From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Wed, 20 Feb 2008 18:58:21 +0900 (JST) Message-Id: <20080220.185821.61784723.taka@valinux.co.jp> Subject: Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races. From: Hirokazu Takahashi In-Reply-To: <47BBC15E.5070405@linux.vnet.ibm.com> References: <20080219215431.1aa9fa8a.kamezawa.hiroyu@jp.fujitsu.com> <47BBC15E.5070405@linux.vnet.ibm.com> Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: balbir@linux.vnet.ibm.com Cc: hugh@veritas.com, kamezawa.hiroyu@jp.fujitsu.com, linux-mm@kvack.org, yamamoto@valinux.co.jp, riel@redhat.com List-ID: Hi, > >> I'd like to start from RFC. > >> > >> In following code > >> == > >> lock_page_cgroup(page); > >> pc = page_get_page_cgroup(page); > >> unlock_page_cgroup(page); > >> > >> access 'pc' later.. > >> == (See, page_cgroup_move_lists()) > >> > >> There is a race because 'pc' is not a stable value without lock_page_cgroup(). > >> (mem_cgroup_uncharge can free this 'pc'). > >> > >> For example, page_cgroup_move_lists() access pc without lock. > >> There is a small race window, between page_cgroup_move_lists() > >> and mem_cgroup_uncharge(). At uncharge, page_cgroup struct is immedieately > >> freed but move_list can access it after taking lru_lock. > >> (*) mem_cgroup_uncharge_page() can be called without zone->lru lock. > >> > >> This is not good manner. > >> ..... > >> There is no quick fix (maybe). Moreover, I hear some people around me said > >> current memcontrol.c codes are very complicated. > >> I agree ;( ..it's caued by my work. > >> > >> I'd like to fix problems in clean way. > >> (Note: current -rc2 codes works well under heavy pressure. but there > >> is possibility of race, I think.) > > > > Yes, yes, indeed, I've been working away on this too. > > > > Ever since the VM_BUG_ON(page_get_page_cgroup(page)) went into > > free_hot_cold_page (at my own prompting), I've been hitting it > > just very occasionally in my kernel build testing. Was unable > > to reproduce it over the New Year, but a week or two ago found > > one machine and config on which it is relatively reproducible, > > pretty sure to happen within 12 hours. > > > > And on Saturday evening at last identified the cause, exactly > > where you have: that unsafety in mem_cgroup_move_lists - which > > has the nice property of putting pages from the lru on to SLUB's > > freelist! > > > > Unlike the unsafeties of force_empty, this is liable to hit anyone > > running with MEM_CONT compiled in, they don't have to be consciously > > using mem_cgroups at all. > > > > (I consider that, by the way, quite a serious defect in the current > > mem_cgroup work: that a distro compiling it in for 1% of customers > > is then subjecting all to the mem_cgroup overhead - effectively > > doubled struct page size and unnecessary accounting overhead. I > > believe there needs to be a way to opt out, a force_empty which > > sticks. Yes, I know the page_cgroup which does that doubling of > > size is only allocated on demand, but every page cache page and > > every anonymous page is going to have one. A kmem_cache for them > > will reduce the extra, but there still needs to be a way to opt > > out completely.) > > > > I've been thinking along these lines as well > > 1. Have a boot option to turn on/off the memory controller It will be much convenient if the memory controller can be turned on/off on demand. I think you can turn it off if there aren't any mem_cgroups except the root mem_cgroup, > 2. Have a separate cache for the page_cgroup structure. I sent this suggestion > out just yesterday or so. I think the policy that every mem_cgroup should be dynamically allocated and assigned to the proper page every time is causing some overheads and spinlock contentions. What do you think if you allocate all page_cgroups and assign to all the pages when the memory controller gets turned on, which will allow you to remove most of the spinlocks. And you may possibly have a chance to remove page->page_cgroup member if you allocate array of page_cgroups and attach them to the zone which the pages belong to. zone page[] +----+ page_cgroup[] +----+<---- ---->+----+ | | | | | | +----+ | | +----+ | | +----+ | | +----+ +----+ | | | | +----+ +----+ | | | | +----+ +----+ | | | | +----+ +----+ > I agree that these are necessary enhancements/changes. Thank you, Hirokazu Takahashi. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org