* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree [not found] <200708272119.l7RLJoOD028582@imap1.linux-foundation.org> @ 2007-08-28 6:35 ` Nick Piggin 2007-08-28 7:26 ` Balbir Singh 0 siblings, 1 reply; 10+ messages in thread From: Nick Piggin @ 2007-08-28 6:35 UTC (permalink / raw) To: akpm Cc: balbir, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes, svaidy, xemul, Linux Memory Management akpm@linux-foundation.org wrote: > The patch titled > Memory controller: memory accounting What's the plans to deal with other types of pages other than pagecache? Surely not adding hooks on a case-by-case basis throughout all the kernel? Or do we not actually care about giving real guarantees about memory availability? Presumably you do, otherwise you wouldn't be carefully charging things before you actually insert them and uncharing afterwards on failure, etc. If you do, then I would have thought the best way to go would be to start with the simplest approach, and add things like targetted reclaim after that. I just don't want to see little bits to add more and more hooks slowly go in under the radar ;) So... there is a coherent plan, right? > +/* > + * Charge the memory controller for page usage. > + * Return > + * 0 if the charge was successful > + * < 0 if the container is over its limit > + */ > +int mem_container_charge(struct page *page, struct mm_struct *mm) > +{ > + struct mem_container *mem; > + struct page_container *pc, *race_pc; > + > + /* > + * Should page_container's go to their own slab? > + * One could optimize the performance of the charging routine > + * by saving a bit in the page_flags and using it as a lock > + * to see if the container page already has a page_container associated > + * with it > + */ > + lock_page_container(page); > + pc = page_get_page_container(page); > + /* > + * The page_container exists and the page has already been accounted > + */ > + if (pc) { > + atomic_inc(&pc->ref_cnt); > + goto done; > + } > + > + unlock_page_container(page); > + > + pc = kzalloc(sizeof(struct page_container), GFP_KERNEL); > + if (pc == NULL) > + goto err; > + > + rcu_read_lock(); > + /* > + * We always charge the container the mm_struct belongs to > + * the mm_struct's mem_container changes on task migration if the > + * thread group leader migrates. It's possible that mm is not > + * set, if so charge the init_mm (happens for pagecache usage). > + */ > + if (!mm) > + mm = &init_mm; > + > + mem = rcu_dereference(mm->mem_container); > + /* > + * For every charge from the container, increment reference > + * count > + */ > + css_get(&mem->css); > + rcu_read_unlock(); Where's the corresponding rcu_assign_pointer? Oh, in the next patch. That's unconventional (can you rearrange it so they go in together, please?). > @@ -1629,6 +1637,9 @@ gotten: > goto oom; > cow_user_page(new_page, old_page, address, vma); > > + if (mem_container_charge(new_page, mm)) > + goto oom_free_new; > + > /* > * Re-check the pte - we dropped the lock > */ > @@ -1660,7 +1671,9 @@ gotten: > /* Free the old page.. */ > new_page = old_page; > ret |= VM_FAULT_WRITE; > - } > + } else > + mem_container_uncharge_page(new_page); > + > if (new_page) > page_cache_release(new_page); > if (old_page) > @@ -1681,6 +1694,8 @@ unlock: > put_page(dirty_page); > } > return ret; > +oom_free_new: > + __free_page(new_page); > oom: > if (old_page) > page_cache_release(old_page); > @@ -2085,6 +2100,11 @@ static int do_swap_page(struct mm_struct > } > > delayacct_clear_flag(DELAYACCT_PF_SWAPIN); > + if (mem_container_charge(page, mm)) { > + ret = VM_FAULT_OOM; > + goto out; > + } The oom-from-pagefault path is quite crap (not this code, mainline). It doesn't obey any of the given oom killing policy. I had a patch to fix this, but nobody else was too concerned at that stage. I don't expect that path would have been hit very much before, but with a lot of -EFAULTs coming out of containers, I think VM_FAULT_OOMs could easily trigger in practice now. http://lkml.org/lkml/2006/10/12/158 Patch may be a little outdated, but the basics should still work. You might be in a good position to test it with these container patches? > diff -puN mm/rmap.c~memory-controller-memory-accounting-v7 mm/rmap.c > --- a/mm/rmap.c~memory-controller-memory-accounting-v7 > +++ a/mm/rmap.c > @@ -48,6 +48,7 @@ > #include <linux/rcupdate.h> > #include <linux/module.h> > #include <linux/kallsyms.h> > +#include <linux/memcontrol.h> > > #include <asm/tlbflush.h> > > @@ -550,8 +551,14 @@ void page_add_anon_rmap(struct page *pag > VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end); > if (atomic_inc_and_test(&page->_mapcount)) > __page_set_anon_rmap(page, vma, address); > - else > + else { > __page_check_anon_rmap(page, vma, address); > + /* > + * We unconditionally charged during prepare, we uncharge here > + * This takes care of balancing the reference counts > + */ > + mem_container_uncharge_page(page); > + } > } > > /* > @@ -582,6 +589,12 @@ void page_add_file_rmap(struct page *pag > { > if (atomic_inc_and_test(&page->_mapcount)) > __inc_zone_page_state(page, NR_FILE_MAPPED); > + else > + /* > + * We unconditionally charged during prepare, we uncharge here > + * This takes care of balancing the reference counts > + */ > + mem_container_uncharge_page(page); > } What's "during prepare"? Better would be "before adding the page to page tables" or something. But... why do you do it? The refcounts and charging are alrady taken care of in your charge function, aren't they? Just unconditionally charge at map time and unconditionally uncharge at unmap time, and let your accounting implementation take care of what to actually do. (This is what I mean about mem container creeping into core code -- it's fine to have some tasteful hooks, but introducing these more complex interactions between core VM and container accounting details is nasty). I would much prefer this whole thing to _not_ to hook into rmap like this at all. Do the call unconditionally, and your container implementation can do as much weird and wonderful refcounting as its heart desires. > #ifdef CONFIG_DEBUG_VM > @@ -642,6 +655,8 @@ void page_remove_rmap(struct page *page, > page_clear_dirty(page); > set_page_dirty(page); > } > + mem_container_uncharge_page(page); > + > __dec_zone_page_state(page, > PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED); > } > diff -puN mm/swap_state.c~memory-controller-memory-accounting-v7 mm/swap_state.c > --- a/mm/swap_state.c~memory-controller-memory-accounting-v7 > +++ a/mm/swap_state.c > @@ -17,6 +17,7 @@ > #include <linux/backing-dev.h> > #include <linux/pagevec.h> > #include <linux/migrate.h> > +#include <linux/memcontrol.h> > > #include <asm/pgtable.h> > > @@ -80,6 +81,11 @@ static int __add_to_swap_cache(struct pa > BUG_ON(PagePrivate(page)); > error = radix_tree_preload(gfp_mask); > if (!error) { > + > + error = mem_container_charge(page, current->mm); > + if (error) > + goto out; > + > write_lock_irq(&swapper_space.tree_lock); > error = radix_tree_insert(&swapper_space.page_tree, > entry.val, page); > @@ -89,10 +95,13 @@ static int __add_to_swap_cache(struct pa > set_page_private(page, entry.val); > total_swapcache_pages++; > __inc_zone_page_state(page, NR_FILE_PAGES); > - } > + } else > + mem_container_uncharge_page(page); > + > write_unlock_irq(&swapper_space.tree_lock); > radix_tree_preload_end(); > } > +out: > return error; > } Uh. You have to be really careful here (and in add_to_page_cache, and possibly others I haven't really looked)... you're ignoring gfp masks because mem_container_charge allocates with GFP_KERNEL. You can have all sorts of GFP_ATOMIC, GFP_NOIO, NOFS etc. come in these places that will deadlock. I think I would much prefer you make mem_container_charge run atomically, and then have just a single hook at the site of where everything else is checked, rather than having all this pre-charge post-uncharge messiness. Ditto for mm/memory.c... all call sites really... should make them much nicer looking. ... I didn't look at all code in all patches. But overall I guess it isn't looking too bad (though I'd really like people like Andrea and Hugh to take a look as well, eventually). It is still a lot uglier in terms of core mm hooks than I would like, but with luck that could be improved (some of my suggestions might even help a bit). Haven't looked at the targetted reclaim patches yet (sigh, more container tentacles :P). -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree 2007-08-28 6:35 ` + memory-controller-memory-accounting-v7.patch added to -mm tree Nick Piggin @ 2007-08-28 7:26 ` Balbir Singh 2007-08-28 9:29 ` Nick Piggin 0 siblings, 1 reply; 10+ messages in thread From: Balbir Singh @ 2007-08-28 7:26 UTC (permalink / raw) To: Nick Piggin Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes, svaidy, xemul, Linux Memory Management Nick Piggin wrote: > akpm@linux-foundation.org wrote: >> The patch titled >> Memory controller: memory accounting > > What's the plans to deal with other types of pages other than pagecache? > Surely not adding hooks on a case-by-case basis throughout all the kernel? > Or do we not actually care about giving real guarantees about memory > availability? Presumably you do, otherwise you wouldn't be carefully > charging things before you actually insert them and uncharing afterwards > on failure, etc. > We deal with RSS and Page/Swap Cache in this controller. > If you do, then I would have thought the best way to go would be to start > with the simplest approach, and add things like targetted reclaim after > that. > I think the kernel accounting/control system will have to be an independent controller (since we cannot reclaim kernel pages, just limit them by accounting (except for some slab pages which are free)). > I just don't want to see little bits to add more and more hooks slowly go > in under the radar ;) So... there is a coherent plan, right? > Nothing will go under the radar :-) Everything will be posted to linux-mm and linux-kernel. We'll make sure you don't wake up one day and see the mm code has changed to something you cannot recognize ;) > >> +/* >> + * Charge the memory controller for page usage. >> + * Return >> + * 0 if the charge was successful >> + * < 0 if the container is over its limit >> + */ >> +int mem_container_charge(struct page *page, struct mm_struct *mm) >> +{ >> + struct mem_container *mem; >> + struct page_container *pc, *race_pc; >> + >> + /* >> + * Should page_container's go to their own slab? >> + * One could optimize the performance of the charging routine >> + * by saving a bit in the page_flags and using it as a lock >> + * to see if the container page already has a page_container >> associated >> + * with it >> + */ >> + lock_page_container(page); >> + pc = page_get_page_container(page); >> + /* >> + * The page_container exists and the page has already been accounted >> + */ >> + if (pc) { >> + atomic_inc(&pc->ref_cnt); >> + goto done; >> + } >> + >> + unlock_page_container(page); >> + >> + pc = kzalloc(sizeof(struct page_container), GFP_KERNEL); >> + if (pc == NULL) >> + goto err; >> + >> + rcu_read_lock(); >> + /* >> + * We always charge the container the mm_struct belongs to >> + * the mm_struct's mem_container changes on task migration if the >> + * thread group leader migrates. It's possible that mm is not >> + * set, if so charge the init_mm (happens for pagecache usage). >> + */ >> + if (!mm) >> + mm = &init_mm; >> + >> + mem = rcu_dereference(mm->mem_container); >> + /* >> + * For every charge from the container, increment reference >> + * count >> + */ >> + css_get(&mem->css); >> + rcu_read_unlock(); > > Where's the corresponding rcu_assign_pointer? Oh, in the next patch. That's > unconventional (can you rearrange it so they go in together, please?). > > The movement is associated with task migration, it should be easy to move the patch around. >> @@ -1629,6 +1637,9 @@ gotten: >> goto oom; >> cow_user_page(new_page, old_page, address, vma); >> >> + if (mem_container_charge(new_page, mm)) >> + goto oom_free_new; >> + >> /* >> * Re-check the pte - we dropped the lock >> */ >> @@ -1660,7 +1671,9 @@ gotten: >> /* Free the old page.. */ >> new_page = old_page; >> ret |= VM_FAULT_WRITE; >> - } >> + } else >> + mem_container_uncharge_page(new_page); >> + >> if (new_page) >> page_cache_release(new_page); >> if (old_page) >> @@ -1681,6 +1694,8 @@ unlock: >> put_page(dirty_page); >> } >> return ret; >> +oom_free_new: >> + __free_page(new_page); >> oom: >> if (old_page) >> page_cache_release(old_page); >> @@ -2085,6 +2100,11 @@ static int do_swap_page(struct mm_struct >> } >> >> delayacct_clear_flag(DELAYACCT_PF_SWAPIN); >> + if (mem_container_charge(page, mm)) { >> + ret = VM_FAULT_OOM; >> + goto out; >> + } > > The oom-from-pagefault path is quite crap (not this code, mainline). > It doesn't obey any of the given oom killing policy. > > I had a patch to fix this, but nobody else was too concerned at that > stage. I don't expect that path would have been hit very much before, > but with a lot of -EFAULTs coming out of containers, I think > VM_FAULT_OOMs could easily trigger in practice now. > > http://lkml.org/lkml/2006/10/12/158 > > Patch may be a little outdated, but the basics should still work. You > might be in a good position to test it with these container patches? > I'll test the patch. Thanks for pointing me in the right direction. > >> diff -puN mm/rmap.c~memory-controller-memory-accounting-v7 mm/rmap.c >> --- a/mm/rmap.c~memory-controller-memory-accounting-v7 >> +++ a/mm/rmap.c >> @@ -48,6 +48,7 @@ >> #include <linux/rcupdate.h> >> #include <linux/module.h> >> #include <linux/kallsyms.h> >> +#include <linux/memcontrol.h> >> >> #include <asm/tlbflush.h> >> >> @@ -550,8 +551,14 @@ void page_add_anon_rmap(struct page *pag >> VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end); >> if (atomic_inc_and_test(&page->_mapcount)) >> __page_set_anon_rmap(page, vma, address); >> - else >> + else { >> __page_check_anon_rmap(page, vma, address); >> + /* >> + * We unconditionally charged during prepare, we uncharge here >> + * This takes care of balancing the reference counts >> + */ >> + mem_container_uncharge_page(page); >> + } >> } >> >> /* >> @@ -582,6 +589,12 @@ void page_add_file_rmap(struct page *pag >> { >> if (atomic_inc_and_test(&page->_mapcount)) >> __inc_zone_page_state(page, NR_FILE_MAPPED); >> + else >> + /* >> + * We unconditionally charged during prepare, we uncharge here >> + * This takes care of balancing the reference counts >> + */ >> + mem_container_uncharge_page(page); >> } > > What's "during prepare"? Better would be "before adding the page to > page tables" or something. > Yeah.. I'll change that comment > But... why do you do it? The refcounts and charging are alrady taken > care of in your charge function, aren't they? Just unconditionally > charge at map time and unconditionally uncharge at unmap time, and > let your accounting implementation take care of what to actually do. > > (This is what I mean about mem container creeping into core code -- > it's fine to have some tasteful hooks, but introducing these more > complex interactions between core VM and container accounting details > is nasty). > > I would much prefer this whole thing to _not_ to hook into rmap like > this at all. Do the call unconditionally, and your container > implementation can do as much weird and wonderful refcounting as its > heart desires. > The reason why we have the accounting this way is because After we charge, some other code path could fail and we need to uncharge the page. We do most of the refcounting internally. mem_container_charge() could fail if the container is over its limit and we cannot reclaim enough pages to allow a new charge to be added. In that case we go to OOM. > >> #ifdef CONFIG_DEBUG_VM >> @@ -642,6 +655,8 @@ void page_remove_rmap(struct page *page, >> page_clear_dirty(page); >> set_page_dirty(page); >> } >> + mem_container_uncharge_page(page); >> + >> __dec_zone_page_state(page, >> PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED); >> } >> diff -puN mm/swap_state.c~memory-controller-memory-accounting-v7 >> mm/swap_state.c >> --- a/mm/swap_state.c~memory-controller-memory-accounting-v7 >> +++ a/mm/swap_state.c >> @@ -17,6 +17,7 @@ >> #include <linux/backing-dev.h> >> #include <linux/pagevec.h> >> #include <linux/migrate.h> >> +#include <linux/memcontrol.h> >> >> #include <asm/pgtable.h> >> >> @@ -80,6 +81,11 @@ static int __add_to_swap_cache(struct pa >> BUG_ON(PagePrivate(page)); >> error = radix_tree_preload(gfp_mask); >> if (!error) { >> + >> + error = mem_container_charge(page, current->mm); >> + if (error) >> + goto out; >> + >> write_lock_irq(&swapper_space.tree_lock); >> error = radix_tree_insert(&swapper_space.page_tree, >> entry.val, page); >> @@ -89,10 +95,13 @@ static int __add_to_swap_cache(struct pa >> set_page_private(page, entry.val); >> total_swapcache_pages++; >> __inc_zone_page_state(page, NR_FILE_PAGES); >> - } >> + } else >> + mem_container_uncharge_page(page); >> + >> write_unlock_irq(&swapper_space.tree_lock); >> radix_tree_preload_end(); >> } >> +out: >> return error; >> } > > Uh. You have to be really careful here (and in add_to_page_cache, and > possibly others I haven't really looked)... you're ignoring gfp masks > because mem_container_charge allocates with GFP_KERNEL. You can have > all sorts of GFP_ATOMIC, GFP_NOIO, NOFS etc. come in these places that > will deadlock. > I ran into some trouble with move_to_swap_cache() and move_from_swap_cache() since they use GFP_ATOMIC. Given the page is already accounted for and refcounted, the refcounting helped avoid blocking. > I think I would much prefer you make mem_container_charge run atomically, > and then have just a single hook at the site of where everything else is > checked, rather than having all this pre-charge post-uncharge messiness. > Ditto for mm/memory.c... all call sites really... should make them much > nicer looking. > The problem is that we need to charge, before we add the page to the page table and that needs to be a blocking context (since we reclaim when the container goes over limit). The uncharge part comes in when we handle errors from other paths (like I've mentioned earlier in this email). > > I didn't look at all code in all patches. But overall I guess it isn't > looking too bad (though I'd really like people like Andrea and Hugh to > take a look as well, eventually). It is still a lot uglier in terms of > core mm hooks than I would like, but with luck that could be improved > (some of my suggestions might even help a bit). > I would like Andrea and Hugh to review it as well. Eventually, we want the best thing to go in and make it as non-intrusive as possible. > Haven't looked at the targetted reclaim patches yet (sigh, more > container tentacles :P). > :-) -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree 2007-08-28 7:26 ` Balbir Singh @ 2007-08-28 9:29 ` Nick Piggin 2007-08-28 11:39 ` Balbir Singh 0 siblings, 1 reply; 10+ messages in thread From: Nick Piggin @ 2007-08-28 9:29 UTC (permalink / raw) To: balbir Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes, svaidy, xemul, Linux Memory Management Balbir Singh wrote: > Nick Piggin wrote: > >>akpm@linux-foundation.org wrote: >> >>>The patch titled >>> Memory controller: memory accounting >> >>What's the plans to deal with other types of pages other than pagecache? >>Surely not adding hooks on a case-by-case basis throughout all the kernel? >>Or do we not actually care about giving real guarantees about memory >>availability? Presumably you do, otherwise you wouldn't be carefully >>charging things before you actually insert them and uncharing afterwards >>on failure, etc. >> > > > We deal with RSS and Page/Swap Cache in this controller. Sure. And if all you intend is workload management, then that's probably fine. If you want guarantees, then its useless on its own. >>If you do, then I would have thought the best way to go would be to start >>with the simplest approach, and add things like targetted reclaim after >>that. >> > > > I think the kernel accounting/control system will have to be an independent > controller (since we cannot reclaim kernel pages, just limit them by > accounting (except for some slab pages which are free)). I don't see why accounting would have to be different. Control obviously will, but for the purposes of using a page, it doesn't matter to anybody else in the system whether some container has used it for a pagecache page, or for something else like page tables. The end result obviously is only that you can either kill the container or reclaim its reclaimablememory, but before you get to that point, you need to do the actual accounting, and AFAIKS it would be identical for all pages (at least for this basic first-touch charging scheme). >>I just don't want to see little bits to add more and more hooks slowly go >>in under the radar ;) So... there is a coherent plan, right? >> > > > Nothing will go under the radar :-) Everything will be posted to linux-mm > and linux-kernel. We'll make sure you don't wake up one day and see > the mm code has changed to something you cannot recognize ;) I don't mean to say it would be done on purpose, but without a coherent strategy then things might slowly just get added as people think they are needed. I'm fairly sure several people want to really guarantee memory resources in an untrusted environment, don't they? And that's obviously not going to scale by putting calls all throughout the kernel. >>>+ mem = rcu_dereference(mm->mem_container); >>>+ /* >>>+ * For every charge from the container, increment reference >>>+ * count >>>+ */ >>>+ css_get(&mem->css); >>>+ rcu_read_unlock(); >> >>Where's the corresponding rcu_assign_pointer? Oh, in the next patch. That's >>unconventional (can you rearrange it so they go in together, please?). >> >> > > > The movement is associated with task migration, it should be easy to move > the patch around. Thanks. >>>+ if (mem_container_charge(page, mm)) { >>>+ ret = VM_FAULT_OOM; >>>+ goto out; >>>+ } >> >>The oom-from-pagefault path is quite crap (not this code, mainline). >>It doesn't obey any of the given oom killing policy. >> >>I had a patch to fix this, but nobody else was too concerned at that >>stage. I don't expect that path would have been hit very much before, >>but with a lot of -EFAULTs coming out of containers, I think >>VM_FAULT_OOMs could easily trigger in practice now. >> >>http://lkml.org/lkml/2006/10/12/158 >> >>Patch may be a little outdated, but the basics should still work. You >>might be in a good position to test it with these container patches? >> > > > > I'll test the patch. Thanks for pointing me in the right direction. OK, thanks. Let me know how it goes and we could try again to get it merged. >>>@@ -582,6 +589,12 @@ void page_add_file_rmap(struct page *pag >>> { >>> if (atomic_inc_and_test(&page->_mapcount)) >>> __inc_zone_page_state(page, NR_FILE_MAPPED); >>>+ else >>>+ /* >>>+ * We unconditionally charged during prepare, we uncharge here >>>+ * This takes care of balancing the reference counts >>>+ */ >>>+ mem_container_uncharge_page(page); >>> } >> >>What's "during prepare"? Better would be "before adding the page to >>page tables" or something. >> > > > Yeah.. I'll change that comment > > >>But... why do you do it? The refcounts and charging are alrady taken >>care of in your charge function, aren't they? Just unconditionally >>charge at map time and unconditionally uncharge at unmap time, and >>let your accounting implementation take care of what to actually do. >> >>(This is what I mean about mem container creeping into core code -- >>it's fine to have some tasteful hooks, but introducing these more >>complex interactions between core VM and container accounting details >>is nasty). >> >>I would much prefer this whole thing to _not_ to hook into rmap like >>this at all. Do the call unconditionally, and your container >>implementation can do as much weird and wonderful refcounting as its >>heart desires. >> > > > The reason why we have the accounting this way is because > > After we charge, some other code path could fail and we need > to uncharge the page. We do most of the refcounting internally. > mem_container_charge() could fail if the container is over > its limit and we cannot reclaim enough pages to allow a new > charge to be added. In that case we go to OOM. But at this point you have already charged the container, and have put it in the page tables, if I read correctly. Nothing is going to fail at this point and the page could get uncharged when it is unmapped? >>>@@ -80,6 +81,11 @@ static int __add_to_swap_cache(struct pa >>> BUG_ON(PagePrivate(page)); >>> error = radix_tree_preload(gfp_mask); >>> if (!error) { >>>+ >>>+ error = mem_container_charge(page, current->mm); >>>+ if (error) >>>+ goto out; >>>+ >>> write_lock_irq(&swapper_space.tree_lock); >>> error = radix_tree_insert(&swapper_space.page_tree, >>> entry.val, page); >>>@@ -89,10 +95,13 @@ static int __add_to_swap_cache(struct pa >>> set_page_private(page, entry.val); >>> total_swapcache_pages++; >>> __inc_zone_page_state(page, NR_FILE_PAGES); >>>- } >>>+ } else >>>+ mem_container_uncharge_page(page); >>>+ >>> write_unlock_irq(&swapper_space.tree_lock); >>> radix_tree_preload_end(); >>> } >>>+out: >>> return error; >>> } >> >>Uh. You have to be really careful here (and in add_to_page_cache, and >>possibly others I haven't really looked)... you're ignoring gfp masks >>because mem_container_charge allocates with GFP_KERNEL. You can have >>all sorts of GFP_ATOMIC, GFP_NOIO, NOFS etc. come in these places that >>will deadlock. >> > > > I ran into some trouble with move_to_swap_cache() and move_from_swap_cache() > since they use GFP_ATOMIC. Given the page is already accounted for and > refcounted, the refcounting helped avoid blocking. add_to_page_cache gets called with GFP_ATOMIC as well, and it gets called with GFP_NOFS for new pages. But I don't think you were suggesting that this isn't a problem, where you? Relying on implementation in the VM would signal more broken layering. >>I think I would much prefer you make mem_container_charge run atomically, >>and then have just a single hook at the site of where everything else is >>checked, rather than having all this pre-charge post-uncharge messiness. >>Ditto for mm/memory.c... all call sites really... should make them much >>nicer looking. >> > > > The problem is that we need to charge, before we add the page to the page > table and that needs to be a blocking context (since we reclaim when > the container goes over limit). The uncharge part comes in when we > handle errors from other paths (like I've mentioned earlier in this > email). So I don't understand how you'd deal with GFP_ATOMIC and such restricted masks, then, if you want to block here. It would be so so much easier and cleaner for the VM if you did all the accounting in page alloc and freeing hooks, and just put the page on per-container LRUs when it goes on the regular LRU. I'm still going to keep pushing for that approach until either someone explains why it can't be done or the current patch gets a fair bit cleaner. Has that already been tried and shown not to work? I would have thought so seeing as it would be the simplest patch, however I can't remember hearing about the actual problems with it. But anyway I don't have any problems with it getting into -mm at this stage for more exposure and review. Maybe my concerns will get overridden anyway ;) -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree 2007-08-28 9:29 ` Nick Piggin @ 2007-08-28 11:39 ` Balbir Singh 2007-08-29 7:28 ` Nick Piggin 0 siblings, 1 reply; 10+ messages in thread From: Balbir Singh @ 2007-08-28 11:39 UTC (permalink / raw) To: Nick Piggin Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes, svaidy, xemul, Linux Memory Management Nick Piggin wrote: > Balbir Singh wrote: >> Nick Piggin wrote: >> >>> akpm@linux-foundation.org wrote: >>> >>>> The patch titled >>>> Memory controller: memory accounting >>> >>> What's the plans to deal with other types of pages other than pagecache? >>> Surely not adding hooks on a case-by-case basis throughout all the >>> kernel? >>> Or do we not actually care about giving real guarantees about memory >>> availability? Presumably you do, otherwise you wouldn't be carefully >>> charging things before you actually insert them and uncharing afterwards >>> on failure, etc. >>> >> >> >> We deal with RSS and Page/Swap Cache in this controller. > > Sure. And if all you intend is workload management, then that's probably > fine. If you want guarantees, then its useless on its own. > Yes. Guarantees are hard to do with just this controller (due to non-reclaimable pages). It should be easy to add a mlock() controller on top of this one. Pavel is planning to work on a kernel memory controller. With that in place, guarantees might be possible. > >>> If you do, then I would have thought the best way to go would be to >>> start >>> with the simplest approach, and add things like targetted reclaim after >>> that. >>> >> >> >> I think the kernel accounting/control system will have to be an >> independent >> controller (since we cannot reclaim kernel pages, just limit them by >> accounting (except for some slab pages which are free)). > > I don't see why accounting would have to be different. > > Control obviously will, but for the purposes of using a page, it doesn't > matter to anybody else in the system whether some container has used it > for a pagecache page, or for something else like page tables. The end > result obviously is only that you can either kill the container or > reclaim its reclaimablememory, but before you get to that point, you > need to do the actual accounting, and AFAIKS it would be identical for > all pages (at least for this basic first-touch charging scheme). > Yes.. I see your point > >>> I just don't want to see little bits to add more and more hooks >>> slowly go >>> in under the radar ;) So... there is a coherent plan, right? >>> >> >> >> Nothing will go under the radar :-) Everything will be posted to linux-mm >> and linux-kernel. We'll make sure you don't wake up one day and see >> the mm code has changed to something you cannot recognize ;) > > I don't mean to say it would be done on purpose, but without a coherent > strategy then things might slowly just get added as people think they are > needed. I'm fairly sure several people want to really guarantee memory > resources in an untrusted environment, don't they? And that's obviously > not going to scale by putting calls all throughout the kernel. > We sent out an RFC last year and got several comments from stake holders 1. Pavel Emelianov 2. Paul Menage 3. Vaidyanathan Srinivasan 4. YAMAMOTO Takshi 5. KAMEZAWA Hiroyuki At the OLS resource management BoF, we had a broader participation and several comments and suggestions. We've incorporated suggestions and comments from all stake holders > >>>> + mem = rcu_dereference(mm->mem_container); >>>> + /* >>>> + * For every charge from the container, increment reference >>>> + * count >>>> + */ >>>> + css_get(&mem->css); >>>> + rcu_read_unlock(); >>> >>> Where's the corresponding rcu_assign_pointer? Oh, in the next patch. >>> That's >>> unconventional (can you rearrange it so they go in together, please?). >>> >>> >> >> >> The movement is associated with task migration, it should be easy to move >> the patch around. > > Thanks. > > >>>> + if (mem_container_charge(page, mm)) { >>>> + ret = VM_FAULT_OOM; >>>> + goto out; >>>> + } >>> >>> The oom-from-pagefault path is quite crap (not this code, mainline). >>> It doesn't obey any of the given oom killing policy. >>> >>> I had a patch to fix this, but nobody else was too concerned at that >>> stage. I don't expect that path would have been hit very much before, >>> but with a lot of -EFAULTs coming out of containers, I think >>> VM_FAULT_OOMs could easily trigger in practice now. >>> >>> http://lkml.org/lkml/2006/10/12/158 >>> >>> Patch may be a little outdated, but the basics should still work. You >>> might be in a good position to test it with these container patches? >>> >> >> >> >> I'll test the patch. Thanks for pointing me in the right direction. > > OK, thanks. Let me know how it goes and we could try again to get it > merged. > > Sure I'll start testing out that code. >>>> @@ -582,6 +589,12 @@ void page_add_file_rmap(struct page *pag >>>> { >>>> if (atomic_inc_and_test(&page->_mapcount)) >>>> __inc_zone_page_state(page, NR_FILE_MAPPED); >>>> + else >>>> + /* >>>> + * We unconditionally charged during prepare, we uncharge here >>>> + * This takes care of balancing the reference counts >>>> + */ >>>> + mem_container_uncharge_page(page); >>>> } >>> >>> What's "during prepare"? Better would be "before adding the page to >>> page tables" or something. >>> >> >> >> Yeah.. I'll change that comment >> >> >>> But... why do you do it? The refcounts and charging are alrady taken >>> care of in your charge function, aren't they? Just unconditionally >>> charge at map time and unconditionally uncharge at unmap time, and >>> let your accounting implementation take care of what to actually do. >>> >>> (This is what I mean about mem container creeping into core code -- >>> it's fine to have some tasteful hooks, but introducing these more >>> complex interactions between core VM and container accounting details >>> is nasty). >>> >>> I would much prefer this whole thing to _not_ to hook into rmap like >>> this at all. Do the call unconditionally, and your container >>> implementation can do as much weird and wonderful refcounting as its >>> heart desires. >>> >> >> >> The reason why we have the accounting this way is because >> >> After we charge, some other code path could fail and we need >> to uncharge the page. We do most of the refcounting internally. >> mem_container_charge() could fail if the container is over >> its limit and we cannot reclaim enough pages to allow a new >> charge to be added. In that case we go to OOM. > > But at this point you have already charged the container, and have put > it in the page tables, if I read correctly. Nothing is going to fail > at this point and the page could get uncharged when it is unmapped? > > Here's the sequence of steps 1. Charge (we might need to block on reclaim) 2. Check to see if there is a race in updating the PTE 3. Add to the page table, update _mapcount under a lock Several times the error occurs in step 2, due to which we need to uncharge the memory container. >>>> @@ -80,6 +81,11 @@ static int __add_to_swap_cache(struct pa >>>> BUG_ON(PagePrivate(page)); >>>> error = radix_tree_preload(gfp_mask); >>>> if (!error) { >>>> + >>>> + error = mem_container_charge(page, current->mm); >>>> + if (error) >>>> + goto out; >>>> + >>>> write_lock_irq(&swapper_space.tree_lock); >>>> error = radix_tree_insert(&swapper_space.page_tree, >>>> entry.val, page); >>>> @@ -89,10 +95,13 @@ static int __add_to_swap_cache(struct pa >>>> set_page_private(page, entry.val); >>>> total_swapcache_pages++; >>>> __inc_zone_page_state(page, NR_FILE_PAGES); >>>> - } >>>> + } else >>>> + mem_container_uncharge_page(page); >>>> + >>>> write_unlock_irq(&swapper_space.tree_lock); >>>> radix_tree_preload_end(); >>>> } >>>> +out: >>>> return error; >>>> } >>> >>> Uh. You have to be really careful here (and in add_to_page_cache, and >>> possibly others I haven't really looked)... you're ignoring gfp masks >>> because mem_container_charge allocates with GFP_KERNEL. You can have >>> all sorts of GFP_ATOMIC, GFP_NOIO, NOFS etc. come in these places that >>> will deadlock. >>> >> >> >> I ran into some trouble with move_to_swap_cache() and >> move_from_swap_cache() >> since they use GFP_ATOMIC. Given the page is already accounted for and >> refcounted, the refcounting helped avoid blocking. > > add_to_page_cache gets called with GFP_ATOMIC as well, and it gets called > with GFP_NOFS for new pages. > > But I don't think you were suggesting that this isn't a problem, where > you? Relying on implementation in the VM would signal more broken layering. > > Good, thanks for spotting this. It should be possible to fix this case. I'll work on a fix and send it out. >>> I think I would much prefer you make mem_container_charge run >>> atomically, >>> and then have just a single hook at the site of where everything else is >>> checked, rather than having all this pre-charge post-uncharge messiness. >>> Ditto for mm/memory.c... all call sites really... should make them much >>> nicer looking. >>> >> >> >> The problem is that we need to charge, before we add the page to the page >> table and that needs to be a blocking context (since we reclaim when >> the container goes over limit). The uncharge part comes in when we >> handle errors from other paths (like I've mentioned earlier in this >> email). > > So I don't understand how you'd deal with GFP_ATOMIC and such restricted > masks, then, if you want to block here. > > It would be so so much easier and cleaner for the VM if you did all the > accounting in page alloc and freeing hooks, and just put the page > on per-container LRUs when it goes on the regular LRU. > The advantage of the current approach is that not all page allocations have to worry about charges. Only user space pages (unmapped cache and mapped RSS) are accounted for. We tried syncing the container and the global LRU. Pavel is working on that approach. The problem with adding the page at the same time as the regular LRU is that we end up calling the addition under the zone->lru_lock. Holding the entire zone's lru_lock for list addition might be an overhead. Although, we do the list isolation under the zone's lru lock. > I'm still going to keep pushing for that approach until either someone > explains why it can't be done or the current patch gets a fair bit > cleaner. Has that already been tried and shown not to work? I would have > thought so seeing as it would be the simplest patch, however I can't > remember hearing about the actual problems with it. > I am all eyes and ears to patches/suggestions/improvements to the current container. > But anyway I don't have any problems with it getting into -mm at this > stage for more exposure and review. Maybe my concerns will get > overridden anyway ;) > Thanks, we'll do the right thing to override your concerns (whether that is to change code or to convince you that we are already doing it right) -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree 2007-08-28 11:39 ` Balbir Singh @ 2007-08-29 7:28 ` Nick Piggin 2007-08-29 8:15 ` Balbir Singh 0 siblings, 1 reply; 10+ messages in thread From: Nick Piggin @ 2007-08-29 7:28 UTC (permalink / raw) To: balbir Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes, svaidy, xemul, Linux Memory Management Balbir Singh wrote: > Nick Piggin wrote: >>Sure. And if all you intend is workload management, then that's probably >>fine. If you want guarantees, then its useless on its own. >> > > > Yes. Guarantees are hard to do with just this controller (due to non-reclaimable > pages). It should be easy to add a mlock() controller on top of this one. > Pavel is planning to work on a kernel memory controller. With that in place, > guarantees might be possible. So it doesn't sound like they are necessarily a requirement for linux kernel memory control -- that's fine, I just want to know where you stand with that. >>I don't mean to say it would be done on purpose, but without a coherent >>strategy then things might slowly just get added as people think they are >>needed. I'm fairly sure several people want to really guarantee memory >>resources in an untrusted environment, don't they? And that's obviously >>not going to scale by putting calls all throughout the kernel. >> > > > We sent out an RFC last year and got several comments from stake holders > > 1. Pavel Emelianov > 2. Paul Menage > 3. Vaidyanathan Srinivasan > 4. YAMAMOTO Takshi > 5. KAMEZAWA Hiroyuki > > > At the OLS resource management BoF, we had a broader participation and > several comments and suggestions. > > We've incorporated suggestions and comments from all stake holders :) I know everyone's really worked hard at this and is trying to do the right thing. But for example some of the non-container VM people may not know exactly what you are up to or planning (I don't). It can make things a bit harder to review IF there is an expectation that more functionality is going to be required (which is what my expectation is). Anyway, let's not continue this tangent. Instead, I'll try to be more constructive and ask for eg. your future plans if they are not clear. >>But at this point you have already charged the container, and have put >>it in the page tables, if I read correctly. Nothing is going to fail >>at this point and the page could get uncharged when it is unmapped? >> >> > > > Here's the sequence of steps > > 1. Charge (we might need to block on reclaim) > 2. Check to see if there is a race in updating the PTE > 3. Add to the page table, update _mapcount under a lock > > Several times the error occurs in step 2, due to which we need to uncharge > the memory container. Oh yeah, I understand all _those_ uncharging calls you do. But let me just quote the code again: diff -puN mm/rmap.c~memory-controller-memory-accounting-v7 mm/rmap.c --- a/mm/rmap.c~memory-controller-memory-accounting-v7 +++ a/mm/rmap.c @@ -48,6 +48,7 @@ #include <linux/rcupdate.h> #include <linux/module.h> #include <linux/kallsyms.h> +#include <linux/memcontrol.h> #include <asm/tlbflush.h> @@ -550,8 +551,14 @@ void page_add_anon_rmap(struct page *pag VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end); if (atomic_inc_and_test(&page->_mapcount)) __page_set_anon_rmap(page, vma, address); - else + else { __page_check_anon_rmap(page, vma, address); + /* + * We unconditionally charged during prepare, we uncharge here + * This takes care of balancing the reference counts + */ + mem_container_uncharge_page(page); + } } At the point you are uncharging here, the pte has been updated and the page is in the pagetables of the process. I guess this uncharging works on the presumption that you are not the first container to map the page, but I thought that you already check for that in your accounting implementation. Now how does it take care of the refcounts? I guess it is because your rmap removal function also takes care of refcounts by only uncharging if mapcount has gone to zero... however that's polluting the VM with knowledge that your accounting scheme is a first-touch one, isn't it? Aside, I'm slightly suspicious of whether this is correct against mapcount races, but I didn't look closely yet. I remember you bringing that up with me, so I guess you've been careful there... >>add_to_page_cache gets called with GFP_ATOMIC as well, and it gets called >>with GFP_NOFS for new pages. >> >>But I don't think you were suggesting that this isn't a problem, where >>you? Relying on implementation in the VM would signal more broken layering. >> >> > > > Good, thanks for spotting this. It should be possible to fix this case. > I'll work on a fix and send it out. OK. >>It would be so so much easier and cleaner for the VM if you did all the >>accounting in page alloc and freeing hooks, and just put the page >>on per-container LRUs when it goes on the regular LRU. >> > > > The advantage of the current approach is that not all page allocations have > to worry about charges. Only user space pages (unmapped cache and mapped RSS) > are accounted for. That could be true. OTOH, if you have a significant amount of non userspace page allocations happening, the chances are that your userspace-only accounting is going out the window too :) You'd still be able to avoid charging if a process isn't in a container or accounting is turned off, of course. > We tried syncing the container and the global LRU. Pavel is working on > that approach. The problem with adding the page at the same time as the > regular LRU is that we end up calling the addition under the zone->lru_lock. > Holding the entire zone's lru_lock for list addition might be an overhead. > Although, we do the list isolation under the zone's lru lock. Well it may not have to be _exactly_ the same time as the page goes on the normal lru. You could try doing it in pagevec_lru_add, for example, after the lru locks have been released? >>I'm still going to keep pushing for that approach until either someone >>explains why it can't be done or the current patch gets a fair bit >>cleaner. Has that already been tried and shown not to work? I would have >>thought so seeing as it would be the simplest patch, however I can't >>remember hearing about the actual problems with it. >> > > > I am all eyes and ears to patches/suggestions/improvements to the current > container. Well I have made a few. I'm actually not too interested in containers and resource control myself, but I suspect it may be a fair bit easier to play around with eg. my ideas for VM hooks with an implementation in -mm, so I might get motivated to try a patch... Thanks, Nick -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree 2007-08-29 7:28 ` Nick Piggin @ 2007-08-29 8:15 ` Balbir Singh 2007-08-30 7:39 ` Nick Piggin 0 siblings, 1 reply; 10+ messages in thread From: Balbir Singh @ 2007-08-29 8:15 UTC (permalink / raw) To: Nick Piggin Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes, svaidy, xemul, Linux Memory Management Nick Piggin wrote: > Balbir Singh wrote: >> Nick Piggin wrote: > >>> Sure. And if all you intend is workload management, then that's probably >>> fine. If you want guarantees, then its useless on its own. >>> >> >> >> Yes. Guarantees are hard to do with just this controller (due to >> non-reclaimable >> pages). It should be easy to add a mlock() controller on top of this one. >> Pavel is planning to work on a kernel memory controller. With that in >> place, >> guarantees might be possible. > > So it doesn't sound like they are necessarily a requirement for linux > kernel memory control -- that's fine, I just want to know where you > stand with that. > > >>> I don't mean to say it would be done on purpose, but without a coherent >>> strategy then things might slowly just get added as people think they >>> are >>> needed. I'm fairly sure several people want to really guarantee memory >>> resources in an untrusted environment, don't they? And that's obviously >>> not going to scale by putting calls all throughout the kernel. >>> >> >> >> We sent out an RFC last year and got several comments from stake holders >> >> 1. Pavel Emelianov >> 2. Paul Menage >> 3. Vaidyanathan Srinivasan >> 4. YAMAMOTO Takshi >> 5. KAMEZAWA Hiroyuki >> >> >> At the OLS resource management BoF, we had a broader participation and >> several comments and suggestions. >> >> We've incorporated suggestions and comments from all stake holders > > :) I know everyone's really worked hard at this and is trying to do the > right thing. But for example some of the non-container VM people may not > know exactly what you are up to or planning (I don't). It can make things > a bit harder to review IF there is an expectation that more functionality > is going to be required (which is what my expectation is). > > Anyway, let's not continue this tangent. Instead, I'll try to be more > constructive and ask for eg. your future plans if they are not clear. > Thanks, I am going to be present at the VM summit and we have a roadmap as well. We can discuss it and we'll also send it out for comments. Just FYI, we started with an RFC about a year ago and got several comments from non-container folks. We'll try to discuss the roadmap further and get in more inputs. > >>> But at this point you have already charged the container, and have put >>> it in the page tables, if I read correctly. Nothing is going to fail >>> at this point and the page could get uncharged when it is unmapped? >>> >>> >> >> >> Here's the sequence of steps >> >> 1. Charge (we might need to block on reclaim) >> 2. Check to see if there is a race in updating the PTE >> 3. Add to the page table, update _mapcount under a lock >> >> Several times the error occurs in step 2, due to which we need to >> uncharge >> the memory container. > > Oh yeah, I understand all _those_ uncharging calls you do. But let me > just quote the code again: > > diff -puN mm/rmap.c~memory-controller-memory-accounting-v7 mm/rmap.c > --- a/mm/rmap.c~memory-controller-memory-accounting-v7 > +++ a/mm/rmap.c > @@ -48,6 +48,7 @@ > #include <linux/rcupdate.h> > #include <linux/module.h> > #include <linux/kallsyms.h> > +#include <linux/memcontrol.h> > > #include <asm/tlbflush.h> > > @@ -550,8 +551,14 @@ void page_add_anon_rmap(struct page *pag > VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end); > if (atomic_inc_and_test(&page->_mapcount)) > __page_set_anon_rmap(page, vma, address); > - else > + else { > __page_check_anon_rmap(page, vma, address); > + /* > + * We unconditionally charged during prepare, we uncharge here > + * This takes care of balancing the reference counts > + */ > + mem_container_uncharge_page(page); > + } > } > > At the point you are uncharging here, the pte has been updated and > the page is in the pagetables of the process. I guess this uncharging > works on the presumption that you are not the first container to map > the page, but I thought that you already check for that in your > accounting implementation. > > Now how does it take care of the refcounts? I guess it is because your > rmap removal function also takes care of refcounts by only uncharging > if mapcount has gone to zero... however that's polluting the VM with > knowledge that your accounting scheme is a first-touch one, isn't it? > > Aside, I'm slightly suspicious of whether this is correct against > mapcount races, but I didn't look closely yet. I remember you bringing > that up with me, so I guess you've been careful there... > Very good review comment. Here's what we see 1. Page comes in through page cache, we increment the reference ount 2. Page comes into rmap, we increment the refcount again 3. We race in page_add.*rmap(), the problem we have is that for the same page, for rmap(), step 2 would have taken place more than once 4. That's why we uncharge I think I need to add a big fat comment in the ref_cnt member. ref_cnt is held once for the page in the page cache and once for the page mapped anywhere in the page tables. reference counting helps us correctly determine when to take the page of the LRU (it moves from page tables to swap cache, but it's still on the LRU). >>> add_to_page_cache gets called with GFP_ATOMIC as well, and it gets >>> called >>> with GFP_NOFS for new pages. >>> >>> But I don't think you were suggesting that this isn't a problem, where >>> you? Relying on implementation in the VM would signal more broken >>> layering. >>> >>> >> >> >> Good, thanks for spotting this. It should be possible to fix this case. >> I'll work on a fix and send it out. > > OK. > > >>> It would be so so much easier and cleaner for the VM if you did all the >>> accounting in page alloc and freeing hooks, and just put the page >>> on per-container LRUs when it goes on the regular LRU. >>> >> >> >> The advantage of the current approach is that not all page allocations >> have >> to worry about charges. Only user space pages (unmapped cache and >> mapped RSS) >> are accounted for. > > That could be true. OTOH, if you have a significant amount of non userspace > page allocations happening, the chances are that your userspace-only > accounting is going out the window too :) > > You'd still be able to avoid charging if a process isn't in a container > or accounting is turned off, of course. > Yes, we need a mechanism to turn off the container at boot time. Currently if the config is enabled, we charge :-) > >> We tried syncing the container and the global LRU. Pavel is working on >> that approach. The problem with adding the page at the same time as the >> regular LRU is that we end up calling the addition under the >> zone->lru_lock. >> Holding the entire zone's lru_lock for list addition might be an >> overhead. >> Although, we do the list isolation under the zone's lru lock. > > Well it may not have to be _exactly_ the same time as the page goes on the > normal lru. You could try doing it in pagevec_lru_add, for example, after > the lru locks have been released? > I'll play around with Pavel's patches or work with him to explore this further. > >>> I'm still going to keep pushing for that approach until either someone >>> explains why it can't be done or the current patch gets a fair bit >>> cleaner. Has that already been tried and shown not to work? I would have >>> thought so seeing as it would be the simplest patch, however I can't >>> remember hearing about the actual problems with it. >>> >> >> >> I am all eyes and ears to patches/suggestions/improvements to the current >> container. > > Well I have made a few. I'm actually not too interested in containers and > resource control myself, but I suspect it may be a fair bit easier to play > around with eg. my ideas for VM hooks with an implementation in -mm, so I > might get motivated to try a patch... > Great! Thanks! > Thanks, > Nick > -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree 2007-08-29 8:15 ` Balbir Singh @ 2007-08-30 7:39 ` Nick Piggin 2007-08-30 9:04 ` Balbir Singh 0 siblings, 1 reply; 10+ messages in thread From: Nick Piggin @ 2007-08-30 7:39 UTC (permalink / raw) To: balbir Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes, svaidy, xemul, Linux Memory Management Balbir Singh wrote: > Nick Piggin wrote: >>Oh yeah, I understand all _those_ uncharging calls you do. But let me >>just quote the code again: >> >>diff -puN mm/rmap.c~memory-controller-memory-accounting-v7 mm/rmap.c >>--- a/mm/rmap.c~memory-controller-memory-accounting-v7 >>+++ a/mm/rmap.c >>@@ -48,6 +48,7 @@ >> #include <linux/rcupdate.h> >> #include <linux/module.h> >> #include <linux/kallsyms.h> >>+#include <linux/memcontrol.h> >> >> #include <asm/tlbflush.h> >> >>@@ -550,8 +551,14 @@ void page_add_anon_rmap(struct page *pag >> VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end); >> if (atomic_inc_and_test(&page->_mapcount)) >> __page_set_anon_rmap(page, vma, address); >>- else >>+ else { >> __page_check_anon_rmap(page, vma, address); >>+ /* >>+ * We unconditionally charged during prepare, we uncharge here >>+ * This takes care of balancing the reference counts >>+ */ >>+ mem_container_uncharge_page(page); >>+ } >> } >> >>At the point you are uncharging here, the pte has been updated and >>the page is in the pagetables of the process. I guess this uncharging >>works on the presumption that you are not the first container to map >>the page, but I thought that you already check for that in your >>accounting implementation. >> >>Now how does it take care of the refcounts? I guess it is because your >>rmap removal function also takes care of refcounts by only uncharging >>if mapcount has gone to zero... however that's polluting the VM with >>knowledge that your accounting scheme is a first-touch one, isn't it? >> >>Aside, I'm slightly suspicious of whether this is correct against >>mapcount races, but I didn't look closely yet. I remember you bringing >>that up with me, so I guess you've been careful there... >> > > > Very good review comment. Here's what we see > > 1. Page comes in through page cache, we increment the reference ount > 2. Page comes into rmap, we increment the refcount again > 3. We race in page_add.*rmap(), the problem we have is that for > the same page, for rmap(), step 2 would have taken place more than > once > 4. That's why we uncharge > > I think I need to add a big fat comment in the ref_cnt member. ref_cnt > is held once for the page in the page cache and once for the page mapped > anywhere in the page tables. > > reference counting helps us correctly determine when to take the page > of the LRU (it moves from page tables to swap cache, but it's still > on the LRU). Still don't understand. You increment the refcount once when you put the page in the pagecache, then again when the first process maps the page, then again while subsequent processes map the page but you soon drop it afterwards. That's fine, I don't pretend to understand why you're doing it, but presumably the controller has a good reason for that. But my point is, why should the VM know or care about that? You should handle all those details in your controller. If, in order to do that, you need to differentiate between when a process puts a page in pagecache and when it maps a page, that's fine, just use different hooks for those events. The situation now is that your one hook is not actually a "this page was mapped" hook, or a "this page was added to pagecache", or "we are about to map this page". These are easy for VM maintainers to maintain because they're simple VM concepts. But your hook is "increment ref_cnt and do some other stuff". So now the VM needs to know about when and why your container implementation needs to increment and decrement this ref_cnt. I don't know this, and I don't want to know this ;) -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree 2007-08-30 7:39 ` Nick Piggin @ 2007-08-30 9:04 ` Balbir Singh 2007-08-31 0:35 ` Nick Piggin 0 siblings, 1 reply; 10+ messages in thread From: Balbir Singh @ 2007-08-30 9:04 UTC (permalink / raw) To: Nick Piggin Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes, svaidy, xemul, Linux Memory Management Nick Piggin wrote: > Balbir Singh wrote: >> Nick Piggin wrote: > >> Very good review comment. Here's what we see >> >> 1. Page comes in through page cache, we increment the reference ount >> 2. Page comes into rmap, we increment the refcount again >> 3. We race in page_add.*rmap(), the problem we have is that for >> the same page, for rmap(), step 2 would have taken place more than >> once >> 4. That's why we uncharge >> >> I think I need to add a big fat comment in the ref_cnt member. ref_cnt >> is held once for the page in the page cache and once for the page mapped >> anywhere in the page tables. >> >> reference counting helps us correctly determine when to take the page >> of the LRU (it moves from page tables to swap cache, but it's still >> on the LRU). > > Still don't understand. You increment the refcount once when you put > the page in the pagecache, then again when the first process maps the > page, then again while subsequent processes map the page but you soon > drop it afterwards. That's fine, I don't pretend to understand why > you're doing it, but presumably the controller has a good reason for > that. > > But my point is, why should the VM know or care about that? You should > handle all those details in your controller. If, in order to do that, > you need to differentiate between when a process puts a page in > pagecache and when it maps a page, that's fine, just use different > hooks for those events. > > The situation now is that your one hook is not actually a "this page > was mapped" hook, or a "this page was added to pagecache", or "we are > about to map this page". These are easy for VM maintainers to maintain > because they're simple VM concepts. > > But your hook is "increment ref_cnt and do some other stuff". So now > the VM needs to know about when and why your container implementation > needs to increment and decrement this ref_cnt. I don't know this, and > I don't want to know this ;) > My hook really is -- there was a race, there is no rmap lock to prevent several independent processes from mapping the same page into their page tables. I want to increment the reference count just once (apart from it being accounted in the page cache), since we account the page once. I'll revisit this hook to see if it can be made cleaner Thanks for your detailed review comments. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree 2007-08-30 9:04 ` Balbir Singh @ 2007-08-31 0:35 ` Nick Piggin 2007-08-31 4:40 ` Balbir Singh 0 siblings, 1 reply; 10+ messages in thread From: Nick Piggin @ 2007-08-31 0:35 UTC (permalink / raw) To: balbir Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes, svaidy, xemul, Linux Memory Management Balbir Singh wrote: > Nick Piggin wrote: > >>Balbir Singh wrote: >> >>>Nick Piggin wrote: >> >>>Very good review comment. Here's what we see >>> >>>1. Page comes in through page cache, we increment the reference ount >>>2. Page comes into rmap, we increment the refcount again >>>3. We race in page_add.*rmap(), the problem we have is that for >>> the same page, for rmap(), step 2 would have taken place more than >>> once >>>4. That's why we uncharge >>> >>>I think I need to add a big fat comment in the ref_cnt member. ref_cnt >>>is held once for the page in the page cache and once for the page mapped >>>anywhere in the page tables. >>> >>>reference counting helps us correctly determine when to take the page >>>of the LRU (it moves from page tables to swap cache, but it's still >>>on the LRU). >> >>Still don't understand. You increment the refcount once when you put >>the page in the pagecache, then again when the first process maps the >>page, then again while subsequent processes map the page but you soon >>drop it afterwards. That's fine, I don't pretend to understand why >>you're doing it, but presumably the controller has a good reason for >>that. >> >>But my point is, why should the VM know or care about that? You should >>handle all those details in your controller. If, in order to do that, >>you need to differentiate between when a process puts a page in >>pagecache and when it maps a page, that's fine, just use different >>hooks for those events. >> >>The situation now is that your one hook is not actually a "this page >>was mapped" hook, or a "this page was added to pagecache", or "we are >>about to map this page". These are easy for VM maintainers to maintain >>because they're simple VM concepts. >> >>But your hook is "increment ref_cnt and do some other stuff". So now >>the VM needs to know about when and why your container implementation >>needs to increment and decrement this ref_cnt. I don't know this, and >>I don't want to know this ;) >> > > > My hook really is -- there was a race, there is no rmap lock to prevent > several independent processes from mapping the same page into their > page tables. I want to increment the reference count just once (apart from > it being accounted in the page cache), since we account the page once. > > I'll revisit this hook to see if it can be made cleaner If you just have a different hook for mapping a page into the page tables, your controller can take care of any races, no? -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree 2007-08-31 0:35 ` Nick Piggin @ 2007-08-31 4:40 ` Balbir Singh 0 siblings, 0 replies; 10+ messages in thread From: Balbir Singh @ 2007-08-31 4:40 UTC (permalink / raw) To: Nick Piggin Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes, svaidy, xemul, Linux Memory Management <snip> Nick Piggin wrote: >> >> My hook really is -- there was a race, there is no rmap lock to prevent >> several independent processes from mapping the same page into their >> page tables. I want to increment the reference count just once (apart >> from >> it being accounted in the page cache), since we account the page once. >> >> I'll revisit this hook to see if it can be made cleaner > > If you just have a different hook for mapping a page into the page > tables, your controller can take care of any races, no? > We increment the reference count in the mem_container_charge() routine. Not all pages into page tables, so it makes sense to do the reference counting there. The charge routine, also does reclaim, so we cannot call it at mapping time, since the mapping happens under pte lock. I'll see if I can refactor the way we reference count, I agree that it would simplify code maintenance. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2007-08-31 4:40 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <200708272119.l7RLJoOD028582@imap1.linux-foundation.org>
2007-08-28 6:35 ` + memory-controller-memory-accounting-v7.patch added to -mm tree Nick Piggin
2007-08-28 7:26 ` Balbir Singh
2007-08-28 9:29 ` Nick Piggin
2007-08-28 11:39 ` Balbir Singh
2007-08-29 7:28 ` Nick Piggin
2007-08-29 8:15 ` Balbir Singh
2007-08-30 7:39 ` Nick Piggin
2007-08-30 9:04 ` Balbir Singh
2007-08-31 0:35 ` Nick Piggin
2007-08-31 4:40 ` Balbir Singh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox