Re: + memory-controller-memory-accounting-v7.patch added to -mm tree

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree
       [not found] <200708272119.l7RLJoOD028582@imap1.linux-foundation.org>
@ 2007-08-28  6:35 ` Nick Piggin
  2007-08-28  7:26   ` Balbir Singh
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2007-08-28  6:35 UTC (permalink / raw)
  To: akpm
  Cc: balbir, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes,
	svaidy, xemul, Linux Memory Management

akpm@linux-foundation.org wrote:
> The patch titled
>      Memory controller: memory accounting

What's the plans to deal with other types of pages other than pagecache?
Surely not adding hooks on a case-by-case basis throughout all the kernel?
Or do we not actually care about giving real guarantees about memory
availability? Presumably you do, otherwise you wouldn't be carefully
charging things before you actually insert them and uncharing afterwards
on failure, etc.

If you do, then I would have thought the best way to go would be to start
with the simplest approach, and add things like targetted reclaim after
that.

I just don't want to see little bits to add more and more hooks slowly go
in under the radar ;) So... there is a coherent plan, right?

> +/*
> + * Charge the memory controller for page usage.
> + * Return
> + * 0 if the charge was successful
> + * < 0 if the container is over its limit
> + */
> +int mem_container_charge(struct page *page, struct mm_struct *mm)
> +{
> +	struct mem_container *mem;
> +	struct page_container *pc, *race_pc;
> +
> +	/*
> +	 * Should page_container's go to their own slab?
> +	 * One could optimize the performance of the charging routine
> +	 * by saving a bit in the page_flags and using it as a lock
> +	 * to see if the container page already has a page_container associated
> +	 * with it
> +	 */
> +	lock_page_container(page);
> +	pc = page_get_page_container(page);
> +	/*
> +	 * The page_container exists and the page has already been accounted
> +	 */
> +	if (pc) {
> +		atomic_inc(&pc->ref_cnt);
> +		goto done;
> +	}
> +
> +	unlock_page_container(page);
> +
> +	pc = kzalloc(sizeof(struct page_container), GFP_KERNEL);
> +	if (pc == NULL)
> +		goto err;
> +
> +	rcu_read_lock();
> +	/*
> +	 * We always charge the container the mm_struct belongs to
> +	 * the mm_struct's mem_container changes on task migration if the
> +	 * thread group leader migrates. It's possible that mm is not
> +	 * set, if so charge the init_mm (happens for pagecache usage).
> +	 */
> +	if (!mm)
> +		mm = &init_mm;
> +
> +	mem = rcu_dereference(mm->mem_container);
> +	/*
> +	 * For every charge from the container, increment reference
> +	 * count
> +	 */
> +	css_get(&mem->css);
> +	rcu_read_unlock();

Where's the corresponding rcu_assign_pointer? Oh, in the next patch. That's
unconventional (can you rearrange it so they go in together, please?).

> @@ -1629,6 +1637,9 @@ gotten:
>  		goto oom;
>  	cow_user_page(new_page, old_page, address, vma);
>  
> +	if (mem_container_charge(new_page, mm))
> +		goto oom_free_new;
> +
>  	/*
>  	 * Re-check the pte - we dropped the lock
>  	 */
> @@ -1660,7 +1671,9 @@ gotten:
>  		/* Free the old page.. */
>  		new_page = old_page;
>  		ret |= VM_FAULT_WRITE;
> -	}
> +	} else
> +		mem_container_uncharge_page(new_page);
> +
>  	if (new_page)
>  		page_cache_release(new_page);
>  	if (old_page)
> @@ -1681,6 +1694,8 @@ unlock:
>  		put_page(dirty_page);
>  	}
>  	return ret;
> +oom_free_new:
> +	__free_page(new_page);
>  oom:
>  	if (old_page)
>  		page_cache_release(old_page);
> @@ -2085,6 +2100,11 @@ static int do_swap_page(struct mm_struct
>  	}
>  
>  	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
> +	if (mem_container_charge(page, mm)) {
> +		ret = VM_FAULT_OOM;
> +		goto out;
> +	}

The oom-from-pagefault path is quite crap (not this code, mainline).
It doesn't obey any of the given oom killing policy.

I had a patch to fix this, but nobody else was too concerned at that
stage. I don't expect that path would have been hit very much before,
but with a lot of -EFAULTs coming out of containers, I think
VM_FAULT_OOMs could easily trigger in practice now.

http://lkml.org/lkml/2006/10/12/158

Patch may be a little outdated, but the basics should still work. You
might be in a good position to test it with these container patches?

> diff -puN mm/rmap.c~memory-controller-memory-accounting-v7 mm/rmap.c
> --- a/mm/rmap.c~memory-controller-memory-accounting-v7
> +++ a/mm/rmap.c
> @@ -48,6 +48,7 @@
>  #include <linux/rcupdate.h>
>  #include <linux/module.h>
>  #include <linux/kallsyms.h>
> +#include <linux/memcontrol.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -550,8 +551,14 @@ void page_add_anon_rmap(struct page *pag
>  	VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
>  	if (atomic_inc_and_test(&page->_mapcount))
>  		__page_set_anon_rmap(page, vma, address);
> -	else
> +	else {
>  		__page_check_anon_rmap(page, vma, address);
> +		/*
> +		 * We unconditionally charged during prepare, we uncharge here
> +		 * This takes care of balancing the reference counts
> +		 */
> +		mem_container_uncharge_page(page);
> +	}
>  }
>  
>  /*
> @@ -582,6 +589,12 @@ void page_add_file_rmap(struct page *pag
>  {
>  	if (atomic_inc_and_test(&page->_mapcount))
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> +	else
> +		/*
> +		 * We unconditionally charged during prepare, we uncharge here
> +		 * This takes care of balancing the reference counts
> +		 */
> +		mem_container_uncharge_page(page);
>  }

What's "during prepare"? Better would be "before adding the page to
page tables" or something.

But... why do you do it? The refcounts and charging are alrady taken
care of in your charge function, aren't they? Just unconditionally
charge at map time and unconditionally uncharge at unmap time, and
let your accounting implementation take care of what to actually do.

(This is what I mean about mem container creeping into core code --
it's fine to have some tasteful hooks, but introducing these more
complex interactions between core VM and container accounting details
is nasty).

I would much prefer this whole thing to _not_ to hook into rmap like
this at all. Do the call unconditionally, and your container
implementation can do as much weird and wonderful refcounting as its
heart desires.

>  #ifdef CONFIG_DEBUG_VM
> @@ -642,6 +655,8 @@ void page_remove_rmap(struct page *page,
>  			page_clear_dirty(page);
>  			set_page_dirty(page);
>  		}
> +		mem_container_uncharge_page(page);
> +
>  		__dec_zone_page_state(page,
>  				PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
>  	}
> diff -puN mm/swap_state.c~memory-controller-memory-accounting-v7 mm/swap_state.c
> --- a/mm/swap_state.c~memory-controller-memory-accounting-v7
> +++ a/mm/swap_state.c
> @@ -17,6 +17,7 @@
>  #include <linux/backing-dev.h>
>  #include <linux/pagevec.h>
>  #include <linux/migrate.h>
> +#include <linux/memcontrol.h>
>  
>  #include <asm/pgtable.h>
>  
> @@ -80,6 +81,11 @@ static int __add_to_swap_cache(struct pa
>  	BUG_ON(PagePrivate(page));
>  	error = radix_tree_preload(gfp_mask);
>  	if (!error) {
> +
> +		error = mem_container_charge(page, current->mm);
> +		if (error)
> +			goto out;
> +
>  		write_lock_irq(&swapper_space.tree_lock);
>  		error = radix_tree_insert(&swapper_space.page_tree,
>  						entry.val, page);
> @@ -89,10 +95,13 @@ static int __add_to_swap_cache(struct pa
>  			set_page_private(page, entry.val);
>  			total_swapcache_pages++;
>  			__inc_zone_page_state(page, NR_FILE_PAGES);
> -		}
> +		} else
> +			mem_container_uncharge_page(page);
> +
>  		write_unlock_irq(&swapper_space.tree_lock);
>  		radix_tree_preload_end();
>  	}
> +out:
>  	return error;
>  }

Uh. You have to be really careful here (and in add_to_page_cache, and
possibly others I haven't really looked)... you're ignoring gfp masks
because mem_container_charge allocates with GFP_KERNEL. You can have
all sorts of GFP_ATOMIC, GFP_NOIO, NOFS etc. come in these places that
will deadlock.

I think I would much prefer you make mem_container_charge run atomically,
and then have just a single hook at the site of where everything else is
checked, rather than having all this pre-charge post-uncharge messiness.
Ditto for mm/memory.c... all call sites really... should make them much
nicer looking.

...

I didn't look at all code in all patches. But overall I guess it isn't
looking too bad (though I'd really like people like Andrea and Hugh to
take a look as well, eventually). It is still a lot uglier in terms of
core mm hooks than I would like, but with luck that could be improved
(some of my suggestions might even help a bit).

Haven't looked at the targetted reclaim patches yet (sigh, more
container tentacles :P).

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree
  2007-08-28  6:35 ` + memory-controller-memory-accounting-v7.patch added to -mm tree Nick Piggin
@ 2007-08-28  7:26   ` Balbir Singh
  2007-08-28  9:29     ` Nick Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: Balbir Singh @ 2007-08-28  7:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes,
	svaidy, xemul, Linux Memory Management

Nick Piggin wrote:
> akpm@linux-foundation.org wrote:
>> The patch titled
>>      Memory controller: memory accounting
> 
> What's the plans to deal with other types of pages other than pagecache?
> Surely not adding hooks on a case-by-case basis throughout all the kernel?
> Or do we not actually care about giving real guarantees about memory
> availability? Presumably you do, otherwise you wouldn't be carefully
> charging things before you actually insert them and uncharing afterwards
> on failure, etc.
> 

We deal with RSS and Page/Swap Cache in this controller.

> If you do, then I would have thought the best way to go would be to start
> with the simplest approach, and add things like targetted reclaim after
> that.
> 

I think the kernel accounting/control system will have to be an independent
controller (since we cannot reclaim kernel pages, just limit them by
accounting (except for some slab pages which are free)).

> I just don't want to see little bits to add more and more hooks slowly go
> in under the radar ;) So... there is a coherent plan, right?
> 

Nothing will go under the radar :-) Everything will be posted to linux-mm
and linux-kernel. We'll make sure you don't wake up one day and see
the mm code has changed to something you cannot recognize ;)

> 
>> +/*
>> + * Charge the memory controller for page usage.
>> + * Return
>> + * 0 if the charge was successful
>> + * < 0 if the container is over its limit
>> + */
>> +int mem_container_charge(struct page *page, struct mm_struct *mm)
>> +{
>> +    struct mem_container *mem;
>> +    struct page_container *pc, *race_pc;
>> +
>> +    /*
>> +     * Should page_container's go to their own slab?
>> +     * One could optimize the performance of the charging routine
>> +     * by saving a bit in the page_flags and using it as a lock
>> +     * to see if the container page already has a page_container
>> associated
>> +     * with it
>> +     */
>> +    lock_page_container(page);
>> +    pc = page_get_page_container(page);
>> +    /*
>> +     * The page_container exists and the page has already been accounted
>> +     */
>> +    if (pc) {
>> +        atomic_inc(&pc->ref_cnt);
>> +        goto done;
>> +    }
>> +
>> +    unlock_page_container(page);
>> +
>> +    pc = kzalloc(sizeof(struct page_container), GFP_KERNEL);
>> +    if (pc == NULL)
>> +        goto err;
>> +
>> +    rcu_read_lock();
>> +    /*
>> +     * We always charge the container the mm_struct belongs to
>> +     * the mm_struct's mem_container changes on task migration if the
>> +     * thread group leader migrates. It's possible that mm is not
>> +     * set, if so charge the init_mm (happens for pagecache usage).
>> +     */
>> +    if (!mm)
>> +        mm = &init_mm;
>> +
>> +    mem = rcu_dereference(mm->mem_container);
>> +    /*
>> +     * For every charge from the container, increment reference
>> +     * count
>> +     */
>> +    css_get(&mem->css);
>> +    rcu_read_unlock();
> 
> Where's the corresponding rcu_assign_pointer? Oh, in the next patch. That's
> unconventional (can you rearrange it so they go in together, please?).
> 
> 

The movement is associated with task migration, it should be easy to move
the patch around.

>> @@ -1629,6 +1637,9 @@ gotten:
>>          goto oom;
>>      cow_user_page(new_page, old_page, address, vma);
>>  
>> +    if (mem_container_charge(new_page, mm))
>> +        goto oom_free_new;
>> +
>>      /*
>>       * Re-check the pte - we dropped the lock
>>       */
>> @@ -1660,7 +1671,9 @@ gotten:
>>          /* Free the old page.. */
>>          new_page = old_page;
>>          ret |= VM_FAULT_WRITE;
>> -    }
>> +    } else
>> +        mem_container_uncharge_page(new_page);
>> +
>>      if (new_page)
>>          page_cache_release(new_page);
>>      if (old_page)
>> @@ -1681,6 +1694,8 @@ unlock:
>>          put_page(dirty_page);
>>      }
>>      return ret;
>> +oom_free_new:
>> +    __free_page(new_page);
>>  oom:
>>      if (old_page)
>>          page_cache_release(old_page);
>> @@ -2085,6 +2100,11 @@ static int do_swap_page(struct mm_struct
>>      }
>>  
>>      delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
>> +    if (mem_container_charge(page, mm)) {
>> +        ret = VM_FAULT_OOM;
>> +        goto out;
>> +    }
> 
> The oom-from-pagefault path is quite crap (not this code, mainline).
> It doesn't obey any of the given oom killing policy.
> 
> I had a patch to fix this, but nobody else was too concerned at that
> stage. I don't expect that path would have been hit very much before,
> but with a lot of -EFAULTs coming out of containers, I think
> VM_FAULT_OOMs could easily trigger in practice now.
> 
> http://lkml.org/lkml/2006/10/12/158
> 
> Patch may be a little outdated, but the basics should still work. You
> might be in a good position to test it with these container patches?
> 


I'll test the patch. Thanks for pointing me in the right direction.

> 
>> diff -puN mm/rmap.c~memory-controller-memory-accounting-v7 mm/rmap.c
>> --- a/mm/rmap.c~memory-controller-memory-accounting-v7
>> +++ a/mm/rmap.c
>> @@ -48,6 +48,7 @@
>>  #include <linux/rcupdate.h>
>>  #include <linux/module.h>
>>  #include <linux/kallsyms.h>
>> +#include <linux/memcontrol.h>
>>  
>>  #include <asm/tlbflush.h>
>>  
>> @@ -550,8 +551,14 @@ void page_add_anon_rmap(struct page *pag
>>      VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
>>      if (atomic_inc_and_test(&page->_mapcount))
>>          __page_set_anon_rmap(page, vma, address);
>> -    else
>> +    else {
>>          __page_check_anon_rmap(page, vma, address);
>> +        /*
>> +         * We unconditionally charged during prepare, we uncharge here
>> +         * This takes care of balancing the reference counts
>> +         */
>> +        mem_container_uncharge_page(page);
>> +    }
>>  }
>>  
>>  /*
>> @@ -582,6 +589,12 @@ void page_add_file_rmap(struct page *pag
>>  {
>>      if (atomic_inc_and_test(&page->_mapcount))
>>          __inc_zone_page_state(page, NR_FILE_MAPPED);
>> +    else
>> +        /*
>> +         * We unconditionally charged during prepare, we uncharge here
>> +         * This takes care of balancing the reference counts
>> +         */
>> +        mem_container_uncharge_page(page);
>>  }
> 
> What's "during prepare"? Better would be "before adding the page to
> page tables" or something.
> 

Yeah.. I'll change that comment

> But... why do you do it? The refcounts and charging are alrady taken
> care of in your charge function, aren't they? Just unconditionally
> charge at map time and unconditionally uncharge at unmap time, and
> let your accounting implementation take care of what to actually do.
> 
> (This is what I mean about mem container creeping into core code --
> it's fine to have some tasteful hooks, but introducing these more
> complex interactions between core VM and container accounting details
> is nasty).
> 
> I would much prefer this whole thing to _not_ to hook into rmap like
> this at all. Do the call unconditionally, and your container
> implementation can do as much weird and wonderful refcounting as its
> heart desires.
> 

The reason why we have the accounting this way is because

After we charge, some other code path could fail and we need
to uncharge the page. We do most of the refcounting internally.
mem_container_charge() could fail if the container is over
its limit and we cannot reclaim enough pages to allow a new
charge to be added. In that case we go to OOM.


> 
>>  #ifdef CONFIG_DEBUG_VM
>> @@ -642,6 +655,8 @@ void page_remove_rmap(struct page *page,
>>              page_clear_dirty(page);
>>              set_page_dirty(page);
>>          }
>> +        mem_container_uncharge_page(page);
>> +
>>          __dec_zone_page_state(page,
>>                  PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
>>      }
>> diff -puN mm/swap_state.c~memory-controller-memory-accounting-v7
>> mm/swap_state.c
>> --- a/mm/swap_state.c~memory-controller-memory-accounting-v7
>> +++ a/mm/swap_state.c
>> @@ -17,6 +17,7 @@
>>  #include <linux/backing-dev.h>
>>  #include <linux/pagevec.h>
>>  #include <linux/migrate.h>
>> +#include <linux/memcontrol.h>
>>  
>>  #include <asm/pgtable.h>
>>  
>> @@ -80,6 +81,11 @@ static int __add_to_swap_cache(struct pa
>>      BUG_ON(PagePrivate(page));
>>      error = radix_tree_preload(gfp_mask);
>>      if (!error) {
>> +
>> +        error = mem_container_charge(page, current->mm);
>> +        if (error)
>> +            goto out;
>> +
>>          write_lock_irq(&swapper_space.tree_lock);
>>          error = radix_tree_insert(&swapper_space.page_tree,
>>                          entry.val, page);
>> @@ -89,10 +95,13 @@ static int __add_to_swap_cache(struct pa
>>              set_page_private(page, entry.val);
>>              total_swapcache_pages++;
>>              __inc_zone_page_state(page, NR_FILE_PAGES);
>> -        }
>> +        } else
>> +            mem_container_uncharge_page(page);
>> +
>>          write_unlock_irq(&swapper_space.tree_lock);
>>          radix_tree_preload_end();
>>      }
>> +out:
>>      return error;
>>  }
> 
> Uh. You have to be really careful here (and in add_to_page_cache, and
> possibly others I haven't really looked)... you're ignoring gfp masks
> because mem_container_charge allocates with GFP_KERNEL. You can have
> all sorts of GFP_ATOMIC, GFP_NOIO, NOFS etc. come in these places that
> will deadlock.
> 

I ran into some trouble with move_to_swap_cache() and move_from_swap_cache()
since they use GFP_ATOMIC. Given the page is already accounted for and
refcounted, the refcounting helped avoid blocking.

> I think I would much prefer you make mem_container_charge run atomically,
> and then have just a single hook at the site of where everything else is
> checked, rather than having all this pre-charge post-uncharge messiness.
> Ditto for mm/memory.c... all call sites really... should make them much
> nicer looking.
> 

The problem is that we need to charge, before we add the page to the page
table and that needs to be a blocking context (since we reclaim when
the container goes over limit). The uncharge part comes in when we
handle errors from other paths (like I've mentioned earlier in this
email).

> 
> I didn't look at all code in all patches. But overall I guess it isn't
> looking too bad (though I'd really like people like Andrea and Hugh to
> take a look as well, eventually). It is still a lot uglier in terms of
> core mm hooks than I would like, but with luck that could be improved
> (some of my suggestions might even help a bit).
> 

I would like Andrea and Hugh to review it as well. Eventually, we want
the best thing to go in and make it as non-intrusive as possible.


> Haven't looked at the targetted reclaim patches yet (sigh, more
> container tentacles :P).
> 

:-)

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree
  2007-08-28  7:26   ` Balbir Singh
@ 2007-08-28  9:29     ` Nick Piggin
  2007-08-28 11:39       ` Balbir Singh
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2007-08-28  9:29 UTC (permalink / raw)
  To: balbir
  Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes,
	svaidy, xemul, Linux Memory Management

Balbir Singh wrote:
> Nick Piggin wrote:
> 
>>akpm@linux-foundation.org wrote:
>>
>>>The patch titled
>>>     Memory controller: memory accounting
>>
>>What's the plans to deal with other types of pages other than pagecache?
>>Surely not adding hooks on a case-by-case basis throughout all the kernel?
>>Or do we not actually care about giving real guarantees about memory
>>availability? Presumably you do, otherwise you wouldn't be carefully
>>charging things before you actually insert them and uncharing afterwards
>>on failure, etc.
>>
> 
> 
> We deal with RSS and Page/Swap Cache in this controller.

Sure. And if all you intend is workload management, then that's probably
fine. If you want guarantees, then its useless on its own.


>>If you do, then I would have thought the best way to go would be to start
>>with the simplest approach, and add things like targetted reclaim after
>>that.
>>
> 
> 
> I think the kernel accounting/control system will have to be an independent
> controller (since we cannot reclaim kernel pages, just limit them by
> accounting (except for some slab pages which are free)).

I don't see why accounting would have to be different.

Control obviously will, but for the purposes of using a page, it doesn't
matter to anybody else in the system whether some container has used it
for a pagecache page, or for something else like page tables. The end
result obviously is only that you can either kill the container or
reclaim its reclaimablememory, but before you get to that point, you
need to do the actual  accounting, and AFAIKS it would be identical for
all pages (at least for this basic first-touch charging scheme).


>>I just don't want to see little bits to add more and more hooks slowly go
>>in under the radar ;) So... there is a coherent plan, right?
>>
> 
> 
> Nothing will go under the radar :-) Everything will be posted to linux-mm
> and linux-kernel. We'll make sure you don't wake up one day and see
> the mm code has changed to something you cannot recognize ;)

I don't mean to say it would be done on purpose, but without a coherent
strategy then things might slowly just get added as people think they are
needed. I'm fairly sure several people want to really guarantee memory
resources in an untrusted environment, don't they? And that's obviously
not going to scale by putting calls all throughout the kernel.


>>>+    mem = rcu_dereference(mm->mem_container);
>>>+    /*
>>>+     * For every charge from the container, increment reference
>>>+     * count
>>>+     */
>>>+    css_get(&mem->css);
>>>+    rcu_read_unlock();
>>
>>Where's the corresponding rcu_assign_pointer? Oh, in the next patch. That's
>>unconventional (can you rearrange it so they go in together, please?).
>>
>>
> 
> 
> The movement is associated with task migration, it should be easy to move
> the patch around.

Thanks.


>>>+    if (mem_container_charge(page, mm)) {
>>>+        ret = VM_FAULT_OOM;
>>>+        goto out;
>>>+    }
>>
>>The oom-from-pagefault path is quite crap (not this code, mainline).
>>It doesn't obey any of the given oom killing policy.
>>
>>I had a patch to fix this, but nobody else was too concerned at that
>>stage. I don't expect that path would have been hit very much before,
>>but with a lot of -EFAULTs coming out of containers, I think
>>VM_FAULT_OOMs could easily trigger in practice now.
>>
>>http://lkml.org/lkml/2006/10/12/158
>>
>>Patch may be a little outdated, but the basics should still work. You
>>might be in a good position to test it with these container patches?
>>
> 
> 
> 
> I'll test the patch. Thanks for pointing me in the right direction.

OK, thanks. Let me know how it goes and we could try again to get it
merged.


>>>@@ -582,6 +589,12 @@ void page_add_file_rmap(struct page *pag
>>> {
>>>     if (atomic_inc_and_test(&page->_mapcount))
>>>         __inc_zone_page_state(page, NR_FILE_MAPPED);
>>>+    else
>>>+        /*
>>>+         * We unconditionally charged during prepare, we uncharge here
>>>+         * This takes care of balancing the reference counts
>>>+         */
>>>+        mem_container_uncharge_page(page);
>>> }
>>
>>What's "during prepare"? Better would be "before adding the page to
>>page tables" or something.
>>
> 
> 
> Yeah.. I'll change that comment
> 
> 
>>But... why do you do it? The refcounts and charging are alrady taken
>>care of in your charge function, aren't they? Just unconditionally
>>charge at map time and unconditionally uncharge at unmap time, and
>>let your accounting implementation take care of what to actually do.
>>
>>(This is what I mean about mem container creeping into core code --
>>it's fine to have some tasteful hooks, but introducing these more
>>complex interactions between core VM and container accounting details
>>is nasty).
>>
>>I would much prefer this whole thing to _not_ to hook into rmap like
>>this at all. Do the call unconditionally, and your container
>>implementation can do as much weird and wonderful refcounting as its
>>heart desires.
>>
> 
> 
> The reason why we have the accounting this way is because
> 
> After we charge, some other code path could fail and we need
> to uncharge the page. We do most of the refcounting internally.
> mem_container_charge() could fail if the container is over
> its limit and we cannot reclaim enough pages to allow a new
> charge to be added. In that case we go to OOM.

But at this point you have already charged the container, and have put
it in the page tables, if I read correctly. Nothing is going to fail
at this point and the page could get uncharged when it is unmapped?


>>>@@ -80,6 +81,11 @@ static int __add_to_swap_cache(struct pa
>>>     BUG_ON(PagePrivate(page));
>>>     error = radix_tree_preload(gfp_mask);
>>>     if (!error) {
>>>+
>>>+        error = mem_container_charge(page, current->mm);
>>>+        if (error)
>>>+            goto out;
>>>+
>>>         write_lock_irq(&swapper_space.tree_lock);
>>>         error = radix_tree_insert(&swapper_space.page_tree,
>>>                         entry.val, page);
>>>@@ -89,10 +95,13 @@ static int __add_to_swap_cache(struct pa
>>>             set_page_private(page, entry.val);
>>>             total_swapcache_pages++;
>>>             __inc_zone_page_state(page, NR_FILE_PAGES);
>>>-        }
>>>+        } else
>>>+            mem_container_uncharge_page(page);
>>>+
>>>         write_unlock_irq(&swapper_space.tree_lock);
>>>         radix_tree_preload_end();
>>>     }
>>>+out:
>>>     return error;
>>> }
>>
>>Uh. You have to be really careful here (and in add_to_page_cache, and
>>possibly others I haven't really looked)... you're ignoring gfp masks
>>because mem_container_charge allocates with GFP_KERNEL. You can have
>>all sorts of GFP_ATOMIC, GFP_NOIO, NOFS etc. come in these places that
>>will deadlock.
>>
> 
> 
> I ran into some trouble with move_to_swap_cache() and move_from_swap_cache()
> since they use GFP_ATOMIC. Given the page is already accounted for and
> refcounted, the refcounting helped avoid blocking.

add_to_page_cache gets called with GFP_ATOMIC as well, and it gets called
with GFP_NOFS for new pages.

But I don't think you were suggesting that this isn't a problem, where
you? Relying on implementation in the VM would signal more broken layering.


>>I think I would much prefer you make mem_container_charge run atomically,
>>and then have just a single hook at the site of where everything else is
>>checked, rather than having all this pre-charge post-uncharge messiness.
>>Ditto for mm/memory.c... all call sites really... should make them much
>>nicer looking.
>>
> 
> 
> The problem is that we need to charge, before we add the page to the page
> table and that needs to be a blocking context (since we reclaim when
> the container goes over limit). The uncharge part comes in when we
> handle errors from other paths (like I've mentioned earlier in this
> email).

So I don't understand how you'd deal with GFP_ATOMIC and such restricted
masks, then, if you want to block here.

It would be so so much easier and cleaner for the VM if you did all the
accounting in page alloc and freeing hooks, and just put the page
on per-container LRUs when it goes on the regular LRU.

I'm still going to keep pushing for that approach until either someone
explains why it can't be done or the current patch gets a fair bit
cleaner. Has that already been tried and shown not to work? I would have
thought so seeing as it would be the simplest patch, however I can't
remember hearing about the actual problems with it.

But anyway I don't have any problems with it getting into -mm at this
stage for more exposure and review. Maybe my concerns will get
overridden anyway ;)

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree
  2007-08-28  9:29     ` Nick Piggin
@ 2007-08-28 11:39       ` Balbir Singh
  2007-08-29  7:28         ` Nick Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: Balbir Singh @ 2007-08-28 11:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes,
	svaidy, xemul, Linux Memory Management

Nick Piggin wrote:
> Balbir Singh wrote:
>> Nick Piggin wrote:
>>
>>> akpm@linux-foundation.org wrote:
>>>
>>>> The patch titled
>>>>     Memory controller: memory accounting
>>>
>>> What's the plans to deal with other types of pages other than pagecache?
>>> Surely not adding hooks on a case-by-case basis throughout all the
>>> kernel?
>>> Or do we not actually care about giving real guarantees about memory
>>> availability? Presumably you do, otherwise you wouldn't be carefully
>>> charging things before you actually insert them and uncharing afterwards
>>> on failure, etc.
>>>
>>
>>
>> We deal with RSS and Page/Swap Cache in this controller.
> 
> Sure. And if all you intend is workload management, then that's probably
> fine. If you want guarantees, then its useless on its own.
> 

Yes. Guarantees are hard to do with just this controller (due to non-reclaimable
pages). It should be easy to add a mlock() controller on top of this one.
Pavel is planning to work on a kernel memory controller. With that in place,
guarantees might be possible.

> 
>>> If you do, then I would have thought the best way to go would be to
>>> start
>>> with the simplest approach, and add things like targetted reclaim after
>>> that.
>>>
>>
>>
>> I think the kernel accounting/control system will have to be an
>> independent
>> controller (since we cannot reclaim kernel pages, just limit them by
>> accounting (except for some slab pages which are free)).
> 
> I don't see why accounting would have to be different.
> 
> Control obviously will, but for the purposes of using a page, it doesn't
> matter to anybody else in the system whether some container has used it
> for a pagecache page, or for something else like page tables. The end
> result obviously is only that you can either kill the container or
> reclaim its reclaimablememory, but before you get to that point, you
> need to do the actual  accounting, and AFAIKS it would be identical for
> all pages (at least for this basic first-touch charging scheme).
> 


Yes.. I see your point

> 
>>> I just don't want to see little bits to add more and more hooks
>>> slowly go
>>> in under the radar ;) So... there is a coherent plan, right?
>>>
>>
>>
>> Nothing will go under the radar :-) Everything will be posted to linux-mm
>> and linux-kernel. We'll make sure you don't wake up one day and see
>> the mm code has changed to something you cannot recognize ;)
> 
> I don't mean to say it would be done on purpose, but without a coherent
> strategy then things might slowly just get added as people think they are
> needed. I'm fairly sure several people want to really guarantee memory
> resources in an untrusted environment, don't they? And that's obviously
> not going to scale by putting calls all throughout the kernel.
> 

We sent out an RFC last year and got several comments from stake holders

1. Pavel Emelianov
2. Paul Menage
3. Vaidyanathan Srinivasan
4. YAMAMOTO Takshi
5. KAMEZAWA Hiroyuki


At the OLS resource management BoF, we had a broader participation and
several comments and suggestions.

We've incorporated suggestions and comments from all stake holders

> 
>>>> +    mem = rcu_dereference(mm->mem_container);
>>>> +    /*
>>>> +     * For every charge from the container, increment reference
>>>> +     * count
>>>> +     */
>>>> +    css_get(&mem->css);
>>>> +    rcu_read_unlock();
>>>
>>> Where's the corresponding rcu_assign_pointer? Oh, in the next patch.
>>> That's
>>> unconventional (can you rearrange it so they go in together, please?).
>>>
>>>
>>
>>
>> The movement is associated with task migration, it should be easy to move
>> the patch around.
> 
> Thanks.
> 
> 
>>>> +    if (mem_container_charge(page, mm)) {
>>>> +        ret = VM_FAULT_OOM;
>>>> +        goto out;
>>>> +    }
>>>
>>> The oom-from-pagefault path is quite crap (not this code, mainline).
>>> It doesn't obey any of the given oom killing policy.
>>>
>>> I had a patch to fix this, but nobody else was too concerned at that
>>> stage. I don't expect that path would have been hit very much before,
>>> but with a lot of -EFAULTs coming out of containers, I think
>>> VM_FAULT_OOMs could easily trigger in practice now.
>>>
>>> http://lkml.org/lkml/2006/10/12/158
>>>
>>> Patch may be a little outdated, but the basics should still work. You
>>> might be in a good position to test it with these container patches?
>>>
>>
>>
>>
>> I'll test the patch. Thanks for pointing me in the right direction.
> 
> OK, thanks. Let me know how it goes and we could try again to get it
> merged.
> 
> 

Sure I'll start testing out that code.

>>>> @@ -582,6 +589,12 @@ void page_add_file_rmap(struct page *pag
>>>> {
>>>>     if (atomic_inc_and_test(&page->_mapcount))
>>>>         __inc_zone_page_state(page, NR_FILE_MAPPED);
>>>> +    else
>>>> +        /*
>>>> +         * We unconditionally charged during prepare, we uncharge here
>>>> +         * This takes care of balancing the reference counts
>>>> +         */
>>>> +        mem_container_uncharge_page(page);
>>>> }
>>>
>>> What's "during prepare"? Better would be "before adding the page to
>>> page tables" or something.
>>>
>>
>>
>> Yeah.. I'll change that comment
>>
>>
>>> But... why do you do it? The refcounts and charging are alrady taken
>>> care of in your charge function, aren't they? Just unconditionally
>>> charge at map time and unconditionally uncharge at unmap time, and
>>> let your accounting implementation take care of what to actually do.
>>>
>>> (This is what I mean about mem container creeping into core code --
>>> it's fine to have some tasteful hooks, but introducing these more
>>> complex interactions between core VM and container accounting details
>>> is nasty).
>>>
>>> I would much prefer this whole thing to _not_ to hook into rmap like
>>> this at all. Do the call unconditionally, and your container
>>> implementation can do as much weird and wonderful refcounting as its
>>> heart desires.
>>>
>>
>>
>> The reason why we have the accounting this way is because
>>
>> After we charge, some other code path could fail and we need
>> to uncharge the page. We do most of the refcounting internally.
>> mem_container_charge() could fail if the container is over
>> its limit and we cannot reclaim enough pages to allow a new
>> charge to be added. In that case we go to OOM.
> 
> But at this point you have already charged the container, and have put
> it in the page tables, if I read correctly. Nothing is going to fail
> at this point and the page could get uncharged when it is unmapped?
> 
> 

Here's the sequence of steps

1. Charge (we might need to block on reclaim)
2. Check to see if there is a race in updating the PTE
3. Add to the page table, update _mapcount under a lock

Several times the error occurs in step 2, due to which we need to uncharge
the memory container.

>>>> @@ -80,6 +81,11 @@ static int __add_to_swap_cache(struct pa
>>>>     BUG_ON(PagePrivate(page));
>>>>     error = radix_tree_preload(gfp_mask);
>>>>     if (!error) {
>>>> +
>>>> +        error = mem_container_charge(page, current->mm);
>>>> +        if (error)
>>>> +            goto out;
>>>> +
>>>>         write_lock_irq(&swapper_space.tree_lock);
>>>>         error = radix_tree_insert(&swapper_space.page_tree,
>>>>                         entry.val, page);
>>>> @@ -89,10 +95,13 @@ static int __add_to_swap_cache(struct pa
>>>>             set_page_private(page, entry.val);
>>>>             total_swapcache_pages++;
>>>>             __inc_zone_page_state(page, NR_FILE_PAGES);
>>>> -        }
>>>> +        } else
>>>> +            mem_container_uncharge_page(page);
>>>> +
>>>>         write_unlock_irq(&swapper_space.tree_lock);
>>>>         radix_tree_preload_end();
>>>>     }
>>>> +out:
>>>>     return error;
>>>> }
>>>
>>> Uh. You have to be really careful here (and in add_to_page_cache, and
>>> possibly others I haven't really looked)... you're ignoring gfp masks
>>> because mem_container_charge allocates with GFP_KERNEL. You can have
>>> all sorts of GFP_ATOMIC, GFP_NOIO, NOFS etc. come in these places that
>>> will deadlock.
>>>
>>
>>
>> I ran into some trouble with move_to_swap_cache() and
>> move_from_swap_cache()
>> since they use GFP_ATOMIC. Given the page is already accounted for and
>> refcounted, the refcounting helped avoid blocking.
> 
> add_to_page_cache gets called with GFP_ATOMIC as well, and it gets called
> with GFP_NOFS for new pages.
> 
> But I don't think you were suggesting that this isn't a problem, where
> you? Relying on implementation in the VM would signal more broken layering.
> 
> 

Good, thanks for spotting this. It should be possible to fix this case.
I'll work on a fix and send it out.

>>> I think I would much prefer you make mem_container_charge run
>>> atomically,
>>> and then have just a single hook at the site of where everything else is
>>> checked, rather than having all this pre-charge post-uncharge messiness.
>>> Ditto for mm/memory.c... all call sites really... should make them much
>>> nicer looking.
>>>
>>
>>
>> The problem is that we need to charge, before we add the page to the page
>> table and that needs to be a blocking context (since we reclaim when
>> the container goes over limit). The uncharge part comes in when we
>> handle errors from other paths (like I've mentioned earlier in this
>> email).
> 
> So I don't understand how you'd deal with GFP_ATOMIC and such restricted
> masks, then, if you want to block here.
> 
> It would be so so much easier and cleaner for the VM if you did all the
> accounting in page alloc and freeing hooks, and just put the page
> on per-container LRUs when it goes on the regular LRU.
> 

The advantage of the current approach is that not all page allocations have
to worry about charges. Only user space pages (unmapped cache and mapped RSS)
are accounted for.

We tried syncing the container and the global LRU. Pavel is working on
that approach. The problem with adding the page at the same time as the
regular LRU is that we end up calling the addition under the zone->lru_lock.
Holding the entire zone's lru_lock for list addition might be an overhead.
Although, we do the list isolation under the zone's lru lock.

> I'm still going to keep pushing for that approach until either someone
> explains why it can't be done or the current patch gets a fair bit
> cleaner. Has that already been tried and shown not to work? I would have
> thought so seeing as it would be the simplest patch, however I can't
> remember hearing about the actual problems with it.
> 

I am all eyes and ears to patches/suggestions/improvements to the current
container.

> But anyway I don't have any problems with it getting into -mm at this
> stage for more exposure and review. Maybe my concerns will get
> overridden anyway ;)
> 

Thanks, we'll do the right thing to override your concerns (whether that
is to change code or to convince you that we are already doing it
right)


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree
  2007-08-28 11:39       ` Balbir Singh
@ 2007-08-29  7:28         ` Nick Piggin
  2007-08-29  8:15           ` Balbir Singh
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2007-08-29  7:28 UTC (permalink / raw)
  To: balbir
  Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes,
	svaidy, xemul, Linux Memory Management

Balbir Singh wrote:
> Nick Piggin wrote:

>>Sure. And if all you intend is workload management, then that's probably
>>fine. If you want guarantees, then its useless on its own.
>>
> 
> 
> Yes. Guarantees are hard to do with just this controller (due to non-reclaimable
> pages). It should be easy to add a mlock() controller on top of this one.
> Pavel is planning to work on a kernel memory controller. With that in place,
> guarantees might be possible.

So it doesn't sound like they are necessarily a requirement for linux
kernel memory control -- that's fine, I just want to know where you
stand with that.

>>I don't mean to say it would be done on purpose, but without a coherent
>>strategy then things might slowly just get added as people think they are
>>needed. I'm fairly sure several people want to really guarantee memory
>>resources in an untrusted environment, don't they? And that's obviously
>>not going to scale by putting calls all throughout the kernel.
>>
> 
> 
> We sent out an RFC last year and got several comments from stake holders
> 
> 1. Pavel Emelianov
> 2. Paul Menage
> 3. Vaidyanathan Srinivasan
> 4. YAMAMOTO Takshi
> 5. KAMEZAWA Hiroyuki
> 
> 
> At the OLS resource management BoF, we had a broader participation and
> several comments and suggestions.
> 
> We've incorporated suggestions and comments from all stake holders

:) I know everyone's really worked hard at this and is trying to do the
right thing. But for example some of the non-container VM people may not
know exactly what you are up to or planning (I don't). It can make things
a bit harder to review IF there is an expectation that more functionality
is going to be required (which is what my expectation is).

Anyway, let's not continue this tangent. Instead, I'll try to be more
constructive and ask for eg. your future plans if they are not clear.

>>But at this point you have already charged the container, and have put
>>it in the page tables, if I read correctly. Nothing is going to fail
>>at this point and the page could get uncharged when it is unmapped?
>>
>>
> 
> 
> Here's the sequence of steps
> 
> 1. Charge (we might need to block on reclaim)
> 2. Check to see if there is a race in updating the PTE
> 3. Add to the page table, update _mapcount under a lock
> 
> Several times the error occurs in step 2, due to which we need to uncharge
> the memory container.

Oh yeah, I understand all _those_ uncharging calls you do. But let me
just quote the code again:

diff -puN mm/rmap.c~memory-controller-memory-accounting-v7 mm/rmap.c
--- a/mm/rmap.c~memory-controller-memory-accounting-v7
+++ a/mm/rmap.c
@@ -48,6 +48,7 @@
  #include <linux/rcupdate.h>
  #include <linux/module.h>
  #include <linux/kallsyms.h>
+#include <linux/memcontrol.h>

  #include <asm/tlbflush.h>

@@ -550,8 +551,14 @@ void page_add_anon_rmap(struct page *pag
      VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
      if (atomic_inc_and_test(&page->_mapcount))
          __page_set_anon_rmap(page, vma, address);
-    else
+    else {
          __page_check_anon_rmap(page, vma, address);
+        /*
+         * We unconditionally charged during prepare, we uncharge here
+         * This takes care of balancing the reference counts
+         */
+        mem_container_uncharge_page(page);
+    }
  }

At the point you are uncharging here, the pte has been updated and
the page is in the pagetables of the process. I guess this uncharging
works on the presumption that you are not the first container to map
the page, but I thought that you already check for that in your
accounting implementation.

Now how does it take care of the refcounts? I guess it is because your
rmap removal function also takes care of refcounts by only uncharging
if mapcount has gone to zero... however that's polluting the VM with
knowledge that your accounting scheme is a first-touch one, isn't it?

Aside, I'm slightly suspicious of whether this is correct against
mapcount races, but I didn't look closely yet. I remember you bringing
that up with me, so I guess you've been careful there...

>>add_to_page_cache gets called with GFP_ATOMIC as well, and it gets called
>>with GFP_NOFS for new pages.
>>
>>But I don't think you were suggesting that this isn't a problem, where
>>you? Relying on implementation in the VM would signal more broken layering.
>>
>>
> 
> 
> Good, thanks for spotting this. It should be possible to fix this case.
> I'll work on a fix and send it out.

OK.

>>It would be so so much easier and cleaner for the VM if you did all the
>>accounting in page alloc and freeing hooks, and just put the page
>>on per-container LRUs when it goes on the regular LRU.
>>
> 
> 
> The advantage of the current approach is that not all page allocations have
> to worry about charges. Only user space pages (unmapped cache and mapped RSS)
> are accounted for.

That could be true. OTOH, if you have a significant amount of non userspace
page allocations happening, the chances are that your userspace-only
accounting is going out the window too :)

You'd still be able to avoid charging if a process isn't in a container
or accounting is turned off, of course.

> We tried syncing the container and the global LRU. Pavel is working on
> that approach. The problem with adding the page at the same time as the
> regular LRU is that we end up calling the addition under the zone->lru_lock.
> Holding the entire zone's lru_lock for list addition might be an overhead.
> Although, we do the list isolation under the zone's lru lock.

Well it may not have to be _exactly_ the same time as the page goes on the
normal lru. You could try doing it in pagevec_lru_add, for example, after
the lru locks have been released?

>>I'm still going to keep pushing for that approach until either someone
>>explains why it can't be done or the current patch gets a fair bit
>>cleaner. Has that already been tried and shown not to work? I would have
>>thought so seeing as it would be the simplest patch, however I can't
>>remember hearing about the actual problems with it.
>>
> 
> 
> I am all eyes and ears to patches/suggestions/improvements to the current
> container.

Well I have made a few. I'm actually not too interested in containers and
resource control myself, but I suspect it may be a fair bit easier to play
around with eg. my ideas for VM hooks with an implementation in -mm, so I
might get motivated to try a patch...

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree
  2007-08-29  7:28         ` Nick Piggin
@ 2007-08-29  8:15           ` Balbir Singh
  2007-08-30  7:39             ` Nick Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: Balbir Singh @ 2007-08-29  8:15 UTC (permalink / raw)
  To: Nick Piggin
  Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes,
	svaidy, xemul, Linux Memory Management

Nick Piggin wrote:
> Balbir Singh wrote:
>> Nick Piggin wrote:
> 
>>> Sure. And if all you intend is workload management, then that's probably
>>> fine. If you want guarantees, then its useless on its own.
>>>
>>
>>
>> Yes. Guarantees are hard to do with just this controller (due to
>> non-reclaimable
>> pages). It should be easy to add a mlock() controller on top of this one.
>> Pavel is planning to work on a kernel memory controller. With that in
>> place,
>> guarantees might be possible.
> 
> So it doesn't sound like they are necessarily a requirement for linux
> kernel memory control -- that's fine, I just want to know where you
> stand with that.
> 
> 
>>> I don't mean to say it would be done on purpose, but without a coherent
>>> strategy then things might slowly just get added as people think they
>>> are
>>> needed. I'm fairly sure several people want to really guarantee memory
>>> resources in an untrusted environment, don't they? And that's obviously
>>> not going to scale by putting calls all throughout the kernel.
>>>
>>
>>
>> We sent out an RFC last year and got several comments from stake holders
>>
>> 1. Pavel Emelianov
>> 2. Paul Menage
>> 3. Vaidyanathan Srinivasan
>> 4. YAMAMOTO Takshi
>> 5. KAMEZAWA Hiroyuki
>>
>>
>> At the OLS resource management BoF, we had a broader participation and
>> several comments and suggestions.
>>
>> We've incorporated suggestions and comments from all stake holders
> 
> :) I know everyone's really worked hard at this and is trying to do the
> right thing. But for example some of the non-container VM people may not
> know exactly what you are up to or planning (I don't). It can make things
> a bit harder to review IF there is an expectation that more functionality
> is going to be required (which is what my expectation is).
> 
> Anyway, let's not continue this tangent. Instead, I'll try to be more
> constructive and ask for eg. your future plans if they are not clear.
> 

Thanks, I am going to be present at the VM summit and we have a roadmap
as well. We can discuss it and we'll also send it out for comments.
Just FYI, we started with an RFC about a year ago and got several
comments from non-container folks. We'll try to discuss the roadmap
further and get in more inputs.

> 
>>> But at this point you have already charged the container, and have put
>>> it in the page tables, if I read correctly. Nothing is going to fail
>>> at this point and the page could get uncharged when it is unmapped?
>>>
>>>
>>
>>
>> Here's the sequence of steps
>>
>> 1. Charge (we might need to block on reclaim)
>> 2. Check to see if there is a race in updating the PTE
>> 3. Add to the page table, update _mapcount under a lock
>>
>> Several times the error occurs in step 2, due to which we need to
>> uncharge
>> the memory container.
> 
> Oh yeah, I understand all _those_ uncharging calls you do. But let me
> just quote the code again:
> 
> diff -puN mm/rmap.c~memory-controller-memory-accounting-v7 mm/rmap.c
> --- a/mm/rmap.c~memory-controller-memory-accounting-v7
> +++ a/mm/rmap.c
> @@ -48,6 +48,7 @@
>  #include <linux/rcupdate.h>
>  #include <linux/module.h>
>  #include <linux/kallsyms.h>
> +#include <linux/memcontrol.h>
> 
>  #include <asm/tlbflush.h>
> 
> @@ -550,8 +551,14 @@ void page_add_anon_rmap(struct page *pag
>      VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
>      if (atomic_inc_and_test(&page->_mapcount))
>          __page_set_anon_rmap(page, vma, address);
> -    else
> +    else {
>          __page_check_anon_rmap(page, vma, address);
> +        /*
> +         * We unconditionally charged during prepare, we uncharge here
> +         * This takes care of balancing the reference counts
> +         */
> +        mem_container_uncharge_page(page);
> +    }
>  }
> 
> At the point you are uncharging here, the pte has been updated and
> the page is in the pagetables of the process. I guess this uncharging
> works on the presumption that you are not the first container to map
> the page, but I thought that you already check for that in your
> accounting implementation.
> 
> Now how does it take care of the refcounts? I guess it is because your
> rmap removal function also takes care of refcounts by only uncharging
> if mapcount has gone to zero... however that's polluting the VM with
> knowledge that your accounting scheme is a first-touch one, isn't it?
> 
> Aside, I'm slightly suspicious of whether this is correct against
> mapcount races, but I didn't look closely yet. I remember you bringing
> that up with me, so I guess you've been careful there...
> 

Very good review comment. Here's what we see

1. Page comes in through page cache, we increment the reference ount
2. Page comes into rmap, we increment the refcount again
3. We race in page_add.*rmap(), the problem we have is that for
   the same page, for rmap(), step 2 would have taken place more than
   once
4. That's why we uncharge

I think I need to add a big fat comment in the ref_cnt member. ref_cnt
is held once for the page in the page cache and once for the page mapped
anywhere in the page tables.

reference counting helps us correctly determine when to take the page
of the LRU (it moves from page tables to swap cache, but it's still
on the LRU).

>>> add_to_page_cache gets called with GFP_ATOMIC as well, and it gets
>>> called
>>> with GFP_NOFS for new pages.
>>>
>>> But I don't think you were suggesting that this isn't a problem, where
>>> you? Relying on implementation in the VM would signal more broken
>>> layering.
>>>
>>>
>>
>>
>> Good, thanks for spotting this. It should be possible to fix this case.
>> I'll work on a fix and send it out.
> 
> OK.
> 
> 
>>> It would be so so much easier and cleaner for the VM if you did all the
>>> accounting in page alloc and freeing hooks, and just put the page
>>> on per-container LRUs when it goes on the regular LRU.
>>>
>>
>>
>> The advantage of the current approach is that not all page allocations
>> have
>> to worry about charges. Only user space pages (unmapped cache and
>> mapped RSS)
>> are accounted for.
> 
> That could be true. OTOH, if you have a significant amount of non userspace
> page allocations happening, the chances are that your userspace-only
> accounting is going out the window too :)
> 
> You'd still be able to avoid charging if a process isn't in a container
> or accounting is turned off, of course.
> 

Yes, we need a mechanism to turn off the container at boot time. Currently
if the config is enabled, we charge :-)

> 
>> We tried syncing the container and the global LRU. Pavel is working on
>> that approach. The problem with adding the page at the same time as the
>> regular LRU is that we end up calling the addition under the
>> zone->lru_lock.
>> Holding the entire zone's lru_lock for list addition might be an
>> overhead.
>> Although, we do the list isolation under the zone's lru lock.
> 
> Well it may not have to be _exactly_ the same time as the page goes on the
> normal lru. You could try doing it in pagevec_lru_add, for example, after
> the lru locks have been released?
> 

I'll play around with Pavel's patches or work with him to explore this
further.

> 
>>> I'm still going to keep pushing for that approach until either someone
>>> explains why it can't be done or the current patch gets a fair bit
>>> cleaner. Has that already been tried and shown not to work? I would have
>>> thought so seeing as it would be the simplest patch, however I can't
>>> remember hearing about the actual problems with it.
>>>
>>
>>
>> I am all eyes and ears to patches/suggestions/improvements to the current
>> container.
> 
> Well I have made a few. I'm actually not too interested in containers and
> resource control myself, but I suspect it may be a fair bit easier to play
> around with eg. my ideas for VM hooks with an implementation in -mm, so I
> might get motivated to try a patch...
> 

Great! Thanks!

> Thanks,
> Nick
> 


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree
  2007-08-29  8:15           ` Balbir Singh
@ 2007-08-30  7:39             ` Nick Piggin
  2007-08-30  9:04               ` Balbir Singh
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2007-08-30  7:39 UTC (permalink / raw)
  To: balbir
  Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes,
	svaidy, xemul, Linux Memory Management

Balbir Singh wrote:
> Nick Piggin wrote:

>>Oh yeah, I understand all _those_ uncharging calls you do. But let me
>>just quote the code again:
>>
>>diff -puN mm/rmap.c~memory-controller-memory-accounting-v7 mm/rmap.c
>>--- a/mm/rmap.c~memory-controller-memory-accounting-v7
>>+++ a/mm/rmap.c
>>@@ -48,6 +48,7 @@
>> #include <linux/rcupdate.h>
>> #include <linux/module.h>
>> #include <linux/kallsyms.h>
>>+#include <linux/memcontrol.h>
>>
>> #include <asm/tlbflush.h>
>>
>>@@ -550,8 +551,14 @@ void page_add_anon_rmap(struct page *pag
>>     VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
>>     if (atomic_inc_and_test(&page->_mapcount))
>>         __page_set_anon_rmap(page, vma, address);
>>-    else
>>+    else {
>>         __page_check_anon_rmap(page, vma, address);
>>+        /*
>>+         * We unconditionally charged during prepare, we uncharge here
>>+         * This takes care of balancing the reference counts
>>+         */
>>+        mem_container_uncharge_page(page);
>>+    }
>> }
>>
>>At the point you are uncharging here, the pte has been updated and
>>the page is in the pagetables of the process. I guess this uncharging
>>works on the presumption that you are not the first container to map
>>the page, but I thought that you already check for that in your
>>accounting implementation.
>>
>>Now how does it take care of the refcounts? I guess it is because your
>>rmap removal function also takes care of refcounts by only uncharging
>>if mapcount has gone to zero... however that's polluting the VM with
>>knowledge that your accounting scheme is a first-touch one, isn't it?
>>
>>Aside, I'm slightly suspicious of whether this is correct against
>>mapcount races, but I didn't look closely yet. I remember you bringing
>>that up with me, so I guess you've been careful there...
>>
> 
> 
> Very good review comment. Here's what we see
> 
> 1. Page comes in through page cache, we increment the reference ount
> 2. Page comes into rmap, we increment the refcount again
> 3. We race in page_add.*rmap(), the problem we have is that for
>    the same page, for rmap(), step 2 would have taken place more than
>    once
> 4. That's why we uncharge
> 
> I think I need to add a big fat comment in the ref_cnt member. ref_cnt
> is held once for the page in the page cache and once for the page mapped
> anywhere in the page tables.
> 
> reference counting helps us correctly determine when to take the page
> of the LRU (it moves from page tables to swap cache, but it's still
> on the LRU).

Still don't understand. You increment the refcount once when you put
the page in the pagecache, then again when the first process maps the
page, then again while subsequent processes map the page but you soon
drop it afterwards. That's fine, I don't pretend to understand why
you're doing it, but presumably the controller has a good reason for
that.

But my point is, why should the VM know or care about that? You should
handle all those details in your controller. If, in order to do that,
you need to differentiate between when a process puts a page in
pagecache and when it maps a page, that's fine, just use different
hooks for those events.

The situation now is that your one hook is not actually a "this page
was mapped" hook, or a "this page was added to pagecache", or "we are
about to map this page". These are easy for VM maintainers to maintain
because they're simple VM concepts.

But your hook is "increment ref_cnt and do some other stuff". So now
the VM needs to know about when and why your container implementation
needs to increment and decrement this ref_cnt. I don't know this, and
I don't want to know this ;)

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree
  2007-08-30  7:39             ` Nick Piggin
@ 2007-08-30  9:04               ` Balbir Singh
  2007-08-31  0:35                 ` Nick Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: Balbir Singh @ 2007-08-30  9:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes,
	svaidy, xemul, Linux Memory Management

Nick Piggin wrote:
> Balbir Singh wrote:
>> Nick Piggin wrote:
> 
>> Very good review comment. Here's what we see
>>
>> 1. Page comes in through page cache, we increment the reference ount
>> 2. Page comes into rmap, we increment the refcount again
>> 3. We race in page_add.*rmap(), the problem we have is that for
>>    the same page, for rmap(), step 2 would have taken place more than
>>    once
>> 4. That's why we uncharge
>>
>> I think I need to add a big fat comment in the ref_cnt member. ref_cnt
>> is held once for the page in the page cache and once for the page mapped
>> anywhere in the page tables.
>>
>> reference counting helps us correctly determine when to take the page
>> of the LRU (it moves from page tables to swap cache, but it's still
>> on the LRU).
> 
> Still don't understand. You increment the refcount once when you put
> the page in the pagecache, then again when the first process maps the
> page, then again while subsequent processes map the page but you soon
> drop it afterwards. That's fine, I don't pretend to understand why
> you're doing it, but presumably the controller has a good reason for
> that.
> 
> But my point is, why should the VM know or care about that? You should
> handle all those details in your controller. If, in order to do that,
> you need to differentiate between when a process puts a page in
> pagecache and when it maps a page, that's fine, just use different
> hooks for those events.
> 
> The situation now is that your one hook is not actually a "this page
> was mapped" hook, or a "this page was added to pagecache", or "we are
> about to map this page". These are easy for VM maintainers to maintain
> because they're simple VM concepts.
> 
> But your hook is "increment ref_cnt and do some other stuff". So now
> the VM needs to know about when and why your container implementation
> needs to increment and decrement this ref_cnt. I don't know this, and
> I don't want to know this ;)
> 

My hook really is -- there was a race, there is no rmap lock to prevent
several independent processes from mapping the same page into their
page tables. I want to increment the reference count just once (apart from
it being accounted in the page cache), since we account the page once.

I'll revisit this hook to see if it can be made cleaner

Thanks for your detailed review comments.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree
  2007-08-30  9:04               ` Balbir Singh
@ 2007-08-31  0:35                 ` Nick Piggin
  2007-08-31  4:40                   ` Balbir Singh
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2007-08-31  0:35 UTC (permalink / raw)
  To: balbir
  Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes,
	svaidy, xemul, Linux Memory Management

Balbir Singh wrote:
> Nick Piggin wrote:
> 
>>Balbir Singh wrote:
>>
>>>Nick Piggin wrote:
>>
>>>Very good review comment. Here's what we see
>>>
>>>1. Page comes in through page cache, we increment the reference ount
>>>2. Page comes into rmap, we increment the refcount again
>>>3. We race in page_add.*rmap(), the problem we have is that for
>>>   the same page, for rmap(), step 2 would have taken place more than
>>>   once
>>>4. That's why we uncharge
>>>
>>>I think I need to add a big fat comment in the ref_cnt member. ref_cnt
>>>is held once for the page in the page cache and once for the page mapped
>>>anywhere in the page tables.
>>>
>>>reference counting helps us correctly determine when to take the page
>>>of the LRU (it moves from page tables to swap cache, but it's still
>>>on the LRU).
>>
>>Still don't understand. You increment the refcount once when you put
>>the page in the pagecache, then again when the first process maps the
>>page, then again while subsequent processes map the page but you soon
>>drop it afterwards. That's fine, I don't pretend to understand why
>>you're doing it, but presumably the controller has a good reason for
>>that.
>>
>>But my point is, why should the VM know or care about that? You should
>>handle all those details in your controller. If, in order to do that,
>>you need to differentiate between when a process puts a page in
>>pagecache and when it maps a page, that's fine, just use different
>>hooks for those events.
>>
>>The situation now is that your one hook is not actually a "this page
>>was mapped" hook, or a "this page was added to pagecache", or "we are
>>about to map this page". These are easy for VM maintainers to maintain
>>because they're simple VM concepts.
>>
>>But your hook is "increment ref_cnt and do some other stuff". So now
>>the VM needs to know about when and why your container implementation
>>needs to increment and decrement this ref_cnt. I don't know this, and
>>I don't want to know this ;)
>>
> 
> 
> My hook really is -- there was a race, there is no rmap lock to prevent
> several independent processes from mapping the same page into their
> page tables. I want to increment the reference count just once (apart from
> it being accounted in the page cache), since we account the page once.
> 
> I'll revisit this hook to see if it can be made cleaner

If you just have a different hook for mapping a page into the page
tables, your controller can take care of any races, no?

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + memory-controller-memory-accounting-v7.patch added to -mm tree
  2007-08-31  0:35                 ` Nick Piggin
@ 2007-08-31  4:40                   ` Balbir Singh
  0 siblings, 0 replies; 10+ messages in thread
From: Balbir Singh @ 2007-08-31  4:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: akpm, a.p.zijlstra, dev, ebiederm, herbert, menage, rientjes,
	svaidy, xemul, Linux Memory Management

<snip>
Nick Piggin wrote:
>>
>> My hook really is -- there was a race, there is no rmap lock to prevent
>> several independent processes from mapping the same page into their
>> page tables. I want to increment the reference count just once (apart
>> from
>> it being accounted in the page cache), since we account the page once.
>>
>> I'll revisit this hook to see if it can be made cleaner
> 
> If you just have a different hook for mapping a page into the page
> tables, your controller can take care of any races, no?
> 

We increment the reference count in the mem_container_charge() routine.
Not all pages into page tables, so it makes sense to do the reference
counting there. The charge routine, also does reclaim, so we cannot call
it at mapping time, since the mapping happens under pte lock.

I'll see if I can refactor the way we reference count, I agree that
it would simplify code maintenance.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-08-31  4:40 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <200708272119.l7RLJoOD028582@imap1.linux-foundation.org>
2007-08-28  6:35 ` + memory-controller-memory-accounting-v7.patch added to -mm tree Nick Piggin
2007-08-28  7:26   ` Balbir Singh
2007-08-28  9:29     ` Nick Piggin
2007-08-28 11:39       ` Balbir Singh
2007-08-29  7:28         ` Nick Piggin
2007-08-29  8:15           ` Balbir Singh
2007-08-30  7:39             ` Nick Piggin
2007-08-30  9:04               ` Balbir Singh
2007-08-31  0:35                 ` Nick Piggin
2007-08-31  4:40                   ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox