linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [BUGFIX][PATCH mmotm] memcg fix swap accounting leak
@ 2008-12-12  8:29 KAMEZAWA Hiroyuki
  2008-12-12  9:43 ` Daisuke Nishimura
  0 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-12  8:29 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, balbir, nishimura, akpm


From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Fix swap-page-fault charge leak of memcg.

Now, memcg has hooks to swap-out operation and checks SwapCache is really
unused or not. That check depends on contents of struct page.
I.e. If PageAnon(page) && page_mapped(page), the page is recoginized as
still-in-use.

Now, reuse_swap_page() calles delete_from_swap_cache() before establishment
of any rmap. Then, in followinig sequence

	(Page fault with WRITE)
	Assume the page is SwapCache "on memory (still charged)"
	try_charge() (charge += PAGESIZE)
	commit_charge()
	   => (Check page_cgroup and found PCG_USED bit, charge-=PAGE_SIZE
	       because it seems already charged.)
	reuse_swap_page()
		-> delete_from_swapcache()
			-> mem_cgroup_uncharge_swapcache() (charge -= PAGESIZE)
	......

too much uncharge.....

To avoid this,  move commit_charge() after page_mapcount() goes up to 1.
By this,
        Assume the page is SwapCache "on memory"
	try_charge()		(charge += PAGESIZE)
	reuse_swap_page()	(may charge -= PAGESIZE if PCG_USED is set)
	commit_charge()		(Ony if page_cgroup is marked as PCG_USED,
				 charge -= PAGESIZE)
Accounting will be correct.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memory.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: mmotm-2.6.28-Dec11/mm/memory.c
===================================================================
--- mmotm-2.6.28-Dec11.orig/mm/memory.c
+++ mmotm-2.6.28-Dec11/mm/memory.c
@@ -2433,17 +2433,17 @@ static int do_swap_page(struct mm_struct
 	 * which may delete_from_swap_cache().
 	 */
 
-	mem_cgroup_commit_charge_swapin(page, ptr);
 	inc_mm_counter(mm, anon_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if (write_access && reuse_swap_page(page)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		write_access = 0;
 	}
-
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	/* It's better to call commit-charge after rmap is established */
+	mem_cgroup_commit_charge_swapin(page, ptr);
 
 	swap_free(entry);
 	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUGFIX][PATCH mmotm] memcg fix swap accounting leak
  2008-12-12  8:29 [BUGFIX][PATCH mmotm] memcg fix swap accounting leak KAMEZAWA Hiroyuki
@ 2008-12-12  9:43 ` Daisuke Nishimura
  2008-12-12 11:16   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 13+ messages in thread
From: Daisuke Nishimura @ 2008-12-12  9:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, balbir, akpm, Hugh Dickins, nishimura

(add CC: Hugh Dickins <hugh@veritas.com>)

On Fri, 12 Dec 2008 17:29:30 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Fix swap-page-fault charge leak of memcg.
> 
> Now, memcg has hooks to swap-out operation and checks SwapCache is really
> unused or not. That check depends on contents of struct page.
> I.e. If PageAnon(page) && page_mapped(page), the page is recoginized as
> still-in-use.
> 
> Now, reuse_swap_page() calles delete_from_swap_cache() before establishment
> of any rmap. Then, in followinig sequence
> 
> 	(Page fault with WRITE)
> 	Assume the page is SwapCache "on memory (still charged)"
> 	try_charge() (charge += PAGESIZE)
> 	commit_charge()
> 	   => (Check page_cgroup and found PCG_USED bit, charge-=PAGE_SIZE
> 	       because it seems already charged.)
> 	reuse_swap_page()
> 		-> delete_from_swapcache()
> 			-> mem_cgroup_uncharge_swapcache() (charge -= PAGESIZE)
> 	......
> 
> too much uncharge.....
> 
> To avoid this,  move commit_charge() after page_mapcount() goes up to 1.
> By this,
>         Assume the page is SwapCache "on memory"
> 	try_charge()		(charge += PAGESIZE)
> 	reuse_swap_page()	(may charge -= PAGESIZE if PCG_USED is set)
> 	commit_charge()		(Ony if page_cgroup is marked as PCG_USED,
> 				 charge -= PAGESIZE)
> Accounting will be correct.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I've confirmed that the problem you reported offline is fixed, but...

> ---
>  mm/memory.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Index: mmotm-2.6.28-Dec11/mm/memory.c
> ===================================================================
> --- mmotm-2.6.28-Dec11.orig/mm/memory.c
> +++ mmotm-2.6.28-Dec11/mm/memory.c
> @@ -2433,17 +2433,17 @@ static int do_swap_page(struct mm_struct
>  	 * which may delete_from_swap_cache().
>  	 */
>  
The comment here says:

        /*
         * The page isn't present yet, go ahead with the fault.
         *
         * Be careful about the sequence of operations here.
         * To get its accounting right, reuse_swap_page() must be called
         * while the page is counted on swap but not yet in mapcount i.e.
         * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
         * must be called after the swap_free(), or it will never succeed.
         * And mem_cgroup_commit_charge_swapin(), which uses the swp_entry
         * in page->private, must be called before reuse_swap_page(),
         * which may delete_from_swap_cache().
         */

Hmm.. should we save page->private before calling reuse_swap_page and pass it
to mem_cgroup_commit_charge_swapin(I think it cannot be changed because the page
is locked)?


Thanks,
Daisuke Nishimura. 

> -	mem_cgroup_commit_charge_swapin(page, ptr);
>  	inc_mm_counter(mm, anon_rss);
>  	pte = mk_pte(page, vma->vm_page_prot);
>  	if (write_access && reuse_swap_page(page)) {
>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  		write_access = 0;
>  	}
> -
>  	flush_icache_page(vma, page);
>  	set_pte_at(mm, address, page_table, pte);
>  	page_add_anon_rmap(page, vma, address);
> +	/* It's better to call commit-charge after rmap is established */
> +	mem_cgroup_commit_charge_swapin(page, ptr);
>  
>  	swap_free(entry);
>  	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUGFIX][PATCH mmotm] memcg fix swap accounting leak
  2008-12-12  9:43 ` Daisuke Nishimura
@ 2008-12-12 11:16   ` KAMEZAWA Hiroyuki
  2008-12-13  7:03     ` [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v2) KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-12 11:16 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, balbir, akpm, Hugh Dickins

Daisuke Nishimura said:
>
>         /*
>          * The page isn't present yet, go ahead with the fault.
>          *
>          * Be careful about the sequence of operations here.
>          * To get its accounting right, reuse_swap_page() must be called
>          * while the page is counted on swap but not yet in mapcount i.e.
>          * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
>          * must be called after the swap_free(), or it will never succeed.
>          * And mem_cgroup_commit_charge_swapin(), which uses the swp_entry
>          * in page->private, must be called before reuse_swap_page(),
>          * which may delete_from_swap_cache().
>          */
>
> Hmm.. should we save page->private before calling reuse_swap_page and pass
> it
> to mem_cgroup_commit_charge_swapin(I think it cannot be changed because
> the page
> is locked)?
>
seems not necessary (see below).  I'll fix comment if I uses my pc tomorrow..

Considering 2 cases,
 A. the SwapCache is already chareged before try_charge_swapin()
 B. the SwapCache is very new and not charged before try_charge_swapin()

Case A.
   0. We have charge of PAGE_SIZE to this page before reach here.
   1. try_charge_swapin() is called and charge += PAGE_SIZE
   2. reuse_swap_page() is called.
          when delete_from_swap_cache() is called..
          2-a. if already mapped, no change in charges.
          2-b. if not mapped, charge-=PAGE_SIZE. PCG_USED bit is cleared.
               and charge-record is written into swap_cgroup
          not called.
          2-c. no changes in charge.
   3. commit_charge is called.
          3-a. PCG_USED bit is set, so charge -= PAGE_SIZE.
          3-b. PCG_USED bit is cleared and so we set PCG_USED bit and no
               changes in charge.
          3-c. no changes in charge.
   4-b. swap_free() will clear record in swap_cgroup.

   Then, finally we have PAGE_SIZE of charge to this page.

Case B.
   0. We have no charges to this page.
   1. try_charge_swapin() is called and charge += PAGE_SIZE.
   2. reuse_swap_page() is called.
         2-a if delete_from_swap_cache() is called.
         the page is not mapped. but PCG_USED bit is not set.
         so, no change in charges finally. (just recorded in swap_cgroup)
         2-b. not called ... no changes in charge.
   3. commit_charge() is called and set PCG_USED bit. no changes in charnge.
   4. swap_free() is called and clear record in swap_cgroup.

   Then, finally we have PAGE_SIZE of charge to this page.



Thanks,
-Kame

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v2)
  2008-12-12 11:16   ` KAMEZAWA Hiroyuki
@ 2008-12-13  7:03     ` KAMEZAWA Hiroyuki
  2008-12-13  9:49       ` Hugh Dickins
  0 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-13  7:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, linux-mm, linux-kernel, balbir, akpm, Hugh Dickins

Updated explanation and fixed comment.
==

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Fix swapin charge operation of memcg.

Now, memcg has hooks to swap-out operation and checks SwapCache is really
unused or not. That check depends on contents of struct page.
I.e. If PageAnon(page) && page_mapped(page), the page is recoginized as
still-in-use.

Now, reuse_swap_page() calles delete_from_swap_cache() before establishment
of any rmap. Then, in followinig sequence

	(Page fault with WRITE)
	try_charge() (charge += PAGESIZE)
	commit_charge() (Check page_cgroup is used or not..)
	reuse_swap_page()
		-> delete_from_swapcache()
			-> mem_cgroup_uncharge_swapcache() (charge -= PAGESIZE)
	......
New charge is uncharged soon....
To avoid this,  move commit_charge() after page_mapcount() goes up to 1.
By this,

	try_charge()		(usage += PAGESIZE)
	reuse_swap_page()	(may usage -= PAGESIZE if PCG_USED is set)
	commit_charge()		(If page_cgroup is not marked as PCG_USED,
				 add new charge.)
Accounting will be correct.

Changelog (v1) -> (v2)
  - fixed comment.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/controllers/memcg_test.txt |   50 +++++++++++++++++++++++++++++--
 mm/memory.c                              |   11 +++---
 2 files changed, 54 insertions(+), 7 deletions(-)

Index: mmotm-2.6.28-Dec12/mm/memory.c
===================================================================
--- mmotm-2.6.28-Dec12.orig/mm/memory.c
+++ mmotm-2.6.28-Dec12/mm/memory.c
@@ -2428,22 +2428,23 @@ static int do_swap_page(struct mm_struct
 	 * while the page is counted on swap but not yet in mapcount i.e.
 	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
 	 * must be called after the swap_free(), or it will never succeed.
-	 * And mem_cgroup_commit_charge_swapin(), which uses the swp_entry
-	 * in page->private, must be called before reuse_swap_page(),
-	 * which may delete_from_swap_cache().
+	 * Because delete_from_swap_page() may be called by reuse_swap_page(),
+	 * mem_cgroup_commit_charge_swapin() may not be able to find swp_entry
+	 * in page->private. In this case, a record in swap_cgroup  is silently
+	 * discarded at swap_free().
 	 */
 
-	mem_cgroup_commit_charge_swapin(page, ptr);
 	inc_mm_counter(mm, anon_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if (write_access && reuse_swap_page(page)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		write_access = 0;
 	}
-
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	/* It's better to call commit-charge after rmap is established */
+	mem_cgroup_commit_charge_swapin(page, ptr);
 
 	swap_free(entry);
 	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
Index: mmotm-2.6.28-Dec12/Documentation/controllers/memcg_test.txt
===================================================================
--- mmotm-2.6.28-Dec12.orig/Documentation/controllers/memcg_test.txt
+++ mmotm-2.6.28-Dec12/Documentation/controllers/memcg_test.txt
@@ -1,6 +1,6 @@
 Memory Resource Controller(Memcg)  Implementation Memo.
-Last Updated: 2008/12/10
-Base Kernel Version: based on 2.6.28-rc7-mm.
+Last Updated: 2008/12/13
+Base Kernel Version: based on 2.6.28-rc8-mm.
 
 Because VM is getting complex (one of reasons is memcg...), memcg's behavior
 is complex. This is a document for memcg's internal behavior.
@@ -115,6 +115,52 @@ Under below explanation, we assume CONFI
 	(But racy state between (a) and (b) exists. We do check it.)
 	At charging, a charge recorded in swap_cgroup is moved to page_cgroup.
 
+	In case (a), reuse_swap_page() may call delete_from_swap_cache() if
+	the page can drop swp_entry and be reused for "WRITE".
+	Note: If the page may be accounted before (A), if it isn't kicked out
+	      to disk before page fault.
+
+	case A) the page is not accounted as SwapCache and SwapCache is deleted
+		by reuse_swap_page().
+		1. try_charge_swapin() is called and
+			- charge_for_memory +=1.
+			- charge_for_memsw  +=1.
+		2. reuse_swap_page -> delete_from_swap_cache() is called.
+			because the page is not accounted as SwapCache,
+			no changes in accounting.
+		3. commit_charge_swapin() finds PCG_USED bit is not set and
+		   set PCG_USED bit.
+		   Because page->private is empty by 2. no changes in charge.
+		4. swap_free(entry) is called.
+			- charge_for_memsw -= 1.
+
+		Finally, charge_for_memory +=1, charge_for_memsw = +-0.
+
+	case B) the page is accounted as SwapCache and SwapCache is deleted
+		by reuse_swap_page.
+		1. try_charge_swapin() is called.
+			- charge_for_memory += 1.
+			- charge_for_memsw += 1.
+		2. reuse_swap_page -> delete_from_swap_cache() is called.
+			PCG_USED bit is found and cleared.
+			- charge_for_memory -= 1. (swap_cgroup is recorded.)
+		3. commit_charge_swapin() finds PCG_USED bit is not set.
+		4. swap_free(entry) is called and
+			- charge_for_memsw -= 1.
+
+		Finally, charge_for_memory = +-0, charge_for_memsw = +-0.
+
+	case C) the page is not accounted as SwapCache and reuse_swap_page
+		doesn't call delete_from_swap_cache()
+		1. try_charge_swapin() is called.
+			- charge_for_memory += 1.
+			- charge_for_memsw += 1.
+		2. commit_charge_swapin() finds PCG_USED bit is not set
+		   and finds swap_cgroup records this entry.
+			- charge_for_memsw -= 1.
+
+		Finally, charge_for_memory +=1, charge_for_memsw = +-0
+
 	4.2 Swap-out.
 	At swap-out, typical state transition is below.
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v2)
  2008-12-13  7:03     ` [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v2) KAMEZAWA Hiroyuki
@ 2008-12-13  9:49       ` Hugh Dickins
  2008-12-13 10:27         ` KAMEZAWA Hiroyuki
  2008-12-13 10:38         ` [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v2) KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 13+ messages in thread
From: Hugh Dickins @ 2008-12-13  9:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-mm, linux-kernel, balbir, akpm

On Sat, 13 Dec 2008, KAMEZAWA Hiroyuki wrote:
> --- mmotm-2.6.28-Dec12.orig/mm/memory.c
> +++ mmotm-2.6.28-Dec12/mm/memory.c
>  
> -	mem_cgroup_commit_charge_swapin(page, ptr);
>  	inc_mm_counter(mm, anon_rss);
>  	pte = mk_pte(page, vma->vm_page_prot);
>  	if (write_access && reuse_swap_page(page)) {
>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  		write_access = 0;
>  	}
> -
>  	flush_icache_page(vma, page);
>  	set_pte_at(mm, address, page_table, pte);
>  	page_add_anon_rmap(page, vma, address);
> +	/* It's better to call commit-charge after rmap is established */
> +	mem_cgroup_commit_charge_swapin(page, ptr);
>  
>  	swap_free(entry);
>  	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))

That ordering is back to how it was before I adjusted it
for reuse_swap_page()'s delete_from_swap_cache(), isn't it?

So I don't understand how you've fixed the bug I hit (not an
accounting imbalance but an oops or BUG, I forget) with this
ordering, without making some other change elsewhere.

mem_cgroup_commit_charge_swapin calls swap_cgroup_record with
bogus swp_entry_t 0, which appears to belong to swp_offset 0 of
swp_type 0, but the ctrl->map for type 0 may have been freed
ages ago (we do always start from 0, but maybe we swapped on
type 1 and swapped off type 0 meanwhile).  I'm guessing that
by looking at the code, not by retesting it, so I may have the
details wrong; but I didn't reorder your code just for fun.

Perhaps your restored ordering works if you check PageSwapCache
in mem_cgroup_commit_charge_swapin or check 0 in swap_cgroup_record,
but I don't see that in yesterday's mmotm, nor in this patch.

(And I should admit, I've not even attempted to follow your
accounting justification: I'll leave that to you memcg guys.)

An alternative could be not to clear page->private when deleting
from swap cache, that's only done for tidiness and to force notice
of races like this; but I'd want a much stronger reason to change that.

Or am I making this up?  As I say, I've not tested it this time around.

Hugh

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v2)
  2008-12-13  9:49       ` Hugh Dickins
@ 2008-12-13 10:27         ` KAMEZAWA Hiroyuki
  2008-12-15  7:07           ` [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v3) KAMEZAWA Hiroyuki
  2008-12-13 10:38         ` [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v2) KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-13 10:27 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm, linux-kernel,
	balbir, akpm

Hugh Dickins said:
> On Sat, 13 Dec 2008, KAMEZAWA Hiroyuki wrote:
>> --- mmotm-2.6.28-Dec12.orig/mm/memory.c
>> +++ mmotm-2.6.28-Dec12/mm/memory.c
>>
>> -	mem_cgroup_commit_charge_swapin(page, ptr);
>>  	inc_mm_counter(mm, anon_rss);
>>  	pte = mk_pte(page, vma->vm_page_prot);
>>  	if (write_access && reuse_swap_page(page)) {
>>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>>  		write_access = 0;
>>  	}
>> -
>>  	flush_icache_page(vma, page);
>>  	set_pte_at(mm, address, page_table, pte);
>>  	page_add_anon_rmap(page, vma, address);
>> +	/* It's better to call commit-charge after rmap is established */
>> +	mem_cgroup_commit_charge_swapin(page, ptr);
>>
>>  	swap_free(entry);
>>  	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) ||
>> PageMlocked(page))
>
> That ordering is back to how it was before I adjusted it
> for reuse_swap_page()'s delete_from_swap_cache(), isn't it?
>
> So I don't understand how you've fixed the bug I hit (not an
> accounting imbalance but an oops or BUG, I forget) with this
> ordering, without making some other change elsewhere.
>
Ah, this is for fixing the bug by this order of calls.
==


> mem_cgroup_commit_charge_swapin calls swap_cgroup_record with
> bogus swp_entry_t 0, which appears to belong to swp_offset 0 of
> swp_type 0, but the ctrl->map for type 0 may have been freed
> ages ago (we do always start from 0, but maybe we swapped on
> type 1 and swapped off type 0 meanwhile).  I'm guessing that
> by looking at the code, not by retesting it, so I may have the
> details wrong; but I didn't reorder your code just for fun.
>
Ah, sorry. commit_charge_swapin() should chekc the page is still
SwapCache. Sorry. I'll update this in Monday.

> Perhaps your restored ordering works if you check PageSwapCache
> in mem_cgroup_commit_charge_swapin or check 0 in swap_cgroup_record,
> but I don't see that in yesterday's mmotm, nor in this patch.
>
yes. I'm wrong at that point.
I'll add PageSwapCache check to "commit" ops.

> (And I should admit, I've not even attempted to follow your
> accounting justification: I'll leave that to you memcg guys.)
>
> An alternative could be not to clear page->private when deleting
> from swap cache, that's only done for tidiness and to force notice
> of races like this; but I'd want a much stronger reason to change that.
>
Hmm, doesn't that change will add new unnecessary complex ?

> Or am I making this up?  As I say, I've not tested it this time around.
>
I'll revisit this Monday and think of swp_entry==0 problem.

Thank you for pointing out.

-Kame

> Hugh
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v2)
  2008-12-13  9:49       ` Hugh Dickins
  2008-12-13 10:27         ` KAMEZAWA Hiroyuki
@ 2008-12-13 10:38         ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-13 10:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm, linux-kernel,
	balbir, akpm

Hugh Dickins said:
> On Sat, 13 Dec 2008, KAMEZAWA Hiroyuki wrote:
>> --- mmotm-2.6.28-Dec12.orig/mm/memory.c
>> +++ mmotm-2.6.28-Dec12/mm/memory.c
>>
>> -	mem_cgroup_commit_charge_swapin(page, ptr);
>>  	inc_mm_counter(mm, anon_rss);
>>  	pte = mk_pte(page, vma->vm_page_prot);
>>  	if (write_access && reuse_swap_page(page)) {
>>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>>  		write_access = 0;
>>  	}
>> -
>>  	flush_icache_page(vma, page);
>>  	set_pte_at(mm, address, page_table, pte);
>>  	page_add_anon_rmap(page, vma, address);
>> +	/* It's better to call commit-charge after rmap is established */
>> +	mem_cgroup_commit_charge_swapin(page, ptr);
>>
>>  	swap_free(entry);
>>  	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) ||
>> PageMlocked(page))
>
> That ordering is back to how it was before I adjusted it
> for reuse_swap_page()'s delete_from_swap_cache(), isn't it?
>
> So I don't understand how you've fixed the bug I hit (not an
> accounting imbalance but an oops or BUG, I forget) with this
> ordering, without making some other change elsewhere.
>
Ah, this is a fix for the new bug by this order.
==
    try_charge()
    commit_charge()
    reuse_swap_page()
         -> delete_from_swapcache() -> uncharge_swapcache().
    increase mapcount here.
==
Because ucharge_swapcache() assumes following
  a. if mapcount==0, this swap cache is of no use and will be discarded.
  b. if mapcount >0, this swap cache is in use.
A charge commited by commit_charge() is discarded by reuse_swap_page().

By delaying commit (means checking flag of page_cgroup).
==
  try_charge()
  reuse_swap_page()
  commit_charge()
==
the leak of charge doesn't happen.
(reuse_swap_page() may drop page from swap-cache, but it's no probelm to
 commit. But as you say, this has swp_entry==0 bug.)

> mem_cgroup_commit_charge_swapin calls swap_cgroup_record with
> bogus swp_entry_t 0, which appears to belong to swp_offset 0 of
> swp_type 0, but the ctrl->map for type 0 may have been freed
> ages ago (we do always start from 0, but maybe we swapped on
> type 1 and swapped off type 0 meanwhile).  I'm guessing that
> by looking at the code, not by retesting it, so I may have the
> details wrong; but I didn't reorder your code just for fun.
>
> Perhaps your restored ordering works if you check PageSwapCache
> in mem_cgroup_commit_charge_swapin or check 0 in swap_cgroup_record,
> but I don't see that in yesterday's mmotm, nor in this patch.
>
Ahhhh, sorry. ok, swp_entry==0 is valid...Sigh...
I'll revisit this and check how commit_charge() works.
I think checking PageSwapCache() is enough but if not, do somehing other.
(Maybe Nishimura's suggestion to pass swp_entry directly to commit_charge()
 is one way.)

> (And I should admit, I've not even attempted to follow your
> accounting justification: I'll leave that to you memcg guys.)
>
Sorry for complication ;(

> An alternative could be not to clear page->private when deleting
> from swap cache, that's only done for tidiness and to force notice
> of races like this; but I'd want a much stronger reason to change that.
>
It seems that it  will add another complex or unexpected behavior..
I think I can do something workaround.

> Or am I making this up?  As I say, I've not tested it this time around.
>
I'll ask you if I found I can't do anything ;(

Thank you for pointing out!
I'll revisit this on Monday.

-Kame

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v3)
  2008-12-13 10:27         ` KAMEZAWA Hiroyuki
@ 2008-12-15  7:07           ` KAMEZAWA Hiroyuki
  2008-12-15  8:37             ` Balbir Singh
                               ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-15  7:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hugh Dickins, Daisuke Nishimura, linux-mm, linux-kernel, balbir, akpm


From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Fix swapin charge operation of memcg.

Now, memcg has hooks to swap-out operation and checks SwapCache is really
unused or not. That check depends on contents of struct page.
I.e. If PageAnon(page) && page_mapped(page), the page is recoginized as
still-in-use.

Now, reuse_swap_page() calles delete_from_swap_cache() before establishment
of any rmap. Then, in followinig sequence

	(Page fault with WRITE)
	try_charge() (charge += PAGESIZE)
	commit_charge() (Check page_cgroup is used or not..)
	reuse_swap_page()
		-> delete_from_swapcache()
			-> mem_cgroup_uncharge_swapcache() (charge -= PAGESIZE)
	......
New charge is uncharged soon....
To avoid this,  move commit_charge() after page_mapcount() goes up to 1.
By this,

	try_charge()		(usage += PAGESIZE)
	reuse_swap_page()	(may usage -= PAGESIZE if PCG_USED is set)
	commit_charge()		(If page_cgroup is not marked as PCG_USED,
				 add new charge.)
Accounting will be correct.

Changelog (v2) -> (v3)
  - fixed invalid charge to swp_entry==0.
  - updated documentation.
Changelog (v1) -> (v2)
  - fixed comment.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/controllers/memcg_test.txt |   41 +++++++++++++++++++++++++++----
 mm/memcontrol.c                          |    7 +++--
 mm/memory.c                              |   11 ++++----
 3 files changed, 46 insertions(+), 13 deletions(-)

Index: mmotm-2.6.28-Dec12/mm/memory.c
===================================================================
--- mmotm-2.6.28-Dec12.orig/mm/memory.c
+++ mmotm-2.6.28-Dec12/mm/memory.c
@@ -2428,22 +2428,23 @@ static int do_swap_page(struct mm_struct
 	 * while the page is counted on swap but not yet in mapcount i.e.
 	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
 	 * must be called after the swap_free(), or it will never succeed.
-	 * And mem_cgroup_commit_charge_swapin(), which uses the swp_entry
-	 * in page->private, must be called before reuse_swap_page(),
-	 * which may delete_from_swap_cache().
+	 * Because delete_from_swap_page() may be called by reuse_swap_page(),
+	 * mem_cgroup_commit_charge_swapin() may not be able to find swp_entry
+	 * in page->private. In this case, a record in swap_cgroup  is silently
+	 * discarded at swap_free().
 	 */
 
-	mem_cgroup_commit_charge_swapin(page, ptr);
 	inc_mm_counter(mm, anon_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if (write_access && reuse_swap_page(page)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		write_access = 0;
 	}
-
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	/* It's better to call commit-charge after rmap is established */
+	mem_cgroup_commit_charge_swapin(page, ptr);
 
 	swap_free(entry);
 	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
Index: mmotm-2.6.28-Dec12/Documentation/controllers/memcg_test.txt
===================================================================
--- mmotm-2.6.28-Dec12.orig/Documentation/controllers/memcg_test.txt
+++ mmotm-2.6.28-Dec12/Documentation/controllers/memcg_test.txt
@@ -1,6 +1,6 @@
 Memory Resource Controller(Memcg)  Implementation Memo.
-Last Updated: 2008/12/10
-Base Kernel Version: based on 2.6.28-rc7-mm.
+Last Updated: 2008/12/15
+Base Kernel Version: based on 2.6.28-rc8-mm.
 
 Because VM is getting complex (one of reasons is memcg...), memcg's behavior
 is complex. This is a document for memcg's internal behavior.
@@ -111,9 +111,40 @@ Under below explanation, we assume CONFI
 	(b) If the SwapCache has been mapped by processes, it has been
 	    charged already.
 
-	In case (a), we charge it. In case (b), we don't charge it.
-	(But racy state between (a) and (b) exists. We do check it.)
-	At charging, a charge recorded in swap_cgroup is moved to page_cgroup.
+	This swap-in is one of the most complicated work. In do_swap_page(),
+	following events occur when pte is unchanged.
+
+	(1) the page (SwapCache) is looked up.
+	(2) lock_page()
+	(3) try_charge_swapin()
+	(4) reuse_swap_page() (may call delete_swap_cache())
+	(5) commit_charge_swapin()
+	(6) swap_free().
+
+	Considering following situation for example.
+
+	(A) The page has not been charged before (2) and reuse_swap_page()
+	    doesn't call delete_from_swap_cache().
+	(B) The page has not been charged before (2) and reuse_swap_page()
+	    calls delete_from_swap_cache().
+	(C) The page has been charged before (2) and reuse_swap_page() doesn't
+	    call delete_from_swap_cache().
+	(D) The page has been charged before (2) and reuse_swap_page() calls
+	    delete_from_swap_cache().
+
+	    memory.usage/memsw.usage changes to this page/swp_entry will be
+	 Case          (A)      (B)       (C)     (D)
+         Event
+       Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
+          ===========================================
+          (3)        +1/+1    +1/+1     +1/+1   +1/+1
+          (4)          -       0/ 0       -     -1/ 0
+          (5)         0/ 1     0/-1     -1/-1    0/ 0
+          (6)          -        -         -      0/-1
+          ===========================================
+       Result         1/ 1     1/1       1/ 1    1/ 1
+
+       In any cases, charges to this page should be 1/ 1.
 
 	4.2 Swap-out.
 	At swap-out, typical state transition is below.
Index: mmotm-2.6.28-Dec12/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec12.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec12/mm/memcontrol.c
@@ -1139,10 +1139,11 @@ void mem_cgroup_commit_charge_swapin(str
 	/*
 	 * Now swap is on-memory. This means this page may be
 	 * counted both as mem and swap....double count.
-	 * Fix it by uncharging from memsw. This SwapCache is stable
-	 * because we're still under lock_page().
+	 * Fix it by uncharging from memsw. Basically, this SwapCache is stable
+	 * under lock_page(). But in do_swap_page()::memory.c, reuse_swap_page()
+	 * may call delete_from_swap_cache() before reach here.
 	 */
-	if (do_swap_account) {
+	if (do_swap_account && PageSwapCache(page)) {
 		swp_entry_t ent = {.val = page_private(page)};
 		struct mem_cgroup *memcg;
 		memcg = swap_cgroup_record(ent, NULL);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v3)
  2008-12-15  7:07           ` [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v3) KAMEZAWA Hiroyuki
@ 2008-12-15  8:37             ` Balbir Singh
  2008-12-15  8:40               ` KAMEZAWA Hiroyuki
  2008-12-15 10:34             ` Hugh Dickins
  2008-12-16  4:02             ` [PATCH mmotm] memcg: fix for documentation (Re: [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v3)) Daisuke Nishimura
  2 siblings, 1 reply; 13+ messages in thread
From: Balbir Singh @ 2008-12-15  8:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hugh Dickins, Daisuke Nishimura, linux-mm, linux-kernel, akpm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-12-15 16:07:51]:

> 
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Fix swapin charge operation of memcg.
> 
> Now, memcg has hooks to swap-out operation and checks SwapCache is really
> unused or not. That check depends on contents of struct page.
> I.e. If PageAnon(page) && page_mapped(page), the page is recoginized as
> still-in-use.
> 
> Now, reuse_swap_page() calles delete_from_swap_cache() before establishment
> of any rmap. Then, in followinig sequence
> 
> 	(Page fault with WRITE)
> 	try_charge() (charge += PAGESIZE)
> 	commit_charge() (Check page_cgroup is used or not..)
> 	reuse_swap_page()
> 		-> delete_from_swapcache()
> 			-> mem_cgroup_uncharge_swapcache() (charge -= PAGESIZE)
> 	......
> New charge is uncharged soon....
> To avoid this,  move commit_charge() after page_mapcount() goes up to 1.
> By this,
> 
> 	try_charge()		(usage += PAGESIZE)
> 	reuse_swap_page()	(may usage -= PAGESIZE if PCG_USED is set)
> 	commit_charge()		(If page_cgroup is not marked as PCG_USED,
> 				 add new charge.)
> Accounting will be correct.
> 
> Changelog (v2) -> (v3)
>   - fixed invalid charge to swp_entry==0.
>   - updated documentation.
> Changelog (v1) -> (v2)
>   - fixed comment.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  Documentation/controllers/memcg_test.txt |   41 +++++++++++++++++++++++++++----
>  mm/memcontrol.c                          |    7 +++--
>  mm/memory.c                              |   11 ++++----
>  3 files changed, 46 insertions(+), 13 deletions(-)
> 
> Index: mmotm-2.6.28-Dec12/mm/memory.c
> ===================================================================
> --- mmotm-2.6.28-Dec12.orig/mm/memory.c
> +++ mmotm-2.6.28-Dec12/mm/memory.c
> @@ -2428,22 +2428,23 @@ static int do_swap_page(struct mm_struct
>  	 * while the page is counted on swap but not yet in mapcount i.e.
>  	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
>  	 * must be called after the swap_free(), or it will never succeed.
> -	 * And mem_cgroup_commit_charge_swapin(), which uses the swp_entry
> -	 * in page->private, must be called before reuse_swap_page(),
> -	 * which may delete_from_swap_cache().
> +	 * Because delete_from_swap_page() may be called by reuse_swap_page(),
> +	 * mem_cgroup_commit_charge_swapin() may not be able to find swp_entry
> +	 * in page->private. In this case, a record in swap_cgroup  is silently
> +	 * discarded at swap_free().
>  	 */
> 
> -	mem_cgroup_commit_charge_swapin(page, ptr);
>  	inc_mm_counter(mm, anon_rss);
>  	pte = mk_pte(page, vma->vm_page_prot);
>  	if (write_access && reuse_swap_page(page)) {
>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  		write_access = 0;
>  	}
> -

Removal of unassociated lines, not sure if that is a good practice.

>  	flush_icache_page(vma, page);
>  	set_pte_at(mm, address, page_table, pte);
>  	page_add_anon_rmap(page, vma, address);
> +	/* It's better to call commit-charge after rmap is established */
> +	mem_cgroup_commit_charge_swapin(page, ptr);
> 

Yes, it does make sense

>  	swap_free(entry);
>  	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> Index: mmotm-2.6.28-Dec12/Documentation/controllers/memcg_test.txt
> ===================================================================
> --- mmotm-2.6.28-Dec12.orig/Documentation/controllers/memcg_test.txt
> +++ mmotm-2.6.28-Dec12/Documentation/controllers/memcg_test.txt
> @@ -1,6 +1,6 @@
>  Memory Resource Controller(Memcg)  Implementation Memo.
> -Last Updated: 2008/12/10
> -Base Kernel Version: based on 2.6.28-rc7-mm.
> +Last Updated: 2008/12/15
> +Base Kernel Version: based on 2.6.28-rc8-mm.
> 
>  Because VM is getting complex (one of reasons is memcg...), memcg's behavior
>  is complex. This is a document for memcg's internal behavior.
> @@ -111,9 +111,40 @@ Under below explanation, we assume CONFI
>  	(b) If the SwapCache has been mapped by processes, it has been
>  	    charged already.
> 
> -	In case (a), we charge it. In case (b), we don't charge it.
> -	(But racy state between (a) and (b) exists. We do check it.)
> -	At charging, a charge recorded in swap_cgroup is moved to page_cgroup.
> +	This swap-in is one of the most complicated work. In do_swap_page(),
> +	following events occur when pte is unchanged.
> +
> +	(1) the page (SwapCache) is looked up.
> +	(2) lock_page()
> +	(3) try_charge_swapin()
> +	(4) reuse_swap_page() (may call delete_swap_cache())
> +	(5) commit_charge_swapin()
> +	(6) swap_free().
> +
> +	Considering following situation for example.
> +
> +	(A) The page has not been charged before (2) and reuse_swap_page()
> +	    doesn't call delete_from_swap_cache().
> +	(B) The page has not been charged before (2) and reuse_swap_page()
> +	    calls delete_from_swap_cache().
> +	(C) The page has been charged before (2) and reuse_swap_page() doesn't
> +	    call delete_from_swap_cache().
> +	(D) The page has been charged before (2) and reuse_swap_page() calls
> +	    delete_from_swap_cache().
> +
> +	    memory.usage/memsw.usage changes to this page/swp_entry will be
> +	 Case          (A)      (B)       (C)     (D)
> +         Event
> +       Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
> +          ===========================================
> +          (3)        +1/+1    +1/+1     +1/+1   +1/+1
> +          (4)          -       0/ 0       -     -1/ 0
> +          (5)         0/ 1     0/-1     -1/-1    0/ 0
> +          (6)          -        -         -      0/-1
> +          ===========================================
> +       Result         1/ 1     1/1       1/ 1    1/ 1
> +
> +       In any cases, charges to this page should be 1/ 1.
> 

The documentation patch failed to apply for me

>  	4.2 Swap-out.
>  	At swap-out, typical state transition is below.
> Index: mmotm-2.6.28-Dec12/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.28-Dec12.orig/mm/memcontrol.c
> +++ mmotm-2.6.28-Dec12/mm/memcontrol.c
> @@ -1139,10 +1139,11 @@ void mem_cgroup_commit_charge_swapin(str
>  	/*
>  	 * Now swap is on-memory. This means this page may be
>  	 * counted both as mem and swap....double count.
> -	 * Fix it by uncharging from memsw. This SwapCache is stable
> -	 * because we're still under lock_page().
> +	 * Fix it by uncharging from memsw. Basically, this SwapCache is stable
> +	 * under lock_page(). But in do_swap_page()::memory.c, reuse_swap_page()
> +	 * may call delete_from_swap_cache() before reach here.
>  	 */
> -	if (do_swap_account) {
> +	if (do_swap_account && PageSwapCache(page)) {
>  		swp_entry_t ent = {.val = page_private(page)};
>  		struct mem_cgroup *memcg;
>  		memcg = swap_cgroup_record(ent, NULL);
> 
>


Looks good to me

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> 
Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> 

-- 
	Balbir

----- End forwarded message -----
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v3)
  2008-12-15  8:37             ` Balbir Singh
@ 2008-12-15  8:40               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-15  8:40 UTC (permalink / raw)
  To: balbir; +Cc: Hugh Dickins, Daisuke Nishimura, linux-mm, linux-kernel, akpm

On Mon, 15 Dec 2008 14:07:26 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-12-15 16:07:51]:
> 
> > 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Fix swapin charge operation of memcg.
> > 
> > Now, memcg has hooks to swap-out operation and checks SwapCache is really
> > unused or not. That check depends on contents of struct page.
> > I.e. If PageAnon(page) && page_mapped(page), the page is recoginized as
> > still-in-use.
> > 
> > Now, reuse_swap_page() calles delete_from_swap_cache() before establishment
> > of any rmap. Then, in followinig sequence
> > 
> > 	(Page fault with WRITE)
> > 	try_charge() (charge += PAGESIZE)
> > 	commit_charge() (Check page_cgroup is used or not..)
> > 	reuse_swap_page()
> > 		-> delete_from_swapcache()
> > 			-> mem_cgroup_uncharge_swapcache() (charge -= PAGESIZE)
> > 	......
> > New charge is uncharged soon....
> > To avoid this,  move commit_charge() after page_mapcount() goes up to 1.
> > By this,
> > 
> > 	try_charge()		(usage += PAGESIZE)
> > 	reuse_swap_page()	(may usage -= PAGESIZE if PCG_USED is set)
> > 	commit_charge()		(If page_cgroup is not marked as PCG_USED,
> > 				 add new charge.)
> > Accounting will be correct.
> > 
> > Changelog (v2) -> (v3)
> >   - fixed invalid charge to swp_entry==0.
> >   - updated documentation.
> > Changelog (v1) -> (v2)
> >   - fixed comment.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  Documentation/controllers/memcg_test.txt |   41 +++++++++++++++++++++++++++----
> >  mm/memcontrol.c                          |    7 +++--
> >  mm/memory.c                              |   11 ++++----
> >  3 files changed, 46 insertions(+), 13 deletions(-)
> > 
> > Index: mmotm-2.6.28-Dec12/mm/memory.c
> > ===================================================================
> > --- mmotm-2.6.28-Dec12.orig/mm/memory.c
> > +++ mmotm-2.6.28-Dec12/mm/memory.c
> > @@ -2428,22 +2428,23 @@ static int do_swap_page(struct mm_struct
> >  	 * while the page is counted on swap but not yet in mapcount i.e.
> >  	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
> >  	 * must be called after the swap_free(), or it will never succeed.
> > -	 * And mem_cgroup_commit_charge_swapin(), which uses the swp_entry
> > -	 * in page->private, must be called before reuse_swap_page(),
> > -	 * which may delete_from_swap_cache().
> > +	 * Because delete_from_swap_page() may be called by reuse_swap_page(),
> > +	 * mem_cgroup_commit_charge_swapin() may not be able to find swp_entry
> > +	 * in page->private. In this case, a record in swap_cgroup  is silently
> > +	 * discarded at swap_free().
> >  	 */
> > 
> > -	mem_cgroup_commit_charge_swapin(page, ptr);
> >  	inc_mm_counter(mm, anon_rss);
> >  	pte = mk_pte(page, vma->vm_page_prot);
> >  	if (write_access && reuse_swap_page(page)) {
> >  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> >  		write_access = 0;
> >  	}
> > -
> 
> Removal of unassociated lines, not sure if that is a good practice.
> 
my mistake...

> >  	flush_icache_page(vma, page);
> >  	set_pte_at(mm, address, page_table, pte);
> >  	page_add_anon_rmap(page, vma, address);
> > +	/* It's better to call commit-charge after rmap is established */
> > +	mem_cgroup_commit_charge_swapin(page, ptr);
> > 
> 
> Yes, it does make sense
> 
> >  	swap_free(entry);
> >  	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> > Index: mmotm-2.6.28-Dec12/Documentation/controllers/memcg_test.txt
> > ===================================================================
> > --- mmotm-2.6.28-Dec12.orig/Documentation/controllers/memcg_test.txt
> > +++ mmotm-2.6.28-Dec12/Documentation/controllers/memcg_test.txt
> > @@ -1,6 +1,6 @@
> >  Memory Resource Controller(Memcg)  Implementation Memo.
> > -Last Updated: 2008/12/10
> > -Base Kernel Version: based on 2.6.28-rc7-mm.
> > +Last Updated: 2008/12/15
> > +Base Kernel Version: based on 2.6.28-rc8-mm.
> > 
> >  Because VM is getting complex (one of reasons is memcg...), memcg's behavior
> >  is complex. This is a document for memcg's internal behavior.
> > @@ -111,9 +111,40 @@ Under below explanation, we assume CONFI
> >  	(b) If the SwapCache has been mapped by processes, it has been
> >  	    charged already.
> > 
> > -	In case (a), we charge it. In case (b), we don't charge it.
> > -	(But racy state between (a) and (b) exists. We do check it.)
> > -	At charging, a charge recorded in swap_cgroup is moved to page_cgroup.
> > +	This swap-in is one of the most complicated work. In do_swap_page(),
> > +	following events occur when pte is unchanged.
> > +
> > +	(1) the page (SwapCache) is looked up.
> > +	(2) lock_page()
> > +	(3) try_charge_swapin()
> > +	(4) reuse_swap_page() (may call delete_swap_cache())
> > +	(5) commit_charge_swapin()
> > +	(6) swap_free().
> > +
> > +	Considering following situation for example.
> > +
> > +	(A) The page has not been charged before (2) and reuse_swap_page()
> > +	    doesn't call delete_from_swap_cache().
> > +	(B) The page has not been charged before (2) and reuse_swap_page()
> > +	    calls delete_from_swap_cache().
> > +	(C) The page has been charged before (2) and reuse_swap_page() doesn't
> > +	    call delete_from_swap_cache().
> > +	(D) The page has been charged before (2) and reuse_swap_page() calls
> > +	    delete_from_swap_cache().
> > +
> > +	    memory.usage/memsw.usage changes to this page/swp_entry will be
> > +	 Case          (A)      (B)       (C)     (D)
> > +         Event
> > +       Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
> > +          ===========================================
> > +          (3)        +1/+1    +1/+1     +1/+1   +1/+1
> > +          (4)          -       0/ 0       -     -1/ 0
> > +          (5)         0/ 1     0/-1     -1/-1    0/ 0
> > +          (6)          -        -         -      0/-1
> > +          ===========================================
> > +       Result         1/ 1     1/1       1/ 1    1/ 1
> > +
> > +       In any cases, charges to this page should be 1/ 1.
> > 
> 
> The documentation patch failed to apply for me
> 
Hmm... I'll check my queue again.

Thanks,
-Kame

> >  	4.2 Swap-out.
> >  	At swap-out, typical state transition is below.
> > Index: mmotm-2.6.28-Dec12/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.28-Dec12.orig/mm/memcontrol.c
> > +++ mmotm-2.6.28-Dec12/mm/memcontrol.c
> > @@ -1139,10 +1139,11 @@ void mem_cgroup_commit_charge_swapin(str
> >  	/*
> >  	 * Now swap is on-memory. This means this page may be
> >  	 * counted both as mem and swap....double count.
> > -	 * Fix it by uncharging from memsw. This SwapCache is stable
> > -	 * because we're still under lock_page().
> > +	 * Fix it by uncharging from memsw. Basically, this SwapCache is stable
> > +	 * under lock_page(). But in do_swap_page()::memory.c, reuse_swap_page()
> > +	 * may call delete_from_swap_cache() before reach here.
> >  	 */
> > -	if (do_swap_account) {
> > +	if (do_swap_account && PageSwapCache(page)) {
> >  		swp_entry_t ent = {.val = page_private(page)};
> >  		struct mem_cgroup *memcg;
> >  		memcg = swap_cgroup_record(ent, NULL);
> > 
> >
> 
> 
> Looks good to me
> 
> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> 
> Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> 
> 
> -- 
> 	Balbir
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v3)
  2008-12-15  7:07           ` [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v3) KAMEZAWA Hiroyuki
  2008-12-15  8:37             ` Balbir Singh
@ 2008-12-15 10:34             ` Hugh Dickins
  2008-12-16  4:02             ` [PATCH mmotm] memcg: fix for documentation (Re: [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v3)) Daisuke Nishimura
  2 siblings, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2008-12-15 10:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-mm, linux-kernel, balbir, akpm

On Mon, 15 Dec 2008, KAMEZAWA Hiroyuki wrote:
> 
> Fix swapin charge operation of memcg.
> 
> @@ -1139,10 +1139,11 @@ void mem_cgroup_commit_charge_swapin(str
>  	/*
>  	 * Now swap is on-memory. This means this page may be
>  	 * counted both as mem and swap....double count.
> -	 * Fix it by uncharging from memsw. This SwapCache is stable
> -	 * because we're still under lock_page().
> +	 * Fix it by uncharging from memsw. Basically, this SwapCache is stable
> +	 * under lock_page(). But in do_swap_page()::memory.c, reuse_swap_page()
> +	 * may call delete_from_swap_cache() before reach here.
>  	 */
> -	if (do_swap_account) {
> +	if (do_swap_account && PageSwapCache(page)) {
>  		swp_entry_t ent = {.val = page_private(page)};
>  		struct mem_cgroup *memcg;
>  		memcg = swap_cgroup_record(ent, NULL);

Yes, that addition looks good to me.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH mmotm] memcg: fix for documentation (Re: [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v3))
  2008-12-15  7:07           ` [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v3) KAMEZAWA Hiroyuki
  2008-12-15  8:37             ` Balbir Singh
  2008-12-15 10:34             ` Hugh Dickins
@ 2008-12-16  4:02             ` Daisuke Nishimura
  2008-12-16  4:53               ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 13+ messages in thread
From: Daisuke Nishimura @ 2008-12-16  4:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hugh Dickins, linux-mm, linux-kernel, balbir, akpm, nishimura

Sorry for late reply.

> @@ -111,9 +111,40 @@ Under below explanation, we assume CONFI
>  	(b) If the SwapCache has been mapped by processes, it has been
>  	    charged already.
>  
> -	In case (a), we charge it. In case (b), we don't charge it.
> -	(But racy state between (a) and (b) exists. We do check it.)
> -	At charging, a charge recorded in swap_cgroup is moved to page_cgroup.
> +	This swap-in is one of the most complicated work. In do_swap_page(),
> +	following events occur when pte is unchanged.
> +
> +	(1) the page (SwapCache) is looked up.
> +	(2) lock_page()
> +	(3) try_charge_swapin()
> +	(4) reuse_swap_page() (may call delete_swap_cache())
> +	(5) commit_charge_swapin()
> +	(6) swap_free().
> +
> +	Considering following situation for example.
> +
> +	(A) The page has not been charged before (2) and reuse_swap_page()
> +	    doesn't call delete_from_swap_cache().
> +	(B) The page has not been charged before (2) and reuse_swap_page()
> +	    calls delete_from_swap_cache().
> +	(C) The page has been charged before (2) and reuse_swap_page() doesn't
> +	    call delete_from_swap_cache().
> +	(D) The page has been charged before (2) and reuse_swap_page() calls
> +	    delete_from_swap_cache().
> +
> +	    memory.usage/memsw.usage changes to this page/swp_entry will be
> +	 Case          (A)      (B)       (C)     (D)
> +         Event
> +       Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
> +          ===========================================
> +          (3)        +1/+1    +1/+1     +1/+1   +1/+1
> +          (4)          -       0/ 0       -     -1/ 0
> +          (5)         0/ 1     0/-1     -1/-1    0/ 0
> +          (6)          -        -         -      0/-1
> +          ===========================================
> +       Result         1/ 1     1/1       1/ 1    1/ 1
> +
> +       In any cases, charges to this page should be 1/ 1.
>  
I've verified that charges will result in valid values by tracing source code
in all of these cases, but in case of (B) I don't think commit_charge_swapin
does memsw-- because PageSwapCache has been cleared already. swap_free does
memsw-- in this case.

I attached a fix patch.

Thanks,
Daisuke Nishimura.

===
From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

fix for documentation.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 Documentation/controllers/memcg_test.txt |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/controllers/memcg_test.txt b/Documentation/controllers/memcg_test.txt
index 3c1458a..08d4d3e 100644
--- a/Documentation/controllers/memcg_test.txt
+++ b/Documentation/controllers/memcg_test.txt
@@ -139,10 +139,10 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
           ===========================================
           (3)        +1/+1    +1/+1     +1/+1   +1/+1
           (4)          -       0/ 0       -     -1/ 0
-          (5)         0/ 1     0/-1     -1/-1    0/ 0
-          (6)          -        -         -      0/-1
+          (5)         0/-1     0/ 0     -1/-1    0/ 0
+          (6)          -       0/-1       -      0/-1
           ===========================================
-       Result         1/ 1     1/1       1/ 1    1/ 1
+       Result         1/ 1     1/ 1      1/ 1    1/ 1
 
        In any cases, charges to this page should be 1/ 1.
 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mmotm] memcg: fix for documentation (Re: [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v3))
  2008-12-16  4:02             ` [PATCH mmotm] memcg: fix for documentation (Re: [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v3)) Daisuke Nishimura
@ 2008-12-16  4:53               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-16  4:53 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Hugh Dickins, linux-mm, linux-kernel, balbir, akpm

On Tue, 16 Dec 2008 13:02:30 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> Sorry for late reply.
> 
> > @@ -111,9 +111,40 @@ Under below explanation, we assume CONFI
> >  	(b) If the SwapCache has been mapped by processes, it has been
> >  	    charged already.
> >  
> > -	In case (a), we charge it. In case (b), we don't charge it.
> > -	(But racy state between (a) and (b) exists. We do check it.)
> > -	At charging, a charge recorded in swap_cgroup is moved to page_cgroup.
> > +	This swap-in is one of the most complicated work. In do_swap_page(),
> > +	following events occur when pte is unchanged.
> > +
> > +	(1) the page (SwapCache) is looked up.
> > +	(2) lock_page()
> > +	(3) try_charge_swapin()
> > +	(4) reuse_swap_page() (may call delete_swap_cache())
> > +	(5) commit_charge_swapin()
> > +	(6) swap_free().
> > +
> > +	Considering following situation for example.
> > +
> > +	(A) The page has not been charged before (2) and reuse_swap_page()
> > +	    doesn't call delete_from_swap_cache().
> > +	(B) The page has not been charged before (2) and reuse_swap_page()
> > +	    calls delete_from_swap_cache().
> > +	(C) The page has been charged before (2) and reuse_swap_page() doesn't
> > +	    call delete_from_swap_cache().
> > +	(D) The page has been charged before (2) and reuse_swap_page() calls
> > +	    delete_from_swap_cache().
> > +
> > +	    memory.usage/memsw.usage changes to this page/swp_entry will be
> > +	 Case          (A)      (B)       (C)     (D)
> > +         Event
> > +       Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
> > +          ===========================================
> > +          (3)        +1/+1    +1/+1     +1/+1   +1/+1
> > +          (4)          -       0/ 0       -     -1/ 0
> > +          (5)         0/ 1     0/-1     -1/-1    0/ 0
> > +          (6)          -        -         -      0/-1
> > +          ===========================================
> > +       Result         1/ 1     1/1       1/ 1    1/ 1
> > +
> > +       In any cases, charges to this page should be 1/ 1.
> >  
> I've verified that charges will result in valid values by tracing source code
> in all of these cases, but in case of (B) I don't think commit_charge_swapin
> does memsw-- because PageSwapCache has been cleared already. swap_free does
> memsw-- in this case.
> 
> I attached a fix patch.
> 
you're right. it comes from this version's new change....sorry.

Acked-by; KAMEZAWA Hiroyuki <kamezaw.hiroyu@jp.fujitsu.com>


> Thanks,
> Daisuke Nishimura.
> 
> ===
> From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
> fix for documentation.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  Documentation/controllers/memcg_test.txt |    6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/controllers/memcg_test.txt b/Documentation/controllers/memcg_test.txt
> index 3c1458a..08d4d3e 100644
> --- a/Documentation/controllers/memcg_test.txt
> +++ b/Documentation/controllers/memcg_test.txt
> @@ -139,10 +139,10 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
>            ===========================================
>            (3)        +1/+1    +1/+1     +1/+1   +1/+1
>            (4)          -       0/ 0       -     -1/ 0
> -          (5)         0/ 1     0/-1     -1/-1    0/ 0
> -          (6)          -        -         -      0/-1
> +          (5)         0/-1     0/ 0     -1/-1    0/ 0
> +          (6)          -       0/-1       -      0/-1
>            ===========================================
> -       Result         1/ 1     1/1       1/ 1    1/ 1
> +       Result         1/ 1     1/ 1      1/ 1    1/ 1
>  
>         In any cases, charges to this page should be 1/ 1.
>  
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-12-16  4:52 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-12  8:29 [BUGFIX][PATCH mmotm] memcg fix swap accounting leak KAMEZAWA Hiroyuki
2008-12-12  9:43 ` Daisuke Nishimura
2008-12-12 11:16   ` KAMEZAWA Hiroyuki
2008-12-13  7:03     ` [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v2) KAMEZAWA Hiroyuki
2008-12-13  9:49       ` Hugh Dickins
2008-12-13 10:27         ` KAMEZAWA Hiroyuki
2008-12-15  7:07           ` [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v3) KAMEZAWA Hiroyuki
2008-12-15  8:37             ` Balbir Singh
2008-12-15  8:40               ` KAMEZAWA Hiroyuki
2008-12-15 10:34             ` Hugh Dickins
2008-12-16  4:02             ` [PATCH mmotm] memcg: fix for documentation (Re: [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v3)) Daisuke Nishimura
2008-12-16  4:53               ` KAMEZAWA Hiroyuki
2008-12-13 10:38         ` [BUGFIX][PATCH mmotm] memcg fix swap accounting leak (v2) KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox