[RFC][PATCH 0/2] memcg: some updates to move_account and file

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/2] memcg: some updates to move_account and file_stat races
@ 2010-10-15  8:06 KAMEZAWA Hiroyuki
  2010-10-15  8:11 ` [RFC][PATCH 1/2] memcg: avoiding unnecessary get_page at move_charge KAMEZAWA Hiroyuki
  2010-10-15  8:12 ` [RFC][PATCH 2/2] memcg: new lock for mutual execution of account_move and file stats KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 10+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-15  8:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, nishimura, balbir, Greg Thelen, akpm

At implementing dirty page accounting support, one of problem is that
PG_writeback update can be called in IRQ context. So, following kind of dead-lock
can be considered.

  lock_page_cgroup()
	<--------------- IRQ
			  try to update Writeback state of memcg
		          lock_page_cgroup()
			  DEAD LOCK

To avoid this, one idea is IRQ disabling in lock_page_cgroup() but our concern is
it's too heavy. 

Considering more, there are facts
  -  why update_file_stat() has to take lock_page_cgroup() is just for avoiding
     race with move_account(). There are no race with charge/uncharge.

So, this series adds a new lock for mutual exection of move_account() and
update_file_stat(). This lock is always taken under IRQ disable.
This adds new lock to move_account()....so this makes move_account() slow.
It's a trade-off to be considered.

This series contains 2 patches. One is a trial to performance improvement,
next one is adding a new lock.
They are independent from each other.
All are onto mmotm-1014 + removing memcg-reduce-lock-hold-time-during-charge-moving.patch

Scores on my box at moving 8GB anon process.

== mmotm ==
root@bluextal kamezawa]# time echo 2530 > /cgroup/B/tasks

real    0m0.792s
user    0m0.000s
sys     0m0.780s

== After patch 1==
[root@bluextal kamezawa]# time echo 2257 > /cgroup/B/tasks

real    0m0.694s
user    0m0.000s
sys     0m0.683s

[After Patch #2]
[root@bluextal kamezawa]# time echo 2238 > /cgroup/B/tasks

real    0m0.741s
user    0m0.000s
sys     0m0.730s

Any comments/advices are welcome.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC][PATCH 1/2] memcg: avoiding unnecessary get_page at move_charge
  2010-10-15  8:06 [RFC][PATCH 0/2] memcg: some updates to move_account and file_stat races KAMEZAWA Hiroyuki
@ 2010-10-15  8:11 ` KAMEZAWA Hiroyuki
  2010-10-15 17:20   ` Christoph Lameter
  2010-10-18  4:29   ` Daisuke Nishimura
  2010-10-15  8:12 ` [RFC][PATCH 2/2] memcg: new lock for mutual execution of account_move and file stats KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 10+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-15  8:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, nishimura, balbir, Greg Thelen, akpm,
	Mel Gorman, kosaki.motohiro, Christoph Lameter

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

CC'ed to Mel and Chrisotph, KOSAKi because I wanted to be double-cheked
that I miss something important at isolate_lru_page().

Checking perofrmance of memcg's move_account with perf, you may notice
that put_page() is too much.
#
# Overhead  Command      Shared Object                                 Symbol
# ........  .......  .................  .....................................
#
    14.24%     echo  [kernel.kallsyms]  [k] put_page
    12.80%     echo  [kernel.kallsyms]  [k] isolate_lru_page
     9.67%     echo  [kernel.kallsyms]  [k] is_target_pte_for_mc
     8.11%     echo  [kernel.kallsyms]  [k] ____pagevec_lru_add
     7.22%     echo  [kernel.kallsyms]  [k] putback_lru_page

This is because mc_handle_present_pte() do get_page(). Then,
page->count is updated 4 times.
	get_page_unless_zero() #1
	isolate_lru_page()
	putback_lru_page()
	put_page()

But above is called all under pte_offset_map_lock().
get_page_unless_zero() #1 is not necessary because we do all under a
pte_offset_map_lock().

isolate_lru_page()'s comment says 
 # Restrictions:
 # (1) Must be called with an elevated refcount on the page. This is a
 #     fundamentnal difference from isolate_lru_pages (which is called
 #     without a stable reference).

So, current implemnation does get_page_unless_zero() explicitly but
holding pte_lock() implies a stable reference. I removed #1.

Then, Performance will be
[Before Patch]
[root@bluextal kamezawa]# time echo 2530 > /cgroup/B/tasks

real    0m0.792s
user    0m0.000s
sys     0m0.780s

[After Patch]
[root@bluextal kamezawa]# time echo 2257 > /cgroup/B/tasks

real    0m0.694s
user    0m0.000s
sys     0m0.683s

perf's log is
    10.82%     echo  [kernel.kallsyms]  [k] isolate_lru_page
    10.01%     echo  [kernel.kallsyms]  [k] mem_cgroup_move_account
     8.75%     echo  [kernel.kallsyms]  [k] is_target_pte_for_mc
     8.52%     echo  [kernel.kallsyms]  [k] ____pagevec_lru_add
     6.90%     echo  [kernel.kallsyms]  [k] putback_lru_page
     6.36%     echo  [kernel.kallsyms]  [k] mem_cgroup_add_lru_list
     6.22%     echo  [kernel.kallsyms]  [k] mem_cgroup_del_lru_list
     5.68%     echo  [kernel.kallsyms]  [k] lookup_page_cgroup
     5.28%     echo  [kernel.kallsyms]  [k] __lru_cache_add
     5.00%     echo  [kernel.kallsyms]  [k] release_pages
     3.79%     echo  [kernel.kallsyms]  [k] _raw_spin_lock_irq
     3.52%     echo  [kernel.kallsyms]  [k] memcg_check_events
     3.38%     echo  [kernel.kallsyms]  [k] bit_spin_lock
     3.25%     echo  [kernel.kallsyms]  [k] put_page

seems nice. I updated isolate_lru_page()'s comment, too.

# Note: isolate_lru_page() is necessary before account move for avoinding
        memcg's LRU manipulation.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   63 +++++++++++++++++++++++++++++++++++---------------------
 mm/vmscan.c     |    3 +-
 2 files changed, 42 insertions(+), 24 deletions(-)

Index: mmotm-1013/mm/memcontrol.c
===================================================================
--- mmotm-1013.orig/mm/memcontrol.c
+++ mmotm-1013/mm/memcontrol.c
@@ -1169,7 +1169,6 @@ static void mem_cgroup_end_move(struct m
  *			  under hierarchy of moving cgroups. This is for
  *			  waiting at hith-memory prressure caused by "move".
  */
-
 static bool mem_cgroup_stealed(struct mem_cgroup *mem)
 {
 	VM_BUG_ON(!rcu_read_lock_held());
@@ -4471,11 +4470,14 @@ one_by_one:
  * Returns
  *   0(MC_TARGET_NONE): if the pte is not a target for move charge.
  *   1(MC_TARGET_PAGE): if the page corresponding to this pte is a target for
- *     move charge. if @target is not NULL, the page is stored in target->page
- *     with extra refcnt got(Callers should handle it).
+ *     move charge and it's mapped.. if @target is not NULL, the page is
+ *     stored in target->pagewithout extra refcnt.
  *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
  *     target for charge migration. if @target is not NULL, the entry is stored
  *     in target->ent.
+ *   3(MC_TARGET_UNMAPPED_PAGE): if the page corresponding to this pte is a
+ *     target for move charge. if @target is not NULL, the page is stored in
+ *     target->page with extra refcnt got(Callers should handle it).
  *
  * Called with pte lock held.
  */
@@ -4486,8 +4488,9 @@ union mc_target {
 
 enum mc_target_type {
 	MC_TARGET_NONE,	/* not used */
-	MC_TARGET_PAGE,
+	MC_TARGET_PAGE, /* a page mapped */
 	MC_TARGET_SWAP,
+	MC_TARGET_UNMAPPED_PAGE, /* a page unmapped */
 };
 
 static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
@@ -4504,9 +4507,10 @@ static struct page *mc_handle_present_pt
 	} else if (!move_file())
 		/* we ignore mapcount for file pages */
 		return NULL;
-	if (!get_page_unless_zero(page))
-		return NULL;
-
+	/*
+ 	 * Because we're under pte_lock and the page is mapped,
+	 * get_page() isn't necessary
+	 */
 	return page;
 }
 
@@ -4570,14 +4574,18 @@ static int is_target_pte_for_mc(struct v
 	struct page *page = NULL;
 	struct page_cgroup *pc;
 	int ret = 0;
+	bool present = true;
 	swp_entry_t ent = { .val = 0 };
 
 	if (pte_present(ptent))
 		page = mc_handle_present_pte(vma, addr, ptent);
-	else if (is_swap_pte(ptent))
-		page = mc_handle_swap_pte(vma, addr, ptent, &ent);
-	else if (pte_none(ptent) || pte_file(ptent))
-		page = mc_handle_file_pte(vma, addr, ptent, &ent);
+	else {
+		present = false;
+	 	if (is_swap_pte(ptent))
+			page = mc_handle_swap_pte(vma, addr, ptent, &ent);
+		else if (pte_none(ptent) || pte_file(ptent))
+			page = mc_handle_file_pte(vma, addr, ptent, &ent);
+	}
 
 	if (!page && !ent.val)
 		return 0;
@@ -4589,11 +4597,15 @@ static int is_target_pte_for_mc(struct v
 		 * the lock.
 		 */
 		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
-			ret = MC_TARGET_PAGE;
+			if (present)
+				ret = MC_TARGET_PAGE;
+			else
+				ret = MC_TARGET_UNMAPPED_PAGE;
 			if (target)
 				target->page = page;
 		}
-		if (!ret || !target)
+		/* We got refcnt but the page is not for target */
+		if (!present && (!ret || !target))
 			put_page(page);
 	}
 	/* There is a swap entry and a page doesn't exist or isn't charged */
@@ -4780,19 +4792,24 @@ retry:
 		type = is_target_pte_for_mc(vma, addr, ptent, &target);
 		switch (type) {
 		case MC_TARGET_PAGE:
+		case MC_TARGET_UNMAPPED_PAGE:
 			page = target.page;
-			if (isolate_lru_page(page))
-				goto put;
-			pc = lookup_page_cgroup(page);
-			if (!mem_cgroup_move_account(pc,
+			if (!isolate_lru_page(page)) {
+				pc = lookup_page_cgroup(page);
+				if (!mem_cgroup_move_account(pc,
 						mc.from, mc.to, false)) {
-				mc.precharge--;
-				/* we uncharge from mc.from later. */
-				mc.moved_charge++;
+					mc.precharge--;
+					/* we uncharge from mc.from later. */
+					mc.moved_charge++;
+				}
+				putback_lru_page(page);
 			}
-			putback_lru_page(page);
-put:			/* is_target_pte_for_mc() gets the page */
-			put_page(page);
+			/*
+			 * Because we holds pte_lock, we have a stable reference			 * to the page if mapped. If not mapped, we have an
+			 * elevated refcnt. drop it.
+			 */
+			if (type == MC_TARGET_UNMAPPED_PAGE)
+				put_page(page);
 			break;
 		case MC_TARGET_SWAP:
 			ent = target.ent;
Index: mmotm-1013/mm/vmscan.c
===================================================================
--- mmotm-1013.orig/mm/vmscan.c
+++ mmotm-1013/mm/vmscan.c
@@ -1166,7 +1166,8 @@ static unsigned long clear_active_flags(
  * found will be decremented.
  *
  * Restrictions:
- * (1) Must be called with an elevated refcount on the page. This is a
+ * (1) Must be called with an elevated refcount on the page, IOW, the
+ *     caller must guarantee that there is a stable reference. This is a
  *     fundamentnal difference from isolate_lru_pages (which is called
  *     without a stable reference).
  * (2) the lru_lock must not be held.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC][PATCH 2/2] memcg: new lock for mutual execution of account_move and file stats
  2010-10-15  8:06 [RFC][PATCH 0/2] memcg: some updates to move_account and file_stat races KAMEZAWA Hiroyuki
  2010-10-15  8:11 ` [RFC][PATCH 1/2] memcg: avoiding unnecessary get_page at move_charge KAMEZAWA Hiroyuki
@ 2010-10-15  8:12 ` KAMEZAWA Hiroyuki
  2010-10-17  5:33   ` Minchan Kim
  1 sibling, 1 reply; 10+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-15  8:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, nishimura, balbir, Greg Thelen, akpm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

When we try to enhance page's status update to support other flags,
one of problem is updating status from IRQ context.

Now, mem_cgroup_update_file_stat() takes lock_page_cgroup() to avoid
race with _account move_. IOW, there are no races with charge/uncharge
in nature. Considering an update from IRQ context, it seems better
to disable IRQ at lock_page_cgroup() to avoid deadlock.

But lock_page_cgroup() is used too widerly and adding IRQ disable
there makes the performance bad. To avoid the big hammer, this patch
adds a new lock for update_stat().

This lock is for mutual execustion of updating stat and accout moving.
This adds a new lock to move_account..so, this makes move_account slow.
But considering trade-off, I think it's acceptable.

A score of moving 8GB anon pages, 8cpu Xeon(3.1GHz) is here.

[before patch] (mmotm + optimization patch (#1 in this series)
[root@bluextal kamezawa]# time echo 2257 > /cgroup/B/tasks

real    0m0.694s
user    0m0.000s
sys     0m0.683s

[After patch]
[root@bluextal kamezawa]# time echo 2238 > /cgroup/B/tasks

real    0m0.741s
user    0m0.000s
sys     0m0.730s

This moves 8Gbytes == 2048k pages. But no bad effects to codes
other than "move".

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/page_cgroup.h |   29 +++++++++++++++++++++++++++++
 mm/memcontrol.c             |   11 +++++++++--
 2 files changed, 38 insertions(+), 2 deletions(-)

Index: mmotm-1013/include/linux/page_cgroup.h
===================================================================
--- mmotm-1013.orig/include/linux/page_cgroup.h
+++ mmotm-1013/include/linux/page_cgroup.h
@@ -36,6 +36,7 @@ struct page_cgroup *lookup_page_cgroup(s
 enum {
 	/* flags for mem_cgroup */
 	PCG_LOCK,  /* page cgroup is locked */
+	PCG_LOCK_STATS, /* page cgroup's stat accounting flags are locked */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
@@ -104,6 +105,34 @@ static inline void unlock_page_cgroup(st
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
+/*
+ * Because page's status can be updated in IRQ context(PG_writeback)
+ * we disable IRQ at updating page's stat.
+ */
+static inline void lock_page_cgroup_stat(struct page_cgroup *pc,
+	unsigned long *flags)
+{
+	local_irq_save(*flags);
+	bit_spin_lock(PCG_LOCK_STATS, &pc->flags);
+}
+
+static inline void __lock_page_cgroup_stat(struct page_cgroup *pc)
+{
+	bit_spin_lock(PCG_LOCK_STATS, &pc->flags);
+}
+
+static inline void unlock_page_cgroup_stat(struct page_cgroup *pc,
+	unsigned long *flags)
+{
+	bit_spin_unlock(PCG_LOCK_STATS, &pc->flags);
+	local_irq_restore(*flags);
+}
+
+static inline void __unlock_page_cgroup_stat(struct page_cgroup *pc)
+{
+	bit_spin_unlock(PCG_LOCK_STATS, &pc->flags);
+}
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
Index: mmotm-1013/mm/memcontrol.c
===================================================================
--- mmotm-1013.orig/mm/memcontrol.c
+++ mmotm-1013/mm/memcontrol.c
@@ -1596,6 +1596,7 @@ static void mem_cgroup_update_file_stat(
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	bool need_unlock = false;
+	unsigned long flags = 0;
 
 	if (unlikely(!pc))
 		return;
@@ -1607,7 +1608,7 @@ static void mem_cgroup_update_file_stat(
 	/* pc->mem_cgroup is unstable ? */
 	if (unlikely(mem_cgroup_stealed(mem))) {
 		/* take a lock against to access pc->mem_cgroup */
-		lock_page_cgroup(pc);
+		lock_page_cgroup_stat(pc, &flags);
 		need_unlock = true;
 		mem = pc->mem_cgroup;
 		if (!mem || !PageCgroupUsed(pc))
@@ -1629,7 +1630,7 @@ static void mem_cgroup_update_file_stat(
 
 out:
 	if (unlikely(need_unlock))
-		unlock_page_cgroup(pc);
+		unlock_page_cgroup_stat(pc, &flags);
 	rcu_read_unlock();
 	return;
 }
@@ -2187,12 +2188,18 @@ static int mem_cgroup_move_account(struc
 		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
 	int ret = -EINVAL;
+
+	/* Avoiding dead-lock with page stat updates via irq context */
+	local_irq_disable();
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
+		__lock_page_cgroup_stat(pc);
 		__mem_cgroup_move_account(pc, from, to, uncharge);
+		__unlock_page_cgroup_stat(pc);
 		ret = 0;
 	}
 	unlock_page_cgroup(pc);
+	local_irq_enable();
 	/*
 	 * check events
 	 */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: avoiding unnecessary get_page at move_charge
  2010-10-15  8:11 ` [RFC][PATCH 1/2] memcg: avoiding unnecessary get_page at move_charge KAMEZAWA Hiroyuki
@ 2010-10-15 17:20   ` Christoph Lameter
  2010-10-15 22:36     ` Hiroyuki Kamezawa
  2010-10-18  4:29   ` Daisuke Nishimura
  1 sibling, 1 reply; 10+ messages in thread
From: Christoph Lameter @ 2010-10-15 17:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, nishimura, balbir, Greg Thelen, akpm,
	Mel Gorman, kosaki.motohiro

On Fri, 15 Oct 2010, KAMEZAWA Hiroyuki wrote:

> But above is called all under pte_offset_map_lock().
> get_page_unless_zero() #1 is not necessary because we do all under a
> pte_offset_map_lock().

The two (ptl and refcount) are entirely different. The ptl is for
protecting the page table. The refcount handles only the page.

However, if the entry in the page table is pointing to the page then there
must have been a refcount taken on the page. So if you know that the page
is in the page table and you took the ptl then you can be sure that the
page refcount will not become zero. Therefore get_page_unless_zero() will
never fail and there is no need to take additional refcounts as long as
the page table lock is held and the page is not removed from the page
table.

> Index: mmotm-1013/mm/vmscan.c
> ===================================================================
> --- mmotm-1013.orig/mm/vmscan.c
> +++ mmotm-1013/mm/vmscan.c
> @@ -1166,7 +1166,8 @@ static unsigned long clear_active_flags(
>   * found will be decremented.
>   *
>   * Restrictions:
> - * (1) Must be called with an elevated refcount on the page. This is a
> + * (1) Must be called with an elevated refcount on the page, IOW, the
> + *     caller must guarantee that there is a stable reference. This is a
>   *     fundamentnal difference from isolate_lru_pages (which is called
>   *     without a stable reference).
>   * (2) the lru_lock must not be held.

There is no need for this change since you have an elevated refcount.
IMH The words "stable reference" may be confusing since the refcount may
change. The elevated refcount protects against the freeing of the page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: avoiding unnecessary get_page at move_charge
  2010-10-15 17:20   ` Christoph Lameter
@ 2010-10-15 22:36     ` Hiroyuki Kamezawa
  0 siblings, 0 replies; 10+ messages in thread
From: Hiroyuki Kamezawa @ 2010-10-15 22:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, nishimura, balbir,
	Greg Thelen, akpm, Mel Gorman, kosaki.motohiro

2010/10/16 Christoph Lameter <cl@linux.com>:
> On Fri, 15 Oct 2010, KAMEZAWA Hiroyuki wrote:
>
>> But above is called all under pte_offset_map_lock().
>> get_page_unless_zero() #1 is not necessary because we do all under a
>> pte_offset_map_lock().
>
> The two (ptl and refcount) are entirely different. The ptl is for
> protecting the page table. The refcount handles only the page.
>
> However, if the entry in the page table is pointing to the page then there
> must have been a refcount taken on the page. So if you know that the page
> is in the page table and you took the ptl then you can be sure that the
> page refcount will not become zero. Therefore get_page_unless_zero() will
> never fail and there is no need to take additional refcounts as long as
> the page table lock is held and the page is not removed from the page
> table.
>

Ok, thank you for explanation. I can make this function faster.

>> Index: mmotm-1013/mm/vmscan.c
>> ===================================================================
>> --- mmotm-1013.orig/mm/vmscan.c
>> +++ mmotm-1013/mm/vmscan.c
>> @@ -1166,7 +1166,8 @@ static unsigned long clear_active_flags(
>>   * found will be decremented.
>>   *
>>   * Restrictions:
>> - * (1) Must be called with an elevated refcount on the page. This is a
>> + * (1) Must be called with an elevated refcount on the page, IOW, the
>> + *     caller must guarantee that there is a stable reference. This is a
>>   *     fundamentnal difference from isolate_lru_pages (which is called
>>   *     without a stable reference).
>>   * (2) the lru_lock must not be held.
>
> There is no need for this change since you have an elevated refcount.
> IMH The words "stable reference" may be confusing since the refcount may
> change. The elevated refcount protects against the freeing of the page.
>

Sure, drop change this in v2. I misunderstand "elevated refcount"
means "extra get_page()".

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 2/2] memcg: new lock for mutual execution of account_move and file stats
  2010-10-15  8:12 ` [RFC][PATCH 2/2] memcg: new lock for mutual execution of account_move and file stats KAMEZAWA Hiroyuki
@ 2010-10-17  5:33   ` Minchan Kim
  2010-10-17  5:35     ` Minchan Kim
  0 siblings, 1 reply; 10+ messages in thread
From: Minchan Kim @ 2010-10-17  5:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, nishimura, balbir, Greg Thelen, akpm

On Fri, Oct 15, 2010 at 5:12 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> When we try to enhance page's status update to support other flags,
> one of problem is updating status from IRQ context.
>
> Now, mem_cgroup_update_file_stat() takes lock_page_cgroup() to avoid
> race with _account move_. IOW, there are no races with charge/uncharge
> in nature. Considering an update from IRQ context, it seems better
> to disable IRQ at lock_page_cgroup() to avoid deadlock.
>
> But lock_page_cgroup() is used too widerly and adding IRQ disable
> there makes the performance bad. To avoid the big hammer, this patch
> adds a new lock for update_stat().
>
> This lock is for mutual execustion of updating stat and accout moving.
> This adds a new lock to move_account..so, this makes move_account slow.
> But considering trade-off, I think it's acceptable.
>
> A score of moving 8GB anon pages, 8cpu Xeon(3.1GHz) is here.
>
> [before patch] (mmotm + optimization patch (#1 in this series)
> [root@bluextal kamezawa]# time echo 2257 > /cgroup/B/tasks
>
> real    0m0.694s
> user    0m0.000s
> sys     0m0.683s
>
> [After patch]
> [root@bluextal kamezawa]# time echo 2238 > /cgroup/B/tasks
>
> real    0m0.741s
> user    0m0.000s
> sys     0m0.730s
>
> This moves 8Gbytes == 2048k pages. But no bad effects to codes
> other than "move".
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

It looks good than old approach.
Just a below nitpick.

> ---
>  include/linux/page_cgroup.h |   29 +++++++++++++++++++++++++++++
>  mm/memcontrol.c             |   11 +++++++++--
>  2 files changed, 38 insertions(+), 2 deletions(-)
>
> Index: mmotm-1013/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-1013.orig/include/linux/page_cgroup.h
> +++ mmotm-1013/include/linux/page_cgroup.h
> @@ -36,6 +36,7 @@ struct page_cgroup *lookup_page_cgroup(s
>  enum {
>        /* flags for mem_cgroup */
>        PCG_LOCK,  /* page cgroup is locked */
> +       PCG_LOCK_STATS, /* page cgroup's stat accounting flags are locked */

Hmm, I think naming isn't a good. Aren't both for stat?
As I understand, Both are used for stat.
One is just used by charge/uncharge and the other is used by
pdate_file_stat/move_account.
If you guys who are expert in mcg feel it with easy, I am not against.
But at least, mcg-not-familiar people like me don't feel it comfortable.





-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 2/2] memcg: new lock for mutual execution of account_move and file stats
  2010-10-17  5:33   ` Minchan Kim
@ 2010-10-17  5:35     ` Minchan Kim
  2010-10-18  0:19       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 10+ messages in thread
From: Minchan Kim @ 2010-10-17  5:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, nishimura, balbir, Greg Thelen, akpm

On Sun, Oct 17, 2010 at 2:33 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Fri, Oct 15, 2010 at 5:12 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>> When we try to enhance page's status update to support other flags,
>> one of problem is updating status from IRQ context.
>>
>> Now, mem_cgroup_update_file_stat() takes lock_page_cgroup() to avoid
>> race with _account move_. IOW, there are no races with charge/uncharge
>> in nature. Considering an update from IRQ context, it seems better
>> to disable IRQ at lock_page_cgroup() to avoid deadlock.
>>
>> But lock_page_cgroup() is used too widerly and adding IRQ disable
>> there makes the performance bad. To avoid the big hammer, this patch
>> adds a new lock for update_stat().
>>
>> This lock is for mutual execustion of updating stat and accout moving.
>> This adds a new lock to move_account..so, this makes move_account slow.
>> But considering trade-off, I think it's acceptable.
>>
>> A score of moving 8GB anon pages, 8cpu Xeon(3.1GHz) is here.
>>
>> [before patch] (mmotm + optimization patch (#1 in this series)
>> [root@bluextal kamezawa]# time echo 2257 > /cgroup/B/tasks
>>
>> real    0m0.694s
>> user    0m0.000s
>> sys     0m0.683s
>>
>> [After patch]
>> [root@bluextal kamezawa]# time echo 2238 > /cgroup/B/tasks
>>
>> real    0m0.741s
>> user    0m0.000s
>> sys     0m0.730s
>>
>> This moves 8Gbytes == 2048k pages. But no bad effects to codes
>> other than "move".
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> It looks good than old approach.
> Just a below nitpick.
>
>> ---
>>  include/linux/page_cgroup.h |   29 +++++++++++++++++++++++++++++
>>  mm/memcontrol.c             |   11 +++++++++--
>>  2 files changed, 38 insertions(+), 2 deletions(-)
>>
>> Index: mmotm-1013/include/linux/page_cgroup.h
>> ===================================================================
>> --- mmotm-1013.orig/include/linux/page_cgroup.h
>> +++ mmotm-1013/include/linux/page_cgroup.h
>> @@ -36,6 +36,7 @@ struct page_cgroup *lookup_page_cgroup(s
>>  enum {
>>        /* flags for mem_cgroup */
>>        PCG_LOCK,  /* page cgroup is locked */
>> +       PCG_LOCK_STATS, /* page cgroup's stat accounting flags are locked */
>
> Hmm, I think naming isn't a good. Aren't both for stat?
> As I understand, Both are used for stat.
> One is just used by charge/uncharge and the other is used by
> pdate_file_stat/move_account.
> If you guys who are expert in mcg feel it with easy, I am not against.
> But at least, mcg-not-familiar people like me don't feel it comfortable.
>

And I think this patch would be better to be part of Greg Thelen's series.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 2/2] memcg: new lock for mutual execution of account_move and file stats
  2010-10-17  5:35     ` Minchan Kim
@ 2010-10-18  0:19       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 10+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-18  0:19 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-kernel, linux-mm, nishimura, balbir, Greg Thelen, akpm

On Sun, 17 Oct 2010 14:35:47 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Sun, Oct 17, 2010 at 2:33 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> > On Fri, Oct 15, 2010 at 5:12 PM, KAMEZAWA Hiroyuki
> > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >>
> >> When we try to enhance page's status update to support other flags,
> >> one of problem is updating status from IRQ context.
> >>
> >> Now, mem_cgroup_update_file_stat() takes lock_page_cgroup() to avoid
> >> race with _account move_. IOW, there are no races with charge/uncharge
> >> in nature. Considering an update from IRQ context, it seems better
> >> to disable IRQ at lock_page_cgroup() to avoid deadlock.
> >>
> >> But lock_page_cgroup() is used too widerly and adding IRQ disable
> >> there makes the performance bad. To avoid the big hammer, this patch
> >> adds a new lock for update_stat().
> >>
> >> This lock is for mutual execustion of updating stat and accout moving.
> >> This adds a new lock to move_account..so, this makes move_account slow.
> >> But considering trade-off, I think it's acceptable.
> >>
> >> A score of moving 8GB anon pages, 8cpu Xeon(3.1GHz) is here.
> >>
> >> [before patch] (mmotm + optimization patch (#1 in this series)
> >> [root@bluextal kamezawa]# time echo 2257 > /cgroup/B/tasks
> >>
> >> real A  A 0m0.694s
> >> user A  A 0m0.000s
> >> sys A  A  0m0.683s
> >>
> >> [After patch]
> >> [root@bluextal kamezawa]# time echo 2238 > /cgroup/B/tasks
> >>
> >> real A  A 0m0.741s
> >> user A  A 0m0.000s
> >> sys A  A  0m0.730s
> >>
> >> This moves 8Gbytes == 2048k pages. But no bad effects to codes
> >> other than "move".
> >>
> >> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > It looks good than old approach.
> > Just a below nitpick.
> >
> >> ---
> >> A include/linux/page_cgroup.h | A  29 +++++++++++++++++++++++++++++
> >> A mm/memcontrol.c A  A  A  A  A  A  | A  11 +++++++++--
> >> A 2 files changed, 38 insertions(+), 2 deletions(-)
> >>
> >> Index: mmotm-1013/include/linux/page_cgroup.h
> >> ===================================================================
> >> --- mmotm-1013.orig/include/linux/page_cgroup.h
> >> +++ mmotm-1013/include/linux/page_cgroup.h
> >> @@ -36,6 +36,7 @@ struct page_cgroup *lookup_page_cgroup(s
> >> A enum {
> >> A  A  A  A /* flags for mem_cgroup */
> >> A  A  A  A PCG_LOCK, A /* page cgroup is locked */
> >> + A  A  A  PCG_LOCK_STATS, /* page cgroup's stat accounting flags are locked */
> >
> > Hmm, I think naming isn't a good. Aren't both for stat?

PCG_LOCK is for page_cgroup->mem_cgroup, not for stat.


But hmm...how about
{
  PCG_LOCK  /* For CACEH, USED and pc->mem_cgroup */
  PCG_CACHE
  PCG_USED
  PCG_ACCT_LRU /* no lock is used */
  PCG_MOVE_FLAGS_LOCK  /* For MAPPED and I/O flags v.s account_move races*/
  PCG_FILE_MAPPED,
  ..
  PCG_MIGRATION, /* For remembering Page Migration */
}

Anyway, documentation should be updated.
...


> > As I understand, Both are used for stat.
> > One is just used by charge/uncharge and the other is used by
> > pdate_file_stat/move_account.
> > If you guys who are expert in mcg feel it with easy, I am not against.
> > But at least, mcg-not-familiar people like me don't feel it comfortable.
> >
> 
> And I think this patch would be better to be part of Greg Thelen's series.
> 

Hmm. Greg, can you merge my new version (I'll post today) into your series ?

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: avoiding unnecessary get_page at move_charge
  2010-10-15  8:11 ` [RFC][PATCH 1/2] memcg: avoiding unnecessary get_page at move_charge KAMEZAWA Hiroyuki
  2010-10-15 17:20   ` Christoph Lameter
@ 2010-10-18  4:29   ` Daisuke Nishimura
  2010-10-18  4:38     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 10+ messages in thread
From: Daisuke Nishimura @ 2010-10-18  4:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, balbir, Greg Thelen, akpm, Mel Gorman,
	kosaki.motohiro, Christoph Lameter, Daisuke Nishimura

On Fri, 15 Oct 2010 17:11:09 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> CC'ed to Mel and Chrisotph, KOSAKi because I wanted to be double-cheked
> that I miss something important at isolate_lru_page().
> 
> Checking perofrmance of memcg's move_account with perf, you may notice
> that put_page() is too much.
> #
> # Overhead  Command      Shared Object                                 Symbol
> # ........  .......  .................  .....................................
> #
>     14.24%     echo  [kernel.kallsyms]  [k] put_page
>     12.80%     echo  [kernel.kallsyms]  [k] isolate_lru_page
>      9.67%     echo  [kernel.kallsyms]  [k] is_target_pte_for_mc
>      8.11%     echo  [kernel.kallsyms]  [k] ____pagevec_lru_add
>      7.22%     echo  [kernel.kallsyms]  [k] putback_lru_page
> 
> This is because mc_handle_present_pte() do get_page(). Then,
> page->count is updated 4 times.
> 	get_page_unless_zero() #1
> 	isolate_lru_page()
> 	putback_lru_page()
> 	put_page()
> 
> But above is called all under pte_offset_map_lock().
> get_page_unless_zero() #1 is not necessary because we do all under a
> pte_offset_map_lock().
> 
> isolate_lru_page()'s comment says 
>  # Restrictions:
>  # (1) Must be called with an elevated refcount on the page. This is a
>  #     fundamentnal difference from isolate_lru_pages (which is called
>  #     without a stable reference).
> 
> So, current implemnation does get_page_unless_zero() explicitly but
> holding pte_lock() implies a stable reference. I removed #1.
> 
> Then, Performance will be
> [Before Patch]
> [root@bluextal kamezawa]# time echo 2530 > /cgroup/B/tasks
> 
> real    0m0.792s
> user    0m0.000s
> sys     0m0.780s
> 
> [After Patch]
> [root@bluextal kamezawa]# time echo 2257 > /cgroup/B/tasks
> 
> real    0m0.694s
> user    0m0.000s
> sys     0m0.683s
> 
Very nice!

Some nitpicks.

> perf's log is
>     10.82%     echo  [kernel.kallsyms]  [k] isolate_lru_page
>     10.01%     echo  [kernel.kallsyms]  [k] mem_cgroup_move_account
>      8.75%     echo  [kernel.kallsyms]  [k] is_target_pte_for_mc
>      8.52%     echo  [kernel.kallsyms]  [k] ____pagevec_lru_add
>      6.90%     echo  [kernel.kallsyms]  [k] putback_lru_page
>      6.36%     echo  [kernel.kallsyms]  [k] mem_cgroup_add_lru_list
>      6.22%     echo  [kernel.kallsyms]  [k] mem_cgroup_del_lru_list
>      5.68%     echo  [kernel.kallsyms]  [k] lookup_page_cgroup
>      5.28%     echo  [kernel.kallsyms]  [k] __lru_cache_add
>      5.00%     echo  [kernel.kallsyms]  [k] release_pages
>      3.79%     echo  [kernel.kallsyms]  [k] _raw_spin_lock_irq
>      3.52%     echo  [kernel.kallsyms]  [k] memcg_check_events
>      3.38%     echo  [kernel.kallsyms]  [k] bit_spin_lock
>      3.25%     echo  [kernel.kallsyms]  [k] put_page
> 
> seems nice. I updated isolate_lru_page()'s comment, too.
> 
> # Note: isolate_lru_page() is necessary before account move for avoinding
>         memcg's LRU manipulation.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   63 +++++++++++++++++++++++++++++++++++---------------------
>  mm/vmscan.c     |    3 +-
>  2 files changed, 42 insertions(+), 24 deletions(-)
> 
> Index: mmotm-1013/mm/memcontrol.c
> ===================================================================
> --- mmotm-1013.orig/mm/memcontrol.c
> +++ mmotm-1013/mm/memcontrol.c
> @@ -1169,7 +1169,6 @@ static void mem_cgroup_end_move(struct m
>   *			  under hierarchy of moving cgroups. This is for
>   *			  waiting at hith-memory prressure caused by "move".
>   */
> -
>  static bool mem_cgroup_stealed(struct mem_cgroup *mem)
>  {
>  	VM_BUG_ON(!rcu_read_lock_held());
> @@ -4471,11 +4470,14 @@ one_by_one:
>   * Returns
>   *   0(MC_TARGET_NONE): if the pte is not a target for move charge.
>   *   1(MC_TARGET_PAGE): if the page corresponding to this pte is a target for
> - *     move charge. if @target is not NULL, the page is stored in target->page
> - *     with extra refcnt got(Callers should handle it).
> + *     move charge and it's mapped.. if @target is not NULL, the page is
> + *     stored in target->pagewithout extra refcnt.
                               ^^ needs ' '.

>   *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
>   *     target for charge migration. if @target is not NULL, the entry is stored
>   *     in target->ent.
> + *   3(MC_TARGET_UNMAPPED_PAGE): if the page corresponding to this pte is a
> + *     target for move charge. if @target is not NULL, the page is stored in
> + *     target->page with extra refcnt got(Callers should handle it).
>   *
>   * Called with pte lock held.
>   */
> @@ -4486,8 +4488,9 @@ union mc_target {
>  
>  enum mc_target_type {
>  	MC_TARGET_NONE,	/* not used */
> -	MC_TARGET_PAGE,
> +	MC_TARGET_PAGE, /* a page mapped */
>  	MC_TARGET_SWAP,
> +	MC_TARGET_UNMAPPED_PAGE, /* a page unmapped */
>  };
>  
I prefer the order of "MC_TARGET_PAGE", "MC_TARGET_UNMAPPED_PAGE", and "MC_TARGET_SWAP".

>  static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
> @@ -4504,9 +4507,10 @@ static struct page *mc_handle_present_pt
>  	} else if (!move_file())
>  		/* we ignore mapcount for file pages */
>  		return NULL;
> -	if (!get_page_unless_zero(page))
> -		return NULL;
> -
> +	/*
> + 	 * Because we're under pte_lock and the page is mapped,
> +	 * get_page() isn't necessary
> +	 */
>  	return page;
>  }
>  
> @@ -4570,14 +4574,18 @@ static int is_target_pte_for_mc(struct v
>  	struct page *page = NULL;
>  	struct page_cgroup *pc;
>  	int ret = 0;
> +	bool present = true;
>  	swp_entry_t ent = { .val = 0 };
>  
>  	if (pte_present(ptent))
>  		page = mc_handle_present_pte(vma, addr, ptent);
> -	else if (is_swap_pte(ptent))
> -		page = mc_handle_swap_pte(vma, addr, ptent, &ent);
> -	else if (pte_none(ptent) || pte_file(ptent))
> -		page = mc_handle_file_pte(vma, addr, ptent, &ent);
> +	else {
> +		present = false;
> +	 	if (is_swap_pte(ptent))
> +			page = mc_handle_swap_pte(vma, addr, ptent, &ent);
> +		else if (pte_none(ptent) || pte_file(ptent))
> +			page = mc_handle_file_pte(vma, addr, ptent, &ent);
> +	}
>  
>  	if (!page && !ent.val)
>  		return 0;
> @@ -4589,11 +4597,15 @@ static int is_target_pte_for_mc(struct v
>  		 * the lock.
>  		 */
>  		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
> -			ret = MC_TARGET_PAGE;
> +			if (present)
> +				ret = MC_TARGET_PAGE;
> +			else
> +				ret = MC_TARGET_UNMAPPED_PAGE;
>  			if (target)
>  				target->page = page;
>  		}
> -		if (!ret || !target)
> +		/* We got refcnt but the page is not for target */
> +		if (!present && (!ret || !target))
>  			put_page(page);
>  	}
>  	/* There is a swap entry and a page doesn't exist or isn't charged */
> @@ -4780,19 +4792,24 @@ retry:
>  		type = is_target_pte_for_mc(vma, addr, ptent, &target);
>  		switch (type) {
>  		case MC_TARGET_PAGE:
> +		case MC_TARGET_UNMAPPED_PAGE:
>  			page = target.page;
> -			if (isolate_lru_page(page))
> -				goto put;
> -			pc = lookup_page_cgroup(page);
> -			if (!mem_cgroup_move_account(pc,
> +			if (!isolate_lru_page(page)) {
> +				pc = lookup_page_cgroup(page);
> +				if (!mem_cgroup_move_account(pc,
>  						mc.from, mc.to, false)) {
> -				mc.precharge--;
> -				/* we uncharge from mc.from later. */
> -				mc.moved_charge++;
> +					mc.precharge--;
> +					/* we uncharge from mc.from later. */
> +					mc.moved_charge++;
> +				}
> +				putback_lru_page(page);
>  			}
> -			putback_lru_page(page);
> -put:			/* is_target_pte_for_mc() gets the page */
> -			put_page(page);
> +			/*
> +			 * Because we holds pte_lock, we have a stable reference			 * to the page if mapped. If not mapped, we have an

You need a new line :)


Thanks,
Daisuke Nishimura.

> +			 * elevated refcnt. drop it.
> +			 */
> +			if (type == MC_TARGET_UNMAPPED_PAGE)
> +				put_page(page);
>  			break;
>  		case MC_TARGET_SWAP:
>  			ent = target.ent;
> Index: mmotm-1013/mm/vmscan.c
> ===================================================================
> --- mmotm-1013.orig/mm/vmscan.c
> +++ mmotm-1013/mm/vmscan.c
> @@ -1166,7 +1166,8 @@ static unsigned long clear_active_flags(
>   * found will be decremented.
>   *
>   * Restrictions:
> - * (1) Must be called with an elevated refcount on the page. This is a
> + * (1) Must be called with an elevated refcount on the page, IOW, the
> + *     caller must guarantee that there is a stable reference. This is a
>   *     fundamentnal difference from isolate_lru_pages (which is called
>   *     without a stable reference).
>   * (2) the lru_lock must not be held.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: avoiding unnecessary get_page at move_charge
  2010-10-18  4:29   ` Daisuke Nishimura
@ 2010-10-18  4:38     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 10+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-18  4:38 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-kernel, linux-mm, balbir, Greg Thelen, akpm, Mel Gorman,
	kosaki.motohiro, Christoph Lameter

On Mon, 18 Oct 2010 13:29:01 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Fri, 15 Oct 2010 17:11:09 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > perf's log is
> >     10.82%     echo  [kernel.kallsyms]  [k] isolate_lru_page
> >     10.01%     echo  [kernel.kallsyms]  [k] mem_cgroup_move_account
> >      8.75%     echo  [kernel.kallsyms]  [k] is_target_pte_for_mc
> >      8.52%     echo  [kernel.kallsyms]  [k] ____pagevec_lru_add
> >      6.90%     echo  [kernel.kallsyms]  [k] putback_lru_page
> >      6.36%     echo  [kernel.kallsyms]  [k] mem_cgroup_add_lru_list
> >      6.22%     echo  [kernel.kallsyms]  [k] mem_cgroup_del_lru_list
> >      5.68%     echo  [kernel.kallsyms]  [k] lookup_page_cgroup
> >      5.28%     echo  [kernel.kallsyms]  [k] __lru_cache_add
> >      5.00%     echo  [kernel.kallsyms]  [k] release_pages
> >      3.79%     echo  [kernel.kallsyms]  [k] _raw_spin_lock_irq
> >      3.52%     echo  [kernel.kallsyms]  [k] memcg_check_events
> >      3.38%     echo  [kernel.kallsyms]  [k] bit_spin_lock
> >      3.25%     echo  [kernel.kallsyms]  [k] put_page
> > 
> > seems nice. I updated isolate_lru_page()'s comment, too.
> > 
> > # Note: isolate_lru_page() is necessary before account move for avoinding
> >         memcg's LRU manipulation.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |   63 +++++++++++++++++++++++++++++++++++---------------------
> >  mm/vmscan.c     |    3 +-
> >  2 files changed, 42 insertions(+), 24 deletions(-)
> > 
> > Index: mmotm-1013/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-1013.orig/mm/memcontrol.c
> > +++ mmotm-1013/mm/memcontrol.c
> > @@ -1169,7 +1169,6 @@ static void mem_cgroup_end_move(struct m
> >   *			  under hierarchy of moving cgroups. This is for
> >   *			  waiting at hith-memory prressure caused by "move".
> >   */
> > -
> >  static bool mem_cgroup_stealed(struct mem_cgroup *mem)
> >  {
> >  	VM_BUG_ON(!rcu_read_lock_held());
> > @@ -4471,11 +4470,14 @@ one_by_one:
> >   * Returns
> >   *   0(MC_TARGET_NONE): if the pte is not a target for move charge.
> >   *   1(MC_TARGET_PAGE): if the page corresponding to this pte is a target for
> > - *     move charge. if @target is not NULL, the page is stored in target->page
> > - *     with extra refcnt got(Callers should handle it).
> > + *     move charge and it's mapped.. if @target is not NULL, the page is
> > + *     stored in target->pagewithout extra refcnt.
>                                ^^ needs ' '.
> 
> >   *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
> >   *     target for charge migration. if @target is not NULL, the entry is stored
> >   *     in target->ent.
> > + *   3(MC_TARGET_UNMAPPED_PAGE): if the page corresponding to this pte is a
> > + *     target for move charge. if @target is not NULL, the page is stored in
> > + *     target->page with extra refcnt got(Callers should handle it).
> >   *
> >   * Called with pte lock held.
> >   */
> > @@ -4486,8 +4488,9 @@ union mc_target {
> >  
> >  enum mc_target_type {
> >  	MC_TARGET_NONE,	/* not used */
> > -	MC_TARGET_PAGE,
> > +	MC_TARGET_PAGE, /* a page mapped */
> >  	MC_TARGET_SWAP,
> > +	MC_TARGET_UNMAPPED_PAGE, /* a page unmapped */
> >  };
> >  
> I prefer the order of "MC_TARGET_PAGE", "MC_TARGET_UNMAPPED_PAGE", and "MC_TARGET_SWAP".
> 
ok.

> >  static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
> > @@ -4504,9 +4507,10 @@ static struct page *mc_handle_present_pt
> >  	} else if (!move_file())
> >  		/* we ignore mapcount for file pages */
> >  		return NULL;
> > -	if (!get_page_unless_zero(page))
> > -		return NULL;
> > -
> > +	/*
> > + 	 * Because we're under pte_lock and the page is mapped,
> > +	 * get_page() isn't necessary
> > +	 */
> >  	return page;
> >  }
> >  
> > @@ -4570,14 +4574,18 @@ static int is_target_pte_for_mc(struct v
> >  	struct page *page = NULL;
> >  	struct page_cgroup *pc;
> >  	int ret = 0;
> > +	bool present = true;
> >  	swp_entry_t ent = { .val = 0 };
> >  
> >  	if (pte_present(ptent))
> >  		page = mc_handle_present_pte(vma, addr, ptent);
> > -	else if (is_swap_pte(ptent))
> > -		page = mc_handle_swap_pte(vma, addr, ptent, &ent);
> > -	else if (pte_none(ptent) || pte_file(ptent))
> > -		page = mc_handle_file_pte(vma, addr, ptent, &ent);
> > +	else {
> > +		present = false;
> > +	 	if (is_swap_pte(ptent))
> > +			page = mc_handle_swap_pte(vma, addr, ptent, &ent);
> > +		else if (pte_none(ptent) || pte_file(ptent))
> > +			page = mc_handle_file_pte(vma, addr, ptent, &ent);
> > +	}
> >  
> >  	if (!page && !ent.val)
> >  		return 0;
> > @@ -4589,11 +4597,15 @@ static int is_target_pte_for_mc(struct v
> >  		 * the lock.
> >  		 */
> >  		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
> > -			ret = MC_TARGET_PAGE;
> > +			if (present)
> > +				ret = MC_TARGET_PAGE;
> > +			else
> > +				ret = MC_TARGET_UNMAPPED_PAGE;
> >  			if (target)
> >  				target->page = page;
> >  		}
> > -		if (!ret || !target)
> > +		/* We got refcnt but the page is not for target */
> > +		if (!present && (!ret || !target))
> >  			put_page(page);
> >  	}
> >  	/* There is a swap entry and a page doesn't exist or isn't charged */
> > @@ -4780,19 +4792,24 @@ retry:
> >  		type = is_target_pte_for_mc(vma, addr, ptent, &target);
> >  		switch (type) {
> >  		case MC_TARGET_PAGE:
> > +		case MC_TARGET_UNMAPPED_PAGE:
> >  			page = target.page;
> > -			if (isolate_lru_page(page))
> > -				goto put;
> > -			pc = lookup_page_cgroup(page);
> > -			if (!mem_cgroup_move_account(pc,
> > +			if (!isolate_lru_page(page)) {
> > +				pc = lookup_page_cgroup(page);
> > +				if (!mem_cgroup_move_account(pc,
> >  						mc.from, mc.to, false)) {
> > -				mc.precharge--;
> > -				/* we uncharge from mc.from later. */
> > -				mc.moved_charge++;
> > +					mc.precharge--;
> > +					/* we uncharge from mc.from later. */
> > +					mc.moved_charge++;
> > +				}
> > +				putback_lru_page(page);
> >  			}
> > -			putback_lru_page(page);
> > -put:			/* is_target_pte_for_mc() gets the page */
> > -			put_page(page);
> > +			/*
> > +			 * Because we holds pte_lock, we have a stable reference			 * to the page if mapped. If not mapped, we have an
> 
> You need a new line :)
> 
yes.

I'll reorder patches and push this after dirty_ratio patches goes.
But before doing that, I'll show an extreme version.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-10-18  4:44 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-15  8:06 [RFC][PATCH 0/2] memcg: some updates to move_account and file_stat races KAMEZAWA Hiroyuki
2010-10-15  8:11 ` [RFC][PATCH 1/2] memcg: avoiding unnecessary get_page at move_charge KAMEZAWA Hiroyuki
2010-10-15 17:20   ` Christoph Lameter
2010-10-15 22:36     ` Hiroyuki Kamezawa
2010-10-18  4:29   ` Daisuke Nishimura
2010-10-18  4:38     ` KAMEZAWA Hiroyuki
2010-10-15  8:12 ` [RFC][PATCH 2/2] memcg: new lock for mutual execution of account_move and file stats KAMEZAWA Hiroyuki
2010-10-17  5:33   ` Minchan Kim
2010-10-17  5:35     ` Minchan Kim
2010-10-18  0:19       ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox