[PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec)
@ 2009-12-04  5:46 Daisuke Nishimura
  2009-12-04  5:47 ` [PATCH -mmotm 1/7] cgroup: introduce cancel_attach() Daisuke Nishimura
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Daisuke Nishimura @ 2009-12-04  5:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan, Paul Menage,
	Daisuke Nishimura, linux-mm

Hi.

These are current patches of my move-charge-at-task-migration feature.

The biggest change from previous(19/Nov) version is improvement in performance.

I measured the elapsed time of "echo [pid] > <some path>/tasks" on KVM guest
with 4CPU/4GB(Xeon/3GHz) in three patterns:

  (1) / -> /00
  (2) /00 -> /01

  we don't need to call res_counter_uncharge against root, so (1) would be smaller
  than (2).

  (3) /00(setting mem.limit to half size of total) -> /01

  To compare the overhead of anon and swap.

In 19/Nov version:
       |  252M  |  512M  |   1G
  -----+--------+--------+--------
   (1) |  0.21  |  0.41  |  0.821
  -----+--------+--------+--------
   (2) |  0.43  |  0.85  |  1.71
  -----+--------+--------+--------
   (3) |  0.40  |  0.81  |  1.62
  -----+--------+--------+--------

In this version:
       |  252M  |  512M  |   1G
  -----+--------+--------+--------
   (1) |  0.15  |  0.30  |  0.60
  -----+--------+--------+--------
   (2) |  0.15  |  0.30  |  0.60
  -----+--------+--------+--------
   (3) |  0.22  |  0.44  |  0.89

Please read patch descriptions for each patch([4/7],[7/7]) for details of
how and how much the patch improved the performance.

  [1/7] cgroup: introduce cancel_attach()
  [2/7] memcg: add interface to move charge at task migration
  [3/7] memcg: move charges of anonymous page
  [4/7] memcg: improbe performance in moving charge
  [5/7] memcg: avoid oom during moving charge
  [6/7] memcg: move charges of anonymous swap
  [7/7] memcg: improbe performance in moving swap charge

Current version supports only recharge of non-shared(mapcount == 1) anonymous pages
and swaps of those pages. I think it's enough as a first step.


Overall history of this patch set:
2009/12/04
- rebase on mmotm-2009-11-24-16-47.
- change the term "recharge" to "move charge".
- improve performance in moving charge.
- parse the page table in can_attach() phase again(go back to the old behavior),
  because it doesn't add so big overheads, so it would be better to calculate
  the precharge count more accurately.
2009/11/19
- rebase on mmotm-2009-11-17-14-03 + KAMEZAWA-san's show per-process swap usage
  via procfs patch(v3).
- in can_attach(), instead of parsing the page table, make use of per process
  mm_counter(anon_rss, swap_usage).
- handle recharge_at_immigrate as bitmask(as I did in first version)
- use mm->owner instead of thread_group_leader()
2009/11/06
- remove "[RFC]".
- rebase on mmotm-2009-11-01-10-01.
- drop support for file cache and shmem/tmpfs(revisit in future).
- update Documentation/cgroup/memory.txt.
2009/10/13
- rebase on mmotm-2009-10-09-01-07 + KAMEZAWA-san's batched charge/uncharge(Oct09) + part
of KAMEZAWA-san's cleanup/fix patches(4,5,7 of Sep25 with some fixes).
- change the term "migrate" to "recharge".
2009/09/24
- change "migrate_charge" flag from "int" to "bool".
- in can_attach(), parse the page table of the task and count only the number
  of target ptes and call try_charge() repeatedly. No isolation at this phase.
- in attach(), parse the page table of the task again, and isolate the target
  page and call move_account() one by one.
- do no swap-in in moving swap account any more.
- add support for shmem/tmpfs's swap.
- update Documentation/cgroup/cgroup.txt.
2009/09/17
- first version


Regards,
Dasiuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH -mmotm 1/7] cgroup: introduce cancel_attach()
  2009-12-04  5:46 [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec) Daisuke Nishimura
@ 2009-12-04  5:47 ` Daisuke Nishimura
  2009-12-04  5:48 ` [PATCH -mmotm 2/7] memcg: add interface to move charge at task migration Daisuke Nishimura
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Daisuke Nishimura @ 2009-12-04  5:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan, Paul Menage,
	Daisuke Nishimura, linux-mm

This patch adds cancel_attach() operation to struct cgroup_subsys.
cancel_attach() can be used when can_attach() operation prepares something
for the subsys, but we should rollback what can_attach() operation has prepared
if attach task fails after we've succeeded in can_attach().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>

Changelog: 2009/12/04
- update comments.
Changelog: 2009/11/19
- fix typo and coding style.
Changelog: 2009/09/24
- add explanation about cancel_attach() to Documentation/cgroup/cgroup.txt.
---
 Documentation/cgroups/cgroups.txt |   13 +++++++++++-
 include/linux/cgroup.h            |    2 +
 kernel/cgroup.c                   |   40 ++++++++++++++++++++++++++++++------
 3 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 0b33bfe..d450826 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -536,10 +536,21 @@ returns an error, this will abort the attach operation.  If a NULL
 task is passed, then a successful result indicates that *any*
 unspecified task can be moved into the cgroup. Note that this isn't
 called on a fork. If this method returns 0 (success) then this should
-remain valid while the caller holds cgroup_mutex. If threadgroup is
+remain valid while the caller holds cgroup_mutex and it is ensured that either
+attach() or cancel_attach() will be called in future. If threadgroup is
 true, then a successful result indicates that all threads in the given
 thread's threadgroup can be moved together.
 
+void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+	       struct task_struct *task, bool threadgroup)
+(cgroup_mutex held by caller)
+
+Called when a task attach operation has failed after can_attach() has succeeded.
+A subsystem whose can_attach() has some side-effects should provide this
+function, so that the subsytem can implement a rollback. If not, not necessary.
+This will be called only about subsystems whose can_attach() operation have
+succeeded.
+
 void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	    struct cgroup *old_cgrp, struct task_struct *task,
 	    bool threadgroup)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 0008dee..d4cc200 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -427,6 +427,8 @@ struct cgroup_subsys {
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			  struct task_struct *tsk, bool threadgroup);
+	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			  struct task_struct *tsk, bool threadgroup);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			struct cgroup *old_cgrp, struct task_struct *tsk,
 			bool threadgroup);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0249f4b..d67d471 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1539,7 +1539,7 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	int retval = 0;
-	struct cgroup_subsys *ss;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
 	struct css_set *cg;
 	struct css_set *newcg;
@@ -1553,8 +1553,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {
 			retval = ss->can_attach(ss, cgrp, tsk, false);
-			if (retval)
-				return retval;
+			if (retval) {
+				/*
+				 * Remember on which subsystem the can_attach()
+				 * failed, so that we only call cancel_attach()
+				 * against the subsystems whose can_attach()
+				 * succeeded. (See below)
+				 */
+				failed_ss = ss;
+				goto out;
+			}
 		}
 	}
 
@@ -1568,14 +1576,17 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	 */
 	newcg = find_css_set(cg, cgrp);
 	put_css_set(cg);
-	if (!newcg)
-		return -ENOMEM;
+	if (!newcg) {
+		retval = -ENOMEM;
+		goto out;
+	}
 
 	task_lock(tsk);
 	if (tsk->flags & PF_EXITING) {
 		task_unlock(tsk);
 		put_css_set(newcg);
-		return -ESRCH;
+		retval = -ESRCH;
+		goto out;
 	}
 	rcu_assign_pointer(tsk->cgroups, newcg);
 	task_unlock(tsk);
@@ -1601,7 +1612,22 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	 * is no longer empty.
 	 */
 	cgroup_wakeup_rmdir_waiter(cgrp);
-	return 0;
+out:
+	if (retval) {
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss)
+				/*
+				 * This subsystem was the one that failed the
+				 * can_attach() check earlier, so we don't need
+				 * to call cancel_attach() against it or any
+				 * remaining subsystems.
+				 */
+				break;
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, tsk, false);
+		}
+	}
+	return retval;
 }
 
 /*
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH -mmotm 2/7] memcg: add interface to move charge at task migration
  2009-12-04  5:46 [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec) Daisuke Nishimura
  2009-12-04  5:47 ` [PATCH -mmotm 1/7] cgroup: introduce cancel_attach() Daisuke Nishimura
@ 2009-12-04  5:48 ` Daisuke Nishimura
  2009-12-04  6:55   ` KAMEZAWA Hiroyuki
  2009-12-04  5:49 ` [PATCH -mmotm 3/7] memcg: move charges of anonymous page Daisuke Nishimura
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Daisuke Nishimura @ 2009-12-04  5:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan, Paul Menage,
	Daisuke Nishimura, linux-mm

In current memcg, charges associated with a task aren't moved to the new cgroup
at task migration. Some users feel this behavior to be strange.
These patches are for this feature, that is, for charging to the new cgroup
and, of course, uncharging from the old cgroup at task migration.

This patch adds "memory.move_charge_at_immigrate" file, which is a flag file to
determine whether charges should be moved to the new cgroup at task migration or
not and what type of charges should be moved. This patch also adds read and
write handlers of the file.

This patch also adds no-op handlers for this feature. These handlers will be
implemented in later patches. And you cannot write any values other than 0
to move_charge_at_immigrate yet.

Changelog: 2009/12/04
- change the term "recharge" to "move_charge".
- update memory.txt.
Changelog: 2009/11/19
- consolidate changes in Documentation/cgroup/memory.txt, which were made in
  other patches separately.
- handle recharge_at_immigrate as bitmask(as I did in first version).
- use mm->owner instead of thread_group_leader().
Changelog: 2009/09/24
- change the term "migration" to "recharge".
- handle the flag as bool not bitmask to make codes simple.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 Documentation/cgroups/memory.txt |   48 ++++++++++++++++++-
 mm/memcontrol.c                  |   97 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 139 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b871f25..19b01f7 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -262,10 +262,12 @@ some of the pages cached in the cgroup (page cache pages).
 4.2 Task migration
 
 When a task migrates from one cgroup to another, it's charge is not
-carried forward. The pages allocated from the original cgroup still
+carried forward by default. The pages allocated from the original cgroup still
 remain charged to it, the charge is dropped when the page is freed or
 reclaimed.
 
+Note: You can move charges of a task along with task migration. See 8.
+
 4.3 Removing a cgroup
 
 A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
@@ -414,7 +416,49 @@ NOTE1: Soft limits take effect over a long period of time, since they involve
 NOTE2: It is recommended to set the soft limit always below the hard limit,
        otherwise the hard limit will take precedence.
 
-8. TODO
+8. Move charges at task migration
+
+Users can move charges associated with a task along with task migration, that
+is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
+
+8.1 Interface
+
+This feature is disabled by default. It can be enabled(and disabled again) by
+writing to memory.move_charge_at_immigrate of the destination cgroup.
+
+If you want to enable it:
+
+# echo (some positive value) > memory.move_charge_at_immigrate
+
+Note: Each bits of move_charge_at_immigrate has its own meaning about what type
+      of charges should be moved. See 8.2 for details.
+Note: Charges are moved only when you move mm->owner, IOW, a leader of a thread
+      group.
+Note: If we cannot find enough space for the task in the destination cgroup, we
+      try to make space by reclaiming memory. Task migration may fail if we
+      cannot make enough space.
+Note: It can take several seconds if you move charges in giga bytes order.
+
+And if you want disable it again:
+
+# echo 0 > memory.move_charge_at_immigrate
+
+8.2 Type of charges which can be move
+
+Each bits of move_charge_at_immigrate has its own meaning about what type of
+charges should be moved.
+
+  bit | what type of charges would be moved ?
+ -----+------------------------------------------------------------------------
+   0  | A charge of an anonymous page(or swap of it) used by the target task.
+      | Those pages and swaps must be used only by the target task. You must
+      | enable Swap Extension(see 2.4) to enable move of swap charges.
+
+Note: Those pages and swaps must be charged to the old cgroup.
+Note: More type of pages(e.g. file cache, shmem,) will be supported by other
+      bits in future.
+
+9. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 951c103..2624d23 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -226,11 +226,26 @@ struct mem_cgroup {
 	bool		memsw_is_minimum;
 
 	/*
+	 * Should we move charges of a task when a task is moved into this
+	 * mem_cgroup ? And what type of charges should we move ?
+	 */
+	unsigned long 	move_charge_at_immigrate;
+
+	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
 	struct mem_cgroup_stat stat;
 };
 
+/* Stuffs for move charges at task migration. */
+/*
+ * Types of charges to be moved. "move_charge_at_immitgrate" is treated as a
+ * left-shifted bitmap of these types.
+ */
+enum move_type {
+	NR_MOVE_TYPE,
+};
+
 /*
  * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
  * limit reclaim to prevent infinite loops, if they ever occur.
@@ -2867,6 +2882,31 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
 	return 0;
 }
 
+static u64 mem_cgroup_move_charge_read(struct cgroup *cgrp,
+					struct cftype *cft)
+{
+	return mem_cgroup_from_cont(cgrp)->move_charge_at_immigrate;
+}
+
+static int mem_cgroup_move_charge_write(struct cgroup *cgrp,
+					struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+
+	if (val >= (1 << NR_MOVE_TYPE))
+		return -EINVAL;
+	/*
+	 * We check this value several times in both in can_attach() and
+	 * attach(), so we need cgroup lock to prevent this value from being
+	 * inconsistent.
+	 */
+	cgroup_lock();
+	mem->move_charge_at_immigrate = val;
+	cgroup_unlock();
+
+	return 0;
+}
+
 
 /* For read statistics */
 enum {
@@ -3100,6 +3140,11 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_swappiness_read,
 		.write_u64 = mem_cgroup_swappiness_write,
 	},
+	{
+		.name = "move_charge_at_immigrate",
+		.read_u64 = mem_cgroup_move_charge_read,
+		.write_u64 = mem_cgroup_move_charge_write,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -3347,6 +3392,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	if (parent)
 		mem->swappiness = get_swappiness(parent);
 	atomic_set(&mem->refcnt, 1);
+	mem->move_charge_at_immigrate = 0;
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
@@ -3383,16 +3429,57 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
 	return ret;
 }
 
+/* Handlers for move charge at task migration. */
+static int mem_cgroup_can_move_charge(void)
+{
+	return 0;
+}
+
+static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
+				struct cgroup *cgroup,
+				struct task_struct *p,
+				bool threadgroup)
+{
+	int ret = 0;
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
+
+	if (mem->move_charge_at_immigrate) {
+		struct mm_struct *mm;
+		struct mem_cgroup *from = mem_cgroup_from_task(p);
+
+		VM_BUG_ON(from == mem);
+
+		mm = get_task_mm(p);
+		if (!mm)
+			return 0;
+
+		/* We move charges only when we move a owner of the mm */
+		if (mm->owner == p)
+			ret = mem_cgroup_can_move_charge();
+
+		mmput(mm);
+	}
+	return ret;
+}
+
+static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
+				struct cgroup *cgroup,
+				struct task_struct *p,
+				bool threadgroup)
+{
+}
+
+static void mem_cgroup_move_charge(void)
+{
+}
+
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
 				struct task_struct *p,
 				bool threadgroup)
 {
-	/*
-	 * FIXME: It's better to move charges of this process from old
-	 * memcg to new memcg. But it's just on TODO-List now.
-	 */
+	mem_cgroup_move_charge();
 }
 
 struct cgroup_subsys mem_cgroup_subsys = {
@@ -3402,6 +3489,8 @@ struct cgroup_subsys mem_cgroup_subsys = {
 	.pre_destroy = mem_cgroup_pre_destroy,
 	.destroy = mem_cgroup_destroy,
 	.populate = mem_cgroup_populate,
+	.can_attach = mem_cgroup_can_attach,
+	.cancel_attach = mem_cgroup_cancel_attach,
 	.attach = mem_cgroup_move_task,
 	.early_init = 0,
 	.use_id = 1,
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH -mmotm 2/7] memcg: add interface to move charge at task migration
  2009-12-04  5:48 ` [PATCH -mmotm 2/7] memcg: add interface to move charge at task migration Daisuke Nishimura
@ 2009-12-04  6:55   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-04  6:55 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrew Morton, Balbir Singh, Li Zefan, Paul Menage, linux-mm

On Fri, 4 Dec 2009 14:48:36 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> In current memcg, charges associated with a task aren't moved to the new cgroup
> at task migration. Some users feel this behavior to be strange.
> These patches are for this feature, that is, for charging to the new cgroup
> and, of course, uncharging from the old cgroup at task migration.
> 
> This patch adds "memory.move_charge_at_immigrate" file, which is a flag file to
> determine whether charges should be moved to the new cgroup at task migration or
> not and what type of charges should be moved. This patch also adds read and
> write handlers of the file.
> 
> This patch also adds no-op handlers for this feature. These handlers will be
> implemented in later patches. And you cannot write any values other than 0
> to move_charge_at_immigrate yet.
> 
> Changelog: 2009/12/04
> - change the term "recharge" to "move_charge".
> - update memory.txt.
> Changelog: 2009/11/19
> - consolidate changes in Documentation/cgroup/memory.txt, which were made in
>   other patches separately.
> - handle recharge_at_immigrate as bitmask(as I did in first version).
> - use mm->owner instead of thread_group_leader().
> Changelog: 2009/09/24
> - change the term "migration" to "recharge".
> - handle the flag as bool not bitmask to make codes simple.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH -mmotm 3/7] memcg: move charges of anonymous page
  2009-12-04  5:46 [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec) Daisuke Nishimura
  2009-12-04  5:47 ` [PATCH -mmotm 1/7] cgroup: introduce cancel_attach() Daisuke Nishimura
  2009-12-04  5:48 ` [PATCH -mmotm 2/7] memcg: add interface to move charge at task migration Daisuke Nishimura
@ 2009-12-04  5:49 ` Daisuke Nishimura
  2009-12-04  5:50 ` [PATCH -mmotm 4/7] memcg: improbe performance in moving charge Daisuke Nishimura
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Daisuke Nishimura @ 2009-12-04  5:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan, Paul Menage,
	Daisuke Nishimura, linux-mm

This patch is the core part of this move-charge-at-task-migration feature.
It implements functions to move charges of anonymous pages mapped only by
the target task.

Implementation:
- define struct move_charge_struct and a valuable of it(mc) to remember the
  count of pre-charges and other information.
- At can_attach(), get anon_rss of the target mm, call __mem_cgroup_try_charge()
  repeatedly and count up mc.precharge.
- At attach(), parse the page table, find a target page to be move, and call
  mem_cgroup_move_account() about the page.
- Cancel all precharges if mc.precharge > 0 on failure or at the end of
  task move.

Changelog: 2009/12/04
- change the term "recharge" to "move_charge".
- handle a signal in can_attach() phase.
- parse the page table in can_attach() phase again(go back to the old behavior),
  because it doesn't add so big overheads, so it would be better to calculate
  the precharge count more accurately.
Changelog: 2009/11/19
- in can_attach(), instead of parsing the page table, make use of per process
  mm_counter(anon_rss).
- loosen the valid check in is_target_pte_for_recharge().
Changelog: 2009/11/06
- drop support for file cache, shmem/tmpfs and shared(used by multiple processes)
  pages(revisit in future).
Changelog: 2009/10/13
- change the term "migrate" to "recharge".
Changelog: 2009/09/24
- in can_attach(), parse the page table of the task and count only the number
  of target ptes and call try_charge() repeatedly. No isolation at this phase.
- in attach(), parse the page table of the task again, and isolate the target
  page and call move_account() one by one.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 mm/memcontrol.c |  295 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 285 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2624d23..e38f211 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -21,6 +21,7 @@
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
 #include <linux/mm.h>
+#include <linux/hugetlb.h>
 #include <linux/pagemap.h>
 #include <linux/smp.h>
 #include <linux/page-flags.h>
@@ -243,8 +244,17 @@ struct mem_cgroup {
  * left-shifted bitmap of these types.
  */
 enum move_type {
+	MOVE_CHARGE_TYPE_ANON,	/* private anonymous page and swap of it */
 	NR_MOVE_TYPE,
 };
+/* "mc" and its members are protected by cgroup_mutex */
+struct move_charge_struct {
+	struct mem_cgroup *from;
+	struct mem_cgroup *to;
+	unsigned long precharge;
+};
+static struct move_charge_struct mc;
+
 
 /*
  * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
@@ -1508,7 +1518,7 @@ charged:
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
 	 * if they exceeds softlimit.
 	 */
-	if (mem_cgroup_soft_limit_check(mem))
+	if (page && mem_cgroup_soft_limit_check(mem))
 		mem_cgroup_update_tree(mem, page);
 done:
 	return 0;
@@ -1688,8 +1698,9 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
 	/*
 	 * We charges against "to" which may not have any tasks. Then, "to"
 	 * can be under rmdir(). But in current implementation, caller of
-	 * this function is just force_empty() and it's garanteed that
-	 * "to" is never removed. So, we don't check rmdir status here.
+	 * this function is just force_empty() and move charge, so it's
+	 * garanteed that "to" is never removed. So, we don't check rmdir
+	 * status here.
 	 */
 }
 
@@ -3430,11 +3441,171 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
 }
 
 /* Handlers for move charge at task migration. */
-static int mem_cgroup_can_move_charge(void)
+static int mem_cgroup_do_precharge(void)
+{
+	int ret = -ENOMEM;
+	struct mem_cgroup *mem = mc.to;
+
+	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false, NULL);
+	if (ret || !mem)
+		return -ENOMEM;
+
+	mc.precharge++;
+	return ret;
+}
+
+/**
+ * is_target_pte_for_mc - check a pte whether it is valid for move charge
+ * @vma: the vma the pte to be checked belongs
+ * @addr: the address corresponding to the pte to be checked
+ * @ptent: the pte to be checked
+ * @target: the pointer the target page will be stored(can be NULL)
+ *
+ * Returns
+ *   0(MC_TARGET_NONE): if the pte is not a target for move charge.
+ *   1(MC_TARGET_PAGE): if the page corresponding to this pte is a target for
+ *     move charge. if @target is not NULL, the page is stored in target->page
+ *     with extra refcnt got(Callers should handle it).
+ *
+ * Called with pte lock held.
+ */
+/* We add a new member later. */
+union mc_target {
+	struct page	*page;
+};
+
+/* We add a new type later. */
+enum mc_target_type {
+	MC_TARGET_NONE,	/* not used */
+	MC_TARGET_PAGE,
+};
+
+static int is_target_pte_for_mc(struct vm_area_struct *vma,
+		unsigned long addr, pte_t ptent, union mc_target *target)
+{
+	struct page *page;
+	struct page_cgroup *pc;
+	int ret = 0;
+	bool move_anon = test_bit(MOVE_CHARGE_TYPE_ANON,
+					&mc.to->move_charge_at_immigrate);
+
+	if (!pte_present(ptent))
+		return 0;
+
+	page = vm_normal_page(vma, addr, ptent);
+	if (!page || !page_mapped(page))
+		return 0;
+	/*
+	 * TODO: We don't move charges of file(including shmem/tmpfs) pages for
+	 * now.
+	 */
+	if (!move_anon || !PageAnon(page))
+		return 0;
+	/*
+	 * TODO: We don't move charges of shared(used by multiple processes)
+	 * pages for now.
+	 */
+	if (page_mapcount(page) > 1)
+		return 0;
+	if (!get_page_unless_zero(page))
+		return 0;
+
+	pc = lookup_page_cgroup(page);
+	/*
+	 * Do only loose check w/o page_cgroup lock. mem_cgroup_move_account()
+	 * checks the pc is valid or not under the lock.
+	 */
+	if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
+		ret = MC_TARGET_PAGE;
+		if (target)
+			target->page = page;
+	}
+
+	if (!ret || !target)
+		put_page(page);
+
+	return ret;
+}
+
+static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
+					unsigned long addr, unsigned long end,
+					struct mm_walk *walk)
 {
+	struct vm_area_struct *vma = walk->private;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE)
+		if (is_target_pte_for_mc(vma, addr, *pte, NULL))
+			mc.precharge++;	/* increment precharge temporarily */
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+
 	return 0;
 }
 
+static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
+{
+	unsigned long precharge;
+	struct vm_area_struct *vma;
+
+	down_read(&mm->mmap_sem);
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		struct mm_walk mem_cgroup_count_precharge_walk = {
+			.pmd_entry = mem_cgroup_count_precharge_pte_range,
+			.mm = mm,
+			.private = vma,
+		};
+		if (is_vm_hugetlb_page(vma))
+			continue;
+		/* TODO: We don't move charges of shmem/tmpfs pages for now. */
+		if (vma->vm_flags & VM_SHARED)
+			continue;
+		walk_page_range(vma->vm_start, vma->vm_end,
+					&mem_cgroup_count_precharge_walk);
+	}
+	up_read(&mm->mmap_sem);
+
+	precharge = mc.precharge;
+	mc.precharge = 0;
+
+	return precharge;
+}
+
+#define PRECHARGE_AT_ONCE	256
+static int mem_cgroup_precharge_mc(struct mm_struct *mm)
+{
+	int ret = 0;
+	int count = PRECHARGE_AT_ONCE;
+	unsigned long precharge = mem_cgroup_count_precharge(mm);
+
+	while (!ret && precharge--) {
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+		if (!count--) {
+			count = PRECHARGE_AT_ONCE;
+			cond_resched();
+		}
+		ret = mem_cgroup_do_precharge();
+	}
+
+	return ret;
+}
+
+static void mem_cgroup_clear_mc(void)
+{
+	/* we must uncharge all the leftover precharges from mc.to */
+	while (mc.precharge) {
+		mem_cgroup_cancel_charge(mc.to);
+		mc.precharge--;
+	}
+	mc.from = NULL;
+	mc.to = NULL;
+}
+
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
 				struct task_struct *p,
@@ -3452,11 +3623,19 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 		mm = get_task_mm(p);
 		if (!mm)
 			return 0;
-
 		/* We move charges only when we move a owner of the mm */
-		if (mm->owner == p)
-			ret = mem_cgroup_can_move_charge();
-
+		if (mm->owner == p) {
+			VM_BUG_ON(mc.from);
+			VM_BUG_ON(mc.to);
+			VM_BUG_ON(mc.precharge);
+			mc.from = from;
+			mc.to = mem;
+			mc.precharge = 0;
+
+			ret = mem_cgroup_precharge_mc(mm);
+			if (ret)
+				mem_cgroup_clear_mc();
+		}
 		mmput(mm);
 	}
 	return ret;
@@ -3467,10 +3646,95 @@ static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct task_struct *p,
 				bool threadgroup)
 {
+	mem_cgroup_clear_mc();
 }
 
-static void mem_cgroup_move_charge(void)
+static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
+				unsigned long addr, unsigned long end,
+				struct mm_walk *walk)
 {
+	int ret = 0;
+	struct vm_area_struct *vma = walk->private;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+retry:
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (; addr != end; addr += PAGE_SIZE) {
+		pte_t ptent = *(pte++);
+		union mc_target target;
+		int type;
+		struct page *page;
+		struct page_cgroup *pc;
+
+		if (!mc.precharge)
+			break;
+
+		type = is_target_pte_for_mc(vma, addr, ptent, &target);
+		switch (type) {
+		case MC_TARGET_PAGE:
+			page = target.page;
+			if (isolate_lru_page(page))
+				goto put;
+			pc = lookup_page_cgroup(page);
+			if (!mem_cgroup_move_account(pc, mc.from, mc.to)) {
+				css_put(&mc.to->css);
+				mc.precharge--;
+			}
+			putback_lru_page(page);
+put:			/* is_target_pte_for_mc() gets the page */
+			put_page(page);
+			break;
+		default:
+			break;
+		}
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+
+	if (addr != end) {
+		/*
+		 * We have consumed all precharges we got in can_attach().
+		 * We try charge one by one, but don't do any additional
+		 * charges to mc.to if we have failed in charge once in attach()
+		 * phase.
+		 */
+		ret = mem_cgroup_do_precharge();
+		if (!ret)
+			goto retry;
+	}
+
+	return ret;
+}
+
+static void mem_cgroup_move_charge(struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+
+	lru_add_drain_all();
+	down_read(&mm->mmap_sem);
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		int ret;
+		struct mm_walk mem_cgroup_move_charge_walk = {
+			.pmd_entry = mem_cgroup_move_charge_pte_range,
+			.mm = mm,
+			.private = vma,
+		};
+		if (is_vm_hugetlb_page(vma))
+			continue;
+		/* TODO: We don't move charges of shmem/tmpfs pages for now. */
+		if (vma->vm_flags & VM_SHARED)
+			continue;
+		ret = walk_page_range(vma->vm_start, vma->vm_end,
+						&mem_cgroup_move_charge_walk);
+		if (ret)
+			/*
+			 * means we have consumed all precharges and failed in
+			 * doing additional charge. Just abandon here.
+			 */
+			break;
+	}
+	up_read(&mm->mmap_sem);
 }
 
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
@@ -3479,7 +3743,18 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct task_struct *p,
 				bool threadgroup)
 {
-	mem_cgroup_move_charge();
+	struct mm_struct *mm;
+
+	if (!mc.to)
+		/* no need to move charge */
+		return;
+
+	mm = get_task_mm(p);
+	if (mm) {
+		mem_cgroup_move_charge(mm);
+		mmput(mm);
+	}
+	mem_cgroup_clear_mc();
 }
 
 struct cgroup_subsys mem_cgroup_subsys = {
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH -mmotm 4/7] memcg: improbe performance in moving charge
  2009-12-04  5:46 [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec) Daisuke Nishimura
                   ` (2 preceding siblings ...)
  2009-12-04  5:49 ` [PATCH -mmotm 3/7] memcg: move charges of anonymous page Daisuke Nishimura
@ 2009-12-04  5:50 ` Daisuke Nishimura
  2009-12-04  7:10   ` KAMEZAWA Hiroyuki
  2009-12-04  5:51 ` [PATCH -mmotm 5/7] memcg: avoid oom during " Daisuke Nishimura
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Daisuke Nishimura @ 2009-12-04  5:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan, Paul Menage,
	Daisuke Nishimura, linux-mm

This patch tries to reduce overheads in moving charge by:

- Instead of calling res_counter_uncharge against the old cgroup in
  __mem_cgroup_move_account everytime, call res_counter_uncharge at the end of
  task migration once.
- Instead of calling res_counter_charge(via __mem_cgroup_try_charge) repeatedly,
  call res_counter_charge(PAGE_SIZE * count) in can_attach() if possible.
- Adds a new arg(count) to __css_put and make it decrement the css->refcnt
  by "count", not 1.
- Add a new function(__css_get), which takes "count" as a arg and increment
  the css->recnt by "count".
- Instead of calling css_get/css_put repeatedly, call new __css_get/__css_put
  if possible.
- removed css_get(&to->css) from __mem_cgroup_move_account(callers should have
  already called css_get), and removed css_put(&to->css) too, which is called by
  callers of move_account on success of move_account.

These changes reduces the overhead from 1.7sec to 0.6sec to move charges of 1G
anonymous memory in my test environment.

Changelog: 2009/12/04
- new patch

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 include/linux/cgroup.h |   12 +++-
 kernel/cgroup.c        |    5 +-
 mm/memcontrol.c        |  151 +++++++++++++++++++++++++++++++-----------------
 3 files changed, 109 insertions(+), 59 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index d4cc200..61f75ae 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -75,6 +75,12 @@ enum {
 	CSS_REMOVED, /* This CSS is dead */
 };
 
+/* Caller must verify that the css is not for root cgroup */
+static inline void __css_get(struct cgroup_subsys_state *css, int count)
+{
+	atomic_add(count, &css->refcnt);
+}
+
 /*
  * Call css_get() to hold a reference on the css; it can be used
  * for a reference obtained via:
@@ -86,7 +92,7 @@ static inline void css_get(struct cgroup_subsys_state *css)
 {
 	/* We don't need to reference count the root state */
 	if (!test_bit(CSS_ROOT, &css->flags))
-		atomic_inc(&css->refcnt);
+		__css_get(css, 1);
 }
 
 static inline bool css_is_removed(struct cgroup_subsys_state *css)
@@ -117,11 +123,11 @@ static inline bool css_tryget(struct cgroup_subsys_state *css)
  * css_get() or css_tryget()
  */
 
-extern void __css_put(struct cgroup_subsys_state *css);
+extern void __css_put(struct cgroup_subsys_state *css, int count);
 static inline void css_put(struct cgroup_subsys_state *css)
 {
 	if (!test_bit(CSS_ROOT, &css->flags))
-		__css_put(css);
+		__css_put(css, 1);
 }
 
 /* bits in struct cgroup flags field */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d67d471..44f5924 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3729,12 +3729,13 @@ static void check_for_release(struct cgroup *cgrp)
 	}
 }
 
-void __css_put(struct cgroup_subsys_state *css)
+/*  Caller must verify that the css is not for root cgroup */
+void __css_put(struct cgroup_subsys_state *css, int count)
 {
 	struct cgroup *cgrp = css->cgroup;
 	int val;
 	rcu_read_lock();
-	val = atomic_dec_return(&css->refcnt);
+	val = atomic_sub_return(count, &css->refcnt);
 	if (val == 1) {
 		if (notify_on_release(cgrp)) {
 			set_bit(CGRP_RELEASABLE, &cgrp->flags);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e38f211..769b85a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -252,6 +252,7 @@ struct move_charge_struct {
 	struct mem_cgroup *from;
 	struct mem_cgroup *to;
 	unsigned long precharge;
+	unsigned long moved_charge;
 };
 static struct move_charge_struct mc;
 
@@ -1532,14 +1533,23 @@ nomem:
  * This function is for that and do uncharge, put css's refcnt.
  * gotten by try_charge().
  */
-static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
+static void __mem_cgroup_cancel_charge(struct mem_cgroup *mem,
+							unsigned long count)
 {
 	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		res_counter_uncharge(&mem->res, PAGE_SIZE * count);
 		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE * count);
+		VM_BUG_ON(test_bit(CSS_ROOT, &mem->css.flags));
+		WARN_ON_ONCE(count > INT_MAX);
+		__css_put(&mem->css, (int)count);
 	}
-	css_put(&mem->css);
+	/* we don't need css_put for root */
+}
+
+static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
+{
+	__mem_cgroup_cancel_charge(mem, 1);
 }
 
 /*
@@ -1645,17 +1655,20 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
  * @pc:	page_cgroup of the page.
  * @from: mem_cgroup which the page is moved from.
  * @to:	mem_cgroup which the page is moved to. @from != @to.
+ * @uncharge: whether we should call uncharge and css_put against @from.
  *
  * The caller must confirm following.
  * - page is not on LRU (isolate_page() is useful.)
  * - the pc is locked, used, and ->mem_cgroup points to @from.
  *
- * This function does "uncharge" from old cgroup but doesn't do "charge" to
- * new cgroup. It should be done by a caller.
+ * This function doesn't do "charge" nor css_get to new cgroup. It should be
+ * done by a caller(__mem_cgroup_try_charge would be usefull). If @uncharge is
+ * true, this function does "uncharge" from old cgroup, but it doesn't if
+ * @uncharge is false, so a caller should do "uncharge".
  */
 
 static void __mem_cgroup_move_account(struct page_cgroup *pc,
-	struct mem_cgroup *from, struct mem_cgroup *to)
+	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
 	struct page *page;
 	int cpu;
@@ -1668,10 +1681,6 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
 	VM_BUG_ON(!PageCgroupUsed(pc));
 	VM_BUG_ON(pc->mem_cgroup != from);
 
-	if (!mem_cgroup_is_root(from))
-		res_counter_uncharge(&from->res, PAGE_SIZE);
-	mem_cgroup_charge_statistics(from, pc, false);
-
 	page = pc->page;
 	if (page_mapped(page) && !PageAnon(page)) {
 		cpu = smp_processor_id();
@@ -1687,12 +1696,12 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
 		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_FILE_MAPPED,
 						1);
 	}
+	mem_cgroup_charge_statistics(from, pc, false);
+	if (uncharge)
+		/* This is not "cancel", but cancel_charge does all we need. */
+		mem_cgroup_cancel_charge(from);
 
-	if (do_swap_account && !mem_cgroup_is_root(from))
-		res_counter_uncharge(&from->memsw, PAGE_SIZE);
-	css_put(&from->css);
-
-	css_get(&to->css);
+	/* caller should have done css_get */
 	pc->mem_cgroup = to;
 	mem_cgroup_charge_statistics(to, pc, true);
 	/*
@@ -1709,12 +1718,12 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
  * __mem_cgroup_move_account()
  */
 static int mem_cgroup_move_account(struct page_cgroup *pc,
-				struct mem_cgroup *from, struct mem_cgroup *to)
+		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
 	int ret = -EINVAL;
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
-		__mem_cgroup_move_account(pc, from, to);
+		__mem_cgroup_move_account(pc, from, to, uncharge);
 		ret = 0;
 	}
 	unlock_page_cgroup(pc);
@@ -1750,11 +1759,9 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
 	if (ret || !parent)
 		goto put_back;
 
-	ret = mem_cgroup_move_account(pc, child, parent);
-	if (!ret)
-		css_put(&parent->css);	/* drop extra refcnt by try_charge() */
-	else
-		mem_cgroup_cancel_charge(parent);	/* does css_put */
+	ret = mem_cgroup_move_account(pc, child, parent, true);
+	if (ret)
+		mem_cgroup_cancel_charge(parent);
 put_back:
 	putback_lru_page(page);
 put:
@@ -3441,16 +3448,57 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
 }
 
 /* Handlers for move charge at task migration. */
-static int mem_cgroup_do_precharge(void)
+#define PRECHARGE_COUNT_AT_ONCE	256
+static int mem_cgroup_do_precharge(unsigned long count)
 {
-	int ret = -ENOMEM;
+	int ret = 0;
+	int batch_count = PRECHARGE_COUNT_AT_ONCE;
 	struct mem_cgroup *mem = mc.to;
 
-	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false, NULL);
-	if (ret || !mem)
-		return -ENOMEM;
-
-	mc.precharge++;
+	if (mem_cgroup_is_root(mem)) {
+		mc.precharge += count;
+		/* we don't need css_get for root */
+		return ret;
+	}
+	/* try to charge at once */
+	if (count > 1) {
+		struct res_counter *dummy;
+		/*
+		 * "mem" cannot be under rmdir() because we've already checked
+		 * by cgroup_lock_live_cgroup() that it is not removed and we
+		 * are still under the same cgroup_mutex. So we can postpone
+		 * css_get().
+		 */
+		if (res_counter_charge(&mem->res, PAGE_SIZE * count, &dummy))
+			goto one_by_one;
+		if (res_counter_charge(&mem->memsw,
+						PAGE_SIZE * count, &dummy)) {
+			res_counter_uncharge(&mem->res, PAGE_SIZE * count);
+			goto one_by_one;
+		}
+		mc.precharge += count;
+		VM_BUG_ON(test_bit(CSS_ROOT, &mem->css.flags));
+		WARN_ON_ONCE(count > INT_MAX);
+		__css_get(&mem->css, (int)count);
+		return ret;
+	}
+one_by_one:
+	/* fall back to one by one charge */
+	while (!ret && count--) {
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+		if (!batch_count--) {
+			batch_count = PRECHARGE_COUNT_AT_ONCE;
+			cond_resched();
+		}
+		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem,
+								false, NULL);
+		if (ret || !mem)
+			return -ENOMEM;
+		mc.precharge++;
+	}
 	return ret;
 }
 
@@ -3573,34 +3621,25 @@ static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
 	return precharge;
 }
 
-#define PRECHARGE_AT_ONCE	256
 static int mem_cgroup_precharge_mc(struct mm_struct *mm)
 {
-	int ret = 0;
-	int count = PRECHARGE_AT_ONCE;
-	unsigned long precharge = mem_cgroup_count_precharge(mm);
-
-	while (!ret && precharge--) {
-		if (signal_pending(current)) {
-			ret = -EINTR;
-			break;
-		}
-		if (!count--) {
-			count = PRECHARGE_AT_ONCE;
-			cond_resched();
-		}
-		ret = mem_cgroup_do_precharge();
-	}
-
-	return ret;
+	return mem_cgroup_do_precharge(mem_cgroup_count_precharge(mm));
 }
 
 static void mem_cgroup_clear_mc(void)
 {
 	/* we must uncharge all the leftover precharges from mc.to */
-	while (mc.precharge) {
-		mem_cgroup_cancel_charge(mc.to);
-		mc.precharge--;
+	if (mc.precharge) {
+		__mem_cgroup_cancel_charge(mc.to, mc.precharge);
+		mc.precharge = 0;
+	}
+	/*
+	 * we didn't uncharge from mc.from at mem_cgroup_move_account(), so
+	 * we must uncharge here.
+	 */
+	if (mc.moved_charge) {
+		__mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
+		mc.moved_charge = 0;
 	}
 	mc.from = NULL;
 	mc.to = NULL;
@@ -3628,9 +3667,11 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 			VM_BUG_ON(mc.from);
 			VM_BUG_ON(mc.to);
 			VM_BUG_ON(mc.precharge);
+			VM_BUG_ON(mc.moved_charge);
 			mc.from = from;
 			mc.to = mem;
 			mc.precharge = 0;
+			mc.moved_charge = 0;
 
 			ret = mem_cgroup_precharge_mc(mm);
 			if (ret)
@@ -3677,9 +3718,11 @@ retry:
 			if (isolate_lru_page(page))
 				goto put;
 			pc = lookup_page_cgroup(page);
-			if (!mem_cgroup_move_account(pc, mc.from, mc.to)) {
-				css_put(&mc.to->css);
+			if (!mem_cgroup_move_account(pc,
+						mc.from, mc.to, false)) {
 				mc.precharge--;
+				/* we uncharge from mc.from later. */
+				mc.moved_charge++;
 			}
 			putback_lru_page(page);
 put:			/* is_target_pte_for_mc() gets the page */
@@ -3699,7 +3742,7 @@ put:			/* is_target_pte_for_mc() gets the page */
 		 * charges to mc.to if we have failed in charge once in attach()
 		 * phase.
 		 */
-		ret = mem_cgroup_do_precharge();
+		ret = mem_cgroup_do_precharge(1);
 		if (!ret)
 			goto retry;
 	}
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH -mmotm 4/7] memcg: improbe performance in moving charge
  2009-12-04  5:50 ` [PATCH -mmotm 4/7] memcg: improbe performance in moving charge Daisuke Nishimura
@ 2009-12-04  7:10   ` KAMEZAWA Hiroyuki
  2009-12-04  7:29     ` Daisuke Nishimura
  0 siblings, 1 reply; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-04  7:10 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrew Morton, Balbir Singh, Li Zefan, Paul Menage, linux-mm

On Fri, 4 Dec 2009 14:50:49 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> This patch tries to reduce overheads in moving charge by:
> 
> - Instead of calling res_counter_uncharge against the old cgroup in
>   __mem_cgroup_move_account everytime, call res_counter_uncharge at the end of
>   task migration once.
> - Instead of calling res_counter_charge(via __mem_cgroup_try_charge) repeatedly,
>   call res_counter_charge(PAGE_SIZE * count) in can_attach() if possible.
> - Adds a new arg(count) to __css_put and make it decrement the css->refcnt
>   by "count", not 1.
> - Add a new function(__css_get), which takes "count" as a arg and increment
>   the css->recnt by "count".
> - Instead of calling css_get/css_put repeatedly, call new __css_get/__css_put
>   if possible.
> - removed css_get(&to->css) from __mem_cgroup_move_account(callers should have
>   already called css_get), and removed css_put(&to->css) too, which is called by
>   callers of move_account on success of move_account.
> 
> These changes reduces the overhead from 1.7sec to 0.6sec to move charges of 1G
> anonymous memory in my test environment.
> 
> Changelog: 2009/12/04
> - new patch
> 
seems nice in general.


> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  include/linux/cgroup.h |   12 +++-
>  kernel/cgroup.c        |    5 +-
>  mm/memcontrol.c        |  151 +++++++++++++++++++++++++++++++-----------------
>  3 files changed, 109 insertions(+), 59 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index d4cc200..61f75ae 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -75,6 +75,12 @@ enum {
>  	CSS_REMOVED, /* This CSS is dead */
>  };
>  
> +/* Caller must verify that the css is not for root cgroup */
> +static inline void __css_get(struct cgroup_subsys_state *css, int count)
> +{
> +	atomic_add(count, &css->refcnt);
> +}
> +
>  /*
>   * Call css_get() to hold a reference on the css; it can be used
>   * for a reference obtained via:
> @@ -86,7 +92,7 @@ static inline void css_get(struct cgroup_subsys_state *css)
>  {
>  	/* We don't need to reference count the root state */
>  	if (!test_bit(CSS_ROOT, &css->flags))
> -		atomic_inc(&css->refcnt);
> +		__css_get(css, 1);
>  }
>  
>  static inline bool css_is_removed(struct cgroup_subsys_state *css)
> @@ -117,11 +123,11 @@ static inline bool css_tryget(struct cgroup_subsys_state *css)
>   * css_get() or css_tryget()
>   */
>  
> -extern void __css_put(struct cgroup_subsys_state *css);
> +extern void __css_put(struct cgroup_subsys_state *css, int count);
>  static inline void css_put(struct cgroup_subsys_state *css)
>  {
>  	if (!test_bit(CSS_ROOT, &css->flags))
> -		__css_put(css);
> +		__css_put(css, 1);
>  }
> 

Maybe it's better to divide cgroup part in other patches. Li or Paul has to review.

 
>  /* bits in struct cgroup flags field */
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index d67d471..44f5924 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -3729,12 +3729,13 @@ static void check_for_release(struct cgroup *cgrp)
>  	}
>  }
>  
> -void __css_put(struct cgroup_subsys_state *css)
> +/*  Caller must verify that the css is not for root cgroup */
> +void __css_put(struct cgroup_subsys_state *css, int count)
>  {
>  	struct cgroup *cgrp = css->cgroup;
>  	int val;
>  	rcu_read_lock();
> -	val = atomic_dec_return(&css->refcnt);
> +	val = atomic_sub_return(count, &css->refcnt);
>  	if (val == 1) {
>  		if (notify_on_release(cgrp)) {
>  			set_bit(CGRP_RELEASABLE, &cgrp->flags);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e38f211..769b85a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -252,6 +252,7 @@ struct move_charge_struct {
>  	struct mem_cgroup *from;
>  	struct mem_cgroup *to;
>  	unsigned long precharge;
> +	unsigned long moved_charge;
>  };
>  static struct move_charge_struct mc;
>  
> @@ -1532,14 +1533,23 @@ nomem:
>   * This function is for that and do uncharge, put css's refcnt.
>   * gotten by try_charge().
>   */
> -static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
> +static void __mem_cgroup_cancel_charge(struct mem_cgroup *mem,
> +							unsigned long count)
>  {
>  	if (!mem_cgroup_is_root(mem)) {
> -		res_counter_uncharge(&mem->res, PAGE_SIZE);
> +		res_counter_uncharge(&mem->res, PAGE_SIZE * count);
>  		if (do_swap_account)
> -			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> +			res_counter_uncharge(&mem->memsw, PAGE_SIZE * count);
> +		VM_BUG_ON(test_bit(CSS_ROOT, &mem->css.flags));
> +		WARN_ON_ONCE(count > INT_MAX);

Hmm. is this WARN_ON necessary ? ...maybe res_counter_uncharge() will catch
this, anyway.

> +		__css_put(&mem->css, (int)count);
>  	}
> -	css_put(&mem->css);
> +	/* we don't need css_put for root */
> +}
> +
> +static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
> +{
> +	__mem_cgroup_cancel_charge(mem, 1);
>  }
>  
>  /*
> @@ -1645,17 +1655,20 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>   * @pc:	page_cgroup of the page.
>   * @from: mem_cgroup which the page is moved from.
>   * @to:	mem_cgroup which the page is moved to. @from != @to.
> + * @uncharge: whether we should call uncharge and css_put against @from.
>   *
>   * The caller must confirm following.
>   * - page is not on LRU (isolate_page() is useful.)
>   * - the pc is locked, used, and ->mem_cgroup points to @from.
>   *
> - * This function does "uncharge" from old cgroup but doesn't do "charge" to
> - * new cgroup. It should be done by a caller.
> + * This function doesn't do "charge" nor css_get to new cgroup. It should be
> + * done by a caller(__mem_cgroup_try_charge would be usefull). If @uncharge is
> + * true, this function does "uncharge" from old cgroup, but it doesn't if
> + * @uncharge is false, so a caller should do "uncharge".
>   */
>  
>  static void __mem_cgroup_move_account(struct page_cgroup *pc,
> -	struct mem_cgroup *from, struct mem_cgroup *to)
> +	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
>  {
>  	struct page *page;
>  	int cpu;
> @@ -1668,10 +1681,6 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
>  	VM_BUG_ON(!PageCgroupUsed(pc));
>  	VM_BUG_ON(pc->mem_cgroup != from);
>  
> -	if (!mem_cgroup_is_root(from))
> -		res_counter_uncharge(&from->res, PAGE_SIZE);
> -	mem_cgroup_charge_statistics(from, pc, false);
> -
>  	page = pc->page;
>  	if (page_mapped(page) && !PageAnon(page)) {
>  		cpu = smp_processor_id();
> @@ -1687,12 +1696,12 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
>  		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_FILE_MAPPED,
>  						1);
>  	}
> +	mem_cgroup_charge_statistics(from, pc, false);
> +	if (uncharge)
> +		/* This is not "cancel", but cancel_charge does all we need. */
> +		mem_cgroup_cancel_charge(from);
>  
> -	if (do_swap_account && !mem_cgroup_is_root(from))
> -		res_counter_uncharge(&from->memsw, PAGE_SIZE);
> -	css_put(&from->css);
> -
> -	css_get(&to->css);
> +	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
>  	mem_cgroup_charge_statistics(to, pc, true);
>  	/*
> @@ -1709,12 +1718,12 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
>   * __mem_cgroup_move_account()
>   */
>  static int mem_cgroup_move_account(struct page_cgroup *pc,
> -				struct mem_cgroup *from, struct mem_cgroup *to)
> +		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
>  {
>  	int ret = -EINVAL;
>  	lock_page_cgroup(pc);
>  	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
> -		__mem_cgroup_move_account(pc, from, to);
> +		__mem_cgroup_move_account(pc, from, to, uncharge);
>  		ret = 0;
>  	}
>  	unlock_page_cgroup(pc);
> @@ -1750,11 +1759,9 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
>  	if (ret || !parent)
>  		goto put_back;
>  
> -	ret = mem_cgroup_move_account(pc, child, parent);
> -	if (!ret)
> -		css_put(&parent->css);	/* drop extra refcnt by try_charge() */
> -	else
> -		mem_cgroup_cancel_charge(parent);	/* does css_put */
> +	ret = mem_cgroup_move_account(pc, child, parent, true);
> +	if (ret)
> +		mem_cgroup_cancel_charge(parent);
>  put_back:
>  	putback_lru_page(page);
>  put:
> @@ -3441,16 +3448,57 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
>  }
>  
>  /* Handlers for move charge at task migration. */
> -static int mem_cgroup_do_precharge(void)
> +#define PRECHARGE_COUNT_AT_ONCE	256
> +static int mem_cgroup_do_precharge(unsigned long count)
>  {
> -	int ret = -ENOMEM;
> +	int ret = 0;
> +	int batch_count = PRECHARGE_COUNT_AT_ONCE;
>  	struct mem_cgroup *mem = mc.to;
>  
> -	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false, NULL);
> -	if (ret || !mem)
> -		return -ENOMEM;
> -
> -	mc.precharge++;
> +	if (mem_cgroup_is_root(mem)) {
> +		mc.precharge += count;
> +		/* we don't need css_get for root */
> +		return ret;
> +	}
> +	/* try to charge at once */
> +	if (count > 1) {
> +		struct res_counter *dummy;
> +		/*
> +		 * "mem" cannot be under rmdir() because we've already checked
> +		 * by cgroup_lock_live_cgroup() that it is not removed and we
> +		 * are still under the same cgroup_mutex. So we can postpone
> +		 * css_get().
> +		 */
> +		if (res_counter_charge(&mem->res, PAGE_SIZE * count, &dummy))
> +			goto one_by_one;

 if (do_swap_account) here.

> +		if (res_counter_charge(&mem->memsw,
> +						PAGE_SIZE * count, &dummy)) {
> +			res_counter_uncharge(&mem->res, PAGE_SIZE * count);
> +			goto one_by_one;
> +		}
> +		mc.precharge += count;
> +		VM_BUG_ON(test_bit(CSS_ROOT, &mem->css.flags));
> +		WARN_ON_ONCE(count > INT_MAX);
> +		__css_get(&mem->css, (int)count);
> +		return ret;
> +	}
> +one_by_one:
> +	/* fall back to one by one charge */
> +	while (!ret && count--) {

!ret check seems unnecessary.
> +		if (signal_pending(current)) {
> +			ret = -EINTR;
> +			break;
> +		}
> +		if (!batch_count--) {
> +			batch_count = PRECHARGE_COUNT_AT_ONCE;
> +			cond_resched();
> +		}
> +		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem,
> +								false, NULL);
> +		if (ret || !mem)
> +			return -ENOMEM;

returning without uncharge here ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH -mmotm 4/7] memcg: improbe performance in moving charge
  2009-12-04  7:10   ` KAMEZAWA Hiroyuki
@ 2009-12-04  7:29     ` Daisuke Nishimura
  0 siblings, 0 replies; 21+ messages in thread
From: Daisuke Nishimura @ 2009-12-04  7:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Li Zefan, Paul Menage, linux-mm,
	Daisuke Nishimura

On Fri, 4 Dec 2009 16:10:04 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 4 Dec 2009 14:50:49 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > This patch tries to reduce overheads in moving charge by:
> > 
> > - Instead of calling res_counter_uncharge against the old cgroup in
> >   __mem_cgroup_move_account everytime, call res_counter_uncharge at the end of
> >   task migration once.
> > - Instead of calling res_counter_charge(via __mem_cgroup_try_charge) repeatedly,
> >   call res_counter_charge(PAGE_SIZE * count) in can_attach() if possible.
> > - Adds a new arg(count) to __css_put and make it decrement the css->refcnt
> >   by "count", not 1.
> > - Add a new function(__css_get), which takes "count" as a arg and increment
> >   the css->recnt by "count".
> > - Instead of calling css_get/css_put repeatedly, call new __css_get/__css_put
> >   if possible.
> > - removed css_get(&to->css) from __mem_cgroup_move_account(callers should have
> >   already called css_get), and removed css_put(&to->css) too, which is called by
> >   callers of move_account on success of move_account.
> > 
> > These changes reduces the overhead from 1.7sec to 0.6sec to move charges of 1G
> > anonymous memory in my test environment.
> > 
> > Changelog: 2009/12/04
> > - new patch
> > 
> seems nice in general.
>
Thank you.
 
> 
> > Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > ---
> >  include/linux/cgroup.h |   12 +++-
> >  kernel/cgroup.c        |    5 +-
> >  mm/memcontrol.c        |  151 +++++++++++++++++++++++++++++++-----------------
> >  3 files changed, 109 insertions(+), 59 deletions(-)
> > 
> > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> > index d4cc200..61f75ae 100644
> > --- a/include/linux/cgroup.h
> > +++ b/include/linux/cgroup.h
> > @@ -75,6 +75,12 @@ enum {
> >  	CSS_REMOVED, /* This CSS is dead */
> >  };
> >  
> > +/* Caller must verify that the css is not for root cgroup */
> > +static inline void __css_get(struct cgroup_subsys_state *css, int count)
> > +{
> > +	atomic_add(count, &css->refcnt);
> > +}
> > +
> >  /*
> >   * Call css_get() to hold a reference on the css; it can be used
> >   * for a reference obtained via:
> > @@ -86,7 +92,7 @@ static inline void css_get(struct cgroup_subsys_state *css)
> >  {
> >  	/* We don't need to reference count the root state */
> >  	if (!test_bit(CSS_ROOT, &css->flags))
> > -		atomic_inc(&css->refcnt);
> > +		__css_get(css, 1);
> >  }
> >  
> >  static inline bool css_is_removed(struct cgroup_subsys_state *css)
> > @@ -117,11 +123,11 @@ static inline bool css_tryget(struct cgroup_subsys_state *css)
> >   * css_get() or css_tryget()
> >   */
> >  
> > -extern void __css_put(struct cgroup_subsys_state *css);
> > +extern void __css_put(struct cgroup_subsys_state *css, int count);
> >  static inline void css_put(struct cgroup_subsys_state *css)
> >  {
> >  	if (!test_bit(CSS_ROOT, &css->flags))
> > -		__css_put(css);
> > +		__css_put(css, 1);
> >  }
> > 
> 
> Maybe it's better to divide cgroup part in other patches. Li or Paul has to review.
> 
>  
I agree.

> >  /* bits in struct cgroup flags field */
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index d67d471..44f5924 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -3729,12 +3729,13 @@ static void check_for_release(struct cgroup *cgrp)
> >  	}
> >  }
> >  
> > -void __css_put(struct cgroup_subsys_state *css)
> > +/*  Caller must verify that the css is not for root cgroup */
> > +void __css_put(struct cgroup_subsys_state *css, int count)
> >  {
> >  	struct cgroup *cgrp = css->cgroup;
> >  	int val;
> >  	rcu_read_lock();
> > -	val = atomic_dec_return(&css->refcnt);
> > +	val = atomic_sub_return(count, &css->refcnt);
> >  	if (val == 1) {
> >  		if (notify_on_release(cgrp)) {
> >  			set_bit(CGRP_RELEASABLE, &cgrp->flags);
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e38f211..769b85a 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -252,6 +252,7 @@ struct move_charge_struct {
> >  	struct mem_cgroup *from;
> >  	struct mem_cgroup *to;
> >  	unsigned long precharge;
> > +	unsigned long moved_charge;
> >  };
> >  static struct move_charge_struct mc;
> >  
> > @@ -1532,14 +1533,23 @@ nomem:
> >   * This function is for that and do uncharge, put css's refcnt.
> >   * gotten by try_charge().
> >   */
> > -static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
> > +static void __mem_cgroup_cancel_charge(struct mem_cgroup *mem,
> > +							unsigned long count)
> >  {
> >  	if (!mem_cgroup_is_root(mem)) {
> > -		res_counter_uncharge(&mem->res, PAGE_SIZE);
> > +		res_counter_uncharge(&mem->res, PAGE_SIZE * count);
> >  		if (do_swap_account)
> > -			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> > +			res_counter_uncharge(&mem->memsw, PAGE_SIZE * count);
> > +		VM_BUG_ON(test_bit(CSS_ROOT, &mem->css.flags));
> > +		WARN_ON_ONCE(count > INT_MAX);
> 
> Hmm. is this WARN_ON necessary ? ...maybe res_counter_uncharge() will catch
> this, anyway.
> 
The arg of atomic_add/dec is "int", so IMHO, it would be better to check here
(I think we neve hit this in real use).

> > +		__css_put(&mem->css, (int)count);
> >  	}
> > -	css_put(&mem->css);
> > +	/* we don't need css_put for root */
> > +}
> > +
> > +static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
> > +{
> > +	__mem_cgroup_cancel_charge(mem, 1);
> >  }
> >  
> >  /*
> > @@ -1645,17 +1655,20 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> >   * @pc:	page_cgroup of the page.
> >   * @from: mem_cgroup which the page is moved from.
> >   * @to:	mem_cgroup which the page is moved to. @from != @to.
> > + * @uncharge: whether we should call uncharge and css_put against @from.
> >   *
> >   * The caller must confirm following.
> >   * - page is not on LRU (isolate_page() is useful.)
> >   * - the pc is locked, used, and ->mem_cgroup points to @from.
> >   *
> > - * This function does "uncharge" from old cgroup but doesn't do "charge" to
> > - * new cgroup. It should be done by a caller.
> > + * This function doesn't do "charge" nor css_get to new cgroup. It should be
> > + * done by a caller(__mem_cgroup_try_charge would be usefull). If @uncharge is
> > + * true, this function does "uncharge" from old cgroup, but it doesn't if
> > + * @uncharge is false, so a caller should do "uncharge".
> >   */
> >  
> >  static void __mem_cgroup_move_account(struct page_cgroup *pc,
> > -	struct mem_cgroup *from, struct mem_cgroup *to)
> > +	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
> >  {
> >  	struct page *page;
> >  	int cpu;
> > @@ -1668,10 +1681,6 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
> >  	VM_BUG_ON(!PageCgroupUsed(pc));
> >  	VM_BUG_ON(pc->mem_cgroup != from);
> >  
> > -	if (!mem_cgroup_is_root(from))
> > -		res_counter_uncharge(&from->res, PAGE_SIZE);
> > -	mem_cgroup_charge_statistics(from, pc, false);
> > -
> >  	page = pc->page;
> >  	if (page_mapped(page) && !PageAnon(page)) {
> >  		cpu = smp_processor_id();
> > @@ -1687,12 +1696,12 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
> >  		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_FILE_MAPPED,
> >  						1);
> >  	}
> > +	mem_cgroup_charge_statistics(from, pc, false);
> > +	if (uncharge)
> > +		/* This is not "cancel", but cancel_charge does all we need. */
> > +		mem_cgroup_cancel_charge(from);
> >  
> > -	if (do_swap_account && !mem_cgroup_is_root(from))
> > -		res_counter_uncharge(&from->memsw, PAGE_SIZE);
> > -	css_put(&from->css);
> > -
> > -	css_get(&to->css);
> > +	/* caller should have done css_get */
> >  	pc->mem_cgroup = to;
> >  	mem_cgroup_charge_statistics(to, pc, true);
> >  	/*
> > @@ -1709,12 +1718,12 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
> >   * __mem_cgroup_move_account()
> >   */
> >  static int mem_cgroup_move_account(struct page_cgroup *pc,
> > -				struct mem_cgroup *from, struct mem_cgroup *to)
> > +		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
> >  {
> >  	int ret = -EINVAL;
> >  	lock_page_cgroup(pc);
> >  	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
> > -		__mem_cgroup_move_account(pc, from, to);
> > +		__mem_cgroup_move_account(pc, from, to, uncharge);
> >  		ret = 0;
> >  	}
> >  	unlock_page_cgroup(pc);
> > @@ -1750,11 +1759,9 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
> >  	if (ret || !parent)
> >  		goto put_back;
> >  
> > -	ret = mem_cgroup_move_account(pc, child, parent);
> > -	if (!ret)
> > -		css_put(&parent->css);	/* drop extra refcnt by try_charge() */
> > -	else
> > -		mem_cgroup_cancel_charge(parent);	/* does css_put */
> > +	ret = mem_cgroup_move_account(pc, child, parent, true);
> > +	if (ret)
> > +		mem_cgroup_cancel_charge(parent);
> >  put_back:
> >  	putback_lru_page(page);
> >  put:
> > @@ -3441,16 +3448,57 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
> >  }
> >  
> >  /* Handlers for move charge at task migration. */
> > -static int mem_cgroup_do_precharge(void)
> > +#define PRECHARGE_COUNT_AT_ONCE	256
> > +static int mem_cgroup_do_precharge(unsigned long count)
> >  {
> > -	int ret = -ENOMEM;
> > +	int ret = 0;
> > +	int batch_count = PRECHARGE_COUNT_AT_ONCE;
> >  	struct mem_cgroup *mem = mc.to;
> >  
> > -	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false, NULL);
> > -	if (ret || !mem)
> > -		return -ENOMEM;
> > -
> > -	mc.precharge++;
> > +	if (mem_cgroup_is_root(mem)) {
> > +		mc.precharge += count;
> > +		/* we don't need css_get for root */
> > +		return ret;
> > +	}
> > +	/* try to charge at once */
> > +	if (count > 1) {
> > +		struct res_counter *dummy;
> > +		/*
> > +		 * "mem" cannot be under rmdir() because we've already checked
> > +		 * by cgroup_lock_live_cgroup() that it is not removed and we
> > +		 * are still under the same cgroup_mutex. So we can postpone
> > +		 * css_get().
> > +		 */
> > +		if (res_counter_charge(&mem->res, PAGE_SIZE * count, &dummy))
> > +			goto one_by_one;
> 
>  if (do_swap_account) here.
> 
Ah, yes. you're right.

> > +		if (res_counter_charge(&mem->memsw,
> > +						PAGE_SIZE * count, &dummy)) {
> > +			res_counter_uncharge(&mem->res, PAGE_SIZE * count);
> > +			goto one_by_one;
> > +		}
> > +		mc.precharge += count;
> > +		VM_BUG_ON(test_bit(CSS_ROOT, &mem->css.flags));
> > +		WARN_ON_ONCE(count > INT_MAX);
> > +		__css_get(&mem->css, (int)count);
> > +		return ret;
> > +	}
> > +one_by_one:
> > +	/* fall back to one by one charge */
> > +	while (!ret && count--) {
> 
> !ret check seems unnecessary.
yes.. will remove.

> > +		if (signal_pending(current)) {
> > +			ret = -EINTR;
> > +			break;
> > +		}
> > +		if (!batch_count--) {
> > +			batch_count = PRECHARGE_COUNT_AT_ONCE;
> > +			cond_resched();
> > +		}
> > +		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem,
> > +								false, NULL);
> > +		if (ret || !mem)
> > +			return -ENOMEM;
> 
> returning without uncharge here ?
> 
If we return here with -ENOMEM, mem_cgroup_can_attach calls mem_cgroup_clear_mc,
which will do uncharge. I will add some comments.

Thank you for your careful review.


Regards,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH -mmotm 5/7] memcg: avoid oom during moving charge
  2009-12-04  5:46 [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec) Daisuke Nishimura
                   ` (3 preceding siblings ...)
  2009-12-04  5:50 ` [PATCH -mmotm 4/7] memcg: improbe performance in moving charge Daisuke Nishimura
@ 2009-12-04  5:51 ` Daisuke Nishimura
  2009-12-04  7:14   ` KAMEZAWA Hiroyuki
  2009-12-04  5:52 ` [PATCH -mmotm 6/7] memcg: move charges of anonymous swap Daisuke Nishimura
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Daisuke Nishimura @ 2009-12-04  5:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan, Paul Menage,
	Daisuke Nishimura, linux-mm

This move-charge-at-task-migration feature has extra charges on "to"(pre-charges)
and "from"(leftover charges) during moving charge. This means unnecessary oom
can happen.

This patch tries to avoid such oom.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

Changelog: 2009/12/04
- take account of "from" too, because we uncharge from "from" at once in
  mem_cgroup_clear_mc(), so leftover charges exist during moving charge.
- check use_hierarchy of "mem_over_limit", instead of "to" or "from"(bugfix).
---
 mm/memcontrol.c |   38 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 769b85a..f50ad15 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -253,6 +253,7 @@ struct move_charge_struct {
 	struct mem_cgroup *to;
 	unsigned long precharge;
 	unsigned long moved_charge;
+	struct task_struct *moving_task;	/* a task moving charges */
 };
 static struct move_charge_struct mc;
 
@@ -1504,6 +1505,40 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		if (mem_cgroup_check_under_limit(mem_over_limit))
 			continue;
 
+		/* try to avoid oom while someone is moving charge */
+		if (mc.moving_task && current != mc.moving_task) {
+			struct mem_cgroup *from, *to;
+			bool do_continue = false;
+			/*
+			 * There is a small race that "from" or "to" can be
+			 * freed by rmdir, so we use css_tryget().
+			 */
+			rcu_read_lock();
+			from = mc.from;
+			to = mc.to;
+			if (from && css_tryget(&from->css)) {
+				if (mem_over_limit->use_hierarchy)
+					do_continue = css_is_ancestor(
+							&from->css,
+							&mem_over_limit->css);
+				else
+					do_continue = (from == mem_over_limit);
+				css_put(&from->css);
+			}
+			if (!do_continue && to && css_tryget(&to->css)) {
+				if (mem_over_limit->use_hierarchy)
+					do_continue = css_is_ancestor(
+							&to->css,
+							&mem_over_limit->css);
+				else
+					do_continue = (to == mem_over_limit);
+				css_put(&to->css);
+			}
+			rcu_read_unlock();
+			if (do_continue)
+				continue;
+		}
+
 		if (!nr_retries--) {
 			if (oom) {
 				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
@@ -3643,6 +3678,7 @@ static void mem_cgroup_clear_mc(void)
 	}
 	mc.from = NULL;
 	mc.to = NULL;
+	mc.moving_task = NULL;
 }
 
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
@@ -3668,10 +3704,12 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 			VM_BUG_ON(mc.to);
 			VM_BUG_ON(mc.precharge);
 			VM_BUG_ON(mc.moved_charge);
+			VM_BUG_ON(mc.moving_task);
 			mc.from = from;
 			mc.to = mem;
 			mc.precharge = 0;
 			mc.moved_charge = 0;
+			mc.moving_task = current;
 
 			ret = mem_cgroup_precharge_mc(mm);
 			if (ret)
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH -mmotm 5/7] memcg: avoid oom during moving charge
  2009-12-04  5:51 ` [PATCH -mmotm 5/7] memcg: avoid oom during " Daisuke Nishimura
@ 2009-12-04  7:14   ` KAMEZAWA Hiroyuki
  2009-12-04  7:43     ` Daisuke Nishimura
  0 siblings, 1 reply; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-04  7:14 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrew Morton, Balbir Singh, Li Zefan, Paul Menage, linux-mm

On Fri, 4 Dec 2009 14:51:54 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> This move-charge-at-task-migration feature has extra charges on "to"(pre-charges)
> and "from"(leftover charges) during moving charge. This means unnecessary oom
> can happen.
> 
> This patch tries to avoid such oom.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
> Changelog: 2009/12/04
> - take account of "from" too, because we uncharge from "from" at once in
>   mem_cgroup_clear_mc(), so leftover charges exist during moving charge.
> - check use_hierarchy of "mem_over_limit", instead of "to" or "from"(bugfix).
> ---
>  mm/memcontrol.c |   38 ++++++++++++++++++++++++++++++++++++++
>  1 files changed, 38 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 769b85a..f50ad15 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -253,6 +253,7 @@ struct move_charge_struct {
>  	struct mem_cgroup *to;
>  	unsigned long precharge;
>  	unsigned long moved_charge;
> +	struct task_struct *moving_task;	/* a task moving charges */
>  };
>  static struct move_charge_struct mc;
>  
> @@ -1504,6 +1505,40 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  		if (mem_cgroup_check_under_limit(mem_over_limit))
>  			continue;
>  
> +		/* try to avoid oom while someone is moving charge */
> +		if (mc.moving_task && current != mc.moving_task) {
> +			struct mem_cgroup *from, *to;
> +			bool do_continue = false;
> +			/*
> +			 * There is a small race that "from" or "to" can be
> +			 * freed by rmdir, so we use css_tryget().
> +			 */
> +			rcu_read_lock();
> +			from = mc.from;
> +			to = mc.to;
> +			if (from && css_tryget(&from->css)) {
> +				if (mem_over_limit->use_hierarchy)
> +					do_continue = css_is_ancestor(
> +							&from->css,
> +							&mem_over_limit->css);
> +				else
> +					do_continue = (from == mem_over_limit);
> +				css_put(&from->css);
> +			}
> +			if (!do_continue && to && css_tryget(&to->css)) {
> +				if (mem_over_limit->use_hierarchy)
> +					do_continue = css_is_ancestor(
> +							&to->css,
> +							&mem_over_limit->css);
> +				else
> +					do_continue = (to == mem_over_limit);
> +				css_put(&to->css);
> +			}
> +			rcu_read_unlock();
> +			if (do_continue)
> +				continue;

Hmm. do countine without any relaxing ? can't this occupy cpu ?
Can't we add schedule() or some and put into sleep ?

Maybe the best way is enqueue this thread to mc.wait_queue and wait for
the end of task moving.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH -mmotm 5/7] memcg: avoid oom during moving charge
  2009-12-04  7:14   ` KAMEZAWA Hiroyuki
@ 2009-12-04  7:43     ` Daisuke Nishimura
  0 siblings, 0 replies; 21+ messages in thread
From: Daisuke Nishimura @ 2009-12-04  7:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Li Zefan, Paul Menage, linux-mm,
	Daisuke Nishimura

On Fri, 4 Dec 2009 16:14:39 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 4 Dec 2009 14:51:54 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > This move-charge-at-task-migration feature has extra charges on "to"(pre-charges)
> > and "from"(leftover charges) during moving charge. This means unnecessary oom
> > can happen.
> > 
> > This patch tries to avoid such oom.
> > 
> > Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > 
> > Changelog: 2009/12/04
> > - take account of "from" too, because we uncharge from "from" at once in
> >   mem_cgroup_clear_mc(), so leftover charges exist during moving charge.
> > - check use_hierarchy of "mem_over_limit", instead of "to" or "from"(bugfix).
> > ---
> >  mm/memcontrol.c |   38 ++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 38 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 769b85a..f50ad15 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -253,6 +253,7 @@ struct move_charge_struct {
> >  	struct mem_cgroup *to;
> >  	unsigned long precharge;
> >  	unsigned long moved_charge;
> > +	struct task_struct *moving_task;	/* a task moving charges */
> >  };
> >  static struct move_charge_struct mc;
> >  
> > @@ -1504,6 +1505,40 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  		if (mem_cgroup_check_under_limit(mem_over_limit))
> >  			continue;
> >  
> > +		/* try to avoid oom while someone is moving charge */
> > +		if (mc.moving_task && current != mc.moving_task) {
> > +			struct mem_cgroup *from, *to;
> > +			bool do_continue = false;
> > +			/*
> > +			 * There is a small race that "from" or "to" can be
> > +			 * freed by rmdir, so we use css_tryget().
> > +			 */
> > +			rcu_read_lock();
> > +			from = mc.from;
> > +			to = mc.to;
> > +			if (from && css_tryget(&from->css)) {
> > +				if (mem_over_limit->use_hierarchy)
> > +					do_continue = css_is_ancestor(
> > +							&from->css,
> > +							&mem_over_limit->css);
> > +				else
> > +					do_continue = (from == mem_over_limit);
> > +				css_put(&from->css);
> > +			}
> > +			if (!do_continue && to && css_tryget(&to->css)) {
> > +				if (mem_over_limit->use_hierarchy)
> > +					do_continue = css_is_ancestor(
> > +							&to->css,
> > +							&mem_over_limit->css);
> > +				else
> > +					do_continue = (to == mem_over_limit);
> > +				css_put(&to->css);
> > +			}
> > +			rcu_read_unlock();
> > +			if (do_continue)
> > +				continue;
> 
> Hmm. do countine without any relaxing ? can't this occupy cpu ?
> Can't we add schedule() or some and put into sleep ?
> 
> Maybe the best way is enqueue this thread to mc.wait_queue and wait for
> the end of task moving.
> 
Good idea. I'll try it.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH -mmotm 6/7] memcg: move charges of anonymous swap
  2009-12-04  5:46 [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec) Daisuke Nishimura
                   ` (4 preceding siblings ...)
  2009-12-04  5:51 ` [PATCH -mmotm 5/7] memcg: avoid oom during " Daisuke Nishimura
@ 2009-12-04  5:52 ` Daisuke Nishimura
  2009-12-04  7:32   ` KAMEZAWA Hiroyuki
  2009-12-04  5:54 ` [PATCH -mmotm 7/7] memcg: improbe performance in moving swap charge Daisuke Nishimura
  2009-12-04  6:53 ` [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec) KAMEZAWA Hiroyuki
  7 siblings, 1 reply; 21+ messages in thread
From: Daisuke Nishimura @ 2009-12-04  5:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan, Paul Menage,
	Daisuke Nishimura, linux-mm

This patch is another core part of this move-charge-at-task-migration feature.
It enables moving charges of anonymous swaps.

To move the charge of swap, we need to exchange swap_cgroup's record.

In current implementation, swap_cgroup's record is protected by:

  - page lock: if the entry is on swap cache.
  - swap_lock: if the entry is not on swap cache.

This works well in usual swap-in/out activity.

But this behavior make the feature of moving swap charge check many conditions
to exchange swap_cgroup's record safely.

So I changed modification of swap_cgroup's recored(swap_cgroup_record())
to use xchg, and define a new function to cmpxchg swap_cgroup's record.

This patch also enables moving charge of non pte_present but not uncharged swap
caches, which can be exist on swap-out path, by getting the target pages via
find_get_page() as do_mincore() does.

Changelog: 2009/12/04
- minor changes in comments and valuable names.
Changelog: 2009/11/19
- in can_attach(), instead of parsing the page table, make use of per process
  mm_counter(swap_usage).
Changelog: 2009/11/06
- drop support for shmem's swap(revisit in future).
- add mem_cgroup_count_swap_user() to prevent moving charges of swaps used by
  multiple processes(revisit in future).
Changelog: 2009/09/24
- do no swap-in in moving swap account any more.
- add support for shmem's swap.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 include/linux/page_cgroup.h |    2 +
 include/linux/swap.h        |    1 +
 mm/memcontrol.c             |  154 +++++++++++++++++++++++++++++++++----------
 mm/page_cgroup.c            |   35 +++++++++-
 mm/swapfile.c               |   31 +++++++++
 5 files changed, 185 insertions(+), 38 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index b0e4eb1..30b0813 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -118,6 +118,8 @@ static inline void __init page_cgroup_init_flatmem(void)
 #include <linux/swap.h>
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
+					unsigned short old, unsigned short new);
 extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
 extern unsigned short lookup_swap_cgroup(swp_entry_t ent);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 9f0ca32..2a3209e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -355,6 +355,7 @@ static inline void disable_swap_token(void)
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 extern void
 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout);
+extern int mem_cgroup_count_swap_user(swp_entry_t ent, struct page **pagep);
 #else
 static inline void
 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f50ad15..6b3d17f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -33,6 +33,7 @@
 #include <linux/rbtree.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
+#include <linux/swapops.h>
 #include <linux/spinlock.h>
 #include <linux/fs.h>
 #include <linux/seq_file.h>
@@ -2258,6 +2259,53 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
 	}
 	rcu_read_unlock();
 }
+
+/**
+ * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record.
+ * @entry: swap entry to be moved
+ * @from:  mem_cgroup which the entry is moved from
+ * @to:  mem_cgroup which the entry is moved to
+ *
+ * It succeeds only when the swap_cgroup's record for this entry is the same
+ * as the mem_cgroup's id of @from.
+ *
+ * Returns 0 on success, -EINVAL on failure.
+ *
+ * The caller must have charged to @to, IOW, called res_counter_charge() about
+ * both res and memsw, and called css_get().
+ */
+static int mem_cgroup_move_swap_account(swp_entry_t entry,
+				struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	unsigned short old_id, new_id;
+
+	old_id = css_id(&from->css);
+	new_id = css_id(&to->css);
+
+	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
+		if (!mem_cgroup_is_root(from))
+			res_counter_uncharge(&from->memsw, PAGE_SIZE);
+		mem_cgroup_swap_statistics(from, false);
+		mem_cgroup_put(from);
+		/*
+		 * we charged both to->res and to->memsw, so we should uncharge
+		 * to->res.
+		 */
+		if (!mem_cgroup_is_root(to))
+			res_counter_uncharge(&to->res, PAGE_SIZE);
+		mem_cgroup_swap_statistics(to, true);
+		mem_cgroup_get(to);
+
+		return 0;
+	}
+	return -EINVAL;
+}
+#else
+static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
+				struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	return -EINVAL;
+}
 #endif
 
 /*
@@ -3542,71 +3590,96 @@ one_by_one:
  * @vma: the vma the pte to be checked belongs
  * @addr: the address corresponding to the pte to be checked
  * @ptent: the pte to be checked
- * @target: the pointer the target page will be stored(can be NULL)
+ * @target: the pointer the target page or swap ent will be stored(can be NULL)
  *
  * Returns
  *   0(MC_TARGET_NONE): if the pte is not a target for move charge.
  *   1(MC_TARGET_PAGE): if the page corresponding to this pte is a target for
  *     move charge. if @target is not NULL, the page is stored in target->page
  *     with extra refcnt got(Callers should handle it).
+ *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
+ *     target for charge migration. if @target is not NULL, the entry is stored
+ *     in target->ent.
  *
  * Called with pte lock held.
  */
-/* We add a new member later. */
 union mc_target {
 	struct page	*page;
+	swp_entry_t	ent;
 };
 
-/* We add a new type later. */
 enum mc_target_type {
 	MC_TARGET_NONE,	/* not used */
 	MC_TARGET_PAGE,
+	MC_TARGET_SWAP,
 };
 
 static int is_target_pte_for_mc(struct vm_area_struct *vma,
 		unsigned long addr, pte_t ptent, union mc_target *target)
 {
-	struct page *page;
+	struct page *page = NULL;
 	struct page_cgroup *pc;
 	int ret = 0;
+	swp_entry_t ent = { .val = 0 };
+	int usage_count = 0;
 	bool move_anon = test_bit(MOVE_CHARGE_TYPE_ANON,
 					&mc.to->move_charge_at_immigrate);
 
-	if (!pte_present(ptent))
-		return 0;
-
-	page = vm_normal_page(vma, addr, ptent);
-	if (!page || !page_mapped(page))
-		return 0;
-	/*
-	 * TODO: We don't move charges of file(including shmem/tmpfs) pages for
-	 * now.
-	 */
-	if (!move_anon || !PageAnon(page))
-		return 0;
-	/*
-	 * TODO: We don't move charges of shared(used by multiple processes)
-	 * pages for now.
-	 */
-	if (page_mapcount(page) > 1)
-		return 0;
-	if (!get_page_unless_zero(page))
+	if (!pte_present(ptent)) {
+		/* TODO: handle swap of shmes/tmpfs */
+		if (pte_none(ptent) || pte_file(ptent))
+			return 0;
+		else if (is_swap_pte(ptent)) {
+			ent = pte_to_swp_entry(ptent);
+			if (!move_anon || non_swap_entry(ent))
+				return 0;
+			usage_count = mem_cgroup_count_swap_user(ent, &page);
+		}
+	} else {
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page || !page_mapped(page))
+			return 0;
+		/*
+		 * TODO: We don't move charges of file(including shmem/tmpfs)
+		 * pages for now.
+		 */
+		if (!move_anon || !PageAnon(page))
+			return 0;
+		if (!get_page_unless_zero(page))
+			return 0;
+		usage_count = page_mapcount(page);
+	}
+	if (usage_count > 1) {
+		/*
+		 * TODO: We don't move charges of shared(used by multiple
+		 * processes) pages for now.
+		 */
+		if (page)
+			put_page(page);
 		return 0;
-
-	pc = lookup_page_cgroup(page);
-	/*
-	 * Do only loose check w/o page_cgroup lock. mem_cgroup_move_account()
-	 * checks the pc is valid or not under the lock.
-	 */
-	if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
-		ret = MC_TARGET_PAGE;
+	}
+	if (page) {
+		pc = lookup_page_cgroup(page);
+		/*
+		 * Do only loose check w/o page_cgroup lock.
+		 * mem_cgroup_move_account() checks the pc is valid or not under
+		 * the lock.
+		 */
+		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
+			ret = MC_TARGET_PAGE;
+			if (target)
+				target->page = page;
+		}
+		if (!ret || !target)
+			put_page(page);
+	}
+	/* throught */
+	if (ent.val && do_swap_account && !ret &&
+			css_id(&mc.from->css) == lookup_swap_cgroup(ent)) {
+		ret = MC_TARGET_SWAP;
 		if (target)
-			target->page = page;
+			target->ent = ent;
 	}
-
-	if (!ret || !target)
-		put_page(page);
-
 	return ret;
 }
 
@@ -3745,6 +3818,7 @@ retry:
 		int type;
 		struct page *page;
 		struct page_cgroup *pc;
+		swp_entry_t ent;
 
 		if (!mc.precharge)
 			break;
@@ -3766,6 +3840,14 @@ retry:
 put:			/* is_target_pte_for_mc() gets the page */
 			put_page(page);
 			break;
+		case MC_TARGET_SWAP:
+			ent = target.ent;
+			if (!mem_cgroup_move_swap_account(ent,
+						mc.from, mc.to)) {
+				css_put(&mc.to->css);
+				mc.precharge--;
+			}
+			break;
 		default:
 			break;
 		}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 3d535d5..213b0ee 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,6 +9,7 @@
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
 #include <linux/swapops.h>
+#include <asm/cmpxchg.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
@@ -335,6 +336,37 @@ not_enough_page:
 }
 
 /**
+ * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
+ * @end: swap entry to be cmpxchged
+ * @old: old id
+ * @new: new id
+ *
+ * Returns old id at success, 0 at failure.
+ * (There is no mem_cgroup useing 0 as its id)
+ */
+unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
+					unsigned short old, unsigned short new)
+{
+	int type = swp_type(ent);
+	unsigned long offset = swp_offset(ent);
+	unsigned long idx = offset / SC_PER_PAGE;
+	unsigned long pos = offset & SC_POS_MASK;
+	struct swap_cgroup_ctrl *ctrl;
+	struct page *mappage;
+	struct swap_cgroup *sc;
+
+	ctrl = &swap_cgroup_ctrl[type];
+
+	mappage = ctrl->map[idx];
+	sc = page_address(mappage);
+	sc += pos;
+	if (cmpxchg(&sc->id, old, new) == old)
+		return old;
+	else
+		return 0;
+}
+
+/**
  * swap_cgroup_record - record mem_cgroup for this swp_entry.
  * @ent: swap entry to be recorded into
  * @mem: mem_cgroup to be recorded
@@ -358,8 +390,7 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
 	mappage = ctrl->map[idx];
 	sc = page_address(mappage);
 	sc += pos;
-	old = sc->id;
-	sc->id = id;
+	old = xchg(&sc->id, id);
 
 	return old;
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 67f808b..09ab458 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -722,6 +722,37 @@ int free_swap_and_cache(swp_entry_t entry)
 	return p != NULL;
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/**
+ * mem_cgroup_count_swap_user - count the user of a swap entry
+ * @ent: the swap entry to be checked
+ * @pagep: the pointer for the swap cache page of the entry to be stored
+ *
+ * Returns the number of the user of the swap entry. The number is valid only
+ * for swaps of anonymous pages.
+ * If the entry is found on swap cache, the page is stored to pagep with
+ * refcount of it being incremented.
+ */
+int mem_cgroup_count_swap_user(swp_entry_t ent, struct page **pagep)
+{
+	struct page *page;
+	struct swap_info_struct *p;
+	int count = 0;
+
+	page = find_get_page(&swapper_space, ent.val);
+	if (page)
+		count += page_mapcount(page);
+	p = swap_info_get(ent);
+	if (p) {
+		count += swap_count(p->swap_map[swp_offset(ent)]);
+		spin_unlock(&swap_lock);
+	}
+
+	*pagep = page;
+	return count;
+}
+#endif
+
 #ifdef CONFIG_HIBERNATION
 /*
  * Find the swap type that corresponds to given device (if any).
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH -mmotm 6/7] memcg: move charges of anonymous swap
  2009-12-04  5:52 ` [PATCH -mmotm 6/7] memcg: move charges of anonymous swap Daisuke Nishimura
@ 2009-12-04  7:32   ` KAMEZAWA Hiroyuki
  2009-12-04 10:53     ` Daisuke Nishimura
  0 siblings, 1 reply; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-04  7:32 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrew Morton, Balbir Singh, Li Zefan, Paul Menage, linux-mm

On Fri, 4 Dec 2009 14:52:55 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> This patch is another core part of this move-charge-at-task-migration feature.
> It enables moving charges of anonymous swaps.
> 
> To move the charge of swap, we need to exchange swap_cgroup's record.
> 
> In current implementation, swap_cgroup's record is protected by:
> 
>   - page lock: if the entry is on swap cache.
>   - swap_lock: if the entry is not on swap cache.
> 
> This works well in usual swap-in/out activity.
> 
> But this behavior make the feature of moving swap charge check many conditions
> to exchange swap_cgroup's record safely.
> 
> So I changed modification of swap_cgroup's recored(swap_cgroup_record())
> to use xchg, and define a new function to cmpxchg swap_cgroup's record.
> 
> This patch also enables moving charge of non pte_present but not uncharged swap
> caches, which can be exist on swap-out path, by getting the target pages via
> find_get_page() as do_mincore() does.
> 
> Changelog: 2009/12/04
> - minor changes in comments and valuable names.
> Changelog: 2009/11/19
> - in can_attach(), instead of parsing the page table, make use of per process
>   mm_counter(swap_usage).
> Changelog: 2009/11/06
> - drop support for shmem's swap(revisit in future).
> - add mem_cgroup_count_swap_user() to prevent moving charges of swaps used by
>   multiple processes(revisit in future).
> Changelog: 2009/09/24
> - do no swap-in in moving swap account any more.
> - add support for shmem's swap.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  include/linux/page_cgroup.h |    2 +
>  include/linux/swap.h        |    1 +
>  mm/memcontrol.c             |  154 +++++++++++++++++++++++++++++++++----------
>  mm/page_cgroup.c            |   35 +++++++++-
>  mm/swapfile.c               |   31 +++++++++
>  5 files changed, 185 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index b0e4eb1..30b0813 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -118,6 +118,8 @@ static inline void __init page_cgroup_init_flatmem(void)
>  #include <linux/swap.h>
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
> +					unsigned short old, unsigned short new);
>  extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
>  extern unsigned short lookup_swap_cgroup(swp_entry_t ent);
>  extern int swap_cgroup_swapon(int type, unsigned long max_pages);
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 9f0ca32..2a3209e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -355,6 +355,7 @@ static inline void disable_swap_token(void)
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  extern void
>  mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout);
> +extern int mem_cgroup_count_swap_user(swp_entry_t ent, struct page **pagep);
>  #else
>  static inline void
>  mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f50ad15..6b3d17f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -33,6 +33,7 @@
>  #include <linux/rbtree.h>
>  #include <linux/slab.h>
>  #include <linux/swap.h>
> +#include <linux/swapops.h>
>  #include <linux/spinlock.h>
>  #include <linux/fs.h>
>  #include <linux/seq_file.h>
> @@ -2258,6 +2259,53 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
>  	}
>  	rcu_read_unlock();
>  }
> +
> +/**
> + * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record.
> + * @entry: swap entry to be moved
> + * @from:  mem_cgroup which the entry is moved from
> + * @to:  mem_cgroup which the entry is moved to
> + *
> + * It succeeds only when the swap_cgroup's record for this entry is the same
> + * as the mem_cgroup's id of @from.
> + *
> + * Returns 0 on success, -EINVAL on failure.
> + *
> + * The caller must have charged to @to, IOW, called res_counter_charge() about
> + * both res and memsw, and called css_get().
> + */
> +static int mem_cgroup_move_swap_account(swp_entry_t entry,
> +				struct mem_cgroup *from, struct mem_cgroup *to)
> +{
> +	unsigned short old_id, new_id;
> +
> +	old_id = css_id(&from->css);
> +	new_id = css_id(&to->css);
> +
> +	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
> +		if (!mem_cgroup_is_root(from))
> +			res_counter_uncharge(&from->memsw, PAGE_SIZE);
> +		mem_cgroup_swap_statistics(from, false);
> +		mem_cgroup_put(from);
> +		/*
> +		 * we charged both to->res and to->memsw, so we should uncharge
> +		 * to->res.
> +		 */
> +		if (!mem_cgroup_is_root(to))
> +			res_counter_uncharge(&to->res, PAGE_SIZE);
> +		mem_cgroup_swap_statistics(to, true);
> +		mem_cgroup_get(to);
> +
> +		return 0;
> +	}
> +	return -EINVAL;
> +}

Hmm. Aren't there race with swapin ?

    Thread A
   ----------
   do_swap_page()
   mem_cgroup_try_charge_swapin() <<== charge "memory" against old memcg.
   page table lock
   mem_cgroup_commit_charge_swapin()
         lookup memcg from swap_cgroup() <<=== finds new memcg, moved one.
                res_counter_uncharge(&new_memcg->memsw,...)
                 
Then, Thread A does
   old_memcg->res +1 
   new_memcg->memsw -1

move_swap_account() does
   old_memcg->memsw - 1
   new_memcg->res - 1
   new_memcg->memsw + 1

Hmm. old_memcg->res doesn't leak ? I think some check in commit_charge()
for this new race is necessary.



> +#else
> +static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
> +				struct mem_cgroup *from, struct mem_cgroup *to)
> +{
> +	return -EINVAL;
> +}
>  #endif
>  
>  /*
> @@ -3542,71 +3590,96 @@ one_by_one:
>   * @vma: the vma the pte to be checked belongs
>   * @addr: the address corresponding to the pte to be checked
>   * @ptent: the pte to be checked
> - * @target: the pointer the target page will be stored(can be NULL)
> + * @target: the pointer the target page or swap ent will be stored(can be NULL)
>   *
>   * Returns
>   *   0(MC_TARGET_NONE): if the pte is not a target for move charge.
>   *   1(MC_TARGET_PAGE): if the page corresponding to this pte is a target for
>   *     move charge. if @target is not NULL, the page is stored in target->page
>   *     with extra refcnt got(Callers should handle it).
> + *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
> + *     target for charge migration. if @target is not NULL, the entry is stored
> + *     in target->ent.
>   *
>   * Called with pte lock held.
>   */
> -/* We add a new member later. */
>  union mc_target {
>  	struct page	*page;
> +	swp_entry_t	ent;
>  };
>  
> -/* We add a new type later. */
>  enum mc_target_type {
>  	MC_TARGET_NONE,	/* not used */
>  	MC_TARGET_PAGE,
> +	MC_TARGET_SWAP,
>  };
>  
>  static int is_target_pte_for_mc(struct vm_area_struct *vma,
>  		unsigned long addr, pte_t ptent, union mc_target *target)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
>  	struct page_cgroup *pc;
>  	int ret = 0;
> +	swp_entry_t ent = { .val = 0 };
> +	int usage_count = 0;
>  	bool move_anon = test_bit(MOVE_CHARGE_TYPE_ANON,
>  					&mc.to->move_charge_at_immigrate);
>  
> -	if (!pte_present(ptent))
> -		return 0;
> -
> -	page = vm_normal_page(vma, addr, ptent);
> -	if (!page || !page_mapped(page))
> -		return 0;
> -	/*
> -	 * TODO: We don't move charges of file(including shmem/tmpfs) pages for
> -	 * now.
> -	 */
> -	if (!move_anon || !PageAnon(page))
> -		return 0;
> -	/*
> -	 * TODO: We don't move charges of shared(used by multiple processes)
> -	 * pages for now.
> -	 */
> -	if (page_mapcount(page) > 1)
> -		return 0;
> -	if (!get_page_unless_zero(page))
> +	if (!pte_present(ptent)) {
> +		/* TODO: handle swap of shmes/tmpfs */
> +		if (pte_none(ptent) || pte_file(ptent))
> +			return 0;
> +		else if (is_swap_pte(ptent)) {
> +			ent = pte_to_swp_entry(ptent);
> +			if (!move_anon || non_swap_entry(ent))
> +				return 0;
> +			usage_count = mem_cgroup_count_swap_user(ent, &page);
> +		}
> +	} else {
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page || !page_mapped(page))
> +			return 0;
> +		/*
> +		 * TODO: We don't move charges of file(including shmem/tmpfs)
> +		 * pages for now.
> +		 */
> +		if (!move_anon || !PageAnon(page))
> +			return 0;
> +		if (!get_page_unless_zero(page))
> +			return 0;
> +		usage_count = page_mapcount(page);
> +	}
> +	if (usage_count > 1) {
> +		/*
> +		 * TODO: We don't move charges of shared(used by multiple
> +		 * processes) pages for now.
> +		 */
> +		if (page)
> +			put_page(page);
>  		return 0;
> -
> -	pc = lookup_page_cgroup(page);
> -	/*
> -	 * Do only loose check w/o page_cgroup lock. mem_cgroup_move_account()
> -	 * checks the pc is valid or not under the lock.
> -	 */
> -	if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
> -		ret = MC_TARGET_PAGE;
> +	}
> +	if (page) {
> +		pc = lookup_page_cgroup(page);
> +		/*
> +		 * Do only loose check w/o page_cgroup lock.
> +		 * mem_cgroup_move_account() checks the pc is valid or not under
> +		 * the lock.
> +		 */
> +		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
> +			ret = MC_TARGET_PAGE;
> +			if (target)
> +				target->page = page;
> +		}
> +		if (!ret || !target)
> +			put_page(page);
> +	}
> +	/* throught */
> +	if (ent.val && do_swap_account && !ret &&
> +			css_id(&mc.from->css) == lookup_swap_cgroup(ent)) {
> +		ret = MC_TARGET_SWAP;
>  		if (target)
> -			target->page = page;
> +			target->ent = ent;
>  	}
> -
> -	if (!ret || !target)
> -		put_page(page);
> -
>  	return ret;
>  }
>  
> @@ -3745,6 +3818,7 @@ retry:
>  		int type;
>  		struct page *page;
>  		struct page_cgroup *pc;
> +		swp_entry_t ent;
>  
>  		if (!mc.precharge)
>  			break;
> @@ -3766,6 +3840,14 @@ retry:
>  put:			/* is_target_pte_for_mc() gets the page */
>  			put_page(page);
>  			break;
> +		case MC_TARGET_SWAP:
> +			ent = target.ent;
> +			if (!mem_cgroup_move_swap_account(ent,
> +						mc.from, mc.to)) {
> +				css_put(&mc.to->css);
> +				mc.precharge--;
> +			}
> +			break;
>  		default:
>  			break;
>  		}

complicated than expected. But ok for the first version.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH -mmotm 6/7] memcg: move charges of anonymous swap
  2009-12-04  7:32   ` KAMEZAWA Hiroyuki
@ 2009-12-04 10:53     ` Daisuke Nishimura
  0 siblings, 0 replies; 21+ messages in thread
From: Daisuke Nishimura @ 2009-12-04 10:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Li Zefan, Paul Menage, linux-mm,
	Daisuke Nishimura

On Fri, 4 Dec 2009 16:32:37 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Fri, 4 Dec 2009 14:52:55 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > This patch is another core part of this move-charge-at-task-migration feature.
> > It enables moving charges of anonymous swaps.
> > 
> > To move the charge of swap, we need to exchange swap_cgroup's record.
> > 
> > In current implementation, swap_cgroup's record is protected by:
> > 
> >   - page lock: if the entry is on swap cache.
> >   - swap_lock: if the entry is not on swap cache.
> > 
> > This works well in usual swap-in/out activity.
> > 
> > But this behavior make the feature of moving swap charge check many conditions
> > to exchange swap_cgroup's record safely.
> > 
> > So I changed modification of swap_cgroup's recored(swap_cgroup_record())
> > to use xchg, and define a new function to cmpxchg swap_cgroup's record.
> > 
> > This patch also enables moving charge of non pte_present but not uncharged swap
> > caches, which can be exist on swap-out path, by getting the target pages via
> > find_get_page() as do_mincore() does.
> > 
> > Changelog: 2009/12/04
> > - minor changes in comments and valuable names.
> > Changelog: 2009/11/19
> > - in can_attach(), instead of parsing the page table, make use of per process
> >   mm_counter(swap_usage).
> > Changelog: 2009/11/06
> > - drop support for shmem's swap(revisit in future).
> > - add mem_cgroup_count_swap_user() to prevent moving charges of swaps used by
> >   multiple processes(revisit in future).
> > Changelog: 2009/09/24
> > - do no swap-in in moving swap account any more.
> > - add support for shmem's swap.
> > 
> > Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > ---
> >  include/linux/page_cgroup.h |    2 +
> >  include/linux/swap.h        |    1 +
> >  mm/memcontrol.c             |  154 +++++++++++++++++++++++++++++++++----------
> >  mm/page_cgroup.c            |   35 +++++++++-
> >  mm/swapfile.c               |   31 +++++++++
> >  5 files changed, 185 insertions(+), 38 deletions(-)
> > 
> > diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> > index b0e4eb1..30b0813 100644
> > --- a/include/linux/page_cgroup.h
> > +++ b/include/linux/page_cgroup.h
> > @@ -118,6 +118,8 @@ static inline void __init page_cgroup_init_flatmem(void)
> >  #include <linux/swap.h>
> >  
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > +extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
> > +					unsigned short old, unsigned short new);
> >  extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
> >  extern unsigned short lookup_swap_cgroup(swp_entry_t ent);
> >  extern int swap_cgroup_swapon(int type, unsigned long max_pages);
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 9f0ca32..2a3209e 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -355,6 +355,7 @@ static inline void disable_swap_token(void)
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> >  extern void
> >  mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout);
> > +extern int mem_cgroup_count_swap_user(swp_entry_t ent, struct page **pagep);
> >  #else
> >  static inline void
> >  mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f50ad15..6b3d17f 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -33,6 +33,7 @@
> >  #include <linux/rbtree.h>
> >  #include <linux/slab.h>
> >  #include <linux/swap.h>
> > +#include <linux/swapops.h>
> >  #include <linux/spinlock.h>
> >  #include <linux/fs.h>
> >  #include <linux/seq_file.h>
> > @@ -2258,6 +2259,53 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
> >  	}
> >  	rcu_read_unlock();
> >  }
> > +
> > +/**
> > + * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record.
> > + * @entry: swap entry to be moved
> > + * @from:  mem_cgroup which the entry is moved from
> > + * @to:  mem_cgroup which the entry is moved to
> > + *
> > + * It succeeds only when the swap_cgroup's record for this entry is the same
> > + * as the mem_cgroup's id of @from.
> > + *
> > + * Returns 0 on success, -EINVAL on failure.
> > + *
> > + * The caller must have charged to @to, IOW, called res_counter_charge() about
> > + * both res and memsw, and called css_get().
> > + */
> > +static int mem_cgroup_move_swap_account(swp_entry_t entry,
> > +				struct mem_cgroup *from, struct mem_cgroup *to)
> > +{
> > +	unsigned short old_id, new_id;
> > +
> > +	old_id = css_id(&from->css);
> > +	new_id = css_id(&to->css);
> > +
> > +	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
> > +		if (!mem_cgroup_is_root(from))
> > +			res_counter_uncharge(&from->memsw, PAGE_SIZE);
> > +		mem_cgroup_swap_statistics(from, false);
> > +		mem_cgroup_put(from);
> > +		/*
> > +		 * we charged both to->res and to->memsw, so we should uncharge
> > +		 * to->res.
> > +		 */
> > +		if (!mem_cgroup_is_root(to))
> > +			res_counter_uncharge(&to->res, PAGE_SIZE);
> > +		mem_cgroup_swap_statistics(to, true);
> > +		mem_cgroup_get(to);
> > +
> > +		return 0;
> > +	}
> > +	return -EINVAL;
> > +}
> 
> Hmm. Aren't there race with swapin ?
> 
>     Thread A
>    ----------
>    do_swap_page()
>    mem_cgroup_try_charge_swapin() <<== charge "memory" against old memcg.
>    page table lock
>    mem_cgroup_commit_charge_swapin()
>          lookup memcg from swap_cgroup() <<=== finds new memcg, moved one.
>                 res_counter_uncharge(&new_memcg->memsw,...)
>                  
> Then, Thread A does
>    old_memcg->res +1 
>    new_memcg->memsw -1
> 
> move_swap_account() does
>    old_memcg->memsw - 1
>    new_memcg->res - 1
>    new_memcg->memsw + 1
> 
> Hmm. old_memcg->res doesn't leak ? I think some check in commit_charge()
> for this new race is necessary.
> 
The race exists, but it's not a "leak": I mean the page will be charged
as memory to old_memcg.

Before state: (with sc->id == id of old_memcg)
	old_memcg->res: 0		new_memcg->res: 0
	old_memcg->memsw: 1	new_memcg->memsw: 0

Thread A does:
	do_swap_page()
		mem_cgroup_try_charge_swapin()
			=> old_memcg->res++, old_memcg->memsw++
		<task migration context moves the charge of swap>
		pte_offset_map_lock()
		mem_cgroup_try_charge_swapin()
			=> set pc->mem_cgroup to old_memcg(it's passed as an arg)
			mem_cgroup_lookup()
				=> finds new_memcg
			res_counter_uncharge()
				=> new_memcg->memsw--
		pte_unmap_unlock()

OTOH, the task migration context does:
	can_attach()
		do_precharge()
			=> new_memcg->res++, new_memcg->memsw++
	attach()
		pte_offset_map_lock()
		mem_cgroup_move_swap_account()
			=> old_memcg->memsw--, new_memcg->res--
		pte_unmap_unlock()

So,

After state: (with pc->mem_cgroup == old_memcg)
	old_memcg->res: 1		new_memcg->res: 0
	old_memcg->memsw: 1	new_memcg->memsw: 0

IMHO, some races where we cannot move the account would be unavoidable.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH -mmotm 7/7] memcg: improbe performance in moving swap charge
  2009-12-04  5:46 [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec) Daisuke Nishimura
                   ` (5 preceding siblings ...)
  2009-12-04  5:52 ` [PATCH -mmotm 6/7] memcg: move charges of anonymous swap Daisuke Nishimura
@ 2009-12-04  5:54 ` Daisuke Nishimura
  2009-12-04  6:53 ` [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec) KAMEZAWA Hiroyuki
  7 siblings, 0 replies; 21+ messages in thread
From: Daisuke Nishimura @ 2009-12-04  5:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan, Paul Menage,
	Daisuke Nishimura, linux-mm

This patch tries to reduce overheads in moving charge by:

- Adds a new function(__mem_cgroup_get/put), which takes "count" as a arg and
  increment/decrement mem->refcnt by "count".
- removed res_counter_uncharge, css_put, and mem_cgroup_get/put from the path of
  moving swap account, and consolidate all of them into mem_cgroup_clear_mc.

These changes reduces the overhead from 1.3sec to 0.9sec to move charges of 1G
anonymous memory(including 500MB swap) in my test environment.

Changelog: 2009/12/04
- new patch

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 mm/memcontrol.c |   80 ++++++++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 64 insertions(+), 16 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6b3d17f..a0d06f4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -254,6 +254,7 @@ struct move_charge_struct {
 	struct mem_cgroup *to;
 	unsigned long precharge;
 	unsigned long moved_charge;
+	unsigned long moved_swap;
 	struct task_struct *moving_task;	/* a task moving charges */
 };
 static struct move_charge_struct mc;
@@ -2265,6 +2266,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
  * @entry: swap entry to be moved
  * @from:  mem_cgroup which the entry is moved from
  * @to:  mem_cgroup which the entry is moved to
+ * @need_fixup: whether we should fixup res_counters and refcounts.
  *
  * It succeeds only when the swap_cgroup's record for this entry is the same
  * as the mem_cgroup's id of @from.
@@ -2275,7 +2277,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
  * both res and memsw, and called css_get().
  */
 static int mem_cgroup_move_swap_account(swp_entry_t entry,
-				struct mem_cgroup *from, struct mem_cgroup *to)
+		struct mem_cgroup *from, struct mem_cgroup *to, bool need_fixup)
 {
 	unsigned short old_id, new_id;
 
@@ -2283,19 +2285,25 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
 	new_id = css_id(&to->css);
 
 	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
-		if (!mem_cgroup_is_root(from))
-			res_counter_uncharge(&from->memsw, PAGE_SIZE);
 		mem_cgroup_swap_statistics(from, false);
-		mem_cgroup_put(from);
+		mem_cgroup_swap_statistics(to, true);
 		/*
-		 * we charged both to->res and to->memsw, so we should uncharge
-		 * to->res.
+		 * This function is only called from task migration context now,
+		 * so we can safely postpone mem_cgroup_get/put and uncharge
+		 * till the end of this context(mem_cgroup_clear_mc).
 		 */
-		if (!mem_cgroup_is_root(to))
-			res_counter_uncharge(&to->res, PAGE_SIZE);
-		mem_cgroup_swap_statistics(to, true);
-		mem_cgroup_get(to);
-
+		if (need_fixup) {
+			if (!mem_cgroup_is_root(from))
+				res_counter_uncharge(&from->memsw, PAGE_SIZE);
+			mem_cgroup_put(from);
+			/*
+			 * we charged both to->res and to->memsw, so we should
+			 * uncharge to->res.
+			 */
+			if (!mem_cgroup_is_root(to))
+				res_counter_uncharge(&to->res, PAGE_SIZE);
+			mem_cgroup_get(to);
+		}
 		return 0;
 	}
 	return -EINVAL;
@@ -3376,14 +3384,19 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
 		vfree(mem);
 }
 
+static void __mem_cgroup_get(struct mem_cgroup *mem, int count)
+{
+	atomic_add(count, &mem->refcnt);
+}
+
 static void mem_cgroup_get(struct mem_cgroup *mem)
 {
-	atomic_inc(&mem->refcnt);
+	__mem_cgroup_get(mem, 1);
 }
 
-static void mem_cgroup_put(struct mem_cgroup *mem)
+static void __mem_cgroup_put(struct mem_cgroup *mem, int count)
 {
-	if (atomic_dec_and_test(&mem->refcnt)) {
+	if (atomic_sub_and_test(count, &mem->refcnt)) {
 		struct mem_cgroup *parent = parent_mem_cgroup(mem);
 		__mem_cgroup_free(mem);
 		if (parent)
@@ -3391,6 +3404,11 @@ static void mem_cgroup_put(struct mem_cgroup *mem)
 	}
 }
 
+static void mem_cgroup_put(struct mem_cgroup *mem)
+{
+	__mem_cgroup_put(mem, 1);
+}
+
 /*
  * Returns the parent mem_cgroup in memcgroup hierarchy with hierarchy enabled.
  */
@@ -3749,6 +3767,33 @@ static void mem_cgroup_clear_mc(void)
 		__mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
 		mc.moved_charge = 0;
 	}
+	/* we must fixup refcnts and charges */
+	if (mc.moved_swap) {
+		WARN_ON_ONCE(mc.moved_swap > INT_MAX);
+		/* uncharge swap account from the old cgroup */
+		if (!mem_cgroup_is_root(mc.from))
+			res_counter_uncharge(&mc.from->memsw,
+						PAGE_SIZE * mc.moved_swap);
+		__mem_cgroup_put(mc.from, mc.moved_swap);
+
+		if (!mem_cgroup_is_root(mc.to)) {
+			/*
+			 * we charged both to->res and to->memsw, so we should
+			 * uncharge to->res.
+			 */
+			res_counter_uncharge(&mc.to->res,
+						PAGE_SIZE * mc.moved_swap);
+			/*
+			 * we must do css_put to cancel the refcnt got in
+			 * can_attach.
+			 */
+			VM_BUG_ON(test_bit(CSS_ROOT, &mc.to->css.flags));
+			__css_put(&mc.to->css, mc.moved_swap);
+		}
+		__mem_cgroup_get(mc.to, mc.moved_swap);
+
+		mc.moved_swap = 0;
+	}
 	mc.from = NULL;
 	mc.to = NULL;
 	mc.moving_task = NULL;
@@ -3777,11 +3822,13 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 			VM_BUG_ON(mc.to);
 			VM_BUG_ON(mc.precharge);
 			VM_BUG_ON(mc.moved_charge);
+			VM_BUG_ON(mc.moved_swap);
 			VM_BUG_ON(mc.moving_task);
 			mc.from = from;
 			mc.to = mem;
 			mc.precharge = 0;
 			mc.moved_charge = 0;
+			mc.moved_swap = 0;
 			mc.moving_task = current;
 
 			ret = mem_cgroup_precharge_mc(mm);
@@ -3843,9 +3890,10 @@ put:			/* is_target_pte_for_mc() gets the page */
 		case MC_TARGET_SWAP:
 			ent = target.ent;
 			if (!mem_cgroup_move_swap_account(ent,
-						mc.from, mc.to)) {
-				css_put(&mc.to->css);
+						mc.from, mc.to, false)) {
 				mc.precharge--;
+				/* we fixup refcnts and charges later. */
+				mc.moved_swap++;
 			}
 			break;
 		default:
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec)
  2009-12-04  5:46 [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec) Daisuke Nishimura
                   ` (6 preceding siblings ...)
  2009-12-04  5:54 ` [PATCH -mmotm 7/7] memcg: improbe performance in moving swap charge Daisuke Nishimura
@ 2009-12-04  6:53 ` KAMEZAWA Hiroyuki
  2009-12-04  7:00   ` KAMEZAWA Hiroyuki
  2009-12-04  7:09   ` Daisuke Nishimura
  7 siblings, 2 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-04  6:53 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrew Morton, Balbir Singh, Li Zefan, Paul Menage, linux-mm

On Fri, 4 Dec 2009 14:46:09 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> Hi.
> 
> These are current patches of my move-charge-at-task-migration feature.
> 
> The biggest change from previous(19/Nov) version is improvement in performance.
> 
> I measured the elapsed time of "echo [pid] > <some path>/tasks" on KVM guest
> with 4CPU/4GB(Xeon/3GHz) in three patterns:
> 
>   (1) / -> /00
>   (2) /00 -> /01
> 
>   we don't need to call res_counter_uncharge against root, so (1) would be smaller
>   than (2).
> 
>   (3) /00(setting mem.limit to half size of total) -> /01
> 
>   To compare the overhead of anon and swap.
> 
> In 19/Nov version:
>        |  252M  |  512M  |   1G
>   -----+--------+--------+--------
>    (1) |  0.21  |  0.41  |  0.821
>   -----+--------+--------+--------
>    (2) |  0.43  |  0.85  |  1.71
>   -----+--------+--------+--------
>    (3) |  0.40  |  0.81  |  1.62
>   -----+--------+--------+--------
> 
> In this version:
>        |  252M  |  512M  |   1G
>   -----+--------+--------+--------
>    (1) |  0.15  |  0.30  |  0.60
>   -----+--------+--------+--------
>    (2) |  0.15  |  0.30  |  0.60
>   -----+--------+--------+--------
>    (3) |  0.22  |  0.44  |  0.89
> 
Nice !

> Please read patch descriptions for each patch([4/7],[7/7]) for details of
> how and how much the patch improved the performance.
> 
>   [1/7] cgroup: introduce cancel_attach()
>   [2/7] memcg: add interface to move charge at task migration
>   [3/7] memcg: move charges of anonymous page
>   [4/7] memcg: improbe performance in moving charge
>   [5/7] memcg: avoid oom during moving charge
>   [6/7] memcg: move charges of anonymous swap
>   [7/7] memcg: improbe performance in moving swap charge
> 
> Current version supports only recharge of non-shared(mapcount == 1) anonymous pages
> and swaps of those pages. I think it's enough as a first step.
> 
Hmm. shared swap entry (very rare one?) is moved ?

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec)
  2009-12-04  6:53 ` [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec) KAMEZAWA Hiroyuki
@ 2009-12-04  7:00   ` KAMEZAWA Hiroyuki
  2009-12-07  6:34     ` Daisuke Nishimura
  2009-12-04  7:09   ` Daisuke Nishimura
  1 sibling, 1 reply; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-04  7:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Andrew Morton, Balbir Singh, Li Zefan,
	Paul Menage, linux-mm

On Fri, 4 Dec 2009 15:53:17 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Fri, 4 Dec 2009 14:46:09 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > In this version:
> >        |  252M  |  512M  |   1G
> >   -----+--------+--------+--------
> >    (1) |  0.15  |  0.30  |  0.60
> >   -----+--------+--------+--------
> >    (2) |  0.15  |  0.30  |  0.60
> >   -----+--------+--------+--------
> >    (3) |  0.22  |  0.44  |  0.89
> > 
> Nice !
> 

Ah. could you clarify...

 1. How is fork()/exit() affected by this move ?
 2. How long cpuset's migration-at-task-move requires ?
    I guess much longer than this.
 3. If need to reclaim memory for moving tasks, can this be longer ?
    If so, we may need some trick to release cgroup_mutex in task moving.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec)
  2009-12-04  7:00   ` KAMEZAWA Hiroyuki
@ 2009-12-07  6:34     ` Daisuke Nishimura
  2009-12-09  0:21       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 21+ messages in thread
From: Daisuke Nishimura @ 2009-12-07  6:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Li Zefan, Paul Menage, linux-mm,
	Daisuke Nishimura

[-- Attachment #1: Type: text/plain, Size: 3296 bytes --]

On Fri, 4 Dec 2009 16:00:42 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 4 Dec 2009 15:53:17 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Fri, 4 Dec 2009 14:46:09 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > > In this version:
> > >        |  252M  |  512M  |   1G
> > >   -----+--------+--------+--------
> > >    (1) |  0.15  |  0.30  |  0.60
> > >   -----+--------+--------+--------
> > >    (2) |  0.15  |  0.30  |  0.60
> > >   -----+--------+--------+--------
> > >    (3) |  0.22  |  0.44  |  0.89
> > > 
> > Nice !
> > 
> 
> Ah. could you clarify...
> 
>  1. How is fork()/exit() affected by this move ?
I measured using unixbench(./Run -c 1 spawn execl). I used the attached script to do
task migration infinitly(./switch3.sh /cgroup/memory/01 /cgroup/memory/02 [pid of bash]).
The script is executed on a different cpu from the unixbench's by taskset.

(1) no task migration(run on /01)

Execl Throughput                                192.7 lps   (29.9 s, 2 samples)
Process Creation                                475.5 lps   (30.0 s, 2 samples)

Execl Throughput                                191.2 lps   (29.9 s, 2 samples)
Process Creation                                463.4 lps   (30.0 s, 2 samples)

Execl Throughput                                191.0 lps   (29.9 s, 2 samples)
Process Creation                                474.9 lps   (30.0 s, 2 samples)


(2) under task migration between /01 and /02 w/o setting move_charge_at_immigrate

Execl Throughput                                150.2 lps   (29.8 s, 2 samples)
Process Creation                                344.1 lps   (30.0 s, 2 samples)

Execl Throughput                                146.9 lps   (29.8 s, 2 samples)
Process Creation                                337.7 lps   (30.0 s, 2 samples)

Execl Throughput                                150.5 lps   (29.8 s, 2 samples)
Process Creation                                345.3 lps   (30.0 s, 2 samples)


(3) under task migration between /01 and /02 w/ setting move_charge_at_immigrate

Execl Throughput                                142.9 lps   (29.9 s, 2 samples)
Process Creation                                323.1 lps   (30.0 s, 2 samples)

Execl Throughput                                146.6 lps   (29.8 s, 2 samples)
Process Creation                                332.0 lps   (30.0 s, 2 samples)

Execl Throughput                                150.9 lps   (29.8 s, 2 samples)
Process Creation                                344.2 lps   (30.0 s, 2 samples)


(those values seem terrible :(  I run them on KVM guest...)
(2) seems a bit better than (3), but the impact of task migration itself is
far bigger.


>  2. How long cpuset's migration-at-task-move requires ?
>     I guess much longer than this.
I measured in the same environment using fakenuma. It took 1.17sec for 256M,
2.33sec for 512M, and 4.69sec for 1G.


>  3. If need to reclaim memory for moving tasks, can this be longer ?
I think so.

>     If so, we may need some trick to release cgroup_mutex in task moving.
> 
hmm, I see your concern but I think it isn't so easy.. IMHO, we need changes
in cgroup layer and should take care not to cause dead lock.


Regards,
Daisuke Nishimura.

[-- Attachment #2: switch3.sh --]
[-- Type: text/x-sh, Size: 451 bytes --]

#!/bin/bash

SRC=$1
DST=$2
EXCLUDE_LIST="$3"

move_task()
{
	local ret

	for pid in $1
	do
		echo "${EXCLUDE_LIST}" | grep -w -q $pid
		if [ $? -eq 0 ]; then
			continue
		fi
		echo -n "$pid "
		/bin/echo $pid >$2/tasks 2>/dev/null
	done
	echo ""
}

stopflag=0

interrupt()
{
	stopflag=1
}

trap interrupt INT

while [ $stopflag -ne 1 ]
do
	echo "----- `date` -----"
	move_task "`cat ${SRC}/tasks`" ${DST}
	TMP=${SRC}
	SRC=${DST}
	DST=${TMP}
done





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec)
  2009-12-07  6:34     ` Daisuke Nishimura
@ 2009-12-09  0:21       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-09  0:21 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrew Morton, Balbir Singh, Li Zefan, Paul Menage, linux-mm

On Mon, 7 Dec 2009 15:34:48 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Fri, 4 Dec 2009 16:00:42 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Fri, 4 Dec 2009 15:53:17 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Fri, 4 Dec 2009 14:46:09 +0900
> > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > 
> > > > In this version:
> > > >        |  252M  |  512M  |   1G
> > > >   -----+--------+--------+--------
> > > >    (1) |  0.15  |  0.30  |  0.60
> > > >   -----+--------+--------+--------
> > > >    (2) |  0.15  |  0.30  |  0.60
> > > >   -----+--------+--------+--------
> > > >    (3) |  0.22  |  0.44  |  0.89
> > > > 
> > > Nice !
> > > 
> > 
> > Ah. could you clarify...
> > 
> >  1. How is fork()/exit() affected by this move ?
> I measured using unixbench(./Run -c 1 spawn execl). I used the attached script to do
> task migration infinitly(./switch3.sh /cgroup/memory/01 /cgroup/memory/02 [pid of bash]).
> The script is executed on a different cpu from the unixbench's by taskset.
> 
> (1) no task migration(run on /01)
> 
> Execl Throughput                                192.7 lps   (29.9 s, 2 samples)
> Process Creation                                475.5 lps   (30.0 s, 2 samples)
> 
> Execl Throughput                                191.2 lps   (29.9 s, 2 samples)
> Process Creation                                463.4 lps   (30.0 s, 2 samples)
> 
> Execl Throughput                                191.0 lps   (29.9 s, 2 samples)
> Process Creation                                474.9 lps   (30.0 s, 2 samples)
> 
> 
> (2) under task migration between /01 and /02 w/o setting move_charge_at_immigrate
> 
> Execl Throughput                                150.2 lps   (29.8 s, 2 samples)
> Process Creation                                344.1 lps   (30.0 s, 2 samples)
> 
> Execl Throughput                                146.9 lps   (29.8 s, 2 samples)
> Process Creation                                337.7 lps   (30.0 s, 2 samples)
> 
> Execl Throughput                                150.5 lps   (29.8 s, 2 samples)
> Process Creation                                345.3 lps   (30.0 s, 2 samples)
> 
> 
> (3) under task migration between /01 and /02 w/ setting move_charge_at_immigrate
> 
> Execl Throughput                                142.9 lps   (29.9 s, 2 samples)
> Process Creation                                323.1 lps   (30.0 s, 2 samples)
> 
> Execl Throughput                                146.6 lps   (29.8 s, 2 samples)
> Process Creation                                332.0 lps   (30.0 s, 2 samples)
> 
> Execl Throughput                                150.9 lps   (29.8 s, 2 samples)
> Process Creation                                344.2 lps   (30.0 s, 2 samples)
> 
> 
> (those values seem terrible :(  I run them on KVM guest...)
> (2) seems a bit better than (3), but the impact of task migration itself is
> far bigger.
> 

Thank you for interesting tests. (3) seems faster than I expected.


> 
> >  2. How long cpuset's migration-at-task-move requires ?
> >     I guess much longer than this.
> I measured in the same environment using fakenuma. It took 1.17sec for 256M,
> 2.33sec for 512M, and 4.69sec for 1G.
> 
Wow..

> 
> >  3. If need to reclaim memory for moving tasks, can this be longer ?
> I think so.
> 
> >     If so, we may need some trick to release cgroup_mutex in task moving.
> > 
> hmm, I see your concern but I think it isn't so easy.. IMHO, we need changes
> in cgroup layer and should take care not to cause dead lock.
> 

I agree here. If you can find somewhere good to write this on TO-DO-LIST, please.
No other requests from me, now.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec)
  2009-12-04  6:53 ` [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec) KAMEZAWA Hiroyuki
  2009-12-04  7:00   ` KAMEZAWA Hiroyuki
@ 2009-12-04  7:09   ` Daisuke Nishimura
  2009-12-04  7:34     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 21+ messages in thread
From: Daisuke Nishimura @ 2009-12-04  7:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Li Zefan, Paul Menage, linux-mm,
	Daisuke Nishimura

On Fri, 4 Dec 2009 15:53:17 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 4 Dec 2009 14:46:09 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > Hi.
> > 
> > These are current patches of my move-charge-at-task-migration feature.
> > 
> > The biggest change from previous(19/Nov) version is improvement in performance.
> > 
> > I measured the elapsed time of "echo [pid] > <some path>/tasks" on KVM guest
> > with 4CPU/4GB(Xeon/3GHz) in three patterns:
> > 
> >   (1) / -> /00
> >   (2) /00 -> /01
> > 
> >   we don't need to call res_counter_uncharge against root, so (1) would be smaller
> >   than (2).
> > 
> >   (3) /00(setting mem.limit to half size of total) -> /01
> > 
> >   To compare the overhead of anon and swap.
> > 
> > In 19/Nov version:
> >        |  252M  |  512M  |   1G
> >   -----+--------+--------+--------
> >    (1) |  0.21  |  0.41  |  0.821
> >   -----+--------+--------+--------
> >    (2) |  0.43  |  0.85  |  1.71
> >   -----+--------+--------+--------
> >    (3) |  0.40  |  0.81  |  1.62
> >   -----+--------+--------+--------
> > 
> > In this version:
> >        |  252M  |  512M  |   1G
> >   -----+--------+--------+--------
> >    (1) |  0.15  |  0.30  |  0.60
> >   -----+--------+--------+--------
> >    (2) |  0.15  |  0.30  |  0.60
> >   -----+--------+--------+--------
> >    (3) |  0.22  |  0.44  |  0.89
> > 
> Nice !
> 
> > Please read patch descriptions for each patch([4/7],[7/7]) for details of
> > how and how much the patch improved the performance.
> > 
> >   [1/7] cgroup: introduce cancel_attach()
> >   [2/7] memcg: add interface to move charge at task migration
> >   [3/7] memcg: move charges of anonymous page
> >   [4/7] memcg: improbe performance in moving charge
> >   [5/7] memcg: avoid oom during moving charge
> >   [6/7] memcg: move charges of anonymous swap
> >   [7/7] memcg: improbe performance in moving swap charge
> > 
> > Current version supports only recharge of non-shared(mapcount == 1) anonymous pages
> > and swaps of those pages. I think it's enough as a first step.
> > 
> Hmm. shared swap entry (very rare one?) is moved ?
> 
Well, do you mean the charge of shared swap entry(IOW, swap entry with swap_count > 1)
is moved ? If so, no. I check swap_count in mem_cgroup_count_swap_user(see [6/7]),
and don't move the charge of it if it's shared.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec)
  2009-12-04  7:09   ` Daisuke Nishimura
@ 2009-12-04  7:34     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-04  7:34 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrew Morton, Balbir Singh, Li Zefan, Paul Menage, linux-mm

On Fri, 4 Dec 2009 16:09:01 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Fri, 4 Dec 2009 15:53:17 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Fri, 4 Dec 2009 14:46:09 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > 
> > > Hi.
> > > 
> > > These are current patches of my move-charge-at-task-migration feature.
> > > 
> > > The biggest change from previous(19/Nov) version is improvement in performance.
> > > 
> > > I measured the elapsed time of "echo [pid] > <some path>/tasks" on KVM guest
> > > with 4CPU/4GB(Xeon/3GHz) in three patterns:
> > > 
> > >   (1) / -> /00
> > >   (2) /00 -> /01
> > > 
> > >   we don't need to call res_counter_uncharge against root, so (1) would be smaller
> > >   than (2).
> > > 
> > >   (3) /00(setting mem.limit to half size of total) -> /01
> > > 
> > >   To compare the overhead of anon and swap.
> > >
> > > Please read patch descriptions for each patch([4/7],[7/7]) for details of
> > > how and how much the patch improved the performance.
> > > 
> > >   [1/7] cgroup: introduce cancel_attach()
> > >   [2/7] memcg: add interface to move charge at task migration
> > >   [3/7] memcg: move charges of anonymous page
> > >   [4/7] memcg: improbe performance in moving charge
> > >   [5/7] memcg: avoid oom during moving charge
> > >   [6/7] memcg: move charges of anonymous swap
> > >   [7/7] memcg: improbe performance in moving swap charge
> > > 
> > > Current version supports only recharge of non-shared(mapcount == 1) anonymous pages
> > > and swaps of those pages. I think it's enough as a first step.
> > > 
> > Hmm. shared swap entry (very rare one?) is moved ?
> > 
> Well, do you mean the charge of shared swap entry(IOW, swap entry with swap_count > 1)
> is moved ? If so, no. I check swap_count in mem_cgroup_count_swap_user(see [6/7]),
> and don't move the charge of it if it's shared.
> 
I see. thank you for explanation.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2009-12-09  0:25 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-04  5:46 [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec) Daisuke Nishimura
2009-12-04  5:47 ` [PATCH -mmotm 1/7] cgroup: introduce cancel_attach() Daisuke Nishimura
2009-12-04  5:48 ` [PATCH -mmotm 2/7] memcg: add interface to move charge at task migration Daisuke Nishimura
2009-12-04  6:55   ` KAMEZAWA Hiroyuki
2009-12-04  5:49 ` [PATCH -mmotm 3/7] memcg: move charges of anonymous page Daisuke Nishimura
2009-12-04  5:50 ` [PATCH -mmotm 4/7] memcg: improbe performance in moving charge Daisuke Nishimura
2009-12-04  7:10   ` KAMEZAWA Hiroyuki
2009-12-04  7:29     ` Daisuke Nishimura
2009-12-04  5:51 ` [PATCH -mmotm 5/7] memcg: avoid oom during " Daisuke Nishimura
2009-12-04  7:14   ` KAMEZAWA Hiroyuki
2009-12-04  7:43     ` Daisuke Nishimura
2009-12-04  5:52 ` [PATCH -mmotm 6/7] memcg: move charges of anonymous swap Daisuke Nishimura
2009-12-04  7:32   ` KAMEZAWA Hiroyuki
2009-12-04 10:53     ` Daisuke Nishimura
2009-12-04  5:54 ` [PATCH -mmotm 7/7] memcg: improbe performance in moving swap charge Daisuke Nishimura
2009-12-04  6:53 ` [PATCH -mmotm 0/7] memcg: move charge at task migration (04/Dec) KAMEZAWA Hiroyuki
2009-12-04  7:00   ` KAMEZAWA Hiroyuki
2009-12-07  6:34     ` Daisuke Nishimura
2009-12-09  0:21       ` KAMEZAWA Hiroyuki
2009-12-04  7:09   ` Daisuke Nishimura
2009-12-04  7:34     ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox