[-mm] Add an owner to the mm

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [-mm] Add an owner to the mm_struct (v8)
@ 2008-04-04  8:05 Balbir Singh
  2008-04-04  8:12 ` Paul Menage
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Balbir Singh @ 2008-04-04  8:05 UTC (permalink / raw)
  To: Paul Menage, Pavel Emelianov
  Cc: Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi, lizf, linux-kernel,
	taka, linux-mm, David Rientjes, Balbir Singh, Andrew Morton,
	KAMEZAWA Hiroyuki



Changelog v7
------------
1. Make mm_need_new_owner() more readable
2. Remove extra white space from init_task.h

Changelog v6
------------

1. Fix typos
2. Document the use of delay_group_leader()

Changelog v5
------------
Remove the hooks for .owner from init_task.h and move it to init/main.c

Changelog v4
------------
1. Release rcu_read_lock() after acquiring task_lock(). Also get a reference
   to the task_struct
2. Change cgroup mm_owner_changed callback to callback only if the
   cgroup of old and new task is different and to pass the old and new
   cgroups instead of task pointers
3. Port the patch to 2.6.25-rc8-mm1

Changelog v3
------------

1. Add mm->owner change callbacks using cgroups

This patch removes the mem_cgroup member from mm_struct and instead adds
an owner. This approach was suggested by Paul Menage. The advantage of
this approach is that, once the mm->owner is known, using the subsystem
id, the cgroup can be determined. It also allows several control groups
that are virtually grouped by mm_struct, to exist independent of the memory
controller i.e., without adding mem_cgroup's for each controller,
to mm_struct.

A new config option CONFIG_MM_OWNER is added and the memory resource
controller selects this config option.

This patch also adds cgroup callbacks to notify subsystems when mm->owner
changes. The mm_cgroup_changed callback is called with the task_lock()
of the new task held and is called just prior to changing the mm->owner.

I am indebted to Paul Menage for the several reviews of this patchset
and helping me make it lighter and simpler.

This patch was tested on a powerpc box, it was compiled with both the
MM_OWNER config turned on and off.

After the thread group leader exits, it's moved to init_css_state by
cgroup_exit(), thus all future charges from runnings threads would
be redirected to the init_css_set's subsystem.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 fs/exec.c                  |    1 
 include/linux/cgroup.h     |   15 +++++++
 include/linux/memcontrol.h |   16 +-------
 include/linux/mm_types.h   |    5 +-
 include/linux/sched.h      |   13 ++++++
 init/Kconfig               |   15 +++++++
 init/main.c                |    1 
 kernel/cgroup.c            |   30 +++++++++++++++
 kernel/exit.c              |   89 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c              |   11 ++++-
 mm/memcontrol.c            |   24 +-----------
 11 files changed, 181 insertions(+), 39 deletions(-)

diff -puN fs/exec.c~memory-controller-add-mm-owner fs/exec.c
--- linux-2.6.25-rc8/fs/exec.c~memory-controller-add-mm-owner	2008-04-03 22:43:27.000000000 +0530
+++ linux-2.6.25-rc8-balbir/fs/exec.c	2008-04-03 22:43:27.000000000 +0530
@@ -735,6 +735,7 @@ static int exec_mmap(struct mm_struct *m
 	tsk->active_mm = mm;
 	activate_mm(active_mm, mm);
 	task_unlock(tsk);
+	mm_update_next_owner(mm);
 	arch_pick_mmap_layout(mm);
 	if (old_mm) {
 		up_read(&old_mm->mmap_sem);
diff -puN include/linux/cgroup.h~memory-controller-add-mm-owner include/linux/cgroup.h
--- linux-2.6.25-rc8/include/linux/cgroup.h~memory-controller-add-mm-owner	2008-04-03 22:43:27.000000000 +0530
+++ linux-2.6.25-rc8-balbir/include/linux/cgroup.h	2008-04-03 22:43:27.000000000 +0530
@@ -300,6 +300,12 @@ struct cgroup_subsys {
 			struct cgroup *cgrp);
 	void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
+	/*
+	 * This routine is called with the task_lock of mm->owner held
+	 */
+	void (*mm_owner_changed)(struct cgroup_subsys *ss,
+					struct cgroup *old,
+					struct cgroup *new);
 	int subsys_id;
 	int active;
 	int disabled;
@@ -385,4 +391,13 @@ static inline int cgroupstats_build(stru
 
 #endif /* !CONFIG_CGROUPS */
 
+#ifdef CONFIG_MM_OWNER
+extern void
+cgroup_mm_owner_callbacks(struct task_struct *old, struct task_struct *new);
+#else /* !CONFIG_MM_OWNER */
+static inline void
+cgroup_mm_owner_callbacks(struct task_struct *old, struct task_struct *new)
+{
+}
+#endif /* CONFIG_MM_OWNER */
 #endif /* _LINUX_CGROUP_H */
diff -puN include/linux/init_task.h~memory-controller-add-mm-owner include/linux/init_task.h
diff -puN include/linux/memcontrol.h~memory-controller-add-mm-owner include/linux/memcontrol.h
--- linux-2.6.25-rc8/include/linux/memcontrol.h~memory-controller-add-mm-owner	2008-04-03 22:43:27.000000000 +0530
+++ linux-2.6.25-rc8-balbir/include/linux/memcontrol.h	2008-04-03 22:43:27.000000000 +0530
@@ -27,9 +27,6 @@ struct mm_struct;
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 
-extern void mm_init_cgroup(struct mm_struct *mm, struct task_struct *p);
-extern void mm_free_cgroup(struct mm_struct *mm);
-
 #define page_reset_bad_cgroup(page)	((page)->page_cgroup = 0)
 
 extern struct page_cgroup *page_get_page_cgroup(struct page *page);
@@ -48,8 +45,10 @@ extern unsigned long mem_cgroup_isolate_
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
+extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+
 #define mm_match_cgroup(mm, cgroup)	\
-	((cgroup) == rcu_dereference((mm)->mem_cgroup))
+	((cgroup) == mem_cgroup_from_task((mm)->owner))
 
 extern int mem_cgroup_prepare_migration(struct page *page);
 extern void mem_cgroup_end_migration(struct page *page);
@@ -73,15 +72,6 @@ extern long mem_cgroup_calc_reclaim_inac
 				struct zone *zone, int priority);
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
-static inline void mm_init_cgroup(struct mm_struct *mm,
-					struct task_struct *p)
-{
-}
-
-static inline void mm_free_cgroup(struct mm_struct *mm)
-{
-}
-
 static inline void page_reset_bad_cgroup(struct page *page)
 {
 }
diff -puN include/linux/mm_types.h~memory-controller-add-mm-owner include/linux/mm_types.h
--- linux-2.6.25-rc8/include/linux/mm_types.h~memory-controller-add-mm-owner	2008-04-03 22:43:27.000000000 +0530
+++ linux-2.6.25-rc8-balbir/include/linux/mm_types.h	2008-04-03 22:43:27.000000000 +0530
@@ -230,8 +230,9 @@ struct mm_struct {
 	/* aio bits */
 	rwlock_t		ioctx_list_lock;	/* aio lock */
 	struct kioctx		*ioctx_list;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-	struct mem_cgroup *mem_cgroup;
+#ifdef CONFIG_MM_OWNER
+	struct task_struct *owner;	/* The thread group leader that */
+					/* owns the mm_struct.		*/
 #endif
 
 #ifdef CONFIG_PROC_FS
diff -puN include/linux/sched.h~memory-controller-add-mm-owner include/linux/sched.h
--- linux-2.6.25-rc8/include/linux/sched.h~memory-controller-add-mm-owner	2008-04-03 22:43:27.000000000 +0530
+++ linux-2.6.25-rc8-balbir/include/linux/sched.h	2008-04-03 22:43:27.000000000 +0530
@@ -2144,6 +2144,19 @@ static inline void migration_init(void)
 
 #define TASK_STATE_TO_CHAR_STR "RSDTtZX"
 
+#ifdef CONFIG_MM_OWNER
+extern void mm_update_next_owner(struct mm_struct *mm);
+extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
+#else
+static inline void mm_update_next_owner(struct mm_struct *mm)
+{
+}
+
+static inline void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
+{
+}
+#endif /* CONFIG_MM_OWNER */
+
 #endif /* __KERNEL__ */
 
 #endif
diff -puN init/Kconfig~memory-controller-add-mm-owner init/Kconfig
--- linux-2.6.25-rc8/init/Kconfig~memory-controller-add-mm-owner	2008-04-03 22:43:27.000000000 +0530
+++ linux-2.6.25-rc8-balbir/init/Kconfig	2008-04-03 22:45:18.000000000 +0530
@@ -371,9 +371,21 @@ config RESOURCE_COUNTERS
           infrastructure that works with cgroups
 	depends on CGROUPS
 
+config MM_OWNER
+	bool "Enable ownership of mm structure"
+	help
+	  This option enables mm_struct's to have an owner. The advantage
+	  of this approach is that it allows for several independent memory
+	  based cgroup controllers to co-exist independently without too
+	  much space overhead
+
+	  This feature adds fork/exit overhead. So enable this only if
+	  you need resource controllers
+
 config CGROUP_MEM_RES_CTLR
 	bool "Memory Resource Controller for Control Groups"
 	depends on CGROUPS && RESOURCE_COUNTERS
+	select MM_OWNER
 	help
 	  Provides a memory resource controller that manages both page cache and
 	  RSS memory.
@@ -386,6 +398,9 @@ config CGROUP_MEM_RES_CTLR
 	  Only enable when you're ok with these trade offs and really
 	  sure you need the memory resource controller.
 
+	  This config option also selects MM_OWNER config option, which
+	  could in turn add some fork/exit overhead.
+
 config SYSFS_DEPRECATED
 	bool
 
diff -puN kernel/cgroup.c~memory-controller-add-mm-owner kernel/cgroup.c
--- linux-2.6.25-rc8/kernel/cgroup.c~memory-controller-add-mm-owner	2008-04-03 22:43:27.000000000 +0530
+++ linux-2.6.25-rc8-balbir/kernel/cgroup.c	2008-04-03 22:43:27.000000000 +0530
@@ -118,6 +118,7 @@ static int root_count;
  * be called.
  */
 static int need_forkexit_callback;
+static int need_mm_owner_callback;
 
 /* convenient tests for these bits */
 inline int cgroup_is_removed(const struct cgroup *cgrp)
@@ -2485,6 +2486,7 @@ static void __init cgroup_init_subsys(st
 	}
 
 	need_forkexit_callback |= ss->fork || ss->exit;
+	need_mm_owner_callback |= !!ss->mm_owner_changed;
 
 	ss->active = 1;
 }
@@ -2721,6 +2723,34 @@ void cgroup_fork_callbacks(struct task_s
 	}
 }
 
+#ifdef CONFIG_MM_OWNER
+/**
+ * cgroup_mm_owner_callbacks - run callbacks when the mm->owner changes
+ * @p: the new owner
+ *
+ * Called on every change to mm->owner. mm_init_owner() does not
+ * invoke this routine, since it assigns the mm->owner the first time
+ * and does not change it.
+ */
+void cgroup_mm_owner_callbacks(struct task_struct *old, struct task_struct *new)
+{
+	struct cgroup *oldcgrp, *newcgrp;
+
+	if (need_mm_owner_callback) {
+		int i;
+		for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+			struct cgroup_subsys *ss = subsys[i];
+			oldcgrp = task_cgroup(old, ss->subsys_id);
+			newcgrp = task_cgroup(new, ss->subsys_id);
+			if (oldcgrp == newcgrp)
+				continue;
+			if (ss->mm_owner_changed)
+				ss->mm_owner_changed(ss, oldcgrp, newcgrp);
+		}
+	}
+}
+#endif /* CONFIG_MM_OWNER */
+
 /**
  * cgroup_post_fork - called on a new task after adding it to the task list
  * @child: the task in question
diff -puN kernel/exit.c~memory-controller-add-mm-owner kernel/exit.c
--- linux-2.6.25-rc8/kernel/exit.c~memory-controller-add-mm-owner	2008-04-03 22:43:27.000000000 +0530
+++ linux-2.6.25-rc8-balbir/kernel/exit.c	2008-04-04 00:56:51.000000000 +0530
@@ -577,6 +577,94 @@ void exit_fs(struct task_struct *tsk)
 
 EXPORT_SYMBOL_GPL(exit_fs);
 
+#ifdef CONFIG_MM_OWNER
+/*
+ * Task p is exiting and it owned p, so lets find a new owner for it
+ */
+static inline int
+mm_need_new_owner(struct mm_struct *mm, struct task_struct *p)
+{
+	/*
+	 * If there are other users of the mm and the owner (us) is exiting
+	 * we need to find a new owner to take on the responsibility.
+	 * When we use thread groups (CLONE_THREAD), the thread group
+	 * leader is kept around in zombie state, even after it exits.
+	 * delay_group_leader() ensures that if the group leader is around
+	 * we need not select a new owner.
+	 */
+	if (!mm)
+		return 0;
+	if (atomic_read(&mm->mm_users) <= 1)
+		return 0;
+	if (mm->owner != p)
+		return 0;
+	if (delay_group_leader(p))
+		return 0;
+	return 1;
+}
+
+void mm_update_next_owner(struct mm_struct *mm)
+{
+	struct task_struct *c, *g, *p = current;
+
+retry:
+	if (!mm_need_new_owner(mm, p))
+		return;
+
+	rcu_read_lock();
+	/*
+	 * Search in the children
+	 */
+	list_for_each_entry(c, &p->children, sibling) {
+		if (c->mm == mm)
+			goto assign_new_owner;
+	}
+
+	/*
+	 * Search in the siblings
+	 */
+	list_for_each_entry(c, &p->parent->children, sibling) {
+		if (c->mm == mm)
+			goto assign_new_owner;
+	}
+
+	/*
+	 * Search through everything else. We should not get
+	 * here often
+	 */
+	do_each_thread(g, c) {
+		if (c->mm == mm)
+			goto assign_new_owner;
+	} while_each_thread(g, c);
+
+	rcu_read_unlock();
+	return;
+
+assign_new_owner:
+	BUG_ON(c == p);
+	get_task_struct(c);
+	/*
+	 * The task_lock protects c->mm from changing.
+	 * We always want mm->owner->mm == mm
+	 */
+	task_lock(c);
+	/*
+ 	 * Delay rcu_read_unlock() till we have the task_lock()
+ 	 * to ensure that c does not slip away underneath us
+ 	 */
+	rcu_read_unlock();
+	if (c->mm != mm) {
+		task_unlock(c);
+		put_task_struct(c);
+		goto retry;
+	}
+	cgroup_mm_owner_callbacks(mm->owner, c);
+	mm->owner = c;
+	task_unlock(c);
+	put_task_struct(c);
+}
+#endif /* CONFIG_MM_OWNER */
+
 /*
  * Turn us into a lazy TLB process if we
  * aren't already..
@@ -616,6 +704,7 @@ static void exit_mm(struct task_struct *
 	/* We don't want this task to be frozen prematurely */
 	clear_freeze_flag(tsk);
 	task_unlock(tsk);
+	mm_update_next_owner(mm);
 	mmput(mm);
 }
 
diff -puN kernel/fork.c~memory-controller-add-mm-owner kernel/fork.c
--- linux-2.6.25-rc8/kernel/fork.c~memory-controller-add-mm-owner	2008-04-03 22:43:27.000000000 +0530
+++ linux-2.6.25-rc8-balbir/kernel/fork.c	2008-04-03 22:43:27.000000000 +0530
@@ -358,14 +358,13 @@ static struct mm_struct * mm_init(struct
 	mm->ioctx_list = NULL;
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
-	mm_init_cgroup(mm, p);
+	mm_init_owner(mm, p);
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
 		return mm;
 	}
 
-	mm_free_cgroup(mm);
 	free_mm(mm);
 	return NULL;
 }
@@ -416,7 +415,6 @@ void mmput(struct mm_struct *mm)
 			spin_unlock(&mmlist_lock);
 		}
 		put_swap_token(mm);
-		mm_free_cgroup(mm);
 		mmdrop(mm);
 	}
 }
@@ -996,6 +994,13 @@ static void rt_mutex_init_task(struct ta
 #endif
 }
 
+#ifdef CONFIG_MM_OWNER
+void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
+{
+	mm->owner = p;
+}
+#endif /* CONFIG_MM_OWNER */
+
 /*
  * This creates a new process as a copy of the old one,
  * but does not actually start it yet.
diff -puN mm/memcontrol.c~memory-controller-add-mm-owner mm/memcontrol.c
--- linux-2.6.25-rc8/mm/memcontrol.c~memory-controller-add-mm-owner	2008-04-03 22:43:27.000000000 +0530
+++ linux-2.6.25-rc8-balbir/mm/memcontrol.c	2008-04-03 22:46:51.000000000 +0530
@@ -238,26 +238,12 @@ static struct mem_cgroup *mem_cgroup_fro
 				css);
 }
 
-static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
+struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 {
 	return container_of(task_subsys_state(p, mem_cgroup_subsys_id),
 				struct mem_cgroup, css);
 }
 
-void mm_init_cgroup(struct mm_struct *mm, struct task_struct *p)
-{
-	struct mem_cgroup *mem;
-
-	mem = mem_cgroup_from_task(p);
-	css_get(&mem->css);
-	mm->mem_cgroup = mem;
-}
-
-void mm_free_cgroup(struct mm_struct *mm)
-{
-	css_put(&mm->mem_cgroup->css);
-}
-
 static inline int page_cgroup_locked(struct page *page)
 {
 	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
@@ -478,6 +464,7 @@ unsigned long mem_cgroup_isolate_pages(u
 	int zid = zone_idx(z);
 	struct mem_cgroup_per_zone *mz;
 
+	BUG_ON(!mem_cont);
 	mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
 	if (active)
 		src = &mz->active_list;
@@ -576,7 +563,7 @@ retry:
 		mm = &init_mm;
 
 	rcu_read_lock();
-	mem = rcu_dereference(mm->mem_cgroup);
+	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
 	/*
 	 * For every charge from the cgroup, increment reference count
 	 */
@@ -1006,7 +993,6 @@ mem_cgroup_create(struct cgroup_subsys *
 
 	if (unlikely((cont->parent) == NULL)) {
 		mem = &init_mem_cgroup;
-		init_mm.mem_cgroup = mem;
 		page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
 	} else
 		mem = kzalloc(sizeof(struct mem_cgroup), GFP_KERNEL);
@@ -1087,10 +1073,6 @@ static void mem_cgroup_move_task(struct 
 	if (!thread_group_leader(p))
 		goto out;
 
-	css_get(&mem->css);
-	rcu_assign_pointer(mm->mem_cgroup, mem);
-	css_put(&old_mem->css);
-
 out:
 	mmput(mm);
 }
diff -puN init/main.c~memory-controller-add-mm-owner init/main.c
--- linux-2.6.25-rc8/init/main.c~memory-controller-add-mm-owner	2008-04-03 22:43:27.000000000 +0530
+++ linux-2.6.25-rc8-balbir/init/main.c	2008-04-03 22:43:27.000000000 +0530
@@ -537,6 +537,7 @@ asmlinkage void __init start_kernel(void
 	printk(KERN_NOTICE);
 	printk(linux_banner);
 	setup_arch(&command_line);
+	mm_init_owner(&init_mm, &init_task);
 	setup_command_line(command_line);
 	unwind_setup();
 	setup_per_cpu_areas();
_

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-04  8:05 [-mm] Add an owner to the mm_struct (v8) Balbir Singh
@ 2008-04-04  8:12 ` Paul Menage
  2008-04-04  8:28   ` Balbir Singh
  2008-04-07 22:09 ` Andrew Morton
  2008-04-09  0:42 ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 26+ messages in thread
From: Paul Menage @ 2008-04-04  8:12 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	lizf, linux-kernel, taka, linux-mm, David Rientjes,
	Andrew Morton, KAMEZAWA Hiroyuki

On Fri, Apr 4, 2008 at 1:05 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
>  After the thread group leader exits, it's moved to init_css_state by
>  cgroup_exit(), thus all future charges from runnings threads would
>  be redirected to the init_css_set's subsystem.

And its uncharges, which is more of the problem I was getting at
earlier - surely when the mm is finally destroyed, all its virtual
address space charges will be uncharged from the root cgroup rather
than the correct cgroup, if we left the delayed group leader as the
owner? Which is why I think the group leader optimization is unsafe.

Paul

>
>  Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
>  ---
>
>   fs/exec.c                  |    1
>   include/linux/cgroup.h     |   15 +++++++
>   include/linux/memcontrol.h |   16 +-------
>   include/linux/mm_types.h   |    5 +-
>   include/linux/sched.h      |   13 ++++++
>   init/Kconfig               |   15 +++++++
>   init/main.c                |    1
>   kernel/cgroup.c            |   30 +++++++++++++++
>   kernel/exit.c              |   89 +++++++++++++++++++++++++++++++++++++++++++++
>   kernel/fork.c              |   11 ++++-
>   mm/memcontrol.c            |   24 +-----------
>   11 files changed, 181 insertions(+), 39 deletions(-)
>
>  diff -puN fs/exec.c~memory-controller-add-mm-owner fs/exec.c
>  --- linux-2.6.25-rc8/fs/exec.c~memory-controller-add-mm-owner   2008-04-03 22:43:27.000000000 +0530
>  +++ linux-2.6.25-rc8-balbir/fs/exec.c   2008-04-03 22:43:27.000000000 +0530
>  @@ -735,6 +735,7 @@ static int exec_mmap(struct mm_struct *m
>         tsk->active_mm = mm;
>         activate_mm(active_mm, mm);
>         task_unlock(tsk);
>  +       mm_update_next_owner(mm);
>         arch_pick_mmap_layout(mm);
>         if (old_mm) {
>                 up_read(&old_mm->mmap_sem);
>  diff -puN include/linux/cgroup.h~memory-controller-add-mm-owner include/linux/cgroup.h
>  --- linux-2.6.25-rc8/include/linux/cgroup.h~memory-controller-add-mm-owner      2008-04-03 22:43:27.000000000 +0530
>  +++ linux-2.6.25-rc8-balbir/include/linux/cgroup.h      2008-04-03 22:43:27.000000000 +0530
>  @@ -300,6 +300,12 @@ struct cgroup_subsys {
>                         struct cgroup *cgrp);
>         void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>         void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
>  +       /*
>  +        * This routine is called with the task_lock of mm->owner held
>  +        */
>  +       void (*mm_owner_changed)(struct cgroup_subsys *ss,
>  +                                       struct cgroup *old,
>  +                                       struct cgroup *new);
>         int subsys_id;
>         int active;
>         int disabled;
>  @@ -385,4 +391,13 @@ static inline int cgroupstats_build(stru
>
>   #endif /* !CONFIG_CGROUPS */
>
>  +#ifdef CONFIG_MM_OWNER
>  +extern void
>  +cgroup_mm_owner_callbacks(struct task_struct *old, struct task_struct *new);
>  +#else /* !CONFIG_MM_OWNER */
>  +static inline void
>  +cgroup_mm_owner_callbacks(struct task_struct *old, struct task_struct *new)
>  +{
>  +}
>  +#endif /* CONFIG_MM_OWNER */
>   #endif /* _LINUX_CGROUP_H */
>  diff -puN include/linux/init_task.h~memory-controller-add-mm-owner include/linux/init_task.h
>  diff -puN include/linux/memcontrol.h~memory-controller-add-mm-owner include/linux/memcontrol.h
>  --- linux-2.6.25-rc8/include/linux/memcontrol.h~memory-controller-add-mm-owner  2008-04-03 22:43:27.000000000 +0530
>  +++ linux-2.6.25-rc8-balbir/include/linux/memcontrol.h  2008-04-03 22:43:27.000000000 +0530
>  @@ -27,9 +27,6 @@ struct mm_struct;
>
>   #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>
>  -extern void mm_init_cgroup(struct mm_struct *mm, struct task_struct *p);
>  -extern void mm_free_cgroup(struct mm_struct *mm);
>  -
>   #define page_reset_bad_cgroup(page)    ((page)->page_cgroup = 0)
>
>   extern struct page_cgroup *page_get_page_cgroup(struct page *page);
>  @@ -48,8 +45,10 @@ extern unsigned long mem_cgroup_isolate_
>   extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
>   int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
>
>  +extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>  +
>   #define mm_match_cgroup(mm, cgroup)    \
>  -       ((cgroup) == rcu_dereference((mm)->mem_cgroup))
>  +       ((cgroup) == mem_cgroup_from_task((mm)->owner))
>
>   extern int mem_cgroup_prepare_migration(struct page *page);
>   extern void mem_cgroup_end_migration(struct page *page);
>  @@ -73,15 +72,6 @@ extern long mem_cgroup_calc_reclaim_inac
>                                 struct zone *zone, int priority);
>
>   #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  -static inline void mm_init_cgroup(struct mm_struct *mm,
>  -                                       struct task_struct *p)
>  -{
>  -}
>  -
>  -static inline void mm_free_cgroup(struct mm_struct *mm)
>  -{
>  -}
>  -
>   static inline void page_reset_bad_cgroup(struct page *page)
>   {
>   }
>  diff -puN include/linux/mm_types.h~memory-controller-add-mm-owner include/linux/mm_types.h
>  --- linux-2.6.25-rc8/include/linux/mm_types.h~memory-controller-add-mm-owner    2008-04-03 22:43:27.000000000 +0530
>  +++ linux-2.6.25-rc8-balbir/include/linux/mm_types.h    2008-04-03 22:43:27.000000000 +0530
>  @@ -230,8 +230,9 @@ struct mm_struct {
>         /* aio bits */
>         rwlock_t                ioctx_list_lock;        /* aio lock */
>         struct kioctx           *ioctx_list;
>  -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  -       struct mem_cgroup *mem_cgroup;
>  +#ifdef CONFIG_MM_OWNER
>  +       struct task_struct *owner;      /* The thread group leader that */
>  +                                       /* owns the mm_struct.          */
>   #endif
>
>   #ifdef CONFIG_PROC_FS
>  diff -puN include/linux/sched.h~memory-controller-add-mm-owner include/linux/sched.h
>  --- linux-2.6.25-rc8/include/linux/sched.h~memory-controller-add-mm-owner       2008-04-03 22:43:27.000000000 +0530
>  +++ linux-2.6.25-rc8-balbir/include/linux/sched.h       2008-04-03 22:43:27.000000000 +0530
>  @@ -2144,6 +2144,19 @@ static inline void migration_init(void)
>
>   #define TASK_STATE_TO_CHAR_STR "RSDTtZX"
>
>  +#ifdef CONFIG_MM_OWNER
>  +extern void mm_update_next_owner(struct mm_struct *mm);
>  +extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
>  +#else
>  +static inline void mm_update_next_owner(struct mm_struct *mm)
>  +{
>  +}
>  +
>  +static inline void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
>  +{
>  +}
>  +#endif /* CONFIG_MM_OWNER */
>  +
>   #endif /* __KERNEL__ */
>
>   #endif
>  diff -puN init/Kconfig~memory-controller-add-mm-owner init/Kconfig
>  --- linux-2.6.25-rc8/init/Kconfig~memory-controller-add-mm-owner        2008-04-03 22:43:27.000000000 +0530
>  +++ linux-2.6.25-rc8-balbir/init/Kconfig        2008-04-03 22:45:18.000000000 +0530
>  @@ -371,9 +371,21 @@ config RESOURCE_COUNTERS
>            infrastructure that works with cgroups
>         depends on CGROUPS
>
>  +config MM_OWNER
>  +       bool "Enable ownership of mm structure"
>  +       help
>  +         This option enables mm_struct's to have an owner. The advantage
>  +         of this approach is that it allows for several independent memory
>  +         based cgroup controllers to co-exist independently without too
>  +         much space overhead
>  +
>  +         This feature adds fork/exit overhead. So enable this only if
>  +         you need resource controllers
>  +
>   config CGROUP_MEM_RES_CTLR
>         bool "Memory Resource Controller for Control Groups"
>         depends on CGROUPS && RESOURCE_COUNTERS
>  +       select MM_OWNER
>         help
>           Provides a memory resource controller that manages both page cache and
>           RSS memory.
>  @@ -386,6 +398,9 @@ config CGROUP_MEM_RES_CTLR
>           Only enable when you're ok with these trade offs and really
>           sure you need the memory resource controller.
>
>  +         This config option also selects MM_OWNER config option, which
>  +         could in turn add some fork/exit overhead.
>  +
>   config SYSFS_DEPRECATED
>         bool
>
>  diff -puN kernel/cgroup.c~memory-controller-add-mm-owner kernel/cgroup.c
>  --- linux-2.6.25-rc8/kernel/cgroup.c~memory-controller-add-mm-owner     2008-04-03 22:43:27.000000000 +0530
>  +++ linux-2.6.25-rc8-balbir/kernel/cgroup.c     2008-04-03 22:43:27.000000000 +0530
>  @@ -118,6 +118,7 @@ static int root_count;
>   * be called.
>   */
>   static int need_forkexit_callback;
>  +static int need_mm_owner_callback;
>
>   /* convenient tests for these bits */
>   inline int cgroup_is_removed(const struct cgroup *cgrp)
>  @@ -2485,6 +2486,7 @@ static void __init cgroup_init_subsys(st
>         }
>
>         need_forkexit_callback |= ss->fork || ss->exit;
>  +       need_mm_owner_callback |= !!ss->mm_owner_changed;
>
>         ss->active = 1;
>   }
>  @@ -2721,6 +2723,34 @@ void cgroup_fork_callbacks(struct task_s
>         }
>   }
>
>  +#ifdef CONFIG_MM_OWNER
>  +/**
>  + * cgroup_mm_owner_callbacks - run callbacks when the mm->owner changes
>  + * @p: the new owner
>  + *
>  + * Called on every change to mm->owner. mm_init_owner() does not
>  + * invoke this routine, since it assigns the mm->owner the first time
>  + * and does not change it.
>  + */
>  +void cgroup_mm_owner_callbacks(struct task_struct *old, struct task_struct *new)
>  +{
>  +       struct cgroup *oldcgrp, *newcgrp;
>  +
>  +       if (need_mm_owner_callback) {
>  +               int i;
>  +               for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
>  +                       struct cgroup_subsys *ss = subsys[i];
>  +                       oldcgrp = task_cgroup(old, ss->subsys_id);
>  +                       newcgrp = task_cgroup(new, ss->subsys_id);
>  +                       if (oldcgrp == newcgrp)
>  +                               continue;
>  +                       if (ss->mm_owner_changed)
>  +                               ss->mm_owner_changed(ss, oldcgrp, newcgrp);
>  +               }
>  +       }
>  +}
>  +#endif /* CONFIG_MM_OWNER */
>  +
>   /**
>   * cgroup_post_fork - called on a new task after adding it to the task list
>   * @child: the task in question
>  diff -puN kernel/exit.c~memory-controller-add-mm-owner kernel/exit.c
>  --- linux-2.6.25-rc8/kernel/exit.c~memory-controller-add-mm-owner       2008-04-03 22:43:27.000000000 +0530
>  +++ linux-2.6.25-rc8-balbir/kernel/exit.c       2008-04-04 00:56:51.000000000 +0530
>  @@ -577,6 +577,94 @@ void exit_fs(struct task_struct *tsk)
>
>   EXPORT_SYMBOL_GPL(exit_fs);
>
>  +#ifdef CONFIG_MM_OWNER
>  +/*
>  + * Task p is exiting and it owned p, so lets find a new owner for it
>  + */
>  +static inline int
>  +mm_need_new_owner(struct mm_struct *mm, struct task_struct *p)
>  +{
>  +       /*
>  +        * If there are other users of the mm and the owner (us) is exiting
>  +        * we need to find a new owner to take on the responsibility.
>  +        * When we use thread groups (CLONE_THREAD), the thread group
>  +        * leader is kept around in zombie state, even after it exits.
>  +        * delay_group_leader() ensures that if the group leader is around
>  +        * we need not select a new owner.
>  +        */
>  +       if (!mm)
>  +               return 0;
>  +       if (atomic_read(&mm->mm_users) <= 1)
>  +               return 0;
>  +       if (mm->owner != p)
>  +               return 0;
>  +       if (delay_group_leader(p))
>  +               return 0;
>  +       return 1;
>  +}
>  +
>  +void mm_update_next_owner(struct mm_struct *mm)
>  +{
>  +       struct task_struct *c, *g, *p = current;
>  +
>  +retry:
>  +       if (!mm_need_new_owner(mm, p))
>  +               return;
>  +
>  +       rcu_read_lock();
>  +       /*
>  +        * Search in the children
>  +        */
>  +       list_for_each_entry(c, &p->children, sibling) {
>  +               if (c->mm == mm)
>  +                       goto assign_new_owner;
>  +       }
>  +
>  +       /*
>  +        * Search in the siblings
>  +        */
>  +       list_for_each_entry(c, &p->parent->children, sibling) {
>  +               if (c->mm == mm)
>  +                       goto assign_new_owner;
>  +       }
>  +
>  +       /*
>  +        * Search through everything else. We should not get
>  +        * here often
>  +        */
>  +       do_each_thread(g, c) {
>  +               if (c->mm == mm)
>  +                       goto assign_new_owner;
>  +       } while_each_thread(g, c);
>  +
>  +       rcu_read_unlock();
>  +       return;
>  +
>  +assign_new_owner:
>  +       BUG_ON(c == p);
>  +       get_task_struct(c);
>  +       /*
>  +        * The task_lock protects c->mm from changing.
>  +        * We always want mm->owner->mm == mm
>  +        */
>  +       task_lock(c);
>  +       /*
>  +        * Delay rcu_read_unlock() till we have the task_lock()
>  +        * to ensure that c does not slip away underneath us
>  +        */
>  +       rcu_read_unlock();
>  +       if (c->mm != mm) {
>  +               task_unlock(c);
>  +               put_task_struct(c);
>  +               goto retry;
>  +       }
>  +       cgroup_mm_owner_callbacks(mm->owner, c);
>  +       mm->owner = c;
>  +       task_unlock(c);
>  +       put_task_struct(c);
>  +}
>  +#endif /* CONFIG_MM_OWNER */
>  +
>   /*
>   * Turn us into a lazy TLB process if we
>   * aren't already..
>  @@ -616,6 +704,7 @@ static void exit_mm(struct task_struct *
>         /* We don't want this task to be frozen prematurely */
>         clear_freeze_flag(tsk);
>         task_unlock(tsk);
>  +       mm_update_next_owner(mm);
>         mmput(mm);
>   }
>
>  diff -puN kernel/fork.c~memory-controller-add-mm-owner kernel/fork.c
>  --- linux-2.6.25-rc8/kernel/fork.c~memory-controller-add-mm-owner       2008-04-03 22:43:27.000000000 +0530
>  +++ linux-2.6.25-rc8-balbir/kernel/fork.c       2008-04-03 22:43:27.000000000 +0530
>  @@ -358,14 +358,13 @@ static struct mm_struct * mm_init(struct
>         mm->ioctx_list = NULL;
>         mm->free_area_cache = TASK_UNMAPPED_BASE;
>         mm->cached_hole_size = ~0UL;
>  -       mm_init_cgroup(mm, p);
>  +       mm_init_owner(mm, p);
>
>         if (likely(!mm_alloc_pgd(mm))) {
>                 mm->def_flags = 0;
>                 return mm;
>         }
>
>  -       mm_free_cgroup(mm);
>         free_mm(mm);
>         return NULL;
>   }
>  @@ -416,7 +415,6 @@ void mmput(struct mm_struct *mm)
>                         spin_unlock(&mmlist_lock);
>                 }
>                 put_swap_token(mm);
>  -               mm_free_cgroup(mm);
>                 mmdrop(mm);
>         }
>   }
>  @@ -996,6 +994,13 @@ static void rt_mutex_init_task(struct ta
>   #endif
>   }
>
>  +#ifdef CONFIG_MM_OWNER
>  +void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
>  +{
>  +       mm->owner = p;
>  +}
>  +#endif /* CONFIG_MM_OWNER */
>  +
>   /*
>   * This creates a new process as a copy of the old one,
>   * but does not actually start it yet.
>  diff -puN mm/memcontrol.c~memory-controller-add-mm-owner mm/memcontrol.c
>  --- linux-2.6.25-rc8/mm/memcontrol.c~memory-controller-add-mm-owner     2008-04-03 22:43:27.000000000 +0530
>  +++ linux-2.6.25-rc8-balbir/mm/memcontrol.c     2008-04-03 22:46:51.000000000 +0530
>  @@ -238,26 +238,12 @@ static struct mem_cgroup *mem_cgroup_fro
>                                 css);
>   }
>
>  -static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
>  +struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
>   {
>         return container_of(task_subsys_state(p, mem_cgroup_subsys_id),
>                                 struct mem_cgroup, css);
>   }
>
>  -void mm_init_cgroup(struct mm_struct *mm, struct task_struct *p)
>  -{
>  -       struct mem_cgroup *mem;
>  -
>  -       mem = mem_cgroup_from_task(p);
>  -       css_get(&mem->css);
>  -       mm->mem_cgroup = mem;
>  -}
>  -
>  -void mm_free_cgroup(struct mm_struct *mm)
>  -{
>  -       css_put(&mm->mem_cgroup->css);
>  -}
>  -
>   static inline int page_cgroup_locked(struct page *page)
>   {
>         return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
>  @@ -478,6 +464,7 @@ unsigned long mem_cgroup_isolate_pages(u
>         int zid = zone_idx(z);
>         struct mem_cgroup_per_zone *mz;
>
>  +       BUG_ON(!mem_cont);
>         mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
>         if (active)
>                 src = &mz->active_list;
>  @@ -576,7 +563,7 @@ retry:
>                 mm = &init_mm;
>
>         rcu_read_lock();
>  -       mem = rcu_dereference(mm->mem_cgroup);
>  +       mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
>         /*
>          * For every charge from the cgroup, increment reference count
>          */
>  @@ -1006,7 +993,6 @@ mem_cgroup_create(struct cgroup_subsys *
>
>         if (unlikely((cont->parent) == NULL)) {
>                 mem = &init_mem_cgroup;
>  -               init_mm.mem_cgroup = mem;
>                 page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
>         } else
>                 mem = kzalloc(sizeof(struct mem_cgroup), GFP_KERNEL);
>  @@ -1087,10 +1073,6 @@ static void mem_cgroup_move_task(struct
>         if (!thread_group_leader(p))
>                 goto out;
>
>  -       css_get(&mem->css);
>  -       rcu_assign_pointer(mm->mem_cgroup, mem);
>  -       css_put(&old_mem->css);
>  -
>   out:
>         mmput(mm);
>   }
>  diff -puN init/main.c~memory-controller-add-mm-owner init/main.c
>  --- linux-2.6.25-rc8/init/main.c~memory-controller-add-mm-owner 2008-04-03 22:43:27.000000000 +0530
>  +++ linux-2.6.25-rc8-balbir/init/main.c 2008-04-03 22:43:27.000000000 +0530
>  @@ -537,6 +537,7 @@ asmlinkage void __init start_kernel(void
>         printk(KERN_NOTICE);
>         printk(linux_banner);
>         setup_arch(&command_line);
>  +       mm_init_owner(&init_mm, &init_task);
>         setup_command_line(command_line);
>         unwind_setup();
>         setup_per_cpu_areas();
>  _
>
>  --
>         Warm Regards,
>         Balbir Singh
>         Linux Technology Center
>         IBM, ISTL
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-04  8:12 ` Paul Menage
@ 2008-04-04  8:28   ` Balbir Singh
  2008-04-04  8:50     ` Paul Menage
  0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2008-04-04  8:28 UTC (permalink / raw)
  To: Paul Menage
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	lizf, linux-kernel, taka, linux-mm, David Rientjes,
	Andrew Morton, KAMEZAWA Hiroyuki

Paul Menage wrote:
> On Fri, Apr 4, 2008 at 1:05 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>  After the thread group leader exits, it's moved to init_css_state by
>>  cgroup_exit(), thus all future charges from runnings threads would
>>  be redirected to the init_css_set's subsystem.
> 
> And its uncharges, which is more of the problem I was getting at
> earlier - surely when the mm is finally destroyed, all its virtual
> address space charges will be uncharged from the root cgroup rather
> than the correct cgroup, if we left the delayed group leader as the
> owner? Which is why I think the group leader optimization is unsafe.

It won't uncharge for the memory controller from the root cgroup since each page
 has the mem_cgroup information associated with it. For other controllers,
they'll need to monitor exit() callbacks to know when the leader is dead :( (sigh).

Not having the group leader optimization can introduce big overheads (consider
thousands of tasks, with the group leader being the first one to exit).

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-04  8:28   ` Balbir Singh
@ 2008-04-04  8:50     ` Paul Menage
  2008-04-04  9:25       ` Balbir Singh
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Menage @ 2008-04-04  8:50 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	lizf, linux-kernel, taka, linux-mm, David Rientjes,
	Andrew Morton, KAMEZAWA Hiroyuki

On Fri, Apr 4, 2008 at 1:28 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
>  It won't uncharge for the memory controller from the root cgroup since each page
>   has the mem_cgroup information associated with it.

Right, I realise that the memory controller is OK because of the ref counts.

>  For other controllers,
>  they'll need to monitor exit() callbacks to know when the leader is dead :( (sigh).

That sounds like a nightmare ...

>
>  Not having the group leader optimization can introduce big overheads (consider
>  thousands of tasks, with the group leader being the first one to exit).

Can you test the overhead?

As long as we find someone to pass the mm to quickly, it shouldn't be
too bad - I think we're already optimized for that case. Generally the
group leader's first child will be the new owner, and any subsequent
times the owner exits, they're unlikely to have any children so
they'll go straight to the sibling check and pass the mm to the
parent's first child.

Unless they all exit in strict sibling order and hence pass the mm
along the chain one by one, we should be fine. And if that exit
ordering does turn out to be common, then simply walking the child and
sibling lists in reverse order to find a victim will minimize the
amount of passing.

One other thing occurred to me - what lock protects the child and
sibling links? I don't see any documentation anywhere, but from the
code it looks as though it's tasklist_lock rather than RCU - so maybe
we should be holding that with a read_lock(), at least for the first
two parts of the search? (The full thread search is RCU-safe).

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-04  8:50     ` Paul Menage
@ 2008-04-04  9:25       ` Balbir Singh
  2008-04-04 19:11         ` Paul Menage
  0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2008-04-04  9:25 UTC (permalink / raw)
  To: Paul Menage
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	lizf, linux-kernel, taka, linux-mm, David Rientjes,
	Andrew Morton, KAMEZAWA Hiroyuki

Paul Menage wrote:
> On Fri, Apr 4, 2008 at 1:28 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>  It won't uncharge for the memory controller from the root cgroup since each page
>>   has the mem_cgroup information associated with it.
> 
> Right, I realise that the memory controller is OK because of the ref counts.
> 
>>  For other controllers,
>>  they'll need to monitor exit() callbacks to know when the leader is dead :( (sigh).
> 
> That sounds like a nightmare ...
> 

Yes, it would be, but worth the trouble. Is it really critical to move a dead
cgroup leader to init_css_set in cgroup_exit()?

>>  Not having the group leader optimization can introduce big overheads (consider
>>  thousands of tasks, with the group leader being the first one to exit).
> 
> Can you test the overhead?
> 

I probably can write a program and see what the overhead looks like

> As long as we find someone to pass the mm to quickly, it shouldn't be
> too bad - I think we're already optimized for that case. Generally the
> group leader's first child will be the new owner, and any subsequent
> times the owner exits, they're unlikely to have any children so
> they'll go straight to the sibling check and pass the mm to the
> parent's first child.
> 
> Unless they all exit in strict sibling order and hence pass the mm
> along the chain one by one, we should be fine. And if that exit
> ordering does turn out to be common, then simply walking the child and
> sibling lists in reverse order to find a victim will minimize the
> amount of passing.
> 


Finding the next mm might not be all that bad, but doing it each time a task
exits, can be an overhead, specially for large multi threaded programs. This can
get severe if the new mm->owner belongs to a different cgroup, in which case we
need to use callbacks as well.

If half the threads belonged to a different cgroup and the new mm->owner kept
switching between cgroups, the overhead would be really high, with the callbacks
and the mm->owner changing frequently.

> One other thing occurred to me - what lock protects the child and
> sibling links? I don't see any documentation anywhere, but from the
> code it looks as though it's tasklist_lock rather than RCU - so maybe
> we should be holding that with a read_lock(), at least for the first
> two parts of the search? (The full thread search is RCU-safe).
> 

You are right about the read_lock()

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-04  9:25       ` Balbir Singh
@ 2008-04-04 19:11         ` Paul Menage
  2008-04-05 14:47           ` Balbir Singh
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Menage @ 2008-04-04 19:11 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	lizf, linux-kernel, taka, linux-mm, David Rientjes,
	Andrew Morton, KAMEZAWA Hiroyuki

On Fri, Apr 4, 2008 at 2:25 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>  >>  For other controllers,
>  >>  they'll need to monitor exit() callbacks to know when the leader is dead :( (sigh).
>  >
>  > That sounds like a nightmare ...
>  >
>
>  Yes, it would be, but worth the trouble. Is it really critical to move a dead
>  cgroup leader to init_css_set in cgroup_exit()?

It struck me that this whole group leader optimization is broken as it
stands since there could (in strange configurations) be multiple
thread groups sharing the same mm.

I wonder if we can't just delay the exit_mm() call of a group leader
until all its threads have exited?

>
>  > As long as we find someone to pass the mm to quickly, it shouldn't be
>  > too bad - I think we're already optimized for that case. Generally the
>  > group leader's first child will be the new owner, and any subsequent
>  > times the owner exits, they're unlikely to have any children so
>  > they'll go straight to the sibling check and pass the mm to the
>  > parent's first child.
>  >
>  > Unless they all exit in strict sibling order and hence pass the mm
>  > along the chain one by one, we should be fine. And if that exit
>  > ordering does turn out to be common, then simply walking the child and
>  > sibling lists in reverse order to find a victim will minimize the
>  > amount of passing.
>  >
>
>
>  Finding the next mm might not be all that bad, but doing it each time a task
>  exits, can be an overhead, specially for large multi threaded programs.

Right, but we only have that overhead if we actually end up passing
the mm from one to another each time they exit. It would be
interesting to know what order the threads in a large multi-threaded
process exit typically (when the main process exits and all the
threads die).

I guess it's likely to be one of:

- in thread creation order (i.e. in order of parent->children list),
in which case we should try to throw the mm to the parent's last child
- in reverse creation order, in which case we should try to throw the
mm to the parent's first child
- in random order depending on which threads the scheduler runs first
(in which case we can expect that a small fraction of the threads will
have to throw the mm whichever end we start from)

>  This can
>  get severe if the new mm->owner belongs to a different cgroup, in which case we
>  need to use callbacks as well.
>
>  If half the threads belonged to a different cgroup and the new mm->owner kept
>  switching between cgroups, the overhead would be really high, with the callbacks
>  and the mm->owner changing frequently.

To me, it seems that setting up a *virtual address space* cgroup
hierarchy and then putting half your threads in one group and half in
the another is asking for trouble. We need to not break in that
situation, but I'm not sure it's a case to optimize for.

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-04 19:11         ` Paul Menage
@ 2008-04-05 14:47           ` Balbir Singh
  2008-04-05 17:23             ` Paul Menage
  0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2008-04-05 14:47 UTC (permalink / raw)
  To: Paul Menage
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	lizf, linux-kernel, taka, linux-mm, David Rientjes,
	Andrew Morton, KAMEZAWA Hiroyuki

Paul Menage wrote:
> On Fri, Apr 4, 2008 at 2:25 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>  >>  For other controllers,
>>  >>  they'll need to monitor exit() callbacks to know when the leader is dead :( (sigh).
>>  >
>>  > That sounds like a nightmare ...
>>  >
>>
>>  Yes, it would be, but worth the trouble. Is it really critical to move a dead
>>  cgroup leader to init_css_set in cgroup_exit()?
> 
> It struck me that this whole group leader optimization is broken as it
> stands since there could (in strange configurations) be multiple
> thread groups sharing the same mm.
> 
> I wonder if we can't just delay the exit_mm() call of a group leader
> until all its threads have exited?
> 

Not sure about this one, I suspect keeping the group_leader around is an
optimization, changing exit_mm() for the group_leader, not sure how that will
impact functionality or standards. It might even break some applications.

Repeating my question earlier

Can we delay setting task->cgroups = &init_css_set for the group_leader, until
all threads have exited? If the user is unable to remove a cgroup node, it will
be due a valid reason, the group_leader is still around, since the threads are
still around. The user in that case should wait for notify_on_release.

>>  > As long as we find someone to pass the mm to quickly, it shouldn't be
>>  > too bad - I think we're already optimized for that case. Generally the
>>  > group leader's first child will be the new owner, and any subsequent
>>  > times the owner exits, they're unlikely to have any children so
>>  > they'll go straight to the sibling check and pass the mm to the
>>  > parent's first child.
>>  >
>>  > Unless they all exit in strict sibling order and hence pass the mm
>>  > along the chain one by one, we should be fine. And if that exit
>>  > ordering does turn out to be common, then simply walking the child and
>>  > sibling lists in reverse order to find a victim will minimize the
>>  > amount of passing.
>>  >
>>
>>
>>  Finding the next mm might not be all that bad, but doing it each time a task
>>  exits, can be an overhead, specially for large multi threaded programs.
> 
> Right, but we only have that overhead if we actually end up passing
> the mm from one to another each time they exit. It would be
> interesting to know what order the threads in a large multi-threaded
> process exit typically (when the main process exits and all the
> threads die).
> 
> I guess it's likely to be one of:
> 
> - in thread creation order (i.e. in order of parent->children list),
> in which case we should try to throw the mm to the parent's last child
> - in reverse creation order, in which case we should try to throw the
> mm to the parent's first child
> - in random order depending on which threads the scheduler runs first
> (in which case we can expect that a small fraction of the threads will
> have to throw the mm whichever end we start from)
> 
>>  This can
>>  get severe if the new mm->owner belongs to a different cgroup, in which case we
>>  need to use callbacks as well.
>>
>>  If half the threads belonged to a different cgroup and the new mm->owner kept
>>  switching between cgroups, the overhead would be really high, with the callbacks
>>  and the mm->owner changing frequently.
> 
> To me, it seems that setting up a *virtual address space* cgroup
> hierarchy and then putting half your threads in one group and half in
> the another is asking for trouble. We need to not break in that
> situation, but I'm not sure it's a case to optimize for.

That could potentially happen, if the virtual address space cgroup and cpu
control cgroup were bound together in the same hierarchy by the sysadmin.

I measured the overhead of removing the delay_group_leader optimization and
found a 4% impact on throughput (with volanomark, that is one of the
multi-threaded benchmarks I know of).

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-05 14:47           ` Balbir Singh
@ 2008-04-05 17:23             ` Paul Menage
  2008-04-05 17:48               ` Balbir Singh
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Menage @ 2008-04-05 17:23 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	lizf, linux-kernel, taka, linux-mm, David Rientjes,
	Andrew Morton, KAMEZAWA Hiroyuki

On Sat, Apr 5, 2008 at 7:47 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
>  Repeating my question earlier
>
>  Can we delay setting task->cgroups = &init_css_set for the group_leader, until
>  all threads have exited?

Potentially, yes. It also might make more sense to move the
exit_cgroup() for all threads to a later point rather than special
case delayed group leaders.

> If the user is unable to remove a cgroup node, it will
>  be due a valid reason, the group_leader is still around, since the threads are
>  still around. The user in that case should wait for notify_on_release.
>
>  >
>  > To me, it seems that setting up a *virtual address space* cgroup
>  > hierarchy and then putting half your threads in one group and half in
>  > the another is asking for trouble. We need to not break in that
>  > situation, but I'm not sure it's a case to optimize for.
>
>  That could potentially happen, if the virtual address space cgroup and cpu
>  control cgroup were bound together in the same hierarchy by the sysadmin.

Yes, I agree it could potentially happen. But it seems like a strange
thing to do if you're planning to be not have the same groupings for
cpu and va.

>
>  I measured the overhead of removing the delay_group_leader optimization and
>  found a 4% impact on throughput (with volanomark, that is one of the
>  multi-threaded benchmarks I know of).

Interesting, I thought (although I've never actually looked at the
code) that volanomark was more of a scheduling benchmark than a
process start/exit benchmark. How frequently does it have processes
(not threads) exiting?

How many runs was that over? Ingo's recently posted volanomark tests
against -rc7 showed ~3% random variation between runs.

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-05 17:23             ` Paul Menage
@ 2008-04-05 17:48               ` Balbir Singh
  2008-04-05 17:57                 ` Paul Menage
  0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2008-04-05 17:48 UTC (permalink / raw)
  To: Paul Menage
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	lizf, linux-kernel, taka, linux-mm, David Rientjes,
	Andrew Morton, KAMEZAWA Hiroyuki

Paul Menage wrote:
> On Sat, Apr 5, 2008 at 7:47 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>  Repeating my question earlier
>>
>>  Can we delay setting task->cgroups = &init_css_set for the group_leader, until
>>  all threads have exited?
> 
> Potentially, yes. It also might make more sense to move the
> exit_cgroup() for all threads to a later point rather than special
> case delayed group leaders.
> 

Yes, that makes sense. I think that patch should be independent of this one
though? What do you think?

>> If the user is unable to remove a cgroup node, it will
>>  be due a valid reason, the group_leader is still around, since the threads are
>>  still around. The user in that case should wait for notify_on_release.
>>
>>  >
>>  > To me, it seems that setting up a *virtual address space* cgroup
>>  > hierarchy and then putting half your threads in one group and half in
>>  > the another is asking for trouble. We need to not break in that
>>  > situation, but I'm not sure it's a case to optimize for.
>>
>>  That could potentially happen, if the virtual address space cgroup and cpu
>>  control cgroup were bound together in the same hierarchy by the sysadmin.
> 
> Yes, I agree it could potentially happen. But it seems like a strange
> thing to do if you're planning to be not have the same groupings for
> cpu and va.
> 

It's easier to set it up that way. Usually the end user gets the same SLA for
memory, CPU and other resources, so it makes sense to bind the controllers together.

>>  I measured the overhead of removing the delay_group_leader optimization and
>>  found a 4% impact on throughput (with volanomark, that is one of the
>>  multi-threaded benchmarks I know of).
> 
> Interesting, I thought (although I've never actually looked at the
> code) that volanomark was more of a scheduling benchmark than a
> process start/exit benchmark. How frequently does it have processes
> (not threads) exiting?
> 

I could not find any other interesting benchmark for benchmarking fork/exits. I
know that volanomark is heavily threaded, so I used it. The threads quickly exit
after processing the messages, I thought that would be a good test to see the
overhead.

> How many runs was that over? Ingo's recently posted volanomark tests
> against -rc7 showed ~3% random variation between runs.

I ran the test four times. I took the average of runs, I did see some variation
between runs, I did not calculate the standard deviation.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-05 17:48               ` Balbir Singh
@ 2008-04-05 17:57                 ` Paul Menage
  2008-04-05 18:59                   ` Balbir Singh
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Menage @ 2008-04-05 17:57 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	lizf, linux-kernel, taka, linux-mm, David Rientjes,
	Andrew Morton, KAMEZAWA Hiroyuki

On Sat, Apr 5, 2008 at 10:48 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> Paul Menage wrote:
>  > On Sat, Apr 5, 2008 at 7:47 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>  >>  Repeating my question earlier
>  >>
>  >>  Can we delay setting task->cgroups = &init_css_set for the group_leader, until
>  >>  all threads have exited?
>  >
>  > Potentially, yes. It also might make more sense to move the
>  > exit_cgroup() for all threads to a later point rather than special
>  > case delayed group leaders.
>  >
>
>  Yes, that makes sense. I think that patch should be independent of this one
>  though? What do you think?

Yes, it would probably need to be a separate patch. The current
positioning of cgroup_exit() is more or less inherited from cpusets.
I'd need to figure out if a change like that would break anything.

>  >
>  > Yes, I agree it could potentially happen. But it seems like a strange
>  > thing to do if you're planning to be not have the same groupings for
>  > cpu and va.
>
>  It's easier to set it up that way. Usually the end user gets the same SLA for
>  memory, CPU and other resources, so it makes sense to bind the controllers together.
>

True - but in that case why wouldn't they have the same SLA for
virtual address space too?

>
>  >>  I measured the overhead of removing the delay_group_leader optimization and
>  >>  found a 4% impact on throughput (with volanomark, that is one of the
>  >>  multi-threaded benchmarks I know of).
>  >
>  > Interesting, I thought (although I've never actually looked at the
>  > code) that volanomark was more of a scheduling benchmark than a
>  > process start/exit benchmark. How frequently does it have processes
>  > (not threads) exiting?
>  >
>
>  I could not find any other interesting benchmark for benchmarking fork/exits. I
>  know that volanomark is heavily threaded, so I used it. The threads quickly exit
>  after processing the messages, I thought that would be a good test to see the
>  overhead.

But surely the performance of thread exits wouldn't be affected by the
delay_group_leader(p) change, since none of the exiting threads would
be a group leader. That optimization only matters when the entire
process exits.

Does oprofile show any interesting differences?

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-05 17:57                 ` Paul Menage
@ 2008-04-05 18:59                   ` Balbir Singh
  2008-04-05 23:29                     ` Paul Menage
  2008-04-05 23:31                     ` Paul Menage
  0 siblings, 2 replies; 26+ messages in thread
From: Balbir Singh @ 2008-04-05 18:59 UTC (permalink / raw)
  To: Paul Menage
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	lizf, linux-kernel, taka, linux-mm, David Rientjes,
	Andrew Morton, KAMEZAWA Hiroyuki

Paul Menage wrote:
> On Sat, Apr 5, 2008 at 10:48 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>> Paul Menage wrote:
>>  > On Sat, Apr 5, 2008 at 7:47 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>  >>  Repeating my question earlier
>>  >>
>>  >>  Can we delay setting task->cgroups = &init_css_set for the group_leader, until
>>  >>  all threads have exited?
>>  >
>>  > Potentially, yes. It also might make more sense to move the
>>  > exit_cgroup() for all threads to a later point rather than special
>>  > case delayed group leaders.
>>  >
>>
>>  Yes, that makes sense. I think that patch should be independent of this one
>>  though? What do you think?
> 
> Yes, it would probably need to be a separate patch. The current
> positioning of cgroup_exit() is more or less inherited from cpusets.
> I'd need to figure out if a change like that would break anything.
> 

Yes, thats understandable

>>  >
>>  > Yes, I agree it could potentially happen. But it seems like a strange
>>  > thing to do if you're planning to be not have the same groupings for
>>  > cpu and va.
>>
>>  It's easier to set it up that way. Usually the end user gets the same SLA for
>>  memory, CPU and other resources, so it makes sense to bind the controllers together.
>>
> 
> True - but in that case why wouldn't they have the same SLA for
> virtual address space too?
> 

Yes, mostly. That's why I had made the virtual address space patches as a config
option on top of the memory controller :)

>>  >>  I measured the overhead of removing the delay_group_leader optimization and
>>  >>  found a 4% impact on throughput (with volanomark, that is one of the
>>  >>  multi-threaded benchmarks I know of).
>>  >
>>  > Interesting, I thought (although I've never actually looked at the
>>  > code) that volanomark was more of a scheduling benchmark than a
>>  > process start/exit benchmark. How frequently does it have processes
>>  > (not threads) exiting?
>>  >
>>
>>  I could not find any other interesting benchmark for benchmarking fork/exits. I
>>  know that volanomark is heavily threaded, so I used it. The threads quickly exit
>>  after processing the messages, I thought that would be a good test to see the
>>  overhead.
> 
> But surely the performance of thread exits wouldn't be affected by the
> delay_group_leader(p) change, since none of the exiting threads would
> be a group leader. That optimization only matters when the entire
> process exits.
> 

On the client side, each JVM instance exits after the test. I see the thread
group leader exit as well through getdelays (I see TGID exits).

> Does oprofile show any interesting differences?

Need to try oprofile.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-05 18:59                   ` Balbir Singh
@ 2008-04-05 23:29                     ` Paul Menage
  2008-04-06  5:38                       ` Balbir Singh
  2008-04-05 23:31                     ` Paul Menage
  1 sibling, 1 reply; 26+ messages in thread
From: Paul Menage @ 2008-04-05 23:29 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	linux-kernel, taka, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki

On Sat, Apr 5, 2008 at 11:59 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>  > But surely the performance of thread exits wouldn't be affected by the
>  > delay_group_leader(p) change, since none of the exiting threads would
>  > be a group leader. That optimization only matters when the entire
>  > process exits.
>  >
>
>  On the client side, each JVM instance exits after the test. I see the thread
>  group leader exit as well through getdelays (I see TGID exits).

How long does the test run for? How many threads does each client have?

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-05 23:29                     ` Paul Menage
@ 2008-04-06  5:38                       ` Balbir Singh
  2008-04-08  6:37                         ` Paul Menage
  0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2008-04-06  5:38 UTC (permalink / raw)
  To: Paul Menage
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	linux-kernel, taka, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki

Paul Menage wrote:
> On Sat, Apr 5, 2008 at 11:59 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>  > But surely the performance of thread exits wouldn't be affected by the
>>  > delay_group_leader(p) change, since none of the exiting threads would
>>  > be a group leader. That optimization only matters when the entire
>>  > process exits.
>>  >
>>
>>  On the client side, each JVM instance exits after the test. I see the thread
>>  group leader exit as well through getdelays (I see TGID exits).
> 
> How long does the test run for? How many threads does each client have?

The test on each client side runs for about 10 seconds. I saw the client create
up to 411 threads.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-06  5:38                       ` Balbir Singh
@ 2008-04-08  6:37                         ` Paul Menage
  2008-04-08  6:52                           ` Balbir Singh
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Menage @ 2008-04-08  6:37 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	linux-kernel, taka, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki

On Sat, Apr 5, 2008 at 10:38 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>  >
>  > How long does the test run for? How many threads does each client have?
>
>  The test on each client side runs for about 10 seconds. I saw the client create
>  up to 411 threads.
>

I'm not convinced that an application that creates 400 threads and
exits in 10 seconds is particular representative of a high-performance
application.

But I agree that it's an example of something it may be worth trying
to optimize for.

You mention that you saw tgid exits - what order did the individual
threads exit in? If we threw the mm to the last thread in the thread
group rather than the first, would that help?

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-08  6:37                         ` Paul Menage
@ 2008-04-08  6:52                           ` Balbir Singh
  2008-04-08  6:57                             ` Paul Menage
  0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2008-04-08  6:52 UTC (permalink / raw)
  To: Paul Menage
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	linux-kernel, taka, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki

Paul Menage wrote:
> On Sat, Apr 5, 2008 at 10:38 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>  >
>>  > How long does the test run for? How many threads does each client have?
>>
>>  The test on each client side runs for about 10 seconds. I saw the client create
>>  up to 411 threads.
>>
> 
> I'm not convinced that an application that creates 400 threads and
> exits in 10 seconds is particular representative of a high-performance
> application.
> 

I agree, but like I said earlier, this was the easily available ready made
application I found. Do you know of any other highly threaded micro benchmark?

> But I agree that it's an example of something it may be worth trying
> to optimize for.
> 
> You mention that you saw tgid exits - what order did the individual
> threads exit in? If we threw the mm to the last thread in the thread
> group rather than the first, would that help?

The order was different each time. I suspect that when we have too many threads
all exiting at once and they are all running in parallel, I don't know if we can
have ordering or predict the order in which threads exit.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-08  6:52                           ` Balbir Singh
@ 2008-04-08  6:57                             ` Paul Menage
  2008-04-08  7:05                               ` Balbir Singh
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Menage @ 2008-04-08  6:57 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	linux-kernel, taka, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki

On Mon, Apr 7, 2008 at 11:52 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
>  I agree, but like I said earlier, this was the easily available ready made
>  application I found. Do you know of any other highly threaded micro benchmark?
>

How about a simple program that creates N threads that just sleep,
then has the main thread exit?

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-08  6:57                             ` Paul Menage
@ 2008-04-08  7:05                               ` Balbir Singh
  2008-04-08  7:29                                 ` Paul Menage
  0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2008-04-08  7:05 UTC (permalink / raw)
  To: Paul Menage
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	linux-kernel, taka, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki

Paul Menage wrote:
> On Mon, Apr 7, 2008 at 11:52 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>  I agree, but like I said earlier, this was the easily available ready made
>>  application I found. Do you know of any other highly threaded micro benchmark?
>>
> 
> How about a simple program that creates N threads that just sleep,
> then has the main thread exit?
> 

That is not really representative of anything. I have that program handy. How do
we measure the impact on throughput?

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-08  7:05                               ` Balbir Singh
@ 2008-04-08  7:29                                 ` Paul Menage
  2008-04-10  9:09                                   ` Balbir Singh
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Menage @ 2008-04-08  7:29 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	linux-kernel, taka, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki

On Tue, Apr 8, 2008 at 12:05 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
> Paul Menage wrote:
>  > On Mon, Apr 7, 2008 at 11:52 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>  >>  I agree, but like I said earlier, this was the easily available ready made
>  >>  application I found. Do you know of any other highly threaded micro benchmark?
>  >>
>  >
>  > How about a simple program that creates N threads that just sleep,
>  > then has the main thread exit?
>  >
>
>  That is not really representative of anything. I have that program handy. How do
>  we measure the impact on throughput?

It's very representative of how much additional overhead in terms of
mm->owner churn there is in a large multi-threaded application
exiting, which is the thing that you're trying to optimize with the
delayed thread group leader checks.

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-08  7:29                                 ` Paul Menage
@ 2008-04-10  9:09                                   ` Balbir Singh
  0 siblings, 0 replies; 26+ messages in thread
From: Balbir Singh @ 2008-04-10  9:09 UTC (permalink / raw)
  To: Paul Menage
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	linux-kernel, taka, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki

Paul Menage wrote:
> On Tue, Apr 8, 2008 at 12:05 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>> Paul Menage wrote:
>>  > On Mon, Apr 7, 2008 at 11:52 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>  >>  I agree, but like I said earlier, this was the easily available ready made
>>  >>  application I found. Do you know of any other highly threaded micro benchmark?
>>  >>
>>  >
>>  > How about a simple program that creates N threads that just sleep,
>>  > then has the main thread exit?
>>  >
>>
>>  That is not really representative of anything. I have that program handy. How do
>>  we measure the impact on throughput?
> 
> It's very representative of how much additional overhead in terms of
> mm->owner churn there is in a large multi-threaded application
> exiting, which is the thing that you're trying to optimize with the
> delayed thread group leader checks.
> 

I see almost no overhead after the notification change optimization (notify only
if owner belongs to a different cgroup).

My program creates n processes with k threads each and forces the thread group
leader to exit. For my experiment I created 10 processes with 800 threads each
(NOTE: you need to change ulimit -s for this to work).

I am going to remove the delay_group_leader() optimization and submit v9.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-05 18:59                   ` Balbir Singh
  2008-04-05 23:29                     ` Paul Menage
@ 2008-04-05 23:31                     ` Paul Menage
  2008-04-06  6:31                       ` Balbir Singh
  1 sibling, 1 reply; 26+ messages in thread
From: Paul Menage @ 2008-04-05 23:31 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	linux-kernel, taka, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki

On Sat, Apr 5, 2008 at 11:59 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>  >>  It's easier to set it up that way. Usually the end user gets the same SLA for
>  >>  memory, CPU and other resources, so it makes sense to bind the controllers together.
>  >>
>  >
>  > True - but in that case why wouldn't they have the same SLA for
>  > virtual address space too?
>  >
>
>  Yes, mostly. That's why I had made the virtual address space patches as a config
>  option on top of the memory controller :)
>

*If* they want to use the virtual address space controller, that is.

By that argument, you should make the memory and cpu controllers the
same controller, since in your scenario they'll usually be used
together..

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-05 23:31                     ` Paul Menage
@ 2008-04-06  6:31                       ` Balbir Singh
  2008-04-08  6:32                         ` Paul Menage
  0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2008-04-06  6:31 UTC (permalink / raw)
  To: Paul Menage
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	linux-kernel, taka, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki

Paul Menage wrote:
> On Sat, Apr 5, 2008 at 11:59 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>  >>  It's easier to set it up that way. Usually the end user gets the same SLA for
>>  >>  memory, CPU and other resources, so it makes sense to bind the controllers together.
>>  >>
>>  >
>>  > True - but in that case why wouldn't they have the same SLA for
>>  > virtual address space too?
>>  >
>>
>>  Yes, mostly. That's why I had made the virtual address space patches as a config
>>  option on top of the memory controller :)
>>
> 
> *If* they want to use the virtual address space controller, that is.
> 
> By that argument, you should make the memory and cpu controllers the
> same controller, since in your scenario they'll usually be used
> together..

Heh, Virtual address and memory are more closely interlinked than CPU and Memory.
-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-06  6:31                       ` Balbir Singh
@ 2008-04-08  6:32                         ` Paul Menage
  0 siblings, 0 replies; 26+ messages in thread
From: Paul Menage @ 2008-04-08  6:32 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	linux-kernel, taka, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki

On Sat, Apr 5, 2008 at 11:31 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>  > *If* they want to use the virtual address space controller, that is.
>  >
>  > By that argument, you should make the memory and cpu controllers the
>  > same controller, since in your scenario they'll usually be used
>  > together..
>
>  Heh, Virtual address and memory are more closely interlinked than CPU and Memory.

If you consider virtual address space limits a useful way to limit
swap usage, that's true.

But if you don't, then memory and CPU are more closely linked since
they represent real resource usage, whereas virtual address space is a
more abstract quantity.

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-04  8:05 [-mm] Add an owner to the mm_struct (v8) Balbir Singh
  2008-04-04  8:12 ` Paul Menage
@ 2008-04-07 22:09 ` Andrew Morton
  2008-04-08  2:39   ` Balbir Singh
  2008-04-09  0:42 ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2008-04-07 22:09 UTC (permalink / raw)
  Cc: menage, xemul, hugh, skumar, yamamoto, lizf, linux-kernel, taka,
	linux-mm, rientjes, balbir, kamezawa.hiroyu

On Fri, 04 Apr 2008 13:35:44 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 1. Add mm->owner change callbacks using cgroups
> 
> ...
>
> +config MM_OWNER
> +	bool "Enable ownership of mm structure"
> +	help
> +	  This option enables mm_struct's to have an owner. The advantage
> +	  of this approach is that it allows for several independent memory
> +	  based cgroup controllers to co-exist independently without too
> +	  much space overhead
> +
> +	  This feature adds fork/exit overhead. So enable this only if
> +	  you need resource controllers

Do we really want to offer this option to people?  It's rather a low-level
thing and it's likely to cause more confusion than it's worth.  Remember
that most kernels get to our users via kernel vendors - to what will they
be setting this config option?

>  config CGROUP_MEM_RES_CTLR
>  	bool "Memory Resource Controller for Control Groups"
>  	depends on CGROUPS && RESOURCE_COUNTERS
> +	select MM_OWNER

Presumably they'll always be setting it to "y" if they are enabling cgroups
at all.

> --- linux-2.6.25-rc8/kernel/cgroup.c~memory-controller-add-mm-owner	2008-04-03 22:43:27.000000000 +0530
> +++ linux-2.6.25-rc8-balbir/kernel/cgroup.c	2008-04-03 22:43:27.000000000 +0530
> @@ -118,6 +118,7 @@ static int root_count;
>   * be called.
>   */
>  static int need_forkexit_callback;
> +static int need_mm_owner_callback;

I suppose these should be __read_mostly.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-07 22:09 ` Andrew Morton
@ 2008-04-08  2:39   ` Balbir Singh
  2008-04-08  2:55     ` Andrew Morton
  0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2008-04-08  2:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: menage, xemul, hugh, skumar, yamamoto, lizf, linux-kernel, taka,
	linux-mm, rientjes, kamezawa.hiroyu

Andrew Morton wrote:
> On Fri, 04 Apr 2008 13:35:44 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> 1. Add mm->owner change callbacks using cgroups
>>
>> ...
>>
>> +config MM_OWNER
>> +	bool "Enable ownership of mm structure"
>> +	help
>> +	  This option enables mm_struct's to have an owner. The advantage
>> +	  of this approach is that it allows for several independent memory
>> +	  based cgroup controllers to co-exist independently without too
>> +	  much space overhead
>> +
>> +	  This feature adds fork/exit overhead. So enable this only if
>> +	  you need resource controllers
> 
> Do we really want to offer this option to people?  It's rather a low-level
> thing and it's likely to cause more confusion than it's worth.  Remember
> that most kernels get to our users via kernel vendors - to what will they
> be setting this config option?
> 

I suspect that this kernel option will not be explicitly set it. This option
will be selected by other config options (memory controller, swap namespace,
revoke*)

>>  config CGROUP_MEM_RES_CTLR
>>  	bool "Memory Resource Controller for Control Groups"
>>  	depends on CGROUPS && RESOURCE_COUNTERS
>> +	select MM_OWNER
> 
> Presumably they'll always be setting it to "y" if they are enabling cgroups
> at all.
> 
>> --- linux-2.6.25-rc8/kernel/cgroup.c~memory-controller-add-mm-owner	2008-04-03 22:43:27.000000000 +0530
>> +++ linux-2.6.25-rc8-balbir/kernel/cgroup.c	2008-04-03 22:43:27.000000000 +0530
>> @@ -118,6 +118,7 @@ static int root_count;
>>   * be called.
>>   */
>>  static int need_forkexit_callback;
>> +static int need_mm_owner_callback;
> 
> I suppose these should be __read_mostly.
> 

Yes, good point. I'll send out v9 with this fix.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-08  2:39   ` Balbir Singh
@ 2008-04-08  2:55     ` Andrew Morton
  0 siblings, 0 replies; 26+ messages in thread
From: Andrew Morton @ 2008-04-08  2:55 UTC (permalink / raw)
  To: balbir
  Cc: menage, xemul, hugh, skumar, yamamoto, lizf, linux-kernel, taka,
	linux-mm, rientjes, kamezawa.hiroyu

On Tue, 08 Apr 2008 08:09:57 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Andrew Morton wrote:
> > On Fri, 04 Apr 2008 13:35:44 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> >> 1. Add mm->owner change callbacks using cgroups
> >>
> >> ...
> >>
> >> +config MM_OWNER
> >> +	bool "Enable ownership of mm structure"
> >> +	help
> >> +	  This option enables mm_struct's to have an owner. The advantage
> >> +	  of this approach is that it allows for several independent memory
> >> +	  based cgroup controllers to co-exist independently without too
> >> +	  much space overhead
> >> +
> >> +	  This feature adds fork/exit overhead. So enable this only if
> >> +	  you need resource controllers
> > 
> > Do we really want to offer this option to people?  It's rather a low-level
> > thing and it's likely to cause more confusion than it's worth.  Remember
> > that most kernels get to our users via kernel vendors - to what will they
> > be setting this config option?
> > 
> 
> I suspect that this kernel option will not be explicitly set it. This option
> will be selected by other config options (memory controller, swap namespace,
> revoke*)

I believe that the way to do this is to not give the option a `help'
section.  Tht makes it a Kconfig-internal-only thing.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [-mm] Add an owner to the mm_struct (v8)
  2008-04-04  8:05 [-mm] Add an owner to the mm_struct (v8) Balbir Singh
  2008-04-04  8:12 ` Paul Menage
  2008-04-07 22:09 ` Andrew Morton
@ 2008-04-09  0:42 ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-04-09  0:42 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Paul Menage, Pavel Emelianov, Hugh Dickins, Sudhir Kumar,
	YAMAMOTO Takashi, lizf, linux-kernel, taka, linux-mm,
	David Rientjes, Andrew Morton

On Fri, 04 Apr 2008 13:35:44 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> +	/*
> +	 * Search through everything else. We should not get
> +	 * here often
> +	 */
> +	do_each_thread(g, c) {
> +		if (c->mm == mm)
> +			goto assign_new_owner;
> +	} while_each_thread(g, c);
> +
I'm sorry for my laziness.

Why do_each_thread() ? for_each_process() is not enough ?
(because of delay_group_leader().)

And what we have to test for the worst case is following, right ?
==
 1. create a tons of threads.
 2. create a process which calls vfork().
 3. keep child alive and vfork() caller exits
==

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2008-04-10  9:11 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-04  8:05 [-mm] Add an owner to the mm_struct (v8) Balbir Singh
2008-04-04  8:12 ` Paul Menage
2008-04-04  8:28   ` Balbir Singh
2008-04-04  8:50     ` Paul Menage
2008-04-04  9:25       ` Balbir Singh
2008-04-04 19:11         ` Paul Menage
2008-04-05 14:47           ` Balbir Singh
2008-04-05 17:23             ` Paul Menage
2008-04-05 17:48               ` Balbir Singh
2008-04-05 17:57                 ` Paul Menage
2008-04-05 18:59                   ` Balbir Singh
2008-04-05 23:29                     ` Paul Menage
2008-04-06  5:38                       ` Balbir Singh
2008-04-08  6:37                         ` Paul Menage
2008-04-08  6:52                           ` Balbir Singh
2008-04-08  6:57                             ` Paul Menage
2008-04-08  7:05                               ` Balbir Singh
2008-04-08  7:29                                 ` Paul Menage
2008-04-10  9:09                                   ` Balbir Singh
2008-04-05 23:31                     ` Paul Menage
2008-04-06  6:31                       ` Balbir Singh
2008-04-08  6:32                         ` Paul Menage
2008-04-07 22:09 ` Andrew Morton
2008-04-08  2:39   ` Balbir Singh
2008-04-08  2:55     ` Andrew Morton
2008-04-09  0:42 ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox