[RFC][PATCH]Per-cgroup OOM handler

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH]Per-cgroup OOM handler
@ 2008-11-03 21:40 Ying Han
  2008-11-03 22:19 ` Ying Han
  0 siblings, 1 reply; 6+ messages in thread
From: Ying Han @ 2008-11-03 21:40 UTC (permalink / raw)
  To: linux-mm; +Cc: Rohit Seth, Paul Menage, David Rientjes

Per-cgroup OOM handler ported from cpuset to cgroup.

Per cgroup OOM handler allows a userspace handler catches and handle the OOM,
the OOMing thread doesn't trigger a kill, but returns to alloc_pages to try
again; alternatively usersapce can cause the OOM killer to go ahead as normal.

It's a standalone subsystem that can work with either the memory cgroup or
with cpusets(where memory is constrained by numa nodes).

The features are:

- an oom.delay file that controls how long a thread will pause in the
OOM killer waiting for a response from userspace (in milliseconds)

- an oom.await file that a userspace handler can write a timeout value
to, and be awoken either when a process in that cgroup enters the OOM
killer, or the timeout expires.

example:
(mount oom as normal cgroup subsystem as well as cpuset)
1. mount -t cgroup -o cpuset,oom cpuset /dev/cpuset

(config sample cpuset contains single fakenuma node with 128M and one
cpu core)
2. mkdir /dev/cpuset/sample
   echo 1 > /dev/cpuset/sample/cpuset.mems
   echo 1 > /dev/cpuset/sample/cpuset.cpus

(config the oom.delay to be 10sec)
3. echo 10000 >/dev/cpuset/sample/oom.oom_delay

(put the shell in the wait-queue with max 60sec waitting)
4. echo 60000 >/dev/cpuset/sample/oom.await_oom

(trigger the oom by mlockall 600M anon memory)
5. /oom 600000000

When the sample cpuset triggers the OOM, it will wake-up the
OOM-handler thread that slept in step 4, sleep for a jiffie, and then
return to alloc_pages() to try again. This sleep gives the OOM-handler
time to deal with the OOM, for example by giving another memory node
to the OOMing cpuset.

We're sending out this in-house patch to start discussion about what
might be appropriate for supporting user-space OOM-handling in the
mainline kernel. Potential improvements include:

- providing more information in the OOM notification, such as the pid
that triggered the OOM, and a unique id for that OOM instance that can
be tied to later OOM-kill notifications.

- allowing better notifications from userspace back to the kernel.

 Documentation/cgroups/oom-handler.txt |   49 ++++++++
 include/linux/cgroup_subsys.h         |   12 ++
 include/linux/cpuset.h                |    7 +-
 init/Kconfig                          |    8 ++
 kernel/cpuset.c                       |    8 +-
 mm/oom_kill.c                         |  220 +++++++++++++++++++++++++++++++++
 6 files changed, 301 insertions(+), 3 deletions(-)


diff --git a/Documentation/cgroups/oom-handler.txt
b/Documentation/cgroups/oom-handler.txt
new file mode 100644
index 0000000..aa006fe
--- /dev/null
+++ b/Documentation/cgroups/oom-handler.txt
@@ -0,0 +1,49 @@
+Per cgroup OOM handler allows a userspace handler catches and handle the OOM,
+the OOMing thread doesn't trigger a kill, but returns to alloc_pages to try
+again; alternatively usersapce can cause the OOM killer to go ahead as normal.
+
+It's a standalone subsystem that can work with either the memory cgroup or
+with cpusets(where memory is constrained by numa nodes).
+
+The features are:
+
+- an oom.delay file that controls how long a thread will pause in the
+OOM killer waiting for a response from userspace (in milliseconds)
+
+- an oom.await file that a userspace handler can write a timeout value
+to, and be awoken either when a process in that cgroup enters the OOM
+killer, or the timeout expires.
+
+example:
+(mount oom as normal cgroup subsystem as well as cpuset)
+1. mount -t cgroup -o cpuset,oom cpuset /dev/cpuset
+
+(config sample cpuset contains single fakenuma node with 128M and one
+cpu core)
+2. mkdir /dev/cpuset/sample
+   echo 1 > /dev/cpuset/sample/cpuset.mems
+   echo 1 > /dev/cpuset/sample/cpuset.cpus
+
+(config the oom.delay to be 10sec)
+3. echo 10000 >/dev/cpuset/sample/oom.oom_delay
+
+(put the shell in the wait-queue with max 60sec waitting)
+4. echo 60000 >/dev/cpuset/sample/oom.await_oom
+
+(trigger the oom by mlockall 600M anon memory)
+5. /oom 600000000
+
+When the sample cpuset triggers the OOM, it will wake-up the
+OOM-handler thread that slept in step 4, sleep for a jiffie, and then
+return to alloc_pages() to try again. This sleep gives the OOM-handler
+time to deal with the OOM, for example by giving another memory node
+to the OOMing cpuset.
+
+Potential improvements include:
+- providing more information in the OOM notification, such as the pid
+that triggered the OOM, and a unique id for that OOM instance that can
+be tied to later OOM-kill notifications.
+
+- allowing better notifications from userspace back to the kernel.
+
+
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c22396..1e63bd5 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -54,3 +54,15 @@ SUBSYS(freezer)
 #endif

 /* */
+
+#ifdef CONFIG_CGROUP_OOM_CONT
+SUBSYS(oom_cgroup)
+#endif
+
+/* */
+
+#ifdef CONFIG_CGROUP_OOM_CONT
+SUBSYS(oom_cgroup)
+#endif
+
+/* */
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 2691926..26dab22 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -25,7 +25,7 @@ extern void cpuset_cpus_allowed_locked(struct
task_struct *p, cpumask_t *mask);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
-void cpuset_update_task_memory_state(void);
+int cpuset_update_task_memory_state(void);
 int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);

 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
@@ -103,7 +103,10 @@ static inline nodemask_t
cpuset_mems_allowed(struct task_struct *p)

 #define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
 static inline void cpuset_init_current_mems_allowed(void) {}
-static inline void cpuset_update_task_memory_state(void) {}
+static inline int cpuset_update_task_memory_state(void)
+{
+	return 1;
+}

 static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
diff --git a/init/Kconfig b/init/Kconfig
index 44e9208..971b0b5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -324,6 +324,14 @@ config CPUSETS

 	  Say N if unsure.

+config CGROUP_OOM_CONT
+	bool "OOM controller for cgroups"
+	depends on CGROUPS
+	help
+	  This option allows userspace to trap OOM conditions on a
+	  per-cgroup basis, and take action that might prevent the OOM from
+	  occurring.
+
 #
 # Architectures with an unreliable sched_clock() should select this:
 #
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 3e00526..c986423 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -355,13 +355,17 @@ static void guarantee_online_mems(const struct
cpuset *cs, nodemask_t *pmask)
  * within the tasks context, when it is trying to allocate memory
  * (in various mm/mempolicy.c routines) and notices that some other
  * task has been modifying its cpuset.
+ *
+ * Returns non-zero if the state was updated, including when it is
+ * an effective no-op.
  */

-void cpuset_update_task_memory_state(void)
+int cpuset_update_task_memory_state(void)
 {
 	int my_cpusets_mem_gen;
 	struct task_struct *tsk = current;
 	struct cpuset *cs;
+	int ret = 0;

 	if (task_cs(tsk) == &top_cpuset) {
 		/* Don't need rcu for top_cpuset.  It's never freed. */
@@ -389,7 +393,9 @@ void cpuset_update_task_memory_state(void)
 		task_unlock(tsk);
 		mutex_unlock(&callback_mutex);
 		mpol_rebind_task(tsk, &tsk->mems_allowed);
+		ret = 1;
 	}
+	return ret;
 }

 /*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 64e5b4b..5677b72 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -32,6 +32,219 @@ int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks;
 static DEFINE_SPINLOCK(zone_scan_mutex);
+
+#ifdef CONFIG_CGROUP_OOM_CONT
+struct oom_cgroup {
+	struct cgroup_subsys_state css;
+
+	/* How long between first OOM indication and actual OOM kill
+	 * for processes in this cgroup */
+	unsigned long oom_delay;
+
+	/* When the current OOM delay began. Zero means no delay in progress */
+	unsigned long oom_since;
+
+	/* Wait queue for userspace OOM handler */
+	wait_queue_head_t oom_wait;
+
+	spinlock_t oom_lock;
+};
+
+static inline
+struct oom_cgroup *oom_cgroup_from_cont(struct cgroup *cont)
+{
+	return container_of(cgroup_subsys_state(cont, oom_cgroup_subsys_id),
+				struct oom_cgroup, css);
+}
+
+static inline
+struct oom_cgroup *oom_cgroup_from_task(struct task_struct *task)
+{
+	return container_of(task_subsys_state(task, oom_cgroup_subsys_id),
+					struct oom_cgroup, css);
+}
+
+/*
+ * Takes oom_lock during call.
+ */
+static int oom_cgroup_write_delay(struct cgroup *cont, struct cftype *cft,
+				u64 delay)
+{
+	struct oom_cgroup *cs = oom_cgroup_from_cont(cont);
+
+	/* Sanity check */
+	if (unlikely(delay > 60 * 1000))
+		return -EINVAL;
+	spin_lock(&cs->oom_lock);
+	cs->oom_delay = msecs_to_jiffies(delay);
+	spin_unlock(&cs->oom_lock);
+	return 0;
+}
+
+/*
+ * sleeps until the cgroup enters OOM (or a maximum of N milliseconds if N is
+ * passed). Clears the OOM condition in the cgroup when it returns.
+ */
+static int oom_cgroup_write_await(struct cgroup *cont, struct cftype *cft,
+				u64 await)
+{
+	int retval = 0;
+	struct oom_cgroup *cs = oom_cgroup_from_cont(cont);
+
+	/* Don't try to wait for more than a minute */
+	await = min(await, 60ULL * 1000);
+	/* Try waiting for up to a second for an OOM condition */
+	wait_event_interruptible_timeout(cs->oom_wait, cs->oom_since ||
+					 cgroup_is_removed(cs->css.cgroup),
+					 msecs_to_jiffies(await));
+	spin_lock(&cs->oom_lock);
+	if (cgroup_is_removed(cs->css.cgroup)) {
+		/* The cpuset was removed while we slept */
+		retval = -ENODEV;
+	} else if (cs->oom_since) {
+		/* We reached OOM. Clear the OOM condition now that
+		 * userspace knows about it */
+		cs->oom_since = 0;
+	} else if (signal_pending(current)) {
+		retval = -EINTR;
+	} else {
+		/* No OOM yet */
+		retval = -ETIMEDOUT;
+	}
+	spin_unlock(&cs->oom_lock);
+	return retval;
+}
+
+static u64 oom_cgroup_read_delay(struct cgroup *cont, struct cftype *cft)
+{
+	return oom_cgroup_from_cont(cont)->oom_delay;
+}
+
+static struct cftype oom_cgroup_files[] = {
+	{
+		.name = "delay",
+		.read_u64 = oom_cgroup_read_delay,
+		.write_u64 = oom_cgroup_write_delay,
+	},
+
+	{
+		.name = "await",
+		.write_u64 = oom_cgroup_write_await,
+	},
+};
+
+static struct cgroup_subsys_state *oom_cgroup_create(
+		struct cgroup_subsys *ss,
+		struct cgroup *cont)
+{
+	struct oom_cgroup *oom;
+
+	oom = kmalloc(sizeof(*oom), GFP_KERNEL);
+	if (!oom)
+		return ERR_PTR(-ENOMEM);
+
+	oom->oom_delay = 0;
+	init_waitqueue_head(&oom->oom_wait);
+	oom->oom_since = 0;
+	spin_lock_init(&oom->oom_lock);
+
+	return &oom->css;
+}
+
+static void oom_cgroup_destroy(struct cgroup_subsys *ss,
+			struct cgroup *cont)
+{
+	kfree(oom_cgroup_from_cont(cont));
+}
+
+static int oom_cgroup_populate(struct cgroup_subsys *ss,
+			struct cgroup *cont)
+{
+	return cgroup_add_files(cont, ss, oom_cgroup_files,
+					ARRAY_SIZE(oom_cgroup_files));
+}
+
+struct cgroup_subsys oom_cgroup_subsys = {
+	.name = "oom",
+	.subsys_id = oom_cgroup_subsys_id,
+	.create = oom_cgroup_create,
+	.destroy = oom_cgroup_destroy,
+	.populate = oom_cgroup_populate,
+};
+
+
+/*
+ * Call with no cpuset mutex held. Determines whether this process
+ * should allow an OOM to proceed as normal (retval==1) or should try
+ * again to allocate memory (retval==0). If necessary, sleeps and then
+ * updates the task's mems_allowed to let userspace update the memory
+ * nodes for the task's cpuset.
+ */
+static int cgroup_should_oom(void)
+{
+	int ret = 1; /* OOM by default */
+	struct oom_cgroup *cs;
+
+	task_lock(current);
+	cs = oom_cgroup_from_task(current);
+
+	spin_lock(&cs->oom_lock);
+	if (cs->oom_delay) {
+		/* We have an OOM delay configured */
+		if (cs->oom_since) {
+			/* We're already OOMing - see if we're over
+			 * the time limit. Also make sure that jiffie
+			 * wrap-around doesn't make us think we're in
+			 * an incredibly long OOM delay */
+			unsigned long deadline = cs->oom_since + cs->oom_delay;
+			if (time_after(deadline, jiffies) &&
+			    !time_after(cs->oom_since, jiffies)) {
+				/* Not OOM yet */
+				ret = 0;
+			}
+		} else {
+			/* This is the first OOM */
+			ret = 0;
+			cs->oom_since = jiffies;
+			/* Avoid problems with jiffie wrap - make an
+			 * oom_since of zero always mean not
+			 * OOMing */
+			if (!cs->oom_since)
+				cs->oom_since = 1;
+			printk(KERN_WARNING
+			       "Cpuset %s (pid %d) sending memory "
+			       "notification to userland at %lu%s\n",
+			       cs->css.cgroup->dentry->d_name.name,
+			       current->pid, jiffies,
+			       waitqueue_active(&cs->oom_wait) ?
+			       "" : " (no waiters)");
+		}
+		if (!ret) {
+			/* If we're planning to retry, we should wake
+			 * up any userspace waiter in order to let it
+			 * handle the OOM
+			 */
+			wake_up_all(&cs->oom_wait);
+		}
+	}
+
+	spin_unlock(&cs->oom_lock);
+	task_unlock(current);
+	if (!ret) {
+		/* If we're not going to OOM, we should sleep for a
+		 * bit to give userspace a chance to respond before we
+		 * go back and try to reclaim again */
+		schedule_timeout_uninterruptible(1);
+	}
+	return ret;
+}
+#else /* !CONFIG_CGROUP_OOM_CONT */
+static inline int cgroup_should_oom(void)
+{
+	return 1;
+}
+#endif
+
 /* #define DEBUG */

 /**
@@ -526,6 +739,13 @@ void out_of_memory(struct zonelist *zonelist,
gfp_t gfp_mask, int order)
 	unsigned long freed = 0;
 	enum oom_constraint constraint;

+	/*
+	 * It is important to call in this order since cgroup_should_oom()
+	 * might sleep and give userspace chance to update mems.
+	 */
+	if (!cgroup_should_oom() || cpuset_update_task_memory_state())
+		return;
+
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 	if (freed > 0)
 		/* Got some memory back in the last second. */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH]Per-cgroup OOM handler
  2008-11-03 21:40 [RFC][PATCH]Per-cgroup OOM handler Ying Han
@ 2008-11-03 22:19 ` Ying Han
  2008-11-06  5:34   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 6+ messages in thread
From: Ying Han @ 2008-11-03 22:19 UTC (permalink / raw)
  To: linux-mm; +Cc: Rohit Seth, Paul Menage, David Rientjes

sorry, please use the following patch. (deleted the double definition
in cgroup_subsys.h from last patch)

Per-cgroup OOM handler ported from cpuset to cgroup.

Per cgroup OOM handler allows a userspace handler catches and handle the OOM,
the OOMing thread doesn't trigger a kill, but returns to alloc_pages to try
again; alternatively usersapce can cause the OOM killer to go ahead as normal.

It's a standalone subsystem that can work with either the memory cgroup or
with cpusets(where memory is constrained by numa nodes).

The features are:

- an oom.delay file that controls how long a thread will pause in the
OOM killer waiting for a response from userspace (in milliseconds)

- an oom.await file that a userspace handler can write a timeout value
to, and be awoken either when a process in that cgroup enters the OOM
killer, or the timeout expires.

example:
(mount oom as normal cgroup subsystem as well as cpuset)
1. mount -t cgroup -o cpuset,oom cpuset /dev/cpuset

(config sample cpuset contains single fakenuma node with 128M and one
cpu core)
2. mkdir /dev/cpuset/sample
   echo 1 > /dev/cpuset/sample/cpuset.mems
   echo 1 > /dev/cpuset/sample/cpuset.cpus

(config the oom.delay to be 10sec)
3. echo 10000 >/dev/cpuset/sample/oom.oom_delay

(put the shell in the wait-queue with max 60sec waitting)
4. echo 60000 >/dev/cpuset/sample/oom.await_oom

(trigger the oom by mlockall 600M anon memory)
5. /oom 600000000

When the sample cpuset triggers the OOM, it will wake-up the
OOM-handler thread that slept in step 4, sleep for a jiffie, and then
return to alloc_pages() to try again. This sleep gives the OOM-handler
time to deal with the OOM, for example by giving another memory node
to the OOMing cpuset.

We're sending out this in-house patch to start discussion about what
might be appropriate for supporting user-space OOM-handling in the
mainline kernel. Potential improvements include:

- providing more information in the OOM notification, such as the pid
that triggered the OOM, and a unique id for that OOM instance that can
be tied to later OOM-kill notifications.

- allowing better notifications from userspace back to the kernel.

 Documentation/cgroups/oom-handler.txt |   49 ++++++++
 include/linux/cgroup_subsys.h         |   12 ++
 include/linux/cpuset.h                |    7 +-
 init/Kconfig                          |    8 ++
 kernel/cpuset.c                       |    8 +-
 mm/oom_kill.c                         |  220 +++++++++++++++++++++++++++++++++
 6 files changed, 301 insertions(+), 3 deletions(-)

Signed-off-by:Paul Menage <menage@google.com>
	      David Rientjes <rientjes@google.com>
	      Ying Han <yinghan@google.com>


diff --git a/Documentation/cgroups/oom-handler.txt
b/Documentation/cgroups/oom-handler.txt
new file mode 100644
index 0000000..aa006fe
--- /dev/null
+++ b/Documentation/cgroups/oom-handler.txt
@@ -0,0 +1,49 @@
+Per cgroup OOM handler allows a userspace handler catches and handle the OOM,
+the OOMing thread doesn't trigger a kill, but returns to alloc_pages to try
+again; alternatively usersapce can cause the OOM killer to go ahead as normal.
+
+It's a standalone subsystem that can work with either the memory cgroup or
+with cpusets(where memory is constrained by numa nodes).
+
+The features are:
+
+- an oom.delay file that controls how long a thread will pause in the
+OOM killer waiting for a response from userspace (in milliseconds)
+
+- an oom.await file that a userspace handler can write a timeout value
+to, and be awoken either when a process in that cgroup enters the OOM
+killer, or the timeout expires.
+
+example:
+(mount oom as normal cgroup subsystem as well as cpuset)
+1. mount -t cgroup -o cpuset,oom cpuset /dev/cpuset
+
+(config sample cpuset contains single fakenuma node with 128M and one
+cpu core)
+2. mkdir /dev/cpuset/sample
+   echo 1 > /dev/cpuset/sample/cpuset.mems
+   echo 1 > /dev/cpuset/sample/cpuset.cpus
+
+(config the oom.delay to be 10sec)
+3. echo 10000 >/dev/cpuset/sample/oom.oom_delay
+
+(put the shell in the wait-queue with max 60sec waitting)
+4. echo 60000 >/dev/cpuset/sample/oom.await_oom
+
+(trigger the oom by mlockall 600M anon memory)
+5. /oom 600000000
+
+When the sample cpuset triggers the OOM, it will wake-up the
+OOM-handler thread that slept in step 4, sleep for a jiffie, and then
+return to alloc_pages() to try again. This sleep gives the OOM-handler
+time to deal with the OOM, for example by giving another memory node
+to the OOMing cpuset.
+
+Potential improvements include:
+- providing more information in the OOM notification, such as the pid
+that triggered the OOM, and a unique id for that OOM instance that can
+be tied to later OOM-kill notifications.
+
+- allowing better notifications from userspace back to the kernel.
+
+
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c22396..23fe6c7 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -54,3 +54,9 @@ SUBSYS(freezer)
 #endif

 /* */
+
+#ifdef CONFIG_CGROUP_OOM_CONT
+SUBSYS(oom_cgroup)
+#endif
+
+/* */
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 2691926..26dab22 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -25,7 +25,7 @@ extern void cpuset_cpus_allowed_locked(struct
task_struct *p, cpumask_t *mask);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
-void cpuset_update_task_memory_state(void);
+int cpuset_update_task_memory_state(void);
 int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);

 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
@@ -103,7 +103,10 @@ static inline nodemask_t
cpuset_mems_allowed(struct task_struct *p)

 #define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
 static inline void cpuset_init_current_mems_allowed(void) {}
-static inline void cpuset_update_task_memory_state(void) {}
+static inline int cpuset_update_task_memory_state(void)
+{
+	return 1;
+}

 static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
diff --git a/init/Kconfig b/init/Kconfig
index 44e9208..971b0b5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -324,6 +324,14 @@ config CPUSETS

 	  Say N if unsure.

+config CGROUP_OOM_CONT
+	bool "OOM controller for cgroups"
+	depends on CGROUPS
+	help
+	  This option allows userspace to trap OOM conditions on a
+	  per-cgroup basis, and take action that might prevent the OOM from
+	  occurring.
+
 #
 # Architectures with an unreliable sched_clock() should select this:
 #
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 3e00526..c986423 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -355,13 +355,17 @@ static void guarantee_online_mems(const struct
cpuset *cs, nodemask_t *pmask)
  * within the tasks context, when it is trying to allocate memory
  * (in various mm/mempolicy.c routines) and notices that some other
  * task has been modifying its cpuset.
+ *
+ * Returns non-zero if the state was updated, including when it is
+ * an effective no-op.
  */

-void cpuset_update_task_memory_state(void)
+int cpuset_update_task_memory_state(void)
 {
 	int my_cpusets_mem_gen;
 	struct task_struct *tsk = current;
 	struct cpuset *cs;
+	int ret = 0;

 	if (task_cs(tsk) == &top_cpuset) {
 		/* Don't need rcu for top_cpuset.  It's never freed. */
@@ -389,7 +393,9 @@ void cpuset_update_task_memory_state(void)
 		task_unlock(tsk);
 		mutex_unlock(&callback_mutex);
 		mpol_rebind_task(tsk, &tsk->mems_allowed);
+		ret = 1;
 	}
+	return ret;
 }

 /*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 64e5b4b..5677b72 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -32,6 +32,219 @@ int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks;
 static DEFINE_SPINLOCK(zone_scan_mutex);
+
+#ifdef CONFIG_CGROUP_OOM_CONT
+struct oom_cgroup {
+	struct cgroup_subsys_state css;
+
+	/* How long between first OOM indication and actual OOM kill
+	 * for processes in this cgroup */
+	unsigned long oom_delay;
+
+	/* When the current OOM delay began. Zero means no delay in progress */
+	unsigned long oom_since;
+
+	/* Wait queue for userspace OOM handler */
+	wait_queue_head_t oom_wait;
+
+	spinlock_t oom_lock;
+};
+
+static inline
+struct oom_cgroup *oom_cgroup_from_cont(struct cgroup *cont)
+{
+	return container_of(cgroup_subsys_state(cont, oom_cgroup_subsys_id),
+				struct oom_cgroup, css);
+}
+
+static inline
+struct oom_cgroup *oom_cgroup_from_task(struct task_struct *task)
+{
+	return container_of(task_subsys_state(task, oom_cgroup_subsys_id),
+					struct oom_cgroup, css);
+}
+
+/*
+ * Takes oom_lock during call.
+ */
+static int oom_cgroup_write_delay(struct cgroup *cont, struct cftype *cft,
+				u64 delay)
+{
+	struct oom_cgroup *cs = oom_cgroup_from_cont(cont);
+
+	/* Sanity check */
+	if (unlikely(delay > 60 * 1000))
+		return -EINVAL;
+	spin_lock(&cs->oom_lock);
+	cs->oom_delay = msecs_to_jiffies(delay);
+	spin_unlock(&cs->oom_lock);
+	return 0;
+}
+
+/*
+ * sleeps until the cgroup enters OOM (or a maximum of N milliseconds if N is
+ * passed). Clears the OOM condition in the cgroup when it returns.
+ */
+static int oom_cgroup_write_await(struct cgroup *cont, struct cftype *cft,
+				u64 await)
+{
+	int retval = 0;
+	struct oom_cgroup *cs = oom_cgroup_from_cont(cont);
+
+	/* Don't try to wait for more than a minute */
+	await = min(await, 60ULL * 1000);
+	/* Try waiting for up to a second for an OOM condition */
+	wait_event_interruptible_timeout(cs->oom_wait, cs->oom_since ||
+					 cgroup_is_removed(cs->css.cgroup),
+					 msecs_to_jiffies(await));
+	spin_lock(&cs->oom_lock);
+	if (cgroup_is_removed(cs->css.cgroup)) {
+		/* The cpuset was removed while we slept */
+		retval = -ENODEV;
+	} else if (cs->oom_since) {
+		/* We reached OOM. Clear the OOM condition now that
+		 * userspace knows about it */
+		cs->oom_since = 0;
+	} else if (signal_pending(current)) {
+		retval = -EINTR;
+	} else {
+		/* No OOM yet */
+		retval = -ETIMEDOUT;
+	}
+	spin_unlock(&cs->oom_lock);
+	return retval;
+}
+
+static u64 oom_cgroup_read_delay(struct cgroup *cont, struct cftype *cft)
+{
+	return oom_cgroup_from_cont(cont)->oom_delay;
+}
+
+static struct cftype oom_cgroup_files[] = {
+	{
+		.name = "delay",
+		.read_u64 = oom_cgroup_read_delay,
+		.write_u64 = oom_cgroup_write_delay,
+	},
+
+	{
+		.name = "await",
+		.write_u64 = oom_cgroup_write_await,
+	},
+};
+
+static struct cgroup_subsys_state *oom_cgroup_create(
+		struct cgroup_subsys *ss,
+		struct cgroup *cont)
+{
+	struct oom_cgroup *oom;
+
+	oom = kmalloc(sizeof(*oom), GFP_KERNEL);
+	if (!oom)
+		return ERR_PTR(-ENOMEM);
+
+	oom->oom_delay = 0;
+	init_waitqueue_head(&oom->oom_wait);
+	oom->oom_since = 0;
+	spin_lock_init(&oom->oom_lock);
+
+	return &oom->css;
+}
+
+static void oom_cgroup_destroy(struct cgroup_subsys *ss,
+			struct cgroup *cont)
+{
+	kfree(oom_cgroup_from_cont(cont));
+}
+
+static int oom_cgroup_populate(struct cgroup_subsys *ss,
+			struct cgroup *cont)
+{
+	return cgroup_add_files(cont, ss, oom_cgroup_files,
+					ARRAY_SIZE(oom_cgroup_files));
+}
+
+struct cgroup_subsys oom_cgroup_subsys = {
+	.name = "oom",
+	.subsys_id = oom_cgroup_subsys_id,
+	.create = oom_cgroup_create,
+	.destroy = oom_cgroup_destroy,
+	.populate = oom_cgroup_populate,
+};
+
+
+/*
+ * Call with no cpuset mutex held. Determines whether this process
+ * should allow an OOM to proceed as normal (retval==1) or should try
+ * again to allocate memory (retval==0). If necessary, sleeps and then
+ * updates the task's mems_allowed to let userspace update the memory
+ * nodes for the task's cpuset.
+ */
+static int cgroup_should_oom(void)
+{
+	int ret = 1; /* OOM by default */
+	struct oom_cgroup *cs;
+
+	task_lock(current);
+	cs = oom_cgroup_from_task(current);
+
+	spin_lock(&cs->oom_lock);
+	if (cs->oom_delay) {
+		/* We have an OOM delay configured */
+		if (cs->oom_since) {
+			/* We're already OOMing - see if we're over
+			 * the time limit. Also make sure that jiffie
+			 * wrap-around doesn't make us think we're in
+			 * an incredibly long OOM delay */
+			unsigned long deadline = cs->oom_since + cs->oom_delay;
+			if (time_after(deadline, jiffies) &&
+			    !time_after(cs->oom_since, jiffies)) {
+				/* Not OOM yet */
+				ret = 0;
+			}
+		} else {
+			/* This is the first OOM */
+			ret = 0;
+			cs->oom_since = jiffies;
+			/* Avoid problems with jiffie wrap - make an
+			 * oom_since of zero always mean not
+			 * OOMing */
+			if (!cs->oom_since)
+				cs->oom_since = 1;
+			printk(KERN_WARNING
+			       "Cpuset %s (pid %d) sending memory "
+			       "notification to userland at %lu%s\n",
+			       cs->css.cgroup->dentry->d_name.name,
+			       current->pid, jiffies,
+			       waitqueue_active(&cs->oom_wait) ?
+			       "" : " (no waiters)");
+		}
+		if (!ret) {
+			/* If we're planning to retry, we should wake
+			 * up any userspace waiter in order to let it
+			 * handle the OOM
+			 */
+			wake_up_all(&cs->oom_wait);
+		}
+	}
+
+	spin_unlock(&cs->oom_lock);
+	task_unlock(current);
+	if (!ret) {
+		/* If we're not going to OOM, we should sleep for a
+		 * bit to give userspace a chance to respond before we
+		 * go back and try to reclaim again */
+		schedule_timeout_uninterruptible(1);
+	}
+	return ret;
+}
+#else /* !CONFIG_CGROUP_OOM_CONT */
+static inline int cgroup_should_oom(void)
+{
+	return 1;
+}
+#endif
+
 /* #define DEBUG */

 /**
@@ -526,6 +739,13 @@ void out_of_memory(struct zonelist *zonelist,
gfp_t gfp_mask, int order)
 	unsigned long freed = 0;
 	enum oom_constraint constraint;

+	/*
+	 * It is important to call in this order since cgroup_should_oom()
+	 * might sleep and give userspace chance to update mems.
+	 */
+	if (!cgroup_should_oom() || cpuset_update_task_memory_state())
+		return;
+
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 	if (freed > 0)
 		/* Got some memory back in the last second. */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH]Per-cgroup OOM handler
  2008-11-03 22:19 ` Ying Han
@ 2008-11-06  5:34   ` KAMEZAWA Hiroyuki
       [not found]     ` <604427e00811102042x202906ecq2a10eb5e404e2ec9@mail.gmail.com>
  0 siblings, 1 reply; 6+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-06  5:34 UTC (permalink / raw)
  To: Ying Han; +Cc: linux-mm, Rohit Seth, Paul Menage, David Rientjes

Thank you for posting.

On Mon, 3 Nov 2008 14:19:11 -0800
Ying Han <yinghan@google.com> wrote:

> sorry, please use the following patch. (deleted the double definition
> in cgroup_subsys.h from last patch)
> 
> Per-cgroup OOM handler ported from cpuset to cgroup.
> 
> Per cgroup OOM handler allows a userspace handler catches and handle the OOM,
> the OOMing thread doesn't trigger a kill, but returns to alloc_pages to try
> again; alternatively usersapce can cause the OOM killer to go ahead as normal.
> 
> It's a standalone subsystem that can work with either the memory cgroup or
> with cpusets(where memory is constrained by numa nodes).
> 
> The features are:
> 
> - an oom.delay file that controls how long a thread will pause in the
> OOM killer waiting for a response from userspace (in milliseconds)
> 
> - an oom.await file that a userspace handler can write a timeout value
> to, and be awoken either when a process in that cgroup enters the OOM
> killer, or the timeout expires.
> 
> example:
> (mount oom as normal cgroup subsystem as well as cpuset)
> 1. mount -t cgroup -o cpuset,oom cpuset /dev/cpuset
> 
> (config sample cpuset contains single fakenuma node with 128M and one
> cpu core)
> 2. mkdir /dev/cpuset/sample
>    echo 1 > /dev/cpuset/sample/cpuset.mems
>    echo 1 > /dev/cpuset/sample/cpuset.cpus
> 
> (config the oom.delay to be 10sec)
> 3. echo 10000 >/dev/cpuset/sample/oom.oom_delay
> 
> (put the shell in the wait-queue with max 60sec waitting)
> 4. echo 60000 >/dev/cpuset/sample/oom.await_oom
> 
> (trigger the oom by mlockall 600M anon memory)
> 5. /oom 600000000
> 
> When the sample cpuset triggers the OOM, it will wake-up the
> OOM-handler thread that slept in step 4, sleep for a jiffie, and then
> return to alloc_pages() to try again. This sleep gives the OOM-handler
> time to deal with the OOM, for example by giving another memory node
> to the OOMing cpuset.
> 
Where does this one-tick-wait comes from ? It works well ?
(Before OOM, the system tend to wait in congestion_wait() or some.)

OOM-handler shoule be in another cpuset or mlocked in this case ?

> We're sending out this in-house patch to start discussion about what
> might be appropriate for supporting user-space OOM-handling in the
> mainline kernel. Potential improvements include:
> 
thanks.

> - providing more information in the OOM notification, such as the pid
> that triggered the OOM, and a unique id for that OOM instance that can
> be tied to later OOM-kill notifications.

With this patch, why the system OOM is unknown.
I think following information is necessary at least.

   - OOM because global memory shortage.
   - OOM because cpuset memory shortage.
   - OOM because memcg memory hit limits.

> 
> - allowing better notifications from userspace back to the kernel.

I love some interface allowing poll().

I'm wondering 
  - freeeze-all-threads-in-group-at-oom 
  - free emergency memory to page allocator which was pooled at cgroup creation
    rather than 1-tick wait

BTW, it seems this patch allows task detach/attach always. it's safe(and sane) ?

At first impression, needs some progress but interesting in general. thanks.

-Kame


> 
>  Documentation/cgroups/oom-handler.txt |   49 ++++++++
>  include/linux/cgroup_subsys.h         |   12 ++
>  include/linux/cpuset.h                |    7 +-
>  init/Kconfig                          |    8 ++
>  kernel/cpuset.c                       |    8 +-
>  mm/oom_kill.c                         |  220 +++++++++++++++++++++++++++++++++
>  6 files changed, 301 insertions(+), 3 deletions(-)
> 
> Signed-off-by:Paul Menage <menage@google.com>
> 	      David Rientjes <rientjes@google.com>
> 	      Ying Han <yinghan@google.com>
> 
> 
> diff --git a/Documentation/cgroups/oom-handler.txt
> b/Documentation/cgroups/oom-handler.txt
> new file mode 100644
> index 0000000..aa006fe
> --- /dev/null
> +++ b/Documentation/cgroups/oom-handler.txt
> @@ -0,0 +1,49 @@
> +Per cgroup OOM handler allows a userspace handler catches and handle the OOM,
> +the OOMing thread doesn't trigger a kill, but returns to alloc_pages to try
> +again; alternatively usersapce can cause the OOM killer to go ahead as normal.
> +
> +It's a standalone subsystem that can work with either the memory cgroup or
> +with cpusets(where memory is constrained by numa nodes).
> +
> +The features are:
> +
> +- an oom.delay file that controls how long a thread will pause in the
> +OOM killer waiting for a response from userspace (in milliseconds)
> +
> +- an oom.await file that a userspace handler can write a timeout value
> +to, and be awoken either when a process in that cgroup enters the OOM
> +killer, or the timeout expires.
> +
> +example:
> +(mount oom as normal cgroup subsystem as well as cpuset)
> +1. mount -t cgroup -o cpuset,oom cpuset /dev/cpuset
> +
> +(config sample cpuset contains single fakenuma node with 128M and one
> +cpu core)
> +2. mkdir /dev/cpuset/sample
> +   echo 1 > /dev/cpuset/sample/cpuset.mems
> +   echo 1 > /dev/cpuset/sample/cpuset.cpus
> +
> +(config the oom.delay to be 10sec)
> +3. echo 10000 >/dev/cpuset/sample/oom.oom_delay
> +
> +(put the shell in the wait-queue with max 60sec waitting)
> +4. echo 60000 >/dev/cpuset/sample/oom.await_oom
> +
> +(trigger the oom by mlockall 600M anon memory)
> +5. /oom 600000000
> +
> +When the sample cpuset triggers the OOM, it will wake-up the
> +OOM-handler thread that slept in step 4, sleep for a jiffie, and then
> +return to alloc_pages() to try again. This sleep gives the OOM-handler
> +time to deal with the OOM, for example by giving another memory node
> +to the OOMing cpuset.
> +
> +Potential improvements include:
> +- providing more information in the OOM notification, such as the pid
> +that triggered the OOM, and a unique id for that OOM instance that can
> +be tied to later OOM-kill notifications.
> +
> +- allowing better notifications from userspace back to the kernel.
> +
> +
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 9c22396..23fe6c7 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -54,3 +54,9 @@ SUBSYS(freezer)
>  #endif
> 
>  /* */
> +
> +#ifdef CONFIG_CGROUP_OOM_CONT
> +SUBSYS(oom_cgroup)
> +#endif
> +
> +/* */
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 2691926..26dab22 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -25,7 +25,7 @@ extern void cpuset_cpus_allowed_locked(struct
> task_struct *p, cpumask_t *mask);
>  extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
>  #define cpuset_current_mems_allowed (current->mems_allowed)
>  void cpuset_init_current_mems_allowed(void);
> -void cpuset_update_task_memory_state(void);
> +int cpuset_update_task_memory_state(void);
>  int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
> 
>  extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
> @@ -103,7 +103,10 @@ static inline nodemask_t
> cpuset_mems_allowed(struct task_struct *p)
> 
>  #define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
>  static inline void cpuset_init_current_mems_allowed(void) {}
> -static inline void cpuset_update_task_memory_state(void) {}
> +static inline int cpuset_update_task_memory_state(void)
> +{
> +	return 1;
> +}
> 
>  static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
>  {
> diff --git a/init/Kconfig b/init/Kconfig
> index 44e9208..971b0b5 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -324,6 +324,14 @@ config CPUSETS
> 
>  	  Say N if unsure.
> 
> +config CGROUP_OOM_CONT
> +	bool "OOM controller for cgroups"
> +	depends on CGROUPS
> +	help
> +	  This option allows userspace to trap OOM conditions on a
> +	  per-cgroup basis, and take action that might prevent the OOM from
> +	  occurring.
> +
>  #
>  # Architectures with an unreliable sched_clock() should select this:
>  #
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 3e00526..c986423 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -355,13 +355,17 @@ static void guarantee_online_mems(const struct
> cpuset *cs, nodemask_t *pmask)
>   * within the tasks context, when it is trying to allocate memory
>   * (in various mm/mempolicy.c routines) and notices that some other
>   * task has been modifying its cpuset.
> + *
> + * Returns non-zero if the state was updated, including when it is
> + * an effective no-op.
>   */
> 
> -void cpuset_update_task_memory_state(void)
> +int cpuset_update_task_memory_state(void)
>  {
>  	int my_cpusets_mem_gen;
>  	struct task_struct *tsk = current;
>  	struct cpuset *cs;
> +	int ret = 0;
> 
>  	if (task_cs(tsk) == &top_cpuset) {
>  		/* Don't need rcu for top_cpuset.  It's never freed. */
> @@ -389,7 +393,9 @@ void cpuset_update_task_memory_state(void)
>  		task_unlock(tsk);
>  		mutex_unlock(&callback_mutex);
>  		mpol_rebind_task(tsk, &tsk->mems_allowed);
> +		ret = 1;
>  	}
> +	return ret;
>  }
> 
>  /*
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 64e5b4b..5677b72 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -32,6 +32,219 @@ int sysctl_panic_on_oom;
>  int sysctl_oom_kill_allocating_task;
>  int sysctl_oom_dump_tasks;
>  static DEFINE_SPINLOCK(zone_scan_mutex);
> +
> +#ifdef CONFIG_CGROUP_OOM_CONT
> +struct oom_cgroup {
> +	struct cgroup_subsys_state css;
> +
> +	/* How long between first OOM indication and actual OOM kill
> +	 * for processes in this cgroup */
> +	unsigned long oom_delay;
> +
> +	/* When the current OOM delay began. Zero means no delay in progress */
> +	unsigned long oom_since;
> +
> +	/* Wait queue for userspace OOM handler */
> +	wait_queue_head_t oom_wait;
> +
> +	spinlock_t oom_lock;
> +};
> +
> +static inline
> +struct oom_cgroup *oom_cgroup_from_cont(struct cgroup *cont)
> +{
> +	return container_of(cgroup_subsys_state(cont, oom_cgroup_subsys_id),
> +				struct oom_cgroup, css);
> +}
> +
> +static inline
> +struct oom_cgroup *oom_cgroup_from_task(struct task_struct *task)
> +{
> +	return container_of(task_subsys_state(task, oom_cgroup_subsys_id),
> +					struct oom_cgroup, css);
> +}
> +
> +/*
> + * Takes oom_lock during call.
> + */
> +static int oom_cgroup_write_delay(struct cgroup *cont, struct cftype *cft,
> +				u64 delay)
> +{
> +	struct oom_cgroup *cs = oom_cgroup_from_cont(cont);
> +
> +	/* Sanity check */
> +	if (unlikely(delay > 60 * 1000))
> +		return -EINVAL;
> +	spin_lock(&cs->oom_lock);
> +	cs->oom_delay = msecs_to_jiffies(delay);
> +	spin_unlock(&cs->oom_lock);
> +	return 0;
> +}
> +
> +/*
> + * sleeps until the cgroup enters OOM (or a maximum of N milliseconds if N is
> + * passed). Clears the OOM condition in the cgroup when it returns.
> + */
> +static int oom_cgroup_write_await(struct cgroup *cont, struct cftype *cft,
> +				u64 await)
> +{
> +	int retval = 0;
> +	struct oom_cgroup *cs = oom_cgroup_from_cont(cont);
> +
> +	/* Don't try to wait for more than a minute */
> +	await = min(await, 60ULL * 1000);
> +	/* Try waiting for up to a second for an OOM condition */
> +	wait_event_interruptible_timeout(cs->oom_wait, cs->oom_since ||
> +					 cgroup_is_removed(cs->css.cgroup),
> +					 msecs_to_jiffies(await));
> +	spin_lock(&cs->oom_lock);
> +	if (cgroup_is_removed(cs->css.cgroup)) {
> +		/* The cpuset was removed while we slept */
> +		retval = -ENODEV;
> +	} else if (cs->oom_since) {
> +		/* We reached OOM. Clear the OOM condition now that
> +		 * userspace knows about it */
> +		cs->oom_since = 0;
> +	} else if (signal_pending(current)) {
> +		retval = -EINTR;
> +	} else {
> +		/* No OOM yet */
> +		retval = -ETIMEDOUT;
> +	}
> +	spin_unlock(&cs->oom_lock);
> +	return retval;
> +}
> +
> +static u64 oom_cgroup_read_delay(struct cgroup *cont, struct cftype *cft)
> +{
> +	return oom_cgroup_from_cont(cont)->oom_delay;
> +}
> +
> +static struct cftype oom_cgroup_files[] = {
> +	{
> +		.name = "delay",
> +		.read_u64 = oom_cgroup_read_delay,
> +		.write_u64 = oom_cgroup_write_delay,
> +	},
> +
> +	{
> +		.name = "await",
> +		.write_u64 = oom_cgroup_write_await,
> +	},
> +};
> +
> +static struct cgroup_subsys_state *oom_cgroup_create(
> +		struct cgroup_subsys *ss,
> +		struct cgroup *cont)
> +{
> +	struct oom_cgroup *oom;
> +
> +	oom = kmalloc(sizeof(*oom), GFP_KERNEL);
> +	if (!oom)
> +		return ERR_PTR(-ENOMEM);
> +
> +	oom->oom_delay = 0;
> +	init_waitqueue_head(&oom->oom_wait);
> +	oom->oom_since = 0;
> +	spin_lock_init(&oom->oom_lock);
> +
> +	return &oom->css;
> +}
> +
> +static void oom_cgroup_destroy(struct cgroup_subsys *ss,
> +			struct cgroup *cont)
> +{
> +	kfree(oom_cgroup_from_cont(cont));
> +}
> +
> +static int oom_cgroup_populate(struct cgroup_subsys *ss,
> +			struct cgroup *cont)
> +{
> +	return cgroup_add_files(cont, ss, oom_cgroup_files,
> +					ARRAY_SIZE(oom_cgroup_files));
> +}
> +
> +struct cgroup_subsys oom_cgroup_subsys = {
> +	.name = "oom",
> +	.subsys_id = oom_cgroup_subsys_id,
> +	.create = oom_cgroup_create,
> +	.destroy = oom_cgroup_destroy,
> +	.populate = oom_cgroup_populate,
> +};
> +
> +
> +/*
> + * Call with no cpuset mutex held. Determines whether this process
> + * should allow an OOM to proceed as normal (retval==1) or should try
> + * again to allocate memory (retval==0). If necessary, sleeps and then
> + * updates the task's mems_allowed to let userspace update the memory
> + * nodes for the task's cpuset.
> + */
> +static int cgroup_should_oom(void)
> +{
> +	int ret = 1; /* OOM by default */
> +	struct oom_cgroup *cs;
> +
> +	task_lock(current);
> +	cs = oom_cgroup_from_task(current);
> +
> +	spin_lock(&cs->oom_lock);
> +	if (cs->oom_delay) {
> +		/* We have an OOM delay configured */
> +		if (cs->oom_since) {
> +			/* We're already OOMing - see if we're over
> +			 * the time limit. Also make sure that jiffie
> +			 * wrap-around doesn't make us think we're in
> +			 * an incredibly long OOM delay */
> +			unsigned long deadline = cs->oom_since + cs->oom_delay;
> +			if (time_after(deadline, jiffies) &&
> +			    !time_after(cs->oom_since, jiffies)) {
> +				/* Not OOM yet */
> +				ret = 0;
> +			}
> +		} else {
> +			/* This is the first OOM */
> +			ret = 0;
> +			cs->oom_since = jiffies;
> +			/* Avoid problems with jiffie wrap - make an
> +			 * oom_since of zero always mean not
> +			 * OOMing */
> +			if (!cs->oom_since)
> +				cs->oom_since = 1;
> +			printk(KERN_WARNING
> +			       "Cpuset %s (pid %d) sending memory "
> +			       "notification to userland at %lu%s\n",
> +			       cs->css.cgroup->dentry->d_name.name,
> +			       current->pid, jiffies,
> +			       waitqueue_active(&cs->oom_wait) ?
> +			       "" : " (no waiters)");
> +		}
> +		if (!ret) {
> +			/* If we're planning to retry, we should wake
> +			 * up any userspace waiter in order to let it
> +			 * handle the OOM
> +			 */
> +			wake_up_all(&cs->oom_wait);
> +		}
> +	}
> +
> +	spin_unlock(&cs->oom_lock);
> +	task_unlock(current);
> +	if (!ret) {
> +		/* If we're not going to OOM, we should sleep for a
> +		 * bit to give userspace a chance to respond before we
> +		 * go back and try to reclaim again */
> +		schedule_timeout_uninterruptible(1);
> +	}
> +	return ret;
> +}
> +#else /* !CONFIG_CGROUP_OOM_CONT */
> +static inline int cgroup_should_oom(void)
> +{
> +	return 1;
> +}
> +#endif
> +
>  /* #define DEBUG */
> 
>  /**
> @@ -526,6 +739,13 @@ void out_of_memory(struct zonelist *zonelist,
> gfp_t gfp_mask, int order)
>  	unsigned long freed = 0;
>  	enum oom_constraint constraint;
> 
> +	/*
> +	 * It is important to call in this order since cgroup_should_oom()
> +	 * might sleep and give userspace chance to update mems.
> +	 */
> +	if (!cgroup_should_oom() || cpuset_update_task_memory_state())
> +		return;
> +
>  	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
>  	if (freed > 0)
>  		/* Got some memory back in the last second. */
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH]Per-cgroup OOM handler
       [not found]     ` <604427e00811102042x202906ecq2a10eb5e404e2ec9@mail.gmail.com>
@ 2008-11-11  7:28       ` KAMEZAWA Hiroyuki
  2008-11-11  8:14         ` David Rientjes
  2008-11-11  8:27       ` Paul Menage
  1 sibling, 1 reply; 6+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-11  7:28 UTC (permalink / raw)
  To: Ying Han; +Cc: linux-mm, Rohit Seth, Paul Menage, David Rientjes

On Mon, 10 Nov 2008 20:42:23 -0800
Ying Han <yinghan@google.com> wrote:

> Thank you for your comments.
> On Wed, Nov 5, 2008 at 9:34 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>     Here is how we do the one-tick-wait in cgroup_should_oom() in oom_kill.c
>     >-------if (!ret) {
>     >------->-------/* If we're not going to OOM, we should sleep for a
>     >------->------- * bit to give userspace a chance to respond before we
>     >------->------- * go back and try to reclaim again */
>     >------->-------schedule_timeout_uninterruptible(1);
>     >-------}
>    and it works well in-house so far as i mentioned earlier. what's
> important here is not "sleeping for one tick", the idea here is to
> reschedule the ooming thread so the oom handler can make action ( like
> adding memory node to the cpuset) and the subsequent page allocator in
> get_page_from_freelist() can use it.
> 
Can't we avoid this kind of magical one-tick wait ?

> 
> > (Before OOM, the system tend to wait in congestion_wait() or some.)
> 
>    I am not sure how the call to congestion_wait() relevant to the
> "one-tick-wait"? We are simply just trying to reschedule the ooming task,
> that the oom handler has waken up to have chance doing something.
> 
if lucky.


> >
> >
> > OOM-handler shoule be in another cpuset or mlocked in this case
> 
> The oom-handler is in the same cgroup as the ooming task. That is why it's
> called per-cgroup oom-handler. However, there's probably a livelock if the
> userspace oom handler is the one that triggers the oom and detach/reattaches
> without ever freeing or adding memory. For this case, either we can detect
> in the kernel by doing something like if(current == pid) or just leave the
> problem up to userspace( the oom handler shouldn't detach itself after
> getting the ooming notification, it is considered to be a user bug? ).
> 
Hmm, from discussion of mem_notify handler in Feb/March of this year,
oom-hanlder cannot works well if memory is near to OOM, in general.
Then, mlockall was recomemded to handler.
(and it must not do file access.)

I wonder creating small cpuset (and isolated node) for oom-handler may be
another help.


> >
> > I'm wondering
> >  - freeeze-all-threads-in-group-at-oom
> >  - free emergency memory to page allocator which was pooled at cgroup
> > creation
> >    rather than 1-tick wait
> >
> > BTW, it seems this patch allows task detach/attach always. it's safe(and
> > sane) ?
> 
>    yes, we allows task detach/attach. So far we don't see any race condition
> except the livelock
> i mentioned above. Any particular scenario can think of now? thanks
> 
I don't find it ;)
BTW, shouldn't we disable preempt(or irq) before taking spinlocks ?

> > +static int cgroup_should_oom(void)
> > +{
> > +     int ret = 1; /* OOM by default */
> > +     struct oom_cgroup *cs;
> > +
> > +     task_lock(current);
> > +     cs = oom_cgroup_from_task(current);
> > +

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH]Per-cgroup OOM handler
  2008-11-11  7:28       ` KAMEZAWA Hiroyuki
@ 2008-11-11  8:14         ` David Rientjes
  0 siblings, 0 replies; 6+ messages in thread
From: David Rientjes @ 2008-11-11  8:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Ying Han, linux-mm, Rohit Seth, Paul Menage

Sorry, there's been some confusion in this proposal.

On Tue, 11 Nov 2008, KAMEZAWA Hiroyuki wrote:

> >     Here is how we do the one-tick-wait in cgroup_should_oom() in oom_kill.c
> >     >-------if (!ret) {
> >     >------->-------/* If we're not going to OOM, we should sleep for a
> >     >------->------- * bit to give userspace a chance to respond before we
> >     >------->------- * go back and try to reclaim again */
> >     >------->-------schedule_timeout_uninterruptible(1);
> >     >-------}
> >    and it works well in-house so far as i mentioned earlier. what's
> > important here is not "sleeping for one tick", the idea here is to
> > reschedule the ooming thread so the oom handler can make action ( like
> > adding memory node to the cpuset) and the subsequent page allocator in
> > get_page_from_freelist() can use it.
> > 
> Can't we avoid this kind of magical one-tick wait ?
> 

cgroup_should_oom() determines whether the oom killer should be invoked or 
whether userspace should be given the opportunity to act first; it returns 
zero only when the kernel has deferred to userspace.

In these situations, the kernel will return to the page allocator to 
attempt the allocation again.  If current were not rescheduled like this 
(the schedule_timeout is simply more powerful than a cond_resched), there 
is a very high liklihood that this subsequent allocation attempt would 
fail just as it did before the oom killer was triggered and then we'd 
enter reclaim unnecessarily when userspace could have reclaimed on its 
own, killed a task by overriding the kernels' heuristics, added a node to 
the cpuset, increased its memcg allocation, etc.

So this reschedule simply prevents needlessly entering reclaim, just as 
its comment indicates.

> > > (Before OOM, the system tend to wait in congestion_wait() or some.)
> > 

Yes, we wait on block congestion as part of direct reclaim but at this 
point we've yet to notify the userspace oom handler so that it may act to 
avoid invoking the oom kiler.

> Hmm, from discussion of mem_notify handler in Feb/March of this year,
> oom-hanlder cannot works well if memory is near to OOM, in general.
> Then, mlockall was recomemded to handler.
> (and it must not do file access.)
> 

This would be a legitimate point if we were talking about a system-wide 
oom notifier like /dev/mem_notify was and we were addressing unconstrained 
ooms.  This patch was specific to cgroups and the only likely usecases are 
for either cpusets or memcg.

> I wonder creating small cpuset (and isolated node) for oom-handler may be
> another help.
> 

This is obviously a pure userspace issue; any sane oom handler that is 
itself subjected to the same memory constraints would be written to avoid 
memory allocations when woken up.

> > > I'm wondering
> > >  - freeeze-all-threads-in-group-at-oom
> > >  - free emergency memory to page allocator which was pooled at cgroup
> > > creation
> > >    rather than 1-tick wait
> > >
> > > BTW, it seems this patch allows task detach/attach always. it's safe(and
> > > sane) ?
> > 
> >    yes, we allows task detach/attach. So far we don't see any race condition
> > except the livelock
> > i mentioned above. Any particular scenario can think of now? thanks
> > 
> I don't find it ;)
> BTW, shouldn't we disable preempt(or irq) before taking spinlocks ?
> 

I don't know which spinlock you're specifically referring to here, but the 
oom killer (and thus the handling of the oom handler) is invoked in 
process context with irqs enabled.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH]Per-cgroup OOM handler
       [not found]     ` <604427e00811102042x202906ecq2a10eb5e404e2ec9@mail.gmail.com>
  2008-11-11  7:28       ` KAMEZAWA Hiroyuki
@ 2008-11-11  8:27       ` Paul Menage
  1 sibling, 0 replies; 6+ messages in thread
From: Paul Menage @ 2008-11-11  8:27 UTC (permalink / raw)
  To: Ying Han; +Cc: KAMEZAWA Hiroyuki, linux-mm, Rohit Seth, David Rientjes

On Mon, Nov 10, 2008 at 8:42 PM, Ying Han <yinghan@google.com> wrote:
>> OOM-handler shoule be in another cpuset or mlocked in this case
>
> The oom-handler is in the same cgroup as the ooming task.

No, that's not how we've been using it - the OOM handler runs in a
system-control cpuset that hopefully doesn't end up OOMing itself.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-11-11  8:27 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-03 21:40 [RFC][PATCH]Per-cgroup OOM handler Ying Han
2008-11-03 22:19 ` Ying Han
2008-11-06  5:34   ` KAMEZAWA Hiroyuki
     [not found]     ` <604427e00811102042x202906ecq2a10eb5e404e2ec9@mail.gmail.com>
2008-11-11  7:28       ` KAMEZAWA Hiroyuki
2008-11-11  8:14         ` David Rientjes
2008-11-11  8:27       ` Paul Menage

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox