From: Ying Han <yinghan@google.com>
To: linux-mm@kvack.org
Cc: Rohit Seth <rohitseth@google.com>,
Paul Menage <menage@google.com>,
David Rientjes <rientjes@google.com>
Subject: [RFC][PATCH]Per-cgroup OOM handler
Date: Mon, 3 Nov 2008 13:40:03 -0800 [thread overview]
Message-ID: <604427e00811031340k56634773g6e260d79e6cb51e7@mail.gmail.com> (raw)
Per-cgroup OOM handler ported from cpuset to cgroup.
Per cgroup OOM handler allows a userspace handler catches and handle the OOM,
the OOMing thread doesn't trigger a kill, but returns to alloc_pages to try
again; alternatively usersapce can cause the OOM killer to go ahead as normal.
It's a standalone subsystem that can work with either the memory cgroup or
with cpusets(where memory is constrained by numa nodes).
The features are:
- an oom.delay file that controls how long a thread will pause in the
OOM killer waiting for a response from userspace (in milliseconds)
- an oom.await file that a userspace handler can write a timeout value
to, and be awoken either when a process in that cgroup enters the OOM
killer, or the timeout expires.
example:
(mount oom as normal cgroup subsystem as well as cpuset)
1. mount -t cgroup -o cpuset,oom cpuset /dev/cpuset
(config sample cpuset contains single fakenuma node with 128M and one
cpu core)
2. mkdir /dev/cpuset/sample
echo 1 > /dev/cpuset/sample/cpuset.mems
echo 1 > /dev/cpuset/sample/cpuset.cpus
(config the oom.delay to be 10sec)
3. echo 10000 >/dev/cpuset/sample/oom.oom_delay
(put the shell in the wait-queue with max 60sec waitting)
4. echo 60000 >/dev/cpuset/sample/oom.await_oom
(trigger the oom by mlockall 600M anon memory)
5. /oom 600000000
When the sample cpuset triggers the OOM, it will wake-up the
OOM-handler thread that slept in step 4, sleep for a jiffie, and then
return to alloc_pages() to try again. This sleep gives the OOM-handler
time to deal with the OOM, for example by giving another memory node
to the OOMing cpuset.
We're sending out this in-house patch to start discussion about what
might be appropriate for supporting user-space OOM-handling in the
mainline kernel. Potential improvements include:
- providing more information in the OOM notification, such as the pid
that triggered the OOM, and a unique id for that OOM instance that can
be tied to later OOM-kill notifications.
- allowing better notifications from userspace back to the kernel.
Documentation/cgroups/oom-handler.txt | 49 ++++++++
include/linux/cgroup_subsys.h | 12 ++
include/linux/cpuset.h | 7 +-
init/Kconfig | 8 ++
kernel/cpuset.c | 8 +-
mm/oom_kill.c | 220 +++++++++++++++++++++++++++++++++
6 files changed, 301 insertions(+), 3 deletions(-)
diff --git a/Documentation/cgroups/oom-handler.txt
b/Documentation/cgroups/oom-handler.txt
new file mode 100644
index 0000000..aa006fe
--- /dev/null
+++ b/Documentation/cgroups/oom-handler.txt
@@ -0,0 +1,49 @@
+Per cgroup OOM handler allows a userspace handler catches and handle the OOM,
+the OOMing thread doesn't trigger a kill, but returns to alloc_pages to try
+again; alternatively usersapce can cause the OOM killer to go ahead as normal.
+
+It's a standalone subsystem that can work with either the memory cgroup or
+with cpusets(where memory is constrained by numa nodes).
+
+The features are:
+
+- an oom.delay file that controls how long a thread will pause in the
+OOM killer waiting for a response from userspace (in milliseconds)
+
+- an oom.await file that a userspace handler can write a timeout value
+to, and be awoken either when a process in that cgroup enters the OOM
+killer, or the timeout expires.
+
+example:
+(mount oom as normal cgroup subsystem as well as cpuset)
+1. mount -t cgroup -o cpuset,oom cpuset /dev/cpuset
+
+(config sample cpuset contains single fakenuma node with 128M and one
+cpu core)
+2. mkdir /dev/cpuset/sample
+ echo 1 > /dev/cpuset/sample/cpuset.mems
+ echo 1 > /dev/cpuset/sample/cpuset.cpus
+
+(config the oom.delay to be 10sec)
+3. echo 10000 >/dev/cpuset/sample/oom.oom_delay
+
+(put the shell in the wait-queue with max 60sec waitting)
+4. echo 60000 >/dev/cpuset/sample/oom.await_oom
+
+(trigger the oom by mlockall 600M anon memory)
+5. /oom 600000000
+
+When the sample cpuset triggers the OOM, it will wake-up the
+OOM-handler thread that slept in step 4, sleep for a jiffie, and then
+return to alloc_pages() to try again. This sleep gives the OOM-handler
+time to deal with the OOM, for example by giving another memory node
+to the OOMing cpuset.
+
+Potential improvements include:
+- providing more information in the OOM notification, such as the pid
+that triggered the OOM, and a unique id for that OOM instance that can
+be tied to later OOM-kill notifications.
+
+- allowing better notifications from userspace back to the kernel.
+
+
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c22396..1e63bd5 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -54,3 +54,15 @@ SUBSYS(freezer)
#endif
/* */
+
+#ifdef CONFIG_CGROUP_OOM_CONT
+SUBSYS(oom_cgroup)
+#endif
+
+/* */
+
+#ifdef CONFIG_CGROUP_OOM_CONT
+SUBSYS(oom_cgroup)
+#endif
+
+/* */
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 2691926..26dab22 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -25,7 +25,7 @@ extern void cpuset_cpus_allowed_locked(struct
task_struct *p, cpumask_t *mask);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
#define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void);
-void cpuset_update_task_memory_state(void);
+int cpuset_update_task_memory_state(void);
int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
@@ -103,7 +103,10 @@ static inline nodemask_t
cpuset_mems_allowed(struct task_struct *p)
#define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
static inline void cpuset_init_current_mems_allowed(void) {}
-static inline void cpuset_update_task_memory_state(void) {}
+static inline int cpuset_update_task_memory_state(void)
+{
+ return 1;
+}
static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
{
diff --git a/init/Kconfig b/init/Kconfig
index 44e9208..971b0b5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -324,6 +324,14 @@ config CPUSETS
Say N if unsure.
+config CGROUP_OOM_CONT
+ bool "OOM controller for cgroups"
+ depends on CGROUPS
+ help
+ This option allows userspace to trap OOM conditions on a
+ per-cgroup basis, and take action that might prevent the OOM from
+ occurring.
+
#
# Architectures with an unreliable sched_clock() should select this:
#
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 3e00526..c986423 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -355,13 +355,17 @@ static void guarantee_online_mems(const struct
cpuset *cs, nodemask_t *pmask)
* within the tasks context, when it is trying to allocate memory
* (in various mm/mempolicy.c routines) and notices that some other
* task has been modifying its cpuset.
+ *
+ * Returns non-zero if the state was updated, including when it is
+ * an effective no-op.
*/
-void cpuset_update_task_memory_state(void)
+int cpuset_update_task_memory_state(void)
{
int my_cpusets_mem_gen;
struct task_struct *tsk = current;
struct cpuset *cs;
+ int ret = 0;
if (task_cs(tsk) == &top_cpuset) {
/* Don't need rcu for top_cpuset. It's never freed. */
@@ -389,7 +393,9 @@ void cpuset_update_task_memory_state(void)
task_unlock(tsk);
mutex_unlock(&callback_mutex);
mpol_rebind_task(tsk, &tsk->mems_allowed);
+ ret = 1;
}
+ return ret;
}
/*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 64e5b4b..5677b72 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -32,6 +32,219 @@ int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks;
static DEFINE_SPINLOCK(zone_scan_mutex);
+
+#ifdef CONFIG_CGROUP_OOM_CONT
+struct oom_cgroup {
+ struct cgroup_subsys_state css;
+
+ /* How long between first OOM indication and actual OOM kill
+ * for processes in this cgroup */
+ unsigned long oom_delay;
+
+ /* When the current OOM delay began. Zero means no delay in progress */
+ unsigned long oom_since;
+
+ /* Wait queue for userspace OOM handler */
+ wait_queue_head_t oom_wait;
+
+ spinlock_t oom_lock;
+};
+
+static inline
+struct oom_cgroup *oom_cgroup_from_cont(struct cgroup *cont)
+{
+ return container_of(cgroup_subsys_state(cont, oom_cgroup_subsys_id),
+ struct oom_cgroup, css);
+}
+
+static inline
+struct oom_cgroup *oom_cgroup_from_task(struct task_struct *task)
+{
+ return container_of(task_subsys_state(task, oom_cgroup_subsys_id),
+ struct oom_cgroup, css);
+}
+
+/*
+ * Takes oom_lock during call.
+ */
+static int oom_cgroup_write_delay(struct cgroup *cont, struct cftype *cft,
+ u64 delay)
+{
+ struct oom_cgroup *cs = oom_cgroup_from_cont(cont);
+
+ /* Sanity check */
+ if (unlikely(delay > 60 * 1000))
+ return -EINVAL;
+ spin_lock(&cs->oom_lock);
+ cs->oom_delay = msecs_to_jiffies(delay);
+ spin_unlock(&cs->oom_lock);
+ return 0;
+}
+
+/*
+ * sleeps until the cgroup enters OOM (or a maximum of N milliseconds if N is
+ * passed). Clears the OOM condition in the cgroup when it returns.
+ */
+static int oom_cgroup_write_await(struct cgroup *cont, struct cftype *cft,
+ u64 await)
+{
+ int retval = 0;
+ struct oom_cgroup *cs = oom_cgroup_from_cont(cont);
+
+ /* Don't try to wait for more than a minute */
+ await = min(await, 60ULL * 1000);
+ /* Try waiting for up to a second for an OOM condition */
+ wait_event_interruptible_timeout(cs->oom_wait, cs->oom_since ||
+ cgroup_is_removed(cs->css.cgroup),
+ msecs_to_jiffies(await));
+ spin_lock(&cs->oom_lock);
+ if (cgroup_is_removed(cs->css.cgroup)) {
+ /* The cpuset was removed while we slept */
+ retval = -ENODEV;
+ } else if (cs->oom_since) {
+ /* We reached OOM. Clear the OOM condition now that
+ * userspace knows about it */
+ cs->oom_since = 0;
+ } else if (signal_pending(current)) {
+ retval = -EINTR;
+ } else {
+ /* No OOM yet */
+ retval = -ETIMEDOUT;
+ }
+ spin_unlock(&cs->oom_lock);
+ return retval;
+}
+
+static u64 oom_cgroup_read_delay(struct cgroup *cont, struct cftype *cft)
+{
+ return oom_cgroup_from_cont(cont)->oom_delay;
+}
+
+static struct cftype oom_cgroup_files[] = {
+ {
+ .name = "delay",
+ .read_u64 = oom_cgroup_read_delay,
+ .write_u64 = oom_cgroup_write_delay,
+ },
+
+ {
+ .name = "await",
+ .write_u64 = oom_cgroup_write_await,
+ },
+};
+
+static struct cgroup_subsys_state *oom_cgroup_create(
+ struct cgroup_subsys *ss,
+ struct cgroup *cont)
+{
+ struct oom_cgroup *oom;
+
+ oom = kmalloc(sizeof(*oom), GFP_KERNEL);
+ if (!oom)
+ return ERR_PTR(-ENOMEM);
+
+ oom->oom_delay = 0;
+ init_waitqueue_head(&oom->oom_wait);
+ oom->oom_since = 0;
+ spin_lock_init(&oom->oom_lock);
+
+ return &oom->css;
+}
+
+static void oom_cgroup_destroy(struct cgroup_subsys *ss,
+ struct cgroup *cont)
+{
+ kfree(oom_cgroup_from_cont(cont));
+}
+
+static int oom_cgroup_populate(struct cgroup_subsys *ss,
+ struct cgroup *cont)
+{
+ return cgroup_add_files(cont, ss, oom_cgroup_files,
+ ARRAY_SIZE(oom_cgroup_files));
+}
+
+struct cgroup_subsys oom_cgroup_subsys = {
+ .name = "oom",
+ .subsys_id = oom_cgroup_subsys_id,
+ .create = oom_cgroup_create,
+ .destroy = oom_cgroup_destroy,
+ .populate = oom_cgroup_populate,
+};
+
+
+/*
+ * Call with no cpuset mutex held. Determines whether this process
+ * should allow an OOM to proceed as normal (retval==1) or should try
+ * again to allocate memory (retval==0). If necessary, sleeps and then
+ * updates the task's mems_allowed to let userspace update the memory
+ * nodes for the task's cpuset.
+ */
+static int cgroup_should_oom(void)
+{
+ int ret = 1; /* OOM by default */
+ struct oom_cgroup *cs;
+
+ task_lock(current);
+ cs = oom_cgroup_from_task(current);
+
+ spin_lock(&cs->oom_lock);
+ if (cs->oom_delay) {
+ /* We have an OOM delay configured */
+ if (cs->oom_since) {
+ /* We're already OOMing - see if we're over
+ * the time limit. Also make sure that jiffie
+ * wrap-around doesn't make us think we're in
+ * an incredibly long OOM delay */
+ unsigned long deadline = cs->oom_since + cs->oom_delay;
+ if (time_after(deadline, jiffies) &&
+ !time_after(cs->oom_since, jiffies)) {
+ /* Not OOM yet */
+ ret = 0;
+ }
+ } else {
+ /* This is the first OOM */
+ ret = 0;
+ cs->oom_since = jiffies;
+ /* Avoid problems with jiffie wrap - make an
+ * oom_since of zero always mean not
+ * OOMing */
+ if (!cs->oom_since)
+ cs->oom_since = 1;
+ printk(KERN_WARNING
+ "Cpuset %s (pid %d) sending memory "
+ "notification to userland at %lu%s\n",
+ cs->css.cgroup->dentry->d_name.name,
+ current->pid, jiffies,
+ waitqueue_active(&cs->oom_wait) ?
+ "" : " (no waiters)");
+ }
+ if (!ret) {
+ /* If we're planning to retry, we should wake
+ * up any userspace waiter in order to let it
+ * handle the OOM
+ */
+ wake_up_all(&cs->oom_wait);
+ }
+ }
+
+ spin_unlock(&cs->oom_lock);
+ task_unlock(current);
+ if (!ret) {
+ /* If we're not going to OOM, we should sleep for a
+ * bit to give userspace a chance to respond before we
+ * go back and try to reclaim again */
+ schedule_timeout_uninterruptible(1);
+ }
+ return ret;
+}
+#else /* !CONFIG_CGROUP_OOM_CONT */
+static inline int cgroup_should_oom(void)
+{
+ return 1;
+}
+#endif
+
/* #define DEBUG */
/**
@@ -526,6 +739,13 @@ void out_of_memory(struct zonelist *zonelist,
gfp_t gfp_mask, int order)
unsigned long freed = 0;
enum oom_constraint constraint;
+ /*
+ * It is important to call in this order since cgroup_should_oom()
+ * might sleep and give userspace chance to update mems.
+ */
+ if (!cgroup_should_oom() || cpuset_update_task_memory_state())
+ return;
+
blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
if (freed > 0)
/* Got some memory back in the last second. */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next reply other threads:[~2008-11-03 21:40 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-11-03 21:40 Ying Han [this message]
2008-11-03 22:19 ` Ying Han
2008-11-06 5:34 ` KAMEZAWA Hiroyuki
[not found] ` <604427e00811102042x202906ecq2a10eb5e404e2ec9@mail.gmail.com>
2008-11-11 7:28 ` KAMEZAWA Hiroyuki
2008-11-11 8:14 ` David Rientjes
2008-11-11 8:27 ` Paul Menage
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=604427e00811031340k56634773g6e260d79e6cb51e7@mail.gmail.com \
--to=yinghan@google.com \
--cc=linux-mm@kvack.org \
--cc=menage@google.com \
--cc=rientjes@google.com \
--cc=rohitseth@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox