From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from wpaz13.hot.corp.google.com (wpaz13.hot.corp.google.com [172.24.198.77]) by smtp-out.google.com with ESMTP id mA3MJEhL008806 for ; Mon, 3 Nov 2008 22:19:14 GMT Received: from wf-out-1314.google.com (wfg24.prod.google.com [10.142.7.24]) by wpaz13.hot.corp.google.com with ESMTP id mA3MJCU6024722 for ; Mon, 3 Nov 2008 14:19:12 -0800 Received: by wf-out-1314.google.com with SMTP id 24so3056649wfg.15 for ; Mon, 03 Nov 2008 14:19:11 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <604427e00811031340k56634773g6e260d79e6cb51e7@mail.gmail.com> References: <604427e00811031340k56634773g6e260d79e6cb51e7@mail.gmail.com> Date: Mon, 3 Nov 2008 14:19:11 -0800 Message-ID: <604427e00811031419k2e990061kdb03f4b715b51fb9@mail.gmail.com> Subject: Re: [RFC][PATCH]Per-cgroup OOM handler From: Ying Han Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: linux-mm@kvack.org Cc: Rohit Seth , Paul Menage , David Rientjes List-ID: sorry, please use the following patch. (deleted the double definition in cgroup_subsys.h from last patch) Per-cgroup OOM handler ported from cpuset to cgroup. Per cgroup OOM handler allows a userspace handler catches and handle the OOM, the OOMing thread doesn't trigger a kill, but returns to alloc_pages to try again; alternatively usersapce can cause the OOM killer to go ahead as normal. It's a standalone subsystem that can work with either the memory cgroup or with cpusets(where memory is constrained by numa nodes). The features are: - an oom.delay file that controls how long a thread will pause in the OOM killer waiting for a response from userspace (in milliseconds) - an oom.await file that a userspace handler can write a timeout value to, and be awoken either when a process in that cgroup enters the OOM killer, or the timeout expires. example: (mount oom as normal cgroup subsystem as well as cpuset) 1. mount -t cgroup -o cpuset,oom cpuset /dev/cpuset (config sample cpuset contains single fakenuma node with 128M and one cpu core) 2. mkdir /dev/cpuset/sample echo 1 > /dev/cpuset/sample/cpuset.mems echo 1 > /dev/cpuset/sample/cpuset.cpus (config the oom.delay to be 10sec) 3. echo 10000 >/dev/cpuset/sample/oom.oom_delay (put the shell in the wait-queue with max 60sec waitting) 4. echo 60000 >/dev/cpuset/sample/oom.await_oom (trigger the oom by mlockall 600M anon memory) 5. /oom 600000000 When the sample cpuset triggers the OOM, it will wake-up the OOM-handler thread that slept in step 4, sleep for a jiffie, and then return to alloc_pages() to try again. This sleep gives the OOM-handler time to deal with the OOM, for example by giving another memory node to the OOMing cpuset. We're sending out this in-house patch to start discussion about what might be appropriate for supporting user-space OOM-handling in the mainline kernel. Potential improvements include: - providing more information in the OOM notification, such as the pid that triggered the OOM, and a unique id for that OOM instance that can be tied to later OOM-kill notifications. - allowing better notifications from userspace back to the kernel. Documentation/cgroups/oom-handler.txt | 49 ++++++++ include/linux/cgroup_subsys.h | 12 ++ include/linux/cpuset.h | 7 +- init/Kconfig | 8 ++ kernel/cpuset.c | 8 +- mm/oom_kill.c | 220 +++++++++++++++++++++++++++++++++ 6 files changed, 301 insertions(+), 3 deletions(-) Signed-off-by:Paul Menage David Rientjes Ying Han diff --git a/Documentation/cgroups/oom-handler.txt b/Documentation/cgroups/oom-handler.txt new file mode 100644 index 0000000..aa006fe --- /dev/null +++ b/Documentation/cgroups/oom-handler.txt @@ -0,0 +1,49 @@ +Per cgroup OOM handler allows a userspace handler catches and handle the OOM, +the OOMing thread doesn't trigger a kill, but returns to alloc_pages to try +again; alternatively usersapce can cause the OOM killer to go ahead as normal. + +It's a standalone subsystem that can work with either the memory cgroup or +with cpusets(where memory is constrained by numa nodes). + +The features are: + +- an oom.delay file that controls how long a thread will pause in the +OOM killer waiting for a response from userspace (in milliseconds) + +- an oom.await file that a userspace handler can write a timeout value +to, and be awoken either when a process in that cgroup enters the OOM +killer, or the timeout expires. + +example: +(mount oom as normal cgroup subsystem as well as cpuset) +1. mount -t cgroup -o cpuset,oom cpuset /dev/cpuset + +(config sample cpuset contains single fakenuma node with 128M and one +cpu core) +2. mkdir /dev/cpuset/sample + echo 1 > /dev/cpuset/sample/cpuset.mems + echo 1 > /dev/cpuset/sample/cpuset.cpus + +(config the oom.delay to be 10sec) +3. echo 10000 >/dev/cpuset/sample/oom.oom_delay + +(put the shell in the wait-queue with max 60sec waitting) +4. echo 60000 >/dev/cpuset/sample/oom.await_oom + +(trigger the oom by mlockall 600M anon memory) +5. /oom 600000000 + +When the sample cpuset triggers the OOM, it will wake-up the +OOM-handler thread that slept in step 4, sleep for a jiffie, and then +return to alloc_pages() to try again. This sleep gives the OOM-handler +time to deal with the OOM, for example by giving another memory node +to the OOMing cpuset. + +Potential improvements include: +- providing more information in the OOM notification, such as the pid +that triggered the OOM, and a unique id for that OOM instance that can +be tied to later OOM-kill notifications. + +- allowing better notifications from userspace back to the kernel. + + diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h index 9c22396..23fe6c7 100644 --- a/include/linux/cgroup_subsys.h +++ b/include/linux/cgroup_subsys.h @@ -54,3 +54,9 @@ SUBSYS(freezer) #endif /* */ + +#ifdef CONFIG_CGROUP_OOM_CONT +SUBSYS(oom_cgroup) +#endif + +/* */ diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 2691926..26dab22 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -25,7 +25,7 @@ extern void cpuset_cpus_allowed_locked(struct task_struct *p, cpumask_t *mask); extern nodemask_t cpuset_mems_allowed(struct task_struct *p); #define cpuset_current_mems_allowed (current->mems_allowed) void cpuset_init_current_mems_allowed(void); -void cpuset_update_task_memory_state(void); +int cpuset_update_task_memory_state(void); int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask); extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask); @@ -103,7 +103,10 @@ static inline nodemask_t cpuset_mems_allowed(struct task_struct *p) #define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY]) static inline void cpuset_init_current_mems_allowed(void) {} -static inline void cpuset_update_task_memory_state(void) {} +static inline int cpuset_update_task_memory_state(void) +{ + return 1; +} static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask) { diff --git a/init/Kconfig b/init/Kconfig index 44e9208..971b0b5 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -324,6 +324,14 @@ config CPUSETS Say N if unsure. +config CGROUP_OOM_CONT + bool "OOM controller for cgroups" + depends on CGROUPS + help + This option allows userspace to trap OOM conditions on a + per-cgroup basis, and take action that might prevent the OOM from + occurring. + # # Architectures with an unreliable sched_clock() should select this: # diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 3e00526..c986423 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -355,13 +355,17 @@ static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask) * within the tasks context, when it is trying to allocate memory * (in various mm/mempolicy.c routines) and notices that some other * task has been modifying its cpuset. + * + * Returns non-zero if the state was updated, including when it is + * an effective no-op. */ -void cpuset_update_task_memory_state(void) +int cpuset_update_task_memory_state(void) { int my_cpusets_mem_gen; struct task_struct *tsk = current; struct cpuset *cs; + int ret = 0; if (task_cs(tsk) == &top_cpuset) { /* Don't need rcu for top_cpuset. It's never freed. */ @@ -389,7 +393,9 @@ void cpuset_update_task_memory_state(void) task_unlock(tsk); mutex_unlock(&callback_mutex); mpol_rebind_task(tsk, &tsk->mems_allowed); + ret = 1; } + return ret; } /* diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 64e5b4b..5677b72 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -32,6 +32,219 @@ int sysctl_panic_on_oom; int sysctl_oom_kill_allocating_task; int sysctl_oom_dump_tasks; static DEFINE_SPINLOCK(zone_scan_mutex); + +#ifdef CONFIG_CGROUP_OOM_CONT +struct oom_cgroup { + struct cgroup_subsys_state css; + + /* How long between first OOM indication and actual OOM kill + * for processes in this cgroup */ + unsigned long oom_delay; + + /* When the current OOM delay began. Zero means no delay in progress */ + unsigned long oom_since; + + /* Wait queue for userspace OOM handler */ + wait_queue_head_t oom_wait; + + spinlock_t oom_lock; +}; + +static inline +struct oom_cgroup *oom_cgroup_from_cont(struct cgroup *cont) +{ + return container_of(cgroup_subsys_state(cont, oom_cgroup_subsys_id), + struct oom_cgroup, css); +} + +static inline +struct oom_cgroup *oom_cgroup_from_task(struct task_struct *task) +{ + return container_of(task_subsys_state(task, oom_cgroup_subsys_id), + struct oom_cgroup, css); +} + +/* + * Takes oom_lock during call. + */ +static int oom_cgroup_write_delay(struct cgroup *cont, struct cftype *cft, + u64 delay) +{ + struct oom_cgroup *cs = oom_cgroup_from_cont(cont); + + /* Sanity check */ + if (unlikely(delay > 60 * 1000)) + return -EINVAL; + spin_lock(&cs->oom_lock); + cs->oom_delay = msecs_to_jiffies(delay); + spin_unlock(&cs->oom_lock); + return 0; +} + +/* + * sleeps until the cgroup enters OOM (or a maximum of N milliseconds if N is + * passed). Clears the OOM condition in the cgroup when it returns. + */ +static int oom_cgroup_write_await(struct cgroup *cont, struct cftype *cft, + u64 await) +{ + int retval = 0; + struct oom_cgroup *cs = oom_cgroup_from_cont(cont); + + /* Don't try to wait for more than a minute */ + await = min(await, 60ULL * 1000); + /* Try waiting for up to a second for an OOM condition */ + wait_event_interruptible_timeout(cs->oom_wait, cs->oom_since || + cgroup_is_removed(cs->css.cgroup), + msecs_to_jiffies(await)); + spin_lock(&cs->oom_lock); + if (cgroup_is_removed(cs->css.cgroup)) { + /* The cpuset was removed while we slept */ + retval = -ENODEV; + } else if (cs->oom_since) { + /* We reached OOM. Clear the OOM condition now that + * userspace knows about it */ + cs->oom_since = 0; + } else if (signal_pending(current)) { + retval = -EINTR; + } else { + /* No OOM yet */ + retval = -ETIMEDOUT; + } + spin_unlock(&cs->oom_lock); + return retval; +} + +static u64 oom_cgroup_read_delay(struct cgroup *cont, struct cftype *cft) +{ + return oom_cgroup_from_cont(cont)->oom_delay; +} + +static struct cftype oom_cgroup_files[] = { + { + .name = "delay", + .read_u64 = oom_cgroup_read_delay, + .write_u64 = oom_cgroup_write_delay, + }, + + { + .name = "await", + .write_u64 = oom_cgroup_write_await, + }, +}; + +static struct cgroup_subsys_state *oom_cgroup_create( + struct cgroup_subsys *ss, + struct cgroup *cont) +{ + struct oom_cgroup *oom; + + oom = kmalloc(sizeof(*oom), GFP_KERNEL); + if (!oom) + return ERR_PTR(-ENOMEM); + + oom->oom_delay = 0; + init_waitqueue_head(&oom->oom_wait); + oom->oom_since = 0; + spin_lock_init(&oom->oom_lock); + + return &oom->css; +} + +static void oom_cgroup_destroy(struct cgroup_subsys *ss, + struct cgroup *cont) +{ + kfree(oom_cgroup_from_cont(cont)); +} + +static int oom_cgroup_populate(struct cgroup_subsys *ss, + struct cgroup *cont) +{ + return cgroup_add_files(cont, ss, oom_cgroup_files, + ARRAY_SIZE(oom_cgroup_files)); +} + +struct cgroup_subsys oom_cgroup_subsys = { + .name = "oom", + .subsys_id = oom_cgroup_subsys_id, + .create = oom_cgroup_create, + .destroy = oom_cgroup_destroy, + .populate = oom_cgroup_populate, +}; + + +/* + * Call with no cpuset mutex held. Determines whether this process + * should allow an OOM to proceed as normal (retval==1) or should try + * again to allocate memory (retval==0). If necessary, sleeps and then + * updates the task's mems_allowed to let userspace update the memory + * nodes for the task's cpuset. + */ +static int cgroup_should_oom(void) +{ + int ret = 1; /* OOM by default */ + struct oom_cgroup *cs; + + task_lock(current); + cs = oom_cgroup_from_task(current); + + spin_lock(&cs->oom_lock); + if (cs->oom_delay) { + /* We have an OOM delay configured */ + if (cs->oom_since) { + /* We're already OOMing - see if we're over + * the time limit. Also make sure that jiffie + * wrap-around doesn't make us think we're in + * an incredibly long OOM delay */ + unsigned long deadline = cs->oom_since + cs->oom_delay; + if (time_after(deadline, jiffies) && + !time_after(cs->oom_since, jiffies)) { + /* Not OOM yet */ + ret = 0; + } + } else { + /* This is the first OOM */ + ret = 0; + cs->oom_since = jiffies; + /* Avoid problems with jiffie wrap - make an + * oom_since of zero always mean not + * OOMing */ + if (!cs->oom_since) + cs->oom_since = 1; + printk(KERN_WARNING + "Cpuset %s (pid %d) sending memory " + "notification to userland at %lu%s\n", + cs->css.cgroup->dentry->d_name.name, + current->pid, jiffies, + waitqueue_active(&cs->oom_wait) ? + "" : " (no waiters)"); + } + if (!ret) { + /* If we're planning to retry, we should wake + * up any userspace waiter in order to let it + * handle the OOM + */ + wake_up_all(&cs->oom_wait); + } + } + + spin_unlock(&cs->oom_lock); + task_unlock(current); + if (!ret) { + /* If we're not going to OOM, we should sleep for a + * bit to give userspace a chance to respond before we + * go back and try to reclaim again */ + schedule_timeout_uninterruptible(1); + } + return ret; +} +#else /* !CONFIG_CGROUP_OOM_CONT */ +static inline int cgroup_should_oom(void) +{ + return 1; +} +#endif + /* #define DEBUG */ /** @@ -526,6 +739,13 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order) unsigned long freed = 0; enum oom_constraint constraint; + /* + * It is important to call in this order since cgroup_should_oom() + * might sleep and give userspace chance to update mems. + */ + if (!cgroup_should_oom() || cpuset_update_task_memory_state()) + return; + blocking_notifier_call_chain(&oom_notify_list, 0, &freed); if (freed > 0) /* Got some memory back in the last second. */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org