From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id mA65ZEQb022003 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Thu, 6 Nov 2008 14:35:14 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 4C54045DD7B for ; Thu, 6 Nov 2008 14:35:14 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 29DFF45DD78 for ; Thu, 6 Nov 2008 14:35:14 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 0B95B1DB803A for ; Thu, 6 Nov 2008 14:35:14 +0900 (JST) Received: from ml14.s.css.fujitsu.com (ml14.s.css.fujitsu.com [10.249.87.104]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id AE9721DB8037 for ; Thu, 6 Nov 2008 14:35:13 +0900 (JST) Date: Thu, 6 Nov 2008 14:34:38 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [RFC][PATCH]Per-cgroup OOM handler Message-Id: <20081106143438.5557b87c.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <604427e00811031419k2e990061kdb03f4b715b51fb9@mail.gmail.com> References: <604427e00811031340k56634773g6e260d79e6cb51e7@mail.gmail.com> <604427e00811031419k2e990061kdb03f4b715b51fb9@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Ying Han Cc: linux-mm@kvack.org, Rohit Seth , Paul Menage , David Rientjes List-ID: Thank you for posting. On Mon, 3 Nov 2008 14:19:11 -0800 Ying Han wrote: > sorry, please use the following patch. (deleted the double definition > in cgroup_subsys.h from last patch) > > Per-cgroup OOM handler ported from cpuset to cgroup. > > Per cgroup OOM handler allows a userspace handler catches and handle the OOM, > the OOMing thread doesn't trigger a kill, but returns to alloc_pages to try > again; alternatively usersapce can cause the OOM killer to go ahead as normal. > > It's a standalone subsystem that can work with either the memory cgroup or > with cpusets(where memory is constrained by numa nodes). > > The features are: > > - an oom.delay file that controls how long a thread will pause in the > OOM killer waiting for a response from userspace (in milliseconds) > > - an oom.await file that a userspace handler can write a timeout value > to, and be awoken either when a process in that cgroup enters the OOM > killer, or the timeout expires. > > example: > (mount oom as normal cgroup subsystem as well as cpuset) > 1. mount -t cgroup -o cpuset,oom cpuset /dev/cpuset > > (config sample cpuset contains single fakenuma node with 128M and one > cpu core) > 2. mkdir /dev/cpuset/sample > echo 1 > /dev/cpuset/sample/cpuset.mems > echo 1 > /dev/cpuset/sample/cpuset.cpus > > (config the oom.delay to be 10sec) > 3. echo 10000 >/dev/cpuset/sample/oom.oom_delay > > (put the shell in the wait-queue with max 60sec waitting) > 4. echo 60000 >/dev/cpuset/sample/oom.await_oom > > (trigger the oom by mlockall 600M anon memory) > 5. /oom 600000000 > > When the sample cpuset triggers the OOM, it will wake-up the > OOM-handler thread that slept in step 4, sleep for a jiffie, and then > return to alloc_pages() to try again. This sleep gives the OOM-handler > time to deal with the OOM, for example by giving another memory node > to the OOMing cpuset. > Where does this one-tick-wait comes from ? It works well ? (Before OOM, the system tend to wait in congestion_wait() or some.) OOM-handler shoule be in another cpuset or mlocked in this case ? > We're sending out this in-house patch to start discussion about what > might be appropriate for supporting user-space OOM-handling in the > mainline kernel. Potential improvements include: > thanks. > - providing more information in the OOM notification, such as the pid > that triggered the OOM, and a unique id for that OOM instance that can > be tied to later OOM-kill notifications. With this patch, why the system OOM is unknown. I think following information is necessary at least. - OOM because global memory shortage. - OOM because cpuset memory shortage. - OOM because memcg memory hit limits. > > - allowing better notifications from userspace back to the kernel. I love some interface allowing poll(). I'm wondering - freeeze-all-threads-in-group-at-oom - free emergency memory to page allocator which was pooled at cgroup creation rather than 1-tick wait BTW, it seems this patch allows task detach/attach always. it's safe(and sane) ? At first impression, needs some progress but interesting in general. thanks. -Kame > > Documentation/cgroups/oom-handler.txt | 49 ++++++++ > include/linux/cgroup_subsys.h | 12 ++ > include/linux/cpuset.h | 7 +- > init/Kconfig | 8 ++ > kernel/cpuset.c | 8 +- > mm/oom_kill.c | 220 +++++++++++++++++++++++++++++++++ > 6 files changed, 301 insertions(+), 3 deletions(-) > > Signed-off-by:Paul Menage > David Rientjes > Ying Han > > > diff --git a/Documentation/cgroups/oom-handler.txt > b/Documentation/cgroups/oom-handler.txt > new file mode 100644 > index 0000000..aa006fe > --- /dev/null > +++ b/Documentation/cgroups/oom-handler.txt > @@ -0,0 +1,49 @@ > +Per cgroup OOM handler allows a userspace handler catches and handle the OOM, > +the OOMing thread doesn't trigger a kill, but returns to alloc_pages to try > +again; alternatively usersapce can cause the OOM killer to go ahead as normal. > + > +It's a standalone subsystem that can work with either the memory cgroup or > +with cpusets(where memory is constrained by numa nodes). > + > +The features are: > + > +- an oom.delay file that controls how long a thread will pause in the > +OOM killer waiting for a response from userspace (in milliseconds) > + > +- an oom.await file that a userspace handler can write a timeout value > +to, and be awoken either when a process in that cgroup enters the OOM > +killer, or the timeout expires. > + > +example: > +(mount oom as normal cgroup subsystem as well as cpuset) > +1. mount -t cgroup -o cpuset,oom cpuset /dev/cpuset > + > +(config sample cpuset contains single fakenuma node with 128M and one > +cpu core) > +2. mkdir /dev/cpuset/sample > + echo 1 > /dev/cpuset/sample/cpuset.mems > + echo 1 > /dev/cpuset/sample/cpuset.cpus > + > +(config the oom.delay to be 10sec) > +3. echo 10000 >/dev/cpuset/sample/oom.oom_delay > + > +(put the shell in the wait-queue with max 60sec waitting) > +4. echo 60000 >/dev/cpuset/sample/oom.await_oom > + > +(trigger the oom by mlockall 600M anon memory) > +5. /oom 600000000 > + > +When the sample cpuset triggers the OOM, it will wake-up the > +OOM-handler thread that slept in step 4, sleep for a jiffie, and then > +return to alloc_pages() to try again. This sleep gives the OOM-handler > +time to deal with the OOM, for example by giving another memory node > +to the OOMing cpuset. > + > +Potential improvements include: > +- providing more information in the OOM notification, such as the pid > +that triggered the OOM, and a unique id for that OOM instance that can > +be tied to later OOM-kill notifications. > + > +- allowing better notifications from userspace back to the kernel. > + > + > diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h > index 9c22396..23fe6c7 100644 > --- a/include/linux/cgroup_subsys.h > +++ b/include/linux/cgroup_subsys.h > @@ -54,3 +54,9 @@ SUBSYS(freezer) > #endif > > /* */ > + > +#ifdef CONFIG_CGROUP_OOM_CONT > +SUBSYS(oom_cgroup) > +#endif > + > +/* */ > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > index 2691926..26dab22 100644 > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -25,7 +25,7 @@ extern void cpuset_cpus_allowed_locked(struct > task_struct *p, cpumask_t *mask); > extern nodemask_t cpuset_mems_allowed(struct task_struct *p); > #define cpuset_current_mems_allowed (current->mems_allowed) > void cpuset_init_current_mems_allowed(void); > -void cpuset_update_task_memory_state(void); > +int cpuset_update_task_memory_state(void); > int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask); > > extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask); > @@ -103,7 +103,10 @@ static inline nodemask_t > cpuset_mems_allowed(struct task_struct *p) > > #define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY]) > static inline void cpuset_init_current_mems_allowed(void) {} > -static inline void cpuset_update_task_memory_state(void) {} > +static inline int cpuset_update_task_memory_state(void) > +{ > + return 1; > +} > > static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask) > { > diff --git a/init/Kconfig b/init/Kconfig > index 44e9208..971b0b5 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -324,6 +324,14 @@ config CPUSETS > > Say N if unsure. > > +config CGROUP_OOM_CONT > + bool "OOM controller for cgroups" > + depends on CGROUPS > + help > + This option allows userspace to trap OOM conditions on a > + per-cgroup basis, and take action that might prevent the OOM from > + occurring. > + > # > # Architectures with an unreliable sched_clock() should select this: > # > diff --git a/kernel/cpuset.c b/kernel/cpuset.c > index 3e00526..c986423 100644 > --- a/kernel/cpuset.c > +++ b/kernel/cpuset.c > @@ -355,13 +355,17 @@ static void guarantee_online_mems(const struct > cpuset *cs, nodemask_t *pmask) > * within the tasks context, when it is trying to allocate memory > * (in various mm/mempolicy.c routines) and notices that some other > * task has been modifying its cpuset. > + * > + * Returns non-zero if the state was updated, including when it is > + * an effective no-op. > */ > > -void cpuset_update_task_memory_state(void) > +int cpuset_update_task_memory_state(void) > { > int my_cpusets_mem_gen; > struct task_struct *tsk = current; > struct cpuset *cs; > + int ret = 0; > > if (task_cs(tsk) == &top_cpuset) { > /* Don't need rcu for top_cpuset. It's never freed. */ > @@ -389,7 +393,9 @@ void cpuset_update_task_memory_state(void) > task_unlock(tsk); > mutex_unlock(&callback_mutex); > mpol_rebind_task(tsk, &tsk->mems_allowed); > + ret = 1; > } > + return ret; > } > > /* > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 64e5b4b..5677b72 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -32,6 +32,219 @@ int sysctl_panic_on_oom; > int sysctl_oom_kill_allocating_task; > int sysctl_oom_dump_tasks; > static DEFINE_SPINLOCK(zone_scan_mutex); > + > +#ifdef CONFIG_CGROUP_OOM_CONT > +struct oom_cgroup { > + struct cgroup_subsys_state css; > + > + /* How long between first OOM indication and actual OOM kill > + * for processes in this cgroup */ > + unsigned long oom_delay; > + > + /* When the current OOM delay began. Zero means no delay in progress */ > + unsigned long oom_since; > + > + /* Wait queue for userspace OOM handler */ > + wait_queue_head_t oom_wait; > + > + spinlock_t oom_lock; > +}; > + > +static inline > +struct oom_cgroup *oom_cgroup_from_cont(struct cgroup *cont) > +{ > + return container_of(cgroup_subsys_state(cont, oom_cgroup_subsys_id), > + struct oom_cgroup, css); > +} > + > +static inline > +struct oom_cgroup *oom_cgroup_from_task(struct task_struct *task) > +{ > + return container_of(task_subsys_state(task, oom_cgroup_subsys_id), > + struct oom_cgroup, css); > +} > + > +/* > + * Takes oom_lock during call. > + */ > +static int oom_cgroup_write_delay(struct cgroup *cont, struct cftype *cft, > + u64 delay) > +{ > + struct oom_cgroup *cs = oom_cgroup_from_cont(cont); > + > + /* Sanity check */ > + if (unlikely(delay > 60 * 1000)) > + return -EINVAL; > + spin_lock(&cs->oom_lock); > + cs->oom_delay = msecs_to_jiffies(delay); > + spin_unlock(&cs->oom_lock); > + return 0; > +} > + > +/* > + * sleeps until the cgroup enters OOM (or a maximum of N milliseconds if N is > + * passed). Clears the OOM condition in the cgroup when it returns. > + */ > +static int oom_cgroup_write_await(struct cgroup *cont, struct cftype *cft, > + u64 await) > +{ > + int retval = 0; > + struct oom_cgroup *cs = oom_cgroup_from_cont(cont); > + > + /* Don't try to wait for more than a minute */ > + await = min(await, 60ULL * 1000); > + /* Try waiting for up to a second for an OOM condition */ > + wait_event_interruptible_timeout(cs->oom_wait, cs->oom_since || > + cgroup_is_removed(cs->css.cgroup), > + msecs_to_jiffies(await)); > + spin_lock(&cs->oom_lock); > + if (cgroup_is_removed(cs->css.cgroup)) { > + /* The cpuset was removed while we slept */ > + retval = -ENODEV; > + } else if (cs->oom_since) { > + /* We reached OOM. Clear the OOM condition now that > + * userspace knows about it */ > + cs->oom_since = 0; > + } else if (signal_pending(current)) { > + retval = -EINTR; > + } else { > + /* No OOM yet */ > + retval = -ETIMEDOUT; > + } > + spin_unlock(&cs->oom_lock); > + return retval; > +} > + > +static u64 oom_cgroup_read_delay(struct cgroup *cont, struct cftype *cft) > +{ > + return oom_cgroup_from_cont(cont)->oom_delay; > +} > + > +static struct cftype oom_cgroup_files[] = { > + { > + .name = "delay", > + .read_u64 = oom_cgroup_read_delay, > + .write_u64 = oom_cgroup_write_delay, > + }, > + > + { > + .name = "await", > + .write_u64 = oom_cgroup_write_await, > + }, > +}; > + > +static struct cgroup_subsys_state *oom_cgroup_create( > + struct cgroup_subsys *ss, > + struct cgroup *cont) > +{ > + struct oom_cgroup *oom; > + > + oom = kmalloc(sizeof(*oom), GFP_KERNEL); > + if (!oom) > + return ERR_PTR(-ENOMEM); > + > + oom->oom_delay = 0; > + init_waitqueue_head(&oom->oom_wait); > + oom->oom_since = 0; > + spin_lock_init(&oom->oom_lock); > + > + return &oom->css; > +} > + > +static void oom_cgroup_destroy(struct cgroup_subsys *ss, > + struct cgroup *cont) > +{ > + kfree(oom_cgroup_from_cont(cont)); > +} > + > +static int oom_cgroup_populate(struct cgroup_subsys *ss, > + struct cgroup *cont) > +{ > + return cgroup_add_files(cont, ss, oom_cgroup_files, > + ARRAY_SIZE(oom_cgroup_files)); > +} > + > +struct cgroup_subsys oom_cgroup_subsys = { > + .name = "oom", > + .subsys_id = oom_cgroup_subsys_id, > + .create = oom_cgroup_create, > + .destroy = oom_cgroup_destroy, > + .populate = oom_cgroup_populate, > +}; > + > + > +/* > + * Call with no cpuset mutex held. Determines whether this process > + * should allow an OOM to proceed as normal (retval==1) or should try > + * again to allocate memory (retval==0). If necessary, sleeps and then > + * updates the task's mems_allowed to let userspace update the memory > + * nodes for the task's cpuset. > + */ > +static int cgroup_should_oom(void) > +{ > + int ret = 1; /* OOM by default */ > + struct oom_cgroup *cs; > + > + task_lock(current); > + cs = oom_cgroup_from_task(current); > + > + spin_lock(&cs->oom_lock); > + if (cs->oom_delay) { > + /* We have an OOM delay configured */ > + if (cs->oom_since) { > + /* We're already OOMing - see if we're over > + * the time limit. Also make sure that jiffie > + * wrap-around doesn't make us think we're in > + * an incredibly long OOM delay */ > + unsigned long deadline = cs->oom_since + cs->oom_delay; > + if (time_after(deadline, jiffies) && > + !time_after(cs->oom_since, jiffies)) { > + /* Not OOM yet */ > + ret = 0; > + } > + } else { > + /* This is the first OOM */ > + ret = 0; > + cs->oom_since = jiffies; > + /* Avoid problems with jiffie wrap - make an > + * oom_since of zero always mean not > + * OOMing */ > + if (!cs->oom_since) > + cs->oom_since = 1; > + printk(KERN_WARNING > + "Cpuset %s (pid %d) sending memory " > + "notification to userland at %lu%s\n", > + cs->css.cgroup->dentry->d_name.name, > + current->pid, jiffies, > + waitqueue_active(&cs->oom_wait) ? > + "" : " (no waiters)"); > + } > + if (!ret) { > + /* If we're planning to retry, we should wake > + * up any userspace waiter in order to let it > + * handle the OOM > + */ > + wake_up_all(&cs->oom_wait); > + } > + } > + > + spin_unlock(&cs->oom_lock); > + task_unlock(current); > + if (!ret) { > + /* If we're not going to OOM, we should sleep for a > + * bit to give userspace a chance to respond before we > + * go back and try to reclaim again */ > + schedule_timeout_uninterruptible(1); > + } > + return ret; > +} > +#else /* !CONFIG_CGROUP_OOM_CONT */ > +static inline int cgroup_should_oom(void) > +{ > + return 1; > +} > +#endif > + > /* #define DEBUG */ > > /** > @@ -526,6 +739,13 @@ void out_of_memory(struct zonelist *zonelist, > gfp_t gfp_mask, int order) > unsigned long freed = 0; > enum oom_constraint constraint; > > + /* > + * It is important to call in this order since cgroup_should_oom() > + * might sleep and give userspace chance to update mems. > + */ > + if (!cgroup_should_oom() || cpuset_update_task_memory_state()) > + return; > + > blocking_notifier_call_chain(&oom_notify_list, 0, &freed); > if (freed > 0) > /* Got some memory back in the last second. */ > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org