From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: Ying Han <yinghan@google.com>
Cc: linux-mm@kvack.org, Rohit Seth <rohitseth@google.com>,
Paul Menage <menage@google.com>,
David Rientjes <rientjes@google.com>
Subject: Re: [RFC][PATCH]Per-cgroup OOM handler
Date: Thu, 6 Nov 2008 14:34:38 +0900 [thread overview]
Message-ID: <20081106143438.5557b87c.kamezawa.hiroyu@jp.fujitsu.com> (raw)
In-Reply-To: <604427e00811031419k2e990061kdb03f4b715b51fb9@mail.gmail.com>
Thank you for posting.
On Mon, 3 Nov 2008 14:19:11 -0800
Ying Han <yinghan@google.com> wrote:
> sorry, please use the following patch. (deleted the double definition
> in cgroup_subsys.h from last patch)
>
> Per-cgroup OOM handler ported from cpuset to cgroup.
>
> Per cgroup OOM handler allows a userspace handler catches and handle the OOM,
> the OOMing thread doesn't trigger a kill, but returns to alloc_pages to try
> again; alternatively usersapce can cause the OOM killer to go ahead as normal.
>
> It's a standalone subsystem that can work with either the memory cgroup or
> with cpusets(where memory is constrained by numa nodes).
>
> The features are:
>
> - an oom.delay file that controls how long a thread will pause in the
> OOM killer waiting for a response from userspace (in milliseconds)
>
> - an oom.await file that a userspace handler can write a timeout value
> to, and be awoken either when a process in that cgroup enters the OOM
> killer, or the timeout expires.
>
> example:
> (mount oom as normal cgroup subsystem as well as cpuset)
> 1. mount -t cgroup -o cpuset,oom cpuset /dev/cpuset
>
> (config sample cpuset contains single fakenuma node with 128M and one
> cpu core)
> 2. mkdir /dev/cpuset/sample
> echo 1 > /dev/cpuset/sample/cpuset.mems
> echo 1 > /dev/cpuset/sample/cpuset.cpus
>
> (config the oom.delay to be 10sec)
> 3. echo 10000 >/dev/cpuset/sample/oom.oom_delay
>
> (put the shell in the wait-queue with max 60sec waitting)
> 4. echo 60000 >/dev/cpuset/sample/oom.await_oom
>
> (trigger the oom by mlockall 600M anon memory)
> 5. /oom 600000000
>
> When the sample cpuset triggers the OOM, it will wake-up the
> OOM-handler thread that slept in step 4, sleep for a jiffie, and then
> return to alloc_pages() to try again. This sleep gives the OOM-handler
> time to deal with the OOM, for example by giving another memory node
> to the OOMing cpuset.
>
Where does this one-tick-wait comes from ? It works well ?
(Before OOM, the system tend to wait in congestion_wait() or some.)
OOM-handler shoule be in another cpuset or mlocked in this case ?
> We're sending out this in-house patch to start discussion about what
> might be appropriate for supporting user-space OOM-handling in the
> mainline kernel. Potential improvements include:
>
thanks.
> - providing more information in the OOM notification, such as the pid
> that triggered the OOM, and a unique id for that OOM instance that can
> be tied to later OOM-kill notifications.
With this patch, why the system OOM is unknown.
I think following information is necessary at least.
- OOM because global memory shortage.
- OOM because cpuset memory shortage.
- OOM because memcg memory hit limits.
>
> - allowing better notifications from userspace back to the kernel.
I love some interface allowing poll().
I'm wondering
- freeeze-all-threads-in-group-at-oom
- free emergency memory to page allocator which was pooled at cgroup creation
rather than 1-tick wait
BTW, it seems this patch allows task detach/attach always. it's safe(and sane) ?
At first impression, needs some progress but interesting in general. thanks.
-Kame
>
> Documentation/cgroups/oom-handler.txt | 49 ++++++++
> include/linux/cgroup_subsys.h | 12 ++
> include/linux/cpuset.h | 7 +-
> init/Kconfig | 8 ++
> kernel/cpuset.c | 8 +-
> mm/oom_kill.c | 220 +++++++++++++++++++++++++++++++++
> 6 files changed, 301 insertions(+), 3 deletions(-)
>
> Signed-off-by:Paul Menage <menage@google.com>
> David Rientjes <rientjes@google.com>
> Ying Han <yinghan@google.com>
>
>
> diff --git a/Documentation/cgroups/oom-handler.txt
> b/Documentation/cgroups/oom-handler.txt
> new file mode 100644
> index 0000000..aa006fe
> --- /dev/null
> +++ b/Documentation/cgroups/oom-handler.txt
> @@ -0,0 +1,49 @@
> +Per cgroup OOM handler allows a userspace handler catches and handle the OOM,
> +the OOMing thread doesn't trigger a kill, but returns to alloc_pages to try
> +again; alternatively usersapce can cause the OOM killer to go ahead as normal.
> +
> +It's a standalone subsystem that can work with either the memory cgroup or
> +with cpusets(where memory is constrained by numa nodes).
> +
> +The features are:
> +
> +- an oom.delay file that controls how long a thread will pause in the
> +OOM killer waiting for a response from userspace (in milliseconds)
> +
> +- an oom.await file that a userspace handler can write a timeout value
> +to, and be awoken either when a process in that cgroup enters the OOM
> +killer, or the timeout expires.
> +
> +example:
> +(mount oom as normal cgroup subsystem as well as cpuset)
> +1. mount -t cgroup -o cpuset,oom cpuset /dev/cpuset
> +
> +(config sample cpuset contains single fakenuma node with 128M and one
> +cpu core)
> +2. mkdir /dev/cpuset/sample
> + echo 1 > /dev/cpuset/sample/cpuset.mems
> + echo 1 > /dev/cpuset/sample/cpuset.cpus
> +
> +(config the oom.delay to be 10sec)
> +3. echo 10000 >/dev/cpuset/sample/oom.oom_delay
> +
> +(put the shell in the wait-queue with max 60sec waitting)
> +4. echo 60000 >/dev/cpuset/sample/oom.await_oom
> +
> +(trigger the oom by mlockall 600M anon memory)
> +5. /oom 600000000
> +
> +When the sample cpuset triggers the OOM, it will wake-up the
> +OOM-handler thread that slept in step 4, sleep for a jiffie, and then
> +return to alloc_pages() to try again. This sleep gives the OOM-handler
> +time to deal with the OOM, for example by giving another memory node
> +to the OOMing cpuset.
> +
> +Potential improvements include:
> +- providing more information in the OOM notification, such as the pid
> +that triggered the OOM, and a unique id for that OOM instance that can
> +be tied to later OOM-kill notifications.
> +
> +- allowing better notifications from userspace back to the kernel.
> +
> +
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 9c22396..23fe6c7 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -54,3 +54,9 @@ SUBSYS(freezer)
> #endif
>
> /* */
> +
> +#ifdef CONFIG_CGROUP_OOM_CONT
> +SUBSYS(oom_cgroup)
> +#endif
> +
> +/* */
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 2691926..26dab22 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -25,7 +25,7 @@ extern void cpuset_cpus_allowed_locked(struct
> task_struct *p, cpumask_t *mask);
> extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
> #define cpuset_current_mems_allowed (current->mems_allowed)
> void cpuset_init_current_mems_allowed(void);
> -void cpuset_update_task_memory_state(void);
> +int cpuset_update_task_memory_state(void);
> int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
>
> extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
> @@ -103,7 +103,10 @@ static inline nodemask_t
> cpuset_mems_allowed(struct task_struct *p)
>
> #define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
> static inline void cpuset_init_current_mems_allowed(void) {}
> -static inline void cpuset_update_task_memory_state(void) {}
> +static inline int cpuset_update_task_memory_state(void)
> +{
> + return 1;
> +}
>
> static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
> {
> diff --git a/init/Kconfig b/init/Kconfig
> index 44e9208..971b0b5 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -324,6 +324,14 @@ config CPUSETS
>
> Say N if unsure.
>
> +config CGROUP_OOM_CONT
> + bool "OOM controller for cgroups"
> + depends on CGROUPS
> + help
> + This option allows userspace to trap OOM conditions on a
> + per-cgroup basis, and take action that might prevent the OOM from
> + occurring.
> +
> #
> # Architectures with an unreliable sched_clock() should select this:
> #
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 3e00526..c986423 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -355,13 +355,17 @@ static void guarantee_online_mems(const struct
> cpuset *cs, nodemask_t *pmask)
> * within the tasks context, when it is trying to allocate memory
> * (in various mm/mempolicy.c routines) and notices that some other
> * task has been modifying its cpuset.
> + *
> + * Returns non-zero if the state was updated, including when it is
> + * an effective no-op.
> */
>
> -void cpuset_update_task_memory_state(void)
> +int cpuset_update_task_memory_state(void)
> {
> int my_cpusets_mem_gen;
> struct task_struct *tsk = current;
> struct cpuset *cs;
> + int ret = 0;
>
> if (task_cs(tsk) == &top_cpuset) {
> /* Don't need rcu for top_cpuset. It's never freed. */
> @@ -389,7 +393,9 @@ void cpuset_update_task_memory_state(void)
> task_unlock(tsk);
> mutex_unlock(&callback_mutex);
> mpol_rebind_task(tsk, &tsk->mems_allowed);
> + ret = 1;
> }
> + return ret;
> }
>
> /*
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 64e5b4b..5677b72 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -32,6 +32,219 @@ int sysctl_panic_on_oom;
> int sysctl_oom_kill_allocating_task;
> int sysctl_oom_dump_tasks;
> static DEFINE_SPINLOCK(zone_scan_mutex);
> +
> +#ifdef CONFIG_CGROUP_OOM_CONT
> +struct oom_cgroup {
> + struct cgroup_subsys_state css;
> +
> + /* How long between first OOM indication and actual OOM kill
> + * for processes in this cgroup */
> + unsigned long oom_delay;
> +
> + /* When the current OOM delay began. Zero means no delay in progress */
> + unsigned long oom_since;
> +
> + /* Wait queue for userspace OOM handler */
> + wait_queue_head_t oom_wait;
> +
> + spinlock_t oom_lock;
> +};
> +
> +static inline
> +struct oom_cgroup *oom_cgroup_from_cont(struct cgroup *cont)
> +{
> + return container_of(cgroup_subsys_state(cont, oom_cgroup_subsys_id),
> + struct oom_cgroup, css);
> +}
> +
> +static inline
> +struct oom_cgroup *oom_cgroup_from_task(struct task_struct *task)
> +{
> + return container_of(task_subsys_state(task, oom_cgroup_subsys_id),
> + struct oom_cgroup, css);
> +}
> +
> +/*
> + * Takes oom_lock during call.
> + */
> +static int oom_cgroup_write_delay(struct cgroup *cont, struct cftype *cft,
> + u64 delay)
> +{
> + struct oom_cgroup *cs = oom_cgroup_from_cont(cont);
> +
> + /* Sanity check */
> + if (unlikely(delay > 60 * 1000))
> + return -EINVAL;
> + spin_lock(&cs->oom_lock);
> + cs->oom_delay = msecs_to_jiffies(delay);
> + spin_unlock(&cs->oom_lock);
> + return 0;
> +}
> +
> +/*
> + * sleeps until the cgroup enters OOM (or a maximum of N milliseconds if N is
> + * passed). Clears the OOM condition in the cgroup when it returns.
> + */
> +static int oom_cgroup_write_await(struct cgroup *cont, struct cftype *cft,
> + u64 await)
> +{
> + int retval = 0;
> + struct oom_cgroup *cs = oom_cgroup_from_cont(cont);
> +
> + /* Don't try to wait for more than a minute */
> + await = min(await, 60ULL * 1000);
> + /* Try waiting for up to a second for an OOM condition */
> + wait_event_interruptible_timeout(cs->oom_wait, cs->oom_since ||
> + cgroup_is_removed(cs->css.cgroup),
> + msecs_to_jiffies(await));
> + spin_lock(&cs->oom_lock);
> + if (cgroup_is_removed(cs->css.cgroup)) {
> + /* The cpuset was removed while we slept */
> + retval = -ENODEV;
> + } else if (cs->oom_since) {
> + /* We reached OOM. Clear the OOM condition now that
> + * userspace knows about it */
> + cs->oom_since = 0;
> + } else if (signal_pending(current)) {
> + retval = -EINTR;
> + } else {
> + /* No OOM yet */
> + retval = -ETIMEDOUT;
> + }
> + spin_unlock(&cs->oom_lock);
> + return retval;
> +}
> +
> +static u64 oom_cgroup_read_delay(struct cgroup *cont, struct cftype *cft)
> +{
> + return oom_cgroup_from_cont(cont)->oom_delay;
> +}
> +
> +static struct cftype oom_cgroup_files[] = {
> + {
> + .name = "delay",
> + .read_u64 = oom_cgroup_read_delay,
> + .write_u64 = oom_cgroup_write_delay,
> + },
> +
> + {
> + .name = "await",
> + .write_u64 = oom_cgroup_write_await,
> + },
> +};
> +
> +static struct cgroup_subsys_state *oom_cgroup_create(
> + struct cgroup_subsys *ss,
> + struct cgroup *cont)
> +{
> + struct oom_cgroup *oom;
> +
> + oom = kmalloc(sizeof(*oom), GFP_KERNEL);
> + if (!oom)
> + return ERR_PTR(-ENOMEM);
> +
> + oom->oom_delay = 0;
> + init_waitqueue_head(&oom->oom_wait);
> + oom->oom_since = 0;
> + spin_lock_init(&oom->oom_lock);
> +
> + return &oom->css;
> +}
> +
> +static void oom_cgroup_destroy(struct cgroup_subsys *ss,
> + struct cgroup *cont)
> +{
> + kfree(oom_cgroup_from_cont(cont));
> +}
> +
> +static int oom_cgroup_populate(struct cgroup_subsys *ss,
> + struct cgroup *cont)
> +{
> + return cgroup_add_files(cont, ss, oom_cgroup_files,
> + ARRAY_SIZE(oom_cgroup_files));
> +}
> +
> +struct cgroup_subsys oom_cgroup_subsys = {
> + .name = "oom",
> + .subsys_id = oom_cgroup_subsys_id,
> + .create = oom_cgroup_create,
> + .destroy = oom_cgroup_destroy,
> + .populate = oom_cgroup_populate,
> +};
> +
> +
> +/*
> + * Call with no cpuset mutex held. Determines whether this process
> + * should allow an OOM to proceed as normal (retval==1) or should try
> + * again to allocate memory (retval==0). If necessary, sleeps and then
> + * updates the task's mems_allowed to let userspace update the memory
> + * nodes for the task's cpuset.
> + */
> +static int cgroup_should_oom(void)
> +{
> + int ret = 1; /* OOM by default */
> + struct oom_cgroup *cs;
> +
> + task_lock(current);
> + cs = oom_cgroup_from_task(current);
> +
> + spin_lock(&cs->oom_lock);
> + if (cs->oom_delay) {
> + /* We have an OOM delay configured */
> + if (cs->oom_since) {
> + /* We're already OOMing - see if we're over
> + * the time limit. Also make sure that jiffie
> + * wrap-around doesn't make us think we're in
> + * an incredibly long OOM delay */
> + unsigned long deadline = cs->oom_since + cs->oom_delay;
> + if (time_after(deadline, jiffies) &&
> + !time_after(cs->oom_since, jiffies)) {
> + /* Not OOM yet */
> + ret = 0;
> + }
> + } else {
> + /* This is the first OOM */
> + ret = 0;
> + cs->oom_since = jiffies;
> + /* Avoid problems with jiffie wrap - make an
> + * oom_since of zero always mean not
> + * OOMing */
> + if (!cs->oom_since)
> + cs->oom_since = 1;
> + printk(KERN_WARNING
> + "Cpuset %s (pid %d) sending memory "
> + "notification to userland at %lu%s\n",
> + cs->css.cgroup->dentry->d_name.name,
> + current->pid, jiffies,
> + waitqueue_active(&cs->oom_wait) ?
> + "" : " (no waiters)");
> + }
> + if (!ret) {
> + /* If we're planning to retry, we should wake
> + * up any userspace waiter in order to let it
> + * handle the OOM
> + */
> + wake_up_all(&cs->oom_wait);
> + }
> + }
> +
> + spin_unlock(&cs->oom_lock);
> + task_unlock(current);
> + if (!ret) {
> + /* If we're not going to OOM, we should sleep for a
> + * bit to give userspace a chance to respond before we
> + * go back and try to reclaim again */
> + schedule_timeout_uninterruptible(1);
> + }
> + return ret;
> +}
> +#else /* !CONFIG_CGROUP_OOM_CONT */
> +static inline int cgroup_should_oom(void)
> +{
> + return 1;
> +}
> +#endif
> +
> /* #define DEBUG */
>
> /**
> @@ -526,6 +739,13 @@ void out_of_memory(struct zonelist *zonelist,
> gfp_t gfp_mask, int order)
> unsigned long freed = 0;
> enum oom_constraint constraint;
>
> + /*
> + * It is important to call in this order since cgroup_should_oom()
> + * might sleep and give userspace chance to update mems.
> + */
> + if (!cgroup_should_oom() || cpuset_update_task_memory_state())
> + return;
> +
> blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> if (freed > 0)
> /* Got some memory back in the last second. */
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2008-11-06 5:35 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-11-03 21:40 Ying Han
2008-11-03 22:19 ` Ying Han
2008-11-06 5:34 ` KAMEZAWA Hiroyuki [this message]
[not found] ` <604427e00811102042x202906ecq2a10eb5e404e2ec9@mail.gmail.com>
2008-11-11 7:28 ` KAMEZAWA Hiroyuki
2008-11-11 8:14 ` David Rientjes
2008-11-11 8:27 ` Paul Menage
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20081106143438.5557b87c.kamezawa.hiroyu@jp.fujitsu.com \
--to=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=menage@google.com \
--cc=rientjes@google.com \
--cc=rohitseth@google.com \
--cc=yinghan@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox