From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
To: David Rientjes <rientjes@google.com>
Cc: kosaki.motohiro@jp.fujitsu.com,
Andrew Morton <akpm@linux-foundation.org>,
Rik van Riel <riel@redhat.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Nick Piggin <npiggin@suse.de>,
Andrea Arcangeli <aarcange@redhat.com>,
Balbir Singh <balbir@linux.vnet.ibm.com>,
Lubos Lunak <l.lunak@suse.cz>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [patch 3/7 -mm] oom: select task from tasklist for mempolicy ooms
Date: Mon, 15 Feb 2010 14:03:06 +0900 (JST) [thread overview]
Message-ID: <20100215120924.7281.A69D9226@jp.fujitsu.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1002100228370.8001@chino.kir.corp.google.com>
> The oom killer presently kills current whenever there is no more memory
> free or reclaimable on its mempolicy's nodes. There is no guarantee that
> current is a memory-hogging task or that killing it will free any
> substantial amount of memory, however.
>
> In such situations, it is better to scan the tasklist for nodes that are
> allowed to allocate on current's set of nodes and kill the task with the
> highest badness() score. This ensures that the most memory-hogging task,
> or the one configured by the user with /proc/pid/oom_adj, is always
> selected in such scenarios.
>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
> include/linux/mempolicy.h | 13 +++++++-
> mm/mempolicy.c | 39 +++++++++++++++++++++++
> mm/oom_kill.c | 77 +++++++++++++++++++++++++++-----------------
> 3 files changed, 98 insertions(+), 31 deletions(-)
>
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -202,6 +202,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> unsigned long addr, gfp_t gfp_flags,
> struct mempolicy **mpol, nodemask_t **nodemask);
> extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
> +extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> + const nodemask_t *mask);
> extern unsigned slab_node(struct mempolicy *policy);
>
> extern enum zone_type policy_zone;
> @@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> return node_zonelist(0, gfp_flags);
> }
>
> -static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
> +static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
> +{
> + return false;
> +}
> +
> +static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> + const nodemask_t *mask)
> +{
> + return false;
> +}
>
> static inline int do_migrate_pages(struct mm_struct *mm,
> const nodemask_t *from_nodes,
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1638,6 +1638,45 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
> }
> #endif
>
> +/*
> + * mempolicy_nodemask_intersects
> + *
> + * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
> + * policy. Otherwise, check for intersection between mask and the policy
> + * nodemask for 'bind' or 'interleave' policy, or mask to contain the single
> + * node for 'preferred' or 'local' policy.
> + */
> +bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> + const nodemask_t *mask)
> +{
> + struct mempolicy *mempolicy;
> + bool ret = true;
> +
> + mempolicy = tsk->mempolicy;
> + mpol_get(mempolicy);
Why is this refcount increment necessary? mempolicy is grabbed by tsk,
IOW it never be freed in this function.
> + if (!mask || !mempolicy)
> + goto out;
> +
> + switch (mempolicy->mode) {
> + case MPOL_PREFERRED:
> + if (mempolicy->flags & MPOL_F_LOCAL)
> + ret = node_isset(numa_node_id(), *mask);
Um? Is this good heuristic?
The task can migrate various cpus, then "node_isset(numa_node_id(), *mask) == 0"
doesn't mean the task doesn't consume *mask's memory.
> + else
> + ret = node_isset(mempolicy->v.preferred_node,
> + *mask);
> + break;
> + case MPOL_BIND:
> + case MPOL_INTERLEAVE:
> + ret = nodes_intersects(mempolicy->v.nodes, *mask);
> + break;
> + default:
> + BUG();
> + }
> +out:
> + mpol_put(mempolicy);
> + return ret;
> +}
> +
> /* Allocate a page in interleaved policy.
> Own path because it needs to do special accounting. */
> static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -26,6 +26,7 @@
> #include <linux/module.h>
> #include <linux/notifier.h>
> #include <linux/memcontrol.h>
> +#include <linux/mempolicy.h>
> #include <linux/security.h>
>
> int sysctl_panic_on_oom;
> @@ -36,19 +37,35 @@ static DEFINE_SPINLOCK(zone_scan_lock);
>
> /*
> * Do all threads of the target process overlap our allowed nodes?
> + * @tsk: task struct of which task to consider
> + * @mask: nodemask passed to page allocator for mempolicy ooms
> */
> -static int has_intersects_mems_allowed(struct task_struct *tsk)
> +static bool has_intersects_mems_allowed(struct task_struct *tsk,
> + const nodemask_t *mask)
> {
> - struct task_struct *t;
> + struct task_struct *start = tsk;
>
> - t = tsk;
> do {
> - if (cpuset_mems_allowed_intersects(current, t))
> - return 1;
> - t = next_thread(t);
> - } while (t != tsk);
> -
> - return 0;
> + if (mask) {
> + /*
> + * If this is a mempolicy constrained oom, tsk's
> + * cpuset is irrelevant. Only return true if its
> + * mempolicy intersects current, otherwise it may be
> + * needlessly killed.
> + */
> + if (mempolicy_nodemask_intersects(tsk, mask))
> + return true;
> + } else {
> + /*
> + * This is not a mempolicy constrained oom, so only
> + * check the mems of tsk's cpuset.
> + */
> + if (cpuset_mems_allowed_intersects(current, tsk))
> + return true;
> + }
> + tsk = next_thread(tsk);
> + } while (tsk != start);
> + return false;
> }
>
> /**
> @@ -236,7 +253,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> * (not docbooked, we don't want this one cluttering up the manual)
> */
> static struct task_struct *select_bad_process(unsigned long *ppoints,
> - struct mem_cgroup *mem)
> + struct mem_cgroup *mem, enum oom_constraint constraint,
> + const nodemask_t *mask)
> {
> struct task_struct *p;
> struct task_struct *chosen = NULL;
> @@ -258,7 +276,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> continue;
> if (mem && !task_in_mem_cgroup(p, mem))
> continue;
> - if (!has_intersects_mems_allowed(p))
> + if (!has_intersects_mems_allowed(p,
> + constraint == CONSTRAINT_MEMORY_POLICY ? mask :
> + NULL))
> continue;
>
> /*
> @@ -478,7 +498,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>
> read_lock(&tasklist_lock);
> retry:
> - p = select_bad_process(&points, mem);
> + p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
> if (PTR_ERR(p) == -1UL)
> goto out;
>
> @@ -560,7 +580,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
> /*
> * Must be called with tasklist_lock held for read.
> */
> -static void __out_of_memory(gfp_t gfp_mask, int order)
> +static void __out_of_memory(gfp_t gfp_mask, int order,
> + enum oom_constraint constraint, const nodemask_t *mask)
> {
> struct task_struct *p;
> unsigned long points;
> @@ -574,7 +595,7 @@ retry:
> * Rambo mode: Shoot down a process and hope it solves whatever
> * issues we may have.
> */
> - p = select_bad_process(&points, NULL);
> + p = select_bad_process(&points, NULL, constraint, mask);
>
> if (PTR_ERR(p) == -1UL)
> return;
> @@ -615,7 +636,8 @@ void pagefault_out_of_memory(void)
> panic("out of memory from page fault. panic_on_oom is selected.\n");
>
> read_lock(&tasklist_lock);
> - __out_of_memory(0, 0); /* unknown gfp_mask and order */
> + /* unknown gfp_mask and order */
> + __out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
> read_unlock(&tasklist_lock);
>
> /*
> @@ -632,6 +654,7 @@ rest_and_return:
> * @zonelist: zonelist pointer
> * @gfp_mask: memory allocation flags
> * @order: amount of memory being requested as a power of 2
> + * @nodemask: nodemask passed to page allocator
> *
> * If we run out of memory, we have the choice between either
> * killing a random task (bad), letting the system crash (worse)
> @@ -660,24 +683,18 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> */
> constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
> read_lock(&tasklist_lock);
> -
> - switch (constraint) {
> - case CONSTRAINT_MEMORY_POLICY:
> - oom_kill_process(current, gfp_mask, order, 0, NULL,
> - "No available memory (MPOL_BIND)");
> - break;
> -
> - case CONSTRAINT_NONE:
> - if (sysctl_panic_on_oom) {
> + if (unlikely(sysctl_panic_on_oom)) {
> + /*
> + * panic_on_oom only affects CONSTRAINT_NONE, the kernel
> + * should not panic for cpuset or mempolicy induced memory
> + * failures.
> + */
> + if (constraint == CONSTRAINT_NONE) {
> dump_header(NULL, gfp_mask, order, NULL);
> - panic("out of memory. panic_on_oom is selected\n");
> + panic("Out of memory: panic_on_oom is enabled\n");
enabled? Its feature is enabled at boot time. triggered? or fired?
> }
> - /* Fall-through */
> - case CONSTRAINT_CPUSET:
> - __out_of_memory(gfp_mask, order);
> - break;
> }
> -
> + __out_of_memory(gfp_mask, order, constraint, nodemask);
> read_unlock(&tasklist_lock);
>
> /*
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-02-15 5:03 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-02-10 16:32 [patch 0/7 -mm] oom killer rewrite David Rientjes
2010-02-10 16:32 ` [patch 1/7 -mm] oom: filter tasks not sharing the same cpuset David Rientjes
2010-02-10 17:08 ` Rik van Riel
2010-02-11 23:52 ` KAMEZAWA Hiroyuki
2010-02-15 2:56 ` KOSAKI Motohiro
2010-02-15 22:06 ` David Rientjes
2010-02-16 4:52 ` KOSAKI Motohiro
2010-02-16 6:01 ` KOSAKI Motohiro
2010-02-16 7:03 ` Nick Piggin
2010-02-16 8:49 ` David Rientjes
2010-02-16 9:04 ` Nick Piggin
2010-02-16 9:10 ` David Rientjes
2010-02-16 8:46 ` David Rientjes
2010-02-10 16:32 ` [patch 2/7 -mm] oom: sacrifice child with highest badness score for parent David Rientjes
2010-02-10 20:52 ` Rik van Riel
2010-02-12 0:00 ` KAMEZAWA Hiroyuki
2010-02-12 0:15 ` David Rientjes
2010-02-13 2:49 ` Minchan Kim
2010-02-15 3:08 ` KOSAKI Motohiro
2010-02-10 16:32 ` [patch 3/7 -mm] oom: select task from tasklist for mempolicy ooms David Rientjes
2010-02-10 22:47 ` Rik van Riel
2010-02-15 5:03 ` KOSAKI Motohiro [this message]
2010-02-15 22:11 ` David Rientjes
2010-02-16 5:15 ` KOSAKI Motohiro
2010-02-16 21:52 ` David Rientjes
2010-02-17 0:48 ` David Rientjes
2010-02-17 1:13 ` KOSAKI Motohiro
2010-02-10 16:32 ` [patch 4/7 -mm] oom: badness heuristic rewrite David Rientjes
2010-02-11 4:10 ` Rik van Riel
2010-02-11 9:14 ` David Rientjes
2010-02-11 15:07 ` Nick Bowler
2010-02-11 21:01 ` David Rientjes
2010-02-11 21:43 ` Andrew Morton
2010-02-11 21:51 ` David Rientjes
2010-02-11 22:31 ` Andrew Morton
2010-02-11 22:42 ` David Rientjes
2010-02-11 23:11 ` Andrew Morton
2010-02-11 23:31 ` David Rientjes
2010-02-11 23:37 ` Andrew Morton
2010-02-12 13:56 ` Minchan Kim
2010-02-12 21:00 ` David Rientjes
2010-02-13 2:45 ` Minchan Kim
2010-02-15 21:54 ` David Rientjes
2010-02-16 13:14 ` Minchan Kim
2010-02-16 21:41 ` David Rientjes
2010-02-17 7:41 ` Minchan Kim
2010-02-17 9:23 ` David Rientjes
2010-02-17 13:08 ` Minchan Kim
2010-02-15 8:05 ` KOSAKI Motohiro
2010-02-10 16:32 ` [patch 5/7 -mm] oom: replace sysctls with quick mode David Rientjes
2010-02-12 0:26 ` KAMEZAWA Hiroyuki
2010-02-12 9:58 ` David Rientjes
2010-02-15 8:09 ` KOSAKI Motohiro
2010-02-15 22:15 ` David Rientjes
2010-02-16 5:25 ` KOSAKI Motohiro
2010-02-16 9:04 ` David Rientjes
2010-02-10 16:32 ` [patch 6/7 -mm] oom: avoid oom killer for lowmem allocations David Rientjes
2010-02-11 4:13 ` Rik van Riel
2010-02-11 9:19 ` David Rientjes
2010-02-11 14:08 ` Rik van Riel
2010-02-12 1:28 ` KAMEZAWA Hiroyuki
2010-02-12 10:06 ` David Rientjes
2010-02-15 0:09 ` KAMEZAWA Hiroyuki
2010-02-15 22:01 ` David Rientjes
2010-02-15 8:29 ` KOSAKI Motohiro
2010-02-10 16:32 ` [patch 7/7 -mm] oom: remove unnecessary code and cleanup David Rientjes
2010-02-12 0:12 ` KAMEZAWA Hiroyuki
2010-02-12 0:21 ` David Rientjes
2010-02-15 8:31 ` KOSAKI Motohiro
2010-02-15 2:51 ` [patch 0/7 -mm] oom killer rewrite KOSAKI Motohiro
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100215120924.7281.A69D9226@jp.fujitsu.com \
--to=kosaki.motohiro@jp.fujitsu.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=balbir@linux.vnet.ibm.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=l.lunak@suse.cz \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=npiggin@suse.de \
--cc=riel@redhat.com \
--cc=rientjes@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox