Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Rientjes <rientjes@google.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Christoph Lameter <cl@linux-foundation.org>
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
Date: Tue, 17 Nov 2009 16:11:58 -0800 (PST)	[thread overview]
Message-ID: <alpine.DEB.2.00.0911171609370.12532@chino.kir.corp.google.com> (raw)
In-Reply-To: <20091111153414.3c263842.kamezawa.hiroyu@jp.fujitsu.com>

On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:

> Fixing node-oriented allocation handling in oom-kill.c
> I myself think this as bugfix not as ehnancement.
> 
> In these days, things are changed as
>   - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
>   - mempolicy don't maintain its own private zonelists.
>   (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
> 
> So, current oom-killer's check function is wrong.
> 
> This patch does
>   - check nodemask, if nodemask && nodemask doesn't cover all
>     node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
>   - Scan all zonelist under nodemask, if it hits cpuset's wall
>     this faiulre is from cpuset.
> And
>   - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
>     This doesn't change "current" behavior. If callers use __GFP_THISNODE
>     it should handle "page allocation failure" by itself.
> 
>   - handle __GFP_NOFAIL+__GFP_THISNODE path.
>     This is something like a FIXME but this gfpmask is not used now.
> 

Now that we're passing the nodemask into the oom killer, we should be able 
to do more intelligent CONSTRAINT_MEMORY_POLICY selection.  current is not 
always the ideal task to kill, so it's better to scan the tasklist and 
determine the best task depending on our heuristics, similiar to how we 
penalize candidates if they do not share the same cpuset.

Something like the following (untested) patch.  Comments?
---
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -201,7 +201,9 @@ extern void mpol_fix_fork_child_flag(struct task_struct *p);
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
-extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
+extern bool init_nodemask_of_task_mempolicy(struct task_struct *tsk,
+				nodemask_t *mask);
+extern bool init_nodemask_of_current_mempolicy(nodemask_t *mask);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 	return node_zonelist(0, gfp_flags);
 }
 
-static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
+static inline bool init_nodemask_of_task_mempolicy(struct task_struct *tsk,
+							nodemask_t *mask)
+{
+	return false;
+}
+
+static inline bool init_nodemask_of_current_mempolicy(nodemask_t *mask)
+{
+	return false;
+}
 
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ed3f392..3ab3021 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1373,7 +1373,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
 		 * global hstate attribute
 		 */
 		if (!(obey_mempolicy &&
-				init_nodemask_of_mempolicy(nodes_allowed))) {
+			init_nodemask_of_current_mempolicy(nodes_allowed))) {
 			NODEMASK_FREE(nodes_allowed);
 			nodes_allowed = &node_states[N_HIGH_MEMORY];
 		}
@@ -1860,7 +1860,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
 		NODEMASK_ALLOC(nodemask_t, nodes_allowed,
 						GFP_KERNEL | __GFP_NORETRY);
 		if (!(obey_mempolicy &&
-			       init_nodemask_of_mempolicy(nodes_allowed))) {
+			init_nodemask_of_current_mempolicy(nodes_allowed))) {
 			NODEMASK_FREE(nodes_allowed);
 			nodes_allowed = &node_states[N_HIGH_MEMORY];
 		}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f11fdad..23c84bb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1568,24 +1568,18 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
 	}
 	return zl;
 }
+#endif
 
 /*
- * init_nodemask_of_mempolicy
- *
- * If the current task's mempolicy is "default" [NULL], return 'false'
- * to indicate default policy.  Otherwise, extract the policy nodemask
- * for 'bind' or 'interleave' policy into the argument nodemask, or
- * initialize the argument nodemask to contain the single node for
- * 'preferred' or 'local' policy and return 'true' to indicate presence
- * of non-default mempolicy.
- *
- * We don't bother with reference counting the mempolicy [mpol_get/put]
- * because the current task is examining it's own mempolicy and a task's
- * mempolicy is only ever changed by the task itself.
+ * If tsk's mempolicy is "default" [NULL], return 'false' to indicate default
+ * policy.  Otherwise, extract the policy nodemask for 'bind' or 'interleave'
+ * policy into the argument nodemask, or initialize the argument nodemask to
+ * contain the single node for 'preferred' or 'local' policy and return 'true'
+ * to indicate presence of non-default mempolicy.
  *
  * N.B., it is the caller's responsibility to free a returned nodemask.
  */
-bool init_nodemask_of_mempolicy(nodemask_t *mask)
+bool init_nodemask_of_task_mempolicy(struct task_struct *tsk, nodemask_t *mask)
 {
 	struct mempolicy *mempolicy;
 	int nid;
@@ -1615,7 +1609,16 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 
 	return true;
 }
-#endif
+
+/*
+ * We don't bother with reference counting the mempolicy [mpol_get/put]
+ * because the current task is examining it's own mempolicy and a task's
+ * mempolicy is only ever changed by the task itself.
+ */
+bool init_nodemask_of_current_mempolicy(nodemask_t *mask)
+{
+	return init_nodemask_of_task_mempolicy(current, mask);
+}
 
 /* Allocate a page in interleaved policy.
    Own path because it needs to do special accounting. */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ab04537..4c5c58b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -27,6 +27,7 @@
 #include <linux/notifier.h>
 #include <linux/memcontrol.h>
 #include <linux/security.h>
+#include <linux/mempolicy.h>
 
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
@@ -35,18 +36,30 @@ static DEFINE_SPINLOCK(zone_scan_lock);
 /* #define DEBUG */
 
 /*
- * Is all threads of the target process nodes overlap ours?
+ * Do the nodes allowed by any of tsk's threads overlap ours?
  */
-static int has_intersects_mems_allowed(struct task_struct *tsk)
+static int has_intersects_mems_allowed(struct task_struct *tsk,
+						nodemask_t *nodemask)
 {
-	struct task_struct *t;
+	struct task_struct *start = tsk;
+	NODEMASK_ALLOC(nodemask_t, mpol_nodemask, GFP_KERNEL);
 
-	t = tsk;
+	if (!nodemask)
+		mpol_nodemask = NULL;
 	do {
-		if (cpuset_mems_allowed_intersects(current, t))
+		if (mpol_nodemask) {
+			mpol_get(tsk->mempolicy);
+			if (init_nodemask_of_task_mempolicy(tsk, mpol_nodemask) &&
+				nodes_intersects(*nodemask, *mpol_nodemask)) {
+				mpol_put(tsk->mempolicy);
+				return 1;
+			}
+			mpol_put(tsk->mempolicy);
+		}
+		if (cpuset_mems_allowed_intersects(current, tsk))
 			return 1;
-		t = next_thread(t);
-	} while (t != tsk);
+		tsk = next_thread(tsk);
+	} while (tsk != start);
 
 	return 0;
 }
@@ -55,6 +68,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
  * badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
  * @uptime: current uptime in seconds
+ * @constraint: type of oom constraint
+ * @nodemask: nodemask passed to page allocator
  *
  * The formula used is relatively simple and documented inline in the
  * function. The main rationale is that we want to select a good task
@@ -70,7 +85,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
  *    of least surprise ... (be careful when you change it)
  */
 
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+unsigned long badness(struct task_struct *p, unsigned long uptime,
+			enum oom_constraint constraint, nodemask_t *nodemask)
 {
 	unsigned long points, cpu_time, run_time;
 	struct mm_struct *mm;
@@ -171,7 +187,9 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	 * because p may have allocated or otherwise mapped memory on
 	 * this node before. However it will be less likely.
 	 */
-	if (!has_intersects_mems_allowed(p))
+	if (!has_intersects_mems_allowed(p,
+			constraint == CONSTRAINT_MEMORY_POLICY ? nodemask :
+								 NULL))
 		points /= 8;
 
 	/*
@@ -244,7 +262,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned long *ppoints,
-						struct mem_cgroup *mem)
+			struct mem_cgroup *mem, enum oom_constraint constraint,
+			nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
@@ -300,7 +319,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 		if (p->signal->oom_adj == OOM_DISABLE)
 			continue;
 
-		points = badness(p, uptime.tv_sec);
+		points = badness(p, uptime.tv_sec, constraint, nodemask);
 		if (points > *ppoints || !chosen) {
 			chosen = p;
 			*ppoints = points;
@@ -472,7 +491,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem);
+	p = select_bad_process(&points, mem, NULL);
 	if (PTR_ERR(p) == -1UL)
 		goto out;
 
@@ -554,7 +573,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static void __out_of_memory(gfp_t gfp_mask, int order,
+			enum oom_constraint constraint, nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	unsigned long points;
@@ -568,7 +588,7 @@ retry:
 	 * Rambo mode: Shoot down a process and hope it solves whatever
 	 * issues we may have.
 	 */
-	p = select_bad_process(&points, NULL);
+	p = select_bad_process(&points, NULL, constraint, nodemask);
 
 	if (PTR_ERR(p) == -1UL)
 		return;
@@ -609,7 +629,8 @@ void pagefault_out_of_memory(void)
 		panic("out of memory from page fault. panic_on_oom is selected.\n");
 
 	read_lock(&tasklist_lock);
-	__out_of_memory(0, 0); /* unknown gfp_mask and order */
+	/* unknown gfp_mask and order */
+	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -656,11 +677,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	read_lock(&tasklist_lock);
 
 	switch (constraint) {
-	case CONSTRAINT_MEMORY_POLICY:
-		oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"No available memory (MPOL_BIND)");
-		break;
-
 	case CONSTRAINT_NONE:
 		if (sysctl_panic_on_oom) {
 			dump_header(gfp_mask, order, NULL);
@@ -668,7 +684,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		}
 		/* Fall-through */
 	case CONSTRAINT_CPUSET:
-		__out_of_memory(gfp_mask, order);
+	case CONSTRAINT_MEMORY_POLICY:
+		__out_of_memory(gfp_mask, order, constraint, nodemask);
 		break;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2009-11-18  0:12 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-04  8:09 [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask KAMEZAWA Hiroyuki
2009-11-06  0:02 ` [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2 KAMEZAWA Hiroyuki
2009-11-10  7:24   ` KOSAKI Motohiro
2009-11-10  7:24     ` KAMEZAWA Hiroyuki
2009-11-10  7:39       ` KOSAKI Motohiro
2009-11-10  7:40         ` KAMEZAWA Hiroyuki
2009-11-10  8:03           ` Daisuke Nishimura
2009-11-10  8:17             ` KAMEZAWA Hiroyuki
2009-11-11  2:24               ` [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3 KAMEZAWA Hiroyuki
2009-11-11  2:36                 ` KOSAKI Motohiro
2009-11-11  2:49                 ` David Rientjes
2009-11-11  3:02                   ` KOSAKI Motohiro
2009-11-11  3:10                     ` KAMEZAWA Hiroyuki
2009-11-11  3:14                     ` David Rientjes
2009-11-11  3:23                       ` KOSAKI Motohiro
2009-11-11  3:27                         ` David Rientjes
2009-11-11  3:04                   ` KAMEZAWA Hiroyuki
2009-11-11  4:45                 ` [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4 KAMEZAWA Hiroyuki
2009-11-11  5:28                   ` [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.1 KAMEZAWA Hiroyuki
2009-11-11  5:58                     ` David Rientjes
2009-11-11  6:20                       ` KAMEZAWA Hiroyuki
2009-11-11  6:26                         ` David Rientjes
2009-11-11  6:34                           ` [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2 KAMEZAWA Hiroyuki
2009-11-11  7:32                             ` David Rientjes
2009-11-18  0:11                             ` David Rientjes [this message]
2009-11-18  0:58                               ` KAMEZAWA Hiroyuki
2009-11-18  2:13                                 ` David Rientjes
2009-12-15  1:16                                   ` Andrew Morton
2009-12-15  1:32                                     ` KAMEZAWA Hiroyuki
2009-12-15  1:38                                       ` KOSAKI Motohiro
2009-12-15  4:30                                       ` David Rientjes
2009-12-15  4:35                                         ` KAMEZAWA Hiroyuki
2009-12-15  4:54                                           ` David Rientjes
2009-12-15  5:19                                             ` KOSAKI Motohiro
2009-12-17 22:21                                               ` David Rientjes
2009-12-18  4:30                                                 ` KOSAKI Motohiro
2009-12-18 10:04                                                   ` David Rientjes
2009-12-15  4:57                                           ` KAMEZAWA Hiroyuki
2009-12-15  4:43                                         ` KAMEZAWA Hiroyuki
2009-12-15  4:57                                           ` David Rientjes
2009-12-15  5:09                                             ` KAMEZAWA Hiroyuki
2009-12-17 22:23                                               ` David Rientjes
2009-12-17 23:33                                                 ` KAMEZAWA Hiroyuki
2009-12-15  4:47                                         ` KOSAKI Motohiro
2009-12-15  5:03                                           ` David Rientjes
2009-11-18  1:41                               ` Daisuke Nishimura

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.00.0911171609370.12532@chino.kir.corp.google.com \
    --to=rientjes@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux-foundation.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nishimura@mxp.nes.nec.co.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox