Re: [PATCH 04 of 24] serialize oom killer

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Rientjes <rientjes@google.com>
To: Christoph Lameter <clameter@sgi.com>
Cc: Paul Jackson <pj@sgi.com>, Andrea Arcangeli <andrea@suse.de>,
	linux-mm@kvack.org
Subject: Re: [PATCH 04 of 24] serialize oom killer
Date: Thu, 13 Sep 2007 17:36:06 -0700 (PDT)	[thread overview]
Message-ID: <alpine.DEB.0.9999.0709131732330.21805@chino.kir.corp.google.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0709131152400.9999@schroedinger.engr.sgi.com>

On Thu, 13 Sep 2007, Christoph Lameter wrote:

> > It's easier to serialize it outside of out_of_memory() instead, since it 
> > only has a single caller and we don't need to serialize for sysrq.
> > 
> > This seems like it would collapse down nicely to a global or per-cpuset 
> > serialization with an added helper function implemented partially in 
> > kernel/cpuset.c for the CONFIG_CPUSETS case.
> > 
> > Then, in __alloc_pages(), we test for either a global or per-cpuset 
> > spin_trylock() and, if we acquire it, call out_of_memory() and goto 
> > restart as we currently do.  If it's contended, we reschedule ourself and 
> > goto restart when we awaken.
> 
> Could you rephrase that in patch form? ;-)
> 

Yeah, it turned out to be a little more invasive then I thought but it 
appears to be the cleanest solution for both the general CONSTRAINT_NONE 
and the per-cpuset CONSTRAINT_CPUSET cases.

I've been trying to keep score at home, but I've lost track of what 
patches from the series we're keeping so this is against HEAD.




serialize oom killer

Serializes the OOM killer both globally and per-cpuset, depending on the
system configuration.

A new spinlock, oom_lock, is introduced for the global case.  It
serializes the OOM killer for systems that are not using cpusets.  Only
one system task may enter the OOM killer at a time to prevent
unnecessarily killing others.

A per-cpuset flag, CS_OOM, is introduced in the flags field of struct
cpuset.  It serializes the OOM killer for only for hardwall allocations
targeted for that cpuset.  Only one task for each cpuset may enter the
OOM killer at a time to prevent unnecessarily killing others.  When a
per-cpuset OOM killing is taking place, the global spinlock is also
locked since we'll be alleviating that condition at the same time.

Regardless of the synchronization primitive used, if a task cannot
acquire the OOM lock, it is put to sleep before retrying the triggering
allocation so that the OOM killer may finish and free some memory.

We acquire either lock before attempting one last try at 
get_pages_from_freelist() with a very high watermark, otherwise we could 
invoke the OOM killer needlessly if another thread reschedules between 
this allocation attempt and trying to take the OOM lock.

Also converts the CONSTAINT_{NONE,CPUSET,MEMORY_POLICY} defines to an
enum and moves them to include/linux/swap.h.  We're going to need an
include/linux/oom_kill.h soon, probably.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 drivers/char/sysrq.c   |    3 +-
 include/linux/cpuset.h |   13 ++++++++++-
 include/linux/swap.h   |   14 ++++++++++-
 kernel/cpuset.c        |   16 +++++++++++++
 mm/oom_kill.c          |   58 ++++++++++++++++++++++++++++++++++++-----------
 mm/page_alloc.c        |   42 +++++++++++++++++++++++-----------
 6 files changed, 114 insertions(+), 32 deletions(-)

diff --git a/drivers/char/sysrq.c b/drivers/char/sysrq.c
--- a/drivers/char/sysrq.c
+++ b/drivers/char/sysrq.c
@@ -270,8 +270,7 @@ static struct sysrq_key_op sysrq_term_op = {
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(&NODE_DATA(0)->node_zonelists[ZONE_NORMAL],
-			GFP_KERNEL, 0);
+	out_of_memory(GFP_KERNEL, 0, CONSTRAINT_NONE);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -60,7 +60,8 @@ extern char *cpuset_task_status_allowed(struct task_struct *task, char *buffer);
 
 extern void cpuset_lock(void);
 extern void cpuset_unlock(void);
-
+extern int cpuset_oom_test_and_set_lock(void);
+extern int cpuset_oom_unlock(void);
 extern int cpuset_mem_spread_node(void);
 
 static inline int cpuset_do_page_mem_spread(void)
@@ -129,6 +130,16 @@ static inline char *cpuset_task_status_allowed(struct task_struct *task,
 static inline void cpuset_lock(void) {}
 static inline void cpuset_unlock(void) {}
 
+static inline int cpuset_oom_test_and_set_lock(void)
+{
+	return -1;
+}
+
+static inline int cpuset_oom_unlock(void)
+{
+	return 0;
+}
+
 static inline int cpuset_mem_spread_node(void)
 {
 	return 0;
diff --git a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -159,9 +159,21 @@ struct swap_list_t {
 #define vm_swap_full() (nr_swap_pages*2 < total_swap_pages)
 
 /* linux/mm/oom_kill.c */
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+/*
+ * Types of limitations to the nodes from which allocations may occur
+ */
+enum oom_constraint {
+	CONSTRAINT_NONE,
+	CONSTRAINT_CPUSET,
+	CONSTRAINT_MEMORY_POLICY,
+};
+extern void out_of_memory(gfp_t gfp_mask, int order,
+			  enum oom_constraint constraint);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
+extern int oom_test_and_set_lock(struct zonelist *zonelist, gfp_t gfp_mask,
+				 enum oom_constraint *constraint);
+extern void oom_unlock(enum oom_constraint constraint);
 
 /* linux/mm/memory.c */
 extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *);
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -109,6 +109,7 @@ typedef enum {
 	CS_NOTIFY_ON_RELEASE,
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
+	CS_IS_OOM,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -147,6 +148,11 @@ static inline int is_spread_slab(const struct cpuset *cs)
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
+static inline int is_oom(const struct cpuset *cs)
+{
+	return test_bit(CS_IS_OOM, &cs->flags);
+}
+
 /*
  * Increment this integer everytime any cpuset changes its
  * mems_allowed value.  Users of cpusets can track this generation
@@ -2527,6 +2533,16 @@ void cpuset_unlock(void)
 	mutex_unlock(&callback_mutex);
 }
 
+int cpuset_oom_test_and_set_lock(void)
+{
+	return test_and_set_bit(CS_IS_OOM, &current->cpuset->flags);
+}
+
+int cpuset_oom_unlock(void)
+{
+	return test_and_clear_bit(CS_IS_OOM, &current->cpuset->flags);
+}
+
 /**
  * cpuset_mem_spread_node() - On which node to begin search for a page
  *
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -27,6 +27,7 @@
 #include <linux/notifier.h>
 
 int sysctl_panic_on_oom;
+static DEFINE_SPINLOCK(oom_lock);
 /* #define DEBUG */
 
 /**
@@ -164,13 +165,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 }
 
 /*
- * Types of limitations to the nodes from which allocations may occur
- */
-#define CONSTRAINT_NONE 1
-#define CONSTRAINT_MEMORY_POLICY 2
-#define CONSTRAINT_CPUSET 3
-
-/*
  * Determine the type of allocation constraint.
  */
 static inline int constrained_alloc(struct zonelist *zonelist, gfp_t gfp_mask)
@@ -387,6 +381,48 @@ int unregister_oom_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
+/*
+ * If using cpusets, try to lock task's per-cpuset OOM lock; otherwise, try to
+ * lock the global OOM spinlock.  Returns non-zero if the lock is contended or
+ * zero if acquired.
+ */
+int oom_test_and_set_lock(struct zonelist *zonelist, gfp_t gfp_mask,
+			  enum oom_constraint *constraint)
+{
+	int ret;
+
+	*constraint = constrained_alloc(zonelist, gfp_mask);
+	switch (*constraint) {
+	case CONSTRAINT_CPUSET:
+		ret = cpuset_oom_test_and_set_lock();
+		if (!ret)
+			spin_trylock(&oom_lock);
+		break;
+	default:
+		ret = spin_trylock(&oom_lock);
+		break;
+	}
+	return ret;
+}
+
+/*
+ * If using cpusets, unlock task's per-cpuset OOM lock; otherwise, unlock the
+ * global OOM spinlock.
+ */
+void oom_unlock(enum oom_constraint constraint)
+{
+	switch (constraint) {
+	case CONSTRAINT_CPUSET:
+		if (likely(spin_is_locked(&oom_lock)))
+			spin_unlock(&oom_lock);
+		cpuset_oom_unlock();
+		break;
+	default:
+		spin_unlock(&oom_lock);
+		break;
+	}
+}
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  *
@@ -395,12 +431,11 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifier);
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(gfp_t gfp_mask, int order, enum oom_constraint constraint)
 {
 	struct task_struct *p;
 	unsigned long points = 0;
 	unsigned long freed = 0;
-	int constraint;
 
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 	if (freed > 0)
@@ -418,11 +453,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 	if (sysctl_panic_on_oom == 2)
 		panic("out of memory. Compulsory panic_on_oom is selected.\n");
 
-	/*
-	 * Check if there were limitations on the allocation (only relevant for
-	 * NUMA) that may require different handling.
-	 */
-	constraint = constrained_alloc(zonelist, gfp_mask);
 	cpuset_lock();
 	read_lock(&tasklist_lock);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1352,22 +1352,36 @@ nofail_alloc:
 		if (page)
 			goto got_pg;
 	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
-		/*
-		 * Go through the zonelist yet one more time, keep
-		 * very high watermark here, this is only to catch
-		 * a parallel oom killing, we must fail if we're still
-		 * under heavy pressure.
-		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-				zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
-		if (page)
-			goto got_pg;
+		enum oom_constraint constraint = CONSTRAINT_NONE;
 
-		/* The OOM killer will not help higher order allocs so fail */
-		if (order > PAGE_ALLOC_COSTLY_ORDER)
-			goto nopage;
+		if (!oom_test_and_set_lock(zonelist, gfp_mask, &constraint)) {
+			/*
+			 * Go through the zonelist yet one more time, keep
+			 * very high watermark here, this is only to catch
+			 * a previous oom killing, we must fail if we're still
+			 * under heavy pressure.
+			 */
+			page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL,
+					order, zonelist,
+					ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+			if (page) {
+				oom_unlock(constraint);
+				goto got_pg;
+			}
+
+			/*
+			 * The OOM killer will not help higher order allocs so
+			 * fail
+			 */
+			if (order > PAGE_ALLOC_COSTLY_ORDER) {
+				oom_unlock(constraint);
+				goto nopage;
+			}
 
-		out_of_memory(zonelist, gfp_mask, order);
+			out_of_memory(gfp_mask, order, constraint);
+			oom_unlock(constraint);
+		} else
+			schedule_timeout_uninterruptible(1);
 		goto restart;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2007-09-14  0:36 UTC|newest]

Thread overview: 113+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 01 of 24] remove nr_scan_inactive/active Andrea Arcangeli
2007-09-12 11:44   ` Andrew Morton
2008-01-02 17:50     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 02 of 24] avoid oom deadlock in nfs_create_request Andrea Arcangeli
2007-09-12 23:54   ` Christoph Lameter
2007-08-22 12:48 ` [PATCH 03 of 24] prevent oom deadlocks during read/write operations Andrea Arcangeli
2007-09-12 11:56   ` Andrew Morton
2007-09-12  2:18     ` Nick Piggin
2008-01-03  0:53     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 04 of 24] serialize oom killer Andrea Arcangeli
2007-09-12 12:02   ` Andrew Morton
2007-09-12 12:04     ` Andrew Morton
2007-09-12 12:11       ` Andrea Arcangeli
2008-01-03  0:55     ` Andrea Arcangeli
2007-09-13  0:09   ` Christoph Lameter
2007-09-13 18:32     ` David Rientjes
2007-09-13 18:37       ` Christoph Lameter
2007-09-13 18:46         ` David Rientjes
2007-09-13 18:53           ` Christoph Lameter
2007-09-14  0:36             ` David Rientjes [this message]
2007-09-14  2:31               ` Christoph Lameter
2007-09-14  3:33                 ` David Rientjes
2007-09-18 16:44                   ` David Rientjes
2007-09-18 16:44                     ` [patch 1/4] oom: move prototypes to appropriate header file David Rientjes
2007-09-18 16:44                       ` [patch 2/4] oom: move constraints to enum David Rientjes
2007-09-18 16:44                         ` [patch 3/4] oom: save zonelist pointer for oom killer calls David Rientjes
2007-09-18 16:44                           ` [patch 4/4] oom: serialize out of memory calls David Rientjes
2007-09-18 19:54                             ` Christoph Lameter
2007-09-18 19:56                               ` David Rientjes
2007-09-18 20:01                                 ` Christoph Lameter
2007-09-18 20:06                                   ` David Rientjes
2007-09-18 20:23                                     ` [patch 5/4] oom: rename serialization helper functions David Rientjes
2007-09-18 20:26                                       ` Christoph Lameter
2007-09-18 20:39                                         ` [patch 5/4 v2] " David Rientjes
2007-09-18 20:59                                           ` Christoph Lameter
2007-09-18 19:57                           ` [patch 3/4] oom: save zonelist pointer for oom killer calls Christoph Lameter
2007-09-18 20:13                             ` David Rientjes
2007-09-18 20:16                               ` Christoph Lameter
2007-09-18 20:47                                 ` [patch 6/4] oom: pass null to kfree if zonelist is not cleared David Rientjes
2007-09-18 21:01                                   ` Christoph Lameter
2007-09-18 21:13                                     ` David Rientjes
2007-09-18 21:25                                       ` Christoph Lameter
2007-09-18 22:16                                         ` David Rientjes
2007-09-19 17:09                                           ` Paul Jackson
2007-09-19 18:21                                             ` David Rientjes
2007-09-18 19:55                         ` [patch 2/4] oom: move constraints to enum Christoph Lameter
2007-08-22 12:48 ` [PATCH 05 of 24] avoid selecting already killed tasks Andrea Arcangeli
2007-09-13  0:13   ` Christoph Lameter
2007-08-22 12:48 ` [PATCH 06 of 24] reduce the probability of an OOM livelock Andrea Arcangeli
2007-09-12 12:17   ` Andrew Morton
2008-01-03  1:03     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 07 of 24] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
2007-09-12 12:18   ` Andrew Morton
2007-09-13  0:26     ` Christoph Lameter
2007-08-22 12:48 ` [PATCH 08 of 24] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
2007-09-12 12:20   ` Andrew Morton
2008-01-03  0:56     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 09 of 24] fallback killing more tasks if tif-memdie doesn't " Andrea Arcangeli
2007-09-12 12:30   ` Andrew Morton
2007-09-12 12:34     ` Andrew Morton
2008-01-03  1:06     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 10 of 24] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
2007-09-12 12:42   ` Andrew Morton
2007-09-13  0:36     ` Christoph Lameter
2007-09-21 19:10   ` David Rientjes
2008-01-03  1:08     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 11 of 24] the oom schedule timeout isn't needed with the VM_is_OOM logic Andrea Arcangeli
2007-09-12 12:44   ` Andrew Morton
2007-08-22 12:48 ` [PATCH 12 of 24] show mem information only when a task is actually being killed Andrea Arcangeli
2007-09-12 12:49   ` Andrew Morton
2007-08-22 12:49 ` [PATCH 13 of 24] simplify oom heuristics Andrea Arcangeli
2007-09-12 12:52   ` Andrew Morton
2007-09-12 13:40     ` Andrea Arcangeli
2007-09-12 20:52       ` Andrew Morton
2007-08-22 12:49 ` [PATCH 14 of 24] oom select should only take rss into account Andrea Arcangeli
2007-09-13  0:43   ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 15 of 24] limit reclaim if enough pages have been freed Andrea Arcangeli
2007-09-12 12:57   ` Andrew Morton
2008-01-03  1:12     ` Andrea Arcangeli
2007-09-12 12:58   ` Andrew Morton
2007-09-12 13:38     ` Andrea Arcangeli
2007-08-22 12:49 ` [PATCH 16 of 24] avoid some lock operation in vm fast path Andrea Arcangeli
2007-09-12 12:59   ` Andrew Morton
2007-09-13  0:49     ` Christoph Lameter
2007-09-13  1:16       ` Andrew Morton
2007-09-13  1:33         ` Christoph Lameter
2007-09-13  1:41           ` KAMEZAWA Hiroyuki
2007-09-13  1:44           ` Andrew Morton
2007-08-22 12:49 ` [PATCH 17 of 24] apply the anti deadlock features only to global oom Andrea Arcangeli
2007-09-12 13:02   ` Andrew Morton
2007-09-13  0:53     ` Christoph Lameter
2007-09-13  0:52   ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 18 of 24] run panic the same way in both places Andrea Arcangeli
2007-09-13  0:54   ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 19 of 24] cacheline align VM_is_OOM to prevent false sharing Andrea Arcangeli
2007-09-12 13:02   ` Andrew Morton
2007-09-12 13:36     ` Andrea Arcangeli
2007-09-13  0:55       ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 20 of 24] extract deadlock helper function Andrea Arcangeli
2007-08-22 12:49 ` [PATCH 21 of 24] select process to kill for cpusets Andrea Arcangeli
2007-09-12 13:05   ` Andrew Morton
2007-09-13  0:59     ` Christoph Lameter
2007-09-13  5:13       ` David Rientjes
2007-09-13 17:55         ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 22 of 24] extract select helper function Andrea Arcangeli
2007-08-22 12:49 ` [PATCH 23 of 24] serialize for cpusets Andrea Arcangeli
2007-09-12 13:10   ` Andrew Morton
2007-09-12 13:34     ` Andrea Arcangeli
2007-09-12 19:08     ` David Rientjes
2007-09-13  1:02     ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 24 of 24] add oom_kill_asking_task flag Andrea Arcangeli
2007-09-12 13:11   ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.0.9999.0709131732330.21805@chino.kir.corp.google.com \
    --to=rientjes@google.com \
    --cc=andrea@suse.de \
    --cc=clameter@sgi.com \
    --cc=linux-mm@kvack.org \
    --cc=pj@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox