[PATCH 0/4] cpusets mems

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] cpusets mems_allowed and oom
@ 2005-07-11  1:58 Paul Jackson
  2005-07-11  1:58 ` [PATCH 1/4] cpusets oom_kill and page_alloc tweaks Paul Jackson
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Paul Jackson @ 2005-07-11  1:58 UTC (permalink / raw)
  To: Dinakar Guniguntala, Erich Focht, Simon Derr; +Cc: linux-mm, Paul Jackson

Time to make better use of the cpuset mem_exclusive  flag ...

Dinakar has made good use of cpu_exclusive, by tying it to sched
domains.  Good.

Now I'd like use mem_exclusive, to support cpuset configurations that
allow GFP_KERNEL allocations to come from a potentially larger set
of memory nodes than GFP_USER allocations.

Here's an example usage scenario.  For a few hours or more, a large
NUMA system at a University is to be divided in two halves, with a
bunch of student jobs running in half the system under some form
of batch manager, and with a big research project running in the
other half.  Each of the student jobs is placed in a small cpuset, but
should share the classic Unix time share facilities, such as buffered
pages of files in /bin and /usr/lib.  The big research project wants no
interference whatsoever from the student jobs, and has highly tuned,
unusual memory and i/o patterns that intend to make full use of all
the main memory on the nodes available to it.

In this example, we have two big sibling cpusets, one of which is
further divided into a more dynamic set of child cpusets.

We want kernel memory allocations constrained by the two big cpusets,
and user allocations constrained by the smaller child cpusets where
present.

I propose to use the 'mem_exclusive' flag of cpusets to provide a flag
to control a solution for such scenarios.  Let memory allocations
for user space (GFP_USER) be constrained by a tasks current cpuset,
but memory allocations for kernel space (GFP_KERNEL) by constrained
by the nearest mem_exclusive ancestor of the current cpuset, even
though kernel space allocations will still _prefer_ to remain within
the current tasks cpuset, if memory is easily available.

The current constraints imposed on setting mem_exclusive are unchanged.
A cpuset may only be mem_exclusive if its parent is also mem_exclusive,
and a mem_exclusive cpuset may not overlap any of its siblings
memory nodes.

With this, one can configure a system so that allocations for kernel
use can come from a superset of the node allowed for user allocations.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/4] cpusets oom_kill and page_alloc tweaks
  2005-07-11  1:58 [PATCH 0/4] cpusets mems_allowed and oom Paul Jackson
@ 2005-07-11  1:58 ` Paul Jackson
  2005-07-11  1:58 ` [PATCH 2/4] cpusets new __GFP_HARDWALL flag Paul Jackson
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Paul Jackson @ 2005-07-11  1:58 UTC (permalink / raw)
  To: Dinakar Guniguntala, Simon Derr, Erich Focht; +Cc: linux-mm, Paul Jackson

This patch applies a few comment and code cleanups, to mm/oom_kill.c
and mm/page_alloc.c, prior to applying a patch set to improve
cpuset management of memory placement.

The first comment changed in oom_kill.c, and the first one and only
one changed in page_alloc.c, were seriously misleading.  The code
layout change in select_bad_process() makes room for adding another
condition on which a process can be spared the oom killer (see the
subsequent cpuset_nodes_overlap patch for this addition).

Also a couple typos and spellos that bugged me, while I was here.
And add documentation for cpuset flag notify_on_release.

This patch should have no material affect.

Signed-off-by: Paul Jackson <pj@sgi.com>

Index: linux-2.6-mem_exclusive/mm/oom_kill.c
===================================================================
--- linux-2.6-mem_exclusive.orig/mm/oom_kill.c	2005-06-29 23:11:41.000000000 -0700
+++ linux-2.6-mem_exclusive/mm/oom_kill.c	2005-07-02 17:40:03.000000000 -0700
@@ -6,8 +6,8 @@
  *	for goading me into coding this file...
  *
  *  The routines in this file are used to kill a process when
- *  we're seriously out of memory. This gets called from kswapd()
- *  in linux/mm/vmscan.c when we really run out of memory.
+ *  we're seriously out of memory. This gets called from __alloc_pages()
+ *  in mm/page_alloc.c when we really run out of memory.
  *
  *  Since we won't call these routines often (on a well-configured
  *  machine) this file will double as a 'coding guide' and a signpost
@@ -26,7 +26,7 @@
 /**
  * oom_badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
- * @p: current uptime in seconds
+ * @uptime: current uptime in seconds
  *
  * The formula used is relatively simple and documented inline in the
  * function. The main rationale is that we want to select a good task
@@ -57,9 +57,9 @@ unsigned long badness(struct task_struct
 
 	/*
 	 * Processes which fork a lot of child processes are likely
-	 * a good choice. We add the vmsize of the childs if they
+	 * a good choice. We add the vmsize of the children if they
 	 * have an own mm. This prevents forking servers to flood the
-	 * machine with an endless amount of childs
+	 * machine with an endless amount of children
 	 */
 	list_for_each(tsk, &p->children) {
 		struct task_struct *chld;
@@ -143,28 +143,32 @@ static struct task_struct * select_bad_p
 	struct timespec uptime;
 
 	do_posix_clock_monotonic_gettime(&uptime);
-	do_each_thread(g, p)
-		/* skip the init task with pid == 1 */
-		if (p->pid > 1 && p->oomkilladj != OOM_DISABLE) {
-			unsigned long points;
+	do_each_thread(g, p) {
+		unsigned long points;
+		int releasing;
 
-			/*
-			 * This is in the process of releasing memory so wait it
-			 * to finish before killing some other task by mistake.
-			 */
-			if ((unlikely(test_tsk_thread_flag(p, TIF_MEMDIE)) || (p->flags & PF_EXITING)) &&
-			    !(p->flags & PF_DEAD))
-				return ERR_PTR(-1UL);
-			if (p->flags & PF_SWAPOFF)
-				return p;
-
-			points = badness(p, uptime.tv_sec);
-			if (points > maxpoints || !chosen) {
-				chosen = p;
-				maxpoints = points;
-			}
+		/* skip the init task with pid == 1 */
+		if (p->pid == 1)
+			continue;
+		if (p->oomkilladj == OOM_DISABLE)
+			continue;
+		/*
+		 * This is in the process of releasing memory so for wait it
+		 * to finish before killing some other task by mistake.
+		 */
+		releasing = test_tsk_thread_flag(p, TIF_MEMDIE) ||
+						p->flags & PF_EXITING;
+		if (releasing && !(p->flags & PF_DEAD))
+			return ERR_PTR(-1UL);
+		if (p->flags & PF_SWAPOFF)
+			return p;
+
+		points = badness(p, uptime.tv_sec);
+		if (points > maxpoints || !chosen) {
+			chosen = p;
+			maxpoints = points;
 		}
-	while_each_thread(g, p);
+	} while_each_thread(g, p);
 	return chosen;
 }
 
@@ -189,7 +193,8 @@ static void __oom_kill_task(task_t *p)
 		return;
 	}
 	task_unlock(p);
-	printk(KERN_ERR "Out of Memory: Killed process %d (%s).\n", p->pid, p->comm);
+	printk(KERN_ERR "Out of Memory: Killed process %d (%s).\n",
+							p->pid, p->comm);
 
 	/*
 	 * We give our sacrificial lamb high priority and access to
Index: linux-2.6-mem_exclusive/mm/page_alloc.c
===================================================================
--- linux-2.6-mem_exclusive.orig/mm/page_alloc.c	2005-06-29 23:11:41.000000000 -0700
+++ linux-2.6-mem_exclusive/mm/page_alloc.c	2005-07-02 17:40:04.000000000 -0700
@@ -898,10 +898,9 @@ rebalance:
 
 	if (likely(did_some_progress)) {
 		/*
-		 * Go through the zonelist yet one more time, keep
-		 * very high watermark here, this is only to catch
-		 * a parallel oom killing, we must fail if we're still
-		 * under heavy pressure.
+		 * Go through the zone list yet one more time, with
+		 * min watermark and trying harder.  Since try_to_free_pages
+		 * made some progress, there might be something free.
 		 */
 		for (i = 0; (z = zones[i]) != NULL; i++) {
 			if (!zone_watermark_ok(z, order, z->pages_min,
Index: linux-2.6-mem_exclusive/Documentation/cpusets.txt
===================================================================
--- linux-2.6-mem_exclusive.orig/Documentation/cpusets.txt	2005-07-02 17:40:04.000000000 -0700
+++ linux-2.6-mem_exclusive/Documentation/cpusets.txt	2005-07-02 17:40:54.000000000 -0700
@@ -143,7 +143,7 @@ Cpusets extends these two mechanisms as 
 The implementation of cpusets requires a few, simple hooks
 into the rest of the kernel, none in performance critical paths:
 
- - in main/init.c, to initialize the root cpuset at system boot.
+ - in init/main.c, to initialize the root cpuset at system boot.
  - in fork and exit, to attach and detach a task from its cpuset.
  - in sched_setaffinity, to mask the requested CPUs by what's
    allowed in that tasks cpuset.
@@ -154,7 +154,7 @@ into the rest of the kernel, none in per
    and related changes in both sched.c and arch/ia64/kernel/domain.c
  - in the mbind and set_mempolicy system calls, to mask the requested
    Memory Nodes by what's allowed in that tasks cpuset.
- - in page_alloc, to restrict memory to allowed nodes.
+ - in page_alloc.c, to restrict memory to allowed nodes.
  - in vmscan.c, to restrict page recovery to the current cpuset.
 
 In addition a new file system, of type "cpuset" may be mounted,
@@ -182,6 +182,7 @@ containing the following files describin
  - mems: list of Memory Nodes in that cpuset
  - cpu_exclusive flag: is cpu placement exclusive?
  - mem_exclusive flag: is memory placement exclusive?
+ - notify_on_release: call /sbin/cpuset_release_agent on exit if set
  - tasks: list of tasks (by pid) attached to that cpuset
 
 New cpusets are created using the mkdir system call or shell
@@ -356,7 +357,8 @@ Now you want to do something with this c
 
 In this directory you can find several files:
 # ls
-cpus  cpu_exclusive  mems  mem_exclusive  tasks
+cpu_exclusive  mem_exclusive  notify_on_release
+cpus           mems           tasks
 
 Reading them will give you information about the state of this cpuset:
 the CPUs and Memory Nodes it can use, the processes that are using
Index: linux-2.6-mem_exclusive/include/linux/gfp.h
===================================================================
--- linux-2.6-mem_exclusive.orig/include/linux/gfp.h	2005-07-02 17:40:04.000000000 -0700
+++ linux-2.6-mem_exclusive/include/linux/gfp.h	2005-07-02 17:42:02.000000000 -0700
@@ -39,7 +39,7 @@ struct vm_area_struct;
 #define __GFP_COMP	0x4000u	/* Add compound page metadata */
 #define __GFP_ZERO	0x8000u	/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC 0x10000u /* Don't use emergency reserves */
-#define __GFP_NORECLAIM  0x20000u /* No realy zone reclaim during allocation */
+#define __GFP_NORECLAIM  0x20000u /* No zone reclaim during page_cache_alloc */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 2/4] cpusets new __GFP_HARDWALL flag
  2005-07-11  1:58 [PATCH 0/4] cpusets mems_allowed and oom Paul Jackson
  2005-07-11  1:58 ` [PATCH 1/4] cpusets oom_kill and page_alloc tweaks Paul Jackson
@ 2005-07-11  1:58 ` Paul Jackson
  2005-07-11  1:58 ` [PATCH 3/4] cpusets formalize intermediate GFP_KERNEL containment Paul Jackson
  2005-07-11  1:59 ` [PATCH 4/4] cpusets confine oom_killer to mem_exclusive cpuset Paul Jackson
  3 siblings, 0 replies; 5+ messages in thread
From: Paul Jackson @ 2005-07-11  1:58 UTC (permalink / raw)
  To: Dinakar Guniguntala, Erich Focht, Simon Derr; +Cc: linux-mm, Paul Jackson

Add another GFP flag: __GFP_HARDWALL.

A subsequent "cpuset_zone_allowed" patch will use this flag to mark
GFP_USER allocations, and distinguish them from GFP_KERNEL allocations.

Allocations (such as GFP_USER) marked GFP_HARDWALL are constrainted to
the current tasks cpuset.  Other allocations (such as GFP_KERNEL) can
steal from the possibly larger nearest mem_exclusive cpuset ancestor,
if memory is tight on every node in the current cpuset.

This patch collides with Mel Gorman's patch to reduce fragmentation
in the standard buddy allocator, which adds two GFP flags.  At first
glance, it seems that his added __GFP_USERRCLM flag could be used in
place of the following __GFP_HARDWALL, as they both seem to be set
the same way - for GFP_USER and GFP_HIGHUSER.  Perhaps we should call
this flag __GFP_USER, rather than some name dependent on its use(s).

Signed-off-by: Paul Jackson <pj@sgi.com>

Index: linux-2.6-mem_exclusive/include/linux/gfp.h
===================================================================
--- linux-2.6-mem_exclusive.orig/include/linux/gfp.h	2005-07-02 17:42:02.000000000 -0700
+++ linux-2.6-mem_exclusive/include/linux/gfp.h	2005-07-02 17:43:00.000000000 -0700
@@ -40,6 +40,7 @@ struct vm_area_struct;
 #define __GFP_ZERO	0x8000u	/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC 0x10000u /* Don't use emergency reserves */
 #define __GFP_NORECLAIM  0x20000u /* No zone reclaim during page_cache_alloc */
+#define __GFP_HARDWALL   0x40000u /* Enforce hardwall cpuset memory allocs */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
@@ -48,14 +49,15 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_NORECLAIM)
+			__GFP_NOMEMALLOC|__GFP_NORECLAIM|__GFP_HARDWALL)
 
 #define GFP_ATOMIC	(__GFP_HIGH)
 #define GFP_NOIO	(__GFP_WAIT)
 #define GFP_NOFS	(__GFP_WAIT | __GFP_IO)
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
+#define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \
+			 __GFP_HIGHMEM)
 
 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 3/4] cpusets formalize intermediate GFP_KERNEL containment
  2005-07-11  1:58 [PATCH 0/4] cpusets mems_allowed and oom Paul Jackson
  2005-07-11  1:58 ` [PATCH 1/4] cpusets oom_kill and page_alloc tweaks Paul Jackson
  2005-07-11  1:58 ` [PATCH 2/4] cpusets new __GFP_HARDWALL flag Paul Jackson
@ 2005-07-11  1:58 ` Paul Jackson
  2005-07-11  1:59 ` [PATCH 4/4] cpusets confine oom_killer to mem_exclusive cpuset Paul Jackson
  3 siblings, 0 replies; 5+ messages in thread
From: Paul Jackson @ 2005-07-11  1:58 UTC (permalink / raw)
  To: Dinakar Guniguntala, Simon Derr, Erich Focht; +Cc: linux-mm, Paul Jackson

This patch depends on the previous patches cpuset_gfp_hardwall_flag
and cpuset_mm_alloc_oom_fixes.

This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of
memory placement resolution.  With this patch, there are now the
following four layers of memory placement available:

 1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
 2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
 3) The current tasks cpuset (GFP_USER allocations constrained to here), and
 4) Specific node placement, using mbind and set_mempolicy.

These nest - each layer is a subset (same or within) of the previous.

Layer (2) is new with this patch. The cpuset_zone_allowed() call that
is used to check whether a zone's node is in a tasks mems_allowed
(which is constrained by the tasks cpuset) is extended to take
a gfp_mask argument, and its logic is extended, in the case that
__GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine
if the zone's node is allowed in that cpuset.

The definition of GFP_USER, which used to be identical to GFP_KERNEL,
is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.

GFP_ATOMIC and GFP_KERNEL allocations will stay within the current
tasks cpuset, so long as any node therein is not too tight on memory,
but will escape to the larger layer, if need be.

The intended use is to allow something like a batch manager to
handle several jobs, each job in its own cpuset, but using common
kernel memory for caches and such.  Swapper and oom_kill activity is
also constrained to Layer (2).  A task in or below one mem_exclusive
cpuset should not cause swapping on nodes in another non-overlapping
mem_exclusive cpuset, nor provoke oom_killing of a task in another
such cpuset.  Heavy use of kernel memory for i/o caching and such
by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.

This patch enables providing hardwall, inescapable cpusets for
memory allocations of each job, while sharing kernel memory
allocations between several jobs, in an enclosing mem_exclusive
cpuset.

Like Dinakar's patch earlier to enable administering sched domains
using the cpu_exclusive flag, this patch also provides a useful
meaning to a cpuset flag that had previously done nothing much 
useful other than restrict what cpuset configurations were allowed.

Signed-off-by: Paul Jackson <pj@sgi.com>

Index: linux-2.6-mem_exclusive/Documentation/cpusets.txt
===================================================================
--- linux-2.6-mem_exclusive.orig/Documentation/cpusets.txt	2005-07-02 17:40:54.000000000 -0700
+++ linux-2.6-mem_exclusive/Documentation/cpusets.txt	2005-07-02 17:43:15.000000000 -0700
@@ -51,6 +51,7 @@ mems_allowed vector.
 
 If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct
 ancestor or descendent, may share any of the same CPUs or Memory Nodes.
+
 A cpuset that is cpu exclusive has a sched domain associated with it.
 The sched domain consists of all cpus in the current cpuset that are not
 part of any exclusive child cpusets.
@@ -60,6 +61,18 @@ all of the cpus in the system. This remo
 load balancing code trying to pull tasks outside of the cpu exclusive
 cpuset only to be prevented by the tasks' cpus_allowed mask.
 
+A cpuset that is mem_exclusive restricts kernel allocations for
+page, buffer and other data commonly shared by the kernel across
+multiple users.  All cpusets, whether mem_exclusive or not, restrict
+allocations of memory for user space.  This enables configuring a
+system so that several independent jobs can share common kernel
+data, such as file system pages, while isolating each jobs user
+allocation in its own cpuset.  To do this, construct a large
+mem_exclusive cpuset to hold all the jobs, and construct child,
+non-mem_exclusive cpusets for each individual job.  Only a small
+amount of typical kernel memory, such as requests from interrupt
+handlers, is allowed to be taken outside even a mem_exclusive cpuset.
+
 User level code may create and destroy cpusets by name in the cpuset
 virtual file system, manage the attributes and permissions of these
 cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
Index: linux-2.6-mem_exclusive/include/linux/cpuset.h
===================================================================
--- linux-2.6-mem_exclusive.orig/include/linux/cpuset.h	2005-07-02 17:40:04.000000000 -0700
+++ linux-2.6-mem_exclusive/include/linux/cpuset.h	2005-07-02 17:43:15.000000000 -0700
@@ -23,7 +23,7 @@ void cpuset_init_current_mems_allowed(vo
 void cpuset_update_current_mems_allowed(void);
 void cpuset_restrict_to_mems_allowed(unsigned long *nodes);
 int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
-int cpuset_zone_allowed(struct zone *z);
+extern int cpuset_zone_allowed(struct zone *z, unsigned int __nocast gfp_mask);
 extern struct file_operations proc_cpuset_operations;
 extern char *cpuset_task_status_allowed(struct task_struct *task, char *buffer);
 
@@ -48,7 +48,8 @@ static inline int cpuset_zonelist_valid_
 	return 1;
 }
 
-static inline int cpuset_zone_allowed(struct zone *z)
+static inline int cpuset_zone_allowed(struct zone *z,
+					unsigned int __nocast gfp_mask)
 {
 	return 1;
 }
Index: linux-2.6-mem_exclusive/mm/page_alloc.c
===================================================================
--- linux-2.6-mem_exclusive.orig/mm/page_alloc.c	2005-07-02 17:40:04.000000000 -0700
+++ linux-2.6-mem_exclusive/mm/page_alloc.c	2005-07-02 17:43:15.000000000 -0700
@@ -806,11 +806,14 @@ __alloc_pages(unsigned int __nocast gfp_
 	classzone_idx = zone_idx(zones[0]);
 
 restart:
-	/* Go through the zonelist once, looking for a zone with enough free */
+	/*
+	 * Go through the zonelist once, looking for a zone with enough free.
+	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+	 */
 	for (i = 0; (z = zones[i]) != NULL; i++) {
 		int do_reclaim = should_reclaim_zone(z, gfp_mask);
 
-		if (!cpuset_zone_allowed(z))
+		if (!cpuset_zone_allowed(z, __GFP_HARDWALL))
 			continue;
 
 		/*
@@ -845,6 +848,7 @@ zone_reclaim_retry:
 	 *
 	 * This is the last chance, in general, before the goto nopage.
 	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
 	for (i = 0; (z = zones[i]) != NULL; i++) {
 		if (!zone_watermark_ok(z, order, z->pages_min,
@@ -852,7 +856,7 @@ zone_reclaim_retry:
 				       gfp_mask & __GFP_HIGH))
 			continue;
 
-		if (wait && !cpuset_zone_allowed(z))
+		if (wait && !cpuset_zone_allowed(z, gfp_mask))
 			continue;
 
 		page = buffered_rmqueue(z, order, gfp_mask);
@@ -867,7 +871,7 @@ zone_reclaim_retry:
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 			/* go through the zonelist yet again, ignoring mins */
 			for (i = 0; (z = zones[i]) != NULL; i++) {
-				if (!cpuset_zone_allowed(z))
+				if (!cpuset_zone_allowed(z, gfp_mask))
 					continue;
 				page = buffered_rmqueue(z, order, gfp_mask);
 				if (page)
@@ -908,7 +912,7 @@ rebalance:
 					       gfp_mask & __GFP_HIGH))
 				continue;
 
-			if (!cpuset_zone_allowed(z))
+			if (!cpuset_zone_allowed(z, gfp_mask))
 				continue;
 
 			page = buffered_rmqueue(z, order, gfp_mask);
@@ -927,7 +931,7 @@ rebalance:
 					       classzone_idx, 0, 0))
 				continue;
 
-			if (!cpuset_zone_allowed(z))
+			if (!cpuset_zone_allowed(z, __GFP_HARDWALL))
 				continue;
 
 			page = buffered_rmqueue(z, order, gfp_mask);
Index: linux-2.6-mem_exclusive/kernel/cpuset.c
===================================================================
--- linux-2.6-mem_exclusive.orig/kernel/cpuset.c	2005-07-02 17:40:04.000000000 -0700
+++ linux-2.6-mem_exclusive/kernel/cpuset.c	2005-07-02 17:43:15.000000000 -0700
@@ -1563,12 +1563,79 @@ int cpuset_zonelist_valid_mems_allowed(s
 }
 
 /*
- * Is 'current' valid, and is zone z allowed in current->mems_allowed?
+ * nearest_exclusive_ancestor() - Returns the nearest mem_exclusive
+ * ancestor to the specified cpuset.  Call while holding cpuset_sem.
+ * If no ancestor is mem_exclusive (an unusual configuration), then
+ * returns the root cpuset.
  */
-int cpuset_zone_allowed(struct zone *z)
+static const struct cpuset *nearest_exclusive_ancestor(const struct cpuset *cs)
 {
-	return in_interrupt() ||
-		node_isset(z->zone_pgdat->node_id, current->mems_allowed);
+	while (!is_mem_exclusive(cs) && cs->parent)
+		cs = cs->parent;
+	return cs;
+}
+
+/**
+ * cpuset_zone_allowed - Can we allocate memory on zone z's memory node?
+ * @z: is this zone on an allowed node?
+ * @gfp_mask: memory allocation flags (we use __GFP_HARDWALL)
+ *
+ * If we're in interrupt, yes, we can always allocate.  If zone
+ * z's node is in our tasks mems_allowed, yes.  If it's not a
+ * __GFP_HARDWALL request and this zone's nodes is in the nearest
+ * mem_exclusive cpuset ancestor to this tasks cpuset, yes.
+ * Otherwise, no.
+ *
+ * GFP_USER allocations are marked with the __GFP_HARDWALL bit,
+ * and do not allow allocations outside the current tasks cpuset.
+ * GFP_KERNEL allocations are not so marked, so can escape to the
+ * nearest mem_exclusive ancestor cpuset.
+ *
+ * Scanning up parent cpusets requires cpuset_sem.  The __alloc_pages()
+ * routine only calls here with __GFP_HARDWALL bit _not_ set if
+ * it's a GFP_KERNEL allocation, and all nodes in the current tasks
+ * mems_allowed came up empty on the first pass over the zonelist.
+ * So only GFP_KERNEL allocations, if all nodes in the cpuset are
+ * short of memory, might require taking the cpuset_sem semaphore.
+ *
+ * The first loop over the zonelist in mm/page_alloc.c:__alloc_pages()
+ * calls here with __GFP_HARDWALL always set in gfp_mask, enforcing
+ * hardwall cpusets - no allocation on a node outside the cpuset is
+ * allowed (unless in interrupt, of course).
+ *
+ * The second loop doesn't even call here for GFP_ATOMIC requests
+ * (if the __alloc_pages() local variable 'wait' is set).  That check
+ * and the checks below have the combined affect in the second loop of
+ * the __alloc_pages() routine that:
+ *	in_interrupt - any node ok (current task context irrelevant)
+ *	GFP_ATOMIC   - any node ok
+ *	GFP_KERNEL   - any node in enclosing mem_exclusive cpuset ok
+ *	GFP_USER     - only nodes in current tasks mems allowed ok.
+ **/
+int cpuset_zone_allowed(struct zone *z, unsigned int __nocast gfp_mask)
+{
+	int node;			/* node that zone z is on */
+	const struct cpuset *cs;	/* current cpuset ancestors */
+	int allowed = 1;		/* is allocation in zone z allowed? */
+
+	if (in_interrupt())
+		return 1;
+	node = z->zone_pgdat->node_id;
+	if (node_isset(node, current->mems_allowed))
+		return 1;
+	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
+		return 0;
+
+	/* Not hardwall and node outside mems_allowed: scan up cpusets */
+	down(&cpuset_sem);
+	cs = current->cpuset;
+	if (!cs)
+		goto done;		/* current task exiting */
+	cs = nearest_exclusive_ancestor(cs);
+	allowed = node_isset(node, cs->mems_allowed);
+done:
+	up(&cpuset_sem);
+	return allowed;
 }
 
 /*
Index: linux-2.6-mem_exclusive/mm/vmscan.c
===================================================================
--- linux-2.6-mem_exclusive.orig/mm/vmscan.c	2005-07-02 17:40:04.000000000 -0700
+++ linux-2.6-mem_exclusive/mm/vmscan.c	2005-07-02 17:43:15.000000000 -0700
@@ -890,7 +890,7 @@ shrink_caches(struct zone **zones, struc
 		if (zone->present_pages == 0)
 			continue;
 
-		if (!cpuset_zone_allowed(zone))
+		if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))
 			continue;
 
 		zone->temp_priority = sc->priority;
@@ -938,7 +938,7 @@ int try_to_free_pages(struct zone **zone
 	for (i = 0; zones[i] != NULL; i++) {
 		struct zone *zone = zones[i];
 
-		if (!cpuset_zone_allowed(zone))
+		if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))
 			continue;
 
 		zone->temp_priority = DEF_PRIORITY;
@@ -984,7 +984,7 @@ out:
 	for (i = 0; zones[i] != 0; i++) {
 		struct zone *zone = zones[i];
 
-		if (!cpuset_zone_allowed(zone))
+		if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))
 			continue;
 
 		zone->prev_priority = zone->temp_priority;
@@ -1254,7 +1254,7 @@ void wakeup_kswapd(struct zone *zone, in
 		return;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
-	if (!cpuset_zone_allowed(zone))
+	if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))
 		return;
 	if (!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
 		return;

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 4/4] cpusets confine oom_killer to mem_exclusive cpuset
  2005-07-11  1:58 [PATCH 0/4] cpusets mems_allowed and oom Paul Jackson
                   ` (2 preceding siblings ...)
  2005-07-11  1:58 ` [PATCH 3/4] cpusets formalize intermediate GFP_KERNEL containment Paul Jackson
@ 2005-07-11  1:59 ` Paul Jackson
  3 siblings, 0 replies; 5+ messages in thread
From: Paul Jackson @ 2005-07-11  1:59 UTC (permalink / raw)
  To: Dinakar Guniguntala, Erich Focht, Simon Derr; +Cc: linux-mm, Paul Jackson

Now the real motivation for this cpuset mem_exclusive patch series
seems trivial.  This patch depends on the previous cpuset_zone_allowed
patch and its prerequisites.

This patch keeps a task in or under one mem_exclusive cpuset from
provoking an oom kill of a task under a non-overlapping mem_exclusive
cpuset.  Since only interrupt and GFP_ATOMIC allocations are allowed
to escape mem_exclusive containment, there is little to gain from
oom killing a task under a non-overlapping mem_exclusive cpuset, as
almost all kernel and user memory allocation must come from disjoint
memory nodes.

This patch enables configuring a system so that a runaway job under
one mem_exclusive cpuset cannot cause the killing of a job in another
such cpuset that might be using very high compute and memory resources
for a prolonged time.

Signed-off-by: Paul Jackson <pj@sgi.com>

Index: linux-2.6-mem_exclusive/include/linux/cpuset.h
===================================================================
--- linux-2.6-mem_exclusive.orig/include/linux/cpuset.h	2005-07-02 17:43:44.000000000 -0700
+++ linux-2.6-mem_exclusive/include/linux/cpuset.h	2005-07-02 17:43:44.000000000 -0700
@@ -24,6 +24,7 @@ void cpuset_update_current_mems_allowed(
 void cpuset_restrict_to_mems_allowed(unsigned long *nodes);
 int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
 extern int cpuset_zone_allowed(struct zone *z, unsigned int __nocast gfp_mask);
+extern int cpuset_nodes_overlap(const struct task_struct *p);
 extern struct file_operations proc_cpuset_operations;
 extern char *cpuset_task_status_allowed(struct task_struct *task, char *buffer);
 
@@ -54,6 +55,11 @@ static inline int cpuset_zone_allowed(st
 	return 1;
 }
 
+static inline int cpuset_nodes_overlap(const struct task_struct *p)
+{
+	return 1;
+}
+
 static inline char *cpuset_task_status_allowed(struct task_struct *task,
 							char *buffer)
 {
Index: linux-2.6-mem_exclusive/kernel/cpuset.c
===================================================================
--- linux-2.6-mem_exclusive.orig/kernel/cpuset.c	2005-07-02 17:43:44.000000000 -0700
+++ linux-2.6-mem_exclusive/kernel/cpuset.c	2005-07-02 17:43:44.000000000 -0700
@@ -1638,6 +1638,39 @@ done:
 	return allowed;
 }
 
+/**
+ * cpuset_nodes_overlap - Do task p's cpuset nodes overlap current tasks?
+ * @tsk: pointer to task_struct of some other task.
+ *
+ * Description: Returns true if specified task p's cpuset overlaps the
+ * current tasks cpuset.  Actually compares the nearest mem_exclusive
+ * ancestor cpusets of p and current.  Used by oom killer to determine
+ * if there is any chance that task p's memory usage might impact
+ * the memory available to the current task.
+ *
+ * Acquires cpuset_sem - not suitable for calling from a fast path.
+ **/
+int cpuset_nodes_overlap(const struct task_struct *p)
+{
+	const struct cpuset *cs1, *cs2;	/* my and p's cpuset ancestors */
+	int overlap = 0;		/* do cpusets overlap? */
+
+	down(&cpuset_sem);
+	cs1 = current->cpuset;
+	if (!cs1)
+		goto done;		/* current task exiting */
+	cs2 = p->cpuset;
+	if (!cs2)
+		goto done;		/* task p is exiting */
+	cs1 = nearest_exclusive_ancestor(cs1);
+	cs2 = nearest_exclusive_ancestor(cs2);
+	overlap = nodes_intersects(cs1->mems_allowed, cs2->mems_allowed);
+done:
+	up(&cpuset_sem);
+
+	return overlap;
+}
+
 /*
  * proc_cpuset_show()
  *  - Print tasks cpuset path into seq_file.
Index: linux-2.6-mem_exclusive/mm/oom_kill.c
===================================================================
--- linux-2.6-mem_exclusive.orig/mm/oom_kill.c	2005-07-02 17:43:44.000000000 -0700
+++ linux-2.6-mem_exclusive/mm/oom_kill.c	2005-07-02 17:43:44.000000000 -0700
@@ -20,6 +20,7 @@
 #include <linux/swap.h>
 #include <linux/timex.h>
 #include <linux/jiffies.h>
+#include <linux/cpuset.h>
 
 /* #define DEBUG */
 
@@ -152,6 +153,10 @@ static struct task_struct * select_bad_p
 			continue;
 		if (p->oomkilladj == OOM_DISABLE)
 			continue;
+		/* If p's nodes don't overlap ours, it won't help to kill p. */
+		if (!cpuset_nodes_overlap(p))
+			continue;
+
 		/*
 		 * This is in the process of releasing memory so for wait it
 		 * to finish before killing some other task by mistake.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2005-07-11  1:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-07-11  1:58 [PATCH 0/4] cpusets mems_allowed and oom Paul Jackson
2005-07-11  1:58 ` [PATCH 1/4] cpusets oom_kill and page_alloc tweaks Paul Jackson
2005-07-11  1:58 ` [PATCH 2/4] cpusets new __GFP_HARDWALL flag Paul Jackson
2005-07-11  1:58 ` [PATCH 3/4] cpusets formalize intermediate GFP_KERNEL containment Paul Jackson
2005-07-11  1:59 ` [PATCH 4/4] cpusets confine oom_killer to mem_exclusive cpuset Paul Jackson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox