[PATCH/RFC] Allow selected nodes to be excluded from MPOL

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
@ 2007-07-27 20:07 Lee Schermerhorn
  2007-07-28  6:19 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 17+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 20:07 UTC (permalink / raw)
  To: linux-mm
  Cc: Paul Mundt, Christoph Lameter, Nishanth Aravamudan, kxr, ak,
	KAMEZAWA Hiroyuki, akpm, Eric Whitney

Allow selected nodes to be excluded from MPOL_INTERLEAVE masks

Against:  2.6.23-rc1-mm1 atop Christoph Lameter's memoryless
	  node patch set.

This patch implements a new node state, N_INTERLEAVE, to specify
the subset of nodes with memory [state N_MEMORY] that are valid
for MPOL_INTERLEAVE node masks.  The new state mask is populated
from the N_MEMORY state mask, less any nodes excluded by a new
command line option, "no_interleave_nodes=<NodeList>".  Any nodemask
specified for an interleave policy is then masked by the N_INTERLEAVE
mask, including the temporary boot-time interleave policy.

Rationale:  some architectures and platforms include nodes with
memory that, in some cases, should never appear in MPOL_INTERLEAVE
node masks.  For example, the 'sh' architecture contains a small
amount of SRAM that is local to each cpu.  In some applications,
this memory should be reserved for explicit usage.  Another example
is the pseudo-node on HP ia64 platforms that is already interleaved
on a cache-line granularity by hardware.  Again, in some cases, we
want to reserve this for explicit usage, as it has bandwidth and
[average] latency characteristics quite different from the "real"
nodes.

Note that allocation of fresh hugepages in response to increases
in /proc/sys/vm/nr_hugepages is a form of interleaving.  I would
like to propose that allocate_fresh_huge_page() use the 
N_INTERLEAVE state as well as MPOL_INTERLEAVE.  Then, one can
explicity allocate hugepages on the excluded nodes, when needed,
using Nish Aravamundan's per node huge page sysfs attribute.
NOT in this patch.

Questions:

* do we need/want a sysctl for run time modifications?  IMO, no.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/kernel-parameters.txt |    9 +++++++++
 include/linux/nodemask.h            |    1 +
 mm/mempolicy.c                      |    9 +++++----
 mm/page_alloc.c                     |   24 +++++++++++++++++++++++-
 4 files changed, 38 insertions(+), 5 deletions(-)

Index: Linux/include/linux/nodemask.h
===================================================================
--- Linux.orig/include/linux/nodemask.h	2007-07-27 11:25:36.000000000 -0400
+++ Linux/include/linux/nodemask.h	2007-07-27 11:36:15.000000000 -0400
@@ -345,6 +345,7 @@ enum node_states {
 	N_ONLINE,	/* The node is online */
 	N_MEMORY,	/* The node has memory */
 	N_CPU,		/* The node has cpus */
+	N_INTERLEAVE,	/* The node is valid for MPOL_INTERLEAVE */
 	NR_NODE_STATES
 };
 
Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-07-27 11:25:36.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-07-27 12:03:29.000000000 -0400
@@ -2003,6 +2003,21 @@ static char zonelist_order_name[3][8] = 
 
 
 #ifdef CONFIG_NUMA
+/*
+ * Command line:  no_interleave_nodes=<NodeList>
+ * Specify nodes to exclude from MPOL_INTERLEAVE masks.
+ */
+static nodemask_t no_interleave_nodes;	/* default:  none */
+
+static __init int setup_no_interleave_nodes(char *nodelist)
+{
+	if (nodelist) {
+		return nodelist_parse(nodelist, no_interleave_nodes);
+	}
+	return 0;
+}
+early_param("no_interleave_nodes", setup_no_interleave_nodes);
+
 /* The value user specified ....changed by config */
 static int user_zonelist_order = ZONELIST_ORDER_DEFAULT;
 /* string for sysctl */
@@ -2410,8 +2425,15 @@ static int __build_all_zonelists(void *d
 		build_zonelists(pgdat);
 		build_zonelist_cache(pgdat);
 
-		if (pgdat->node_present_pages)
+		if (pgdat->node_present_pages) {
 			node_set_state(nid, N_MEMORY);
+			/*
+			 * Only nodes with memory are valid for MPOL_INTERLEAVE,
+			 * but maybe not all of them?
+			 */
+			if (!node_isset(nid, no_interleave_nodes))
+				node_set_state(nid, N_INTERLEAVE);
+		}
 	}
 	return 0;
 }
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-07-27 11:25:36.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-07-27 11:50:01.000000000 -0400
@@ -184,7 +184,7 @@ static struct mempolicy *mpol_new(int mo
 	case MPOL_INTERLEAVE:
 		policy->v.nodes = *nodes;
 		nodes_and(policy->v.nodes, policy->v.nodes,
-					node_states[N_MEMORY]);
+					node_states[N_INTERLEAVE]);
 		if (nodes_weight(policy->v.nodes) == 0) {
 			kmem_cache_free(policy_cache, policy);
 			return ERR_PTR(-EINVAL);
@@ -1612,11 +1612,12 @@ void __init numa_policy_init(void)
 
 	/*
 	 * Set interleaving policy for system init. Interleaving is only
-	 * enabled across suitably sized nodes (default is >= 16MB), or
-	 * fall back to the largest node if they're all smaller.
+	 * enabled across suitably sized nodes (hard coded >= 16MB) on which
+	 * interleaving is allowed  Fall back to the largest node if all
+	 * allowable nodes are smaller than the hard coded limit.
 	 */
 	nodes_clear(interleave_nodes);
-	for_each_node_state(nid, N_MEMORY) {
+	for_each_node_state(nid, N_INTERLEAVE) {
 		unsigned long total_pages = node_present_pages(nid);
 
 		/* Preserve the largest node */
Index: Linux/Documentation/kernel-parameters.txt
===================================================================
--- Linux.orig/Documentation/kernel-parameters.txt	2007-07-25 09:29:48.000000000 -0400
+++ Linux/Documentation/kernel-parameters.txt	2007-07-27 11:43:54.000000000 -0400
@@ -1181,6 +1181,15 @@ and is between 256 and 4096 characters. 
 	noinitrd	[RAM] Tells the kernel not to load any configured
 			initial RAM disk.
 
+	no_interleave_nodes [KNL, BOOT] Specifies a list of nodes to exclude
+			[remove] from any nodemask specified with the
+			MPOL_INTERLEAVE policy.  Some platforms have nodes
+			that are "special" in some way and should not be
+			used for policy based interleaving.
+			Format:  no_interleave_nodes=<NodeList>
+			NodeList format is described in
+				Documentation/filesystems/tmpfs.txt
+
 	nointroute	[IA-64]
 
 	nojitter	[IA64] Disables jitter checking for ITC timers.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
  2007-07-27 20:07 [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks Lee Schermerhorn
@ 2007-07-28  6:19 ` KAMEZAWA Hiroyuki
  2007-07-30 16:13   ` Lee Schermerhorn
  0 siblings, 1 reply; 17+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-07-28  6:19 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, Paul Mundt, Christoph Lameter, Nishanth Aravamudan,
	kxr, ak, akpm, Eric Whitney

On Fri, 27 Jul 2007 16:07:57 -0400
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> Questions:
> 
> * do we need/want a sysctl for run time modifications?  IMO, no.
> 

I can agree that runtime modification is not necessary. But applications or
libnuma will not use this information ? Doing all in implicit way is enough ?
(maybe enough)

BTW, could you print "nodes of XXXX are ignored in INTERLEAVE mempolicy" to
/var/log/messages at boot ?
 
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
  2007-07-28  6:19 ` KAMEZAWA Hiroyuki
@ 2007-07-30 16:13   ` Lee Schermerhorn
  2007-07-30 18:29     ` Christoph Lameter
  2007-08-01 10:16     ` Paul Mundt
  0 siblings, 2 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2007-07-30 16:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Paul Mundt, Christoph Lameter, Nishanth Aravamudan,
	kxr, ak, akpm, Eric Whitney

On Sat, 2007-07-28 at 15:19 +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 27 Jul 2007 16:07:57 -0400
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> 
> > Questions:
> > 
> > * do we need/want a sysctl for run time modifications?  IMO, no.
> > 
> 
> I can agree that runtime modification is not necessary. But applications or
> libnuma will not use this information ? Doing all in implicit way is enough ?
> (maybe enough)

I think it's enough.  But, maybe we should export this info as a node
attribute in sysfs?  Would be easy enough to do, if demand exists.

> 
> BTW, could you print "nodes of XXXX are ignored in INTERLEAVE mempolicy" to
> /var/log/messages at boot ?

Good idea.  It also prompts me to consider better error handling. 

How about this?

---

Introduce mask of nodes to exclude from MPOL_INTERLEAVE masks - V2

Against:  2.6.23-rc1-mm1 atop Christoph Lameter's memoryless
	  node patch set.

V1 -> V2:
+ issue KERN_NOTICE for successful parse of nodelist.
  Suggestion by Kamezawa Hiroyuki.
+ clear no_interleave_nodes nodemask and issue KERN_ERR for
  invalid nodelist argument.

This patch implements a new node state, N_INTERLEAVE to specify
the subset of nodes with memory [state N_MEMORY] that are valid
for MPOL_INTERLEAVE node masks.  The new state mask is populated
from the N_MEMORY state mask, less any nodes excluded by a new
command line option, no_interleave_nodes.

Rationale:  some architectures and platforms include nodes with
memory that, in some cases, should never appear in MPOL_INTERLEAVE
node masks.  For example, the 'sh' architecture contains a small
amount of SRAM that is local to each cpu.  In some applications,
this memory should be reserved for explicit usage.  Another example
is the pseudo-node on HP ia64 platforms that is already interleaved
on a cache-line granularity by hardware.  Again, in some cases, we
want to reserve this for explicit usage, as it has bandwidth and
[average] latency characteristics quite different from the "real"
nodes.

Note that allocation of fresh hugepages in response to increases
in /proc/sys/vm/nr_hugepages is a form of interleaving.  I would
like to propose that allocate_fresh_huge_page() use the 
N_INTERLEAVE state as well as MPOL_INTERLEAVE.  Then, one can
explicity allocate hugepages on the excluded nodes, when needed,
using Nish Aravamundan's per node huge page sysfs attribute.
NOT in this patch.

Questions:

* do we need/want a sysctl for run time modifications?  IMO, no.
	Kame-san votes "No".

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/kernel-parameters.txt |    9 +++++++++
 include/linux/nodemask.h            |    1 +
 mm/mempolicy.c                      |    9 +++++----
 mm/page_alloc.c                     |   34 +++++++++++++++++++++++++++++++++-
 4 files changed, 48 insertions(+), 5 deletions(-)

Index: Linux/include/linux/nodemask.h
===================================================================
--- Linux.orig/include/linux/nodemask.h	2007-07-27 15:23:53.000000000 -0400
+++ Linux/include/linux/nodemask.h	2007-07-27 15:23:53.000000000 -0400
@@ -345,6 +345,7 @@ enum node_states {
 	N_ONLINE,	/* The node is online */
 	N_MEMORY,	/* The node has memory */
 	N_CPU,		/* The node has cpus */
+	N_INTERLEAVE,	/* The node is valid for MPOL_INTERLEAVE */
 	NR_NODE_STATES
 };
 
Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-07-27 15:23:53.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-07-30 10:25:38.000000000 -0400
@@ -2003,6 +2003,31 @@ static char zonelist_order_name[3][8] = 
 
 
 #ifdef CONFIG_NUMA
+/*
+ * Command line:  no_interleave_nodes=<NodeList>
+ * Specify nodes to exclude from MPOL_INTERLEAVE masks.
+ */
+static nodemask_t no_interleave_nodes;	/* default:  none */
+
+static __init int setup_no_interleave_nodes(char *nodelist)
+{
+	if (nodelist) {
+		int err = nodelist_parse(nodelist, no_interleave_nodes);
+		if (err) {
+			printk(KERN_ERR
+				"Ignoring invalid no_interleave_nodes nodelist:"
+				"  %s\n", nodelist);
+			nodes_clear(no_interleave_nodes); /* all or nothing */
+			return err;
+		}
+		printk(KERN_NOTICE
+			"Nodes ignored for INTERLEAVE memory policy: %s\n",
+			nodelist);
+	}
+	return 0;
+}
+early_param("no_interleave_nodes", setup_no_interleave_nodes);
+
 /* The value user specified ....changed by config */
 static int user_zonelist_order = ZONELIST_ORDER_DEFAULT;
 /* string for sysctl */
@@ -2410,8 +2435,15 @@ static int __build_all_zonelists(void *d
 		build_zonelists(pgdat);
 		build_zonelist_cache(pgdat);
 
-		if (pgdat->node_present_pages)
+		if (pgdat->node_present_pages) {
 			node_set_state(nid, N_MEMORY);
+			/*
+			 * Only nodes with memory are valid for MPOL_INTERLEAVE,
+			 * but maybe not all of them?
+			 */
+			if (!node_isset(nid, no_interleave_nodes))
+				node_set_state(nid, N_INTERLEAVE);
+		}
 	}
 	return 0;
 }
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-07-27 15:23:53.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-07-30 11:09:20.000000000 -0400
@@ -184,7 +184,7 @@ static struct mempolicy *mpol_new(int mo
 	case MPOL_INTERLEAVE:
 		policy->v.nodes = *nodes;
 		nodes_and(policy->v.nodes, policy->v.nodes,
-					node_states[N_MEMORY]);
+					node_states[N_INTERLEAVE]);
 		if (nodes_weight(policy->v.nodes) == 0) {
 			kmem_cache_free(policy_cache, policy);
 			return ERR_PTR(-EINVAL);
@@ -1612,11 +1612,12 @@ void __init numa_policy_init(void)
 
 	/*
 	 * Set interleaving policy for system init. Interleaving is only
-	 * enabled across suitably sized nodes (default is >= 16MB), or
-	 * fall back to the largest node if they're all smaller.
+	 * enabled across suitably sized nodes (hard coded >= 16MB) on which
+	 * interleaving is allowed  Fall back to the largest node if all
+	 * allowable nodes are smaller than the hard coded limit.
 	 */
 	nodes_clear(interleave_nodes);
-	for_each_node_state(nid, N_MEMORY) {
+	for_each_node_state(nid, N_INTERLEAVE) {
 		unsigned long total_pages = node_present_pages(nid);
 
 		/* Preserve the largest node */
Index: Linux/Documentation/kernel-parameters.txt
===================================================================
--- Linux.orig/Documentation/kernel-parameters.txt	2007-07-27 15:22:41.000000000 -0400
+++ Linux/Documentation/kernel-parameters.txt	2007-07-27 15:23:53.000000000 -0400
@@ -1181,6 +1181,15 @@ and is between 256 and 4096 characters. 
 	noinitrd	[RAM] Tells the kernel not to load any configured
 			initial RAM disk.
 
+	no_interleave_nodes [KNL, BOOT] Specifies a list of nodes to exclude
+			[remove] from any nodemask specified with the
+			MPOL_INTERLEAVE policy.  Some platforms have nodes
+			that are "special" in some way and should not be
+			used for policy based interleaving.
+			Format:  no_interleave_nodes=<NodeList>
+			NodeList format is described in
+				Documentation/filesystems/tmpfs.txt
+
 	nointroute	[IA-64]
 
 	nojitter	[IA64] Disables jitter checking for ITC timers.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
  2007-07-30 16:13   ` Lee Schermerhorn
@ 2007-07-30 18:29     ` Christoph Lameter
  2007-07-30 20:32       ` Lee Schermerhorn
  2007-08-01 10:16     ` Paul Mundt
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2007-07-30 18:29 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: KAMEZAWA Hiroyuki, linux-mm, Paul Mundt, Nishanth Aravamudan, ak,
	akpm, Eric Whitney

On Mon, 30 Jul 2007, Lee Schermerhorn wrote:

> +	return 0;
> +}
> +early_param("no_interleave_nodes", setup_no_interleave_nodes);
> +
>  /* The value user specified ....changed by config */
>  static int user_zonelist_order = ZONELIST_ORDER_DEFAULT;
>  /* string for sysctl */
> @@ -2410,8 +2435,15 @@ static int __build_all_zonelists(void *d
>  		build_zonelists(pgdat);
>  		build_zonelist_cache(pgdat);
>  
> -		if (pgdat->node_present_pages)
> +		if (pgdat->node_present_pages) {
>  			node_set_state(nid, N_MEMORY);
> +			/*
> +			 * Only nodes with memory are valid for MPOL_INTERLEAVE,
> +			 * but maybe not all of them?
> +			 */
> +			if (!node_isset(nid, no_interleave_nodes))
> +				node_set_state(nid, N_INTERLEAVE);

			else
			 printk ....

would be better since it will only list the nodes that have memory and are 
excluded from interleave.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
  2007-07-30 18:29     ` Christoph Lameter
@ 2007-07-30 20:32       ` Lee Schermerhorn
  2007-07-30 21:57         ` Christoph Lameter
  0 siblings, 1 reply; 17+ messages in thread
From: Lee Schermerhorn @ 2007-07-30 20:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KAMEZAWA Hiroyuki, linux-mm, Paul Mundt, Nishanth Aravamudan, ak,
	akpm, Eric Whitney

On Mon, 2007-07-30 at 11:29 -0700, Christoph Lameter wrote:
> On Mon, 30 Jul 2007, Lee Schermerhorn wrote:
> 
> > +	return 0;
> > +}
> > +early_param("no_interleave_nodes", setup_no_interleave_nodes);
> > +
> >  /* The value user specified ....changed by config */
> >  static int user_zonelist_order = ZONELIST_ORDER_DEFAULT;
> >  /* string for sysctl */
> > @@ -2410,8 +2435,15 @@ static int __build_all_zonelists(void *d
> >  		build_zonelists(pgdat);
> >  		build_zonelist_cache(pgdat);
> >  
> > -		if (pgdat->node_present_pages)
> > +		if (pgdat->node_present_pages) {
> >  			node_set_state(nid, N_MEMORY);
> > +			/*
> > +			 * Only nodes with memory are valid for MPOL_INTERLEAVE,
> > +			 * but maybe not all of them?
> > +			 */
> > +			if (!node_isset(nid, no_interleave_nodes))
> > +				node_set_state(nid, N_INTERLEAVE);
> 
> 			else
> 			 printk ....
> 
> would be better since it will only list the nodes that have memory and are 
> excluded from interleave.

You mean instead of just listing the no_interleave_nodes node list
argument which might contain memoryless nodes? 

I'll fix that up on next respin.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
  2007-07-30 20:32       ` Lee Schermerhorn
@ 2007-07-30 21:57         ` Christoph Lameter
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Lameter @ 2007-07-30 21:57 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: KAMEZAWA Hiroyuki, linux-mm, Paul Mundt, Nishanth Aravamudan, ak,
	akpm, Eric Whitney

On Mon, 30 Jul 2007, Lee Schermerhorn wrote:

> You mean instead of just listing the no_interleave_nodes node list
> argument which might contain memoryless nodes? 

Right.
List the nodes that have memory but that are no includes in interleave.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
  2007-07-30 16:13   ` Lee Schermerhorn
  2007-07-30 18:29     ` Christoph Lameter
@ 2007-08-01 10:16     ` Paul Mundt
  2007-08-01 10:33       ` Andi Kleen
  2007-08-01 13:39       ` Lee Schermerhorn
  1 sibling, 2 replies; 17+ messages in thread
From: Paul Mundt @ 2007-08-01 10:16 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: KAMEZAWA Hiroyuki, linux-mm, Christoph Lameter,
	Nishanth Aravamudan, kxr, ak, akpm, Eric Whitney

On Mon, Jul 30, 2007 at 12:13:48PM -0400, Lee Schermerhorn wrote:
> Rationale:  some architectures and platforms include nodes with
> memory that, in some cases, should never appear in MPOL_INTERLEAVE
> node masks.  For example, the 'sh' architecture contains a small
> amount of SRAM that is local to each cpu.  In some applications,
> this memory should be reserved for explicit usage.  Another example
> is the pseudo-node on HP ia64 platforms that is already interleaved
> on a cache-line granularity by hardware.  Again, in some cases, we
> want to reserve this for explicit usage, as it has bandwidth and
> [average] latency characteristics quite different from the "real"
> nodes.
> 
Well, it's not so much the interleave that's the problem so much as
_when_ we interleave. The problem with the interleave node mask at system
init is that the kernel attempts to spread out data structures across
these nodes, which results in us being completely out of memory by the
time we get to userspace. After we've booted, supporting MPOL_INTERLEAVE
is not so much of a problem, applications just have to be careful with
their allocations.

The main thing is keeping the kernel away from these nodes unless it's
been specifically asked to fetch some memory from there. Every page does
count.

The real problem is how we want to deal with the node avoidance mask. In
SLOB things presently work quite well in this regard, Christoph's
slub_nodes= patch did a similar thing:

	http://marc.info/?l=linux-mm&m=118127465421877&w=2
	http://marc.info/?l=linux-mm&m=118127688911359&w=2

> Note that allocation of fresh hugepages in response to increases
> in /proc/sys/vm/nr_hugepages is a form of interleaving.  I would
> like to propose that allocate_fresh_huge_page() use the 
> N_INTERLEAVE state as well as MPOL_INTERLEAVE.  Then, one can
> explicity allocate hugepages on the excluded nodes, when needed,
> using Nish Aravamundan's per node huge page sysfs attribute.
> NOT in this patch.
> 
If we can differentiate between MPOL_INTERLEAVE from the kernel's point
of view, and explicit MPOL_INTERLEAVE specifiers via mbind() from
userspace, that works fine for my case. However, the mpol_new() changes
in this patch deny small nodes the ability to ever be included in an
MPOL_INTERLEAVE policy, when it's only the kernel policy that I have a
problem with.

Having said that, I do like the node states and using that to exclude a
node from the system init interleave nodelist, but this still won't
completely solve the tiny node problems.

> @@ -184,7 +184,7 @@ static struct mempolicy *mpol_new(int mo
>  	case MPOL_INTERLEAVE:
>  		policy->v.nodes = *nodes;
>  		nodes_and(policy->v.nodes, policy->v.nodes,
> -					node_states[N_MEMORY]);
> +					node_states[N_INTERLEAVE]);
>  		if (nodes_weight(policy->v.nodes) == 0) {
>  			kmem_cache_free(policy_cache, policy);
>  			return ERR_PTR(-EINVAL);

Leaving this as node_states[N_MEMORY] combined with the rest of the patch
would work for me, but that sort of changes the scope of the entire patch
;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
  2007-08-01 10:16     ` Paul Mundt
@ 2007-08-01 10:33       ` Andi Kleen
  2007-08-01 11:01         ` Paul Mundt
  2007-08-01 13:39       ` Lee Schermerhorn
  1 sibling, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2007-08-01 10:33 UTC (permalink / raw)
  To: Paul Mundt
  Cc: Lee Schermerhorn, KAMEZAWA Hiroyuki, linux-mm, Christoph Lameter,
	Nishanth Aravamudan, kxr, akpm, Eric Whitney

On Wednesday 01 August 2007 12:16:51 Paul Mundt wrote:

> Well, it's not so much the interleave that's the problem so much as
> _when_ we interleave. The problem with the interleave node mask at system
> init is that the kernel attempts to spread out data structures across
> these nodes, which results in us being completely out of memory by the
> time we get to userspace. After we've booted, supporting MPOL_INTERLEAVE
> is not so much of a problem, applications just have to be careful with
> their allocations.

I assume you got a mostly flat latency machine with a few additional
small nodes for special purposes, right?

Would the problem be solved if you just had a per arch CONFIG
to disable interleaving at boot?  That would be really simple.

-Andi (who is a bit sceptical of more and more boot options) 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
  2007-08-01 10:33       ` Andi Kleen
@ 2007-08-01 11:01         ` Paul Mundt
  2007-08-01 11:07           ` Andi Kleen
  0 siblings, 1 reply; 17+ messages in thread
From: Paul Mundt @ 2007-08-01 11:01 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Lee Schermerhorn, KAMEZAWA Hiroyuki, linux-mm, Christoph Lameter,
	Nishanth Aravamudan, kxr, akpm, Eric Whitney

On Wed, Aug 01, 2007 at 12:33:01PM +0200, Andi Kleen wrote:
> On Wednesday 01 August 2007 12:16:51 Paul Mundt wrote:
> > Well, it's not so much the interleave that's the problem so much as
> > _when_ we interleave. The problem with the interleave node mask at system
> > init is that the kernel attempts to spread out data structures across
> > these nodes, which results in us being completely out of memory by the
> > time we get to userspace. After we've booted, supporting MPOL_INTERLEAVE
> > is not so much of a problem, applications just have to be careful with
> > their allocations.
> 
> I assume you got a mostly flat latency machine with a few additional
> small nodes for special purposes, right?
> 
No, each one of the nodes has differing latency, and also differing
characteristics with regards to caching behaviour and things like that.
That's what I was attempting to convey in reply to Andrew:

	http://marc.info/?l=linux-mm&m=118594672828737&w=2

> Would the problem be solved if you just had a per arch CONFIG
> to disable interleaving at boot?  That would be really simple.
> 
As long as interleaving is possible after boot, then yes. It's only the
boot-time interleave that we would like to avoid, and even then, only
across specific nodes (which so far I've just hacked around by removing
small nodes from the interleave map at system init time).

I would also favour an option where we didn't have to set these things as
obscure boot options.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
  2007-08-01 11:01         ` Paul Mundt
@ 2007-08-01 11:07           ` Andi Kleen
  2007-08-01 11:21             ` Paul Mundt
  0 siblings, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2007-08-01 11:07 UTC (permalink / raw)
  To: Paul Mundt
  Cc: Lee Schermerhorn, KAMEZAWA Hiroyuki, linux-mm, Christoph Lameter,
	Nishanth Aravamudan, kxr, akpm, Eric Whitney

> As long as interleaving is possible after boot, then yes. It's only the
> boot-time interleave that we would like to avoid,

But when anybody does interleaving later it could just as easily
fill up your small nodes, couldn't it?

Boot time allocations are small compared to what user space
later can allocate.

And do you really want them in the normal fallback lists? The normal zone
reservation heuristics probably won't work unless you put them into
special low zones.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
  2007-08-01 11:07           ` Andi Kleen
@ 2007-08-01 11:21             ` Paul Mundt
  2007-08-01 13:54               ` Lee Schermerhorn
  0 siblings, 1 reply; 17+ messages in thread
From: Paul Mundt @ 2007-08-01 11:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Lee Schermerhorn, KAMEZAWA Hiroyuki, linux-mm, Christoph Lameter,
	Nishanth Aravamudan, kxr, akpm, Eric Whitney

On Wed, Aug 01, 2007 at 01:07:43PM +0200, Andi Kleen wrote:
> 
> > As long as interleaving is possible after boot, then yes. It's only the
> > boot-time interleave that we would like to avoid,
> 
> But when anybody does interleaving later it could just as easily
> fill up your small nodes, couldn't it?
> 
Yes, but these are in embedded environments where we have control over
what the applications are doing. Most of these sorts of things are for
applications where we know what sort of latency requires we have to deal
with, and so the workload is very much tied to the worst-case range of
nodes, or just to a particular node. We might only have certain buffers
that need to be backed by faster memory as well, so while most of the
application pages will come from node 0 (system memory), certain other
allocations will come from other nodes. We've been experimenting with
doing that through tmpfs with mpol tuning.

In the general case however it's fairly safe to include the tiny nodes as
part of a larger set with a prefer policy so we don't immediately OOM.

> Boot time allocations are small compared to what user space
> later can allocate.
> 
Yes, we only want certain applications to explicitly poke at those nodes,
but they do have a use case for interleave, so it is not functionality I
would want to lose completely.

> And do you really want them in the normal fallback lists? The normal zone
> reservation heuristics probably won't work unless you put them into
> special low zones.
> 
That's something else to look at also, though I would very much like to
avoid having to construct custom zonelists. it would be nice to keep things as
simple and as non-invasive as possible. As far as the existing NUMA code
goes, we're not quite all the way there yet in terms of supporting these
things as well as we can, but it has proven to be a pretty good starting
point.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
  2007-08-01 10:16     ` Paul Mundt
  2007-08-01 10:33       ` Andi Kleen
@ 2007-08-01 13:39       ` Lee Schermerhorn
  2007-08-03  7:53         ` Paul Mundt
  1 sibling, 1 reply; 17+ messages in thread
From: Lee Schermerhorn @ 2007-08-01 13:39 UTC (permalink / raw)
  To: Paul Mundt
  Cc: KAMEZAWA Hiroyuki, linux-mm, Christoph Lameter,
	Nishanth Aravamudan, kxr, ak, akpm, Eric Whitney

On Wed, 2007-08-01 at 19:16 +0900, Paul Mundt wrote:
> On Mon, Jul 30, 2007 at 12:13:48PM -0400, Lee Schermerhorn wrote:
> > Rationale:  some architectures and platforms include nodes with
> > memory that, in some cases, should never appear in MPOL_INTERLEAVE
> > node masks.  For example, the 'sh' architecture contains a small
> > amount of SRAM that is local to each cpu.  In some applications,
> > this memory should be reserved for explicit usage.  Another example
> > is the pseudo-node on HP ia64 platforms that is already interleaved
> > on a cache-line granularity by hardware.  Again, in some cases, we
> > want to reserve this for explicit usage, as it has bandwidth and
> > [average] latency characteristics quite different from the "real"
> > nodes.
> > 
> Well, it's not so much the interleave that's the problem so much as
> _when_ we interleave. The problem with the interleave node mask at system
> init is that the kernel attempts to spread out data structures across
> these nodes, which results in us being completely out of memory by the
> time we get to userspace. After we've booted, supporting MPOL_INTERLEAVE
> is not so much of a problem, applications just have to be careful with
> their allocations.
> 
> The main thing is keeping the kernel away from these nodes unless it's
> been specifically asked to fetch some memory from there. Every page does
> count.
> 
> The real problem is how we want to deal with the node avoidance mask. In
> SLOB things presently work quite well in this regard, Christoph's
> slub_nodes= patch did a similar thing:
> 
> 	http://marc.info/?l=linux-mm&m=118127465421877&w=2
> 	http://marc.info/?l=linux-mm&m=118127688911359&w=2
> 
> > Note that allocation of fresh hugepages in response to increases
> > in /proc/sys/vm/nr_hugepages is a form of interleaving.  I would
> > like to propose that allocate_fresh_huge_page() use the 
> > N_INTERLEAVE state as well as MPOL_INTERLEAVE.  Then, one can
> > explicity allocate hugepages on the excluded nodes, when needed,
> > using Nish Aravamundan's per node huge page sysfs attribute.
> > NOT in this patch.
> > 
> If we can differentiate between MPOL_INTERLEAVE from the kernel's point
> of view, and explicit MPOL_INTERLEAVE specifiers via mbind() from
> userspace, that works fine for my case. However, the mpol_new() changes
> in this patch deny small nodes the ability to ever be included in an
> MPOL_INTERLEAVE policy, when it's only the kernel policy that I have a
> problem with.

Ah, but it would only "deny small nodes" if you nominate them in the
boot option.  I haven't changed your heuristic in numa_policy_init.  So,
it will still eliminate small nodes from the boot time interleave
nodemask, independent of whether or not you specify them in the
no_interleave_nodes list.

Or am I missing your point?
> 
> Having said that, I do like the node states and using that to exclude a
> node from the system init interleave nodelist, but this still won't
> completely solve the tiny node problems.

Right, so we should keep your boot time heuristic.

> 
> > @@ -184,7 +184,7 @@ static struct mempolicy *mpol_new(int mo
> >  	case MPOL_INTERLEAVE:
> >  		policy->v.nodes = *nodes;
> >  		nodes_and(policy->v.nodes, policy->v.nodes,
> > -					node_states[N_MEMORY]);
> > +					node_states[N_INTERLEAVE]);
> >  		if (nodes_weight(policy->v.nodes) == 0) {
> >  			kmem_cache_free(policy_cache, policy);
> >  			return ERR_PTR(-EINVAL);
> 
> Leaving this as node_states[N_MEMORY] combined with the rest of the patch
> would work for me, but that sort of changes the scope of the entire patch
> ;-)

Yeah, it breaks one of my main reasons for proposing this.  I still have
no way to keep user requested interleaving off my "special" hardware
interleaved nodes in the case where we don't want this.  I should
mention that I'm assuming that the current "best practice" is to
interleave across "all available nodes" in the applications current
context.

[more follow up to later messages]

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
  2007-08-01 11:21             ` Paul Mundt
@ 2007-08-01 13:54               ` Lee Schermerhorn
  2007-08-02 17:38                 ` Mark Gross
  0 siblings, 1 reply; 17+ messages in thread
From: Lee Schermerhorn @ 2007-08-01 13:54 UTC (permalink / raw)
  To: Paul Mundt
  Cc: Andi Kleen, KAMEZAWA Hiroyuki, linux-mm, Christoph Lameter,
	Nishanth Aravamudan, kxr, akpm, Eric Whitney

On Wed, 2007-08-01 at 20:21 +0900, Paul Mundt wrote:
> On Wed, Aug 01, 2007 at 01:07:43PM +0200, Andi Kleen wrote:
> > 
> > > As long as interleaving is possible after boot, then yes. It's only the
> > > boot-time interleave that we would like to avoid,
> > 
> > But when anybody does interleaving later it could just as easily
> > fill up your small nodes, couldn't it?
> > 
> Yes, but these are in embedded environments where we have control over
> what the applications are doing. Most of these sorts of things are for
> applications where we know what sort of latency requires we have to deal
> with, and so the workload is very much tied to the worst-case range of
> nodes, or just to a particular node. We might only have certain buffers
> that need to be backed by faster memory as well, so while most of the
> application pages will come from node 0 (system memory), certain other
> allocations will come from other nodes. We've been experimenting with
> doing that through tmpfs with mpol tuning.
> 
> In the general case however it's fairly safe to include the tiny nodes as
> part of a larger set with a prefer policy so we don't immediately OOM.
> 
> > Boot time allocations are small compared to what user space
> > later can allocate.
> > 
> Yes, we only want certain applications to explicitly poke at those nodes,
> but they do have a use case for interleave, so it is not functionality I
> would want to lose completely.

This is why I wanted to use an "obscure boot option".  I don't see this
as strictly an architectural/platform issue.  Rather, it's a combination
of the arch/platform and how it's being used for specific applications.
So, I don't see how one could accomplish this with a heuristic.

As Paul mentioned, in embedded systems, one has a bit more control over
what applications are doing.  In that case, I could envision a config
option to specify the initial/default value for the no_interleave_nodes
at kernel build time and dispense with the boot option.  [Any interest
in such an option, Paul?]  But for platforms like ours, that tend to run
enterprise distro kernels, I need a way to specify on a per site or per
installation basis, what nodes should be used.  Our approach would be to
document this in a "best practices" doc that the customer or, more
likely, our field software specialists, would use to optimize the
platform and OS config for the application.
 
> 
> > And do you really want them in the normal fallback lists? The normal zone
> > reservation heuristics probably won't work unless you put them into
> > special low zones.
> > 
> That's something else to look at also, though I would very much like to
> avoid having to construct custom zonelists. it would be nice to keep things as
> simple and as non-invasive as possible. As far as the existing NUMA code
> goes, we're not quite all the way there yet in terms of supporting these
> things as well as we can, but it has proven to be a pretty good starting
> point.

Yes, there are rumblings on the mailing list about passing just a
starting [preferred] node and a node mask to the page allocator.  I'm
too backed up with other things to think too much about this, yet.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
  2007-08-01 13:54               ` Lee Schermerhorn
@ 2007-08-02 17:38                 ` Mark Gross
  2007-08-02 18:46                   ` Lee Schermerhorn
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Gross @ 2007-08-02 17:38 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Paul Mundt, Andi Kleen, KAMEZAWA Hiroyuki, linux-mm,
	Christoph Lameter, Nishanth Aravamudan, kxr, akpm, Eric Whitney

On Wed, Aug 01, 2007 at 09:54:06AM -0400, Lee Schermerhorn wrote:
> On Wed, 2007-08-01 at 20:21 +0900, Paul Mundt wrote:
> > On Wed, Aug 01, 2007 at 01:07:43PM +0200, Andi Kleen wrote:
> > > 
> > > > As long as interleaving is possible after boot, then yes. It's only the
> > > > boot-time interleave that we would like to avoid,
> > > 
> > > But when anybody does interleaving later it could just as easily
> > > fill up your small nodes, couldn't it?
> > > 
> > Yes, but these are in embedded environments where we have control over
> > what the applications are doing. Most of these sorts of things are for
> > applications where we know what sort of latency requires we have to deal
> > with, and so the workload is very much tied to the worst-case range of
> > nodes, or just to a particular node. We might only have certain buffers
> > that need to be backed by faster memory as well, so while most of the
> > application pages will come from node 0 (system memory), certain other
> > allocations will come from other nodes. We've been experimenting with
> > doing that through tmpfs with mpol tuning.
> > 
> > In the general case however it's fairly safe to include the tiny nodes as
> > part of a larger set with a prefer policy so we don't immediately OOM.
> > 
> > > Boot time allocations are small compared to what user space
> > > later can allocate.
> > > 
> > Yes, we only want certain applications to explicitly poke at those nodes,
> > but they do have a use case for interleave, so it is not functionality I
> > would want to lose completely.
> 
> This is why I wanted to use an "obscure boot option".  I don't see this
> as strictly an architectural/platform issue.  Rather, it's a combination
> of the arch/platform and how it's being used for specific applications.
> So, I don't see how one could accomplish this with a heuristic.
> 
> As Paul mentioned, in embedded systems, one has a bit more control over
> what applications are doing.  In that case, I could envision a config
> option to specify the initial/default value for the no_interleave_nodes
> at kernel build time and dispense with the boot option.  [Any interest

Having the interleave as a build time option won't work for some power
managed memory applications.  I posted an RFC a few months back and will
be coming back to it in a few weeks, so take this comment with a grain
of salt.  But I want to be able to switch on some ACPI table entries to
trigger the non-interleave boot time allocation behavior for some FBDIM
based platforms.  My needs are in surprising alignment with Paul's on
this stuff.


--mgross

> in such an option, Paul?]  But for platforms like ours, that tend to run
> enterprise distro kernels, I need a way to specify on a per site or per
> installation basis, what nodes should be used.  Our approach would be to
> document this in a "best practices" doc that the customer or, more
> likely, our field software specialists, would use to optimize the
> platform and OS config for the application.
>  
> > 
> > > And do you really want them in the normal fallback lists? The normal zone
> > > reservation heuristics probably won't work unless you put them into
> > > special low zones.
> > > 
> > That's something else to look at also, though I would very much like to
> > avoid having to construct custom zonelists. it would be nice to keep things as
> > simple and as non-invasive as possible. As far as the existing NUMA code
> > goes, we're not quite all the way there yet in terms of supporting these
> > things as well as we can, but it has proven to be a pretty good starting
> > point.
> 
> Yes, there are rumblings on the mailing list about passing just a
> starting [preferred] node and a node mask to the page allocator.  I'm
> too backed up with other things to think too much about this, yet.
> 
> Lee
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
  2007-08-02 17:38                 ` Mark Gross
@ 2007-08-02 18:46                   ` Lee Schermerhorn
  2007-08-06 16:42                     ` Mark Gross
  0 siblings, 1 reply; 17+ messages in thread
From: Lee Schermerhorn @ 2007-08-02 18:46 UTC (permalink / raw)
  To: mgross
  Cc: Paul Mundt, Andi Kleen, KAMEZAWA Hiroyuki, linux-mm,
	Christoph Lameter, Nishanth Aravamudan, kxr, akpm, Eric Whitney

On Thu, 2007-08-02 at 10:38 -0700, Mark Gross wrote:
> On Wed, Aug 01, 2007 at 09:54:06AM -0400, Lee Schermerhorn wrote:
> > On Wed, 2007-08-01 at 20:21 +0900, Paul Mundt wrote:
> > > On Wed, Aug 01, 2007 at 01:07:43PM +0200, Andi Kleen wrote:
> > > > 
> > > > > As long as interleaving is possible after boot, then yes. It's only the
> > > > > boot-time interleave that we would like to avoid,
> > > > 
> > > > But when anybody does interleaving later it could just as easily
> > > > fill up your small nodes, couldn't it?
> > > > 
> > > Yes, but these are in embedded environments where we have control over
> > > what the applications are doing. Most of these sorts of things are for
> > > applications where we know what sort of latency requires we have to deal
> > > with, and so the workload is very much tied to the worst-case range of
> > > nodes, or just to a particular node. We might only have certain buffers
> > > that need to be backed by faster memory as well, so while most of the
> > > application pages will come from node 0 (system memory), certain other
> > > allocations will come from other nodes. We've been experimenting with
> > > doing that through tmpfs with mpol tuning.
> > > 
> > > In the general case however it's fairly safe to include the tiny nodes as
> > > part of a larger set with a prefer policy so we don't immediately OOM.
> > > 
> > > > Boot time allocations are small compared to what user space
> > > > later can allocate.
> > > > 
> > > Yes, we only want certain applications to explicitly poke at those nodes,
> > > but they do have a use case for interleave, so it is not functionality I
> > > would want to lose completely.
> > 
> > This is why I wanted to use an "obscure boot option".  I don't see this
> > as strictly an architectural/platform issue.  Rather, it's a combination
> > of the arch/platform and how it's being used for specific applications.
> > So, I don't see how one could accomplish this with a heuristic.
> > 
> > As Paul mentioned, in embedded systems, one has a bit more control over
> > what applications are doing.  In that case, I could envision a config
> > option to specify the initial/default value for the no_interleave_nodes
> > at kernel build time and dispense with the boot option.  [Any interest
> 
> Having the interleave as a build time option won't work for some power
> managed memory applications.  I posted an RFC a few months back and will
> be coming back to it in a few weeks, so take this comment with a grain
> of salt.  But I want to be able to switch on some ACPI table entries to
> trigger the non-interleave boot time allocation behavior for some FBDIM
> based platforms.  My needs are in surprising alignment with Paul's on
> this stuff.
> 
> 
> --mgross
<snip>

Mark:  you mean "boot time option", right?

When you get back to it, can you verify that this patch won't affect
what you want to do in policy init [boot time interleave mask]--?  ...as
long as no one specifies any no_interleave_nodes, of course.  And even
then, all that happens is that maybe more nodes get excluded from the
boot time policy mask than you would have excluded based on ACPI info.  

Until you have the ACPI table info and parsing in place [or maybe you
already have this], this patch could allow you to test with the desired
nodes excluded...

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
  2007-08-01 13:39       ` Lee Schermerhorn
@ 2007-08-03  7:53         ` Paul Mundt
  0 siblings, 0 replies; 17+ messages in thread
From: Paul Mundt @ 2007-08-03  7:53 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: KAMEZAWA Hiroyuki, linux-mm, Christoph Lameter,
	Nishanth Aravamudan, kxr, ak, akpm, Eric Whitney

On Wed, Aug 01, 2007 at 09:39:18AM -0400, Lee Schermerhorn wrote:
> On Wed, 2007-08-01 at 19:16 +0900, Paul Mundt wrote:
> > If we can differentiate between MPOL_INTERLEAVE from the kernel's point
> > of view, and explicit MPOL_INTERLEAVE specifiers via mbind() from
> > userspace, that works fine for my case. However, the mpol_new() changes
> > in this patch deny small nodes the ability to ever be included in an
> > MPOL_INTERLEAVE policy, when it's only the kernel policy that I have a
> > problem with.
> 
> Ah, but it would only "deny small nodes" if you nominate them in the
> boot option.  I haven't changed your heuristic in numa_policy_init.  So,
> it will still eliminate small nodes from the boot time interleave
> nodemask, independent of whether or not you specify them in the
> no_interleave_nodes list.
> 
> Or am I missing your point?

That's correct, as long as the size heuristic remains in
numa_policy_init() there's no problem with this. The point was more that
if we were able to use N_INTERLEAVE nodes for the system init policy, it
would be possible to do away with the size heuristic entirely.

Effectively we want the same things, but whereas you want the interleave
nodes to be something applied to all policies, I'm mostly concerned with
keeping the kernel away from the nodes we don't want to interleave.
Userland is basically a free-for-all in terms of the allowable nodemask,
so I don't have a need to restrict MPOL_INTERLEAVE policies once the
system is up.

The size heuristic itself is a bit of a kludge anyhow. I'd like to have a
single point where I can tell the kernel "these nodes are special, don't
use them unless you've been asked". And that's certainly something I
don't have an issue flagging in the pgdat when constructing the nodes in
the first place (at which point we already know which ones are special,
without having to bother with command line options). Whether this is
something that's best as a special node state or not is something that
will need some toying with. On the other hand, simply being able to take the
system init node list and keep that "pinned" is another option, so we
don't end up allocating there even if node 0 is under pressure.

Page migration also poses an interesting problem, in that we don't have a
problem in migrating pages between and off of these nodes, but we do not
want to migrate pages that started out in system memory to them, as the
node will run out of pages too quickly (and also gives those pages up to
whatever is migrated first, rather than something that actually _wants_
those pages out of performance considerations). I don't see an easy way
to do this without having a page flag that indicates whether migration to
special nodes is permitted or not, and setting that when the page is
allocated.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
  2007-08-02 18:46                   ` Lee Schermerhorn
@ 2007-08-06 16:42                     ` Mark Gross
  0 siblings, 0 replies; 17+ messages in thread
From: Mark Gross @ 2007-08-06 16:42 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Paul Mundt, Andi Kleen, KAMEZAWA Hiroyuki, linux-mm,
	Christoph Lameter, Nishanth Aravamudan, kxr, akpm, Eric Whitney

On Thu, Aug 02, 2007 at 02:46:03PM -0400, Lee Schermerhorn wrote:
> On Thu, 2007-08-02 at 10:38 -0700, Mark Gross wrote:
> > On Wed, Aug 01, 2007 at 09:54:06AM -0400, Lee Schermerhorn wrote:
> > > On Wed, 2007-08-01 at 20:21 +0900, Paul Mundt wrote:
> > > > On Wed, Aug 01, 2007 at 01:07:43PM +0200, Andi Kleen wrote:
> > > > > 
> > > > > > As long as interleaving is possible after boot, then yes. It's only the
> > > > > > boot-time interleave that we would like to avoid,
> > > > > 
> > > > > But when anybody does interleaving later it could just as easily
> > > > > fill up your small nodes, couldn't it?
> > > > > 
> > > > Yes, but these are in embedded environments where we have control over
> > > > what the applications are doing. Most of these sorts of things are for
> > > > applications where we know what sort of latency requires we have to deal
> > > > with, and so the workload is very much tied to the worst-case range of
> > > > nodes, or just to a particular node. We might only have certain buffers
> > > > that need to be backed by faster memory as well, so while most of the
> > > > application pages will come from node 0 (system memory), certain other
> > > > allocations will come from other nodes. We've been experimenting with
> > > > doing that through tmpfs with mpol tuning.
> > > > 
> > > > In the general case however it's fairly safe to include the tiny nodes as
> > > > part of a larger set with a prefer policy so we don't immediately OOM.
> > > > 
> > > > > Boot time allocations are small compared to what user space
> > > > > later can allocate.
> > > > > 
> > > > Yes, we only want certain applications to explicitly poke at those nodes,
> > > > but they do have a use case for interleave, so it is not functionality I
> > > > would want to lose completely.
> > > 
> > > This is why I wanted to use an "obscure boot option".  I don't see this
> > > as strictly an architectural/platform issue.  Rather, it's a combination
> > > of the arch/platform and how it's being used for specific applications.
> > > So, I don't see how one could accomplish this with a heuristic.
> > > 
> > > As Paul mentioned, in embedded systems, one has a bit more control over
> > > what applications are doing.  In that case, I could envision a config
> > > option to specify the initial/default value for the no_interleave_nodes
> > > at kernel build time and dispense with the boot option.  [Any interest
> > 
> > Having the interleave as a build time option won't work for some power
> > managed memory applications.  I posted an RFC a few months back and will
> > be coming back to it in a few weeks, so take this comment with a grain
> > of salt.  But I want to be able to switch on some ACPI table entries to
> > trigger the non-interleave boot time allocation behavior for some FBDIM
> > based platforms.  My needs are in surprising alignment with Paul's on
> > this stuff.
> > 
> > 
> > --mgross
> <snip>
> 
> Mark:  you mean "boot time option", right?

I meant to express a preffence of avoiding a compile time only
enablement of the non-interleave nodes.  (I like the boot time
option better.)

> 
> When you get back to it, can you verify that this patch won't affect
> what you want to do in policy init [boot time interleave mask]--?  ...as

I will.

> long as no one specifies any no_interleave_nodes, of course.  And even
> then, all that happens is that maybe more nodes get excluded from the
> boot time policy mask than you would have excluded based on ACPI info.  

yes.

> 
> Until you have the ACPI table info and parsing in place [or maybe you
> already have this], this patch could allow you to test with the desired
> nodes excluded...

We have a custom bios / table for this, but having a boot option would
enable easier testing.

thanks,

--mgross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2007-08-06 16:42 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-27 20:07 [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks Lee Schermerhorn
2007-07-28  6:19 ` KAMEZAWA Hiroyuki
2007-07-30 16:13   ` Lee Schermerhorn
2007-07-30 18:29     ` Christoph Lameter
2007-07-30 20:32       ` Lee Schermerhorn
2007-07-30 21:57         ` Christoph Lameter
2007-08-01 10:16     ` Paul Mundt
2007-08-01 10:33       ` Andi Kleen
2007-08-01 11:01         ` Paul Mundt
2007-08-01 11:07           ` Andi Kleen
2007-08-01 11:21             ` Paul Mundt
2007-08-01 13:54               ` Lee Schermerhorn
2007-08-02 17:38                 ` Mark Gross
2007-08-02 18:46                   ` Lee Schermerhorn
2007-08-06 16:42                     ` Mark Gross
2007-08-01 13:39       ` Lee Schermerhorn
2007-08-03  7:53         ` Paul Mundt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox