Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: linux-mm <linux-mm@kvack.org>, Paul Mundt <lethal@linux-sh.org>,
	Christoph Lameter <clameter@sgi.com>,
	Nishanth Aravamudan <nacc@us.ibm.com>,
	kxr@sgi.com, ak@suse.de, akpm@linux-foundation.org,
	Eric Whitney <eric.whitney@hp.com>
Subject: Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
Date: Mon, 30 Jul 2007 12:13:48 -0400	[thread overview]
Message-ID: <1185812028.5492.79.camel@localhost> (raw)
In-Reply-To: <20070728151912.c541aec0.kamezawa.hiroyu@jp.fujitsu.com>

On Sat, 2007-07-28 at 15:19 +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 27 Jul 2007 16:07:57 -0400
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> 
> > Questions:
> > 
> > * do we need/want a sysctl for run time modifications?  IMO, no.
> > 
> 
> I can agree that runtime modification is not necessary. But applications or
> libnuma will not use this information ? Doing all in implicit way is enough ?
> (maybe enough)

I think it's enough.  But, maybe we should export this info as a node
attribute in sysfs?  Would be easy enough to do, if demand exists.

> 
> BTW, could you print "nodes of XXXX are ignored in INTERLEAVE mempolicy" to
> /var/log/messages at boot ?

Good idea.  It also prompts me to consider better error handling. 

How about this?

---

Introduce mask of nodes to exclude from MPOL_INTERLEAVE masks - V2

Against:  2.6.23-rc1-mm1 atop Christoph Lameter's memoryless
	  node patch set.

V1 -> V2:
+ issue KERN_NOTICE for successful parse of nodelist.
  Suggestion by Kamezawa Hiroyuki.
+ clear no_interleave_nodes nodemask and issue KERN_ERR for
  invalid nodelist argument.

This patch implements a new node state, N_INTERLEAVE to specify
the subset of nodes with memory [state N_MEMORY] that are valid
for MPOL_INTERLEAVE node masks.  The new state mask is populated
from the N_MEMORY state mask, less any nodes excluded by a new
command line option, no_interleave_nodes.

Rationale:  some architectures and platforms include nodes with
memory that, in some cases, should never appear in MPOL_INTERLEAVE
node masks.  For example, the 'sh' architecture contains a small
amount of SRAM that is local to each cpu.  In some applications,
this memory should be reserved for explicit usage.  Another example
is the pseudo-node on HP ia64 platforms that is already interleaved
on a cache-line granularity by hardware.  Again, in some cases, we
want to reserve this for explicit usage, as it has bandwidth and
[average] latency characteristics quite different from the "real"
nodes.

Note that allocation of fresh hugepages in response to increases
in /proc/sys/vm/nr_hugepages is a form of interleaving.  I would
like to propose that allocate_fresh_huge_page() use the 
N_INTERLEAVE state as well as MPOL_INTERLEAVE.  Then, one can
explicity allocate hugepages on the excluded nodes, when needed,
using Nish Aravamundan's per node huge page sysfs attribute.
NOT in this patch.

Questions:

* do we need/want a sysctl for run time modifications?  IMO, no.
	Kame-san votes "No".

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/kernel-parameters.txt |    9 +++++++++
 include/linux/nodemask.h            |    1 +
 mm/mempolicy.c                      |    9 +++++----
 mm/page_alloc.c                     |   34 +++++++++++++++++++++++++++++++++-
 4 files changed, 48 insertions(+), 5 deletions(-)

Index: Linux/include/linux/nodemask.h
===================================================================
--- Linux.orig/include/linux/nodemask.h	2007-07-27 15:23:53.000000000 -0400
+++ Linux/include/linux/nodemask.h	2007-07-27 15:23:53.000000000 -0400
@@ -345,6 +345,7 @@ enum node_states {
 	N_ONLINE,	/* The node is online */
 	N_MEMORY,	/* The node has memory */
 	N_CPU,		/* The node has cpus */
+	N_INTERLEAVE,	/* The node is valid for MPOL_INTERLEAVE */
 	NR_NODE_STATES
 };
 
Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-07-27 15:23:53.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-07-30 10:25:38.000000000 -0400
@@ -2003,6 +2003,31 @@ static char zonelist_order_name[3][8] = 
 
 
 #ifdef CONFIG_NUMA
+/*
+ * Command line:  no_interleave_nodes=<NodeList>
+ * Specify nodes to exclude from MPOL_INTERLEAVE masks.
+ */
+static nodemask_t no_interleave_nodes;	/* default:  none */
+
+static __init int setup_no_interleave_nodes(char *nodelist)
+{
+	if (nodelist) {
+		int err = nodelist_parse(nodelist, no_interleave_nodes);
+		if (err) {
+			printk(KERN_ERR
+				"Ignoring invalid no_interleave_nodes nodelist:"
+				"  %s\n", nodelist);
+			nodes_clear(no_interleave_nodes); /* all or nothing */
+			return err;
+		}
+		printk(KERN_NOTICE
+			"Nodes ignored for INTERLEAVE memory policy: %s\n",
+			nodelist);
+	}
+	return 0;
+}
+early_param("no_interleave_nodes", setup_no_interleave_nodes);
+
 /* The value user specified ....changed by config */
 static int user_zonelist_order = ZONELIST_ORDER_DEFAULT;
 /* string for sysctl */
@@ -2410,8 +2435,15 @@ static int __build_all_zonelists(void *d
 		build_zonelists(pgdat);
 		build_zonelist_cache(pgdat);
 
-		if (pgdat->node_present_pages)
+		if (pgdat->node_present_pages) {
 			node_set_state(nid, N_MEMORY);
+			/*
+			 * Only nodes with memory are valid for MPOL_INTERLEAVE,
+			 * but maybe not all of them?
+			 */
+			if (!node_isset(nid, no_interleave_nodes))
+				node_set_state(nid, N_INTERLEAVE);
+		}
 	}
 	return 0;
 }
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-07-27 15:23:53.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-07-30 11:09:20.000000000 -0400
@@ -184,7 +184,7 @@ static struct mempolicy *mpol_new(int mo
 	case MPOL_INTERLEAVE:
 		policy->v.nodes = *nodes;
 		nodes_and(policy->v.nodes, policy->v.nodes,
-					node_states[N_MEMORY]);
+					node_states[N_INTERLEAVE]);
 		if (nodes_weight(policy->v.nodes) == 0) {
 			kmem_cache_free(policy_cache, policy);
 			return ERR_PTR(-EINVAL);
@@ -1612,11 +1612,12 @@ void __init numa_policy_init(void)
 
 	/*
 	 * Set interleaving policy for system init. Interleaving is only
-	 * enabled across suitably sized nodes (default is >= 16MB), or
-	 * fall back to the largest node if they're all smaller.
+	 * enabled across suitably sized nodes (hard coded >= 16MB) on which
+	 * interleaving is allowed  Fall back to the largest node if all
+	 * allowable nodes are smaller than the hard coded limit.
 	 */
 	nodes_clear(interleave_nodes);
-	for_each_node_state(nid, N_MEMORY) {
+	for_each_node_state(nid, N_INTERLEAVE) {
 		unsigned long total_pages = node_present_pages(nid);
 
 		/* Preserve the largest node */
Index: Linux/Documentation/kernel-parameters.txt
===================================================================
--- Linux.orig/Documentation/kernel-parameters.txt	2007-07-27 15:22:41.000000000 -0400
+++ Linux/Documentation/kernel-parameters.txt	2007-07-27 15:23:53.000000000 -0400
@@ -1181,6 +1181,15 @@ and is between 256 and 4096 characters. 
 	noinitrd	[RAM] Tells the kernel not to load any configured
 			initial RAM disk.
 
+	no_interleave_nodes [KNL, BOOT] Specifies a list of nodes to exclude
+			[remove] from any nodemask specified with the
+			MPOL_INTERLEAVE policy.  Some platforms have nodes
+			that are "special" in some way and should not be
+			used for policy based interleaving.
+			Format:  no_interleave_nodes=<NodeList>
+			NodeList format is described in
+				Documentation/filesystems/tmpfs.txt
+
 	nointroute	[IA-64]
 
 	nojitter	[IA64] Disables jitter checking for ITC timers.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2007-07-30 16:13 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-07-27 20:07 Lee Schermerhorn
2007-07-28  6:19 ` KAMEZAWA Hiroyuki
2007-07-30 16:13   ` Lee Schermerhorn [this message]
2007-07-30 18:29     ` Christoph Lameter
2007-07-30 20:32       ` Lee Schermerhorn
2007-07-30 21:57         ` Christoph Lameter
2007-08-01 10:16     ` Paul Mundt
2007-08-01 10:33       ` Andi Kleen
2007-08-01 11:01         ` Paul Mundt
2007-08-01 11:07           ` Andi Kleen
2007-08-01 11:21             ` Paul Mundt
2007-08-01 13:54               ` Lee Schermerhorn
2007-08-02 17:38                 ` Mark Gross
2007-08-02 18:46                   ` Lee Schermerhorn
2007-08-06 16:42                     ` Mark Gross
2007-08-01 13:39       ` Lee Schermerhorn
2007-08-03  7:53         ` Paul Mundt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1185812028.5492.79.camel@localhost \
    --to=lee.schermerhorn@hp.com \
    --cc=ak@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=eric.whitney@hp.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kxr@sgi.com \
    --cc=lethal@linux-sh.org \
    --cc=linux-mm@kvack.org \
    --cc=nacc@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox