From mboxrd@z Thu Jan 1 00:00:00 1970 Subject: Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks From: Lee Schermerhorn In-Reply-To: <20070728151912.c541aec0.kamezawa.hiroyu@jp.fujitsu.com> References: <1185566878.5069.123.camel@localhost> <20070728151912.c541aec0.kamezawa.hiroyu@jp.fujitsu.com> Content-Type: text/plain Date: Mon, 30 Jul 2007 12:13:48 -0400 Message-Id: <1185812028.5492.79.camel@localhost> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: KAMEZAWA Hiroyuki Cc: linux-mm , Paul Mundt , Christoph Lameter , Nishanth Aravamudan , kxr@sgi.com, ak@suse.de, akpm@linux-foundation.org, Eric Whitney List-ID: On Sat, 2007-07-28 at 15:19 +0900, KAMEZAWA Hiroyuki wrote: > On Fri, 27 Jul 2007 16:07:57 -0400 > Lee Schermerhorn wrote: > > > Questions: > > > > * do we need/want a sysctl for run time modifications? IMO, no. > > > > I can agree that runtime modification is not necessary. But applications or > libnuma will not use this information ? Doing all in implicit way is enough ? > (maybe enough) I think it's enough. But, maybe we should export this info as a node attribute in sysfs? Would be easy enough to do, if demand exists. > > BTW, could you print "nodes of XXXX are ignored in INTERLEAVE mempolicy" to > /var/log/messages at boot ? Good idea. It also prompts me to consider better error handling. How about this? --- Introduce mask of nodes to exclude from MPOL_INTERLEAVE masks - V2 Against: 2.6.23-rc1-mm1 atop Christoph Lameter's memoryless node patch set. V1 -> V2: + issue KERN_NOTICE for successful parse of nodelist. Suggestion by Kamezawa Hiroyuki. + clear no_interleave_nodes nodemask and issue KERN_ERR for invalid nodelist argument. This patch implements a new node state, N_INTERLEAVE to specify the subset of nodes with memory [state N_MEMORY] that are valid for MPOL_INTERLEAVE node masks. The new state mask is populated from the N_MEMORY state mask, less any nodes excluded by a new command line option, no_interleave_nodes. Rationale: some architectures and platforms include nodes with memory that, in some cases, should never appear in MPOL_INTERLEAVE node masks. For example, the 'sh' architecture contains a small amount of SRAM that is local to each cpu. In some applications, this memory should be reserved for explicit usage. Another example is the pseudo-node on HP ia64 platforms that is already interleaved on a cache-line granularity by hardware. Again, in some cases, we want to reserve this for explicit usage, as it has bandwidth and [average] latency characteristics quite different from the "real" nodes. Note that allocation of fresh hugepages in response to increases in /proc/sys/vm/nr_hugepages is a form of interleaving. I would like to propose that allocate_fresh_huge_page() use the N_INTERLEAVE state as well as MPOL_INTERLEAVE. Then, one can explicity allocate hugepages on the excluded nodes, when needed, using Nish Aravamundan's per node huge page sysfs attribute. NOT in this patch. Questions: * do we need/want a sysctl for run time modifications? IMO, no. Kame-san votes "No". Signed-off-by: Lee Schermerhorn Documentation/kernel-parameters.txt | 9 +++++++++ include/linux/nodemask.h | 1 + mm/mempolicy.c | 9 +++++---- mm/page_alloc.c | 34 +++++++++++++++++++++++++++++++++- 4 files changed, 48 insertions(+), 5 deletions(-) Index: Linux/include/linux/nodemask.h =================================================================== --- Linux.orig/include/linux/nodemask.h 2007-07-27 15:23:53.000000000 -0400 +++ Linux/include/linux/nodemask.h 2007-07-27 15:23:53.000000000 -0400 @@ -345,6 +345,7 @@ enum node_states { N_ONLINE, /* The node is online */ N_MEMORY, /* The node has memory */ N_CPU, /* The node has cpus */ + N_INTERLEAVE, /* The node is valid for MPOL_INTERLEAVE */ NR_NODE_STATES }; Index: Linux/mm/page_alloc.c =================================================================== --- Linux.orig/mm/page_alloc.c 2007-07-27 15:23:53.000000000 -0400 +++ Linux/mm/page_alloc.c 2007-07-30 10:25:38.000000000 -0400 @@ -2003,6 +2003,31 @@ static char zonelist_order_name[3][8] = #ifdef CONFIG_NUMA +/* + * Command line: no_interleave_nodes= + * Specify nodes to exclude from MPOL_INTERLEAVE masks. + */ +static nodemask_t no_interleave_nodes; /* default: none */ + +static __init int setup_no_interleave_nodes(char *nodelist) +{ + if (nodelist) { + int err = nodelist_parse(nodelist, no_interleave_nodes); + if (err) { + printk(KERN_ERR + "Ignoring invalid no_interleave_nodes nodelist:" + " %s\n", nodelist); + nodes_clear(no_interleave_nodes); /* all or nothing */ + return err; + } + printk(KERN_NOTICE + "Nodes ignored for INTERLEAVE memory policy: %s\n", + nodelist); + } + return 0; +} +early_param("no_interleave_nodes", setup_no_interleave_nodes); + /* The value user specified ....changed by config */ static int user_zonelist_order = ZONELIST_ORDER_DEFAULT; /* string for sysctl */ @@ -2410,8 +2435,15 @@ static int __build_all_zonelists(void *d build_zonelists(pgdat); build_zonelist_cache(pgdat); - if (pgdat->node_present_pages) + if (pgdat->node_present_pages) { node_set_state(nid, N_MEMORY); + /* + * Only nodes with memory are valid for MPOL_INTERLEAVE, + * but maybe not all of them? + */ + if (!node_isset(nid, no_interleave_nodes)) + node_set_state(nid, N_INTERLEAVE); + } } return 0; } Index: Linux/mm/mempolicy.c =================================================================== --- Linux.orig/mm/mempolicy.c 2007-07-27 15:23:53.000000000 -0400 +++ Linux/mm/mempolicy.c 2007-07-30 11:09:20.000000000 -0400 @@ -184,7 +184,7 @@ static struct mempolicy *mpol_new(int mo case MPOL_INTERLEAVE: policy->v.nodes = *nodes; nodes_and(policy->v.nodes, policy->v.nodes, - node_states[N_MEMORY]); + node_states[N_INTERLEAVE]); if (nodes_weight(policy->v.nodes) == 0) { kmem_cache_free(policy_cache, policy); return ERR_PTR(-EINVAL); @@ -1612,11 +1612,12 @@ void __init numa_policy_init(void) /* * Set interleaving policy for system init. Interleaving is only - * enabled across suitably sized nodes (default is >= 16MB), or - * fall back to the largest node if they're all smaller. + * enabled across suitably sized nodes (hard coded >= 16MB) on which + * interleaving is allowed Fall back to the largest node if all + * allowable nodes are smaller than the hard coded limit. */ nodes_clear(interleave_nodes); - for_each_node_state(nid, N_MEMORY) { + for_each_node_state(nid, N_INTERLEAVE) { unsigned long total_pages = node_present_pages(nid); /* Preserve the largest node */ Index: Linux/Documentation/kernel-parameters.txt =================================================================== --- Linux.orig/Documentation/kernel-parameters.txt 2007-07-27 15:22:41.000000000 -0400 +++ Linux/Documentation/kernel-parameters.txt 2007-07-27 15:23:53.000000000 -0400 @@ -1181,6 +1181,15 @@ and is between 256 and 4096 characters. noinitrd [RAM] Tells the kernel not to load any configured initial RAM disk. + no_interleave_nodes [KNL, BOOT] Specifies a list of nodes to exclude + [remove] from any nodemask specified with the + MPOL_INTERLEAVE policy. Some platforms have nodes + that are "special" in some way and should not be + used for policy based interleaving. + Format: no_interleave_nodes= + NodeList format is described in + Documentation/filesystems/tmpfs.txt + nointroute [IA-64] nojitter [IA64] Disables jitter checking for ITC timers. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org