From mboxrd@z Thu Jan  1 00:00:00 1970
Date: Wed, 1 Aug 2007 19:16:51 +0900
From: Paul Mundt <lethal@linux-sh.org>
Subject: Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
Message-ID: <20070801101651.GA9113@linux-sh.org>
References: <1185566878.5069.123.camel@localhost> <20070728151912.c541aec0.kamezawa.hiroyu@jp.fujitsu.com> <1185812028.5492.79.camel@localhost>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1185812028.5492.79.camel@localhost>
Sender: owner-linux-mm@kvack.org
Return-Path: <owner-linux-mm@kvack.org>
To: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>, linux-mm <linux-mm@kvack.org>, Christoph Lameter <clameter@sgi.com>, Nishanth Aravamudan <nacc@us.ibm.com>, kxr@sgi.com, ak@suse.de, akpm@linux-foundation.org, Eric Whitney <eric.whitney@hp.com>
List-ID: <linux-mm.kvack.org>

On Mon, Jul 30, 2007 at 12:13:48PM -0400, Lee Schermerhorn wrote:
> Rationale:  some architectures and platforms include nodes with
> memory that, in some cases, should never appear in MPOL_INTERLEAVE
> node masks.  For example, the 'sh' architecture contains a small
> amount of SRAM that is local to each cpu.  In some applications,
> this memory should be reserved for explicit usage.  Another example
> is the pseudo-node on HP ia64 platforms that is already interleaved
> on a cache-line granularity by hardware.  Again, in some cases, we
> want to reserve this for explicit usage, as it has bandwidth and
> [average] latency characteristics quite different from the "real"
> nodes.
> 
Well, it's not so much the interleave that's the problem so much as
_when_ we interleave. The problem with the interleave node mask at system
init is that the kernel attempts to spread out data structures across
these nodes, which results in us being completely out of memory by the
time we get to userspace. After we've booted, supporting MPOL_INTERLEAVE
is not so much of a problem, applications just have to be careful with
their allocations.

The main thing is keeping the kernel away from these nodes unless it's
been specifically asked to fetch some memory from there. Every page does
count.

The real problem is how we want to deal with the node avoidance mask. In
SLOB things presently work quite well in this regard, Christoph's
slub_nodes= patch did a similar thing:

	http://marc.info/?l=linux-mm&m=118127465421877&w=2
	http://marc.info/?l=linux-mm&m=118127688911359&w=2

> Note that allocation of fresh hugepages in response to increases
> in /proc/sys/vm/nr_hugepages is a form of interleaving.  I would
> like to propose that allocate_fresh_huge_page() use the 
> N_INTERLEAVE state as well as MPOL_INTERLEAVE.  Then, one can
> explicity allocate hugepages on the excluded nodes, when needed,
> using Nish Aravamundan's per node huge page sysfs attribute.
> NOT in this patch.
> 
If we can differentiate between MPOL_INTERLEAVE from the kernel's point
of view, and explicit MPOL_INTERLEAVE specifiers via mbind() from
userspace, that works fine for my case. However, the mpol_new() changes
in this patch deny small nodes the ability to ever be included in an
MPOL_INTERLEAVE policy, when it's only the kernel policy that I have a
problem with.

Having said that, I do like the node states and using that to exclude a
node from the system init interleave nodelist, but this still won't
completely solve the tiny node problems.

> @@ -184,7 +184,7 @@ static struct mempolicy *mpol_new(int mo
>  	case MPOL_INTERLEAVE:
>  		policy->v.nodes = *nodes;
>  		nodes_and(policy->v.nodes, policy->v.nodes,
> -					node_states[N_MEMORY]);
> +					node_states[N_INTERLEAVE]);
>  		if (nodes_weight(policy->v.nodes) == 0) {
>  			kmem_cache_free(policy_cache, policy);
>  			return ERR_PTR(-EINVAL);

Leaving this as node_states[N_MEMORY] combined with the rest of the patch
would work for me, but that sort of changes the scope of the entire patch
;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>