Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Paul Mundt <lethal@linux-sh.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	linux-mm <linux-mm@kvack.org>,
	Christoph Lameter <clameter@sgi.com>,
	Nishanth Aravamudan <nacc@us.ibm.com>,
	kxr@sgi.com, ak@suse.de, akpm@linux-foundation.org,
	Eric Whitney <eric.whitney@hp.com>
Subject: Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
Date: Wed, 01 Aug 2007 09:39:18 -0400	[thread overview]
Message-ID: <1185975558.5059.18.camel@localhost> (raw)
In-Reply-To: <20070801101651.GA9113@linux-sh.org>

On Wed, 2007-08-01 at 19:16 +0900, Paul Mundt wrote:
> On Mon, Jul 30, 2007 at 12:13:48PM -0400, Lee Schermerhorn wrote:
> > Rationale:  some architectures and platforms include nodes with
> > memory that, in some cases, should never appear in MPOL_INTERLEAVE
> > node masks.  For example, the 'sh' architecture contains a small
> > amount of SRAM that is local to each cpu.  In some applications,
> > this memory should be reserved for explicit usage.  Another example
> > is the pseudo-node on HP ia64 platforms that is already interleaved
> > on a cache-line granularity by hardware.  Again, in some cases, we
> > want to reserve this for explicit usage, as it has bandwidth and
> > [average] latency characteristics quite different from the "real"
> > nodes.
> > 
> Well, it's not so much the interleave that's the problem so much as
> _when_ we interleave. The problem with the interleave node mask at system
> init is that the kernel attempts to spread out data structures across
> these nodes, which results in us being completely out of memory by the
> time we get to userspace. After we've booted, supporting MPOL_INTERLEAVE
> is not so much of a problem, applications just have to be careful with
> their allocations.
> 
> The main thing is keeping the kernel away from these nodes unless it's
> been specifically asked to fetch some memory from there. Every page does
> count.
> 
> The real problem is how we want to deal with the node avoidance mask. In
> SLOB things presently work quite well in this regard, Christoph's
> slub_nodes= patch did a similar thing:
> 
> 	http://marc.info/?l=linux-mm&m=118127465421877&w=2
> 	http://marc.info/?l=linux-mm&m=118127688911359&w=2
> 
> > Note that allocation of fresh hugepages in response to increases
> > in /proc/sys/vm/nr_hugepages is a form of interleaving.  I would
> > like to propose that allocate_fresh_huge_page() use the 
> > N_INTERLEAVE state as well as MPOL_INTERLEAVE.  Then, one can
> > explicity allocate hugepages on the excluded nodes, when needed,
> > using Nish Aravamundan's per node huge page sysfs attribute.
> > NOT in this patch.
> > 
> If we can differentiate between MPOL_INTERLEAVE from the kernel's point
> of view, and explicit MPOL_INTERLEAVE specifiers via mbind() from
> userspace, that works fine for my case. However, the mpol_new() changes
> in this patch deny small nodes the ability to ever be included in an
> MPOL_INTERLEAVE policy, when it's only the kernel policy that I have a
> problem with.

Ah, but it would only "deny small nodes" if you nominate them in the
boot option.  I haven't changed your heuristic in numa_policy_init.  So,
it will still eliminate small nodes from the boot time interleave
nodemask, independent of whether or not you specify them in the
no_interleave_nodes list.

Or am I missing your point?
> 
> Having said that, I do like the node states and using that to exclude a
> node from the system init interleave nodelist, but this still won't
> completely solve the tiny node problems.

Right, so we should keep your boot time heuristic.

> 
> > @@ -184,7 +184,7 @@ static struct mempolicy *mpol_new(int mo
> >  	case MPOL_INTERLEAVE:
> >  		policy->v.nodes = *nodes;
> >  		nodes_and(policy->v.nodes, policy->v.nodes,
> > -					node_states[N_MEMORY]);
> > +					node_states[N_INTERLEAVE]);
> >  		if (nodes_weight(policy->v.nodes) == 0) {
> >  			kmem_cache_free(policy_cache, policy);
> >  			return ERR_PTR(-EINVAL);
> 
> Leaving this as node_states[N_MEMORY] combined with the rest of the patch
> would work for me, but that sort of changes the scope of the entire patch
> ;-)

Yeah, it breaks one of my main reasons for proposing this.  I still have
no way to keep user requested interleaving off my "special" hardware
interleaved nodes in the case where we don't want this.  I should
mention that I'm assuming that the current "best practice" is to
interleave across "all available nodes" in the applications current
context.

[more follow up to later messages]

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2007-08-01 13:39 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-07-27 20:07 Lee Schermerhorn
2007-07-28  6:19 ` KAMEZAWA Hiroyuki
2007-07-30 16:13   ` Lee Schermerhorn
2007-07-30 18:29     ` Christoph Lameter
2007-07-30 20:32       ` Lee Schermerhorn
2007-07-30 21:57         ` Christoph Lameter
2007-08-01 10:16     ` Paul Mundt
2007-08-01 10:33       ` Andi Kleen
2007-08-01 11:01         ` Paul Mundt
2007-08-01 11:07           ` Andi Kleen
2007-08-01 11:21             ` Paul Mundt
2007-08-01 13:54               ` Lee Schermerhorn
2007-08-02 17:38                 ` Mark Gross
2007-08-02 18:46                   ` Lee Schermerhorn
2007-08-06 16:42                     ` Mark Gross
2007-08-01 13:39       ` Lee Schermerhorn [this message]
2007-08-03  7:53         ` Paul Mundt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1185975558.5059.18.camel@localhost \
    --to=lee.schermerhorn@hp.com \
    --cc=ak@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=eric.whitney@hp.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kxr@sgi.com \
    --cc=lethal@linux-sh.org \
    --cc=linux-mm@kvack.org \
    --cc=nacc@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox