From mboxrd@z Thu Jan 1 00:00:00 1970 Subject: Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks From: Lee Schermerhorn In-Reply-To: <20070801101651.GA9113@linux-sh.org> References: <1185566878.5069.123.camel@localhost> <20070728151912.c541aec0.kamezawa.hiroyu@jp.fujitsu.com> <1185812028.5492.79.camel@localhost> <20070801101651.GA9113@linux-sh.org> Content-Type: text/plain Date: Wed, 01 Aug 2007 09:39:18 -0400 Message-Id: <1185975558.5059.18.camel@localhost> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Paul Mundt Cc: KAMEZAWA Hiroyuki , linux-mm , Christoph Lameter , Nishanth Aravamudan , kxr@sgi.com, ak@suse.de, akpm@linux-foundation.org, Eric Whitney List-ID: On Wed, 2007-08-01 at 19:16 +0900, Paul Mundt wrote: > On Mon, Jul 30, 2007 at 12:13:48PM -0400, Lee Schermerhorn wrote: > > Rationale: some architectures and platforms include nodes with > > memory that, in some cases, should never appear in MPOL_INTERLEAVE > > node masks. For example, the 'sh' architecture contains a small > > amount of SRAM that is local to each cpu. In some applications, > > this memory should be reserved for explicit usage. Another example > > is the pseudo-node on HP ia64 platforms that is already interleaved > > on a cache-line granularity by hardware. Again, in some cases, we > > want to reserve this for explicit usage, as it has bandwidth and > > [average] latency characteristics quite different from the "real" > > nodes. > > > Well, it's not so much the interleave that's the problem so much as > _when_ we interleave. The problem with the interleave node mask at system > init is that the kernel attempts to spread out data structures across > these nodes, which results in us being completely out of memory by the > time we get to userspace. After we've booted, supporting MPOL_INTERLEAVE > is not so much of a problem, applications just have to be careful with > their allocations. > > The main thing is keeping the kernel away from these nodes unless it's > been specifically asked to fetch some memory from there. Every page does > count. > > The real problem is how we want to deal with the node avoidance mask. In > SLOB things presently work quite well in this regard, Christoph's > slub_nodes= patch did a similar thing: > > http://marc.info/?l=linux-mm&m=118127465421877&w=2 > http://marc.info/?l=linux-mm&m=118127688911359&w=2 > > > Note that allocation of fresh hugepages in response to increases > > in /proc/sys/vm/nr_hugepages is a form of interleaving. I would > > like to propose that allocate_fresh_huge_page() use the > > N_INTERLEAVE state as well as MPOL_INTERLEAVE. Then, one can > > explicity allocate hugepages on the excluded nodes, when needed, > > using Nish Aravamundan's per node huge page sysfs attribute. > > NOT in this patch. > > > If we can differentiate between MPOL_INTERLEAVE from the kernel's point > of view, and explicit MPOL_INTERLEAVE specifiers via mbind() from > userspace, that works fine for my case. However, the mpol_new() changes > in this patch deny small nodes the ability to ever be included in an > MPOL_INTERLEAVE policy, when it's only the kernel policy that I have a > problem with. Ah, but it would only "deny small nodes" if you nominate them in the boot option. I haven't changed your heuristic in numa_policy_init. So, it will still eliminate small nodes from the boot time interleave nodemask, independent of whether or not you specify them in the no_interleave_nodes list. Or am I missing your point? > > Having said that, I do like the node states and using that to exclude a > node from the system init interleave nodelist, but this still won't > completely solve the tiny node problems. Right, so we should keep your boot time heuristic. > > > @@ -184,7 +184,7 @@ static struct mempolicy *mpol_new(int mo > > case MPOL_INTERLEAVE: > > policy->v.nodes = *nodes; > > nodes_and(policy->v.nodes, policy->v.nodes, > > - node_states[N_MEMORY]); > > + node_states[N_INTERLEAVE]); > > if (nodes_weight(policy->v.nodes) == 0) { > > kmem_cache_free(policy_cache, policy); > > return ERR_PTR(-EINVAL); > > Leaving this as node_states[N_MEMORY] combined with the rest of the patch > would work for me, but that sort of changes the scope of the entire patch > ;-) Yeah, it breaks one of my main reasons for proposing this. I still have no way to keep user requested interleaving off my "special" hardware interleaved nodes in the case where we don't want this. I should mention that I'm assuming that the current "best practice" is to interleave across "all available nodes" in the applications current context. [more follow up to later messages] Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org