From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Wed, 1 Aug 2007 19:16:51 +0900 From: Paul Mundt Subject: Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks Message-ID: <20070801101651.GA9113@linux-sh.org> References: <1185566878.5069.123.camel@localhost> <20070728151912.c541aec0.kamezawa.hiroyu@jp.fujitsu.com> <1185812028.5492.79.camel@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1185812028.5492.79.camel@localhost> Sender: owner-linux-mm@kvack.org Return-Path: To: Lee Schermerhorn Cc: KAMEZAWA Hiroyuki , linux-mm , Christoph Lameter , Nishanth Aravamudan , kxr@sgi.com, ak@suse.de, akpm@linux-foundation.org, Eric Whitney List-ID: On Mon, Jul 30, 2007 at 12:13:48PM -0400, Lee Schermerhorn wrote: > Rationale: some architectures and platforms include nodes with > memory that, in some cases, should never appear in MPOL_INTERLEAVE > node masks. For example, the 'sh' architecture contains a small > amount of SRAM that is local to each cpu. In some applications, > this memory should be reserved for explicit usage. Another example > is the pseudo-node on HP ia64 platforms that is already interleaved > on a cache-line granularity by hardware. Again, in some cases, we > want to reserve this for explicit usage, as it has bandwidth and > [average] latency characteristics quite different from the "real" > nodes. > Well, it's not so much the interleave that's the problem so much as _when_ we interleave. The problem with the interleave node mask at system init is that the kernel attempts to spread out data structures across these nodes, which results in us being completely out of memory by the time we get to userspace. After we've booted, supporting MPOL_INTERLEAVE is not so much of a problem, applications just have to be careful with their allocations. The main thing is keeping the kernel away from these nodes unless it's been specifically asked to fetch some memory from there. Every page does count. The real problem is how we want to deal with the node avoidance mask. In SLOB things presently work quite well in this regard, Christoph's slub_nodes= patch did a similar thing: http://marc.info/?l=linux-mm&m=118127465421877&w=2 http://marc.info/?l=linux-mm&m=118127688911359&w=2 > Note that allocation of fresh hugepages in response to increases > in /proc/sys/vm/nr_hugepages is a form of interleaving. I would > like to propose that allocate_fresh_huge_page() use the > N_INTERLEAVE state as well as MPOL_INTERLEAVE. Then, one can > explicity allocate hugepages on the excluded nodes, when needed, > using Nish Aravamundan's per node huge page sysfs attribute. > NOT in this patch. > If we can differentiate between MPOL_INTERLEAVE from the kernel's point of view, and explicit MPOL_INTERLEAVE specifiers via mbind() from userspace, that works fine for my case. However, the mpol_new() changes in this patch deny small nodes the ability to ever be included in an MPOL_INTERLEAVE policy, when it's only the kernel policy that I have a problem with. Having said that, I do like the node states and using that to exclude a node from the system init interleave nodelist, but this still won't completely solve the tiny node problems. > @@ -184,7 +184,7 @@ static struct mempolicy *mpol_new(int mo > case MPOL_INTERLEAVE: > policy->v.nodes = *nodes; > nodes_and(policy->v.nodes, policy->v.nodes, > - node_states[N_MEMORY]); > + node_states[N_INTERLEAVE]); > if (nodes_weight(policy->v.nodes) == 0) { > kmem_cache_free(policy_cache, policy); > return ERR_PTR(-EINVAL); Leaving this as node_states[N_MEMORY] combined with the rest of the patch would work for me, but that sort of changes the scope of the entire patch ;-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org