linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Christoph Lameter <clameter@sgi.com>
To: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>,
	Lee.Schermerhorn@hp.com, ak@suse.de, anton@samba.org,
	mel@csn.ul.ie, akpm@linux-foundation.org, linux-mm@kvack.org
Subject: Re: [PATCH v2] gfp.h: GFP_THISNODE can go to other nodes if some are unpopulated
Date: Mon, 11 Jun 2007 11:29:14 -0700 (PDT)	[thread overview]
Message-ID: <Pine.LNX.4.64.0706111122010.18327@schroedinger.engr.sgi.com> (raw)
In-Reply-To: <20070611171201.GB3798@us.ibm.com>

On Mon, 11 Jun 2007, Nishanth Aravamudan wrote:

> These are the exact semantics, I expected. so I'll be happy to test/work
> on these fixes.
> 
> This would also make it unnecessary to add the populated checks in
> various places, I think, as THISNODE will mean ONLYTHISNODE (and perhaps
> should be renamed in the series).

Here is a draft on how this could work:

It seems that MPOL_BIND already has the fixes needed. Adding empty zones 
does not look that easy. 
So I added a check interleave_nodes to verify that the selected node has
memory. If not skip it.

This does not work for the address based interleaving for anonymous vmas.
I am not sure what to do there. We could change the calculation of the
node to be based only on nodes with memory and then skip the memoryless 
ones. I have only added a comment to describe its brokennes for now.

Then there is the original issue of GFP_THISNODE. I added a check in 
alloc_pages_node to verify that the node has memory. If not then fail.
This will fix the alloc_pages_node case but not the alloc_pages() case.

In the alloc_pages() case we do not specify a node. Implicitly it is 
understood that we (in the case of no memory policy / cpuset options) 
allocate from the nearest node. So it may be argued there that the 
GFP_THISNODE behavior of taking the first node from the zonelist is okay.

There is some reason to worry about the performance impact from adding the 
check in alloc_pages_node and interleave. If the code is left in 
alloc_pages_node then alloc_pages_node should be uninlined and moved to 
mempolicy.c

Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c	2007-06-11 11:13:09.000000000 -0700
+++ linux-2.6/mm/mempolicy.c	2007-06-11 11:19:03.000000000 -0700
@@ -1125,9 +1125,11 @@ static unsigned interleave_nodes(struct 
 	struct task_struct *me = current;
 
 	nid = me->il_next;
-	next = next_node(nid, policy->v.nodes);
-	if (next >= MAX_NUMNODES)
-		next = first_node(policy->v.nodes);
+	do {
+		next = next_node(nid, policy->v.nodes);
+		if (next >= MAX_NUMNODES)
+			next = first_node(policy->v.nodes);
+	} while (!NODE_DATA(node)->present_pages);
 	me->il_next = next;
 	return nid;
 }
@@ -1191,6 +1193,11 @@ static inline unsigned interleave_nid(st
 		 * for huge pages, since vm_pgoff is in units of small
 		 * pages, we need to shift off the always 0 bits to get
 		 * a useful offset.
+		 *
+		 * For configurations with memoryless nodes this is broken
+		 * since the allocation attempts on that node will fall
+		 * back to other nodes and thus one neighboring node
+		 * will be overallocated from.
 		 */
 		BUG_ON(shift < PAGE_SHIFT);
 		off = vma->vm_pgoff >> (shift - PAGE_SHIFT);
Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h	2007-06-11 11:19:32.000000000 -0700
+++ linux-2.6/include/linux/gfp.h	2007-06-11 11:21:33.000000000 -0700
@@ -134,6 +134,9 @@ static inline struct page *alloc_pages_n
 	if (nid < 0)
 		nid = numa_node_id();
 
+	if ((gfp_mask & __GFP_THISNODE) && !NODE_DATA(nid)->present_pages)
+		return NULL;
+
 	return __alloc_pages(gfp_mask, order,
 		NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2007-06-11 18:29 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-06-07 15:04 [PATCH] " Nishanth Aravamudan
2007-06-07 18:11 ` Christoph Lameter
2007-06-07 22:01   ` [PATCH v2] " Nishanth Aravamudan
2007-06-07 22:05     ` Christoph Lameter
2007-06-07 22:16       ` Nishanth Aravamudan
2007-06-11 12:49     ` Andy Whitcroft
2007-06-11 16:12       ` Christoph Lameter
2007-06-11 16:42         ` Christoph Lameter
2007-06-11 17:12           ` Nishanth Aravamudan
2007-06-11 18:29             ` Christoph Lameter [this message]
2007-06-11 18:46               ` Nishanth Aravamudan
2007-06-11 18:54                 ` Christoph Lameter
2007-06-11 19:36               ` Nishanth Aravamudan
2007-06-11 19:43                 ` Christoph Lameter
2007-06-11 20:18                   ` Nishanth Aravamudan
2007-06-11 18:23           ` Lee Schermerhorn
2007-06-11 18:40             ` Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0706111122010.18327@schroedinger.engr.sgi.com \
    --to=clameter@sgi.com \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=ak@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=anton@samba.org \
    --cc=apw@shadowen.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=nacc@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox