From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Sun, 17 Sep 2006 15:27:23 -0700 From: Paul Jackson Subject: Re: [PATCH] GFP_THISNODE for the slab allocator Message-Id: <20060917152723.5bb69b82.pj@sgi.com> In-Reply-To: References: <20060914220011.2be9100a.akpm@osdl.org> <20060914234926.9b58fd77.pj@sgi.com> <20060915002325.bffe27d1.akpm@osdl.org> <20060915004402.88d462ff.pj@sgi.com> <20060915010622.0e3539d2.akpm@osdl.org> <20060917041707.28171868.pj@sgi.com> <20060917060358.ac16babf.pj@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: David Rientjes Cc: clameter@sgi.com, akpm@osdl.org, linux-mm@kvack.org List-ID: David, Could you run the following on your fake numa booted box, and report the results: find /sys/devices -name distance | xargs head Following Andrew's suggestion, I'm toying with the idea that since one fake numa node is as good as another, there is no reason to worry about retrying skipped over nodes or re-validating the cached zones on such systems Roughly, my plan is: If the node on which we most recently found memory is 'just as good as' the first node in the zonelist, then go ahead and cache that node and continue to use it as long as we can. We're in the fake NUMA case, and one node is as good as another. If that node is 'further away' than the first node in the zonelist, don't cache it. We're in the real NUMA case, and we're happy to carry on just as we have in the past, always scanning from the beginning of the zonelist. However this requires some way to determine whether two fake nodes are really on the same hardware node. Hmmm ... there's a good chance that the kernel 'node_distance()' routine, as shown in the above /sys/devices distance table, is not the way to determine this. Perhaps that table must reflect the fake reality, not the underlying hardware reality. Though, if node_distance() doesn't tell us this, there's a chance this will cause problems elsewhere and we will end up wanting to fix node_distance() in the fake NUMA case to note that all nodes are actually local, which is value 10, I believe. The code in arch/x86_64/mm/srat.c:slit_valid() may conflict with this, and the concerns its comment raises about a SLIT table with all 10's may also point to conflicts with this. You've been looking at this fake NUMA code recently, David. Perhaps you can recommend some other way from within the mm/page_alloc.c code to efficiently (just a couple cache lines) answer the question: Given two node numbers, are they really just two fake nodes on the same hardware node, or are they really on two distinct hardware nodes? Granted, I'm not -entirely- following Andrew's lead here. He's been hoping that this most-recently-used-node cache would benefit both fake and real NUMA systems, while I've thinking we don't really have a problem on the real NUMA systems, and it is better not to mess with the memory allocation pattern there (if it ain't broke, don't fix ...) -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson 1.925.600.0401 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org