From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Wed, 4 Oct 2006 19:27:14 -0700 From: Paul Jackson Subject: Re: [RFC] another way to speed up fake numa node page_alloc Message-Id: <20061004192714.20412e08.pj@sgi.com> In-Reply-To: References: <20060925091452.14277.9236.sendpatchset@v0> <20061001231811.26f91c47.pj@sgi.com> <20061001234858.fe91109e.pj@sgi.com> <20061002014121.28b759da.pj@sgi.com> <20061003111517.a5cc30ea.pj@sgi.com> <20061004084552.a07025d7.pj@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: David Rientjes Cc: linux-mm@kvack.org, akpm@osdl.org, nickpiggin@yahoo.com.au, ak@suse.de, mbligh@google.com, rohitseth@google.com, menage@google.com, clameter@sgi.com List-ID: > Isn't this the exact behavior that ordered zonelists are supposed to solve > for real NUMA systems? Has there been an _observed_ case where the cost > to scan the zonelists was considered excessive on real NUMA systems? Well ... the good news is I understood your comments this time. I guess I should be happy it only took about 3 iterations. Historically the ordered zonelists addressed the situation where one almost always found free memory near the front of the ordered zonelist. Yes, you are correct that I originally didn't think we had a problem with real numa zonelist scans. Three days ago, when I introduced this alternative patch that started this current thread, I changed my position, stating at that time: > > There are two reasons I persued this alternative: > > 1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems) > have seen real customer loads where the cost to scan the zonelist > was a problem, due to many nodes being full of memory before > we got to a node we could use. Or at least, I think we have. > This was related to me by another engineer, based on experiences > from some time past. So this is not guaranteed. Most likely, though. > > The following approach should help such real numa systems just as > much as it helps fake numa systems, or any combination thereof. > > 2) The effort to distinguish fake from real numa, using node_distance, > so that we could cache a fake numa node and optimize choosing > it over equivalent distance fake nodes, while continuing to > properly scan all real nodes in distance order, was going to > require a nasty blob of zonelist and node distance munging. > > The following approach has no new dependency on node distances or > zone sorting. David wrote: > I was under the impression that there was nothing wrong with the way > current real NUMA systems allocate pages. If not, please point me to the > thread that _specifically_ discusses this with _data_ that shows it's > inefficient. See above. I don't have data, so cannot justify going far out of our way. If someone has a better way to skin this fake numa cat, that does not benefit (or harm) real numa, that would still be worth careful consideration. > In fact, when this thread started you recommended as little > changes as possible to the code to not interfere with what already works. Yes, I did start with that recommendation. See above. And see above for my current reasons for persuing this patch. Some more things I like about this patch: * Conceptually, it is very localized, making no changes to the larger code or data structure, just adding a cache of some hot data. * Further, it makes few assumptions about the larger scheme of things. * It has no dependencies on zonelist sorting, node distances, fake vs real numa nodes or any of that. * It makes no discernable difference in the memory placement behaviour of a system. Downside - it's still a linear zonelist scan, and it's a cache bolted on the side of things, rather than an inherently fast algorithm and data structure. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson 1.925.600.0401 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org