From mboxrd@z Thu Jan 1 00:00:00 1970 Subject: Re: [PATCH 00/14] NUMA: Memoryless node support V4 From: Lee Schermerhorn In-Reply-To: <20070729053516.5d85738a.pj@sgi.com> References: <20070727194316.18614.36380.sendpatchset@localhost> <20070729053516.5d85738a.pj@sgi.com> Content-Type: text/plain Date: Mon, 30 Jul 2007 12:07:08 -0400 Message-Id: <1185811629.5492.73.camel@localhost> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Paul Jackson Cc: linux-mm@kvack.org, ak@suse.de, nacc@us.ibm.com, kxr@sgi.com, clameter@sgi.com, mel@skynet.ie, akpm@linux-foundation.org, kamezawa.hiroyu@jp.fujitsu.com List-ID: On Sun, 2007-07-29 at 05:35 -0700, Paul Jackson wrote: > Lee, > > What is the motivation for memoryless nodes? I'm not sure what I > mean by that question -- perhaps the answer involves describing a > piece of hardware, perhaps a somewhat hypothetical piece of hardware > if the real hardware is proprietary. But usually adding new mechanisms > to the kernel should involve explaining why it is needed. Hi, Paul. My motivation for working on the memoryless nodes patches is to properly support all configurations of our hardware. We can configure our platforms with from 0% to 100% "cell local memory" [CLM]. We also call 0% CLM "fully interleaved", as it the hardware interleaves the memory on a cache line granularity. Our AMD-based x86_64 platforms have a similar feature, altho' it's "all or nothing" on these platforms. I believe the Fujitsu ia64 platform supports a similar feature. One could reasonably ask why we have this feature. My understanding is that certain OSes supported on this hardware were not very "NUMA-aware" when the hardware was released--Linux, included. Hardware interleaving smoothed out the "hot spots" and made it possible to run reasonably well on the platform. This did leave some performance "on the table", as Linux has demonstrated in recent releases. Linux now performs better for some workloads, like AIM7, in 100% CLM mode. This was not the case a year or two ago. A couple of other details for completeness: Like SGI platforms, on our platforms, cell local memory shows up at some ridiculously high physical address, altho' maybe not so ridiculous as the Altix ;-). Interleaved memory shows up at physical address 0. I understand that the architecture requires some memory at phys addr 0. For this reason, even when we configure 100% CLM, we still get a "small" amount of interleaved memory--512M on my 4-node test system I should also mention that when the HP-UX group runs the TPC-C benchmark for reporting, they find that a mixture of cell local and interleaved memory provides the best performance. I don't know the details of how they lay out the benchmark on this config, but I need to find out for Linux testing... Anyway, in 0% CLM/fully-interleaved mode, our platform looks like this: available: 5 nodes (0-4) node 0 size: 0 MB node 0 free: 0 MB node 1 size: 0 MB node 1 free: 0 MB node 2 size: 0 MB node 2 free: 0 MB node 3 size: 0 MB node 3 free: 0 MB node 4 size: 8191 MB <= interleaved at phys addr 0 node 4 free: 105 MB <= was running a test... If I configure for 100% CLM and boot with mem=16G [on a 32G platform], I get: available: 5 nodes (0-4) node 0 size: 7600 MB node 0 free: 6647 MB node 1 size: 8127 MB node 1 free: 7675 MB node 2 size: 144 MB node 2 free: 94 MB node 3 size: 0 MB node 3 free: 0 MB node 4 size: 511 MB <= interleaved @ phys addr 0 node 4 free: 494 MB both configs include memoryless nodes. > In this case, it might further involve explaining why we need memoryless > nodes, as opposed to say a hack for the above (hypothetical?) hardware > in question that pretends that any CPUs on such memoryNo, wless nodes are on > the nearest memory equipped node -- and then entirely drops the idea of > memoryless nodes. Most likely you have good reason not to go this way. > Good chance even you've already explained this, and I missed it. No, I haven't explained it. Christoph posted the original memoryless nodes patch set in response to prompting from Andrew. He considered failure to support memoryless nodes a bug. The system "sort of" worked because for most allocations, the zonelists allow the memoryless nodes immediately "fall back" to a node with memory. There were a few corner cases that Christoph's series address. I believe that the x86_64 kernel works as you suggest in fully interleaved mode. All memory shows up on node zero in the SRAT, and all cpus are attached to this node. For my part, given that our platforms can be configured in a couple of ways, I would prefer that cpus not change their node association based on the configuration. But, that's just me... I know one shouldn't make any assumptions about cpu-to-node association. Rather, we have the libnuma APIs to query this information. Still... why go there? And then there's the fact that on some platforms, ours included, all nodes with memory are not equal. See my recent patch to allow selected nodes to be excluded from interleave policy. I don't want to exclude these nodes from cpusets to achieve this, because there are cases [like the TPC-C benchmark mentioned above] where we want the application to be able to use the funky, interleaved memory, but only when requested explicitly. IMO, Christoph's generic nodemask mechanism makes it easy to handle nodes with special characteristics--no memory, excluded from interleave, ...--in a generic way. > > === > > I have user level code that scans the 'cpu%d' entries below the > /sys/devices/system/node%d directories, and then inverts the resulting > map, in order to provide, for any given cpu the nearest > node. This code is a simple form of node and cpu topology for user > code that wants to setup cpusets with cpus and nodes 'near' each other. Sounds useful for an administrator partitioning the machine. I can see why you might need it with the size of your systems ;-). And, for our platform in fully interleaved mode--even tho' there is only one node with memory to choose from. Is this part of the SGI ProPack? > > Could you post the results, from such a (possibly hypothetical) machine, > of the following two commands: > > find /sys/devices/system/node* -name cpu[0-9]\* > ls /sys/devices/system/cpu > > And if the 'ls' shows cpus that the 'find' doesn't show, then can you > recommend how user code should be written that would return, for any > specified cpu (even one on a memoryless node) the number of the > 'nearest' node that does have memory (for some plausible definition, > your choice pretty much, of 'nearest')? I verified that I see all cpus [16 on the 4-node, 16 cpu ia64 platform I'm testing on], either way: find or ls [w/ and w/o cell local memory]. > > Granted, this is not a pressing issue ... not much chance that my user > code will be running on your (hypothetical?) hardware anytime soon, > unless there is some deal in the works I don't know about for hp to > buy sgi ;). > > In short, how should user code find 'nearby' memory nodes for cpus that > are on memoryless nodes? Again, on the fully interleaved config, there is only one node with memory, so it's not hard. And in the 100% CLM, with mem= [2nd config above], the SLIT says that the interleaved pseudo-node is closer to any real node than any other real node--based on the average latency. The interleaved node is always the highest numbered node. Mileage may vary on other platforms... Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org