From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Tue, 19 Sep 2006 13:52:46 -0700 (PDT) From: David Rientjes Subject: Re: [PATCH] GFP_THISNODE for the slab allocator In-Reply-To: <20060918093434.e66b8887.pj@sgi.com> Message-ID: References: <20060914220011.2be9100a.akpm@osdl.org> <20060914234926.9b58fd77.pj@sgi.com> <20060915002325.bffe27d1.akpm@osdl.org> <20060915004402.88d462ff.pj@sgi.com> <20060915010622.0e3539d2.akpm@osdl.org> <20060917041707.28171868.pj@sgi.com> <20060917060358.ac16babf.pj@sgi.com> <20060917152723.5bb69b82.pj@sgi.com> <20060917192010.cc360ece.pj@sgi.com> <20060918093434.e66b8887.pj@sgi.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org Return-Path: To: Paul Jackson Cc: clameter@sgi.com, akpm@osdl.org, linux-mm@kvack.org, rientjes@google.com List-ID: On Mon, 18 Sep 2006, Paul Jackson wrote: > For now, it could be that we can't handle hybrid systems, and that fake > numa systems simply have a distance table of all 10's, driven by the > kernel boot command "numa=fake=N". But that apparatus will have to be > extended at some point, to support hybrid fake and real NUMA combined. > And this will have to mature from being an arch=x86_64 only thing to > being generically available. And it will have to become a mechanism > that can be applied on a running system, creating (and removing) fake > nodes on the fly, without a reboot, so long as the required physical > memory is free and available. > > A comment above arch/x86_64/mm/srat.c slit_valid() raises concerns > about a SLIT table with all 10's. I suspect we will just have to find > out the hard way what that problem is. Change the table to all 10's > on these fake numa systems and see what hurts. > > The generic kernel code should deal with this, and in particular, the > get_page_from_freelist() loop that provoked this discussion should be > coded so that it caches the last used node iff that node is distance > 10 from the node at the front of the zonelist. > > The only way to make this kind of stuff hold up over the long term > is to get a good conceptual model, and stick with it. This fake > numa provides for multiple logical nodes on a single physical node. > > The modal approach I recommended yesterday, where a system either > supported fake NUMA or real NUMA, but not both, had the stench of > an intermediate solution that would not hold over the long run. > Any solution for how to use numa=fake=N as a means of resource management will come with two prerequisites: 1. An increase in N does not lead to degraded performance due to get_page_from_freelist in any more than a negligible manner. 2. The current infrastructure of cpusets is not changed in a risky (or major) way to affect real NUMA machines. We can assume for the matter of this implementation that the system administrator or whomever is responsible for using this means of memory management configures their machine appropriately. So there is no worry about allocating 256 nodes when the intent is to split the machine down the middle and having two "containers" for their tasks. My current fixes to numa=fake: 1. It allows N nodes to be split evenly over the memory map with no restriction on N being a power of 2. 2. Memory can be allocated in 4M contiguous chunks on a machine up to 64G. This means my 3G box can be booted with numa=fake=256 or even higher if NODE_SHIFT is increased. 3. Nodes can be asymmetric so that you can configure as many nodes as you want with the sizes you specify. This provides the system administrator with all the functionality that he needs so that the machine can be configured appropriately for its foreseen workload. It is not, however, a dynamic solution that Christoph suggests where nodes can be partitioned along themselves at runtime. The problem that is being faced is prerequisite #1 whereas get_page_from_freelist needs to be more efficient when considering fake nodes as opposed to real nodes. In favor of prerequisite #2, __node_distance in arch/x86_64/can be modified so that the distance between fake nodes is always 10 (no distance). This is intuitive: we have a UMA machine that is acting in a NUMA environment and the notion of memory locality is no longer a consideration. Fake nodes currently are not emulated among real nodes through Andi Kleen's implementation so the change to __node_distance is trivial once we have abstracted whether NUMA emulation is, in fact, being used. The concern about that approach is the comment in slit_valid which suggests that the local node should always have a smaller distance than the others considering the NUMA heuristics. One of those heuristics is in find_next_best_node where the next best node is determined by preferring those that have the least distance (in addition to those with CPUS, etc). This is not an issue since all nodes in a fake NUMA system will have the same distance and thus this heuristic becomes a non-factor. The other heuristic is in build_zonelists where pages are reclaimed on a local node as opposed to from other nodes. Since the distance, again, is always the same on each node in a fake NUMA system, this should not exceed RECLAIM_DISTANCE and thus it is acceptable at all times to reclaim from across nodes (since this is, after all, an UMA machine) and no preference is given to being across zones instead. There is _no_ problem with the penalty on the first node in the same distance group in the build_zonelists iteration. The load on each individual node will be the same since the distance between the nodes never changes and thus the ordering of nodes is appropriate even for our emulation. To be convinced of this, for any node N, node_load[N] is always the same for each node in our UMA machine: find_next_best_node treats each node equally. The return value on a numa_emulation call in arch/x86_64/mm/numa.c is 0 when CONFIG_NUMA_EMU is enabled and the command line was parsed and setup correctly. It becomes trivial to abstract this to the global kernel code by creating a macro that can be tested against for any generic architecture (and may become useful later when NUMA emulation is abstracted generically as suggested prior). So, with these changes, we now assured: 1. A test against a macro in global kernel code where a numa=fake kernel can be evaluated. 2. That the ordering of next_best_node treats each node equally by keeping in the back of our minds that this is really a UMA machine. This now no longer requires us to determine whether two fake nodes are really on the same hardware node because we recognize the NUMA emulation case when appropriate and the rest of the cpusets infrastructure is remained unchanged (prerequisite #2). Now stepping back to prerequisite #1, we are required to modify __cpuset_zone_allowed for the case where we have NUMA emulation. The idea is that since we have N number of nodes where N may be large, get_page_from_freelist scans those N nodes before finding memory on page claim. This isn't a problem with the typical NUMA machine because free pages can normally be found quickly in a scan of the first few zonelist entries; this is thanks to the N differently sorted zonelists. It _is_ a problem in NUMA emulation because we have many more nodes than CPU's, but it can be remedied with the proposed caching and changes to __node_distance. As Paul and Andrew suggested, there are three additions to task_struct: 1. cached copy of struct zonelist *zonelist that was passed into get_page_from_freelist, 2. index of zone where free memory was located last, and 3. index of next zone to try when (2) is full. get_page_from_freelist, in the case where the passed in zonelist* differs from (1) or in the ~GFP_HARDWALL & ~ALLOC_CPUSET case, uses the current implementation going through the zonelist and finding one with enough free pages. Otherwise, if we are in the NUMA emulation case, the node where the memory was found most recently can be cached since all memory is equal. There is no consideration given to the distance between the last used node and the node at the front of the zonelist because the distance between all nodes is 10. (If the passed in zonelist* differs from (1), then the three additions to task_struct are reset per the new configuration in the same sense as cpuset_update_task_memory_state since the memory placement has changed relative to current->cpuset which cpusets allows by outside manipulation.) Now, when get_page_from_freelist is called, (3) is tested for new memory and used if some is found, otherwise it is incremented so that we don't spin on this one full zone (especially in the case with the memory hogs from my experiment where it would never get out of this spin). This prevents us from having to take callback_mutex and makes the call to __cpuset_zone_allowed more efficient. Thus this solution correctly implements prerequisite #1 and keeps the modification to the current infrastructure of cpusets small (prerequisite #2) by abstracting the NUMA emulation case away from the current code path. David -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org