From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Sat, 16 Sep 2006 14:51:17 -0700 From: Paul Jackson Subject: Re: [PATCH] GFP_THISNODE for the slab allocator Message-Id: <20060916145117.9b44786d.pj@sgi.com> In-Reply-To: <20060916083825.ba88eee8.akpm@osdl.org> References: <20060914220011.2be9100a.akpm@osdl.org> <20060914234926.9b58fd77.pj@sgi.com> <20060915002325.bffe27d1.akpm@osdl.org> <20060916044847.99802d21.pj@sgi.com> <20060916083825.ba88eee8.akpm@osdl.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Andrew Morton Cc: clameter@sgi.com, linux-mm@kvack.org, rientjes@google.com List-ID: Andrew wrote: > Pretty much all loads? If you haven't consumed most of the "container"'s > memory then you have overprovisioned its size. Not so on real NUMA boxes. If you configure your system so that you are having to go a long way off-node for much of your memory, then your performance is screwed. No one in their right mind would run a memory hog that eats 40 nodes of memory and a kernel build both in the same 60 node, small CPU count cpuset on a real NUMA box. The primary motivation for cpusets is to improve memory locality on NUMA boxes. You're using fake numa and cpusets to simulate destroying memory locality. On a real 64 node NUMA box, there would be 64 differently sorted zonelists, each one centered on a different node. The kernel build would be running on different CPUs, associated with different nodes than the memory hog, and it would be using zonelists that had the unloaded (still has free memory) nodes at the front the list. Aha - maybe this is the problem - the fake numa stuff is missing the properly sorted zone lists. We normally pick which zone list to use based on which CPU the thread is running on, but that assumes there are at least as many CPUs as Nodes, so that we have a many-to-one map from CPUs to Nodes. In this fake numa setup, we've got way more Nodes than CPUs. The current macro wrapper that calls __alloc_pages() in gfp.h relies on per-node data to select a zonelist, and the node is selected based on the tasks current CPU. You can see this in the following code from mmzone.h and gfp.h: #define numa_node_id() (cpu_to_node(raw_smp_processor_id())) nid = numa_node_id(); return __alloc_pages(gfp_mask, order, NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask)); For the fake numa mechanism to work well, it would need to assign a different zone list to the kernel build tasks in this test case than it assigned to the memory hogs. These different zone lists would have to have a different sort, with unloaded nodes at the front. The definition of numa_node_id() would have to depend on more than just the processor id for this to work. Perhaps the memory hog and the kernel build should be in separate cpusets, and the zonelists they were assigned should depend on their cpuset (not just their current CPU, as it is now), so that they got zonelists with a different sort to them. Then the kernel build wouldn't have to waddle past all the used up memory hog nodes to find some free memory ... Well, I intentionally overstated the case a bit. I doubt that we should put a big effort -at this time- into elaborating the zonelists and numa_node_id() mechanisms for x86_64 fake numa configurations. Rather, we should recognize that it is only in such configurations that the existing code in get_page_from_freelist() and cpuset_zone_allowed() has the suboptimal performance observed here, due to the different zonelist configuration. In an earlier message, Andrew wrote: > Guys, it's still 50 cachelines per page. This one needs more than tweaking > - algorithmic changes are needed. I disagree (respectfully disagree -- Andrew has *way* more experience hacking memory code than I ;). I still don't know if this x86_64 fake numa mechanism has any real life outside of entertaining kernel memory hackers. We should first see how far the localized tweaks get us. I suspect the tweaks will help quite a bit and provide an adequate basis for effective machine partitioning, as proposed by the various bright sparks. If this partitioning proves useful, then I agree that we should elaborate the algorithms and data structures to also handle this usage well. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson 1.925.600.0401 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org