From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com [9.17.195.11]) by e33.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id j4GHMUmD305754 for ; Mon, 16 May 2005 13:22:30 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by westrelay02.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id j4GHMTdV100632 for ; Mon, 16 May 2005 11:22:30 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11/8.13.3) with ESMTP id j4GHMTaY032740 for ; Mon, 16 May 2005 11:22:29 -0600 Subject: Re: NUMA aware slab allocator V3 From: Dave Hansen In-Reply-To: References: <20050512000444.641f44a9.akpm@osdl.org> <20050513000648.7d341710.akpm@osdl.org> <20050513043311.7961e694.akpm@osdl.org> <1116251568.1005.29.camel@localhost> Content-Type: text/plain Date: Mon, 16 May 2005 10:22:15 -0700 Message-Id: <1116264135.1005.73.camel@localhost> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Christoph Lameter Cc: Andrew Morton , linux-mm , Linux Kernel Mailing List , shai@scalex86.org, steiner@sgi.com List-ID: On Mon, 2005-05-16 at 09:47 -0700, Christoph Lameter wrote: > On Mon, 16 May 2005, Dave Hansen wrote: > > There are some broken assumptions in the kernel that > > CONFIG_DISCONTIG==CONFIG_NUMA. These usually manifest when code assumes > > that one pg_data_t means one NUMA node. > > > > However, NUMA node ids are actually distinct from "discontigmem nodes". > > A "discontigmem node" is just one physically contiguous area of memory, > > thus one pg_data_t. Some (non-NUMA) Mac G5's have a gap in their > > address space, so they get two discontigmem nodes. > > I thought the discontigous memory in one node was handled through zones? > I.e. ZONE_HIGHMEM in i386? You can only have one zone of each type under each pg_data_t. For instance, you can't properly represent (DMA, NORMAL, HIGHMEM, , HIGHMEM) in a single pg_data_t without wasting node_mem_map[] space. The "proper" discontig way of representing that is like this: pg_data_t[0] (DMA, NORMAL, HIGHMEM) pg_data_t[1] (---, ------, HIGHMEM) Where pg_data_t[1] has empty DMA and NORMAL zones. Also, remember that both of these could theoretically be on the same NUMA node. But, I don't think we ever do that in practice. > > So, that #error is bogus. It's perfectly valid to have multiple > > discontigmem nodes, when the number of NUMA nodes is 1. MAX_NUMNODES > > refers to discontigmem nodes, not NUMA nodes. > > Ok. We looked through the code and saw that the check may be removed > without causing problems. However, there is still a feeling of uneasiness > about this. I don't blame you :) > To what node does numa_node_id() refer? That refers to the NUMA node that you're thinking of. Close CPUs and memory and I/O, etc... > And it is legit to use > numa_node_id() to index cpu maps and stuff? Yes, those are all NUMA nodes. > How do the concepts of numa node id relate to discontig node ids? I believe there are quite a few assumptions on some architectures that, when NUMA is on, they are equivalent. It appears to be pretty much assumed everywhere that CONFIG_NUMA=y means one pg_data_t per NUMA node. Remember, as you saw, you can't assume that MAX_NUMNODES=1 when NUMA=n because of the DISCONTIG=y case. So, in summary, if you want to do it right: use the CONFIG_NEED_MULTIPLE_NODES that you see in -mm. As plain DISCONTIG=y gets replaced by sparsemem any code using this is likely to stay working. -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: aart@kvack.org