* [patch 0/2] fix bugs while booting on NUMA system where some nodes have no mem @ 2006-11-15 18:30 Christian Krafft 2006-11-15 18:32 ` [patch 1/2] fix call to alloc_bootmem after bootmem has been freed Christian Krafft 2006-11-15 18:34 ` [patch 2/2] enables booting a NUMA system where some nodes have no memory Christian Krafft 0 siblings, 2 replies; 36+ messages in thread From: Christian Krafft @ 2006-11-15 18:30 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, krafft Hi, The following patches are fixing two problems that showed up while booting a NUMA system where memory was limited to the first node. Please cc me for comments as I am not subscribed. cheers, Christian PS: sorry for resending it, I didn't cc myself, and wasn't able to reply to this note. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* [patch 1/2] fix call to alloc_bootmem after bootmem has been freed 2006-11-15 18:30 [patch 0/2] fix bugs while booting on NUMA system where some nodes have no mem Christian Krafft @ 2006-11-15 18:32 ` Christian Krafft 2006-11-21 16:55 ` Andrew Morton 2006-11-15 18:34 ` [patch 2/2] enables booting a NUMA system where some nodes have no memory Christian Krafft 1 sibling, 1 reply; 36+ messages in thread From: Christian Krafft @ 2006-11-15 18:32 UTC (permalink / raw) To: Christian Krafft; +Cc: linux-mm, linux-kernel In some cases it might happen, that alloc_bootmem is beeing called after bootmem pages have been freed. This is, because the condition SYSTEM_BOOTING is still true after bootmem has been freed. Signed-off-by: Christian Krafft <krafft@de.ibm.com> Index: linux/mm/page_alloc.c =================================================================== --- linux.orig/mm/page_alloc.c +++ linux/mm/page_alloc.c @@ -1931,7 +1931,7 @@ int zone_wait_table_init(struct zone *zo alloc_size = zone->wait_table_hash_nr_entries * sizeof(wait_queue_head_t); - if (system_state == SYSTEM_BOOTING) { + if (!slab_is_available()) { zone->wait_table = (wait_queue_head_t *) alloc_bootmem_node(pgdat, alloc_size); } else { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 1/2] fix call to alloc_bootmem after bootmem has been freed 2006-11-15 18:32 ` [patch 1/2] fix call to alloc_bootmem after bootmem has been freed Christian Krafft @ 2006-11-21 16:55 ` Andrew Morton 2006-11-21 18:02 ` Christian Krafft 0 siblings, 1 reply; 36+ messages in thread From: Andrew Morton @ 2006-11-21 16:55 UTC (permalink / raw) To: Christian Krafft; +Cc: linux-mm, linux-kernel On Wed, 15 Nov 2006 19:32:38 +0100 Christian Krafft <krafft@de.ibm.com> wrote: > In some cases it might happen, that alloc_bootmem is beeing called > after bootmem pages have been freed. This is, because the condition > SYSTEM_BOOTING is still true after bootmem has been freed. > > Signed-off-by: Christian Krafft <krafft@de.ibm.com> > > Index: linux/mm/page_alloc.c > =================================================================== > --- linux.orig/mm/page_alloc.c > +++ linux/mm/page_alloc.c > @@ -1931,7 +1931,7 @@ int zone_wait_table_init(struct zone *zo > alloc_size = zone->wait_table_hash_nr_entries > * sizeof(wait_queue_head_t); > > - if (system_state == SYSTEM_BOOTING) { > + if (!slab_is_available()) { > zone->wait_table = (wait_queue_head_t *) > alloc_bootmem_node(pgdat, alloc_size); > } else { I don't think that slab_is_available() is an appropriate way of working out if we can call vmalloc(). Also, a more complete description of the problem is needed, please. Which caller is incorrectly allocating bootmem? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 1/2] fix call to alloc_bootmem after bootmem has been freed 2006-11-21 16:55 ` Andrew Morton @ 2006-11-21 18:02 ` Christian Krafft 2006-11-21 18:26 ` Andrew Morton 0 siblings, 1 reply; 36+ messages in thread From: Christian Krafft @ 2006-11-21 18:02 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, linux-kernel On Tue, 21 Nov 2006 08:55:35 -0800 Andrew Morton <akpm@osdl.org> wrote: > On Wed, 15 Nov 2006 19:32:38 +0100 > Christian Krafft <krafft@de.ibm.com> wrote: > > > In some cases it might happen, that alloc_bootmem is beeing called > > after bootmem pages have been freed. This is, because the condition > > SYSTEM_BOOTING is still true after bootmem has been freed. > > > > Signed-off-by: Christian Krafft <krafft@de.ibm.com> > > > > Index: linux/mm/page_alloc.c > > =================================================================== > > --- linux.orig/mm/page_alloc.c > > +++ linux/mm/page_alloc.c > > @@ -1931,7 +1931,7 @@ int zone_wait_table_init(struct zone *zo > > alloc_size = zone->wait_table_hash_nr_entries > > * sizeof(wait_queue_head_t); > > > > - if (system_state == SYSTEM_BOOTING) { > > + if (!slab_is_available()) { > > zone->wait_table = (wait_queue_head_t *) > > alloc_bootmem_node(pgdat, alloc_size); > > } else { > > I don't think that slab_is_available() is an appropriate way of working out > if we can call vmalloc(). Afaik slab_is_available() is the generic replacement for mem_init_done, which exists only on powerpc. If thats not appropriate, I dont know why. However, SYSTEM_BOOTING is definitively wrong. > Also, a more complete description of the problem is needed, please. Which > caller is incorrectly allocating bootmem? > spu_base is causing the call to alloc_bootmem but only, if built into kernel. Other components might have the same problem. cheers, ck -- Mit freundlichen Grussen, kind regards, Christian Krafft IBM Systems & Technology Group, Linux Kernel Development IT Specialist -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 1/2] fix call to alloc_bootmem after bootmem has been freed 2006-11-21 18:02 ` Christian Krafft @ 2006-11-21 18:26 ` Andrew Morton 2006-11-22 9:23 ` Arnd Bergmann 0 siblings, 1 reply; 36+ messages in thread From: Andrew Morton @ 2006-11-21 18:26 UTC (permalink / raw) To: Christian Krafft; +Cc: linux-mm, linux-kernel On Tue, 21 Nov 2006 19:02:13 +0100 Christian Krafft <krafft@de.ibm.com> wrote: > > > Index: linux/mm/page_alloc.c > > > =================================================================== > > > --- linux.orig/mm/page_alloc.c > > > +++ linux/mm/page_alloc.c > > > @@ -1931,7 +1931,7 @@ int zone_wait_table_init(struct zone *zo > > > alloc_size = zone->wait_table_hash_nr_entries > > > * sizeof(wait_queue_head_t); > > > > > > - if (system_state == SYSTEM_BOOTING) { > > > + if (!slab_is_available()) { > > > zone->wait_table = (wait_queue_head_t *) > > > alloc_bootmem_node(pgdat, alloc_size); > > > } else { > > > > I don't think that slab_is_available() is an appropriate way of working out > > if we can call vmalloc(). > > Afaik slab_is_available() is the generic replacement for mem_init_done, which exists only on powerpc. > If thats not appropriate, I dont know why. However, SYSTEM_BOOTING is definitively wrong. slab is a very different thing from vmalloc. One could easily envisage situations (now or in the future) in which slab is ready, but vmalloc is not (more likely vice versa). It'd be better to add a new vmalloc_is_available. (Just an int - no need for a helper function). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 1/2] fix call to alloc_bootmem after bootmem has been freed 2006-11-21 18:26 ` Andrew Morton @ 2006-11-22 9:23 ` Arnd Bergmann 0 siblings, 0 replies; 36+ messages in thread From: Arnd Bergmann @ 2006-11-22 9:23 UTC (permalink / raw) To: Andrew Morton; +Cc: Christian Krafft, linux-mm, linux-kernel On Tuesday 21 November 2006 19:26, Andrew Morton wrote: > slab is a very different thing from vmalloc. One could easily envisage > situations (now or in the future) in which slab is ready, but vmalloc is > not (more likely vice versa). > > It'd be better to add a new vmalloc_is_available. (Just an int - no need > for a helper function). In the time line, we currently have start_kernel() ... setup_arch() init_bootmem() # alloc_bootmem starts working ... paging_init() # needed for vmalloc ... # mem_init() free_all_bootmem() # alloc_bootmem stops working, alloc_pages # starts working kmem_cache_init() # kmalloc and vmalloc start working ... system_state = SYSTEM_RUNNING The one interesting point here is where you have to transition between calling alloc_bootmem and calling the regular allocator functions. Maybe calling it slab_is_available() was not the best choice for a name, but I don't see a point in having different names for essentially the same question, "bootmem or not bootmem". The powerpc platform has an integer variable called 'mem_init_done', which expresses this well IMHO, but it's currently not portable. Checking for SYSTEM_RUNNING is obviously the wrong choice, since it is set at a very late point in bootup, long after bootmem is gone. Arnd <>< -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-15 18:30 [patch 0/2] fix bugs while booting on NUMA system where some nodes have no mem Christian Krafft 2006-11-15 18:32 ` [patch 1/2] fix call to alloc_bootmem after bootmem has been freed Christian Krafft @ 2006-11-15 18:34 ` Christian Krafft 2006-11-15 21:24 ` Christoph Lameter 1 sibling, 1 reply; 36+ messages in thread From: Christian Krafft @ 2006-11-15 18:34 UTC (permalink / raw) To: Christian Krafft; +Cc: linux-mm, linux-kernel When booting a NUMA system with nodes that have no memory (eg by limiting memory), bootmem_alloc_core tried to find pages in an uninitialized bootmem_map. This caused a null pointer access. This fix adds a check, so that NULL is returned. That will enable the caller (bootmem_alloc_nopanic) to alloc memory on other without a panic. Signed-off-by: Christian Krafft <krafft@de.ibm.com> Index: linux/mm/bootmem.c =================================================================== --- linux.orig/mm/bootmem.c +++ linux/mm/bootmem.c @@ -196,6 +196,10 @@ __alloc_bootmem_core(struct bootmem_data if (limit && bdata->node_boot_start >= limit) return NULL; + /* on nodes without memory - bootmem_map is NULL */ + if(!bdata->node_bootmem_map) + return NULL; + end_pfn = bdata->node_low_pfn; limit = PFN_DOWN(limit); if (limit && end_pfn > limit) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-15 18:34 ` [patch 2/2] enables booting a NUMA system where some nodes have no memory Christian Krafft @ 2006-11-15 21:24 ` Christoph Lameter 2006-11-15 21:58 ` Jack Steiner 2006-11-15 22:05 ` Martin Bligh 0 siblings, 2 replies; 36+ messages in thread From: Christoph Lameter @ 2006-11-15 21:24 UTC (permalink / raw) To: Christian Krafft; +Cc: linux-mm, linux-kernel On Wed, 15 Nov 2006, Christian Krafft wrote: > When booting a NUMA system with nodes that have no memory (eg by limiting memory), > bootmem_alloc_core tried to find pages in an uninitialized bootmem_map. Why should we support nodes with no memory? If a node has no memory then its processors and other resources need to be attached to the nearest node with memory. AFAICT The primary role of a node is to manage memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-15 21:24 ` Christoph Lameter @ 2006-11-15 21:58 ` Jack Steiner 2006-11-15 22:40 ` Christoph Lameter 2006-11-15 22:05 ` Martin Bligh 1 sibling, 1 reply; 36+ messages in thread From: Jack Steiner @ 2006-11-15 21:58 UTC (permalink / raw) To: Christoph Lameter; +Cc: Christian Krafft, linux-mm, linux-kernel On Wed, Nov 15, 2006 at 01:24:55PM -0800, Christoph Lameter wrote: > On Wed, 15 Nov 2006, Christian Krafft wrote: > > > When booting a NUMA system with nodes that have no memory (eg by limiting memory), > > bootmem_alloc_core tried to find pages in an uninitialized bootmem_map. > > Why should we support nodes with no memory? If a node has no memory then > its processors and other resources need to be attached to the nearest node > with memory. > > AFAICT The primary role of a node is to manage memory. > SGI has nodes that are have neither memory or cpus. These are IO nodes. Think of them as ordinary nodes that have had the cpu's & DIMMs removed. Only the IO buses remain. IO nodes have the same NUMA properties as regular nodes. They are connected via the numalink fabric, they should be described in the SLIT table, they should be identified in proximity_domains, etc. A lot of the core infrastructure is currently missing that is required to describe IO nodes as regular nodes, but in principle, I don't see anything wrong with nodes w/o memory. It is also possible to disable the DIMMs on a node that actually has cpus & memory. I suspect this doesn't work but I see no reason that you should HAVE to disable the cpus on nodes that have had the DIMMs disabled. Our BIOS currently provides the capability to disable DIMMS. The BIOS has a hack to automatically disable cpus if all DIMMs have been disabled. This hack was required for several reasons, one of which was linux does not support nodes with cpus & no memory. -- jack -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-15 21:58 ` Jack Steiner @ 2006-11-15 22:40 ` Christoph Lameter 2006-11-15 22:43 ` Martin Bligh 2006-11-16 1:35 ` Jack Steiner 0 siblings, 2 replies; 36+ messages in thread From: Christoph Lameter @ 2006-11-15 22:40 UTC (permalink / raw) To: Jack Steiner; +Cc: Christian Krafft, linux-mm, Martin Bligh, linux-kernel On Wed, 15 Nov 2006, Jack Steiner wrote: > A lot of the core infrastructure is currently missing that is required > to describe IO nodes as regular nodes, but in principle, I don't > see anything wrong with nodes w/o memory. Every processor has a local node on which it runs. The kernel places memory used by the processor on the local node. Even if we allow nodes without memory: We still need to associate a "local" node to the processor. If that is across some NUMA interlink then it is going to be slower but it will work. AFAIK It seems to be better to explicitly associate a memory node with a processor during bootup in arch code. Various kernel optimizations rely on local memory. Would we create a special case here of a pglist_data structure without a zones structure? It seems that the contents of pglist_data are targeted to a memory node. If we do not have a pglist_data structure then the node would not exist for the kernel. What would the benefit or difference be of having nodes without memory? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-15 22:40 ` Christoph Lameter @ 2006-11-15 22:43 ` Martin Bligh 2006-11-15 22:52 ` Christoph Lameter 2006-11-16 1:35 ` Jack Steiner 1 sibling, 1 reply; 36+ messages in thread From: Martin Bligh @ 2006-11-15 22:43 UTC (permalink / raw) To: Christoph Lameter; +Cc: Jack Steiner, Christian Krafft, linux-mm, linux-kernel Christoph Lameter wrote: > On Wed, 15 Nov 2006, Jack Steiner wrote: > >> A lot of the core infrastructure is currently missing that is required >> to describe IO nodes as regular nodes, but in principle, I don't >> see anything wrong with nodes w/o memory. > > Every processor has a local node on which it runs. The kernel places > memory used by the processor on the local node. Even if we allow > nodes without memory: We still need to associate a "local" node to the > processor. If that is across some NUMA interlink then it is going to be > slower but it will work. > > AFAIK It seems to be better to explicitly associate a memory node with a > processor during bootup in arch code. > > Various kernel optimizations rely on local memory. Would we create > a special case here of a pglist_data structure without a zones structure? > > It seems that the contents of pglist_data are targeted to a memory node. > If we do not have a pglist_data structure then the node would not exist > for the kernel. > > What would the benefit or difference be of having nodes without memory? Some nodes really don't have memory. Either because it's been deconfigured, or because it was never there in the first place. We shouldn't need to kludge that. All we need is an appropriate zonelist for each node, pointing to the memory it should be accessing. M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-15 22:43 ` Martin Bligh @ 2006-11-15 22:52 ` Christoph Lameter 2006-11-16 0:54 ` KAMEZAWA Hiroyuki 2006-11-16 2:01 ` Martin Bligh 0 siblings, 2 replies; 36+ messages in thread From: Christoph Lameter @ 2006-11-15 22:52 UTC (permalink / raw) To: Martin Bligh; +Cc: Jack Steiner, Christian Krafft, linux-mm, linux-kernel On Wed, 15 Nov 2006, Martin Bligh wrote: > All we need is an appropriate zonelist for each node, pointing to > the memory it should be accessing. But there is no memory on the node. Does the zonelist contain the zones of the node without memory or not? We simply fall back each allocation to the next node as if the node was overflowing? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-15 22:52 ` Christoph Lameter @ 2006-11-16 0:54 ` KAMEZAWA Hiroyuki 2006-11-16 0:57 ` Christoph Lameter 2006-11-16 2:01 ` Martin Bligh 1 sibling, 1 reply; 36+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-11-16 0:54 UTC (permalink / raw) To: Christoph Lameter; +Cc: mbligh, steiner, krafft, linux-mm, linux-kernel On Wed, 15 Nov 2006 14:52:43 -0800 (PST) Christoph Lameter <clameter@sgi.com> wrote: > On Wed, 15 Nov 2006, Martin Bligh wrote: > > > All we need is an appropriate zonelist for each node, pointing to > > the memory it should be accessing. > > But there is no memory on the node. Does the zonelist contain the zones of > the node without memory or not? We simply fall back each allocation to the > next node as if the node was overflowing? > yes. just fallback. The zonelist[] donen't contain empty-zone. -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-16 0:54 ` KAMEZAWA Hiroyuki @ 2006-11-16 0:57 ` Christoph Lameter 2006-11-16 1:17 ` KAMEZAWA Hiroyuki 2006-11-16 15:40 ` Christian Krafft 0 siblings, 2 replies; 36+ messages in thread From: Christoph Lameter @ 2006-11-16 0:57 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: mbligh, steiner, krafft, linux-mm, linux-kernel On Thu, 16 Nov 2006, KAMEZAWA Hiroyuki wrote: > > But there is no memory on the node. Does the zonelist contain the zones of > > the node without memory or not? We simply fall back each allocation to the > > next node as if the node was overflowing? > yes. just fallback. Ok, so we got a useless pglist_data struct and the struct zone contains a zonelist that does not include the zone. numa_node_id() points to this and we always get allocations redirected to other nodes. The slab duplicates its per node structures on the fallback node. > The zonelist[] donen't contain empty-zone. So we will never encounter that zone except when going to the pglist_data struct through numa_node_id()? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-16 0:57 ` Christoph Lameter @ 2006-11-16 1:17 ` KAMEZAWA Hiroyuki 2006-11-16 15:40 ` Christian Krafft 1 sibling, 0 replies; 36+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-11-16 1:17 UTC (permalink / raw) To: Christoph Lameter; +Cc: mbligh, steiner, krafft, linux-mm, linux-kernel On Wed, 15 Nov 2006 16:57:56 -0800 (PST) Christoph Lameter <clameter@sgi.com> wrote: > numa_node_id() points to this and we always get allocations redirected to > other nodes. The slab duplicates its per node structures on the fallback > node. > > > The zonelist[] donen't contain empty-zone. > > So we will never encounter that zone except when going to the > pglist_data struct through numa_node_id()? > Some pgdat/zone scanning code will access it. See: for_each_zone() and populated_zone(). AFAIK, in 2.6.9 age(means RHEL4), cpus on memory-less-node are moved to the nearest node. And there were no useless pgdat. Now, there are memory-less-node. Cpus on memory-less-node are on a pgdat with empty-zone. I think this is very simple way rather than remapping. And I think cpus on memory-less-node are sharing something (FSB,switch,etc..) Tieing cpus to a memory-less-node may have some benefit. -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-16 0:57 ` Christoph Lameter 2006-11-16 1:17 ` KAMEZAWA Hiroyuki @ 2006-11-16 15:40 ` Christian Krafft 2006-11-16 15:49 ` Martin J. Bligh 2006-11-16 18:46 ` Christoph Lameter 1 sibling, 2 replies; 36+ messages in thread From: Christian Krafft @ 2006-11-16 15:40 UTC (permalink / raw) To: Christoph Lameter Cc: KAMEZAWA Hiroyuki, mbligh, steiner, linux-mm, linux-kernel On Wed, 15 Nov 2006 16:57:56 -0800 (PST) Christoph Lameter <clameter@sgi.com> wrote: > On Thu, 16 Nov 2006, KAMEZAWA Hiroyuki wrote: > > > > But there is no memory on the node. Does the zonelist contain the zones of > > > the node without memory or not? We simply fall back each allocation to the > > > next node as if the node was overflowing? > > yes. just fallback. > > Ok, so we got a useless pglist_data struct and the struct zone contains a > zonelist that does not include the zone. Okay, I slowly understand what you are talking about. I just tried a "numactl --cpunodebind 1 --membind 1 true" which hit an uninitialized zone in slab_node: return zone_to_nid(policy->v.zonelist->zones[0]); I also still don't know if it makes sense to have memoryless nodes, but supporting it does. So wath would be reasonable, to have empty zonelists for those node, or to check if zonelists are uninitialized ? -- Mit freundlichen Grussen, kind regards, Christian Krafft IBM Systems & Technology Group, Linux Kernel Development IT Specialist -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-16 15:40 ` Christian Krafft @ 2006-11-16 15:49 ` Martin J. Bligh 2006-11-16 18:46 ` Christoph Lameter 1 sibling, 0 replies; 36+ messages in thread From: Martin J. Bligh @ 2006-11-16 15:49 UTC (permalink / raw) To: Christian Krafft Cc: Christoph Lameter, KAMEZAWA Hiroyuki, steiner, linux-mm, linux-kernel Christian Krafft wrote: > On Wed, 15 Nov 2006 16:57:56 -0800 (PST) > Christoph Lameter <clameter@sgi.com> wrote: > >> On Thu, 16 Nov 2006, KAMEZAWA Hiroyuki wrote: >> >>>> But there is no memory on the node. Does the zonelist contain the zones of >>>> the node without memory or not? We simply fall back each allocation to the >>>> next node as if the node was overflowing? >>> yes. just fallback. >> Ok, so we got a useless pglist_data struct and the struct zone contains a >> zonelist that does not include the zone. > > Okay, I slowly understand what you are talking about. > I just tried a "numactl --cpunodebind 1 --membind 1 true" which hit an uninitialized zone in slab_node: > > return zone_to_nid(policy->v.zonelist->zones[0]); > > I also still don't know if it makes sense to have memoryless nodes, but supporting it does. > So wath would be reasonable, to have empty zonelists for those node, or to check if zonelists are uninitialized ? You don't want empty zonelists on a node containing CPUs, else it won't know where to allocate from. You just want to make sure that the zones in that node (if existant) are not contained in *anyone's* zonelist. M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-16 15:40 ` Christian Krafft 2006-11-16 15:49 ` Martin J. Bligh @ 2006-11-16 18:46 ` Christoph Lameter 1 sibling, 0 replies; 36+ messages in thread From: Christoph Lameter @ 2006-11-16 18:46 UTC (permalink / raw) To: Christian Krafft Cc: KAMEZAWA Hiroyuki, mbligh, steiner, linux-mm, linux-kernel On Thu, 16 Nov 2006, Christian Krafft wrote: > Okay, I slowly understand what you are talking about. > I just tried a "numactl --cpunodebind 1 --membind 1 true" which hit an uninitialized zone in slab_node: > > return zone_to_nid(policy->v.zonelist->zones[0]); I think the above should work fine and give the expected OOM since the node has no memory. The zone struct should redirect via the zonelist to nodes that have memory for allocations that are not bound to a single node. > I also still don't know if it makes sense to have memoryless nodes, but supporting it does. > So wath would be reasonable, to have empty zonelists for those node, or to check if zonelists are uninitialized ? zonelists of those nodes should contain a list of fallback zones with available memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-15 22:52 ` Christoph Lameter 2006-11-16 0:54 ` KAMEZAWA Hiroyuki @ 2006-11-16 2:01 ` Martin Bligh 1 sibling, 0 replies; 36+ messages in thread From: Martin Bligh @ 2006-11-16 2:01 UTC (permalink / raw) To: Christoph Lameter; +Cc: Jack Steiner, Christian Krafft, linux-mm, linux-kernel Christoph Lameter wrote: > On Wed, 15 Nov 2006, Martin Bligh wrote: > >> All we need is an appropriate zonelist for each node, pointing to >> the memory it should be accessing. > > But there is no memory on the node. Does the zonelist contain the zones of > the node without memory or not? We simply fall back each allocation to the > next node as if the node was overflowing? Sure. there's no point in putting an empty zone in the zonelist. We should just skip anything where present_pages is zero. M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-15 22:40 ` Christoph Lameter 2006-11-15 22:43 ` Martin Bligh @ 2006-11-16 1:35 ` Jack Steiner 2006-11-16 1:57 ` Christoph Lameter 1 sibling, 1 reply; 36+ messages in thread From: Jack Steiner @ 2006-11-16 1:35 UTC (permalink / raw) To: Christoph Lameter; +Cc: Christian Krafft, linux-mm, Martin Bligh, linux-kernel On Wed, Nov 15, 2006 at 02:40:36PM -0800, Christoph Lameter wrote: > On Wed, 15 Nov 2006, Jack Steiner wrote: > > > A lot of the core infrastructure is currently missing that is required > > to describe IO nodes as regular nodes, but in principle, I don't > > see anything wrong with nodes w/o memory. > > Every processor has a local node on which it runs. The kernel places > memory used by the processor on the local node. Even if we allow > nodes without memory: We still need to associate a "local" node to the > processor. If that is across some NUMA interlink then it is going to be > slower but it will work. True. > > AFAIK It seems to be better to explicitly associate a memory node with a > processor during bootup in arch code. > > Various kernel optimizations rely on local memory. Would we create > a special case here of a pglist_data structure without a zones structure? > > It seems that the contents of pglist_data are targeted to a memory node. > If we do not have a pglist_data structure then the node would not exist > for the kernel. > > What would the benefit or difference be of having nodes without memory? I doubt that there is a demand for systems with memoryless nodes. However, if the DIMM(s) on a node fails, I think the system may perform better with the cpus on the node enabled than it will if they have to be disabled. -- jack -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-16 1:35 ` Jack Steiner @ 2006-11-16 1:57 ` Christoph Lameter 2006-11-16 2:09 ` Martin Bligh 2006-11-16 3:28 ` Jack Steiner 0 siblings, 2 replies; 36+ messages in thread From: Christoph Lameter @ 2006-11-16 1:57 UTC (permalink / raw) To: Jack Steiner; +Cc: Christian Krafft, linux-mm, Martin Bligh, linux-kernel On Wed, 15 Nov 2006, Jack Steiner wrote: > I doubt that there is a demand for systems with memoryless nodes. However, if the > DIMM(s) on a node fails, I think the system may perform better > with the cpus on the node enabled than it will if they have to be > disabled. Right now we do not have the capability to remove memory from a node while the system is running. If the DIMMs have failed and we boot up and the systems finds out that there is no memory on that node then the cpus can be remapped to the next memory node. That is better than having lots of useless structures allocated. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-16 1:57 ` Christoph Lameter @ 2006-11-16 2:09 ` Martin Bligh 2006-11-16 2:35 ` Christoph Lameter 2006-11-16 3:28 ` Jack Steiner 1 sibling, 1 reply; 36+ messages in thread From: Martin Bligh @ 2006-11-16 2:09 UTC (permalink / raw) To: Christoph Lameter; +Cc: Jack Steiner, Christian Krafft, linux-mm, linux-kernel Christoph Lameter wrote: > On Wed, 15 Nov 2006, Jack Steiner wrote: > >> I doubt that there is a demand for systems with memoryless nodes. However, if the >> DIMM(s) on a node fails, I think the system may perform better >> with the cpus on the node enabled than it will if they have to be >> disabled. > > Right now we do not have the capability to remove memory from a node while > the system is running. > > If the DIMMs have failed and we boot up and the systems finds out that > there is no memory on that node then the cpus can be remapped to > the next memory node. That is better than having lots of useless > structures allocated. A node without memory is a node without memory. Simply remapping the cpus to another node and pretending the world is different does not make much sense. Is there some fundamental problem you see with dealing with the nodes as is? Doesn't seem that hard to me. I'm not asking you to put the effort in to fixing it, just if you see some fundamental reason why it can't be fixed? M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-16 2:09 ` Martin Bligh @ 2006-11-16 2:35 ` Christoph Lameter 0 siblings, 0 replies; 36+ messages in thread From: Christoph Lameter @ 2006-11-16 2:35 UTC (permalink / raw) To: Martin Bligh; +Cc: Jack Steiner, Christian Krafft, linux-mm, linux-kernel On Wed, 15 Nov 2006, Martin Bligh wrote: > A node without memory is a node without memory. Simply remapping the > cpus to another node and pretending the world is different does not > make much sense. It avoids overhead both in terms of memory and processing in the kernel and it seems that is the way we have traditionally dealt with the issue? Nodes without memory require the VM to allocate memory from different nodes in order to build up management structures for the node (these are useless since the node has no memory, caches will be split etc etc). The cpus will allways fallback to the next node anyways since their zonelist begins with a zone in a node that has memory. > Is there some fundamental problem you see with dealing with the nodes > as is? Doesn't seem that hard to me. I'm not asking you to put the > effort in to fixing it, just if you see some fundamental reason why > it can't be fixed? I am not sure how memoryless nodes would affect various subsystems. And it seems that this patch only fixes the first issue that they found (?). If we go down this route then we may have to add more special casing to the VM in order to cleanly handle memoryless nodes. But maybe someone else has already experience with memoryless nodes? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-16 1:57 ` Christoph Lameter 2006-11-16 2:09 ` Martin Bligh @ 2006-11-16 3:28 ` Jack Steiner 1 sibling, 0 replies; 36+ messages in thread From: Jack Steiner @ 2006-11-16 3:28 UTC (permalink / raw) To: Christoph Lameter; +Cc: Christian Krafft, linux-mm, Martin Bligh, linux-kernel On Wed, Nov 15, 2006 at 05:57:27PM -0800, Christoph Lameter wrote: > On Wed, 15 Nov 2006, Jack Steiner wrote: > > > I doubt that there is a demand for systems with memoryless nodes. However, if the > > DIMM(s) on a node fails, I think the system may perform better > > with the cpus on the node enabled than it will if they have to be > > disabled. > > Right now we do not have the capability to remove memory from a node while > the system is running. I know. I'm refering to a DIMM that fails power-on diags or one that is explicitly disabled from the system controller. Clearly a reboot is required in both cases, but the end result is a node with cpus and no memory. As I said earlier, the PROM (for several reasons) automatically the cpus on nodes w/o memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-15 21:24 ` Christoph Lameter 2006-11-15 21:58 ` Jack Steiner @ 2006-11-15 22:05 ` Martin Bligh 2006-11-15 22:41 ` Christoph Lameter 1 sibling, 1 reply; 36+ messages in thread From: Martin Bligh @ 2006-11-15 22:05 UTC (permalink / raw) To: Christoph Lameter; +Cc: Christian Krafft, linux-mm, linux-kernel Christoph Lameter wrote: > On Wed, 15 Nov 2006, Christian Krafft wrote: > >> When booting a NUMA system with nodes that have no memory (eg by limiting memory), >> bootmem_alloc_core tried to find pages in an uninitialized bootmem_map. > > Why should we support nodes with no memory? If a node has no memory then > its processors and other resources need to be attached to the nearest node > with memory. > > AFAICT The primary role of a node is to manage memory. A node is an arbitrary container object containing one or more of: CPUs Memory IO bus It does not have to contain memory. M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-15 22:05 ` Martin Bligh @ 2006-11-15 22:41 ` Christoph Lameter 2006-11-15 22:46 ` Martin Bligh ` (3 more replies) 0 siblings, 4 replies; 36+ messages in thread From: Christoph Lameter @ 2006-11-15 22:41 UTC (permalink / raw) To: Martin Bligh; +Cc: Christian Krafft, linux-mm, linux-kernel On Wed, 15 Nov 2006, Martin Bligh wrote: > A node is an arbitrary container object containing one or more of: > > CPUs > Memory > IO bus > > It does not have to contain memory. I have never seen a node on Linux without memory. I have seen nodes without processors and without I/O but not without memory.This seems to be something new? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-15 22:41 ` Christoph Lameter @ 2006-11-15 22:46 ` Martin Bligh 2006-11-15 22:51 ` Christoph Lameter 2006-11-16 0:26 ` Arnd Bergmann ` (2 subsequent siblings) 3 siblings, 1 reply; 36+ messages in thread From: Martin Bligh @ 2006-11-15 22:46 UTC (permalink / raw) To: Christoph Lameter; +Cc: Christian Krafft, linux-mm, linux-kernel Christoph Lameter wrote: > On Wed, 15 Nov 2006, Martin Bligh wrote: > >> A node is an arbitrary container object containing one or more of: >> >> CPUs >> Memory >> IO bus >> >> It does not have to contain memory. > > I have never seen a node on Linux without memory. I have seen nodes > without processors and without I/O but not without memory.This seems to be > something new? A node was always defined that way. Search back a few years in the lkml archives. We may be finding bugs in the implementation, but the definition has not changed. Supposing we hot-unplugged all the memory in a node? Or seems to have happened in this instance is boot with mem=, cutting out memory on that node. M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-15 22:46 ` Martin Bligh @ 2006-11-15 22:51 ` Christoph Lameter 2006-11-16 0:59 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 36+ messages in thread From: Christoph Lameter @ 2006-11-15 22:51 UTC (permalink / raw) To: Martin Bligh; +Cc: Christian Krafft, linux-mm, linux-kernel On Wed, 15 Nov 2006, Martin Bligh wrote: > Supposing we hot-unplugged all the memory in a node? Or seems to have > happened in this instance is boot with mem=, cutting out memory on that > node. So a node with no memory has a pgdat_list structure but no zones? Or empty zones? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-15 22:51 ` Christoph Lameter @ 2006-11-16 0:59 ` KAMEZAWA Hiroyuki 2006-11-16 1:22 ` Yasunori Goto 0 siblings, 1 reply; 36+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-11-16 0:59 UTC (permalink / raw) To: Christoph Lameter; +Cc: mbligh, krafft, linux-mm, linux-kernel On Wed, 15 Nov 2006 14:51:26 -0800 (PST) Christoph Lameter <clameter@sgi.com> wrote: > On Wed, 15 Nov 2006, Martin Bligh wrote: > > > Supposing we hot-unplugged all the memory in a node? Or seems to have > > happened in this instance is boot with mem=, cutting out memory on that > > node. > > So a node with no memory has a pgdat_list structure but no zones? Or empty > zones? > The node has just empty-zone. pgdat/per-cpu-area is allocated on an other (nearest) node. I hear some vender's machine has this configuration. (ia64, maybe SGI or HP) Node0: CPUx0 + XXXGb memory Node1: CPUx2 + 16MB memory Node2: CPUx2 + 16MB memory memory of Node1 and Node2 is tirmmed at boot by GRANULE alignment. Then, final view is Node0 : memory-only-node Node1 : cpu-only-node Node2 : cpu-only-node. -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-16 0:59 ` KAMEZAWA Hiroyuki @ 2006-11-16 1:22 ` Yasunori Goto 0 siblings, 0 replies; 36+ messages in thread From: Yasunori Goto @ 2006-11-16 1:22 UTC (permalink / raw) To: Christoph Lameter Cc: mbligh, krafft, linux-mm, linux-kernel, KAMEZAWA Hiroyuki > I hear some vender's machine has this configuration. (ia64, maybe SGI or HP) > > Node0: CPUx0 + XXXGb memory > Node1: CPUx2 + 16MB memory > Node2: CPUx2 + 16MB memory > > memory of Node1 and Node2 is tirmmed at boot by GRANULE alignment. > Then, final view is > Node0 : memory-only-node > Node1 : cpu-only-node > Node2 : cpu-only-node. IIRC, this is HP box. It is using memory interleave among nodes. Bye. -- Yasunori Goto -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-15 22:41 ` Christoph Lameter 2006-11-15 22:46 ` Martin Bligh @ 2006-11-16 0:26 ` Arnd Bergmann 2006-11-16 0:45 ` Christoph Lameter 2006-11-16 0:44 ` Jesper Juhl 2006-11-16 15:21 ` Lee Schermerhorn 3 siblings, 1 reply; 36+ messages in thread From: Arnd Bergmann @ 2006-11-16 0:26 UTC (permalink / raw) To: Christoph Lameter; +Cc: Martin Bligh, Christian Krafft, linux-mm, linux-kernel On Wednesday 15 November 2006 23:41, Christoph Lameter wrote: > On Wed, 15 Nov 2006, Martin Bligh wrote: > > A node is an arbitrary container object containing one or more of: > > > > CPUs > > Memory > > IO bus + SPUs on a Cell processor > > It does not have to contain memory. > > I have never seen a node on Linux without memory. I have seen nodes > without processors and without I/O but not without memory.This seems to be > something new? In this particular case, we have a dual-socket Cell/B.E. blade server, where each of the two CPU-socket/south-bridge/memory combinations is treated as a separate node. The two points that make this tricky are: - we want to be able to boot with the 'mem=512M' option, which effectively disables the memory on the second node (each node has 512MiB). - Each node has 8 SPUs, all of which we want to use. In order to use an SPU, we call __add_pages to register the local memory on it, so we have struct page pointers we can hand out to user mappings with ->nopage(). The __add_pages call needs to do node local allocations (there are probably more allocations that have the same problem, but this is the first one that crashes), which oops when there is no memory registered at all for that node, instead of returning an error or falling back on a non-local allocation. Arnd <>< -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-16 0:26 ` Arnd Bergmann @ 2006-11-16 0:45 ` Christoph Lameter 2006-11-16 13:08 ` Arnd Bergmann 0 siblings, 1 reply; 36+ messages in thread From: Christoph Lameter @ 2006-11-16 0:45 UTC (permalink / raw) To: Arnd Bergmann; +Cc: Martin Bligh, Christian Krafft, linux-mm, linux-kernel On Thu, 16 Nov 2006, Arnd Bergmann wrote: > - we want to be able to boot with the 'mem=512M' option, which effectively > disables the memory on the second node (each node has 512MiB). > - Each node has 8 SPUs, all of which we want to use. In order to use an > SPU, we call __add_pages to register the local memory on it, so we have > struct page pointers we can hand out to user mappings with ->nopage(). This is more like the bringup of a processor right? You need to have the memory online before the processor is brought up otherwise the slab cannot properly allocate its structures on the node when the per node portion is brought up. The page allocator has similar issues. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-16 0:45 ` Christoph Lameter @ 2006-11-16 13:08 ` Arnd Bergmann 0 siblings, 0 replies; 36+ messages in thread From: Arnd Bergmann @ 2006-11-16 13:08 UTC (permalink / raw) To: Christoph Lameter; +Cc: Martin Bligh, Christian Krafft, linux-mm, linux-kernel On Thursday 16 November 2006 01:45, Christoph Lameter wrote: > On Thu, 16 Nov 2006, Arnd Bergmann wrote: > > > - we want to be able to boot with the 'mem=512M' option, which effectively > > disables the memory on the second node (each node has 512MiB). > > - Each node has 8 SPUs, all of which we want to use. In order to use an > > SPU, we call __add_pages to register the local memory on it, so we have > > struct page pointers we can hand out to user mappings with ->nopage(). > > This is more like the bringup of a processor right? You need > to have the memory online before the processor is brought up otherwise > the slab cannot properly allocate its structures on the node when the > per node portion is brought up. The page allocator has similar issues. No, that's not really the issue here. The memory we're trying to add to the mem_map can not be used for kernel allocations at all and is never entered into the buddy allocator. It can only be used for applications running on an SPU itself. So the problem is not the order in which we do things, but the fact that node data structure has not been initialized, and never will be, when we add the SPU to the node. Arnd <>< -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-15 22:41 ` Christoph Lameter 2006-11-15 22:46 ` Martin Bligh 2006-11-16 0:26 ` Arnd Bergmann @ 2006-11-16 0:44 ` Jesper Juhl 2006-11-16 0:46 ` Christoph Lameter 2006-11-16 15:21 ` Lee Schermerhorn 3 siblings, 1 reply; 36+ messages in thread From: Jesper Juhl @ 2006-11-16 0:44 UTC (permalink / raw) To: Christoph Lameter; +Cc: Martin Bligh, Christian Krafft, linux-mm, linux-kernel On 15/11/06, Christoph Lameter <clameter@sgi.com> wrote: > On Wed, 15 Nov 2006, Martin Bligh wrote: > > > A node is an arbitrary container object containing one or more of: > > > > CPUs > > Memory > > IO bus > > > > It does not have to contain memory. > > I have never seen a node on Linux without memory. I have seen nodes > without processors and without I/O but not without memory.This seems to be > something new? > What about SMP Opteron boards that have RAM slots for each CPU? With two (or more) CPU's and only memory slots populated for one of them, wouldn't that count as multiple NUMA nodes but only one of them with memory? That would seem to be a pretty common thing that could happen. -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-16 0:44 ` Jesper Juhl @ 2006-11-16 0:46 ` Christoph Lameter 0 siblings, 0 replies; 36+ messages in thread From: Christoph Lameter @ 2006-11-16 0:46 UTC (permalink / raw) To: Jesper Juhl; +Cc: Martin Bligh, Christian Krafft, linux-mm, linux-kernel On Thu, 16 Nov 2006, Jesper Juhl wrote: > What about SMP Opteron boards that have RAM slots for each CPU? > With two (or more) CPU's and only memory slots populated for one of > them, wouldn't that count as multiple NUMA nodes but only one of them > with memory? > That would seem to be a pretty common thing that could happen. I think so far we have handled these as two processors on one node. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory 2006-11-15 22:41 ` Christoph Lameter ` (2 preceding siblings ...) 2006-11-16 0:44 ` Jesper Juhl @ 2006-11-16 15:21 ` Lee Schermerhorn 3 siblings, 0 replies; 36+ messages in thread From: Lee Schermerhorn @ 2006-11-16 15:21 UTC (permalink / raw) To: Christoph Lameter; +Cc: Martin Bligh, Christian Krafft, linux-mm, linux-kernel On Wed, 2006-11-15 at 14:41 -0800, Christoph Lameter wrote: > On Wed, 15 Nov 2006, Martin Bligh wrote: > > > A node is an arbitrary container object containing one or more of: > > > > CPUs > > Memory > > IO bus > > > > It does not have to contain memory. > > I have never seen a node on Linux without memory. I have seen nodes > without processors and without I/O but not without memory.This seems to be > something new? I sent this out earlier in response to another message from Christoph regarding nodes w/o memory. Don't know if it made it... >On Fri, 2006-11-10 at 10:16 -0800, Christoph Lameter wrote: >> On Wed, 8 Nov 2006, KAMEZAWA Hiroyuki wrote: >> >> > I wonder there are no code for creating NODE_DATA() for device-only-node. >> >> On IA64 we remap nodes with no memory / cpus to the nearest node with >> memory. I think that is sufficient. I don't think this happens anymore. Back in the ~2.6.5 days, when we would configure our numa platforms with 100% of memory interleaved [in hardware at cache line granularity], the cpus would move to the interleaved "pseudo-node" and the memoryless nodes would be removed. numactl --hardware would show something like this: # uname -r 2.6.5-7.244-default # numactl --hardware available: 1 nodes (0-0) node 0 size: 65443 MB node 0 free: 64506 MB I started seeing different behavior about the time SPARSEMEM went in. Now, with a 2.6.16 base kernel [same platform, hardware interleaved memory], I see: # uname -r# numactl --hardware available: 5 nodes (0-4) node 0 size: 0 MB node 0 free: 0 MB node 1 size: 0 MB node 1 free: 0 MB node 2 size: 0 MB node 2 free: 0 MB node 3 size: 0 MB node 3 free: 0 MB node 4 size: 65439 MB node 4 free: 64492 MB node distances: node 0 1 2 3 4 0: 10 17 17 17 14 1: 17 10 17 17 14 2: 17 17 10 17 14 3: 17 17 17 10 14 4: 14 14 14 14 10 2.6.16.21-0.8-default [Aside: The firmware/SLIT says that the interleaved memory is closer to all nodes that other nodes' memory. This has interesting implications for the "overflow" zone lists...] Lee > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2006-11-22 9:23 UTC | newest] Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-11-15 18:30 [patch 0/2] fix bugs while booting on NUMA system where some nodes have no mem Christian Krafft 2006-11-15 18:32 ` [patch 1/2] fix call to alloc_bootmem after bootmem has been freed Christian Krafft 2006-11-21 16:55 ` Andrew Morton 2006-11-21 18:02 ` Christian Krafft 2006-11-21 18:26 ` Andrew Morton 2006-11-22 9:23 ` Arnd Bergmann 2006-11-15 18:34 ` [patch 2/2] enables booting a NUMA system where some nodes have no memory Christian Krafft 2006-11-15 21:24 ` Christoph Lameter 2006-11-15 21:58 ` Jack Steiner 2006-11-15 22:40 ` Christoph Lameter 2006-11-15 22:43 ` Martin Bligh 2006-11-15 22:52 ` Christoph Lameter 2006-11-16 0:54 ` KAMEZAWA Hiroyuki 2006-11-16 0:57 ` Christoph Lameter 2006-11-16 1:17 ` KAMEZAWA Hiroyuki 2006-11-16 15:40 ` Christian Krafft 2006-11-16 15:49 ` Martin J. Bligh 2006-11-16 18:46 ` Christoph Lameter 2006-11-16 2:01 ` Martin Bligh 2006-11-16 1:35 ` Jack Steiner 2006-11-16 1:57 ` Christoph Lameter 2006-11-16 2:09 ` Martin Bligh 2006-11-16 2:35 ` Christoph Lameter 2006-11-16 3:28 ` Jack Steiner 2006-11-15 22:05 ` Martin Bligh 2006-11-15 22:41 ` Christoph Lameter 2006-11-15 22:46 ` Martin Bligh 2006-11-15 22:51 ` Christoph Lameter 2006-11-16 0:59 ` KAMEZAWA Hiroyuki 2006-11-16 1:22 ` Yasunori Goto 2006-11-16 0:26 ` Arnd Bergmann 2006-11-16 0:45 ` Christoph Lameter 2006-11-16 13:08 ` Arnd Bergmann 2006-11-16 0:44 ` Jesper Juhl 2006-11-16 0:46 ` Christoph Lameter 2006-11-16 15:21 ` Lee Schermerhorn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox