* [patch 0/3] Fixes for NUMA allocations on memoryless nodes
@ 2007-06-12 20:48 clameter
2007-06-12 20:48 ` [patch 1/3] NUMA: introduce node_memory_map clameter
` (2 more replies)
0 siblings, 3 replies; 40+ messages in thread
From: clameter @ 2007-06-12 20:48 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, ak, Nishanth Aravamudan, Lee Schermerhorn
This patchset fixes the dysfunctional behavior of for GFP_THISNODE and MPOL_INTERLEAVE
on systems with memoryless nodes. We introduce a new "node_memory_map" to be able to
determine efficiently if a node has memory.
Tested on IA64 NUMA and compile tested on i386 SMP (to verify that the new fallbacks work right)
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 40+ messages in thread* [patch 1/3] NUMA: introduce node_memory_map 2007-06-12 20:48 [patch 0/3] Fixes for NUMA allocations on memoryless nodes clameter @ 2007-06-12 20:48 ` clameter 2007-06-12 21:03 ` David Rientjes 2007-06-12 20:48 ` [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes clameter 2007-06-12 20:48 ` [patch 3/3] Fix MPOL_INTERLEAVE " clameter 2 siblings, 1 reply; 40+ messages in thread From: clameter @ 2007-06-12 20:48 UTC (permalink / raw) To: akpm; +Cc: linux-mm, ak, Nishanth Aravamudan, Lee Schermerhorn [-- Attachment #1: node_memory_map --] [-- Type: text/plain, Size: 4293 bytes --] It is necessary to know if nodes have memory since we have recently begun to add support for memoryless nodes. For that purpose we introduce a new bitmap called node_memory_map A node has its bit in node_memory_map set if it has memory. If a node has memory then it has at least one zone defined in its pgdat structure that is located in the pgdat itself. The node_memory_map can then be used in various places to insure that we do the right thing when we encounter a memoryless node. Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.22-rc4-mm2/include/linux/nodemask.h =================================================================== --- linux-2.6.22-rc4-mm2.orig/include/linux/nodemask.h 2007-06-12 12:32:38.000000000 -0700 +++ linux-2.6.22-rc4-mm2/include/linux/nodemask.h 2007-06-12 13:45:44.000000000 -0700 @@ -64,12 +64,16 @@ * * int node_online(node) Is some node online? * int node_possible(node) Is some node possible? + * int node_memory(node) Does a node have memory? * * int any_online_node(mask) First online node in mask * * node_set_online(node) set bit 'node' in node_online_map * node_set_offline(node) clear bit 'node' in node_online_map * + * node_set_memory(node) set bit 'node' in node_memory_map + * node_clear_memoryd(node) clear bit 'node' in node_memory_map + * * for_each_node(node) for-loop node over node_possible_map * for_each_online_node(node) for-loop node over node_online_map * @@ -344,12 +348,14 @@ static inline void __nodes_remap(nodemas extern nodemask_t node_online_map; extern nodemask_t node_possible_map; +extern nodemask_t node_memory_map; #if MAX_NUMNODES > 1 #define num_online_nodes() nodes_weight(node_online_map) #define num_possible_nodes() nodes_weight(node_possible_map) #define node_online(node) node_isset((node), node_online_map) #define node_possible(node) node_isset((node), node_possible_map) +#define node_memory(node) node_isset((node), node_memory_map) #define first_online_node first_node(node_online_map) #define next_online_node(nid) next_node((nid), node_online_map) extern int nr_node_ids; @@ -358,6 +364,8 @@ extern int nr_node_ids; #define num_possible_nodes() 1 #define node_online(node) ((node) == 0) #define node_possible(node) ((node) == 0) +#define node_memory(node) ((node) == 0) +#define node_populated(node) ((node) == 0) #define first_online_node 0 #define next_online_node(nid) (MAX_NUMNODES) #define nr_node_ids 1 @@ -375,6 +383,9 @@ extern int nr_node_ids; #define node_set_online(node) set_bit((node), node_online_map.bits) #define node_set_offline(node) clear_bit((node), node_online_map.bits) +#define node_set_memory(node) set_bit((node), node_memory_map.bits) +#define node_clear_memory(node) clear_bit((node), node_memory_map.bits) + #define for_each_node(node) for_each_node_mask((node), node_possible_map) #define for_each_online_node(node) for_each_node_mask((node), node_online_map) Index: linux-2.6.22-rc4-mm2/mm/page_alloc.c =================================================================== --- linux-2.6.22-rc4-mm2.orig/mm/page_alloc.c 2007-06-12 12:32:38.000000000 -0700 +++ linux-2.6.22-rc4-mm2/mm/page_alloc.c 2007-06-12 12:32:43.000000000 -0700 @@ -54,6 +54,9 @@ nodemask_t node_online_map __read_mostly EXPORT_SYMBOL(node_online_map); nodemask_t node_possible_map __read_mostly = NODE_MASK_ALL; EXPORT_SYMBOL(node_possible_map); +nodemask_t node_memory_map __read_mostly = NODE_MASK_NONE; +EXPORT_SYMBOL(node_memory_map); + unsigned long totalram_pages __read_mostly; unsigned long totalreserve_pages __read_mostly; long nr_swap_pages; @@ -2299,6 +2302,9 @@ static void build_zonelists(pg_data_t *p /* calculate node order -- i.e., DMA last! */ build_zonelists_in_zone_order(pgdat, j); } + + if (pgdat->node_present_pages) + node_set_memory(local_node); } /* Construct the zonelist performance cache - see further mmzone.h */ -- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 1/3] NUMA: introduce node_memory_map 2007-06-12 20:48 ` [patch 1/3] NUMA: introduce node_memory_map clameter @ 2007-06-12 21:03 ` David Rientjes 2007-06-12 21:08 ` Christoph Lameter 0 siblings, 1 reply; 40+ messages in thread From: David Rientjes @ 2007-06-12 21:03 UTC (permalink / raw) To: clameter; +Cc: akpm, linux-mm, ak, Nishanth Aravamudan, Lee Schermerhorn On Tue, 12 Jun 2007, clameter@sgi.com wrote: > Index: linux-2.6.22-rc4-mm2/include/linux/nodemask.h > =================================================================== > --- linux-2.6.22-rc4-mm2.orig/include/linux/nodemask.h 2007-06-12 12:32:38.000000000 -0700 > +++ linux-2.6.22-rc4-mm2/include/linux/nodemask.h 2007-06-12 13:45:44.000000000 -0700 > @@ -64,12 +64,16 @@ > * > * int node_online(node) Is some node online? > * int node_possible(node) Is some node possible? > + * int node_memory(node) Does a node have memory? > * This name doesn't make sense; wouldn't node_has_memory() be better? > * int any_online_node(mask) First online node in mask > * > * node_set_online(node) set bit 'node' in node_online_map > * node_set_offline(node) clear bit 'node' in node_online_map > * > + * node_set_memory(node) set bit 'node' in node_memory_map > + * node_clear_memoryd(node) clear bit 'node' in node_memory_map > + * Extra 'd' there in node_clear_memory(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 1/3] NUMA: introduce node_memory_map 2007-06-12 21:03 ` David Rientjes @ 2007-06-12 21:08 ` Christoph Lameter 2007-06-12 21:10 ` David Rientjes 0 siblings, 1 reply; 40+ messages in thread From: Christoph Lameter @ 2007-06-12 21:08 UTC (permalink / raw) To: David Rientjes; +Cc: akpm, linux-mm, ak, Nishanth Aravamudan, Lee Schermerhorn On Tue, 12 Jun 2007, David Rientjes wrote: > > * int node_online(node) Is some node online? > > * int node_possible(node) Is some node possible? > > + * int node_memory(node) Does a node have memory? > > * > > This name doesn't make sense; wouldn't node_has_memory() be better? node_set_has_memory and node_clear_has_memory sounds a bit strange. > > > * int any_online_node(mask) First online node in mask > > * > > * node_set_online(node) set bit 'node' in node_online_map > > * node_set_offline(node) clear bit 'node' in node_online_map > > * > > + * node_set_memory(node) set bit 'node' in node_memory_map > > + * node_clear_memoryd(node) clear bit 'node' in node_memory_map > > + * > > Extra 'd' there in node_clear_memory(). Tss.... It survived my editing. Fix typo Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.22-rc4-mm2/include/linux/nodemask.h =================================================================== --- linux-2.6.22-rc4-mm2.orig/include/linux/nodemask.h 2007-06-12 14:07:52.000000000 -0700 +++ linux-2.6.22-rc4-mm2/include/linux/nodemask.h 2007-06-12 14:08:03.000000000 -0700 @@ -72,7 +72,7 @@ * node_set_offline(node) clear bit 'node' in node_online_map * * node_set_memory(node) set bit 'node' in node_memory_map - * node_clear_memoryd(node) clear bit 'node' in node_memory_map + * node_clear_memory(node) clear bit 'node' in node_memory_map * * for_each_node(node) for-loop node over node_possible_map * for_each_online_node(node) for-loop node over node_online_map -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 1/3] NUMA: introduce node_memory_map 2007-06-12 21:08 ` Christoph Lameter @ 2007-06-12 21:10 ` David Rientjes 2007-06-12 21:27 ` Christoph Lameter 2007-06-12 21:36 ` Nishanth Aravamudan 0 siblings, 2 replies; 40+ messages in thread From: David Rientjes @ 2007-06-12 21:10 UTC (permalink / raw) To: Christoph Lameter Cc: akpm, linux-mm, ak, Nishanth Aravamudan, Lee Schermerhorn On Tue, 12 Jun 2007, Christoph Lameter wrote: > On Tue, 12 Jun 2007, David Rientjes wrote: > > > > * int node_online(node) Is some node online? > > > * int node_possible(node) Is some node possible? > > > + * int node_memory(node) Does a node have memory? > > > * > > > > This name doesn't make sense; wouldn't node_has_memory() be better? > > node_set_has_memory and node_clear_has_memory sounds a bit strange. > This will probably be one of those things that people see in the source and have to look up everytime. node_has_memory() is straight-forward and to the point. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 1/3] NUMA: introduce node_memory_map 2007-06-12 21:10 ` David Rientjes @ 2007-06-12 21:27 ` Christoph Lameter 2007-06-12 21:34 ` David Rientjes 2007-06-12 21:36 ` Nishanth Aravamudan 1 sibling, 1 reply; 40+ messages in thread From: Christoph Lameter @ 2007-06-12 21:27 UTC (permalink / raw) To: David Rientjes; +Cc: akpm, linux-mm, ak, Nishanth Aravamudan, Lee Schermerhorn On Tue, 12 Jun 2007, David Rientjes wrote: > On Tue, 12 Jun 2007, Christoph Lameter wrote: > > > On Tue, 12 Jun 2007, David Rientjes wrote: > > > > > > * int node_online(node) Is some node online? > > > > * int node_possible(node) Is some node possible? > > > > + * int node_memory(node) Does a node have memory? > > > > * > > > > > > This name doesn't make sense; wouldn't node_has_memory() be better? > > > > node_set_has_memory and node_clear_has_memory sounds a bit strange. > > > > This will probably be one of those things that people see in the source > and have to look up everytime. node_has_memory() is straight-forward and > to the point. But node_possible is similar to node_memory. Would you also prefer node_is_possible over node_possible? node_is_online? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 1/3] NUMA: introduce node_memory_map 2007-06-12 21:27 ` Christoph Lameter @ 2007-06-12 21:34 ` David Rientjes 2007-06-12 21:37 ` Christoph Lameter 0 siblings, 1 reply; 40+ messages in thread From: David Rientjes @ 2007-06-12 21:34 UTC (permalink / raw) To: Christoph Lameter Cc: akpm, linux-mm, ak, Nishanth Aravamudan, Lee Schermerhorn On Tue, 12 Jun 2007, Christoph Lameter wrote: > > > On Tue, 12 Jun 2007, David Rientjes wrote: > > > > > > > > * int node_online(node) Is some node online? > > > > > * int node_possible(node) Is some node possible? > > > > > + * int node_memory(node) Does a node have memory? > > > > > * > > > > > > > > This name doesn't make sense; wouldn't node_has_memory() be better? > > > > > > node_set_has_memory and node_clear_has_memory sounds a bit strange. > > > > > > > This will probably be one of those things that people see in the source > > and have to look up everytime. node_has_memory() is straight-forward and > > to the point. > > But node_possible is similar to node_memory. > > Would you also prefer node_is_possible over node_possible? > > node_is_online? > I think the problem is that online and possible are adverbs and adjectives, respectively, and memory is a noun. That's why when it appears in source code, it doesn't make a lot of sense for node_memory to return a boolean value. I suspect it would return the memory, whatever that is. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 1/3] NUMA: introduce node_memory_map 2007-06-12 21:34 ` David Rientjes @ 2007-06-12 21:37 ` Christoph Lameter 2007-06-12 21:38 ` David Rientjes 0 siblings, 1 reply; 40+ messages in thread From: Christoph Lameter @ 2007-06-12 21:37 UTC (permalink / raw) To: David Rientjes; +Cc: akpm, linux-mm, ak, Nishanth Aravamudan, Lee Schermerhorn On Tue, 12 Jun 2007, David Rientjes wrote: > I think the problem is that online and possible are adverbs and > adjectives, respectively, and memory is a noun. That's why when it > appears in source code, it doesn't make a lot of sense for node_memory to > return a boolean value. I suspect it would return the memory, whatever > that is. A sentence such as "This is a memory node" in opposition to "This is a memoryless nide" would put memory to use as an adjective. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 1/3] NUMA: introduce node_memory_map 2007-06-12 21:37 ` Christoph Lameter @ 2007-06-12 21:38 ` David Rientjes 0 siblings, 0 replies; 40+ messages in thread From: David Rientjes @ 2007-06-12 21:38 UTC (permalink / raw) To: Christoph Lameter Cc: akpm, linux-mm, ak, Nishanth Aravamudan, Lee Schermerhorn On Tue, 12 Jun 2007, Christoph Lameter wrote: > A sentence such as "This is a memory node" in opposition to "This is a > memoryless nide" would put memory to use as an adjective. > Ok, is_memory_node() is great too ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 1/3] NUMA: introduce node_memory_map 2007-06-12 21:10 ` David Rientjes 2007-06-12 21:27 ` Christoph Lameter @ 2007-06-12 21:36 ` Nishanth Aravamudan 2007-06-12 21:39 ` Christoph Lameter 2007-06-13 9:14 ` Andy Whitcroft 1 sibling, 2 replies; 40+ messages in thread From: Nishanth Aravamudan @ 2007-06-12 21:36 UTC (permalink / raw) To: David Rientjes; +Cc: Christoph Lameter, akpm, linux-mm, ak, Lee Schermerhorn On 12.06.2007 [14:10:44 -0700], David Rientjes wrote: > On Tue, 12 Jun 2007, Christoph Lameter wrote: > > > On Tue, 12 Jun 2007, David Rientjes wrote: > > > > > > * int node_online(node) Is some node online? > > > > * int node_possible(node) Is some node possible? > > > > + * int node_memory(node) Does a node have memory? > > > > * > > > > > > This name doesn't make sense; wouldn't node_has_memory() be better? > > > > node_set_has_memory and node_clear_has_memory sounds a bit strange. > > > > This will probably be one of those things that people see in the > source and have to look up everytime. node_has_memory() is > straight-forward and to the point. Indeed, I did and (I like to think) I helped write the patches :) Why not just make the boolean sensible? We can keep node_set_memory() node_clear_memory() but change node_memory() to node_has_memory() ? Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 1/3] NUMA: introduce node_memory_map 2007-06-12 21:36 ` Nishanth Aravamudan @ 2007-06-12 21:39 ` Christoph Lameter 2007-06-12 21:42 ` Nishanth Aravamudan 2007-06-13 9:14 ` Andy Whitcroft 1 sibling, 1 reply; 40+ messages in thread From: Christoph Lameter @ 2007-06-12 21:39 UTC (permalink / raw) To: Nishanth Aravamudan; +Cc: David Rientjes, akpm, linux-mm, ak, Lee Schermerhorn > Indeed, I did and (I like to think) I helped write the patches :) The patches contain your signoff because of your authorship... > We can keep > > node_set_memory() > node_clear_memory() > > but change node_memory() to node_has_memory() ? Hmmm.... That deviates from how the other node_xxx() things are so it disturbed my sense of order. We have no three word node_is/has_xxx yet. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 1/3] NUMA: introduce node_memory_map 2007-06-12 21:39 ` Christoph Lameter @ 2007-06-12 21:42 ` Nishanth Aravamudan 2007-06-12 21:45 ` David Rientjes 2007-06-12 22:26 ` Christoph Lameter 0 siblings, 2 replies; 40+ messages in thread From: Nishanth Aravamudan @ 2007-06-12 21:42 UTC (permalink / raw) To: Christoph Lameter; +Cc: David Rientjes, akpm, linux-mm, ak, Lee Schermerhorn On 12.06.2007 [14:39:21 -0700], Christoph Lameter wrote: > > > Indeed, I did and (I like to think) I helped write the patches :) > > The patches contain your signoff because of your authorship... > > > We can keep > > > > node_set_memory() > > node_clear_memory() > > > > but change node_memory() to node_has_memory() ? > > Hmmm.... That deviates from how the other node_xxx() things are so it > disturbed my sense of order. We have no three word node_is/has_xxx > yet. Yeah, I realize that -- but I also agree with David that node_memory() is not intuitive at all. And we've already admitted that a few of the macros in there are inconsistent already :) Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 1/3] NUMA: introduce node_memory_map 2007-06-12 21:42 ` Nishanth Aravamudan @ 2007-06-12 21:45 ` David Rientjes 2007-06-12 22:26 ` Christoph Lameter 1 sibling, 0 replies; 40+ messages in thread From: David Rientjes @ 2007-06-12 21:45 UTC (permalink / raw) To: Nishanth Aravamudan Cc: Christoph Lameter, akpm, linux-mm, ak, Lee Schermerhorn On Tue, 12 Jun 2007, Nishanth Aravamudan wrote: > Yeah, I realize that -- but I also agree with David that > > node_memory() > > is not intuitive at all. And we've already admitted that a few of the > macros in there are inconsistent already :) > Creating new macros that conform to the others in terms of node_<single-word-here>() is great, but useless if it isn't readable in soruce code. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 1/3] NUMA: introduce node_memory_map 2007-06-12 21:42 ` Nishanth Aravamudan 2007-06-12 21:45 ` David Rientjes @ 2007-06-12 22:26 ` Christoph Lameter 2007-06-12 22:32 ` Nishanth Aravamudan 1 sibling, 1 reply; 40+ messages in thread From: Christoph Lameter @ 2007-06-12 22:26 UTC (permalink / raw) To: Nishanth Aravamudan Cc: David Rientjes, akpm, linux-mm, ak, pj, Lee Schermerhorn On Tue, 12 Jun 2007, Nishanth Aravamudan wrote: > is not intuitive at all. And we've already admitted that a few of the > macros in there are inconsistent already :) I do not want to add another inconsistency. if (node_memory(node)) is pretty clear as far as I can tell. Some of the macros in include/linux/nodemask.h are inconsistent. How can we make those consistent. Could you come up with a consistent naming scheme? Add the explanation for that scheme. But that should be a separate patch. And the patch would have to change all uses of those macros in the kernel source tree. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 1/3] NUMA: introduce node_memory_map 2007-06-12 22:26 ` Christoph Lameter @ 2007-06-12 22:32 ` Nishanth Aravamudan 0 siblings, 0 replies; 40+ messages in thread From: Nishanth Aravamudan @ 2007-06-12 22:32 UTC (permalink / raw) To: Christoph Lameter Cc: David Rientjes, akpm, linux-mm, ak, pj, Lee Schermerhorn On 12.06.2007 [15:26:25 -0700], Christoph Lameter wrote: > On Tue, 12 Jun 2007, Nishanth Aravamudan wrote: > > > is not intuitive at all. And we've already admitted that a few of the > > macros in there are inconsistent already :) > > I do not want to add another inconsistency. > > if (node_memory(node)) > > is pretty clear as far as I can tell. Fair enough. > Some of the macros in include/linux/nodemask.h are inconsistent. How > can we make those consistent. Could you come up with a consistent > naming scheme? Add the explanation for that scheme. I'll think it over :) > But that should be a separate patch. And the patch would have to > change all uses of those macros in the kernel source tree. Agreed, will be a down-the-road thing. In the meantime, I'm rebasing the hugetlb pool patch and sysfs interface patch on top of your stack and will repost. Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 1/3] NUMA: introduce node_memory_map 2007-06-12 21:36 ` Nishanth Aravamudan 2007-06-12 21:39 ` Christoph Lameter @ 2007-06-13 9:14 ` Andy Whitcroft 1 sibling, 0 replies; 40+ messages in thread From: Andy Whitcroft @ 2007-06-13 9:14 UTC (permalink / raw) To: Nishanth Aravamudan Cc: David Rientjes, Christoph Lameter, akpm, linux-mm, ak, Lee Schermerhorn Nishanth Aravamudan wrote: > On 12.06.2007 [14:10:44 -0700], David Rientjes wrote: >> On Tue, 12 Jun 2007, Christoph Lameter wrote: >> >>> On Tue, 12 Jun 2007, David Rientjes wrote: >>> >>>>> * int node_online(node) Is some node online? >>>>> * int node_possible(node) Is some node possible? >>>>> + * int node_memory(node) Does a node have memory? >>>>> * >>>> This name doesn't make sense; wouldn't node_has_memory() be better? >>> node_set_has_memory and node_clear_has_memory sounds a bit strange. >>> >> This will probably be one of those things that people see in the >> source and have to look up everytime. node_has_memory() is >> straight-forward and to the point. > > Indeed, I did and (I like to think) I helped write the patches :) > > Why not just make the boolean sensible? > > We can keep > > node_set_memory() > node_clear_memory() node_clear_memory() sounds like something to memset all of that node's memory or something. > but change node_memory() to node_has_memory() ? -apw -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-12 20:48 [patch 0/3] Fixes for NUMA allocations on memoryless nodes clameter 2007-06-12 20:48 ` [patch 1/3] NUMA: introduce node_memory_map clameter @ 2007-06-12 20:48 ` clameter 2007-06-12 21:03 ` David Rientjes 2007-06-13 21:10 ` Lee Schermerhorn 2007-06-12 20:48 ` [patch 3/3] Fix MPOL_INTERLEAVE " clameter 2 siblings, 2 replies; 40+ messages in thread From: clameter @ 2007-06-12 20:48 UTC (permalink / raw) To: akpm; +Cc: linux-mm, ak, Nishanth Aravamudan, Lee Schermerhorn [-- Attachment #1: gfp_thisnode_fix --] [-- Type: text/plain, Size: 1785 bytes --] GFP_THISNODE checks that the zone selected is within the pgdat (node) of the first zone of a nodelist. That only works if the node has memory. A memoryless node will have its first zone on another pgdat (node). Thus GFP_THISNODE may be returning memory on other nodes. GFP_THISNODE should fail if there is no local memory on a node. So we add a check to verify that the node specified has memory in alloc_pages_node(). If the node has no memory then return NULL. The case of alloc_pages(GFP_THISNODE) is not changed. alloc_pages() (with no memory policies in effect) is understood to prefer the current node. If a process is running on a node with no memory then its default allocations come from the next neighboring node. GFP_THISNODE will then force the memory to come from that node. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Index: linux-2.6.22-rc4-mm2/include/linux/gfp.h =================================================================== --- linux-2.6.22-rc4-mm2.orig/include/linux/gfp.h 2007-06-12 12:33:37.000000000 -0700 +++ linux-2.6.22-rc4-mm2/include/linux/gfp.h 2007-06-12 12:38:37.000000000 -0700 @@ -175,6 +175,13 @@ static inline struct page *alloc_pages_n if (nid < 0) nid = numa_node_id(); + /* + * Check for the special case that GFP_THISNODE is used on a + * memoryless node + */ + if ((gfp_mask & __GFP_THISNODE) && !node_memory(nid)) + return NULL; + return __alloc_pages(gfp_mask, order, NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask)); } -- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-12 20:48 ` [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes clameter @ 2007-06-12 21:03 ` David Rientjes 2007-06-12 21:07 ` Christoph Lameter 2007-06-13 21:10 ` Lee Schermerhorn 1 sibling, 1 reply; 40+ messages in thread From: David Rientjes @ 2007-06-12 21:03 UTC (permalink / raw) To: clameter; +Cc: akpm, linux-mm, ak, Nishanth Aravamudan, Lee Schermerhorn On Tue, 12 Jun 2007, clameter@sgi.com wrote: > Index: linux-2.6.22-rc4-mm2/include/linux/gfp.h > =================================================================== > --- linux-2.6.22-rc4-mm2.orig/include/linux/gfp.h 2007-06-12 12:33:37.000000000 -0700 > +++ linux-2.6.22-rc4-mm2/include/linux/gfp.h 2007-06-12 12:38:37.000000000 -0700 > @@ -175,6 +175,13 @@ static inline struct page *alloc_pages_n > if (nid < 0) > nid = numa_node_id(); > > + /* > + * Check for the special case that GFP_THISNODE is used on a > + * memoryless node > + */ > + if ((gfp_mask & __GFP_THISNODE) && !node_memory(nid)) > + return NULL; > + unlikely()? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-12 21:03 ` David Rientjes @ 2007-06-12 21:07 ` Christoph Lameter 2007-06-12 21:08 ` David Rientjes 0 siblings, 1 reply; 40+ messages in thread From: Christoph Lameter @ 2007-06-12 21:07 UTC (permalink / raw) To: David Rientjes; +Cc: akpm, linux-mm, ak, Nishanth Aravamudan, Lee Schermerhorn On Tue, 12 Jun 2007, David Rientjes wrote: > > =================================================================== > > --- linux-2.6.22-rc4-mm2.orig/include/linux/gfp.h 2007-06-12 12:33:37.000000000 -0700 > > +++ linux-2.6.22-rc4-mm2/include/linux/gfp.h 2007-06-12 12:38:37.000000000 -0700 > > @@ -175,6 +175,13 @@ static inline struct page *alloc_pages_n > > if (nid < 0) > > nid = numa_node_id(); > > > > + /* > > + * Check for the special case that GFP_THISNODE is used on a > > + * memoryless node > > + */ > > + if ((gfp_mask & __GFP_THISNODE) && !node_memory(nid)) > > + return NULL; > > + > > unlikely()? The gfp_mask is typically constant. So the whole expression is folded by the compiler into either if (!node_memory(nid)) return NULL or nothing. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-12 21:07 ` Christoph Lameter @ 2007-06-12 21:08 ` David Rientjes 2007-06-12 21:25 ` Christoph Lameter 0 siblings, 1 reply; 40+ messages in thread From: David Rientjes @ 2007-06-12 21:08 UTC (permalink / raw) To: Christoph Lameter Cc: akpm, linux-mm, ak, Nishanth Aravamudan, Lee Schermerhorn On Tue, 12 Jun 2007, Christoph Lameter wrote: > On Tue, 12 Jun 2007, David Rientjes wrote: > > > > =================================================================== > > > --- linux-2.6.22-rc4-mm2.orig/include/linux/gfp.h 2007-06-12 12:33:37.000000000 -0700 > > > +++ linux-2.6.22-rc4-mm2/include/linux/gfp.h 2007-06-12 12:38:37.000000000 -0700 > > > @@ -175,6 +175,13 @@ static inline struct page *alloc_pages_n > > > if (nid < 0) > > > nid = numa_node_id(); > > > > > > + /* > > > + * Check for the special case that GFP_THISNODE is used on a > > > + * memoryless node > > > + */ > > > + if ((gfp_mask & __GFP_THISNODE) && !node_memory(nid)) > > > + return NULL; > > > + > > > > unlikely()? > > The gfp_mask is typically constant. So the whole expression is folded by the > compiler into either > > if (!node_memory(nid)) > return NULL > > or > > nothing. > That's the point. Isn't !node_memory(nid) unlikely? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-12 21:08 ` David Rientjes @ 2007-06-12 21:25 ` Christoph Lameter 2007-06-12 21:34 ` David Rientjes 0 siblings, 1 reply; 40+ messages in thread From: Christoph Lameter @ 2007-06-12 21:25 UTC (permalink / raw) To: David Rientjes; +Cc: akpm, linux-mm, ak, Nishanth Aravamudan, Lee Schermerhorn On Tue, 12 Jun 2007, David Rientjes wrote: > That's the point. Isn't !node_memory(nid) unlikely? Correct. Use unlikely Signed-off-cy: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.22-rc4-mm2/include/linux/gfp.h =================================================================== --- linux-2.6.22-rc4-mm2.orig/include/linux/gfp.h 2007-06-12 14:22:57.000000000 -0700 +++ linux-2.6.22-rc4-mm2/include/linux/gfp.h 2007-06-12 14:24:46.000000000 -0700 @@ -179,7 +179,7 @@ static inline struct page *alloc_pages_n * Check for the special case that GFP_THISNODE is used on a * memoryless node */ - if ((gfp_mask & __GFP_THISNODE) && !node_memory(nid)) + if (unlikely((gfp_mask & __GFP_THISNODE) && !node_memory(nid))) return NULL; return __alloc_pages(gfp_mask, order, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-12 21:25 ` Christoph Lameter @ 2007-06-12 21:34 ` David Rientjes 2007-06-12 21:51 ` Nishanth Aravamudan 0 siblings, 1 reply; 40+ messages in thread From: David Rientjes @ 2007-06-12 21:34 UTC (permalink / raw) To: Christoph Lameter Cc: akpm, linux-mm, ak, Nishanth Aravamudan, Lee Schermerhorn On Tue, 12 Jun 2007, Christoph Lameter wrote: > On Tue, 12 Jun 2007, David Rientjes wrote: > > > That's the point. Isn't !node_memory(nid) unlikely? > > Correct. > > Use unlikely > > Signed-off-cy: Christoph Lameter <clameter@sgi.com> Acked-by: David Rientjes <rientjes@google.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-12 21:34 ` David Rientjes @ 2007-06-12 21:51 ` Nishanth Aravamudan 0 siblings, 0 replies; 40+ messages in thread From: Nishanth Aravamudan @ 2007-06-12 21:51 UTC (permalink / raw) To: David Rientjes; +Cc: Christoph Lameter, akpm, linux-mm, ak, Lee Schermerhorn On 12.06.2007 [14:34:40 -0700], David Rientjes wrote: > On Tue, 12 Jun 2007, Christoph Lameter wrote: > > > On Tue, 12 Jun 2007, David Rientjes wrote: > > > > > That's the point. Isn't !node_memory(nid) unlikely? > > > > Correct. > > > > Use unlikely > > > > Signed-off-cy: Christoph Lameter <clameter@sgi.com> > > Acked-by: David Rientjes <rientjes@google.com> Yep, definitely makes sense. Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-12 20:48 ` [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes clameter 2007-06-12 21:03 ` David Rientjes @ 2007-06-13 21:10 ` Lee Schermerhorn 2007-06-13 21:57 ` Nishanth Aravamudan ` (2 more replies) 1 sibling, 3 replies; 40+ messages in thread From: Lee Schermerhorn @ 2007-06-13 21:10 UTC (permalink / raw) To: clameter; +Cc: akpm, linux-mm, ak, Nishanth Aravamudan On Tue, 2007-06-12 at 13:48 -0700, clameter@sgi.com wrote: > GFP_THISNODE checks that the zone selected is within the pgdat (node) of the > first zone of a nodelist. That only works if the node has memory. A > memoryless node will have its first zone on another pgdat (node). > > Thus GFP_THISNODE may be returning memory on other nodes. > GFP_THISNODE should fail if there is no local memory on a node. > > So we add a check to verify that the node specified has memory in > alloc_pages_node(). If the node has no memory then return NULL. > > The case of alloc_pages(GFP_THISNODE) is not changed. alloc_pages() (with > no memory policies in effect) is understood to prefer the current node. > If a process is running on a node with no memory then its default allocations > come from the next neighboring node. GFP_THISNODE will then force the memory > to come from that node. > > Signed-off-by: Christoph Lameter <clameter@sgi.com> > Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> > > Index: linux-2.6.22-rc4-mm2/include/linux/gfp.h > =================================================================== > --- linux-2.6.22-rc4-mm2.orig/include/linux/gfp.h 2007-06-12 12:33:37.000000000 -0700 > +++ linux-2.6.22-rc4-mm2/include/linux/gfp.h 2007-06-12 12:38:37.000000000 -0700 > @@ -175,6 +175,13 @@ static inline struct page *alloc_pages_n > if (nid < 0) > nid = numa_node_id(); > > + /* > + * Check for the special case that GFP_THISNODE is used on a > + * memoryless node > + */ > + if ((gfp_mask & __GFP_THISNODE) && !node_memory(nid)) > + return NULL; > + > return __alloc_pages(gfp_mask, order, > NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask)); > } > Attached patch fixes alloc_pages_node() so that it never returns an off-node page when GFP_THISNODE is specified by. This requires a fix to SLUB early allocation, included in the patch. Works on HP ia64 platform with small DMA only node and "zone order" zonelists. Will test on x86_64 real soon now... --- PATCH fix GFP_THISNODE for DMA only nodes and zone-order zonelists The map of nodes with memory may include nodes with just DMA/DMA32 memory. Using this map/mask together with GFP_THISNODE will not guarantee on-node allocations at higher zones. Modify checks in alloc_pages_node() to ensure that the first zone in the selected zonelist is "on-node". This change will result in alloc_pages_node() returning NULL when GFP_THISNODE is specified and the first zone in the zonelist selected by (nid, gfp_zone(gfp_mask) is not on node 'nid'. This, in turn, BUGs out in slub.c:early_kmem_cache_node_alloc() which apparently can't handle a NULL page from new_slab(). Fix SLUB to handle NULL page in early allocation. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> include/linux/gfp.h | 11 ++++++++--- mm/slub.c | 22 ++++++++++++---------- 2 files changed, 20 insertions(+), 13 deletions(-) Index: Linux/include/linux/gfp.h =================================================================== --- Linux.orig/include/linux/gfp.h 2007-06-13 16:36:02.000000000 -0400 +++ Linux/include/linux/gfp.h 2007-06-13 16:38:41.000000000 -0400 @@ -168,6 +168,9 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) { + pg_data_t *pgdat; + struct zonelist *zonelist; + if (unlikely(order >= MAX_ORDER)) return NULL; @@ -179,11 +182,13 @@ static inline struct page *alloc_pages_n * Check for the special case that GFP_THISNODE is used on a * memoryless node */ - if ((gfp_mask & __GFP_THISNODE) && !node_memory(nid)) + pgdat = NODE_DATA(nid); + zonelist = pgdat->node_zonelists + gfp_zone(gfp_mask); + if ((gfp_mask & __GFP_THISNODE) && + pgdat != zonelist->zones[0]->zone_pgdat) return NULL; - return __alloc_pages(gfp_mask, order, - NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask)); + return __alloc_pages(gfp_mask, order, zonelist); } #ifdef CONFIG_NUMA Index: Linux/mm/slub.c =================================================================== --- Linux.orig/mm/slub.c 2007-06-13 16:36:02.000000000 -0400 +++ Linux/mm/slub.c 2007-06-13 16:38:41.000000000 -0400 @@ -1870,16 +1870,18 @@ static struct kmem_cache_node * __init e /* new_slab() disables interupts */ local_irq_enable(); - BUG_ON(!page); - n = page->freelist; - BUG_ON(!n); - page->freelist = get_freepointer(kmalloc_caches, n); - page->inuse++; - kmalloc_caches->node[node] = n; - setup_object_debug(kmalloc_caches, page, n); - init_kmem_cache_node(n); - atomic_long_inc(&n->nr_slabs); - add_partial(n, page); + if (page) { + n = page->freelist; + BUG_ON(!n); + page->freelist = get_freepointer(kmalloc_caches, n); + page->inuse++; + kmalloc_caches->node[node] = n; + setup_object_debug(kmalloc_caches, page, n); + init_kmem_cache_node(n); + atomic_long_inc(&n->nr_slabs); + add_partial(n, page); + } else + kmalloc_caches->node[node] = NULL; return n; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-13 21:10 ` Lee Schermerhorn @ 2007-06-13 21:57 ` Nishanth Aravamudan 2007-06-13 22:46 ` Christoph Lameter 2007-06-14 7:07 ` Christoph Lameter 2 siblings, 0 replies; 40+ messages in thread From: Nishanth Aravamudan @ 2007-06-13 21:57 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: clameter, akpm, linux-mm, ak On 13.06.2007 [17:10:32 -0400], Lee Schermerhorn wrote: > On Tue, 2007-06-12 at 13:48 -0700, clameter@sgi.com wrote: > > GFP_THISNODE checks that the zone selected is within the pgdat (node) of the > > first zone of a nodelist. That only works if the node has memory. A > > memoryless node will have its first zone on another pgdat (node). > > > > Thus GFP_THISNODE may be returning memory on other nodes. > > GFP_THISNODE should fail if there is no local memory on a node. > > > > So we add a check to verify that the node specified has memory in > > alloc_pages_node(). If the node has no memory then return NULL. > > > > The case of alloc_pages(GFP_THISNODE) is not changed. alloc_pages() (with > > no memory policies in effect) is understood to prefer the current node. > > If a process is running on a node with no memory then its default allocations > > come from the next neighboring node. GFP_THISNODE will then force the memory > > to come from that node. > > > > Signed-off-by: Christoph Lameter <clameter@sgi.com> > > Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> > > > > Index: linux-2.6.22-rc4-mm2/include/linux/gfp.h > > =================================================================== > > --- linux-2.6.22-rc4-mm2.orig/include/linux/gfp.h 2007-06-12 12:33:37.000000000 -0700 > > +++ linux-2.6.22-rc4-mm2/include/linux/gfp.h 2007-06-12 12:38:37.000000000 -0700 > > @@ -175,6 +175,13 @@ static inline struct page *alloc_pages_n > > if (nid < 0) > > nid = numa_node_id(); > > > > + /* > > + * Check for the special case that GFP_THISNODE is used on a > > + * memoryless node > > + */ > > + if ((gfp_mask & __GFP_THISNODE) && !node_memory(nid)) > > + return NULL; > > + > > return __alloc_pages(gfp_mask, order, > > NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask)); > > } > > > > Attached patch fixes alloc_pages_node() so that it never returns an > off-node page when GFP_THISNODE is specified by. This requires a fix to > SLUB early allocation, included in the patch. Works on HP ia64 platform > with small DMA only node and "zone order" zonelists. Will test on > x86_64 real soon now... > > --- > > PATCH fix GFP_THISNODE for DMA only nodes and zone-order zonelists > > The map of nodes with memory may include nodes with just > DMA/DMA32 memory. Using this map/mask together with > GFP_THISNODE will not guarantee on-node allocations at higher > zones. Modify checks in alloc_pages_node() to ensure that the > first zone in the selected zonelist is "on-node". > > This change will result in alloc_pages_node() returning NULL > when GFP_THISNODE is specified and the first zone in the zonelist > selected by (nid, gfp_zone(gfp_mask) is not on node 'nid'. This, > in turn, BUGs out in slub.c:early_kmem_cache_node_alloc() which > apparently can't handle a NULL page from new_slab(). Fix SLUB > to handle NULL page in early allocation. > > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> > > include/linux/gfp.h | 11 ++++++++--- > mm/slub.c | 22 ++++++++++++---------- > 2 files changed, 20 insertions(+), 13 deletions(-) > > Index: Linux/include/linux/gfp.h > =================================================================== > --- Linux.orig/include/linux/gfp.h 2007-06-13 16:36:02.000000000 -0400 > +++ Linux/include/linux/gfp.h 2007-06-13 16:38:41.000000000 -0400 > @@ -168,6 +168,9 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i > static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, > unsigned int order) > { > + pg_data_t *pgdat; > + struct zonelist *zonelist; > + > if (unlikely(order >= MAX_ORDER)) > return NULL; > > @@ -179,11 +182,13 @@ static inline struct page *alloc_pages_n > * Check for the special case that GFP_THISNODE is used on a > * memoryless node > */ > - if ((gfp_mask & __GFP_THISNODE) && !node_memory(nid)) > + pgdat = NODE_DATA(nid); > + zonelist = pgdat->node_zonelists + gfp_zone(gfp_mask); > + if ((gfp_mask & __GFP_THISNODE) && > + pgdat != zonelist->zones[0]->zone_pgdat) This hunk won't apply on top of the unlikely() change... Rediffed below, adding into the set to test now... From: Lee Schermerhorn <lee.schermerhorn@hp.com> PATCH fix GFP_THISNODE for DMA only nodes and zone-order zonelists The map of nodes with memory may include nodes with just DMA/DMA32 memory. Using this map/mask together with GFP_THISNODE will not guarantee on-node allocations at higher zones. Modify checks in alloc_pages_node() to ensure that the first zone in the selected zonelist is "on-node". This change will result in alloc_pages_node() returning NULL when GFP_THISNODE is specified and the first zone in the zonelist selected by (nid, gfp_zone(gfp_mask) is not on node 'nid'. This, in turn, BUGs out in slub.c:early_kmem_cache_node_alloc() which apparently can't handle a NULL page from new_slab(). Fix SLUB to handle NULL page in early allocation. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> diff --git a/include/linux/gfp.h b/include/linux/gfp.h index ecd8adb..98a76c9 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -168,6 +168,9 @@ FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *)); static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) { + pg_data_t *pgdat; + struct zonelist *zonelist; + if (unlikely(order >= MAX_ORDER)) return NULL; @@ -175,15 +178,17 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, if (nid < 0) nid = numa_node_id(); + pgdat = NODE_DATA(nid); + zonelist = pgdat->node_zonelists + gfp_zone(gfp_mask); /* * Check for the special case that GFP_THISNODE is used on a * memoryless node */ - if (unlikely((gfp_mask & __GFP_THISNODE) && !node_memory(nid))) + if (unlikely((gfp_mask & __GFP_THISNODE) && + pgdat != zonelist->zones[0]->zone_pgdat)) return NULL; - return __alloc_pages(gfp_mask, order, - NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask)); + return __alloc_pages(gfp_mask, order, zonelist); } #ifdef CONFIG_NUMA diff --git a/mm/slub.c b/mm/slub.c index c16b75d..43a9270 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1870,16 +1870,18 @@ static struct kmem_cache_node * __init early_kmem_cache_node_alloc(gfp_t gfpflag /* new_slab() disables interupts */ local_irq_enable(); - BUG_ON(!page); - n = page->freelist; - BUG_ON(!n); - page->freelist = get_freepointer(kmalloc_caches, n); - page->inuse++; - kmalloc_caches->node[node] = n; - setup_object_debug(kmalloc_caches, page, n); - init_kmem_cache_node(n); - atomic_long_inc(&n->nr_slabs); - add_partial(n, page); + if (page) { + n = page->freelist; + BUG_ON(!n); + page->freelist = get_freepointer(kmalloc_caches, n); + page->inuse++; + kmalloc_caches->node[node] = n; + setup_object_debug(kmalloc_caches, page, n); + init_kmem_cache_node(n); + atomic_long_inc(&n->nr_slabs); + add_partial(n, page); + } else + kmalloc_caches->node[node] = NULL; return n; } -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-13 21:10 ` Lee Schermerhorn 2007-06-13 21:57 ` Nishanth Aravamudan @ 2007-06-13 22:46 ` Christoph Lameter 2007-06-13 23:11 ` Nishanth Aravamudan 2007-06-14 14:18 ` Lee Schermerhorn 2007-06-14 7:07 ` Christoph Lameter 2 siblings, 2 replies; 40+ messages in thread From: Christoph Lameter @ 2007-06-13 22:46 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, Nishanth Aravamudan On Wed, 13 Jun 2007, Lee Schermerhorn wrote: > SLUB early allocation, included in the patch. Works on HP ia64 platform > with small DMA only node and "zone order" zonelists. Will test on > x86_64 real soon now... I do not see the difference?? How does this work? node_memory(x) fails there? > The map of nodes with memory may include nodes with just > DMA/DMA32 memory. Using this map/mask together with > GFP_THISNODE will not guarantee on-node allocations at higher > zones. Modify checks in alloc_pages_node() to ensure that the > first zone in the selected zonelist is "on-node". That check is already done by __alloc_pages. > This change will result in alloc_pages_node() returning NULL > when GFP_THISNODE is specified and the first zone in the zonelist > selected by (nid, gfp_zone(gfp_mask) is not on node 'nid'. This, > in turn, BUGs out in slub.c:early_kmem_cache_node_alloc() which > apparently can't handle a NULL page from new_slab(). Fix SLUB > to handle NULL page in early allocation. Ummm... Slub would need to consult node_memory_map instead I guess. > Index: Linux/mm/slub.c > =================================================================== > --- Linux.orig/mm/slub.c 2007-06-13 16:36:02.000000000 -0400 > +++ Linux/mm/slub.c 2007-06-13 16:38:41.000000000 -0400 > @@ -1870,16 +1870,18 @@ static struct kmem_cache_node * __init e > /* new_slab() disables interupts */ > local_irq_enable(); > > - BUG_ON(!page); > - n = page->freelist; > - BUG_ON(!n); > - page->freelist = get_freepointer(kmalloc_caches, n); > - page->inuse++; > - kmalloc_caches->node[node] = n; > - setup_object_debug(kmalloc_caches, page, n); > - init_kmem_cache_node(n); > - atomic_long_inc(&n->nr_slabs); > - add_partial(n, page); > + if (page) { > + n = page->freelist; > + BUG_ON(!n); > + page->freelist = get_freepointer(kmalloc_caches, n); > + page->inuse++; > + kmalloc_caches->node[node] = n; > + setup_object_debug(kmalloc_caches, page, n); > + init_kmem_cache_node(n); > + atomic_long_inc(&n->nr_slabs); > + add_partial(n, page); > + } else > + kmalloc_caches->node[node] = NULL; > return n; > } It would be easier to modify SLUB to loop over node_memory_map instead of node_online_map? Potentially we have to change all loops over online node in the slab allocators. --- include/linux/nodemask.h | 1 + mm/slub.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6/include/linux/nodemask.h =================================================================== --- linux-2.6.orig/include/linux/nodemask.h 2007-06-13 15:40:27.000000000 -0700 +++ linux-2.6/include/linux/nodemask.h 2007-06-13 15:40:48.000000000 -0700 @@ -377,5 +377,6 @@ extern int nr_node_ids; #define for_each_node(node) for_each_node_mask((node), node_possible_map) #define for_each_online_node(node) for_each_node_mask((node), node_online_map) +#define for_each_memory_node(node) for_each_node_mask((node), node_memory_map) #endif /* __LINUX_NODEMASK_H */ Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2007-06-13 15:39:16.000000000 -0700 +++ linux-2.6/mm/slub.c 2007-06-13 15:40:23.000000000 -0700 @@ -1836,7 +1836,7 @@ static int init_kmem_cache_nodes(struct else local_node = 0; - for_each_online_node(node) { + for_each_memory_node(node) { struct kmem_cache_node *n; if (local_node == node) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-13 22:46 ` Christoph Lameter @ 2007-06-13 23:11 ` Nishanth Aravamudan 2007-06-13 23:15 ` Christoph Lameter 2007-06-14 14:18 ` Lee Schermerhorn 1 sibling, 1 reply; 40+ messages in thread From: Nishanth Aravamudan @ 2007-06-13 23:11 UTC (permalink / raw) To: Christoph Lameter; +Cc: Lee Schermerhorn, akpm, linux-mm, ak On 13.06.2007 [15:46:11 -0700], Christoph Lameter wrote: > On Wed, 13 Jun 2007, Lee Schermerhorn wrote: > > > SLUB early allocation, included in the patch. Works on HP ia64 platform > > with small DMA only node and "zone order" zonelists. Will test on > > x86_64 real soon now... > > I do not see the difference?? How does this work? node_memory(x) fails > there? > > > The map of nodes with memory may include nodes with just > > DMA/DMA32 memory. Using this map/mask together with > > GFP_THISNODE will not guarantee on-node allocations at higher > > zones. Modify checks in alloc_pages_node() to ensure that the > > first zone in the selected zonelist is "on-node". > > That check is already done by __alloc_pages. > > > This change will result in alloc_pages_node() returning NULL > > when GFP_THISNODE is specified and the first zone in the zonelist > > selected by (nid, gfp_zone(gfp_mask) is not on node 'nid'. This, > > in turn, BUGs out in slub.c:early_kmem_cache_node_alloc() which > > apparently can't handle a NULL page from new_slab(). Fix SLUB > > to handle NULL page in early allocation. > > Ummm... Slub would need to consult node_memory_map instead I guess. > > > Index: Linux/mm/slub.c > > =================================================================== > > --- Linux.orig/mm/slub.c 2007-06-13 16:36:02.000000000 -0400 > > +++ Linux/mm/slub.c 2007-06-13 16:38:41.000000000 -0400 > > @@ -1870,16 +1870,18 @@ static struct kmem_cache_node * __init e > > /* new_slab() disables interupts */ > > local_irq_enable(); > > > > - BUG_ON(!page); > > - n = page->freelist; > > - BUG_ON(!n); > > - page->freelist = get_freepointer(kmalloc_caches, n); > > - page->inuse++; > > - kmalloc_caches->node[node] = n; > > - setup_object_debug(kmalloc_caches, page, n); > > - init_kmem_cache_node(n); > > - atomic_long_inc(&n->nr_slabs); > > - add_partial(n, page); > > + if (page) { > > + n = page->freelist; > > + BUG_ON(!n); > > + page->freelist = get_freepointer(kmalloc_caches, n); > > + page->inuse++; > > + kmalloc_caches->node[node] = n; > > + setup_object_debug(kmalloc_caches, page, n); > > + init_kmem_cache_node(n); > > + atomic_long_inc(&n->nr_slabs); > > + add_partial(n, page); > > + } else > > + kmalloc_caches->node[node] = NULL; > > return n; > > } > > It would be easier to modify SLUB to loop over node_memory_map instead of > node_online_map? Potentially we have to change all loops over online node > in the slab allocators. So, I think we are really close to closing the gaps here. Just need to figure out how to fix Lee's platform so it does what he wants, I think. I've tested the current set (which is going to change again, once we figure out how to deal with SLUB (I'm guessing we'll go with your patch Christoph, but it didn't exist when I was testing earlier :) and Lee's platform properly). But everything, including the sysfs allocator, works with a 4-node x86_64, with all nodes populated and a 4-node ppc64, with only 2 nodes populated. I would like to roll up the patches and small fixes into a set of 4 or 5 patches that Andrew can pick up, so once this is all stable, I'll post a fresh series. Sound good, Andrew? Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-13 23:11 ` Nishanth Aravamudan @ 2007-06-13 23:15 ` Christoph Lameter 2007-06-13 23:20 ` Nishanth Aravamudan 0 siblings, 1 reply; 40+ messages in thread From: Christoph Lameter @ 2007-06-13 23:15 UTC (permalink / raw) To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, akpm, linux-mm, ak On Wed, 13 Jun 2007, Nishanth Aravamudan wrote: > I would like to roll up the patches and small fixes into a set of 4 or 5 > patches that Andrew can pick up, so once this is all stable, I'll post a > fresh series. Sound good, Andrew? NACK. This patchset is not ready for any inclusion and nothing like that should go into 2.6.22. We first need to assess the breakage that results if GFP_THISNODE now returns NULL for memoryless nodes. So far GFP_THISNODE returns memory on the nearest node and that seems to make lots of things keep working. Plus the implementation of GFP_THISNODE that we have so far is a bit complex. It would be good if that could be simpler. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-13 23:15 ` Christoph Lameter @ 2007-06-13 23:20 ` Nishanth Aravamudan 2007-06-13 23:26 ` Christoph Lameter 0 siblings, 1 reply; 40+ messages in thread From: Nishanth Aravamudan @ 2007-06-13 23:20 UTC (permalink / raw) To: Christoph Lameter; +Cc: Lee Schermerhorn, akpm, linux-mm, ak On 13.06.2007 [16:15:24 -0700], Christoph Lameter wrote: > On Wed, 13 Jun 2007, Nishanth Aravamudan wrote: > > > I would like to roll up the patches and small fixes into a set of 4 or 5 > > patches that Andrew can pick up, so once this is all stable, I'll post a > > fresh series. Sound good, Andrew? > > NACK. This patchset is not ready for any inclusion and nothing like > that should go into 2.6.22. We first need to assess the breakage that > results if GFP_THISNODE now returns NULL for memoryless nodes. So far > GFP_THISNODE returns memory on the nearest node and that seems to make > lots of things keep working. Who the heck said anything about mainline? See my other reply for discussing GFP_THISNODE. Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-13 23:20 ` Nishanth Aravamudan @ 2007-06-13 23:26 ` Christoph Lameter 2007-06-13 23:32 ` Nishanth Aravamudan 0 siblings, 1 reply; 40+ messages in thread From: Christoph Lameter @ 2007-06-13 23:26 UTC (permalink / raw) To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, akpm, linux-mm, ak On Wed, 13 Jun 2007, Nishanth Aravamudan wrote: > Who the heck said anything about mainline? Well we do have discovered issues that are bugs. Handling of memoryless nodes is rather strange. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-13 23:26 ` Christoph Lameter @ 2007-06-13 23:32 ` Nishanth Aravamudan 2007-06-13 23:53 ` Christoph Lameter 0 siblings, 1 reply; 40+ messages in thread From: Nishanth Aravamudan @ 2007-06-13 23:32 UTC (permalink / raw) To: Christoph Lameter; +Cc: Lee Schermerhorn, akpm, linux-mm, ak On 13.06.2007 [16:26:50 -0700], Christoph Lameter wrote: > On Wed, 13 Jun 2007, Nishanth Aravamudan wrote: > > > Who the heck said anything about mainline? > > Well we do have discovered issues that are bugs. Handling of > memoryless nodes is rather strange. Without a doubt :) Sorry, the real reason for wrapping the patches and reposting, for me, is that we've had a lot of versions flying around, with small fixlets here and there. I wanted to start a new thread for the 3 or so patches I see that implement the core of dealing with memoryless nodes, and then keep the discussion going there, but that was purely for my own sanity. Sorry if I overreacted. Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-13 23:32 ` Nishanth Aravamudan @ 2007-06-13 23:53 ` Christoph Lameter 2007-06-14 0:04 ` Nishanth Aravamudan 0 siblings, 1 reply; 40+ messages in thread From: Christoph Lameter @ 2007-06-13 23:53 UTC (permalink / raw) To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, akpm, linux-mm, ak On Wed, 13 Jun 2007, Nishanth Aravamudan wrote: > Sorry, the real reason for wrapping the patches and reposting, for me, > is that we've had a lot of versions flying around, with small fixlets > here and there. I wanted to start a new thread for the 3 or so patches I > see that implement the core of dealing with memoryless nodes, and then > keep the discussion going there, but that was purely for my own sanity. Ok. I will try to put all the fixes together. Basically that will be the three patches plus Lee's new suggestion for alloc_pages_node. Plus fixes to slab / slub / the uncached allocator etc where I see that the online map is used but what was really intended was the node_memory_map. There is also stuff in the oom killer, vmscan.c etc that seems to make that assumption. Sigh. That still leaves the issue of the name. node_memory set_node_has_memory set_node_no_memory Ok? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-13 23:53 ` Christoph Lameter @ 2007-06-14 0:04 ` Nishanth Aravamudan 0 siblings, 0 replies; 40+ messages in thread From: Nishanth Aravamudan @ 2007-06-14 0:04 UTC (permalink / raw) To: Christoph Lameter; +Cc: Lee Schermerhorn, akpm, linux-mm, ak On 13.06.2007 [16:53:26 -0700], Christoph Lameter wrote: > On Wed, 13 Jun 2007, Nishanth Aravamudan wrote: > > > Sorry, the real reason for wrapping the patches and reposting, for > > me, is that we've had a lot of versions flying around, with small > > fixlets here and there. I wanted to start a new thread for the 3 or > > so patches I see that implement the core of dealing with memoryless > > nodes, and then keep the discussion going there, but that was purely > > for my own sanity. > > Ok. I will try to put all the fixes together. Basically that will be > the three patches plus Lee's new suggestion for alloc_pages_node. Plus > fixes to slab / slub / the uncached allocator etc where I see that the > online map is used but what was really intended was the > node_memory_map. There is also stuff in the oom killer, vmscan.c etc > that seems to make that assumption. Sigh. Yeah, it's a big problem to solve, from an audit perspective. > That still leaves the issue of the name. > > node_memory node_has_memory? Reading code, if I see: if (node_has_memory(nid)) it's almost reading an English sentence! :) > set_node_has_memory > set_node_no_memory node_set_has_memory node_set_no_memory? Just to be close to (but not quite the same as) online/offline. Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-13 22:46 ` Christoph Lameter 2007-06-13 23:11 ` Nishanth Aravamudan @ 2007-06-14 14:18 ` Lee Schermerhorn 2007-06-14 14:24 ` Christoph Lameter 1 sibling, 1 reply; 40+ messages in thread From: Lee Schermerhorn @ 2007-06-14 14:18 UTC (permalink / raw) To: Christoph Lameter; +Cc: akpm, linux-mm, ak, Nishanth Aravamudan On Wed, 2007-06-13 at 15:46 -0700, Christoph Lameter wrote: > On Wed, 13 Jun 2007, Lee Schermerhorn wrote: > > > SLUB early allocation, included in the patch. Works on HP ia64 platform > > with small DMA only node and "zone order" zonelists. Will test on > > x86_64 real soon now... > > I do not see the difference?? How does this work? node_memory(x) fails > there? On my system, pseudo-node 4 contains ~512MB of DMA zone. With the new zoneorder auto-config patch [which this platform also needs], the zonelist for gfp_zone == ZONE_NORMAL [hugetlb attempts to allocate from this zone--as it should] contains the following zones: zone0-normal, zone1-normal, zone2-normal, zone3-normal, zone4-dma node_memory(4) returns "true" -- node does have memory, in the dma zone, but it's last in the list, as required. alloc_pages_node() would call __alloc_pages() with this zonelist. If get_page_from_freelist() finds the requested page in zone0-normal, the check that the zone's pgdat == the pgdat of zonelist->zone[0] will succeed and we'll return an off-node page. > > > The map of nodes with memory may include nodes with just > > DMA/DMA32 memory. Using this map/mask together with > > GFP_THISNODE will not guarantee on-node allocations at higher > > zones. Modify checks in alloc_pages_node() to ensure that the > > first zone in the selected zonelist is "on-node". > > That check is already done by __alloc_pages. You mean in get_page_from_freelist()? No, it only checks that the zone under consideration is on the same node as the zone at the start of the list. This can be off-node if the node is populated only at lower zones; and the zonelists are in zone-order. > > > This change will result in alloc_pages_node() returning NULL > > when GFP_THISNODE is specified and the first zone in the zonelist > > selected by (nid, gfp_zone(gfp_mask) is not on node 'nid'. This, > > in turn, BUGs out in slub.c:early_kmem_cache_node_alloc() which > > apparently can't handle a NULL page from new_slab(). Fix SLUB > > to handle NULL page in early allocation. > > Ummm... Slub would need to consult node_memory_map instead I guess. Probably should check the node_memory_map to avoid attempting allocations from completely memoryless nodes. However, it should still be able to handle nulls from alloc_pages_nodes() because of the scenarios discussed above. Lee > > > Index: Linux/mm/slub.c > > =================================================================== > > --- Linux.orig/mm/slub.c 2007-06-13 16:36:02.000000000 -0400 > > +++ Linux/mm/slub.c 2007-06-13 16:38:41.000000000 -0400 > > @@ -1870,16 +1870,18 @@ static struct kmem_cache_node * __init e > > /* new_slab() disables interupts */ > > local_irq_enable(); > > > > - BUG_ON(!page); > > - n = page->freelist; > > - BUG_ON(!n); > > - page->freelist = get_freepointer(kmalloc_caches, n); > > - page->inuse++; > > - kmalloc_caches->node[node] = n; > > - setup_object_debug(kmalloc_caches, page, n); > > - init_kmem_cache_node(n); > > - atomic_long_inc(&n->nr_slabs); > > - add_partial(n, page); > > + if (page) { > > + n = page->freelist; > > + BUG_ON(!n); > > + page->freelist = get_freepointer(kmalloc_caches, n); > > + page->inuse++; > > + kmalloc_caches->node[node] = n; > > + setup_object_debug(kmalloc_caches, page, n); > > + init_kmem_cache_node(n); > > + atomic_long_inc(&n->nr_slabs); > > + add_partial(n, page); > > + } else > > + kmalloc_caches->node[node] = NULL; > > return n; > > } > > It would be easier to modify SLUB to loop over node_memory_map instead of > node_online_map? Potentially we have to change all loops over online node > in the slab allocators. Again, node_memory_map can't detect the "first zone in zonelist off-node" situation. That's the one that alloc_pages_node() must guard against. So, it can/should/must return NULL when attempting to allocate from a higher zone that is off-node. > > --- > include/linux/nodemask.h | 1 + > mm/slub.c | 2 +- > 2 files changed, 2 insertions(+), 1 deletion(-) > > Index: linux-2.6/include/linux/nodemask.h > =================================================================== > --- linux-2.6.orig/include/linux/nodemask.h 2007-06-13 15:40:27.000000000 -0700 > +++ linux-2.6/include/linux/nodemask.h 2007-06-13 15:40:48.000000000 -0700 > @@ -377,5 +377,6 @@ extern int nr_node_ids; > > #define for_each_node(node) for_each_node_mask((node), node_possible_map) > #define for_each_online_node(node) for_each_node_mask((node), node_online_map) > +#define for_each_memory_node(node) for_each_node_mask((node), node_memory_map) > > #endif /* __LINUX_NODEMASK_H */ > Index: linux-2.6/mm/slub.c > =================================================================== > --- linux-2.6.orig/mm/slub.c 2007-06-13 15:39:16.000000000 -0700 > +++ linux-2.6/mm/slub.c 2007-06-13 15:40:23.000000000 -0700 > @@ -1836,7 +1836,7 @@ static int init_kmem_cache_nodes(struct > else > local_node = 0; > > - for_each_online_node(node) { > + for_each_memory_node(node) { > struct kmem_cache_node *n; > > if (local_node == node) > > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-14 14:18 ` Lee Schermerhorn @ 2007-06-14 14:24 ` Christoph Lameter 2007-06-14 14:55 ` Lee Schermerhorn 0 siblings, 1 reply; 40+ messages in thread From: Christoph Lameter @ 2007-06-14 14:24 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, Nishanth Aravamudan On Thu, 14 Jun 2007, Lee Schermerhorn wrote: > > That check is already done by __alloc_pages. > > You mean in get_page_from_freelist()? No, it only checks that the zone > under consideration is on the same node as the zone at the start of the > list. This can be off-node if the node is populated only at lower > zones; and the zonelists are in zone-order. See the later discussion. I did not see the use the nodes pgdat here that you only have in alloc_pages_node(). > > Ummm... Slub would need to consult node_memory_map instead I guess. > > Probably should check the node_memory_map to avoid attempting > allocations from completely memoryless nodes. However, it should still > be able to handle nulls from alloc_pages_nodes() because of the > scenarios discussed above. It is able to handle NULLs during usual operations but not during bootstrap. > Again, node_memory_map can't detect the "first zone in zonelist > off-node" situation. That's the one that alloc_pages_node() must guard > against. So, it can/should/must return NULL when attempting to > allocate from a higher zone that is off-node. I think GFP_THISNODE should not fail in that case. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-14 14:24 ` Christoph Lameter @ 2007-06-14 14:55 ` Lee Schermerhorn 2007-06-14 15:51 ` Christoph Lameter 0 siblings, 1 reply; 40+ messages in thread From: Lee Schermerhorn @ 2007-06-14 14:55 UTC (permalink / raw) To: Christoph Lameter; +Cc: akpm, linux-mm, ak, Nishanth Aravamudan On Thu, 2007-06-14 at 07:24 -0700, Christoph Lameter wrote: > On Thu, 14 Jun 2007, Lee Schermerhorn wrote: > > > > That check is already done by __alloc_pages. > > > > You mean in get_page_from_freelist()? No, it only checks that the zone > > under consideration is on the same node as the zone at the start of the > > list. This can be off-node if the node is populated only at lower > > zones; and the zonelists are in zone-order. > > See the later discussion. I did not see the use the nodes pgdat here that > you only have in alloc_pages_node(). > > > > Ummm... Slub would need to consult node_memory_map instead I guess. > > > > Probably should check the node_memory_map to avoid attempting > > allocations from completely memoryless nodes. However, it should still > > be able to handle nulls from alloc_pages_nodes() because of the > > scenarios discussed above. > > It is able to handle NULLs during usual operations but not during bootstrap. ??? it has to handle memoryless nodes, right? if so, why can't it handle alloc_pages_node('THISNODE,...) returning NULL? Too late in the process? > > > Again, node_memory_map can't detect the "first zone in zonelist > > off-node" situation. That's the one that alloc_pages_node() must guard > > against. So, it can/should/must return NULL when attempting to > > allocate from a higher zone that is off-node. > > I think GFP_THISNODE should not fail in that case. You think it should return off-node memory? If it can, we have to check the returned page's nid--e.g., in alloc_fresh_huge_page()--where we really don't want an off-node page. Could do that, I suppose. Or do you mean that GPF_THISNODE should return memory from the lower zones, if it satisfies the order requirement? Your custom 'thisnode' zonelist will enable this, right? That would work for my particular config of my platform, but on a larger platform, I'd have a larger DMA zone and wouldn't want hugetlb pages coming from there. I guess that in the long run, I need to hope that Nish's per node huge page attribute goes in. Then, we can provide a fancy script to configure huge pages on a specific set of nodes, rather than relying on the kernel to distribute them evenly across the set of nodes that "make sense" for the given platform. Later, Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-14 14:55 ` Lee Schermerhorn @ 2007-06-14 15:51 ` Christoph Lameter 0 siblings, 0 replies; 40+ messages in thread From: Christoph Lameter @ 2007-06-14 15:51 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, Nishanth Aravamudan On Thu, 14 Jun 2007, Lee Schermerhorn wrote: > > It is able to handle NULLs during usual operations but not during bootstrap. > > ??? it has to handle memoryless nodes, right? if so, why can't it > handle alloc_pages_node('THISNODE,...) returning NULL? Too late in the > process? It should not call alloc_pages_node for a node that has no memory during bootstrap. > Or do you mean that GPF_THISNODE should return memory from the lower > zones, if it satisfies the order requirement? Your custom 'thisnode' > zonelist will enable this, right? That would work for my particular > config of my platform, but on a larger platform, I'd have a larger DMA > zone and wouldn't want hugetlb pages coming from there. Right. The fixes here are for the general vm and not for hugetlb. They are definitely not HP platform specific. > I guess that in the long run, I need to hope that Nish's per node huge > page attribute goes in. Then, we can provide a fancy script to configure > huge pages on a specific set of nodes, rather than relying on the kernel > to distribute them evenly across the set of nodes that "make sense" for > the given platform. Right. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-13 21:10 ` Lee Schermerhorn 2007-06-13 21:57 ` Nishanth Aravamudan 2007-06-13 22:46 ` Christoph Lameter @ 2007-06-14 7:07 ` Christoph Lameter 2007-06-14 14:23 ` Nishanth Aravamudan 2 siblings, 1 reply; 40+ messages in thread From: Christoph Lameter @ 2007-06-14 7:07 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, Nishanth Aravamudan On Wed, 13 Jun 2007, Lee Schermerhorn wrote: > --- Linux.orig/include/linux/gfp.h 2007-06-13 16:36:02.000000000 -0400 > +++ Linux/include/linux/gfp.h 2007-06-13 16:38:41.000000000 -0400 > @@ -168,6 +168,9 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i > static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, > unsigned int order) > { > + pg_data_t *pgdat; > + struct zonelist *zonelist; > + > if (unlikely(order >= MAX_ORDER)) > return NULL; > > @@ -179,11 +182,13 @@ static inline struct page *alloc_pages_n > * Check for the special case that GFP_THISNODE is used on a > * memoryless node > */ > - if ((gfp_mask & __GFP_THISNODE) && !node_memory(nid)) > + pgdat = NODE_DATA(nid); > + zonelist = pgdat->node_zonelists + gfp_zone(gfp_mask); > + if ((gfp_mask & __GFP_THISNODE) && > + pgdat != zonelist->zones[0]->zone_pgdat) > return NULL; > > - return __alloc_pages(gfp_mask, order, > - NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask)); > + return __alloc_pages(gfp_mask, order, zonelist); > } Good idea but I think this does not address the case where the DMA zone of a node was moved to the end of the zonelist. In that case the first zone is not on the first pgdat but the node has memory. The memory of the node is listed elsewhere in the nodelist. I can probably modify __alloc_pages to make GFP_THISNODE to check all zones but we do not have the pgdat reference there. Sigh. How about generating a special THISNODE zonelist in build_zonelist that only contains the zones of a single node. Then just use that one if GFP_THISNODE is set. Then we get rid of all the GFP_THISNODE crap that I added to __alloc_pages? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes 2007-06-14 7:07 ` Christoph Lameter @ 2007-06-14 14:23 ` Nishanth Aravamudan 0 siblings, 0 replies; 40+ messages in thread From: Nishanth Aravamudan @ 2007-06-14 14:23 UTC (permalink / raw) To: Christoph Lameter; +Cc: Lee Schermerhorn, akpm, linux-mm, ak On 14.06.2007 [00:07:28 -0700], Christoph Lameter wrote: > On Wed, 13 Jun 2007, Lee Schermerhorn wrote: > > > --- Linux.orig/include/linux/gfp.h 2007-06-13 16:36:02.000000000 -0400 > > +++ Linux/include/linux/gfp.h 2007-06-13 16:38:41.000000000 -0400 > > @@ -168,6 +168,9 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i > > static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, > > unsigned int order) > > { > > + pg_data_t *pgdat; > > + struct zonelist *zonelist; > > + > > if (unlikely(order >= MAX_ORDER)) > > return NULL; > > > > @@ -179,11 +182,13 @@ static inline struct page *alloc_pages_n > > * Check for the special case that GFP_THISNODE is used on a > > * memoryless node > > */ > > - if ((gfp_mask & __GFP_THISNODE) && !node_memory(nid)) > > + pgdat = NODE_DATA(nid); > > + zonelist = pgdat->node_zonelists + gfp_zone(gfp_mask); > > + if ((gfp_mask & __GFP_THISNODE) && > > + pgdat != zonelist->zones[0]->zone_pgdat) > > return NULL; > > > > - return __alloc_pages(gfp_mask, order, > > - NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask)); > > + return __alloc_pages(gfp_mask, order, zonelist); > > } > > Good idea but I think this does not address the case where the DMA > zone of a node was moved to the end of the zonelist. In that case the > first zone is not on the first pgdat but the node has memory. The > memory of the node is listed elsewhere in the nodelist. I can probably > modify __alloc_pages to make GFP_THISNODE to check all zones but we do > not have the pgdat reference there. Sigh. > > How about generating a special THISNODE zonelist in build_zonelist > that only contains the zones of a single node. Then just use that one > if GFP_THISNODE is set. Then we get rid of all the GFP_THISNODE crap > that I added to __alloc_pages? Makes sense to me. Thanks, NIsh -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* [patch 3/3] Fix MPOL_INTERLEAVE behavior for memoryless nodes 2007-06-12 20:48 [patch 0/3] Fixes for NUMA allocations on memoryless nodes clameter 2007-06-12 20:48 ` [patch 1/3] NUMA: introduce node_memory_map clameter 2007-06-12 20:48 ` [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes clameter @ 2007-06-12 20:48 ` clameter 2 siblings, 0 replies; 40+ messages in thread From: clameter @ 2007-06-12 20:48 UTC (permalink / raw) To: akpm; +Cc: linux-mm, ak, Nishanth Aravamudan, Lee Schermerhorn [-- Attachment #1: fix_interleave --] [-- Type: text/plain, Size: 1339 bytes --] MPOL_INTERLEAVE currently simply loops over all nodes. Allocations on memoryless nodes will be redirected to nodes with memory. This results in an imbalance because the neighboring nodes to memoryless nodes will get significantly more interleave hits that the rest of the nodes. We can avoid this imbalance by clearing the nodes in the interleave node set that have no memory. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Index: linux-2.6.22-rc4-mm2/mm/mempolicy.c =================================================================== --- linux-2.6.22-rc4-mm2.orig/mm/mempolicy.c 2007-06-12 13:45:16.000000000 -0700 +++ linux-2.6.22-rc4-mm2/mm/mempolicy.c 2007-06-12 13:45:31.000000000 -0700 @@ -185,7 +185,8 @@ static struct mempolicy *mpol_new(int mo switch (mode) { case MPOL_INTERLEAVE: policy->v.nodes = *nodes; - if (nodes_weight(*nodes) == 0) { + nodes_and(policy->v.nodes, policy->v.nodes, node_memory_map); + if (nodes_weight(policy->v.nodes) == 0) { kmem_cache_free(policy_cache, policy); return ERR_PTR(-EINVAL); } -- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2007-06-14 15:51 UTC | newest] Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-06-12 20:48 [patch 0/3] Fixes for NUMA allocations on memoryless nodes clameter 2007-06-12 20:48 ` [patch 1/3] NUMA: introduce node_memory_map clameter 2007-06-12 21:03 ` David Rientjes 2007-06-12 21:08 ` Christoph Lameter 2007-06-12 21:10 ` David Rientjes 2007-06-12 21:27 ` Christoph Lameter 2007-06-12 21:34 ` David Rientjes 2007-06-12 21:37 ` Christoph Lameter 2007-06-12 21:38 ` David Rientjes 2007-06-12 21:36 ` Nishanth Aravamudan 2007-06-12 21:39 ` Christoph Lameter 2007-06-12 21:42 ` Nishanth Aravamudan 2007-06-12 21:45 ` David Rientjes 2007-06-12 22:26 ` Christoph Lameter 2007-06-12 22:32 ` Nishanth Aravamudan 2007-06-13 9:14 ` Andy Whitcroft 2007-06-12 20:48 ` [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes clameter 2007-06-12 21:03 ` David Rientjes 2007-06-12 21:07 ` Christoph Lameter 2007-06-12 21:08 ` David Rientjes 2007-06-12 21:25 ` Christoph Lameter 2007-06-12 21:34 ` David Rientjes 2007-06-12 21:51 ` Nishanth Aravamudan 2007-06-13 21:10 ` Lee Schermerhorn 2007-06-13 21:57 ` Nishanth Aravamudan 2007-06-13 22:46 ` Christoph Lameter 2007-06-13 23:11 ` Nishanth Aravamudan 2007-06-13 23:15 ` Christoph Lameter 2007-06-13 23:20 ` Nishanth Aravamudan 2007-06-13 23:26 ` Christoph Lameter 2007-06-13 23:32 ` Nishanth Aravamudan 2007-06-13 23:53 ` Christoph Lameter 2007-06-14 0:04 ` Nishanth Aravamudan 2007-06-14 14:18 ` Lee Schermerhorn 2007-06-14 14:24 ` Christoph Lameter 2007-06-14 14:55 ` Lee Schermerhorn 2007-06-14 15:51 ` Christoph Lameter 2007-06-14 7:07 ` Christoph Lameter 2007-06-14 14:23 ` Nishanth Aravamudan 2007-06-12 20:48 ` [patch 3/3] Fix MPOL_INTERLEAVE " clameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox