* [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization
@ 2006-02-17 1:23 Andi Kleen
2006-02-17 1:40 ` Christoph Lameter
` (3 more replies)
0 siblings, 4 replies; 17+ messages in thread
From: Andi Kleen @ 2006-02-17 1:23 UTC (permalink / raw)
To: torvalds; +Cc: akpm, Christoph Lameter, linux-mm
The new function to set up the node fallback lists didn't handle
holes in the node map. This happens e.g. on Opterons when
the a CPU is missing memory, which is not that uncommon.
Empty nodes are not initialization, but the node number is still
allocated. And then it would early except or even triple fault here
because it would try to set up a fallback list for a NULL pgdat. Oops.
There was actually another bug that caused problems in this
configuration - fixed in the earlier x86-64 patchkit. But
this is the second fix to make it actually boot.
This change makes sure the fallback list initialization really
looks at all nodes (when there is a hole num_online_nodes() isn't
the highest index) and also skips missing nodes.
Cc: clameter@engr.sgi.com
Signed-off-by: Andi Kleen <ak@suse.de>
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1540,12 +1540,19 @@ static int __init find_next_best_node(in
int i, n, val;
int min_val = INT_MAX;
int best_node = -1;
+ int highest_node = 0;
+
+ for_each_online_node(i)
+ highest_node = i;
for_each_online_node(i) {
cpumask_t tmp;
/* Start from local node */
- n = (node+i) % num_online_nodes();
+ n = (node+i) % (highest_node + 1);
+ /* Handle holes in the nodemask */
+ if (!NODE_DATA(n))
+ continue;
/* Don't want a node to appear more than once */
if (node_isset(n, *used_node_mask))
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization
2006-02-17 1:23 [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization Andi Kleen
@ 2006-02-17 1:40 ` Christoph Lameter
2006-02-17 1:46 ` Andi Kleen
2006-02-17 1:51 ` Christoph Lameter
` (2 subsequent siblings)
3 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2006-02-17 1:40 UTC (permalink / raw)
To: Andi Kleen; +Cc: torvalds, akpm, linux-mm
What happens if another node beyond higest_node comes online later?
Or one node in between comes online?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization
2006-02-17 1:40 ` Christoph Lameter
@ 2006-02-17 1:46 ` Andi Kleen
2006-02-17 2:12 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2006-02-17 1:46 UTC (permalink / raw)
To: Christoph Lameter; +Cc: torvalds, akpm, linux-mm
On Friday 17 February 2006 02:40, Christoph Lameter wrote:
> What happens if another node beyond higest_node comes online later?
> Or one node in between comes online?
I don't know. Whoever implements node hotplug has to handle it.
But I'm pretty sure the old code also didn't handle it, so it's not
a regression.
My primary interest is just to get all these Opterons booting again.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization
2006-02-17 1:23 [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization Andi Kleen
2006-02-17 1:40 ` Christoph Lameter
@ 2006-02-17 1:51 ` Christoph Lameter
2006-02-17 2:10 ` Andi Kleen
2006-02-17 6:10 ` Yasunori Goto
2006-02-17 3:33 ` Yasunori Goto
2006-02-17 16:52 ` Linus Torvalds
3 siblings, 2 replies; 17+ messages in thread
From: Christoph Lameter @ 2006-02-17 1:51 UTC (permalink / raw)
To: Andi Kleen; +Cc: torvalds, akpm, linux-mm
On Fri, 17 Feb 2006, Andi Kleen wrote:
> Empty nodes are not initialization, but the node number is still
> allocated. And then it would early except or even triple fault here
> because it would try to set up a fallback list for a NULL pgdat. Oops.
Isnt this an issue with the arch code? Simply do not allocate an empty
node. Is the mapping from linux Node id -> Hardware node id fixed on
x86_64? ia64 has a lookup table.
These are empty nodes without processor? Or a processor without a node?
In that case the processor will have to be assigned a default node.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization
2006-02-17 1:51 ` Christoph Lameter
@ 2006-02-17 2:10 ` Andi Kleen
2006-02-17 2:46 ` Christoph Lameter
2006-02-17 6:10 ` Yasunori Goto
1 sibling, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2006-02-17 2:10 UTC (permalink / raw)
To: Christoph Lameter; +Cc: torvalds, akpm, linux-mm
On Friday 17 February 2006 02:51, Christoph Lameter wrote:
> On Fri, 17 Feb 2006, Andi Kleen wrote:
>
> > Empty nodes are not initialization, but the node number is still
> > allocated. And then it would early except or even triple fault here
> > because it would try to set up a fallback list for a NULL pgdat. Oops.
>
> Isnt this an issue with the arch code? Simply do not allocate an empty
> node.
The node is not allocated (in the pgdat sense), but the nodes are not
renumbered when this happens.
> Is the mapping from linux Node id -> Hardware node id fixed on
> x86_64?
No, in theory not, but changing that would require considerable changes
in the NUMA discovery code and I'm not planning to do that for 2.6.16 now.
Also I think the generic code ought to handle that anyways. Why should
we have node bitmaps if they can't have holes?
> ia64 has a lookup table.
x86-64 too.
> These are empty nodes without processor? Or a processor without a node?
processor(s) without node
(it could be multiple processors in the multi core case)
On some systems it's even unavoidable because on cheaper motherboards
the vendors sometimes don't put DIMM slots to one of the CPUs.
> In that case the processor will have to be assigned a default node.
It will - it will get a nearby node.
In fact it has worked in the past (ok mostly there were bugs in it too, but
the last few releases were ok). But due to some changes there were regressions
and people are hitting this now.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization
2006-02-17 1:46 ` Andi Kleen
@ 2006-02-17 2:12 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 17+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-02-17 2:12 UTC (permalink / raw)
To: Andi Kleen; +Cc: Christoph Lameter, torvalds, akpm, linux-mm, Yasunori Goto
Andi Kleen wrote:
> On Friday 17 February 2006 02:40, Christoph Lameter wrote:
>> What happens if another node beyond higest_node comes online later?
>> Or one node in between comes online?
>
> I don't know. Whoever implements node hotplug has to handle it.
> But I'm pretty sure the old code also didn't handle it, so it's not
> a regression.
>
> My primary interest is just to get all these Opterons booting again.
>
All existing pgdat's default zonelist should be refreshed when a new
node comes in. So,I think this patch wouldn't be problem.
It's node-hotplug's problem.
Goto is implementing it now by this:
==
+static int __build_all_zonelists(void *dummy)
+{
+ int i;
+ for_each_online_node(i)
+ build_zonelists(NODE_DATA(i));
+ /* XXX: Cpuset must be updated when node is hotplugged. */
+ return 0;
+}
<snip>
+ stop_machine_run(__build_all_zonelists, zone->zone_pgdat, NR_CPUS);
==
If this is ok, next problem is "how to remove pgdat/zone from all zonelist....".
If there are no performance problem, adding list and seqlock , callback to
zonelist is one way to manage add-remove-zone/pgdat to zonelist.
But this will make codes more complicated.
Thanks,
-- Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization
2006-02-17 2:10 ` Andi Kleen
@ 2006-02-17 2:46 ` Christoph Lameter
0 siblings, 0 replies; 17+ messages in thread
From: Christoph Lameter @ 2006-02-17 2:46 UTC (permalink / raw)
To: Andi Kleen; +Cc: Christoph Lameter, torvalds, akpm, kiran, linux-mm
On Fri, 17 Feb 2006, Andi Kleen wrote:
> No, in theory not, but changing that would require considerable changes
> in the NUMA discovery code and I'm not planning to do that for 2.6.16 now.
Are you sure that the kernel can handle nodelists with holes everywhere?
This is essentially a new feature requiring a review of all uses of
node ranges.... I'd rather suggest to fix the arch.
> Also I think the generic code ought to handle that anyways. Why should
> we have node bitmaps if they can't have holes?
There are special cases for example in the slab allocator and possibly
elsewhere too. F.e. have a look at __alloc_percpu which must allocate
memory for cpus on offline nodes. These will then never be used. So
hopefully not an issue just a waste of memory. There is more in
alloc_alien_cache(). That is just the stuff that I know about off hand.
> > ia64 has a lookup table.
> x86-64 too.
So this is fixable in arch specific code.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization
2006-02-17 1:23 [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization Andi Kleen
2006-02-17 1:40 ` Christoph Lameter
2006-02-17 1:51 ` Christoph Lameter
@ 2006-02-17 3:33 ` Yasunori Goto
2006-02-17 16:52 ` Linus Torvalds
3 siblings, 0 replies; 17+ messages in thread
From: Yasunori Goto @ 2006-02-17 3:33 UTC (permalink / raw)
To: Andi Kleen; +Cc: torvalds, akpm, Christoph Lameter, linux-mm
> - n = (node+i) % num_online_nodes();
> + n = (node+i) % (highest_node + 1);
To tell the truth, I think that both of their calculations
are conceptually strange.
The first and true issue is that variable name is wrong.
Ok, I'll rewrite true meaning of this calculation
True vaiable names are here.
i -> online_node_id.
node -> start_node_id.
n -> target_node_id.
So, this loop is like followings.
for_each_online_node(online_node_id){
target_node_id = (start_node_id + online_node_id)
% num_online_nodes()
:
}
What does mean (start_node_id + ONLINE_node_id)?
~~~
This means nothing even if using highest_node_id.
If one of node is removed, or offlined at first,
trouble will be occur. target_node_id may point offlined node.
ex) start_node_id = 1, online nodes are 0, 1, 3....
(start_node_id + online_node_id) % num_online_nodes
1 1 3
target_node_id will be 2! It is offlined node.
But, now, node id is contiguous, and there is/(was?) no trouble.
So, everyone haven't minded this....
Bye.
--
Yasunori Goto
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization
2006-02-17 1:51 ` Christoph Lameter
2006-02-17 2:10 ` Andi Kleen
@ 2006-02-17 6:10 ` Yasunori Goto
2006-02-17 9:58 ` Andi Kleen
2006-02-17 16:05 ` Christoph Lameter
1 sibling, 2 replies; 17+ messages in thread
From: Yasunori Goto @ 2006-02-17 6:10 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Andi Kleen, torvalds, akpm, linux-mm
> > Empty nodes are not initialization, but the node number is still
> > allocated. And then it would early except or even triple fault here
> > because it would try to set up a fallback list for a NULL pgdat. Oops.
>
> Isnt this an issue with the arch code? Simply do not allocate an empty
> node. Is the mapping from linux Node id -> Hardware node id fixed on
> x86_64? ia64 has a lookup table.
Do you mention about pxm_to_nid_map[]?
I picked it out to driver/acpi/numa.c. (see: current -mm)
It is not arch specific. pxm is acpi's spec, and node id is generic
linux kernel code. :-P
> These are empty nodes without processor? Or a processor without a node?
> In that case the processor will have to be assigned a default node.
???
Ia64 added the feature of memory less node long time ago.
This is in arch/ia64/mm/discontig.c
406 /**
407 * memory_less_nodes - allocate and initialize CPU only nodes pernode
408 * information.
410 static void __init memory_less_nodes(void)
411 {
409 */
:
:
Bye.
--
Yasunori Goto
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization
2006-02-17 6:10 ` Yasunori Goto
@ 2006-02-17 9:58 ` Andi Kleen
2006-02-17 11:23 ` Bob Picco
2006-02-17 16:05 ` Christoph Lameter
1 sibling, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2006-02-17 9:58 UTC (permalink / raw)
To: Yasunori Goto; +Cc: Christoph Lameter, torvalds, akpm, linux-mm
On Friday 17 February 2006 07:10, Yasunori Goto wrote:
> > > Empty nodes are not initialization, but the node number is still
> > > allocated. And then it would early except or even triple fault here
> > > because it would try to set up a fallback list for a NULL pgdat. Oops.
> >
> > Isnt this an issue with the arch code? Simply do not allocate an empty
> > node. Is the mapping from linux Node id -> Hardware node id fixed on
> > x86_64? ia64 has a lookup table.
>
> Do you mention about pxm_to_nid_map[]?
I think he refers to cpu_to_node[]
> Ia64 added the feature of memory less node long time ago.
x86-64 too, but it just bitrotted and that is what I was trying to fix.
I did some tests with a simulator in a few combinations of memory less
CPUs and with the two patches they all boot so far. But will test it out more.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization
2006-02-17 9:58 ` Andi Kleen
@ 2006-02-17 11:23 ` Bob Picco
2006-02-17 12:15 ` Andi Kleen
0 siblings, 1 reply; 17+ messages in thread
From: Bob Picco @ 2006-02-17 11:23 UTC (permalink / raw)
To: Andi Kleen; +Cc: Yasunori Goto, Christoph Lameter, torvalds, akpm, linux-mm
Andi Kleen wrote: [Fri Feb 17 2006, 04:58:32AM EST]
> On Friday 17 February 2006 07:10, Yasunori Goto wrote:
> > > > Empty nodes are not initialization, but the node number is still
> > > > allocated. And then it would early except or even triple fault here
> > > > because it would try to set up a fallback list for a NULL pgdat. Oops.
> > >
> > > Isnt this an issue with the arch code? Simply do not allocate an empty
> > > node. Is the mapping from linux Node id -> Hardware node id fixed on
> > > x86_64? ia64 has a lookup table.
> >
> > Do you mention about pxm_to_nid_map[]?
>
> I think he refers to cpu_to_node[]
>
> > Ia64 added the feature of memory less node long time ago.
>
> x86-64 too, but it just bitrotted and that is what I was trying to fix.
> I did some tests with a simulator in a few combinations of memory less
> CPUs and with the two patches they all boot so far. But will test it out more.
>
> -Andi
>
> --
Yasunori thanks for mentioning memory less nodes for ia64. This is my
concern with the patch. I need to test/review the patch on HP
hardware/simulator (most default configured HP NUMA machines are memory less -
interleaved memory). This has caused us numerous NUMA issues.
bob
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization
2006-02-17 11:23 ` Bob Picco
@ 2006-02-17 12:15 ` Andi Kleen
2006-02-17 14:34 ` Lee Schermerhorn
0 siblings, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2006-02-17 12:15 UTC (permalink / raw)
To: Bob Picco; +Cc: Yasunori Goto, Christoph Lameter, torvalds, akpm, linux-mm
On Friday 17 February 2006 12:23, Bob Picco wrote:
> Yasunori thanks for mentioning memory less nodes for ia64. This is my
> concern with the patch.
I very much doubt it worked before without this patch in 2.6.16-* (unless you have
the memory less nodes all at the end and not in the middle)
> I need to test/review the patch on HP
> hardware/simulator (most default configured HP NUMA machines are memory less -
> interleaved memory). This has caused us numerous NUMA issues.
Yes, it's causing me problems on x86-64 all the time too.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization
2006-02-17 12:15 ` Andi Kleen
@ 2006-02-17 14:34 ` Lee Schermerhorn
0 siblings, 0 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2006-02-17 14:34 UTC (permalink / raw)
To: Andi Kleen
Cc: Bob Picco, Yasunori Goto, Christoph Lameter, torvalds, akpm, linux-mm
On Fri, 2006-02-17 at 13:15 +0100, Andi Kleen wrote:
> On Friday 17 February 2006 12:23, Bob Picco wrote:
>
> > Yasunori thanks for mentioning memory less nodes for ia64. This is my
> > concern with the patch.
>
> I very much doubt it worked before without this patch in 2.6.16-* (unless you have
> the memory less nodes all at the end and not in the middle)
Yes, that is the case with HP NUMA platforms. When configure with fully
hardware interleaved memory, all of the real nodes, 0-n, show up as having
no memory while containing all the cpus. The memory shows up as a ficticious
node n+1 with no cpus. For completeness, I should mention that even when
configured for "100% cell local memory", the platforms still have the
memory-only pseudo-node containing 512MB [on 4 node system, e.g.] of
interleaved memory--at physaddr 0, I believe.
Except for the ACPI slab corruption that Bjorn fixed recently, 2.6.16-rc*
has successfully booted on these platforms.
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization
2006-02-17 6:10 ` Yasunori Goto
2006-02-17 9:58 ` Andi Kleen
@ 2006-02-17 16:05 ` Christoph Lameter
1 sibling, 0 replies; 17+ messages in thread
From: Christoph Lameter @ 2006-02-17 16:05 UTC (permalink / raw)
To: Yasunori Goto; +Cc: Christoph Lameter, Andi Kleen, torvalds, akpm, linux-mm
On Fri, 17 Feb 2006, Yasunori Goto wrote:
> > These are empty nodes without processor? Or a processor without a node?
> > In that case the processor will have to be assigned a default node.
>
> ???
> Ia64 added the feature of memory less node long time ago.
Correct but a processor without a node is a new configuration. The
processor has to be assigned some node.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization
2006-02-17 1:23 [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization Andi Kleen
` (2 preceding siblings ...)
2006-02-17 3:33 ` Yasunori Goto
@ 2006-02-17 16:52 ` Linus Torvalds
2006-02-17 18:07 ` Andi Kleen
3 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2006-02-17 16:52 UTC (permalink / raw)
To: Andi Kleen; +Cc: akpm, Christoph Lameter, linux-mm
On Fri, 17 Feb 2006, Andi Kleen wrote:
>
> The new function to set up the node fallback lists didn't handle
> holes in the node map. This happens e.g. on Opterons when
> the a CPU is missing memory, which is not that uncommon.
That whole function is crap. Your changes don't seem to make it any less
crap, and depends on some insane and unreliable node ordering
characteristic, as far as I can tell. The thing is horrid.
There is no way you can know that your
n = (node + i) % random_value;
hits all nodes, much less a valid node.
Think about it: because we do "for_each_online_node(i)", the "i" is _not_
guaranteed to be contiguous, which means that "node + i" is not guaranteed
to be contiguous, which in turn means that you may be hopping over all the
valid nodes, and every time (because you do that stupid and undefined
"node + i" crap) you may hit something invalid or empty.
THE WHOLE ALGORITHM IS BROKEN.
Your patch doesn't make it any less broken, it just makes it _differently_
broken, and so you think it fixed something. It fixed absolutely NOTHING.
Here is a totally untested diff that may not even compile, but at least it
makes SENSE from a conceptual standpoint. The magis is
- don't do the "node + i" crap, because it is by definition broken.
It has no semantic meaning, and I _guarantee_ that you can't get it to
work in general.
- prefer nodes that follow the current node, by artificially inflating
the distance to previous nodes. This should automatically get you
exactly the round-robin behaviour you wanted.
NOTE! I've not tested (and thus not debugged) it. I don't even have NUMA
enabled, so I've not even compiled it. Somebody else please test it, and
send it back to me with a sign-off and a proper explanation, and I'll sign
off on it again and apply it.
Linus
----
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 62c1225..208812b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1541,29 +1541,29 @@ static int __initdata node_load[MAX_NUMN
*/
static int __init find_next_best_node(int node, nodemask_t *used_node_mask)
{
- int i, n, val;
+ int n, val;
int min_val = INT_MAX;
int best_node = -1;
- for_each_online_node(i) {
- cpumask_t tmp;
+ /* Use the local node if we haven't already */
+ if (!node_isset(node, *used_node_mask)) {
+ node_set(node, *used_node_mask);
+ return node;
+ }
- /* Start from local node */
- n = (node+i) % num_online_nodes();
+ for_each_online_node(n) {
+ cpumask_t tmp;
/* Don't want a node to appear more than once */
if (node_isset(n, *used_node_mask))
continue;
- /* Use the local node if we haven't already */
- if (!node_isset(node, *used_node_mask)) {
- best_node = node;
- break;
- }
-
/* Use the distance array to find the distance */
val = node_distance(node, n);
+ /* Penalize nodes under us ("prefer the next node") */
+ val += (n < node);
+
/* Give preference to headless and unused nodes */
tmp = node_to_cpumask(n);
if (!cpus_empty(tmp))
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization
2006-02-17 16:52 ` Linus Torvalds
@ 2006-02-17 18:07 ` Andi Kleen
2006-02-17 18:38 ` Linus Torvalds
0 siblings, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2006-02-17 18:07 UTC (permalink / raw)
To: Linus Torvalds; +Cc: akpm, Christoph Lameter, linux-mm
On Friday 17 February 2006 17:52, Linus Torvalds wrote:
>
> On Fri, 17 Feb 2006, Andi Kleen wrote:
> >
> > The new function to set up the node fallback lists didn't handle
> > holes in the node map. This happens e.g. on Opterons when
> > the a CPU is missing memory, which is not that uncommon.
>
> That whole function is crap. Your changes don't seem to make it any less
> crap, and depends on some insane and unreliable node ordering
> characteristic, as far as I can tell. The thing is horrid.
Yes the algorithm is a bit strange anyways. Essentially it's a bogosort
indexed on node_distance() with some additional tweaks.
Maybe it would be better to collect all data into an array and then do a
normal sort.
> Think about it: because we do "for_each_online_node(i)", the "i" is _not_
> guaranteed to be contiguous, which means that "node + i" is not guaranteed
> to be contiguous, which in turn means that you may be hopping over all the
> valid nodes, and every time (because you do that stupid and undefined
> "node + i" crap) you may hit something invalid or empty.
That is why I added the !NODE_DATA(...) continue check
It will just continue until it finds a usable node.
But you're right it can miss valid nodes.
> NOTE! I've not tested (and thus not debugged) it. I don't even have NUMA
> enabled, so I've not even compiled it. Somebody else please test it, and
> send it back to me with a sign-off and a proper explanation, and I'll sign
> off on it again and apply it.
I gave it a quick boot on the simulator with a missing node and it looks
good. Will test it a bit more and then resubmit it.
Thanks,
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization
2006-02-17 18:07 ` Andi Kleen
@ 2006-02-17 18:38 ` Linus Torvalds
0 siblings, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2006-02-17 18:38 UTC (permalink / raw)
To: Andi Kleen; +Cc: akpm, Christoph Lameter, linux-mm
On Fri, 17 Feb 2006, Andi Kleen wrote:
>
> That is why I added the !NODE_DATA(...) continue check
> It will just continue until it finds a usable node.
The thing is, there is nothing to guarantee that it _ever_ finds a usable
node. Let's say that we have nodes
3 10
in the node map, then neither the old "i + n % 2" nor your new "i + n % 11"
will ever actually hit either of the valid nodes at all when you traverse
the thing. See?
That's why I'm saying that the "(i + n) % any_random_number" just can't be
right, and that you absolutely _have_ to use the numbers that
"for_each_online_node()" gives you directly. Using anything else is always
going to be buggy.
> > NOTE! I've not tested (and thus not debugged) it. I don't even have NUMA
> > enabled, so I've not even compiled it. Somebody else please test it, and
> > send it back to me with a sign-off and a proper explanation, and I'll sign
> > off on it again and apply it.
>
> I gave it a quick boot on the simulator with a missing node and it looks
> good. Will test it a bit more and then resubmit it.
If it compiles (and I didn't just do something stupid like use the wrong
variable or test the order the wrong way), I think my version is always
safe. It doesn't play games with the node numbers, and it only really
edits the "distance function" to have a dependency on the node
relationship.
So if it works at all, I think it works every time. But I'm biased ;)
And I'll never argue against more testing.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2006-02-17 18:38 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-02-17 1:23 [PATCH for 2.6.16] Handle holes in node mask in node fallback list initialization Andi Kleen
2006-02-17 1:40 ` Christoph Lameter
2006-02-17 1:46 ` Andi Kleen
2006-02-17 2:12 ` KAMEZAWA Hiroyuki
2006-02-17 1:51 ` Christoph Lameter
2006-02-17 2:10 ` Andi Kleen
2006-02-17 2:46 ` Christoph Lameter
2006-02-17 6:10 ` Yasunori Goto
2006-02-17 9:58 ` Andi Kleen
2006-02-17 11:23 ` Bob Picco
2006-02-17 12:15 ` Andi Kleen
2006-02-17 14:34 ` Lee Schermerhorn
2006-02-17 16:05 ` Christoph Lameter
2006-02-17 3:33 ` Yasunori Goto
2006-02-17 16:52 ` Linus Torvalds
2006-02-17 18:07 ` Andi Kleen
2006-02-17 18:38 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox