* Re: [PATCH] get_nodes mask miscalculation
[not found] <2rr7U-5xT-11@gated-at.bofh.it>
@ 2004-08-10 1:44 ` Andi Kleen
2004-08-10 5:25 ` Paul Jackson
2004-08-10 14:54 ` Brent Casavant
0 siblings, 2 replies; 7+ messages in thread
From: Andi Kleen @ 2004-08-10 1:44 UTC (permalink / raw)
To: Brent Casavant; +Cc: linux-kernel, linux-mm
Brent Casavant <bcasavan@sgi.com> writes:
The idea behind this is was to make it behave like select.
And select passes highest valid value + 1. In this case
valid value is not the highest bit number, but the length
of the bitmap.
For some reason nobody except me seems to get it though,
probably because the description in the manpages is a bit confusing:
get_mempolicy(2):
"maxnode is the maximum bit number plus one that can be stored into
nodemask."
I suppose this should be better described. For changing it
it is too late unfortunately because the libnuma binaries
are already out in the wild.
> It appears there is a nodemask miscalculation in the get_nodes()
> function in mm/mempolicy.c. This bug has two effects:
>
> 1. It is impossible to specify a length 1 nodemask.
Sure. You pass 2.
> 2. It is impossible to specify a nodemask containing the last node.
you pass number of nodes + 1.
> The following patch against 2.6.8-rc3 has been confirmed to solve
> both problems.
Problem is that you'll break all existing libnuma binaries
which pass NUMA_MAX_NODES + 1. In your scheme the kernel
will access one bit beyond the bitmap that got passed,
and depending on its random value you may get a failure or not.
BTW there is a minor problem in the code that there isn't a upper
limit. When you pass 0 it currently iterates through all your memory
until it hits an EFAULT, which can be a bit slow. But that's easy to
fix.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] get_nodes mask miscalculation
2004-08-10 1:44 ` [PATCH] get_nodes mask miscalculation Andi Kleen
@ 2004-08-10 5:25 ` Paul Jackson
2004-08-10 10:45 ` Andi Kleen
2004-08-10 14:54 ` Brent Casavant
1 sibling, 1 reply; 7+ messages in thread
From: Paul Jackson @ 2004-08-10 5:25 UTC (permalink / raw)
To: Andi Kleen; +Cc: bcasavan, linux-kernel, linux-mm
Andi wrote:
> The idea behind this is was to make it behave like select.
> And select passes highest valid value + 1. In this case
> valid value is not the highest bit number, but the length
> of the bitmap.
Andi,
I think I need to be a pain in the butt about this ... sorry.
I wish I had had enough numa-sense to recognize this issue when
you proposed your patch earlier, when it would have been easier.
But this off-by-one wrinkle to the numa API cost me some time as
well, and I don't think Brent nor I are going to be the exception.
I'd like to request that you consider the following, admittedly
ugly, changes.
First, a little more motivation ...
If, say, I have 32 Nodes, numbered 0 ... 31, then it would be entirely
unsurprising for me to consider 31 to be the highest valid value (highest
valid node number), and hence expect to pass 32.
The get_mempolicy(2) man page (numactl-0.7-pre1) states:
maxnode is the maximum bit number plus one that can be stored
into nodemask.
This seems to read consistently with such expectations - the max
bit number is 31, so pass 32. Wrong.
The mbind(2) and set_mempolicy(2) state this in different words,
but words which, to me, mean pretty much the same (wrong) thing:
nodemask is pointer to a bit field of nodes that contains upto
maxnode bits.
And the code is further confusing. There are several places in
the code that either decrement maxnode (maxnode--), or keep its
passed in value, and work with one less (maxnode - 1). So the
meaning of maxnode within the code is inconsistent.
Besides the man page, the other key documentation is the code.
Brent tried reading that, and came away with the wrong answer.
While I'm here, the statement that only the highest zone is policied
actually applies only to MPOL_BIND, right? The comment that asserts
this should be so qualified, I suspect.
And there are hardcoded numbers 64 and 8 in copy_nodes_to_user().
Please consider replacing with appropriate macros.
So ... here's what I'd suggest:
1) Change the man page wording, for each of get_mempolicy(2), mbind(2),
and set_mempolicy(2), to boldly state:
Beware:
Pass in a value of maxnode that is * one more * than the
number of nodes represented in nodemask. If for example,
nodemask represents 64 nodes, numbered 0 to 63, pass in a
value of 65 for maxnodes.
2) Review, test, fix, and apply as fixed the following patch. For extra
credit, get rid of the hard coded 8, 32 and 64 values in the compat stuff,
visible in the patch below. I compiled the patch, once, on an ia64.
Otherwise totally untested.
This patch:
a) Notes the situation in a prominent "==> Beware <==".
b) Consistently decrements maxnode immediately on each system call
entry (where someone reading the code might best notice).
c) Otherwise treats maxnode consistently within the code.
d) Addresses the MPOL_BIND max policy only comment.
e) Addresses the harcoded numbers 64 and 8 in copy_nodes_to_user().
Yes - it's ugly. The time that will be lost by those who try to use
this interface directly will be ugly too.
3) Could you propose a strategy for fixing this? It might take a couple
of steps, and there might be other inconsistencies in the interim, such
as special case handling in the kernel for maxnode values of 2049. If
you have to tell me that SGI has to put in a workaround for a year, on
any machine with _exactly_ 2049 nodes, such as padding the user nodemask
up to 2050 nodes, I'm prepared to deal with that <grin>.
Sure seems like it would be worth it, in the long run.
Thanks for considering this.
Signed-off-by: Paul Jackson <pj@sgi.com>
diff -Naurp 2.6.8-rc3-mm2.orig/mm/mempolicy.c 2.6.8-rc3-mm2/mm/mempolicy.c
--- 2.6.8-rc3-mm2.orig/mm/mempolicy.c 2004-08-09 21:35:16.000000000 -0700
+++ 2.6.8-rc3-mm2/mm/mempolicy.c 2004-08-09 21:35:34.000000000 -0700
@@ -7,6 +7,17 @@
* NUMA policy allows the user to give hints in which node(s) memory should
* be allocated.
*
+ *
+ * ==> Beware <==
+ *
+ * Pass in a value of maxnode to the system calls mbind(2),
+ * get_mempolicy(2) and set_mempolicy(2) that is one larger
+ * than you expect. If you have 64 nodes, represented by bits
+ * numbered 0..63 in your nodemask -- pass in 65 for maxnode.
+ * To remain compatible with existing libnuma user code, this
+ * cannot be changed.
+ *
+ *
* Support four policies per VMA and per process:
*
* The VMA policy has priority over the process policy for a page fault.
@@ -37,10 +48,10 @@
* is not applied, but the majority should be handled. When process policy
* is used it is not remembered over swap outs/swap ins.
*
- * Only the highest zone in the zone hierarchy gets policied. Allocations
- * requesting a lower zone just use default policy. This implies that
- * on systems with highmem kernel lowmem allocation don't get policied.
- * Same with GFP_DMA allocations.
+ * When using MPOL_BIND, only the highest zone in the zone hierarchy gets
+ * policied. Allocations requesting a lower zone just use default policy.
+ * This implies that on systems with highmem kernel lowmem allocation don't
+ * get policied by MPOL_BIND. Same with GFP_DMA allocations.
*
* For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between
* all users and remembered even when nobody has memory mapped.
@@ -84,7 +95,7 @@ static kmem_cache_t *sn_cache;
#define PDprintk(fmt...)
/* Highest zone. An specific allocation for a zone below that is not
- policied. */
+ policied by MPOL_BIND */
static int policy_zone;
static struct mempolicy default_policy = {
@@ -134,7 +145,6 @@ static int get_nodes(unsigned long *node
unsigned long nlongs;
unsigned long endmask;
- --maxnode;
bitmap_zero(nodes, MAX_NUMNODES);
if (maxnode == 0 || !nmask)
return 0;
@@ -352,6 +362,8 @@ asmlinkage long sys_mbind(unsigned long
DECLARE_BITMAP(nodes, MAX_NUMNODES);
int err;
+ maxnode = maxnode - 1; /* See ==> Beware <==, above */
+
if ((flags & ~(unsigned long)(MPOL_MF_STRICT)) || mode > MPOL_MAX)
return -EINVAL;
if (start & ~PAGE_MASK)
@@ -394,6 +406,8 @@ asmlinkage long sys_set_mempolicy(int mo
struct mempolicy *new;
DECLARE_BITMAP(nodes, MAX_NUMNODES);
+ maxnode = maxnode - 1; /* See ==> Beware <==, above */
+
if (mode > MPOL_MAX)
return -EINVAL;
err = get_nodes(nodes, nmask, maxnode, mode);
@@ -454,8 +468,7 @@ static int lookup_node(struct mm_struct
static int copy_nodes_to_user(unsigned long __user *mask, unsigned long maxnode,
void *nodes, unsigned nbytes)
{
- unsigned long copy = ALIGN(maxnode-1, 64) / 8;
-
+ unsigned long copy = BITS_TO_LONGS(maxnode) * sizeof(unsigned long);
if (copy > nbytes) {
if (copy > PAGE_SIZE)
return -EINVAL;
@@ -477,6 +490,8 @@ asmlinkage long sys_get_mempolicy(int __
struct vm_area_struct *vma = NULL;
struct mempolicy *pol = current->mempolicy;
+ maxnode = maxnode - 1; /* See ==> Beware <==, above */
+
if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
return -EINVAL;
if (nmask != NULL && maxnode < numnodes)
@@ -539,10 +554,13 @@ asmlinkage long compat_get_mempolicy(int
{
long err;
unsigned long __user *nm = NULL;
+
+ maxnode = maxnode - 1; /* See ==> Beware <==, above */
+
if (nmask)
- nm = compat_alloc_user_space(ALIGN(maxnode-1, 64) / 8);
+ nm = compat_alloc_user_space(ALIGN(maxnode, 64) / 8);
err = sys_get_mempolicy(policy, nm, maxnode, addr, flags);
- if (!err && copy_in_user(nmask, nm, ALIGN(maxnode-1, 32)/8))
+ if (!err && copy_in_user(nmask, nm, ALIGN(maxnode, 32)/8))
err = -EFAULT;
return err;
}
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.650.933.1373
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] get_nodes mask miscalculation
2004-08-10 5:25 ` Paul Jackson
@ 2004-08-10 10:45 ` Andi Kleen
2004-08-17 3:43 ` Paul Jackson
0 siblings, 1 reply; 7+ messages in thread
From: Andi Kleen @ 2004-08-10 10:45 UTC (permalink / raw)
To: Paul Jackson; +Cc: bcasavan, linux-kernel, linux-mm
On Mon, Aug 09, 2004 at 10:25:31PM -0700, Paul Jackson wrote:
> While I'm here, the statement that only the highest zone is policied
> actually applies only to MPOL_BIND, right? The comment that asserts
That is correct. I can add a comment.
> 1) Change the man page wording, for each of get_mempolicy(2), mbind(2),
> and set_mempolicy(2), to boldly state:
>
> Beware:
> Pass in a value of maxnode that is * one more * than the
> number of nodes represented in nodemask. If for example,
> nodemask represents 64 nodes, numbered 0 to 63, pass in a
> value of 65 for maxnodes.
Yes, I will clarify the manpages.
I see no problem with hardcoding 8 bits per byte.
>
> 2) Review, test, fix, and apply as fixed the following patch. For extra
> credit, get rid of the hard coded 8, 32 and 64 values in the compat stuff,
> visible in the patch below. I compiled the patch, once, on an ia64.
> Otherwise totally untested.
>
> This patch:
> a) Notes the situation in a prominent "==> Beware <==".
> b) Consistently decrements maxnode immediately on each system call
> entry (where someone reading the code might best notice).
> c) Otherwise treats maxnode consistently within the code.
> d) Addresses the MPOL_BIND max policy only comment.
> e) Addresses the harcoded numbers 64 and 8 in copy_nodes_to_user().
Yes, the 64 should be addressed agreed. That came from a misguided
attempt by me to not require compat_* functions (by making the 32bit
and 64bit ABI be the same), but that didn't work out.
>
> Yes - it's ugly. The time that will be lost by those who try to use
> this interface directly will be ugly too.
>
> 3) Could you propose a strategy for fixing this? It might take a couple
The only way would be to allocate new system call slots for the
two calls. But I'm not convinced it is worth it.
Sorry, but I don't want to break binary compatibility.
If it was really that bad you should have complained earlier (the
code was out long enough for review), now it is too late.
> If
> you have to tell me that SGI has to put in a workaround for a year, on
> any machine with _exactly_ 2049 nodes, such as padding the user nodemask
> up to 2050 nodes, I'm prepared to deal with that <grin>.
There are many more users of these calls than SGI. And most of them are
using libnuma, which you are proposing to break.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] get_nodes mask miscalculation
2004-08-10 1:44 ` [PATCH] get_nodes mask miscalculation Andi Kleen
2004-08-10 5:25 ` Paul Jackson
@ 2004-08-10 14:54 ` Brent Casavant
2004-08-10 15:34 ` Andi Kleen
1 sibling, 1 reply; 7+ messages in thread
From: Brent Casavant @ 2004-08-10 14:54 UTC (permalink / raw)
To: Andi Kleen; +Cc: linux-kernel, linux-mm
On Tue, 10 Aug 2004, Andi Kleen wrote:
> The idea behind this is was to make it behave like select.
> And select passes highest valid value + 1.
Not that my opinion is worth much, but this seems a very peculiar
interface to mimic.
> In this case
> valid value is not the highest bit number, but the length
> of the bitmap.
>
> For some reason nobody except me seems to get it though,
> probably because the description in the manpages is a bit confusing:
>
> get_mempolicy(2):
>
> "maxnode is the maximum bit number plus one that can be stored into
> nodemask."
Actually, I worked from the following description:
mbind(3)
. . .
nodemask is a bit field of nodes that contains up to
maxnode bits.
. . .
This very clearly indicates that I should pass the number of bits
in the nodemask field, not the number of bits plus one.
Granted, the man page makes it clear that I should use libnuma, but
in this case I was working on implementing a new MPOL_ROUNDROBIN, and
wanted to go directly to the system call for testing purposes.
And not to make things even more confusing, but the way things are
designed now, the value I need to pass to mbind() is numa_max_node()+2.
Very confusing.
> Problem is that you'll break all existing libnuma binaries
> which pass NUMA_MAX_NODES + 1. In your scheme the kernel
> will access one bit beyond the bitmap that got passed,
> and depending on its random value you may get a failure or not.
Well, again, not that my opinion carries much weight (nor really should
it), but this whole off-by-one (or two) interface seems unnecessarily
confusing. Is there no way we can get this fixed? The temporary pain
of breaking the relatively new libnuma seems worth the price of getting
this forever cleaned up.
Brent
--
Brent Casavant bcasavan@sgi.com Forget bright-eyed and
Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red-
Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] get_nodes mask miscalculation
2004-08-10 14:54 ` Brent Casavant
@ 2004-08-10 15:34 ` Andi Kleen
0 siblings, 0 replies; 7+ messages in thread
From: Andi Kleen @ 2004-08-10 15:34 UTC (permalink / raw)
To: Brent Casavant; +Cc: linux-kernel, linux-mm
On Tue, Aug 10, 2004 at 09:54:08AM -0500, Brent Casavant wrote:
> On Tue, 10 Aug 2004, Andi Kleen wrote:
>
> > The idea behind this is was to make it behave like select.
> > And select passes highest valid value + 1.
>
> Not that my opinion is worth much, but this seems a very peculiar
> interface to mimic.
It is the only system call I am aware of that passes a variable
length bitmap. Do you know of any others?
> > nodemask."
>
> Actually, I worked from the following description:
>
> mbind(3)
> . . .
> nodemask is a bit field of nodes that contains up to
> maxnode bits.
> . . .
>
> This very clearly indicates that I should pass the number of bits
> in the nodemask field, not the number of bits plus one.
I really need to rewrite the mbind manpage - it was a quick
hack job and has several other errors.
> > Problem is that you'll break all existing libnuma binaries
> > which pass NUMA_MAX_NODES + 1. In your scheme the kernel
> > will access one bit beyond the bitmap that got passed,
> > and depending on its random value you may get a failure or not.
>
> Well, again, not that my opinion carries much weight (nor really should
> it), but this whole off-by-one (or two) interface seems unnecessarily
> confusing. Is there no way we can get this fixed? The temporary pain
Sure, new system call slots would do with a small compat wrapper
for the old calls. I'm just not sure it is worth it. I personally
don't find it that confusing, but I'm probably not a good benchmark
for this.
> of breaking the relatively new libnuma seems worth the price of getting
> this forever cleaned up.
I don't think breaking existing binaries is a good idea. Linux
has higher standards.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] get_nodes mask miscalculation
2004-08-10 10:45 ` Andi Kleen
@ 2004-08-17 3:43 ` Paul Jackson
0 siblings, 0 replies; 7+ messages in thread
From: Paul Jackson @ 2004-08-17 3:43 UTC (permalink / raw)
To: linux-kernel; +Cc: bcasavan, linux-mm, Andi Kleen
Paul wrote:
> a) Notes the situation in a prominent "==> Beware <==".
> b) Consistently decrements maxnode immediately on each system call
> entry (where someone reading the code might best notice).
Perhaps I missed it, but I didn't notice a response to these points in
your reply, Andi. In addition to improving the man pages, the other
primary documentation is the code itself. I look forward to seeing your
proposed patch on this, I hope that patch includes consistent handling
of the value 'maxnode' (which is sometimes numnodes, and sometimes
numnodes+1). The combination of consistent maxnode usage, and more
explicit, commented, handling of the calls expecting the plus one value
might improve the chances that a reader of the code will correctly use
it.
> The only way would be to allocate new system call slot ...
>
> ... most of them are using libnuma, which you are proposing to break.
No ... I didn't say "add a syscall", and I didn't say "break libnuma"
(at least, not break any real world usage).
Libnuma doesn't pass in just any old value ... it seems to only pass in
one of the two hard coded values 129 (x86_64 or i386 arches) or 2049
(other arches).
Without divulging any of SGI's plans, I think I can safely predict that
not many systems with _exactly_ 2049 nodes will ship in the next year.
Perhaps even zero. 2049 is an odd number.
Do you have any idea, Andi, what would be a reasonable upper bound on
the world wide number of shipments of 129 node x86_64 or i386 systems?
My conjecture - zero. It's another odd number ;).
If we changed both kernel and libnuma, and required that anyone actually
needing to use the full 128 nodes (or 2048 on the other arches) had to
match kernel and libnuma (both old or both new), then seems like a
change in this convention would actually have vanishingly small real
world impact (less impact in the long run than not doing this ;).
Either:
1) almost no 128 node x86_64 or i386 systems will ship, and it's not
a big issue, practically,
or
2) some will ship, and libnuma will soon be broken anyway, when someone
does the inevitable and ships an even larger x86_64 or i386 system.
So I am not asking you to break any real world usage, nor to add a
system call. I'm just asking you to describe what would be the _actual_
problems caused by changing this at this time. So far, I can't see
any real world problem that would be caused by changing this, while I
already see problems caused by not changing it.
Certainly, if we ever do change this, the sooner the better.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.650.933.1373
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH] get_nodes mask miscalculation
@ 2004-08-09 22:39 Brent Casavant
0 siblings, 0 replies; 7+ messages in thread
From: Brent Casavant @ 2004-08-09 22:39 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel
It appears there is a nodemask miscalculation in the get_nodes()
function in mm/mempolicy.c. This bug has two effects:
1. It is impossible to specify a length 1 nodemask.
2. It is impossible to specify a nodemask containing the last node.
The following patch against 2.6.8-rc3 has been confirmed to solve
both problems.
Signed-off-by: Brent Casavant <bcasavan@sgi.com>
--- linux-2.6.8-rc3/mm/mempolicy.c 2004-08-03 16:28:51.000000000 -0500
+++ linux-work/mm/mempolicy.c 2004-08-09 14:56:44.000000000 -0500
@@ -132,7 +132,6 @@
unsigned long nlongs;
unsigned long endmask;
- --maxnode;
bitmap_zero(nodes, MAX_NUMNODES);
if (maxnode == 0 || !nmask)
return 0;
--
Brent Casavant bcasavan@sgi.com Forget bright-eyed and
Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red-
Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2004-08-17 3:43 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <2rr7U-5xT-11@gated-at.bofh.it>
2004-08-10 1:44 ` [PATCH] get_nodes mask miscalculation Andi Kleen
2004-08-10 5:25 ` Paul Jackson
2004-08-10 10:45 ` Andi Kleen
2004-08-17 3:43 ` Paul Jackson
2004-08-10 14:54 ` Brent Casavant
2004-08-10 15:34 ` Andi Kleen
2004-08-09 22:39 Brent Casavant
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox