* Re: [PATCH] get_nodes mask miscalculation [not found] <2rr7U-5xT-11@gated-at.bofh.it> @ 2004-08-10 1:44 ` Andi Kleen 2004-08-10 5:25 ` Paul Jackson 2004-08-10 14:54 ` Brent Casavant 0 siblings, 2 replies; 7+ messages in thread From: Andi Kleen @ 2004-08-10 1:44 UTC (permalink / raw) To: Brent Casavant; +Cc: linux-kernel, linux-mm Brent Casavant <bcasavan@sgi.com> writes: The idea behind this is was to make it behave like select. And select passes highest valid value + 1. In this case valid value is not the highest bit number, but the length of the bitmap. For some reason nobody except me seems to get it though, probably because the description in the manpages is a bit confusing: get_mempolicy(2): "maxnode is the maximum bit number plus one that can be stored into nodemask." I suppose this should be better described. For changing it it is too late unfortunately because the libnuma binaries are already out in the wild. > It appears there is a nodemask miscalculation in the get_nodes() > function in mm/mempolicy.c. This bug has two effects: > > 1. It is impossible to specify a length 1 nodemask. Sure. You pass 2. > 2. It is impossible to specify a nodemask containing the last node. you pass number of nodes + 1. > The following patch against 2.6.8-rc3 has been confirmed to solve > both problems. Problem is that you'll break all existing libnuma binaries which pass NUMA_MAX_NODES + 1. In your scheme the kernel will access one bit beyond the bitmap that got passed, and depending on its random value you may get a failure or not. BTW there is a minor problem in the code that there isn't a upper limit. When you pass 0 it currently iterates through all your memory until it hits an EFAULT, which can be a bit slow. But that's easy to fix. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] get_nodes mask miscalculation 2004-08-10 1:44 ` [PATCH] get_nodes mask miscalculation Andi Kleen @ 2004-08-10 5:25 ` Paul Jackson 2004-08-10 10:45 ` Andi Kleen 2004-08-10 14:54 ` Brent Casavant 1 sibling, 1 reply; 7+ messages in thread From: Paul Jackson @ 2004-08-10 5:25 UTC (permalink / raw) To: Andi Kleen; +Cc: bcasavan, linux-kernel, linux-mm Andi wrote: > The idea behind this is was to make it behave like select. > And select passes highest valid value + 1. In this case > valid value is not the highest bit number, but the length > of the bitmap. Andi, I think I need to be a pain in the butt about this ... sorry. I wish I had had enough numa-sense to recognize this issue when you proposed your patch earlier, when it would have been easier. But this off-by-one wrinkle to the numa API cost me some time as well, and I don't think Brent nor I are going to be the exception. I'd like to request that you consider the following, admittedly ugly, changes. First, a little more motivation ... If, say, I have 32 Nodes, numbered 0 ... 31, then it would be entirely unsurprising for me to consider 31 to be the highest valid value (highest valid node number), and hence expect to pass 32. The get_mempolicy(2) man page (numactl-0.7-pre1) states: maxnode is the maximum bit number plus one that can be stored into nodemask. This seems to read consistently with such expectations - the max bit number is 31, so pass 32. Wrong. The mbind(2) and set_mempolicy(2) state this in different words, but words which, to me, mean pretty much the same (wrong) thing: nodemask is pointer to a bit field of nodes that contains upto maxnode bits. And the code is further confusing. There are several places in the code that either decrement maxnode (maxnode--), or keep its passed in value, and work with one less (maxnode - 1). So the meaning of maxnode within the code is inconsistent. Besides the man page, the other key documentation is the code. Brent tried reading that, and came away with the wrong answer. While I'm here, the statement that only the highest zone is policied actually applies only to MPOL_BIND, right? The comment that asserts this should be so qualified, I suspect. And there are hardcoded numbers 64 and 8 in copy_nodes_to_user(). Please consider replacing with appropriate macros. So ... here's what I'd suggest: 1) Change the man page wording, for each of get_mempolicy(2), mbind(2), and set_mempolicy(2), to boldly state: Beware: Pass in a value of maxnode that is * one more * than the number of nodes represented in nodemask. If for example, nodemask represents 64 nodes, numbered 0 to 63, pass in a value of 65 for maxnodes. 2) Review, test, fix, and apply as fixed the following patch. For extra credit, get rid of the hard coded 8, 32 and 64 values in the compat stuff, visible in the patch below. I compiled the patch, once, on an ia64. Otherwise totally untested. This patch: a) Notes the situation in a prominent "==> Beware <==". b) Consistently decrements maxnode immediately on each system call entry (where someone reading the code might best notice). c) Otherwise treats maxnode consistently within the code. d) Addresses the MPOL_BIND max policy only comment. e) Addresses the harcoded numbers 64 and 8 in copy_nodes_to_user(). Yes - it's ugly. The time that will be lost by those who try to use this interface directly will be ugly too. 3) Could you propose a strategy for fixing this? It might take a couple of steps, and there might be other inconsistencies in the interim, such as special case handling in the kernel for maxnode values of 2049. If you have to tell me that SGI has to put in a workaround for a year, on any machine with _exactly_ 2049 nodes, such as padding the user nodemask up to 2050 nodes, I'm prepared to deal with that <grin>. Sure seems like it would be worth it, in the long run. Thanks for considering this. Signed-off-by: Paul Jackson <pj@sgi.com> diff -Naurp 2.6.8-rc3-mm2.orig/mm/mempolicy.c 2.6.8-rc3-mm2/mm/mempolicy.c --- 2.6.8-rc3-mm2.orig/mm/mempolicy.c 2004-08-09 21:35:16.000000000 -0700 +++ 2.6.8-rc3-mm2/mm/mempolicy.c 2004-08-09 21:35:34.000000000 -0700 @@ -7,6 +7,17 @@ * NUMA policy allows the user to give hints in which node(s) memory should * be allocated. * + * + * ==> Beware <== + * + * Pass in a value of maxnode to the system calls mbind(2), + * get_mempolicy(2) and set_mempolicy(2) that is one larger + * than you expect. If you have 64 nodes, represented by bits + * numbered 0..63 in your nodemask -- pass in 65 for maxnode. + * To remain compatible with existing libnuma user code, this + * cannot be changed. + * + * * Support four policies per VMA and per process: * * The VMA policy has priority over the process policy for a page fault. @@ -37,10 +48,10 @@ * is not applied, but the majority should be handled. When process policy * is used it is not remembered over swap outs/swap ins. * - * Only the highest zone in the zone hierarchy gets policied. Allocations - * requesting a lower zone just use default policy. This implies that - * on systems with highmem kernel lowmem allocation don't get policied. - * Same with GFP_DMA allocations. + * When using MPOL_BIND, only the highest zone in the zone hierarchy gets + * policied. Allocations requesting a lower zone just use default policy. + * This implies that on systems with highmem kernel lowmem allocation don't + * get policied by MPOL_BIND. Same with GFP_DMA allocations. * * For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between * all users and remembered even when nobody has memory mapped. @@ -84,7 +95,7 @@ static kmem_cache_t *sn_cache; #define PDprintk(fmt...) /* Highest zone. An specific allocation for a zone below that is not - policied. */ + policied by MPOL_BIND */ static int policy_zone; static struct mempolicy default_policy = { @@ -134,7 +145,6 @@ static int get_nodes(unsigned long *node unsigned long nlongs; unsigned long endmask; - --maxnode; bitmap_zero(nodes, MAX_NUMNODES); if (maxnode == 0 || !nmask) return 0; @@ -352,6 +362,8 @@ asmlinkage long sys_mbind(unsigned long DECLARE_BITMAP(nodes, MAX_NUMNODES); int err; + maxnode = maxnode - 1; /* See ==> Beware <==, above */ + if ((flags & ~(unsigned long)(MPOL_MF_STRICT)) || mode > MPOL_MAX) return -EINVAL; if (start & ~PAGE_MASK) @@ -394,6 +406,8 @@ asmlinkage long sys_set_mempolicy(int mo struct mempolicy *new; DECLARE_BITMAP(nodes, MAX_NUMNODES); + maxnode = maxnode - 1; /* See ==> Beware <==, above */ + if (mode > MPOL_MAX) return -EINVAL; err = get_nodes(nodes, nmask, maxnode, mode); @@ -454,8 +468,7 @@ static int lookup_node(struct mm_struct static int copy_nodes_to_user(unsigned long __user *mask, unsigned long maxnode, void *nodes, unsigned nbytes) { - unsigned long copy = ALIGN(maxnode-1, 64) / 8; - + unsigned long copy = BITS_TO_LONGS(maxnode) * sizeof(unsigned long); if (copy > nbytes) { if (copy > PAGE_SIZE) return -EINVAL; @@ -477,6 +490,8 @@ asmlinkage long sys_get_mempolicy(int __ struct vm_area_struct *vma = NULL; struct mempolicy *pol = current->mempolicy; + maxnode = maxnode - 1; /* See ==> Beware <==, above */ + if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR)) return -EINVAL; if (nmask != NULL && maxnode < numnodes) @@ -539,10 +554,13 @@ asmlinkage long compat_get_mempolicy(int { long err; unsigned long __user *nm = NULL; + + maxnode = maxnode - 1; /* See ==> Beware <==, above */ + if (nmask) - nm = compat_alloc_user_space(ALIGN(maxnode-1, 64) / 8); + nm = compat_alloc_user_space(ALIGN(maxnode, 64) / 8); err = sys_get_mempolicy(policy, nm, maxnode, addr, flags); - if (!err && copy_in_user(nmask, nm, ALIGN(maxnode-1, 32)/8)) + if (!err && copy_in_user(nmask, nm, ALIGN(maxnode, 32)/8)) err = -EFAULT; return err; } -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.650.933.1373 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] get_nodes mask miscalculation 2004-08-10 5:25 ` Paul Jackson @ 2004-08-10 10:45 ` Andi Kleen 2004-08-17 3:43 ` Paul Jackson 0 siblings, 1 reply; 7+ messages in thread From: Andi Kleen @ 2004-08-10 10:45 UTC (permalink / raw) To: Paul Jackson; +Cc: bcasavan, linux-kernel, linux-mm On Mon, Aug 09, 2004 at 10:25:31PM -0700, Paul Jackson wrote: > While I'm here, the statement that only the highest zone is policied > actually applies only to MPOL_BIND, right? The comment that asserts That is correct. I can add a comment. > 1) Change the man page wording, for each of get_mempolicy(2), mbind(2), > and set_mempolicy(2), to boldly state: > > Beware: > Pass in a value of maxnode that is * one more * than the > number of nodes represented in nodemask. If for example, > nodemask represents 64 nodes, numbered 0 to 63, pass in a > value of 65 for maxnodes. Yes, I will clarify the manpages. I see no problem with hardcoding 8 bits per byte. > > 2) Review, test, fix, and apply as fixed the following patch. For extra > credit, get rid of the hard coded 8, 32 and 64 values in the compat stuff, > visible in the patch below. I compiled the patch, once, on an ia64. > Otherwise totally untested. > > This patch: > a) Notes the situation in a prominent "==> Beware <==". > b) Consistently decrements maxnode immediately on each system call > entry (where someone reading the code might best notice). > c) Otherwise treats maxnode consistently within the code. > d) Addresses the MPOL_BIND max policy only comment. > e) Addresses the harcoded numbers 64 and 8 in copy_nodes_to_user(). Yes, the 64 should be addressed agreed. That came from a misguided attempt by me to not require compat_* functions (by making the 32bit and 64bit ABI be the same), but that didn't work out. > > Yes - it's ugly. The time that will be lost by those who try to use > this interface directly will be ugly too. > > 3) Could you propose a strategy for fixing this? It might take a couple The only way would be to allocate new system call slots for the two calls. But I'm not convinced it is worth it. Sorry, but I don't want to break binary compatibility. If it was really that bad you should have complained earlier (the code was out long enough for review), now it is too late. > If > you have to tell me that SGI has to put in a workaround for a year, on > any machine with _exactly_ 2049 nodes, such as padding the user nodemask > up to 2050 nodes, I'm prepared to deal with that <grin>. There are many more users of these calls than SGI. And most of them are using libnuma, which you are proposing to break. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] get_nodes mask miscalculation 2004-08-10 10:45 ` Andi Kleen @ 2004-08-17 3:43 ` Paul Jackson 0 siblings, 0 replies; 7+ messages in thread From: Paul Jackson @ 2004-08-17 3:43 UTC (permalink / raw) To: linux-kernel; +Cc: bcasavan, linux-mm, Andi Kleen Paul wrote: > a) Notes the situation in a prominent "==> Beware <==". > b) Consistently decrements maxnode immediately on each system call > entry (where someone reading the code might best notice). Perhaps I missed it, but I didn't notice a response to these points in your reply, Andi. In addition to improving the man pages, the other primary documentation is the code itself. I look forward to seeing your proposed patch on this, I hope that patch includes consistent handling of the value 'maxnode' (which is sometimes numnodes, and sometimes numnodes+1). The combination of consistent maxnode usage, and more explicit, commented, handling of the calls expecting the plus one value might improve the chances that a reader of the code will correctly use it. > The only way would be to allocate new system call slot ... > > ... most of them are using libnuma, which you are proposing to break. No ... I didn't say "add a syscall", and I didn't say "break libnuma" (at least, not break any real world usage). Libnuma doesn't pass in just any old value ... it seems to only pass in one of the two hard coded values 129 (x86_64 or i386 arches) or 2049 (other arches). Without divulging any of SGI's plans, I think I can safely predict that not many systems with _exactly_ 2049 nodes will ship in the next year. Perhaps even zero. 2049 is an odd number. Do you have any idea, Andi, what would be a reasonable upper bound on the world wide number of shipments of 129 node x86_64 or i386 systems? My conjecture - zero. It's another odd number ;). If we changed both kernel and libnuma, and required that anyone actually needing to use the full 128 nodes (or 2048 on the other arches) had to match kernel and libnuma (both old or both new), then seems like a change in this convention would actually have vanishingly small real world impact (less impact in the long run than not doing this ;). Either: 1) almost no 128 node x86_64 or i386 systems will ship, and it's not a big issue, practically, or 2) some will ship, and libnuma will soon be broken anyway, when someone does the inevitable and ships an even larger x86_64 or i386 system. So I am not asking you to break any real world usage, nor to add a system call. I'm just asking you to describe what would be the _actual_ problems caused by changing this at this time. So far, I can't see any real world problem that would be caused by changing this, while I already see problems caused by not changing it. Certainly, if we ever do change this, the sooner the better. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.650.933.1373 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] get_nodes mask miscalculation 2004-08-10 1:44 ` [PATCH] get_nodes mask miscalculation Andi Kleen 2004-08-10 5:25 ` Paul Jackson @ 2004-08-10 14:54 ` Brent Casavant 2004-08-10 15:34 ` Andi Kleen 1 sibling, 1 reply; 7+ messages in thread From: Brent Casavant @ 2004-08-10 14:54 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, linux-mm On Tue, 10 Aug 2004, Andi Kleen wrote: > The idea behind this is was to make it behave like select. > And select passes highest valid value + 1. Not that my opinion is worth much, but this seems a very peculiar interface to mimic. > In this case > valid value is not the highest bit number, but the length > of the bitmap. > > For some reason nobody except me seems to get it though, > probably because the description in the manpages is a bit confusing: > > get_mempolicy(2): > > "maxnode is the maximum bit number plus one that can be stored into > nodemask." Actually, I worked from the following description: mbind(3) . . . nodemask is a bit field of nodes that contains up to maxnode bits. . . . This very clearly indicates that I should pass the number of bits in the nodemask field, not the number of bits plus one. Granted, the man page makes it clear that I should use libnuma, but in this case I was working on implementing a new MPOL_ROUNDROBIN, and wanted to go directly to the system call for testing purposes. And not to make things even more confusing, but the way things are designed now, the value I need to pass to mbind() is numa_max_node()+2. Very confusing. > Problem is that you'll break all existing libnuma binaries > which pass NUMA_MAX_NODES + 1. In your scheme the kernel > will access one bit beyond the bitmap that got passed, > and depending on its random value you may get a failure or not. Well, again, not that my opinion carries much weight (nor really should it), but this whole off-by-one (or two) interface seems unnecessarily confusing. Is there no way we can get this fixed? The temporary pain of breaking the relatively new libnuma seems worth the price of getting this forever cleaned up. Brent -- Brent Casavant bcasavan@sgi.com Forget bright-eyed and Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red- Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] get_nodes mask miscalculation 2004-08-10 14:54 ` Brent Casavant @ 2004-08-10 15:34 ` Andi Kleen 0 siblings, 0 replies; 7+ messages in thread From: Andi Kleen @ 2004-08-10 15:34 UTC (permalink / raw) To: Brent Casavant; +Cc: linux-kernel, linux-mm On Tue, Aug 10, 2004 at 09:54:08AM -0500, Brent Casavant wrote: > On Tue, 10 Aug 2004, Andi Kleen wrote: > > > The idea behind this is was to make it behave like select. > > And select passes highest valid value + 1. > > Not that my opinion is worth much, but this seems a very peculiar > interface to mimic. It is the only system call I am aware of that passes a variable length bitmap. Do you know of any others? > > nodemask." > > Actually, I worked from the following description: > > mbind(3) > . . . > nodemask is a bit field of nodes that contains up to > maxnode bits. > . . . > > This very clearly indicates that I should pass the number of bits > in the nodemask field, not the number of bits plus one. I really need to rewrite the mbind manpage - it was a quick hack job and has several other errors. > > Problem is that you'll break all existing libnuma binaries > > which pass NUMA_MAX_NODES + 1. In your scheme the kernel > > will access one bit beyond the bitmap that got passed, > > and depending on its random value you may get a failure or not. > > Well, again, not that my opinion carries much weight (nor really should > it), but this whole off-by-one (or two) interface seems unnecessarily > confusing. Is there no way we can get this fixed? The temporary pain Sure, new system call slots would do with a small compat wrapper for the old calls. I'm just not sure it is worth it. I personally don't find it that confusing, but I'm probably not a good benchmark for this. > of breaking the relatively new libnuma seems worth the price of getting > this forever cleaned up. I don't think breaking existing binaries is a good idea. Linux has higher standards. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH] get_nodes mask miscalculation
@ 2004-08-09 22:39 Brent Casavant
0 siblings, 0 replies; 7+ messages in thread
From: Brent Casavant @ 2004-08-09 22:39 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel
It appears there is a nodemask miscalculation in the get_nodes()
function in mm/mempolicy.c. This bug has two effects:
1. It is impossible to specify a length 1 nodemask.
2. It is impossible to specify a nodemask containing the last node.
The following patch against 2.6.8-rc3 has been confirmed to solve
both problems.
Signed-off-by: Brent Casavant <bcasavan@sgi.com>
--- linux-2.6.8-rc3/mm/mempolicy.c 2004-08-03 16:28:51.000000000 -0500
+++ linux-work/mm/mempolicy.c 2004-08-09 14:56:44.000000000 -0500
@@ -132,7 +132,6 @@
unsigned long nlongs;
unsigned long endmask;
- --maxnode;
bitmap_zero(nodes, MAX_NUMNODES);
if (maxnode == 0 || !nmask)
return 0;
--
Brent Casavant bcasavan@sgi.com Forget bright-eyed and
Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red-
Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in threadend of thread, other threads:[~2004-08-17 3:43 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <2rr7U-5xT-11@gated-at.bofh.it>
2004-08-10 1:44 ` [PATCH] get_nodes mask miscalculation Andi Kleen
2004-08-10 5:25 ` Paul Jackson
2004-08-10 10:45 ` Andi Kleen
2004-08-17 3:43 ` Paul Jackson
2004-08-10 14:54 ` Brent Casavant
2004-08-10 15:34 ` Andi Kleen
2004-08-09 22:39 Brent Casavant
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox