linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [patch] check cpuset mems_allowed for sys_mbind
@ 2007-05-09 23:11 Ken Chen
  2007-05-09 23:48 ` Paul Jackson
  0 siblings, 1 reply; 9+ messages in thread
From: Ken Chen @ 2007-05-09 23:11 UTC (permalink / raw)
  To: Paul Jackson, Andrew Morton; +Cc: linux-mm

I wonder why we don't check cpuset's mems_allowed node mask in the
sys_mbind() path?  sys_set_mempolicy() however, does the enforcement
against cpuset so process can not accidentally set mempolicy with
memory node mask that are not allowed to allocated from.  I think we
should have the equivalent check in the mbind path.   Otherwise, there
are discrepancy in what sys_mbind agrees to versus what the page
allocation policy that enforced by cpuset.  This discrepancy
subsequently causes performance surprises to the application.

Or is it left out intentionally?  for what reason?


Signed-off-by: Ken Chen <kenchen@google.com>

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d76e8eb..ef81080 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -762,7 +762,7 @@ long do_mbind(unsigned long start, unsig
 	if (end == start)
 		return 0;

-	if (mpol_check_policy(mode, nmask))
+	if (contextualize_policy(mode, nmask))
 		return -EINVAL;

 	new = mpol_new(mode, nmask);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch] check cpuset mems_allowed for sys_mbind
  2007-05-09 23:11 [patch] check cpuset mems_allowed for sys_mbind Ken Chen
@ 2007-05-09 23:48 ` Paul Jackson
  2007-05-10  0:47   ` Ken Chen
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Jackson @ 2007-05-09 23:48 UTC (permalink / raw)
  To: Ken Chen; +Cc: akpm, linux-mm, clameter

Ken wrote:
> I wonder why we don't check cpuset's mems_allowed node mask in the
> sys_mbind() path?

Looking back through the version history of mm/mempolicy.c, I see that
we used to check the cpuset (by calling contextualize_policy), but then
with the following patch (Christoph added to CC list above), this was
changed.

=========================== begin ===========================
Subject: - remove-policy-contextualization-from-mbind.patch removed from -mm tree
To: clameter@engr.sgi.com, ak@muc.de, clameter@sgi.com,
   mm-commits@vger.kernel.org
From: akpm@osdl.org
Date:   Sun, 30 Oct 2005 00:27:40 -0700


The patch titled

     Remove policy contextualization from mbind

has been removed from the -mm tree.  Its filename is

     remove-policy-contextualization-from-mbind.patch

This patch was probably dropped from -mm because
it has already been merged into a subsystem tree
or into Linus's tree


From: Christoph Lameter <clameter@engr.sgi.com>

Policy contextualization is only useful for task based policies and not for
vma based policies.  It may be useful to define allowed nodes that are not
accessible from this thread because other threads may have access to these
nodes.  Without this patch strange memory policy situations may cause an
application to fail with out of memory.

Example:

Let's say we have two threads A and B that share the same address space and
a huge array computational array X.

Thread A is restricted by its cpuset to nodes 0 and 1 and thread B is
restricted by its cpuset to nodes 2 and 3.

Thread A now wants to restrict allocations to the first node and thus
applies a BIND policy on X to node 0 and 2.  The cpuset limits this to node
0.  Thus pages for X must be allocated on node 0 now.

Thread B now touches a page that has never been used in X and faults in a
page.  According to the BIND policy of the vma for X the page must be
allocated on page 0.  However, the cpuset of B does not allow allocation on
0 and 1.  Now the application fails in alloc_pages with out of memory.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 mm/mempolicy.c |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

diff -puN mm/mempolicy.c~remove-policy-contextualization-from-mbind mm/mempolicy.c
--- devel/mm/mempolicy.c~remove-policy-contextualization-from-mbind     2005-10-29 17:43:43.000000000 -0700
+++ devel-akpm/mm/mempolicy.c   2005-10-29 17:43:43.000000000 -0700
@@ -370,7 +370,7 @@ long do_mbind(unsigned long start, unsig
                return -EINVAL;
        if (end == start)
                return 0;
-       if (contextualize_policy(mode, nmask))
+       if (mpol_check_policy(mode, nmask))
                return -EINVAL;
        new = mpol_new(mode, nmask);
        if (IS_ERR(new))
_

Patches currently in -mm which might be from clameter@engr.sgi.com are

increase-maximum-kmalloc-size-to-256k.patch
use-alloc_percpu-to-allocate-workqueues-locally.patch
============================ end ============================

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch] check cpuset mems_allowed for sys_mbind
  2007-05-09 23:48 ` Paul Jackson
@ 2007-05-10  0:47   ` Ken Chen
  2007-05-10  0:55     ` Christoph Lameter
  0 siblings, 1 reply; 9+ messages in thread
From: Ken Chen @ 2007-05-10  0:47 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, linux-mm, clameter

On 5/9/07, Paul Jackson <pj@sgi.com> wrote:
> Ken wrote:
> > I wonder why we don't check cpuset's mems_allowed node mask in the
> > sys_mbind() path?
>
> Looking back through the version history of mm/mempolicy.c, I see that
> we used to check the cpuset (by calling contextualize_policy), but then
> with the following patch (Christoph added to CC list above), this was
> changed.

oh, boy, never ending circle of fixing a bug by introduce another one.
 No wonder why number of kernel bugs never goes down because everyone
is running in circles.

I see Christoph's point that when two threads live in two disjoint
cpusets, they can affect each other's memory policy and cause
undesired oom behavior.

However, mbind shouldn't create discrepancy between what is allowed
and what is promised, especially with MPOL_BIND policy.  Since a
numa-aware app has already gone such a detail to request memory
placement on a specific nodemask, they fully expect memory to be
placed there for performance reason.  If kernel lies about it, we get
very unpleasant performance issue.

I suppose neither behavior is correct nor desired.  What if we "OR"
all the nodemask for all threads in a process group and use that
nodemask to check against what is being requested, is that reasonable?

- Ken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch] check cpuset mems_allowed for sys_mbind
  2007-05-10  0:47   ` Ken Chen
@ 2007-05-10  0:55     ` Christoph Lameter
  2007-05-10  1:26       ` Ken Chen
  2007-05-10 18:32       ` Ken Chen
  0 siblings, 2 replies; 9+ messages in thread
From: Christoph Lameter @ 2007-05-10  0:55 UTC (permalink / raw)
  To: Ken Chen; +Cc: Paul Jackson, akpm, linux-mm

On Wed, 9 May 2007, Ken Chen wrote:

> > Looking back through the version history of mm/mempolicy.c, I see that
> > we used to check the cpuset (by calling contextualize_policy), but then
> > with the following patch (Christoph added to CC list above), this was
> > changed.
> 
> oh, boy, never ending circle of fixing a bug by introduce another one.
> No wonder why number of kernel bugs never goes down because everyone
> is running in circles.
> 
> I see Christoph's point that when two threads live in two disjoint
> cpusets, they can affect each other's memory policy and cause
> undesired oom behavior.

s/undesired/unexpected/

> However, mbind shouldn't create discrepancy between what is allowed
> and what is promised, especially with MPOL_BIND policy.  Since a
> numa-aware app has already gone such a detail to request memory
> placement on a specific nodemask, they fully expect memory to be
> placed there for performance reason.  If kernel lies about it, we get
> very unpleasant performance issue.

How does the kernel lie? The memory is placed given the current cpuset and 
memory policy restrictions.
 
> I suppose neither behavior is correct nor desired.  What if we "OR"
> all the nodemask for all threads in a process group and use that
> nodemask to check against what is being requested, is that reasonable?

Pretty serious hackery there.

I think there is a rather large portioin of people rather frustated with 
the inconsistencies in the memory policy layer. If you can come up with 
some broader scheme to fix this then we would all be happier.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch] check cpuset mems_allowed for sys_mbind
  2007-05-10  0:55     ` Christoph Lameter
@ 2007-05-10  1:26       ` Ken Chen
  2007-05-10  1:44         ` Christoph Lameter
  2007-05-10 18:32       ` Ken Chen
  1 sibling, 1 reply; 9+ messages in thread
From: Ken Chen @ 2007-05-10  1:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul Jackson, akpm, linux-mm

On 5/9/07, Christoph Lameter <clameter@sgi.com> wrote:
> > However, mbind shouldn't create discrepancy between what is allowed
> > and what is promised, especially with MPOL_BIND policy.  Since a
> > numa-aware app has already gone such a detail to request memory
> > placement on a specific nodemask, they fully expect memory to be
> > placed there for performance reason.  If kernel lies about it, we get
> > very unpleasant performance issue.
>
> How does the kernel lie? The memory is placed given the current cpuset and
> memory policy restrictions.

sys_mbind lies.  A task in cpuset that has mems=0-7, it can do
sys_mbind(MPOL_BIND, 0x100, ...) and such call will return success.
The app fully rely on memory allocation occurs on node 8, but that
obviously can't happen because of cpuset.  Everything goes downhill
from this point on.  Granted, app shouldn't call with such nodemask,
but the fun starts with mbind being incompatible with cpuset (it
checks against global node_online_map which includes a mask of entire
system).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch] check cpuset mems_allowed for sys_mbind
  2007-05-10  1:26       ` Ken Chen
@ 2007-05-10  1:44         ` Christoph Lameter
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Lameter @ 2007-05-10  1:44 UTC (permalink / raw)
  To: Ken Chen; +Cc: Paul Jackson, akpm, linux-mm

On Wed, 9 May 2007, Ken Chen wrote:

> On 5/9/07, Christoph Lameter <clameter@sgi.com> wrote:
> > > However, mbind shouldn't create discrepancy between what is allowed
> > > and what is promised, especially with MPOL_BIND policy.  Since a
> > > numa-aware app has already gone such a detail to request memory
> > > placement on a specific nodemask, they fully expect memory to be
> > > placed there for performance reason.  If kernel lies about it, we get
> > > very unpleasant performance issue.
> > 
> > How does the kernel lie? The memory is placed given the current cpuset and
> > memory policy restrictions.
> 
> sys_mbind lies.  A task in cpuset that has mems=0-7, it can do
> sys_mbind(MPOL_BIND, 0x100, ...) and such call will return success.

I thought we assume that people know what they are doing if they run such 
NUMA applications?

I do not think there is an easy way out given the current way of managing 
memory policies and allocation constraints.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch] check cpuset mems_allowed for sys_mbind
  2007-05-10  0:55     ` Christoph Lameter
  2007-05-10  1:26       ` Ken Chen
@ 2007-05-10 18:32       ` Ken Chen
  2007-05-10 18:41         ` Christoph Lameter
  1 sibling, 1 reply; 9+ messages in thread
From: Ken Chen @ 2007-05-10 18:32 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul Jackson, akpm, linux-mm

On 5/9/07, Christoph Lameter <clameter@sgi.com> wrote:
> > However, mbind shouldn't create discrepancy between what is allowed
> > and what is promised, especially with MPOL_BIND policy.  Since a
> > numa-aware app has already gone such a detail to request memory
> > placement on a specific nodemask, they fully expect memory to be
> > placed there for performance reason.  If kernel lies about it, we get
> > very unpleasant performance issue.
>
> How does the kernel lie? The memory is placed given the current cpuset and
> memory policy restrictions.

I wish Christoph whack me a little bit harder ;-)  He already fixed
the darn thing 4 month ago.  And I should've set more rigorous habit
of cross check/test somewhat non-ancient kernel tree.  We are indeed
already restrict nodemask to current mems_allowed.

- Ken


commit 30150f8d7b76f25b1127a5079528b7a17307f995
Author: Christoph Lameter <clameter@sgi.com>
Date:   Mon Jan 22 20:40:45 2007 -0800

    [PATCH] mbind: restrict nodes to the currently allowed cpuset

    Currently one can specify an arbitrary node mask to mbind that includes
    nodes not allowed.  If that is done with an interleave policy then we will
    go around all the nodes.  Those outside of the currently allowed cpuset
    will be redirected to the border nodes.  Interleave will then create
    imbalances at the borders of the cpuset.

    This patch restricts the nodes to the currently allowed cpuset.

    The RFC for this patch was discussed at
    http://marc.theaimsgroup.com/?t=116793842100004&r=1&w=2

    Signed-off-by: Christoph Lameter <clameter@sgi.com>
    Cc: Paul Jackson <pj@sgi.com>
    Cc: Andi Kleen <ak@suse.de>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da94639..c2aec0e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -884,6 +884,10 @@ asmlinkage long sys_mbind(unsigned long
        err = get_nodes(&nodes, nmask, maxnode);
        if (err)
                return err;
+#ifdef CONFIG_CPUSETS
+       /* Restrict the nodes to the allowed nodes in the cpuset */
+       nodes_and(nodes, nodes, current->mems_allowed);
+#endif
        return do_mbind(start, len, mode, &nodes, flags);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch] check cpuset mems_allowed for sys_mbind
  2007-05-10 18:32       ` Ken Chen
@ 2007-05-10 18:41         ` Christoph Lameter
  2007-05-10 19:30           ` Ken Chen
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Lameter @ 2007-05-10 18:41 UTC (permalink / raw)
  To: Ken Chen; +Cc: Paul Jackson, akpm, linux-mm

On Thu, 10 May 2007, Ken Chen wrote:

> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da94639..c2aec0e 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -884,6 +884,10 @@ asmlinkage long sys_mbind(unsigned long
>        err = get_nodes(&nodes, nmask, maxnode);
>        if (err)
>                return err;
> +#ifdef CONFIG_CPUSETS
> +       /* Restrict the nodes to the allowed nodes in the cpuset */
> +       nodes_and(nodes, nodes, current->mems_allowed);
> +#endif
>        return do_mbind(start, len, mode, &nodes, flags);

Did I screw up whitespace there?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch] check cpuset mems_allowed for sys_mbind
  2007-05-10 18:41         ` Christoph Lameter
@ 2007-05-10 19:30           ` Ken Chen
  0 siblings, 0 replies; 9+ messages in thread
From: Ken Chen @ 2007-05-10 19:30 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul Jackson, akpm, linux-mm

On 5/10/07, Christoph Lameter <clameter@sgi.com> wrote:
> On Thu, 10 May 2007, Ken Chen wrote:
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index da94639..c2aec0e 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -884,6 +884,10 @@ asmlinkage long sys_mbind(unsigned long
> >        err = get_nodes(&nodes, nmask, maxnode);
> >        if (err)
> >                return err;
> > +#ifdef CONFIG_CPUSETS
> > +       /* Restrict the nodes to the allowed nodes in the cpuset */
> > +       nodes_and(nodes, nodes, current->mems_allowed);
> > +#endif
> >        return do_mbind(start, len, mode, &nodes, flags);
>
> Did I screw up whitespace there?

No, the patch was fine.  I copy'n paste from xterm which turns all tab
into space.  It's crappy xterm stuff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-05-10 19:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-09 23:11 [patch] check cpuset mems_allowed for sys_mbind Ken Chen
2007-05-09 23:48 ` Paul Jackson
2007-05-10  0:47   ` Ken Chen
2007-05-10  0:55     ` Christoph Lameter
2007-05-10  1:26       ` Ken Chen
2007-05-10  1:44         ` Christoph Lameter
2007-05-10 18:32       ` Ken Chen
2007-05-10 18:41         ` Christoph Lameter
2007-05-10 19:30           ` Ken Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox