NUMA policy interface

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* NUMA policy interface
       [not found]                   ` <20050804170803.GB8266@wotan.suse.de>
@ 2005-08-04 17:34                     ` Christoph Lameter
  2005-08-04 21:14                       ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2005-08-04 17:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Paul Jackson, linux-kernel, linux-mm

On Thu, 4 Aug 2005, Andi Kleen wrote:

> > So your point of view is that there will be no control and monitoring of 
> > the memory usage and policies?
> 
> External control is implemented for named objects and for process policy.
> A process can also monitor its own policies if it wants.

Named objects like files and not processes and/or threads? But then these 
named objects do not have memory allocated to them.

> I think the payoff for external monitoring of policies vs complexity 
> and cleanliness of interface and long term code impact is too bad to make 
> it an attractive option.

Well the implementation has the following issues right now:

1. BIND policy implemented in a way that fills up nodes from the lowest 
   to the higest instead of allocating memory on the local node.

2. No separation between sys_ and do_ functions. Therefore difficult
   to use from kernel context.

3. Functions have weird side effect (f.e. get_nodes updating 
   and using cpuset policies). Code is therefore difficult 
   to maintain.

4. Uses bitmaps instead of nodemask_t.

5. No means to figure out where the memory was allocated although
   mempoliy.c implements scans over ptes that would allow that 
   determination.

6. Needs hook into page migration layer to move pages to either conform
   to policy or to move them menually.

The long term impact of this missing functionality is already showing 
in the numbers of workarounds that I have seen at a various sites, 

The code is currently complex and difficult to handle because some of the 
issues mentioned above. We need to fix this in order to have clean code 
and in order to control future complexity.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NUMA policy interface
  2005-08-04 17:34                     ` NUMA policy interface Christoph Lameter
@ 2005-08-04 21:14                       ` Andi Kleen
  2005-08-04 21:21                         ` Christoph Lameter
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2005-08-04 21:14 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul Jackson, linux-kernel, linux-mm

> 1. BIND policy implemented in a way that fills up nodes from the lowest 
>    to the higest instead of allocating memory on the local node.

Hmm, there was a patch from PJ for that at some point. Not sure why it 
was not merged. iirc the first implementation was too complex, but
there was a second reasonable one.

> 
> 2. No separation between sys_ and do_ functions. Therefore difficult
>    to use from kernel context.

set_fs(KERNEL_DS)
Some policies can be even set without that.

There are already kernel users BTW that prove you wrong.

> 3. Functions have weird side effect (f.e. get_nodes updating 
>    and using cpuset policies). Code is therefore difficult 
>    to maintain.

Agreed that should be cleaned up.

> 4. Uses bitmaps instead of nodemask_t.

Should be easy to fix if someone is motivated.  When I wrote the code
nodemask_t didn't exist yet, and when it was merged it wasn't 
converted over. Not a big deal.

> 
> 5. No means to figure out where the memory was allocated although
>    mempoliy.c implements scans over ptes that would allow that 
>    determination.

You lost me here.

>  
> 6. Needs hook into page migration layer to move pages to either conform
>    to policy or to move them menually.

Does it really? So far my feedback from all users I talked to is that they only
use a small subset of the functionality, even what is there is too complex.
Nobody with a real app so far has asked me for page migration.

There was one implementation of simple page migration in Steve L.'s patches,
but that was just because it was too hard to handle one corner case
otherwise.

> The long term impact of this missing functionality is already showing 
> in the numbers of workarounds that I have seen at a various sites, 

Examples? 

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NUMA policy interface
  2005-08-04 21:14                       ` Andi Kleen
@ 2005-08-04 21:21                         ` Christoph Lameter
  2005-08-04 21:41                           ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2005-08-04 21:21 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Paul Jackson, linux-kernel, linux-mm

On Thu, 4 Aug 2005, Andi Kleen wrote:

> > 1. BIND policy implemented in a way that fills up nodes from the lowest 
> >    to the higest instead of allocating memory on the local node.
> 
> Hmm, there was a patch from PJ for that at some point. Not sure why it 
> was not merged. iirc the first implementation was too complex, but
> there was a second reasonable one.

Yes he mentioned that patch earlier in this thread.

> > 5. No means to figure out where the memory was allocated although
> >    mempoliy.c implements scans over ptes that would allow that 
> >    determination.
> 
> You lost me here.

There is this scan over the page table that verifies if all nodes are 
allocated according to the policy. That scan could easily be used to 
provide a map to the application (and to /proc/<pid>/smap) of where the
memory was allocated.
 
> > 6. Needs hook into page migration layer to move pages to either conform
> >    to policy or to move them menually.
> 
> Does it really? So far my feedback from all users I talked to is that they only
> use a small subset of the functionality, even what is there is too complex.
> Nobody with a real app so far has asked me for page migration.

Maybe we have different customers. My feedback is consistently that this 
is a very urgently feature needed.
 
> There was one implementation of simple page migration in Steve L.'s patches,
> but that was just because it was too hard to handle one corner case
> otherwise.

There is a page migration implementation in the hotplug patchset.

> > The long term impact of this missing functionality is already showing 
> > in the numbers of workarounds that I have seen at a various sites, 
> 
> Examples? 

Two of the high profile ones are NASA and APA. One person from the APA 
posted in one of our earlier discussions.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NUMA policy interface
  2005-08-04 21:21                         ` Christoph Lameter
@ 2005-08-04 21:41                           ` Andi Kleen
  2005-08-04 22:19                             ` Christoph Lameter
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2005-08-04 21:41 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, Paul Jackson, linux-kernel, linux-mm

On Thu, Aug 04, 2005 at 02:21:09PM -0700, Christoph Lameter wrote:
> Yes he mentioned that patch earlier in this thread.
> 
> > > 5. No means to figure out where the memory was allocated although
> > >    mempoliy.c implements scans over ptes that would allow that 
> > >    determination.
> > 
> > You lost me here.
> 
> There is this scan over the page table that verifies if all nodes are 
> allocated according to the policy. That scan could easily be used to 
> provide a map to the application (and to /proc/<pid>/smap) of where the

The application can already get it. But it's an ugly feature
that I only used for debugging and I was actually considering
to remove it.

Doing it for external users is a completely different thing though.
I still think those have business in messing with other people's
virtual addresses. In addition I expect it will cause problems
longer term
(did you ever look why mmap on /proc/*/mem is not allowed - it used
to be long ago, but it was impossible to make it work race free and
before that was always a gapping security hole) 

> > > The long term impact of this missing functionality is already showing 
> > > in the numbers of workarounds that I have seen at a various sites, 
> > 
> > Examples? 
> 
> Two of the high profile ones are NASA and APA. One person from the APA 
> posted in one of our earlier discussions.

Ok. I think for those the swapoff per process is the right because
simplest and easiest solution. No complex patch sets needed,
just some changes to an existing code path.

If they cannot afford enough disk space it might be possible
to do the page migration in swap cache like Hugh proposed.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NUMA policy interface
  2005-08-04 21:41                           ` Andi Kleen
@ 2005-08-04 22:19                             ` Christoph Lameter
  2005-08-04 22:44                               ` Mike Kravetz
  2005-08-04 23:40                               ` Andi Kleen
  0 siblings, 2 replies; 11+ messages in thread
From: Christoph Lameter @ 2005-08-04 22:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Paul Jackson, linux-kernel, linux-mm

On Thu, 4 Aug 2005, Andi Kleen wrote:

> > There is this scan over the page table that verifies if all nodes are 
> > allocated according to the policy. That scan could easily be used to 
> > provide a map to the application (and to /proc/<pid>/smap) of where the
> 
> The application can already get it. But it's an ugly feature
> that I only used for debugging and I was actually considering
> to remove it.
> 
> Doing it for external users is a completely different thing though.
> I still think those have business in messing with other people's
> virtual addresses. In addition I expect it will cause problems
> longer term
> (did you ever look why mmap on /proc/*/mem is not allowed - it used
> to be long ago, but it was impossible to make it work race free and
> before that was always a gapping security hole) 

The proc stuff is fake anyways. I would not worry about that. The biggest 
worry is the locking mechanism to make this clean.

There are three possibilites:

1. do what cpusets is doing by versioning.

2. Have the task notifier access the task_struct information.
See http://lwn.net/Articles/145232/ "A new path to the refrigerator"

3. Maybe the easiest: Require mmap_sem to be taken for all policy 
accesses. Currently its only require for vma policies. Then we need
to make a copy of the policy at some point so that alloc_pages can
access policy information lock free. This may also allow us to fix
the bind issue if we would f.e. keep a bitmap in the taskstruct or (ab)use 
the cpusets map.

> If they cannot afford enough disk space it might be possible
> to do the page migration in swap cache like Hugh proposed.

This code already exist in the memory hotplug code base and Ray already 
had a working implementation for page migration. The migration code will 
also be necessary in order to relocate pages with ECC single bit failures 
that Russ is working on (of course that will only work for some pages) and
for Mel Gorman's defragmentation approach (if we ever get the split into 
differnet types of memory chunks in).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NUMA policy interface
  2005-08-04 22:19                             ` Christoph Lameter
@ 2005-08-04 22:44                               ` Mike Kravetz
  2005-08-04 23:40                               ` Andi Kleen
  1 sibling, 0 replies; 11+ messages in thread
From: Mike Kravetz @ 2005-08-04 22:44 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, Paul Jackson, linux-kernel, linux-mm

On Thu, Aug 04, 2005 at 03:19:52PM -0700, Christoph Lameter wrote:
> This code already exist in the memory hotplug code base and Ray already 
> had a working implementation for page migration. The migration code will 
> also be necessary in order to relocate pages with ECC single bit failures 
> that Russ is working on (of course that will only work for some pages) and
> for Mel Gorman's defragmentation approach (if we ever get the split into 
> differnet types of memory chunks in).

Yup, we need page migration for memory hotplug.  However, for hotplug
we are not too concerned about where the pages are migrated to.  Our
primary concern is to move them out of the block/section that we want
to offline.  Suspect this is the same for pages with ECC single bit
failures.  In fact, this is one possible use of the hotplug code.
Notice a failure.  Migrate all pages off the containing DIMM.  Offline
section corresponding to DIMM.  Replace the DIMM.  Online section
corresponding to DIMM.  Of course, your hardware needs to be able to
do this.

-- 
Mike
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NUMA policy interface
  2005-08-04 22:19                             ` Christoph Lameter
  2005-08-04 22:44                               ` Mike Kravetz
@ 2005-08-04 23:40                               ` Andi Kleen
  2005-08-04 23:49                                 ` Christoph Lameter
  1 sibling, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2005-08-04 23:40 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, Paul Jackson, linux-kernel, linux-mm

On Thu, Aug 04, 2005 at 03:19:52PM -0700, Christoph Lameter wrote:
> There are three possibilites:
> 
> 1. do what cpusets is doing by versioning.
> 
> 2. Have the task notifier access the task_struct information.
> See http://lwn.net/Articles/145232/ "A new path to the refrigerator"
> 
> 3. Maybe the easiest: Require mmap_sem to be taken for all policy 
> accesses. Currently its only require for vma policies. Then we need
> to make a copy of the policy at some point so that alloc_pages can
> access policy information lock free. This may also allow us to fix
> the bind issue if we would f.e. keep a bitmap in the taskstruct or (ab)use 
> the cpusets map.

None of them seem very attractive to me.  I would prefer to just
not support external accesses keeping things lean and fast.


> > If they cannot afford enough disk space it might be possible
> > to do the page migration in swap cache like Hugh proposed.
> 
> This code already exist in the memory hotplug code base and Ray already 
> had a working implementation for page migration. The migration code will 
> also be necessary in order to relocate pages with ECC single bit failures 
> that Russ is working on (of course that will only work for some pages) and
> for Mel Gorman's defragmentation approach (if we ever get the split into 
> differnet types of memory chunks in).

Individual physical page migration is quite different from
address space migration.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NUMA policy interface
  2005-08-04 23:40                               ` Andi Kleen
@ 2005-08-04 23:49                                 ` Christoph Lameter
  2005-08-05  9:16                                   ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2005-08-04 23:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Paul Jackson, linux-kernel, linux-mm

On Fri, 5 Aug 2005, Andi Kleen wrote:

> None of them seem very attractive to me.  I would prefer to just
> not support external accesses keeping things lean and fast.

That is a surprising statement given what we just discussed. Things 
are not lean and fast but weirdly screwed up. The policy layer is 
significantly impacted by historical contingencies rather than designed in 
a clean way. It cannot even deliver the functionality it was designed to 
deliver (see BIND).

> Individual physical page migration is quite different from
> address space migration.

Address space migration? That is something new in this discussion. So 
could you explain what you mean by that? I have looked at page migration 
in a variety of contexts and could not see much difference.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NUMA policy interface
  2005-08-04 23:49                                 ` Christoph Lameter
@ 2005-08-05  9:16                                   ` Andi Kleen
  2005-08-05 14:52                                     ` Christoph Lameter
  2005-08-05 14:58                                     ` Christoph Lameter
  0 siblings, 2 replies; 11+ messages in thread
From: Andi Kleen @ 2005-08-05  9:16 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, Paul Jackson, linux-kernel, linux-mm

On Thu, Aug 04, 2005 at 04:49:33PM -0700, Christoph Lameter wrote:
> On Fri, 5 Aug 2005, Andi Kleen wrote:
> 
> > None of them seem very attractive to me.  I would prefer to just
> > not support external accesses keeping things lean and fast.
> 
> That is a surprising statement given what we just discussed. Things 
> are not lean and fast but weirdly screwed up. The policy layer is 
> significantly impacted by historical contingencies rather than designed in 
> a clean way. It cannot even deliver the functionality it was designed to 
> deliver (see BIND).

That seems like a unfair description to me. While things are not
perfect they are definitely not as bad as you're trying to paint them.

> 
> > Individual physical page migration is quite different from
> > address space migration.
> 
> Address space migration? That is something new in this discussion. So 
> could you explain what you mean by that? I have looked at page migration 
> in a variety of contexts and could not see much difference.

MCE page migration just puts a physical page to somewhere else.
memory hotplug migration does the same for multiple pages from
different processes.

Page migration like you're asking for migrates whole processes.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NUMA policy interface
  2005-08-05  9:16                                   ` Andi Kleen
@ 2005-08-05 14:52                                     ` Christoph Lameter
  2005-08-05 14:58                                     ` Christoph Lameter
  1 sibling, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2005-08-05 14:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Paul Jackson, linux-kernel, linux-mm

On Fri, 5 Aug 2005, Andi Kleen wrote:

> > Address space migration? That is something new in this discussion. So 
> > could you explain what you mean by that? I have looked at page migration 
> > in a variety of contexts and could not see much difference.
> 
> MCE page migration just puts a physical page to somewhere else.
> memory hotplug migration does the same for multiple pages from
> different processes.
> 
> Page migration like you're asking for migrates whole processes.

No I am asking for the migration of parts of a process. Hotplug migration 
and MCE page migration do the same.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NUMA policy interface
  2005-08-05  9:16                                   ` Andi Kleen
  2005-08-05 14:52                                     ` Christoph Lameter
@ 2005-08-05 14:58                                     ` Christoph Lameter
  1 sibling, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2005-08-05 14:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Paul Jackson, linux-kernel, linux-mm

On Fri, 5 Aug 2005, Andi Kleen wrote:

> > a clean way. It cannot even deliver the functionality it was designed to 
> > deliver (see BIND).
> 
> That seems like a unfair description to me. While things are not
> perfect they are definitely not as bad as you're trying to paint them.

Sorry this went to far in the heat of the discussion. But 
the BIND functionality is truly not where its supposed to be.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2005-08-05 14:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20050730181418.65caed1f.pj@sgi.com>
     [not found] ` <Pine.LNX.4.62.0507301814540.31359@graphe.net>
     [not found]   ` <20050730190126.6bec9186.pj@sgi.com>
     [not found]     ` <Pine.LNX.4.62.0507301904420.31882@graphe.net>
     [not found]       ` <20050730191228.15b71533.pj@sgi.com>
     [not found]         ` <Pine.LNX.4.62.0508011147030.5541@graphe.net>
     [not found]           ` <20050803084849.GB10895@wotan.suse.de>
     [not found]             ` <Pine.LNX.4.62.0508040704590.3319@graphe.net>
     [not found]               ` <20050804142942.GY8266@wotan.suse.de>
     [not found]                 ` <Pine.LNX.4.62.0508040922110.6650@graphe.net>
     [not found]                   ` <20050804170803.GB8266@wotan.suse.de>
2005-08-04 17:34                     ` NUMA policy interface Christoph Lameter
2005-08-04 21:14                       ` Andi Kleen
2005-08-04 21:21                         ` Christoph Lameter
2005-08-04 21:41                           ` Andi Kleen
2005-08-04 22:19                             ` Christoph Lameter
2005-08-04 22:44                               ` Mike Kravetz
2005-08-04 23:40                               ` Andi Kleen
2005-08-04 23:49                                 ` Christoph Lameter
2005-08-05  9:16                                   ` Andi Kleen
2005-08-05 14:52                                     ` Christoph Lameter
2005-08-05 14:58                                     ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox