manual page migration -- issue list

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* manual page migration -- issue list
@ 2005-02-15 23:52 Ray Bryant
  2005-02-16  0:09 ` Paul Jackson
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Ray Bryant @ 2005-02-15 23:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Paul Jackson, Robin Holt, Andi Kleen, Dave Hansen,
	Marcello Tosatti, Steve Longerbeam, Peter Chubb

I've been asked to repost this to the list with cc:'s to the interested 
parties.  The content below is the same as before, its the email headers
that are more interesting. :-)  (I hope I didn't miss anyone.)

==============================REPOSTING================================
The following is an attempt to summarize the issues that have
been raised thus far in this discussion.  I'm hoping that this
list can help us resolve the issues in a (somewhat) organized
manner:

(1)  Should there be a new API or should/can the migration functionality
      be folded under the NUMA API?

(2)  If we decide to make a new API, then what parameters should
      that system call take?  Proposals have been made for all of
      the following:

      -- pid, va_start, va_end, count, old_nodes, new_nodes
      -- pid, va_start, va_end, old_node_mask, new_node_mask
      -- pid, va_start, va_end, old_node, new_node
      -- same variations as above without the va_start/end arguments

(2)  If we make a new API, how does that new API interact with the
      NUMA API?
      -- e. g.what happens when we migrate a VMA that has a mempolicy
         associated with it?

(3)  If we make a new API, how does this API interact with the rest
      of the VM system.  For example, when we migrate part of a VMA
      do we split the VMA or not?  (See also (4) below since if we
      decide that the migration interface needs to be able to migrate
      processes without stopping them, the whole concept of talking
      about such ephemeral things as VMAs becomes pointless.)

(4)  How general of a migration model are we supporting?
      -- migration where old and new set of nodes might not be disjoint
      -- migration of general processes (without suspension) or just
         of suspended processes
      -- how general of a migration model is necessary to get sufficient
         users (more than SGI, say) to increase the chances of getting
         the facility merged into the kernel.

(5)  How do we determine what vma's to migrate?   (Subquestion:  Is
      this done by the kernel or in user space?)
      -- original idea:  reference counts in /proc/pid/maps
      -- newer idea: exclusion lists either set by marking the
         file in some special way or by an explicit list
      -- if we mark files as non-migratable, where is this information
         stored?

(6)  How does the migration API (in whatever form it takes) interact
      with cpusets?

So first off, is this the complete list of issues?  Can anyone suggest
an issue that isn't covered here?
-- 
-----------------------------------------------
Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
	 so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-15 23:52 manual page migration -- issue list Ray Bryant
@ 2005-02-16  0:09 ` Paul Jackson
  2005-02-16  0:28   ` Ray Bryant
  2005-02-16  0:51 ` Paul Jackson
  2005-02-16  1:41 ` Paul Jackson
  2 siblings, 1 reply; 24+ messages in thread
From: Paul Jackson @ 2005-02-16  0:09 UTC (permalink / raw)
  To: Ray Bryant; +Cc: linux-mm, holt, ak, haveblue, marcello, stevel, peterc

One minor issue not mentioned:

  Is it more typical to pass the address range
  as a start and end address, or as a start address
  and a length?

I suspect the latter.

Though, by the time we get done on all this, who knows
if that minor issue will even apply any more.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16  0:09 ` Paul Jackson
@ 2005-02-16  0:28   ` Ray Bryant
  0 siblings, 0 replies; 24+ messages in thread
From: Ray Bryant @ 2005-02-16  0:28 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-mm, holt, ak, haveblue, marcello, stevel, peterc

Paul Jackson wrote:
> One minor issue not mentioned:
> 
>   Is it more typical to pass the address range
>   as a start and end address, or as a start address
>   and a length?
> 
> I suspect the latter.
> 
> Though, by the time we get done on all this, who knows
> if that minor issue will even apply any more.
> 
added to the list, for now at least.

-- 
-----------------------------------------------
Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
	 so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-15 23:52 manual page migration -- issue list Ray Bryant
  2005-02-16  0:09 ` Paul Jackson
@ 2005-02-16  0:51 ` Paul Jackson
  2005-02-16  1:17   ` Paul Jackson
  2005-02-16  1:56   ` Robin Holt
  2005-02-16  1:41 ` Paul Jackson
  2 siblings, 2 replies; 24+ messages in thread
From: Paul Jackson @ 2005-02-16  0:51 UTC (permalink / raw)
  To: Ray Bryant; +Cc: linux-mm, holt, ak, haveblue, marcello, stevel, peterc

Robin, replying to pj, from the earlier thread also on lkml:
> On Tue, Feb 15, 2005 at 08:35:29AM -0800, Paul Jackson wrote:
> > What about ...
> > 
> >     sys_page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);
> > 
> > to:
> > 
> >     sys_page_migrate(pid, va_start, va_end, old_node, new_node);
> > 
> > ...
> 
> Migration could be done in most cases and would only fall apart when
> there are overlapping node lists and no nodes available as temp space
> and we are not moving large chunks of data.

Given the <va_start, va_end>, which could be reduced to the granularity
of a single page if need be, there should not be an issue with overlapping
node lists.  My "node as temp space" suggestion was insane - nevermind
that one.  If worse comes to worse, you can handle any combination of
nodes, overlapping or not, with no extra or temporary copies, just by
doing one page at a time.

So this seems to boil down to whether it makes more sense to move
several nodes worth of stuff to several corresponding nodes in a single
call, or in several calls, roughly one call for each <to, from> node
pair.

The working example I was carrying around in my mind was of a job that
had one thread per cpu, and that had explicitly used some numa policy
(mbind, mempolicy or cpusets) to place memory on the node local to its
cpu.

Earlier today on the lkml thread, Robin described how a typical MPI job
works.  Seems that it relies on some startup code running in each
thread, carefully touching each page that should be local to that cpu
before any other thread touches said page, and requiring no particular
memory policy facility beyond first touch.  Seems to me that the memory
migration requirements here are the same as they were for the example I
had in mind.  Each task has some currently allocated memory pages in its
address space that are on the local node of that task, and that memory
must stay local to that task, after the migration.

Looking at such an MPI job as a whole, there seems to be pages scattered
across several nodes, where the only place it is 'encoded' how to place
them is in the job startup code that first touched each page.

A successful migration must replicate that memory placement, page for
page, just changing the nodes.  From that perspective, it makes sense to
think of it as an array of old nodes, and a corresponding array of new
nodes, where each page on an old node is to be migrated to the corresponding
new node.

However, since each thread allocated its memory locally, this factors into
N separate migrations, each of one task, one old node, and one new node.
Such a call doesn't migrate all physical pages in the target tasks memory,
rather just the pages that are on the specified old node.

The one thing not trivially covered in such a one task, one node pair at
a time factoring is memory that is placed on a node that is remote from
any of the tasks which map that memory.  Let me call this 'remote
placement.'  Offhand, I don't know why anyone would do this.  If such
were rare, the one task, one node pair at a time factoring can still
migrate it easily enough, so long as it knows to do so, and issue
another system call, for the necessary task and remote nodes (old and
new).  If such remote placement were used in abundance, the one task,
one node pair at a time factoring would become inefficient.  I don't
anticipate that remote placement will be used in abundance.

By the way, what happens if you're doing a move where the to and from
node sets overlap, and the kernel scans in the wrong order, and ends up
trying to put new pages onto a node that is in that overlap, before
pulling the old pages off it, running out of memory on that node?
Perhaps the smarts to avoid that should be in user space ;).  This can
be avoided using the one task, one node pair at a time factored API,
because user space can control the order in which memory is migrated, to
avoid temporarilly overloading the memory on any one node.

With this, I am now more convinced than I was earlier that passing a
single old node, new node pair, rather than the array of old and new
nodes, is just as good (hardly any more system calls in actual usage). 
And it is better in one regard, that it avoids the kernel having to risk
overloading the memory on some node during the migration if it scans in
the wrong order when doing an overlapped migration.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16  0:51 ` Paul Jackson
@ 2005-02-16  1:17   ` Paul Jackson
  2005-02-16  2:01     ` Robin Holt
  2005-02-16  3:55     ` Ray Bryant
  2005-02-16  1:56   ` Robin Holt
  1 sibling, 2 replies; 24+ messages in thread
From: Paul Jackson @ 2005-02-16  1:17 UTC (permalink / raw)
  To: Paul Jackson
  Cc: raybry, linux-mm, holt, ak, haveblue, marcello, stevel, peterc

As a straw man, let me push the factored migration call to the
extreme, and propose a call:

  sys_page_migrate(pid, oldnode, newnode)

that moves any physical page in the address space of pid that is
currently located on oldnode to newnode.

Won't this come about as close as we are going to get to replicating the
physical memory layout of a job, if we just call it once, for each task
in that job?  Oops - make that one call for each node in use by the job
- see the following ...


Earlier I (pj) wrote:
> The one thing not trivially covered in such a one task, one node pair at
> a time factoring is memory that is placed on a node that is remote from
> any of the tasks which map that memory.  Let me call this 'remote
> placement.'  Offhand, I don't know why anyone would do this.

Well - one case - headless nodes.  These are memory-only nodes.

Typically one sys_page_migrate() call will be needed for each such node,
specifying some task in the job that has all the relevent memory on that
node mapped, specifying that (old) node, and specifying which new node
that memory should be migrated to.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16  1:17   ` Paul Jackson
@ 2005-02-16  2:01     ` Robin Holt
  2005-02-16  4:04       ` Ray Bryant
  2005-02-16  4:24       ` Paul Jackson
  2005-02-16  3:55     ` Ray Bryant
  1 sibling, 2 replies; 24+ messages in thread
From: Robin Holt @ 2005-02-16  2:01 UTC (permalink / raw)
  To: Paul Jackson
  Cc: raybry, linux-mm, holt, ak, haveblue, marcello, stevel, peterc

On Tue, Feb 15, 2005 at 05:17:09PM -0800, Paul Jackson wrote:
> As a straw man, let me push the factored migration call to the
> extreme, and propose a call:
> 
>   sys_page_migrate(pid, oldnode, newnode)

Go look at the mappings in /proc/<pid>/maps once and you will see
how painful this can make things.  Especially for an applications
with shared mappings.  Overlapping nodes with the above will make
a complete mess of your memory placement.

Robin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16  2:01     ` Robin Holt
@ 2005-02-16  4:04       ` Ray Bryant
  2005-02-16  4:28         ` Paul Jackson
  2005-02-16  4:24       ` Paul Jackson
  1 sibling, 1 reply; 24+ messages in thread
From: Ray Bryant @ 2005-02-16  4:04 UTC (permalink / raw)
  To: Robin Holt; +Cc: Paul Jackson, linux-mm, ak, haveblue, marcello, stevel, peterc

Robin Holt wrote:
> On Tue, Feb 15, 2005 at 05:17:09PM -0800, Paul Jackson wrote:
> 
>>As a straw man, let me push the factored migration call to the
>>extreme, and propose a call:
>>
>>  sys_page_migrate(pid, oldnode, newnode)
> 
> 
> Go look at the mappings in /proc/<pid>/maps once and you will see
> how painful this can make things.  Especially for an applications
> with shared mappings.  Overlapping nodes with the above will make
> a complete mess of your memory placement.
> 
> Robin
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
> 
So lets address that issue again, since I think that is now the
heart of the matter.

Exactly why do we need to support the case where the set of old
nodes and new nodes overlap?  I agree it is more general, but if
we drop that, I think we are one step closer to getting agreement
as to what the page migration system call interface should be.

Do we have a case, say from IRIX, of why supporting this kind of
migration is necessary?

-- 
-----------------------------------------------
Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
	 so I installed Linux.
-----------------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16  4:04       ` Ray Bryant
@ 2005-02-16  4:28         ` Paul Jackson
  0 siblings, 0 replies; 24+ messages in thread
From: Paul Jackson @ 2005-02-16  4:28 UTC (permalink / raw)
  To: Ray Bryant; +Cc: holt, linux-mm, ak, haveblue, marcello, stevel, peterc

Ray wrote:
> Exactly why do we need to support the case where the set of old
> nodes and new nodes overlap? 

Actually, I think they can overlap, just so long as the set of old nodes
is not identical to the set of new nodes.  It's this "perfect shuffle,
in place" that can't be done without the infamous insane temporary node.

But that's likely beside the point, as I have already adequately
demonstrated that there is some requirement here that Robin knows and I
don't.  Yet, anyway.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16  2:01     ` Robin Holt
  2005-02-16  4:04       ` Ray Bryant
@ 2005-02-16  4:24       ` Paul Jackson
  1 sibling, 0 replies; 24+ messages in thread
From: Paul Jackson @ 2005-02-16  4:24 UTC (permalink / raw)
  To: Robin Holt; +Cc: raybry, linux-mm, ak, haveblue, marcello, stevel, peterc

Robin wrote:
> Overlapping nodes with the above will make
> a complete mess of your memory placement.

I agree we don't want to overlap nodes.

I don't yet understand why my simple (simplistic?)
version of this system call leads us to overlapped
nodes.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16  1:17   ` Paul Jackson
  2005-02-16  2:01     ` Robin Holt
@ 2005-02-16  3:55     ` Ray Bryant
  1 sibling, 0 replies; 24+ messages in thread
From: Ray Bryant @ 2005-02-16  3:55 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-mm, holt, ak, haveblue, marcello, stevel, peterc

Paul Jackson wrote:
> As a straw man, let me push the factored migration call to the
> extreme, and propose a call:
> 
>   sys_page_migrate(pid, oldnode, newnode)
> 
> that moves any physical page in the address space of pid that is
> currently located on oldnode to newnode.
> 
> Won't this come about as close as we are going to get to replicating the
> physical memory layout of a job, if we just call it once, for each task
> in that job?  Oops - make that one call for each node in use by the job
> - see the following ...
> 
> 
> Earlier I (pj) wrote:
> 
>>The one thing not trivially covered in such a one task, one node pair at
>>a time factoring is memory that is placed on a node that is remote from
>>any of the tasks which map that memory.  Let me call this 'remote
>>placement.'  Offhand, I don't know why anyone would do this.
> 
> 
> Well - one case - headless nodes.  These are memory-only nodes.
> 
> Typically one sys_page_migrate() call will be needed for each such node,
> specifying some task in the job that has all the relevent memory on that
> node mapped, specifying that (old) node, and specifying which new node
> that memory should be migrated to.
> 

This works provide you get Robin and Jack and all to drop the requirement
that my page migration facility support overlapping sets of origin and
destination nodes.  Otherwise, this is a non-starter.

So, lets go back to that one.  Robin, can you provide me with a concrete
(not hypothetical example) of a case where the from and to sets of nodes
are overlapping?

-- 
-----------------------------------------------
Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
	 so I installed Linux.
-----------------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16  0:51 ` Paul Jackson
  2005-02-16  1:17   ` Paul Jackson
@ 2005-02-16  1:56   ` Robin Holt
  2005-02-16  4:22     ` Paul Jackson
  1 sibling, 1 reply; 24+ messages in thread
From: Robin Holt @ 2005-02-16  1:56 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Ray Bryant, linux-mm, holt, ak, haveblue, marcello, stevel, peterc

On Tue, Feb 15, 2005 at 04:51:06PM -0800, Paul Jackson wrote:
> Earlier today on the lkml thread, Robin described how a typical MPI job
> works.  Seems that it relies on some startup code running in each
> thread, carefully touching each page that should be local to that cpu
> before any other thread touches said page, and requiring no particular
> memory policy facility beyond first touch.  Seems to me that the memory
> migration requirements here are the same as they were for the example I
> had in mind.  Each task has some currently allocated memory pages in its
> address space that are on the local node of that task, and that memory
> must stay local to that task, after the migration.

One important point I probably forgot to make is there is typically a
very large shared anonymous mapping before the initial fork.  This will
result in many processes sharing the vma discussed below.

> 
> Looking at such an MPI job as a whole, there seems to be pages scattered
> across several nodes, where the only place it is 'encoded' how to place
> them is in the job startup code that first touched each page.
> 
> A successful migration must replicate that memory placement, page for
> page, just changing the nodes.  From that perspective, it makes sense to
> think of it as an array of old nodes, and a corresponding array of new
> nodes, where each page on an old node is to be migrated to the corresponding
> new node.

And given the large single mapping and two arrays corresponding to
old/new nodes, a single call would handle the migration even with
overlapping regions in a single call and pass over the ptes.

> 
> However, since each thread allocated its memory locally, this factors into
> N separate migrations, each of one task, one old node, and one new node.
> Such a call doesn't migrate all physical pages in the target tasks memory,
> rather just the pages that are on the specified old node.

If you do that for each job with the shared mapping and have overlapping
node lists, you end up combining two nodes and not being able to seperate
them.  Oh sure, we could add in a page flag indicating that the page
is going to be migrated, add a syscall which you call on the VMA first
to set all the flags and then as pages are moved with the one-for-one
syscalls, clear the flag.  Oh yeah, we also need to add an additional
syscall to clear any flags for pages that did not get migrated because
they were not in the old list at all.

> The one thing not trivially covered in such a one task, one node pair at
> a time factoring is memory that is placed on a node that is remote from
> any of the tasks which map that memory.  Let me call this 'remote
> placement.'  Offhand, I don't know why anyone would do this.  If such
> were rare, the one task, one node pair at a time factoring can still
> migrate it easily enough, so long as it knows to do so, and issue
> another system call, for the necessary task and remote nodes (old and
> new).  If such remote placement were used in abundance, the one task,
> one node pair at a time factoring would become inefficient.  I don't
> anticipate that remote placement will be used in abundance.

Unfortunately it does happen often for stuff like shared file mappings
that a different job is using in conjuction with this job.  There are
other considerations as well such as shared libraries etc, but we can
minimize that noise in this discussion for the time being.

> By the way, what happens if you're doing a move where the to and from
> node sets overlap, and the kernel scans in the wrong order, and ends up
> trying to put new pages onto a node that is in that overlap, before
> pulling the old pages off it, running out of memory on that node?
> Perhaps the smarts to avoid that should be in user space ;).  This can
> be avoided using the one task, one node pair at a time factored API,
> because user space can control the order in which memory is migrated, to
> avoid temporarilly overloading the memory on any one node.

Unfortunately, userspace can not avoid this easily as it does not know
which pages in the virtual address space are on which nodes.  It could
do some kludge work and only call for va ranges that are smaller than
the most available memory on any of the destination nodes, but that
might make things sort of hackish.  Alternatively, the syscall handler
could do some work to find chunks of memory that are being used by
that node and process that chunk and then return to this.  Makes stuff
ugly, but is a possiblity as well.

> 
> With this, I am now more convinced than I was earlier that passing a
> single old node, new node pair, rather than the array of old and new
> nodes, is just as good (hardly any more system calls in actual usage). 
> And it is better in one regard, that it avoids the kernel having to risk
> overloading the memory on some node during the migration if it scans in
> the wrong order when doing an overlapped migration.

Shared mappings and overlapping regions make the node arrays necessary.
Single old/new pair _DOES_ result in more system calls and therefore
scans over the ptes.  It does result in problems with overlapping old/new
node lists.  It does not help with out of memory issues.  It accomplishes
nothing other than making the syscall interface different.

Thanks,
Robin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16  1:56   ` Robin Holt
@ 2005-02-16  4:22     ` Paul Jackson
  2005-02-16  9:20       ` Robin Holt
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Jackson @ 2005-02-16  4:22 UTC (permalink / raw)
  To: Robin Holt; +Cc: raybry, linux-mm, ak, haveblue, marcello, stevel, peterc

Robin wrote:
> If you do that for each job with the shared mapping and have overlapping
> node lists, you end up combining two nodes and not being able to seperate
> them.

I don't see the problem.  Just don't move a task onto a node
until you moved the one that was already there, if any, off.

Say, for example, you want to move a job from nodes 4, 5 and 6 to nodes
5, 6 and 7, respectively.  First move 6 to 7, then 5 to 6, then 4 to 5.
Or save some migration, and just move what's on 4 to 7, leaving 5 and
6 as is.

At any point, either there is at least one new node not currently
occupied by some not yet migrated task, or else you're just reshuffling
a set of tasks on the same set of nodes, which I presume would be
without purpose and so we don't need to support.  If we did need to
support shuffling a job on its current node set, I'd have to plead
insanity, and reintroduce the temporary node hack.

> Unfortunately it does happen often for stuff like shared file mappings
> that a different job is using in conjuction with this job.

This might be the essential detail I'm missing.  I'm not sure what you
mean here (see P.S., at end), but it seems that you are telling me you
must have the ability to avoid moving parts of a job.  That for a given
task, pinned to a given cpu, with various physical pages on the node
local to that cpu, some of those pages must not move, because they are
used in conjunction with some other job, that is not being migrated at
this time.

If that's the case, aren't you pretty much guaranteeing the migrated job
will not run as well as before the migration - some of the pages it was
using that were local are now remote.  And if that's the case, I take it
you are presuming that the server process doing the migration has
intimate knowledge of the tasks being migrated, and of the various
factors that determine which pages of those tasks should migrate and
which should not migrate.  Uggh.

I am working from the idea that you've got some job, running on some
nodes, and that you just want to jack up that job and put it back down
on an isomorphic set of nodes - same number of nodes, same (or at least
sufficient) amount of memory on the nodes, possibly an overlapping set
of nodes, but just not the self-same identical set of nodes.  I was
presuming that everything in the address spaces of the tasks in the job
should move, and should end up placed the same, relative to the tasks in
the job, as before, just on different node numbers.  Even shared library
pages can move -- if this job happened to be the one that paged that
portion of library in, then perhaps this job has the most use for that
page.  That or it's just a popular page left over from the dawn of time
and it doesn't matter much which node holds it.

Perhaps I have the wrong idea here?

> Unfortunately, userspace can not avoid this easily as it does not know
> which pages in the virtual address space are on which nodes.

Userspace doesn't need to know that.  It only needs to know at least one
node in the set of new nodes is not still occupied by an unmigrated task
in the job.  See the example above.

> Oh sure, we could add in ... 
> Oh yeah, we also need to add ...
> It could do some kludge work and only call ...

No need to spend too much effort elaborating such additions ... the mere
fact that you find them necessary means that either it's not as simple
as I think, or it's simpler than you think.  In other words, that one of
us (most likely me) doesn't understand the real requirements here.

P.S. - or perhaps what you're telling me with the bit about shared file
mappings is not that you must not move any such shared file pages as
well, but that you'd rather not, as there are perhaps many such pages,
and the time spent moving them would be wasted.  Are you saying that you
want to move some subset of a jobs pages, as an optimization, because
for a large chunk of pages, such as for some files and libraries shared
with other jobs, the expense of migrating them would not be paid back?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16  4:22     ` Paul Jackson
@ 2005-02-16  9:20       ` Robin Holt
  2005-02-16 10:20         ` Paul Jackson
  2005-02-16 23:05         ` Ray Bryant
  0 siblings, 2 replies; 24+ messages in thread
From: Robin Holt @ 2005-02-16  9:20 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Robin Holt, raybry, linux-mm, ak, haveblue, marcello, stevel, peterc

On Tue, Feb 15, 2005 at 08:22:14PM -0800, Paul Jackson wrote:
> Robin wrote:
> > If you do that for each job with the shared mapping and have overlapping
> > node lists, you end up combining two nodes and not being able to seperate
> > them.
> 
> I don't see the problem.  Just don't move a task onto a node
> until you moved the one that was already there, if any, off.
> 
> Say, for example, you want to move a job from nodes 4, 5 and 6 to nodes
> 5, 6 and 7, respectively.  First move 6 to 7, then 5 to 6, then 4 to 5.
> Or save some migration, and just move what's on 4 to 7, leaving 5 and
> 6 as is.

Moving 4 to 7 will likely change the node to node distance for the
processes within that job.  You will probably need to do the 6-7, 5-6, 4-5
to keep relative distances the same.  Again, the batch scheduler will tell
us whether a simple 4-7 move is possible or whether we need to shift each.

I should correct my earlier add.  As long as you have a seperate node
in the new list that is not in the old, you could accomplish it with a
one-at-a-time fashion.  What that would result in is a syscall for each
non-overlapping vma per node.  Multiple that by the number of nodes with
each system call going over that same shared vma.

For the sake of discussion, lets assume this is a 256p job using 128 nodes
and a shared message block of 2GB per task.  You will have a 512GB shared
mapping which will have some holes punched in it (no single task will
have the entire mapping unscathed).  Again, for the sake of discussion,
let's assume that 96% of the shared buffer is intact for the process we
choose to do the initial migration on.  Compare the single node method
to the array method.

Array method:
1) Call system call with pid, va_start, va_end, 128, [2,3,4,5...], [32,33,34,...].
   This will scan the page tables _ONCE_ and migrate the pages to their
   new destination.
2) Call system call on second pid to cover 1/2 of the remaining
   4% of address space.  Again single scan over that portion of
   address space.
3) Call system call on third pid to cover last portion of address
   space.

With this, we have made 3 system calls and scanned the entire address
range 1 time.

Single parameter method:
1) For a single pid, cal system call 128 times with pid, va_start, va_end, from, to
   which scans the 96% chunk 128 times.
2) Repeat 128 times with second pid.
3) Repeat 128 times with third pid.

We have now made the system call 384 times, scanned the entire address
range 128 times.

Do you see why I called this insane.  This is all because you don't like
to pass in a complex array of integers.  That seems like a very small
thing to ask to save 127 scans of a 512GB address space.

I believe that is what I called insane earlier.  I reserve the right to
be wrong.

> 
> At any point, either there is at least one new node not currently
> occupied by some not yet migrated task, or else you're just reshuffling
> a set of tasks on the same set of nodes, which I presume would be
> without purpose and so we don't need to support.  If we did need to
> support shuffling a job on its current node set, I'd have to plead
> insanity, and reintroduce the temporary node hack.
> 
> 
> > Unfortunately it does happen often for stuff like shared file mappings
> > that a different job is using in conjuction with this job.
> 
> This might be the essential detail I'm missing.  I'm not sure what you
> mean here (see P.S., at end), but it seems that you are telling me you
> must have the ability to avoid moving parts of a job.  That for a given
> task, pinned to a given cpu, with various physical pages on the node
> local to that cpu, some of those pages must not move, because they are
> used in conjunction with some other job, that is not being migrated at
> this time.

For the simple case assume a sysV shared memory segment that was created
by a previous job being used by this one.  The memory placement for
the segment will depend entirely on whether the previous job touched a
particular page and where that job ran.  It may get migrated depending
upon if any other jobs anywhere else are on the system and are using it
and any of the pages are on the jobs old node list.

These types of mappings have always given us issues (Irix as well as
Linux) and are difficult to handle.  The one additional nice feature to
having an external migration facility is we might be able to use this
type of thing from a command line to move the shared memory segment
over to nodes that the job is using.  This has just been off the cuff
thinking lately and hasn't been fully thought through.

> P.S. - or perhaps what you're telling me with the bit about shared file
> mappings is not that you must not move any such shared file pages as
> well, but that you'd rather not, as there are perhaps many such pages,
> and the time spent moving them would be wasted.  Are you saying that you
> want to move some subset of a jobs pages, as an optimization, because
> for a large chunk of pages, such as for some files and libraries shared
> with other jobs, the expense of migrating them would not be paid back?

I believe Ray's proposed userland piece would migrate shared libraries
used exclusively by this job.  Was that right Ray?

Here is my real question.  How much opposition is there to the array
of integers?  This does not seem like a risky interface to me.  If there
is not a lot of opposition to the arrays, can we discuss the rest of
the proposal and accept the arrays for the time being?  The array can
be addressed once we know that the syscall for migrating idea is sound.

Thanks,
Robin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16  9:20       ` Robin Holt
@ 2005-02-16 10:20         ` Paul Jackson
  2005-02-16 11:30           ` Robin Holt
  2005-02-16 23:08           ` Ray Bryant
  2005-02-16 23:05         ` Ray Bryant
  1 sibling, 2 replies; 24+ messages in thread
From: Paul Jackson @ 2005-02-16 10:20 UTC (permalink / raw)
  To: Robin Holt; +Cc: raybry, linux-mm, ak, haveblue, marcello, stevel, peterc

Robin wrote:
> What that would result in is a syscall for each
> non-overlapping vma per node.

My latest, most radical, proposal did not take an address range.  It was
simply:

    sys_page_migrate(pid, oldnode, newnode)

It would be called once per node.  In your example, this would be 128
calls.  Nothing "for each non-overlapping vma".  Just per node.

Until I drove you to near distraction, and you spelled out the details
of an example that migrated 96% of the address space in the first call,
and only need 3 calls total, I would have presumed that the API:

    sys_page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes)

would have required one call per pid, or 256 calls, for your example.

My method did not look insanely worse to me, indeed it would have looked
better in this example with two tasks per node, since I did one call per
node, and I thought you did one per task.

... However, I see now that you can routinely get by with dramatically
fewer calls than the number of tasks, by noticing what portions of the
typically huge shared address space have already been covered, and not
covering them again.

There is no need to convince me that 384 syscalls and 128 full scans
is insanely worse than 3 syscalls with 1 full scan, and no need to
get frustrated that I cannot see the insanity of it.

However, you might have wanted to allow for the possibility, when you
reduced what you thought I was proposing to insanity, that rather than
my proposing something insane, perhaps we had different numbers ... as
happened here.  Your numbers for the array API had 80 times fewer system
calls than I would have expected, and your numbers for the single
parameter call had 3 times _more_ system calls than I had in mind (I had
one call per node, period, not one per node per vma or whatever).

> How much opposition is there to the array of integers?

My opposition to the array was not profound.  It needed to provide
an advantage, which I didn't see it much did.

I now see it provides an advantage, dramatically reducing the number of
system calls and scans in typical cases, to substantially fewer than
either the number of tasks or of nodes.

Ok ... onward.  I'll take the node arrays.

The next concern that rises to the top for me was best expressed by Andi:
>
> The main reasons for that is that I don't think external
> processes should mess with virtual addresses of another process.
> It just feels unclean and has many drawbacks (parsing /proc/*/maps
> needs complicated user code, racy, locking difficult).  
> 
> In kernel space handling full VMs is much easier and safer due to better 
> locking facilities.

I share Andi's concerns, but I don't see what to do about this.  Andi's
recommendations seem to be about memory policies (which guide future
allocations), and not about migration of already allocated physical
pages.  So for now at least, his recommendations don't seem like answers
to me.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16 10:20         ` Paul Jackson
@ 2005-02-16 11:30           ` Robin Holt
  2005-02-16 15:45             ` Paul Jackson
  2005-02-16 23:08           ` Ray Bryant
  1 sibling, 1 reply; 24+ messages in thread
From: Robin Holt @ 2005-02-16 11:30 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Robin Holt, raybry, linux-mm, ak, haveblue, marcello, stevel, peterc

On Wed, Feb 16, 2005 at 02:20:09AM -0800, Paul Jackson wrote:
> The next concern that rises to the top for me was best expressed by Andi:
> >
> > The main reasons for that is that I don't think external
> > processes should mess with virtual addresses of another process.
> > It just feels unclean and has many drawbacks (parsing /proc/*/maps
> > needs complicated user code, racy, locking difficult).  
> > 
> > In kernel space handling full VMs is much easier and safer due to better 
> > locking facilities.
> 
> I share Andi's concerns, but I don't see what to do about this.  Andi's
> recommendations seem to be about memory policies (which guide future
> allocations), and not about migration of already allocated physical
> pages.  So for now at least, his recommendations don't seem like answers
> to me.

If we had the ability to change the vendor provided software to meet
our needs, that would be wonderful.

Unfortunately, most of this type code runs on _MANY_ different OSs and
architectures.  If you could get the NUMA api into everything from AIX
to Windows XP, I think you would have a very good chance of convincing
ISVs to start converting.  Until then, there is no clear win over first
touch for their type of application.

With that in mind, we are left with doing things from the outside in.
Heck, if we could get them to change their code, cpusets would be
irrelavent as well ;)

Thanks,
Robin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16 11:30           ` Robin Holt
@ 2005-02-16 15:45             ` Paul Jackson
  2005-02-16 16:08               ` Robin Holt
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Jackson @ 2005-02-16 15:45 UTC (permalink / raw)
  To: Robin Holt; +Cc: raybry, linux-mm, ak, haveblue, marcello, stevel, peterc

Robin wrote:
> Until then, there is no clear win over first
> touch for their type of application.

Huh?  So what was the point of this rant? <grin>

You seem to explain why first touch is used instead of the Linux 2.6
numa placement calls mbind/mempolicy, in some third party code that
runs on multiple operating systems.

But I thought this was the page migration thread, not the placement
policy thread.

Now I am as mystified with your latest comments as I was with Andi's
discussion of using these memory policy calls.

Regardless of what mechanisms we use to guide future allocations to
their proper nodes, how best can we provide a facility to migrate
already allocated physical memory pages to other nodes?  That's the
question, or so I thought, on this thread.

To repeat myself ...

> The next concern that rises to the top for me was best expressed by Andi:
> >
> > The main reasons for that is that I don't think external
> > processes should mess with virtual addresses of another process.
> > It just feels unclean and has many drawbacks (parsing /proc/*/maps
> > needs complicated user code, racy, locking difficult).  
> > 
> > In kernel space handling full VMs is much easier and safer due to better 
> > locking facilities.
> 
> I share Andi's concerns, but I don't see what to do about this. 

Perhaps a part of the answer is that we aren't messing with (as in
"changing") the virtual addresses of other processes.  The migration
call is only reading these addresses.  What it messes with is the
_physical_ addresses ;).

Though this proposed call still seems to have some of the same drawbacks.

One of my motivations for persuing the no-array version of this call
that you loved so much was that it (my latest variant, anyway) didn't
pass any virtual address ranges in, further simplifying what crossed the
user-kernel boundary and leaving details of parsing the virtual address
layout of tasks strictly to the kernel (no need to read /proc/*/maps).

But it seems that if we are going to achieve the fairly significant
optimizations you enumerated in your example a few hours ago, we at
least have to parse the /proc/*/maps files.

Hmmm ... wait just a minute ... isn't parsing the maps files in /proc
really scanning the virtual addresses of tasks.  In your example of a
few hours ago, which seemed to only require 3 system calls and one full
scan of any task address space, did you read all the /proc/*/maps files,
for all 256 of the tasks involved?  I would think you would have to have
done so, or else one of these tasks could be holding onto some private
memory of its own that we would need to migrate.  Are the stack pages
and any per-thread private data on pages visible to all the threads, or
are some of these pages private to each thread?  Does anything prevent a
thread from having additional private pages invisible to the other
threads?

Could you redo your example, including scans implied by reading maps
files, and including system calls needed to do those reads, and needed
to migrate any private pages they might have?  Perhaps your preferred
API doesn't have such an insane advantage after all.

I'm fixing soon to consider another variant of this call, that takes an
_array_ of pids, along with the old and new arrays of nodes, but takes
no virtual address range.  The kernel would scan each pid in the array,
migrating anything found on any old node to the corresponding new node,
all in one system call.  If my speculations above are right, this does
the minimum of scans, one per pid, and the minimum number of system
calls - one.  And does so without involving the user space code in racy
maps file reading to determine what to call (though the kernel code
would probably still have more than its share of races to fuss over).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16 15:45             ` Paul Jackson
@ 2005-02-16 16:08               ` Robin Holt
  2005-02-16 19:23                 ` Paul Jackson
  0 siblings, 1 reply; 24+ messages in thread
From: Robin Holt @ 2005-02-16 16:08 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Robin Holt, raybry, linux-mm, ak, haveblue, marcello, stevel, peterc

On Wed, Feb 16, 2005 at 07:45:50AM -0800, Paul Jackson wrote:
> Hmmm ... wait just a minute ... isn't parsing the maps files in /proc
> really scanning the virtual addresses of tasks.  In your example of a
> few hours ago, which seemed to only require 3 system calls and one full
> scan of any task address space, did you read all the /proc/*/maps files,
> for all 256 of the tasks involved?  I would think you would have to have

Reading /proc/<pid>maps just scans through the vmas and not the
address space.  Very different things!

> Could you redo your example, including scans implied by reading maps
> files, and including system calls needed to do those reads, and needed
> to migrate any private pages they might have?  Perhaps your preferred
> API doesn't have such an insane advantage after all.

Ray, do you have your userland stuff in anywhere close to presentable
condition?  If so, that might be the best for this part of the discussion.

Robin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16 16:08               ` Robin Holt
@ 2005-02-16 19:23                 ` Paul Jackson
  2005-02-16 19:56                   ` Robin Holt
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Jackson @ 2005-02-16 19:23 UTC (permalink / raw)
  To: Robin Holt; +Cc: raybry, linux-mm, ak, haveblue, marcello, stevel, peterc

Robin wrote:
> Reading /proc/<pid>maps just scans through the vmas and not the
> address space.

Yes - you're right.

So the number of system calls in your example of a few hours ago, using
your preferred array API, if you include the reads of each tasks
/proc/<pid>/maps file, is about equal to the number of tasks, right?

And I take it that the user code you asked Ray about looks at these
maps files for each of the tasks to be migrated, identifies each
mapped range of each mapped object (mapped file or whatever) and
calculates a fairly minimum set of tasks and virtual address ranges
therein, sufficient to cover all the mapped objects that should
be migrated, thus minimizing the amount of scanning that needs
to be done of individual pages.

And further I take it that you recommend the above described code [to
find a fairly minimum set of tasks and address ranges to scan that will
cover any page of interest] be put in user space, not in the kernel (a
quite reasonable recommendation).

Why didn't your example have some writable private pages?  Wouldn't such
pages be commonplace, and wouldn't they have to be migrated for each
thread, resulting in at least N calls to the new sys_page_migrate()
system call, for N tasks, rather than the 3 calls in your example?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16 19:23                 ` Paul Jackson
@ 2005-02-16 19:56                   ` Robin Holt
  0 siblings, 0 replies; 24+ messages in thread
From: Robin Holt @ 2005-02-16 19:56 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Robin Holt, raybry, linux-mm, ak, haveblue, marcello, stevel, peterc

On Wed, Feb 16, 2005 at 11:23:35AM -0800, Paul Jackson wrote:
> Robin wrote:
> > Reading /proc/<pid>maps just scans through the vmas and not the
> > address space.
> 
> Yes - you're right.
> 
> So the number of system calls in your example of a few hours ago, using
> your preferred array API, if you include the reads of each tasks
> /proc/<pid>/maps file, is about equal to the number of tasks, right?
> 
> And I take it that the user code you asked Ray about looks at these
> maps files for each of the tasks to be migrated, identifies each
> mapped range of each mapped object (mapped file or whatever) and
> calculates a fairly minimum set of tasks and virtual address ranges
> therein, sufficient to cover all the mapped objects that should
> be migrated, thus minimizing the amount of scanning that needs
> to be done of individual pages.
> 
> And further I take it that you recommend the above described code [to
> find a fairly minimum set of tasks and address ranges to scan that will
> cover any page of interest] be put in user space, not in the kernel (a
> quite reasonable recommendation).

I think user space for a few reasons.  The code in the kernel will be
much easier to digest and ensure it is a bug-free as possible.  If bugs
are found or issues arise in the portions that are in userland, we are
left with a maximum amount of flexibility to correct the issue without
needing kernel code change.  In a different direction, if I am a support
person trying to figure out why an application is performing poorly,
I can try migrating portions of the applications address space to a node
closer to the cpu and hopefully see a performance improvement.

> 
> Why didn't your example have some writable private pages?  Wouldn't such
> pages be commonplace, and wouldn't they have to be migrated for each
> thread, resulting in at least N calls to the new sys_page_migrate()
> system call, for N tasks, rather than the 3 calls in your example?

You are right about everything above.  The calls to migrate the private
regions will be small in comparison to the typical large shared mapping.
The real work horse is going to always be walking the page tables and
that will take time.  I am advocating for a system call which covers the
needs and also remains flexible enough to correct short comings in our
thinking about all the possible permutations of user virtual address
spaces.

Thanks,
Robin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16 10:20         ` Paul Jackson
  2005-02-16 11:30           ` Robin Holt
@ 2005-02-16 23:08           ` Ray Bryant
  1 sibling, 0 replies; 24+ messages in thread
From: Ray Bryant @ 2005-02-16 23:08 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Robin Holt, linux-mm, ak, haveblue, marcello, stevel, peterc

Paul Jackson wrote:
> Robin wrote:
> 
>>What that would result in is a syscall for each
>>non-overlapping vma per node.
> 
> 
> My latest, most radical, proposal did not take an address range.  It was
> simply:
> 
>     sys_page_migrate(pid, oldnode, newnode)
> 
> It would be called once per node.  In your example, this would be 128
> calls.  Nothing "for each non-overlapping vma".  Just per node.
> 
> Until I drove you to near distraction, and you spelled out the details
> of an example that migrated 96% of the address space in the first call,
> and only need 3 calls total, I would have presumed that the API:
> 
>     sys_page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes)
> 
> would have required one call per pid, or 256 calls, for your example.
> 
> My method did not look insanely worse to me, indeed it would have looked
> better in this example with two tasks per node, since I did one call per
> node, and I thought you did one per task.
> 
> ... However, I see now that you can routinely get by with dramatically
> fewer calls than the number of tasks, by noticing what portions of the
> typically huge shared address space have already been covered, and not
> covering them again.

Right, that was our original plan.  So you only had to make as many
system calls as there were address ranges that needed to be migrated,
more or less.  This assumes we have stopped processes and can read and
make sense of /proc/*/maps.

> 
> There is no need to convince me that 384 syscalls and 128 full scans
> is insanely worse than 3 syscalls with 1 full scan, and no need to
> get frustrated that I cannot see the insanity of it.
> 
> However, you might have wanted to allow for the possibility, when you
> reduced what you thought I was proposing to insanity, that rather than
> my proposing something insane, perhaps we had different numbers ... as
> happened here.  Your numbers for the array API had 80 times fewer system
> calls than I would have expected, and your numbers for the single
> parameter call had 3 times _more_ system calls than I had in mind (I had
> one call per node, period, not one per node per vma or whatever).
> 
> 
>>How much opposition is there to the array of integers?
> 
> 
> My opposition to the array was not profound.  It needed to provide
> an advantage, which I didn't see it much did.
> 
> I now see it provides an advantage, dramatically reducing the number of
> system calls and scans in typical cases, to substantially fewer than
> either the number of tasks or of nodes.
> 
> Ok ... onward.  I'll take the node arrays.
> 
> The next concern that rises to the top for me was best expressed by Andi:
> 
>>The main reasons for that is that I don't think external
>>processes should mess with virtual addresses of another process.
>>It just feels unclean and has many drawbacks (parsing /proc/*/maps
>>needs complicated user code, racy, locking difficult).  
>>
>>In kernel space handling full VMs is much easier and safer due to better 
>>locking facilities.
> 
> 
> I share Andi's concerns, but I don't see what to do about this.  Andi's
> recommendations seem to be about memory policies (which guide future
> allocations), and not about migration of already allocated physical
> pages.  So for now at least, his recommendations don't seem like answers
> to me.
> 


-- 
-----------------------------------------------
Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
	 so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16  9:20       ` Robin Holt
  2005-02-16 10:20         ` Paul Jackson
@ 2005-02-16 23:05         ` Ray Bryant
  2005-02-17  0:28           ` Paul Jackson
  1 sibling, 1 reply; 24+ messages in thread
From: Ray Bryant @ 2005-02-16 23:05 UTC (permalink / raw)
  To: Robin Holt; +Cc: Paul Jackson, linux-mm, ak, haveblue, marcello, stevel, peterc

Robin Holt wrote:
> On Tue, Feb 15, 2005 at 08:22:14PM -0800, Paul Jackson wrote:
> 
>>Robin wrote:
>>
>>>If you do that for each job with the shared mapping and have overlapping
>>>node lists, you end up combining two nodes and not being able to seperate
>>>them.
>>
>>I don't see the problem.  Just don't move a task onto a node
>>until you moved the one that was already there, if any, off.
>>
>>Say, for example, you want to move a job from nodes 4, 5 and 6 to nodes
>>5, 6 and 7, respectively.  First move 6 to 7, then 5 to 6, then 4 to 5.
>>Or save some migration, and just move what's on 4 to 7, leaving 5 and
>>6 as is.
>

The customers I have talked to about this tell me that they never
imagine having a set of old and new nodes overlap.  I agree it is more
general to allow this, but resistance to the original system call I
proposed appears to be somewhat stiff.

> 
> Moving 4 to 7 will likely change the node to node distance for the
> processes within that job.  You will probably need to do the 6-7, 5-6, 4-5
> to keep relative distances the same.  Again, the batch scheduler will tell
> us whether a simple 4-7 move is possible or whether we need to shift each.
> 
> I should correct my earlier add.  As long as you have a seperate node
> in the new list that is not in the old, you could accomplish it with a
> one-at-a-time fashion.  What that would result in is a syscall for each
> non-overlapping vma per node.  Multiple that by the number of nodes with
> each system call going over that same shared vma.
> 
> For the sake of discussion, lets assume this is a 256p job using 128 nodes
> and a shared message block of 2GB per task.  You will have a 512GB shared
> mapping which will have some holes punched in it (no single task will
> have the entire mapping unscathed).  Again, for the sake of discussion,
> let's assume that 96% of the shared buffer is intact for the process we
> choose to do the initial migration on.  Compare the single node method
> to the array method.

Would we really ever migrate something that big?  I had the same concerns
about large address spaces and the like, but it just seems to me that if
something is that big, we'd leave it alone.  :-)

> 
> Array method:
> 1) Call system call with pid, va_start, va_end, 128, [2,3,4,5...], [32,33,34,...].
>    This will scan the page tables _ONCE_ and migrate the pages to their
>    new destination.
> 2) Call system call on second pid to cover 1/2 of the remaining
>    4% of address space.  Again single scan over that portion of
>    address space.
> 3) Call system call on third pid to cover last portion of address
>    space.
> 
> With this, we have made 3 system calls and scanned the entire address
> range 1 time.
> 
> Single parameter method:
> 1) For a single pid, cal system call 128 times with pid, va_start, va_end, from, to
>    which scans the 96% chunk 128 times.
> 2) Repeat 128 times with second pid.
> 3) Repeat 128 times with third pid.
> 
> We have now made the system call 384 times, scanned the entire address
> range 128 times.
> 
> Do you see why I called this insane.  This is all because you don't like
> to pass in a complex array of integers.  That seems like a very small
> thing to ask to save 127 scans of a 512GB address space.
> 

I agree, it sounds like a lot of work.  Perhaps we should try this with
my prototype code and see how long it takes.  But, I really think this is
a contrived example.  I don't think anyone would migrate a job that big.
To my way of thinking, the largest job we would ever migrate would be on
the order of 1/8th to 1/4 of the machine.  Not 1/2.  If it is 1/2 of
the machine, lets just leave the darn thing where it is.  :-)  (I always
try to let large sleepling dogs lie...)

> I believe that is what I called insane earlier.  I reserve the right to
> be wrong.
> 
> 
>>At any point, either there is at least one new node not currently
>>occupied by some not yet migrated task, or else you're just reshuffling
>>a set of tasks on the same set of nodes, which I presume would be
>>without purpose and so we don't need to support.  If we did need to
>>support shuffling a job on its current node set, I'd have to plead
>>insanity, and reintroduce the temporary node hack.
>>
>>
>>
>>>Unfortunately it does happen often for stuff like shared file mappings
>>>that a different job is using in conjuction with this job.
>>
>>This might be the essential detail I'm missing.  I'm not sure what you
>>mean here (see P.S., at end), but it seems that you are telling me you
>>must have the ability to avoid moving parts of a job.  That for a given
>>task, pinned to a given cpu, with various physical pages on the node
>>local to that cpu, some of those pages must not move, because they are
>>used in conjunction with some other job, that is not being migrated at
>>this time.
> 
> 
> For the simple case assume a sysV shared memory segment that was created
> by a previous job being used by this one.  The memory placement for
> the segment will depend entirely on whether the previous job touched a
> particular page and where that job ran.  It may get migrated depending
> upon if any other jobs anywhere else are on the system and are using it
> and any of the pages are on the jobs old node list.
> 
> These types of mappings have always given us issues (Irix as well as
> Linux) and are difficult to handle.  The one additional nice feature to
> having an external migration facility is we might be able to use this
> type of thing from a command line to move the shared memory segment
> over to nodes that the job is using.  This has just been off the cuff
> thinking lately and hasn't been fully thought through.
> 
> 
>>P.S. - or perhaps what you're telling me with the bit about shared file
>>mappings is not that you must not move any such shared file pages as
>>well, but that you'd rather not, as there are perhaps many such pages,
>>and the time spent moving them would be wasted.  Are you saying that you
>>want to move some subset of a jobs pages, as an optimization, because
>>for a large chunk of pages, such as for some files and libraries shared
>>with other jobs, the expense of migrating them would not be paid back?
> 
> 
> I believe Ray's proposed userland piece would migrate shared libraries
> used exclusively by this job.  Was that right Ray?
>

Yes, that was the intent.

> Here is my real question.  How much opposition is there to the array
> of integers?  This does not seem like a risky interface to me.  If there
> is not a lot of opposition to the arrays, can we discuss the rest of
> the proposal and accept the arrays for the time being?  The array can
> be addressed once we know that the syscall for migrating idea is sound.
> 
> 
> Thanks,
> Robin
> 


-- 
-----------------------------------------------
Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
	 so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16 23:05         ` Ray Bryant
@ 2005-02-17  0:28           ` Paul Jackson
  0 siblings, 0 replies; 24+ messages in thread
From: Paul Jackson @ 2005-02-17  0:28 UTC (permalink / raw)
  To: Ray Bryant; +Cc: holt, linux-mm, ak, haveblue, marcello, stevel, peterc

Ray wrote:
> resistance to the original system call I
> proposed appears to be somewhat stiff.

Do not confuse the thickness of my skull with
the profundity of my thought.

As you might notice in some other posts, Robin
succeeded, after a few frustrating moments, in
educating me to the true brilliance of your
original system call proposal.

<grin>

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-15 23:52 manual page migration -- issue list Ray Bryant
  2005-02-16  0:09 ` Paul Jackson
  2005-02-16  0:51 ` Paul Jackson
@ 2005-02-16  1:41 ` Paul Jackson
  2005-02-16  3:56   ` Ray Bryant
  2 siblings, 1 reply; 24+ messages in thread
From: Paul Jackson @ 2005-02-16  1:41 UTC (permalink / raw)
  To: Ray Bryant; +Cc: linux-mm, holt, ak, haveblue, marcello, stevel, peterc

A couple comments in response to Andi's earlier post on the
related lkml thread ...

Andi wrote:
> Sorry, but the only real difference between your API and mbind is that
> yours has a pid argument. 

One other difference shouts out at me.  I am unsure of my reading of
Andi's post, so I can't tell if (1) it was so obvious Andi didn't
bother mentioning it, or (2) he doesn't see it as a difference.

That difference is this.

    The various numa mechanisms, such as mbind, set_mempolicy and cpusets,
    as well as the simple first touch that MPI jobs rely on, are all about
    setting a policy for where future allocations should go.

    This page migration mechanism is all about changing the placement of
    physical pages of ram that are currently allocated.

At any point in time, numa policy guides future allocations, and page
migration redoes past allocations.


Andi wrote:
> My thinking is the simplest way to handle that is to have a call that just
> migrates everything. 

I might have ended up at the same place, not sure, when I just suggested
in my previous post:

pj wrote:
> As a straw man, let me push the factored migration call to the
> extreme, and propose a call:
> 
>   sys_page_migrate(pid, oldnode, newnode)
> 
> that moves any physical page in the address space of pid that is
> currently located on oldnode to newnode.


-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: manual page migration -- issue list
  2005-02-16  1:41 ` Paul Jackson
@ 2005-02-16  3:56   ` Ray Bryant
  0 siblings, 0 replies; 24+ messages in thread
From: Ray Bryant @ 2005-02-16  3:56 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-mm, holt, ak, haveblue, marcello, stevel, peterc

Paul Jackson wrote:
> A couple comments in response to Andi's earlier post on the
> related lkml thread ...
> 
> Andi wrote:
> 
>>Sorry, but the only real difference between your API and mbind is that
>>yours has a pid argument. 
> 
> 
> One other difference shouts out at me.  I am unsure of my reading of
> Andi's post, so I can't tell if (1) it was so obvious Andi didn't
> bother mentioning it, or (2) he doesn't see it as a difference.
> 
> That difference is this.
> 
>     The various numa mechanisms, such as mbind, set_mempolicy and cpusets,
>     as well as the simple first touch that MPI jobs rely on, are all about
>     setting a policy for where future allocations should go.
> 
>     This page migration mechanism is all about changing the placement of
>     physical pages of ram that are currently allocated.
> 
> At any point in time, numa policy guides future allocations, and page
> migration redoes past allocations.
> 

Very nicely said, thanks.  And the concern I have been trying to raise with
Andi is:

       How does that page migration mechanism redo a past allocation using
       a memory policy if the orginal allocation was not done with a memory
       policy, but instead done via first touch?

> 
> Andi wrote:
> 
>>My thinking is the simplest way to handle that is to have a call that just
>>migrates everything. 
> 
> 
> I might have ended up at the same place, not sure, when I just suggested
> in my previous post:
> 
> pj wrote:
> 
>>As a straw man, let me push the factored migration call to the
>>extreme, and propose a call:
>>
>>  sys_page_migrate(pid, oldnode, newnode)
>>
>>that moves any physical page in the address space of pid that is
>>currently located on oldnode to newnode.
> 
> 
> 


-- 
-----------------------------------------------
Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
	 so I installed Linux.
-----------------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2005-02-17  0:28 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-02-15 23:52 manual page migration -- issue list Ray Bryant
2005-02-16  0:09 ` Paul Jackson
2005-02-16  0:28   ` Ray Bryant
2005-02-16  0:51 ` Paul Jackson
2005-02-16  1:17   ` Paul Jackson
2005-02-16  2:01     ` Robin Holt
2005-02-16  4:04       ` Ray Bryant
2005-02-16  4:28         ` Paul Jackson
2005-02-16  4:24       ` Paul Jackson
2005-02-16  3:55     ` Ray Bryant
2005-02-16  1:56   ` Robin Holt
2005-02-16  4:22     ` Paul Jackson
2005-02-16  9:20       ` Robin Holt
2005-02-16 10:20         ` Paul Jackson
2005-02-16 11:30           ` Robin Holt
2005-02-16 15:45             ` Paul Jackson
2005-02-16 16:08               ` Robin Holt
2005-02-16 19:23                 ` Paul Jackson
2005-02-16 19:56                   ` Robin Holt
2005-02-16 23:08           ` Ray Bryant
2005-02-16 23:05         ` Ray Bryant
2005-02-17  0:28           ` Paul Jackson
2005-02-16  1:41 ` Paul Jackson
2005-02-16  3:56   ` Ray Bryant

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox