* manual page migration -- issue list
@ 2005-02-15 23:52 Ray Bryant
2005-02-16 0:09 ` Paul Jackson
` (2 more replies)
0 siblings, 3 replies; 24+ messages in thread
From: Ray Bryant @ 2005-02-15 23:52 UTC (permalink / raw)
To: linux-mm
Cc: Paul Jackson, Robin Holt, Andi Kleen, Dave Hansen,
Marcello Tosatti, Steve Longerbeam, Peter Chubb
I've been asked to repost this to the list with cc:'s to the interested
parties. The content below is the same as before, its the email headers
that are more interesting. :-) (I hope I didn't miss anyone.)
==============================REPOSTING================================
The following is an attempt to summarize the issues that have
been raised thus far in this discussion. I'm hoping that this
list can help us resolve the issues in a (somewhat) organized
manner:
(1) Should there be a new API or should/can the migration functionality
be folded under the NUMA API?
(2) If we decide to make a new API, then what parameters should
that system call take? Proposals have been made for all of
the following:
-- pid, va_start, va_end, count, old_nodes, new_nodes
-- pid, va_start, va_end, old_node_mask, new_node_mask
-- pid, va_start, va_end, old_node, new_node
-- same variations as above without the va_start/end arguments
(2) If we make a new API, how does that new API interact with the
NUMA API?
-- e. g.what happens when we migrate a VMA that has a mempolicy
associated with it?
(3) If we make a new API, how does this API interact with the rest
of the VM system. For example, when we migrate part of a VMA
do we split the VMA or not? (See also (4) below since if we
decide that the migration interface needs to be able to migrate
processes without stopping them, the whole concept of talking
about such ephemeral things as VMAs becomes pointless.)
(4) How general of a migration model are we supporting?
-- migration where old and new set of nodes might not be disjoint
-- migration of general processes (without suspension) or just
of suspended processes
-- how general of a migration model is necessary to get sufficient
users (more than SGI, say) to increase the chances of getting
the facility merged into the kernel.
(5) How do we determine what vma's to migrate? (Subquestion: Is
this done by the kernel or in user space?)
-- original idea: reference counts in /proc/pid/maps
-- newer idea: exclusion lists either set by marking the
file in some special way or by an explicit list
-- if we mark files as non-migratable, where is this information
stored?
(6) How does the migration API (in whatever form it takes) interact
with cpusets?
So first off, is this the complete list of issues? Can anyone suggest
an issue that isn't covered here?
--
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
raybry@sgi.com raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: manual page migration -- issue list 2005-02-15 23:52 manual page migration -- issue list Ray Bryant @ 2005-02-16 0:09 ` Paul Jackson 2005-02-16 0:28 ` Ray Bryant 2005-02-16 0:51 ` Paul Jackson 2005-02-16 1:41 ` Paul Jackson 2 siblings, 1 reply; 24+ messages in thread From: Paul Jackson @ 2005-02-16 0:09 UTC (permalink / raw) To: Ray Bryant; +Cc: linux-mm, holt, ak, haveblue, marcello, stevel, peterc One minor issue not mentioned: Is it more typical to pass the address range as a start and end address, or as a start address and a length? I suspect the latter. Though, by the time we get done on all this, who knows if that minor issue will even apply any more. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 0:09 ` Paul Jackson @ 2005-02-16 0:28 ` Ray Bryant 0 siblings, 0 replies; 24+ messages in thread From: Ray Bryant @ 2005-02-16 0:28 UTC (permalink / raw) To: Paul Jackson; +Cc: linux-mm, holt, ak, haveblue, marcello, stevel, peterc Paul Jackson wrote: > One minor issue not mentioned: > > Is it more typical to pass the address range > as a start and end address, or as a start address > and a length? > > I suspect the latter. > > Though, by the time we get done on all this, who knows > if that minor issue will even apply any more. > added to the list, for now at least. -- ----------------------------------------------- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) raybry@sgi.com raybry@austin.rr.com The box said: "Requires Windows 98 or better", so I installed Linux. ----------------------------------------------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-15 23:52 manual page migration -- issue list Ray Bryant 2005-02-16 0:09 ` Paul Jackson @ 2005-02-16 0:51 ` Paul Jackson 2005-02-16 1:17 ` Paul Jackson 2005-02-16 1:56 ` Robin Holt 2005-02-16 1:41 ` Paul Jackson 2 siblings, 2 replies; 24+ messages in thread From: Paul Jackson @ 2005-02-16 0:51 UTC (permalink / raw) To: Ray Bryant; +Cc: linux-mm, holt, ak, haveblue, marcello, stevel, peterc Robin, replying to pj, from the earlier thread also on lkml: > On Tue, Feb 15, 2005 at 08:35:29AM -0800, Paul Jackson wrote: > > What about ... > > > > sys_page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes); > > > > to: > > > > sys_page_migrate(pid, va_start, va_end, old_node, new_node); > > > > ... > > Migration could be done in most cases and would only fall apart when > there are overlapping node lists and no nodes available as temp space > and we are not moving large chunks of data. Given the <va_start, va_end>, which could be reduced to the granularity of a single page if need be, there should not be an issue with overlapping node lists. My "node as temp space" suggestion was insane - nevermind that one. If worse comes to worse, you can handle any combination of nodes, overlapping or not, with no extra or temporary copies, just by doing one page at a time. So this seems to boil down to whether it makes more sense to move several nodes worth of stuff to several corresponding nodes in a single call, or in several calls, roughly one call for each <to, from> node pair. The working example I was carrying around in my mind was of a job that had one thread per cpu, and that had explicitly used some numa policy (mbind, mempolicy or cpusets) to place memory on the node local to its cpu. Earlier today on the lkml thread, Robin described how a typical MPI job works. Seems that it relies on some startup code running in each thread, carefully touching each page that should be local to that cpu before any other thread touches said page, and requiring no particular memory policy facility beyond first touch. Seems to me that the memory migration requirements here are the same as they were for the example I had in mind. Each task has some currently allocated memory pages in its address space that are on the local node of that task, and that memory must stay local to that task, after the migration. Looking at such an MPI job as a whole, there seems to be pages scattered across several nodes, where the only place it is 'encoded' how to place them is in the job startup code that first touched each page. A successful migration must replicate that memory placement, page for page, just changing the nodes. From that perspective, it makes sense to think of it as an array of old nodes, and a corresponding array of new nodes, where each page on an old node is to be migrated to the corresponding new node. However, since each thread allocated its memory locally, this factors into N separate migrations, each of one task, one old node, and one new node. Such a call doesn't migrate all physical pages in the target tasks memory, rather just the pages that are on the specified old node. The one thing not trivially covered in such a one task, one node pair at a time factoring is memory that is placed on a node that is remote from any of the tasks which map that memory. Let me call this 'remote placement.' Offhand, I don't know why anyone would do this. If such were rare, the one task, one node pair at a time factoring can still migrate it easily enough, so long as it knows to do so, and issue another system call, for the necessary task and remote nodes (old and new). If such remote placement were used in abundance, the one task, one node pair at a time factoring would become inefficient. I don't anticipate that remote placement will be used in abundance. By the way, what happens if you're doing a move where the to and from node sets overlap, and the kernel scans in the wrong order, and ends up trying to put new pages onto a node that is in that overlap, before pulling the old pages off it, running out of memory on that node? Perhaps the smarts to avoid that should be in user space ;). This can be avoided using the one task, one node pair at a time factored API, because user space can control the order in which memory is migrated, to avoid temporarilly overloading the memory on any one node. With this, I am now more convinced than I was earlier that passing a single old node, new node pair, rather than the array of old and new nodes, is just as good (hardly any more system calls in actual usage). And it is better in one regard, that it avoids the kernel having to risk overloading the memory on some node during the migration if it scans in the wrong order when doing an overlapped migration. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 0:51 ` Paul Jackson @ 2005-02-16 1:17 ` Paul Jackson 2005-02-16 2:01 ` Robin Holt 2005-02-16 3:55 ` Ray Bryant 2005-02-16 1:56 ` Robin Holt 1 sibling, 2 replies; 24+ messages in thread From: Paul Jackson @ 2005-02-16 1:17 UTC (permalink / raw) To: Paul Jackson Cc: raybry, linux-mm, holt, ak, haveblue, marcello, stevel, peterc As a straw man, let me push the factored migration call to the extreme, and propose a call: sys_page_migrate(pid, oldnode, newnode) that moves any physical page in the address space of pid that is currently located on oldnode to newnode. Won't this come about as close as we are going to get to replicating the physical memory layout of a job, if we just call it once, for each task in that job? Oops - make that one call for each node in use by the job - see the following ... Earlier I (pj) wrote: > The one thing not trivially covered in such a one task, one node pair at > a time factoring is memory that is placed on a node that is remote from > any of the tasks which map that memory. Let me call this 'remote > placement.' Offhand, I don't know why anyone would do this. Well - one case - headless nodes. These are memory-only nodes. Typically one sys_page_migrate() call will be needed for each such node, specifying some task in the job that has all the relevent memory on that node mapped, specifying that (old) node, and specifying which new node that memory should be migrated to. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 1:17 ` Paul Jackson @ 2005-02-16 2:01 ` Robin Holt 2005-02-16 4:04 ` Ray Bryant 2005-02-16 4:24 ` Paul Jackson 2005-02-16 3:55 ` Ray Bryant 1 sibling, 2 replies; 24+ messages in thread From: Robin Holt @ 2005-02-16 2:01 UTC (permalink / raw) To: Paul Jackson Cc: raybry, linux-mm, holt, ak, haveblue, marcello, stevel, peterc On Tue, Feb 15, 2005 at 05:17:09PM -0800, Paul Jackson wrote: > As a straw man, let me push the factored migration call to the > extreme, and propose a call: > > sys_page_migrate(pid, oldnode, newnode) Go look at the mappings in /proc/<pid>/maps once and you will see how painful this can make things. Especially for an applications with shared mappings. Overlapping nodes with the above will make a complete mess of your memory placement. Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 2:01 ` Robin Holt @ 2005-02-16 4:04 ` Ray Bryant 2005-02-16 4:28 ` Paul Jackson 2005-02-16 4:24 ` Paul Jackson 1 sibling, 1 reply; 24+ messages in thread From: Ray Bryant @ 2005-02-16 4:04 UTC (permalink / raw) To: Robin Holt; +Cc: Paul Jackson, linux-mm, ak, haveblue, marcello, stevel, peterc Robin Holt wrote: > On Tue, Feb 15, 2005 at 05:17:09PM -0800, Paul Jackson wrote: > >>As a straw man, let me push the factored migration call to the >>extreme, and propose a call: >> >> sys_page_migrate(pid, oldnode, newnode) > > > Go look at the mappings in /proc/<pid>/maps once and you will see > how painful this can make things. Especially for an applications > with shared mappings. Overlapping nodes with the above will make > a complete mess of your memory placement. > > Robin > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> > So lets address that issue again, since I think that is now the heart of the matter. Exactly why do we need to support the case where the set of old nodes and new nodes overlap? I agree it is more general, but if we drop that, I think we are one step closer to getting agreement as to what the page migration system call interface should be. Do we have a case, say from IRIX, of why supporting this kind of migration is necessary? -- ----------------------------------------------- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) raybry@sgi.com raybry@austin.rr.com The box said: "Requires Windows 98 or better", so I installed Linux. ----------------------------------------------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 4:04 ` Ray Bryant @ 2005-02-16 4:28 ` Paul Jackson 0 siblings, 0 replies; 24+ messages in thread From: Paul Jackson @ 2005-02-16 4:28 UTC (permalink / raw) To: Ray Bryant; +Cc: holt, linux-mm, ak, haveblue, marcello, stevel, peterc Ray wrote: > Exactly why do we need to support the case where the set of old > nodes and new nodes overlap? Actually, I think they can overlap, just so long as the set of old nodes is not identical to the set of new nodes. It's this "perfect shuffle, in place" that can't be done without the infamous insane temporary node. But that's likely beside the point, as I have already adequately demonstrated that there is some requirement here that Robin knows and I don't. Yet, anyway. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 2:01 ` Robin Holt 2005-02-16 4:04 ` Ray Bryant @ 2005-02-16 4:24 ` Paul Jackson 1 sibling, 0 replies; 24+ messages in thread From: Paul Jackson @ 2005-02-16 4:24 UTC (permalink / raw) To: Robin Holt; +Cc: raybry, linux-mm, ak, haveblue, marcello, stevel, peterc Robin wrote: > Overlapping nodes with the above will make > a complete mess of your memory placement. I agree we don't want to overlap nodes. I don't yet understand why my simple (simplistic?) version of this system call leads us to overlapped nodes. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 1:17 ` Paul Jackson 2005-02-16 2:01 ` Robin Holt @ 2005-02-16 3:55 ` Ray Bryant 1 sibling, 0 replies; 24+ messages in thread From: Ray Bryant @ 2005-02-16 3:55 UTC (permalink / raw) To: Paul Jackson; +Cc: linux-mm, holt, ak, haveblue, marcello, stevel, peterc Paul Jackson wrote: > As a straw man, let me push the factored migration call to the > extreme, and propose a call: > > sys_page_migrate(pid, oldnode, newnode) > > that moves any physical page in the address space of pid that is > currently located on oldnode to newnode. > > Won't this come about as close as we are going to get to replicating the > physical memory layout of a job, if we just call it once, for each task > in that job? Oops - make that one call for each node in use by the job > - see the following ... > > > Earlier I (pj) wrote: > >>The one thing not trivially covered in such a one task, one node pair at >>a time factoring is memory that is placed on a node that is remote from >>any of the tasks which map that memory. Let me call this 'remote >>placement.' Offhand, I don't know why anyone would do this. > > > Well - one case - headless nodes. These are memory-only nodes. > > Typically one sys_page_migrate() call will be needed for each such node, > specifying some task in the job that has all the relevent memory on that > node mapped, specifying that (old) node, and specifying which new node > that memory should be migrated to. > This works provide you get Robin and Jack and all to drop the requirement that my page migration facility support overlapping sets of origin and destination nodes. Otherwise, this is a non-starter. So, lets go back to that one. Robin, can you provide me with a concrete (not hypothetical example) of a case where the from and to sets of nodes are overlapping? -- ----------------------------------------------- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) raybry@sgi.com raybry@austin.rr.com The box said: "Requires Windows 98 or better", so I installed Linux. ----------------------------------------------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 0:51 ` Paul Jackson 2005-02-16 1:17 ` Paul Jackson @ 2005-02-16 1:56 ` Robin Holt 2005-02-16 4:22 ` Paul Jackson 1 sibling, 1 reply; 24+ messages in thread From: Robin Holt @ 2005-02-16 1:56 UTC (permalink / raw) To: Paul Jackson Cc: Ray Bryant, linux-mm, holt, ak, haveblue, marcello, stevel, peterc On Tue, Feb 15, 2005 at 04:51:06PM -0800, Paul Jackson wrote: > Earlier today on the lkml thread, Robin described how a typical MPI job > works. Seems that it relies on some startup code running in each > thread, carefully touching each page that should be local to that cpu > before any other thread touches said page, and requiring no particular > memory policy facility beyond first touch. Seems to me that the memory > migration requirements here are the same as they were for the example I > had in mind. Each task has some currently allocated memory pages in its > address space that are on the local node of that task, and that memory > must stay local to that task, after the migration. One important point I probably forgot to make is there is typically a very large shared anonymous mapping before the initial fork. This will result in many processes sharing the vma discussed below. > > Looking at such an MPI job as a whole, there seems to be pages scattered > across several nodes, where the only place it is 'encoded' how to place > them is in the job startup code that first touched each page. > > A successful migration must replicate that memory placement, page for > page, just changing the nodes. From that perspective, it makes sense to > think of it as an array of old nodes, and a corresponding array of new > nodes, where each page on an old node is to be migrated to the corresponding > new node. And given the large single mapping and two arrays corresponding to old/new nodes, a single call would handle the migration even with overlapping regions in a single call and pass over the ptes. > > However, since each thread allocated its memory locally, this factors into > N separate migrations, each of one task, one old node, and one new node. > Such a call doesn't migrate all physical pages in the target tasks memory, > rather just the pages that are on the specified old node. If you do that for each job with the shared mapping and have overlapping node lists, you end up combining two nodes and not being able to seperate them. Oh sure, we could add in a page flag indicating that the page is going to be migrated, add a syscall which you call on the VMA first to set all the flags and then as pages are moved with the one-for-one syscalls, clear the flag. Oh yeah, we also need to add an additional syscall to clear any flags for pages that did not get migrated because they were not in the old list at all. > The one thing not trivially covered in such a one task, one node pair at > a time factoring is memory that is placed on a node that is remote from > any of the tasks which map that memory. Let me call this 'remote > placement.' Offhand, I don't know why anyone would do this. If such > were rare, the one task, one node pair at a time factoring can still > migrate it easily enough, so long as it knows to do so, and issue > another system call, for the necessary task and remote nodes (old and > new). If such remote placement were used in abundance, the one task, > one node pair at a time factoring would become inefficient. I don't > anticipate that remote placement will be used in abundance. Unfortunately it does happen often for stuff like shared file mappings that a different job is using in conjuction with this job. There are other considerations as well such as shared libraries etc, but we can minimize that noise in this discussion for the time being. > By the way, what happens if you're doing a move where the to and from > node sets overlap, and the kernel scans in the wrong order, and ends up > trying to put new pages onto a node that is in that overlap, before > pulling the old pages off it, running out of memory on that node? > Perhaps the smarts to avoid that should be in user space ;). This can > be avoided using the one task, one node pair at a time factored API, > because user space can control the order in which memory is migrated, to > avoid temporarilly overloading the memory on any one node. Unfortunately, userspace can not avoid this easily as it does not know which pages in the virtual address space are on which nodes. It could do some kludge work and only call for va ranges that are smaller than the most available memory on any of the destination nodes, but that might make things sort of hackish. Alternatively, the syscall handler could do some work to find chunks of memory that are being used by that node and process that chunk and then return to this. Makes stuff ugly, but is a possiblity as well. > > With this, I am now more convinced than I was earlier that passing a > single old node, new node pair, rather than the array of old and new > nodes, is just as good (hardly any more system calls in actual usage). > And it is better in one regard, that it avoids the kernel having to risk > overloading the memory on some node during the migration if it scans in > the wrong order when doing an overlapped migration. Shared mappings and overlapping regions make the node arrays necessary. Single old/new pair _DOES_ result in more system calls and therefore scans over the ptes. It does result in problems with overlapping old/new node lists. It does not help with out of memory issues. It accomplishes nothing other than making the syscall interface different. Thanks, Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 1:56 ` Robin Holt @ 2005-02-16 4:22 ` Paul Jackson 2005-02-16 9:20 ` Robin Holt 0 siblings, 1 reply; 24+ messages in thread From: Paul Jackson @ 2005-02-16 4:22 UTC (permalink / raw) To: Robin Holt; +Cc: raybry, linux-mm, ak, haveblue, marcello, stevel, peterc Robin wrote: > If you do that for each job with the shared mapping and have overlapping > node lists, you end up combining two nodes and not being able to seperate > them. I don't see the problem. Just don't move a task onto a node until you moved the one that was already there, if any, off. Say, for example, you want to move a job from nodes 4, 5 and 6 to nodes 5, 6 and 7, respectively. First move 6 to 7, then 5 to 6, then 4 to 5. Or save some migration, and just move what's on 4 to 7, leaving 5 and 6 as is. At any point, either there is at least one new node not currently occupied by some not yet migrated task, or else you're just reshuffling a set of tasks on the same set of nodes, which I presume would be without purpose and so we don't need to support. If we did need to support shuffling a job on its current node set, I'd have to plead insanity, and reintroduce the temporary node hack. > Unfortunately it does happen often for stuff like shared file mappings > that a different job is using in conjuction with this job. This might be the essential detail I'm missing. I'm not sure what you mean here (see P.S., at end), but it seems that you are telling me you must have the ability to avoid moving parts of a job. That for a given task, pinned to a given cpu, with various physical pages on the node local to that cpu, some of those pages must not move, because they are used in conjunction with some other job, that is not being migrated at this time. If that's the case, aren't you pretty much guaranteeing the migrated job will not run as well as before the migration - some of the pages it was using that were local are now remote. And if that's the case, I take it you are presuming that the server process doing the migration has intimate knowledge of the tasks being migrated, and of the various factors that determine which pages of those tasks should migrate and which should not migrate. Uggh. I am working from the idea that you've got some job, running on some nodes, and that you just want to jack up that job and put it back down on an isomorphic set of nodes - same number of nodes, same (or at least sufficient) amount of memory on the nodes, possibly an overlapping set of nodes, but just not the self-same identical set of nodes. I was presuming that everything in the address spaces of the tasks in the job should move, and should end up placed the same, relative to the tasks in the job, as before, just on different node numbers. Even shared library pages can move -- if this job happened to be the one that paged that portion of library in, then perhaps this job has the most use for that page. That or it's just a popular page left over from the dawn of time and it doesn't matter much which node holds it. Perhaps I have the wrong idea here? > Unfortunately, userspace can not avoid this easily as it does not know > which pages in the virtual address space are on which nodes. Userspace doesn't need to know that. It only needs to know at least one node in the set of new nodes is not still occupied by an unmigrated task in the job. See the example above. > Oh sure, we could add in ... > Oh yeah, we also need to add ... > It could do some kludge work and only call ... No need to spend too much effort elaborating such additions ... the mere fact that you find them necessary means that either it's not as simple as I think, or it's simpler than you think. In other words, that one of us (most likely me) doesn't understand the real requirements here. P.S. - or perhaps what you're telling me with the bit about shared file mappings is not that you must not move any such shared file pages as well, but that you'd rather not, as there are perhaps many such pages, and the time spent moving them would be wasted. Are you saying that you want to move some subset of a jobs pages, as an optimization, because for a large chunk of pages, such as for some files and libraries shared with other jobs, the expense of migrating them would not be paid back? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 4:22 ` Paul Jackson @ 2005-02-16 9:20 ` Robin Holt 2005-02-16 10:20 ` Paul Jackson 2005-02-16 23:05 ` Ray Bryant 0 siblings, 2 replies; 24+ messages in thread From: Robin Holt @ 2005-02-16 9:20 UTC (permalink / raw) To: Paul Jackson Cc: Robin Holt, raybry, linux-mm, ak, haveblue, marcello, stevel, peterc On Tue, Feb 15, 2005 at 08:22:14PM -0800, Paul Jackson wrote: > Robin wrote: > > If you do that for each job with the shared mapping and have overlapping > > node lists, you end up combining two nodes and not being able to seperate > > them. > > I don't see the problem. Just don't move a task onto a node > until you moved the one that was already there, if any, off. > > Say, for example, you want to move a job from nodes 4, 5 and 6 to nodes > 5, 6 and 7, respectively. First move 6 to 7, then 5 to 6, then 4 to 5. > Or save some migration, and just move what's on 4 to 7, leaving 5 and > 6 as is. Moving 4 to 7 will likely change the node to node distance for the processes within that job. You will probably need to do the 6-7, 5-6, 4-5 to keep relative distances the same. Again, the batch scheduler will tell us whether a simple 4-7 move is possible or whether we need to shift each. I should correct my earlier add. As long as you have a seperate node in the new list that is not in the old, you could accomplish it with a one-at-a-time fashion. What that would result in is a syscall for each non-overlapping vma per node. Multiple that by the number of nodes with each system call going over that same shared vma. For the sake of discussion, lets assume this is a 256p job using 128 nodes and a shared message block of 2GB per task. You will have a 512GB shared mapping which will have some holes punched in it (no single task will have the entire mapping unscathed). Again, for the sake of discussion, let's assume that 96% of the shared buffer is intact for the process we choose to do the initial migration on. Compare the single node method to the array method. Array method: 1) Call system call with pid, va_start, va_end, 128, [2,3,4,5...], [32,33,34,...]. This will scan the page tables _ONCE_ and migrate the pages to their new destination. 2) Call system call on second pid to cover 1/2 of the remaining 4% of address space. Again single scan over that portion of address space. 3) Call system call on third pid to cover last portion of address space. With this, we have made 3 system calls and scanned the entire address range 1 time. Single parameter method: 1) For a single pid, cal system call 128 times with pid, va_start, va_end, from, to which scans the 96% chunk 128 times. 2) Repeat 128 times with second pid. 3) Repeat 128 times with third pid. We have now made the system call 384 times, scanned the entire address range 128 times. Do you see why I called this insane. This is all because you don't like to pass in a complex array of integers. That seems like a very small thing to ask to save 127 scans of a 512GB address space. I believe that is what I called insane earlier. I reserve the right to be wrong. > > At any point, either there is at least one new node not currently > occupied by some not yet migrated task, or else you're just reshuffling > a set of tasks on the same set of nodes, which I presume would be > without purpose and so we don't need to support. If we did need to > support shuffling a job on its current node set, I'd have to plead > insanity, and reintroduce the temporary node hack. > > > > Unfortunately it does happen often for stuff like shared file mappings > > that a different job is using in conjuction with this job. > > This might be the essential detail I'm missing. I'm not sure what you > mean here (see P.S., at end), but it seems that you are telling me you > must have the ability to avoid moving parts of a job. That for a given > task, pinned to a given cpu, with various physical pages on the node > local to that cpu, some of those pages must not move, because they are > used in conjunction with some other job, that is not being migrated at > this time. For the simple case assume a sysV shared memory segment that was created by a previous job being used by this one. The memory placement for the segment will depend entirely on whether the previous job touched a particular page and where that job ran. It may get migrated depending upon if any other jobs anywhere else are on the system and are using it and any of the pages are on the jobs old node list. These types of mappings have always given us issues (Irix as well as Linux) and are difficult to handle. The one additional nice feature to having an external migration facility is we might be able to use this type of thing from a command line to move the shared memory segment over to nodes that the job is using. This has just been off the cuff thinking lately and hasn't been fully thought through. > P.S. - or perhaps what you're telling me with the bit about shared file > mappings is not that you must not move any such shared file pages as > well, but that you'd rather not, as there are perhaps many such pages, > and the time spent moving them would be wasted. Are you saying that you > want to move some subset of a jobs pages, as an optimization, because > for a large chunk of pages, such as for some files and libraries shared > with other jobs, the expense of migrating them would not be paid back? I believe Ray's proposed userland piece would migrate shared libraries used exclusively by this job. Was that right Ray? Here is my real question. How much opposition is there to the array of integers? This does not seem like a risky interface to me. If there is not a lot of opposition to the arrays, can we discuss the rest of the proposal and accept the arrays for the time being? The array can be addressed once we know that the syscall for migrating idea is sound. Thanks, Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 9:20 ` Robin Holt @ 2005-02-16 10:20 ` Paul Jackson 2005-02-16 11:30 ` Robin Holt 2005-02-16 23:08 ` Ray Bryant 2005-02-16 23:05 ` Ray Bryant 1 sibling, 2 replies; 24+ messages in thread From: Paul Jackson @ 2005-02-16 10:20 UTC (permalink / raw) To: Robin Holt; +Cc: raybry, linux-mm, ak, haveblue, marcello, stevel, peterc Robin wrote: > What that would result in is a syscall for each > non-overlapping vma per node. My latest, most radical, proposal did not take an address range. It was simply: sys_page_migrate(pid, oldnode, newnode) It would be called once per node. In your example, this would be 128 calls. Nothing "for each non-overlapping vma". Just per node. Until I drove you to near distraction, and you spelled out the details of an example that migrated 96% of the address space in the first call, and only need 3 calls total, I would have presumed that the API: sys_page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes) would have required one call per pid, or 256 calls, for your example. My method did not look insanely worse to me, indeed it would have looked better in this example with two tasks per node, since I did one call per node, and I thought you did one per task. ... However, I see now that you can routinely get by with dramatically fewer calls than the number of tasks, by noticing what portions of the typically huge shared address space have already been covered, and not covering them again. There is no need to convince me that 384 syscalls and 128 full scans is insanely worse than 3 syscalls with 1 full scan, and no need to get frustrated that I cannot see the insanity of it. However, you might have wanted to allow for the possibility, when you reduced what you thought I was proposing to insanity, that rather than my proposing something insane, perhaps we had different numbers ... as happened here. Your numbers for the array API had 80 times fewer system calls than I would have expected, and your numbers for the single parameter call had 3 times _more_ system calls than I had in mind (I had one call per node, period, not one per node per vma or whatever). > How much opposition is there to the array of integers? My opposition to the array was not profound. It needed to provide an advantage, which I didn't see it much did. I now see it provides an advantage, dramatically reducing the number of system calls and scans in typical cases, to substantially fewer than either the number of tasks or of nodes. Ok ... onward. I'll take the node arrays. The next concern that rises to the top for me was best expressed by Andi: > > The main reasons for that is that I don't think external > processes should mess with virtual addresses of another process. > It just feels unclean and has many drawbacks (parsing /proc/*/maps > needs complicated user code, racy, locking difficult). > > In kernel space handling full VMs is much easier and safer due to better > locking facilities. I share Andi's concerns, but I don't see what to do about this. Andi's recommendations seem to be about memory policies (which guide future allocations), and not about migration of already allocated physical pages. So for now at least, his recommendations don't seem like answers to me. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 10:20 ` Paul Jackson @ 2005-02-16 11:30 ` Robin Holt 2005-02-16 15:45 ` Paul Jackson 2005-02-16 23:08 ` Ray Bryant 1 sibling, 1 reply; 24+ messages in thread From: Robin Holt @ 2005-02-16 11:30 UTC (permalink / raw) To: Paul Jackson Cc: Robin Holt, raybry, linux-mm, ak, haveblue, marcello, stevel, peterc On Wed, Feb 16, 2005 at 02:20:09AM -0800, Paul Jackson wrote: > The next concern that rises to the top for me was best expressed by Andi: > > > > The main reasons for that is that I don't think external > > processes should mess with virtual addresses of another process. > > It just feels unclean and has many drawbacks (parsing /proc/*/maps > > needs complicated user code, racy, locking difficult). > > > > In kernel space handling full VMs is much easier and safer due to better > > locking facilities. > > I share Andi's concerns, but I don't see what to do about this. Andi's > recommendations seem to be about memory policies (which guide future > allocations), and not about migration of already allocated physical > pages. So for now at least, his recommendations don't seem like answers > to me. If we had the ability to change the vendor provided software to meet our needs, that would be wonderful. Unfortunately, most of this type code runs on _MANY_ different OSs and architectures. If you could get the NUMA api into everything from AIX to Windows XP, I think you would have a very good chance of convincing ISVs to start converting. Until then, there is no clear win over first touch for their type of application. With that in mind, we are left with doing things from the outside in. Heck, if we could get them to change their code, cpusets would be irrelavent as well ;) Thanks, Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 11:30 ` Robin Holt @ 2005-02-16 15:45 ` Paul Jackson 2005-02-16 16:08 ` Robin Holt 0 siblings, 1 reply; 24+ messages in thread From: Paul Jackson @ 2005-02-16 15:45 UTC (permalink / raw) To: Robin Holt; +Cc: raybry, linux-mm, ak, haveblue, marcello, stevel, peterc Robin wrote: > Until then, there is no clear win over first > touch for their type of application. Huh? So what was the point of this rant? <grin> You seem to explain why first touch is used instead of the Linux 2.6 numa placement calls mbind/mempolicy, in some third party code that runs on multiple operating systems. But I thought this was the page migration thread, not the placement policy thread. Now I am as mystified with your latest comments as I was with Andi's discussion of using these memory policy calls. Regardless of what mechanisms we use to guide future allocations to their proper nodes, how best can we provide a facility to migrate already allocated physical memory pages to other nodes? That's the question, or so I thought, on this thread. To repeat myself ... > The next concern that rises to the top for me was best expressed by Andi: > > > > The main reasons for that is that I don't think external > > processes should mess with virtual addresses of another process. > > It just feels unclean and has many drawbacks (parsing /proc/*/maps > > needs complicated user code, racy, locking difficult). > > > > In kernel space handling full VMs is much easier and safer due to better > > locking facilities. > > I share Andi's concerns, but I don't see what to do about this. Perhaps a part of the answer is that we aren't messing with (as in "changing") the virtual addresses of other processes. The migration call is only reading these addresses. What it messes with is the _physical_ addresses ;). Though this proposed call still seems to have some of the same drawbacks. One of my motivations for persuing the no-array version of this call that you loved so much was that it (my latest variant, anyway) didn't pass any virtual address ranges in, further simplifying what crossed the user-kernel boundary and leaving details of parsing the virtual address layout of tasks strictly to the kernel (no need to read /proc/*/maps). But it seems that if we are going to achieve the fairly significant optimizations you enumerated in your example a few hours ago, we at least have to parse the /proc/*/maps files. Hmmm ... wait just a minute ... isn't parsing the maps files in /proc really scanning the virtual addresses of tasks. In your example of a few hours ago, which seemed to only require 3 system calls and one full scan of any task address space, did you read all the /proc/*/maps files, for all 256 of the tasks involved? I would think you would have to have done so, or else one of these tasks could be holding onto some private memory of its own that we would need to migrate. Are the stack pages and any per-thread private data on pages visible to all the threads, or are some of these pages private to each thread? Does anything prevent a thread from having additional private pages invisible to the other threads? Could you redo your example, including scans implied by reading maps files, and including system calls needed to do those reads, and needed to migrate any private pages they might have? Perhaps your preferred API doesn't have such an insane advantage after all. I'm fixing soon to consider another variant of this call, that takes an _array_ of pids, along with the old and new arrays of nodes, but takes no virtual address range. The kernel would scan each pid in the array, migrating anything found on any old node to the corresponding new node, all in one system call. If my speculations above are right, this does the minimum of scans, one per pid, and the minimum number of system calls - one. And does so without involving the user space code in racy maps file reading to determine what to call (though the kernel code would probably still have more than its share of races to fuss over). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 15:45 ` Paul Jackson @ 2005-02-16 16:08 ` Robin Holt 2005-02-16 19:23 ` Paul Jackson 0 siblings, 1 reply; 24+ messages in thread From: Robin Holt @ 2005-02-16 16:08 UTC (permalink / raw) To: Paul Jackson Cc: Robin Holt, raybry, linux-mm, ak, haveblue, marcello, stevel, peterc On Wed, Feb 16, 2005 at 07:45:50AM -0800, Paul Jackson wrote: > Hmmm ... wait just a minute ... isn't parsing the maps files in /proc > really scanning the virtual addresses of tasks. In your example of a > few hours ago, which seemed to only require 3 system calls and one full > scan of any task address space, did you read all the /proc/*/maps files, > for all 256 of the tasks involved? I would think you would have to have Reading /proc/<pid>maps just scans through the vmas and not the address space. Very different things! > Could you redo your example, including scans implied by reading maps > files, and including system calls needed to do those reads, and needed > to migrate any private pages they might have? Perhaps your preferred > API doesn't have such an insane advantage after all. Ray, do you have your userland stuff in anywhere close to presentable condition? If so, that might be the best for this part of the discussion. Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 16:08 ` Robin Holt @ 2005-02-16 19:23 ` Paul Jackson 2005-02-16 19:56 ` Robin Holt 0 siblings, 1 reply; 24+ messages in thread From: Paul Jackson @ 2005-02-16 19:23 UTC (permalink / raw) To: Robin Holt; +Cc: raybry, linux-mm, ak, haveblue, marcello, stevel, peterc Robin wrote: > Reading /proc/<pid>maps just scans through the vmas and not the > address space. Yes - you're right. So the number of system calls in your example of a few hours ago, using your preferred array API, if you include the reads of each tasks /proc/<pid>/maps file, is about equal to the number of tasks, right? And I take it that the user code you asked Ray about looks at these maps files for each of the tasks to be migrated, identifies each mapped range of each mapped object (mapped file or whatever) and calculates a fairly minimum set of tasks and virtual address ranges therein, sufficient to cover all the mapped objects that should be migrated, thus minimizing the amount of scanning that needs to be done of individual pages. And further I take it that you recommend the above described code [to find a fairly minimum set of tasks and address ranges to scan that will cover any page of interest] be put in user space, not in the kernel (a quite reasonable recommendation). Why didn't your example have some writable private pages? Wouldn't such pages be commonplace, and wouldn't they have to be migrated for each thread, resulting in at least N calls to the new sys_page_migrate() system call, for N tasks, rather than the 3 calls in your example? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 19:23 ` Paul Jackson @ 2005-02-16 19:56 ` Robin Holt 0 siblings, 0 replies; 24+ messages in thread From: Robin Holt @ 2005-02-16 19:56 UTC (permalink / raw) To: Paul Jackson Cc: Robin Holt, raybry, linux-mm, ak, haveblue, marcello, stevel, peterc On Wed, Feb 16, 2005 at 11:23:35AM -0800, Paul Jackson wrote: > Robin wrote: > > Reading /proc/<pid>maps just scans through the vmas and not the > > address space. > > Yes - you're right. > > So the number of system calls in your example of a few hours ago, using > your preferred array API, if you include the reads of each tasks > /proc/<pid>/maps file, is about equal to the number of tasks, right? > > And I take it that the user code you asked Ray about looks at these > maps files for each of the tasks to be migrated, identifies each > mapped range of each mapped object (mapped file or whatever) and > calculates a fairly minimum set of tasks and virtual address ranges > therein, sufficient to cover all the mapped objects that should > be migrated, thus minimizing the amount of scanning that needs > to be done of individual pages. > > And further I take it that you recommend the above described code [to > find a fairly minimum set of tasks and address ranges to scan that will > cover any page of interest] be put in user space, not in the kernel (a > quite reasonable recommendation). I think user space for a few reasons. The code in the kernel will be much easier to digest and ensure it is a bug-free as possible. If bugs are found or issues arise in the portions that are in userland, we are left with a maximum amount of flexibility to correct the issue without needing kernel code change. In a different direction, if I am a support person trying to figure out why an application is performing poorly, I can try migrating portions of the applications address space to a node closer to the cpu and hopefully see a performance improvement. > > Why didn't your example have some writable private pages? Wouldn't such > pages be commonplace, and wouldn't they have to be migrated for each > thread, resulting in at least N calls to the new sys_page_migrate() > system call, for N tasks, rather than the 3 calls in your example? You are right about everything above. The calls to migrate the private regions will be small in comparison to the typical large shared mapping. The real work horse is going to always be walking the page tables and that will take time. I am advocating for a system call which covers the needs and also remains flexible enough to correct short comings in our thinking about all the possible permutations of user virtual address spaces. Thanks, Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 10:20 ` Paul Jackson 2005-02-16 11:30 ` Robin Holt @ 2005-02-16 23:08 ` Ray Bryant 1 sibling, 0 replies; 24+ messages in thread From: Ray Bryant @ 2005-02-16 23:08 UTC (permalink / raw) To: Paul Jackson; +Cc: Robin Holt, linux-mm, ak, haveblue, marcello, stevel, peterc Paul Jackson wrote: > Robin wrote: > >>What that would result in is a syscall for each >>non-overlapping vma per node. > > > My latest, most radical, proposal did not take an address range. It was > simply: > > sys_page_migrate(pid, oldnode, newnode) > > It would be called once per node. In your example, this would be 128 > calls. Nothing "for each non-overlapping vma". Just per node. > > Until I drove you to near distraction, and you spelled out the details > of an example that migrated 96% of the address space in the first call, > and only need 3 calls total, I would have presumed that the API: > > sys_page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes) > > would have required one call per pid, or 256 calls, for your example. > > My method did not look insanely worse to me, indeed it would have looked > better in this example with two tasks per node, since I did one call per > node, and I thought you did one per task. > > ... However, I see now that you can routinely get by with dramatically > fewer calls than the number of tasks, by noticing what portions of the > typically huge shared address space have already been covered, and not > covering them again. Right, that was our original plan. So you only had to make as many system calls as there were address ranges that needed to be migrated, more or less. This assumes we have stopped processes and can read and make sense of /proc/*/maps. > > There is no need to convince me that 384 syscalls and 128 full scans > is insanely worse than 3 syscalls with 1 full scan, and no need to > get frustrated that I cannot see the insanity of it. > > However, you might have wanted to allow for the possibility, when you > reduced what you thought I was proposing to insanity, that rather than > my proposing something insane, perhaps we had different numbers ... as > happened here. Your numbers for the array API had 80 times fewer system > calls than I would have expected, and your numbers for the single > parameter call had 3 times _more_ system calls than I had in mind (I had > one call per node, period, not one per node per vma or whatever). > > >>How much opposition is there to the array of integers? > > > My opposition to the array was not profound. It needed to provide > an advantage, which I didn't see it much did. > > I now see it provides an advantage, dramatically reducing the number of > system calls and scans in typical cases, to substantially fewer than > either the number of tasks or of nodes. > > Ok ... onward. I'll take the node arrays. > > The next concern that rises to the top for me was best expressed by Andi: > >>The main reasons for that is that I don't think external >>processes should mess with virtual addresses of another process. >>It just feels unclean and has many drawbacks (parsing /proc/*/maps >>needs complicated user code, racy, locking difficult). >> >>In kernel space handling full VMs is much easier and safer due to better >>locking facilities. > > > I share Andi's concerns, but I don't see what to do about this. Andi's > recommendations seem to be about memory policies (which guide future > allocations), and not about migration of already allocated physical > pages. So for now at least, his recommendations don't seem like answers > to me. > -- ----------------------------------------------- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) raybry@sgi.com raybry@austin.rr.com The box said: "Requires Windows 98 or better", so I installed Linux. ----------------------------------------------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 9:20 ` Robin Holt 2005-02-16 10:20 ` Paul Jackson @ 2005-02-16 23:05 ` Ray Bryant 2005-02-17 0:28 ` Paul Jackson 1 sibling, 1 reply; 24+ messages in thread From: Ray Bryant @ 2005-02-16 23:05 UTC (permalink / raw) To: Robin Holt; +Cc: Paul Jackson, linux-mm, ak, haveblue, marcello, stevel, peterc Robin Holt wrote: > On Tue, Feb 15, 2005 at 08:22:14PM -0800, Paul Jackson wrote: > >>Robin wrote: >> >>>If you do that for each job with the shared mapping and have overlapping >>>node lists, you end up combining two nodes and not being able to seperate >>>them. >> >>I don't see the problem. Just don't move a task onto a node >>until you moved the one that was already there, if any, off. >> >>Say, for example, you want to move a job from nodes 4, 5 and 6 to nodes >>5, 6 and 7, respectively. First move 6 to 7, then 5 to 6, then 4 to 5. >>Or save some migration, and just move what's on 4 to 7, leaving 5 and >>6 as is. > The customers I have talked to about this tell me that they never imagine having a set of old and new nodes overlap. I agree it is more general to allow this, but resistance to the original system call I proposed appears to be somewhat stiff. > > Moving 4 to 7 will likely change the node to node distance for the > processes within that job. You will probably need to do the 6-7, 5-6, 4-5 > to keep relative distances the same. Again, the batch scheduler will tell > us whether a simple 4-7 move is possible or whether we need to shift each. > > I should correct my earlier add. As long as you have a seperate node > in the new list that is not in the old, you could accomplish it with a > one-at-a-time fashion. What that would result in is a syscall for each > non-overlapping vma per node. Multiple that by the number of nodes with > each system call going over that same shared vma. > > For the sake of discussion, lets assume this is a 256p job using 128 nodes > and a shared message block of 2GB per task. You will have a 512GB shared > mapping which will have some holes punched in it (no single task will > have the entire mapping unscathed). Again, for the sake of discussion, > let's assume that 96% of the shared buffer is intact for the process we > choose to do the initial migration on. Compare the single node method > to the array method. Would we really ever migrate something that big? I had the same concerns about large address spaces and the like, but it just seems to me that if something is that big, we'd leave it alone. :-) > > Array method: > 1) Call system call with pid, va_start, va_end, 128, [2,3,4,5...], [32,33,34,...]. > This will scan the page tables _ONCE_ and migrate the pages to their > new destination. > 2) Call system call on second pid to cover 1/2 of the remaining > 4% of address space. Again single scan over that portion of > address space. > 3) Call system call on third pid to cover last portion of address > space. > > With this, we have made 3 system calls and scanned the entire address > range 1 time. > > Single parameter method: > 1) For a single pid, cal system call 128 times with pid, va_start, va_end, from, to > which scans the 96% chunk 128 times. > 2) Repeat 128 times with second pid. > 3) Repeat 128 times with third pid. > > We have now made the system call 384 times, scanned the entire address > range 128 times. > > Do you see why I called this insane. This is all because you don't like > to pass in a complex array of integers. That seems like a very small > thing to ask to save 127 scans of a 512GB address space. > I agree, it sounds like a lot of work. Perhaps we should try this with my prototype code and see how long it takes. But, I really think this is a contrived example. I don't think anyone would migrate a job that big. To my way of thinking, the largest job we would ever migrate would be on the order of 1/8th to 1/4 of the machine. Not 1/2. If it is 1/2 of the machine, lets just leave the darn thing where it is. :-) (I always try to let large sleepling dogs lie...) > I believe that is what I called insane earlier. I reserve the right to > be wrong. > > >>At any point, either there is at least one new node not currently >>occupied by some not yet migrated task, or else you're just reshuffling >>a set of tasks on the same set of nodes, which I presume would be >>without purpose and so we don't need to support. If we did need to >>support shuffling a job on its current node set, I'd have to plead >>insanity, and reintroduce the temporary node hack. >> >> >> >>>Unfortunately it does happen often for stuff like shared file mappings >>>that a different job is using in conjuction with this job. >> >>This might be the essential detail I'm missing. I'm not sure what you >>mean here (see P.S., at end), but it seems that you are telling me you >>must have the ability to avoid moving parts of a job. That for a given >>task, pinned to a given cpu, with various physical pages on the node >>local to that cpu, some of those pages must not move, because they are >>used in conjunction with some other job, that is not being migrated at >>this time. > > > For the simple case assume a sysV shared memory segment that was created > by a previous job being used by this one. The memory placement for > the segment will depend entirely on whether the previous job touched a > particular page and where that job ran. It may get migrated depending > upon if any other jobs anywhere else are on the system and are using it > and any of the pages are on the jobs old node list. > > These types of mappings have always given us issues (Irix as well as > Linux) and are difficult to handle. The one additional nice feature to > having an external migration facility is we might be able to use this > type of thing from a command line to move the shared memory segment > over to nodes that the job is using. This has just been off the cuff > thinking lately and hasn't been fully thought through. > > >>P.S. - or perhaps what you're telling me with the bit about shared file >>mappings is not that you must not move any such shared file pages as >>well, but that you'd rather not, as there are perhaps many such pages, >>and the time spent moving them would be wasted. Are you saying that you >>want to move some subset of a jobs pages, as an optimization, because >>for a large chunk of pages, such as for some files and libraries shared >>with other jobs, the expense of migrating them would not be paid back? > > > I believe Ray's proposed userland piece would migrate shared libraries > used exclusively by this job. Was that right Ray? > Yes, that was the intent. > Here is my real question. How much opposition is there to the array > of integers? This does not seem like a risky interface to me. If there > is not a lot of opposition to the arrays, can we discuss the rest of > the proposal and accept the arrays for the time being? The array can > be addressed once we know that the syscall for migrating idea is sound. > > > Thanks, > Robin > -- ----------------------------------------------- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) raybry@sgi.com raybry@austin.rr.com The box said: "Requires Windows 98 or better", so I installed Linux. ----------------------------------------------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 23:05 ` Ray Bryant @ 2005-02-17 0:28 ` Paul Jackson 0 siblings, 0 replies; 24+ messages in thread From: Paul Jackson @ 2005-02-17 0:28 UTC (permalink / raw) To: Ray Bryant; +Cc: holt, linux-mm, ak, haveblue, marcello, stevel, peterc Ray wrote: > resistance to the original system call I > proposed appears to be somewhat stiff. Do not confuse the thickness of my skull with the profundity of my thought. As you might notice in some other posts, Robin succeeded, after a few frustrating moments, in educating me to the true brilliance of your original system call proposal. <grin> -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-15 23:52 manual page migration -- issue list Ray Bryant 2005-02-16 0:09 ` Paul Jackson 2005-02-16 0:51 ` Paul Jackson @ 2005-02-16 1:41 ` Paul Jackson 2005-02-16 3:56 ` Ray Bryant 2 siblings, 1 reply; 24+ messages in thread From: Paul Jackson @ 2005-02-16 1:41 UTC (permalink / raw) To: Ray Bryant; +Cc: linux-mm, holt, ak, haveblue, marcello, stevel, peterc A couple comments in response to Andi's earlier post on the related lkml thread ... Andi wrote: > Sorry, but the only real difference between your API and mbind is that > yours has a pid argument. One other difference shouts out at me. I am unsure of my reading of Andi's post, so I can't tell if (1) it was so obvious Andi didn't bother mentioning it, or (2) he doesn't see it as a difference. That difference is this. The various numa mechanisms, such as mbind, set_mempolicy and cpusets, as well as the simple first touch that MPI jobs rely on, are all about setting a policy for where future allocations should go. This page migration mechanism is all about changing the placement of physical pages of ram that are currently allocated. At any point in time, numa policy guides future allocations, and page migration redoes past allocations. Andi wrote: > My thinking is the simplest way to handle that is to have a call that just > migrates everything. I might have ended up at the same place, not sure, when I just suggested in my previous post: pj wrote: > As a straw man, let me push the factored migration call to the > extreme, and propose a call: > > sys_page_migrate(pid, oldnode, newnode) > > that moves any physical page in the address space of pid that is > currently located on oldnode to newnode. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: manual page migration -- issue list 2005-02-16 1:41 ` Paul Jackson @ 2005-02-16 3:56 ` Ray Bryant 0 siblings, 0 replies; 24+ messages in thread From: Ray Bryant @ 2005-02-16 3:56 UTC (permalink / raw) To: Paul Jackson; +Cc: linux-mm, holt, ak, haveblue, marcello, stevel, peterc Paul Jackson wrote: > A couple comments in response to Andi's earlier post on the > related lkml thread ... > > Andi wrote: > >>Sorry, but the only real difference between your API and mbind is that >>yours has a pid argument. > > > One other difference shouts out at me. I am unsure of my reading of > Andi's post, so I can't tell if (1) it was so obvious Andi didn't > bother mentioning it, or (2) he doesn't see it as a difference. > > That difference is this. > > The various numa mechanisms, such as mbind, set_mempolicy and cpusets, > as well as the simple first touch that MPI jobs rely on, are all about > setting a policy for where future allocations should go. > > This page migration mechanism is all about changing the placement of > physical pages of ram that are currently allocated. > > At any point in time, numa policy guides future allocations, and page > migration redoes past allocations. > Very nicely said, thanks. And the concern I have been trying to raise with Andi is: How does that page migration mechanism redo a past allocation using a memory policy if the orginal allocation was not done with a memory policy, but instead done via first touch? > > Andi wrote: > >>My thinking is the simplest way to handle that is to have a call that just >>migrates everything. > > > I might have ended up at the same place, not sure, when I just suggested > in my previous post: > > pj wrote: > >>As a straw man, let me push the factored migration call to the >>extreme, and propose a call: >> >> sys_page_migrate(pid, oldnode, newnode) >> >>that moves any physical page in the address space of pid that is >>currently located on oldnode to newnode. > > > -- ----------------------------------------------- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) raybry@sgi.com raybry@austin.rr.com The box said: "Requires Windows 98 or better", so I installed Linux. ----------------------------------------------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2005-02-17 0:28 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-02-15 23:52 manual page migration -- issue list Ray Bryant 2005-02-16 0:09 ` Paul Jackson 2005-02-16 0:28 ` Ray Bryant 2005-02-16 0:51 ` Paul Jackson 2005-02-16 1:17 ` Paul Jackson 2005-02-16 2:01 ` Robin Holt 2005-02-16 4:04 ` Ray Bryant 2005-02-16 4:28 ` Paul Jackson 2005-02-16 4:24 ` Paul Jackson 2005-02-16 3:55 ` Ray Bryant 2005-02-16 1:56 ` Robin Holt 2005-02-16 4:22 ` Paul Jackson 2005-02-16 9:20 ` Robin Holt 2005-02-16 10:20 ` Paul Jackson 2005-02-16 11:30 ` Robin Holt 2005-02-16 15:45 ` Paul Jackson 2005-02-16 16:08 ` Robin Holt 2005-02-16 19:23 ` Paul Jackson 2005-02-16 19:56 ` Robin Holt 2005-02-16 23:08 ` Ray Bryant 2005-02-16 23:05 ` Ray Bryant 2005-02-17 0:28 ` Paul Jackson 2005-02-16 1:41 ` Paul Jackson 2005-02-16 3:56 ` Ray Bryant
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox