From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Tue, 15 Feb 2005 16:51:06 -0800 From: Paul Jackson Subject: Re: manual page migration -- issue list Message-Id: <20050215165106.61fd4954.pj@sgi.com> In-Reply-To: <42128B25.9030206@sgi.com> References: <42128B25.9030206@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Ray Bryant Cc: linux-mm@kvack.org, holt@sgi.com, ak@muc.de, haveblue@us.ibm.com, marcello@cyclades.com, stevel@mwwireless.net, peterc@gelato.unsw.edu.au List-ID: Robin, replying to pj, from the earlier thread also on lkml: > On Tue, Feb 15, 2005 at 08:35:29AM -0800, Paul Jackson wrote: > > What about ... > > > > sys_page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes); > > > > to: > > > > sys_page_migrate(pid, va_start, va_end, old_node, new_node); > > > > ... > > Migration could be done in most cases and would only fall apart when > there are overlapping node lists and no nodes available as temp space > and we are not moving large chunks of data. Given the , which could be reduced to the granularity of a single page if need be, there should not be an issue with overlapping node lists. My "node as temp space" suggestion was insane - nevermind that one. If worse comes to worse, you can handle any combination of nodes, overlapping or not, with no extra or temporary copies, just by doing one page at a time. So this seems to boil down to whether it makes more sense to move several nodes worth of stuff to several corresponding nodes in a single call, or in several calls, roughly one call for each node pair. The working example I was carrying around in my mind was of a job that had one thread per cpu, and that had explicitly used some numa policy (mbind, mempolicy or cpusets) to place memory on the node local to its cpu. Earlier today on the lkml thread, Robin described how a typical MPI job works. Seems that it relies on some startup code running in each thread, carefully touching each page that should be local to that cpu before any other thread touches said page, and requiring no particular memory policy facility beyond first touch. Seems to me that the memory migration requirements here are the same as they were for the example I had in mind. Each task has some currently allocated memory pages in its address space that are on the local node of that task, and that memory must stay local to that task, after the migration. Looking at such an MPI job as a whole, there seems to be pages scattered across several nodes, where the only place it is 'encoded' how to place them is in the job startup code that first touched each page. A successful migration must replicate that memory placement, page for page, just changing the nodes. From that perspective, it makes sense to think of it as an array of old nodes, and a corresponding array of new nodes, where each page on an old node is to be migrated to the corresponding new node. However, since each thread allocated its memory locally, this factors into N separate migrations, each of one task, one old node, and one new node. Such a call doesn't migrate all physical pages in the target tasks memory, rather just the pages that are on the specified old node. The one thing not trivially covered in such a one task, one node pair at a time factoring is memory that is placed on a node that is remote from any of the tasks which map that memory. Let me call this 'remote placement.' Offhand, I don't know why anyone would do this. If such were rare, the one task, one node pair at a time factoring can still migrate it easily enough, so long as it knows to do so, and issue another system call, for the necessary task and remote nodes (old and new). If such remote placement were used in abundance, the one task, one node pair at a time factoring would become inefficient. I don't anticipate that remote placement will be used in abundance. By the way, what happens if you're doing a move where the to and from node sets overlap, and the kernel scans in the wrong order, and ends up trying to put new pages onto a node that is in that overlap, before pulling the old pages off it, running out of memory on that node? Perhaps the smarts to avoid that should be in user space ;). This can be avoided using the one task, one node pair at a time factored API, because user space can control the order in which memory is migrated, to avoid temporarilly overloading the memory on any one node. With this, I am now more convinced than I was earlier that passing a single old node, new node pair, rather than the array of old and new nodes, is just as good (hardly any more system calls in actual usage). And it is better in one regard, that it avoids the kernel having to risk overloading the memory on some node during the migration if it scans in the wrong order when doing an overlapped migration. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson 1.650.933.1373, 1.925.600.0401 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: aart@kvack.org