[PATCH 2.6.12-rc3 0/8] mm: manual page migration-rc2 -- overview

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 2.6.12-rc3 0/8] mm: manual page migration-rc2 -- overview
@ 2005-05-11  4:37 Ray Bryant
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch Ray Bryant
                   ` (7 more replies)
  0 siblings, 8 replies; 46+ messages in thread
From: Ray Bryant @ 2005-05-11  4:37 UTC (permalink / raw)
  To: Hirokazu Takahashi, Andi Kleen, Dave Hansen, Marcelo Tosatti
  Cc: Christoph Hellwig, linux-mm, Nathan Scott, Ray Bryant,
	lhms-devel, Ray Bryant

Summary
-------

This set of patches is a more or less complete implementation
of the manual page migration facility that I proposed in February
and that was discussed on the linux-mm mailing list.  This overview
is relatively short since the overview is effectively unchanged
from what I submitted on April 6, 2005.  For details, see the
overview I sent out then at:

http://marc.theaimsgroup.com/?l=linux-mm&m=111276123522952&w=2

This patch set differs from the previous patchset in the following:

(1)  The previous patch was based on 2.6.12-rc1-mhp3, this patchset
     is based on patch-2.6.12-rc3-mhp1-pm.gz from www.sr71.net/patches/
     2.6.12 (of the Memory Hotplug project patchset maintained by
     Dave Hansen).

(2)  This patchset has been divided into a few more, smaller
     patches for easier digestion.

(3)  This patch set includes a permissions check, whereas the
     previous patch set did not.  The permissions check is
     effectively the following:  A process can issue the migrate_pages()
     system call against another process if the either (a) the
     first process has the permission to send a non-trivial signal
     to the second process or if (b) the calling process has the
     capability CAP_SYS_ADMIN.  See the last patch of this set
     (8/8) titled: sys_migrate_pages-permissions-check.patch
     for further details.

(4)  This patch set includes a patch (again due to Nathan Scott
     of SGI) that provides a migrate_page method for XFS.
     This patch is necessary to provide acceptable performance
     for migrations involving mapped files in XFS.  EXT2 and 
     EXT3 already have migrate_page methods.

(5)  Changes suggested by Dave Hansen have been incorporated
     into this patch set.

If this patch is acceptable to the Memory Hotplug Team, I'd like
to see it added to the page migration sequence of patches in
the memory hotplug patch.  (Hirokazu, in particular, could you
review the patches?  Thanks.)

As always, suggestions, flames, etc should be directed to me
at raybry@sgi.com.

Unresolved issues
-----------------

(1)  This version of migrate_pages() works reliably only when the
     process to be migrated has been stopped (e. g., using SIGSTOP)
     before the migrate_pages() system call is executed.  I am
     working on eliminating that restriction, however, at the
     present time the system call does not work reliably without
     the process first being stopped.  (The system doesn't crash
     or oops, but sometimes the process being migrated will be
     "Killed by VM" when it starts up again.  There may be a few
     messages put into the log as well at that time.)

     Of course, I could check to make sure the target process is
     stopped in the migrate_pages() system call, but there is no
     good way to make sure that the process remains stopped for
     the duration of the system call (at least that I am aware
     of) so that is a partial fix, at best.

(2)  I'm still using system call #1279.  On ia64 this is the
     last slot in the system call table.  A system call number
     needs to be assigned to migrate_pages().

(3)  This patch changes adds a parameter to try_to_migrate_pages().
     For other calls to try_to_migrate_pages() in the memory hotplug
     patch, the additional parameter may be passed in as NULL and
     the existing behavior will occur.  Later, I'll send out a
     patch updates the rest of the memory hotplug patch in accordance
     with this change.

(4)  system.migration extended attribute support should also be
     provided for other file systems.  This can be done incrementally.

(5)  As part of the discussion with Andi Kleen, we agreed to
     provide some memory migration support under MPOL_MF_STRICT.
     Currently, if one calls mbind() with the flag MPOL_MF_STRICT
     set, and pages are found that don't follow the memory policy,
     then the mbind() will return -EIO.  Andi would like to be
     able cause those pages to be migrated to the correct nodes.
     This feature is not yet part of this patchset and will
     probably be added as a distinct set of patches.

Description of the patches in this patchset
-------------------------------------------

Recall that all of these patches apply to 2.6.12-rc3 with the
following patch applied first:

http://sr71.net/patches/2.6.12/2.6.12-rc3-mhp1/patch-2.6.12-rc3-mhp1.gz

In addition, to get things to compile (for Altix), I applied:

http://sr71.net/patches/2.6.12/2.6.12-rc3-mhp1/broken-out/fudge-patch-2.patch

and manually added CONFIG_NEED_MULTIPLE_NODES=y to my .config file.

Patch 1: xfs_extended_attributes-rc2.patch
	 This patch, due to Nathan Scott at SGI, adds support to
	 XFS for the system.migration extended attribute.

Patch 2: xfs-migrate-page-rc2.patch
	 This patch, also due to Nathan Scott, provides a migrate_
	 page method for XFS.  EXT2 and EXT3 already have such methods.

Patch 3: add-node_map-arg-to-try_to_migrate_pages-rc2.patch
         This patch adds an additional argument to try_to_migrate_pages().
	 The additional argument controls where pages found on specific
	 nodes in the page_list passed into try_to_migrate_pages() are
	 migrated to.

Patch 4: add-sys_migrate_pages-rc2.patch
	 This is the patch that adds the migrate_pages() system call.
	 This patch provides is a simple version of the system call that
	 migrates all pages associated with a particular process, so
	 is really only useful for programs that are statically linked
	 (i. e. that don't map in any shared libraries).

Patch 5: sys_migrate_pages-xattr-support-rc2.patch
         This patch queries the system.migration extended attribute
	 associated with each mapped file.  The result of this query
	 is used to control which pages are migrated (e. g. if the
	 extended attribute has the value "libr", then it is assumed
	 the mapped file is a shared library, and shared pages are
	 not migrated.)

Patch 6: sys_migrate_pages-mempolicy-migration-rc2.patch
         This patch updates the memory policy data structures
	 as they are encountered in accordance with the migration
	 request.

Patch 7: sys_migrate_pages-cpuset-support.patch
         This patch makes migrate_pages() cooperate better with
	 cpusets.

Patch 8: sys_migrate_pages-permissions-check.patch
         This is the permissions check discussed earlier.

Background
----------

The purpose of this set of patches is to introduce the necessary kernel
infrastructure to support "manual page migration".  That phrase is
intended to describe a facility whereby some user program (most likely
a batch scheduler) is given the responsibility of managing where jobs
run on a large NUMA system.  If it turns out that a job needs to be
run on a different set of nodes from where it is running now, then that
user program would invoke this facility to move the job to the new set
of nodes.

We use the word "manual" here to indicate that the facility is invoked
in a way that the kernel is told where to move things; we distinguish
this approach from "automatic page migration" facilities which have been
proposed in the past.  To us, "automatic page migration" implies using
hardware counters to determine where pages should reside and having the
O/S automatically move misplaced pages.  The utility of such facilities,
for example, on IRIX has, been mixed, and we are not currently proposing
such a facility for Linux.

The normal sequence of events would be as follows:  A job is running
on, say nodes 5-8, and a higher priority job arrives and the only place
it can be run, for whatever reason, is nodes 5-8.  Then the scheduler
would suspend the processes of the existing job (by, for example sending
them a SIGSTOP) and start the new job on those nodes.  At some point in
the future, other nodes become available for use, and at this point the
batch scheduler would invoke the manual page migration facility to move
the processes of the suspended job from nodes 5-8 to the new set of nodes.

Note that not all of the pages of all of the processes will need to (or
should) be moved.  For example, pages of shared libraries are likely to be
shared by many processes in the system; these pages should not be moved
merely because a few processes using these libraries have been migrated.
As discussed above, we use the extended attribute system.migration with
value "lib" to identify such files.  If a shared library file does not
have this attribute set, or the shared library is stored in a file system
that does not support extended attributes (e. g. NFS), then the entire
shared library will be migrated.

-- 
Best Regards,
Ray
-----------------------------------------------
Ray Bryant                       raybry@sgi.com
The box said: "Requires Windows 98 or better",
           so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-11  4:37 [PATCH 2.6.12-rc3 0/8] mm: manual page migration-rc2 -- overview Ray Bryant
@ 2005-05-11  4:38 ` Ray Bryant
  2005-05-11  7:15   ` Christoph Hellwig
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 2/8] mm: manual page migration-rc2 -- xfs-migrate-page-rc2.patch Ray Bryant
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 46+ messages in thread
From: Ray Bryant @ 2005-05-11  4:38 UTC (permalink / raw)
  To: Hirokazu Takahashi, Marcelo Tosatti, Andi Kleen, Dave Hansen
  Cc: Christoph Hellwig, linux-mm, Nathan Scott, Ray Bryant,
	lhms-devel, Ray Bryant

This patch is from Nathan Scott of SGI and adds the extended
attribute system.migration for xfs.  At the moment, there is
no protection checking being done here (according to Nathan),
so this would have to be added if we finally agree to go this
way. 

Signed-off-by: Ray Bryant <raybry@sgi.com>

 xfs_attr.c |   65 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 xfs_attr.h |    4 ---
 2 files changed, 62 insertions(+), 7 deletions(-)

Index: linux/fs/xfs/xfs_attr.c
===================================================================
--- linux/fs/xfs.orig/xfs_attr.c
+++ linux/fs/xfs/xfs_attr.c
@@ -2398,7 +2398,53 @@
 	return xfs_acl_vhasacl_default(vp);
 }
 
-struct attrnames posix_acl_access = {
+#define _MIGRATE	"SGI_NUMA_MIGRATE"
+
+STATIC int
+numa_migration_set(
+	vnode_t	*vp, char *name, void *data, size_t size, int xflags)
+{
+	int error;
+
+	VOP_ATTR_SET(vp, _MIGRATE, data, size, ATTR_ROOT, sys_cred, error);
+	return -error;
+}
+
+STATIC int
+numa_migration_get(
+	vnode_t *vp, char *name, void *data, size_t size, int xflags)
+{
+	int error, flags = ATTR_ROOT;
+
+	if (!size)
+		flags |= ATTR_KERNOVAL;
+	VOP_ATTR_GET(vp, _MIGRATE, data, &size, flags, sys_cred, error);
+	if (!error)
+		return size;
+	return -error;
+}
+
+STATIC int
+numa_migration_remove(
+	struct vnode *vp, char *name, int xflags)
+{
+	int error;
+
+	VOP_ATTR_REMOVE(vp, _MIGRATE, ATTR_ROOT, sys_cred, error);
+	return (error == ENOATTR) ? 0 : -error;
+}
+
+STATIC int
+numa_migration_exists(
+	vnode_t *vp)
+{
+	int	error, len, flags = ATTR_ROOT|ATTR_KERNOVAL;
+
+	VOP_ATTR_GET(vp, _MIGRATE, NULL, &len, flags, sys_cred, error);
+	return (error == 0);
+}
+
+STATIC struct attrnames posix_acl_access = {
 	.attr_name	= "posix_acl_access",
 	.attr_namelen	= sizeof("posix_acl_access") - 1,
 	.attr_get	= posix_acl_access_get,
@@ -2407,7 +2453,7 @@
 	.attr_exists	= posix_acl_access_exists,
 };
 
-struct attrnames posix_acl_default = {
+STATIC struct attrnames posix_acl_default = {
 	.attr_name	= "posix_acl_default",
 	.attr_namelen	= sizeof("posix_acl_default") - 1,
 	.attr_get	= posix_acl_default_get,
@@ -2416,8 +2462,19 @@
 	.attr_exists	= posix_acl_default_exists,
 };
 
-struct attrnames *attr_system_names[] =
-	{ &posix_acl_access, &posix_acl_default };
+STATIC struct attrnames numa_migration = {
+	.attr_name	= "migration",
+	.attr_namelen	= sizeof("migration") - 1,
+	.attr_get	= numa_migration_get,
+	.attr_set	= numa_migration_set,
+	.attr_remove	= numa_migration_remove,
+	.attr_exists	= numa_migration_exists,
+};
+
+struct attrnames *attr_system_names[] = {
+	&posix_acl_access, &posix_acl_default,
+	&numa_migration,
+};
 
 
 /*========================================================================
Index: linux/fs/xfs/xfs_attr.h
===================================================================
--- linux/fs/xfs.orig/xfs_attr.h
+++ linux/fs/xfs/xfs_attr.h
@@ -76,9 +76,7 @@
 extern struct attrnames attr_trusted;
 extern struct attrnames *attr_namespaces[ATTR_NAMECOUNT];
 
-#define ATTR_SYSCOUNT	2
-extern struct attrnames posix_acl_access;
-extern struct attrnames posix_acl_default;
+#define ATTR_SYSCOUNT	3
 extern struct attrnames *attr_system_names[ATTR_SYSCOUNT];
 
 extern attrnames_t *attr_lookup_namespace(char *, attrnames_t **, int);


-- 
Best Regards,
Ray
-----------------------------------------------
Ray Bryant                       raybry@sgi.com
The box said: "Requires Windows 98 or better",
           so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch Ray Bryant
@ 2005-05-11  7:15   ` Christoph Hellwig
  2005-05-11 12:10     ` [Lhms-devel] " Ray Bryant
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2005-05-11  7:15 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Hirokazu Takahashi, Marcelo Tosatti, Andi Kleen, Dave Hansen,
	Christoph Hellwig, linux-mm, Nathan Scott, Ray Bryant,
	lhms-devel

On Tue, May 10, 2005 at 09:38:02PM -0700, Ray Bryant wrote:
> This patch is from Nathan Scott of SGI and adds the extended
> attribute system.migration for xfs.  At the moment, there is
> no protection checking being done here (according to Nathan),
> so this would have to be added if we finally agree to go this
> way. 

As Nathan and I told you this is not acceptable at all.  Imigration policies
don't belong into filesystem metadata.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-11  7:15   ` Christoph Hellwig
@ 2005-05-11 12:10     ` Ray Bryant
  2005-05-11 12:59       ` Andi Kleen
  2005-05-11 19:50       ` Christoph Hellwig
  0 siblings, 2 replies; 46+ messages in thread
From: Ray Bryant @ 2005-05-11 12:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ray Bryant, Hirokazu Takahashi, Marcelo Tosatti, Andi Kleen,
	Dave Hansen, linux-mm, Nathan Scott, Ray Bryant, lhms-devel

Christoph Hellwig wrote:
> On Tue, May 10, 2005 at 09:38:02PM -0700, Ray Bryant wrote:
> 
>>This patch is from Nathan Scott of SGI and adds the extended
>>attribute system.migration for xfs.  At the moment, there is
>>no protection checking being done here (according to Nathan),
>>so this would have to be added if we finally agree to go this
>>way. 
> 
> 
> As Nathan and I told you this is not acceptable at all.  Imigration policies
> don't belong into filesystem metadata.
> 
> 
> 

Is the issue here the use of extended attributes, period, or the use of the
system name space?  If it is the latter, we could readily convert to using
a name in the user name space instead and that would eliminate the need
for this xfs patch, which is certainly desirable and would eliminate the
need for other fs patches of a similar sort.  Of course, there is no way
to guarentee that this name is not already used (or will be used in the
future) in the user name space for a different purpose.  But we can
probably live with that.

I would observe that after lengthy discussion on this topic in February
with Andi, use of extended attributes was agreed upon as the preferred
solution.  Your alternative (As I recall: modifying the dynamic loader
to mark mapped files in memory as shared libraries, and requiring a new
mmap() flag to mark files as non-migratable) strikes me as more
complicated and even harder to get accepted by the community, since it
touches not only the kernel, but glibc as well.

But, we do such things by consensus and I am willing to try to implement
whatever convention we all agree on.  I would like to have an agreement
from all parties before I proceed with an alternative implementation.
I will pursue the ld.so changes with the glibc-developers and see what
the reaction is.

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-11 12:10     ` [Lhms-devel] " Ray Bryant
@ 2005-05-11 12:59       ` Andi Kleen
  2005-05-11 18:43         ` Ray Bryant
  2005-05-11 19:50       ` Christoph Hellwig
  1 sibling, 1 reply; 46+ messages in thread
From: Andi Kleen @ 2005-05-11 12:59 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Christoph Hellwig, Ray Bryant, Hirokazu Takahashi,
	Marcelo Tosatti, Andi Kleen, Dave Hansen, linux-mm, Nathan Scott,
	Ray Bryant, lhms-devel

> But, we do such things by consensus and I am willing to try to implement
> whatever convention we all agree on.  I would like to have an agreement
> from all parties before I proceed with an alternative implementation.
> I will pursue the ld.so changes with the glibc-developers and see what
> the reaction is.


I think Christoph's reaction mostly comes from trying to do this
in file system specific code. Rather it should be some independent
piece of code that just uses the EA interfaces offered by the FS
to do this.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-11 12:59       ` Andi Kleen
@ 2005-05-11 18:43         ` Ray Bryant
  2005-05-11 19:32           ` Andi Kleen
  0 siblings, 1 reply; 46+ messages in thread
From: Ray Bryant @ 2005-05-11 18:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Hellwig, Ray Bryant, Hirokazu Takahashi,
	Marcelo Tosatti, Dave Hansen, linux-mm, Nathan Scott, Ray Bryant,
	lhms-devel, Jes Sorensen

Andi Kleen wrote:
>>But, we do such things by consensus and I am willing to try to implement
>>whatever convention we all agree on.  I would like to have an agreement
>>from all parties before I proceed with an alternative implementation.
>>I will pursue the ld.so changes with the glibc-developers and see what
>>the reaction is.
> 
> 
> 
> I think Christoph's reaction mostly comes from trying to do this
> in file system specific code. Rather it should be some independent
> piece of code that just uses the EA interfaces offered by the FS
> to do this.
> 
> -Andi
> 

If we are going to use a "system" attribute, as near as I can tell, this
requires a change in the file system specific code.  If we use a "user"
attribute, then no fs change is required.  However, in the latter case
there is also no way to reserve a name that users can't overwrite or usurp.

However, I think that a "user" attribute might be workable.  For most
files that we would be marking this way (e. g. /lib and /usr/lib), a
non-root user can't change the user attributes anyway, since normal
protection rules apply.  For mapped files in other places, the chances
of a collision on the user.migration attribute are sufficiently small,
I would think, that we could live with that.  (A user would have to use
the same name and the same values that the kernel is looking for to
have an effect.)

The only remaining issue is the use of a "user" attribute to communicate
with the kernel.  That makes me uneasy as I don't know if this would
follow the normal conventions for extended attribute usage.

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-11 18:43         ` Ray Bryant
@ 2005-05-11 19:32           ` Andi Kleen
  2005-05-11 20:00             ` Christoph Hellwig
  2005-05-12 10:45             ` Christoph Hellwig
  0 siblings, 2 replies; 46+ messages in thread
From: Andi Kleen @ 2005-05-11 19:32 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Andi Kleen, Christoph Hellwig, Ray Bryant, Hirokazu Takahashi,
	Marcelo Tosatti, Dave Hansen, linux-mm, Nathan Scott, Ray Bryant,
	lhms-devel, Jes Sorensen

> If we are going to use a "system" attribute, as near as I can tell, this
> requires a change in the file system specific code.  If we use a "user"
> attribute, then no fs change is required.  However, in the latter case
> there is also no way to reserve a name that users can't overwrite or usurp.

A minor change for that is probably ok, as long as the actual logic
who uses this is generic. 

hch: if you still are against this please reread the original thread
with me and Ray and see why we decided that ld.so changes are not
a good idea.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-11 19:32           ` Andi Kleen
@ 2005-05-11 20:00             ` Christoph Hellwig
  2005-05-11 22:04               ` Ray Bryant
  2005-05-12 10:45             ` Christoph Hellwig
  1 sibling, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2005-05-11 20:00 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ray Bryant, Ray Bryant, Hirokazu Takahashi, Marcelo Tosatti,
	Dave Hansen, linux-mm, Nathan Scott, Ray Bryant, lhms-devel,
	Jes Sorensen

On Wed, May 11, 2005 at 09:32:07PM +0200, Andi Kleen wrote:
> hch: if you still are against this please reread the original thread
> with me and Ray and see why we decided that ld.so changes are not
> a good idea.

please send a pointer to the discussion.

Not that I think it matters a lot.  What Ray implemented is a very
special cased hack for migration policies that only applies to shared
libraries.  Doing it generically is about the same amount of code and
a lot cleaner.

Note that I'm not against storing information in the file so that
shared libraries get the proper treatment, but the proper place for
that is an additional ELF header or magic section, similar to the
noexec stack changes.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-11 20:00             ` Christoph Hellwig
@ 2005-05-11 22:04               ` Ray Bryant
  0 siblings, 0 replies; 46+ messages in thread
From: Ray Bryant @ 2005-05-11 22:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andi Kleen, Ray Bryant, Hirokazu Takahashi, Marcelo Tosatti,
	Dave Hansen, linux-mm, Nathan Scott, Ray Bryant, lhms-devel,
	Jes Sorensen

Christoph Hellwig wrote:
> On Wed, May 11, 2005 at 09:32:07PM +0200, Andi Kleen wrote:
> 
>>hch: if you still are against this please reread the original thread
>>with me and Ray and see why we decided that ld.so changes are not
>>a good idea.
> 
> 
> please send a pointer to the discussion.
> 

The thread starts here:

http://marc.theaimsgroup.com/?l=linux-mm&m=110817907931126&w=2

> Not that I think it matters a lot.  What Ray implemented is a very
> special cased hack for migration policies that only applies to shared
> libraries.  Doing it generically is about the same amount of code and
> a lot cleaner.
> 
> Note that I'm not against storing information in the file so that
> shared libraries get the proper treatment, but the proper place for
> that is an additional ELF header or magic section, similar to the
> noexec stack changes.
> 

Can you help me get the glibc developers to buy into the necessary
changes to ld.so?  (Isn't that where such changes would end up being
made?)

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-11 19:32           ` Andi Kleen
  2005-05-11 20:00             ` Christoph Hellwig
@ 2005-05-12 10:45             ` Christoph Hellwig
  2005-05-17  4:22               ` Ray Bryant
  2005-05-20 22:26               ` Ray Bryant
  1 sibling, 2 replies; 46+ messages in thread
From: Christoph Hellwig @ 2005-05-12 10:45 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ray Bryant, Christoph Hellwig, Ray Bryant, Hirokazu Takahashi,
	Marcelo Tosatti, Dave Hansen, linux-mm, Nathan Scott, Ray Bryant,
	lhms-devel, Jes Sorensen

On Wed, May 11, 2005 at 09:32:07PM +0200, Andi Kleen wrote:
> A minor change for that is probably ok, as long as the actual logic
> who uses this is generic. 
> 
> hch: if you still are against this please reread the original thread
> with me and Ray and see why we decided that ld.so changes are not
> a good idea.

So reading through the thread I think using mempolicies to mark shared
libraries is better than the mmap flag I proposed.  I still don't think
xattrs interpreted by the kernel is a good way to store them.  Setting
up libraries is the job of the dynamic linker, and reading pre-defined
memory policies from an ELF header fits the approach we do for related
things.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-12 10:45             ` Christoph Hellwig
@ 2005-05-17  4:22               ` Ray Bryant
  2005-05-18  6:20                 ` Paul Jackson
  2005-05-20 22:26               ` Ray Bryant
  1 sibling, 1 reply; 46+ messages in thread
From: Ray Bryant @ 2005-05-17  4:22 UTC (permalink / raw)
  To: Christoph Hellwig, Andi Kleen
  Cc: Ray Bryant, Hirokazu Takahashi, Marcelo Tosatti, Dave Hansen,
	linux-mm, Nathan Scott, Ray Bryant, lhms-devel, Jes Sorensen

Christoph Hellwig wrote:
> On Wed, May 11, 2005 at 09:32:07PM +0200, Andi Kleen wrote:
> 
>>A minor change for that is probably ok, as long as the actual logic
>>who uses this is generic. 
>>
>>hch: if you still are against this please reread the original thread
>>with me and Ray and see why we decided that ld.so changes are not
>>a good idea.
> 
> 
> So reading through the thread I think using mempolicies to mark shared
> libraries is better than the mmap flag I proposed.  I still don't think
> xattrs interpreted by the kernel is a good way to store them.  Setting
> up libraries is the job of the dynamic linker, and reading pre-defined
> memory policies from an ELF header fits the approach we do for related
> things.
> 

Andi and hch,

OK, I've been off chasing down what the possibilities are in that area.
I'm also looking at Steve Longerbeam's patches to see if that will help
us out here.

However, I've come across a minor issue that has complicated my thinking
on this:  If one were to use madvise() or mbind() to apply the migration
policy flags (e. g. the three policies we basically need are:  migrate, 
migrate_shared, and migrated_none, used for normal files, libraries,
and shared binaries, respectively) then when madvise() (let us say)
is called, it isn't good enough to mark the vma that the address and
length point to, it's necessary to reach down to a common subobject,
(such as the file struct, address space struct, or inode) and mark
that.

If the vma is all that is marked, then when migrate_pages() is called
and as a result some other address space than the current one is examined,
it won't see the flags.

(Remember that the migrate_pages() system call takes a pid, a count,
and a list of old and new node so that this process is allowed to
migrate that process over there, which is what the batch manager needs
to do.  Running madvise() in the current process's address space doesn't
help much unless it marks something deeper in the address space hierarchy
than a vma.)

This is something quite a bit different than what madvise() or mbind()
do today.  (They just manipulate vma's AFAIK.)

Does that observation change y'all's thinking on this in any way?

> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by Oracle Space Sweepstakes
> Want to be the first software developer in space?
> Enter now for the Oracle Space Sweepstakes!
> http://ads.osdn.com/?ad_id=7393&alloc_id=16281&op=click
> _______________________________________________
> Lhms-devel mailing list
> Lhms-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lhms-devel
> 

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-17  4:22               ` Ray Bryant
@ 2005-05-18  6:20                 ` Paul Jackson
  2005-05-18 14:49                   ` Ray Bryant
  0 siblings, 1 reply; 46+ messages in thread
From: Paul Jackson @ 2005-05-18  6:20 UTC (permalink / raw)
  To: Ray Bryant
  Cc: hch, ak, raybry, taka, marcelo.tosatti, haveblue, linux-mm,
	nathans, raybry, lhms-devel, jes

Ray wrote:
> If one were to use madvise() or mbind() to apply the migration
> policy flags ... then ... it's necessary to reach down to a common subobject,
> (such as the file struct, address space struct, or inode) and mark
> that.

As I wrote Ray offline earlier today (before noticing this thread), I
suggested considering adding another flag to fcntl(), not madvise/mbind,
since fcntl() is sometimes used to make changes to the underlying
in-core inode.

My rough idea was to have a S_SKIPMIGRATE flag in the in-core
inode->i_flag's, defaulting to off, which could be set and gotten via
fcntl on any file descriptor open on that inode.  The value set would
persist so long as some file held that dentry->inode open, or until
changed again.  Just before invoking the call to migrate a task, user
code could examine each named file mapped into the task, and by opening
each said file and performing the appropriate fcntl(), mark whether or
not pages from that mapped file should be migrated.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-18  6:20                 ` Paul Jackson
@ 2005-05-18 14:49                   ` Ray Bryant
  0 siblings, 0 replies; 46+ messages in thread
From: Ray Bryant @ 2005-05-18 14:49 UTC (permalink / raw)
  To: Paul Jackson
  Cc: hch, ak, raybry, taka, marcelo.tosatti, haveblue, linux-mm,
	nathans, raybry, lhms-devel, jes

Paul Jackson wrote:
> Ray wrote:
> 
>>If one were to use madvise() or mbind() to apply the migration
>>policy flags ... then ... it's necessary to reach down to a common subobject,
>>(such as the file struct, address space struct, or inode) and mark
>>that.
> 
> 
> As I wrote Ray offline earlier today (before noticing this thread), I
> suggested considering adding another flag to fcntl(), not madvise/mbind,
> since fcntl() is sometimes used to make changes to the underlying
> in-core inode.
> 
> My rough idea was to have a S_SKIPMIGRATE flag in the in-core
> inode->i_flag's, defaulting to off, which could be set and gotten via
> fcntl on any file descriptor open on that inode.  The value set would
> persist so long as some file held that dentry->inode open, or until
> changed again.  Just before invoking the call to migrate a task, user
> code could examine each named file mapped into the task, and by opening
> each said file and performing the appropriate fcntl(), mark whether or
> not pages from that mapped file should be migrated.
> 

Actually, we need two flags:  S_SKIPMIGRATE, and S_MIGRATE_NON_SHARED.
I personally prefer using mbind() to set these attributes in the
address space object that maps a particular virtual address range,

I guess I don't see the advantage of invoking file system operations
to describe how memory should be dealt with in a migration operation.
So I'd prefer to keep all of the operations in the memory space, hence
mbind().

This makes sense to me since the migration operation is a policy issue,
and mbind() is in the business of setting memory policy.
-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-12 10:45             ` Christoph Hellwig
  2005-05-17  4:22               ` Ray Bryant
@ 2005-05-20 22:26               ` Ray Bryant
  2005-05-23 17:50                 ` Steve Longerbeam
  1 sibling, 1 reply; 46+ messages in thread
From: Ray Bryant @ 2005-05-20 22:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andi Kleen, Ray Bryant, Hirokazu Takahashi, Marcelo Tosatti,
	Dave Hansen, linux-mm, Nathan Scott, Ray Bryant, lhms-devel,
	Jes Sorensen, Steve Longerbeam

Christoph Hellwig wrote:
> On Wed, May 11, 2005 at 09:32:07PM +0200, Andi Kleen wrote:
> 
>>A minor change for that is probably ok, as long as the actual logic
>>who uses this is generic. 
>>
>>hch: if you still are against this please reread the original thread
>>with me and Ray and see why we decided that ld.so changes are not
>>a good idea.
> 
> 
> So reading through the thread I think using mempolicies to mark shared
> libraries is better than the mmap flag I proposed.  I still don't think
> xattrs interpreted by the kernel is a good way to store them.  Setting
> up libraries is the job of the dynamic linker, and reading pre-defined
> memory policies from an ELF header fits the approach we do for related
> things.
> 
> 
> 

Christoph and Andi,

OK, here are the alternatives I have figured out, I'd appreciate feedback
on which of these would be acceptable.  (In each case, the migration
attributes being set are either:  MIGRATE_NONE to indicate that nothing
in this mapped file should be migrated, or MIGRATE_NS to indicate that
the non-shared pages should be migrated, this is the normal setting for
shared library files.  And, since madvise() is mostly about I/O related
things, I'm assuming here that I extend mbind() to set the migration
attributes.):

(1)  Use mbind() to set "shallow" vm attributes.  (I use shallow
versus deep here to indicate whether or not other processes that map
the same object can see the attributes -- this basically also maps
to whether we put the attributes in the vma [shallow] or in the
memory_object [deep].)

In the shallow case, mbind() has to be called in
each address space in order to properly set the migration flags the
same way in each address space that maps a shared object.  So, we
basically have to call mbind() from ld.so.

As far as I am concerned this is a fundamental show stopper, since we
without broad glibc support, we will never get the changes
into ld.so for just Altix and page migration.  It also doesn't handle
the case of shared, mapped r/o data files.   We can leverage Steve
Longerbeam's work here, but he also doesn't have a time frame as to
when his ld.so changes might be accepted by the glibc developers.

It does allow one to mark anonymous memory with migration policy.
However, any use of that  I've been able to think of (e. g. marking
some anonymous pages as MIGRATE_NONE and then calling migrat_pages())
could equally well be handled by mbind(.., MPOL_MF_STRICT | MPOL_MF_MOVE) 
(MPOL_MF_MOVE is in Steve Longerbeams patch and says to move the pages
that don't match the memory policy -- we plan to hook this up to the migration
code at some point in the future.)

(2)  Use mbind() to set "deep" vm attributes.  There appear to be
two places where the deep attributes could be set:  in the
address space object vma->file->f_mapping or in the inode
vma->file->f_mapping->host.  Some upper order bits of address_space.
flags could be used, but there appear to be concurrency issues
there.  Bits in inode.i_flags also appear to be available.

The advantage of setting "deep" vm attributes is that this interface
could be used by ld.so, but in advance of getting the changes
accepted there, we could also set the deep attributes in a migration
library before calling migrate_pages().  (deep attrbutes are be
seen from any address space that maps the object.)  Then when ld.so
changes are in, we can reduce the work done by the migration library.

(3)  The problem with (2) is that to set a deep attribute, one has
to do 4 system calls: open, mmap, mbind, munmap.  If we add the
migration attributes to fcntl() [such as Paul Jackson has suggested],
then it we could set them directly in the inode with one system call.
Perhaps not a big deal, but something to think about.  It's also
simpler, easier to maintain code.

(4)  Then there is the original, extended attribute approach.  I'm
including this one last time just to observe that:
      (i)  This correctly handles regular data (non-elf) files.
     (ii)  If one wants to migrate just a portion of anonymous
           memory, one could still use mbind(...MPOL_MF_STRICT | MPOL_MF_MOVE)
    (iii)  How to set the migration policy is based on how a shared file
           is mapped in multiple address spaces.  It is not so much
           a characterstic of an individual address space's usage of
           the file.  So, it seems natural to associate these with
           the file and not the particular instance in one address space
           (that is alternative (1)).
       If using a system attribute is too much change to fs code,
then let's use a user attribute here.  It's not perfect, but it is
doable, and doesn't require any fs changes.  (We'll just not support
migration policy in file systems that don't have extended attributes.)

In short, as near as I can tell, alternative (1) really doesn't do
what we want, and is the hardest to implement and get into a production
kernel.  I still like (4) best, but I can live with (2) or (3).
Both (2) and (3) have interim approaches that can be made to work
until Steve Longerbeam's stuff makes it into ld.so, at which point
I can easily merge my required changes in with his.

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-20 22:26               ` Ray Bryant
@ 2005-05-23 17:50                 ` Steve Longerbeam
  2005-05-24  4:53                   ` Ray Bryant
  0 siblings, 1 reply; 46+ messages in thread
From: Steve Longerbeam @ 2005-05-23 17:50 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Christoph Hellwig, Andi Kleen, Ray Bryant, Hirokazu Takahashi,
	Marcelo Tosatti, Dave Hansen, linux-mm, Nathan Scott, Ray Bryant,
	lhms-devel, Jes Sorensen, Steve Longerbeam


Ray Bryant wrote:

> Christoph Hellwig wrote:
>
>> On Wed, May 11, 2005 at 09:32:07PM +0200, Andi Kleen wrote:
>>
>>> A minor change for that is probably ok, as long as the actual logic
>>> who uses this is generic.
>>> hch: if you still are against this please reread the original thread
>>> with me and Ray and see why we decided that ld.so changes are not
>>> a good idea.
>>
>>
>>
>> So reading through the thread I think using mempolicies to mark shared
>> libraries is better than the mmap flag I proposed.  I still don't think
>> xattrs interpreted by the kernel is a good way to store them.  Setting
>> up libraries is the job of the dynamic linker, and reading pre-defined
>> memory policies from an ELF header fits the approach we do for related
>> things.
>>
>>
>>
>
> Christoph and Andi,
>
> OK, here are the alternatives I have figured out, I'd appreciate feedback
> on which of these would be acceptable.  (In each case, the migration
> attributes being set are either:  MIGRATE_NONE to indicate that nothing
> in this mapped file should be migrated, or MIGRATE_NS to indicate that
> the non-shared pages should be migrated, this is the normal setting for
> shared library files.  And, since madvise() is mostly about I/O related
> things, I'm assuming here that I extend mbind() to set the migration
> attributes.):
>
> (1)  Use mbind() to set "shallow" vm attributes.  (I use shallow
> versus deep here to indicate whether or not other processes that map
> the same object can see the attributes -- this basically also maps
> to whether we put the attributes in the vma [shallow] or in the
> memory_object [deep].)
>
> In the shallow case, mbind() has to be called in
> each address space in order to properly set the migration flags the
> same way in each address space that maps a shared object.  So, we
> basically have to call mbind() from ld.so.

>
> As far as I am concerned this is a fundamental show stopper, since we
> without broad glibc support, we will never get the changes
> into ld.so for just Altix and page migration.  It also doesn't handle
> the case of shared, mapped r/o data files.   We can leverage Steve
> Longerbeam's work here, but he also doesn't have a time frame as to
> when his ld.so changes might be accepted by the glibc developers.
>
> It does allow one to mark anonymous memory with migration policy.
> However, any use of that  I've been able to think of (e. g. marking
> some anonymous pages as MIGRATE_NONE and then calling migrat_pages())
> could equally well be handled by mbind(.., MPOL_MF_STRICT | 
> MPOL_MF_MOVE) (MPOL_MF_MOVE is in Steve Longerbeams patch and says to 
> move the pages
> that don't match the memory policy -- we plan to hook this up to the 
> migration
> code at some point in the future.)


right, but I should add that mbind(MPOL_MF_MOVE) will also attempt to
migrate shared pages (in the page cache), not just anonymous pages. It uses
invalidate_mapping_pages() in mm/truncate.c to do this. As this routine
states, it will not remove pages from the cache which are dirty, locked, 
under
writeback or mapped into pagetables, so the shared page migration is
rudimentary. It doesn't actually migrate the shared page, it just frees it
from the pagecache, which is fine because later a page fault will 
re-allocate
a new page for the cache using the new mempolicy, which completes the
migration.

Also, invalidate_mapping_pages() should always succeed when used
with the patch to load_elf_binary() and ld.so that search for and apply
mempolicies from a special ELF .note.numapolicy PT_NOTE in executables
and shared libraries (which Ray mentioned above). That's because the
policy will be applied before any process has ever touched the shared
page (which means that, if a page was located in the cache for that
address_space and that page offset, it must be left over from an earlier
regular file I/O on the object, and not from a mapping).

I agree that we should hook up SGI's migration code with MPOL_MF_MOVE.
The only additional feature I see that mbind() would need to complete SGI's
migration API is the ability to migrate pages for other process spaces 
besides
the calling process.



>
> (2)  Use mbind() to set "deep" vm attributes.  There appear to be
> two places where the deep attributes could be set:  in the
> address space object vma->file->f_mapping or in the inode
> vma->file->f_mapping->host.  Some upper order bits of address_space.
> flags could be used, but there appear to be concurrency issues
> there.  Bits in inode.i_flags also appear to be available.
>
> The advantage of setting "deep" vm attributes is that this interface
> could be used by ld.so, but in advance of getting the changes
> accepted there, we could also set the deep attributes in a migration
> library before calling migrate_pages().  (deep attrbutes are be
> seen from any address space that maps the object.)  Then when ld.so
> changes are in, we can reduce the work done by the migration library.


it seems to me that, since you are marking a _shared_ object with
migration attributes that all processes that map that object need
to see, it makes sense to make the attributes "deep", ie. put them
in address_space. It should not go in a private object (vma) that
only one process can see. So (1) above just isn't the correct way
to go.

>
> (3)  The problem with (2) is that to set a deep attribute, one has
> to do 4 system calls: open, mmap, mbind, munmap.  If we add the
> migration attributes to fcntl() [such as Paul Jackson has suggested],
> then it we could set them directly in the inode with one system call.
> Perhaps not a big deal, but something to think about.  It's also
> simpler, easier to maintain code.
>
> (4)  Then there is the original, extended attribute approach.  I'm
> including this one last time just to observe that:
>      (i)  This correctly handles regular data (non-elf) files.
>     (ii)  If one wants to migrate just a portion of anonymous
>           memory, one could still use mbind(...MPOL_MF_STRICT | 
> MPOL_MF_MOVE)
>    (iii)  How to set the migration policy is based on how a shared file
>           is mapped in multiple address spaces.  It is not so much
>           a characterstic of an individual address space's usage of
>           the file.  So, it seems natural to associate these with
>           the file and not the particular instance in one address space
>           (that is alternative (1)).
>       If using a system attribute is too much change to fs code,
> then let's use a user attribute here.  It's not perfect, but it is
> doable, and doesn't require any fs changes.  (We'll just not support
> migration policy in file systems that don't have extended attributes.)
>
> In short, as near as I can tell, alternative (1) really doesn't do
> what we want, and is the hardest to implement and get into a production
> kernel.  I still like (4) best, but I can live with (2) or (3).
> Both (2) and (3) have interim approaches that can be made to work
> until Steve Longerbeam's stuff makes it into ld.so, at which point
> I can easily merge my required changes in with his.


I like (2), and I agree that (1) is out.

We're still going to try getting out glibc patch accepted. The glibc
maintainers are very reluctant to add new interfaces that are not
backed by any widely-used standards (such as POSIX, or in this case,
the Solaris and SVR4 ELF documentation). However our patch doesn't
add a new libc interface - it's a new ELF note, which is parsed and
dealt with internally by ld.so without changing or adding any libc APIs.

I have a question about the migration attributes. Are these attributes
needed because your migration code is not _capable_ of migrating
shared pages? Or is it that you just want to selectively choose which
shared object memory should and should not be migrated?

Steve
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-23 17:50                 ` Steve Longerbeam
@ 2005-05-24  4:53                   ` Ray Bryant
  2005-05-24 20:59                     ` Christoph Lameter
  0 siblings, 1 reply; 46+ messages in thread
From: Ray Bryant @ 2005-05-24  4:53 UTC (permalink / raw)
  To: Steve Longerbeam
  Cc: Christoph Hellwig, Andi Kleen, Ray Bryant, Hirokazu Takahashi,
	Marcelo Tosatti, Dave Hansen, linux-mm, Nathan Scott, Ray Bryant,
	lhms-devel, Jes Sorensen

Steve Longerbeam wrote:
> 
> 
>
> 
> I have a question about the migration attributes. Are these attributes
> needed because your migration code is not _capable_ of migrating
> shared pages? Or is it that you just want to selectively choose which
> shared object memory should and should not be migrated?
> 
> Steve
> 

Hi Steve,

The reason the migration attributes are required is that from inside the
kernel, whilst looking at VMA's in the migration code, all mapped files
look alike.  There is no way to tell the difference (AFAIK) between:

(1)  A regular mapped file that the user mapped in and is shared
      among the processes that are being migrated.
(2)  A mapped file that maps a shared library.
(3)  A mapped file that maps a shared executable (e. g. /bin/bash).

We need to take a different migration action based on which case we
are in:

(1)  Migrate all of the pages.
(2)  Migrate the non-shared pages.
(3)  Migrate none of the pages.

So we need some external way for the kernel to be told which kind of
mapped file this is.  That is why we need some kind of interface for
the user (or admininistrator) to tell us how to classify each shared
mapped file.

(Obviously, we could make some assumptions about file names and catch
a lot of these, but then you would need a configerable interfac to
the kernel to control those names, and that seems like a problem.)

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-24  4:53                   ` Ray Bryant
@ 2005-05-24 20:59                     ` Christoph Lameter
  2005-05-24 21:04                       ` Martin J. Bligh
  2005-05-25  6:42                       ` Ray Bryant
  0 siblings, 2 replies; 46+ messages in thread
From: Christoph Lameter @ 2005-05-24 20:59 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Steve Longerbeam, Christoph Hellwig, Andi Kleen, Ray Bryant,
	Hirokazu Takahashi, Marcelo Tosatti, Dave Hansen, linux-mm,
	Nathan Scott, Ray Bryant, lhms-devel, Jes Sorensen

On Mon, 23 May 2005, Ray Bryant wrote:

> We need to take a different migration action based on which case we
> are in:
> 
> (1)  Migrate all of the pages.
> (2)  Migrate the non-shared pages.
> (3)  Migrate none of the pages.
> 
> So we need some external way for the kernel to be told which kind of
> mapped file this is.  That is why we need some kind of interface for
> the user (or admininistrator) to tell us how to classify each shared
> mapped file.

Sorry I am a bit late to the party and I know you must have said this
before but what is the reason again not to use the page reference count to 
determine if a page is shared? Maybe its possible to live with some 
restrictions that the use of the page reference count brings.

Seems that touching a filesystem and the ELF headers is way off from the 
vm.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-24 20:59                     ` Christoph Lameter
@ 2005-05-24 21:04                       ` Martin J. Bligh
  2005-05-25  6:42                       ` Ray Bryant
  1 sibling, 0 replies; 46+ messages in thread
From: Martin J. Bligh @ 2005-05-24 21:04 UTC (permalink / raw)
  To: Christoph Lameter, Ray Bryant
  Cc: Steve Longerbeam, Christoph Hellwig, Andi Kleen, Ray Bryant,
	Hirokazu Takahashi, Marcelo Tosatti, Dave Hansen, linux-mm,
	Nathan Scott, Ray Bryant, lhms-devel, Jes Sorensen

--Christoph Lameter <christoph@lameter.com> wrote (on Tuesday, May 24, 2005 13:59:30 -0700):

> On Mon, 23 May 2005, Ray Bryant wrote:
> 
>> We need to take a different migration action based on which case we
>> are in:
>> 
>> (1)  Migrate all of the pages.
>> (2)  Migrate the non-shared pages.
>> (3)  Migrate none of the pages.
>> 
>> So we need some external way for the kernel to be told which kind of
>> mapped file this is.  That is why we need some kind of interface for
>> the user (or admininistrator) to tell us how to classify each shared
>> mapped file.
> 
> Sorry I am a bit late to the party and I know you must have said this
> before but what is the reason again not to use the page reference count to 
> determine if a page is shared? Maybe its possible to live with some 
> restrictions that the use of the page reference count brings.
> 
> Seems that touching a filesystem and the ELF headers is way off from the 
> vm.

I forget the context of this conversation, but in general if you want to
find out if the page is shared, you want mapcount, not the reference
count. Is so much easier now that we have that field ...

M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-24 20:59                     ` Christoph Lameter
  2005-05-24 21:04                       ` Martin J. Bligh
@ 2005-05-25  6:42                       ` Ray Bryant
  2005-05-28  8:40                         ` Christoph Hellwig
  1 sibling, 1 reply; 46+ messages in thread
From: Ray Bryant @ 2005-05-25  6:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Steve Longerbeam, Christoph Hellwig, Andi Kleen, Ray Bryant,
	Hirokazu Takahashi, Marcelo Tosatti, Dave Hansen, linux-mm,
	Nathan Scott, Ray Bryant, lhms-devel, Jes Sorensen

Christoph Lameter wrote:
> On Mon, 23 May 2005, Ray Bryant wrote:
> 
> 
>>We need to take a different migration action based on which case we
>>are in:
>>
>>(1)  Migrate all of the pages.
>>(2)  Migrate the non-shared pages.
>>(3)  Migrate none of the pages.
>>
>>So we need some external way for the kernel to be told which kind of
>>mapped file this is.  That is why we need some kind of interface for
>>the user (or admininistrator) to tell us how to classify each shared
>>mapped file.
> 
> 
> Sorry I am a bit late to the party and I know you must have said this
> before but what is the reason again not to use the page reference count to 
> determine if a page is shared? Maybe its possible to live with some 
> restrictions that the use of the page reference count brings.
> 
> Seems that touching a filesystem and the ELF headers is way off from the 
> vm.
> 
> 
> 

Christoph,

I assume you are suggesting that we use the page_mapcount() to detect
pages that shared and then not migrating those pages?  The problem with
that is what do you do about the case where a user program mmap()'s a data
file, then forks a bunch of times, and then we need to migrate that job.
The data file belongs only to the processes that we are migrating, but
it will have a page_mapcount() equal to the number of processes.   So
it should be migrated.

Now a workable solution might be that we decide to not migrate shared
pages that are executable (or part of a vma marked executable).  That
would handle the shared library and shared (system) executable case
quite nicely.  It wouldn't handle the case of a shared user executable
that is only used by the processes being migrated, since it will be
shared an executable, but should be migrated, and we will decide by
the above rule not to migrate it.

Executables are small potatoes in most cases, however, the real issue
is to get the data pages to be migrated.

Another irritating corner case is that of r/o shared data files mapped
in from /usr/lib (National language support does this).  We could say
that if the vma is read only and the page shared, we would not migrate
it either.  User mapped files will typically be read/write, I would
tbink.

Combining all of this with some fixed heuristics (i. e. files in
/bin and /usr/bin are shared executables) might be workable.  It
always bothers me, in those cases though, to imagine having such
a fixed list of directories encoded in the kernel and not configurable.
I suppose one could have a /proc file where one put such a directory/
file list, but that seems messy.

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-25  6:42                       ` Ray Bryant
@ 2005-05-28  8:40                         ` Christoph Hellwig
  2005-05-28 16:12                           ` Ray Bryant
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2005-05-28  8:40 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Christoph Lameter, Steve Longerbeam, Christoph Hellwig,
	Andi Kleen, Ray Bryant, Hirokazu Takahashi, Marcelo Tosatti,
	Dave Hansen, linux-mm, Nathan Scott, Ray Bryant, lhms-devel,
	Jes Sorensen

On Wed, May 25, 2005 at 01:42:37AM -0500, Ray Bryant wrote:
> Now a workable solution might be that we decide to not migrate shared
> pages that are executable (or part of a vma marked executable).  That
> would handle the shared library and shared (system) executable case
> quite nicely.  It wouldn't handle the case of a shared user executable
> that is only used by the processes being migrated, since it will be
> shared an executable, but should be migrated, and we will decide by
> the above rule not to migrate it.

I don't think that's a good idea.  It would place arbitrary policy into
the kernel, something we try to avoid.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-28  8:40                         ` Christoph Hellwig
@ 2005-05-28 16:12                           ` Ray Bryant
  0 siblings, 0 replies; 46+ messages in thread
From: Ray Bryant @ 2005-05-28 16:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Lameter, Steve Longerbeam, Andi Kleen, Ray Bryant,
	Hirokazu Takahashi, Marcelo Tosatti, Dave Hansen, linux-mm,
	Nathan Scott, Ray Bryant, lhms-devel, Jes Sorensen

Christoph Hellwig wrote:
> On Wed, May 25, 2005 at 01:42:37AM -0500, Ray Bryant wrote:
> 
>>Now a workable solution might be that we decide to not migrate shared
>>pages that are executable (or part of a vma marked executable).  That
>>would handle the shared library and shared (system) executable case
>>quite nicely.  It wouldn't handle the case of a shared user executable
>>that is only used by the processes being migrated, since it will be
>>shared an executable, but should be migrated, and we will decide by
>>the above rule not to migrate it.
> 
> 
> I don't think that's a good idea.  It would place arbitrary policy into
> the kernel, something we try to avoid.
> 
> 
> 

In general, I agree, but I would argue this is less of a policy decision than
it is a performance optimization.  That's because part of the fix is to also
allow user space to override the migration decision using mbind() to explictly
set the migration attributes of each memory object.  So, we could make
the user do that for each object (no kernel "policy" at all), or we could fix
it so that the kernel does the right thing in most cases, and then user space
only has to use the mbind() system call to fix up the cases that the
kernel doesn't get right.

In the current instantiation of this, that means that user space only has
to use the mbind() call to fix the migration attributes for:

(1)  The user executable.
(2)  User data files mapped read-only.

All of the other vma's will be handled correctly.  (It is expected that the
user space migration library will deal with these cases.)

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-11 12:10     ` [Lhms-devel] " Ray Bryant
  2005-05-11 12:59       ` Andi Kleen
@ 2005-05-11 19:50       ` Christoph Hellwig
  2005-05-11 21:30         ` Ray Bryant
  1 sibling, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2005-05-11 19:50 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Christoph Hellwig, Ray Bryant, Hirokazu Takahashi,
	Marcelo Tosatti, Andi Kleen, Dave Hansen, linux-mm, Nathan Scott,
	Ray Bryant, lhms-devel

On Wed, May 11, 2005 at 07:10:56AM -0500, Ray Bryant wrote:
> Is the issue here the use of extended attributes, period, or the use of the
> system name space?

The former.

> I would observe that after lengthy discussion on this topic in February
> with Andi, use of extended attributes was agreed upon as the preferred
> solution.  Your alternative (As I recall: modifying the dynamic loader
> to mark mapped files in memory as shared libraries, and requiring a new
> mmap() flag to mark files as non-migratable) strikes me as more
> complicated and even harder to get accepted by the community, since it
> touches not only the kernel, but glibc as well.

But it's the right thing to do.  Non-migratability is not an attribute
of a file but a memory region.  Being able to set it for individual
mappings and possible even modifying it with a new MADVISE subcall
makes sense.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-11 19:50       ` Christoph Hellwig
@ 2005-05-11 21:30         ` Ray Bryant
  2005-05-12  9:55           ` Christoph Hellwig
  0 siblings, 1 reply; 46+ messages in thread
From: Ray Bryant @ 2005-05-11 21:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ray Bryant, Hirokazu Takahashi, Marcelo Tosatti, Andi Kleen,
	Dave Hansen, linux-mm, Nathan Scott, Ray Bryant, lhms-devel,
	Jes Sorensen

Christoph Hellwig wrote:

> 
> But it's the right thing to do.  Non-migratability is not an attribute
> of a file but a memory region.  Being able to set it for individual
> mappings and possible even modifying it with a new MADVISE subcall
> makes sense.
> 
> 

I guess we have a different world view on this.  It seems to me that
migratability is a long term property of the file itself (and how it
is commonly used) rather than a short term property (i. e. how the
file is used this particular time it got mapped in).

It seems to me the system administrator needs the ability to specify
that certain files, based on long term usage patterns in the system,
should be treated as migratable libraries or non-migratable files.
(It may be the case that certain shared libraries are so infrequently
used that they can be migrated with a process, I suppose.)

So I think the natural place to put this information is in the file
system.  That doesn't mean that a new MADVISE() call isn't useful,
it's just that I don't want to have to make this call every time
the file is mapped.

Hiding this call in ld-linux.so would probably be ok provided we can
get the glibc developers to buy off on such a change.  I'd much
prefer to contain the changes in one open source project rather than
two.  :-)  Similarly, having to create a new command to mark files
as migratable/not, rather than using the existing setfattr/getfattr
commands makes the whole memory migration facility that much harder
to get accepted into the system and to use.

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-11 21:30         ` Ray Bryant
@ 2005-05-12  9:55           ` Christoph Hellwig
  2005-05-12 15:47             ` Ray Bryant
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2005-05-12  9:55 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Ray Bryant, Hirokazu Takahashi, Marcelo Tosatti, Andi Kleen,
	Dave Hansen, linux-mm, Nathan Scott, Ray Bryant, lhms-devel,
	Jes Sorensen

On Wed, May 11, 2005 at 04:30:55PM -0500, Ray Bryant wrote:
> I guess we have a different world view on this.  It seems to me that
> migratability is a long term property of the file itself (and how it
> is commonly used) rather than a short term property (i. e. how the
> file is used this particular time it got mapped in).

When you talk about files you're already in the special casing business.
Only few vmas are file-backed and it makes lots of sense to mark an
anonymous vma non-migratable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch
  2005-05-12  9:55           ` Christoph Hellwig
@ 2005-05-12 15:47             ` Ray Bryant
  0 siblings, 0 replies; 46+ messages in thread
From: Ray Bryant @ 2005-05-12 15:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ray Bryant, Hirokazu Takahashi, Marcelo Tosatti, Andi Kleen,
	Dave Hansen, linux-mm, Nathan Scott, Ray Bryant, lhms-devel,
	Jes Sorensen

Christoph Hellwig wrote:

> 
> When you talk about files you're already in the special casing business.
> Only few vmas are file-backed and it makes lots of sense to mark an
> anonymous vma non-migratable.
> 
> 

I disagree.  Here's a couple of random cases:

(1)  /bin/bash.  /proc/pid maps shows it has 39 vmas.  30 of them are file
      backed.

(2)  blastwaves (a sample CFD code).  /proc/pid maps shows 49 vmas.  33 of
      them are file backed.

So, it seems to me that most vmas you encounter (by count) are mapped files.
On the other hand, based on size, most pages would be mapped by anonymous vmas 
with the obvious exception being large mapped files.

In all the thinking we've done about page migration, we have never once
come across a case where anonymous vmas shouldn't be migrated.  Can you
describe to me an example where it would be useful to not migrate an
anonymous vma?

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 2.6.12-rc3 2/8] mm: manual page migration-rc2 -- xfs-migrate-page-rc2.patch
  2005-05-11  4:37 [PATCH 2.6.12-rc3 0/8] mm: manual page migration-rc2 -- overview Ray Bryant
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch Ray Bryant
@ 2005-05-11  4:38 ` Ray Bryant
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 3/8] mm: manual page migration-rc2 -- add-node_map-arg-to-try_to_migrate_pages-rc2.patch Ray Bryant
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 46+ messages in thread
From: Ray Bryant @ 2005-05-11  4:38 UTC (permalink / raw)
  To: Hirokazu Takahashi, Dave Hansen, Marcelo Tosatti, Andi Kleen
  Cc: Christoph Hellwig, linux-mm, Nathan Scott, Ray Bryant,
	lhms-devel, Ray Bryant

Nathan Scott of SGI provided this patch for XFS that supports
the migrate_page method in the address_space operations vector.
It is basically the same as what is in ext2_migrate_page().
However, the routine "xfs_skip_migrate_page()" is added to
disallow migration of xfs metadata.

Signed-off-by: Ray Bryant <raybry@sgi.com>

 xfs_aops.c |   10 ++++++++++
 xfs_buf.c  |    7 +++++++
 2 files changed, 17 insertions(+)

Index: linux-2.6.12-rc3-mhp1-page-migration/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6.12-rc3-mhp1-page-migration.orig/fs/xfs/linux-2.6/xfs_aops.c	2005-04-20 17:03:16.000000000 -0700
+++ linux-2.6.12-rc3-mhp1-page-migration/fs/xfs/linux-2.6/xfs_aops.c	2005-05-04 14:29:10.000000000 -0700
@@ -54,6 +54,7 @@
 #include "xfs_iomap.h"
 #include <linux/mpage.h>
 #include <linux/writeback.h>
+#include <linux/mmigrate.h>
 
 STATIC void xfs_count_page_state(struct page *, int *, int *, int *);
 STATIC void xfs_convert_page(struct inode *, struct page *, xfs_iomap_t *,
@@ -1262,6 +1263,14 @@ linvfs_prepare_write(
 	return block_prepare_write(page, from, to, linvfs_get_block);
 }
 
+STATIC int
+linvfs_migrate_page(
+	struct page		*from,
+	struct page		*to)
+{
+	return generic_migrate_page(from, to, migrate_page_buffer);
+}
+
 struct address_space_operations linvfs_aops = {
 	.readpage		= linvfs_readpage,
 	.readpages		= linvfs_readpages,
@@ -1272,4 +1281,5 @@ struct address_space_operations linvfs_a
 	.commit_write		= generic_commit_write,
 	.bmap			= linvfs_bmap,
 	.direct_IO		= linvfs_direct_IO,
+	.migrate_page		= linvfs_migrate_page,
 };
Index: linux-2.6.12-rc3-mhp1-page-migration/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- linux-2.6.12-rc3-mhp1-page-migration.orig/fs/xfs/linux-2.6/xfs_buf.c	2005-04-20 17:03:16.000000000 -0700
+++ linux-2.6.12-rc3-mhp1-page-migration/fs/xfs/linux-2.6/xfs_buf.c	2005-05-04 14:29:10.000000000 -0700
@@ -1626,6 +1626,12 @@ xfs_setsize_buftarg(
 }
 
 STATIC int
+xfs_skip_migrate_page(struct page *from, struct page *to)
+{
+	return -EBUSY;
+}
+
+STATIC int
 xfs_mapping_buftarg(
 	xfs_buftarg_t		*btp,
 	struct block_device	*bdev)
@@ -1635,6 +1641,7 @@ xfs_mapping_buftarg(
 	struct address_space	*mapping;
 	static struct address_space_operations mapping_aops = {
 		.sync_page = block_sync_page,
+		.migrate_page = xfs_skip_migrate_page,
 	};
 
 	inode = new_inode(bdev->bd_inode->i_sb);

-- 
Best Regards,
Ray
-----------------------------------------------
Ray Bryant                       raybry@sgi.com
The box said: "Requires Windows 98 or better",
           so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 2.6.12-rc3 3/8] mm: manual page migration-rc2 -- add-node_map-arg-to-try_to_migrate_pages-rc2.patch
  2005-05-11  4:37 [PATCH 2.6.12-rc3 0/8] mm: manual page migration-rc2 -- overview Ray Bryant
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch Ray Bryant
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 2/8] mm: manual page migration-rc2 -- xfs-migrate-page-rc2.patch Ray Bryant
@ 2005-05-11  4:38 ` Ray Bryant
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 4/8] mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch Ray Bryant
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 46+ messages in thread
From: Ray Bryant @ 2005-05-11  4:38 UTC (permalink / raw)
  To: Hirokazu Takahashi, Andi Kleen, Dave Hansen, Marcelo Tosatti
  Cc: Christoph Hellwig, linux-mm, Nathan Scott, Ray Bryant,
	lhms-devel, Ray Bryant

This patch changes the interface to try_to_migrate_pages() so that one
can specify the nodes where the pages are to be migrated to.  This is
done by adding a "node_map" argument to try_to_migrate_pages(), node_map
is of type "short *".

If this argument is NULL, then try_to_migrate_pages() behaves exactly
as before and this is the interface the rest of the memory hotplug
patch should use.  (Note:  This patchset does not include the changes
for the rest of the memory hotplug patch that will be necessary to use
this new interface [if it is accepted].  Those chagnes will be provided
as a distinct patch.)

If the argument is non-NULL, the node_map points at an array of shorts
of size MAX_NUMNODES.   node_map[N] is either the id of an online node
or -1.  If node_map[N] >=0 then pages found in the page list passed
to try_to_migrate_pages() that are found on node N are migrated to node
node_map[N].  if node_map[N] == -1, then pages found on node N are left
where they are.

This change depends on previous changes to migrate_onepage()
that support migrating a page to a specified node.  These changes
are already part of the memory migration sub-patch of the memory
hotplug patch.

Signed-off-by:  Ray Bryant <raybry@sgi.com>

 include/linux/mmigrate.h |   11 ++++++++++-
 mm/mmigrate.c            |   10 ++++++----
 2 files changed, 16 insertions(+), 5 deletions(-)

Index: linux-2.6.12-rc1-mhp3-page-migration/include/linux/mmigrate.h
===================================================================
--- linux-2.6.12-rc1-mhp3-page-migration.orig/include/linux/mmigrate.h	2005-03-28 22:10:27.000000000 -0800
+++ linux-2.6.12-rc1-mhp3-page-migration/include/linux/mmigrate.h	2005-03-28 22:20:37.000000000 -0800
@@ -16,7 +16,16 @@ extern int migrate_page_buffer(struct pa
 extern int page_migratable(struct page *, struct page *, int,
 					struct list_head *);
 extern struct page * migrate_onepage(struct page *, int nodeid);
-extern int try_to_migrate_pages(struct list_head *);
+extern int try_to_migrate_pages(struct list_head *, short *);
+
+static inline struct page *node_migrate_onepage(struct page *page, short *node_map)
+{
+	if (node_map)
+		return migrate_onepage(page, node_map[page_to_nid(page)]);
+	else
+		return migrate_onepage(page, MIGRATE_NODE_ANY);
+
+}
 
 #else
 static inline int generic_migrate_page(struct page *page, struct page *newpage,
Index: linux-2.6.12-rc1-mhp3-page-migration/mm/mmigrate.c
===================================================================
--- linux-2.6.12-rc1-mhp3-page-migration.orig/mm/mmigrate.c	2005-03-28 22:10:25.000000000 -0800
+++ linux-2.6.12-rc1-mhp3-page-migration/mm/mmigrate.c	2005-03-28 22:20:18.000000000 -0800
@@ -501,9 +501,11 @@ out_unlock:
 /*
  * This is the main entry point to migrate pages in a specific region.
  * If a page is inactive, the page may be just released instead of
- * migration.
+ * migration.  node_map is supplied in those cases (on NUMA systems)
+ * where the caller wishes to specify to which nodes the pages are
+ * migrated.  If node_map is null, the target node is MIGRATE_NODE_ANY.
  */
-int try_to_migrate_pages(struct list_head *page_list)
+int try_to_migrate_pages(struct list_head *page_list, short *node_map)
 {
 	struct page *page, *page2, *newpage;
 	LIST_HEAD(pass1_list);
@@ -541,7 +543,7 @@ int try_to_migrate_pages(struct list_hea
 	list_for_each_entry_safe(page, page2, &pass1_list, lru) {
 		list_del(&page->lru);
 		if (PageLocked(page) || PageWriteback(page) ||
-		    IS_ERR(newpage = migrate_onepage(page, MIGRATE_NODE_ANY))) {
+		    IS_ERR(newpage = node_migrate_onepage(page, node_map))) {
 			if (page_count(page) == 1) {
 				/* the page is already unused */
 				putback_page_to_lru(page_zone(page), page);
@@ -559,7 +561,7 @@ int try_to_migrate_pages(struct list_hea
 	 */
 	list_for_each_entry_safe(page, page2, &pass2_list, lru) {
 		list_del(&page->lru);
-		if (IS_ERR(newpage = migrate_onepage(page, MIGRATE_NODE_ANY))) {
+		if (IS_ERR(newpage = node_migrate_onepage(page, node_map))) {
 			if (page_count(page) == 1) {
 				/* the page is already unused */
 				putback_page_to_lru(page_zone(page), page);

-- 
Best Regards,
Ray
-----------------------------------------------
Ray Bryant                       raybry@sgi.com
The box said: "Requires Windows 98 or better",
           so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 2.6.12-rc3 4/8] mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch
  2005-05-11  4:37 [PATCH 2.6.12-rc3 0/8] mm: manual page migration-rc2 -- overview Ray Bryant
                   ` (2 preceding siblings ...)
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 3/8] mm: manual page migration-rc2 -- add-node_map-arg-to-try_to_migrate_pages-rc2.patch Ray Bryant
@ 2005-05-11  4:38 ` Ray Bryant
  2005-05-11  8:24   ` Christoph Hellwig
  2005-05-11 13:23   ` Hirokazu Takahashi
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 5/8] mm: manual page migration-rc2 -- sys_migrate_pages-xattr-support-rc2.patch Ray Bryant
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 46+ messages in thread
From: Ray Bryant @ 2005-05-11  4:38 UTC (permalink / raw)
  To: Hirokazu Takahashi, Marcelo Tosatti, Andi Kleen, Dave Hansen
  Cc: Christoph Hellwig, linux-mm, Nathan Scott, Ray Bryant,
	lhms-devel, Ray Bryant

This is the main patch that creates the migrate_pages() system
call.  Note that in this case, the system call number was more
or less arbitrarily assigned at 1279.  This number needs to
allocated.

This patch sits on top of the page migration patches from
the Memory Hotplug project.  This particular patchset is built
on top of:

http://www.sr71.net/patches/2.6.12/2.6.12-rc3-mhp1/page_migration/patch-2.6.12-rc3-mhp1-pm.gz

but it may appy on subsequent page migration patches as well.

This patch migrates all pages in the specified process (including
shared libraries.)

See the patch sys_migrate_pages-xattr-support.patch where the
extended attribute "system.migration" is used to identify shared
libraries and non-migratable files.

Signed-off-by: Ray Bryant <raybry@sgi.com>

 arch/ia64/kernel/entry.S |    6 +
 mm/mmigrate.c            |  182 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 187 insertions(+), 1 deletion(-)

Index: linux-2.6.12-rc3-mhp1-page-migration-export/arch/ia64/kernel/entry.S
===================================================================
--- linux-2.6.12-rc3-mhp1-page-migration-export.orig/arch/ia64/kernel/entry.S	2005-04-20 17:03:12.000000000 -0700
+++ linux-2.6.12-rc3-mhp1-page-migration-export/arch/ia64/kernel/entry.S	2005-05-10 10:22:24.000000000 -0700
@@ -1582,6 +1582,10 @@ sys_call_table:
 	data8 sys_ni_syscall
 	data8 sys_ni_syscall
 	data8 sys_ni_syscall
-	data8 sys_ni_syscall
+#ifdef CONFIG_MEMORY_MIGRATE
+	data8 sys_migrate_pages			// 1279
+#else
+	data8 sys_ni_syscall			// 1279
+#endif
 
 	.org sys_call_table + 8*NR_syscalls	// guard against failures to increase NR_syscalls
Index: linux-2.6.12-rc3-mhp1-page-migration-export/mm/mmigrate.c
===================================================================
--- linux-2.6.12-rc3-mhp1-page-migration-export.orig/mm/mmigrate.c	2005-05-10 10:22:24.000000000 -0700
+++ linux-2.6.12-rc3-mhp1-page-migration-export/mm/mmigrate.c	2005-05-10 10:40:35.000000000 -0700
@@ -5,6 +5,9 @@
  *
  *  Authors:	IWAMOTO Toshihiro <iwamoto@valinux.co.jp>
  *		Hirokazu Takahashi <taka@valinux.co.jp>
+ *
+ * sys_migrate_pages() added by Ray Bryant <raybry@sgi.com>
+ * Copyright (C) 2005, Silicon Graphics, Inc.
  */
 
 #include <linux/config.h>
@@ -21,6 +24,9 @@
 #include <linux/rmap.h>
 #include <linux/mmigrate.h>
 #include <linux/delay.h>
+#include <linux/blkdev.h>
+#include <linux/nodemask.h>
+#include <asm/bitops.h>
 
 /*
  * The concept of memory migration is to replace a target page with
@@ -587,6 +593,182 @@ int try_to_migrate_pages(struct list_hea
 	return nr_busy;
 }
 
+static int
+migrate_vma(struct task_struct *task, struct mm_struct *mm,
+	struct vm_area_struct *vma, short *node_map)
+{
+	struct page *page;
+	struct zone *zone;
+	unsigned long vaddr;
+	int count = 0, nid, pass = 0, nr_busy = 0;
+	LIST_HEAD(page_list);
+
+	/* can't migrate mlock()'d pages */
+	if (vma->vm_flags & VM_LOCKED)
+		return 0;
+
+	/*
+	 * gather all of the pages to be migrated from this vma into page_list
+	 */
+	spin_lock(&mm->page_table_lock);
+ 	for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
+		page = follow_page(mm, vaddr, 0);
+		/*
+		 * follow_page has been known to return pages with zero mapcount
+		 * and NULL mapping.  Skip those pages as well
+		 */
+		if (page && page_mapcount(page)) {
+			nid = page_to_nid(page);
+			if (node_map[nid] >= 0) {
+				zone = page_zone(page);
+				if (PageLRU(page) &&
+				    steal_page_from_lru(zone, page, &page_list))
+					count++;
+				else
+					BUG();
+			}
+		}
+	}
+	spin_unlock(&mm->page_table_lock);
+
+retry:
+
+	/* call the page migration code to move the pages */
+	if (!list_empty(&page_list))
+		nr_busy = try_to_migrate_pages(&page_list, node_map);
+
+	if (nr_busy > 0) {
+		pass++;
+		if (pass > 10)
+			return -EAGAIN;
+		/* wait until some I/O completes and try again */
+		blk_congestion_wait(WRITE, HZ/10);
+		goto retry;
+	} else if (nr_busy < 0)
+		return nr_busy;
+
+	return count;
+}
+
+void lru_add_drain_per_cpu(void *info)
+{
+	lru_add_drain();
+}
+
+asmlinkage long
+sys_migrate_pages(const pid_t pid, const int count,
+	caddr_t old_nodes, caddr_t new_nodes)
+{
+	int i, ret = 0, migrated = 0;
+	short *tmp_old_nodes;
+	short *tmp_new_nodes;
+	short *node_map;
+	struct task_struct *task;
+	struct mm_struct *mm = 0;
+	size_t size = count * sizeof(tmp_old_nodes[0]);
+	struct vm_area_struct *vma;
+	nodemask_t nodes;
+
+	if ((count < 1) || (count > MAX_NUMNODES))
+		return -EINVAL;
+
+	tmp_old_nodes = kmalloc(size, GFP_KERNEL);
+	tmp_new_nodes = kmalloc(size, GFP_KERNEL);
+	node_map = kmalloc(MAX_NUMNODES*sizeof(node_map[0]), GFP_KERNEL);
+
+	if (!tmp_old_nodes || !tmp_new_nodes || !node_map) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (copy_from_user(tmp_old_nodes, old_nodes, size) ||
+	    copy_from_user(tmp_new_nodes, new_nodes, size)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	/* Make sure node arguments are within valid limits */
+	for(i = 0; i < count; i++)
+		if ((tmp_old_nodes[i] < 0) 		||
+		    (tmp_old_nodes[i] >= MAX_NUMNODES)  ||
+		    (tmp_new_nodes[i] < 0) 		||
+		    (tmp_new_nodes[i] >= MAX_NUMNODES)) {
+		    	ret = -EINVAL;
+			goto out;
+		}
+
+	/* disallow migration to an off-line node */
+	for(i = 0; i < count; i++)
+		if (!node_online(tmp_new_nodes[i])) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+	/*
+	 * old_nodes and new_nodes must be disjoint
+	 */
+	nodes_clear(nodes);
+	for(i=0; i<count; i++)
+		node_set(tmp_old_nodes[i], nodes);
+	for(i=0; i<count; i++)
+		if(node_isset(tmp_new_nodes[i], nodes)) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+	/* find the task and mm_structs for this process */
+	read_lock(&tasklist_lock);
+	task = find_task_by_pid(pid);
+	if (task) {
+		task_lock(task);
+		mm = task->mm;
+		if (mm)
+			atomic_inc(&mm->mm_users);
+		task_unlock(task);
+	} else {
+		ret = -ESRCH;
+		read_unlock(&tasklist_lock);
+		goto out;
+	}
+	read_unlock(&tasklist_lock);
+	if (!mm) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* set up the node_map array */
+	for(i = 0; i < MAX_NUMNODES; i++)
+		node_map[i] = -1;
+	for(i = 0; i < count; i++)
+		node_map[tmp_old_nodes[i]] = tmp_new_nodes[i];
+
+	/* prepare for lru list manipulation */
+  	smp_call_function(&lru_add_drain_per_cpu, NULL, 0, 1);
+	lru_add_drain();
+
+	/* actually do the migration */
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		/* migrate the pages of this vma */
+		ret = migrate_vma(task, mm, vma, node_map);
+		if (ret >= 0)
+			migrated += ret;
+		else
+			goto out_dec;
+	}
+	ret = migrated;
+
+out_dec:
+	atomic_dec(&mm->mm_users);
+
+out:
+	kfree(tmp_old_nodes);
+	kfree(tmp_new_nodes);
+	kfree(node_map);
+
+	return ret;
+
+}
+
 EXPORT_SYMBOL(generic_migrate_page);
 EXPORT_SYMBOL(migrate_page_common);
 EXPORT_SYMBOL(migrate_page_buffer);

-- 
Best Regards,
Ray
-----------------------------------------------
Ray Bryant                       raybry@sgi.com
The box said: "Requires Windows 98 or better",
           so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2.6.12-rc3 4/8] mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 4/8] mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch Ray Bryant
@ 2005-05-11  8:24   ` Christoph Hellwig
  2005-05-18 19:07     ` Ray Bryant
  2005-05-11 13:23   ` Hirokazu Takahashi
  1 sibling, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2005-05-11  8:24 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Hirokazu Takahashi, Marcelo Tosatti, Andi Kleen, Dave Hansen,
	Christoph Hellwig, linux-mm, Nathan Scott, Ray Bryant,
	lhms-devel

> +#ifdef CONFIG_MEMORY_MIGRATE
> +	data8 sys_migrate_pages			// 1279
> +#else
> +	data8 sys_ni_syscall			// 1279
> +#endif

never ifdef syscall slots like this, use cond_syscall instead.

> +	if (nr_busy > 0) {
> +		pass++;
> +		if (pass > 10)
> +			return -EAGAIN;
> +		/* wait until some I/O completes and try again */
> +		blk_congestion_wait(WRITE, HZ/10);
> +		goto retry;

this is a layering violation.  How to wait is up to the implementor
of the address_space

> +asmlinkage long
> +sys_migrate_pages(const pid_t pid, const int count,
> +	caddr_t old_nodes, caddr_t new_nodes)

please avoid const quilifiers in syscall prototypes, they're rather
pointless anyway.  Also don't use caddr_t but rather pointers to what's
actually pointed to.  Here that would be shorts which is rather uncommon
in kernel interface.  Please use explicitly sized types (__u16 if you
want to keep it as-is or maybe __u32).

> +	struct mm_struct *mm = 0;

Please use NULL as the null pointer.  if you ran your code through sparse
it would have cought it.

> +	for(i = 0; i < count; i++)

please follow documented kernel style.

> +	atomic_dec(&mm->mm_users);

shouldn't this be an mmput()?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2.6.12-rc3 4/8] mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch
  2005-05-11  8:24   ` Christoph Hellwig
@ 2005-05-18 19:07     ` Ray Bryant
  2005-05-28  9:14       ` Christoph Hellwig
  0 siblings, 1 reply; 46+ messages in thread
From: Ray Bryant @ 2005-05-18 19:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ray Bryant, Hirokazu Takahashi, Marcelo Tosatti, Andi Kleen,
	Dave Hansen, linux-mm, Nathan Scott, Ray Bryant, lhms-devel

Christoph Hellwig wrote:

> 
>>+	if (nr_busy > 0) {
>>+		pass++;
>>+		if (pass > 10)
>>+			return -EAGAIN;
>>+		/* wait until some I/O completes and try again */
>>+		blk_congestion_wait(WRITE, HZ/10);
>>+		goto retry;
> 
> 
> this is a layering violation.  How to wait is up to the implementor
> of the address_space
> 

Christoph,

I've done the other changes you suggested, but am a little confused
by this one.  Is your suggestion that I should be calling:

vma->vm_file->f_mapping->a_ops->writepages()

(assuming this exists)

instead of doing the blk_congestion_wait()?  There is no "wait"
function defined in the aops vector as near as I can tell.

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2.6.12-rc3 4/8] mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch
  2005-05-18 19:07     ` Ray Bryant
@ 2005-05-28  9:14       ` Christoph Hellwig
  2005-05-28 15:53         ` Ray Bryant
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2005-05-28  9:14 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Christoph Hellwig, Ray Bryant, Hirokazu Takahashi,
	Marcelo Tosatti, Andi Kleen, Dave Hansen, linux-mm, Nathan Scott,
	Ray Bryant, lhms-devel

On Wed, May 18, 2005 at 02:07:21PM -0500, Ray Bryant wrote:
> Christoph Hellwig wrote:
> 
> >
> >>+	if (nr_busy > 0) {
> >>+		pass++;
> >>+		if (pass > 10)
> >>+			return -EAGAIN;
> >>+		/* wait until some I/O completes and try again */
> >>+		blk_congestion_wait(WRITE, HZ/10);
> >>+		goto retry;
> >
> >
> >this is a layering violation.  How to wait is up to the implementor
> >of the address_space
> >
> 
> Christoph,
> 
> I've done the other changes you suggested, but am a little confused
> by this one.  Is your suggestion that I should be calling:
> 
> vma->vm_file->f_mapping->a_ops->writepages()
> 
> (assuming this exists)
> 
> instead of doing the blk_congestion_wait()?  There is no "wait"
> function defined in the aops vector as near as I can tell.

I looked over the code again and most of the migration code isn't added
in the patchkit but expected to exist already, thus I'm not sure what's
going on at all.

address_space_operations are the wrong abstraction here, you're operating
on VMAs, thus any vectoring should happen at the vm_operations_struct
level.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2.6.12-rc3 4/8] mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch
  2005-05-28  9:14       ` Christoph Hellwig
@ 2005-05-28 15:53         ` Ray Bryant
  0 siblings, 0 replies; 46+ messages in thread
From: Ray Bryant @ 2005-05-28 15:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ray Bryant, Ray Bryant, Hirokazu Takahashi, Marcelo Tosatti,
	Andi Kleen, Dave Hansen, linux-mm, Nathan Scott, lhms-devel

Christoph Hellwig wrote:

>
>I looked over the code again and most of the migration code isn't added
>in the patchkit but expected to exist already, thus I'm not sure what's
>going on at all.
>  
>
Yes, as discussed in the overview, the manual page migration code 
depends on the page migration
code from the memory hotplug patch.  The plan is to merge the manual 
page migration code into
the page migration subpatch of the memory hotplug code and then I will 
work on merging that
page migration code itself.

>address_space_operations are the wrong abstraction here, you're operating
>on VMAs, thus any vectoring should happen at the vm_operations_struct
>level.
>
>  
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2.6.12-rc3 4/8] mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 4/8] mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch Ray Bryant
  2005-05-11  8:24   ` Christoph Hellwig
@ 2005-05-11 13:23   ` Hirokazu Takahashi
  2005-05-11 13:26     ` Hirokazu Takahashi
  2005-05-11 14:06     ` Ray Bryant
  1 sibling, 2 replies; 46+ messages in thread
From: Hirokazu Takahashi @ 2005-05-11 13:23 UTC (permalink / raw)
  To: raybry
  Cc: marcelo.tosatti, ak, haveblue, hch, linux-mm, nathans, raybry,
	lhms-devel

Hi Ray,

> This is the main patch that creates the migrate_pages() system
> call.  Note that in this case, the system call number was more
> or less arbitrarily assigned at 1279.  This number needs to
> allocated.
> 
> This patch sits on top of the page migration patches from
> the Memory Hotplug project.  This particular patchset is built
> on top of:
> 
> http://www.sr71.net/patches/2.6.12/2.6.12-rc3-mhp1/page_migration/patch-2.6.12-rc3-mhp1-pm.gz

I found there exited a race condition between migrate_vma() and
the swap code. The following code may cause Oops if the swap code
takes the page from the LRU list before calling steal_page_from_lru().

migrate_vma()
{
               :
	if (PageLRU(page) &&
	    steal_page_from_lru(zone, page, &page_list))
		count++;
	else
		BUG();
               :
}

Ok, I should make steal_page_from_lru() check PageLRU(page) with
holding zone->lru_lock. Then migrate_vma() can just call
steal_page_from_lru().

static inline int
steal_page_from_lru(struct zone *zone, struct page *page)
{
        int ret = 0;
        spin_lock_irq(&zone->lru_lock);
	if (PageLRU(page))
                ret = __steal_page_from_lru(zone, page);
        spin_unlock_irq(&zone->lru_lock);
        return ret;
}

migrate_vma()
{
               :
	if (steal_page_from_lru(zone, page, &page_list)
		count++;
               :
}


BTW, I'm not sure whether it's enough that migrate_vma() can only
migrate currently mapped pages. This may leave some pages in the
page-cache if they're not mapped to the process address spaces yet.

Thanks,
Hirokazu Takahashi.


> Index: linux-2.6.12-rc3-mhp1-page-migration-export/mm/mmigrate.c
> ===================================================================
> --- linux-2.6.12-rc3-mhp1-page-migration-export.orig/mm/mmigrate.c	2005-05-10 10:22:24.000000000 -0700
> +++ linux-2.6.12-rc3-mhp1-page-migration-export/mm/mmigrate.c	2005-05-10 10:40:35.000000000 -0700
> @@ -5,6 +5,9 @@
>   *
>   *  Authors:	IWAMOTO Toshihiro <iwamoto@valinux.co.jp>
>   *		Hirokazu Takahashi <taka@valinux.co.jp>
> + *
> + * sys_migrate_pages() added by Ray Bryant <raybry@sgi.com>
> + * Copyright (C) 2005, Silicon Graphics, Inc.
>   */
>  
>  #include <linux/config.h>
> @@ -21,6 +24,9 @@
>  #include <linux/rmap.h>
>  #include <linux/mmigrate.h>
>  #include <linux/delay.h>
> +#include <linux/blkdev.h>
> +#include <linux/nodemask.h>
> +#include <asm/bitops.h>
>  
>  /*
>   * The concept of memory migration is to replace a target page with
> @@ -587,6 +593,182 @@ int try_to_migrate_pages(struct list_hea
>  	return nr_busy;
>  }
>  
> +static int
> +migrate_vma(struct task_struct *task, struct mm_struct *mm,
> +	struct vm_area_struct *vma, short *node_map)
> +{
> +	struct page *page;
> +	struct zone *zone;
> +	unsigned long vaddr;
> +	int count = 0, nid, pass = 0, nr_busy = 0;
> +	LIST_HEAD(page_list);
> +
> +	/* can't migrate mlock()'d pages */
> +	if (vma->vm_flags & VM_LOCKED)
> +		return 0;
> +
> +	/*
> +	 * gather all of the pages to be migrated from this vma into page_list
> +	 */
> +	spin_lock(&mm->page_table_lock);
> + 	for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
> +		page = follow_page(mm, vaddr, 0);
> +		/*
> +		 * follow_page has been known to return pages with zero mapcount
> +		 * and NULL mapping.  Skip those pages as well
> +		 */
> +		if (page && page_mapcount(page)) {
> +			nid = page_to_nid(page);
> +			if (node_map[nid] >= 0) {
> +				zone = page_zone(page);
> +				if (PageLRU(page) &&
> +				    steal_page_from_lru(zone, page, &page_list))
> +					count++;
> +				else
> +					BUG();
> +			}
> +		}
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +retry:
> +
> +	/* call the page migration code to move the pages */
> +	if (!list_empty(&page_list))
> +		nr_busy = try_to_migrate_pages(&page_list, node_map);
> +
> +	if (nr_busy > 0) {
> +		pass++;
> +		if (pass > 10)
> +			return -EAGAIN;
> +		/* wait until some I/O completes and try again */
> +		blk_congestion_wait(WRITE, HZ/10);
> +		goto retry;
> +	} else if (nr_busy < 0)
> +		return nr_busy;
> +
> +	return count;
> +}






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2.6.12-rc3 4/8] mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch
  2005-05-11 13:23   ` Hirokazu Takahashi
@ 2005-05-11 13:26     ` Hirokazu Takahashi
  2005-05-11 14:06     ` Ray Bryant
  1 sibling, 0 replies; 46+ messages in thread
From: Hirokazu Takahashi @ 2005-05-11 13:26 UTC (permalink / raw)
  To: raybry
  Cc: marcelo.tosatti, ak, haveblue, hch, linux-mm, nathans, raybry,
	lhms-devel

Hi Ray,

> > This is the main patch that creates the migrate_pages() system
> > call.  Note that in this case, the system call number was more
> > or less arbitrarily assigned at 1279.  This number needs to
> > allocated.
> > 
> > This patch sits on top of the page migration patches from
> > the Memory Hotplug project.  This particular patchset is built
> > on top of:
> > 
> > http://www.sr71.net/patches/2.6.12/2.6.12-rc3-mhp1/page_migration/patch-2.6.12-rc3-mhp1-pm.gz
> 
> I found there exited a race condition between migrate_vma() and
                ^^^^^^
                existed

> the swap code. The following code may cause Oops if the swap code
> takes the page from the LRU list before calling steal_page_from_lru().
> 
> migrate_vma()
> {
>                :
> 	if (PageLRU(page) &&
> 	    steal_page_from_lru(zone, page, &page_list))
> 		count++;
> 	else
> 		BUG();
>                :
> }
> 
> Ok, I should make steal_page_from_lru() check PageLRU(page) with
> holding zone->lru_lock. Then migrate_vma() can just call
> steal_page_from_lru().
> 
> static inline int
> steal_page_from_lru(struct zone *zone, struct page *page)
> {
>         int ret = 0;
>         spin_lock_irq(&zone->lru_lock);
> 	if (PageLRU(page))
>                 ret = __steal_page_from_lru(zone, page);
>         spin_unlock_irq(&zone->lru_lock);
>         return ret;
> }
> 
> migrate_vma()
> {
>                :
> 	if (steal_page_from_lru(zone, page, &page_list)
> 		count++;
>                :
> }
> 
> 
> BTW, I'm not sure whether it's enough that migrate_vma() can only
> migrate currently mapped pages. This may leave some pages in the
> page-cache if they're not mapped to the process address spaces yet.
> 
> Thanks,
> Hirokazu Takahashi.
> 
> 
> > Index: linux-2.6.12-rc3-mhp1-page-migration-export/mm/mmigrate.c
> > ===================================================================
> > --- linux-2.6.12-rc3-mhp1-page-migration-export.orig/mm/mmigrate.c	2005-05-10 10:22:24.000000000 -0700
> > +++ linux-2.6.12-rc3-mhp1-page-migration-export/mm/mmigrate.c	2005-05-10 10:40:35.000000000 -0700
> > @@ -5,6 +5,9 @@
> >   *
> >   *  Authors:	IWAMOTO Toshihiro <iwamoto@valinux.co.jp>
> >   *		Hirokazu Takahashi <taka@valinux.co.jp>
> > + *
> > + * sys_migrate_pages() added by Ray Bryant <raybry@sgi.com>
> > + * Copyright (C) 2005, Silicon Graphics, Inc.
> >   */
> >  
> >  #include <linux/config.h>
> > @@ -21,6 +24,9 @@
> >  #include <linux/rmap.h>
> >  #include <linux/mmigrate.h>
> >  #include <linux/delay.h>
> > +#include <linux/blkdev.h>
> > +#include <linux/nodemask.h>
> > +#include <asm/bitops.h>
> >  
> >  /*
> >   * The concept of memory migration is to replace a target page with
> > @@ -587,6 +593,182 @@ int try_to_migrate_pages(struct list_hea
> >  	return nr_busy;
> >  }
> >  
> > +static int
> > +migrate_vma(struct task_struct *task, struct mm_struct *mm,
> > +	struct vm_area_struct *vma, short *node_map)
> > +{
> > +	struct page *page;
> > +	struct zone *zone;
> > +	unsigned long vaddr;
> > +	int count = 0, nid, pass = 0, nr_busy = 0;
> > +	LIST_HEAD(page_list);
> > +
> > +	/* can't migrate mlock()'d pages */
> > +	if (vma->vm_flags & VM_LOCKED)
> > +		return 0;
> > +
> > +	/*
> > +	 * gather all of the pages to be migrated from this vma into page_list
> > +	 */
> > +	spin_lock(&mm->page_table_lock);
> > + 	for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
> > +		page = follow_page(mm, vaddr, 0);
> > +		/*
> > +		 * follow_page has been known to return pages with zero mapcount
> > +		 * and NULL mapping.  Skip those pages as well
> > +		 */
> > +		if (page && page_mapcount(page)) {
> > +			nid = page_to_nid(page);
> > +			if (node_map[nid] >= 0) {
> > +				zone = page_zone(page);
> > +				if (PageLRU(page) &&
> > +				    steal_page_from_lru(zone, page, &page_list))
> > +					count++;
> > +				else
> > +					BUG();
> > +			}
> > +		}
> > +	}
> > +	spin_unlock(&mm->page_table_lock);
> > +
> > +retry:
> > +
> > +	/* call the page migration code to move the pages */
> > +	if (!list_empty(&page_list))
> > +		nr_busy = try_to_migrate_pages(&page_list, node_map);
> > +
> > +	if (nr_busy > 0) {
> > +		pass++;
> > +		if (pass > 10)
> > +			return -EAGAIN;
> > +		/* wait until some I/O completes and try again */
> > +		blk_congestion_wait(WRITE, HZ/10);
> > +		goto retry;
> > +	} else if (nr_busy < 0)
> > +		return nr_busy;
> > +
> > +	return count;
> > +}
> 
> 
> 
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2.6.12-rc3 4/8] mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch
  2005-05-11 13:23   ` Hirokazu Takahashi
  2005-05-11 13:26     ` Hirokazu Takahashi
@ 2005-05-11 14:06     ` Ray Bryant
  2005-05-12  6:41       ` Hirokazu Takahashi
  1 sibling, 1 reply; 46+ messages in thread
From: Ray Bryant @ 2005-05-11 14:06 UTC (permalink / raw)
  To: Hirokazu Takahashi
  Cc: raybry, marcelo.tosatti, ak, haveblue, hch, linux-mm, nathans,
	raybry, lhms-devel

Hi Hirokazu,

Hirokazu Takahashi wrote:

<snip>

> 
> I found there exited a race condition between migrate_vma() and
> the swap code. The following code may cause Oops if the swap code
> takes the page from the LRU list before calling steal_page_from_lru().
> 
> migrate_vma()
> {
>                :
> 	if (PageLRU(page) &&
> 	    steal_page_from_lru(zone, page, &page_list))
> 		count++;
> 	else
> 		BUG();
>                :
> }

Ah, good point.  Perhaps this is cause of the race I am seeing.  Let me check.

I used to take the zone->lru_lock explicitly before __steal_page_from_lru()
but saw the other interface and switched to it (a little too quickly, I
now gather...)  Perhaps I should just go back to that.  That way there is
no chance of a race.

> 
> Ok, I should make steal_page_from_lru() check PageLRU(page) with
> holding zone->lru_lock. Then migrate_vma() can just call
> steal_page_from_lru().
> 
> static inline int
> steal_page_from_lru(struct zone *zone, struct page *page)
> {
>         int ret = 0;
>         spin_lock_irq(&zone->lru_lock);
> 	if (PageLRU(page))
>                 ret = __steal_page_from_lru(zone, page);
>         spin_unlock_irq(&zone->lru_lock);
>         return ret;
> }
> 
> migrate_vma()
> {
>                :
> 	if (steal_page_from_lru(zone, page, &page_list)
> 		count++;
>                :
> }
> 
> 
> BTW, I'm not sure whether it's enough that migrate_vma() can only
> migrate currently mapped pages. This may leave some pages in the
> page-cache if they're not mapped to the process address spaces yet.
> 
> Thanks,
> Hirokazu Takahashi.

If the page isn't mapped, there is no good way to match it up with
a particular process id, is there?   :-)

We've handled that separately in the actual migration application,
by sync'ing the system and  then freeing clean page cache pages
before the migrate_pages() system call is invoked.

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2.6.12-rc3 4/8] mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch
  2005-05-11 14:06     ` Ray Bryant
@ 2005-05-12  6:41       ` Hirokazu Takahashi
  2005-05-12 16:41         ` Ray Bryant
  0 siblings, 1 reply; 46+ messages in thread
From: Hirokazu Takahashi @ 2005-05-12  6:41 UTC (permalink / raw)
  To: raybry
  Cc: raybry, marcelo.tosatti, ak, haveblue, hch, linux-mm, nathans,
	raybry, lhms-devel

Hi,

> > BTW, I'm not sure whether it's enough that migrate_vma() can only
> > migrate currently mapped pages. This may leave some pages in the
> > page-cache if they're not mapped to the process address spaces yet.
> > 
> > Thanks,
> > Hirokazu Takahashi.
> 
> If the page isn't mapped, there is no good way to match it up with
> a particular process id, is there?   :-)

I just thought of the page, belonging to some file which is
mmap()ed to the target process to be migrated. The page may
not be accessed and the associated PTE isn't set yet.
if vma->vm_file->f_mapping equals page_mapping(page), the page
should be migrated. 

Pages in the swap-cache have the same problem since the related
PTEs may be clean.

But these cases may be rare and your approach seems to be good
enough in most cases.

> We've handled that separately in the actual migration application,
> by sync'ing the system and  then freeing clean page cache pages
> before the migrate_pages() system call is invoked.
> 
> -- 
> Best Regards,
> Ray

Thanks,
Hirokazu Takahashi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2.6.12-rc3 4/8] mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch
  2005-05-12  6:41       ` Hirokazu Takahashi
@ 2005-05-12 16:41         ` Ray Bryant
  2005-05-12 23:50           ` Hirokazu Takahashi
  0 siblings, 1 reply; 46+ messages in thread
From: Ray Bryant @ 2005-05-12 16:41 UTC (permalink / raw)
  To: Hirokazu Takahashi
  Cc: raybry, marcelo.tosatti, ak, haveblue, hch, linux-mm, nathans,
	raybry, lhms-devel

Hirokazu Takahashi wrote:

> 
> I just thought of the page, belonging to some file which is
> mmap()ed to the target process to be migrated. The page may
> not be accessed and the associated PTE isn't set yet.
> if vma->vm_file->f_mapping equals page_mapping(page), the page
> should be migrated. 
> 
> Pages in the swap-cache have the same problem since the related
> PTEs may be clean.
> 
> But these cases may be rare and your approach seems to be good
> enough in most cases.
> 
> 

Well, what could be done would be the following, I suppose:

If follow_page() returns NULL and the vma maps a file, we could
lookup the page in the radix tree, and if we find it, and if it
is on a node that we are migrating from, we could add the page
to the set of pages to be migrated.

The disadvantage of this is that we could do a LOT of radix
tree lookups and find relatively few pages.  (Our approach of
releasing free page cache pages first makes such pages just
"go away".  But we don't have "release free page cache pages"
in the mainline yet.  :-( )

Similarly, if we modified follow_page() (e. g. follow_page_ex())
to return the pte, check to see if it is a swap pte (!pte_none()
&& !pte_file()), if so then use pte_to_swap_entry() to get
the swap entry, and then use that to look up the page in the
swapper space radix tree.  Then handle it as above (hmmm...
will your migration code handle a page in the swap cache?)

Once again, this could lead to lots of lookups.  This is
especially a concern for a large multithreaded app, since the
address spaces are the same for each process id, hence we look
up the same info over and over in each page table scan.

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2.6.12-rc3 4/8] mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch
  2005-05-12 16:41         ` Ray Bryant
@ 2005-05-12 23:50           ` Hirokazu Takahashi
  2005-05-13  9:59             ` [Lhms-devel] " Ray Bryant
  0 siblings, 1 reply; 46+ messages in thread
From: Hirokazu Takahashi @ 2005-05-12 23:50 UTC (permalink / raw)
  To: raybry
  Cc: raybry, marcelo.tosatti, ak, haveblue, hch, linux-mm, nathans,
	raybry, lhms-devel

Hi Ray,

> > I just thought of the page, belonging to some file which is
> > mmap()ed to the target process to be migrated. The page may
> > not be accessed and the associated PTE isn't set yet.
> > if vma->vm_file->f_mapping equals page_mapping(page), the page
> > should be migrated. 
> > 
> > Pages in the swap-cache have the same problem since the related
> > PTEs may be clean.
> > 
> > But these cases may be rare and your approach seems to be good
> > enough in most cases.
> > 
> > 
> 
> Well, what could be done would be the following, I suppose:
> 
> If follow_page() returns NULL and the vma maps a file, we could
> lookup the page in the radix tree, and if we find it, and if it
> is on a node that we are migrating from, we could add the page
> to the set of pages to be migrated.
> 
> The disadvantage of this is that we could do a LOT of radix
> tree lookups and find relatively few pages.  (Our approach of


How about find_get_pages() for whole mmap()'ed ranges?
With it, you may not need to call follow_page().


> releasing free page cache pages first makes such pages just
> "go away".  But we don't have "release free page cache pages"
> in the mainline yet.  :-( )
>
> Similarly, if we modified follow_page() (e. g. follow_page_ex())
> to return the pte, check to see if it is a swap pte (!pte_none()
> && !pte_file()), if so then use pte_to_swap_entry() to get
> the swap entry, and then use that to look up the page in the
> swapper space radix tree.  Then handle it as above (hmmm...
> will your migration code handle a page in the swap cache?)


Sure. The code can also migrate pages in the swap cache.


> Once again, this could lead to lots of lookups.  This is
> especially a concern for a large multithreaded app, since the
> address spaces are the same for each process id, hence we look
> up the same info over and over in each page table scan.
> 
> -- 
> Best Regards,
> Ray

Thanks,
Hirokazu Takahashi.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 4/8] mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch
  2005-05-12 23:50           ` Hirokazu Takahashi
@ 2005-05-13  9:59             ` Ray Bryant
  0 siblings, 0 replies; 46+ messages in thread
From: Ray Bryant @ 2005-05-13  9:59 UTC (permalink / raw)
  To: Hirokazu Takahashi
  Cc: raybry, marcelo.tosatti, ak, haveblue, hch, linux-mm, nathans,
	raybry, lhms-devel

Hirokazu Takahashi wrote:
> Hi Ray,
> 
> 

>>
>>Well, what could be done would be the following, I suppose:
>>
>>If follow_page() returns NULL and the vma maps a file, we could
>>lookup the page in the radix tree, and if we find it, and if it
>>is on a node that we are migrating from, we could add the page
>>to the set of pages to be migrated.
>>
>>The disadvantage of this is that we could do a LOT of radix
>>tree lookups and find relatively few pages.  (Our approach of
> 
> 
> 
> How about find_get_pages() for whole mmap()'ed ranges?
> With it, you may not need to call follow_page().
> 

No, you need to call follow_page() to get, for example, the
read-write pages of a shared library that are process private.

For the moment, I would propose punting on this issue.  I don't
think there will be many pages in this category, and if they do
show up later, we have a couple of approaches to deal with them.

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 2.6.12-rc3 5/8] mm: manual page migration-rc2 -- sys_migrate_pages-xattr-support-rc2.patch
  2005-05-11  4:37 [PATCH 2.6.12-rc3 0/8] mm: manual page migration-rc2 -- overview Ray Bryant
                   ` (3 preceding siblings ...)
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 4/8] mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch Ray Bryant
@ 2005-05-11  4:38 ` Ray Bryant
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 6/8] mm: manual page migration-rc2 -- sys_migrate_pages-mempolicy-migration-rc2.patch Ray Bryant
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 46+ messages in thread
From: Ray Bryant @ 2005-05-11  4:38 UTC (permalink / raw)
  To: Hirokazu Takahashi, Dave Hansen, Marcelo Tosatti, Andi Kleen
  Cc: Christoph Hellwig, linux-mm, Nathan Scott, Ray Bryant,
	lhms-devel, Ray Bryant

This patch inspects the extended attribute "system.migration" of
each mapped file to determine which pages should be migrated, as
follows:

(1)  If the attribute is missing or does not have the value "libr"
     or "none", then all pages of the mapped file (on the eligible
     nodes) are migrated.

(2)  If the attribute has the value "none", then no pages of the
     mapped file are migrated.

(3)  If the attribute has the value "libr", then shared pages of
     the mapped file are not migrated.  Pages that have been
     modified by the process (and whose COW sharing with the library
     file have been broken) are migrated.  (These files have
     PageAnon() set.)

Signed-off-by: Ray Bryant <raybry@sgi.com>

 include/linux/mmigrate.h |    5 ++++
 mm/mmigrate.c            |   54 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 58 insertions(+), 1 deletion(-)

Index: linux-2.6.12-rc3-mhp1-page-migration-export/include/linux/mmigrate.h
===================================================================
--- linux-2.6.12-rc3-mhp1-page-migration-export.orig/include/linux/mmigrate.h	2005-05-10 10:59:46.000000000 -0700
+++ linux-2.6.12-rc3-mhp1-page-migration-export/include/linux/mmigrate.h	2005-05-10 11:09:07.000000000 -0700
@@ -6,6 +6,11 @@
 
 #define MIGRATE_NODE_ANY -1
 
+#define MIGRATION_XATTR_NAME		"system.migration"
+#define MIGRATION_XATTR_LIBRARY 	"libr"
+#define MIGRATION_XATTR_NOMIGRATE	"none"
+#define MIGRATION_XATTR_LENGTH		4
+
 #ifdef CONFIG_MEMORY_MIGRATE
 extern int generic_migrate_page(struct page *, struct page *,
 		int (*)(struct page *, struct page *, struct list_head *));
Index: linux-2.6.12-rc3-mhp1-page-migration-export/mm/mmigrate.c
===================================================================
--- linux-2.6.12-rc3-mhp1-page-migration-export.orig/mm/mmigrate.c	2005-05-10 10:59:46.000000000 -0700
+++ linux-2.6.12-rc3-mhp1-page-migration-export/mm/mmigrate.c	2005-05-10 11:09:07.000000000 -0700
@@ -593,6 +593,31 @@ int try_to_migrate_pages(struct list_hea
 	return nr_busy;
 }
 
+static int get_migration_xattr(struct file *file, char *xattr)
+{
+	int rc;
+
+	if (!file->f_mapping->host->i_op->getxattr ||
+	    !file->f_dentry)
+		return 0;
+
+   	rc = file->f_mapping->host->i_op->getxattr(file->f_dentry,
+		MIGRATION_XATTR_NAME, xattr, MIGRATION_XATTR_LENGTH);
+
+	return rc;
+
+}
+
+static inline int is_migration_xattr_libr(char *x)
+{
+	return strncmp(x, MIGRATION_XATTR_LIBRARY, MIGRATION_XATTR_LENGTH) == 0;
+}
+
+static inline int is_migration_xattr_none(char *x)
+{
+	return strncmp(x, MIGRATION_XATTR_NOMIGRATE, MIGRATION_XATTR_LENGTH) == 0;
+}
+
 static int
 migrate_vma(struct task_struct *task, struct mm_struct *mm,
 	struct vm_area_struct *vma, short *node_map)
@@ -600,14 +625,39 @@ migrate_vma(struct task_struct *task, st
 	struct page *page;
 	struct zone *zone;
 	unsigned long vaddr;
-	int count = 0, nid, pass = 0, nr_busy = 0;
+	int count = 0, nid, pass = 0, nr_busy = 0, library, rc;
 	LIST_HEAD(page_list);
+	char xattr[MIGRATION_XATTR_LENGTH];
 
 	/* can't migrate mlock()'d pages */
 	if (vma->vm_flags & VM_LOCKED)
 		return 0;
 
 	/*
+	 * if the vma is an anon vma, it is migratable.
+	 * if the vma maps a file, then:
+	 *
+	 * system.migration     PageAnon(page)     Migrate?
+	 * ----------------     --------------     --------
+	 *  "none"                not checked          No
+	 * not present            not checked         Yes
+  	 *  "libr"                    0                No
+  	 *  "libr"                    1               Yes
+	 * any other value        not checked         Yes
+	 */
+
+	library = 0;
+	if (vma->vm_file) {
+	        rc = get_migration_xattr(vma->vm_file, xattr);
+		if (rc > 0) {
+			if (is_migration_xattr_none(xattr))
+				return 0;
+			if (is_migration_xattr_libr(xattr))
+				library = 1;
+		}
+	}
+
+	/*
 	 * gather all of the pages to be migrated from this vma into page_list
 	 */
 	spin_lock(&mm->page_table_lock);
@@ -618,6 +668,8 @@ migrate_vma(struct task_struct *task, st
 		 * and NULL mapping.  Skip those pages as well
 		 */
 		if (page && page_mapcount(page)) {
+			if (library && !PageAnon(page))
+				continue;
 			nid = page_to_nid(page);
 			if (node_map[nid] >= 0) {
 				zone = page_zone(page);

-- 
Best Regards,
Ray
-----------------------------------------------
Ray Bryant                       raybry@sgi.com
The box said: "Requires Windows 98 or better",
           so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 2.6.12-rc3 6/8] mm: manual page migration-rc2 -- sys_migrate_pages-mempolicy-migration-rc2.patch
  2005-05-11  4:37 [PATCH 2.6.12-rc3 0/8] mm: manual page migration-rc2 -- overview Ray Bryant
                   ` (4 preceding siblings ...)
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 5/8] mm: manual page migration-rc2 -- sys_migrate_pages-xattr-support-rc2.patch Ray Bryant
@ 2005-05-11  4:38 ` Ray Bryant
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 7/8] mm: manual page migration-rc2 -- sys_migrate_pages-cpuset-support-rc2.patch Ray Bryant
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 8/8] mm: manual page migration-rc2 -- sys_migrate_pages-permissions-check-rc2.patch Ray Bryant
  7 siblings, 0 replies; 46+ messages in thread
From: Ray Bryant @ 2005-05-11  4:38 UTC (permalink / raw)
  To: Hirokazu Takahashi, Andi Kleen, Dave Hansen, Marcelo Tosatti
  Cc: Christoph Hellwig, linux-mm, Nathan Scott, Ray Bryant,
	lhms-devel, Ray Bryant

This patch adds code that translates the memory policy structures
as they are encountered to that they continue to represent where
memory should be allocated after the page migration has completed.

Signed-off-by: Ray Bryant <raybry@sgi.com>

 include/linux/mempolicy.h |    2 
 mm/mempolicy.c            |  114 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/mmigrate.c             |   10 ++++
 3 files changed, 126 insertions(+)

Index: linux-2.6.12-rc3-mhp1-page-migration-export/include/linux/mempolicy.h
===================================================================
--- linux-2.6.12-rc3-mhp1-page-migration-export.orig/include/linux/mempolicy.h	2005-05-10 11:17:55.000000000 -0700
+++ linux-2.6.12-rc3-mhp1-page-migration-export/include/linux/mempolicy.h	2005-05-10 11:18:24.000000000 -0700
@@ -152,6 +152,8 @@ struct mempolicy *mpol_shared_policy_loo
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
+extern int migrate_process_policy(struct task_struct *, short *);
+extern int migrate_vma_policy(struct vm_area_struct *, short *);
 
 #else
 
Index: linux-2.6.12-rc3-mhp1-page-migration-export/mm/mempolicy.c
===================================================================
--- linux-2.6.12-rc3-mhp1-page-migration-export.orig/mm/mempolicy.c	2005-05-10 11:17:55.000000000 -0700
+++ linux-2.6.12-rc3-mhp1-page-migration-export/mm/mempolicy.c	2005-05-10 11:18:24.000000000 -0700
@@ -1136,3 +1136,117 @@ void numa_default_policy(void)
 {
 	sys_set_mempolicy(MPOL_DEFAULT, NULL, 0);
 }
+
+/*
+ * update a node mask according to a migration request
+ */
+static void migrate_node_mask(unsigned long *new_node_mask,
+			      unsigned long *old_node_mask,
+			      short *nodemap)
+{
+	int i;
+
+	bitmap_zero(new_node_mask, MAX_NUMNODES);
+
+	i = find_first_bit(old_node_mask, MAX_NUMNODES);
+	while(i < MAX_NUMNODES) {
+		if (nodemap[i] >= 0)
+			set_bit(nodemap[i], new_node_mask);
+		else
+			set_bit(i, new_node_mask);
+		i = find_next_bit(old_node_mask, MAX_NUMNODES, i+1);
+	}
+}
+
+/*
+ * update a process or vma mempolicy according to a migration request
+ */
+static struct mempolicy *migrate_policy(struct mempolicy *old, short *nodemap)
+{
+	struct mempolicy *new;
+	DECLARE_BITMAP(old_nodes, MAX_NUMNODES);
+	DECLARE_BITMAP(new_nodes, MAX_NUMNODES);
+	struct zone *z;
+	int i;
+
+	new = kmem_cache_alloc(policy_cache, GFP_KERNEL);
+	if (!new)
+		return ERR_PTR(-ENOMEM);
+	atomic_set(&new->refcnt, 0);
+	switch(old->policy) {
+	case MPOL_DEFAULT:
+		BUG();
+	case MPOL_INTERLEAVE:
+		migrate_node_mask(new->v.nodes, old->v.nodes, nodemap);
+		break;
+	case MPOL_PREFERRED:
+		if (old->v.preferred_node>=0 && (nodemap[old->v.preferred_node] >= 0))
+			new->v.preferred_node = nodemap[old->v.preferred_node];
+		else
+			new->v.preferred_node = old->v.preferred_node;
+		break;
+	case MPOL_BIND:
+		bitmap_zero(old_nodes, MAX_NUMNODES);
+		for (i = 0; (z = old->v.zonelist->zones[i]) != NULL; i++)
+			set_bit(z->zone_pgdat->node_id, old_nodes);
+		migrate_node_mask(new_nodes, old_nodes, nodemap);
+		new->v.zonelist = bind_zonelist(new_nodes);
+		if (!new->v.zonelist) {
+			kmem_cache_free(policy_cache, new);
+			return ERR_PTR(-ENOMEM);
+		}
+	}
+	new->policy = old->policy;
+	return new;
+}
+
+/*
+ * update a process mempolicy based on a migration request
+ */
+int migrate_process_policy(struct task_struct *task, short *nodemap)
+{
+	struct mempolicy *new, *old = task->mempolicy;
+	int tmp;
+
+	if ((!old) || (old->policy == MPOL_DEFAULT))
+		return 0;
+
+	new = migrate_policy(task->mempolicy, nodemap);
+	if (IS_ERR(new))
+		return (PTR_ERR(new));
+
+	mpol_get(new);
+	task->mempolicy = new;
+	mpol_free(old);
+
+	if (task->mempolicy->policy == MPOL_INTERLEAVE) {
+		/*
+		 * If the task is still running and allocating storage, this
+		 * is racy, but there is not much that can be done about it.
+		 */
+		tmp = task->il_next;
+		if (nodemap[tmp] >= 0)
+			task->il_next = nodemap[tmp];
+	}
+
+	return 0;
+
+}
+
+/*
+ * update a vma mempolicy based on a migration request
+ */
+int migrate_vma_policy(struct vm_area_struct *vma, short *nodemap)
+{
+
+	struct mempolicy *new;
+
+	if (!vma->vm_policy || vma->vm_policy->policy == MPOL_DEFAULT)
+		return 0;
+
+	new = migrate_policy(vma->vm_policy, nodemap);
+	if (IS_ERR(new))
+		return (PTR_ERR(new));
+
+	return(policy_vma(vma, new));
+}
Index: linux-2.6.12-rc3-mhp1-page-migration-export/mm/mmigrate.c
===================================================================
--- linux-2.6.12-rc3-mhp1-page-migration-export.orig/mm/mmigrate.c	2005-05-10 11:18:20.000000000 -0700
+++ linux-2.6.12-rc3-mhp1-page-migration-export/mm/mmigrate.c	2005-05-10 11:18:24.000000000 -0700
@@ -26,6 +26,7 @@
 #include <linux/delay.h>
 #include <linux/blkdev.h>
 #include <linux/nodemask.h>
+#include <linux/mempolicy.h>
 #include <asm/bitops.h>
 
 /*
@@ -798,8 +799,17 @@ sys_migrate_pages(const pid_t pid, const
   	smp_call_function(&lru_add_drain_per_cpu, NULL, 0, 1);
 	lru_add_drain();
 
+	/* update the process mempolicy, if needed */
+	ret = migrate_process_policy(task, node_map);
+	if (ret)
+		goto out_dec;
+
 	/* actually do the migration */
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		/* update the vma mempolicy, if needed */
+		ret = migrate_vma_policy(vma, node_map);
+		if (ret)
+			goto out_dec;
 		/* migrate the pages of this vma */
 		ret = migrate_vma(task, mm, vma, node_map);
 		if (ret >= 0)

-- 
Best Regards,
Ray
-----------------------------------------------
Ray Bryant                       raybry@sgi.com
The box said: "Requires Windows 98 or better",
           so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 2.6.12-rc3 7/8] mm: manual page migration-rc2 -- sys_migrate_pages-cpuset-support-rc2.patch
  2005-05-11  4:37 [PATCH 2.6.12-rc3 0/8] mm: manual page migration-rc2 -- overview Ray Bryant
                   ` (5 preceding siblings ...)
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 6/8] mm: manual page migration-rc2 -- sys_migrate_pages-mempolicy-migration-rc2.patch Ray Bryant
@ 2005-05-11  4:38 ` Ray Bryant
  2005-05-11 12:37   ` Paul Jackson
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 8/8] mm: manual page migration-rc2 -- sys_migrate_pages-permissions-check-rc2.patch Ray Bryant
  7 siblings, 1 reply; 46+ messages in thread
From: Ray Bryant @ 2005-05-11  4:38 UTC (permalink / raw)
  To: Hirokazu Takahashi, Marcelo Tosatti, Andi Kleen, Dave Hansen
  Cc: Christoph Hellwig, linux-mm, Nathan Scott, Ray Bryant,
	lhms-devel, Ray Bryant

This patch adds cpuset support to the migrate_pages() system call.

The idea of this patch is that in order to do a migration:

(1)  The target task nees to be able to allocate pages on the
     nodes that are being migrated to.

(2)  However, the actual allocation of pages is not done by
     the target task.  Allocation is done by the task that is
     running the migrate_pages() system call.  Since it is 
     expected that the migration will be done by a batch manager
     of some kind that is authorized to control the jobs running
     in an enclosing cpuset, we make the requirement that the
     current task ALSO must be able to allocate pages on the
     nodes that are being migrated to.

Note well that if cpusets are not configured, both of these tests
become noops.

Signed-off-by: Ray Bryant <raybry@sgi.com>

 include/linux/cpuset.h |    8 +++++++-
 kernel/cpuset.c        |   24 +++++++++++++++++++++++-
 mm/mmigrate.c          |   17 +++++++++++++++++
 3 files changed, 47 insertions(+), 2 deletions(-)

Index: linux-2.6.12-rc3-mhp1-page-migration-export/include/linux/cpuset.h
===================================================================
--- linux-2.6.12-rc3-mhp1-page-migration-export.orig/include/linux/cpuset.h	2005-04-20 17:03:16.000000000 -0700
+++ linux-2.6.12-rc3-mhp1-page-migration-export/include/linux/cpuset.h	2005-05-10 11:20:57.000000000 -0700
@@ -4,7 +4,7 @@
  *  cpuset interface
  *
  *  Copyright (C) 2003 BULL SA
- *  Copyright (C) 2004 Silicon Graphics, Inc.
+ *  Copyright (C) 2004-2005 Silicon Graphics, Inc.
  *
  */
 
@@ -24,6 +24,7 @@ void cpuset_update_current_mems_allowed(
 void cpuset_restrict_to_mems_allowed(unsigned long *nodes);
 int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
 int cpuset_zone_allowed(struct zone *z);
+extern const nodemask_t cpuset_mems_allowed(const struct task_struct *tsk);
 extern struct file_operations proc_cpuset_operations;
 extern char *cpuset_task_status_allowed(struct task_struct *task, char *buffer);
 
@@ -53,6 +54,11 @@ static inline int cpuset_zone_allowed(st
 	return 1;
 }
 
+static inline nodemask_t cpuset_mems_allowed(const struct task_struct *tsk)
+{
+	return node_possible_map;
+}
+
 static inline char *cpuset_task_status_allowed(struct task_struct *task,
 							char *buffer)
 {
Index: linux-2.6.12-rc3-mhp1-page-migration-export/kernel/cpuset.c
===================================================================
--- linux-2.6.12-rc3-mhp1-page-migration-export.orig/kernel/cpuset.c	2005-04-20 17:03:16.000000000 -0700
+++ linux-2.6.12-rc3-mhp1-page-migration-export/kernel/cpuset.c	2005-05-10 11:20:57.000000000 -0700
@@ -4,7 +4,7 @@
  *  Processor and Memory placement constraints for sets of tasks.
  *
  *  Copyright (C) 2003 BULL SA.
- *  Copyright (C) 2004 Silicon Graphics, Inc.
+ *  Copyright (C) 2004-2005 Silicon Graphics, Inc.
  *
  *  Portions derived from Patrick Mochel's sysfs code.
  *  sysfs is Copyright (c) 2001-3 Patrick Mochel
@@ -1500,6 +1500,28 @@ int cpuset_zone_allowed(struct zone *z)
 		node_isset(z->zone_pgdat->node_id, current->mems_allowed);
 }
 
+/**
+ * cpuset_mems_allowed - return mems_allowed mask from a tasks cpuset.
+ * @tsk: pointer to task_struct from which to obtain cpuset->mems_allowed.
+ *
+ * Description: Returns the nodemask_t mems_allowed of the cpuset
+ * attached to the specified @tsk.
+ *
+ * requires either tsk==current or the tsk's task_lock to be held
+ * by the caller.
+ **/
+
+const nodemask_t cpuset_mems_allowed(const struct task_struct *tsk)
+{
+	nodemask_t mask;
+
+	down(&cpuset_sem);
+	guarantee_online_mems(tsk->cpuset, &mask);
+	up(&cpuset_sem);
+
+	return mask;
+}
+
 /*
  * proc_cpuset_show()
  *  - Print tasks cpuset path into seq_file.
Index: linux-2.6.12-rc3-mhp1-page-migration-export/mm/mmigrate.c
===================================================================
--- linux-2.6.12-rc3-mhp1-page-migration-export.orig/mm/mmigrate.c	2005-05-10 11:18:40.000000000 -0700
+++ linux-2.6.12-rc3-mhp1-page-migration-export/mm/mmigrate.c	2005-05-10 11:21:39.000000000 -0700
@@ -27,6 +27,7 @@
 #include <linux/blkdev.h>
 #include <linux/nodemask.h>
 #include <linux/mempolicy.h>
+#include <linux/cpuset.h>
 #include <asm/bitops.h>
 
 /*
@@ -777,6 +778,7 @@ sys_migrate_pages(const pid_t pid, const
 		mm = task->mm;
 		if (mm)
 			atomic_inc(&mm->mm_users);
+		nodes = cpuset_mems_allowed(task);
 		task_unlock(task);
 	} else {
 		ret = -ESRCH;
@@ -789,6 +791,21 @@ sys_migrate_pages(const pid_t pid, const
 		goto out;
 	}
 
+	/* check to make sure the target task can allocate on new_nodes */
+	for(i = 0; i < count; i++)
+		if (!node_isset(tmp_new_nodes[i], nodes)) {
+			ret = -EINVAL;
+			goto out_dec;
+		}
+
+	/* the current task must also be able to allocate on new_nodes */
+	nodes = cpuset_mems_allowed(current);
+	for(i = 0; i < count; i++)
+		if (!node_isset(tmp_new_nodes[i], nodes)) {
+			ret = -EINVAL;
+			goto out_dec;
+		}
+
 	/* set up the node_map array */
 	for(i = 0; i < MAX_NUMNODES; i++)
 		node_map[i] = -1;

-- 
Best Regards,
Ray
-----------------------------------------------
Ray Bryant                       raybry@sgi.com
The box said: "Requires Windows 98 or better",
           so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2.6.12-rc3 7/8] mm: manual page migration-rc2 -- sys_migrate_pages-cpuset-support-rc2.patch
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 7/8] mm: manual page migration-rc2 -- sys_migrate_pages-cpuset-support-rc2.patch Ray Bryant
@ 2005-05-11 12:37   ` Paul Jackson
  2005-05-11 14:20     ` Ray Bryant
  0 siblings, 1 reply; 46+ messages in thread
From: Paul Jackson @ 2005-05-11 12:37 UTC (permalink / raw)
  To: Ray Bryant
  Cc: taka, marcelo.tosatti, ak, haveblue, hch, linux-mm, nathans,
	raybry, lhms-devel

My apologies for not considering some of the following code review
comments earlier.  I was on vacation, more-or-less, for the last month.

Ray wrote:
+static inline nodemask_t cpuset_mems_allowed(const struct task_struct *tsk)
+{
+	return node_possible_map;
+}

I would have expected node_online_map here, not node_possible_map.
But I am not entirely sure that I am right in this expectation.

It is mildly unfortunate that your new cpuset_mems_allowed(const struct
task_struct *tsk) requires the task lock to be held on tsk before the
call, but the corresponding cpuset_cpus_allowed() expects to be called
without the task lock held, and locks the task temporarilly, during the
call.  This difference may trip someone up someday.

The abuse of the local variable 'nodes', changing its meaning towards
the end of a long sequence of code that knows it as the target tasks
nodes, to become the current tasks nodes, is unfortunate.  Egads - it is
worse than that.  The nodemask_t 'nodes' starts out being set to the
nodes in tmp_old_nodes[], and then is set to the nodes allowed to the
target task, and then is set to nodes allowed to the current task.
Perhaps this confusion can be reduced by using the suggested code below
with two nodemask_t's old_node_mask and new_node_mask.

The existing several loops over count nodes in this sys_migrate_pages()
code:

        for(i=0; i<count; i++)
                if(node_isset(tmp_new_nodes[i], nodes)) {

as well as the two new such loops that your cpuset check adds might
better be replaced with bitmap based operations on nodemasks, as in:

asmlinkage long
sys_migrate_pages(const pid_t pid, const int count,
        caddr_t old_nodes, caddr_t new_nodes)
{
	...
	nodemask_t old_node_mask, new_node_mask;

	...
	nodes_clear(old_node_mask);
	nodes_clear(new_node_mask);
	ret = -EINVAL;
	for (i = 0; i < count; i++) {
		int n;

		n = tmp_old_nodes[i];
		if (n < 0 || n >= MAX_NUMNODES)
			goto out;
		node_set(n, old_node_mask);

		n = tmp_new_nodes[i];
		if (n < 0 || n >= MAX_NUMNODES)
			goto out;
		node_set(n, new_node_mask);
	}

	/* old_nodes and new_nodes must be disjoint */
	if (nodes_intersects(old_node_mask, new_node_mask)
		goto out;
	...

With the above code, there is only a single loop over count, and then
various node_*() operations on the old and new node_mask should suffice
in more or less all cases.  Well ... two loops ... looks like at least
the loop to "set up the node_map array" is still needed, near the end.

Note also the more traditional Linux kernel for-loop spacing I used:
	for (i = 0; i < count; i++)
not:
	for(i=0; i<count; i++)
nor:
	for(i = 0; i < count; i++)

The existing sys_migrate_pages() is inconsistent in its spacing of
for-loops, both with itself and with the recommended kernel coding
style.

And note that I put the "old_nodes and new_nodes must be disjoint"
comment on one line, not three.

It doesn't look like tmp_old_nodes and tmp_new_nodes are initialized to
NULL in their declaration at the top of this sys_migrate_pages()
routine.  If say the kmalloc() of the first of these failed, then the
closing code might try to kfree() the second, which would still have
random stack junk in it.

Could you avoid the use of two labels at the end, 'out' and 'out_dec',
by making the decrement conditional on mm being set, as in:

out:
	if (mm)
		atomic_dec(&mm->mm_users);
	kfree(...);

This bit of code:

                /* migrate the pages of this vma */
                ret = migrate_vma(task, mm, vma, node_map);
                if (ret >= 0)
                        migrated += ret;
                else
                        goto out_dec;

might better be written as:

                /* migrate the pages of this vma */
                ret = migrate_vma(task, mm, vma, node_map);
                if (ret < 0)
			goto out;
                migrated += ret;

It is more common I think to dispense with the error case first,
and then unconditionally present the successful mainline case.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2.6.12-rc3 7/8] mm: manual page migration-rc2 -- sys_migrate_pages-cpuset-support-rc2.patch
  2005-05-11 12:37   ` Paul Jackson
@ 2005-05-11 14:20     ` Ray Bryant
  2005-05-11 18:55       ` [Lhms-devel] " Paul Jackson
  0 siblings, 1 reply; 46+ messages in thread
From: Ray Bryant @ 2005-05-11 14:20 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Ray Bryant, linux-mm, raybry, lhms-devel

<trimming the cc list a bit...>

Paul Jackson wrote:
> My apologies for not considering some of the following code review
> comments earlier.  I was on vacation, more-or-less, for the last month.
> 

No problem.

> Ray wrote:
> +static inline nodemask_t cpuset_mems_allowed(const struct task_struct *tsk)
> +{
> +	return node_possible_map;
> +}
> 
> I would have expected node_online_map here, not node_possible_map.
> But I am not entirely sure that I am right in this expectation.

Hmmm.... I think I put there what you suggested to me, before, but never
mind.  node_online_map makes more sense, I think.

> 
> It is mildly unfortunate that your new cpuset_mems_allowed(const struct
> task_struct *tsk) requires the task lock to be held on tsk before the
> call, but the corresponding cpuset_cpus_allowed() expects to be called
> without the task lock held, and locks the task temporarilly, during the
> call.  This difference may trip someone up someday.
> 

I guess I could fix cpuset_mems_allowed() to work the same was as the
other code.  It would mean taking the task lock again, but I suppose
we can deal with that.

> The abuse of the local variable 'nodes', changing its meaning towards
> the end of a long sequence of code that knows it as the target tasks
> nodes, to become the current tasks nodes, is unfortunate.  Egads - it is
> worse than that.  The nodemask_t 'nodes' starts out being set to the
> nodes in tmp_old_nodes[], and then is set to the nodes allowed to the
> target task, and then is set to nodes allowed to the current task.
> Perhaps this confusion can be reduced by using the suggested code below
> with two nodemask_t's old_node_mask and new_node_mask.
>

I was trying to save 64 bytes of stack space (right?  Each nodemask_t
is 8 x 64 bit words on Altix....)

> The existing several loops over count nodes in this sys_migrate_pages()
> code:
> 
>         for(i=0; i<count; i++)
>                 if(node_isset(tmp_new_nodes[i], nodes)) {
> 
> as well as the two new such loops that your cpuset check adds might
> better be replaced with bitmap based operations on nodemasks, as in:
> 
> asmlinkage long
> sys_migrate_pages(const pid_t pid, const int count,
>         caddr_t old_nodes, caddr_t new_nodes)
> {
> 	...
> 	nodemask_t old_node_mask, new_node_mask;
> 
> 	...
> 	nodes_clear(old_node_mask);
> 	nodes_clear(new_node_mask);
> 	ret = -EINVAL;
> 	for (i = 0; i < count; i++) {
> 		int n;
> 
> 		n = tmp_old_nodes[i];
> 		if (n < 0 || n >= MAX_NUMNODES)
> 			goto out;
> 		node_set(n, old_node_mask);
> 
> 		n = tmp_new_nodes[i];
> 		if (n < 0 || n >= MAX_NUMNODES)
> 			goto out;
> 		node_set(n, new_node_mask);
> 	}
> 
> 	/* old_nodes and new_nodes must be disjoint */
> 	if (nodes_intersects(old_node_mask, new_node_mask)
> 		goto out;
> 	...
> 
> With the above code, there is only a single loop over count, and then
> various node_*() operations on the old and new node_mask should suffice
> in more or less all cases.  Well ... two loops ... looks like at least
> the loop to "set up the node_map array" is still needed, near the end.
> 

Yeah, the incremental nature of the development of this thing led to
lots of small loops.

> Note also the more traditional Linux kernel for-loop spacing I used:
> 	for (i = 0; i < count; i++)
> not:
> 	for(i=0; i<count; i++)
> nor:
> 	for(i = 0; i < count; i++)
>

Finally, I see the difference.  I was missing that first space all
along.

> The existing sys_migrate_pages() is inconsistent in its spacing of
> for-loops, both with itself and with the recommended kernel coding
> style.
> 

Not surprising given the above.

> And note that I put the "old_nodes and new_nodes must be disjoint"
> comment on one line, not three.
> 
> It doesn't look like tmp_old_nodes and tmp_new_nodes are initialized to
> NULL in their declaration at the top of this sys_migrate_pages()
> routine.  If say the kmalloc() of the first of these failed, then the
> closing code might try to kfree() the second, which would still have
> random stack junk in it.
> 

That's not good.  Will fix.

> Could you avoid the use of two labels at the end, 'out' and 'out_dec',
> by making the decrement conditional on mm being set, as in:
> 
> out:
> 	if (mm)
> 		atomic_dec(&mm->mm_users);
> 	kfree(...);
> 

Good point.

> This bit of code:
> 
>                 /* migrate the pages of this vma */
>                 ret = migrate_vma(task, mm, vma, node_map);
>                 if (ret >= 0)
>                         migrated += ret;
>                 else
>                         goto out_dec;
> 
> might better be written as:
> 
>                 /* migrate the pages of this vma */
>                 ret = migrate_vma(task, mm, vma, node_map);
>                 if (ret < 0)
> 			goto out;
>                 migrated += ret;
> 
> It is more common I think to dispense with the error case first,
> and then unconditionally present the successful mainline case.
> 

ok.

Thanks,

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 7/8] mm: manual page migration-rc2 -- sys_migrate_pages-cpuset-support-rc2.patch
  2005-05-11 14:20     ` Ray Bryant
@ 2005-05-11 18:55       ` Paul Jackson
  0 siblings, 0 replies; 46+ messages in thread
From: Paul Jackson @ 2005-05-11 18:55 UTC (permalink / raw)
  To: Ray Bryant; +Cc: raybry, linux-mm, raybry, lhms-devel

Ray wrote:
> <trimming the cc list a bit...>

argh ... please don't

> I guess I could fix cpuset_mems_allowed() to work the same was as the
> other code.  It would mean taking the task lock again, but I suppose
> we can deal with that.

Unless this is on a hot path, I am more concerned with ensuring that
code has the fewest surprises (unexpected differences) and the fewest
not trivially obvious order sensitive conditions than I am with the cost
of a non-essential task lock.

> I was trying to save 64 bytes of stack space

A worthwhile goal, but not at the cost of such order sensitive abuse of
a variable three different ways over 149 lines of code, unless this is a
known stack size critical routine, which I am not aware that it is.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 2.6.12-rc3 8/8] mm: manual page migration-rc2 -- sys_migrate_pages-permissions-check-rc2.patch
  2005-05-11  4:37 [PATCH 2.6.12-rc3 0/8] mm: manual page migration-rc2 -- overview Ray Bryant
                   ` (6 preceding siblings ...)
  2005-05-11  4:38 ` [PATCH 2.6.12-rc3 7/8] mm: manual page migration-rc2 -- sys_migrate_pages-cpuset-support-rc2.patch Ray Bryant
@ 2005-05-11  4:38 ` Ray Bryant
  7 siblings, 0 replies; 46+ messages in thread
From: Ray Bryant @ 2005-05-11  4:38 UTC (permalink / raw)
  To: Hirokazu Takahashi, Dave Hansen, Marcelo Tosatti, Andi Kleen
  Cc: Christoph Hellwig, linux-mm, Nathan Scott, Ray Bryant,
	lhms-devel, Ray Bryant

Add permissions checking to migrate_pages() system call.  The basic
idea is that if the calling process could send an arbitary signal to a
process then you are allowed to migrate that process, or if the calling
process has capability CAP_SYS_ADMIN.  The permissions check is based
on that in check_kill_permission() in kernel/signal.c.

Signed-off-by: Ray Bryant <raybry@sgi.com>

 include/linux/capability.h |    2 ++
 mm/mmigrate.c              |   14 ++++++++++++++
 2 files changed, 16 insertions(+)

Index: linux-2.6.12-rc3-mhp1-page-migration-export/include/linux/capability.h
===================================================================
--- linux-2.6.12-rc3-mhp1-page-migration-export.orig/include/linux/capability.h	2005-05-10 12:29:49.000000000 -0700
+++ linux-2.6.12-rc3-mhp1-page-migration-export/include/linux/capability.h	2005-05-10 12:31:16.000000000 -0700
@@ -233,6 +233,8 @@ typedef __u32 kernel_cap_t;
 /* Allow enabling/disabling tagged queuing on SCSI controllers and sending
    arbitrary SCSI commands */
 /* Allow setting encryption key on loopback filesystem */
+/* Allow using the migrate_pages() system call to migrate a process's pages
+   from one set of NUMA nodes to another */
 
 #define CAP_SYS_ADMIN        21
 
Index: linux-2.6.12-rc3-mhp1-page-migration-export/mm/mmigrate.c
===================================================================
--- linux-2.6.12-rc3-mhp1-page-migration-export.orig/mm/mmigrate.c	2005-05-10 12:29:49.000000000 -0700
+++ linux-2.6.12-rc3-mhp1-page-migration-export/mm/mmigrate.c	2005-05-10 12:54:26.000000000 -0700
@@ -15,6 +15,8 @@
 #include <linux/module.h>
 #include <linux/swap.h>
 #include <linux/pagemap.h>
+#include <linux/sched.h>
+#include <linux/capability.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
 #include <linux/writeback.h>
@@ -775,6 +777,18 @@ sys_migrate_pages(const pid_t pid, const
 	task = find_task_by_pid(pid);
 	if (task) {
 		task_lock(task);
+		/*
+		 * does this task have permission to migrate that task?
+		 * (ala check_kill_permission() )
+		 */
+	        if ((current->euid ^ task->suid) && (current->euid ^ task->uid)
+	           && (current->uid ^ task->suid) && (current->uid ^ task->uid)
+	           && !capable(CAP_SYS_ADMIN)) {
+		   	ret = -EPERM;
+			task_unlock(task);
+			read_unlock(&tasklist_lock);
+			goto out;
+		}
 		mm = task->mm;
 		if (mm)
 			atomic_inc(&mm->mm_users);

-- 
Best Regards,
Ray
-----------------------------------------------
Ray Bryant                       raybry@sgi.com
The box said: "Requires Windows 98 or better",
           so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2005-05-28 16:12 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-05-11  4:37 [PATCH 2.6.12-rc3 0/8] mm: manual page migration-rc2 -- overview Ray Bryant
2005-05-11  4:38 ` [PATCH 2.6.12-rc3 1/8] mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch Ray Bryant
2005-05-11  7:15   ` Christoph Hellwig
2005-05-11 12:10     ` [Lhms-devel] " Ray Bryant
2005-05-11 12:59       ` Andi Kleen
2005-05-11 18:43         ` Ray Bryant
2005-05-11 19:32           ` Andi Kleen
2005-05-11 20:00             ` Christoph Hellwig
2005-05-11 22:04               ` Ray Bryant
2005-05-12 10:45             ` Christoph Hellwig
2005-05-17  4:22               ` Ray Bryant
2005-05-18  6:20                 ` Paul Jackson
2005-05-18 14:49                   ` Ray Bryant
2005-05-20 22:26               ` Ray Bryant
2005-05-23 17:50                 ` Steve Longerbeam
2005-05-24  4:53                   ` Ray Bryant
2005-05-24 20:59                     ` Christoph Lameter
2005-05-24 21:04                       ` Martin J. Bligh
2005-05-25  6:42                       ` Ray Bryant
2005-05-28  8:40                         ` Christoph Hellwig
2005-05-28 16:12                           ` Ray Bryant
2005-05-11 19:50       ` Christoph Hellwig
2005-05-11 21:30         ` Ray Bryant
2005-05-12  9:55           ` Christoph Hellwig
2005-05-12 15:47             ` Ray Bryant
2005-05-11  4:38 ` [PATCH 2.6.12-rc3 2/8] mm: manual page migration-rc2 -- xfs-migrate-page-rc2.patch Ray Bryant
2005-05-11  4:38 ` [PATCH 2.6.12-rc3 3/8] mm: manual page migration-rc2 -- add-node_map-arg-to-try_to_migrate_pages-rc2.patch Ray Bryant
2005-05-11  4:38 ` [PATCH 2.6.12-rc3 4/8] mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch Ray Bryant
2005-05-11  8:24   ` Christoph Hellwig
2005-05-18 19:07     ` Ray Bryant
2005-05-28  9:14       ` Christoph Hellwig
2005-05-28 15:53         ` Ray Bryant
2005-05-11 13:23   ` Hirokazu Takahashi
2005-05-11 13:26     ` Hirokazu Takahashi
2005-05-11 14:06     ` Ray Bryant
2005-05-12  6:41       ` Hirokazu Takahashi
2005-05-12 16:41         ` Ray Bryant
2005-05-12 23:50           ` Hirokazu Takahashi
2005-05-13  9:59             ` [Lhms-devel] " Ray Bryant
2005-05-11  4:38 ` [PATCH 2.6.12-rc3 5/8] mm: manual page migration-rc2 -- sys_migrate_pages-xattr-support-rc2.patch Ray Bryant
2005-05-11  4:38 ` [PATCH 2.6.12-rc3 6/8] mm: manual page migration-rc2 -- sys_migrate_pages-mempolicy-migration-rc2.patch Ray Bryant
2005-05-11  4:38 ` [PATCH 2.6.12-rc3 7/8] mm: manual page migration-rc2 -- sys_migrate_pages-cpuset-support-rc2.patch Ray Bryant
2005-05-11 12:37   ` Paul Jackson
2005-05-11 14:20     ` Ray Bryant
2005-05-11 18:55       ` [Lhms-devel] " Paul Jackson
2005-05-11  4:38 ` [PATCH 2.6.12-rc3 8/8] mm: manual page migration-rc2 -- sys_migrate_pages-permissions-check-rc2.patch Ray Bryant

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox