linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* manual page migration and madvise/mbind
@ 2005-05-17 16:44 Ray Bryant
  2005-05-17 16:55 ` Ray Bryant
  2005-05-18  1:26 ` Andi Kleen
  0 siblings, 2 replies; 5+ messages in thread
From: Ray Bryant @ 2005-05-17 16:44 UTC (permalink / raw)
  To: Christoph Hellwig, Andi Kleen; +Cc: linux-mm, lhms

Andi and hch,

Resending to make sure you see this.

-------- Original Message --------
Subject: Re: [Lhms-devel] Re: [PATCH 2.6.12-rc3 1/8] mm: manual page 
migration-rc2 -- xfs-extended-attributes-rc2.patch
Date: Mon, 16 May 2005 23:22:50 -0500
From: Ray Bryant <raybry@engr.sgi.com>
To: Christoph Hellwig <hch@infradead.org>, Andi Kleen <ak@suse.de>
CC: Ray Bryant <raybry@sgi.com>, Hirokazu Takahashi <taka@valinux.co.jp>, 
    Marcelo Tosatti <marcelo.tosatti@cyclades.com>,        Dave Hansen 
<haveblue@us.ibm.com>, linux-mm <linux-mm@kvack.org>,        Nathan Scott 
<nathans@sgi.com>, Ray Bryant <raybry@austin.rr.com>, 
lhms-devel@lists.sourceforge.net,        Jes Sorensen <jes@wildopensource.com>
References: <20050511043756.10876.72079.60115@jackhammer.engr.sgi.com> 
<20050511043802.10876.60521.51027@jackhammer.engr.sgi.com> 
<20050511071538.GA23090@infradead.org> <4281F650.2020807@engr.sgi.com> 
<20050511125932.GW25612@wotan.suse.de> <42825236.1030503@engr.sgi.com> 
<20050511193207.GE11200@wotan.suse.de> <20050512104543.GA14799@infradead.org>

Christoph Hellwig wrote:
> On Wed, May 11, 2005 at 09:32:07PM +0200, Andi Kleen wrote:
> 
>>A minor change for that is probably ok, as long as the actual logic
>>who uses this is generic. 
>>
>>hch: if you still are against this please reread the original thread
>>with me and Ray and see why we decided that ld.so changes are not
>>a good idea.
> 
> 
> So reading through the thread I think using mempolicies to mark shared
> libraries is better than the mmap flag I proposed.  I still don't think
> xattrs interpreted by the kernel is a good way to store them.  Setting
> up libraries is the job of the dynamic linker, and reading pre-defined
> memory policies from an ELF header fits the approach we do for related
> things.
> 

Andi and hch,

OK, I've been off chasing down what the possibilities are in that area.
I'm also looking at Steve Longerbeam's patches to see if that will help
us out here.

However, I've come across a minor issue that has complicated my thinking
on this:  If one were to use madvise() or mbind() to apply the migration
policy flags (e. g. the three policies we basically need are:  migrate,
migrate_non_shared, and migrated_none, used for normal files, libraries,
and shared binaries, respectively) then when madvise() (let us say)
is called, it isn't good enough to mark the vma that the address and
length point to, it's necessary to reach down to a common subobject,
(such as the file struct, address space struct, or inode) and mark
that.

If the vma is all that is marked, then when migrate_pages() is called
and as a result some other address space than the current one is examined,
it won't see the flags.

(Remember that the migrate_pages() system call takes a pid, a count,
and a list of old and new node so that this process is allowed to
migrate that process over there, which is what the batch manager needs
to do.  Running madvise() in the current process's address space doesn't
help much unless it marks something deeper in the address space hierarchy
than a vma.)

This is something quite a bit different than what madvise() or mbind()
do today.  (They just manipulate vma's AFAIK.)

Does that observation change y'all's thinking on this in any way?

> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by Oracle Space Sweepstakes
> Want to be the first software developer in space?
> Enter now for the Oracle Space Sweepstakes!
> http://ads.osdn.com/?ad_id=7393&alloc_id=16281&op=click
> _______________________________________________
> Lhms-devel mailing list
> Lhms-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lhms-devel
> 


-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>


-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: manual page migration and madvise/mbind
  2005-05-17 16:44 manual page migration and madvise/mbind Ray Bryant
@ 2005-05-17 16:55 ` Ray Bryant
  2005-05-18  1:26 ` Andi Kleen
  1 sibling, 0 replies; 5+ messages in thread
From: Ray Bryant @ 2005-05-17 16:55 UTC (permalink / raw)
  To: Ray Bryant; +Cc: Christoph Hellwig, Andi Kleen, linux-mm, lhms

Ray Bryant wrote:

> 
> However, I've come across a minor issue that has complicated my thinking
> on this:  If one were to use madvise() or mbind() to apply the migration
> policy flags (e. g. the three policies we basically need are:  migrate,
> migrate_non_shared, and migrated_none, used for normal files, libraries,
> and shared binaries, respectively) then when madvise() (let us say)
> is called, it isn't good enough to mark the vma that the address and
> length point to, it's necessary to reach down to a common subobject,
> (such as the file struct, address space struct, or inode) and mark
> that.
> 
> If the vma is all that is marked, then when migrate_pages() is called
> and as a result some other address space than the current one is examined,
> it won't see the flags.
> 
> (Remember that the migrate_pages() system call takes a pid, a count,
> and a list of old and new node so that this process is allowed to
> migrate that process over there, which is what the batch manager needs
> to do.  Running madvise() in the current process's address space doesn't
> help much unless it marks something deeper in the address space hierarchy
> than a vma.)
> 
> This is something quite a bit different than what madvise() or mbind()
> do today.  (They just manipulate vma's AFAIK.)
> 
> Does that observation change y'all's thinking on this in any way?

Achh.... Nevermind... my bad.

If we do the madvise/mbind in each pid (via exec_ve() ld.so) then we can
mark just the vma and that is fine.  I was off on some other tangent....

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: manual page migration and madvise/mbind
  2005-05-17 16:44 manual page migration and madvise/mbind Ray Bryant
  2005-05-17 16:55 ` Ray Bryant
@ 2005-05-18  1:26 ` Andi Kleen
  2005-05-18  4:02   ` Ray Bryant
  2005-05-28  8:49   ` Christoph Hellwig
  1 sibling, 2 replies; 5+ messages in thread
From: Andi Kleen @ 2005-05-18  1:26 UTC (permalink / raw)
  To: Ray Bryant; +Cc: Christoph Hellwig, linux-mm, lhms

Sorry for late answer.

On Tue, May 17, 2005 at 11:44:31AM -0500, Ray Bryant wrote:
> (Remember that the migrate_pages() system call takes a pid, a count,
> and a list of old and new node so that this process is allowed to
> migrate that process over there, which is what the batch manager needs
> to do.  Running madvise() in the current process's address space doesn't
> help much unless it marks something deeper in the address space hierarchy
> than a vma.)
> 
> This is something quite a bit different than what madvise() or mbind()
> do today.  (They just manipulate vma's AFAIK.)

Nah, mbind manipulates backing objects too, in particular for shared 
memory. It is not right now implemented for files, but that was planned
and Steve L's patches went into that direction with some limitations.

And yes, the state would need to be stored in the address_space, which
is shared.  In my version it was in private backing store objects.
Check Steve's patch.

The main problem I see with the "hack ld.so" approach is that it 
doesn't work for non program files. So if you really want to handle
them you would need a daemon that sets the policies once a file 
is mapped or hack all the programs to set the policies. I don't
see that as being practicable. Ok you could always add a "sticky" process
policy that actually allocates mempolicies for newly read files
and so marks them using your new flags. But that would seem
somewhat ugly to me and is probably incompatible with your batch manager
anyways.  The only sane way to handle arbitary files like this
would be the xattr.

If you ignore data files then it would be ok to keep it to 
ELF loaders and ld.so I guess.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: manual page migration and madvise/mbind
  2005-05-18  1:26 ` Andi Kleen
@ 2005-05-18  4:02   ` Ray Bryant
  2005-05-28  8:49   ` Christoph Hellwig
  1 sibling, 0 replies; 5+ messages in thread
From: Ray Bryant @ 2005-05-18  4:02 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Hellwig, linux-mm, lhms

Andi Kleen wrote:
> Sorry for late answer.
> 
> On Tue, May 17, 2005 at 11:44:31AM -0500, Ray Bryant wrote:
> 
>>(Remember that the migrate_pages() system call takes a pid, a count,
>>and a list of old and new node so that this process is allowed to
>>migrate that process over there, which is what the batch manager needs
>>to do.  Running madvise() in the current process's address space doesn't
>>help much unless it marks something deeper in the address space hierarchy
>>than a vma.)
>>
>>This is something quite a bit different than what madvise() or mbind()
>>do today.  (They just manipulate vma's AFAIK.)
> 
> 
> Nah, mbind manipulates backing objects too, in particular for shared 
> memory. It is not right now implemented for files, but that was planned
> and Steve L's patches went into that direction with some limitations.
>

That's what I need then.

> And yes, the state would need to be stored in the address_space, which
> is shared.  In my version it was in private backing store objects.
> Check Steve's patch.
> 

I'm in the process of building Steve's stuff for Altix.

> The main problem I see with the "hack ld.so" approach is that it 
> doesn't work for non program files. So if you really want to handle
> them you would need a daemon that sets the policies once a file 
> is mapped or hack all the programs to set the policies. I don't
> see that as being practicable. Ok you could always add a "sticky" process
> policy that actually allocates mempolicies for newly read files
> and so marks them using your new flags. But that would seem
> somewhat ugly to me and is probably incompatible with your batch manager
> anyways.  The only sane way to handle arbitary files like this
> would be the xattr.
> 
> If you ignore data files then it would be ok to keep it to 
> ELF loaders and ld.so I guess.
> 
> -Andi
> 
This turns out to be problematic.  For example, if I look at /proc/pid/maps
for /bin/tcsh, the following files are mapped in, but they are not elf files
at all, they are just data files (at least according to the "file" command):

/usr/lib/gconv/gconv-modules.cache
/usr/lib/locale/en_US.utf8/LC_IDENTIFICATION
/usr/lib/locale/en_US.utf8/LC_MEASUREMENT
/usr/lib/locale/en_US.utf8/LC_TELEPHONE
/usr/lib/locale/en_US.utf8/LC_ADDRESS
/usr/lib/locale/en_US.utf8/LC_NAME
/usr/lib/locale/en_US.utf8/LC_PAPER
/usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES
/usr/lib/locale/en_US.utf8/LC_MONETARY
/usr/lib/locale/en_US.utf8/LC_COLLATE
/usr/lib/locale/en_US.utf8/LC_TIME
/usr/lib/locale/en_US.utf8/LC_NUMERIC
/usr/lib/locale/en_US.utf8/LC_CTYPE

Admittedly, most of this is National Language stuff, so not all programs
map this in, but nonetheless, it begs the question as to how to mark such
stuff as not-migratable, or at least migrate-non-shared, since that is how
they are marked now (we typically mark these files with the extended attribute
value "libr".)  So we want to migrate the anonymous pages found in those
vma's, but not the shared pages.

Also the files are all small (a couple of pages each), so migrating them all
the time would not be such a problem, but it seems untidy to do so.

What could be done (beware of ugly hack following) would be in the migration
application (e. g. the batch manager), to look at /proc/pid/maps for each
process to be migrated and examine the file names specified there.  The
batch manager could then do whatever algorithm it liked (it is just user code)
to determine whether or not, say, /usr/lib/locale/en_US.utf8/LC_TIME is
migratable or not.  It could then use a modified mbind() system call to reach
into the kernel and set a bit (or 2) in the address_space object.  (It would
have to map the file in first, but that is no big deal.)  Those bits could
be used to control the migration the way we do it now with extended
attributes.  There is a small performance hit here (mapping all of those files
in just before migration time and doing the mbind() system calls) but it is
probably doable and will be trivial in comparison to the actual time required
to do the migration.  (I suppose the batch manager could keep a cache of files
it has mapped in and marked and not have to do this every time a migration
call is made.)

(Andi -- I know you dislike bringing /proc/pid/maps back into this because of
the raciness of reading that file, but here we are reading it before the
migration operation itself starts.  And reading the file is a performance
assist, not required for correctness of the migration operation.)

This all can be done in preparation for using Steve Longerbeam's patches for
file numapolicy support and his patches for ld.so etc, which I believe I can
use, with only slight modifications, to support migration policies based on
information in elf program headers, at  which point the ugly hack above can
ignore all elf files.  The ugly hack, however, lives on forever for data
files, so I am not sure how much  simplicity we have bought ourselves through
this whole process.

The ugly hack in some sense replaces the logic in ld.so until we can get a
modified version of ld.so into the glibc trees.  The kernel interface
remains the same regardless of whether we use a modified ld.so or not.

(My personal preference is still just to set the user.migration extended
attribute on such data files; it just seems so much simpler than all of
the other approaches that have been suggested.)

Christoph, would the ugly hack above be acceptable or is that worse than the
original approach?

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: manual page migration and madvise/mbind
  2005-05-18  1:26 ` Andi Kleen
  2005-05-18  4:02   ` Ray Bryant
@ 2005-05-28  8:49   ` Christoph Hellwig
  1 sibling, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2005-05-28  8:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ray Bryant, Christoph Hellwig, linux-mm, lhms

On Wed, May 18, 2005 at 03:26:27AM +0200, Andi Kleen wrote:
> > This is something quite a bit different than what madvise() or mbind()
> > do today.  (They just manipulate vma's AFAIK.)
> 
> Nah, mbind manipulates backing objects too, in particular for shared 
> memory. It is not right now implemented for files, but that was planned
> and Steve L's patches went into that direction with some limitations.
> 
> And yes, the state would need to be stored in the address_space, which
> is shared.  In my version it was in private backing store objects.
> Check Steve's patch.
> 
> The main problem I see with the "hack ld.so" approach is that it 
> doesn't work for non program files. So if you really want to handle
> them you would need a daemon that sets the policies once a file 
> is mapped or hack all the programs to set the policies. I don't
> see that as being practicable. Ok you could always add a "sticky" process
> policy that actually allocates mempolicies for newly read files
> and so marks them using your new flags. But that would seem
> somewhat ugly to me and is probably incompatible with your batch manager
> anyways.  The only sane way to handle arbitary files like this
> would be the xattr.

Storing the full memory policy in the extended attributes seems at least
a little less hackish then what's done currently.  We should read the policy
once at mmap time only instead of letting VM code poke into xattr details,
though.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2005-05-28  8:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-05-17 16:44 manual page migration and madvise/mbind Ray Bryant
2005-05-17 16:55 ` Ray Bryant
2005-05-18  1:26 ` Andi Kleen
2005-05-18  4:02   ` Ray Bryant
2005-05-28  8:49   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox