Query re: mempolicy for page cache pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Query re:  mempolicy for page cache pages
@ 2006-05-18 17:49 Lee Schermerhorn
  2006-05-18 18:12 ` Andi Kleen
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Lee Schermerhorn @ 2006-05-18 17:49 UTC (permalink / raw)
  To: linux-mm; +Cc: Christoph Lameter, Andi Kleen, Steve Longerbeam, Andrew Morton

Below I've included an overview of a patch set that I've been working
on.  I submitted a previous version [then called Page Cache Policy] back
~20Apr.  I started working on this because Christoph seemed to consider
this a prerequisite for considering migrate-on-fault/lazy-migration/...
Since the previous post, I have addressed comments [from Christoph] and
kept the series up to date with the -mm tree.  

Just today, I was cleaning up some really old patches on my system and
came across a patch from Steve Longerbeam that passes a page index to
page_cache_alloc_cold()--exactly what my patch series does.  I took a
look back through the -mm archives [Yeah, I should have done this
earlier :-(] and found that back in Oct'04, Steve had posted a patch
[set] that takes essentially the same approach to solve the same
"problem".  

Since Steve's patches never made it into the kernel and don't exist in
the -mm tree either, I'm wondering why they were dropped.  I.e., is
there some fundamental objection to applying shared policy to memory
mapped files and using this policy for page cache allocations?  Rather
than bomb the mailing list with yet another set of dead-end patches, I'm
sending out just this overview, with the following questions:

1) What ever happened to Steve's patch set?

2) Is this even a problem that needs solving, as Christoph seem to think
at one time?

3) If so, is this the right approach?  I.e., should I post the actual
patches?

4) If you don't agree with this approach, how would you go about it?

Regards,
Lee

P.S., tarballs containing the entire series, along with my "lazy
migration" patches can be found at:
http://free.linux.hp.com/~lts/Patches/PageMigration/ in -rcX-mmY
subdirs.

=====================================================================
Mapped File Policy V0.1 0/7 Overview

Formerly "Page Cache Policy" series.

V0.1 -	renamed and revised the series.  I think this name and
	breakout makes more sense.
	Prevent migration of file backed pages with shared
	policy from private mappings thereof.
	Also, address impact on show_numa_map() of patch #3
	of this series.
	refreshed against 2.6.17-rc4-mm1.

Basic "problem":  currently [2.6.17-rcx], files mmap()ed SHARED
do not follow mem policy applied to the mapped regions.  Instead, 
shared, file backed pages are allocated using the allocating
tasks' task policy.  This is inconsistent with the way that anon
and shmem pages are handled.

One reason for this is that down where pages are allocated for
file backed pages, the faulting (mm, vma, address) are not 
available to compute the policy.  However, we do have the inode
[via the address space] and file index/offset available.  If the
applicable policy could be determined from just this info, the
vma and address would not be required.

The following series of patches against 2.6.17-rc4-mm1 implement
numa memory policy for shared, mmap()ed files.   Because files
mmap()ed SHARED are shared between tasks just like shared memory
regions, I've used the shared_policy infrastructure from shmem.
This infrastructure applies policies directly to ranges of a file
using an rb_tree.

These patches result in the following internal and external
semantics:

1) The vma get|set_policy ops handle mem policies on sub-vma
   address ranges for shared, linear mappings [shmem, files]
   without splitting the vmas at the policy boundaries. Private
   and non-linear mappings still split the vma to apply policy.
   However, vma policy is still not visible to the filemap_nopage()
   fault path.  

2) As with shmem segments, the shared policies applied to shared
   file mappings persist as long as the inode remains--i.e., until
   the file is deleted or the inode recycled--whether or not any
   task has the file mapped or even open.  We could, I suppose,
   free the map on last close.

3) Vma policy of private mappings of files only apply when the 
   task gets a private copy of the page--i.e., when do_wp_page()
   breaks the COW sharing and allocates a private page.  Private,
   read-only mappings of a file use the shared policy which 
   defaults, as before, to process policy, which itself defaults
   to, well... default policy.  This is how mapped files have
   always behaved.

	Could be addressed by passing vma,addr down to where
	page cache pages are allocated and use different policy
	for shared, linear vs private or nonlinear mappings.
	Worth the effort?

4) mbind(... 'MOVE*, ...) will not migrate non-anon file backed
   pages in a private mapping if the file has a shared policy.
   Rather, only anon pages that the mapping task has "COWed"
   will be migrated.  If the mapped file does NOT have a shared
   policy or the file is mapped shared, then the pages will be
   migrated, subject to mapcount, as before.  [patch 6]

The patches, to follow, break out as follows:

1 - move-shared-policy-to-inode

	This patch generalizes the shared_policy infrastructure
	for use by generic files.   First, it adds a shared_policy
	pointer to the struct address_space.  This pointer is
	initialized to NULL on inode allocation, indicating the
	process policy.  The shared memory subsystem is then
	modified to use the shared policy struct out of the
	address_space [a.k.a. mapping] instead of explicitly
	using one embedded in the shmem inode info struct.

	Note, however, at this point we still use the embedded
	shared_policy.  We just point the mapping spolicy pointer
	at the embedded struct at init time.

	Tested to ensure shared policies still work for shmem.

2 - alloc-shared-policies

	This patch removes the shared_policy structs embedded in
	the shmem and hugetlbfs inode info structs, and dynamically
	allocates them, from a new kmem cache, when needed.

	Shmem will allocate a shared policy at segment init if
	the superblock [mount] specifies non-default policy.
	Otherwise, the shared_policy struct will only be allocated
	if a task mbind()s a range of the segment.

	Hugetlbfs just leaves the spolicy pointer NULL [default].
	It will be allocated by the shmem set_policy() vm_op if
	a task mbinds a range of the hugetlb segment.

	Note:  because the shared policy pointer in address_space
	is overhead incurred by every inode's address space, we
	only define it if CONFIG_NUMA.  Access it via wrappers
	to avoid excessive #ifdef in .c's.

3 - let-vma-policy-op-handle-subrange-policies

	Only shmem currently has a set_policy op, and it knows how
	to handle subranges via the rb_tree.  So, I'm proposing we
	adopt this semantic:  if a vma has set_policy() op, it must
	know to handle subranges and must have a get_policy() op that
	also knows how to handle sub-ranges.  These policy ops will
	ONLY be used for shared mappings [VM_SHARED] because we don't
	want private mappings mucking with the underlying object's
	shared policy.  Also, we can't let the policy ops handle
	it for nonlinear mappings [VM_NONLINEAR] without a lot more
	work.

	One BIG side-effect of this patch:  we no longer split
	vm areas to apply sub-range policies if the vma has
	a set_policy vm_op and is mapped linear, shared.
	However, for private mappings, the vma policy ops will not
	be used, even if they exist, and the vma will be split to
	bind a policy to a sub-range of the vma.

	Not splitting vma's for shared policies required mods
	to show_numa_map().  Handled by subsequent patch.

	migrate_pages_to() now uses page_address_in_vma() to 
	alloc destination page for each source page.  This is
	needed for shared policy subranges and gives a better
	location for each destination page now that Christoph
	is syncing from and to lists.

4 - generic-file-policy-vm-ops

	This patch clones the shmem set/get_policy vm_ops for use
	by generic mmap()ed files.  The functions are added to the
	generic_file_vm_ops struct. These functions operate on the
	shared_policy rb_tree associated with the inode, allocating
	one if necessary.

	Note:   these turned out to be indentical in all but name to
	the shmem '_policy ops.  Maybe eliminate one copy and share?

5 - use-file-policy-for-page-cache

	This patch enhances page_cache_alloc[_cold]() to take an
	offset/index argument.  It uses this to lookup the policy
	using a new function get_file_policy() which is just a
	wrapper around mpol_shared_policy_lookup().  If the inode's
	[mapping's] shared_policy pointer is NULL, just returns the
	process or default policy.

	Then page_cache_alloc[_cold]() calls a new function,
	alloc_page_pol() to evaluate the policy [at a specified
	offset] and allocate an appropriate page.  alloc_page_pol()
	shares some code with alloc_page_vma(), so this area is
	reworked to minimize duplication.  

	All callers of page_cache_alloc[_cold]() are modified to
	pass the file index/offset for which a page is requested.
	The index/offset is available at all call sites as it will
	be used to insert the page into the mapping's radix tree.

6 - fix migration of privately mapped files

	Prevent migration of non-anon pages in private mappings
	of files with shared policy.  Migration uses the mapping's
	vma policy.  vma policy does not apply to shared mmap()ed
	files.

7 - fix show_numa_map

	This patch fixes show_numa_map to correctly display numa
	maps of shmem and shared file mappings with sub-vma policies.
	The patch provides numa specific wrappers [nm_*] around
	the task map functions [m_*] in fs/proc/task_mmu.c to handle
	submaps and modifies show_numa_map() to use a passed in
	address range, instead of vm_start..vm_end.

Cursory testing with memtoy for shm segments, shared and privately
mapped files; single task and 2 tasks mmap()ing same file.  
Verified the semantics described above.

Tested numa maps with memtoy and multiple, disjoint ranges in 
submap--including situation where the buffer end occurs in the
middle of a submap.

Lots more testing needed--both functional and performance.

Lee Schermerhorn

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Query re:  mempolicy for page cache pages
  2006-05-18 17:49 Query re: mempolicy for page cache pages Lee Schermerhorn
@ 2006-05-18 18:12 ` Andi Kleen
  2006-05-18 18:29   ` Lee Schermerhorn
  2006-05-18 18:14 ` Andrew Morton
  2006-05-18 18:15 ` Christoph Lameter
  2 siblings, 1 reply; 9+ messages in thread
From: Andi Kleen @ 2006-05-18 18:12 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, Christoph Lameter, Steve Longerbeam, Andrew Morton

> 1) What ever happened to Steve's patch set?

It needed more work, but he just disappeared at some point.

> 
> 2) Is this even a problem that needs solving, as Christoph seem to think
> at one time?

The problem that hasn't been worked out is how to add persistent 
attributes to files. Steve avoided that by limiting his to only
ELF executables and using a static header there, but i'm not
sure that is a generally useful enough for mainline. Just temporary
for mmaps seems very narrow in usefulness.

And with xattrs was unclear if it would be costly or not and
even worth it.

At least in the general case just interleaving the file cache
based on a global setting or on cpuset seemed to work well enough
for most people.

Let's ask it differently. Do you have a real application that
would be improved by it? 

> 2) As with shmem segments, the shared policies applied to shared
>    file mappings persist as long as the inode remains--i.e., until
>    the file is deleted or the inode recycled--whether or not any
>    task has the file mapped or even open.  We could, I suppose,
>    free the map on last close.

The recycling is the problem. It's basically a lottery if the
attributes are kept with high memory pressure or not.
Doesn't seem like a robust approach.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Query re:  mempolicy for page cache pages
  2006-05-18 17:49 Query re: mempolicy for page cache pages Lee Schermerhorn
  2006-05-18 18:12 ` Andi Kleen
@ 2006-05-18 18:14 ` Andrew Morton
  2006-05-18 19:10   ` Lee Schermerhorn
  2006-05-18 18:15 ` Christoph Lameter
  2 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2006-05-18 18:14 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, clameter, ak, stevel

Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
>
> 1) What ever happened to Steve's patch set?

They were based on Andi's 4-level-pagetable work.  Then we merged Nick's
4-level-pagetable work instead, so
numa-policies-for-file-mappings-mpol_mf_move.patch broke horridly and I
dropped it.  Steve said he'd redo the patch based on the new pagetable code
and would work with SGI on getting it benchmarked, but that obviously
didn't happen.

I was a bit concerned about the expansion in sizeof(address_space), but we
ended up agreeing that it's numa-only and NUMA machines tend to have lots
of memory anyway.  That being said, it would still be better to have a
pointer to a refcounted shared_policy in the address_space if poss, rather
than aggregating the whole thing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Query re:  mempolicy for page cache pages
  2006-05-18 17:49 Query re: mempolicy for page cache pages Lee Schermerhorn
  2006-05-18 18:12 ` Andi Kleen
  2006-05-18 18:14 ` Andrew Morton
@ 2006-05-18 18:15 ` Christoph Lameter
  2006-05-18 19:27   ` Lee Schermerhorn
  2 siblings, 1 reply; 9+ messages in thread
From: Christoph Lameter @ 2006-05-18 18:15 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, Andi Kleen, Steve Longerbeam, Andrew Morton

On Thu, 18 May 2006, Lee Schermerhorn wrote:

> Below I've included an overview of a patch set that I've been working
> on.  I submitted a previous version [then called Page Cache Policy] back
> ~20Apr.  I started working on this because Christoph seemed to consider
> this a prerequisite for considering migrate-on-fault/lazy-migration/...
> Since the previous post, I have addressed comments [from Christoph] and
> kept the series up to date with the -mm tree.  

The prequisite for automatic page migration schemes in the kernel is proof 
that these automatic migrations consistently improve performance. We are 
still waiting on data showing that this is the case.

The particular automatic migration scheme that you proposed relies on 
allocating pages according to the memory allocation policy. 

The basic problem is first of all that the memory policies do not
necessarily describe how the user wants memory to be allocated. The user
may temporarily switch task policies to get specific allocation patterns.
So moving memory may misplace memory. We got around that by 
saying that we need to separately enable migration if a user 
wants it.

But even then we have the issue that the memory policies cannot 
describe proper allocation at all since allocation policies are 
ignored for file backed vmas. And this is the issue you are trying to 
address.

I think this is all far to complicated to do in kernel space and still 
conceptually unclean. I would like to have all automatic migration schemes 
confined to user space. We will add an API that allows some process
to migrate pages at will.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Query re:  mempolicy for page cache pages
  2006-05-18 18:12 ` Andi Kleen
@ 2006-05-18 18:29   ` Lee Schermerhorn
  2006-05-18 18:41     ` Christoph Lameter
  0 siblings, 1 reply; 9+ messages in thread
From: Lee Schermerhorn @ 2006-05-18 18:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, Christoph Lameter, Steve Longerbeam, Andrew Morton

Thanks, Andi

On Thu, 2006-05-18 at 20:12 +0200, Andi Kleen wrote:
> > 1) What ever happened to Steve's patch set?
> 
> It needed more work, but he just disappeared at some point.

OK.

> > 
> > 2) Is this even a problem that needs solving, as Christoph seem to think
> > at one time?
> 
> The problem that hasn't been worked out is how to add persistent 
> attributes to files. Steve avoided that by limiting his to only
> ELF executables and using a static header there, but i'm not
> sure that is a generally useful enough for mainline. Just temporary
> for mmaps seems very narrow in usefulness.
> 
> And with xattrs was unclear if it would be costly or not and
> even worth it.

I see...  Still, I find it "interesting" that an app doesn't have
explicit control over shared file mappings except via the process
policy.  I suppose if one applies explicit policy to all ones 
vmas, then by process of elimination, the process policy would
only apply to is file mappings.

> 
> At least in the general case just interleaving the file cache
> based on a global setting or on cpuset seemed to work well enough
> for most people.

Yes, for not overburdening any single node.  Paul Jackson's 
"spread" patches address this.  Actually, for [some of] our platforms,
we can hardware interleave some % of memory at the cache line level.
This shows up as a memory-only node.  Some folks claim it would be
beneficial to be able to specify a page cache policy to prefer this
hardware interleaved node for the page cache.   I see that Ray
Bryant once proposed a patch to define a separate global and 
optional per process policy to be used for page cache pages. This
also "died on the vine"...

> 
> Let's ask it differently. Do you have a real application that
> would be improved by it? 

Uh, not at this point.  As I said, Chistoph said he "wished this were
addressed" before thinking about migrate-on-fault, etc.  Since I wasn't
getting any traction with the migration stuff, and this didn't look to
difficult, I thought I'd look into it.
> 
> 
> > 2) As with shmem segments, the shared policies applied to shared
> >    file mappings persist as long as the inode remains--i.e., until
> >    the file is deleted or the inode recycled--whether or not any
> >    task has the file mapped or even open.  We could, I suppose,
> >    free the map on last close.
> 
> The recycling is the problem. It's basically a lottery if the
> attributes are kept with high memory pressure or not.
> Doesn't seem like a robust approach.

Unless, of course, the file remains mapped/open, right?  Then isn't 
the inode and address_space guaranteed to hang around?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Query re:  mempolicy for page cache pages
  2006-05-18 18:29   ` Lee Schermerhorn
@ 2006-05-18 18:41     ` Christoph Lameter
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Lameter @ 2006-05-18 18:41 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Andi Kleen, linux-mm, Steve Longerbeam, Andrew Morton

On Thu, 18 May 2006, Lee Schermerhorn wrote:

> Yes, for not overburdening any single node.  Paul Jackson's 
> "spread" patches address this.  Actually, for [some of] our platforms,
> we can hardware interleave some % of memory at the cache line level.
> This shows up as a memory-only node.  Some folks claim it would be
> beneficial to be able to specify a page cache policy to prefer this
> hardware interleaved node for the page cache.   I see that Ray
> Bryant once proposed a patch to define a separate global and 
> optional per process policy to be used for page cache pages. This
> also "died on the vine"...

I'd be very interested in some scheme to address the overburdening in a 
simple way. Replication may be useful in addition to spreading to limit 
the traffic on the NUMA interlink.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Query re:  mempolicy for page cache pages
  2006-05-18 18:14 ` Andrew Morton
@ 2006-05-18 19:10   ` Lee Schermerhorn
  0 siblings, 0 replies; 9+ messages in thread
From: Lee Schermerhorn @ 2006-05-18 19:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, clameter, ak, stevel

On Thu, 2006-05-18 at 11:14 -0700, Andrew Morton wrote:
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> >
> > 1) What ever happened to Steve's patch set?
> 
> They were based on Andi's 4-level-pagetable work.  Then we merged Nick's
> 4-level-pagetable work instead, so
> numa-policies-for-file-mappings-mpol_mf_move.patch broke horridly and I
> dropped it.  Steve said he'd redo the patch based on the new pagetable code
> and would work with SGI on getting it benchmarked, but that obviously
> didn't happen.

Thanks for the info Andrew.

> 
> I was a bit concerned about the expansion in sizeof(address_space), but we
> ended up agreeing that it's numa-only and NUMA machines tend to have lots
> of memory anyway.  That being said, it would still be better to have a
> pointer to a refcounted shared_policy in the address_space if poss, rather
> than aggregating the whole thing.

Yes, I was concerned about that, too.  I do use a pointer to the shared
policy struct in the address space, allocating it only if one actually
applies a policy.  A null pointer results in current behavior:  fall
back
to process then global default policy.  Even so, the pointer member
would
only be included under CONFIG_NUMA.

As far as reference counting:  I didn't think it would be necessary,
because
it appears to me that the address space structs are one to one with the
inodes and persists as long as the inode does.  Is this correct?  If
so, 
then the shared policy struct would only be deleted when the inode goes
away.  I may have a race, but I didn't think one could be doing an
insert
or lookup w/o holding locks/references on structs that would prevent the
inode from being destroyed.

But, may turn out to be moot, heh?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Query re:  mempolicy for page cache pages
  2006-05-18 18:15 ` Christoph Lameter
@ 2006-05-18 19:27   ` Lee Schermerhorn
  2006-05-18 19:53     ` Christoph Lameter
  0 siblings, 1 reply; 9+ messages in thread
From: Lee Schermerhorn @ 2006-05-18 19:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, Andi Kleen, Steve Longerbeam, Andrew Morton

On Thu, 2006-05-18 at 11:15 -0700, Christoph Lameter wrote:
> On Thu, 18 May 2006, Lee Schermerhorn wrote:
> 
> > Below I've included an overview of a patch set that I've been working
> > on.  I submitted a previous version [then called Page Cache Policy] back
> > ~20Apr.  I started working on this because Christoph seemed to consider
> > this a prerequisite for considering migrate-on-fault/lazy-migration/...
> > Since the previous post, I have addressed comments [from Christoph] and
> > kept the series up to date with the -mm tree.  
> 
> The prequisite for automatic page migration schemes in the kernel is proof 
> that these automatic migrations consistently improve performance. We are 
> still waiting on data showing that this is the case.

So far, all I have is evidence that good locality obtained at process
start up using default policy is broken by internode task migrations.
I have seen this penalty fixed by automatic migration for artificial
benchmarks [McAlpin STREAM] and know that this approached worked well
for TPC-like loads on previous NUMA systems I've worked on.  Currently,
I don't have access to TPC loads in my lab, but we're working on it.

And, if I could get the patches into the mm tree, once basic 
migration settles down so as not to complicate your on-going work, 
then folks could enable them to test the effects on their NUMA systems,
if they are at all concerned about load balancing upsetting previously
established locality.  You know: the open source, community
collaboration
thing.  I guess I thought that's what the mm tree is for.  Not
everything
that gets in there makes it to Linus' tree.

> 
> The particular automatic migration scheme that you proposed relies on 
> allocating pages according to the memory allocation policy. 

Makes emminent sense to me...

> 
> The basic problem is first of all that the memory policies do not
> necessarily describe how the user wants memory to be allocated. The user
> may temporarily switch task policies to get specific allocation patterns.
> So moving memory may misplace memory. We got around that by 
> saying that we need to separately enable migration if a user 
> wants it.

I'm aware of this.  I guess I always considered the temporary
switching of policies to achieve desired locality as a stopgap 
measure because of missing capabilities in the kernel.  

But, I agree that since this is the existing behavior and we don't
want to break user space, that it should be off by default and
enabled when desired.  My latest patch series, which I haven't posted
for obvious reasons, does support per cpuset enabling of 
migrate-on-fault and auto-migration.

> 
> But even then we have the issue that the memory policies cannot 
> describe proper allocation at all since allocation policies are 
> ignored for file backed vmas. And this is the issue you are trying to 
> address.

Right!  For consistency's sake, I guess.  I always looked at it as
the migrate-on-fault was "correcting" page misplacement at original
fault time.  Said with tongue only slightly in cheek ;-).

> 
> I think this is all far to complicated to do in kernel space and still 
> conceptually unclean. I would like to have all automatic migration schemes 
> confined to user space. We will add an API that allows some process
> to migrate pages at will.

We want pages to migrate when the load balancer decides to move the
process
to a new node, away from it's memory.  I suppose internode migration
could also be accidently reuniting a task with its memory footprint,
but the higher the node count, the lower the probability of this.  And,
if we did do some form of page migration on task migration, I think
we'd need to consider the cost of page migration in the decision to
migrate.   I see that previous attempts to consider memory footprint in
internode migration seem to have gone nowhere, tho'.  Probably not worth
it for some nominally numa platforms.  Even for platforms where it
might make sense, tuning the algorithms will, I think, require data
that can best be obtained from testing on multiple platforms.  Just
dreaming, I know...

As far as doing it in user space:  I supposed one could deliver a
notification to the process and have it migrate pages at that point.
Sounds a lot more inefficient than just unmapping pages that have 
default policy as the process returns to user space on the new node
[easy to hook in w/o adding new mechanism] and letting the pages fault
over as they are referenced.  After all, we don't know which pages
will be touched before the next internode migration.  I don't think
that the application itself would have a good idea of this at the 
time of the migration.

And, I think, complication and cleanliness is in the eye of the
beholder.
'Nuf said on that point... ;-)

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Query re:  mempolicy for page cache pages
  2006-05-18 19:27   ` Lee Schermerhorn
@ 2006-05-18 19:53     ` Christoph Lameter
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Lameter @ 2006-05-18 19:53 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, Andi Kleen, Steve Longerbeam, Andrew Morton

On Thu, 18 May 2006, Lee Schermerhorn wrote:

> So far, all I have is evidence that good locality obtained at process
> start up using default policy is broken by internode task migrations.

Internode task migrations are tighly controlled by the scheduler and by 
cpusets.

> > The basic problem is first of all that the memory policies do not
> > necessarily describe how the user wants memory to be allocated. The user
> > may temporarily switch task policies to get specific allocation patterns.
> > So moving memory may misplace memory. We got around that by 
> > saying that we need to separately enable migration if a user 
> > wants it.
> 
> I'm aware of this.  I guess I always considered the temporary
> switching of policies to achieve desired locality as a stopgap 
> measure because of missing capabilities in the kernel.  

Policy switching is part of the design for memory policies. It is not
a stopgap measure.

> > But even then we have the issue that the memory policies cannot 
> > describe proper allocation at all since allocation policies are 
> > ignored for file backed vmas. And this is the issue you are trying to 
> > address.
> 
> Right!  For consistency's sake, I guess.  I always looked at it as
> the migrate-on-fault was "correcting" page misplacement at original
> fault time.  Said with tongue only slightly in cheek ;-).

Consistency in the sense that we would use memory policies as 
allocation restrictions. However, they are really placement methods. Only 
MPOL_BIND is truly an allocation restriction.

> We want pages to migrate when the load balancer decides to move the
> process
> to a new node, away from it's memory.  I suppose internode migration

We can do that from user space by a scheduling daemon that may have a 
longer range view of things. Also an execution thread may be temporarily
move to another node and then come back later. We really need a much more 
complex scheduler to take all of this into account and that I would say 
also belongs into user space.

> As far as doing it in user space:  I supposed one could deliver a
> notification to the process and have it migrate pages at that point.

Right.

> Sounds a lot more inefficient than just unmapping pages that have 
> default policy as the process returns to user space on the new node
> [easy to hook in w/o adding new mechanism] and letting the pages fault
> over as they are referenced.  After all, we don't know which pages
> will be touched before the next internode migration.  I don't think
> that the application itself would have a good idea of this at the 
> time of the migration.

There would have to be some interface to the scheduler. We never know
know which pages are going to be touched next and in our experience the 
automatic page migration schemes frequently makes the wrong decisions. You 
need to have a huge number of accesses to an off node page in order to justify 
migrating a single page. All these experiments better live in user space. 
Migration of a single page requires taking numerous locks. True the 
expense rises with deferring the decision to user space but user space can 
have more sophisticated databases on page use. Maybe that will finally get 
us to an automatic page migration scheme that is actually improving system 
performance.

> And, I think, complication and cleanliness is in the eye of the
> beholder.
> 'Nuf said on that point... ;-)

True. But lets keep the kernel as simple as possible.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-05-18 19:53 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-05-18 17:49 Query re: mempolicy for page cache pages Lee Schermerhorn
2006-05-18 18:12 ` Andi Kleen
2006-05-18 18:29   ` Lee Schermerhorn
2006-05-18 18:41     ` Christoph Lameter
2006-05-18 18:14 ` Andrew Morton
2006-05-18 19:10   ` Lee Schermerhorn
2006-05-18 18:15 ` Christoph Lameter
2006-05-18 19:27   ` Lee Schermerhorn
2006-05-18 19:53     ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox