* Query re: mempolicy for page cache pages
@ 2006-05-18 17:49 Lee Schermerhorn
2006-05-18 18:12 ` Andi Kleen
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Lee Schermerhorn @ 2006-05-18 17:49 UTC (permalink / raw)
To: linux-mm; +Cc: Christoph Lameter, Andi Kleen, Steve Longerbeam, Andrew Morton
Below I've included an overview of a patch set that I've been working
on. I submitted a previous version [then called Page Cache Policy] back
~20Apr. I started working on this because Christoph seemed to consider
this a prerequisite for considering migrate-on-fault/lazy-migration/...
Since the previous post, I have addressed comments [from Christoph] and
kept the series up to date with the -mm tree.
Just today, I was cleaning up some really old patches on my system and
came across a patch from Steve Longerbeam that passes a page index to
page_cache_alloc_cold()--exactly what my patch series does. I took a
look back through the -mm archives [Yeah, I should have done this
earlier :-(] and found that back in Oct'04, Steve had posted a patch
[set] that takes essentially the same approach to solve the same
"problem".
Since Steve's patches never made it into the kernel and don't exist in
the -mm tree either, I'm wondering why they were dropped. I.e., is
there some fundamental objection to applying shared policy to memory
mapped files and using this policy for page cache allocations? Rather
than bomb the mailing list with yet another set of dead-end patches, I'm
sending out just this overview, with the following questions:
1) What ever happened to Steve's patch set?
2) Is this even a problem that needs solving, as Christoph seem to think
at one time?
3) If so, is this the right approach? I.e., should I post the actual
patches?
4) If you don't agree with this approach, how would you go about it?
Regards,
Lee
P.S., tarballs containing the entire series, along with my "lazy
migration" patches can be found at:
http://free.linux.hp.com/~lts/Patches/PageMigration/ in -rcX-mmY
subdirs.
=====================================================================
Mapped File Policy V0.1 0/7 Overview
Formerly "Page Cache Policy" series.
V0.1 - renamed and revised the series. I think this name and
breakout makes more sense.
Prevent migration of file backed pages with shared
policy from private mappings thereof.
Also, address impact on show_numa_map() of patch #3
of this series.
refreshed against 2.6.17-rc4-mm1.
Basic "problem": currently [2.6.17-rcx], files mmap()ed SHARED
do not follow mem policy applied to the mapped regions. Instead,
shared, file backed pages are allocated using the allocating
tasks' task policy. This is inconsistent with the way that anon
and shmem pages are handled.
One reason for this is that down where pages are allocated for
file backed pages, the faulting (mm, vma, address) are not
available to compute the policy. However, we do have the inode
[via the address space] and file index/offset available. If the
applicable policy could be determined from just this info, the
vma and address would not be required.
The following series of patches against 2.6.17-rc4-mm1 implement
numa memory policy for shared, mmap()ed files. Because files
mmap()ed SHARED are shared between tasks just like shared memory
regions, I've used the shared_policy infrastructure from shmem.
This infrastructure applies policies directly to ranges of a file
using an rb_tree.
These patches result in the following internal and external
semantics:
1) The vma get|set_policy ops handle mem policies on sub-vma
address ranges for shared, linear mappings [shmem, files]
without splitting the vmas at the policy boundaries. Private
and non-linear mappings still split the vma to apply policy.
However, vma policy is still not visible to the filemap_nopage()
fault path.
2) As with shmem segments, the shared policies applied to shared
file mappings persist as long as the inode remains--i.e., until
the file is deleted or the inode recycled--whether or not any
task has the file mapped or even open. We could, I suppose,
free the map on last close.
3) Vma policy of private mappings of files only apply when the
task gets a private copy of the page--i.e., when do_wp_page()
breaks the COW sharing and allocates a private page. Private,
read-only mappings of a file use the shared policy which
defaults, as before, to process policy, which itself defaults
to, well... default policy. This is how mapped files have
always behaved.
Could be addressed by passing vma,addr down to where
page cache pages are allocated and use different policy
for shared, linear vs private or nonlinear mappings.
Worth the effort?
4) mbind(... 'MOVE*, ...) will not migrate non-anon file backed
pages in a private mapping if the file has a shared policy.
Rather, only anon pages that the mapping task has "COWed"
will be migrated. If the mapped file does NOT have a shared
policy or the file is mapped shared, then the pages will be
migrated, subject to mapcount, as before. [patch 6]
The patches, to follow, break out as follows:
1 - move-shared-policy-to-inode
This patch generalizes the shared_policy infrastructure
for use by generic files. First, it adds a shared_policy
pointer to the struct address_space. This pointer is
initialized to NULL on inode allocation, indicating the
process policy. The shared memory subsystem is then
modified to use the shared policy struct out of the
address_space [a.k.a. mapping] instead of explicitly
using one embedded in the shmem inode info struct.
Note, however, at this point we still use the embedded
shared_policy. We just point the mapping spolicy pointer
at the embedded struct at init time.
Tested to ensure shared policies still work for shmem.
2 - alloc-shared-policies
This patch removes the shared_policy structs embedded in
the shmem and hugetlbfs inode info structs, and dynamically
allocates them, from a new kmem cache, when needed.
Shmem will allocate a shared policy at segment init if
the superblock [mount] specifies non-default policy.
Otherwise, the shared_policy struct will only be allocated
if a task mbind()s a range of the segment.
Hugetlbfs just leaves the spolicy pointer NULL [default].
It will be allocated by the shmem set_policy() vm_op if
a task mbinds a range of the hugetlb segment.
Note: because the shared policy pointer in address_space
is overhead incurred by every inode's address space, we
only define it if CONFIG_NUMA. Access it via wrappers
to avoid excessive #ifdef in .c's.
3 - let-vma-policy-op-handle-subrange-policies
Only shmem currently has a set_policy op, and it knows how
to handle subranges via the rb_tree. So, I'm proposing we
adopt this semantic: if a vma has set_policy() op, it must
know to handle subranges and must have a get_policy() op that
also knows how to handle sub-ranges. These policy ops will
ONLY be used for shared mappings [VM_SHARED] because we don't
want private mappings mucking with the underlying object's
shared policy. Also, we can't let the policy ops handle
it for nonlinear mappings [VM_NONLINEAR] without a lot more
work.
One BIG side-effect of this patch: we no longer split
vm areas to apply sub-range policies if the vma has
a set_policy vm_op and is mapped linear, shared.
However, for private mappings, the vma policy ops will not
be used, even if they exist, and the vma will be split to
bind a policy to a sub-range of the vma.
Not splitting vma's for shared policies required mods
to show_numa_map(). Handled by subsequent patch.
migrate_pages_to() now uses page_address_in_vma() to
alloc destination page for each source page. This is
needed for shared policy subranges and gives a better
location for each destination page now that Christoph
is syncing from and to lists.
4 - generic-file-policy-vm-ops
This patch clones the shmem set/get_policy vm_ops for use
by generic mmap()ed files. The functions are added to the
generic_file_vm_ops struct. These functions operate on the
shared_policy rb_tree associated with the inode, allocating
one if necessary.
Note: these turned out to be indentical in all but name to
the shmem '_policy ops. Maybe eliminate one copy and share?
5 - use-file-policy-for-page-cache
This patch enhances page_cache_alloc[_cold]() to take an
offset/index argument. It uses this to lookup the policy
using a new function get_file_policy() which is just a
wrapper around mpol_shared_policy_lookup(). If the inode's
[mapping's] shared_policy pointer is NULL, just returns the
process or default policy.
Then page_cache_alloc[_cold]() calls a new function,
alloc_page_pol() to evaluate the policy [at a specified
offset] and allocate an appropriate page. alloc_page_pol()
shares some code with alloc_page_vma(), so this area is
reworked to minimize duplication.
All callers of page_cache_alloc[_cold]() are modified to
pass the file index/offset for which a page is requested.
The index/offset is available at all call sites as it will
be used to insert the page into the mapping's radix tree.
6 - fix migration of privately mapped files
Prevent migration of non-anon pages in private mappings
of files with shared policy. Migration uses the mapping's
vma policy. vma policy does not apply to shared mmap()ed
files.
7 - fix show_numa_map
This patch fixes show_numa_map to correctly display numa
maps of shmem and shared file mappings with sub-vma policies.
The patch provides numa specific wrappers [nm_*] around
the task map functions [m_*] in fs/proc/task_mmu.c to handle
submaps and modifies show_numa_map() to use a passed in
address range, instead of vm_start..vm_end.
Cursory testing with memtoy for shm segments, shared and privately
mapped files; single task and 2 tasks mmap()ing same file.
Verified the semantics described above.
Tested numa maps with memtoy and multiple, disjoint ranges in
submap--including situation where the buffer end occurs in the
middle of a submap.
Lots more testing needed--both functional and performance.
Lee Schermerhorn
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Query re: mempolicy for page cache pages
2006-05-18 17:49 Query re: mempolicy for page cache pages Lee Schermerhorn
@ 2006-05-18 18:12 ` Andi Kleen
2006-05-18 18:29 ` Lee Schermerhorn
2006-05-18 18:14 ` Andrew Morton
2006-05-18 18:15 ` Christoph Lameter
2 siblings, 1 reply; 9+ messages in thread
From: Andi Kleen @ 2006-05-18 18:12 UTC (permalink / raw)
To: Lee Schermerhorn
Cc: linux-mm, Christoph Lameter, Steve Longerbeam, Andrew Morton
> 1) What ever happened to Steve's patch set?
It needed more work, but he just disappeared at some point.
>
> 2) Is this even a problem that needs solving, as Christoph seem to think
> at one time?
The problem that hasn't been worked out is how to add persistent
attributes to files. Steve avoided that by limiting his to only
ELF executables and using a static header there, but i'm not
sure that is a generally useful enough for mainline. Just temporary
for mmaps seems very narrow in usefulness.
And with xattrs was unclear if it would be costly or not and
even worth it.
At least in the general case just interleaving the file cache
based on a global setting or on cpuset seemed to work well enough
for most people.
Let's ask it differently. Do you have a real application that
would be improved by it?
> 2) As with shmem segments, the shared policies applied to shared
> file mappings persist as long as the inode remains--i.e., until
> the file is deleted or the inode recycled--whether or not any
> task has the file mapped or even open. We could, I suppose,
> free the map on last close.
The recycling is the problem. It's basically a lottery if the
attributes are kept with high memory pressure or not.
Doesn't seem like a robust approach.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Query re: mempolicy for page cache pages
2006-05-18 17:49 Query re: mempolicy for page cache pages Lee Schermerhorn
2006-05-18 18:12 ` Andi Kleen
@ 2006-05-18 18:14 ` Andrew Morton
2006-05-18 19:10 ` Lee Schermerhorn
2006-05-18 18:15 ` Christoph Lameter
2 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2006-05-18 18:14 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: linux-mm, clameter, ak, stevel
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
>
> 1) What ever happened to Steve's patch set?
They were based on Andi's 4-level-pagetable work. Then we merged Nick's
4-level-pagetable work instead, so
numa-policies-for-file-mappings-mpol_mf_move.patch broke horridly and I
dropped it. Steve said he'd redo the patch based on the new pagetable code
and would work with SGI on getting it benchmarked, but that obviously
didn't happen.
I was a bit concerned about the expansion in sizeof(address_space), but we
ended up agreeing that it's numa-only and NUMA machines tend to have lots
of memory anyway. That being said, it would still be better to have a
pointer to a refcounted shared_policy in the address_space if poss, rather
than aggregating the whole thing.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Query re: mempolicy for page cache pages
2006-05-18 17:49 Query re: mempolicy for page cache pages Lee Schermerhorn
2006-05-18 18:12 ` Andi Kleen
2006-05-18 18:14 ` Andrew Morton
@ 2006-05-18 18:15 ` Christoph Lameter
2006-05-18 19:27 ` Lee Schermerhorn
2 siblings, 1 reply; 9+ messages in thread
From: Christoph Lameter @ 2006-05-18 18:15 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: linux-mm, Andi Kleen, Steve Longerbeam, Andrew Morton
On Thu, 18 May 2006, Lee Schermerhorn wrote:
> Below I've included an overview of a patch set that I've been working
> on. I submitted a previous version [then called Page Cache Policy] back
> ~20Apr. I started working on this because Christoph seemed to consider
> this a prerequisite for considering migrate-on-fault/lazy-migration/...
> Since the previous post, I have addressed comments [from Christoph] and
> kept the series up to date with the -mm tree.
The prequisite for automatic page migration schemes in the kernel is proof
that these automatic migrations consistently improve performance. We are
still waiting on data showing that this is the case.
The particular automatic migration scheme that you proposed relies on
allocating pages according to the memory allocation policy.
The basic problem is first of all that the memory policies do not
necessarily describe how the user wants memory to be allocated. The user
may temporarily switch task policies to get specific allocation patterns.
So moving memory may misplace memory. We got around that by
saying that we need to separately enable migration if a user
wants it.
But even then we have the issue that the memory policies cannot
describe proper allocation at all since allocation policies are
ignored for file backed vmas. And this is the issue you are trying to
address.
I think this is all far to complicated to do in kernel space and still
conceptually unclean. I would like to have all automatic migration schemes
confined to user space. We will add an API that allows some process
to migrate pages at will.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Query re: mempolicy for page cache pages
2006-05-18 18:12 ` Andi Kleen
@ 2006-05-18 18:29 ` Lee Schermerhorn
2006-05-18 18:41 ` Christoph Lameter
0 siblings, 1 reply; 9+ messages in thread
From: Lee Schermerhorn @ 2006-05-18 18:29 UTC (permalink / raw)
To: Andi Kleen; +Cc: linux-mm, Christoph Lameter, Steve Longerbeam, Andrew Morton
Thanks, Andi
On Thu, 2006-05-18 at 20:12 +0200, Andi Kleen wrote:
> > 1) What ever happened to Steve's patch set?
>
> It needed more work, but he just disappeared at some point.
OK.
> >
> > 2) Is this even a problem that needs solving, as Christoph seem to think
> > at one time?
>
> The problem that hasn't been worked out is how to add persistent
> attributes to files. Steve avoided that by limiting his to only
> ELF executables and using a static header there, but i'm not
> sure that is a generally useful enough for mainline. Just temporary
> for mmaps seems very narrow in usefulness.
>
> And with xattrs was unclear if it would be costly or not and
> even worth it.
I see... Still, I find it "interesting" that an app doesn't have
explicit control over shared file mappings except via the process
policy. I suppose if one applies explicit policy to all ones
vmas, then by process of elimination, the process policy would
only apply to is file mappings.
>
> At least in the general case just interleaving the file cache
> based on a global setting or on cpuset seemed to work well enough
> for most people.
Yes, for not overburdening any single node. Paul Jackson's
"spread" patches address this. Actually, for [some of] our platforms,
we can hardware interleave some % of memory at the cache line level.
This shows up as a memory-only node. Some folks claim it would be
beneficial to be able to specify a page cache policy to prefer this
hardware interleaved node for the page cache. I see that Ray
Bryant once proposed a patch to define a separate global and
optional per process policy to be used for page cache pages. This
also "died on the vine"...
>
> Let's ask it differently. Do you have a real application that
> would be improved by it?
Uh, not at this point. As I said, Chistoph said he "wished this were
addressed" before thinking about migrate-on-fault, etc. Since I wasn't
getting any traction with the migration stuff, and this didn't look to
difficult, I thought I'd look into it.
>
>
> > 2) As with shmem segments, the shared policies applied to shared
> > file mappings persist as long as the inode remains--i.e., until
> > the file is deleted or the inode recycled--whether or not any
> > task has the file mapped or even open. We could, I suppose,
> > free the map on last close.
>
> The recycling is the problem. It's basically a lottery if the
> attributes are kept with high memory pressure or not.
> Doesn't seem like a robust approach.
Unless, of course, the file remains mapped/open, right? Then isn't
the inode and address_space guaranteed to hang around?
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Query re: mempolicy for page cache pages
2006-05-18 18:29 ` Lee Schermerhorn
@ 2006-05-18 18:41 ` Christoph Lameter
0 siblings, 0 replies; 9+ messages in thread
From: Christoph Lameter @ 2006-05-18 18:41 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: Andi Kleen, linux-mm, Steve Longerbeam, Andrew Morton
On Thu, 18 May 2006, Lee Schermerhorn wrote:
> Yes, for not overburdening any single node. Paul Jackson's
> "spread" patches address this. Actually, for [some of] our platforms,
> we can hardware interleave some % of memory at the cache line level.
> This shows up as a memory-only node. Some folks claim it would be
> beneficial to be able to specify a page cache policy to prefer this
> hardware interleaved node for the page cache. I see that Ray
> Bryant once proposed a patch to define a separate global and
> optional per process policy to be used for page cache pages. This
> also "died on the vine"...
I'd be very interested in some scheme to address the overburdening in a
simple way. Replication may be useful in addition to spreading to limit
the traffic on the NUMA interlink.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Query re: mempolicy for page cache pages
2006-05-18 18:14 ` Andrew Morton
@ 2006-05-18 19:10 ` Lee Schermerhorn
0 siblings, 0 replies; 9+ messages in thread
From: Lee Schermerhorn @ 2006-05-18 19:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm, clameter, ak, stevel
On Thu, 2006-05-18 at 11:14 -0700, Andrew Morton wrote:
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> >
> > 1) What ever happened to Steve's patch set?
>
> They were based on Andi's 4-level-pagetable work. Then we merged Nick's
> 4-level-pagetable work instead, so
> numa-policies-for-file-mappings-mpol_mf_move.patch broke horridly and I
> dropped it. Steve said he'd redo the patch based on the new pagetable code
> and would work with SGI on getting it benchmarked, but that obviously
> didn't happen.
Thanks for the info Andrew.
>
> I was a bit concerned about the expansion in sizeof(address_space), but we
> ended up agreeing that it's numa-only and NUMA machines tend to have lots
> of memory anyway. That being said, it would still be better to have a
> pointer to a refcounted shared_policy in the address_space if poss, rather
> than aggregating the whole thing.
Yes, I was concerned about that, too. I do use a pointer to the shared
policy struct in the address space, allocating it only if one actually
applies a policy. A null pointer results in current behavior: fall
back
to process then global default policy. Even so, the pointer member
would
only be included under CONFIG_NUMA.
As far as reference counting: I didn't think it would be necessary,
because
it appears to me that the address space structs are one to one with the
inodes and persists as long as the inode does. Is this correct? If
so,
then the shared policy struct would only be deleted when the inode goes
away. I may have a race, but I didn't think one could be doing an
insert
or lookup w/o holding locks/references on structs that would prevent the
inode from being destroyed.
But, may turn out to be moot, heh?
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Query re: mempolicy for page cache pages
2006-05-18 18:15 ` Christoph Lameter
@ 2006-05-18 19:27 ` Lee Schermerhorn
2006-05-18 19:53 ` Christoph Lameter
0 siblings, 1 reply; 9+ messages in thread
From: Lee Schermerhorn @ 2006-05-18 19:27 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-mm, Andi Kleen, Steve Longerbeam, Andrew Morton
On Thu, 2006-05-18 at 11:15 -0700, Christoph Lameter wrote:
> On Thu, 18 May 2006, Lee Schermerhorn wrote:
>
> > Below I've included an overview of a patch set that I've been working
> > on. I submitted a previous version [then called Page Cache Policy] back
> > ~20Apr. I started working on this because Christoph seemed to consider
> > this a prerequisite for considering migrate-on-fault/lazy-migration/...
> > Since the previous post, I have addressed comments [from Christoph] and
> > kept the series up to date with the -mm tree.
>
> The prequisite for automatic page migration schemes in the kernel is proof
> that these automatic migrations consistently improve performance. We are
> still waiting on data showing that this is the case.
So far, all I have is evidence that good locality obtained at process
start up using default policy is broken by internode task migrations.
I have seen this penalty fixed by automatic migration for artificial
benchmarks [McAlpin STREAM] and know that this approached worked well
for TPC-like loads on previous NUMA systems I've worked on. Currently,
I don't have access to TPC loads in my lab, but we're working on it.
And, if I could get the patches into the mm tree, once basic
migration settles down so as not to complicate your on-going work,
then folks could enable them to test the effects on their NUMA systems,
if they are at all concerned about load balancing upsetting previously
established locality. You know: the open source, community
collaboration
thing. I guess I thought that's what the mm tree is for. Not
everything
that gets in there makes it to Linus' tree.
>
> The particular automatic migration scheme that you proposed relies on
> allocating pages according to the memory allocation policy.
Makes emminent sense to me...
>
> The basic problem is first of all that the memory policies do not
> necessarily describe how the user wants memory to be allocated. The user
> may temporarily switch task policies to get specific allocation patterns.
> So moving memory may misplace memory. We got around that by
> saying that we need to separately enable migration if a user
> wants it.
I'm aware of this. I guess I always considered the temporary
switching of policies to achieve desired locality as a stopgap
measure because of missing capabilities in the kernel.
But, I agree that since this is the existing behavior and we don't
want to break user space, that it should be off by default and
enabled when desired. My latest patch series, which I haven't posted
for obvious reasons, does support per cpuset enabling of
migrate-on-fault and auto-migration.
>
> But even then we have the issue that the memory policies cannot
> describe proper allocation at all since allocation policies are
> ignored for file backed vmas. And this is the issue you are trying to
> address.
Right! For consistency's sake, I guess. I always looked at it as
the migrate-on-fault was "correcting" page misplacement at original
fault time. Said with tongue only slightly in cheek ;-).
>
> I think this is all far to complicated to do in kernel space and still
> conceptually unclean. I would like to have all automatic migration schemes
> confined to user space. We will add an API that allows some process
> to migrate pages at will.
We want pages to migrate when the load balancer decides to move the
process
to a new node, away from it's memory. I suppose internode migration
could also be accidently reuniting a task with its memory footprint,
but the higher the node count, the lower the probability of this. And,
if we did do some form of page migration on task migration, I think
we'd need to consider the cost of page migration in the decision to
migrate. I see that previous attempts to consider memory footprint in
internode migration seem to have gone nowhere, tho'. Probably not worth
it for some nominally numa platforms. Even for platforms where it
might make sense, tuning the algorithms will, I think, require data
that can best be obtained from testing on multiple platforms. Just
dreaming, I know...
As far as doing it in user space: I supposed one could deliver a
notification to the process and have it migrate pages at that point.
Sounds a lot more inefficient than just unmapping pages that have
default policy as the process returns to user space on the new node
[easy to hook in w/o adding new mechanism] and letting the pages fault
over as they are referenced. After all, we don't know which pages
will be touched before the next internode migration. I don't think
that the application itself would have a good idea of this at the
time of the migration.
And, I think, complication and cleanliness is in the eye of the
beholder.
'Nuf said on that point... ;-)
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Query re: mempolicy for page cache pages
2006-05-18 19:27 ` Lee Schermerhorn
@ 2006-05-18 19:53 ` Christoph Lameter
0 siblings, 0 replies; 9+ messages in thread
From: Christoph Lameter @ 2006-05-18 19:53 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: linux-mm, Andi Kleen, Steve Longerbeam, Andrew Morton
On Thu, 18 May 2006, Lee Schermerhorn wrote:
> So far, all I have is evidence that good locality obtained at process
> start up using default policy is broken by internode task migrations.
Internode task migrations are tighly controlled by the scheduler and by
cpusets.
> > The basic problem is first of all that the memory policies do not
> > necessarily describe how the user wants memory to be allocated. The user
> > may temporarily switch task policies to get specific allocation patterns.
> > So moving memory may misplace memory. We got around that by
> > saying that we need to separately enable migration if a user
> > wants it.
>
> I'm aware of this. I guess I always considered the temporary
> switching of policies to achieve desired locality as a stopgap
> measure because of missing capabilities in the kernel.
Policy switching is part of the design for memory policies. It is not
a stopgap measure.
> > But even then we have the issue that the memory policies cannot
> > describe proper allocation at all since allocation policies are
> > ignored for file backed vmas. And this is the issue you are trying to
> > address.
>
> Right! For consistency's sake, I guess. I always looked at it as
> the migrate-on-fault was "correcting" page misplacement at original
> fault time. Said with tongue only slightly in cheek ;-).
Consistency in the sense that we would use memory policies as
allocation restrictions. However, they are really placement methods. Only
MPOL_BIND is truly an allocation restriction.
> We want pages to migrate when the load balancer decides to move the
> process
> to a new node, away from it's memory. I suppose internode migration
We can do that from user space by a scheduling daemon that may have a
longer range view of things. Also an execution thread may be temporarily
move to another node and then come back later. We really need a much more
complex scheduler to take all of this into account and that I would say
also belongs into user space.
> As far as doing it in user space: I supposed one could deliver a
> notification to the process and have it migrate pages at that point.
Right.
> Sounds a lot more inefficient than just unmapping pages that have
> default policy as the process returns to user space on the new node
> [easy to hook in w/o adding new mechanism] and letting the pages fault
> over as they are referenced. After all, we don't know which pages
> will be touched before the next internode migration. I don't think
> that the application itself would have a good idea of this at the
> time of the migration.
There would have to be some interface to the scheduler. We never know
know which pages are going to be touched next and in our experience the
automatic page migration schemes frequently makes the wrong decisions. You
need to have a huge number of accesses to an off node page in order to justify
migrating a single page. All these experiments better live in user space.
Migration of a single page requires taking numerous locks. True the
expense rises with deferring the decision to user space but user space can
have more sophisticated databases on page use. Maybe that will finally get
us to an automatic page migration scheme that is actually improving system
performance.
> And, I think, complication and cleanliness is in the eye of the
> beholder.
> 'Nuf said on that point... ;-)
True. But lets keep the kernel as simple as possible.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2006-05-18 19:53 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-05-18 17:49 Query re: mempolicy for page cache pages Lee Schermerhorn
2006-05-18 18:12 ` Andi Kleen
2006-05-18 18:29 ` Lee Schermerhorn
2006-05-18 18:41 ` Christoph Lameter
2006-05-18 18:14 ` Andrew Morton
2006-05-18 19:10 ` Lee Schermerhorn
2006-05-18 18:15 ` Christoph Lameter
2006-05-18 19:27 ` Lee Schermerhorn
2006-05-18 19:53 ` Christoph Lameter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox