Re: [PATCH 0/2] mm: memory policy for page cache allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 0/2] mm: memory policy for page cache allocation
       [not found] ` <fa.ep2m52m.1p0edrq@ifi.uio.no>
@ 2004-09-24 15:43   ` Ray Bryant
  2004-09-25  5:40     ` Steve Longerbeam
  0 siblings, 1 reply; 6+ messages in thread
From: Ray Bryant @ 2004-09-24 15:43 UTC (permalink / raw)
  To: Steve Longerbeam; +Cc: linux-mm, lse-tech, linux-kernel

Steve Longerbeam wrote:
> Ray Bryant wrote:
> 
>> Hi Steve,
>>

<snip>

> So in MTA there is only one policy, which is very similar to the BIND 
> policy in
> 2.6.8.
>
> MTA requires per mapped file policies. The patch I posted adds a
> shared_policy tree to the address_space object, so that every file
> can have it's own policy for page cache allocations. A mapped file
> can have a tree of policies, one for each mapped region of the file,
> for instance, text and initialized data. With the patch, file mapped
> policies would work across all filesystems, and the specific support
> in tmpfs and hugetlbfs can be removed.
>

Just mapped files, not regular files as well?  So you don't care about
placement of page cache pages for regular files?

> The goal of MTA is to direct an entire program's resident pages (text
> and data regions of the executable and all its shared libs) to a
> single node or a specific set of nodes. The primary use of MTA (by
> the customer) is to allow portions of memory to be powered off for
> low power modes, and still have critical system applications running.
> 

Interesting.  Sounds like there is a lot of commonality between what you
want and we want.

> In MTA the executable file's policies are stored in the ELF image.
> There is a utility to add a section containing the list of prefered nodes
> for the executable's text and data regions. That section is parsed by
> load_elf_binary(). The section data is in the form of mnemonic node
> name strings, which load_elf_binary() converts to a node id list.

Above you said "per mapped file policies".  So it sounds as if you could have 
different policies for different mapped files in a single application.  How
do you specify which mapped file gets which policy using the info in the 
header?  (in particular, how do you match up info the header with files in
the application?  First one opened gets this policy, next gets that one, or 
what?)  [I guess in this paragraph "policy" == "node list" for your case.]

Or is the policy description more general, i. e. all text pages on nodes 3&5,
all mapped file pages on nodes 4,7,9.

Within a node list, is there any notion of local allocation?  That is, if
the current policy puts mapped file pages on nodes 4, 7, 9, and a process
on node 7 touches a page, is there a preference to allocate it on node 7?

> 
> MTA also supports policies for the slab allocator.
> 

Is that a global or per process policy or is it finer grained than that?
(i. e. by cache type).

>>
>> (Just trying to figure out how to work both of our requirements into
>> the kernel in as simple as possible (but no simpler!) fashion.)
> 
> 
> 
> could we have both a global page cache policy as well as per file
> policies. That is, if a mapped file has a policy, it overrides the
> global policy. That would work fine for MTA.
> 

I don't see why not.  You could fall back on that if there is no
file policy.

When you are done, is the intent to merge this into the mainline or does
MontaVista intend to maintain a "added value" patch of some kind?

> Steve
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] mm: memory policy for page cache allocation
  2004-09-24 15:43   ` [PATCH 0/2] mm: memory policy for page cache allocation Ray Bryant
@ 2004-09-25  5:40     ` Steve Longerbeam
  0 siblings, 0 replies; 6+ messages in thread
From: Steve Longerbeam @ 2004-09-25  5:40 UTC (permalink / raw)
  To: Ray Bryant; +Cc: linux-mm, lse-tech, linux-kernel

Ray Bryant wrote:

> Steve Longerbeam wrote:
>
>> Ray Bryant wrote:
>>
>>> Hi Steve,
>>>
>
> <snip>
>
>
>> So in MTA there is only one policy, which is very similar to the BIND 
>> policy in
>> 2.6.8.
>>
>> MTA requires per mapped file policies. The patch I posted adds a
>> shared_policy tree to the address_space object, so that every file
>> can have it's own policy for page cache allocations. A mapped file
>> can have a tree of policies, one for each mapped region of the file,
>> for instance, text and initialized data. With the patch, file mapped
>> policies would work across all filesystems, and the specific support
>> in tmpfs and hugetlbfs can be removed.
>>
>
> Just mapped files, not regular files as well?  So you don't care about
> placement of page cache pages for regular files?

MTA cares about placement for both mapped file pages and regular readahead
pages. Because if a regular file readahead sees that the region of the 
file being
read currently has a policy (by searching based on the page index in the 
inode's
mapping shared_policy tree), then the readahead needs to use that 
policy. Because
those pages allocated to the page cache will also be used by VMA's 
currently mapping
the file.

>
>> The goal of MTA is to direct an entire program's resident pages (text
>> and data regions of the executable and all its shared libs) to a
>> single node or a specific set of nodes. The primary use of MTA (by
>> the customer) is to allow portions of memory to be powered off for
>> low power modes, and still have critical system applications running.
>>
>
> Interesting.  Sounds like there is a lot of commonality between what you
> want and we want.

cool, so this should be easy!

>
>> In MTA the executable file's policies are stored in the ELF image.
>> There is a utility to add a section containing the list of prefered 
>> nodes
>> for the executable's text and data regions. That section is parsed by
>> load_elf_binary(). The section data is in the form of mnemonic node
>> name strings, which load_elf_binary() converts to a node id list.
>
>
> Above you said "per mapped file policies".  So it sounds as if you 
> could have different policies for different mapped files in a single 
> application.

right. For instance, the bash ELF image could be marked to have its text 
go in node 1,
and data go in node 2, but the libc ELF image could have its text go in 
node 0, and
data in node 3. That's just a wild example. Of course, if you wanted ALL 
of bash's
text and data to go in say node 1, bash and all of the shared libs 
loaded by bash would
have to be marked to go in node 1.

The useage idea is that critical apps (say bash, daemons, etc) and 
critical libs (ie libc)
can have their memory reside completely in nodes that never "go away". 
Non-critical
apps, and libs not used by critical apps, can be located in nodes that 
get powered down
in low power modes.

This info might help too. In MTA, on a page fault, there's three levels of
precedence when selecting a policy for a page alloc. File policy > VMA 
policy >
process policy > default policy (er, well that's four). In other words, 
if the
VMA is a file mapping, use the file's policy if there is one, otherwise 
use the
VMA's policy if there is one, otherwise use the process' policy if there 
is one,
otherwise use the default policy. In the case of a COW for a file 
mapping though,
the COW page is the process' own private page, so it can start from the 
VMA policy.

> How
> do you specify which mapped file gets which policy using the info in 
> the header?  (in particular, how do you match up info the header with 
> files in
> the application?  First one opened gets this policy, next gets that 
> one, or what?)  [I guess in this paragraph "policy" == "node list" for 
> your case.]

the file's policies are simply located in an ELF section of the file. We 
have a utility that will
add a section to the ELF image that contains the policy info. That 
section is parsed
by load_elf_image() for executables, and by ld.so in libc for shared 
objects.

A mapped file's mempolicy has to be specified by the file itself, not by 
the process
mapping the file. For instance, it's not possible for process A to want 
libc text in
node 0, while process B wants libc text in node 1, otherwise we have 
cache aliases.
So the page cache alloc policies have to be either global (which is what 
you want)
or per-file or both, with file policy taking precedence over global.

>
> Or is the policy description more general, i. e. all text pages on 
> nodes 3&5,
> all mapped file pages on nodes 4,7,9.
>
> Within a node list, is there any notion of local allocation?  That is, if
> the current policy puts mapped file pages on nodes 4, 7, 9, and a process
> on node 7 touches a page, is there a preference to allocate it on node 7?

No, not yet. But I'm not sure if that would make sense. If a file wants
text preferably in node 4, with fallback to node 7, it should try node
4 first, regardless of what some process wants.

>
>>
>> MTA also supports policies for the slab allocator.
>>
>
> Is that a global or per process policy or is it finer grained than that?
> (i. e. by cache type).

the policy is contained in the cache, using a new API called
kmem_cache_create_mempolicy(), so that allocations from that
cache use a policy.

>
>>>
>>> (Just trying to figure out how to work both of our requirements into
>>> the kernel in as simple as possible (but no simpler!) fashion.)
>>
>>
>>
>>
>> could we have both a global page cache policy as well as per file
>> policies. That is, if a mapped file has a policy, it overrides the
>> global policy. That would work fine for MTA.
>>
>
> I don't see why not.  You could fall back on that if there is no
> file policy.
>
> When you are done, is the intent to merge this into the mainline or does
> MontaVista intend to maintain a "added value" patch of some kind?

it would be cool if most of this has enough interest to go in mainline, 
otherwise
MontaVista has to maintain a large patch.

But in summary, the only major addition to the current NUMA mempolicy
for MTA support is policies for mapped files.

Steve
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 0/2] mm: memory policy for page cache allocation
@ 2004-09-23  4:32 Ray Bryant
  2004-09-23  9:09 ` Andi Kleen
  0 siblings, 1 reply; 6+ messages in thread
From: Ray Bryant @ 2004-09-23  4:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: William Lee Irwin III, Ray Bryant, linux-mm, Jesse Barnes,
	Dan Higgins, lse-tech, Brent Casavant, Nick Piggin,
	Martin J. Bligh, linux-kernel, Ray Bryant, Andrew Morton,
	Paul Jackson, Dave Hansen

Andi,

You may like the following patchset better.  (At least
I hope so...)

It's divided into 3 parts, with this file (the OVERVIEW)
making up the 0th part, and two patches in part 1 and 2.

I've tried to address several of your concerns with this
version of the patch:

(1)  We dropped the MPOL_ROUNDROBIN patch.  Instead, we
     use MPOL_INTERLEAVE to spread pages across nodes.
     However, rather than use the file offset etc to 
     calculate the node to allocate the page on, I used
     the same mechanism you used in alloc_pages_current()
     to calculate the node number (interleave_node()).
     That eliminates the need to generate an offset etc
     in the routines that call page_cache_alloc() and to
     me appears to be a simpler change that still fits
     within your design.

     We can still go the other way if you want, it matters
     not to me, this was just dramatically less code (i. e.
     0 lines) to use the existing functionality.

(2)  I implemented the sys_set_mempolicy() changes as
     suggested -- higher order bits in the mode (first)
     argument specify whether or not this request is for
     the page allocation policy (your existing policy)
     or for the page cache allocation policy.  Similarly,
     a bit there indicates whether or not we want to set
     the process level policy or the system level policy.

     These bits are to be set in the flags argument of
     sys_mbind().

(3)  As before, there is a process level policy and a
     system level policy for both regular page allocation
     and page cache allocation.  The primary rationale
     for this is that since that is the way your code 
     worked for regular page allocation, it was easiest
     to piggyback on that and hence you end up with a
     per process and system default page allocation policy.
     If no-one specifies a process level page cache 
     allocation policy, the overhead of this is one long
     per task struct.  Making it otherwise would make
     the code less clean, I think.

     We continue to believe that we will have applications
     that wish to set the page cache allocation policy, 
     but, we don't have any demonstrable cases of this.

(4)  I added a new patch to remove a bias toward node
     0 of page allocations.  That is because each
     new process starts with an il_next = 0.  Now, I
     set il_next to current->pid % MAX_NUMNODES.
     See the 2nd patch for more discussion.

I haven't tested this much, it compiles and boots.
More testing will be done once I get your NUMA_API code
converted (perhaps not much needs to be done, don't 
know yet) to use the new interface.

Also, I got Steve's patch, and have looked at the overview,
but not the details.  If we could create a default policy for
page cache allocation that would be like MPOL_INTERLEAVE,
and then have per file settable policies, I guess we could
live with that, but it seems to me that a process would 
likely want all of its pages allocated the same way.  That
is, an HPC process would want all of its files allocated
round robin across the cpuset (most likely), while a file
server process would want its page cache pages allocated
locally.  It would be pain to have to specify a special
policy for each file opened by a process, I would think,
unless there is some way to cache that in the proces and
have it apply to all files that the process opens, but
then you are effectively emulating a per process policy 
in user space, it seems to me.

    ---------------OVERVIEW--------------------

This is the second working release of this patch.

Changes since the last release
------------------------------

(1)  Dropped the MPOL_ROUNDROBIN patch.
(2)  Added some new text to the overview (see <new text>)
     below.
(3)  Changed to use the task struct field: il_next to
     control round robin allocation of pages when the
     policy is MPOL_INTERLEAVE.
(4)  Added code to set and get the additional policy types.
     The original policy in Andi Kleen's code is called
     POLICY_PAGE, because it deals with data page allocation,
     the new policy for page cache pages is called
     POLICY_PAGECACHE.
(5)  Added a new patch to this series to reduce allocation
     bias toward node 0.

Background
----------

In August, Jesse Barnes at SGI proposed a patch to do round robin
allocation of page cache pages on NUMA machines.  This got shot down
for a number of reasons (see
  http://marc.theaimsgroup.com/?l=linux-kernel&m=109235420329360&w=2
and the related thread), but it seemed to me that one of the most
significant issues was that this was a workload dependent optimization.
That is, for an Altix running an HPC workload, it was a good thing,
but for web servers or file servers it was not such a good idea.

So the idea of this patch is the following:  it creates a new memory
policy structure (default_pagecache_policy) that is used to control
how storage for page cache pages is allocated.  So, for a large Altix
running HPC workloads, we can specify a policy that does round robin
allocations, and for other workloads you can specify the default policy
(which results in page cache pages being allocated locally).

The default_pagecache_policy is override-able on a per process basis, so
that if your application prefers to allocate page cache pages locally,
it can.  <new text>  In this regard the pagecache policy behaves the same
as the page allocation policy and indeed all of the code to implement
the two is basically the same.

<new text>
The primary rationale for this is that is the way the existing mempolicy
code works -- there is a per process policy, which is used if it exists,
and if the per process policy is null, then a global, default policy
is used.  This patch piggybacks on that existing code, so you get the
per process policy and a global policy for page cache allocations as well.

If the user does not define a per process policy, the extra cost is an
unused pointer in the task struct.  We can envision situations where
a per process cache allocation policy may be beneficial, but the real
case for this is that it allows us to use the existing code with only
minor modifications to implement, set and get the page cache mempolicy.

This is all done by making default_policy and current->mempolicy an
array of size 2 and of type "struct mempolicy *".   Entry POLICY_PAGE
in these arrays is the old default_policy and process memory policy.
Entry POLICY_PAGECACHE in these arrays contains the system default and
per process page cache allocation policies, respectively.

While one can, in principle, change the global page cache allocation
policy, we think this will be done precisely once per boot by calls from
startup scripts into the NUMA API code.  The idea is not so much to allow
the global page cache policy to be easily changeable, but rather allowing
it to be settable by the system admin so that we don't have to compile
separate kernels for file servers and HPC servers.   In particular,
changing the page cache allocation policy doesn't cause previously
allocated pages to be moved so that they are now placed correctly
according to the new policy.  Over time, they will get replaced and the
system will slowly migrate to a state where most page cache pages are
on the correct nodes for the new policy.

Efficiencies in setting and getting the page cache policy from user
space are also achieve through this approach.  The system call 
entry points "sys_set_mempolicy", "sys_get_mempolicy" and "sys_mbind"
have been enhanced to support specifying whether the policy that is
being operated on is:

(1)  The process-level policy or the default system level policy.
(2)  The page allocation policy or the page cache allocation policy.

This is done using higher order bits in the mode (first) argument to
sys_set/get_mempolicy() and the flags word in sys_mbind().  These
bits are defined so that users of the original interface will get
the same results using the old and new implementations of these
routines.
<end new text>

A new worker routine is defined:
	alloc_pages_by_policy(gfp, order, policy)
This routine allocates the requested number of pages using the policy
index specified.

alloc_pages_current() and page_cache_alloc() are then defined in terms
of alloc_pages_by_policy().

<new text>
This patch is in two parts.  The first part is the page cache policy
patch itself (we dropped the previous first patch).  The second
patch is a patch to slightly modify the implementation of policy
MPOL_INTERLEAVE to remove a bias toward allocating on node 0.

Further specific details of these patches are in the patch files,
which follow this email.
<end new text>

Caveats
-------

(1)  page_cache_alloc_local() is defined, but is not currently called.
This was added in SGI ProPack to make sure that mmap'd() files were
allocated locally rather than round-robin'd (i. e. to override the
round robin allocation in that case.)  This was an SGI MPT requirement.
It may be this is not needed with the current mempolicy code if we can
associate the default mempolicy with mmap()'d files for those MPT users.

(2)  alloc_pages_current() is now an inline, but there is no easy way
to do that totally correctly with the current include file order (that I
could figure out at least...)  The problem is that alloc_pages_current()
wants to use the define constant POLICY_PAGE, but that is defined yet.
We know it is zero, so we just use zero.  A comment in mempolicy.h
suggests not to change the value of this constant to something other
than zero, and references the file gfp.h.

(3) <new>  The code compiles and boots but has not been extensively
tested.  The latter will wait for a NUMA API library that supports
the new functionality.    My next goal is to get those modifications
done so we can do some serious testing.

(4)  I've not thought a bit about locking issues related to changing a
mempolicy whilst the system is actually running.   However, now that
the mempolicies themselves are stateless (as per Andi Kleen's original
design) it may be that these issues are not as significant.

(5)  It seems there may be a potential conflict between the page cache
mempolicy and a mmap mempolicy (do those exist?).  Here's the concern:
If you mmap() a file, and any pages of that file are in the page cache,
then the location of those pages will (have been) dictated by the page
cache mempolicy, which could differ (will likely differ) from the mmap
mempolicy.  It seems that the only solution to this is to migrate those
pages (when they are touched) after the mmap().

(6)  Testing of this particular patch has been minimal since I don't 
yet have a compatible NUMA API.   I'm working on that next.

Comments, flames, etc to the undersigned.

Best Regards,

Ray

Ray Bryant <raybry@sgi.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] mm: memory policy for page cache allocation
  2004-09-23  4:32 Ray Bryant
@ 2004-09-23  9:09 ` Andi Kleen
  0 siblings, 0 replies; 6+ messages in thread
From: Andi Kleen @ 2004-09-23  9:09 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Andi Kleen, William Lee Irwin III, linux-mm, Jesse Barnes,
	Dan Higgins, lse-tech, Brent Casavant, Nick Piggin,
	Martin J. Bligh, linux-kernel, Ray Bryant, Andrew Morton,
	Paul Jackson, Dave Hansen

> (1)  We dropped the MPOL_ROUNDROBIN patch.  Instead, we
>      use MPOL_INTERLEAVE to spread pages across nodes.
>      However, rather than use the file offset etc to 
>      calculate the node to allocate the page on, I used
>      the same mechanism you used in alloc_pages_current()
>      to calculate the node number (interleave_node()).
>      That eliminates the need to generate an offset etc
>      in the routines that call page_cache_alloc() and to
>      me appears to be a simpler change that still fits
>      within your design.


Hmm, that may lead to uneven balancing because the counter is 
per thread. But if it works for you it's ok I guess.

I still think changing the callers and use the offset for
static interleaving would be better. Maybe that could be
done as a followon patch. 
> 
> (2)  I implemented the sys_set_mempolicy() changes as
>      suggested -- higher order bits in the mode (first)
>      argument specify whether or not this request is for
>      the page allocation policy (your existing policy)
>      or for the page cache allocation policy.  Similarly,
>      a bit there indicates whether or not we want to set
>      the process level policy or the system level policy.
> 
>      These bits are to be set in the flags argument of
>      sys_mbind().

Ok.  If that gets in I would suggest you also document it 
in the manpages and send me a patch. 

Comments to the patches in other mail.

-Andi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation
@ 2004-09-20 19:00 Ray Bryant
  2004-09-20 20:55 ` Andi Kleen
  0 siblings, 1 reply; 6+ messages in thread
From: Ray Bryant @ 2004-09-20 19:00 UTC (permalink / raw)
  To: William Lee Irwin III, Martin J. Bligh, Andrew Morton,
	Andi Kleen, Ray Bryant
  Cc: linux-mm, Jesse Barnes, Dan Higgins, lse-tech, Brent Casavant,
	Nick Piggin, linux-kernel, Ray Bryant, Paul Jackson, Dave Hansen

This is the first working release of this patch.  It was previously
proposed as an RFC (see
  http://marc.theaimsgroup.com/?l=linux-mm&m=109416852113561&w=2
).

Background
----------

Last month, Jesse Barnes proposed a patch to do round robin
allocation of page cache pages on NUMA machines.  This got shot down
for a number of reasons (see
  http://marc.theaimsgroup.com/?l=linux-kernel&m=109235420329360&w=2
and the related thread), but it seemed to me that one of the most
significant issues was that this was a workload dependent optimization.
That is, for an Altix running an HPC workload, it was a good thing,
but for web servers or file servers it was not such a good idea.

So the idea of this patch is the following:  it creates a new memory
policy structure (default_pagecache_policy) that is used to control
how storage for page cache pages is allocated.  So, for a large Altix
running HPC workloads, we can specify a policy that does round robin
allocations, and for other workloads you can specify the default policy
(which results in page cache pages being allocated locally).

The default_pagecache_policy is overrideable on a per process basis, so
that if your application prefers to allocate page cache pages locally,
it can.

This is all done by making default_policy and current->mempolicy an array
of size 2 and of type "struct mempolicy *".   Entry POLICY_PAGE in these
arrays is the old default_policy and process memory policy, respectively.
Entry POLICY_PAGECACHE in these arrays contains the system default and
per process page cache allocation policies, respectively.

A new worker routine is defined:
	alloc_pages_by_policy(gfp, order, policy)
This routine allocates the requested number of pages using the policy
index specified.

alloc_pages_current() and page_cache_alloc() are then defined in terms
of alloc_pages_by_policy().

This patch is in two parts.  The first part is Brent Casavant's patch for
MPOL_ROUNDROBIN.  We need this because there is no handy offset to use
when you get a call to allocate a page cache page in "page_cache_alloc()",
so MPOL_INTERLEAVE doesn't do what we need.

The second part of the patch is the set of changes to create the
default_pagecache_policy and see that it is used in page_cache_alloc()
as well as the changes to supporting setting a policy given a policy
index.

Caveats
-------

(1)  Right now, there is no mechanism to set any of the memory policies
from user space.  The NUMA API library will have to be modified to match
the new format of the sys_set/get_mempolicy() system calls (these calls
have an additional integer argument that specifies which policy to set.)
This is work that I will start on once we get agreement with this patch.

(It also appears to me that there is no mechanism to set the default
policies, but perhaps its there and I am just missing it.)

(I tested this stuff by hard compiling policis into my test kernel.)

(2)  page_cache_alloc_local() is defined, but is not currently called.
This was added in SGI ProPack to make sure that mmap'd() files were
allocated locally rather than round-robin'd (i. e. to override the
round robin allocation in that case.)  This was an SGI MPT requirement.
It may be this is not needed with the current mempolicy code if we can
associate the default mempolicy with mmap()'d files for those MPT users.

(3)  alloc_pages_current() is now an inline, but there is no easy way
to do that totally correclty with the current include file order (that I
could figure out at least...)  The problem is that alloc_pages_current()
wants to use the define constant POLICY_PAGE, but that is defined yet.
We know it is zero, so we just use zero.  A comment in mempolicy.h
suggests not to change the value of this constant to something other
than zero, and references the file gfp.h.

(4)  I've not thought a bit about locking issues related to changing a
mempolicy whilst the system is actually running. 

(5)  It seems there may be a potential conflict between the page cache
mempolicy and a mmap mempolicy (do those exist?).  Here's the concern:
If you mmap() a file, and any pages of that file are in the page cache,
then the location of those pages will (have been) dictated by the page
cache mempolicy, which could differ (will likely differ) from the mmap
mempolicy.  It seems that the only solution to this is to migrate those
pages (when they are touched) after the mmap().

Comments, flames, etc to the undersigned.

Best Regards,

Ray

PS:  Both patches are relative to 2.6.9-rc2-mm1.  However, since that
kernel doesn't boot on Altix for me at the moment, the testing was done
using 2.6.9-rc1-mm3.

PPS: This is not a final patch, but lets keep the lawyers happy anyway:

Signed-off-by: Brent Casavant <bcasavan@sgi.com>
Signed-off-by: Ray Bryant <raybry@sgi.com>

===========================================================================
Index: linux-2.6.9-rc1-mm2-kdb/include/linux/sched.h
===================================================================
--- linux-2.6.9-rc1-mm2-kdb.orig/include/linux/sched.h	2004-08-31 13:32:20.000000000 -0700
+++ linux-2.6.9-rc1-mm2-kdb/include/linux/sched.h	2004-09-02 13:17:45.000000000 -0700
@@ -596,6 +596,7 @@
 #ifdef CONFIG_NUMA
   	struct mempolicy *mempolicy;
   	short il_next;		/* could be shared with used_math */
+	short rr_next;
 #endif
 #ifdef CONFIG_CPUSETS
 	struct cpuset *cpuset;
===================================================================
Index: linux-2.6.9-rc1-mm2-kdb/mm/mempolicy.c
===================================================================
--- linux-2.6.9-rc1-mm2-kdb.orig/mm/mempolicy.c	2004-08-31 13:32:20.000000000 -0700
+++ linux-2.6.9-rc1-mm2-kdb/mm/mempolicy.c	2004-09-02 13:17:45.000000000 -0700
@@ -7,10 +7,17 @@
  * NUMA policy allows the user to give hints in which node(s) memory should
  * be allocated.
  *
- * Support four policies per VMA and per process:
+ * Support five policies per VMA and per process:
  *
  * The VMA policy has priority over the process policy for a page fault.
  *
+ * roundrobin     Allocate memory round-robined over a set of nodes,
+ *                with normal fallback if it fails.  The round-robin is
+ *                based on a per-thread rotor both to provide predictability
+ *                of allocation locations and to avoid cacheline contention
+ *                compared to a global rotor.  This policy is distinct from
+ *                interleave in that it seeks to distribute allocations evenly
+ *                across nodes, whereas interleave seeks to maximize bandwidth.
  * interleave     Allocate memory interleaved over a set of nodes,
  *                with normal fallback if it fails.
  *                For VMA based allocations this interleaves based on the
@@ -117,6 +124,7 @@
 		break;
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_ROUNDROBIN:
 		/* Preferred will only use the first bit, but allow
 		   more for now. */
 		if (empty)
@@ -215,6 +223,7 @@
 	atomic_set(&policy->refcnt, 1);
 	switch (mode) {
 	case MPOL_INTERLEAVE:
+	case MPOL_ROUNDROBIN:
 		bitmap_copy(policy->v.nodes, nodes, MAX_NUMNODES);
 		break;
 	case MPOL_PREFERRED:
@@ -406,6 +415,8 @@
 	current->mempolicy = new;
 	if (new && new->policy == MPOL_INTERLEAVE)
 		current->il_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
+	if (new && new->policy == MPOL_ROUNDROBIN)
+		current->rr_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
 	return 0;
 }

@@ -423,6 +434,7 @@
 	case MPOL_DEFAULT:
 		break;
 	case MPOL_INTERLEAVE:
+	case MPOL_ROUNDROBIN:
 		bitmap_copy(nodes, p->v.nodes, MAX_NUMNODES);
 		break;
 	case MPOL_PREFERRED:
@@ -507,6 +519,9 @@
 		} else if (pol == current->mempolicy &&
 				pol->policy == MPOL_INTERLEAVE) {
 			pval = current->il_next;
+		} else if (pol == current->mempolicy &&
+				pol->policy == MPOL_ROUNDROBIN) {
+			pval = current->rr_next;
 		} else {
 			err = -EINVAL;
 			goto out;
@@ -585,6 +600,7 @@
 				return policy->v.zonelist;
 		/*FALL THROUGH*/
 	case MPOL_INTERLEAVE: /* should not happen */
+	case MPOL_ROUNDROBIN: /* should not happen */
 	case MPOL_DEFAULT:
 		nd = numa_node_id();
 		break;
@@ -595,6 +611,21 @@
 	return NODE_DATA(nd)->node_zonelists + (gfp & GFP_ZONEMASK);
 }

+/* Do dynamic round-robin for a process */
+static unsigned roundrobin_nodes(struct mempolicy *policy)
+{
+	unsigned nid, next;
+	struct task_struct *me = current;
+
+	nid = me->rr_next;
+	BUG_ON(nid >= MAX_NUMNODES);
+	next = find_next_bit(policy->v.nodes, MAX_NUMNODES, 1+nid);
+	if (next >= MAX_NUMNODES)
+		next = find_first_bit(policy->v.nodes, MAX_NUMNODES);
+	me->rr_next = next;
+	return nid;
+}
+
 /* Do dynamic interleaving for a process */
 static unsigned interleave_nodes(struct mempolicy *policy)
 {
@@ -646,6 +677,27 @@
 	return page;
 }

+/* Allocate a page in round-robin policy.
+   Own path because first fallback needs to round-robin. */
+static struct page *alloc_page_roundrobin(unsigned gfp, unsigned order, struct mempolicy* policy)
+{
+	struct zonelist *zl;
+	struct page *page;
+	unsigned nid;
+	int i, numnodes = bitmap_weight(policy->v.nodes, MAX_NUMNODES);
+
+	for (i = 0; i < numnodes; i++) {
+		nid = roundrobin_nodes(policy);
+		BUG_ON(!test_bit(nid, (const volatile void *) &node_online_map));
+		zl = NODE_DATA(nid)->node_zonelists + (gfp & GFP_ZONEMASK);
+		page = __alloc_pages(gfp, order, zl);
+		if (page)
+			return page;
+	}
+
+	return NULL;
+}
+
 /**
  * 	alloc_page_vma	- Allocate a page for a VMA.
  *
@@ -671,26 +723,30 @@
 struct page *
 alloc_page_vma(unsigned gfp, struct vm_area_struct *vma, unsigned long addr)
 {
+	unsigned nid;
 	struct mempolicy *pol = get_vma_policy(vma, addr);

 	cpuset_update_current_mems_allowed();

-	if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
-		unsigned nid;
-		if (vma) {
-			unsigned long off;
-			BUG_ON(addr >= vma->vm_end);
-			BUG_ON(addr < vma->vm_start);
-			off = vma->vm_pgoff;
-			off += (addr - vma->vm_start) >> PAGE_SHIFT;
-			nid = offset_il_node(pol, vma, off);
-		} else {
-			/* fall back to process interleaving */
-			nid = interleave_nodes(pol);
-		}
-		return alloc_page_interleave(gfp, 0, nid);
+	switch (pol->policy) {
+		case MPOL_INTERLEAVE:
+			if (vma) {
+				unsigned long off;
+				BUG_ON(addr >= vma->vm_end);
+				BUG_ON(addr < vma->vm_start);
+				off = vma->vm_pgoff;
+				off += (addr - vma->vm_start) >> PAGE_SHIFT;
+				nid = offset_il_node(pol, vma, off);
+			} else {
+				/* fall back to process interleaving */
+				nid = interleave_nodes(pol);
+			}
+			return alloc_page_interleave(gfp, 0, nid);
+		case MPOL_ROUNDROBIN:
+			return alloc_page_roundrobin(gfp, 0, pol);
+		default:
+			return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
 	}
-	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
 }

 /**
@@ -716,8 +772,11 @@
 		cpuset_update_current_mems_allowed();
 	if (!pol || in_interrupt())
 		pol = &default_policy;
-	if (pol->policy == MPOL_INTERLEAVE)
+	if (pol->policy == MPOL_INTERLEAVE) {
 		return alloc_page_interleave(gfp, order, interleave_nodes(pol));
+	} else if (pol->policy == MPOL_ROUNDROBIN) {
+		return alloc_page_roundrobin(gfp, order, pol);
+	}
 	return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
 }
 EXPORT_SYMBOL(alloc_pages_current);
@@ -754,6 +813,7 @@
 	case MPOL_DEFAULT:
 		return 1;
 	case MPOL_INTERLEAVE:
+	case MPOL_ROUNDROBIN:
 		return bitmap_equal(a->v.nodes, b->v.nodes, MAX_NUMNODES);
 	case MPOL_PREFERRED:
 		return a->v.preferred_node == b->v.preferred_node;
@@ -798,6 +858,8 @@
 		return pol->v.zonelist->zones[0]->zone_pgdat->node_id;
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(pol);
+	case MPOL_ROUNDROBIN:
+		return roundrobin_nodes(pol);
 	case MPOL_PREFERRED:
 		return pol->v.preferred_node >= 0 ?
 				pol->v.preferred_node : numa_node_id();
@@ -815,6 +877,7 @@
 	case MPOL_PREFERRED:
 	case MPOL_DEFAULT:
 	case MPOL_INTERLEAVE:
+	case MPOL_ROUNDROBIN:
 		return 1;
 	case MPOL_BIND: {
 		struct zone **z;
===================================================================
Index: linux-2.6.9-rc1-mm2-kdb/include/linux/mempolicy.h
===================================================================
--- linux-2.6.9-rc1-mm2-kdb.orig/include/linux/mempolicy.h	2004-08-27 10:06:15.000000000 -0700
+++ linux-2.6.9-rc1-mm2-kdb/include/linux/mempolicy.h	2004-09-02 13:19:38.000000000 -0700
@@ -13,6 +13,7 @@
 #define MPOL_PREFERRED	1
 #define MPOL_BIND	2
 #define MPOL_INTERLEAVE	3
+#define MPOL_ROUNDROBIN 4

 #define MPOL_MAX MPOL_INTERLEAVE

-- 
Best Regards,
Ray
-----------------------------------------------
Ray Bryant                       raybry@sgi.com
The box said: "Requires Windows 98 or better",
           so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation
  2004-09-20 19:00 [PATCH 2.6.9-rc2-mm1 " Ray Bryant
@ 2004-09-20 20:55 ` Andi Kleen
  2004-09-20 23:48   ` Steve Longerbeam
  0 siblings, 1 reply; 6+ messages in thread
From: Andi Kleen @ 2004-09-20 20:55 UTC (permalink / raw)
  To: Ray Bryant
  Cc: William Lee Irwin III, Martin J. Bligh, Andrew Morton,
	Andi Kleen, Ray Bryant, linux-mm, Jesse Barnes, Dan Higgins,
	lse-tech, Brent Casavant, Nick Piggin, linux-kernel,
	Paul Jackson, Dave Hansen, stevel

On Mon, Sep 20, 2004 at 12:00:33PM -0700, Ray Bryant wrote:
> Background
> ----------
> 
> Last month, Jesse Barnes proposed a patch to do round robin
> allocation of page cache pages on NUMA machines.  This got shot down
> for a number of reasons (see
>   http://marc.theaimsgroup.com/?l=linux-kernel&m=109235420329360&w=2
> and the related thread), but it seemed to me that one of the most
> significant issues was that this was a workload dependent optimization.
> That is, for an Altix running an HPC workload, it was a good thing,
> but for web servers or file servers it was not such a good idea.
> 
> So the idea of this patch is the following:  it creates a new memory
> policy structure (default_pagecache_policy) that is used to control
> how storage for page cache pages is allocated.  So, for a large Altix
> running HPC workloads, we can specify a policy that does round robin
> allocations, and for other workloads you can specify the default policy
> (which results in page cache pages being allocated locally).
> 
> The default_pagecache_policy is overrideable on a per process basis, so
> that if your application prefers to allocate page cache pages locally,
> it can.

I'm not sure this really makes sense. Do you have some clear use 
case where having so much flexibility is needed? 

I would prefer to have a global setting somewhere for the page
cache (sysctl or sysfs or what you prefer) and some special handling for 
text pages. 

This would keep the per thread bloat low. 

Also I must say I got a patch submitted to do policy per
file from Steve Longerbeam. 

It so far only supports this for ELF executables, but
it has most of the infrastructure to do individual policy
per file. Maybe it would be better to go into this direction,
only thing missing is a nice way to declare policy for 
arbitary files. Even in this case a global default would be useful.

I haven't done anything with this patch yet due to missing time 
and there were a few small issues to resolve, but i hope it 
can be eventually integrated.

[Steve, perhaps you can repost the patch to lse-tech for more
wider review?]

> MPOL_ROUNDROBIN.  We need this because there is no handy offset to use
> when you get a call to allocate a page cache page in "page_cache_alloc()",
> so MPOL_INTERLEAVE doesn't do what we need.

Well, you just have to change the callers to pass it in. I think
computing the interleaving on a offset and perhaps another file
identifier is better than having the global counter.

> (It also appears to me that there is no mechanism to set the default
> policies, but perhaps its there and I am just missing it.)

No sure what default policies you mean? 

> (3)  alloc_pages_current() is now an inline, but there is no easy way
> to do that totally correclty with the current include file order (that I
> could figure out at least...)  The problem is that alloc_pages_current()
> wants to use the define constant POLICY_PAGE, but that is defined yet.
> We know it is zero, so we just use zero.  A comment in mempolicy.h
> suggests not to change the value of this constant to something other
> than zero, and references the file gfp.h.

I'm pretty sure the code I wrote didn't have a "POLICY_PAGE" ;-)
Not sure where you got it from, but you could ask whoever 
wrote that comment in your patch

> 
> (4)  I've not thought a bit about locking issues related to changing a
> mempolicy whilst the system is actually running. 

You need some kind of lock. Normally mempolicies are either
protected by being thread local or by the mmsem together
with the atomic reference count.
This only applies to modifications, for reading they are completely
stateless and don't need any locking.

Your new RR policy will break this though. It works for process
policy, but for VMA policy it will either require a lock per 
policy or some other complicated locking. Not nice.

I think doing it stateless is much better because it will scale
much better and should IMHO also have better behaviour longer term.
I went over several design iterations with this and think the 
current lockless design is very preferable.

> (5)  It seems there may be a potential conflict between the page cache
> mempolicy and a mmap mempolicy (do those exist?).  Here's the concern:

They exist for tmpfs/shmfs/hugetlbfs pages.

With Steve's page cache patch it can exist for all pages.

Normally NUMA API resolves this by prefering the more specific
policy (VMA over process) or sharing policies (for shmfs) 

Haven't read your patch in details yet, sorry, just design comments.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation
  2004-09-20 20:55 ` Andi Kleen
@ 2004-09-20 23:48   ` Steve Longerbeam
  2004-09-23 15:54     ` [PATCH " Ray Bryant
  0 siblings, 1 reply; 6+ messages in thread
From: Steve Longerbeam @ 2004-09-20 23:48 UTC (permalink / raw)
  To: linux-mm, lse-tech, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4331 bytes --]



Andi Kleen wrote:

>On Mon, Sep 20, 2004 at 12:00:33PM -0700, Ray Bryant wrote:
>  
>
>>Background
>>----------
>>
>>Last month, Jesse Barnes proposed a patch to do round robin
>>allocation of page cache pages on NUMA machines.  This got shot down
>>for a number of reasons (see
>>  http://marc.theaimsgroup.com/?l=linux-kernel&m=109235420329360&w=2
>>and the related thread), but it seemed to me that one of the most
>>significant issues was that this was a workload dependent optimization.
>>That is, for an Altix running an HPC workload, it was a good thing,
>>but for web servers or file servers it was not such a good idea.
>>
>>So the idea of this patch is the following:  it creates a new memory
>>policy structure (default_pagecache_policy) that is used to control
>>how storage for page cache pages is allocated.  So, for a large Altix
>>running HPC workloads, we can specify a policy that does round robin
>>allocations, and for other workloads you can specify the default policy
>>(which results in page cache pages being allocated locally).
>>
>>The default_pagecache_policy is overrideable on a per process basis, so
>>that if your application prefers to allocate page cache pages locally,
>>it can.
>>    
>>
>
>I'm not sure this really makes sense. Do you have some clear use 
>case where having so much flexibility is needed? 
>
>I would prefer to have a global setting somewhere for the page
>cache (sysctl or sysfs or what you prefer) and some special handling for 
>text pages. 
>
>This would keep the per thread bloat low. 
>
>Also I must say I got a patch submitted to do policy per
>file from Steve Longerbeam. 
>
>It so far only supports this for ELF executables, but
>it has most of the infrastructure to do individual policy
>per file. Maybe it would be better to go into this direction,
>only thing missing is a nice way to declare policy for 
>arbitary files. Even in this case a global default would be useful.
>
>I haven't done anything with this patch yet due to missing time 
>and there were a few small issues to resolve, but i hope it 
>can be eventually integrated.
>
>[Steve, perhaps you can repost the patch to lse-tech for more
>wider review?]
>  
>

Sure, patch is attached. Also, here is a reposting of my original email to
you (Andi) describing the patch. Btw, I received your comments on the
patch, I will reply to your points seperately. Sorry I haven't replied 
sooner,
I'm in the middle of switching jobs  :-)


-------- original email follows ----------

Hi Andi,

I'm working on adding the features to NUMA mempolicy
necessary to support MontaVista's MTA.

Attached is the first of those features, support for
global page allocation policy for mapped files. Here's
what the patch is doing:

1. add a shared_policy tree to the address_space object in fs.h.
2. modify page_cache_alloc() in pagemap.h to take an address_space
    object and page offset, and use those to allocate a page for the
    page cache using the policy in the address_space object.
3. modify filemap.c to pass the additional {mapping, page offset} pair
    to page_cache_alloc().
4. Also in filemap.c, implement generic file {set|get}_policy() methods and
    add those to generic_file_vm_ops.
5. In filemap_nopage(), verify that any existing page located in the cache
    is located in a node that satisfies the file's policy. If it's not 
in a node that
    satisfies the policy, it must be because the page was allocated 
before the
    file had any policies. If it's unused, free it and goto retry_find 
(will allocate
    a new page using the file's policy). Note that a similar operation 
is done in
    exec.c:setup_arg_pages() for stack pages.
6. Init the file's shared policy in alloc_inode(), and free the shared 
policy in
    destroy_inode().

I'm working on the remaining features needed for MTA. They are:

- support for policies contained in ELF images, for text and data regions.
- support for do_mmap_mempolicy() and do_brk_mempolicy(). Do_mmap()
   can allocate pages to the region before the function exits, such as 
when pages
   are locked for the region. So it's necessary in that case to set the 
VMA's policy
   within do_mmap() before those pages are allocated.
- system calls for mmap_mempolicy and brk_mempolicy.

Let me know your thoughts on the filemap policy patch.

Thanks,
Steve


[-- Attachment #2: filemap-policy.patch --]
[-- Type: text/plain, Size: 10821 bytes --]

diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/fs/exec.c 2.6.8-rc3/fs/exec.c
--- 2.6.8-rc3.orig/fs/exec.c	2004-08-10 15:18:07.000000000 -0700
+++ 2.6.8-rc3/fs/exec.c	2004-09-01 21:53:25.000000000 -0700
@@ -439,6 +439,25 @@
 	for (i = 0 ; i < MAX_ARG_PAGES ; i++) {
 		struct page *page = bprm->page[i];
 		if (page) {
+#ifdef CONFIG_NUMA
+			if (!mpol_node_valid(page_to_nid(page), mpnt, 0)) {
+				void *from, *to;
+				struct page * new_page =
+					alloc_pages_current(GFP_HIGHUSER, 0);
+				if (!new_page) {
+					up_write(&mm->mmap_sem);
+					kmem_cache_free(vm_area_cachep, mpnt);
+					return -ENOMEM;
+				}
+				from = kmap(page);
+				to = kmap(new_page);
+				copy_page(to, from);
+				kunmap(page);
+				kunmap(new_page);
+				put_page(page);
+				page = new_page;
+			}
+#endif
 			bprm->page[i] = NULL;
 			install_arg_page(mpnt, page, stack_base);
 		}
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/fs/inode.c 2.6.8-rc3/fs/inode.c
--- 2.6.8-rc3.orig/fs/inode.c	2004-08-10 15:18:07.000000000 -0700
+++ 2.6.8-rc3/fs/inode.c	2004-09-01 11:40:44.000000000 -0700
@@ -150,6 +150,7 @@
 		mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
+ 		mpol_shared_policy_init(&mapping->policy);
 
 		/*
 		 * If the block_device provides a backing_dev_info for client
@@ -177,11 +178,12 @@
 	security_inode_free(inode);
 	if (inode->i_sb->s_op->destroy_inode)
 		inode->i_sb->s_op->destroy_inode(inode);
-	else
+	else {
+		mpol_free_shared_policy(&inode->i_mapping->policy);
 		kmem_cache_free(inode_cachep, (inode));
+	}
 }
 
-
 /*
  * These are initializations that only need to be done
  * once, because the fields are idempotent across use
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/include/linux/fs.h 2.6.8-rc3/include/linux/fs.h
--- 2.6.8-rc3.orig/include/linux/fs.h	2004-08-10 15:18:31.000000000 -0700
+++ 2.6.8-rc3/include/linux/fs.h	2004-09-01 21:08:37.000000000 -0700
@@ -18,6 +18,7 @@
 #include <linux/cache.h>
 #include <linux/prio_tree.h>
 #include <linux/kobject.h>
+#include <linux/mempolicy.h>
 #include <asm/atomic.h>
 
 struct iovec;
@@ -339,6 +340,7 @@
 	atomic_t		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		flags;		/* error bits/gfp mask */
 	struct backing_dev_info *backing_dev_info; /* device readahead, etc */
+	struct shared_policy    policy;         /* page alloc policy */
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/include/linux/mempolicy.h 2.6.8-rc3/include/linux/mempolicy.h
--- 2.6.8-rc3.orig/include/linux/mempolicy.h	2004-08-10 15:18:31.000000000 -0700
+++ 2.6.8-rc3/include/linux/mempolicy.h	2004-09-01 21:54:34.000000000 -0700
@@ -152,6 +152,8 @@
 void mpol_free_shared_policy(struct shared_policy *p);
 struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 					    unsigned long idx);
+struct page *alloc_page_shared_policy(unsigned gfp, struct shared_policy *sp,
+				      unsigned long idx);
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/include/linux/pagemap.h 2.6.8-rc3/include/linux/pagemap.h
--- 2.6.8-rc3.orig/include/linux/pagemap.h	2004-08-10 15:18:31.000000000 -0700
+++ 2.6.8-rc3/include/linux/pagemap.h	2004-09-01 11:04:24.000000000 -0700
@@ -50,14 +50,24 @@
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
-static inline struct page *page_cache_alloc(struct address_space *x)
+
+static inline struct page *__page_cache_alloc(struct address_space *x,
+					      unsigned long idx,
+					      unsigned int gfp_mask)
+{
+	return alloc_page_shared_policy(gfp_mask, &x->policy, idx);
+}
+
+static inline struct page *page_cache_alloc(struct address_space *x,
+					    unsigned long idx)
 {
-	return alloc_pages(mapping_gfp_mask(x), 0);
+	return __page_cache_alloc(x, idx, mapping_gfp_mask(x));
 }
 
-static inline struct page *page_cache_alloc_cold(struct address_space *x)
+static inline struct page *page_cache_alloc_cold(struct address_space *x,
+						 unsigned long idx)
 {
-	return alloc_pages(mapping_gfp_mask(x)|__GFP_COLD, 0);
+	return __page_cache_alloc(x, idx, mapping_gfp_mask(x)|__GFP_COLD);
 }
 
 typedef int filler_t(void *, struct page *);
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/mm/filemap.c 2.6.8-rc3/mm/filemap.c
--- 2.6.8-rc3.orig/mm/filemap.c	2004-08-10 15:18:35.000000000 -0700
+++ 2.6.8-rc3/mm/filemap.c	2004-09-01 21:52:06.000000000 -0700
@@ -534,7 +534,8 @@
 	page = find_lock_page(mapping, index);
 	if (!page) {
 		if (!cached_page) {
-			cached_page = alloc_page(gfp_mask);
+			cached_page = __page_cache_alloc(mapping, index,
+							 gfp_mask);
 			if (!cached_page)
 				return NULL;
 		}
@@ -627,7 +628,7 @@
 		return NULL;
 	}
 	gfp_mask = mapping_gfp_mask(mapping) & ~__GFP_FS;
-	page = alloc_pages(gfp_mask, 0);
+	page = __page_cache_alloc(mapping, index, gfp_mask);
 	if (page && add_to_page_cache_lru(page, mapping, index, gfp_mask)) {
 		page_cache_release(page);
 		page = NULL;
@@ -789,7 +790,7 @@
 		 * page..
 		 */
 		if (!cached_page) {
-			cached_page = page_cache_alloc_cold(mapping);
+			cached_page = page_cache_alloc_cold(mapping, index);
 			if (!cached_page) {
 				desc->error = -ENOMEM;
 				goto out;
@@ -1050,7 +1051,7 @@
 	struct page *page; 
 	int error;
 
-	page = page_cache_alloc_cold(mapping);
+	page = page_cache_alloc_cold(mapping, offset);
 	if (!page)
 		return -ENOMEM;
 
@@ -1070,6 +1071,7 @@
 	return error == -EEXIST ? 0 : error;
 }
 
+
 #define MMAP_LOTSAMISS  (100)
 
 /*
@@ -1090,7 +1092,7 @@
 	struct page *page;
 	unsigned long size, pgoff, endoff;
 	int did_readaround = 0, majmin = VM_FAULT_MINOR;
-
+	
 	pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
 	endoff = ((area->vm_end - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
 
@@ -1162,6 +1164,38 @@
 			goto no_cached_page;
 	}
 
+#ifdef CONFIG_NUMA
+	if (!mpol_node_valid(page_to_nid(page), area, 0)) {
+		/*
+		 * the page in the cache is not in any of the nodes this
+		 * VMA's policy wants it to be in. Can we remove it?
+		 */
+		lock_page(page);
+		if (page_count(page) - !!PagePrivate(page) == 2) {
+			/*
+			 * This page isn't being used by any mappings,
+			 * so we can safely remove it. It must be left
+			 * over from an earlier file IO readahead when
+			 * there was no page allocation policy associated
+			 * with the file.
+			 */
+			spin_lock(&mapping->tree_lock);
+			__remove_from_page_cache(page);
+			spin_unlock(&mapping->tree_lock);
+			page_cache_release(page);  /* pagecache ref */
+			unlock_page(page);
+			page_cache_release(page);  /* us */
+			goto retry_find;
+		} else {
+			/*
+			 * darn, the page is being used by other mappings.
+			 * We'll just have to leave the page in this node.
+			 */
+			unlock_page(page);
+		}
+	}
+#endif
+	
 	if (!did_readaround)
 		ra->mmap_hit++;
 
@@ -1431,9 +1465,35 @@
 	return 0;
 }
 
+
+#ifdef CONFIG_NUMA
+int generic_file_set_policy(struct vm_area_struct *vma,
+			    struct mempolicy *new)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	return mpol_set_shared_policy(&mapping->policy, vma, new);
+}
+
+struct mempolicy *
+generic_file_get_policy(struct vm_area_struct *vma,
+			unsigned long addr)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long idx;
+	
+	idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	return mpol_shared_policy_lookup(&mapping->policy, idx);
+}
+#endif
+
+
 static struct vm_operations_struct generic_file_vm_ops = {
 	.nopage		= filemap_nopage,
 	.populate	= filemap_populate,
+#ifdef CONFIG_NUMA
+	.set_policy     = generic_file_set_policy,
+	.get_policy     = generic_file_get_policy,
+#endif
 };
 
 /* This is used for a general mmap of a disk file */
@@ -1483,7 +1543,7 @@
 	page = find_get_page(mapping, index);
 	if (!page) {
 		if (!cached_page) {
-			cached_page = page_cache_alloc_cold(mapping);
+			cached_page = page_cache_alloc_cold(mapping, index);
 			if (!cached_page)
 				return ERR_PTR(-ENOMEM);
 		}
@@ -1565,7 +1625,7 @@
 	page = find_lock_page(mapping, index);
 	if (!page) {
 		if (!*cached_page) {
-			*cached_page = page_cache_alloc(mapping);
+			*cached_page = page_cache_alloc(mapping, index);
 			if (!*cached_page)
 				return NULL;
 		}
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/mm/mempolicy.c 2.6.8-rc3/mm/mempolicy.c
--- 2.6.8-rc3.orig/mm/mempolicy.c	2004-08-10 15:18:35.000000000 -0700
+++ 2.6.8-rc3/mm/mempolicy.c	2004-09-01 21:49:14.000000000 -0700
@@ -638,6 +638,7 @@
 	return page;
 }
 
+
 /**
  * 	alloc_page_vma	- Allocate a page for a VMA.
  *
@@ -683,6 +684,7 @@
 	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
 }
 
+
 /**
  * 	alloc_pages_current - Allocate pages.
  *
@@ -1003,6 +1005,28 @@
 	up(&p->sem);
 }
 
+struct page *
+alloc_page_shared_policy(unsigned gfp, struct shared_policy *sp,
+			 unsigned long idx)
+{
+	struct page *page;
+	
+	if (sp) {
+		struct vm_area_struct pvma;
+		/* Create a pseudo vma that just contains the policy */
+		memset(&pvma, 0, sizeof(struct vm_area_struct));
+		pvma.vm_end = PAGE_SIZE;
+		pvma.vm_pgoff = idx;
+		pvma.vm_policy = mpol_shared_policy_lookup(sp, idx);
+		page = alloc_page_vma(gfp, &pvma, 0);
+		mpol_free(pvma.vm_policy);
+	} else {
+		page = alloc_pages(gfp, 0);
+	}
+
+	return page;
+}
+
 /* assumes fs == KERNEL_DS */
 void __init numa_policy_init(void)
 {
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/mm/readahead.c 2.6.8-rc3/mm/readahead.c
--- 2.6.8-rc3.orig/mm/readahead.c	2004-08-10 15:18:35.000000000 -0700
+++ 2.6.8-rc3/mm/readahead.c	2004-09-01 20:39:14.000000000 -0700
@@ -246,7 +246,7 @@
 			continue;
 
 		spin_unlock_irq(&mapping->tree_lock);
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, page_offset);
 		spin_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/mm/shmem.c 2.6.8-rc3/mm/shmem.c
--- 2.6.8-rc3.orig/mm/shmem.c	2004-08-10 15:18:35.000000000 -0700
+++ 2.6.8-rc3/mm/shmem.c	2004-09-01 11:14:48.000000000 -0700
@@ -824,16 +824,7 @@
 shmem_alloc_page(unsigned long gfp, struct shmem_inode_info *info,
 		 unsigned long idx)
 {
-	struct vm_area_struct pvma;
-	struct page *page;
-
-	memset(&pvma, 0, sizeof(struct vm_area_struct));
-	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
-	pvma.vm_pgoff = idx;
-	pvma.vm_end = PAGE_SIZE;
-	page = alloc_page_vma(gfp, &pvma, 0);
-	mpol_free(pvma.vm_policy);
-	return page;
+	return alloc_page_shared_policy(gfp, &info->policy, idx);
 }
 #else
 static inline struct page *

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] mm: memory policy for page cache allocation
  2004-09-20 23:48   ` Steve Longerbeam
@ 2004-09-23 15:54     ` Ray Bryant
  2004-09-23 23:01       ` Steve Longerbeam
  0 siblings, 1 reply; 6+ messages in thread
From: Ray Bryant @ 2004-09-23 15:54 UTC (permalink / raw)
  To: Steve Longerbeam; +Cc: linux-mm, lse-tech, linux-kernel

Hi Steve,

Steve Longerbeam wrote:

> -------- original email follows ----------
> 
> Hi Andi,
> 
> I'm working on adding the features to NUMA mempolicy
> necessary to support MontaVista's MTA.
> 
> Attached is the first of those features, support for
> global page allocation policy for mapped files. Here's
> what the patch is doing:
> 
> 1. add a shared_policy tree to the address_space object in fs.h.
> 2. modify page_cache_alloc() in pagemap.h to take an address_space
>    object and page offset, and use those to allocate a page for the
>    page cache using the policy in the address_space object.
> 3. modify filemap.c to pass the additional {mapping, page offset} pair
>    to page_cache_alloc().
> 4. Also in filemap.c, implement generic file {set|get}_policy() methods and
>    add those to generic_file_vm_ops.
> 5. In filemap_nopage(), verify that any existing page located in the cache
>    is located in a node that satisfies the file's policy. If it's not in 
> a node that
>    satisfies the policy, it must be because the page was allocated 
> before the
>    file had any policies. If it's unused, free it and goto retry_find 
> (will allocate
>    a new page using the file's policy). Note that a similar operation is 
> done in
>    exec.c:setup_arg_pages() for stack pages.
> 6. Init the file's shared policy in alloc_inode(), and free the shared 
> policy in
>    destroy_inode().
> 
> I'm working on the remaining features needed for MTA. They are:
> 
> - support for policies contained in ELF images, for text and data regions.
> - support for do_mmap_mempolicy() and do_brk_mempolicy(). Do_mmap()
>   can allocate pages to the region before the function exits, such as 
> when pages
>   are locked for the region. So it's necessary in that case to set the 
> VMA's policy
>   within do_mmap() before those pages are allocated.
> - system calls for mmap_mempolicy and brk_mempolicy.
> 
> Let me know your thoughts on the filemap policy patch.
> 
> Thanks,
> Steve
> 
> 

Steve,

I guess I am a little lost on this without understanding what MTA is.
Is there a design/requirements document you can point me at?

Also, can you comment on how the above is related to my page cache
allocation policy patch?   Does having a global page cache allocation
policy with a per process override satisfy your requirements at all
or do you specifically have per file policies you want to specify?

(Just trying to figure out how to work both of our requirements into
the kernel in as simple as possible (but no simpler!) fashion.)

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] mm: memory policy for page cache allocation
  2004-09-23 15:54     ` [PATCH " Ray Bryant
@ 2004-09-23 23:01       ` Steve Longerbeam
  0 siblings, 0 replies; 6+ messages in thread
From: Steve Longerbeam @ 2004-09-23 23:01 UTC (permalink / raw)
  To: Ray Bryant; +Cc: linux-mm, lse-tech, linux-kernel

Ray Bryant wrote:

> Hi Steve,
>
> Steve Longerbeam wrote:
>
>> -------- original email follows ----------
>>
>> Hi Andi,
>>
>> I'm working on adding the features to NUMA mempolicy
>> necessary to support MontaVista's MTA.
>>
>> Attached is the first of those features, support for
>> global page allocation policy for mapped files. Here's
>> what the patch is doing:
>>
>> 1. add a shared_policy tree to the address_space object in fs.h.
>> 2. modify page_cache_alloc() in pagemap.h to take an address_space
>>    object and page offset, and use those to allocate a page for the
>>    page cache using the policy in the address_space object.
>> 3. modify filemap.c to pass the additional {mapping, page offset} pair
>>    to page_cache_alloc().
>> 4. Also in filemap.c, implement generic file {set|get}_policy() 
>> methods and
>>    add those to generic_file_vm_ops.
>> 5. In filemap_nopage(), verify that any existing page located in the 
>> cache
>>    is located in a node that satisfies the file's policy. If it's not 
>> in a node that
>>    satisfies the policy, it must be because the page was allocated 
>> before the
>>    file had any policies. If it's unused, free it and goto retry_find 
>> (will allocate
>>    a new page using the file's policy). Note that a similar operation 
>> is done in
>>    exec.c:setup_arg_pages() for stack pages.
>> 6. Init the file's shared policy in alloc_inode(), and free the 
>> shared policy in
>>    destroy_inode().
>>
>> I'm working on the remaining features needed for MTA. They are:
>>
>> - support for policies contained in ELF images, for text and data 
>> regions.
>> - support for do_mmap_mempolicy() and do_brk_mempolicy(). Do_mmap()
>>   can allocate pages to the region before the function exits, such as 
>> when pages
>>   are locked for the region. So it's necessary in that case to set 
>> the VMA's policy
>>   within do_mmap() before those pages are allocated.
>> - system calls for mmap_mempolicy and brk_mempolicy.
>>
>> Let me know your thoughts on the filemap policy patch.
>>
>> Thanks,
>> Steve
>>
>>
>
> Steve,
>
> I guess I am a little lost on this without understanding what MTA is.
> Is there a design/requirements document you can point me at?

Not yet, sorry. There is an internal wiki specification at MontaVista
Software, but it's specific to the 2.4.20 design of MTA.

>
> Also, can you comment on how the above is related to my page cache
> allocation policy patch?   Does having a global page cache allocation
> policy with a per process override satisfy your requirements at all
> or do you specifically have per file policies you want to specify?

MTA stands for "Memory Type-based Allocation" (the name was chosen by a
large customer of MontaVista). The idea behind MTA is identical to NUMA
memory policy in 2.6.8, but with extra features. MTA was developed
before NUMA mempolicy (it was originally developed in 2.4.20).

The basic idea of MTA is to allow file-mapped and anonymous VMA's
to contain a preference list of NUMA nodes that a page should be 
allocated from.
So in MTA there is only one policy, which is very similar to the BIND 
policy in
2.6.8.

MTA requires per mapped file policies. The patch I posted adds a
shared_policy tree to the address_space object, so that every file
can have it's own policy for page cache allocations. A mapped file
can have a tree of policies, one for each mapped region of the file,
for instance, text and initialized data. With the patch, file mapped
policies would work across all filesystems, and the specific support
in tmpfs and hugetlbfs can be removed.

The goal of MTA is to direct an entire program's resident pages (text
and data regions of the executable and all its shared libs) to a
single node or a specific set of nodes. The primary use of MTA (by
the customer) is to allow portions of memory to be powered off for
low power modes, and still have critical system applications running.

In MTA the executable file's policies are stored in the ELF image.
There is a utility to add a section containing the list of prefered nodes
for the executable's text and data regions. That section is parsed by
load_elf_binary(). The section data is in the form of mnemonic node
name strings, which load_elf_binary() converts to a node id list.

MTA also supports policies for the slab allocator.

>
> (Just trying to figure out how to work both of our requirements into
> the kernel in as simple as possible (but no simpler!) fashion.)

could we have both a global page cache policy as well as per file
policies. That is, if a mapped file has a policy, it overrides the
global policy. That would work fine for MTA.

Steve
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-09-25  5:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <fa.b014hh3.12l6193@ifi.uio.no>
     [not found] ` <fa.ep2m52m.1p0edrq@ifi.uio.no>
2004-09-24 15:43   ` [PATCH 0/2] mm: memory policy for page cache allocation Ray Bryant
2004-09-25  5:40     ` Steve Longerbeam
2004-09-23  4:32 Ray Bryant
2004-09-23  9:09 ` Andi Kleen
  -- strict thread matches above, loose matches on Subject: below --
2004-09-20 19:00 [PATCH 2.6.9-rc2-mm1 " Ray Bryant
2004-09-20 20:55 ` Andi Kleen
2004-09-20 23:48   ` Steve Longerbeam
2004-09-23 15:54     ` [PATCH " Ray Bryant
2004-09-23 23:01       ` Steve Longerbeam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox