Andi Kleen wrote:
On Mon, Sep 20, 2004 at 12:00:33PM -0700, Ray Bryant wrote:
  
Background
----------

Last month, Jesse Barnes proposed a patch to do round robin
allocation of page cache pages on NUMA machines.  This got shot down
for a number of reasons (see
  http://marc.theaimsgroup.com/?l=linux-kernel&m=109235420329360&w=2
and the related thread), but it seemed to me that one of the most
significant issues was that this was a workload dependent optimization.
That is, for an Altix running an HPC workload, it was a good thing,
but for web servers or file servers it was not such a good idea.

So the idea of this patch is the following:  it creates a new memory
policy structure (default_pagecache_policy) that is used to control
how storage for page cache pages is allocated.  So, for a large Altix
running HPC workloads, we can specify a policy that does round robin
allocations, and for other workloads you can specify the default policy
(which results in page cache pages being allocated locally).

The default_pagecache_policy is overrideable on a per process basis, so
that if your application prefers to allocate page cache pages locally,
it can.
    

I'm not sure this really makes sense. Do you have some clear use 
case where having so much flexibility is needed? 

I would prefer to have a global setting somewhere for the page
cache (sysctl or sysfs or what you prefer) and some special handling for 
text pages. 

This would keep the per thread bloat low. 

Also I must say I got a patch submitted to do policy per
file from Steve Longerbeam. 

It so far only supports this for ELF executables, but
it has most of the infrastructure to do individual policy
per file. Maybe it would be better to go into this direction,
only thing missing is a nice way to declare policy for 
arbitary files. Even in this case a global default would be useful.

I haven't done anything with this patch yet due to missing time 
and there were a few small issues to resolve, but i hope it 
can be eventually integrated.

[Steve, perhaps you can repost the patch to lse-tech for more
wider review?]
  

Sure, patch is attached. Also, here is a reposting of my original email to
you (Andi) describing the patch. Btw, I received your comments on the
patch, I will reply to your points seperately. Sorry I haven't replied sooner,
I'm in the middle of switching jobs  :-)


-------- original email follows ----------

Hi Andi,

I'm working on adding the features to NUMA mempolicy
necessary to support MontaVista's MTA.

Attached is the first of those features, support for
global page allocation policy for mapped files. Here's
what the patch is doing:

1. add a shared_policy tree to the address_space object in fs.h.
2. modify page_cache_alloc() in pagemap.h to take an address_space
    object and page offset, and use those to allocate a page for the
    page cache using the policy in the address_space object.
3. modify filemap.c to pass the additional {mapping, page offset} pair
    to page_cache_alloc().
4. Also in filemap.c, implement generic file {set|get}_policy() methods and
    add those to generic_file_vm_ops.
5. In filemap_nopage(), verify that any existing page located in the cache
    is located in a node that satisfies the file's policy. If it's not in a node that
    satisfies the policy, it must be because the page was allocated before the
    file had any policies. If it's unused, free it and goto retry_find (will allocate
    a new page using the file's policy). Note that a similar operation is done in
    exec.c:setup_arg_pages() for stack pages.
6. Init the file's shared policy in alloc_inode(), and free the shared policy in
    destroy_inode().

I'm working on the remaining features needed for MTA. They are:

- support for policies contained in ELF images, for text and data regions.
- support for do_mmap_mempolicy() and do_brk_mempolicy(). Do_mmap()
   can allocate pages to the region before the function exits, such as when pages
   are locked for the region. So it's necessary in that case to set the VMA's policy
   within do_mmap() before those pages are allocated.
- system calls for mmap_mempolicy and brk_mempolicy.

Let me know your thoughts on the filemap policy patch.

Thanks,
Steve