Re: [PATCH] Document Linux Memory Policy

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Christoph Lameter <clameter@sgi.com>
Cc: linux-mm <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andi Kleen <ak@suse.de>
Subject: Re: [PATCH] Document Linux Memory Policy
Date: Wed, 30 May 2007 12:55:03 -0400	[thread overview]
Message-ID: <1180544104.5850.70.camel@localhost> (raw)
In-Reply-To: <Pine.LNX.4.64.0705291247001.26308@schroedinger.engr.sgi.com>

On Tue, 2007-05-29 at 13:04 -0700, Christoph Lameter wrote:
> On Tue, 29 May 2007, Lee Schermerhorn wrote:
> 
> > +	A task policy applies only to pages allocated after the policy is
> > +	installed.  Any pages already faulted in by the task remain where
> > +	they were allocated based on the policy at the time they were
> > +	allocated.
> 
> You can use cpusets to automatically migrate pages and sys_migrate_pages 
> to manually migrate pages of a process though.

I consider cpusets, and the explicit migration APIs, orthogonal to
mempolicy.  Mempolicy is an application interface, while cpusets are an
administrative interface that restricts what mempolicy can ask for.  And
sys_migrate_pages/sys_move_pages seem to ignore mempolicy altogether.

I would agree, however, that they could be better integrated.  E.g., how
can a NUMA-aware application [one that uses the mempolicy APIs]
determine what memories it's allowed to use.  So far, all I've been able
to determine is that I try each node in the mask and the ones that don't
error out are valid.  Seems a bit awkward...

> 
> > +    VMA Policy:  A "VMA" or "Virtual Memory Area" refers to a range of a task's
> > +    virtual adddress space.  A task may define a specific policy for a range
> > +    of its virtual address space.  This VMA policy will govern the allocation
> > +    of pages that back this region of the address space.  Any regions of the
> > +    task's address space that don't have an explicit VMA policy will fall back
> > +    to the task policy, which may itself fall back to the system default policy.
> 
> The system default policy is always the same when the system is running. 
> There is no way to configure it. So it would be easier to avoid this layer 
> and say they fall back to node local

What you describe is, indeed, the effect, but I'm trying to explain why
it works that way.  
> 
> 
> > +	VMA policies are shared between all tasks that share a virtual address
> > +	space--a.k.a. threads--independent of when the policy is installed; and
> > +	they are inherited across fork().  However, because VMA policies refer
> > +	to a specific region of a task's address space, and because the address
> > +	space is discarded and recreated on exec*(), VMA policies are NOT
> > +	inheritable across exec().  Thus, only NUMA-aware applications may
> > +	use VMA policies.
> 
> Memory policies require NUMA. Drop the last sentence? You can set the task 
> policy via numactl though.

I disagree about dropping the last sentence.  I can/will define
NUMA-aware as applications that directly call the mempolicy APIs.  You
can run an unmodified, non-NUMA-aware program on a NUMA platform with or
without numactl and take whatever performance you get.  In some cases,
you'll be leaving performance on the table, but that may be a trade-off
some are willing to make not to have to modify their existing
applications.

> 
> > +    Shared Policy:  This policy applies to "memory objects" mapped shared into
> > +    one or more tasks' distinct address spaces.  Shared policies are applied
> > +    directly to the shared object.  Thus, all tasks that attach to the object
> > +    share the policy, and all pages allocated for the shared object, by any
> > +    task, will obey the shared policy.
> > +
> > +	Currently [2.6.22], only shared memory segments, created by shmget(),
> > +	support shared policy.  When shared policy support was added to Linux,
> > +	the associated data structures were added to shared hugetlbfs segments.
> > +	However, at the time, hugetlbfs did not support allocation at fault
> > +	time--a.k.a lazy allocation--so hugetlbfs segments were never "hooked
> > +	up" to the shared policy support.  Although hugetlbfs segments now
> > +	support lazy allocation, their support for shared policy has not been
> > +	completed.
> 
> I guess patches would be welcome to complete it. But that may only be 
> releveant if huge pages are shared between processes. We so far have no 
> case in which that support is required.

See response to Andi's mail re:  data base use of shmem & hugepages.

> 
> > +	Although internal to the kernel shared memory segments are really
> > +	files backed by swap space that have been mmap()ed shared into tasks'
> > +	address spaces, regular files mmap()ed shared do NOT support shared
> > +	policy.  Rather, shared page cache pages, including pages backing
> > +	private mappings that have not yet been written by the task, follow
> > +	task policy, if any, else system default policy.
> 
> Yes. shared memory segments do not represent file content. The file 
> content of mmap pages may exist before the mmap. Also there may be regular
> buffered I/O going on which will also use the task policy. 

Unix/Posix/Linux semantics are very flexible with respect to file
description access [read, write, et al] and memory mapped access to
files.  One CAN access files via both of these interfaces, and the
system jumps through hoops backwards [e.g., consider truncation] to make
it work.  However, some applications just access the files via mmap()
and want to control the NUMA placement like any other component of their
address space.   Read/write access to such a file, while I agree it
should work, is, IMO, secondary to load/store access.  In such a case,
the performance of the load/store access shouldn't be sacrificed for the
read/write case, which already has to go through system calls, buffer
copies, ...

> 
> Having no vma policy support insures that pagecache pages regardless if 
> they are mmapped or not will get the task policy applied.

Which is fine if that's what you want.  If you're using a memory mapped
file as a persistent shared memory area that faults pages in where you
specified, as you access them, maybe that's not what you want.  I
guarantee that's not what I want.

However, it seems to me, this is our other discussion.  What I've tried
to do with this patch is document the existing concepts and behavior, as
I understand them.  

> 
> > +   Linux memory policy supports the following 4 modes:
> > +
> > +	Default Mode--MPOL_DEFAULT:  The behavior specified by this mode is
> > +	context dependent.
> > +
> > +	    The system default policy is hard coded to contain the Default mode.
> > +	    In this context, it means "local" allocation--that is attempt to
> > +	    allocate the page from the node associated with the cpu where the
> > +	    fault occurs.  If the "local" node has no memory, or the node's
> > +	    memory can be exhausted [no free pages available], local allocation
> > +	    will attempt to allocate pages from "nearby" nodes, using a per node
> > +	    list of nodes--called zonelists--built at boot time.
> > +
> > +		TODO:  address runtime rebuild of node/zonelists when
> > +		supported.
> 
> Why?

Because "built at boot time" is then not strictly correct, is it?  
> 
> > +	    When a task/process policy contains the Default mode, it means
> > +	    "fall back to the system default mode".  And, as discussed above,
> > +	    this means use "local" allocation.
> 
> This would be easier if you would drop the system default mode and simply 
> say its node local.

I'm trying to build the reader's mental map.  
> 
> > +	    In the context of a VMA, Default mode means "fall back to task
> > +	    policy"--which may, itself, fall back to system default policy.
> > +	    In the context of shared policies, Default mode means fall back
> > +	    directly to the system default policy.  Note:  the result of this
> > +	    semantic is that if the task policy is something other than Default,
> > +	    it is not possible to specify local allocation for a region of the
> > +	    task's address space using a VMA policy.
> > +
> > +	    The Default mode does not use the optional set of nodes.
> 
> Neither does the preferred node mode.

Actually, it does take the node mask argument.  It just selects the
first node therein.  See response to Andi.

> 
> > +	MPOL_BIND:  This mode specifies that memory must come from the
> > +	set of nodes specified by the policy.  The kernel builds a custom
> > +	zonelist containing just the nodes specified by the Bind policy.
> > +	If the kernel is unable to allocate a page from the first node in the
> > +	custom zonelist, it moves on to the next, and so forth.  If it is unable
> > +	to allocate a page from any of the nodes in this list, the allocation
> > +	will fail.
> > +
> > +	    The memory policy APIs do not specify an order in which the nodes
> > +	    will be searched.  However, unlike the per node zonelists mentioned
> > +	    above, the custom zonelist for the Bind policy do not consider the
> > +	    distance between the nodes.  Rather, the lists are built in order
> > +	    of numeric node id.
> 
> Right. TODO: MPOL_BIND needs to pick the best node.
> 
> > +	MPOL_PREFERRED:  This mode specifies that the allocation should be
> > +	attempted from the single node specified in the policy.  If that
> > +	allocation fails, the kernel will search other nodes, exactly as
> > +	it would for a local allocation that started at the preferred node--
> > +	that is, using the per-node zonelists in increasing distance from
> > +	the preferred node.
> > +
> > +	    If the Preferred policy specifies more than one node, the node
> > +	    with the numerically lowest node id will be selected to start
> > +	    the allocation scan.
> 
> AFAIK perferred policy was only intended to specify one node.

Covered in response to Andi.
> 
> > +	    For allocation of page cache pages, Interleave mode indexes the set
> > +	    of nodes specified by the policy using a node counter maintained
> > +	    per task.  This counter wraps around to the lowest specified node
> > +	    after it reaches the highest specified node.  This will tend to
> > +	    spread the pages out over the nodes specified by the policy based
> > +	    on the order in which they are allocated, rather than based on any
> > +	    page offset into an address range or file.
> 
> Which is particularly important if random pages in a file are used.
> 
> > +Linux supports 3 system calls for controlling memory policy.  These APIS
> > +always affect only the calling task, the calling task's address space, or
> > +some shared object mapped into the calling task's address space.
> 
> These are wrapped by the numactl library. So these are not exposed to the 
> user.
> 
> > +	Note:  the headers that define these APIs and the parameter data types
> > +	for user space applications reside in a package that is not part of
> > +	the Linux kernel.  The kernel system call interfaces, with the 'sys_'
> > +	prefix, are defined in <linux/syscalls.h>; the mode and flag
> > +	definitions are defined in <linux/mempolicy.h>.
> 
> You need to mention the numactl library here.

I'm trying to describe kernel behavior.  I would expect this to be
picked up by the man pages at some time.  As I responded to Andi, I'll
work the maintainers... When I get the time.
> 
> > +	'flags' may also contain 'MPOL_F_NODE'.  This flag has been
> > +	described in some get_mempolicy() man pages as "not for application
> > +	use" and subject to change.  Applications are cautioned against
> > +	using it.  However, for completeness and because it is useful for
> > +	testing the kernel memory policy support, current behavior is
> > +	documented here:
> 
> The docs are wrong. This is fully supported.
> 
> > +	    Note:  if the address specifies an anonymous region of the
> > +	    task's address space with no page currently allocated, the
> > +	    resulting "read access fault" will likely just map the shared
> > +	    ZEROPAGE.  It will NOT, for example, allocate a local page in
> > +	    the case of default policy [unless the task happens to be
> > +	    running on the node containing the ZEROPAGE], nor will it obey
> > +	    VMA policy, if any.
> 
> Yes the intend for it was to be used on a mapped page.

Just pointing out that this might not be what you expect.  E.g., if you
mbind() an anonymous region to some node where the ZEROPAGE does NOT
reside [do we intend to do per node ZEROPAGEs, or was that idea
dropped?], fault in the pages via read access and then query the page
location, either via get_mempolicy() w/ '_ADDR|"_NODE or via numa_maps,
you'll see the pages on some node you don't expect and think it's
broken.  Well, not YOU, but someone not familiar with kernel internals
might.  

> 
> > +	If the address space range covers an anonymous region or a private
> > +	mapping of a regular file, a VMA policy will be installed in this
> > +	region.  This policy will govern all subsequent allocations of pages
> > +	for that range for all threads in the task.
> 
> Wont it be installed regardless if it is anonymous or not?

Yes, I suppose I could reword that and the next paragraph differently.

> 
> > +	If the address space range covers a shared mapping of a regular
> > +	file, a VMA policy will be installed for that range.  This policy
> > +	will be ignored for all page allocations by the calling task or
> > +	by any other task.  Rather, all page allocations in that range will
> > +	be allocated using the faulting task's task policy, if any, else
> > +	the system default policy.
> 
> The policy is going to be used for COW in that range.

You don't get COW if it's a shared mapping.  You use the page cache
pages which ignores my mbind().  That's my beef!  [;-)]

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2007-05-30 16:55 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-29 19:33 Lee Schermerhorn
2007-05-29 20:04 ` Christoph Lameter
2007-05-29 20:16   ` Andi Kleen
2007-05-30 16:17     ` Lee Schermerhorn
2007-05-30 17:41       ` Christoph Lameter
2007-05-31  8:20       ` Michael Kerrisk
2007-05-31 14:49         ` Lee Schermerhorn
2007-05-31 15:56           ` Michael Kerrisk
2007-06-01 21:15         ` [PATCH] enhance memory policy sys call man pages v1 Lee Schermerhorn
2007-07-23  6:11           ` Michael Kerrisk
2007-07-23  6:32           ` mbind.2 man page patch Michael Kerrisk
2007-07-23 14:26             ` Lee Schermerhorn
2007-07-26 17:19               ` Michael Kerrisk
2007-07-26 18:06                 ` Lee Schermerhorn
2007-07-26 18:18                   ` Michael Kerrisk
2007-07-23  6:32           ` get_mempolicy.2 " Michael Kerrisk
2007-07-28  9:31             ` Michael Kerrisk
2007-08-09 18:43               ` Lee Schermerhorn
2007-08-09 20:57                 ` Michael Kerrisk
2007-08-16 20:05               ` Andi Kleen
2007-08-18  5:50                 ` Michael Kerrisk
2007-08-21 15:45                   ` Lee Schermerhorn
2007-08-22  4:10                     ` Michael Kerrisk
2007-08-22 16:08                       ` [PATCH] Mempolicy Man Pages 2.64 1/3 - mbind.2 Lee Schermerhorn
2007-08-27 11:29                         ` Michael Kerrisk
2007-08-22 16:10                       ` [PATCH] Mempolicy Man Pages 2.64 2/3 - set_mempolicy.2 Lee Schermerhorn
2007-08-27 11:30                         ` Michael Kerrisk
2007-08-22 16:12                       ` [PATCH] Mempolicy Man Pages 2.64 3/3 - get_mempolicy.2 Lee Schermerhorn
2007-08-27 11:30                         ` Michael Kerrisk
2007-08-27 10:46                 ` get_mempolicy.2 man page patch Michael Kerrisk
2007-07-23  6:33           ` set_mempolicy.2 " Michael Kerrisk
2007-05-30 16:55   ` Lee Schermerhorn [this message]
2007-05-30 17:56     ` [PATCH] Document Linux Memory Policy Christoph Lameter
2007-05-31  6:18       ` Gleb Natapov
2007-05-31  6:41         ` Christoph Lameter
2007-05-31  6:47           ` Gleb Natapov
2007-05-31  6:56             ` Christoph Lameter
2007-05-31  7:11               ` Gleb Natapov
2007-05-31  7:24                 ` Christoph Lameter
2007-05-31  7:39                   ` Gleb Natapov
2007-05-31 17:43                     ` Christoph Lameter
2007-05-31 17:07                   ` Lee Schermerhorn
2007-05-31 10:43             ` Andi Kleen
2007-05-31 11:04               ` Gleb Natapov
2007-05-31 11:30                 ` Gleb Natapov
2007-05-31 15:26                   ` Lee Schermerhorn
2007-05-31 17:41                     ` Gleb Natapov
2007-05-31 18:56                       ` Lee Schermerhorn
2007-05-31 20:06                         ` Gleb Natapov
2007-05-31 20:43                           ` Andi Kleen
2007-06-01  9:38                             ` Gleb Natapov
2007-06-01 10:21                               ` Andi Kleen
2007-06-01 12:25                                 ` Gleb Natapov
2007-06-01 13:09                                   ` Andi Kleen
2007-06-01 17:15                                 ` Lee Schermerhorn
2007-06-01 18:43                                   ` Christoph Lameter
2007-06-01 19:38                                     ` Lee Schermerhorn
2007-06-01 19:48                                       ` Christoph Lameter
2007-06-01 21:05                                         ` Lee Schermerhorn
2007-06-01 21:56                                           ` Christoph Lameter
2007-06-04 13:46                                             ` Lee Schermerhorn
2007-06-04 16:34                                               ` Christoph Lameter
2007-06-04 17:02                                                 ` Lee Schermerhorn
2007-06-04 17:11                                                   ` Christoph Lameter
2007-06-04 20:23                                                     ` Andi Kleen
2007-06-04 21:51                                                       ` Christoph Lameter
2007-06-05 14:30                                                         ` Lee Schermerhorn
2007-06-01 20:28                                     ` Gleb Natapov
2007-06-01 20:45                                       ` Christoph Lameter
2007-06-01 21:10                                         ` Lee Schermerhorn
2007-06-01 21:58                                           ` Christoph Lameter
2007-06-02  7:23                                         ` Gleb Natapov
2007-05-31 11:47                 ` Andi Kleen
2007-05-31 11:59                   ` Gleb Natapov
2007-05-31 12:15                     ` Andi Kleen
2007-05-31 12:18                       ` Gleb Natapov
2007-05-31 18:28       ` Lee Schermerhorn
2007-05-31 18:35         ` Christoph Lameter
2007-05-31 19:29           ` Lee Schermerhorn
2007-05-31 19:25       ` Paul Jackson
2007-05-31 20:22         ` Lee Schermerhorn
2007-05-29 20:07 ` Andi Kleen
2007-05-30 16:04   ` Lee Schermerhorn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1180544104.5850.70.camel@localhost \
    --to=lee.schermerhorn@hp.com \
    --cc=ak@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox