From: mel@skynet.ie (Mel Gorman)
To: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: linux-mm@kvack.org, Christoph Lameter <clameter@sgi.com>,
ak@suse.de, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
akpm@linux-foundation.org, pj@sgi.com,
Michael Kerrisk <mtk-manpages@gmx.net>,
Randy Dunlap <randy.dunlap@oracle.com>,
Eric Whitney <eric.whitney@hp.com>
Subject: Re: [PATCH] Document Linux Memory Policy - V3
Date: Fri, 3 Aug 2007 14:52:53 +0100 [thread overview]
Message-ID: <20070803135253.GA20048@skynet.ie> (raw)
In-Reply-To: <1185914902.6240.125.camel@localhost>
On (31/07/07 16:48), Lee Schermerhorn didst pronounce:
> [PATCH] Document Linux Memory Policy - V3
>
> V3 -> V2:
> + edits and rework suggested by Randy Dunlap, Mel Gorman and Christoph
> Lameter. N.B., I couldn't make all of the changes exactly as suggested
> and retain what I consider important semantics. Therefor, I tried to
> capture the spirit of the suggestions as best I could.
>
> V1 -> V2:
> + Uh, I forget the details. Rework based on suggestions by Andi Kleen
> and Christoph Lameter. E.g., dropped syscall details and updated
> the man pages, instead.
>
> I couldn't find any memory policy documentation in the Documentation
> directory, so here is my attempt to document it.
>
> There's lots more that could be written about the internal design--including
> data structures, functions, etc. However, if you agree that this is better
> that the nothing that exists now, perhaps it could be merged. This will
> provide a baseline for updates to document the many policy patches that are
> currently being worked.
>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
>
I'm happy with that. It gets lots of useful information out in as clear
as manner as you get.
Acked-by: Mel Gorman <mel@csn.ul.ie>
> Documentation/vm/memory_policy.txt | 332 +++++++++++++++++++++++++++++++++++++
> 1 file changed, 332 insertions(+)
>
> Index: Linux/Documentation/vm/memory_policy.txt
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ Linux/Documentation/vm/memory_policy.txt 2007-07-31 15:54:50.000000000 -0400
> @@ -0,0 +1,332 @@
> +
> +What is Linux Memory Policy?
> +
> +In the Linux kernel, "memory policy" determines from which node the kernel will
> +allocate memory in a NUMA system or in an emulated NUMA system. Linux has
> +supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
> +The current memory policy support was added to Linux 2.6 around May 2004. This
> +document attempts to describe the concepts and APIs of the 2.6 memory policy
> +support.
> +
> +Memory policies should not be confused with cpusets (Documentation/cpusets.txt)
> +which is an administrative mechanism for restricting the nodes from which
> +memory may be allocated by a set of processes. Memory policies are a
> +programming interface that a NUMA-aware application can take advantage of. When
> +both cpusets and policies are applied to a task, the restrictions of the cpuset
> +takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details.
> +
> +MEMORY POLICY CONCEPTS
> +
> +Scope of Memory Policies
> +
> +The Linux kernel supports _scopes_ of memory policy, described here from
> +most general to most specific:
> +
> + System Default Policy: this policy is "hard coded" into the kernel. It
> + is the policy that governs all page allocations that aren't controlled
> + by one of the more specific policy scopes discussed below. When the
> + system is "up and running", the system default policy will use "local
> + allocation" described below. However, during boot up, the system
> + default policy will be set to interleave allocations across all nodes
> + with "sufficient" memory, so as not to overload the initial boot node
> + with boot-time allocations.
> +
> + Task/Process Policy: this is an optional, per-task policy. When defined
> + for a specific task, this policy controls all page allocations made by or
> + on behalf of the task that aren't controlled by a more specific scope.
> + If a task does not define a task policy, then all page allocations that
> + would have been controlled by the task policy "fall back" to the System
> + Default Policy.
> +
> + The task policy applies to the entire address space of a task. Thus,
> + it is inheritable, and indeed is inherited, across both fork()
> + [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task
> + to establish the task policy for a child task exec()'d from an
> + executable image that has no awareness of memory policy. See the
> + MEMORY POLICY APIS section, below, for an overview of the system call
> + that a task may use to set/change it's task/process policy.
> +
> + In a multi-threaded task, task policies apply only to the thread
> + [Linux kernel task] that installs the policy and any threads
> + subsequently created by that thread. Any sibling threads existing
> + at the time a new task policy is installed retain their current
> + policy.
> +
> + A task policy applies only to pages allocated after the policy is
> + installed. Any pages already faulted in by the task when the task
> + changes its task policy remain where they were allocated based on
> + the policy at the time they were allocated.
> +
> + VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's
> + virtual adddress space. A task may define a specific policy for a range
> + of its virtual address space. See the MEMORY POLICIES APIS section,
> + below, for an overview of the mbind() system call used to set a VMA
> + policy.
> +
> + A VMA policy will govern the allocation of pages that back this region of
> + the address space. Any regions of the task's address space that don't
> + have an explicit VMA policy will fall back to the task policy, which may
> + itself fall back to the System Default Policy.
> +
> + VMA policies have a few complicating details:
> +
> + VMA policy applies ONLY to anonymous pages. These include pages
> + allocated for anonymous segments, such as the task stack and heap, and
> + any regions of the address space mmap()ed with the MAP_ANONYMOUS flag.
> + If a VMA policy is applied to a file mapping, it will be ignored if
> + the mapping used the MAP_SHARED flag. If the file mapping used the
> + MAP_PRIVATE flag, the VMA policy will only be applied when an
> + anonymous page is allocated on an attempt to write to the mapping--
> + i.e., at Copy-On-Write.
> +
> + VMA policies are shared between all tasks that share a virtual address
> + space--a.k.a. threads--independent of when the policy is installed; and
> + they are inherited across fork(). However, because VMA policies refer
> + to a specific region of a task's address space, and because the address
> + space is discarded and recreated on exec*(), VMA policies are NOT
> + inheritable across exec(). Thus, only NUMA-aware applications may
> + use VMA policies.
> +
> + A task may install a new VMA policy on a sub-range of a previously
> + mmap()ed region. When this happens, Linux splits the existing virtual
> + memory area into 2 or 3 VMAs, each with it's own policy.
> +
> + By default, VMA policy applies only to pages allocated after the policy
> + is installed. Any pages already faulted into the VMA range remain
> + where they were allocated based on the policy at the time they were
> + allocated. However, since 2.6.16, Linux supports page migration via
> + the mbind() system call, so that page contents can be moved to match
> + a newly installed policy.
> +
> + Shared Policy: Conceptually, shared policies apply to "memory objects"
> + mapped shared into one or more tasks' distinct address spaces. An
> + application installs a shared policies the same way as VMA policies--using
> + the mbind() system call specifying a range of virtual addresses that map
> + the shared object. However, unlike VMA policies, which can be considered
> + to be an attribute of a range of a task's address space, shared policies
> + apply directly to the shared object. Thus, all tasks that attach to the
> + object share the policy, and all pages allocated for the shared object,
> + by any task, will obey the shared policy.
> +
> + As of 2.6.22, only shared memory segments, created by shmget() or
> + mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
> + policy support was added to Linux, the associated data structures were
> + added to hugetlbfs shmem segments. At the time, hugetlbfs did not
> + support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
> + shmem segments were never "hooked up" to the shared policy support.
> + Although hugetlbfs segments now support lazy allocation, their support
> + for shared policy has not been completed.
> +
> + As mentioned above [re: VMA policies], allocations of page cache
> + pages for regular files mmap()ed with MAP_SHARED ignore any VMA
> + policy installed on the virtual address range backed by the shared
> + file mapping. Rather, shared page cache pages, including pages backing
> + private mappings that have not yet been written by the task, follow
> + task policy, if any, else System Default Policy.
> +
> + The shared policy infrastructure supports different policies on subset
> + ranges of the shared object. However, Linux still splits the VMA of
> + the task that installs the policy for each range of distinct policy.
> + Thus, different tasks that attach to a shared memory segment can have
> + different VMA configurations mapping that one shared object. This
> + can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
> + a shared memory region, when one task has installed shared policy on
> + one or more ranges of the region.
> +
> +Components of Memory Policies
> +
> + A Linux memory policy is a tuple consisting of a "mode" and an optional set
> + of nodes. The mode determine the behavior of the policy, while the
> + optional set of nodes can be viewed as the arguments to the behavior.
> +
> + Internally, memory policies are implemented by a reference counted
> + structure, struct mempolicy. Details of this structure will be discussed
> + in context, below, as required to explain the behavior.
> +
> + Note: in some functions AND in the struct mempolicy itself, the mode
> + is called "policy". However, to avoid confusion with the policy tuple,
> + this document will continue to use the term "mode".
> +
> + Linux memory policy supports the following 4 behavioral modes:
> +
> + Default Mode--MPOL_DEFAULT: The behavior specified by this mode is
> + context or scope dependent.
> +
> + As mentioned in the Policy Scope section above, during normal
> + system operation, the System Default Policy is hard coded to
> + contain the Default mode.
> +
> + In this context, default mode means "local" allocation--that is
> + attempt to allocate the page from the node associated with the cpu
> + where the fault occurs. If the "local" node has no memory, or the
> + node's memory can be exhausted [no free pages available], local
> + allocation will "fallback to"--attempt to allocate pages from--
> + "nearby" nodes, in order of increasing "distance".
> +
> + Implementation detail -- subject to change: "Fallback" uses
> + a per node list of sibling nodes--called zonelists--built at
> + boot time, or when nodes or memory are added or removed from
> + the system [memory hotplug]. These per node zonelist are
> + constructed with nodes in order of increasing distance based
> + on information provided by the platform firmware.
> +
> + When a task/process policy or a shared policy contains the Default
> + mode, this also means "local allocation", as described above.
> +
> + In the context of a VMA, Default mode means "fall back to task
> + policy"--which may or may not specify Default mode. Thus, Default
> + mode can not be counted on to mean local allocation when used
> + on a non-shared region of the address space. However, see
> + MPOL_PREFERRED below.
> +
> + The Default mode does not use the optional set of nodes.
> +
> + MPOL_BIND: This mode specifies that memory must come from the
> + set of nodes specified by the policy.
> +
> + The memory policy APIs do not specify an order in which the nodes
> + will be searched. However, unlike "local allocation", the Bind
> + policy does not consider the distance between the nodes. Rather,
> + allocations will fallback to the nodes specified by the policy in
> + order of numeric node id. Like everything in Linux, this is subject
> + to change.
> +
> + MPOL_PREFERRED: This mode specifies that the allocation should be
> + attempted from the single node specified in the policy. If that
> + allocation fails, the kernel will search other nodes, exactly as
> + it would for a local allocation that started at the preferred node
> + in increasing distance from the preferred node. "Local" allocation
> + policy can be viewed as a Preferred policy that starts at the node
> + containing the cpu where the allocation takes place.
> +
> + Internally, the Preferred policy uses a single node--the
> + preferred_node member of struct mempolicy. A "distinguished
> + value of this preferred_node, currently '-1', is interpreted
> + as "the node containing the cpu where the allocation takes
> + place"--local allocation. This is the way to specify
> + local allocation for a specific range of addresses--i.e. for
> + VMA policies.
> +
> + MPOL_INTERLEAVED: This mode specifies that page allocations be
> + interleaved, on a page granularity, across the nodes specified in
> + the policy. This mode also behaves slightly differently, based on
> + the context where it is used:
> +
> + For allocation of anonymous pages and shared memory pages,
> + Interleave mode indexes the set of nodes specified by the policy
> + using the page offset of the faulting address into the segment
> + [VMA] containing the address modulo the number of nodes specified
> + by the policy. It then attempts to allocate a page, starting at
> + the selected node, as if the node had been specified by a Preferred
> + policy or had been selected by a local allocation. That is,
> + allocation will follow the per node zonelist.
> +
> + For allocation of page cache pages, Interleave mode indexes the set
> + of nodes specified by the policy using a node counter maintained
> + per task. This counter wraps around to the lowest specified node
> + after it reaches the highest specified node. This will tend to
> + spread the pages out over the nodes specified by the policy based
> + on the order in which they are allocated, rather than based on any
> + page offset into an address range or file. During system boot up,
> + the temporary interleaved system default policy works in this
> + mode.
> +
> +MEMORY POLICY APIs
> +
> +Linux supports 3 system calls for controlling memory policy. These APIS
> +always affect only the calling task, the calling task's address space, or
> +some shared object mapped into the calling task's address space.
> +
> + Note: the headers that define these APIs and the parameter data types
> + for user space applications reside in a package that is not part of
> + the Linux kernel. The kernel system call interfaces, with the 'sys_'
> + prefix, are defined in <linux/syscalls.h>; the mode and flag
> + definitions are defined in <linux/mempolicy.h>.
> +
> +Set [Task] Memory Policy:
> +
> + long set_mempolicy(int mode, const unsigned long *nmask,
> + unsigned long maxnode);
> +
> + Set's the calling task's "task/process memory policy" to mode
> + specified by the 'mode' argument and the set of nodes defined
> + by 'nmask'. 'nmask' points to a bit mask of node ids containing
> + at least 'maxnode' ids.
> +
> + See the set_mempolicy(2) man page for more details
> +
> +
> +Get [Task] Memory Policy or Related Information
> +
> + long get_mempolicy(int *mode,
> + const unsigned long *nmask, unsigned long maxnode,
> + void *addr, int flags);
> +
> + Queries the "task/process memory policy" of the calling task, or
> + the policy or location of a specified virtual address, depending
> + on the 'flags' argument.
> +
> + See the get_mempolicy(2) man page for more details
> +
> +
> +Install VMA/Shared Policy for a Range of Task's Address Space
> +
> + long mbind(void *start, unsigned long len, int mode,
> + const unsigned long *nmask, unsigned long maxnode,
> + unsigned flags);
> +
> + mbind() installs the policy specified by (mode, nmask, maxnodes) as
> + a VMA policy for the range of the calling task's address space
> + specified by the 'start' and 'len' arguments. Additional actions
> + may be requested via the 'flags' argument.
> +
> + See the mbind(2) man page for more details.
> +
> +MEMORY POLICY COMMAND LINE INTERFACE
> +
> +Although not strictly part of the Linux implementation of memory policy,
> +a command line tool, numactl(8), exists that allows one to:
> +
> ++ set the task policy for a specified program via set_mempolicy(2), fork(2) and
> + exec(2)
> +
> ++ set the shared policy for a shared memory segment via mbind(2)
> +
> +The numactl(8) tool is packages with the run-time version of the library
> +containing the memory policy system call wrappers. Some distributions
> +package the headers and compile-time libraries in a separate development
> +package.
> +
> +
> +MEMORY POLICIES AND CPUSETS
> +
> +Memory policies work within cpusets as described above. For memory policies
> +that require a node or set of nodes, the nodes are restricted to the set of
> +nodes whose memories are allowed by the cpuset constraints. If the
> +intersection of the set of nodes specified for the policy and the set of nodes
> +allowed by the cpuset is the empty set, the policy is considered invalid and
> +cannot be installed.
> +
> +The interaction of memory policies and cpusets can be problematic for a
> +couple of reasons:
> +
> +1) the memory policy APIs take physical node id's as arguments. However, the
> + memory policy APIs do not provide a way to determine what nodes are valid
> + in the context where the application is running. An application MAY consult
> + the cpuset file system [directly or via an out of tree, and not generally
> + available, libcpuset API] to obtain this information, but then the
> + application must be aware that it is running in a cpuset and use what are
> + intended primarily as administrative APIs.
> +
> + However, as long as the policy specifies at least one node that is valid
> + in the controlling cpuset, the policy can be used.
> +
> +2) when tasks in two cpusets share access to a memory region, such as shared
> + memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
> + MAP_SHARED flags, and any of the tasks install shared policy on the region,
> + only nodes whose memories are allowed in both cpusets may be used in the
> + policies. Again, obtaining this information requires "stepping outside"
> + the memory policy APIs, as well as knowing in what cpusets other task might
> + be attaching to the shared region, to use the cpuset information.
> + Furthermore, if the cpusets' allowed memory sets are disjoint, "local"
> + allocation is the only valid policy.
>
--
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-08-03 13:52 UTC|newest]
Thread overview: 60+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-07-25 4:20 NUMA policy issues with ZONE_MOVABLE Christoph Lameter
2007-07-25 4:47 ` Nick Piggin
2007-07-25 5:05 ` Christoph Lameter
2007-07-25 5:24 ` Nick Piggin
2007-07-25 6:00 ` Christoph Lameter
2007-07-25 6:09 ` Nick Piggin
2007-07-25 9:32 ` Andi Kleen
2007-07-25 6:36 ` KAMEZAWA Hiroyuki
2007-07-25 11:16 ` Mel Gorman
2007-07-25 14:30 ` Lee Schermerhorn
2007-07-25 19:31 ` Christoph Lameter
2007-07-26 4:15 ` KAMEZAWA Hiroyuki
2007-07-26 4:53 ` Christoph Lameter
2007-07-26 7:41 ` KAMEZAWA Hiroyuki
2007-07-26 16:16 ` Mel Gorman
2007-07-26 18:03 ` Christoph Lameter
2007-07-26 18:26 ` Mel Gorman
2007-07-26 13:23 ` Mel Gorman
2007-07-26 18:07 ` Christoph Lameter
2007-07-26 22:59 ` Mel Gorman
2007-07-27 1:22 ` Christoph Lameter
2007-07-27 8:20 ` Mel Gorman
2007-07-27 15:45 ` Mel Gorman
2007-07-27 17:35 ` Christoph Lameter
2007-07-27 17:46 ` Mel Gorman
2007-07-27 18:38 ` Christoph Lameter
2007-07-27 18:00 ` [PATCH] Document Linux Memory Policy - V2 Lee Schermerhorn
2007-07-27 18:38 ` Randy Dunlap
2007-07-27 19:01 ` Lee Schermerhorn
2007-07-27 19:21 ` Randy Dunlap
2007-07-27 18:55 ` Christoph Lameter
2007-07-27 19:24 ` Lee Schermerhorn
2007-07-31 15:14 ` Mel Gorman
2007-07-31 16:34 ` Lee Schermerhorn
2007-07-31 19:10 ` Christoph Lameter
2007-07-31 19:46 ` Lee Schermerhorn
2007-07-31 19:58 ` Christoph Lameter
2007-07-31 20:23 ` Lee Schermerhorn
2007-07-31 20:48 ` [PATCH] Document Linux Memory Policy - V3 Lee Schermerhorn
2007-08-03 13:52 ` Mel Gorman [this message]
2007-07-28 7:28 ` NUMA policy issues with ZONE_MOVABLE KAMEZAWA Hiroyuki
2007-07-28 11:57 ` Mel Gorman
2007-07-28 14:10 ` KAMEZAWA Hiroyuki
2007-07-28 14:21 ` KAMEZAWA Hiroyuki
2007-07-30 12:41 ` Mel Gorman
2007-07-30 18:06 ` Christoph Lameter
2007-07-27 14:24 ` Lee Schermerhorn
2007-08-01 18:59 ` Lee Schermerhorn
2007-08-02 0:36 ` KAMEZAWA Hiroyuki
2007-08-02 17:10 ` Mel Gorman
2007-08-02 17:51 ` Lee Schermerhorn
2007-07-26 18:09 ` Lee Schermerhorn
2007-08-02 14:09 ` Mel Gorman
2007-08-02 18:56 ` Christoph Lameter
2007-08-02 19:42 ` Mel Gorman
2007-08-02 19:52 ` Christoph Lameter
2007-08-03 9:32 ` Mel Gorman
2007-08-03 16:36 ` Christoph Lameter
2007-07-25 14:27 ` Lee Schermerhorn
2007-07-25 17:39 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070803135253.GA20048@skynet.ie \
--to=mel@skynet.ie \
--cc=Lee.Schermerhorn@hp.com \
--cc=ak@suse.de \
--cc=akpm@linux-foundation.org \
--cc=clameter@sgi.com \
--cc=eric.whitney@hp.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=mtk-manpages@gmx.net \
--cc=pj@sgi.com \
--cc=randy.dunlap@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox