From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Fri, 27 Jul 2007 11:38:36 -0700 From: Randy Dunlap Subject: Re: [PATCH] Document Linux Memory Policy - V2 Message-Id: <20070727113836.9471e35e.randy.dunlap@oracle.com> In-Reply-To: <1185559260.5069.40.camel@localhost> References: <20070725111646.GA9098@skynet.ie> <20070726132336.GA18825@skynet.ie> <20070726225920.GA10225@skynet.ie> <20070727082046.GA6301@skynet.ie> <20070727154519.GA21614@skynet.ie> <1185559260.5069.40.camel@localhost> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Lee Schermerhorn Cc: linux-mm@kvack.org, Christoph Lameter , ak@suse.de, Mel Gorman , KAMEZAWA Hiroyuki , akpm@linux-foundation.org, pj@sgi.com, Michael Kerrisk , Eric Whitney List-ID: On Fri, 27 Jul 2007 14:00:59 -0400 Lee Schermerhorn wrote: > [PATCH] Document Linux Memory Policy - V2 > > Documentation/vm/memory_policy.txt | 278 +++++++++++++++++++++++++++++++++++++ > 1 file changed, 278 insertions(+) > > Index: Linux/Documentation/vm/memory_policy.txt > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ Linux/Documentation/vm/memory_policy.txt 2007-07-27 13:40:45.000000000 -0400 > @@ -0,0 +1,278 @@ > + ... > + > +MEMORY POLICY CONCEPTS > + > +Scope of Memory Policies > + > +The Linux kernel supports four more or less distinct scopes of memory policy: > + > + System Default Policy: this policy is "hard coded" into the kernel. It > + is the policy that governs the all page allocations that aren't controlled drop ^ "the" > + by one of the more specific policy scopes discussed below. Are these policies listed in order of "less specific scope to more specific scope"? > + Task/Process Policy: this is an optional, per-task policy. When defined > + for a specific task, this policy controls all page allocations made by or > + on behalf of the task that aren't controlled by a more specific scope. > + If a task does not define a task policy, then all page allocations that > + would have been controlled by the task policy "fall back" to the System > + Default Policy. > + ... > + > + VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's > + virtual adddress space. A task may define a specific policy for a range > + of its virtual address space. This VMA policy will govern the allocation > + of pages that back this region of the address space. Any regions of the > + task's address space that don't have an explicit VMA policy will fall back > + to the task policy, which may itself fall back to the system default policy. > + > + VMA policy applies ONLY to anonymous pages. These include pages > + allocated for anonymous segments, such as the task stack and heap, and > + any regions of the address space mmap()ed with the MAP_ANONYMOUS flag. > + Anonymous pages copied from private file mappings [files mmap()ed with > + the MAP_PRIVATE flag] also obey VMA policy, if defined. > + > + VMA policies are shared between all tasks that share a virtual address > + space--a.k.a. threads--independent of when the policy is installed; and > + they are inherited across fork(). However, because VMA policies refer > + to a specific region of a task's address space, and because the address > + space is discarded and recreated on exec*(), VMA policies are NOT > + inheritable across exec(). Thus, only NUMA-aware applications may > + use VMA policies. > + > + A task may install a new VMA policy on a sub-range of a previously > + mmap()ed region. When this happens, Linux splits the existing virtual > + memory area into 2 or 3 VMAs, each with it's own policy. its > + > + By default, VMA policy applies only to pages allocated after the policy > + is installed. Any pages already faulted into the VMA range remain where > + they were allocated based on the policy at the time they were > + allocated. However, since 2.6.16, Linux supports page migration so > + that page contents can be moved to match a newly installed policy. > + > + Shared Policy: This policy applies to "memory objects" mapped shared into > + one or more tasks' distinct address spaces. Shared policies are applied > + directly to the shared object. Thus, all tasks that attach to the object > + share the policy, and all pages allocated for the shared object, by any > + task, will obey the shared policy. > + > + Currently [2.6.22], only shared memory segments, created by shmget(), > + support shared policy. When shared policy support was added to Linux, > + the associated data structures were added to shared hugetlbfs segments. > + However, at the time, hugetlbfs did not support allocation at fault > + time--a.k.a lazy allocation--so hugetlbfs segments were never "hooked a.k.a. > + up" to the shared policy support. Although hugetlbfs segments now > + support lazy allocation, their support for shared policy has not been > + completed. > + > + Although internal to the kernel shared memory segments are really > + files backed by swap space that have been mmap()ed shared into tasks' > + address spaces, regular files mmap()ed shared do NOT support shared confusing sentence, esp. the beginning of it. > + policy. Rather, shared page cache pages, including pages backing > + private mappings that have not yet been written by the task, follow > + task policy, if any, else system default policy. > + > + The shared policy infrastructure supports different policies on subset > + ranges of the shared object. However, Linux still splits the VMA of > + the task that installs the policy for each range of distinct policy. > + Thus, different tasks that attach to a shared memory segment can have > + different VMA configurations mapping that one shared object. > + > +Components of Memory Policies > + > + A Linux memory policy is a tuple consisting of a "mode" and an optional set > + of nodes. The mode determine the behavior of the policy, while the optional determines > + set of nodes can be viewed as the arguments to the behavior. > + > + Internally, memory policies are implemented by a reference counted structure, > + struct mempolicy. Details of this structure will be discussed in context, > + below. > + > + Note: in some functions AND in the struct mempolicy, the mode is > + called "policy". However, to avoid confusion with the policy tuple, > + this document will continue to use the term "mode". > + > + Linux memory policy supports the following 4 modes: > + > + Default Mode--MPOL_DEFAULT: The behavior specified by this mode is > + context dependent. > + > + During normal system operation, the system default policy is hard > + coded to contain the Default mode. During system boot up, the > + system default policy is temporarily set to MPOL_INTERLEAVE [see > + below] to distribute boot time allocations across all nodes in > + the system, instead of using just the node containing the boot cpu. > + > + In this context, default mode means "local" allocation--that is > + attempt to allocate the page from the node associated with the cpu > + where the fault occurs. If the "local" node has no memory, or the > + node's memory can be exhausted [no free pages available], local > + allocation will attempt to allocate pages from "nearby" nodes, using > + a per node list of nodes--called zonelists--built at boot time, or > + when nodes or memory are added or removed from the system [memory > + hotplug]. > + > + When a task/process policy or a shared policy contains the Default > + mode, this also means local allocation, as described above. > + > + In the context of a VMA, Default mode means "fall back to task > + policy"--which may or may not specify Default mode. Thus, Default > + mode can not be counted on to mean local allocation when used cannot > + on a non-shared region of the address space. However, see > + MPOL_PREFERRED below. > + > + The Default mode does not use the optional set of nodes. > + > + MPOL_BIND: This mode specifies that memory must come from the > + set of nodes specified by the policy. The kernel builds a custom > + zonelist pointed to by the zonelist member of struct mempolicy, > + containing just the nodes specified by the Bind policy. If the kernel > + is unable to allocate a page from the first node in the custom zonelist, > + it moves on to the next, and so forth. If it is unable to allocate a > + page from any of the nodes in this list, the allocation will fail. > + > + The memory policy APIs do not specify an order in which the nodes > + will be searched. However, unlike the per node zonelists mentioned > + above, the custom zonelist for the Bind policy do not consider the does not > + distance between the nodes. Rather, the lists are built in order > + of numeric node id. > + > + MPOL_PREFERRED: This mode specifies that the allocation should be > + attempted from the single node specified in the policy. If that > + allocation fails, the kernel will search other nodes, exactly as > + it would for a local allocation that started at the preferred node-- > + that is, using the per-node zonelists in increasing distance from > + the preferred node. > + > + Internally, the Preferred policy uses a single node--the > + preferred_node member of struct mempolicy. > + > + If the Preferred policy node is '-1', then at page allocation time, > + the kernel will use the "local node" as the starting point for the > + allocation. This is the way to specify local allocation for a > + specific range of addresses--i.e. for VMA policies. > + > + MPOL_INTERLEAVED: This mode specifies that page allocations be > + interleaved, on a page granularity, across the nodes specified in > + the policy. This mode also behaves slightly differently, based on > + the context where it is used: ... > + > +MEMORY POLICIES AND CPUSETS > + > +Memory policies work within cpusets as described above. For memory policies > +that require a node or set of nodes, the nodes are restricted to the set of > +nodes whose memories are allowed by the cpuset constraints. This can be > +problematic for 2 reasons: > + > +1) the memory policy APIs take physical node id's as arguments. However, the > + memory policy APIs do not provide a way to determine what nodes are valid > + in the context where the application is running. An application MAY consult > + the cpuset file system [directly or via an out of tree, and not generally > + available, libcpuset API] to obtain this information, but then the > + application must be aware that it is running in a cpuset and use what are > + intended primarily as administrative APIs. > + > +2) when tasks in two cpusets share access to a memory region, such as shared > + memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and or (?) > + MAP_SHARED flags, only nodes whose memories are allowed in both cpusets > + may be used in the policies. Again, obtaining this information requires > + "stepping outside" the memory policy APIs to use the cpuset information. > + Furthermore, if the cpusets' "allowed memory" sets are disjoint, "local" > + allocation is the only valid policy. > + > +MEMORY POLICY APIs > + > +Linux supports 3 system calls for controlling memory policy. These APIS > +always affect only the calling task, the calling task's address space, or > +some shared object mapped into the calling task's address space. > + > + Note: the headers that define these APIs and the parameter data types > + for user space applications reside in a package that is not part of > + the Linux kernel. The kernel system call interfaces, with the 'sys_' > + prefix, are defined in ; the mode and flag > + definitions are defined in . > + > +Set [Task] Memory Policy: > + > + long set_mempolicy(int mode, const unsigned long *nmask, > + unsigned long maxnode); > + > + Set's the calling task's "task/process memory policy" to mode > + specified by the 'mode' argument and the set of nodes defined > + by 'nmask'. 'nmask' points to a bit mask of node ids containing > + at least 'maxnode' ids. > + > + See the set_mempolicy(2) man page for more details . > + > +Get [Task] Memory Policy or Related Information > + > + long get_mempolicy(int *mode, > + const unsigned long *nmask, unsigned long maxnode, > + void *addr, int flags); > + > + Queries the "task/process memory policy" of the calling task, or > + the policy or location of a specified virtual address, depending > + on the 'flags' argument. > + > + See the get_mempolicy(2) man page for more details . > + > +Install VMA/Shared Policy for a Range of Task's Address Space > + > + long mbind(void *start, unsigned long len, int mode, > + const unsigned long *nmask, unsigned long maxnode, > + unsigned flags); > + > + mbind() installs the policy specified by (mode, nmask, maxnodes) as > + a VMA policy for the range of the calling task's address space > + specified by the 'start' and 'len' arguments. Additional actions > + may be requested via the 'flags' argument. > + > + See the mbind(2) man page for more details. --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org