linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] mm: memory policy for page cache allocation
@ 2004-09-23  4:32 Ray Bryant
  2004-09-23  4:32 ` [PATCH 1/2] mm: page cache mempolicy " Ray Bryant
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Ray Bryant @ 2004-09-23  4:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: William Lee Irwin III, Ray Bryant, linux-mm, Jesse Barnes,
	Dan Higgins, lse-tech, Brent Casavant, Nick Piggin,
	Martin J. Bligh, linux-kernel, Ray Bryant, Andrew Morton,
	Paul Jackson, Dave Hansen

Andi,

You may like the following patchset better.  (At least
I hope so...)

It's divided into 3 parts, with this file (the OVERVIEW)
making up the 0th part, and two patches in part 1 and 2.

I've tried to address several of your concerns with this
version of the patch:

(1)  We dropped the MPOL_ROUNDROBIN patch.  Instead, we
     use MPOL_INTERLEAVE to spread pages across nodes.
     However, rather than use the file offset etc to 
     calculate the node to allocate the page on, I used
     the same mechanism you used in alloc_pages_current()
     to calculate the node number (interleave_node()).
     That eliminates the need to generate an offset etc
     in the routines that call page_cache_alloc() and to
     me appears to be a simpler change that still fits
     within your design.

     We can still go the other way if you want, it matters
     not to me, this was just dramatically less code (i. e.
     0 lines) to use the existing functionality.

(2)  I implemented the sys_set_mempolicy() changes as
     suggested -- higher order bits in the mode (first)
     argument specify whether or not this request is for
     the page allocation policy (your existing policy)
     or for the page cache allocation policy.  Similarly,
     a bit there indicates whether or not we want to set
     the process level policy or the system level policy.

     These bits are to be set in the flags argument of
     sys_mbind().

(3)  As before, there is a process level policy and a
     system level policy for both regular page allocation
     and page cache allocation.  The primary rationale
     for this is that since that is the way your code 
     worked for regular page allocation, it was easiest
     to piggyback on that and hence you end up with a
     per process and system default page allocation policy.
     If no-one specifies a process level page cache 
     allocation policy, the overhead of this is one long
     per task struct.  Making it otherwise would make
     the code less clean, I think.

     We continue to believe that we will have applications
     that wish to set the page cache allocation policy, 
     but, we don't have any demonstrable cases of this.

(4)  I added a new patch to remove a bias toward node
     0 of page allocations.  That is because each
     new process starts with an il_next = 0.  Now, I
     set il_next to current->pid % MAX_NUMNODES.
     See the 2nd patch for more discussion.

I haven't tested this much, it compiles and boots.
More testing will be done once I get your NUMA_API code
converted (perhaps not much needs to be done, don't 
know yet) to use the new interface.

Also, I got Steve's patch, and have looked at the overview,
but not the details.  If we could create a default policy for
page cache allocation that would be like MPOL_INTERLEAVE,
and then have per file settable policies, I guess we could
live with that, but it seems to me that a process would 
likely want all of its pages allocated the same way.  That
is, an HPC process would want all of its files allocated
round robin across the cpuset (most likely), while a file
server process would want its page cache pages allocated
locally.  It would be pain to have to specify a special
policy for each file opened by a process, I would think,
unless there is some way to cache that in the proces and
have it apply to all files that the process opens, but
then you are effectively emulating a per process policy 
in user space, it seems to me.

    ---------------OVERVIEW--------------------

This is the second working release of this patch.

Changes since the last release
------------------------------

(1)  Dropped the MPOL_ROUNDROBIN patch.
(2)  Added some new text to the overview (see <new text>)
     below.
(3)  Changed to use the task struct field: il_next to
     control round robin allocation of pages when the
     policy is MPOL_INTERLEAVE.
(4)  Added code to set and get the additional policy types.
     The original policy in Andi Kleen's code is called
     POLICY_PAGE, because it deals with data page allocation,
     the new policy for page cache pages is called
     POLICY_PAGECACHE.
(5)  Added a new patch to this series to reduce allocation
     bias toward node 0.

Background
----------

In August, Jesse Barnes at SGI proposed a patch to do round robin
allocation of page cache pages on NUMA machines.  This got shot down
for a number of reasons (see
  http://marc.theaimsgroup.com/?l=linux-kernel&m=109235420329360&w=2
and the related thread), but it seemed to me that one of the most
significant issues was that this was a workload dependent optimization.
That is, for an Altix running an HPC workload, it was a good thing,
but for web servers or file servers it was not such a good idea.

So the idea of this patch is the following:  it creates a new memory
policy structure (default_pagecache_policy) that is used to control
how storage for page cache pages is allocated.  So, for a large Altix
running HPC workloads, we can specify a policy that does round robin
allocations, and for other workloads you can specify the default policy
(which results in page cache pages being allocated locally).

The default_pagecache_policy is override-able on a per process basis, so
that if your application prefers to allocate page cache pages locally,
it can.  <new text>  In this regard the pagecache policy behaves the same
as the page allocation policy and indeed all of the code to implement
the two is basically the same.

<new text>
The primary rationale for this is that is the way the existing mempolicy
code works -- there is a per process policy, which is used if it exists,
and if the per process policy is null, then a global, default policy
is used.  This patch piggybacks on that existing code, so you get the
per process policy and a global policy for page cache allocations as well.

If the user does not define a per process policy, the extra cost is an
unused pointer in the task struct.  We can envision situations where
a per process cache allocation policy may be beneficial, but the real
case for this is that it allows us to use the existing code with only
minor modifications to implement, set and get the page cache mempolicy.

This is all done by making default_policy and current->mempolicy an
array of size 2 and of type "struct mempolicy *".   Entry POLICY_PAGE
in these arrays is the old default_policy and process memory policy.
Entry POLICY_PAGECACHE in these arrays contains the system default and
per process page cache allocation policies, respectively.

While one can, in principle, change the global page cache allocation
policy, we think this will be done precisely once per boot by calls from
startup scripts into the NUMA API code.  The idea is not so much to allow
the global page cache policy to be easily changeable, but rather allowing
it to be settable by the system admin so that we don't have to compile
separate kernels for file servers and HPC servers.   In particular,
changing the page cache allocation policy doesn't cause previously
allocated pages to be moved so that they are now placed correctly
according to the new policy.  Over time, they will get replaced and the
system will slowly migrate to a state where most page cache pages are
on the correct nodes for the new policy.

Efficiencies in setting and getting the page cache policy from user
space are also achieve through this approach.  The system call 
entry points "sys_set_mempolicy", "sys_get_mempolicy" and "sys_mbind"
have been enhanced to support specifying whether the policy that is
being operated on is:

(1)  The process-level policy or the default system level policy.
(2)  The page allocation policy or the page cache allocation policy.

This is done using higher order bits in the mode (first) argument to
sys_set/get_mempolicy() and the flags word in sys_mbind().  These
bits are defined so that users of the original interface will get
the same results using the old and new implementations of these
routines.
<end new text>

A new worker routine is defined:
	alloc_pages_by_policy(gfp, order, policy)
This routine allocates the requested number of pages using the policy
index specified.

alloc_pages_current() and page_cache_alloc() are then defined in terms
of alloc_pages_by_policy().

<new text>
This patch is in two parts.  The first part is the page cache policy
patch itself (we dropped the previous first patch).  The second
patch is a patch to slightly modify the implementation of policy
MPOL_INTERLEAVE to remove a bias toward allocating on node 0.

Further specific details of these patches are in the patch files,
which follow this email.
<end new text>


Caveats
-------

(1)  page_cache_alloc_local() is defined, but is not currently called.
This was added in SGI ProPack to make sure that mmap'd() files were
allocated locally rather than round-robin'd (i. e. to override the
round robin allocation in that case.)  This was an SGI MPT requirement.
It may be this is not needed with the current mempolicy code if we can
associate the default mempolicy with mmap()'d files for those MPT users.

(2)  alloc_pages_current() is now an inline, but there is no easy way
to do that totally correctly with the current include file order (that I
could figure out at least...)  The problem is that alloc_pages_current()
wants to use the define constant POLICY_PAGE, but that is defined yet.
We know it is zero, so we just use zero.  A comment in mempolicy.h
suggests not to change the value of this constant to something other
than zero, and references the file gfp.h.

(3) <new>  The code compiles and boots but has not been extensively
tested.  The latter will wait for a NUMA API library that supports
the new functionality.    My next goal is to get those modifications
done so we can do some serious testing.

(4)  I've not thought a bit about locking issues related to changing a
mempolicy whilst the system is actually running.   However, now that
the mempolicies themselves are stateless (as per Andi Kleen's original
design) it may be that these issues are not as significant.

(5)  It seems there may be a potential conflict between the page cache
mempolicy and a mmap mempolicy (do those exist?).  Here's the concern:
If you mmap() a file, and any pages of that file are in the page cache,
then the location of those pages will (have been) dictated by the page
cache mempolicy, which could differ (will likely differ) from the mmap
mempolicy.  It seems that the only solution to this is to migrate those
pages (when they are touched) after the mmap().

(6)  Testing of this particular patch has been minimal since I don't 
yet have a compatible NUMA API.   I'm working on that next.

Comments, flames, etc to the undersigned.

Best Regards,

Ray

Ray Bryant <raybry@sgi.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] mm: page cache mempolicy for page cache allocation
  2004-09-23  4:32 [PATCH 0/2] mm: memory policy for page cache allocation Ray Bryant
@ 2004-09-23  4:32 ` Ray Bryant
  2004-09-23  9:24   ` Andi Kleen
  2004-09-23  4:32 ` [PATCH 2/2] mm: eliminate node 0 bias in MPOL_INTERLEAVE Ray Bryant
  2004-09-23  9:09 ` [PATCH 0/2] mm: memory policy for page cache allocation Andi Kleen
  2 siblings, 1 reply; 10+ messages in thread
From: Ray Bryant @ 2004-09-23  4:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: William Lee Irwin III, Andrew Morton, linux-mm, Jesse Barnes,
	Dan Higgins, Dave Hansen, lse-tech, Brent Casavant, Ray Bryant,
	Martin J. Bligh, linux-kernel, Ray Bryant, Paul Jackson,
	Nick Piggin

This is version 2 of the page cache memory policy patch.

Changes from the previous version:

(1)  This patch no longer requires MPOL_ROUNDROBIN so that patch
     has been deleted from this series.

(2)  This patch provides a mechanism for setting and getting
     not only the process's policies for allocating pages and
     page cache (if any), but also for getting and setting the
     system-wide default policies for these allocations.  (Admin
     capabaility is required to set the default policies.)
     Specification of which policy to set and whether it is 
     the page allocation policy or the page cache allocation
     policy is done in the upper bits of the first argument to
     sys_set_mempolicy() and in the flags argument of
     sys_get_mempolicy().  These values are defined so that
     existing users will not see a change.

     See sys_set_mempolicy(), sys_get_mempolicy() and
     include/linux/mempolicy.h for further details.

     It is expected that the default policies will be set during
     boot processing of startup scripts and will not be changed
     thereafter (without quiescing the system and/or flushing the
     page cache).

(3)  This patch uses the existing infrastructure from the 
     the previous version of alloc_pages_current() to do the
     round robin allocation of page cache pages across nodes
     if the page cache allocation policy is MPOL_INTERLEAVE.
     That is, this patch uses current->il_next and
     interleave_node() to decide what node to allocate the
     current page on. 

     This means that regular pages and page cache pages are
     allocated using the same "rotator" if both policies are
     MPOL_INTERLEAVE and avoids having to pass an offset,
     a dev_t, and an inode into page_cache_alloc().

Signed-off-by: Ray Bryant <raybry@sgi.com>

Index: linux-2.6.9-rc2-mm1/include/linux/gfp.h
===================================================================
--- linux-2.6.9-rc2-mm1.orig/include/linux/gfp.h	2004-09-16 12:54:27.000000000 -0700
+++ linux-2.6.9-rc2-mm1/include/linux/gfp.h	2004-09-22 08:48:44.000000000 -0700
@@ -92,7 +92,22 @@ static inline struct page *alloc_pages_n
 }
 
 #ifdef CONFIG_NUMA
-extern struct page *alloc_pages_current(unsigned gfp_mask, unsigned order);
+extern struct page *alloc_pages_by_policy(unsigned gfp, unsigned order, 
+	unsigned policy);
+
+static inline
+struct page *alloc_pages_current(unsigned gfp, unsigned order)
+{
+	/* 
+	 * include order keeps us from including mempolicy.h here
+	 * the following should be:
+	 *    return alloc_pages_by_policy(gfp, order, POLICY_PAGE);
+	 * but POLICY_PAGE is not defined yet.
+	 * We assume here that POLICY_PAGE is defined to be 0
+	 * See include/linux/mempolicy.h.
+	 */
+	return alloc_pages_by_policy(gfp, order, 0);
+}
 
 static inline struct page *
 alloc_pages(unsigned int gfp_mask, unsigned int order)
Index: linux-2.6.9-rc2-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.9-rc2-mm1.orig/include/linux/mempolicy.h	2004-09-16 10:41:23.000000000 -0700
+++ linux-2.6.9-rc2-mm1/include/linux/mempolicy.h	2004-09-22 08:48:44.000000000 -0700
@@ -16,6 +16,29 @@
 
 #define MPOL_MAX MPOL_INTERLEAVE
 
+/* 
+ * Policy indicies
+ * These specify the index into either the task->mempolicy array or the
+ * default_policy array to indicate which policy is to be used for a
+ * particular allocation.
+ */
+#define NR_MEM_POLICIES 	2
+/* policy to use for page allocation and the default kernel policy */
+/* this value is hard coded into alloc_pages() in gfp.h do not change it */
+#define POLICY_PAGE		0
+/* policy to use for pagecache allocation */
+#define POLICY_PAGECACHE 	1
+
+/* policy selection bits are passed from user shifted left by this amount */
+#define REQUEST_POLICY_SHIFT	16
+#define REQUEST_POLICY_PAGE     POLICY_PAGE << REQUEST_POLICY_SHIFT
+#define REQUEST_POLICY_PAGECACHE POLICY_PAGECACHE << REQUEST_POLICY_SHIFT
+#define REQUEST_POLICY_MASK     (0x3FFF) << REQUEST_POLICY_SHIFT
+#define REQUEST_MODE_MASK       (0xFFFF)
+/* by default, user requests are for the process policy -- this flag 
+ * informs sys_set_policy() that this request is for the default policy */
+#define REQUEST_POLICY_DEFAULT  (0x8000) << REQUEST_POLICY_SHIFT
+
 /* Flags for get_mem_policy */
 #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
 #define MPOL_F_ADDR	(1<<1)	/* look up vma using address */
@@ -31,6 +54,8 @@
 #include <linux/slab.h>
 #include <linux/rbtree.h>
 #include <asm/semaphore.h>
+#include <linux/sched.h>
+#include <asm/current.h>
 
 struct vm_area_struct;
 
@@ -68,6 +93,9 @@ struct mempolicy {
 	} v;
 };
 
+extern struct page *
+alloc_pages_by_policy(unsigned gfp, unsigned order, unsigned int policy);
+
 /*
  * Support for managing mempolicy data objects (clone, copy, destroy)
  * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
Index: linux-2.6.9-rc2-mm1/include/linux/pagemap.h
===================================================================
--- linux-2.6.9-rc2-mm1.orig/include/linux/pagemap.h	2004-09-16 12:54:19.000000000 -0700
+++ linux-2.6.9-rc2-mm1/include/linux/pagemap.h	2004-09-22 08:48:45.000000000 -0700
@@ -50,6 +50,7 @@ static inline void mapping_set_gfp_mask(
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
+#ifndef CONFIG_NUMA
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
 	return alloc_pages(mapping_gfp_mask(x), 0);
@@ -59,6 +60,30 @@ static inline struct page *page_cache_al
 {
 	return alloc_pages(mapping_gfp_mask(x)|__GFP_COLD, 0);
 }
+#define page_cache_alloc_local((x)) page_cache_alloc((x))
+#else /* CONFIG_NUMA */
+
+struct mempolicy;
+extern struct mempolicy *default_policy[];
+extern struct page *
+alloc_pages_by_policy(unsigned gfp, unsigned order, unsigned policy);
+
+static inline struct page *page_cache_alloc_local(struct address_space *x)
+{
+	return alloc_pages(mapping_gfp_mask(x), 0);
+}
+
+static inline struct page *page_cache_alloc(struct address_space *x)
+{
+	return alloc_pages_by_policy(mapping_gfp_mask(x), 0, POLICY_PAGECACHE);
+}
+
+static inline struct page *page_cache_alloc_cold(struct address_space *x)
+{
+	return alloc_pages_by_policy(mapping_gfp_mask(x)|__GFP_COLD, 0, 
+		POLICY_PAGECACHE);
+}
+#endif
 
 typedef int filler_t(void *, struct page *);
 
Index: linux-2.6.9-rc2-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.9-rc2-mm1.orig/include/linux/sched.h	2004-09-16 12:54:41.000000000 -0700
+++ linux-2.6.9-rc2-mm1/include/linux/sched.h	2004-09-22 08:48:45.000000000 -0700
@@ -31,6 +31,8 @@
 #include <linux/pid.h>
 #include <linux/percpu.h>
 
+#include <linux/mempolicy.h>
+
 struct exec_domain;
 
 /*
@@ -588,7 +590,6 @@ int set_current_groups(struct group_info
 
 
 struct audit_context;		/* See audit.c */
-struct mempolicy;
 
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
@@ -743,7 +744,7 @@ struct task_struct {
  */
 	wait_queue_t *io_wait;
 #ifdef CONFIG_NUMA
-  	struct mempolicy *mempolicy;
+  	struct mempolicy *mempolicy[NR_MEM_POLICIES];
   	short il_next;		/* could be shared with used_math */
 #endif
 #ifdef CONFIG_CPUSETS
Index: linux-2.6.9-rc2-mm1/kernel/exit.c
===================================================================
--- linux-2.6.9-rc2-mm1.orig/kernel/exit.c	2004-09-16 12:54:32.000000000 -0700
+++ linux-2.6.9-rc2-mm1/kernel/exit.c	2004-09-22 08:48:45.000000000 -0700
@@ -785,6 +785,7 @@ static void exit_notify(struct task_stru
 asmlinkage NORET_TYPE void do_exit(long code)
 {
 	struct task_struct *tsk = current;
+	int i;
 
 	profile_task_exit(tsk);
 
@@ -830,8 +831,10 @@ asmlinkage NORET_TYPE void do_exit(long 
 	tsk->exit_code = code;
 	exit_notify(tsk);
 #ifdef CONFIG_NUMA
-	mpol_free(tsk->mempolicy);
-	tsk->mempolicy = NULL;
+	for(i=0;i<NR_MEM_POLICIES;i++) {
+		mpol_free(tsk->mempolicy[i]);
+		tsk->mempolicy[i] = NULL;
+	}
 #endif
 	schedule();
 	BUG();
Index: linux-2.6.9-rc2-mm1/kernel/fork.c
===================================================================
--- linux-2.6.9-rc2-mm1.orig/kernel/fork.c	2004-09-22 08:08:18.000000000 -0700
+++ linux-2.6.9-rc2-mm1/kernel/fork.c	2004-09-22 08:48:45.000000000 -0700
@@ -776,7 +776,7 @@ static task_t *copy_process(unsigned lon
 				 int __user *child_tidptr,
 				 int pid)
 {
-	int retval;
+	int retval, i;
 	struct task_struct *p = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
@@ -865,12 +865,14 @@ static task_t *copy_process(unsigned lon
 	p->io_wait = NULL;
 	p->audit_context = NULL;
 #ifdef CONFIG_NUMA
- 	p->mempolicy = mpol_copy(p->mempolicy);
- 	if (IS_ERR(p->mempolicy)) {
- 		retval = PTR_ERR(p->mempolicy);
- 		p->mempolicy = NULL;
- 		goto bad_fork_cleanup;
- 	}
+	for(i=0;i<NR_MEM_POLICIES;i++) {
+		p->mempolicy[i] = mpol_copy(p->mempolicy[i]);
+		if (IS_ERR(p->mempolicy[i])) {
+			retval = PTR_ERR(p->mempolicy[i]);
+			p->mempolicy[i] = NULL;
+			goto bad_fork_cleanup;
+		}
+	}
 #endif
 
 	p->tgid = p->pid;
@@ -1038,7 +1040,8 @@ bad_fork_cleanup_security:
 	security_task_free(p);
 bad_fork_cleanup_policy:
 #ifdef CONFIG_NUMA
-	mpol_free(p->mempolicy);
+	for(i=0;i<NR_MEM_POLICIES;i++)
+		mpol_free(p->mempolicy[i]);
 #endif
 bad_fork_cleanup:
 	if (p->binfmt)
Index: linux-2.6.9-rc2-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.9-rc2-mm1.orig/mm/mempolicy.c	2004-09-16 12:54:20.000000000 -0700
+++ linux-2.6.9-rc2-mm1/mm/mempolicy.c	2004-09-22 11:46:20.000000000 -0700
@@ -87,11 +87,27 @@ static kmem_cache_t *sn_cache;
    policied. */
 static int policy_zone;
 
-static struct mempolicy default_policy = {
+/*
+ * the default policies for page allocation, page cache allocation
+ */
+static struct mempolicy default_kernel_mempolicy = {
 	.refcnt = ATOMIC_INIT(1), /* never free it */
 	.policy = MPOL_DEFAULT,
 };
 
+struct mempolicy default_pagecache_mempolicy = {
+	.refcnt  = ATOMIC_INIT(1), /* never free it */
+	.policy  = MPOL_DEFAULT,
+};
+
+/*
+ * references to the default policies are via indexes into this array
+ */
+struct mempolicy *default_policy[NR_MEM_POLICIES] = {
+		&default_kernel_mempolicy, 
+		&default_pagecache_mempolicy,
+};
+
 /* Check if all specified nodes are online */
 static int nodes_online(unsigned long *nodes)
 {
@@ -389,23 +405,34 @@ asmlinkage long sys_mbind(unsigned long 
 }
 
 /* Set the process memory policy */
-asmlinkage long sys_set_mempolicy(int mode, unsigned long __user *nmask,
+asmlinkage long sys_set_mempolicy(int request, unsigned long __user *nmask,
 				   unsigned long maxnode)
 {
-	int err;
+	int err, mode, policy, request_policy_default;
 	struct mempolicy *new;
 	DECLARE_BITMAP(nodes, MAX_NUMNODES);
 
-	if (mode > MPOL_MAX)
+	mode = request & REQUEST_MODE_MASK;
+	policy = (request & REQUEST_POLICY_MASK) >> REQUEST_POLICY_SHIFT;
+	request_policy_default= request & REQUEST_POLICY_DEFAULT;
+
+	if ((mode > MPOL_MAX) || (policy >= NR_MEM_POLICIES))
 		return -EINVAL;
+	if (request_policy_default && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
 	err = get_nodes(nodes, nmask, maxnode, mode);
 	if (err)
 		return err;
 	new = mpol_new(mode, nodes);
 	if (IS_ERR(new))
 		return PTR_ERR(new);
-	mpol_free(current->mempolicy);
-	current->mempolicy = new;
+	if (request_policy_default) {
+		mpol_free(default_policy[policy]);
+		default_policy[policy] = new;
+	} else {
+		mpol_free(current->mempolicy[policy]);
+		current->mempolicy[policy] = new;
+	}
 	if (new && new->policy == MPOL_INTERLEAVE)
 		current->il_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
 	return 0;
@@ -477,12 +504,29 @@ asmlinkage long sys_get_mempolicy(int __
 	int err, pval;
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;
-	struct mempolicy *pol = current->mempolicy;
+	struct mempolicy *pol = NULL;
+	int policy_type, request_policy_default;
 
 	if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
 		return -EINVAL;
 	if (nmask != NULL && maxnode < numnodes)
 		return -EINVAL;
+
+	policy_type = (flags & REQUEST_POLICY_MASK) > REQUEST_POLICY_SHIFT;
+	request_policy_default = (flags & REQUEST_POLICY_DEFAULT);
+	if (policy_type >= NR_MEM_POLICIES)
+		return -EINVAL;
+	if (request_policy_default) {
+		pol = default_policy[policy_type];
+		goto copy_policy_to_user;
+	}
+	if (policy_type>0) {
+		pol = current->mempolicy[policy_type];
+		if (!pol)
+			pol = default_policy[policy_type];
+		goto copy_policy_to_user;
+	}
+
 	if (flags & MPOL_F_ADDR) {
 		down_read(&mm->mmap_sem);
 		vma = find_vma_intersection(mm, addr, addr+1);
@@ -498,7 +542,7 @@ asmlinkage long sys_get_mempolicy(int __
 		return -EINVAL;
 
 	if (!pol)
-		pol = &default_policy;
+		pol = default_policy[policy_type];
 
 	if (flags & MPOL_F_NODE) {
 		if (flags & MPOL_F_ADDR) {
@@ -506,7 +550,7 @@ asmlinkage long sys_get_mempolicy(int __
 			if (err < 0)
 				goto out;
 			pval = err;
-		} else if (pol == current->mempolicy &&
+		} else if (pol == current->mempolicy[policy_type] &&
 				pol->policy == MPOL_INTERLEAVE) {
 			pval = current->il_next;
 		} else {
@@ -520,6 +564,7 @@ asmlinkage long sys_get_mempolicy(int __
 	if (policy && put_user(pval, policy))
 		goto out;
 
+copy_policy_to_user:
 	err = 0;
 	if (nmask) {
 		DECLARE_BITMAP(nodes, MAX_NUMNODES);
@@ -538,7 +583,8 @@ asmlinkage long sys_get_mempolicy(int __
 asmlinkage long compat_get_mempolicy(int __user *policy,
 				     compat_ulong_t __user *nmask,
 				     compat_ulong_t maxnode,
-				     compat_ulong_t addr, compat_ulong_t flags)
+				     compat_ulong_t addr, compat_ulong_t flags,
+				     compat_uint_t policy_index)
 {
 	long err;
 	unsigned long __user *nm = NULL;
@@ -616,7 +662,7 @@ asmlinkage long compat_mbind(compat_ulon
 static struct mempolicy *
 get_vma_policy(struct vm_area_struct *vma, unsigned long addr)
 {
-	struct mempolicy *pol = current->mempolicy;
+	struct mempolicy *pol = current->mempolicy[POLICY_PAGE];
 
 	if (vma) {
 		if (vma->vm_ops && vma->vm_ops->get_policy)
@@ -626,7 +672,7 @@ get_vma_policy(struct vm_area_struct *vm
 			pol = vma->vm_policy;
 	}
 	if (!pol)
-		pol = &default_policy;
+		pol = default_policy[POLICY_PAGE];
 	return pol;
 }
 
@@ -758,7 +804,7 @@ alloc_page_vma(unsigned gfp, struct vm_a
 }
 
 /**
- * 	alloc_pages_current - Allocate pages.
+ * 	alloc_pages_by_policy - Allocate pages using a given mempolicy
  *
  *	@gfp:
  *		%GFP_USER   user allocation,
@@ -767,24 +813,31 @@ alloc_page_vma(unsigned gfp, struct vm_a
  *      	%GFP_FS     don't call back into a file system.
  *      	%GFP_ATOMIC don't sleep.
  *	@order: Power of two of allocation size in pages. 0 is a single page.
+ *	@policy:Index of the mempolicy struct to use for this allocation
  *
  *	Allocate a page from the kernel page pool.  When not in
  *	interrupt context and apply the current process NUMA policy.
  *	Returns NULL when no page can be allocated.
  */
-struct page *alloc_pages_current(unsigned gfp, unsigned order)
+struct page *
+alloc_pages_by_policy(unsigned gfp, unsigned order, unsigned policy)
 {
-	struct mempolicy *pol = current->mempolicy;
-
+ 	struct mempolicy *pol;
+  
+ 	if (policy >= NR_MEM_POLICIES)
+ 		BUG();
+ 	pol = current->mempolicy[policy];
+ 	if (!pol)
+ 		pol = default_policy[policy];
 	if (!in_interrupt())
 		cpuset_update_current_mems_allowed();
 	if (!pol || in_interrupt())
-		pol = &default_policy;
+		pol = default_policy[policy];
 	if (pol->policy == MPOL_INTERLEAVE)
 		return alloc_page_interleave(gfp, order, interleave_nodes(pol));
 	return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
 }
-EXPORT_SYMBOL(alloc_pages_current);
+EXPORT_SYMBOL(alloc_pages_by_policy);
 
 /* Slow path of a mempolicy copy */
 struct mempolicy *__mpol_copy(struct mempolicy *old)
@@ -1093,8 +1146,8 @@ void __init numa_policy_init(void)
 	/* Set interleaving policy for system init. This way not all
 	   the data structures allocated at system boot end up in node zero. */
 
-	if (sys_set_mempolicy(MPOL_INTERLEAVE, nodes_addr(node_online_map),
-							MAX_NUMNODES) < 0)
+	if (sys_set_mempolicy(REQUEST_POLICY_PAGE | MPOL_INTERLEAVE, 
+		nodes_addr(node_online_map), MAX_NUMNODES) < 0)
 		printk("numa_policy_init: interleaving failed\n");
 }
 
@@ -1102,5 +1155,5 @@ void __init numa_policy_init(void)
  * Assumes fs == KERNEL_DS */
 void numa_default_policy(void)
 {
-	sys_set_mempolicy(MPOL_DEFAULT, NULL, 0);
+	sys_set_mempolicy(REQUEST_POLICY_PAGE | MPOL_DEFAULT, NULL, 0);
 }
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/2] mm: eliminate node 0 bias in MPOL_INTERLEAVE
  2004-09-23  4:32 [PATCH 0/2] mm: memory policy for page cache allocation Ray Bryant
  2004-09-23  4:32 ` [PATCH 1/2] mm: page cache mempolicy " Ray Bryant
@ 2004-09-23  4:32 ` Ray Bryant
  2004-09-23  9:29   ` Andi Kleen
  2004-09-23  9:09 ` [PATCH 0/2] mm: memory policy for page cache allocation Andi Kleen
  2 siblings, 1 reply; 10+ messages in thread
From: Ray Bryant @ 2004-09-23  4:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: William Lee Irwin III, Andrew Morton, linux-mm, Jesse Barnes,
	Dan Higgins, lse-tech, Brent Casavant, Ray Bryant,
	Martin J. Bligh, linux-kernel, Nick Piggin, Ray Bryant,
	Paul Jackson, Dave Hansen

This is a new patch in this series (it does not in any way replaced the
MPOL_ROUDROBIN patch, which has been dropped).

This patches fixes the following problems with MPOL_INTERLEAVE:  In
the existing implementation, every time a new process is created and it
is using MPOL_INTERLEAVE, the interleave "rotator" (current->il_next)
is set to zero.  This biases storage allocation toward lower numberd
nodes (this effect is more apparent on systems with hundreds of nodes.)
This patch fixes this problem by setting il_next to pid % MAX_NUMNODES.

Similarly, in the existing implementation of MPOL_INTERLEAVE, each time
a new policy of type MPOL_INTERLEAVE is created, current->il_next is set
to the lowest numbered node that is in the policy mask policy->v.nodes.
This biass storage allocation toward the lowest numbered node in that
mask.  This is again fixed by setting il_next to pid % MAX_NUMNODES.

Each of these cases potentially breaks the (assumed) invariant of
interleave_nodes(), that is that "bit il_next of the nodemask is set"
(because the value of il_next on entry to interleave_nodes() is returned
as the node to be used for the allocation, and we calculate the next
il_next, before returning.)

Solving this requires adding the small bit of code in interleave_nodes()
that checks the invariant and if it is not true, updates the return
value to be the next bit in the nodemask that is set.

Signed-off-by: Ray Bryant <raybry@sgi.com>

Index: linux-2.6.9-rc2-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.9-rc2-mm1.orig/mm/mempolicy.c	2004-09-21 16:49:00.000000000 -0700
+++ linux-2.6.9-rc2-mm1/mm/mempolicy.c	2004-09-21 17:44:58.000000000 -0700
@@ -435,7 +435,7 @@ asmlinkage long sys_set_mempolicy(int re
 		default_policy[policy] = new;
 	}
 	if (new && new->policy == MPOL_INTERLEAVE)
-		current->il_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
+		current->il_next = current->pid % MAX_NUMNODES;
 	return 0;
 }
 
@@ -714,6 +714,11 @@ static unsigned interleave_nodes(struct 
 
 	nid = me->il_next;
 	BUG_ON(nid >= MAX_NUMNODES);
+	if (!test_bit(nid, policy->v.nodes)) {
+		nid = find_next_bit(policy->v.nodes, MAX_NUMNODES, 1+nid);
+		if (nid >= MAX_NUMNODES)
+			nid = find_first_bit(policy->v.nodes, MAX_NUMNODES);
+	}
 	next = find_next_bit(policy->v.nodes, MAX_NUMNODES, 1+nid);
 	if (next >= MAX_NUMNODES)
 		next = find_first_bit(policy->v.nodes, MAX_NUMNODES);
Index: linux-2.6.9-rc2-mm1/kernel/fork.c
===================================================================
--- linux-2.6.9-rc2-mm1.orig/kernel/fork.c	2004-09-21 16:24:49.000000000 -0700
+++ linux-2.6.9-rc2-mm1/kernel/fork.c	2004-09-21 17:41:12.000000000 -0700
@@ -873,6 +873,8 @@ static task_t *copy_process(unsigned lon
 			goto bad_fork_cleanup;
 		}
 	}
+	/* randomize placement of first page across nodes */
+	p->il_next = p->pid % MAX_NUMNODES;
 #endif
 
 	p->tgid = p->pid;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] mm: memory policy for page cache allocation
  2004-09-23  4:32 [PATCH 0/2] mm: memory policy for page cache allocation Ray Bryant
  2004-09-23  4:32 ` [PATCH 1/2] mm: page cache mempolicy " Ray Bryant
  2004-09-23  4:32 ` [PATCH 2/2] mm: eliminate node 0 bias in MPOL_INTERLEAVE Ray Bryant
@ 2004-09-23  9:09 ` Andi Kleen
  2 siblings, 0 replies; 10+ messages in thread
From: Andi Kleen @ 2004-09-23  9:09 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Andi Kleen, William Lee Irwin III, linux-mm, Jesse Barnes,
	Dan Higgins, lse-tech, Brent Casavant, Nick Piggin,
	Martin J. Bligh, linux-kernel, Ray Bryant, Andrew Morton,
	Paul Jackson, Dave Hansen

> (1)  We dropped the MPOL_ROUNDROBIN patch.  Instead, we
>      use MPOL_INTERLEAVE to spread pages across nodes.
>      However, rather than use the file offset etc to 
>      calculate the node to allocate the page on, I used
>      the same mechanism you used in alloc_pages_current()
>      to calculate the node number (interleave_node()).
>      That eliminates the need to generate an offset etc
>      in the routines that call page_cache_alloc() and to
>      me appears to be a simpler change that still fits
>      within your design.


Hmm, that may lead to uneven balancing because the counter is 
per thread. But if it works for you it's ok I guess.

I still think changing the callers and use the offset for
static interleaving would be better. Maybe that could be
done as a followon patch. 
> 
> (2)  I implemented the sys_set_mempolicy() changes as
>      suggested -- higher order bits in the mode (first)
>      argument specify whether or not this request is for
>      the page allocation policy (your existing policy)
>      or for the page cache allocation policy.  Similarly,
>      a bit there indicates whether or not we want to set
>      the process level policy or the system level policy.
> 
>      These bits are to be set in the flags argument of
>      sys_mbind().

Ok.  If that gets in I would suggest you also document it 
in the manpages and send me a patch. 

Comments to the patches in other mail.

-Andi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] mm: page cache mempolicy for page cache allocation
  2004-09-23  4:32 ` [PATCH 1/2] mm: page cache mempolicy " Ray Bryant
@ 2004-09-23  9:24   ` Andi Kleen
  2004-09-24  4:12     ` Ray Bryant
  0 siblings, 1 reply; 10+ messages in thread
From: Andi Kleen @ 2004-09-23  9:24 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Andi Kleen, William Lee Irwin III, Andrew Morton, linux-mm,
	Jesse Barnes, Dan Higgins, Dave Hansen, lse-tech, Brent Casavant,
	Martin J. Bligh, linux-kernel, Ray Bryant, Paul Jackson,
	Nick Piggin

> +/* policy selection bits are passed from user shifted left by this amount */
> +#define REQUEST_POLICY_SHIFT	16
> +#define REQUEST_POLICY_PAGE     POLICY_PAGE << REQUEST_POLICY_SHIFT
> +#define REQUEST_POLICY_PAGECACHE POLICY_PAGECACHE << REQUEST_POLICY_SHIFT
> +#define REQUEST_POLICY_MASK     (0x3FFF) << REQUEST_POLICY_SHIFT

Please put brackets around the macros. Putting them around numbers
is not needed though @)


> +#define REQUEST_POLICY_DEFAULT  (0x8000) << REQUEST_POLICY_SHIFT
> +
>  /* Flags for get_mem_policy */
>  #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
>  #define MPOL_F_ADDR	(1<<1)	/* look up vma using address */
> @@ -31,6 +54,8 @@
>  #include <linux/slab.h>
>  #include <linux/rbtree.h>
>  #include <asm/semaphore.h>
> +#include <linux/sched.h>
> +#include <asm/current.h>

Why is that needed? I don't see any users for this.  Please avoid this 
if possible, we already have too much include dependency spagetti.


> --- linux-2.6.9-rc2-mm1.orig/include/linux/sched.h	2004-09-16 12:54:41.000000000 -0700
> +++ linux-2.6.9-rc2-mm1/include/linux/sched.h	2004-09-22 08:48:45.000000000 -0700
> @@ -31,6 +31,8 @@
>  #include <linux/pid.h>
>  #include <linux/percpu.h>
>  
> +#include <linux/mempolicy.h>

I also don't see why this should be needed. Please remove.

> +	for(i=0;i<NR_MEM_POLICIES;i++)

There should be more spaces here (similar in other loops) 


>  	int err, pval;
>  	struct mm_struct *mm = current->mm;
>  	struct vm_area_struct *vma = NULL;
> -	struct mempolicy *pol = current->mempolicy;
> +	struct mempolicy *pol = NULL;
> +	int policy_type, request_policy_default;
>  
>  	if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
>  		return -EINVAL;
>  	if (nmask != NULL && maxnode < numnodes)
>  		return -EINVAL;
> +
> +	policy_type = (flags & REQUEST_POLICY_MASK) > REQUEST_POLICY_SHIFT;
> +	request_policy_default = (flags & REQUEST_POLICY_DEFAULT);

Why is that not an MPOL_F_* ? 

>  /* Slow path of a mempolicy copy */
>  struct mempolicy *__mpol_copy(struct mempolicy *old)
> @@ -1093,8 +1146,8 @@ void __init numa_policy_init(void)
>  	/* Set interleaving policy for system init. This way not all
>  	   the data structures allocated at system boot end up in node zero. */
>  
> -	if (sys_set_mempolicy(MPOL_INTERLEAVE, nodes_addr(node_online_map),
> -							MAX_NUMNODES) < 0)
> +	if (sys_set_mempolicy(REQUEST_POLICY_PAGE | MPOL_INTERLEAVE, 
> +		nodes_addr(node_online_map), MAX_NUMNODES) < 0)

That's definitely wrong, the boot time interleaving is not for the page
cache but for all allocations. There are not even page cache allocations
that early.

Overall when I look at all the complications you add for the per process
page policy which doesn't even have a demonstrated need I'm not sure
it is really worth it.

>  		printk("numa_policy_init: interleaving failed\n");
>  }
>  
> @@ -1102,5 +1155,5 @@ void __init numa_policy_init(void)
>   * Assumes fs == KERNEL_DS */
>  void numa_default_policy(void)
>  {
> -	sys_set_mempolicy(MPOL_DEFAULT, NULL, 0);
> +	sys_set_mempolicy(REQUEST_POLICY_PAGE | MPOL_DEFAULT, NULL, 0);

Same.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] mm: eliminate node 0 bias in MPOL_INTERLEAVE
  2004-09-23  4:32 ` [PATCH 2/2] mm: eliminate node 0 bias in MPOL_INTERLEAVE Ray Bryant
@ 2004-09-23  9:29   ` Andi Kleen
  2004-09-24  6:33     ` Ray Bryant
  2004-09-24  6:43     ` Ray Bryant
  0 siblings, 2 replies; 10+ messages in thread
From: Andi Kleen @ 2004-09-23  9:29 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Andi Kleen, William Lee Irwin III, Andrew Morton, linux-mm,
	Jesse Barnes, Dan Higgins, lse-tech, Brent Casavant,
	Martin J. Bligh, linux-kernel, Nick Piggin, Ray Bryant,
	Paul Jackson, Dave Hansen

On Wed, Sep 22, 2004 at 11:32:45PM -0500, Ray Bryant wrote:
> Each of these cases potentially breaks the (assumed) invariant of

I would prefer to keep the invariant.

> +++ linux-2.6.9-rc2-mm1/mm/mempolicy.c	2004-09-21 17:44:58.000000000 -0700
> @@ -435,7 +435,7 @@ asmlinkage long sys_set_mempolicy(int re
>  		default_policy[policy] = new;
>  	}
>  	if (new && new->policy == MPOL_INTERLEAVE)
> -		current->il_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
> +		current->il_next = current->pid % MAX_NUMNODES;

Please do the find_next/find_first bit here in the slow path. 

Another useful change may be to check if il_next points to a node
that is in the current interleaving mask. If yes don't change it.
This way skew when interleaving policy is set often could be avoided.

>  	return 0;
>  }
>  
> @@ -714,6 +714,11 @@ static unsigned interleave_nodes(struct 
>  
>  	nid = me->il_next;
>  	BUG_ON(nid >= MAX_NUMNODES);
> +	if (!test_bit(nid, policy->v.nodes)) {
> +		nid = find_next_bit(policy->v.nodes, MAX_NUMNODES, 1+nid);
> +		if (nid >= MAX_NUMNODES)
> +			nid = find_first_bit(policy->v.nodes, MAX_NUMNODES);
> +	}

And remove it here.

>  	next = find_next_bit(policy->v.nodes, MAX_NUMNODES, 1+nid);
>  	if (next >= MAX_NUMNODES)
>  		next = find_first_bit(policy->v.nodes, MAX_NUMNODES);
> Index: linux-2.6.9-rc2-mm1/kernel/fork.c
> ===================================================================
> --- linux-2.6.9-rc2-mm1.orig/kernel/fork.c	2004-09-21 16:24:49.000000000 -0700
> +++ linux-2.6.9-rc2-mm1/kernel/fork.c	2004-09-21 17:41:12.000000000 -0700
> @@ -873,6 +873,8 @@ static task_t *copy_process(unsigned lon
>  			goto bad_fork_cleanup;
>  		}
>  	}
> +	/* randomize placement of first page across nodes */
> +	p->il_next = p->pid % MAX_NUMNODES;

Same here.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] mm: page cache mempolicy for page cache allocation
  2004-09-23  9:24   ` Andi Kleen
@ 2004-09-24  4:12     ` Ray Bryant
  0 siblings, 0 replies; 10+ messages in thread
From: Ray Bryant @ 2004-09-24  4:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ray Bryant, William Lee Irwin III, Andrew Morton, linux-mm,
	Jesse Barnes, Dan Higgins, Dave Hansen, lse-tech, Brent Casavant,
	Martin J. Bligh, linux-kernel, Paul Jackson, Nick Piggin

Andi Kleen wrote:

> 
> Overall when I look at all the complications you add for the per process
> page policy which doesn't even have a demonstrated need I'm not sure
> it is really worth it.
>


Polling people inside of SGI, they seem to think that a per file memory policy
is a good thing, but it needs to be settable from outside the process without
changing the header or code of the process (think of an ISV application that
we want to run on Altix.)  I can't quite get my head around what that means
(do you have to specify this externally based on the order that files are
opened in [e. g. file 1 has policy this, file 2 has policy that, etc] or does
one specify this by type of file [text, mapped file, etc]).  Does this end up
being effectively a per process policy with a per file override?  (e. g. all
files for this process are managed with policy "this", except for the 5th file
opened [or whatever] and it has policy "that".)

Steve -- how does your MTA design handle this?

Anyway, I'm about to throw in the towel on the per process page cache memory
policy.  I can't make a strong enough argument for it.

I assume that is acceptable, Andi?  :-)
-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] mm: eliminate node 0 bias in MPOL_INTERLEAVE
  2004-09-23  9:29   ` Andi Kleen
@ 2004-09-24  6:33     ` Ray Bryant
  2004-09-24  6:43     ` Ray Bryant
  1 sibling, 0 replies; 10+ messages in thread
From: Ray Bryant @ 2004-09-24  6:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ray Bryant, William Lee Irwin III, Andrew Morton, linux-mm,
	Jesse Barnes, Dan Higgins, lse-tech, Brent Casavant,
	Martin J. Bligh, linux-kernel, Nick Piggin, Paul Jackson,
	Dave Hansen

Andi Kleen wrote:
> On Wed, Sep 22, 2004 at 11:32:45PM -0500, Ray Bryant wrote:
> 
>>Each of these cases potentially breaks the (assumed) invariant of
> 
> 
> I would prefer to keep the invariant.
> 

I understand, but read on...

> 
>>+++ linux-2.6.9-rc2-mm1/mm/mempolicy.c	2004-09-21 17:44:58.000000000 -0700
>>@@ -435,7 +435,7 @@ asmlinkage long sys_set_mempolicy(int re
>> 		default_policy[policy] = new;
>> 	}
>> 	if (new && new->policy == MPOL_INTERLEAVE)
>>-		current->il_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
>>+		current->il_next = current->pid % MAX_NUMNODES;
> 
> 
> Please do the find_next/find_first bit here in the slow path. 
> 
> Another useful change may be to check if il_next points to a node
> that is in the current interleaving mask. If yes don't change it.
> This way skew when interleaving policy is set often could be avoided.
> 
> 
>> 	return 0;
>> }
>> 
>>@@ -714,6 +714,11 @@ static unsigned interleave_nodes(struct 
>> 
>> 	nid = me->il_next;
>> 	BUG_ON(nid >= MAX_NUMNODES);
>>+	if (!test_bit(nid, policy->v.nodes)) {
>>+		nid = find_next_bit(policy->v.nodes, MAX_NUMNODES, 1+nid);
>>+		if (nid >= MAX_NUMNODES)
>>+			nid = find_first_bit(policy->v.nodes, MAX_NUMNODES);
>>+	}
> 
> 
> And remove it here.
>

Regardless of whether we remove this or not, then we have a potential problem, 
I think.  The reason is that there is a single il_next for all policies.  So 
we get into trouble if the current process's page allocation policy and
its page cache allocation policy are MPOL_INTERLEAVE, but the node masks for
the two policies are significantly different. Just to be specific, suppose 
there are 64 nodes, and the page allocation policy selects nodes 0-53 and
the page cache allocation policy chooses nodes 54-63.  Further suppose that
allocation requests are page, page cache, page, page cache, etc....

Then if il_next starts out at zero, here are the nodes that will be selected:
(I'm assuming here that the code I inserted above is not present.)

request a page, get 0 and using the page allocation mask, next is set to 1

request page cache, get 1 and using the page cache allocation mask, next is 
set to 54

request a page, get 54 and using the page allocation mask, next is set to 0

request page cache, get 0  and using the page cache allocation mask, next is 
set to 54

request a page, get 54 and using the page allocation mask, next is set to 0
etc...

This is not good.  Generally speaking, all of the pages are allocated from the 
1st page cache node and all of the page cache pages are allocated from the 1st 
page allocation node.

I guess I am back to passing an offset etc in via page cache alloc.  Or we
have to have a second il_next for the page cache policy, and that is more
cruft than we are willing to live with, I expect.

I'll look at Steve's patch and see how he handles this.

> 
>> 	next = find_next_bit(policy->v.nodes, MAX_NUMNODES, 1+nid);
>> 	if (next >= MAX_NUMNODES)
>> 		next = find_first_bit(policy->v.nodes, MAX_NUMNODES);
>>Index: linux-2.6.9-rc2-mm1/kernel/fork.c
>>===================================================================
>>--- linux-2.6.9-rc2-mm1.orig/kernel/fork.c	2004-09-21 16:24:49.000000000 -0700
>>+++ linux-2.6.9-rc2-mm1/kernel/fork.c	2004-09-21 17:41:12.000000000 -0700
>>@@ -873,6 +873,8 @@ static task_t *copy_process(unsigned lon
>> 			goto bad_fork_cleanup;
>> 		}
>> 	}
>>+	/* randomize placement of first page across nodes */
>>+	p->il_next = p->pid % MAX_NUMNODES;
> 
> 
> Same here.
> 
> -Andi
> 


-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] mm: eliminate node 0 bias in MPOL_INTERLEAVE
  2004-09-23  9:29   ` Andi Kleen
  2004-09-24  6:33     ` Ray Bryant
@ 2004-09-24  6:43     ` Ray Bryant
  1 sibling, 0 replies; 10+ messages in thread
From: Ray Bryant @ 2004-09-24  6:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ray Bryant, William Lee Irwin III, Andrew Morton, linux-mm,
	Jesse Barnes, Dan Higgins, lse-tech, Brent Casavant,
	Martin J. Bligh, linux-kernel, Nick Piggin, Paul Jackson,
	Dave Hansen

(Resending to removing annoying long lines....)

Andi Kleen wrote:
> On Wed, Sep 22, 2004 at 11:32:45PM -0500, Ray Bryant wrote:
> 
>>Each of these cases potentially breaks the (assumed) invariant of
> 
> 
> I would prefer to keep the invariant.
> 

I understand, but read on.

> 
>>+++ linux-2.6.9-rc2-mm1/mm/mempolicy.c	2004-09-21 17:44:58.000000000 -0700
>>@@ -435,7 +435,7 @@ asmlinkage long sys_set_mempolicy(int re
>> 		default_policy[policy] = new;
>> 	}
>> 	if (new && new->policy == MPOL_INTERLEAVE)
>>-		current->il_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
>>+		current->il_next = current->pid % MAX_NUMNODES;
> 
> 
> Please do the find_next/find_first bit here in the slow path. 
> 
> Another useful change may be to check if il_next points to a node
> that is in the current interleaving mask. If yes don't change it.
> This way skew when interleaving policy is set often could be avoided.
> 
> 
>> 	return 0;
>> }
>> 
>>@@ -714,6 +714,11 @@ static unsigned interleave_nodes(struct 
>> 
>> 	nid = me->il_next;
>> 	BUG_ON(nid >= MAX_NUMNODES);
>>+	if (!test_bit(nid, policy->v.nodes)) {
>>+		nid = find_next_bit(policy->v.nodes, MAX_NUMNODES, 1+nid);
>>+		if (nid >= MAX_NUMNODES)
>>+			nid = find_first_bit(policy->v.nodes, MAX_NUMNODES);
>>+	}
> 
> 
> And remove it here.
> 
> 
Regardless of whether we remove this or not, then we have a potential problem,
I think.  The reason is that there is a single il_next for all policies.  So
we get into trouble if the current process's page allocation policy and
its page cache allocation policy are MPOL_INTERLEAVE, but the node masks for
the two policies are significantly different. Just to be specific, suppose
there are 64 nodes, and the page allocation policy selects nodes 0-53 and
the page cache allocation policy chooses nodes 54-63.  Further suppose that
allocation requests are page, page cache, page, page cache, etc....

Then if il_next starts out at zero, here are the nodes that will be selected:
(I'm assuming here that the code I inserted above is not present.)

request a page, get 0 and using the page allocation mask, next is set to 1

request page cache, get 1 and using the page cache allocation mask, next is 
set to 54

request a page, get 54 and using the page allocation mask, next is set to 0

request page cache, get 0  and using the page cache allocation mask, next is 
set to 54

request a page, get 54 and using the page allocation mask, next is set to 0
etc...

This is not good.  Generally speaking, all of the pages are allocated from the
1st page cache node and all of the page cache pages are allocated from the 1st
page allocation node.

I guess I am back to passing an offset etc in via page cache alloc.  Or we
have to have a second il_next for the page cache policy, and that is more
cruft than we are willing to live with, I expect.

I'll look at Steve's patch and see how he handles this.

>> 	next = find_next_bit(policy->v.nodes, MAX_NUMNODES, 1+nid);
>> 	if (next >= MAX_NUMNODES)
>> 		next = find_first_bit(policy->v.nodes, MAX_NUMNODES);
>>Index: linux-2.6.9-rc2-mm1/kernel/fork.c
>>===================================================================
>>--- linux-2.6.9-rc2-mm1.orig/kernel/fork.c	2004-09-21 16:24:49.000000000 -0700
>>+++ linux-2.6.9-rc2-mm1/kernel/fork.c	2004-09-21 17:41:12.000000000 -0700
>>@@ -873,6 +873,8 @@ static task_t *copy_process(unsigned lon
>> 			goto bad_fork_cleanup;
>> 		}
>> 	}
>>+	/* randomize placement of first page across nodes */
>>+	p->il_next = p->pid % MAX_NUMNODES;
> 
> 
> Same here.
> 
> -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] mm: page cache mempolicy for page cache allocation
@ 2004-09-23 18:56 Ray Bryant
  0 siblings, 0 replies; 10+ messages in thread
From: Ray Bryant @ 2004-09-23 18:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ray Bryant, William Lee Irwin III, Andrew Morton, linux-mm,
	Ray Bryant, Jesse Barnes, Dan Higgins, Dave Hansen, lse-tech,
	Brent Casavant, Martin J. Bligh, linux-kernel, Ray Bryant,
	Paul Jackson, Nick Piggin

Andi Kleene <ak@suse.de> wrote:

>Overall when I look at all the complications you add for the per process
>page policy which doesn't even have a demonstrated need I'm not sure
>it is really worth it.

Andi,

I guess the only argument I can make is that if I special case the page
cache allocation policy to not have a per process component, I'm either
going to have to create a separate set of stuff to get/set/use it, or I am
going to have to gunk up the existing code with logic like the following:

struct page *
alloc_pages_by_policy(unsigned gfp, unsigned order, unsigned policy)
{
	struct mempolicy *pol;
 
	if (policy >= NR_MEM_POLICIES)
		BUG();
+	if (policy == 0)
+ 		pol = current->mempolicy;
	if (!pol)
		pol = default_policy[policy];
	. . .

All in all, >>I<< think it is a wash either way, but given that I
can't point at an application that uses this requirement, I can't make
a strong argument.  I would observe again that a file server process
on a big HPC machine would likely want to have a different page cache
allocation policy than the HPC applications, but you could get the same
effect by creating a single node cpuset to hold the file server process.

(If we do find such an application, it is going to result in an API
change, assuming we don't support a per process page cache replacement
policy at the present time.)

(Also, what are we going to do if some OTHER policy class comes along
that does have a justifiable need for a per process override? To keep
all of this straight is going to be a mess.)

Just for comparison, I did a patch that removes the per process page
cache policy and annotated it with the changes.  (This patch is on
top of the previous 2 patches I sent.)  This patch can be found below.
Removing support for the per process page cache policy results in a net
change of one line (total) less code; it results in 8 changed lines, most
of these are such things as removing the subscript on current->mempolicy.

Given the above, if you still prefer no per process page cache allocation
policy, I'll merge this patch into the page cache policy patch and send
it out.

(I'm also asking around to see if I can find a suitable justification
for this general per process mempolicy stuff.)

I'll hold off sending out a new version of the patch that includes your
other suggestions until I hear back on this.

======================================================================
Index: linux-2.6.9-rc2-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.9-rc2-mm1.orig/mm/mempolicy.c	2004-09-23 10:52:34.000000000 -0700
+++ linux-2.6.9-rc2-mm1/mm/mempolicy.c	2004-09-23 11:03:33.000000000 -0700
@@ -418,6 +417,8 @@ asmlinkage long sys_set_mempolicy(int re
 
 	if ((mode > MPOL_MAX) || (policy >= NR_MEM_POLICIES))
 		return -EINVAL;
+	if (!request_policy_default && (policy > 0))            /* process add 2 */
+		return -EINVAL;
 	if (request_policy_default && !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 	err = get_nodes(nodes, nmask, maxnode, mode);
@@ -430,8 +431,8 @@ asmlinkage long sys_set_mempolicy(int re
 		mpol_free(default_policy[policy]);
 		default_policy[policy] = new;
 	} else {
-		mpol_free(current->mempolicy[policy]);
-		current->mempolicy[policy] = new;
+		mpol_free(current->mempolicy);			/* process change 2 */
+		current->mempolicy = new;
 	}
 	if (new && new->policy == MPOL_INTERLEAVE)
 		current->il_next = current->pid % MAX_NUMNODES;
@@ -521,9 +522,7 @@ asmlinkage long sys_get_mempolicy(int __
 		goto copy_policy_to_user;
 	}
 	if (policy_type>0) {
-		pol = current->mempolicy[policy_type];
-		if (!pol)
-			pol = default_policy[policy_type];
+		pol = default_policy[policy_type];		/* process del 2 */
 		goto copy_policy_to_user;
 	}
 
@@ -550,7 +549,7 @@ asmlinkage long sys_get_mempolicy(int __
 			if (err < 0)
 				goto out;
 			pval = err;
-		} else if (pol == current->mempolicy[policy_type] &&
+		} else if (pol == current->mempolicy &&		/* process change 1 */
 				pol->policy == MPOL_INTERLEAVE) {
 			pval = current->il_next;
 		} else {
@@ -662,7 +661,7 @@ asmlinkage long compat_mbind(compat_ulon
 static struct mempolicy *
 get_vma_policy(struct vm_area_struct *vma, unsigned long addr)
 {
-	struct mempolicy *pol = current->mempolicy[POLICY_PAGE];
+	struct mempolicy *pol = current->mempolicy;		/* process change 1 */
 
 	if (vma) {
 		if (vma->vm_ops && vma->vm_ops->get_policy)
@@ -831,7 +830,8 @@ alloc_pages_by_policy(unsigned gfp, unsi
   
  	if (policy >= NR_MEM_POLICIES)
  		BUG();
- 	pol = current->mempolicy[policy];
+ 	if (policy == POLICY_PAGE)				/* process add 1 */
+ 		pol = current->mempolicy;                       /* process change 1 */
  	if (!pol)
  		pol = default_policy[policy];
 	if (!in_interrupt())
Index: linux-2.6.9-rc2-mm1/kernel/fork.c
===================================================================
--- linux-2.6.9-rc2-mm1.orig/kernel/fork.c	2004-09-23 10:52:34.000000000 -0700
+++ linux-2.6.9-rc2-mm1/kernel/fork.c	2004-09-23 10:58:07.000000000 -0700
@@ -865,14 +865,12 @@ static task_t *copy_process(unsigned lon
 	p->io_wait = NULL;
 	p->audit_context = NULL;
 #ifdef CONFIG_NUMA
-	for(i=0;i<NR_MEM_POLICIES;i++) {
-		p->mempolicy[i] = mpol_copy(p->mempolicy[i]);
-		if (IS_ERR(p->mempolicy[i])) {
-			retval = PTR_ERR(p->mempolicy[i]);
-			p->mempolicy[i] = NULL;
-			goto bad_fork_cleanup;
-		}
-	}
+ 	p->mempolicy = mpol_copy(p->mempolicy);		/* process del 2 */
+ 	if (IS_ERR(p->mempolicy)) {                     /* process change 3 */
+ 		retval = PTR_ERR(p->mempolicy);
+ 		p->mempolicy = NULL;
+ 		goto bad_fork_cleanup;
+ 	}
 	/* randomize placement of first page across nodes */
 	p->il_next = p->pid % MAX_NUMNODES;
 #endif
@@ -1042,8 +1040,7 @@ bad_fork_cleanup_security:
 	security_task_free(p);
 bad_fork_cleanup_policy:
 #ifdef CONFIG_NUMA
-	for(i=0;i<NR_MEM_POLICIES;i++)
-		mpol_free(p->mempolicy[i]);
+	mpol_free(p->mempolicy);		/* process del 1 */
 #endif
 bad_fork_cleanup:
 	if (p->binfmt)
Index: linux-2.6.9-rc2-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.9-rc2-mm1.orig/include/linux/sched.h	2004-09-23 10:45:48.000000000 -0700
+++ linux-2.6.9-rc2-mm1/include/linux/sched.h	2004-09-23 11:04:33.000000000 -0700
@@ -30,7 +30,7 @@
 #include <linux/completion.h>
 #include <linux/pid.h>
 #include <linux/percpu.h>
-#include <linux/mempolicy.h>
+// #include <linux/mempolicy.h>		/* process del 1 */
 
 struct exec_domain;
 
@@ -743,7 +743,7 @@ struct task_struct {
  */
 	wait_queue_t *io_wait;
 #ifdef CONFIG_NUMA
-  	struct mempolicy *mempolicy[NR_MEM_POLICIES];
+  	struct mempolicy *mempolicy;		/* process change 1 */
   	short il_next;		/* could be shared with used_math */
 #endif
 #ifdef CONFIG_CPUSETS
Index: linux-2.6.9-rc2-mm1/kernel/exit.c
===================================================================
--- linux-2.6.9-rc2-mm1.orig/kernel/exit.c	2004-09-23 10:42:02.000000000 -0700
+++ linux-2.6.9-rc2-mm1/kernel/exit.c	2004-09-23 10:57:42.000000000 -0700
@@ -831,10 +831,8 @@ asmlinkage NORET_TYPE void do_exit(long 
 	tsk->exit_code = code;
 	exit_notify(tsk);
 #ifdef CONFIG_NUMA
-	for (i=0; i<NR_MEM_POLICIES; i++) {
-		mpol_free(tsk->mempolicy[i]);
-		tsk->mempolicy[i] = NULL;
-	}
+	mpol_free(tsk->mempolicy);		/* process del 2 */
+	tsk->mempolicy = NULL;			/* process change 2 */
 #endif
 	schedule();
 	BUG();
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2004-09-24  6:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-09-23  4:32 [PATCH 0/2] mm: memory policy for page cache allocation Ray Bryant
2004-09-23  4:32 ` [PATCH 1/2] mm: page cache mempolicy " Ray Bryant
2004-09-23  9:24   ` Andi Kleen
2004-09-24  4:12     ` Ray Bryant
2004-09-23  4:32 ` [PATCH 2/2] mm: eliminate node 0 bias in MPOL_INTERLEAVE Ray Bryant
2004-09-23  9:29   ` Andi Kleen
2004-09-24  6:33     ` Ray Bryant
2004-09-24  6:43     ` Ray Bryant
2004-09-23  9:09 ` [PATCH 0/2] mm: memory policy for page cache allocation Andi Kleen
2004-09-23 18:56 [PATCH 1/2] mm: page cache mempolicy " Ray Bryant

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox