[PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation
@ 2004-09-20 19:00 Ray Bryant
  2004-09-20 19:00 ` [PATCH 2.6.9-rc2-mm1 1/2] " Ray Bryant
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Ray Bryant @ 2004-09-20 19:00 UTC (permalink / raw)
  To: William Lee Irwin III, Martin J. Bligh, Andrew Morton,
	Andi Kleen, Ray Bryant
  Cc: linux-mm, Jesse Barnes, Dan Higgins, lse-tech, Brent Casavant,
	Nick Piggin, linux-kernel, Ray Bryant, Paul Jackson, Dave Hansen

This is the first working release of this patch.  It was previously
proposed as an RFC (see
  http://marc.theaimsgroup.com/?l=linux-mm&m=109416852113561&w=2
).

Background
----------

Last month, Jesse Barnes proposed a patch to do round robin
allocation of page cache pages on NUMA machines.  This got shot down
for a number of reasons (see
  http://marc.theaimsgroup.com/?l=linux-kernel&m=109235420329360&w=2
and the related thread), but it seemed to me that one of the most
significant issues was that this was a workload dependent optimization.
That is, for an Altix running an HPC workload, it was a good thing,
but for web servers or file servers it was not such a good idea.

So the idea of this patch is the following:  it creates a new memory
policy structure (default_pagecache_policy) that is used to control
how storage for page cache pages is allocated.  So, for a large Altix
running HPC workloads, we can specify a policy that does round robin
allocations, and for other workloads you can specify the default policy
(which results in page cache pages being allocated locally).

The default_pagecache_policy is overrideable on a per process basis, so
that if your application prefers to allocate page cache pages locally,
it can.

This is all done by making default_policy and current->mempolicy an array
of size 2 and of type "struct mempolicy *".   Entry POLICY_PAGE in these
arrays is the old default_policy and process memory policy, respectively.
Entry POLICY_PAGECACHE in these arrays contains the system default and
per process page cache allocation policies, respectively.

A new worker routine is defined:
	alloc_pages_by_policy(gfp, order, policy)
This routine allocates the requested number of pages using the policy
index specified.

alloc_pages_current() and page_cache_alloc() are then defined in terms
of alloc_pages_by_policy().

This patch is in two parts.  The first part is Brent Casavant's patch for
MPOL_ROUNDROBIN.  We need this because there is no handy offset to use
when you get a call to allocate a page cache page in "page_cache_alloc()",
so MPOL_INTERLEAVE doesn't do what we need.

The second part of the patch is the set of changes to create the
default_pagecache_policy and see that it is used in page_cache_alloc()
as well as the changes to supporting setting a policy given a policy
index.

Caveats
-------

(1)  Right now, there is no mechanism to set any of the memory policies
from user space.  The NUMA API library will have to be modified to match
the new format of the sys_set/get_mempolicy() system calls (these calls
have an additional integer argument that specifies which policy to set.)
This is work that I will start on once we get agreement with this patch.

(It also appears to me that there is no mechanism to set the default
policies, but perhaps its there and I am just missing it.)

(I tested this stuff by hard compiling policis into my test kernel.)

(2)  page_cache_alloc_local() is defined, but is not currently called.
This was added in SGI ProPack to make sure that mmap'd() files were
allocated locally rather than round-robin'd (i. e. to override the
round robin allocation in that case.)  This was an SGI MPT requirement.
It may be this is not needed with the current mempolicy code if we can
associate the default mempolicy with mmap()'d files for those MPT users.

(3)  alloc_pages_current() is now an inline, but there is no easy way
to do that totally correclty with the current include file order (that I
could figure out at least...)  The problem is that alloc_pages_current()
wants to use the define constant POLICY_PAGE, but that is defined yet.
We know it is zero, so we just use zero.  A comment in mempolicy.h
suggests not to change the value of this constant to something other
than zero, and references the file gfp.h.

(4)  I've not thought a bit about locking issues related to changing a
mempolicy whilst the system is actually running. 

(5)  It seems there may be a potential conflict between the page cache
mempolicy and a mmap mempolicy (do those exist?).  Here's the concern:
If you mmap() a file, and any pages of that file are in the page cache,
then the location of those pages will (have been) dictated by the page
cache mempolicy, which could differ (will likely differ) from the mmap
mempolicy.  It seems that the only solution to this is to migrate those
pages (when they are touched) after the mmap().

Comments, flames, etc to the undersigned.

Best Regards,

Ray

PS:  Both patches are relative to 2.6.9-rc2-mm1.  However, since that
kernel doesn't boot on Altix for me at the moment, the testing was done
using 2.6.9-rc1-mm3.

PPS: This is not a final patch, but lets keep the lawyers happy anyway:

Signed-off-by: Brent Casavant <bcasavan@sgi.com>
Signed-off-by: Ray Bryant <raybry@sgi.com>

===========================================================================
Index: linux-2.6.9-rc1-mm2-kdb/include/linux/sched.h
===================================================================
--- linux-2.6.9-rc1-mm2-kdb.orig/include/linux/sched.h	2004-08-31 13:32:20.000000000 -0700
+++ linux-2.6.9-rc1-mm2-kdb/include/linux/sched.h	2004-09-02 13:17:45.000000000 -0700
@@ -596,6 +596,7 @@
 #ifdef CONFIG_NUMA
   	struct mempolicy *mempolicy;
   	short il_next;		/* could be shared with used_math */
+	short rr_next;
 #endif
 #ifdef CONFIG_CPUSETS
 	struct cpuset *cpuset;
===================================================================
Index: linux-2.6.9-rc1-mm2-kdb/mm/mempolicy.c
===================================================================
--- linux-2.6.9-rc1-mm2-kdb.orig/mm/mempolicy.c	2004-08-31 13:32:20.000000000 -0700
+++ linux-2.6.9-rc1-mm2-kdb/mm/mempolicy.c	2004-09-02 13:17:45.000000000 -0700
@@ -7,10 +7,17 @@
  * NUMA policy allows the user to give hints in which node(s) memory should
  * be allocated.
  *
- * Support four policies per VMA and per process:
+ * Support five policies per VMA and per process:
  *
  * The VMA policy has priority over the process policy for a page fault.
  *
+ * roundrobin     Allocate memory round-robined over a set of nodes,
+ *                with normal fallback if it fails.  The round-robin is
+ *                based on a per-thread rotor both to provide predictability
+ *                of allocation locations and to avoid cacheline contention
+ *                compared to a global rotor.  This policy is distinct from
+ *                interleave in that it seeks to distribute allocations evenly
+ *                across nodes, whereas interleave seeks to maximize bandwidth.
  * interleave     Allocate memory interleaved over a set of nodes,
  *                with normal fallback if it fails.
  *                For VMA based allocations this interleaves based on the
@@ -117,6 +124,7 @@
 		break;
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_ROUNDROBIN:
 		/* Preferred will only use the first bit, but allow
 		   more for now. */
 		if (empty)
@@ -215,6 +223,7 @@
 	atomic_set(&policy->refcnt, 1);
 	switch (mode) {
 	case MPOL_INTERLEAVE:
+	case MPOL_ROUNDROBIN:
 		bitmap_copy(policy->v.nodes, nodes, MAX_NUMNODES);
 		break;
 	case MPOL_PREFERRED:
@@ -406,6 +415,8 @@
 	current->mempolicy = new;
 	if (new && new->policy == MPOL_INTERLEAVE)
 		current->il_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
+	if (new && new->policy == MPOL_ROUNDROBIN)
+		current->rr_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
 	return 0;
 }

@@ -423,6 +434,7 @@
 	case MPOL_DEFAULT:
 		break;
 	case MPOL_INTERLEAVE:
+	case MPOL_ROUNDROBIN:
 		bitmap_copy(nodes, p->v.nodes, MAX_NUMNODES);
 		break;
 	case MPOL_PREFERRED:
@@ -507,6 +519,9 @@
 		} else if (pol == current->mempolicy &&
 				pol->policy == MPOL_INTERLEAVE) {
 			pval = current->il_next;
+		} else if (pol == current->mempolicy &&
+				pol->policy == MPOL_ROUNDROBIN) {
+			pval = current->rr_next;
 		} else {
 			err = -EINVAL;
 			goto out;
@@ -585,6 +600,7 @@
 				return policy->v.zonelist;
 		/*FALL THROUGH*/
 	case MPOL_INTERLEAVE: /* should not happen */
+	case MPOL_ROUNDROBIN: /* should not happen */
 	case MPOL_DEFAULT:
 		nd = numa_node_id();
 		break;
@@ -595,6 +611,21 @@
 	return NODE_DATA(nd)->node_zonelists + (gfp & GFP_ZONEMASK);
 }

+/* Do dynamic round-robin for a process */
+static unsigned roundrobin_nodes(struct mempolicy *policy)
+{
+	unsigned nid, next;
+	struct task_struct *me = current;
+
+	nid = me->rr_next;
+	BUG_ON(nid >= MAX_NUMNODES);
+	next = find_next_bit(policy->v.nodes, MAX_NUMNODES, 1+nid);
+	if (next >= MAX_NUMNODES)
+		next = find_first_bit(policy->v.nodes, MAX_NUMNODES);
+	me->rr_next = next;
+	return nid;
+}
+
 /* Do dynamic interleaving for a process */
 static unsigned interleave_nodes(struct mempolicy *policy)
 {
@@ -646,6 +677,27 @@
 	return page;
 }

+/* Allocate a page in round-robin policy.
+   Own path because first fallback needs to round-robin. */
+static struct page *alloc_page_roundrobin(unsigned gfp, unsigned order, struct mempolicy* policy)
+{
+	struct zonelist *zl;
+	struct page *page;
+	unsigned nid;
+	int i, numnodes = bitmap_weight(policy->v.nodes, MAX_NUMNODES);
+
+	for (i = 0; i < numnodes; i++) {
+		nid = roundrobin_nodes(policy);
+		BUG_ON(!test_bit(nid, (const volatile void *) &node_online_map));
+		zl = NODE_DATA(nid)->node_zonelists + (gfp & GFP_ZONEMASK);
+		page = __alloc_pages(gfp, order, zl);
+		if (page)
+			return page;
+	}
+
+	return NULL;
+}
+
 /**
  * 	alloc_page_vma	- Allocate a page for a VMA.
  *
@@ -671,26 +723,30 @@
 struct page *
 alloc_page_vma(unsigned gfp, struct vm_area_struct *vma, unsigned long addr)
 {
+	unsigned nid;
 	struct mempolicy *pol = get_vma_policy(vma, addr);

 	cpuset_update_current_mems_allowed();

-	if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
-		unsigned nid;
-		if (vma) {
-			unsigned long off;
-			BUG_ON(addr >= vma->vm_end);
-			BUG_ON(addr < vma->vm_start);
-			off = vma->vm_pgoff;
-			off += (addr - vma->vm_start) >> PAGE_SHIFT;
-			nid = offset_il_node(pol, vma, off);
-		} else {
-			/* fall back to process interleaving */
-			nid = interleave_nodes(pol);
-		}
-		return alloc_page_interleave(gfp, 0, nid);
+	switch (pol->policy) {
+		case MPOL_INTERLEAVE:
+			if (vma) {
+				unsigned long off;
+				BUG_ON(addr >= vma->vm_end);
+				BUG_ON(addr < vma->vm_start);
+				off = vma->vm_pgoff;
+				off += (addr - vma->vm_start) >> PAGE_SHIFT;
+				nid = offset_il_node(pol, vma, off);
+			} else {
+				/* fall back to process interleaving */
+				nid = interleave_nodes(pol);
+			}
+			return alloc_page_interleave(gfp, 0, nid);
+		case MPOL_ROUNDROBIN:
+			return alloc_page_roundrobin(gfp, 0, pol);
+		default:
+			return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
 	}
-	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
 }

 /**
@@ -716,8 +772,11 @@
 		cpuset_update_current_mems_allowed();
 	if (!pol || in_interrupt())
 		pol = &default_policy;
-	if (pol->policy == MPOL_INTERLEAVE)
+	if (pol->policy == MPOL_INTERLEAVE) {
 		return alloc_page_interleave(gfp, order, interleave_nodes(pol));
+	} else if (pol->policy == MPOL_ROUNDROBIN) {
+		return alloc_page_roundrobin(gfp, order, pol);
+	}
 	return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
 }
 EXPORT_SYMBOL(alloc_pages_current);
@@ -754,6 +813,7 @@
 	case MPOL_DEFAULT:
 		return 1;
 	case MPOL_INTERLEAVE:
+	case MPOL_ROUNDROBIN:
 		return bitmap_equal(a->v.nodes, b->v.nodes, MAX_NUMNODES);
 	case MPOL_PREFERRED:
 		return a->v.preferred_node == b->v.preferred_node;
@@ -798,6 +858,8 @@
 		return pol->v.zonelist->zones[0]->zone_pgdat->node_id;
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(pol);
+	case MPOL_ROUNDROBIN:
+		return roundrobin_nodes(pol);
 	case MPOL_PREFERRED:
 		return pol->v.preferred_node >= 0 ?
 				pol->v.preferred_node : numa_node_id();
@@ -815,6 +877,7 @@
 	case MPOL_PREFERRED:
 	case MPOL_DEFAULT:
 	case MPOL_INTERLEAVE:
+	case MPOL_ROUNDROBIN:
 		return 1;
 	case MPOL_BIND: {
 		struct zone **z;
===================================================================
Index: linux-2.6.9-rc1-mm2-kdb/include/linux/mempolicy.h
===================================================================
--- linux-2.6.9-rc1-mm2-kdb.orig/include/linux/mempolicy.h	2004-08-27 10:06:15.000000000 -0700
+++ linux-2.6.9-rc1-mm2-kdb/include/linux/mempolicy.h	2004-09-02 13:19:38.000000000 -0700
@@ -13,6 +13,7 @@
 #define MPOL_PREFERRED	1
 #define MPOL_BIND	2
 #define MPOL_INTERLEAVE	3
+#define MPOL_ROUNDROBIN 4

 #define MPOL_MAX MPOL_INTERLEAVE

-- 
Best Regards,
Ray
-----------------------------------------------
Ray Bryant                       raybry@sgi.com
The box said: "Requires Windows 98 or better",
           so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2.6.9-rc2-mm1 1/2] mm: memory policy for page cache allocation
  2004-09-20 19:00 [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation Ray Bryant
@ 2004-09-20 19:00 ` Ray Bryant
  2004-09-20 19:00 ` [PATCH 2.6.9-rc2-mm1 2/2] " Ray Bryant
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Ray Bryant @ 2004-09-20 19:00 UTC (permalink / raw)
  To: William Lee Irwin III, Martin J. Bligh, Andrew Morton,
	Andi Kleen, Ray Bryant
  Cc: linux-mm, Jesse Barnes, Dan Higgins, Dave Hansen, lse-tech,
	Brent Casavant, linux-kernel, Ray Bryant, Paul Jackson,
	Nick Piggin

This patch creates MPOL_ROUNDROBIN.  This is like MPOL_INTERLEAVE,
but doesn't require a global offset or index to be specified.

Index: linux-2.6.9-rc1-mm3-kdb-pagecache/include/linux/sched.h
===================================================================
--- linux-2.6.9-rc1-mm3-kdb-pagecache.orig/include/linux/sched.h	2004-09-03 09:45:42.000000000 -0700
+++ linux-2.6.9-rc1-mm3-kdb-pagecache/include/linux/sched.h	2004-09-03 09:47:42.000000000 -0700
@@ -596,6 +596,7 @@
 #ifdef CONFIG_NUMA
   	struct mempolicy *mempolicy;
   	short il_next;		/* could be shared with used_math */
+	short rr_next;
 #endif
 #ifdef CONFIG_CPUSETS
 	struct cpuset *cpuset;
Index: linux-2.6.9-rc1-mm3-kdb-pagecache/mm/mempolicy.c
===================================================================
--- linux-2.6.9-rc1-mm3-kdb-pagecache.orig/mm/mempolicy.c	2004-09-03 09:45:40.000000000 -0700
+++ linux-2.6.9-rc1-mm3-kdb-pagecache/mm/mempolicy.c	2004-09-03 09:47:42.000000000 -0700
@@ -7,10 +7,17 @@
  * NUMA policy allows the user to give hints in which node(s) memory should
  * be allocated.
  *
- * Support four policies per VMA and per process:
+ * Support five policies per VMA and per process:
  *
  * The VMA policy has priority over the process policy for a page fault.
  *
+ * roundrobin     Allocate memory round-robined over a set of nodes,
+ *                with normal fallback if it fails.  The round-robin is
+ *                based on a per-thread rotor both to provide predictability
+ *                of allocation locations and to avoid cacheline contention
+ *                compared to a global rotor.  This policy is distinct from
+ *                interleave in that it seeks to distribute allocations evenly
+ *                across nodes, whereas interleave seeks to maximize bandwidth.
  * interleave     Allocate memory interleaved over a set of nodes,
  *                with normal fallback if it fails.
  *                For VMA based allocations this interleaves based on the
@@ -117,6 +124,7 @@
 		break;
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_ROUNDROBIN:
 		/* Preferred will only use the first bit, but allow
 		   more for now. */
 		if (empty)
@@ -215,6 +223,7 @@
 	atomic_set(&policy->refcnt, 1);
 	switch (mode) {
 	case MPOL_INTERLEAVE:
+	case MPOL_ROUNDROBIN:
 		bitmap_copy(policy->v.nodes, nodes, MAX_NUMNODES);
 		break;
 	case MPOL_PREFERRED:
@@ -406,6 +415,8 @@
 	current->mempolicy = new;
 	if (new && new->policy == MPOL_INTERLEAVE)
 		current->il_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
+	if (new && new->policy == MPOL_ROUNDROBIN)
+		current->rr_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
 	return 0;
 }
 
@@ -423,6 +434,7 @@
 	case MPOL_DEFAULT:
 		break;
 	case MPOL_INTERLEAVE:
+	case MPOL_ROUNDROBIN:
 		bitmap_copy(nodes, p->v.nodes, MAX_NUMNODES);
 		break;
 	case MPOL_PREFERRED:
@@ -507,6 +519,9 @@
 		} else if (pol == current->mempolicy &&
 				pol->policy == MPOL_INTERLEAVE) {
 			pval = current->il_next;
+		} else if (pol == current->mempolicy &&
+				pol->policy == MPOL_ROUNDROBIN) {
+			pval = current->rr_next;
 		} else {
 			err = -EINVAL;
 			goto out;
@@ -585,6 +600,7 @@
 				return policy->v.zonelist;
 		/*FALL THROUGH*/
 	case MPOL_INTERLEAVE: /* should not happen */
+	case MPOL_ROUNDROBIN: /* should not happen */
 	case MPOL_DEFAULT:
 		nd = numa_node_id();
 		break;
@@ -595,6 +611,21 @@
 	return NODE_DATA(nd)->node_zonelists + (gfp & GFP_ZONEMASK);
 }
 
+/* Do dynamic round-robin for a process */
+static unsigned roundrobin_nodes(struct mempolicy *policy)
+{
+	unsigned nid, next;
+	struct task_struct *me = current;
+
+	nid = me->rr_next;
+	BUG_ON(nid >= MAX_NUMNODES);
+	next = find_next_bit(policy->v.nodes, MAX_NUMNODES, 1+nid);
+	if (next >= MAX_NUMNODES)
+		next = find_first_bit(policy->v.nodes, MAX_NUMNODES);
+	me->rr_next = next;
+	return nid;
+}
+
 /* Do dynamic interleaving for a process */
 static unsigned interleave_nodes(struct mempolicy *policy)
 {
@@ -646,6 +677,27 @@
 	return page;
 }
 
+/* Allocate a page in round-robin policy.
+   Own path because first fallback needs to round-robin. */
+static struct page *alloc_page_roundrobin(unsigned gfp, unsigned order, struct mempolicy* policy)
+{
+	struct zonelist *zl;
+	struct page *page;
+	unsigned nid;
+	int i, numnodes = bitmap_weight(policy->v.nodes, MAX_NUMNODES);
+
+	for (i = 0; i < numnodes; i++) {
+		nid = roundrobin_nodes(policy);
+		BUG_ON(!test_bit(nid, (const volatile void *) &node_online_map));
+		zl = NODE_DATA(nid)->node_zonelists + (gfp & GFP_ZONEMASK);
+		page = __alloc_pages(gfp, order, zl);
+		if (page)
+			return page;
+	}
+
+	return NULL;
+}
+
 /**
  * 	alloc_page_vma	- Allocate a page for a VMA.
  *
@@ -671,26 +723,30 @@
 struct page *
 alloc_page_vma(unsigned gfp, struct vm_area_struct *vma, unsigned long addr)
 {
+	unsigned nid;
 	struct mempolicy *pol = get_vma_policy(vma, addr);
 
 	cpuset_update_current_mems_allowed();
 
-	if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
-		unsigned nid;
-		if (vma) {
-			unsigned long off;
-			BUG_ON(addr >= vma->vm_end);
-			BUG_ON(addr < vma->vm_start);
-			off = vma->vm_pgoff;
-			off += (addr - vma->vm_start) >> PAGE_SHIFT;
-			nid = offset_il_node(pol, vma, off);
-		} else {
-			/* fall back to process interleaving */
-			nid = interleave_nodes(pol);
-		}
-		return alloc_page_interleave(gfp, 0, nid);
+	switch (pol->policy) {
+		case MPOL_INTERLEAVE:
+			if (vma) {
+				unsigned long off;
+				BUG_ON(addr >= vma->vm_end);
+				BUG_ON(addr < vma->vm_start);
+				off = vma->vm_pgoff;
+				off += (addr - vma->vm_start) >> PAGE_SHIFT;
+				nid = offset_il_node(pol, vma, off);
+			} else {
+				/* fall back to process interleaving */
+				nid = interleave_nodes(pol);
+			}
+			return alloc_page_interleave(gfp, 0, nid);
+		case MPOL_ROUNDROBIN:
+			return alloc_page_roundrobin(gfp, 0, pol);
+		default:
+			return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
 	}
-	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
 }
 
 /**
@@ -716,8 +772,11 @@
 		cpuset_update_current_mems_allowed();
 	if (!pol || in_interrupt())
 		pol = &default_policy;
-	if (pol->policy == MPOL_INTERLEAVE)
+	if (pol->policy == MPOL_INTERLEAVE) {
 		return alloc_page_interleave(gfp, order, interleave_nodes(pol));
+	} else if (pol->policy == MPOL_ROUNDROBIN) {
+		return alloc_page_roundrobin(gfp, order, pol);
+	}
 	return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
 }
 EXPORT_SYMBOL(alloc_pages_current);
@@ -754,6 +813,7 @@
 	case MPOL_DEFAULT:
 		return 1;
 	case MPOL_INTERLEAVE:
+	case MPOL_ROUNDROBIN:
 		return bitmap_equal(a->v.nodes, b->v.nodes, MAX_NUMNODES);
 	case MPOL_PREFERRED:
 		return a->v.preferred_node == b->v.preferred_node;
@@ -798,6 +858,8 @@
 		return pol->v.zonelist->zones[0]->zone_pgdat->node_id;
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(pol);
+	case MPOL_ROUNDROBIN:
+		return roundrobin_nodes(pol);
 	case MPOL_PREFERRED:
 		return pol->v.preferred_node >= 0 ?
 				pol->v.preferred_node : numa_node_id();
@@ -815,6 +877,7 @@
 	case MPOL_PREFERRED:
 	case MPOL_DEFAULT:
 	case MPOL_INTERLEAVE:
+	case MPOL_ROUNDROBIN:
 		return 1;
 	case MPOL_BIND: {
 		struct zone **z;
Index: linux-2.6.9-rc1-mm3-kdb-pagecache/include/linux/mempolicy.h
===================================================================
--- linux-2.6.9-rc1-mm3-kdb-pagecache.orig/include/linux/mempolicy.h	2004-08-27 10:06:15.000000000 -0700
+++ linux-2.6.9-rc1-mm3-kdb-pagecache/include/linux/mempolicy.h	2004-09-16 09:27:08.000000000 -0700
@@ -13,8 +13,9 @@
 #define MPOL_PREFERRED	1
 #define MPOL_BIND	2
 #define MPOL_INTERLEAVE	3
+#define MPOL_ROUNDROBIN 4
 
-#define MPOL_MAX MPOL_INTERLEAVE
+#define MPOL_MAX MPOL_ROUNDROBIN
 
 /* Flags for get_mem_policy */
 #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */

-- 
Best Regards,
Ray
-----------------------------------------------
Ray Bryant                       raybry@sgi.com
The box said: "Requires Windows 98 or better",
           so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2.6.9-rc2-mm1 2/2] mm: memory policy for page cache allocation
  2004-09-20 19:00 [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation Ray Bryant
  2004-09-20 19:00 ` [PATCH 2.6.9-rc2-mm1 1/2] " Ray Bryant
@ 2004-09-20 19:00 ` Ray Bryant
  2004-09-20 20:22 ` [PATCH 2.6.9-rc2-mm1 0/2] " Paul Jackson
  2004-09-20 20:55 ` Andi Kleen
  3 siblings, 0 replies; 16+ messages in thread
From: Ray Bryant @ 2004-09-20 19:00 UTC (permalink / raw)
  To: William Lee Irwin III, Martin J. Bligh, Andrew Morton,
	Andi Kleen, Ray Bryant
  Cc: linux-mm, Jesse Barnes, Dan Higgins, Dave Hansen, lse-tech,
	Brent Casavant, linux-kernel, Ray Bryant, Paul Jackson,
	Nick Piggin

This patch creates a separate mempolicy to control page cache allocation.

Index: linux-2.6.9-rc2-mmxx/include/linux/gfp.h
===================================================================
--- linux-2.6.9-rc2-mmxx.orig/include/linux/gfp.h	2004-09-16 20:17:28.000000000 -0700
+++ linux-2.6.9-rc2-mmxx/include/linux/gfp.h	2004-09-20 09:48:56.000000000 -0700
@@ -92,7 +92,22 @@
 }
 
 #ifdef CONFIG_NUMA
-extern struct page *alloc_pages_current(unsigned gfp_mask, unsigned order);
+extern struct page *alloc_pages_by_policy(unsigned gfp, unsigned order, 
+	unsigned policy);
+
+static inline
+struct page *alloc_pages_current(unsigned gfp, unsigned order)
+{
+	/* 
+	 * include order keeps us from including mempolicy.h here
+	 * the following should be:
+	 *    return alloc_pages_by_policy(gfp, order, POLICY_PAGE);
+	 * but POLICY_PAGE is not defined yet.
+	 * We assume here that POLICY_PAGE is defined to be 0
+	 * See include/linux/mempolicy.h.
+	 */
+	return alloc_pages_by_policy(gfp, order, 0);
+}
 
 static inline struct page *
 alloc_pages(unsigned int gfp_mask, unsigned int order)
Index: linux-2.6.9-rc2-mmxx/include/linux/mempolicy.h
===================================================================
--- linux-2.6.9-rc2-mmxx.orig/include/linux/mempolicy.h	2004-09-20 09:21:09.000000000 -0700
+++ linux-2.6.9-rc2-mmxx/include/linux/mempolicy.h	2004-09-20 09:48:56.000000000 -0700
@@ -17,6 +17,19 @@
 
 #define MPOL_MAX MPOL_ROUNDROBIN
 
+/* 
+ * Policy indicies
+ * These specify the index into either the task->mempolicy array or the
+ * default_policy array to indicate which policy is to be used for a
+ * particular allocation.
+ */
+#define NR_MEM_POLICIES 	2
+/* policy to use for page allocation and the default kernel policy */
+/* this value is hard coded into alloc_pages() in gfp.h do not change it */
+#define POLICY_PAGE		0
+/* policy to use for pagecache allocation */
+#define POLICY_PAGECACHE 	1
+
 /* Flags for get_mem_policy */
 #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
 #define MPOL_F_ADDR	(1<<1)	/* look up vma using address */
@@ -32,6 +45,8 @@
 #include <linux/slab.h>
 #include <linux/rbtree.h>
 #include <asm/semaphore.h>
+#include <linux/sched.h>
+#include <asm/current.h>
 
 struct vm_area_struct;
 
@@ -69,6 +84,9 @@
 	} v;
 };
 
+extern struct page *
+alloc_pages_by_policy(unsigned gfp, unsigned order, unsigned int policy);
+
 /*
  * Support for managing mempolicy data objects (clone, copy, destroy)
  * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
Index: linux-2.6.9-rc2-mmxx/include/linux/pagemap.h
===================================================================
--- linux-2.6.9-rc2-mmxx.orig/include/linux/pagemap.h	2004-09-16 20:17:28.000000000 -0700
+++ linux-2.6.9-rc2-mmxx/include/linux/pagemap.h	2004-09-20 09:48:56.000000000 -0700
@@ -50,6 +50,7 @@
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
+#ifndef CONFIG_NUMA
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
 	return alloc_pages(mapping_gfp_mask(x), 0);
@@ -59,6 +60,30 @@
 {
 	return alloc_pages(mapping_gfp_mask(x)|__GFP_COLD, 0);
 }
+#define page_cache_alloc_local((x)) page_cache_alloc((x))
+#else /* CONFIG_NUMA */
+
+struct mempolicy;
+extern struct mempolicy *default_policy[];
+extern struct page *
+alloc_pages_by_policy(unsigned gfp, unsigned order, unsigned policy);
+
+static inline struct page *page_cache_alloc_local(struct address_space *x)
+{
+	return alloc_pages(mapping_gfp_mask(x), 0);
+}
+
+static inline struct page *page_cache_alloc(struct address_space *x)
+{
+	return alloc_pages_by_policy(mapping_gfp_mask(x), 0, POLICY_PAGECACHE);
+}
+
+static inline struct page *page_cache_alloc_cold(struct address_space *x)
+{
+	return alloc_pages_by_policy(mapping_gfp_mask(x)|__GFP_COLD, 0, 
+		POLICY_PAGECACHE);
+}
+#endif
 
 typedef int filler_t(void *, struct page *);
 
Index: linux-2.6.9-rc2-mmxx/include/linux/sched.h
===================================================================
--- linux-2.6.9-rc2-mmxx.orig/include/linux/sched.h	2004-09-20 09:21:09.000000000 -0700
+++ linux-2.6.9-rc2-mmxx/include/linux/sched.h	2004-09-20 09:48:56.000000000 -0700
@@ -31,6 +31,8 @@
 #include <linux/pid.h>
 #include <linux/percpu.h>
 
+#include <linux/mempolicy.h>
+
 struct exec_domain;
 
 /*
@@ -588,7 +590,6 @@
 
 
 struct audit_context;		/* See audit.c */
-struct mempolicy;
 
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
@@ -743,7 +744,7 @@
  */
 	wait_queue_t *io_wait;
 #ifdef CONFIG_NUMA
-  	struct mempolicy *mempolicy;
+  	struct mempolicy *mempolicy[NR_MEM_POLICIES];
   	short il_next;		/* could be shared with used_math */
 	short rr_next;
 #endif
Index: linux-2.6.9-rc2-mmxx/kernel/exit.c
===================================================================
--- linux-2.6.9-rc2-mmxx.orig/kernel/exit.c	2004-09-16 20:17:29.000000000 -0700
+++ linux-2.6.9-rc2-mmxx/kernel/exit.c	2004-09-20 09:48:56.000000000 -0700
@@ -785,6 +785,7 @@
 asmlinkage NORET_TYPE void do_exit(long code)
 {
 	struct task_struct *tsk = current;
+	int i;
 
 	profile_task_exit(tsk);
 
@@ -830,8 +831,10 @@
 	tsk->exit_code = code;
 	exit_notify(tsk);
 #ifdef CONFIG_NUMA
-	mpol_free(tsk->mempolicy);
-	tsk->mempolicy = NULL;
+	for(i=0;i<NR_MEM_POLICIES;i++) {
+		mpol_free(tsk->mempolicy[i]);
+		tsk->mempolicy[i] = NULL;
+	}
 #endif
 	schedule();
 	BUG();
Index: linux-2.6.9-rc2-mmxx/kernel/fork.c
===================================================================
--- linux-2.6.9-rc2-mmxx.orig/kernel/fork.c	2004-09-16 20:17:29.000000000 -0700
+++ linux-2.6.9-rc2-mmxx/kernel/fork.c	2004-09-20 09:48:56.000000000 -0700
@@ -776,7 +776,7 @@
 				 int __user *child_tidptr,
 				 int pid)
 {
-	int retval;
+	int retval, i;
 	struct task_struct *p = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
@@ -865,12 +865,14 @@
 	p->io_wait = NULL;
 	p->audit_context = NULL;
 #ifdef CONFIG_NUMA
- 	p->mempolicy = mpol_copy(p->mempolicy);
- 	if (IS_ERR(p->mempolicy)) {
- 		retval = PTR_ERR(p->mempolicy);
- 		p->mempolicy = NULL;
- 		goto bad_fork_cleanup;
- 	}
+	for(i=0;i<NR_MEM_POLICIES;i++) {
+		p->mempolicy[i] = mpol_copy(p->mempolicy[i]);
+		if (IS_ERR(p->mempolicy[i])) {
+			retval = PTR_ERR(p->mempolicy[i]);
+			p->mempolicy[i] = NULL;
+			goto bad_fork_cleanup;
+		}
+	}
 #endif
 
 	p->tgid = p->pid;
@@ -1038,7 +1040,8 @@
 	security_task_free(p);
 bad_fork_cleanup_policy:
 #ifdef CONFIG_NUMA
-	mpol_free(p->mempolicy);
+	for(i=0;i<NR_MEM_POLICIES;i++)
+		mpol_free(p->mempolicy[i]);
 #endif
 bad_fork_cleanup:
 	if (p->binfmt)
Index: linux-2.6.9-rc2-mmxx/mm/mempolicy.c
===================================================================
--- linux-2.6.9-rc2-mmxx.orig/mm/mempolicy.c	2004-09-20 09:21:09.000000000 -0700
+++ linux-2.6.9-rc2-mmxx/mm/mempolicy.c	2004-09-20 10:21:38.000000000 -0700
@@ -94,11 +94,27 @@
    policied. */
 static int policy_zone;
 
-static struct mempolicy default_policy = {
+/*
+ * the default policies for page allocation, page cache allocation
+ */
+static struct mempolicy default_kernel_mempolicy = {
 	.refcnt = ATOMIC_INIT(1), /* never free it */
 	.policy = MPOL_DEFAULT,
 };
 
+struct mempolicy default_pagecache_mempolicy = {
+	.refcnt  = ATOMIC_INIT(1), /* never free it */
+	.policy  = MPOL_DEFAULT,
+};
+
+/*
+ * references to the default policies are via indexes into this array
+ */
+struct mempolicy *default_policy[NR_MEM_POLICIES] = {
+		&default_kernel_mempolicy, 
+		&default_pagecache_mempolicy,
+};
+
 /* Check if all specified nodes are online */
 static int nodes_online(unsigned long *nodes)
 {
@@ -399,13 +415,13 @@
 
 /* Set the process memory policy */
 asmlinkage long sys_set_mempolicy(int mode, unsigned long __user *nmask,
-				   unsigned long maxnode)
+				   unsigned long maxnode, unsigned int policy)
 {
 	int err;
 	struct mempolicy *new;
 	DECLARE_BITMAP(nodes, MAX_NUMNODES);
 
-	if (mode > MPOL_MAX)
+	if ((mode > MPOL_MAX) || (policy >= NR_MEM_POLICIES))
 		return -EINVAL;
 	err = get_nodes(nodes, nmask, maxnode, mode);
 	if (err)
@@ -413,8 +429,8 @@
 	new = mpol_new(mode, nodes);
 	if (IS_ERR(new))
 		return PTR_ERR(new);
-	mpol_free(current->mempolicy);
-	current->mempolicy = new;
+	mpol_free(current->mempolicy[policy]);
+	current->mempolicy[policy] = new;
 	if (new && new->policy == MPOL_INTERLEAVE)
 		current->il_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
 	if (new && new->policy == MPOL_ROUNDROBIN)
@@ -484,12 +500,13 @@
 asmlinkage long sys_get_mempolicy(int __user *policy,
 				  unsigned long __user *nmask,
 				  unsigned long maxnode,
-				  unsigned long addr, unsigned long flags)
+				  unsigned long addr, unsigned long flags,
+				  int policy_type)
 {
 	int err, pval;
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;
-	struct mempolicy *pol = current->mempolicy;
+	struct mempolicy *pol = current->mempolicy[policy_type];
 
 	if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
 		return -EINVAL;
@@ -510,7 +527,7 @@
 		return -EINVAL;
 
 	if (!pol)
-		pol = &default_policy;
+		pol = default_policy[policy_type];
 
 	if (flags & MPOL_F_NODE) {
 		if (flags & MPOL_F_ADDR) {
@@ -518,10 +535,10 @@
 			if (err < 0)
 				goto out;
 			pval = err;
-		} else if (pol == current->mempolicy &&
+		} else if (pol == current->mempolicy[policy_type] &&
 				pol->policy == MPOL_INTERLEAVE) {
 			pval = current->il_next;
-		} else if (pol == current->mempolicy &&
+		} else if (pol == current->mempolicy[policy_type] &&
 				pol->policy == MPOL_ROUNDROBIN) {
 			pval = current->rr_next;
 		} else {
@@ -553,7 +570,8 @@
 asmlinkage long compat_get_mempolicy(int __user *policy,
 				     compat_ulong_t __user *nmask,
 				     compat_ulong_t maxnode,
-				     compat_ulong_t addr, compat_ulong_t flags)
+				     compat_ulong_t addr, compat_ulong_t flags,
+				     compat_uint_t policy_index)
 {
 	long err;
 	unsigned long __user *nm = NULL;
@@ -566,7 +584,8 @@
 	if (nmask)
 		nm = compat_alloc_user_space(alloc_size);
 
-	err = sys_get_mempolicy(policy, nm, nr_bits+1, addr, flags);
+	err = sys_get_mempolicy(policy, nm, nr_bits+1, addr, flags, 
+		policy_index);
 
 	if (!err && nmask) {
 		err = copy_from_user(bm, nm, alloc_size);
@@ -579,7 +598,8 @@
 }
 
 asmlinkage long compat_set_mempolicy(int mode, compat_ulong_t __user *nmask,
-				     compat_ulong_t maxnode)
+				     compat_ulong_t maxnode,
+				     compat_uint_t policy_index)
 {
 	long err = 0;
 	unsigned long __user *nm = NULL;
@@ -598,7 +618,7 @@
 	if (err)
 		return -EFAULT;
 
-	return sys_set_mempolicy(mode, nm, nr_bits+1);
+	return sys_set_mempolicy(mode, nm, nr_bits+1, policy_index);
 }
 
 asmlinkage long compat_mbind(compat_ulong_t start, compat_ulong_t len,
@@ -631,7 +651,7 @@
 static struct mempolicy *
 get_vma_policy(struct vm_area_struct *vma, unsigned long addr)
 {
-	struct mempolicy *pol = current->mempolicy;
+	struct mempolicy *pol = current->mempolicy[POLICY_PAGE];
 
 	if (vma) {
 		if (vma->vm_ops && vma->vm_ops->get_policy)
@@ -641,7 +661,7 @@
 			pol = vma->vm_policy;
 	}
 	if (!pol)
-		pol = &default_policy;
+		pol = default_policy[POLICY_PAGE];
 	return pol;
 }
 
@@ -814,7 +834,7 @@
 }
 
 /**
- * 	alloc_pages_current - Allocate pages.
+ * 	alloc_pages_by_policy - Allocate pages using a given mempolicy
  *
  *	@gfp:
  *		%GFP_USER   user allocation,
@@ -823,19 +843,26 @@
  *      	%GFP_FS     don't call back into a file system.
  *      	%GFP_ATOMIC don't sleep.
  *	@order: Power of two of allocation size in pages. 0 is a single page.
+ *	@policy:Index of the mempolicy struct to use for this allocation
  *
  *	Allocate a page from the kernel page pool.  When not in
  *	interrupt context and apply the current process NUMA policy.
  *	Returns NULL when no page can be allocated.
  */
-struct page *alloc_pages_current(unsigned gfp, unsigned order)
+struct page *
+alloc_pages_by_policy(unsigned gfp, unsigned order, unsigned policy)
 {
-	struct mempolicy *pol = current->mempolicy;
+	struct mempolicy *pol;
 
+	if (policy >= NR_MEM_POLICIES)
+		BUG();
+	pol = current->mempolicy[policy];
+	if (!pol)
+		pol = default_policy[policy];
 	if (!in_interrupt())
 		cpuset_update_current_mems_allowed();
 	if (!pol || in_interrupt())
-		pol = &default_policy;
+		pol = default_policy[POLICY_PAGE];
 	if (pol->policy == MPOL_INTERLEAVE) {
 		return alloc_page_interleave(gfp, order, interleave_nodes(pol));
 	} else if (pol->policy == MPOL_ROUNDROBIN) {
@@ -843,7 +870,7 @@
 	}
 	return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
 }
-EXPORT_SYMBOL(alloc_pages_current);
+EXPORT_SYMBOL(alloc_pages_by_policy);
 
 /* Slow path of a mempolicy copy */
 struct mempolicy *__mpol_copy(struct mempolicy *old)
@@ -1157,7 +1184,7 @@
 	   the data structures allocated at system boot end up in node zero. */
 
 	if (sys_set_mempolicy(MPOL_INTERLEAVE, nodes_addr(node_online_map),
-							MAX_NUMNODES) < 0)
+			      MAX_NUMNODES, POLICY_PAGE) < 0)
 		printk("numa_policy_init: interleaving failed\n");
 }
 
@@ -1165,5 +1192,5 @@
  * Assumes fs == KERNEL_DS */
 void numa_default_policy(void)
 {
-	sys_set_mempolicy(MPOL_DEFAULT, NULL, 0);
+	sys_set_mempolicy(MPOL_DEFAULT, NULL, 0, POLICY_PAGE);
 }

-- 
Best Regards,
Ray
-----------------------------------------------
Ray Bryant                       raybry@sgi.com
The box said: "Requires Windows 98 or better",
           so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation
  2004-09-20 19:00 [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation Ray Bryant
  2004-09-20 19:00 ` [PATCH 2.6.9-rc2-mm1 1/2] " Ray Bryant
  2004-09-20 19:00 ` [PATCH 2.6.9-rc2-mm1 2/2] " Ray Bryant
@ 2004-09-20 20:22 ` Paul Jackson
  2004-09-20 20:55 ` Andi Kleen
  3 siblings, 0 replies; 16+ messages in thread
From: Paul Jackson @ 2004-09-20 20:22 UTC (permalink / raw)
  To: Ray Bryant
  Cc: wli, mbligh, akpm, ak, raybry, linux-mm, jbarnes, djh, lse-tech,
	bcasavan, piggin, linux-kernel, haveblue

Nits ... 

1) better change this line in mempolicy.h

	#define MPOL_MAX MPOL_INTERLEAVE

   to be instead

	#define MPOL_MAX MPOL_ROUNDROBIN 

2) Why change the alloc_page_vma() code structure for
   MPOL_INTERLEAVE from starting with:

	if (unlikely(pol->policy == MPOL_INTERLEAVE)) {

   to starting with:

	switch (pol->policy) {
		case MPOL_INTERLEAVE:

   Doesn't the original way work just as well (and keep the patch
   smaller)?  Just add another 'if(...MPOL_ROUNDROBIN)' section
   following the MPOL_INTERLEAVE section.  And the other switch
   statements in this file don't indent the case lines a tab further.

3) The following line looks like it could trigger after a
   hotplug node removal (not that I know how to do that yet):

	BUG_ON(!test_bit(nid, (const volatile void *) &node_online_map));

   Should the '(const volatile void *)' cast be a nodes_addr() wrapper? 

   Can this entire line be dropped, or turned into a test:

	if (!node_isset(nid, node_online_map))
		continue;

4) Patches done with the 'diff -p' option are slightly easier to
   read, as they show the procedure the diff seems to appear in.

5) I see no need for the 'else' in:

	-	if (pol->policy == MPOL_INTERLEAVE)
	+	if (pol->policy == MPOL_INTERLEAVE) {
	 		return alloc_page_interleave(gfp, order, interleave_nodes(pol));
	+	} else if (pol->policy == MPOL_ROUNDROBIN) {
	+		return alloc_page_roundrobin(gfp, order, pol);
	+	}

   Couldn't one just have this less intrusive patch instead:

		if (pol->policy == MPOL_INTERLEAVE)
			return alloc_page_interleave(gfp, order, interleave_nodes(pol));
	+	if (pol->policy == MPOL_ROUNDROBIN)
	+		return alloc_page_roundrobin(gfp, order, pol);
		return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
	 }

6) Can the added rr_next in task_struct be shared with il_next?

7) Why doesn't alloc_page_roundrobin() have its own accounting, like
   alloc_page_interleave() does?

8) Could you explain the for loop in alloc_page_roundrobin()?  Won't
   the first call to __alloc_pages() within the loop search down all
   the nodes in the system, in a numa friendly order (closer nodes
   first)?  Why then pick another node to search from, in the next
   pass of the for loop, which will again search down the same nodes,
   using a differently sorted zonelist.  Obviously I'm missing something
   here.

9) Too bad there's not some pseudo-random value floating around somewhere,
   such as a per-node clock or something, that could be used to drive a
   pseudo-uniform distribution, without any need for the additional rr_next
   state?

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation
  2004-09-20 19:00 [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation Ray Bryant
                   ` (2 preceding siblings ...)
  2004-09-20 20:22 ` [PATCH 2.6.9-rc2-mm1 0/2] " Paul Jackson
@ 2004-09-20 20:55 ` Andi Kleen
  2004-09-20 22:13   ` Ray Bryant
                     ` (2 more replies)
  3 siblings, 3 replies; 16+ messages in thread
From: Andi Kleen @ 2004-09-20 20:55 UTC (permalink / raw)
  To: Ray Bryant
  Cc: William Lee Irwin III, Martin J. Bligh, Andrew Morton,
	Andi Kleen, Ray Bryant, linux-mm, Jesse Barnes, Dan Higgins,
	lse-tech, Brent Casavant, Nick Piggin, linux-kernel,
	Paul Jackson, Dave Hansen, stevel

On Mon, Sep 20, 2004 at 12:00:33PM -0700, Ray Bryant wrote:
> Background
> ----------
> 
> Last month, Jesse Barnes proposed a patch to do round robin
> allocation of page cache pages on NUMA machines.  This got shot down
> for a number of reasons (see
>   http://marc.theaimsgroup.com/?l=linux-kernel&m=109235420329360&w=2
> and the related thread), but it seemed to me that one of the most
> significant issues was that this was a workload dependent optimization.
> That is, for an Altix running an HPC workload, it was a good thing,
> but for web servers or file servers it was not such a good idea.
> 
> So the idea of this patch is the following:  it creates a new memory
> policy structure (default_pagecache_policy) that is used to control
> how storage for page cache pages is allocated.  So, for a large Altix
> running HPC workloads, we can specify a policy that does round robin
> allocations, and for other workloads you can specify the default policy
> (which results in page cache pages being allocated locally).
> 
> The default_pagecache_policy is overrideable on a per process basis, so
> that if your application prefers to allocate page cache pages locally,
> it can.

I'm not sure this really makes sense. Do you have some clear use 
case where having so much flexibility is needed? 

I would prefer to have a global setting somewhere for the page
cache (sysctl or sysfs or what you prefer) and some special handling for 
text pages. 

This would keep the per thread bloat low. 

Also I must say I got a patch submitted to do policy per
file from Steve Longerbeam. 

It so far only supports this for ELF executables, but
it has most of the infrastructure to do individual policy
per file. Maybe it would be better to go into this direction,
only thing missing is a nice way to declare policy for 
arbitary files. Even in this case a global default would be useful.

I haven't done anything with this patch yet due to missing time 
and there were a few small issues to resolve, but i hope it 
can be eventually integrated.

[Steve, perhaps you can repost the patch to lse-tech for more
wider review?]

> MPOL_ROUNDROBIN.  We need this because there is no handy offset to use
> when you get a call to allocate a page cache page in "page_cache_alloc()",
> so MPOL_INTERLEAVE doesn't do what we need.

Well, you just have to change the callers to pass it in. I think
computing the interleaving on a offset and perhaps another file
identifier is better than having the global counter.

> (It also appears to me that there is no mechanism to set the default
> policies, but perhaps its there and I am just missing it.)

No sure what default policies you mean? 

> (3)  alloc_pages_current() is now an inline, but there is no easy way
> to do that totally correclty with the current include file order (that I
> could figure out at least...)  The problem is that alloc_pages_current()
> wants to use the define constant POLICY_PAGE, but that is defined yet.
> We know it is zero, so we just use zero.  A comment in mempolicy.h
> suggests not to change the value of this constant to something other
> than zero, and references the file gfp.h.

I'm pretty sure the code I wrote didn't have a "POLICY_PAGE" ;-)
Not sure where you got it from, but you could ask whoever 
wrote that comment in your patch

> 
> (4)  I've not thought a bit about locking issues related to changing a
> mempolicy whilst the system is actually running. 

You need some kind of lock. Normally mempolicies are either
protected by being thread local or by the mmsem together
with the atomic reference count.
This only applies to modifications, for reading they are completely
stateless and don't need any locking.

Your new RR policy will break this though. It works for process
policy, but for VMA policy it will either require a lock per 
policy or some other complicated locking. Not nice.

I think doing it stateless is much better because it will scale
much better and should IMHO also have better behaviour longer term.
I went over several design iterations with this and think the 
current lockless design is very preferable.

> (5)  It seems there may be a potential conflict between the page cache
> mempolicy and a mmap mempolicy (do those exist?).  Here's the concern:

They exist for tmpfs/shmfs/hugetlbfs pages.

With Steve's page cache patch it can exist for all pages.

Normally NUMA API resolves this by prefering the more specific
policy (VMA over process) or sharing policies (for shmfs) 

Haven't read your patch in details yet, sorry, just design comments.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation
  2004-09-20 20:55 ` Andi Kleen
@ 2004-09-20 22:13   ` Ray Bryant
  2004-09-20 22:37     ` Andi Kleen
  2004-09-20 22:38   ` Steve Longerbeam
  2004-09-20 23:48   ` Steve Longerbeam
  2 siblings, 1 reply; 16+ messages in thread
From: Ray Bryant @ 2004-09-20 22:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: William Lee Irwin III, Martin J. Bligh, Andrew Morton,
	Ray Bryant, linux-mm, Jesse Barnes, Dan Higgins, lse-tech,
	Brent Casavant, Nick Piggin, linux-kernel, Paul Jackson,
	Dave Hansen, stevel

Hi Andi,

Andi Kleen wrote:
> On Mon, Sep 20, 2004 at 12:00:33PM -0700, Ray Bryant wrote:
> 
>>Background
>>----------
>>
>>Last month, Jesse Barnes proposed a patch to do round robin
>>allocation of page cache pages on NUMA machines.  This got shot down
>>for a number of reasons (see
>>  http://marc.theaimsgroup.com/?l=linux-kernel&m=109235420329360&w=2
>>and the related thread), but it seemed to me that one of the most
>>significant issues was that this was a workload dependent optimization.
>>That is, for an Altix running an HPC workload, it was a good thing,
>>but for web servers or file servers it was not such a good idea.
>>
>>So the idea of this patch is the following:  it creates a new memory
>>policy structure (default_pagecache_policy) that is used to control
>>how storage for page cache pages is allocated.  So, for a large Altix
>>running HPC workloads, we can specify a policy that does round robin
>>allocations, and for other workloads you can specify the default policy
>>(which results in page cache pages being allocated locally).
>>
>>The default_pagecache_policy is overrideable on a per process basis, so
>>that if your application prefers to allocate page cache pages locally,
>>it can.
> 
> 
> I'm not sure this really makes sense. Do you have some clear use 
> case where having so much flexibility is needed? 
>
> I would prefer to have a global setting somewhere for the page
> cache (sysctl or sysfs or what you prefer) and some special handling for 
> text pages. 
> 
> This would keep the per thread bloat low. 
>

Yeah, we can probably live with that until we come up with a good example.

The thread bloat in the currentpatch is one word per task struct, plus some 
extra checks in alloc_pages_by_policy().  The latter are more worrisome than 
the former.

My only concern about a separate sysctl or sysfs is that this really is a 
system wide memory allocation policy issue.  It seems cleaner to me to keep 
that all within the scope of the NUMA API rather than hiding details of it 
here and there in /proc.  And we need the full generality of the NUMA API, to, 
for example:

(1)  Restrict all page cache page pages to some subset of the nodes in the
      system.  We have memory only nodes, and people woule like page cache
      pages (not text pages!) to be confined to that set of nodes.  So we
      need some kind of nodemask to go along with the page cache allocation
      policy.

(2)  We need to fully support cpusets within all of this.  That means that
      page cache allocation requests coming from a node in the cpuset, should
      only be allocated within that cpuset.  And even then, perhaps we may
      want a subset of the nodes in the cpuset set to be designated as the
      places for page cache  pages to live.

If we do this within the scope of the existing mempolicy code, the cpuset
stuff comes along for free.

> Also I must say I got a patch submitted to do policy per
> file from Steve Longerbeam. 
> 
> It so far only supports this for ELF executables, but
> it has most of the infrastructure to do individual policy
> per file. Maybe it would be better to go into this direction,
> only thing missing is a nice way to declare policy for 
> arbitary files. Even in this case a global default would be useful.
> 

Yes, if there were a global default that we could set that would support
round robin page cache allocation, then that would probably work.

> I haven't done anything with this patch yet due to missing time 
> and there were a few small issues to resolve, but i hope it 
> can be eventually integrated.
> 
> [Steve, perhaps you can repost the patch to lse-tech for more
> wider review?]
> 

Since I've not seen it, its hard to evaluate.  :-)

> 
> 
>>MPOL_ROUNDROBIN.  We need this because there is no handy offset to use
>>when you get a call to allocate a page cache page in "page_cache_alloc()",
>>so MPOL_INTERLEAVE doesn't do what we need.
> 
> 
> Well, you just have to change the callers to pass it in. I think
> computing the interleaving on a offset and perhaps another file
> identifier is better than having the global counter.
>

In our case that means changing each and every call to page_cache_alloc()
to include an appropriate offset.  This is a change that richochets through 
the machine independent code and makes this harder to contain in the NUMA
subsystem.

Is there a performance problem with the global counter?  We've been using 
exactly that kind of implementation for our current Altix systems and it seems 
to work fine.  If you use some kind of offset and interleave as you suggest, 
how will you make sure that page cache allocations are evenly balanced across 
the nodes in a system, or the nodes in a cpuset?  Wouldn't it make more sense 
to spread them out dynamically based on actual usage?

For example, let suppose (just to be devious) that on a 2-node system you 
decided (poorly, admittedly) to use the bottom bit of the offset to chose the 
node.  And suppose that the user only touches the even numbered offsets in the 
file.  You'll clobber node 0 with all of the page cache pages, right?

Of course, that is a poor decision.  But, any type of static allocation like 
that based on offset is going to suffer from a similar type of worst case
behavior.  If you allocate the page cache page on the next node in sequence,
then we will smooth out page cache allocation based on actual usage patterns.
Isn't that a better idea?

> 
>>(It also appears to me that there is no mechanism to set the default
>>policies, but perhaps its there and I am just missing it.)
> 
> 
> No sure what default policies you mean? 
> 

Since there is (with this patch) a separate (default) policy to control 
allocation of page cache pages, there now has to be a way to set that policy.
Since the default_policy for regular page allocation can't be changed (it is, 
after all also the policy for allocating pages at interrupt time) there was no 
need for that API in the past.  Now, however, we need a way to set the system 
default page cache allocation policy, since some system administrators will 
want that to be MPOL_LOCAL and some will want that to be MPOL_INTERLEAVE or 
potentially MPOL_ROUNDROBIN depending on the workload that the system is running.

So we need some way to set the default policy for the page cache.  Somethling 
like this has to be there because without a way to round robin the page cache, 
we have no good way to be able to guarantee that when a user on our big boxen 
ask for local storage, there is a good likelyhood they will get it.

> 
>>(3)  alloc_pages_current() is now an inline, but there is no easy way
>>to do that totally correclty with the current include file order (that I
>>could figure out at least...)  The problem is that alloc_pages_current()
>>wants to use the define constant POLICY_PAGE, but that is defined yet.
>>We know it is zero, so we just use zero.  A comment in mempolicy.h
>>suggests not to change the value of this constant to something other
>>than zero, and references the file gfp.h.
> 
> 
> I'm pretty sure the code I wrote didn't have a "POLICY_PAGE" ;-)
> Not sure where you got it from, but you could ask whoever 
> wrote that comment in your patch
>

You're correct.  POLICY_PAGE is new in this patch, and I wrote that comment.
The note above explans why it is hard to get this #define constant in there.

What is going on is that what you used to refer to as default_policy is now 
default_policy[POLICY_PAGE] (Not the best name, but I couldn't think of a 
better one), the default policy for page cache pages is
default_policy[POLICY_PAGECACHE], etc.  (I'm sure we can think up several
others without too much problem.  BTW, these should probably be named
POLICY_CLASS_PAGE, or MEM_CLASS_PAGE or some such, because we are talking 
about allocating different kinds of storage.)

(One of the things we have in mind is POLICY_SLABCACHE; we also need to smooth 
out allocation of buffer heads (or whatever they are in 2.6.) so that when we
run a copy on a particular node, even though we smooth out the allocation of
page cache pages, we can end up with significanlty more storage use on that
node than other nodes, and that storage never goes away.  This is bad, bad for
the big HPC application that comes along and wants to allocate the same amount
of storage on each of 512 nodes, and it can't on node 37, for some reason.
If allocations on one of those nodes spill, then the entire parallel 
computation will be slowed down [typically, such a job runs as slowly as the
slowest processor]).

> 
>>(4)  I've not thought a bit about locking issues related to changing a
>>mempolicy whilst the system is actually running. 
> 
> 
> You need some kind of lock. Normally mempolicies are either
> protected by being thread local or by the mmsem together
> with the atomic reference count.
> This only applies to modifications, for reading they are completely
> stateless and don't need any locking.
> 
> Your new RR policy will break this though. It works for process
> policy, but for VMA policy it will either require a lock per 
> policy or some other complicated locking. Not nice.
> 

I agree.  We'd have to figure out some way around this.

> I think doing it stateless is much better because it will scale
> much better and should IMHO also have better behaviour longer term.
> I went over several design iterations with this and think the 
> current lockless design is very preferable.
> 

No argument there, I like scalable solutions better too.  :-)
Lets sort out some of the other stuff first and come back to this one later.

> 
>>(5)  It seems there may be a potential conflict between the page cache
>>mempolicy and a mmap mempolicy (do those exist?).  Here's the concern:
> 
> 
> They exist for tmpfs/shmfs/hugetlbfs pages.
> 
> With Steve's page cache patch it can exist for all pages.
> 
> Normally NUMA API resolves this by prefering the more specific
> policy (VMA over process) or sharing policies (for shmfs) 
> 
> Haven't read your patch in details yet, sorry, just design comments.
> 

That's fine.  The top level details need as much discussion as anything.

> -Andi
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
> 

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation
  2004-09-20 22:13   ` Ray Bryant
@ 2004-09-20 22:37     ` Andi Kleen
  2004-09-20 23:16       ` William Lee Irwin III
  2004-09-21  1:30       ` Ray Bryant
  0 siblings, 2 replies; 16+ messages in thread
From: Andi Kleen @ 2004-09-20 22:37 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Andi Kleen, William Lee Irwin III, Martin J. Bligh,
	Andrew Morton, Ray Bryant, linux-mm, Jesse Barnes, Dan Higgins,
	lse-tech, Brent Casavant, Nick Piggin, linux-kernel,
	Paul Jackson, Dave Hansen, stevel

On Mon, Sep 20, 2004 at 05:13:34PM -0500, Ray Bryant wrote:
> system wide memory allocation policy issue.  It seems cleaner to me to keep 
> that all within the scope of the NUMA API rather than hiding details of it 
> here and there in /proc.  And we need the full generality of the NUMA API, 
> to, for example:

True for cpuset you will need it.

> >Well, you just have to change the callers to pass it in. I think
> >computing the interleaving on a offset and perhaps another file
> >identifier is better than having the global counter.
> >
> 
> In our case that means changing each and every call to page_cache_alloc()
> to include an appropriate offset.  This is a change that richochets through 
> the machine independent code and makes this harder to contain in the NUMA
> subsystem.

I count two callers of page_cache_alloc in 2.6.9rc2 (filemap.c and
XFS pagebuf). Hardly seems like a big issue to change them both. 
Of course getting the offset there might be tricky, but should be 
doable.

> 
> Is there a performance problem with the global counter?  We've been using 

There might be. You will need a global lock. 

> exactly that kind of implementation for our current Altix systems and it 
> seems to work fine.  If you use some kind of offset and interleave as you 
> suggest, how will you make sure that page cache allocations are evenly 
> balanced across the nodes in a system, or the nodes in a cpuset?  Wouldn't 
> it make more sense to spread them out dynamically based on actual usage?
> 
> For example, let suppose (just to be devious) that on a 2-node system you 
> decided (poorly, admittedly) to use the bottom bit of the offset to chose 
> the node.  And suppose that the user only touches the even numbered offsets 
> in the file.  You'll clobber node 0 with all of the page cache pages, right?
> 
> Of course, that is a poor decision.  But, any type of static allocation 
> like that based on offset is going to suffer from a similar type of worst 
> case
> behavior.  If you allocate the page cache page on the next node in sequence,
> then we will smooth out page cache allocation based on actual usage 
> patterns.

Your counter can have the same worst case behaviour, just 
different.  You only have to add freeing into the picture.
Or when you consider getting more memory bandwidth from the interleaving
(I know this is not your primary goal with this) then a sufficient
access pattern could lead to rather uninterleaved allocation 
in the file.

Any allocation algorithm will have such a worst case, so I'm not
too worried. Given ia hash function is not too bad it should
be bearable.

The nice advantage of the static offset is that it makes benchmarks
actually repeatable and is completely lockless

> >
> >
> >No sure what default policies you mean? 
> >
> 
> Since there is (with this patch) a separate (default) policy to control 
> allocation of page cache pages, there now has to be a way to set that 
> policy.
> Since the default_policy for regular page allocation can't be changed (it 
> is, after all also the policy for allocating pages at interrupt time) there 
> was no need for that API in the past.  Now, however, we need a way to set 
> the system default page cache allocation policy, since some system 
> administrators will want that to be MPOL_LOCAL and some will want that to 
> be MPOL_INTERLEAVE or potentially MPOL_ROUNDROBIN depending on the workload 
> that the system is running.

I think I'm still a bit confused by your terminology.
I thought the page cache policy was per process? Now you
are talking about another global unrelated policy?

With the per process policy the only way to change the policy
for the whole system is to change it in init and restart everything.
With a global policy you could change it on the fly, but 
it probably wouldn't make too much sense without a restart
because there would be already too much cache with the wrong
policy.

Anyways, I guess you could just add a high flag bit to the 
mode argument of set_mempolicy. Something like

set_mempolicy(MPOL_PAGECACHE | MPOL_INTERLEAVE, nodemask, len)

That would work for setting the page cache policy of the current
process. 


-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation
  2004-09-20 22:37     ` Andi Kleen
@ 2004-09-20 23:16       ` William Lee Irwin III
  2004-09-21  1:30       ` Ray Bryant
  1 sibling, 0 replies; 16+ messages in thread
From: William Lee Irwin III @ 2004-09-20 23:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ray Bryant, Andrew Morton, Ray Bryant, linux-mm, Jesse Barnes,
	Dan Higgins, lse-tech, Brent Casavant, Nick Piggin, linux-kernel,
	Paul Jackson, Dave Hansen, stevel

On Tue, Sep 21, 2004 at 12:37:42AM +0200, Andi Kleen wrote:
> Your counter can have the same worst case behaviour, just 
> different.  You only have to add freeing into the picture.
> Or when you consider getting more memory bandwidth from the interleaving
> (I know this is not your primary goal with this) then a sufficient
> access pattern could lead to rather uninterleaved allocation 
> in the file.
> Any allocation algorithm will have such a worst case, so I'm not
> too worried. Given ia hash function is not too bad it should
> be bearable.
> The nice advantage of the static offset is that it makes benchmarks
> actually repeatable and is completely lockless

The hash function looks like choosing the nth node whose corresponding
bit is set in node_online_map such that linear_page_index(vma, address)
(why isn't it using linear_page_index()?) mod num_online_nodes() is n,
which actually appears weak compared to various hash functions I've
seen in use for e.g. page coloring. The hash functions I've seen in use
are not tremendously more expensive than mod, and generally meant to be
computationally cheap as opposed to strong.

The kind of scheme you've employed for MPOL_INTERLEAVE is what would be
called "direct mapped" in the context of page coloring, and Ray Bryant's
would be called "bin hopping" there. A nontrivial (though not
necessarily complex or expensive) hash function mod num_online_nodes()
would be considered hashed, and the last category I see in use
elsewhere is a "best bin" algorithm, which tracks utilization of bins
(for page coloring, colors; here nodes) and chooses one of the least
utilized bins thus far.

I'd expect all 4 alternatives (and maybe even a variety of hash
functions for address hashing) to be useful in various contexts,
though I'm unaware of which kinds of apps want which algorithms most.

I don't have any idea what kind of difference the variations on the
locality domain for Bryant's bin hopping algorithm make; I'd tend to
try to make it similar to the others' precedents, but there may be
other interactions.

-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation
  2004-09-20 22:37     ` Andi Kleen
  2004-09-20 23:16       ` William Lee Irwin III
@ 2004-09-21  1:30       ` Ray Bryant
  2004-09-21  9:13         ` Andi Kleen
  1 sibling, 1 reply; 16+ messages in thread
From: Ray Bryant @ 2004-09-21  1:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: William Lee Irwin III, Martin J. Bligh, Andrew Morton,
	Ray Bryant, linux-mm, Jesse Barnes, Dan Higgins, lse-tech,
	Brent Casavant, Nick Piggin, linux-kernel, Paul Jackson,
	Dave Hansen, stevel

Andi Kleen wrote:
> On Mon, Sep 20, 2004 at 05:13:34PM -0500, Ray Bryant wrote:
> 
>>system wide memory allocation policy issue.  It seems cleaner to me to keep 
>>that all within the scope of the NUMA API rather than hiding details of it 
>>here and there in /proc.  And we need the full generality of the NUMA API, 
>>to, for example:
> 
> 
> True for cpuset you will need it.
> 
> 
>>>Well, you just have to change the callers to pass it in. I think
>>>computing the interleaving on a offset and perhaps another file
>>>identifier is better than having the global counter.
>>>
>>
>>In our case that means changing each and every call to page_cache_alloc()
>>to include an appropriate offset.  This is a change that richochets through 
>>the machine independent code and makes this harder to contain in the NUMA
>>subsystem.
> 
> 
> I count two callers of page_cache_alloc in 2.6.9rc2 (filemap.c and
> XFS pagebuf). Hardly seems like a big issue to change them both. 
> Of course getting the offset there might be tricky, but should be 
> doable.
> 

Fair enough.  Another option I was thinking of was hiding a global counter
in page_cache_alloc itself and using it to provide a value for the offset
there.

> 
>>Is there a performance problem with the global counter?  We've been using 
> 
> 
> There might be. You will need a global lock. 
>

Oh yeah, I am sorry, we do this so often I forget.  What I really would do
is to have a per cpu counter so that we can increment that without a lock.
(I was being sloppy in terminology -- that's what efficient global counters
mean to me.  Sorry. :-))  With that idea and the above, I think I will be
able to get by without MPOL_ROUNDROBIN.  I'll check with Brent to see what
he can for the tmp fs code in that case.

> 
>>exactly that kind of implementation for our current Altix systems and it 
>>seems to work fine.  If you use some kind of offset and interleave as you 
>>suggest, how will you make sure that page cache allocations are evenly 
>>balanced across the nodes in a system, or the nodes in a cpuset?  Wouldn't 
>>it make more sense to spread them out dynamically based on actual usage?
>>
>>For example, let suppose (just to be devious) that on a 2-node system you 
>>decided (poorly, admittedly) to use the bottom bit of the offset to chose 
>>the node.  And suppose that the user only touches the even numbered offsets 
>>in the file.  You'll clobber node 0 with all of the page cache pages, right?
>>
>>Of course, that is a poor decision.  But, any type of static allocation 
>>like that based on offset is going to suffer from a similar type of worst 
>>case
>>behavior.  If you allocate the page cache page on the next node in sequence,
>>then we will smooth out page cache allocation based on actual usage 
>>patterns.
> 
> 
> Your counter can have the same worst case behaviour, just 
> different.  You only have to add freeing into the picture.
> Or when you consider getting more memory bandwidth from the interleaving
> (I know this is not your primary goal with this) then a sufficient
> access pattern could lead to rather uninterleaved allocation 
> in the file.
> 
> Any allocation algorithm will have such a worst case, so I'm not
> too worried. Given ia hash function is not too bad it should
> be bearable.
> 
> The nice advantage of the static offset is that it makes benchmarks
> actually repeatable and is completely lockless
> 

I can see the advantages of that.  But the state of the page cache is still
something we have to deal with for benchmarks.

> 
>>>
>>>No sure what default policies you mean? 
>>>
>>
>>Since there is (with this patch) a separate (default) policy to control 
>>allocation of page cache pages, there now has to be a way to set that 
>>policy.
>>Since the default_policy for regular page allocation can't be changed (it 
>>is, after all also the policy for allocating pages at interrupt time) there 
>>was no need for that API in the past.  Now, however, we need a way to set 
>>the system default page cache allocation policy, since some system 
>>administrators will want that to be MPOL_LOCAL and some will want that to 
>>be MPOL_INTERLEAVE or potentially MPOL_ROUNDROBIN depending on the workload 
>>that the system is running.
> 
> 
> I think I'm still a bit confused by your terminology.
> I thought the page cache policy was per process? Now you
> are talking about another global unrelated policy?
> 


I'm sorry if this is confusing, personal terminology usually gets in the way.

The idea is that just like for the page allocation policy (your current code), 
if you wanted, you would have a global, default page cache allocation policy, 
probably set at boot time or shortly thereafter, probably before any (or at 
least most) page cache pages have been allocated.  You could also have a per 
process policy setting that would override the global policy, for processes 
that needed it, but I honestly don't have a good case for this except for 
symmetry with the existing code.

> With the per process policy the only way to change the policy
> for the whole system is to change it in init and restart everything.
> With a global policy you could change it on the fly, but 
> it probably wouldn't make too much sense without a restart
> because there would be already too much cache with the wrong
> policy.
> 

Or we could flush the page cache and change the policy.  It wouldn't
be perfect but it could be close enough.

We do need to be able to set the global policy without recompiling the kernel.
So if we set if from init scripts early enough in boot it should be ok,
I would think.  Not perfect, but good enough.  Remember, we worried
about 10-100 GB files here.  A few MB is not a big deal.

> Anyways, I guess you could just add a high flag bit to the 
> mode argument of set_mempolicy. Something like
> 
> set_mempolicy(MPOL_PAGECACHE | MPOL_INTERLEAVE, nodemask, len)
> 
> That would work for setting the page cache policy of the current
> process. 
> 
> 

That's an idea.  Not they way I was planning on doing it, but that would
work.  I was thinking along the lines of:

set_mempolicy(MPOL_INTERLEAVE, nodemask, len, POLICY_PAGECACHE);

but either way can be made to work.

Am I helping make this clearer or is it getting worse?

> -Andi
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
> 


-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation
  2004-09-21  1:30       ` Ray Bryant
@ 2004-09-21  9:13         ` Andi Kleen
  2004-09-21  9:33           ` William Lee Irwin III
  0 siblings, 1 reply; 16+ messages in thread
From: Andi Kleen @ 2004-09-21  9:13 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Andi Kleen, William Lee Irwin III, Martin J. Bligh,
	Andrew Morton, Ray Bryant, linux-mm, Jesse Barnes, Dan Higgins,
	lse-tech, Brent Casavant, Nick Piggin, linux-kernel,
	Paul Jackson, Dave Hansen, stevel

On Mon, Sep 20, 2004 at 08:30:12PM -0500, Ray Bryant wrote:
> Andi Kleen wrote:
> >On Mon, Sep 20, 2004 at 05:13:34PM -0500, Ray Bryant wrote:
> >
> >>system wide memory allocation policy issue.  It seems cleaner to me to 
> >>keep that all within the scope of the NUMA API rather than hiding details 
> >>of it here and there in /proc.  And we need the full generality of the 
> >>NUMA API, to, for example:
> >
> >
> >True for cpuset you will need it.
> >
> >
> >>>Well, you just have to change the callers to pass it in. I think
> >>>computing the interleaving on a offset and perhaps another file
> >>>identifier is better than having the global counter.
> >>>
> >>
> >>In our case that means changing each and every call to page_cache_alloc()
> >>to include an appropriate offset.  This is a change that richochets 
> >>through the machine independent code and makes this harder to contain in 
> >>the NUMA
> >>subsystem.
> >
> >
> >I count two callers of page_cache_alloc in 2.6.9rc2 (filemap.c and
> >XFS pagebuf). Hardly seems like a big issue to change them both. 
> >Of course getting the offset there might be tricky, but should be 
> >doable.
> >
> 
> Fair enough.  Another option I was thinking of was hiding a global counter
> in page_cache_alloc itself and using it to provide a value for the offset
> there.

Please don't. Just use an offset and a hash on (dev_t, inode number) 

> >
> >Any allocation algorithm will have such a worst case, so I'm not
> >too worried. Given ia hash function is not too bad it should
> >be bearable.
> >
> >The nice advantage of the static offset is that it makes benchmarks
> >actually repeatable and is completely lockless
> >
> 
> I can see the advantages of that.  But the state of the page cache is still
> something we have to deal with for benchmarks.

Umounting the file systems with the data files usually works pretty
well.

Or longer term if it's a real issue one could write a workload manager
that can actually change policies for existing pages. But I'm not 
sure how such a beast would really work.

> >>be MPOL_INTERLEAVE or potentially MPOL_ROUNDROBIN depending on the 
> >>workload that the system is running.
> >
> >
> >I think I'm still a bit confused by your terminology.
> >I thought the page cache policy was per process? Now you
> >are talking about another global unrelated policy?
> >
> 
> 
> I'm sorry if this is confusing, personal terminology usually gets in the 
> way.
> 
> The idea is that just like for the page allocation policy (your current 
> code), if you wanted, you would have a global, default page cache 

Having both a per process page cache and a global page cache policy
would seem like overkill to me.

And having both doesn't make much sense anyways, because when the 
system admin wants to change the global policy to free memory
on nodes he would still need to worry about conflicting per process policies 
anyways. So as soon as you have process policy you cannot easily
change global anymore.

> allocation policy, probably set at boot time or shortly thereafter, 
> probably before any (or at least most) page cache pages have been 
> allocated.  You could also have a per process policy setting that would 
> override the global policy, for processes that needed it, but I honestly 
> don't have a good case for this except for symmetry with the existing code.

cpusets was the good case for it that you mentioned. 
Or did I misunderstand you?

> >Anyways, I guess you could just add a high flag bit to the 
> >mode argument of set_mempolicy. Something like
> >
> >set_mempolicy(MPOL_PAGECACHE | MPOL_INTERLEAVE, nodemask, len)
> >
> >That would work for setting the page cache policy of the current
> >process. 
> >
> >
> 
> That's an idea.  Not they way I was planning on doing it, but that would
> work.  I was thinking along the lines of:
> 
> set_mempolicy(MPOL_INTERLEAVE, nodemask, len, POLICY_PAGECACHE);
> 
> but either way can be made to work.

That would be set_mempolicy2() essentially because the existing
users don't pass this additional argument. I think passing the flags
in the first argument is more compatible.

-andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation
  2004-09-21  9:13         ` Andi Kleen
@ 2004-09-21  9:33           ` William Lee Irwin III
  2004-09-21 13:10             ` Ray Bryant
  0 siblings, 1 reply; 16+ messages in thread
From: William Lee Irwin III @ 2004-09-21  9:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ray Bryant, Andrew Morton, Ray Bryant, linux-mm, Jesse Barnes,
	Dan Higgins, lse-tech, Brent Casavant, Nick Piggin, linux-kernel,
	Paul Jackson, Dave Hansen, stevel

On Mon, Sep 20, 2004 at 08:30:12PM -0500, Ray Bryant wrote:
>> I'm sorry if this is confusing, personal terminology usually gets in the 
>> way.
>> The idea is that just like for the page allocation policy (your current 
>> code), if you wanted, you would have a global, default page cache 
>
On Tue, Sep 21, 2004 at 11:13:54AM +0200, Andi Kleen wrote:
> Having both a per process page cache and a global page cache policy
> would seem like overkill to me.
> And having both doesn't make much sense anyways, because when the 
> system admin wants to change the global policy to free memory
> on nodes he would still need to worry about conflicting per process policies 
> anyways. So as soon as you have process policy you cannot easily
> change global anymore.

Ray, would being able to change the default policy via kernel command-
line options (and perhaps sysctl) suffice? It seems that a global
default and some global state (e.g. per-cpu state) should largely
capture what you're after. If not, could you clarify where it doesn't?

Also, this switch statement stuff is getting a little hairy; maybe
it's time to bring in mempolicy_ops. Or at least trudging through the
switch () statements is turning into a moderate amount of work for me.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation
  2004-09-21  9:33           ` William Lee Irwin III
@ 2004-09-21 13:10             ` Ray Bryant
  0 siblings, 0 replies; 16+ messages in thread
From: Ray Bryant @ 2004-09-21 13:10 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andi Kleen, Andrew Morton, Ray Bryant, linux-mm, Jesse Barnes,
	Dan Higgins, lse-tech, Brent Casavant, Nick Piggin, linux-kernel,
	Paul Jackson, Dave Hansen, stevel

William Lee Irwin III wrote:
> On Mon, Sep 20, 2004 at 08:30:12PM -0500, Ray Bryant wrote:
> 
>>>I'm sorry if this is confusing, personal terminology usually gets in the 
>>>way.
>>>The idea is that just like for the page allocation policy (your current 
>>>code), if you wanted, you would have a global, default page cache 
>>
> On Tue, Sep 21, 2004 at 11:13:54AM +0200, Andi Kleen wrote:
> 
>>Having both a per process page cache and a global page cache policy
>>would seem like overkill to me.
>>And having both doesn't make much sense anyways, because when the 
>>system admin wants to change the global policy to free memory
>>on nodes he would still need to worry about conflicting per process policies 
>>anyways. So as soon as you have process policy you cannot easily
>>change global anymore.
>

(Andi,)

I don't think the requirement is so much to be able to dynamically change
policies while the system is running, I think that the requirement is to be
able to set the global policy at boot time or near then.  We don't want to
have to recompile the kernel to boot it for a webserver or fileserver (likely 
local page cache allocation) versus an HPC system (likely round robin
page cache allocation policy).

The rationale for the local policy is partially implementation (see below),
and thinking that even on your big HPC system, some processes in some cpusets
might be running a file server application, and they would want local
allocation.  Having said that, of course, begs the question as to whether we
need a per cpuset policy, I suppose.  But read on for the implemenation
argument, and if that is not persuasive, then I'll go with a global only
policy and see where that leads.

> 
> Ray, would being able to change the default policy via kernel command-
> line options (and perhaps sysctl) suffice? It seems that a global
> default and some global state (e.g. per-cpu state) should largely
> capture what you're after. If not, could you clarify where it doesn't?
> 

Bill,

That would capture most of our requirements, I think.  Part of the reason
for doing a global and local policy is that is the way that the code works 
now, and all I did was piggyback on that.  So, because there is a global
policy with a per process override for the existing page allocation policy,
you get a similiar structure for the page cache policy.

The overhead is an additional word per task structure, an additional
mempolicy copy (if there is a per process page cache allocation policy)
and structure, plus some code that looks like this in alloc_pages_by_policy():

struct page *
alloc_pages_by_policy(unsigned gfp, unsigned order, unsigned policy)
  {

        struct mempolicy *pol;

        if (policy >= NR_MEM_POLICIES)
                BUG();
        pol = current->mempolicy[policy];
        if (!pol)
                pol = default_policy[policy];
. . .

Som it is elegant that way and readily allows for additional memory
allocation cases (slab cache, anyone?).

> Also, this switch statement stuff is getting a little hairy; maybe
> it's time to bring in mempolicy_ops. Or at least trudging through the
> switch () statements is turning into a moderate amount of work for me.
> 
> 
> -- wli
> 

Yes, it is getting a little out of hand.  However, if we can get by without
the MEMPOL_ROUNDROBIN, we are back to where we were there.

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation
  2004-09-20 20:55 ` Andi Kleen
  2004-09-20 22:13   ` Ray Bryant
@ 2004-09-20 22:38   ` Steve Longerbeam
  2004-09-20 23:48   ` Steve Longerbeam
  2 siblings, 0 replies; 16+ messages in thread
From: Steve Longerbeam @ 2004-09-20 22:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ray Bryant, William Lee Irwin III, Martin J. Bligh,
	Andrew Morton, Ray Bryant, linux-mm, Jesse Barnes, Dan Higgins,
	lse-tech, Brent Casavant, Nick Piggin, linux-kernel,
	Paul Jackson, Dave Hansen


[-- Attachment #1.1: Type: text/plain, Size: 4331 bytes --]



Andi Kleen wrote:

>On Mon, Sep 20, 2004 at 12:00:33PM -0700, Ray Bryant wrote:
>  
>
>>Background
>>----------
>>
>>Last month, Jesse Barnes proposed a patch to do round robin
>>allocation of page cache pages on NUMA machines.  This got shot down
>>for a number of reasons (see
>>  http://marc.theaimsgroup.com/?l=linux-kernel&m=109235420329360&w=2
>>and the related thread), but it seemed to me that one of the most
>>significant issues was that this was a workload dependent optimization.
>>That is, for an Altix running an HPC workload, it was a good thing,
>>but for web servers or file servers it was not such a good idea.
>>
>>So the idea of this patch is the following:  it creates a new memory
>>policy structure (default_pagecache_policy) that is used to control
>>how storage for page cache pages is allocated.  So, for a large Altix
>>running HPC workloads, we can specify a policy that does round robin
>>allocations, and for other workloads you can specify the default policy
>>(which results in page cache pages being allocated locally).
>>
>>The default_pagecache_policy is overrideable on a per process basis, so
>>that if your application prefers to allocate page cache pages locally,
>>it can.
>>    
>>
>
>I'm not sure this really makes sense. Do you have some clear use 
>case where having so much flexibility is needed? 
>
>I would prefer to have a global setting somewhere for the page
>cache (sysctl or sysfs or what you prefer) and some special handling for 
>text pages. 
>
>This would keep the per thread bloat low. 
>
>Also I must say I got a patch submitted to do policy per
>file from Steve Longerbeam. 
>
>It so far only supports this for ELF executables, but
>it has most of the infrastructure to do individual policy
>per file. Maybe it would be better to go into this direction,
>only thing missing is a nice way to declare policy for 
>arbitary files. Even in this case a global default would be useful.
>
>I haven't done anything with this patch yet due to missing time 
>and there were a few small issues to resolve, but i hope it 
>can be eventually integrated.
>
>[Steve, perhaps you can repost the patch to lse-tech for more
>wider review?]
>  
>

Sure, patch is attached. Also, here is a reposting of my original email to
you (Andi) describing the patch. Btw, I received your comments on the
patch, I will reply to your points seperately. Sorry I haven't replied 
sooner,
I'm in the middle of switching jobs  :-)


-------- original email follows ----------

Hi Andi,

I'm working on adding the features to NUMA mempolicy
necessary to support MontaVista's MTA.

Attached is the first of those features, support for
global page allocation policy for mapped files. Here's
what the patch is doing:

1. add a shared_policy tree to the address_space object in fs.h.
2. modify page_cache_alloc() in pagemap.h to take an address_space
    object and page offset, and use those to allocate a page for the
    page cache using the policy in the address_space object.
3. modify filemap.c to pass the additional {mapping, page offset} pair
    to page_cache_alloc().
4. Also in filemap.c, implement generic file {set|get}_policy() methods and
    add those to generic_file_vm_ops.
5. In filemap_nopage(), verify that any existing page located in the cache
    is located in a node that satisfies the file's policy. If it's not 
in a node that
    satisfies the policy, it must be because the page was allocated 
before the
    file had any policies. If it's unused, free it and goto retry_find 
(will allocate
    a new page using the file's policy). Note that a similar operation 
is done in
    exec.c:setup_arg_pages() for stack pages.
6. Init the file's shared policy in alloc_inode(), and free the shared 
policy in
    destroy_inode().

I'm working on the remaining features needed for MTA. They are:

- support for policies contained in ELF images, for text and data regions.
- support for do_mmap_mempolicy() and do_brk_mempolicy(). Do_mmap()
   can allocate pages to the region before the function exits, such as 
when pages
   are locked for the region. So it's necessary in that case to set the 
VMA's policy
   within do_mmap() before those pages are allocated.
- system calls for mmap_mempolicy and brk_mempolicy.

Let me know your thoughts on the filemap policy patch.

Thanks,
Steve


[-- Attachment #1.2: Type: text/html, Size: 5330 bytes --]

[-- Attachment #2: filemap-policy.patch --]
[-- Type: text/plain, Size: 10822 bytes --]

diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/fs/exec.c 2.6.8-rc3/fs/exec.c
--- 2.6.8-rc3.orig/fs/exec.c	2004-08-10 15:18:07.000000000 -0700
+++ 2.6.8-rc3/fs/exec.c	2004-09-01 21:53:25.000000000 -0700
@@ -439,6 +439,25 @@
 	for (i = 0 ; i < MAX_ARG_PAGES ; i++) {
 		struct page *page = bprm->page[i];
 		if (page) {
+#ifdef CONFIG_NUMA
+			if (!mpol_node_valid(page_to_nid(page), mpnt, 0)) {
+				void *from, *to;
+				struct page * new_page =
+					alloc_pages_current(GFP_HIGHUSER, 0);
+				if (!new_page) {
+					up_write(&mm->mmap_sem);
+					kmem_cache_free(vm_area_cachep, mpnt);
+					return -ENOMEM;
+				}
+				from = kmap(page);
+				to = kmap(new_page);
+				copy_page(to, from);
+				kunmap(page);
+				kunmap(new_page);
+				put_page(page);
+				page = new_page;
+			}
+#endif
 			bprm->page[i] = NULL;
 			install_arg_page(mpnt, page, stack_base);
 		}
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/fs/inode.c 2.6.8-rc3/fs/inode.c
--- 2.6.8-rc3.orig/fs/inode.c	2004-08-10 15:18:07.000000000 -0700
+++ 2.6.8-rc3/fs/inode.c	2004-09-01 11:40:44.000000000 -0700
@@ -150,6 +150,7 @@
 		mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
+ 		mpol_shared_policy_init(&mapping->policy);
 
 		/*
 		 * If the block_device provides a backing_dev_info for client
@@ -177,11 +178,12 @@
 	security_inode_free(inode);
 	if (inode->i_sb->s_op->destroy_inode)
 		inode->i_sb->s_op->destroy_inode(inode);
-	else
+	else {
+		mpol_free_shared_policy(&inode->i_mapping->policy);
 		kmem_cache_free(inode_cachep, (inode));
+	}
 }
 
-
 /*
  * These are initializations that only need to be done
  * once, because the fields are idempotent across use
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/include/linux/fs.h 2.6.8-rc3/include/linux/fs.h
--- 2.6.8-rc3.orig/include/linux/fs.h	2004-08-10 15:18:31.000000000 -0700
+++ 2.6.8-rc3/include/linux/fs.h	2004-09-01 21:08:37.000000000 -0700
@@ -18,6 +18,7 @@
 #include <linux/cache.h>
 #include <linux/prio_tree.h>
 #include <linux/kobject.h>
+#include <linux/mempolicy.h>
 #include <asm/atomic.h>
 
 struct iovec;
@@ -339,6 +340,7 @@
 	atomic_t		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		flags;		/* error bits/gfp mask */
 	struct backing_dev_info *backing_dev_info; /* device readahead, etc */
+	struct shared_policy    policy;         /* page alloc policy */
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/include/linux/mempolicy.h 2.6.8-rc3/include/linux/mempolicy.h
--- 2.6.8-rc3.orig/include/linux/mempolicy.h	2004-08-10 15:18:31.000000000 -0700
+++ 2.6.8-rc3/include/linux/mempolicy.h	2004-09-01 21:54:34.000000000 -0700
@@ -152,6 +152,8 @@
 void mpol_free_shared_policy(struct shared_policy *p);
 struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 					    unsigned long idx);
+struct page *alloc_page_shared_policy(unsigned gfp, struct shared_policy *sp,
+				      unsigned long idx);
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/include/linux/pagemap.h 2.6.8-rc3/include/linux/pagemap.h
--- 2.6.8-rc3.orig/include/linux/pagemap.h	2004-08-10 15:18:31.000000000 -0700
+++ 2.6.8-rc3/include/linux/pagemap.h	2004-09-01 11:04:24.000000000 -0700
@@ -50,14 +50,24 @@
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
-static inline struct page *page_cache_alloc(struct address_space *x)
+
+static inline struct page *__page_cache_alloc(struct address_space *x,
+					      unsigned long idx,
+					      unsigned int gfp_mask)
+{
+	return alloc_page_shared_policy(gfp_mask, &x->policy, idx);
+}
+
+static inline struct page *page_cache_alloc(struct address_space *x,
+					    unsigned long idx)
 {
-	return alloc_pages(mapping_gfp_mask(x), 0);
+	return __page_cache_alloc(x, idx, mapping_gfp_mask(x));
 }
 
-static inline struct page *page_cache_alloc_cold(struct address_space *x)
+static inline struct page *page_cache_alloc_cold(struct address_space *x,
+						 unsigned long idx)
 {
-	return alloc_pages(mapping_gfp_mask(x)|__GFP_COLD, 0);
+	return __page_cache_alloc(x, idx, mapping_gfp_mask(x)|__GFP_COLD);
 }
 
 typedef int filler_t(void *, struct page *);
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/mm/filemap.c 2.6.8-rc3/mm/filemap.c
--- 2.6.8-rc3.orig/mm/filemap.c	2004-08-10 15:18:35.000000000 -0700
+++ 2.6.8-rc3/mm/filemap.c	2004-09-01 21:52:06.000000000 -0700
@@ -534,7 +534,8 @@
 	page = find_lock_page(mapping, index);
 	if (!page) {
 		if (!cached_page) {
-			cached_page = alloc_page(gfp_mask);
+			cached_page = __page_cache_alloc(mapping, index,
+							 gfp_mask);
 			if (!cached_page)
 				return NULL;
 		}
@@ -627,7 +628,7 @@
 		return NULL;
 	}
 	gfp_mask = mapping_gfp_mask(mapping) & ~__GFP_FS;
-	page = alloc_pages(gfp_mask, 0);
+	page = __page_cache_alloc(mapping, index, gfp_mask);
 	if (page && add_to_page_cache_lru(page, mapping, index, gfp_mask)) {
 		page_cache_release(page);
 		page = NULL;
@@ -789,7 +790,7 @@
 		 * page..
 		 */
 		if (!cached_page) {
-			cached_page = page_cache_alloc_cold(mapping);
+			cached_page = page_cache_alloc_cold(mapping, index);
 			if (!cached_page) {
 				desc->error = -ENOMEM;
 				goto out;
@@ -1050,7 +1051,7 @@
 	struct page *page; 
 	int error;
 
-	page = page_cache_alloc_cold(mapping);
+	page = page_cache_alloc_cold(mapping, offset);
 	if (!page)
 		return -ENOMEM;
 
@@ -1070,6 +1071,7 @@
 	return error == -EEXIST ? 0 : error;
 }
 
+
 #define MMAP_LOTSAMISS  (100)
 
 /*
@@ -1090,7 +1092,7 @@
 	struct page *page;
 	unsigned long size, pgoff, endoff;
 	int did_readaround = 0, majmin = VM_FAULT_MINOR;
-
+	
 	pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
 	endoff = ((area->vm_end - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
 
@@ -1162,6 +1164,38 @@
 			goto no_cached_page;
 	}
 
+#ifdef CONFIG_NUMA
+	if (!mpol_node_valid(page_to_nid(page), area, 0)) {
+		/*
+		 * the page in the cache is not in any of the nodes this
+		 * VMA's policy wants it to be in. Can we remove it?
+		 */
+		lock_page(page);
+		if (page_count(page) - !!PagePrivate(page) == 2) {
+			/*
+			 * This page isn't being used by any mappings,
+			 * so we can safely remove it. It must be left
+			 * over from an earlier file IO readahead when
+			 * there was no page allocation policy associated
+			 * with the file.
+			 */
+			spin_lock(&mapping->tree_lock);
+			__remove_from_page_cache(page);
+			spin_unlock(&mapping->tree_lock);
+			page_cache_release(page);  /* pagecache ref */
+			unlock_page(page);
+			page_cache_release(page);  /* us */
+			goto retry_find;
+		} else {
+			/*
+			 * darn, the page is being used by other mappings.
+			 * We'll just have to leave the page in this node.
+			 */
+			unlock_page(page);
+		}
+	}
+#endif
+	
 	if (!did_readaround)
 		ra->mmap_hit++;
 
@@ -1431,9 +1465,35 @@
 	return 0;
 }
 
+
+#ifdef CONFIG_NUMA
+int generic_file_set_policy(struct vm_area_struct *vma,
+			    struct mempolicy *new)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	return mpol_set_shared_policy(&mapping->policy, vma, new);
+}
+
+struct mempolicy *
+generic_file_get_policy(struct vm_area_struct *vma,
+			unsigned long addr)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long idx;
+	
+	idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	return mpol_shared_policy_lookup(&mapping->policy, idx);
+}
+#endif
+
+
 static struct vm_operations_struct generic_file_vm_ops = {
 	.nopage		= filemap_nopage,
 	.populate	= filemap_populate,
+#ifdef CONFIG_NUMA
+	.set_policy     = generic_file_set_policy,
+	.get_policy     = generic_file_get_policy,
+#endif
 };
 
 /* This is used for a general mmap of a disk file */
@@ -1483,7 +1543,7 @@
 	page = find_get_page(mapping, index);
 	if (!page) {
 		if (!cached_page) {
-			cached_page = page_cache_alloc_cold(mapping);
+			cached_page = page_cache_alloc_cold(mapping, index);
 			if (!cached_page)
 				return ERR_PTR(-ENOMEM);
 		}
@@ -1565,7 +1625,7 @@
 	page = find_lock_page(mapping, index);
 	if (!page) {
 		if (!*cached_page) {
-			*cached_page = page_cache_alloc(mapping);
+			*cached_page = page_cache_alloc(mapping, index);
 			if (!*cached_page)
 				return NULL;
 		}
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/mm/mempolicy.c 2.6.8-rc3/mm/mempolicy.c
--- 2.6.8-rc3.orig/mm/mempolicy.c	2004-08-10 15:18:35.000000000 -0700
+++ 2.6.8-rc3/mm/mempolicy.c	2004-09-01 21:49:14.000000000 -0700
@@ -638,6 +638,7 @@
 	return page;
 }
 
+
 /**
  * 	alloc_page_vma	- Allocate a page for a VMA.
  *
@@ -683,6 +684,7 @@
 	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
 }
 
+
 /**
  * 	alloc_pages_current - Allocate pages.
  *
@@ -1003,6 +1005,28 @@
 	up(&p->sem);
 }
 
+struct page *
+alloc_page_shared_policy(unsigned gfp, struct shared_policy *sp,
+			 unsigned long idx)
+{
+	struct page *page;
+	
+	if (sp) {
+		struct vm_area_struct pvma;
+		/* Create a pseudo vma that just contains the policy */
+		memset(&pvma, 0, sizeof(struct vm_area_struct));
+		pvma.vm_end = PAGE_SIZE;
+		pvma.vm_pgoff = idx;
+		pvma.vm_policy = mpol_shared_policy_lookup(sp, idx);
+		page = alloc_page_vma(gfp, &pvma, 0);
+		mpol_free(pvma.vm_policy);
+	} else {
+		page = alloc_pages(gfp, 0);
+	}
+
+	return page;
+}
+
 /* assumes fs == KERNEL_DS */
 void __init numa_policy_init(void)
 {
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/mm/readahead.c 2.6.8-rc3/mm/readahead.c
--- 2.6.8-rc3.orig/mm/readahead.c	2004-08-10 15:18:35.000000000 -0700
+++ 2.6.8-rc3/mm/readahead.c	2004-09-01 20:39:14.000000000 -0700
@@ -246,7 +246,7 @@
 			continue;
 
 		spin_unlock_irq(&mapping->tree_lock);
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, page_offset);
 		spin_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/mm/shmem.c 2.6.8-rc3/mm/shmem.c
--- 2.6.8-rc3.orig/mm/shmem.c	2004-08-10 15:18:35.000000000 -0700
+++ 2.6.8-rc3/mm/shmem.c	2004-09-01 11:14:48.000000000 -0700
@@ -824,16 +824,7 @@
 shmem_alloc_page(unsigned long gfp, struct shmem_inode_info *info,
 		 unsigned long idx)
 {
-	struct vm_area_struct pvma;
-	struct page *page;
-
-	memset(&pvma, 0, sizeof(struct vm_area_struct));
-	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
-	pvma.vm_pgoff = idx;
-	pvma.vm_end = PAGE_SIZE;
-	page = alloc_page_vma(gfp, &pvma, 0);
-	mpol_free(pvma.vm_policy);
-	return page;
+	return alloc_page_shared_policy(gfp, &info->policy, idx);
 }
 #else
 static inline struct page *


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation
  2004-09-20 20:55 ` Andi Kleen
  2004-09-20 22:13   ` Ray Bryant
  2004-09-20 22:38   ` Steve Longerbeam
@ 2004-09-20 23:48   ` Steve Longerbeam
  2004-09-23 15:54     ` [PATCH " Ray Bryant
  2 siblings, 1 reply; 16+ messages in thread
From: Steve Longerbeam @ 2004-09-20 23:48 UTC (permalink / raw)
  To: linux-mm, lse-tech, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4331 bytes --]



Andi Kleen wrote:

>On Mon, Sep 20, 2004 at 12:00:33PM -0700, Ray Bryant wrote:
>  
>
>>Background
>>----------
>>
>>Last month, Jesse Barnes proposed a patch to do round robin
>>allocation of page cache pages on NUMA machines.  This got shot down
>>for a number of reasons (see
>>  http://marc.theaimsgroup.com/?l=linux-kernel&m=109235420329360&w=2
>>and the related thread), but it seemed to me that one of the most
>>significant issues was that this was a workload dependent optimization.
>>That is, for an Altix running an HPC workload, it was a good thing,
>>but for web servers or file servers it was not such a good idea.
>>
>>So the idea of this patch is the following:  it creates a new memory
>>policy structure (default_pagecache_policy) that is used to control
>>how storage for page cache pages is allocated.  So, for a large Altix
>>running HPC workloads, we can specify a policy that does round robin
>>allocations, and for other workloads you can specify the default policy
>>(which results in page cache pages being allocated locally).
>>
>>The default_pagecache_policy is overrideable on a per process basis, so
>>that if your application prefers to allocate page cache pages locally,
>>it can.
>>    
>>
>
>I'm not sure this really makes sense. Do you have some clear use 
>case where having so much flexibility is needed? 
>
>I would prefer to have a global setting somewhere for the page
>cache (sysctl or sysfs or what you prefer) and some special handling for 
>text pages. 
>
>This would keep the per thread bloat low. 
>
>Also I must say I got a patch submitted to do policy per
>file from Steve Longerbeam. 
>
>It so far only supports this for ELF executables, but
>it has most of the infrastructure to do individual policy
>per file. Maybe it would be better to go into this direction,
>only thing missing is a nice way to declare policy for 
>arbitary files. Even in this case a global default would be useful.
>
>I haven't done anything with this patch yet due to missing time 
>and there were a few small issues to resolve, but i hope it 
>can be eventually integrated.
>
>[Steve, perhaps you can repost the patch to lse-tech for more
>wider review?]
>  
>

Sure, patch is attached. Also, here is a reposting of my original email to
you (Andi) describing the patch. Btw, I received your comments on the
patch, I will reply to your points seperately. Sorry I haven't replied 
sooner,
I'm in the middle of switching jobs  :-)


-------- original email follows ----------

Hi Andi,

I'm working on adding the features to NUMA mempolicy
necessary to support MontaVista's MTA.

Attached is the first of those features, support for
global page allocation policy for mapped files. Here's
what the patch is doing:

1. add a shared_policy tree to the address_space object in fs.h.
2. modify page_cache_alloc() in pagemap.h to take an address_space
    object and page offset, and use those to allocate a page for the
    page cache using the policy in the address_space object.
3. modify filemap.c to pass the additional {mapping, page offset} pair
    to page_cache_alloc().
4. Also in filemap.c, implement generic file {set|get}_policy() methods and
    add those to generic_file_vm_ops.
5. In filemap_nopage(), verify that any existing page located in the cache
    is located in a node that satisfies the file's policy. If it's not 
in a node that
    satisfies the policy, it must be because the page was allocated 
before the
    file had any policies. If it's unused, free it and goto retry_find 
(will allocate
    a new page using the file's policy). Note that a similar operation 
is done in
    exec.c:setup_arg_pages() for stack pages.
6. Init the file's shared policy in alloc_inode(), and free the shared 
policy in
    destroy_inode().

I'm working on the remaining features needed for MTA. They are:

- support for policies contained in ELF images, for text and data regions.
- support for do_mmap_mempolicy() and do_brk_mempolicy(). Do_mmap()
   can allocate pages to the region before the function exits, such as 
when pages
   are locked for the region. So it's necessary in that case to set the 
VMA's policy
   within do_mmap() before those pages are allocated.
- system calls for mmap_mempolicy and brk_mempolicy.

Let me know your thoughts on the filemap policy patch.

Thanks,
Steve


[-- Attachment #2: filemap-policy.patch --]
[-- Type: text/plain, Size: 10821 bytes --]

diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/fs/exec.c 2.6.8-rc3/fs/exec.c
--- 2.6.8-rc3.orig/fs/exec.c	2004-08-10 15:18:07.000000000 -0700
+++ 2.6.8-rc3/fs/exec.c	2004-09-01 21:53:25.000000000 -0700
@@ -439,6 +439,25 @@
 	for (i = 0 ; i < MAX_ARG_PAGES ; i++) {
 		struct page *page = bprm->page[i];
 		if (page) {
+#ifdef CONFIG_NUMA
+			if (!mpol_node_valid(page_to_nid(page), mpnt, 0)) {
+				void *from, *to;
+				struct page * new_page =
+					alloc_pages_current(GFP_HIGHUSER, 0);
+				if (!new_page) {
+					up_write(&mm->mmap_sem);
+					kmem_cache_free(vm_area_cachep, mpnt);
+					return -ENOMEM;
+				}
+				from = kmap(page);
+				to = kmap(new_page);
+				copy_page(to, from);
+				kunmap(page);
+				kunmap(new_page);
+				put_page(page);
+				page = new_page;
+			}
+#endif
 			bprm->page[i] = NULL;
 			install_arg_page(mpnt, page, stack_base);
 		}
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/fs/inode.c 2.6.8-rc3/fs/inode.c
--- 2.6.8-rc3.orig/fs/inode.c	2004-08-10 15:18:07.000000000 -0700
+++ 2.6.8-rc3/fs/inode.c	2004-09-01 11:40:44.000000000 -0700
@@ -150,6 +150,7 @@
 		mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
+ 		mpol_shared_policy_init(&mapping->policy);
 
 		/*
 		 * If the block_device provides a backing_dev_info for client
@@ -177,11 +178,12 @@
 	security_inode_free(inode);
 	if (inode->i_sb->s_op->destroy_inode)
 		inode->i_sb->s_op->destroy_inode(inode);
-	else
+	else {
+		mpol_free_shared_policy(&inode->i_mapping->policy);
 		kmem_cache_free(inode_cachep, (inode));
+	}
 }
 
-
 /*
  * These are initializations that only need to be done
  * once, because the fields are idempotent across use
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/include/linux/fs.h 2.6.8-rc3/include/linux/fs.h
--- 2.6.8-rc3.orig/include/linux/fs.h	2004-08-10 15:18:31.000000000 -0700
+++ 2.6.8-rc3/include/linux/fs.h	2004-09-01 21:08:37.000000000 -0700
@@ -18,6 +18,7 @@
 #include <linux/cache.h>
 #include <linux/prio_tree.h>
 #include <linux/kobject.h>
+#include <linux/mempolicy.h>
 #include <asm/atomic.h>
 
 struct iovec;
@@ -339,6 +340,7 @@
 	atomic_t		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		flags;		/* error bits/gfp mask */
 	struct backing_dev_info *backing_dev_info; /* device readahead, etc */
+	struct shared_policy    policy;         /* page alloc policy */
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/include/linux/mempolicy.h 2.6.8-rc3/include/linux/mempolicy.h
--- 2.6.8-rc3.orig/include/linux/mempolicy.h	2004-08-10 15:18:31.000000000 -0700
+++ 2.6.8-rc3/include/linux/mempolicy.h	2004-09-01 21:54:34.000000000 -0700
@@ -152,6 +152,8 @@
 void mpol_free_shared_policy(struct shared_policy *p);
 struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 					    unsigned long idx);
+struct page *alloc_page_shared_policy(unsigned gfp, struct shared_policy *sp,
+				      unsigned long idx);
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/include/linux/pagemap.h 2.6.8-rc3/include/linux/pagemap.h
--- 2.6.8-rc3.orig/include/linux/pagemap.h	2004-08-10 15:18:31.000000000 -0700
+++ 2.6.8-rc3/include/linux/pagemap.h	2004-09-01 11:04:24.000000000 -0700
@@ -50,14 +50,24 @@
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
-static inline struct page *page_cache_alloc(struct address_space *x)
+
+static inline struct page *__page_cache_alloc(struct address_space *x,
+					      unsigned long idx,
+					      unsigned int gfp_mask)
+{
+	return alloc_page_shared_policy(gfp_mask, &x->policy, idx);
+}
+
+static inline struct page *page_cache_alloc(struct address_space *x,
+					    unsigned long idx)
 {
-	return alloc_pages(mapping_gfp_mask(x), 0);
+	return __page_cache_alloc(x, idx, mapping_gfp_mask(x));
 }
 
-static inline struct page *page_cache_alloc_cold(struct address_space *x)
+static inline struct page *page_cache_alloc_cold(struct address_space *x,
+						 unsigned long idx)
 {
-	return alloc_pages(mapping_gfp_mask(x)|__GFP_COLD, 0);
+	return __page_cache_alloc(x, idx, mapping_gfp_mask(x)|__GFP_COLD);
 }
 
 typedef int filler_t(void *, struct page *);
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/mm/filemap.c 2.6.8-rc3/mm/filemap.c
--- 2.6.8-rc3.orig/mm/filemap.c	2004-08-10 15:18:35.000000000 -0700
+++ 2.6.8-rc3/mm/filemap.c	2004-09-01 21:52:06.000000000 -0700
@@ -534,7 +534,8 @@
 	page = find_lock_page(mapping, index);
 	if (!page) {
 		if (!cached_page) {
-			cached_page = alloc_page(gfp_mask);
+			cached_page = __page_cache_alloc(mapping, index,
+							 gfp_mask);
 			if (!cached_page)
 				return NULL;
 		}
@@ -627,7 +628,7 @@
 		return NULL;
 	}
 	gfp_mask = mapping_gfp_mask(mapping) & ~__GFP_FS;
-	page = alloc_pages(gfp_mask, 0);
+	page = __page_cache_alloc(mapping, index, gfp_mask);
 	if (page && add_to_page_cache_lru(page, mapping, index, gfp_mask)) {
 		page_cache_release(page);
 		page = NULL;
@@ -789,7 +790,7 @@
 		 * page..
 		 */
 		if (!cached_page) {
-			cached_page = page_cache_alloc_cold(mapping);
+			cached_page = page_cache_alloc_cold(mapping, index);
 			if (!cached_page) {
 				desc->error = -ENOMEM;
 				goto out;
@@ -1050,7 +1051,7 @@
 	struct page *page; 
 	int error;
 
-	page = page_cache_alloc_cold(mapping);
+	page = page_cache_alloc_cold(mapping, offset);
 	if (!page)
 		return -ENOMEM;
 
@@ -1070,6 +1071,7 @@
 	return error == -EEXIST ? 0 : error;
 }
 
+
 #define MMAP_LOTSAMISS  (100)
 
 /*
@@ -1090,7 +1092,7 @@
 	struct page *page;
 	unsigned long size, pgoff, endoff;
 	int did_readaround = 0, majmin = VM_FAULT_MINOR;
-
+	
 	pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
 	endoff = ((area->vm_end - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
 
@@ -1162,6 +1164,38 @@
 			goto no_cached_page;
 	}
 
+#ifdef CONFIG_NUMA
+	if (!mpol_node_valid(page_to_nid(page), area, 0)) {
+		/*
+		 * the page in the cache is not in any of the nodes this
+		 * VMA's policy wants it to be in. Can we remove it?
+		 */
+		lock_page(page);
+		if (page_count(page) - !!PagePrivate(page) == 2) {
+			/*
+			 * This page isn't being used by any mappings,
+			 * so we can safely remove it. It must be left
+			 * over from an earlier file IO readahead when
+			 * there was no page allocation policy associated
+			 * with the file.
+			 */
+			spin_lock(&mapping->tree_lock);
+			__remove_from_page_cache(page);
+			spin_unlock(&mapping->tree_lock);
+			page_cache_release(page);  /* pagecache ref */
+			unlock_page(page);
+			page_cache_release(page);  /* us */
+			goto retry_find;
+		} else {
+			/*
+			 * darn, the page is being used by other mappings.
+			 * We'll just have to leave the page in this node.
+			 */
+			unlock_page(page);
+		}
+	}
+#endif
+	
 	if (!did_readaround)
 		ra->mmap_hit++;
 
@@ -1431,9 +1465,35 @@
 	return 0;
 }
 
+
+#ifdef CONFIG_NUMA
+int generic_file_set_policy(struct vm_area_struct *vma,
+			    struct mempolicy *new)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	return mpol_set_shared_policy(&mapping->policy, vma, new);
+}
+
+struct mempolicy *
+generic_file_get_policy(struct vm_area_struct *vma,
+			unsigned long addr)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long idx;
+	
+	idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	return mpol_shared_policy_lookup(&mapping->policy, idx);
+}
+#endif
+
+
 static struct vm_operations_struct generic_file_vm_ops = {
 	.nopage		= filemap_nopage,
 	.populate	= filemap_populate,
+#ifdef CONFIG_NUMA
+	.set_policy     = generic_file_set_policy,
+	.get_policy     = generic_file_get_policy,
+#endif
 };
 
 /* This is used for a general mmap of a disk file */
@@ -1483,7 +1543,7 @@
 	page = find_get_page(mapping, index);
 	if (!page) {
 		if (!cached_page) {
-			cached_page = page_cache_alloc_cold(mapping);
+			cached_page = page_cache_alloc_cold(mapping, index);
 			if (!cached_page)
 				return ERR_PTR(-ENOMEM);
 		}
@@ -1565,7 +1625,7 @@
 	page = find_lock_page(mapping, index);
 	if (!page) {
 		if (!*cached_page) {
-			*cached_page = page_cache_alloc(mapping);
+			*cached_page = page_cache_alloc(mapping, index);
 			if (!*cached_page)
 				return NULL;
 		}
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/mm/mempolicy.c 2.6.8-rc3/mm/mempolicy.c
--- 2.6.8-rc3.orig/mm/mempolicy.c	2004-08-10 15:18:35.000000000 -0700
+++ 2.6.8-rc3/mm/mempolicy.c	2004-09-01 21:49:14.000000000 -0700
@@ -638,6 +638,7 @@
 	return page;
 }
 
+
 /**
  * 	alloc_page_vma	- Allocate a page for a VMA.
  *
@@ -683,6 +684,7 @@
 	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
 }
 
+
 /**
  * 	alloc_pages_current - Allocate pages.
  *
@@ -1003,6 +1005,28 @@
 	up(&p->sem);
 }
 
+struct page *
+alloc_page_shared_policy(unsigned gfp, struct shared_policy *sp,
+			 unsigned long idx)
+{
+	struct page *page;
+	
+	if (sp) {
+		struct vm_area_struct pvma;
+		/* Create a pseudo vma that just contains the policy */
+		memset(&pvma, 0, sizeof(struct vm_area_struct));
+		pvma.vm_end = PAGE_SIZE;
+		pvma.vm_pgoff = idx;
+		pvma.vm_policy = mpol_shared_policy_lookup(sp, idx);
+		page = alloc_page_vma(gfp, &pvma, 0);
+		mpol_free(pvma.vm_policy);
+	} else {
+		page = alloc_pages(gfp, 0);
+	}
+
+	return page;
+}
+
 /* assumes fs == KERNEL_DS */
 void __init numa_policy_init(void)
 {
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/mm/readahead.c 2.6.8-rc3/mm/readahead.c
--- 2.6.8-rc3.orig/mm/readahead.c	2004-08-10 15:18:35.000000000 -0700
+++ 2.6.8-rc3/mm/readahead.c	2004-09-01 20:39:14.000000000 -0700
@@ -246,7 +246,7 @@
 			continue;
 
 		spin_unlock_irq(&mapping->tree_lock);
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, page_offset);
 		spin_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
diff -Nuar -X /home/stevel/dontdiff 2.6.8-rc3.orig/mm/shmem.c 2.6.8-rc3/mm/shmem.c
--- 2.6.8-rc3.orig/mm/shmem.c	2004-08-10 15:18:35.000000000 -0700
+++ 2.6.8-rc3/mm/shmem.c	2004-09-01 11:14:48.000000000 -0700
@@ -824,16 +824,7 @@
 shmem_alloc_page(unsigned long gfp, struct shmem_inode_info *info,
 		 unsigned long idx)
 {
-	struct vm_area_struct pvma;
-	struct page *page;
-
-	memset(&pvma, 0, sizeof(struct vm_area_struct));
-	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
-	pvma.vm_pgoff = idx;
-	pvma.vm_end = PAGE_SIZE;
-	page = alloc_page_vma(gfp, &pvma, 0);
-	mpol_free(pvma.vm_policy);
-	return page;
+	return alloc_page_shared_policy(gfp, &info->policy, idx);
 }
 #else
 static inline struct page *

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] mm: memory policy for page cache allocation
  2004-09-20 23:48   ` Steve Longerbeam
@ 2004-09-23 15:54     ` Ray Bryant
  2004-09-23 23:01       ` Steve Longerbeam
  0 siblings, 1 reply; 16+ messages in thread
From: Ray Bryant @ 2004-09-23 15:54 UTC (permalink / raw)
  To: Steve Longerbeam; +Cc: linux-mm, lse-tech, linux-kernel

Hi Steve,

Steve Longerbeam wrote:

> -------- original email follows ----------
> 
> Hi Andi,
> 
> I'm working on adding the features to NUMA mempolicy
> necessary to support MontaVista's MTA.
> 
> Attached is the first of those features, support for
> global page allocation policy for mapped files. Here's
> what the patch is doing:
> 
> 1. add a shared_policy tree to the address_space object in fs.h.
> 2. modify page_cache_alloc() in pagemap.h to take an address_space
>    object and page offset, and use those to allocate a page for the
>    page cache using the policy in the address_space object.
> 3. modify filemap.c to pass the additional {mapping, page offset} pair
>    to page_cache_alloc().
> 4. Also in filemap.c, implement generic file {set|get}_policy() methods and
>    add those to generic_file_vm_ops.
> 5. In filemap_nopage(), verify that any existing page located in the cache
>    is located in a node that satisfies the file's policy. If it's not in 
> a node that
>    satisfies the policy, it must be because the page was allocated 
> before the
>    file had any policies. If it's unused, free it and goto retry_find 
> (will allocate
>    a new page using the file's policy). Note that a similar operation is 
> done in
>    exec.c:setup_arg_pages() for stack pages.
> 6. Init the file's shared policy in alloc_inode(), and free the shared 
> policy in
>    destroy_inode().
> 
> I'm working on the remaining features needed for MTA. They are:
> 
> - support for policies contained in ELF images, for text and data regions.
> - support for do_mmap_mempolicy() and do_brk_mempolicy(). Do_mmap()
>   can allocate pages to the region before the function exits, such as 
> when pages
>   are locked for the region. So it's necessary in that case to set the 
> VMA's policy
>   within do_mmap() before those pages are allocated.
> - system calls for mmap_mempolicy and brk_mempolicy.
> 
> Let me know your thoughts on the filemap policy patch.
> 
> Thanks,
> Steve
> 
> 

Steve,

I guess I am a little lost on this without understanding what MTA is.
Is there a design/requirements document you can point me at?

Also, can you comment on how the above is related to my page cache
allocation policy patch?   Does having a global page cache allocation
policy with a per process override satisfy your requirements at all
or do you specifically have per file policies you want to specify?

(Just trying to figure out how to work both of our requirements into
the kernel in as simple as possible (but no simpler!) fashion.)

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] mm: memory policy for page cache allocation
  2004-09-23 15:54     ` [PATCH " Ray Bryant
@ 2004-09-23 23:01       ` Steve Longerbeam
  0 siblings, 0 replies; 16+ messages in thread
From: Steve Longerbeam @ 2004-09-23 23:01 UTC (permalink / raw)
  To: Ray Bryant; +Cc: linux-mm, lse-tech, linux-kernel

Ray Bryant wrote:

> Hi Steve,
>
> Steve Longerbeam wrote:
>
>> -------- original email follows ----------
>>
>> Hi Andi,
>>
>> I'm working on adding the features to NUMA mempolicy
>> necessary to support MontaVista's MTA.
>>
>> Attached is the first of those features, support for
>> global page allocation policy for mapped files. Here's
>> what the patch is doing:
>>
>> 1. add a shared_policy tree to the address_space object in fs.h.
>> 2. modify page_cache_alloc() in pagemap.h to take an address_space
>>    object and page offset, and use those to allocate a page for the
>>    page cache using the policy in the address_space object.
>> 3. modify filemap.c to pass the additional {mapping, page offset} pair
>>    to page_cache_alloc().
>> 4. Also in filemap.c, implement generic file {set|get}_policy() 
>> methods and
>>    add those to generic_file_vm_ops.
>> 5. In filemap_nopage(), verify that any existing page located in the 
>> cache
>>    is located in a node that satisfies the file's policy. If it's not 
>> in a node that
>>    satisfies the policy, it must be because the page was allocated 
>> before the
>>    file had any policies. If it's unused, free it and goto retry_find 
>> (will allocate
>>    a new page using the file's policy). Note that a similar operation 
>> is done in
>>    exec.c:setup_arg_pages() for stack pages.
>> 6. Init the file's shared policy in alloc_inode(), and free the 
>> shared policy in
>>    destroy_inode().
>>
>> I'm working on the remaining features needed for MTA. They are:
>>
>> - support for policies contained in ELF images, for text and data 
>> regions.
>> - support for do_mmap_mempolicy() and do_brk_mempolicy(). Do_mmap()
>>   can allocate pages to the region before the function exits, such as 
>> when pages
>>   are locked for the region. So it's necessary in that case to set 
>> the VMA's policy
>>   within do_mmap() before those pages are allocated.
>> - system calls for mmap_mempolicy and brk_mempolicy.
>>
>> Let me know your thoughts on the filemap policy patch.
>>
>> Thanks,
>> Steve
>>
>>
>
> Steve,
>
> I guess I am a little lost on this without understanding what MTA is.
> Is there a design/requirements document you can point me at?

Not yet, sorry. There is an internal wiki specification at MontaVista
Software, but it's specific to the 2.4.20 design of MTA.

>
> Also, can you comment on how the above is related to my page cache
> allocation policy patch?   Does having a global page cache allocation
> policy with a per process override satisfy your requirements at all
> or do you specifically have per file policies you want to specify?

MTA stands for "Memory Type-based Allocation" (the name was chosen by a
large customer of MontaVista). The idea behind MTA is identical to NUMA
memory policy in 2.6.8, but with extra features. MTA was developed
before NUMA mempolicy (it was originally developed in 2.4.20).

The basic idea of MTA is to allow file-mapped and anonymous VMA's
to contain a preference list of NUMA nodes that a page should be 
allocated from.
So in MTA there is only one policy, which is very similar to the BIND 
policy in
2.6.8.

MTA requires per mapped file policies. The patch I posted adds a
shared_policy tree to the address_space object, so that every file
can have it's own policy for page cache allocations. A mapped file
can have a tree of policies, one for each mapped region of the file,
for instance, text and initialized data. With the patch, file mapped
policies would work across all filesystems, and the specific support
in tmpfs and hugetlbfs can be removed.

The goal of MTA is to direct an entire program's resident pages (text
and data regions of the executable and all its shared libs) to a
single node or a specific set of nodes. The primary use of MTA (by
the customer) is to allow portions of memory to be powered off for
low power modes, and still have critical system applications running.

In MTA the executable file's policies are stored in the ELF image.
There is a utility to add a section containing the list of prefered nodes
for the executable's text and data regions. That section is parsed by
load_elf_binary(). The section data is in the form of mnemonic node
name strings, which load_elf_binary() converts to a node id list.

MTA also supports policies for the slab allocator.

>
> (Just trying to figure out how to work both of our requirements into
> the kernel in as simple as possible (but no simpler!) fashion.)

could we have both a global page cache policy as well as per file
policies. That is, if a mapped file has a policy, it overrides the
global policy. That would work fine for MTA.

Steve
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2004-09-23 23:01 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-09-20 19:00 [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation Ray Bryant
2004-09-20 19:00 ` [PATCH 2.6.9-rc2-mm1 1/2] " Ray Bryant
2004-09-20 19:00 ` [PATCH 2.6.9-rc2-mm1 2/2] " Ray Bryant
2004-09-20 20:22 ` [PATCH 2.6.9-rc2-mm1 0/2] " Paul Jackson
2004-09-20 20:55 ` Andi Kleen
2004-09-20 22:13   ` Ray Bryant
2004-09-20 22:37     ` Andi Kleen
2004-09-20 23:16       ` William Lee Irwin III
2004-09-21  1:30       ` Ray Bryant
2004-09-21  9:13         ` Andi Kleen
2004-09-21  9:33           ` William Lee Irwin III
2004-09-21 13:10             ` Ray Bryant
2004-09-20 22:38   ` Steve Longerbeam
2004-09-20 23:48   ` Steve Longerbeam
2004-09-23 15:54     ` [PATCH " Ray Bryant
2004-09-23 23:01       ` Steve Longerbeam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox