[PATCH/RFC 0/8] Mapped File Policy Overview

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH/RFC 0/8] Mapped File Policy Overview
@ 2007-05-24 17:28 Lee Schermerhorn
  2007-05-24 17:28 ` [PATCH/RFC 1/8] Mapped File Policy: move shared policy to inode/mapping Lee Schermerhorn
                   ` (9 more replies)
  0 siblings, 10 replies; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-24 17:28 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nish.aravamudan, ak, Lee Schermerhorn, clameter

[RFC] Mapped File Policy 0/8 

Against 2.6.22-rc2-mm1

I posted an earlier version of this series around a year ago to
a less than enthusiastic response.  I've maintained the series
out of tree, testing against occasional -mm trees.  Lately, I've
had the privilege of trying to explain how the existing Linux
memory policy works--with all the exceptions and gotchas vis a vis
shared file mappings--to various groups inside and outside of HP.
This experience has convinced me that it's time to post again...

Andrew:  I can fold this long-wided overview into the first patch
if the series ever gets accepted into mm.  I know you don't like
separate overviews, but the patch 1 description is already quite
long.

Nish:  I believe this series works for hugepages, but I've only
tested SHM_HUGETLB.  

Lee Schermerhorn

---

Basic "problem":  currently [~2.6.21], files mmap()ed SHARED
do not follow mem policy applied to the mapped regions.  Instead, 
shared, file backed pages are allocated using the allocating
tasks' task policy.  This is inconsistent with the way that anon
and shmem pages are handled, violating, for me, the Principle
of Least Astonishment.

One reason for this is that down where pages are allocated for
file backed pages, the faulting (mm, vma, address) are not 
available to compute the policy.  However, we do have the
address_space [a.k.a. mapping] and file index/offset available.
If the applicable policy could be determined from just this info,
the vma and address would not be required.

Note that hugepage shmem segments do not follow the vma policy even
tho' the hugetlbfs inode_info contains the shared policy struct.
This situation arises because the hugetlbfs vm_ops do not contain the
shmem_{get|set}_policy ops.  One can't just add these.  If you do,
a read/cat of /proc/<pid>/numa_maps will hang.  I haven't investigated
reason for the hang.  However, this series does not suffer that
problem.

This series of patches implements NUMA memory policy for shared,
mmap()ed files.   Because files mmap()ed SHARED are shared between
tasks just like shared memory regions, I've used the shared_policy
infrastructure from shmem.  This infrastructure applies policies
directly to ranges of a file using an rb_tree.  The tree is indexed
by the page offset, which we have in page cache allocation contexts.

Note that the method used is similar to one proposed by Steve Longerbeam
quite a few years ago, except that I dynamically allocate the shared
policy struct when needed, rather than embedding it directly in the
inode/address_space.

This series result in the following internal and external semantics:

1) The vma get|set_policy ops handle memory policies on sub-vma
   address ranges for shared, linear mappings [shmem, files]
   without splitting the vmas at the policy boundaries. Private
   and non-linear mappings still split the vma to apply policy.
   However, vma policy is still not visible to the nopage fault path.  

2) As with shmem segments, the shared policies applied to shared
   file mappings persist as long as the inode remains--i.e., until
   the file is deleted or the inode recycled--whether or not any
   task has the file mapped or even open.  We could, I suppose,
   free the shared policy on last close.

3) Vma policy of private mappings of files only apply when the 
   task gets a private copy of the page--i.e., when do_wp_page()
   breaks the COW sharing and allocates a private page.  Private,
   read-only mappings of a file use the shared policy which 
   defaults, as before, to process policy, which itself defaults
   to, well... default policy.  This is how mapped files have
   always behaved.

	Could be addressed by passing vma,addr down to where
	page cache pages are allocated and use different policy
	for shared, linear vs private or nonlinear mappings.
	Worth the effort?

	Or, we could force a COW in the fault path for a read
	fault if the page cache page location does not match
	the private mapping's vma policy.
	Worth the overhead?

4) mbind(... 'MOVE*, ...) will not migrate page cache pages in
   a private mapping if the file has a shared policy.  Rather,
   only anon pages that the mapping task has "COWed" will be
   migrated.  If the mapped file does NOT have a shared policy
   or the file is mapped shared, then the pages will be migrated,
   subject to mapcount, preserving the existing semantics.

Impact On Kernel Build times

Parallel [-j4] kernel build on 1.8GHz 2 socket/single core/4GB
Opteron blade.  Average and standard deviation of 10 runs, after
an initial warmup run:

		Real	User	System
w/o patch	142.19	247.98	30.0	avg
		  0.73	  0.27	 0.14	std dev'n

w/  patch	142.28	247.57	30.74	avg
		  0.64	  0.36	 0.28	std dev'n

Impact On Kernel Size [2.6.22-rc2-mm1+]:

With CONFIG_NUMA [built, but not test w/o 'NUMA]

                      text    data     bss       dec
x86_64 w/o patch:  6280775  855146  612200   7748121
x86_64 w/  patch:  6283071  855146  612264   7750481
x86_64 diff:          2296       0      64      2360

ia64   w/o patch   9032530 1253999 1431020  11717549
ia64   w/  patch:  9037618 1254031 1431028  11722677
ia64 diff:            5088      32       8      5128

Impact On Inode Size

The series removes the shared policy structure from the shmem
and hugetlb 'inode_info structures, and replaces them with a
pointer in the address_space structure, conditional on 
CONFIG_NUMA.  This effectively increases the size of the inode
for all systems configured with NUMA support.  However, for the
2 architectures tested [ia64 and x86_64], this did not change
the number of objects/slab.  For 2.6.22-rc1-mm1:

            w/o patch		    w/ patch
	inode_size obj/slab	inode_size obj/slab
x86_64	   736        7     	   744        7

ia64	   584       27    	   592       27

Similary, many file system specific inode sizes increased by
one pointer's worth, but the number of objects/slab remained
the same--at least for those that showed up in my slabinfo.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH/RFC 1/8] Mapped File Policy: move shared policy to inode/mapping
  2007-05-24 17:28 [PATCH/RFC 0/8] Mapped File Policy Overview Lee Schermerhorn
@ 2007-05-24 17:28 ` Lee Schermerhorn
  2007-05-24 17:28 ` [PATCH/RFC 2/8] Mapped File Policy: allocate shared policies as needed Lee Schermerhorn
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-24 17:28 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nish.aravamudan, clameter, ak, Lee Schermerhorn

Mapped File Policy  1/8 move shared policy to inode/mapping

Against 2.6.22-rc2-mm1

This patch starts the process of generalizing the shmem shared
[mem]policy infrastructure for use with generic mmap()ed files.

1) add a struct shared_policy pointer to struct address_space
   This effectively adds it to each inode in i_data.  get_policy
   vma ops will locate this via vma->vm_file->f_mapping->spolicy.
   Modify [temporarily] mpol_shared_policy_init() to initialize
   via a shared policy pointer.

	A subsequent patch will make this struct dependent
	on CONFIG_NUMA so as not to burden systems that
	don't use numa.  At that point all accesses to
	spolicy will also be made dependent on 'NUMA via
	wrapper functions/macros.  I didn't do that in this
	patch because I'd just have to change the wrappers
	in the next patch where I dynamically alloc shared
	policies.

2) create a shared_policy.h header and move the shared policy
   support from mempolicy.h to shared_policy.h.

3) modify mpol_shared_policy_lookup() to return NULL if
   spolicy pointer contains NULL.  get_vma_policy() will
   substitute the process policy, if any, else the default
   policy.  Same for get_file_policy(), added by later 
   patch.

4) modify shmem, the only existing user of shared policy
   infrastructure, to work with changes above.  At this
   point, just use the shared_policy embedded in the shmem
   inode info struct.  A later patch will dynamically
   allocate the struct when needed.

   Actually, hugetlbfs inodes also contain a shared policy, but
   the vma's get|set_policy ops are not hooked up.  This patch
   modifies hugetlbfs_get_inode() to initialize the shared
   policy struct embedded in its info struct via the i_mapping's
   spolicy pointer.  A later patch will "hook up" hugetlb
   mappings to the get|set_policy ops.

5) some miscellaneous cleanup to use "sp" for shared policy
   in routines that take it as an arg.  Prior use of "info"
   seemed misleading, and use of "p" was just plain 
   inconsistent.

	Could be broken out into a separate cleanup patch if
	I were more ambitious or forced into it.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/hugetlbfs/inode.c          |    4 +-
 include/linux/fs.h            |    3 +
 include/linux/mempolicy.h     |   54 -----------------------------------
 include/linux/shared_policy.h |   64 ++++++++++++++++++++++++++++++++++++++++++
 mm/mempolicy.c                |   27 +++++++++--------
 mm/shmem.c                    |   38 ++++++++++++------------
 6 files changed, 104 insertions(+), 86 deletions(-)

Index: Linux/include/linux/fs.h
===================================================================
--- Linux.orig/include/linux/fs.h	2007-05-23 10:57:07.000000000 -0400
+++ Linux/include/linux/fs.h	2007-05-23 11:34:29.000000000 -0400
@@ -465,9 +465,12 @@ struct address_space {
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
 	struct backing_dev_info *backing_dev_info; /* device readahead, etc */
+
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
+
+	struct shared_policy	*spolicy;
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
Index: Linux/include/linux/mempolicy.h
===================================================================
--- Linux.orig/include/linux/mempolicy.h	2007-05-23 10:57:07.000000000 -0400
+++ Linux/include/linux/mempolicy.h	2007-05-23 11:34:29.000000000 -0400
@@ -30,12 +30,12 @@
 
 #include <linux/mmzone.h>
 #include <linux/slab.h>
-#include <linux/rbtree.h>
 #include <linux/spinlock.h>
 #include <linux/nodemask.h>
 
 struct vm_area_struct;
 struct mm_struct;
+#include <linux/shared_policy.h>
 
 #ifdef CONFIG_NUMA
 
@@ -113,34 +113,6 @@ static inline int mpol_equal(struct memp
 
 #define mpol_set_vma_default(vma) ((vma)->vm_policy = NULL)
 
-/*
- * Tree of shared policies for a shared memory region.
- * Maintain the policies in a pseudo mm that contains vmas. The vmas
- * carry the policy. As a special twist the pseudo mm is indexed in pages, not
- * bytes, so that we can work with shared memory segments bigger than
- * unsigned long.
- */
-
-struct sp_node {
-	struct rb_node nd;
-	unsigned long start, end;
-	struct mempolicy *policy;
-};
-
-struct shared_policy {
-	struct rb_root root;
-	spinlock_t lock;
-};
-
-void mpol_shared_policy_init(struct shared_policy *info, int policy,
-				nodemask_t *nodes);
-int mpol_set_shared_policy(struct shared_policy *info,
-				struct vm_area_struct *vma,
-				struct mempolicy *new);
-void mpol_free_shared_policy(struct shared_policy *p);
-struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
-					    unsigned long idx);
-
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
 extern void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *new);
@@ -200,30 +172,6 @@ static inline struct mempolicy *mpol_cop
 	return NULL;
 }
 
-struct shared_policy {};
-
-static inline int mpol_set_shared_policy(struct shared_policy *info,
-					struct vm_area_struct *vma,
-					struct mempolicy *new)
-{
-	return -EINVAL;
-}
-
-static inline void mpol_shared_policy_init(struct shared_policy *info,
-					int policy, nodemask_t *nodes)
-{
-}
-
-static inline void mpol_free_shared_policy(struct shared_policy *p)
-{
-}
-
-static inline struct mempolicy *
-mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
-{
-	return NULL;
-}
-
 #define vma_policy(vma) NULL
 #define vma_set_policy(vma, pol) do {} while(0)
 
Index: Linux/include/linux/shared_policy.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ Linux/include/linux/shared_policy.h	2007-05-23 11:34:29.000000000 -0400
@@ -0,0 +1,64 @@
+#ifndef _LINUX_SHARED_POLICY_H
+#define _LINUX_SHARED_POLICY_H 1
+
+#include <linux/rbtree.h>
+
+/*
+ * Tree of shared policies for a shared memory regions and memory
+ * mapped files.
+TODO:  wean the low level shared policies from the notion of vmas.
+       just use inode, offset, length
+ * Maintain the policies in a pseudo mm that contains vmas. The vmas
+ * carry the policy. As a special twist the pseudo mm is indexed in pages, not
+ * bytes, so that we can work with shared memory segments bigger than
+ * unsigned long.
+ */
+
+#ifdef CONFIG_NUMA
+
+struct sp_node {
+	struct rb_node nd;
+	unsigned long start, end;
+	struct mempolicy *policy;
+};
+
+struct shared_policy {
+	struct rb_root root;
+	spinlock_t lock;
+};
+
+void mpol_shared_policy_init(struct shared_policy *, int, nodemask_t *);
+int mpol_set_shared_policy(struct shared_policy *,
+				struct vm_area_struct *,
+				struct mempolicy *);
+void mpol_free_shared_policy(struct shared_policy *);
+struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *,
+					    unsigned long);
+
+#else /* !NUMA */
+
+struct shared_policy {};
+
+static inline int mpol_set_shared_policy(struct shared_policy *info,
+					struct vm_area_struct *vma,
+					struct mempolicy *new)
+{
+	return -EINVAL;
+}
+static inline void mpol_shared_policy_init(struct shared_policy *info,
+					int policy, nodemask_t *nodes)
+{
+}
+
+static inline void mpol_free_shared_policy(struct shared_policy *p)
+{
+}
+
+static inline struct mempolicy *
+mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
+{
+	return NULL;
+}
+#endif
+
+#endif /* _LINUX_SHARED_POLICY_H */
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-05-23 10:57:09.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-05-23 11:34:29.000000000 -0400
@@ -1447,7 +1447,7 @@ mpol_shared_policy_lookup(struct shared_
 	struct mempolicy *pol = NULL;
 	struct sp_node *sn;
 
-	if (!sp->root.rb_node)
+	if (!sp || !sp->root.rb_node)
 		return NULL;
 	spin_lock(&sp->lock);
 	sn = sp_lookup(sp, idx, idx+1);
@@ -1529,11 +1529,12 @@ restart:
 	return 0;
 }
 
-void mpol_shared_policy_init(struct shared_policy *info, int policy,
+void mpol_shared_policy_init(struct shared_policy *sp, int policy,
 				nodemask_t *policy_nodes)
 {
-	info->root = RB_ROOT;
-	spin_lock_init(&info->lock);
+
+	sp->root = RB_ROOT;
+	spin_lock_init(&sp->lock);
 
 	if (policy != MPOL_DEFAULT) {
 		struct mempolicy *newpol;
@@ -1547,13 +1548,13 @@ void mpol_shared_policy_init(struct shar
 			memset(&pvma, 0, sizeof(struct vm_area_struct));
 			/* Policy covers entire file */
 			pvma.vm_end = TASK_SIZE;
-			mpol_set_shared_policy(info, &pvma, newpol);
+			mpol_set_shared_policy(sp, &pvma, newpol);
 			mpol_free(newpol);
 		}
 	}
 }
 
-int mpol_set_shared_policy(struct shared_policy *info,
+int mpol_set_shared_policy(struct shared_policy *sp,
 			struct vm_area_struct *vma, struct mempolicy *npol)
 {
 	int err;
@@ -1570,30 +1571,30 @@ int mpol_set_shared_policy(struct shared
 		if (!new)
 			return -ENOMEM;
 	}
-	err = shared_policy_replace(info, vma->vm_pgoff, vma->vm_pgoff+sz, new);
+	err = shared_policy_replace(sp, vma->vm_pgoff, vma->vm_pgoff+sz, new);
 	if (err && new)
 		kmem_cache_free(sn_cache, new);
 	return err;
 }
 
 /* Free a backing policy store on inode delete. */
-void mpol_free_shared_policy(struct shared_policy *p)
+void mpol_free_shared_policy(struct shared_policy *sp)
 {
 	struct sp_node *n;
 	struct rb_node *next;
 
-	if (!p->root.rb_node)
+	if (!sp->root.rb_node)
 		return;
-	spin_lock(&p->lock);
-	next = rb_first(&p->root);
+	spin_lock(&sp->lock);
+	next = rb_first(&sp->root);
 	while (next) {
 		n = rb_entry(next, struct sp_node, nd);
 		next = rb_next(&n->nd);
-		rb_erase(&n->nd, &p->root);
+		rb_erase(&n->nd, &sp->root);
 		mpol_free(n->policy);
 		kmem_cache_free(sn_cache, n);
 	}
-	spin_unlock(&p->lock);
+	spin_unlock(&sp->lock);
 }
 
 /* assumes fs == KERNEL_DS */
Index: Linux/mm/shmem.c
===================================================================
--- Linux.orig/mm/shmem.c	2007-05-23 10:57:09.000000000 -0400
+++ Linux/mm/shmem.c	2007-05-23 11:34:29.000000000 -0400
@@ -1007,7 +1007,7 @@ out:
 	return err;
 }
 
-static struct page *shmem_swapin_async(struct shared_policy *p,
+static struct page *shmem_swapin_async(struct shared_policy *sp,
 				       swp_entry_t entry, unsigned long idx)
 {
 	struct page *page;
@@ -1017,41 +1017,39 @@ static struct page *shmem_swapin_async(s
 	memset(&pvma, 0, sizeof(struct vm_area_struct));
 	pvma.vm_end = PAGE_SIZE;
 	pvma.vm_pgoff = idx;
-	pvma.vm_policy = mpol_shared_policy_lookup(p, idx);
+	pvma.vm_policy = mpol_shared_policy_lookup(sp, idx);
 	page = read_swap_cache_async(entry, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
 
-struct page *shmem_swapin(struct shmem_inode_info *info, swp_entry_t entry,
-			  unsigned long idx)
+struct page *shmem_swapin(struct shared_policy *sp, swp_entry_t entry,
+				unsigned long idx)
 {
-	struct shared_policy *p = &info->policy;
 	int i, num;
 	struct page *page;
 	unsigned long offset;
 
 	num = valid_swaphandles(entry, &offset);
 	for (i = 0; i < num; offset++, i++) {
-		page = shmem_swapin_async(p,
+		page = shmem_swapin_async(sp,
 				swp_entry(swp_type(entry), offset), idx);
 		if (!page)
 			break;
 		page_cache_release(page);
 	}
 	lru_add_drain();	/* Push any new pages onto the LRU now */
-	return shmem_swapin_async(p, entry, idx);
+	return shmem_swapin_async(sp, entry, idx);
 }
 
 static struct page *
-shmem_alloc_page(gfp_t gfp, struct shmem_inode_info *info,
-		 unsigned long idx)
+shmem_alloc_page(gfp_t gfp, struct shared_policy *sp, unsigned long idx)
 {
 	struct vm_area_struct pvma;
 	struct page *page;
 
 	memset(&pvma, 0, sizeof(struct vm_area_struct));
-	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
+	pvma.vm_policy = mpol_shared_policy_lookup(sp, idx);
 	pvma.vm_pgoff = idx;
 	pvma.vm_end = PAGE_SIZE;
 	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
@@ -1065,14 +1063,14 @@ static inline int shmem_parse_mpol(char 
 }
 
 static inline struct page *
-shmem_swapin(struct shmem_inode_info *info,swp_entry_t entry,unsigned long idx)
+shmem_swapin(void *sp, swp_entry_t entry, unsigned long idx)
 {
 	swapin_readahead(entry, 0, NULL);
 	return read_swap_cache_async(entry, NULL, 0);
 }
 
 static inline struct page *
-shmem_alloc_page(gfp_t gfp,struct shmem_inode_info *info, unsigned long idx)
+shmem_alloc_page(gfp_t gfp, void *sp, unsigned long idx)
 {
 	return alloc_page(gfp | __GFP_ZERO);
 }
@@ -1141,7 +1139,7 @@ repeat:
 				*type = VM_FAULT_MAJOR;
 			}
 			spin_unlock(&info->lock);
-			swappage = shmem_swapin(info, swap, idx);
+			swappage = shmem_swapin(mapping->spolicy, swap, idx);
 			if (!swappage) {
 				spin_lock(&info->lock);
 				entry = shmem_swp_alloc(info, idx, sgp);
@@ -1254,7 +1252,7 @@ repeat:
 		if (!filepage) {
 			spin_unlock(&info->lock);
 			filepage = shmem_alloc_page(mapping_gfp_mask(mapping),
-						    info,
+						    mapping->spolicy,
 						    idx);
 			if (!filepage) {
 				shmem_unacct_blocks(info->flags, 1);
@@ -1420,8 +1418,9 @@ shmem_get_inode(struct super_block *sb, 
 		case S_IFREG:
 			inode->i_op = &shmem_inode_operations;
 			inode->i_fop = &shmem_file_operations;
-			mpol_shared_policy_init(&info->policy, sbinfo->policy,
-							&sbinfo->policy_nodes);
+			inode->i_mapping->spolicy = &info->policy;
+			mpol_shared_policy_init(inode->i_mapping->spolicy,
+					 sbinfo->policy, &sbinfo->policy_nodes);
 			break;
 		case S_IFDIR:
 			inc_nlink(inode);
@@ -1435,8 +1434,9 @@ shmem_get_inode(struct super_block *sb, 
 			 * Must not load anything in the rbtree,
 			 * mpol_free_shared_policy will not be called.
 			 */
-			mpol_shared_policy_init(&info->policy, MPOL_DEFAULT,
-						NULL);
+			inode->i_mapping->spolicy = &info->policy;
+			mpol_shared_policy_init(inode->i_mapping->spolicy,
+					 MPOL_DEFAULT, NULL);
 			break;
 		}
 	} else if (sbinfo->max_inodes) {
@@ -2316,7 +2316,7 @@ static void shmem_destroy_inode(struct i
 {
 	if ((inode->i_mode & S_IFMT) == S_IFREG) {
 		/* only struct inode is valid if it's an inline symlink */
-		mpol_free_shared_policy(&SHMEM_I(inode)->policy);
+		mpol_free_shared_policy(inode->i_mapping->spolicy);
 	}
 	shmem_acl_destroy_inode(inode);
 	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
Index: Linux/fs/hugetlbfs/inode.c
===================================================================
--- Linux.orig/fs/hugetlbfs/inode.c	2007-05-23 10:56:40.000000000 -0400
+++ Linux/fs/hugetlbfs/inode.c	2007-05-23 11:34:29.000000000 -0400
@@ -360,7 +360,9 @@ static struct inode *hugetlbfs_get_inode
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		INIT_LIST_HEAD(&inode->i_mapping->private_list);
 		info = HUGETLBFS_I(inode);
-		mpol_shared_policy_init(&info->policy, MPOL_DEFAULT, NULL);
+		inode->i_mapping->spolicy = &info->policy;
+		mpol_shared_policy_init(inode->i_mapping->spolicy,
+					 MPOL_DEFAULT, NULL);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH/RFC 2/8] Mapped File Policy: allocate shared policies as needed
  2007-05-24 17:28 [PATCH/RFC 0/8] Mapped File Policy Overview Lee Schermerhorn
  2007-05-24 17:28 ` [PATCH/RFC 1/8] Mapped File Policy: move shared policy to inode/mapping Lee Schermerhorn
@ 2007-05-24 17:28 ` Lee Schermerhorn
  2007-05-24 17:28 ` [PATCH/RFC 3/8] Mapped File Policy: let vma policy ops handle sub-vma policies Lee Schermerhorn
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-24 17:28 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nish.aravamudan, Lee Schermerhorn, clameter, ak

Mapped File Policy 2/8 dynamically alloc shared policies

Against 2.6.22-rc2-mm1

Remove shared policy structs from shmem and hugetlbfs inode
info structs and dynamically allocate them as needed.  Generic
mmap()ed files will use this capability.

Make shared policy pointer in address_space dependent on
CONFIG_NUMA.  Access [get/set] via wrappers that also depend
on 'NUMA [to avoid excessive #ifdef in .c files].

Initialize shmem and hugetlbfs inode/address_space spolicy
pointer to null, unless superblock [mount] specifies a 
non-default policy.

set_policy() ops must create shared_policy struct from a new
kmem cache when a new policy is installed and no spolicy exists.
mpol_shared_policy_init() replaced with mpol_shared_policy_new()
to accomplish this.

shmem must create/initialize a shared_policy when inode
allocated if the tmpfs super-block/mount point specifies a
non-default policy.

mpol_free_shared_policy() must free the spolicy, if any, when
inode is destroyed.

	Note:  I considered referencing counting the shared
	policy, but I don't think this is necessary because
	they are always 1-for-1 to a given inode and are 
	only/always deleted when the inode is destroyed.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/hugetlbfs/inode.c          |   15 +++++++-------
 fs/inode.c                    |    1 
 include/linux/fs.h            |   20 +++++++++++++++++++
 include/linux/hugetlb.h       |    1 
 include/linux/shared_policy.h |   16 ++++++++-------
 include/linux/shmem_fs.h      |    1 
 mm/mempolicy.c                |   21 ++++++++++++++++++--
 mm/shmem.c                    |   44 +++++++++++++++++++++++++-----------------
 8 files changed, 84 insertions(+), 35 deletions(-)

Index: Linux/include/linux/shared_policy.h
===================================================================
--- Linux.orig/include/linux/shared_policy.h	2007-05-23 11:34:29.000000000 -0400
+++ Linux/include/linux/shared_policy.h	2007-05-23 11:34:36.000000000 -0400
@@ -1,6 +1,7 @@
 #ifndef _LINUX_SHARED_POLICY_H
 #define _LINUX_SHARED_POLICY_H 1
 
+#include <linux/fs.h>
 #include <linux/rbtree.h>
 
 /*
@@ -27,12 +28,12 @@ struct shared_policy {
 	spinlock_t lock;
 };
 
-void mpol_shared_policy_init(struct shared_policy *, int, nodemask_t *);
-int mpol_set_shared_policy(struct shared_policy *,
+extern struct shared_policy *mpol_shared_policy_new(int, nodemask_t *);
+extern int mpol_set_shared_policy(struct shared_policy *,
 				struct vm_area_struct *,
 				struct mempolicy *);
-void mpol_free_shared_policy(struct shared_policy *);
-struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *,
+extern void mpol_free_shared_policy(struct shared_policy *);
+extern struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *,
 					    unsigned long);
 
 #else /* !NUMA */
@@ -45,12 +46,12 @@ static inline int mpol_set_shared_policy
 {
 	return -EINVAL;
 }
-static inline void mpol_shared_policy_init(struct shared_policy *info,
-					int policy, nodemask_t *nodes)
+static inline struct shared_policy *mpol_shared_policy_new(int policy,
+					nodemask_t *nodes)
 {
 }
 
-static inline void mpol_free_shared_policy(struct shared_policy *p)
+static inline void mpol_free_shared_policy(struct shared_policy *sp)
 {
 }
 
@@ -59,6 +60,7 @@ mpol_shared_policy_lookup(struct shared_
 {
 	return NULL;
 }
+
 #endif
 
 #endif /* _LINUX_SHARED_POLICY_H */
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-05-23 11:34:29.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-05-23 11:34:36.000000000 -0400
@@ -99,6 +99,7 @@
 #define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2)		/* Gather statistics */
 
 static struct kmem_cache *policy_cache;
+static struct kmem_cache *sp_cache;
 static struct kmem_cache *sn_cache;
 
 #define PDprintk(fmt...)
@@ -1529,10 +1530,17 @@ restart:
 	return 0;
 }
 
-void mpol_shared_policy_init(struct shared_policy *sp, int policy,
+/*
+ * allocate and initialize a shared policy struct
+ */
+struct shared_policy *mpol_shared_policy_new(int policy,
 				nodemask_t *policy_nodes)
 {
+	struct shared_policy *sp;
 
+	sp = kmem_cache_alloc(sp_cache, GFP_KERNEL);
+	if (!sp)
+		return NULL;
 	sp->root = RB_ROOT;
 	spin_lock_init(&sp->lock);
 
@@ -1552,6 +1560,7 @@ void mpol_shared_policy_init(struct shar
 			mpol_free(newpol);
 		}
 	}
+	return sp;
 }
 
 int mpol_set_shared_policy(struct shared_policy *sp,
@@ -1583,8 +1592,11 @@ void mpol_free_shared_policy(struct shar
 	struct sp_node *n;
 	struct rb_node *next;
 
-	if (!sp->root.rb_node)
+	if (!sp || !sp->root.rb_node)
 		return;
+
+//TODO:   locking should be unnecessary as we're only called when
+//        destroying the inode
 	spin_lock(&sp->lock);
 	next = rb_first(&sp->root);
 	while (next) {
@@ -1595,6 +1607,7 @@ void mpol_free_shared_policy(struct shar
 		kmem_cache_free(sn_cache, n);
 	}
 	spin_unlock(&sp->lock);
+	kmem_cache_free(sp_cache, sp);
 }
 
 /* assumes fs == KERNEL_DS */
@@ -1604,6 +1617,10 @@ void __init numa_policy_init(void)
 					 sizeof(struct mempolicy),
 					 0, SLAB_PANIC, NULL, NULL);
 
+	sp_cache = kmem_cache_create("shared_policy",
+				     sizeof(struct shared_policy),
+				     0, SLAB_PANIC, NULL, NULL);
+
 	sn_cache = kmem_cache_create("shared_policy_node",
 				     sizeof(struct sp_node),
 				     0, SLAB_PANIC, NULL, NULL);
Index: Linux/mm/shmem.c
===================================================================
--- Linux.orig/mm/shmem.c	2007-05-23 11:34:29.000000000 -0400
+++ Linux/mm/shmem.c	2007-05-23 11:34:36.000000000 -0400
@@ -962,7 +962,8 @@ redirty:
 }
 
 #ifdef CONFIG_NUMA
-static inline int shmem_parse_mpol(char *value, int *policy, nodemask_t *policy_nodes)
+static inline int shmem_parse_mpol(char *value, int *policy,
+				 nodemask_t *policy_nodes)
 {
 	char *nodelist = strchr(value, ':');
 	int err = 1;
@@ -1139,7 +1140,8 @@ repeat:
 				*type = VM_FAULT_MAJOR;
 			}
 			spin_unlock(&info->lock);
-			swappage = shmem_swapin(mapping->spolicy, swap, idx);
+			swappage = shmem_swapin(mapping_shared_policy(mapping),
+						swap, idx);
 			if (!swappage) {
 				spin_lock(&info->lock);
 				entry = shmem_swp_alloc(info, idx, sgp);
@@ -1252,8 +1254,8 @@ repeat:
 		if (!filepage) {
 			spin_unlock(&info->lock);
 			filepage = shmem_alloc_page(mapping_gfp_mask(mapping),
-						    mapping->spolicy,
-						    idx);
+				 		mapping_shared_policy(mapping),
+						idx);
 			if (!filepage) {
 				shmem_unacct_blocks(info->flags, 1);
 				shmem_free_blocks(inode, 1);
@@ -1333,18 +1335,27 @@ static struct page *shmem_fault(struct v
 #ifdef CONFIG_NUMA
 int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
 {
-	struct inode *i = vma->vm_file->f_path.dentry->d_inode;
-	return mpol_set_shared_policy(&SHMEM_I(i)->policy, vma, new);
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct shared_policy *sp = mapping_shared_policy(mapping);
+
+	if (!sp) {
+		sp = mpol_shared_policy_new(MPOL_DEFAULT, NULL);
+		set_mapping_shared_policy(mapping, sp);
+	}
+	return mpol_set_shared_policy(sp, vma, new);
 }
 
 struct mempolicy *
 shmem_get_policy(struct vm_area_struct *vma, unsigned long addr)
 {
-	struct inode *i = vma->vm_file->f_path.dentry->d_inode;
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct shared_policy *sp = mapping_shared_policy(mapping);
 	unsigned long idx;
 
+	if (!sp)
+		return NULL;
 	idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
-	return mpol_shared_policy_lookup(&SHMEM_I(i)->policy, idx);
+	return mpol_shared_policy_lookup(sp, idx);
 }
 #endif
 
@@ -1418,9 +1429,10 @@ shmem_get_inode(struct super_block *sb, 
 		case S_IFREG:
 			inode->i_op = &shmem_inode_operations;
 			inode->i_fop = &shmem_file_operations;
-			inode->i_mapping->spolicy = &info->policy;
-			mpol_shared_policy_init(inode->i_mapping->spolicy,
-					 sbinfo->policy, &sbinfo->policy_nodes);
+			if (sbinfo->policy != MPOL_DEFAULT)
+				set_mapping_shared_policy(inode->i_mapping,
+					mpol_shared_policy_new(sbinfo->policy,
+						 &sbinfo->policy_nodes));
 			break;
 		case S_IFDIR:
 			inc_nlink(inode);
@@ -1431,12 +1443,9 @@ shmem_get_inode(struct super_block *sb, 
 			break;
 		case S_IFLNK:
 			/*
-			 * Must not load anything in the rbtree,
-			 * mpol_free_shared_policy will not be called.
+			 * This case only exists so that we don't attempt
+			 * to call init_special_inode() for sym links.
 			 */
-			inode->i_mapping->spolicy = &info->policy;
-			mpol_shared_policy_init(inode->i_mapping->spolicy,
-					 MPOL_DEFAULT, NULL);
 			break;
 		}
 	} else if (sbinfo->max_inodes) {
@@ -2316,7 +2325,8 @@ static void shmem_destroy_inode(struct i
 {
 	if ((inode->i_mode & S_IFMT) == S_IFREG) {
 		/* only struct inode is valid if it's an inline symlink */
-		mpol_free_shared_policy(inode->i_mapping->spolicy);
+		mpol_free_shared_policy(
+				mapping_shared_policy(inode->i_mapping));
 	}
 	shmem_acl_destroy_inode(inode);
 	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
Index: Linux/fs/hugetlbfs/inode.c
===================================================================
--- Linux.orig/fs/hugetlbfs/inode.c	2007-05-23 11:34:29.000000000 -0400
+++ Linux/fs/hugetlbfs/inode.c	2007-05-23 11:34:36.000000000 -0400
@@ -350,7 +350,6 @@ static struct inode *hugetlbfs_get_inode
 
 	inode = new_inode(sb);
 	if (inode) {
-		struct hugetlbfs_inode_info *info;
 		inode->i_mode = mode;
 		inode->i_uid = uid;
 		inode->i_gid = gid;
@@ -359,10 +358,9 @@ static struct inode *hugetlbfs_get_inode
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		INIT_LIST_HEAD(&inode->i_mapping->private_list);
-		info = HUGETLBFS_I(inode);
-		inode->i_mapping->spolicy = &info->policy;
-		mpol_shared_policy_init(inode->i_mapping->spolicy,
-					 MPOL_DEFAULT, NULL);
+		/*
+		 * leave i_mapping->spolicy NULL [default policy]
+		 */
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
@@ -375,7 +373,10 @@ static struct inode *hugetlbfs_get_inode
 			inode->i_op = &hugetlbfs_dir_inode_operations;
 			inode->i_fop = &simple_dir_operations;
 
-			/* directory inodes start off with i_nlink == 2 (for "." entry) */
+			/*
+			 * directory inodes start off with i_nlink == 2
+			 * (for "." entry)
+			 */
 			inc_nlink(inode);
 			break;
 		case S_IFLNK:
@@ -542,7 +543,7 @@ static struct inode *hugetlbfs_alloc_ino
 static void hugetlbfs_destroy_inode(struct inode *inode)
 {
 	hugetlbfs_inc_free_inodes(HUGETLBFS_SB(inode->i_sb));
-	mpol_free_shared_policy(&HUGETLBFS_I(inode)->policy);
+	mpol_free_shared_policy(mapping_shared_policy(inode->i_mapping));
 	kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I(inode));
 }
 
Index: Linux/include/linux/hugetlb.h
===================================================================
--- Linux.orig/include/linux/hugetlb.h	2007-05-23 10:57:07.000000000 -0400
+++ Linux/include/linux/hugetlb.h	2007-05-23 11:34:36.000000000 -0400
@@ -149,7 +149,6 @@ struct hugetlbfs_sb_info {
 
 
 struct hugetlbfs_inode_info {
-	struct shared_policy policy;
 	struct inode vfs_inode;
 };
 
Index: Linux/include/linux/shmem_fs.h
===================================================================
--- Linux.orig/include/linux/shmem_fs.h	2007-04-25 23:08:32.000000000 -0400
+++ Linux/include/linux/shmem_fs.h	2007-05-23 11:34:36.000000000 -0400
@@ -14,7 +14,6 @@ struct shmem_inode_info {
 	unsigned long		alloced;	/* data pages alloced to file */
 	unsigned long		swapped;	/* subtotal assigned to swap */
 	unsigned long		next_index;	/* highest alloced index + 1 */
-	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct page		*i_indirect;	/* top indirect blocks page */
 	swp_entry_t		i_direct[SHMEM_NR_DIRECT]; /* first blocks */
 	struct list_head	swaplist;	/* chain of maybes on swap */
Index: Linux/fs/inode.c
===================================================================
--- Linux.orig/fs/inode.c	2007-05-23 10:57:02.000000000 -0400
+++ Linux/fs/inode.c	2007-05-23 11:34:37.000000000 -0400
@@ -163,6 +163,7 @@ static struct inode *alloc_inode(struct 
 			mapping->backing_dev_info = bdi;
 		}
 		inode->i_private = NULL;
+		set_mapping_shared_policy(mapping, NULL);
 		inode->i_mapping = mapping;
 	}
 	return inode;
Index: Linux/include/linux/fs.h
===================================================================
--- Linux.orig/include/linux/fs.h	2007-05-23 11:34:29.000000000 -0400
+++ Linux/include/linux/fs.h	2007-05-23 11:34:37.000000000 -0400
@@ -470,7 +470,9 @@ struct address_space {
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
 
+#ifdef CONFIG_NUMA
 	struct shared_policy	*spolicy;
+#endif
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
@@ -478,6 +480,24 @@ struct address_space {
 	 * of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON.
 	 */
 
+#ifdef CONFIG_NUMA
+static inline struct shared_policy *
+mapping_shared_policy(struct address_space *mapping)
+{
+	return mapping->spolicy;
+}
+
+static inline void set_mapping_shared_policy(struct address_space *mapping,
+						struct shared_policy *sp)
+{
+	mapping->spolicy = sp;
+}
+
+#else
+#define mapping_shared_policy(M) (NULL)
+#define set_mapping_shared_policy(M, SP)	/* nothing */
+#endif
+
 struct block_device {
 	dev_t			bd_dev;  /* not a kdev_t - it's a search key */
 	struct inode *		bd_inode;	/* will die */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH/RFC 3/8] Mapped File Policy: let vma policy ops handle sub-vma policies
  2007-05-24 17:28 [PATCH/RFC 0/8] Mapped File Policy Overview Lee Schermerhorn
  2007-05-24 17:28 ` [PATCH/RFC 1/8] Mapped File Policy: move shared policy to inode/mapping Lee Schermerhorn
  2007-05-24 17:28 ` [PATCH/RFC 2/8] Mapped File Policy: allocate shared policies as needed Lee Schermerhorn
@ 2007-05-24 17:28 ` Lee Schermerhorn
  2007-05-24 17:28 ` [PATCH/RFC 4/8] Mapped File Policy: add generic file set/get policy vm ops Lee Schermerhorn
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-24 17:28 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nish.aravamudan, ak, Lee Schermerhorn, clameter

Mapped File Policy  3/8 let vma policy ops handle sub-vma policies

Against 2.6.22-rc2-mm1

Shared policies can handle subranges of an object, so no need to
split the vma for these mappings. So, modify mbind_range() and
policy_vma() to call the set_policy vma op, if one exists, for
shared mappings.  Similarly, modify get_vma_policy() to call the
get_policy(), if one exists, only for shared mappings.

We don't want private mappings mucking with the shared policy of
the mapped file, so use vma policy for private mappings.  We'll
still split vmas for private mappings.  

	Could use rb_tree for vma subrange policies as well, but
	not in this series.

Also, we can't use policy ops for nonlinear mappings because we
don't have a 1-to-1 correspondence between pgoff and vma relative
address. So, continue to split vmas for non-linear mappings.

Note that this is a new semantic for set|get_policy() ops:  they only
get called for linear, shared mappings, and in that case we don't
split the vma.  Only shmem has set|get_policy() ops, so only that
subsystem is affected.  Will use the same semantic for generic file
shared policies.

Now, since the vma start and end addresses no longer specify the
range to which a new policy applies, need to add start,end address
args to the vma policy ops.  The set_policy op just calls into
mpol_set_shared_policy() to do the real work, so pass the start,
end addresses down to that function.

Notes:

1) not splitting vmas for sub-vma-range policies, affects
show_numa_map().  "Fixed" in subsequent patch.

2) Added helper function "vma_addr_to_pgoff()".  This is similar
to [linear_]page_index() but takes a shift argument so that it can
be used for calculating page indices for interleaving for both
base pages and huge pages.  Perhaps this can be merged with other
similar functions?

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mm.h            |   21 ++++++++++--
 include/linux/shared_policy.h |    1 
 ipc/shm.c                     |    5 +-
 mm/mempolicy.c                |   71 +++++++++++++++++++++++++++++++-----------
 mm/shmem.c                    |    7 ++--
 5 files changed, 79 insertions(+), 26 deletions(-)

Index: Linux/include/linux/mm.h
===================================================================
--- Linux.orig/include/linux/mm.h	2007-05-23 10:57:07.000000000 -0400
+++ Linux/include/linux/mm.h	2007-05-23 11:34:40.000000000 -0400
@@ -235,11 +235,14 @@ struct vm_operations_struct {
 	struct page *(*nopage)(struct vm_area_struct *area,
 			unsigned long address, int *type);
 
-	/* notification that a previously read-only page is about to become
-	 * writable, if an error is returned it will cause a SIGBUS */
+	/*
+	 * notification that a previously read-only page is about to become
+	 * writable, if an error is returned it will cause a SIGBUS
+	 */
 	int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
 #ifdef CONFIG_NUMA
-	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
+	int (*set_policy)(struct vm_area_struct *vma, unsigned long start,
+				unsigned long end, struct mempolicy *new);
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
 					unsigned long addr);
 	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
@@ -704,7 +707,8 @@ static inline int page_mapped(struct pag
 extern void show_free_areas(void);
 
 #ifdef CONFIG_SHMEM
-int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new);
+int shmem_set_policy(struct vm_area_struct *, unsigned long, unsigned long,
+			 struct mempolicy *);
 struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
 					unsigned long addr);
 int shmem_lock(struct file *file, int lock, struct user_struct *user);
@@ -1053,6 +1057,15 @@ extern void si_meminfo_node(struct sysin
 
 #ifdef CONFIG_NUMA
 extern void setup_per_cpu_pageset(void);
+
+/*
+ * Address to offset for shared mapping policy lookup.
+ */
+static inline pgoff_t vma_addr_to_pgoff(struct vm_area_struct *vma,
+		unsigned long addr, int shift)
+{
+	return ((addr - vma->vm_start) >> shift) + vma->vm_pgoff;
+}
 #else
 static inline void setup_per_cpu_pageset(void) {}
 #endif
Index: Linux/include/linux/shared_policy.h
===================================================================
--- Linux.orig/include/linux/shared_policy.h	2007-05-23 11:34:36.000000000 -0400
+++ Linux/include/linux/shared_policy.h	2007-05-23 11:34:40.000000000 -0400
@@ -31,6 +31,7 @@ struct shared_policy {
 extern struct shared_policy *mpol_shared_policy_new(int, nodemask_t *);
 extern int mpol_set_shared_policy(struct shared_policy *,
 				struct vm_area_struct *,
+				unsigned long, unsigned long,
 				struct mempolicy *);
 extern void mpol_free_shared_policy(struct shared_policy *);
 extern struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *,
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-05-23 11:34:36.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-05-23 11:34:40.000000000 -0400
@@ -374,20 +374,28 @@ check_range(struct mm_struct *mm, unsign
 	return first;
 }
 
-/* Apply policy to a single VMA */
-static int policy_vma(struct vm_area_struct *vma, struct mempolicy *new)
+/*
+ * Apply policy to a single VMA, or a subrange thereof
+ */
+static int policy_vma(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, struct mempolicy *new)
 {
 	int err = 0;
-	struct mempolicy *old = vma->vm_policy;
 
 	PDprintk("vma %lx-%lx/%lx vm_ops %p vm_file %p set_policy %p\n",
-		 vma->vm_start, vma->vm_end, vma->vm_pgoff,
+		 start, end, vma_addr_to_pgoff(vma, start, PAGE_SHIFT),
 		 vma->vm_ops, vma->vm_file,
 		 vma->vm_ops ? vma->vm_ops->set_policy : NULL);
 
-	if (vma->vm_ops && vma->vm_ops->set_policy)
-		err = vma->vm_ops->set_policy(vma, new);
-	if (!err) {
+	/*
+	 * set_policy op, if exists, is responsible for policy
+	 * reference counts.
+	 */
+	if ((vma->vm_flags & (VM_SHARED|VM_NONLINEAR)) == VM_SHARED &&
+		vma->vm_ops && vma->vm_ops->set_policy)
+		err = vma->vm_ops->set_policy(vma, start, end, new);
+	else {
+		struct mempolicy *old = vma->vm_policy;
 		mpol_get(new);
 		vma->vm_policy = new;
 		mpol_free(old);
@@ -404,13 +412,30 @@ static int mbind_range(struct vm_area_st
 
 	err = 0;
 	for (; vma && vma->vm_start < end; vma = next) {
+		unsigned long eend = min(end, vma->vm_end);
 		next = vma->vm_next;
+		if ((vma->vm_flags & (VM_SHARED|VM_NONLINEAR)) == VM_SHARED &&
+			vma->vm_ops && vma->vm_ops->set_policy) {
+			/*
+			 * set_policy op handles policies on sub-range
+			 * of vma for linear, shared mappings
+			 */
+			err = policy_vma(vma, start, eend, new);
+			if (err)
+				break;
+			continue;
+		}
+
+		/*
+		 * for private mappings and shared mappings of objects without
+		 * a set_policy vma op, split the vma and use vma policy
+		 */
 		if (vma->vm_start < start)
 			err = split_vma(vma->vm_mm, vma, start, 1);
 		if (!err && vma->vm_end > end)
-			err = split_vma(vma->vm_mm, vma, end, 0);
+			err = split_vma(vma->vm_mm, vma, eend, 0);
 		if (!err)
-			err = policy_vma(vma, new);
+			err = policy_vma(vma, start, eend, new);
 		if (err)
 			break;
 	}
@@ -538,7 +563,11 @@ long do_get_mempolicy(int *policy, nodem
 			up_read(&mm->mmap_sem);
 			return -EFAULT;
 		}
-		if (vma->vm_ops && vma->vm_ops->get_policy)
+		/*
+		 * use get_policy op, if any, for linear, shared mappings
+		 */
+		if ((vma->vm_flags & (VM_SHARED|VM_NONLINEAR)) == VM_SHARED &&
+			vma->vm_ops && vma->vm_ops->get_policy)
 			pol = vma->vm_ops->get_policy(vma, addr);
 		else
 			pol = vma->vm_policy;
@@ -1080,7 +1109,11 @@ static struct mempolicy * get_vma_policy
 	struct mempolicy *pol = task->mempolicy;
 
 	if (vma) {
-		if (vma->vm_ops && vma->vm_ops->get_policy)
+		/*
+		 * use get_policy op, if any, for shared mappings
+		 */
+		if ((vma->vm_flags & (VM_SHARED|VM_NONLINEAR)) == VM_SHARED &&
+			vma->vm_ops && vma->vm_ops->get_policy)
 			pol = vma->vm_ops->get_policy(vma, addr);
 		else if (vma->vm_policy &&
 				vma->vm_policy->policy != MPOL_DEFAULT)
@@ -1556,7 +1589,8 @@ struct shared_policy *mpol_shared_policy
 			memset(&pvma, 0, sizeof(struct vm_area_struct));
 			/* Policy covers entire file */
 			pvma.vm_end = TASK_SIZE;
-			mpol_set_shared_policy(sp, &pvma, newpol);
+			mpol_set_shared_policy(sp, &pvma, 0UL, pvma.vm_end,
+						 newpol);
 			mpol_free(newpol);
 		}
 	}
@@ -1564,23 +1598,26 @@ struct shared_policy *mpol_shared_policy
 }
 
 int mpol_set_shared_policy(struct shared_policy *sp,
-			struct vm_area_struct *vma, struct mempolicy *npol)
+			struct vm_area_struct *vma,
+			unsigned long start, unsigned long end,
+			struct mempolicy *npol)
 {
 	int err;
 	struct sp_node *new = NULL;
-	unsigned long sz = vma_pages(vma);
+	unsigned long sz = (end - start) >> PAGE_SHIFT;
+	pgoff_t pgoff = vma_addr_to_pgoff(vma, start, PAGE_SHIFT);
 
 	PDprintk("set_shared_policy %lx sz %lu %d %lx\n",
-		 vma->vm_pgoff,
+		 pgoff,
 		 sz, npol? npol->policy : -1,
 		npol ? nodes_addr(npol->v.nodes)[0] : -1);
 
 	if (npol) {
-		new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, npol);
+		new = sp_alloc(pgoff, pgoff + sz, npol);
 		if (!new)
 			return -ENOMEM;
 	}
-	err = shared_policy_replace(sp, vma->vm_pgoff, vma->vm_pgoff+sz, new);
+	err = shared_policy_replace(sp, pgoff, pgoff+sz, new);
 	if (err && new)
 		kmem_cache_free(sn_cache, new);
 	return err;
Index: Linux/mm/shmem.c
===================================================================
--- Linux.orig/mm/shmem.c	2007-05-23 11:34:36.000000000 -0400
+++ Linux/mm/shmem.c	2007-05-23 11:34:40.000000000 -0400
@@ -1333,7 +1333,8 @@ static struct page *shmem_fault(struct v
 }
 
 #ifdef CONFIG_NUMA
-int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
+int shmem_set_policy(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, struct mempolicy *new)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct shared_policy *sp = mapping_shared_policy(mapping);
@@ -1342,7 +1343,7 @@ int shmem_set_policy(struct vm_area_stru
 		sp = mpol_shared_policy_new(MPOL_DEFAULT, NULL);
 		set_mapping_shared_policy(mapping, sp);
 	}
-	return mpol_set_shared_policy(sp, vma, new);
+	return mpol_set_shared_policy(sp, vma, start, end, new);
 }
 
 struct mempolicy *
@@ -1354,7 +1355,7 @@ shmem_get_policy(struct vm_area_struct *
 
 	if (!sp)
 		return NULL;
-	idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	idx = vma_addr_to_pgoff(vma, addr, PAGE_SHIFT);
 	return mpol_shared_policy_lookup(sp, idx);
 }
 #endif
Index: Linux/ipc/shm.c
===================================================================
--- Linux.orig/ipc/shm.c	2007-05-23 10:57:08.000000000 -0400
+++ Linux/ipc/shm.c	2007-05-23 11:34:40.000000000 -0400
@@ -236,13 +236,14 @@ static struct page *shm_fault(struct vm_
 }
 
 #ifdef CONFIG_NUMA
-int shm_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
+int shm_set_policy(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, struct mempolicy *new)
 {
 	struct file *file = vma->vm_file;
 	struct shm_file_data *sfd = shm_file_data(file);
 	int err = 0;
 	if (sfd->vm_ops->set_policy)
-		err = sfd->vm_ops->set_policy(vma, new);
+		err = sfd->vm_ops->set_policy(vma, start, end, new);
 	return err;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH/RFC 4/8] Mapped File Policy: add generic file set/get policy vm ops
  2007-05-24 17:28 [PATCH/RFC 0/8] Mapped File Policy Overview Lee Schermerhorn
                   ` (2 preceding siblings ...)
  2007-05-24 17:28 ` [PATCH/RFC 3/8] Mapped File Policy: let vma policy ops handle sub-vma policies Lee Schermerhorn
@ 2007-05-24 17:28 ` Lee Schermerhorn
  2007-05-24 17:28 ` [PATCH/RFC 5/8] Mapped File Policy: Factor alloc_page_pol routine Lee Schermerhorn
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-24 17:28 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nish.aravamudan, clameter, ak, Lee Schermerhorn

Mapped File Policy  4/8 add generic file set/get policy vm ops

Against 2.6.22-rc2-mm1

Add set/get policy vm ops to generic_file_vm_ops in support of
mmap()ed file memory policies.

Note that these ops are identical in all but name to the shmem
policy vm ops as modified by this series.  So, let's try to 
use the generic ones for shmem--but, we'll keep the shmem names
around for now.

Hook up hugetlbfs mappings to the shared policy infrastructure
via the generic_file_vm_ops.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mm.h |   12 ++++++++----
 mm/filemap.c       |   28 ++++++++++++++++++++++++++++
 mm/hugetlb.c       |    4 ++++
 mm/shmem.c         |   28 ----------------------------
 4 files changed, 40 insertions(+), 32 deletions(-)

Index: Linux/mm/filemap.c
===================================================================
--- Linux.orig/mm/filemap.c	2007-05-23 10:57:09.000000000 -0400
+++ Linux/mm/filemap.c	2007-05-23 11:34:43.000000000 -0400
@@ -30,6 +30,7 @@
 #include <linux/security.h>
 #include <linux/syscalls.h>
 #include <linux/cpuset.h>
+#include <linux/mempolicy.h>
 #include "filemap.h"
 #include "internal.h"
 
@@ -478,6 +479,29 @@ struct page *__page_cache_alloc(gfp_t gf
 	return alloc_pages(gfp, 0);
 }
 EXPORT_SYMBOL(__page_cache_alloc);
+
+int generic_file_set_policy(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, struct mempolicy *new)
+{
+	struct shared_policy *sp = vma->vm_file->f_mapping->spolicy;
+
+	if (!sp) {
+		sp = mpol_shared_policy_new(MPOL_DEFAULT, NULL);
+		vma->vm_file->f_mapping->spolicy = sp;
+	}
+	return mpol_set_shared_policy(sp, vma, start, end, new);
+}
+
+struct mempolicy *
+generic_file_get_policy(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct shared_policy *sp = vma->vm_file->f_mapping->spolicy;
+	if (!sp)
+		return NULL;
+
+	return mpol_shared_policy_lookup(sp,
+				 vma_addr_to_pgoff(vma, addr, PAGE_SHIFT));
+}
 #endif
 
 static int __sleep_on_page_lock(void *word)
@@ -1529,6 +1553,10 @@ EXPORT_SYMBOL(filemap_fault);
 
 struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+#ifdef CONFIG_NUMA
+	.set_policy     = generic_file_set_policy,
+	.get_policy     = generic_file_get_policy,
+#endif
 };
 
 /* This is used for a general mmap of a disk file */
Index: Linux/include/linux/mm.h
===================================================================
--- Linux.orig/include/linux/mm.h	2007-05-23 11:34:40.000000000 -0400
+++ Linux/include/linux/mm.h	2007-05-23 11:34:43.000000000 -0400
@@ -707,10 +707,9 @@ static inline int page_mapped(struct pag
 extern void show_free_areas(void);
 
 #ifdef CONFIG_SHMEM
-int shmem_set_policy(struct vm_area_struct *, unsigned long, unsigned long,
-			 struct mempolicy *);
-struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
-					unsigned long addr);
+#define shmem_set_policy generic_file_set_policy
+#define shmem_get_policy generic_file_get_policy
+
 int shmem_lock(struct file *file, int lock, struct user_struct *user);
 #else
 static inline int shmem_lock(struct file *file, int lock,
@@ -1066,6 +1065,11 @@ static inline pgoff_t vma_addr_to_pgoff(
 {
 	return ((addr - vma->vm_start) >> shift) + vma->vm_pgoff;
 }
+
+int generic_file_set_policy(struct vm_area_struct *vma,
+		unsigned long start, unsigned long end, struct mempolicy *new);
+struct mempolicy *generic_file_get_policy(struct vm_area_struct *vma,
+		unsigned long addr);
 #else
 static inline void setup_per_cpu_pageset(void) {}
 #endif
Index: Linux/mm/shmem.c
===================================================================
--- Linux.orig/mm/shmem.c	2007-05-23 11:34:40.000000000 -0400
+++ Linux/mm/shmem.c	2007-05-23 11:34:43.000000000 -0400
@@ -1332,34 +1332,6 @@ static struct page *shmem_fault(struct v
 	return page;
 }
 
-#ifdef CONFIG_NUMA
-int shmem_set_policy(struct vm_area_struct *vma, unsigned long start,
-			unsigned long end, struct mempolicy *new)
-{
-	struct address_space *mapping = vma->vm_file->f_mapping;
-	struct shared_policy *sp = mapping_shared_policy(mapping);
-
-	if (!sp) {
-		sp = mpol_shared_policy_new(MPOL_DEFAULT, NULL);
-		set_mapping_shared_policy(mapping, sp);
-	}
-	return mpol_set_shared_policy(sp, vma, start, end, new);
-}
-
-struct mempolicy *
-shmem_get_policy(struct vm_area_struct *vma, unsigned long addr)
-{
-	struct address_space *mapping = vma->vm_file->f_mapping;
-	struct shared_policy *sp = mapping_shared_policy(mapping);
-	unsigned long idx;
-
-	if (!sp)
-		return NULL;
-	idx = vma_addr_to_pgoff(vma, addr, PAGE_SHIFT);
-	return mpol_shared_policy_lookup(sp, idx);
-}
-#endif
-
 int shmem_lock(struct file *file, int lock, struct user_struct *user)
 {
 	struct inode *inode = file->f_path.dentry->d_inode;
Index: Linux/mm/hugetlb.c
===================================================================
--- Linux.orig/mm/hugetlb.c	2007-05-23 11:05:09.000000000 -0400
+++ Linux/mm/hugetlb.c	2007-05-23 11:34:43.000000000 -0400
@@ -326,6 +326,10 @@ static struct page *hugetlb_vm_op_fault(
 
 struct vm_operations_struct hugetlb_vm_ops = {
 	.fault = hugetlb_vm_op_fault,
+#ifdef CONFIG_NUMA
+	.set_policy	= generic_file_set_policy,
+	.get_policy	= generic_file_get_policy,
+#endif
 };
 
 static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH/RFC 5/8] Mapped File Policy:  Factor alloc_page_pol routine
  2007-05-24 17:28 [PATCH/RFC 0/8] Mapped File Policy Overview Lee Schermerhorn
                   ` (3 preceding siblings ...)
  2007-05-24 17:28 ` [PATCH/RFC 4/8] Mapped File Policy: add generic file set/get policy vm ops Lee Schermerhorn
@ 2007-05-24 17:28 ` Lee Schermerhorn
  2007-05-24 17:29 ` [PATCH/RFC 6/8] Mapped File Policy: use file policy for page cache allocations Lee Schermerhorn
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-24 17:28 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nish.aravamudan, Lee Schermerhorn, clameter, ak

Mapped File Policy 5/8 - Factor alloc_page_pol routine

Against 2.6.22-rc2-mm1

Implement alloc_page_pol() to allocate a page given a policy and
an offset [for interleaving].  No vma nor addr needed.  This
function will be used to allocate page_cache pages given the
policy at a given page offset in a subsequent patch.

Revise alloc_page_vma() to just call alloc_page_pol() after looking
up the policy, to eliminate duplicate code.  This change rippled
into the interleaving functions.  Was able to eliminate
interleave_nid() by computing the offset at the call sites and
calling [modified] offset_il_node() directly.

	removed vma arg from offset_il_node(), as it wasn't
	used and is not available when called from 
	alloc_page_pol().

Note:  re: alloc_page_vma() -- can be called w/ vma == NULL via
read_swap_cache_async() from swapin_readahead().  Can't compute
a page offset in this case, but not an issue?

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/gfp.h       |    3 +
 include/linux/hugetlb.h   |    9 ++++
 include/linux/mempolicy.h |    2 +
 mm/mempolicy.c            |   89 ++++++++++++++++++++++++++--------------------
 4 files changed, 66 insertions(+), 37 deletions(-)

Index: Linux/include/linux/gfp.h
===================================================================
--- Linux.orig/include/linux/gfp.h	2007-05-23 10:57:07.000000000 -0400
+++ Linux/include/linux/gfp.h	2007-05-23 11:34:46.000000000 -0400
@@ -180,10 +180,13 @@ alloc_pages(gfp_t gfp_mask, unsigned int
 }
 extern struct page *alloc_page_vma(gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr);
+struct mempolicy;
+extern struct page *alloc_page_pol(gfp_t, struct mempolicy *, pgoff_t);
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
 #define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#define alloc_page_pol(gfp_mask, pol, off)  alloc_pages(gfp_mask, 0)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 
Index: Linux/include/linux/hugetlb.h
===================================================================
--- Linux.orig/include/linux/hugetlb.h	2007-05-23 11:34:36.000000000 -0400
+++ Linux/include/linux/hugetlb.h	2007-05-23 11:34:46.000000000 -0400
@@ -14,6 +14,14 @@ static inline int is_vm_hugetlb_page(str
 	return vma->vm_flags & VM_HUGETLB;
 }
 
+static inline int vma_page_shift(struct vm_area_struct *vma)
+{
+	if (unlikely(is_vm_hugetlb_page(vma)))
+		return HPAGE_SHIFT;
+	else
+		return PAGE_SHIFT;
+}
+
 int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
 int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
@@ -127,6 +135,7 @@ static inline unsigned long hugetlb_tota
 #define HPAGE_MASK	PAGE_MASK		/* Keep the compiler happy */
 #define HPAGE_SIZE	PAGE_SIZE
 #endif
+#define vma_page_shift(VMA)		PAGE_SHIFT
 
 #endif /* !CONFIG_HUGETLB_PAGE */
 
Index: Linux/include/linux/mempolicy.h
===================================================================
--- Linux.orig/include/linux/mempolicy.h	2007-05-23 11:34:29.000000000 -0400
+++ Linux/include/linux/mempolicy.h	2007-05-23 11:34:46.000000000 -0400
@@ -130,6 +130,8 @@ extern void mpol_fix_fork_child_flag(str
 #endif
 
 extern struct mempolicy default_policy;
+extern struct mempolicy *get_file_policy(struct task_struct *,
+		struct address_space *, pgoff_t);
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 		unsigned long addr, gfp_t gfp_flags);
 extern unsigned slab_node(struct mempolicy *policy);
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-05-23 11:34:40.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-05-23 11:34:46.000000000 -0400
@@ -21,6 +21,7 @@
  *
  * bind           Only allocate memory on a specific set of nodes,
  *                no fallback.
+//TODO:  following still applicable?
  *                FIXME: memory is allocated starting with the first node
  *                to the last. It would be better if bind would truly restrict
  *                the allocation to memory nodes instead
@@ -35,6 +36,7 @@
  *                use the process policy. This is what Linux always did
  *		  in a NUMA aware kernel and still does by, ahem, default.
  *
+//TODO:  following needs paragraph rewording.  haven't figured out what to say.
  * The process policy is applied for most non interrupt memory allocations
  * in that process' context. Interrupts ignore the policies and always
  * try to allocate on the local CPU. The VMA policy is only applied for memory
@@ -50,15 +52,18 @@
  * Same with GFP_DMA allocations.
  *
  * For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between
- * all users and remembered even when nobody has memory mapped.
+ * all users and remembered even when nobody has memory mapped. Shared
+ * policies handle sub-ranges of the object using a red/black tree.
+ *
+ * For mmap()ed files, the policy is shared between all 'SHARED mappers
+ * and is remembered as long as the inode exists.  Private mappings
+ * still use vma policy for COWed pages, but use the shared policy
+ * [default, if none] for initial and read-only faults.
  */
 
 /* Notebook:
-   fix mmap readahead to honour policy and enable policy for any page cache
-   object
    statistics for bigpages
-   global policy for page cache? currently it uses process policy. Requires
-   first item above.
+   global policy for page cache?
    handle mremap for shared memory (currently ignored for the policy)
    grows down?
    make bind policy root only? It can trigger oom much faster and the
@@ -1124,6 +1129,22 @@ static struct mempolicy * get_vma_policy
 	return pol;
 }
 
+/*
+ * Return effective policy for file [address_space] at pgoff
+ */
+struct mempolicy *get_file_policy(struct task_struct *task,
+		struct address_space *x, pgoff_t pgoff)
+{
+	struct shared_policy *sp = x->spolicy;
+	struct mempolicy *pol = task->mempolicy;
+
+	if (sp)
+		pol = mpol_shared_policy_lookup(sp, pgoff);
+	if (!pol)
+		pol = &default_policy;
+	return pol;
+}
+
 /* Return a zonelist representing a mempolicy */
 static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
 {
@@ -1196,9 +1217,8 @@ unsigned slab_node(struct mempolicy *pol
 	}
 }
 
-/* Do static interleaving for a VMA with known offset. */
-static unsigned offset_il_node(struct mempolicy *pol,
-		struct vm_area_struct *vma, unsigned long off)
+/* Do static interleaving for a policy with known offset. */
+static unsigned offset_il_node(struct mempolicy *pol, pgoff_t off)
 {
 	unsigned nnodes = nodes_weight(pol->v.nodes);
 	unsigned target = (unsigned)off % nnodes;
@@ -1213,28 +1233,6 @@ static unsigned offset_il_node(struct me
 	return nid;
 }
 
-/* Determine a node number for interleave */
-static inline unsigned interleave_nid(struct mempolicy *pol,
-		 struct vm_area_struct *vma, unsigned long addr, int shift)
-{
-	if (vma) {
-		unsigned long off;
-
-		/*
-		 * for small pages, there is no difference between
-		 * shift and PAGE_SHIFT, so the bit-shift is safe.
-		 * for huge pages, since vm_pgoff is in units of small
-		 * pages, we need to shift off the always 0 bits to get
-		 * a useful offset.
-		 */
-		BUG_ON(shift < PAGE_SHIFT);
-		off = vma->vm_pgoff >> (shift - PAGE_SHIFT);
-		off += (addr - vma->vm_start) >> shift;
-		return offset_il_node(pol, vma, off);
-	} else
-		return interleave_nodes(pol);
-}
-
 #ifdef CONFIG_HUGETLBFS
 /* Return a zonelist suitable for a huge page allocation. */
 struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
@@ -1245,7 +1243,8 @@ struct zonelist *huge_zonelist(struct vm
 	if (pol->policy == MPOL_INTERLEAVE) {
 		unsigned nid;
 
-		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
+		nid = offset_il_node(pol,
+				vma_addr_to_pgoff(vma, addr, HPAGE_SHIFT));
 		return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
 	}
 	return zonelist_policy(GFP_HIGHUSER, pol);
@@ -1267,6 +1266,23 @@ static struct page *alloc_page_interleav
 	return page;
 }
 
+/*
+ * alloc_page_pol() -- allocate a page based on policy,offset.
+ * Used for mmap()ed file policy allocations where policy is based
+ * on file offset rather than a vma,addr pair
+ */
+struct page *alloc_page_pol(gfp_t gfp, struct mempolicy *pol, pgoff_t pgoff)
+{
+	if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
+		unsigned nid;
+
+		nid = offset_il_node(pol, pgoff);
+		return alloc_page_interleave(gfp, 0, nid);
+	}
+	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+}
+EXPORT_SYMBOL(alloc_page_pol);
+
 /**
  * 	alloc_page_vma	- Allocate a page for a VMA.
  *
@@ -1293,16 +1309,15 @@ struct page *
 alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
 {
 	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+	pgoff_t pgoff = 0;
 
 	cpuset_update_task_memory_state();
 
-	if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
-		unsigned nid;
-
-		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
-		return alloc_page_interleave(gfp, 0, nid);
+	if (likely(vma)) {
+		int shift = vma_page_shift(vma);
+		pgoff = vma_addr_to_pgoff(vma, addr, shift);
 	}
-	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+	return alloc_page_pol(gfp, pol, pgoff);
 }
 
 /**

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH/RFC 6/8] Mapped File Policy: use file policy for page cache allocations
  2007-05-24 17:28 [PATCH/RFC 0/8] Mapped File Policy Overview Lee Schermerhorn
                   ` (4 preceding siblings ...)
  2007-05-24 17:28 ` [PATCH/RFC 5/8] Mapped File Policy: Factor alloc_page_pol routine Lee Schermerhorn
@ 2007-05-24 17:29 ` Lee Schermerhorn
  2007-05-24 17:29 ` [PATCH/RFC 7/8] Mapped File Policy: fix migration of private mappings Lee Schermerhorn
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-24 17:29 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nish.aravamudan, ak, Lee Schermerhorn, clameter

Mapped File Policy 6/8 - use file policy for page cache allocations

Against 2.6.22-rc2-mm1

This patch implements a "get_file_policy()" function, analogous
to get_vma_policy(), but for a given file[inode/mapping] at
at specified offset, using the shared_policy, if any, in the
file's address_space.  If no shared policy, returns the process
policy of the argument task [to match get_vma_policy() args] or
default policy, if no process policy.

Revert [__]page_cache_alloc() to take mapping argument as I need
that to locate the shared policy.  Add pgoff_t and gfp_t modifier
arguments.  Fix up page_cache_alloc() and page_cache_alloc_cold()
in pagemap.h and all direct callers of __page_cache_alloc accordingly.

Modify __page_cache_alloc() to use get_file_policy() and
alloc_page_pol().  

page_cache_alloc*() now take an additional offset/index
argument, available at all call sites, to lookup the appropriate
policy.  The patches fixes all in kernel users of the modified
interfaces.

Re: interaction with cpusets page spread:  if the file has a
shared policy structure attached, that takes precedence over
spreading.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/ntfs/file.c          |    2 +-
 fs/splice.c             |    4 ++--
 include/linux/mm.h      |    6 +++++-
 include/linux/pagemap.h |   19 ++++++++++++-------
 mm/filemap.c            |   47 ++++++++++++++++++++++++++++++++++++++---------
 mm/readahead.c          |    2 +-
 6 files changed, 59 insertions(+), 21 deletions(-)

Index: Linux/mm/filemap.c
===================================================================
--- Linux.orig/mm/filemap.c	2007-05-23 11:34:43.000000000 -0400
+++ Linux/mm/filemap.c	2007-05-23 12:19:46.000000000 -0400
@@ -470,13 +470,41 @@ int add_to_page_cache_lru(struct page *p
 }
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+/**
+ * __page_cache_alloc - allocate a page cache page
+ * @mapping - address_space for which page will be allocated
+ * @pgoff   - page index in mapping -- for mem policy
+ * @gfp_sub - gfp flags to be removed from mapping's gfp
+ * @gfp_add - gfp flags to be added to mapping's gfp
+ *
+ * If the mapping does not contain a shared policy, and page cache spreading
+ * is enabled for the current context's cpuset, allocate a page from the node
+ * indicated by page cache spreading.
+ *
+ * Otherwise, fetch the memory policy at the indicated pgoff and allocate
+ * a page according to that policy.  Note that if the mapping does not
+ * have a shared policy, the allocation will use the task policy, if any,
+ * else the system default policy.
+ *
+ * All allocations will use the mapping's gfp mask, as modified by the
+ * gfp_sub and gfp_add arguments.
+ */
+struct page *__page_cache_alloc(struct address_space *mapping, pgoff_t pgoff,
+					gfp_t gfp_sub, gfp_t gfp_add)
 {
-	if (cpuset_do_page_mem_spread()) {
+	struct mempolicy *pol;
+	gfp_t gfp = (mapping_gfp_mask(mapping) & ~gfp_sub) | gfp_add;
+
+	/*
+	 * Consider spreading only if no shared_policy
+	 */
+	if (!mapping->spolicy && cpuset_do_page_mem_spread()) {
 		int n = cpuset_mem_spread_node();
 		return alloc_pages_node(n, gfp, 0);
 	}
-	return alloc_pages(gfp, 0);
+
+	pol = get_file_policy(current, mapping, pgoff);
+	return alloc_page_pol(gfp, pol, pgoff);
 }
 EXPORT_SYMBOL(__page_cache_alloc);
 
@@ -697,7 +725,8 @@ repeat:
 	if (!page) {
 		if (!cached_page) {
 			cached_page =
-				__page_cache_alloc(gfp_mask);
+				__page_cache_alloc(mapping, index,
+					 ~0, gfp_mask);
 			if (!cached_page)
 				return NULL;
 		}
@@ -833,7 +862,7 @@ grab_cache_page_nowait(struct address_sp
 		page_cache_release(page);
 		return NULL;
 	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
+	page = __page_cache_alloc(mapping, index, __GFP_FS, 0);
 	if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
 		page_cache_release(page);
 		page = NULL;
@@ -1084,7 +1113,7 @@ no_cached_page:
 		 * page..
 		 */
 		if (!cached_page) {
-			cached_page = page_cache_alloc_cold(mapping);
+			cached_page = page_cache_alloc_cold(mapping, index);
 			if (!cached_page) {
 				desc->error = -ENOMEM;
 				goto out;
@@ -1354,7 +1383,7 @@ static int fastcall page_cache_read(stru
 	int ret;
 
 	do {
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, offset);
 		if (!page)
 			return -ENOMEM;
 
@@ -1607,7 +1636,7 @@ repeat:
 	page = find_get_page(mapping, index);
 	if (!page) {
 		if (!cached_page) {
-			cached_page = page_cache_alloc_cold(mapping);
+			cached_page = page_cache_alloc_cold(mapping, index);
 			if (!cached_page)
 				return ERR_PTR(-ENOMEM);
 		}
@@ -1721,7 +1750,7 @@ repeat:
 	page = find_lock_page(mapping, index);
 	if (!page) {
 		if (!*cached_page) {
-			*cached_page = page_cache_alloc(mapping);
+			*cached_page = page_cache_alloc(mapping, index);
 			if (!*cached_page)
 				return NULL;
 		}
Index: Linux/include/linux/pagemap.h
===================================================================
--- Linux.orig/include/linux/pagemap.h	2007-05-23 10:56:54.000000000 -0400
+++ Linux/include/linux/pagemap.h	2007-05-23 12:02:53.000000000 -0400
@@ -63,22 +63,27 @@ static inline void mapping_set_gfp_mask(
 void release_pages(struct page **pages, int nr, int cold);
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(struct address_space *, pgoff_t,
+							gfp_t,  gfp_t);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(struct address_space *mapping,
+						pgoff_t off,
+						gfp_t gfp_sub, gfp_t gfp_add)
 {
-	return alloc_pages(gfp, 0);
+	return alloc_pages((mapping_gfp_mask(mapping) & ~gfp_sub) | gfp_add, 0);
 }
 #endif
 
-static inline struct page *page_cache_alloc(struct address_space *x)
+static inline struct page *page_cache_alloc(struct address_space *mapping,
+						pgoff_t off)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x));
+	return __page_cache_alloc(mapping, off, 0, 0);
 }
 
-static inline struct page *page_cache_alloc_cold(struct address_space *x)
+static inline struct page *page_cache_alloc_cold(struct address_space *mapping,
+						pgoff_t off)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+	return __page_cache_alloc(mapping, off, 0, __GFP_COLD);
 }
 
 typedef int filler_t(void *, struct page *);
Index: Linux/fs/splice.c
===================================================================
--- Linux.orig/fs/splice.c	2007-05-23 10:57:03.000000000 -0400
+++ Linux/fs/splice.c	2007-05-23 11:34:48.000000000 -0400
@@ -317,7 +317,7 @@ __generic_file_splice_read(struct file *
 			/*
 			 * page didn't exist, allocate one.
 			 */
-			page = page_cache_alloc_cold(mapping);
+			page = page_cache_alloc_cold(mapping, index);
 			if (!page)
 				break;
 
@@ -575,7 +575,7 @@ find_page:
 	page = find_lock_page(mapping, index);
 	if (!page) {
 		ret = -ENOMEM;
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, index);
 		if (unlikely(!page))
 			goto out_ret;
 
Index: Linux/mm/readahead.c
===================================================================
--- Linux.orig/mm/readahead.c	2007-05-23 10:57:09.000000000 -0400
+++ Linux/mm/readahead.c	2007-05-23 11:34:48.000000000 -0400
@@ -168,7 +168,7 @@ __do_page_cache_readahead(struct address
 			continue;
 
 		read_unlock_irq(&mapping->tree_lock);
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, page_offset);
 		read_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
Index: Linux/fs/ntfs/file.c
===================================================================
--- Linux.orig/fs/ntfs/file.c	2007-05-23 10:57:02.000000000 -0400
+++ Linux/fs/ntfs/file.c	2007-05-23 11:34:48.000000000 -0400
@@ -424,7 +424,7 @@ static inline int __ntfs_grab_cache_page
 		pages[nr] = find_lock_page(mapping, index);
 		if (!pages[nr]) {
 			if (!*cached_page) {
-				*cached_page = page_cache_alloc(mapping);
+				*cached_page = page_cache_alloc(mapping, index);
 				if (unlikely(!*cached_page)) {
 					err = -ENOMEM;
 					goto err_out;
Index: Linux/include/linux/mm.h
===================================================================
--- Linux.orig/include/linux/mm.h	2007-05-23 11:34:43.000000000 -0400
+++ Linux/include/linux/mm.h	2007-05-23 12:15:52.000000000 -0400
@@ -1059,11 +1059,15 @@ extern void setup_per_cpu_pageset(void);
 
 /*
  * Address to offset for shared mapping policy lookup.
+ * When used for interleaving hugepagefs pages [when shift
+ * == HPAGE_SHIFT], actually returns hugepage offset in
+ * mapping; NOT file page offset.
  */
 static inline pgoff_t vma_addr_to_pgoff(struct vm_area_struct *vma,
 		unsigned long addr, int shift)
 {
-	return ((addr - vma->vm_start) >> shift) + vma->vm_pgoff;
+	return ((addr - vma->vm_start) >> shift) +
+		(vma->vm_pgoff >> (shift - PAGE_SHIFT));
 }
 
 int generic_file_set_policy(struct vm_area_struct *vma,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH/RFC 7/8] Mapped File Policy: fix migration of private mappings
  2007-05-24 17:28 [PATCH/RFC 0/8] Mapped File Policy Overview Lee Schermerhorn
                   ` (5 preceding siblings ...)
  2007-05-24 17:29 ` [PATCH/RFC 6/8] Mapped File Policy: use file policy for page cache allocations Lee Schermerhorn
@ 2007-05-24 17:29 ` Lee Schermerhorn
  2007-05-24 17:29 ` [PATCH/RFC 8/8] Mapped File Policy: fix show_numa_maps() Lee Schermerhorn
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-24 17:29 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nish.aravamudan, clameter, ak, Lee Schermerhorn

Mapped File Policy  7/8 - fix migration of private mappings

Against 2.6.22-rc2-mm1

Don't allow migration of file backed pages mapped with
MAP_PRIVATE if the file has a shared policy.  Rather, only
migrate any private, anon copies that the task has "COWed".

Define a new internal flag that we set in check_range() for
private mappings of files with shared policy.  Then, 
migrate_page_add() will skip non-anon pages when this flag
is set.

May also be able to use this flag to force unmapping of
anon pages that may be shared with relatives during automigrate
on internode task migration--e.g., by using:
	MPOL_MF_MOVE_ALL|MPOL_MF_MOVE_ANON_ONLY
But, that's the subject of a different patch series.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/mempolicy.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-05-23 11:34:46.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-05-23 11:34:50.000000000 -0400
@@ -102,6 +102,7 @@
 #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0)	/* Skip checks for continuous vmas */
 #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
 #define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2)		/* Gather statistics */
+#define MPOL_MF_MOVE_ANON_ONLY (MPOL_MF_INTERNAL << 3)
 
 static struct kmem_cache *policy_cache;
 static struct kmem_cache *sp_cache;
@@ -362,13 +363,19 @@ check_range(struct mm_struct *mm, unsign
 		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
 				vma_migratable(vma)))) {
 			unsigned long endvma = vma->vm_end;
+			unsigned long anononly = 0;
 
 			if (endvma > end)
 				endvma = end;
 			if (vma->vm_start > start)
 				start = vma->vm_start;
+
+			if (vma->vm_file && !(vma->vm_flags & VM_SHARED) &&
+					vma->vm_file->f_mapping->spolicy)
+				anononly = MPOL_MF_MOVE_ANON_ONLY;
+
 			err = check_pgd_range(vma, start, endvma, nodes,
-						flags, private);
+						flags|anononly, private);
 			if (err) {
 				first = ERR_PTR(err);
 				break;
@@ -621,9 +628,11 @@ static void migrate_page_add(struct page
 				unsigned long flags)
 {
 	/*
-	 * Avoid migrating a page that is shared with others.
+	 * Avoid migrating a file backed page in a private mapping or
+	 * a page that is shared with others.
 	 */
-	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1)
+	if ((!(flags & MPOL_MF_MOVE_ANON_ONLY) || PageAnon(page)) &&
+		((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1))
 		isolate_lru_page(page, pagelist);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH/RFC 8/8] Mapped File Policy: fix show_numa_maps()
  2007-05-24 17:28 [PATCH/RFC 0/8] Mapped File Policy Overview Lee Schermerhorn
                   ` (6 preceding siblings ...)
  2007-05-24 17:29 ` [PATCH/RFC 7/8] Mapped File Policy: fix migration of private mappings Lee Schermerhorn
@ 2007-05-24 17:29 ` Lee Schermerhorn
  2007-05-24 19:24 ` [PATCH/RFC 0/8] Mapped File Policy Overview Christoph Lameter
  2007-05-24 20:41 ` Andi Kleen
  9 siblings, 0 replies; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-24 17:29 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nish.aravamudan, Lee Schermerhorn, clameter, ak

Mapped file Policy 8/8 - fix show_numa_maps()

Against 2.6.22-rc2-mm1

This patch updates the procfs numa_maps display to handle multiple
shared policy ranges on a single vma.  numa_maps() still uses the
procfs task maps infrastructure, but provides wrappers around the
maps seq_file ops to handle shared policy "submaps", if any.

Also, this patch fixes a problem with numa_maps for shared mappings:
Before this [mapped file policy] patch series, numa_maps could show
you different results for shared mappings depending on which task you
examined.  A task which has installed shared policies on sub-ranges
of the shared region will show the policies on the sub-ranges, as the
vmas for that task were split when the policies were installed.  
Another task that shares the region, but didn't install any policies,
will show a single [default?] policy for the entire region as it is
mapped by a single vma in such a task.  By displaying the policies 
directly from the shared policy structure, we now see the same info
from each task that maps the segment.

The patch expands the proc_maps_private struct [#ifdef CONFIG_NUMA]
to track the existence of and progress through a submap for the
"current" vma.  For vmas with shared policy submaps, a new 
function--get_numa_submap()--in mm/mempolicy.c allocates and
populates an array of the policy ranges in the shared policy.
To facilitate this, the shared policy struct tracks the number
of ranges [sp_nodes] in the tree.

The nm_* numa_map seq_file wrappers pass the range to be displayed
to show_numa_map() via the saddr and eaddr members added to the
proc_maps_private struct.  The patch modifies show_numa_map() to
use these members, where appropriate, instead of vm_start, vm_end.

As before, once the internal page size buffer is full, seq_read()
suspends the display, drops the mmap_sem and exits the read.
During this time the vma list can change.  However, even within a
single seq_read(), the shared_policy "submap" can be changed by
other mappers.  We could only prevent this by holding the shared
policy spin_lock or otherwise holding off other mappers.  It doesn't
seem worth the effort, as the numa_map is only a snap_shot in any
case.  So, this patch makes a best effort [at least as good as
unpatched task map code, I think] to perform a single scan over the
address space, displaying the policies and page state/location
for policy ranges "snapped" under spin lock into the "submap"
array mentioned above.


Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/proc/task_mmu.c            |  191 ++++++++++++++++++++++++++++++++++++++++--
 include/linux/mempolicy.h     |    5 +
 include/linux/mm.h            |    6 +
 include/linux/proc_fs.h       |   12 ++
 include/linux/shared_policy.h |    3 
 mm/mempolicy.c                |   56 +++++++++++-
 6 files changed, 263 insertions(+), 10 deletions(-)

Index: Linux/include/linux/proc_fs.h
===================================================================
--- Linux.orig/include/linux/proc_fs.h	2007-05-23 10:57:07.000000000 -0400
+++ Linux/include/linux/proc_fs.h	2007-05-23 11:34:51.000000000 -0400
@@ -281,12 +281,24 @@ static inline struct proc_dir_entry *PDE
 	return PROC_I(inode)->pde;
 }
 
+struct mpol_range {
+	unsigned long saddr;
+	unsigned long eaddr;
+};
+
 struct proc_maps_private {
 	struct pid *pid;
 	struct task_struct *task;
 #ifdef CONFIG_MMU
 	struct vm_area_struct *tail_vma;
 #endif
+
+#ifdef CONFIG_NUMA
+	struct vm_area_struct *vma;	/* preserved over seq_reads */
+	unsigned long saddr;
+	unsigned long eaddr;		/* preserved over seq_reads */
+	struct mpol_range *range, *ranges; /* preserved ... */
+#endif
 };
 
 #endif /* _LINUX_PROC_FS_H */
Index: Linux/include/linux/shared_policy.h
===================================================================
--- Linux.orig/include/linux/shared_policy.h	2007-05-23 11:34:40.000000000 -0400
+++ Linux/include/linux/shared_policy.h	2007-05-23 11:34:51.000000000 -0400
@@ -25,7 +25,8 @@ struct sp_node {
 
 struct shared_policy {
 	struct rb_root root;
-	spinlock_t lock;
+	spinlock_t     lock;
+ 	int            nr_sp_nodes;	/* for numa_maps */
 };
 
 extern struct shared_policy *mpol_shared_policy_new(int, nodemask_t *);
Index: Linux/include/linux/mm.h
===================================================================
--- Linux.orig/include/linux/mm.h	2007-05-23 11:34:48.000000000 -0400
+++ Linux/include/linux/mm.h	2007-05-23 11:34:51.000000000 -0400
@@ -1070,6 +1070,12 @@ static inline pgoff_t vma_addr_to_pgoff(
 		(vma->vm_pgoff >> (shift - PAGE_SHIFT));
 }
 
+static inline pgoff_t vma_pgoff_to_addr(struct vm_area_struct *vma,
+		pgoff_t pgoff)
+{
+	return ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) + vma->vm_start;
+}
+
 int generic_file_set_policy(struct vm_area_struct *vma,
 		unsigned long start, unsigned long end, struct mempolicy *new);
 struct mempolicy *generic_file_get_policy(struct vm_area_struct *vma,
Index: Linux/include/linux/mempolicy.h
===================================================================
--- Linux.orig/include/linux/mempolicy.h	2007-05-23 11:34:46.000000000 -0400
+++ Linux/include/linux/mempolicy.h	2007-05-23 11:34:51.000000000 -0400
@@ -149,6 +149,11 @@ int do_migrate_pages(struct mm_struct *m
 
 extern void *cpuset_being_rebound;	/* Trigger mpol_copy vma rebind */
 
+struct seq_file;
+extern int show_numa_map(struct seq_file *, void *);
+struct mpol_range;
+extern struct mpol_range *get_numa_submap(struct vm_area_struct *);
+
 #else
 
 struct mempolicy {};
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-05-23 11:34:50.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-05-23 11:34:51.000000000 -0400
@@ -1494,6 +1494,7 @@ static void sp_insert(struct shared_poli
 	}
 	rb_link_node(&new->nd, parent, p);
 	rb_insert_color(&new->nd, &sp->root);
+	++sp->nr_sp_nodes;
 	PDprintk("inserting %lx-%lx: %d\n", new->start, new->end,
 		 new->policy ? new->policy->policy : 0);
 }
@@ -1523,6 +1524,7 @@ static void sp_delete(struct shared_poli
 	rb_erase(&n->nd, &sp->root);
 	mpol_free(n->policy);
 	kmem_cache_free(sn_cache, n);
+	--sp->nr_sp_nodes;
 }
 
 struct sp_node *
@@ -1600,6 +1602,7 @@ struct shared_policy *mpol_shared_policy
 		return NULL;
 	sp->root = RB_ROOT;
 	spin_lock_init(&sp->lock);
+	sp->nr_sp_nodes = 0;
 
 	if (policy != MPOL_DEFAULT) {
 		struct mempolicy *newpol;
@@ -1932,9 +1935,9 @@ int show_numa_map(struct seq_file *m, vo
 		return 0;
 
 	mpol_to_str(buffer, sizeof(buffer),
-			    get_vma_policy(priv->task, vma, vma->vm_start));
+			    get_vma_policy(priv->task, vma, priv->saddr));
 
-	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
+	seq_printf(m, "%08lx %s", priv->saddr, buffer);
 
 	if (file) {
 		seq_printf(m, " file=");
@@ -1947,10 +1950,10 @@ int show_numa_map(struct seq_file *m, vo
 	}
 
 	if (is_vm_hugetlb_page(vma)) {
-		check_huge_range(vma, vma->vm_start, vma->vm_end, md);
+		check_huge_range(vma, priv->saddr, priv->eaddr, md);
 		seq_printf(m, " huge");
 	} else {
-		check_pgd_range(vma, vma->vm_start, vma->vm_end,
+		check_pgd_range(vma, priv->saddr, priv->eaddr,
 				&node_online_map, MPOL_MF_STATS, md);
 	}
 
@@ -1990,3 +1993,48 @@ out:
 	return 0;
 }
 
+/*
+ * alloc/populate array of shared policy ranges for show_numa_map()
+ */
+struct mpol_range *get_numa_submap(struct vm_area_struct *vma)
+{
+	struct shared_policy *sp;
+	struct mpol_range *ranges, *range;
+	struct rb_node *rbn;
+	int nranges;
+
+	BUG_ON(!vma->vm_file);
+	sp = mapping_shared_policy(vma->vm_file->f_mapping);
+	if (!sp)
+		return NULL;
+
+	nranges = sp->nr_sp_nodes;
+	if (!nranges)
+		return NULL;
+
+	ranges = kzalloc((nranges + 1) * sizeof(*ranges), GFP_KERNEL);
+	if (!ranges)
+		return NULL;	/* pretend there are none */
+
+	range = ranges;
+	spin_lock(&sp->lock);
+	/*
+	 * # of ranges could have changes since we checked, but that is
+	 * unlikely, so this is close enough [as long as it's safe].
+	 */
+	rbn = rb_first(&sp->root);
+	/*
+	 * count nodes to ensure we leave one empty range struct
+	 * in case node added between check and alloc
+	 */
+	while (rbn && nranges--) {
+		struct sp_node *spn = rb_entry(rbn, struct sp_node, nd);
+		range->saddr = vma_pgoff_to_addr(vma, spn->start);
+		range->eaddr = vma_pgoff_to_addr(vma, spn->end);
+		++range;
+		rbn = rb_next(rbn);
+	}
+
+	spin_unlock(&sp->lock);
+	return ranges;
+}
Index: Linux/fs/proc/task_mmu.c
===================================================================
--- Linux.orig/fs/proc/task_mmu.c	2007-05-23 10:57:02.000000000 -0400
+++ Linux/fs/proc/task_mmu.c	2007-05-23 11:34:51.000000000 -0400
@@ -498,7 +498,188 @@ const struct file_operations proc_clear_
 #endif
 
 #ifdef CONFIG_NUMA
-extern int show_numa_map(struct seq_file *m, void *v);
+/*
+ * numa_maps uses procfs task maps file operations, with wrappers
+ * to handle mpol submaps--policy ranges within a vma
+ */
+
+/*
+ * start processing a new vma for show_numa_maps
+ */
+static void nm_vma_start(struct proc_maps_private *priv,
+			struct vm_area_struct *vma)
+{
+	if (!vma)
+		return;
+	priv->vma = vma;	/* saved across read()s */
+
+	priv->saddr = vma->vm_start;
+	if (!(vma->vm_flags & VM_SHARED) || !vma->vm_file ||
+		!vma->vm_file->f_mapping->spolicy) {
+		/*
+		 * usual case:  no submap
+		 */
+		priv->eaddr = vma->vm_end;
+		return;
+	}
+
+	priv->range = priv->ranges = get_numa_submap(vma);
+	if (!priv->range) {
+		priv->eaddr = vma->vm_end;	/* empty shared policy */
+		return;
+	}
+
+	/*
+	 * restart suspended submap where we left off
+	 */
+	while (priv->range->eaddr && priv->range->eaddr < priv->eaddr)
+		++priv->range;
+
+	if (!priv->range->eaddr)
+		priv->eaddr = vma->vm_end;
+	else if (priv->saddr < priv->range->saddr)
+		priv->eaddr = priv->range->saddr; /* show gap [default pol] */
+	else
+		priv->eaddr = priv->range->eaddr; /* show range */
+}
+
+/*
+ * done with numa_maps vma:  reset so we start a new
+ * vma on next seq_read.
+ */
+static void nm_vma_stop(struct proc_maps_private *priv)
+{
+	if (priv->ranges)
+		kfree(priv->ranges);
+	priv->ranges = priv->range = NULL;
+	priv->vma = NULL;
+}
+
+/*
+ * Advance to next vma in mm or next subrange in vma.
+ * mmap_sem held during a single seq_read(), but shared
+ * policy ranges can be modified at any time by other
+ * mappers.  We just continue to display the ranges we
+ * found when we started the vma.
+ */
+static void *nm_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct proc_maps_private *priv = m->private;
+	struct vm_area_struct *vma = v;
+
+	if (!priv->range || priv->eaddr >= vma->vm_end) {
+		/*
+		 * usual case:  no submap or end of vma
+		 * re: '>=' -- in case we got here from nm_start()
+		 * and vma @ pos truncated to < priv->eaddr
+		 */
+		nm_vma_stop(priv);
+		vma = m_next(m, v, pos);
+		nm_vma_start(priv, vma);
+		return vma;
+	}
+
+	/*
+	 * Advance to next range in submap
+	 */
+	priv->saddr = priv->eaddr;
+	if (priv->eaddr == priv->range->saddr) {
+		/*
+		 * just processed a gap in the submap
+		 */
+		priv->eaddr = min(priv->range->eaddr, vma->vm_end);
+		return vma;	/* show the range */
+	}
+
+	++priv->range;
+	if (!priv->range->eaddr)
+		priv->eaddr = vma->vm_end;	/* past end of ranges */
+	else if (priv->saddr < priv->range->saddr)
+		priv->eaddr = priv->range->saddr; /* gap in submap */
+	else
+		priv->eaddr = min(priv->range->eaddr, vma->vm_end);
+
+	return vma;
+}
+
+/*
+ * [Re]start scan for new seq_read().
+ * N.B., much could have changes in mm, as we dropped the mmap_sem
+ * between reads().  Need to call m_start() to find vma at pos.
+ */
+static void *nm_start(struct seq_file *m, loff_t *pos)
+{
+	struct proc_maps_private *priv = m->private;
+	struct vm_area_struct *vma;
+
+	if (!priv->range) {
+		/*
+		 * usual case:  1st after open, or finished prev vma
+		 */
+		vma = m_start(m, pos);
+		nm_vma_start(priv, vma);
+		return vma;
+	}
+
+	/*
+	 * Continue with submap of "current" vma.  However, vma could have
+	 * been unmapped, split, truncated, ... between read()s.
+	 * Reset "last_addr" to simulate seek;  find vma by 'pos'.
+	 */
+	m->version = 0;
+	--(*pos);		/* seq_read() incremented it */
+	vma = m_start(m, pos);
+	if (vma != priv->vma)
+		goto new_vma;
+	/*
+	 * Same vma, but could have different ranges or could be entirely
+	 * different vma.
+	 */
+	if (vma->vm_start > priv->eaddr)
+		goto new_vma;	/* starts past last range displayed */
+	if (priv->eaddr < vma->vm_end) {
+		/*
+		 * vma at pos still covers eaddr--where we left off.  Submap
+		 * could have changed, but we'll keep reporting ranges we found
+		 * earlier up to vm_end.
+		 * We hope it is very unlikely that submap changed.
+		 */
+		return nm_next(m, vma, pos);
+	}
+
+	/*
+	 * Already reported past end of vma; find next vma past eaddr
+	 */
+	while (vma && vma->vm_end < priv->eaddr)
+		vma = m_next(m, vma, pos);
+
+new_vma:
+	/*
+	 * new vma at pos;  continue from ~ last eaddr
+	 */
+	nm_vma_stop(priv);
+	nm_vma_start(priv, vma);
+	return vma;
+}
+
+/*
+ * Suspend display of numa_map--e.g., buffer full?
+ */
+static void nm_stop(struct seq_file *m, void *v)
+{
+	struct proc_maps_private *priv = m->private;
+	struct vm_area_struct *vma = v;
+
+	if (!vma || priv->eaddr >= vma->vm_end) {
+		nm_vma_stop(priv);
+	}
+	/*
+	 * leave state in priv for nm_start(); but drop the
+	 * mmap_sem and unref the mm
+	 */
+	m_stop(m, v);
+}
+
 
 static int show_numa_map_checked(struct seq_file *m, void *v)
 {
@@ -512,10 +693,10 @@ static int show_numa_map_checked(struct 
 }
 
 static struct seq_operations proc_pid_numa_maps_op = {
-        .start  = m_start,
-        .next   = m_next,
-        .stop   = m_stop,
-        .show   = show_numa_map_checked
+	.start  = nm_start,
+	.next   = nm_next,
+	.stop   = nm_stop,
+	.show   = show_numa_map_checked
 };
 
 static int numa_maps_open(struct inode *inode, struct file *file)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-24 17:28 [PATCH/RFC 0/8] Mapped File Policy Overview Lee Schermerhorn
                   ` (7 preceding siblings ...)
  2007-05-24 17:29 ` [PATCH/RFC 8/8] Mapped File Policy: fix show_numa_maps() Lee Schermerhorn
@ 2007-05-24 19:24 ` Christoph Lameter
  2007-05-24 20:46   ` Lee Schermerhorn
  2007-05-24 20:41 ` Andi Kleen
  9 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2007-05-24 19:24 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, akpm, nish.aravamudan, ak

On Thu, 24 May 2007, Lee Schermerhorn wrote:

> Basic "problem":  currently [~2.6.21], files mmap()ed SHARED
> do not follow mem policy applied to the mapped regions.  Instead, 
> shared, file backed pages are allocated using the allocating
> tasks' task policy.  This is inconsistent with the way that anon
> and shmem pages are handled, violating, for me, the Principle
> of Least Astonishment.

Yes sort of.  Anonymous pages are not shared and shmem pages are special 
anyways. With file backed pages we are talking about the page cache right?

It would be great to have policies that can be set for the pagecache.
Preferably per cpuset or so.

> One reason for this is that down where pages are allocated for
> file backed pages, the faulting (mm, vma, address) are not 
> available to compute the policy.  However, we do have the
> address_space [a.k.a. mapping] and file index/offset available.
> If the applicable policy could be determined from just this info,
> the vma and address would not be required.

Hmmm.. Hmmmm... yes logical.

> Note that hugepage shmem segments do not follow the vma policy even
> tho' the hugetlbfs inode_info contains the shared policy struct.
> This situation arises because the hugetlbfs vm_ops do not contain the
> shmem_{get|set}_policy ops.  One can't just add these.  If you do,
> a read/cat of /proc/<pid>/numa_maps will hang.  I haven't investigated
> reason for the hang.  However, this series does not suffer that
> problem.

Could you do a separate patch that fixes the hugetlbfs problem? There are 
more exceptions in mempolicy.c that you may have to deal with.

> This series of patches implements NUMA memory policy for shared,
> mmap()ed files.   Because files mmap()ed SHARED are shared between
> tasks just like shared memory regions, I've used the shared_policy
> infrastructure from shmem.  This infrastructure applies policies
> directly to ranges of a file using an rb_tree.  The tree is indexed
> by the page offset, which we have in page cache allocation contexts.
> 
> Note that the method used is similar to one proposed by Steve Longerbeam
> quite a few years ago, except that I dynamically allocate the shared
> policy struct when needed, rather than embedding it directly in the
> inode/address_space.
> 
> This series result in the following internal and external semantics:
> 
> 1) The vma get|set_policy ops handle memory policies on sub-vma
>    address ranges for shared, linear mappings [shmem, files]
>    without splitting the vmas at the policy boundaries. Private
>    and non-linear mappings still split the vma to apply policy.
>    However, vma policy is still not visible to the nopage fault path.  

So the process may set a policy that is then ignored? Instead we 
use the policy in the inode? Setting a policy sets the policy for the 
complete inode? Can we set policies on segments of pages of an inode?

> 2) As with shmem segments, the shared policies applied to shared
>    file mappings persist as long as the inode remains--i.e., until
>    the file is deleted or the inode recycled--whether or not any
>    task has the file mapped or even open.  We could, I suppose,
>    free the shared policy on last close.

What happens if something evicts the inode (reclaim) and then we need
to reread it later? Isnt this a bit unpredictable?

> 3) Vma policy of private mappings of files only apply when the 
>    task gets a private copy of the page--i.e., when do_wp_page()
>    breaks the COW sharing and allocates a private page.  Private,
>    read-only mappings of a file use the shared policy which 
>    defaults, as before, to process policy, which itself defaults
>    to, well... default policy.  This is how mapped files have
>    always behaved.

Let it be the way it is?
 
> 4) mbind(... 'MOVE*, ...) will not migrate page cache pages in
>    a private mapping if the file has a shared policy.  Rather,
>    only anon pages that the mapping task has "COWed" will be
>    migrated.  If the mapped file does NOT have a shared policy
>    or the file is mapped shared, then the pages will be migrated,
>    subject to mapcount, preserving the existing semantics.

Hmmm... That could get a bit confusing.

 
> Impact On Kernel Build times

I think what you would like to show is that there is no negative impact on 
concurrent reads and write on the page cache. The additional checking
may impact scalability.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-24 17:28 [PATCH/RFC 0/8] Mapped File Policy Overview Lee Schermerhorn
                   ` (8 preceding siblings ...)
  2007-05-24 19:24 ` [PATCH/RFC 0/8] Mapped File Policy Overview Christoph Lameter
@ 2007-05-24 20:41 ` Andi Kleen
  2007-05-24 21:05   ` Lee Schermerhorn
  9 siblings, 1 reply; 30+ messages in thread
From: Andi Kleen @ 2007-05-24 20:41 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, akpm, nish.aravamudan, clameter

> 
> Basic "problem":  currently [~2.6.21], files mmap()ed SHARED
> do not follow mem policy applied to the mapped regions.  Instead, 
> shared, file backed pages are allocated using the allocating
> tasks' task policy.  This is inconsistent with the way that anon
> and shmem pages are handled, violating, for me, the Principle
> of Least Astonishment.

Do you have some specific use cases? Did this actually improve
some application significantly? 

The main basic issue is that it seems weird semantics to have the policy randomly
disappear when everybody closes the file depending on whether the system
decides to flush the inode or not. But using EAs or similar
also looked like overkill.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-24 19:24 ` [PATCH/RFC 0/8] Mapped File Policy Overview Christoph Lameter
@ 2007-05-24 20:46   ` Lee Schermerhorn
  0 siblings, 0 replies; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-24 20:46 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, akpm, nish.aravamudan, ak

On Thu, 2007-05-24 at 12:24 -0700, Christoph Lameter wrote:
> On Thu, 24 May 2007, Lee Schermerhorn wrote:
> 
> > Basic "problem":  currently [~2.6.21], files mmap()ed SHARED
> > do not follow mem policy applied to the mapped regions.  Instead, 
> > shared, file backed pages are allocated using the allocating
> > tasks' task policy.  This is inconsistent with the way that anon
> > and shmem pages are handled, violating, for me, the Principle
> > of Least Astonishment.
> 
> Yes sort of.  Anonymous pages are not shared and shmem pages are special 
> anyways. With file backed pages we are talking about the page cache right?

Aononymous pages are shared between tasks and their descendants until
one task breaks COW sharing.  Shmem pages are special in that they use
the inode/address_space mechanisms for fault handling, but are backed by
swap space.  But, yes, we're talking about page cache pages.  A regular
file mapped shared looks a lot like a shmem area to the mapping task,
except that it ignores any policy installed via mbind().

> 
> It would be great to have policies that can be set for the pagecache.
> Preferably per cpuset or so.

IMO, it would be useful to have those as well.  Ray Bryant posted
patches quite a while back, but that thread died off.  Could easily be
added.  Integration with this series would be quite easy:  instead of
get_file_policy() defaulting to the task and then system policies, it
could take another path:  per task page cache policy [perhaps from it's
cpuset?] and then system-wide page cache default policy.

> 
> > One reason for this is that down where pages are allocated for
> > file backed pages, the faulting (mm, vma, address) are not 
> > available to compute the policy.  However, we do have the
> > address_space [a.k.a. mapping] and file index/offset available.
> > If the applicable policy could be determined from just this info,
> > the vma and address would not be required.
> 
> Hmmm.. Hmmmm... yes logical.
> 
> > Note that hugepage shmem segments do not follow the vma policy even
> > tho' the hugetlbfs inode_info contains the shared policy struct.
> > This situation arises because the hugetlbfs vm_ops do not contain the
> > shmem_{get|set}_policy ops.  One can't just add these.  If you do,
> > a read/cat of /proc/<pid>/numa_maps will hang.  I haven't investigated
> > reason for the hang.  However, this series does not suffer that
> > problem.
> 
> Could you do a separate patch that fixes the hugetlbfs problem? There are 
> more exceptions in mempolicy.c that you may have to deal with.

I can take a look at the numa_maps hang.  I did try a patch that just
hooked up the hugetlbfs vm_ops, but testing showed the hang.  I'll try
to make time to investigate the hang.  If I can fix that, hooking up the
vm_ops is trivial.

> 
> > This series of patches implements NUMA memory policy for shared,
> > mmap()ed files.   Because files mmap()ed SHARED are shared between
> > tasks just like shared memory regions, I've used the shared_policy
> > infrastructure from shmem.  This infrastructure applies policies
> > directly to ranges of a file using an rb_tree.  The tree is indexed
> > by the page offset, which we have in page cache allocation contexts.
> > 
> > Note that the method used is similar to one proposed by Steve Longerbeam
> > quite a few years ago, except that I dynamically allocate the shared
> > policy struct when needed, rather than embedding it directly in the
> > inode/address_space.
> > 
> > This series result in the following internal and external semantics:
> > 
> > 1) The vma get|set_policy ops handle memory policies on sub-vma
> >    address ranges for shared, linear mappings [shmem, files]
> >    without splitting the vmas at the policy boundaries. Private
> >    and non-linear mappings still split the vma to apply policy.
> >    However, vma policy is still not visible to the nopage fault path.  
> 
> So the process may set a policy that is then ignored? Instead we 
> use the policy in the inode? Setting a policy sets the policy for the 
> complete inode? Can we set policies on segments of pages of an inode?

That happens now--vma policy, set by mbind(), is ignored--with shared
mapped files.  In fact, it happens today with private, read-only
mappings.   If you never write to a page to break COW sharing, you
always use the page cache page which follows task/system default policy.
This is one of those rough edges that surprises people [me, anyway].

> 
> > 2) As with shmem segments, the shared policies applied to shared
> >    file mappings persist as long as the inode remains--i.e., until
> >    the file is deleted or the inode recycled--whether or not any
> >    task has the file mapped or even open.  We could, I suppose,
> >    free the shared policy on last close.
> 
> What happens if something evicts the inode (reclaim) and then we need
> to reread it later? Isnt this a bit unpredictable?

One can't evict the inode while the file is open or mapped, right?  If
the question is about applying policy directly to the files [as Steve
Longerbeam wanted to do for executables], such a feature would require a
mechanism such as this series implements.  And, in my experience, the
policy applied to a mapped file has as much, if not more, to do with the
applications' useage of the file than with the particular file itself.
So, as long as one task of the application keeps the file mapped, or
open, the policy that it applies will persist.

With the current implementation, if you place a page cache page via task
policy, and then change the policy, the placement of the page will be
unpredictable if it ever gets paged out.  Yeah, I know.  You can lock
the page down.  Not everyone wants to do that; not everyone has the
privileges to do so.   All in all, I think this is an improvement over
the current state of affairs relative to persistence of policy.

The only surprise, I think, would be the policy persisting after all
tasks closed and unmapped the file.  As you say, the policy can then
persist for an unpredictable time in the inode cache.  The next mapper
of the file may be surprised to find that it already has a policy.  As I
mentioned above, I currently don't free the shared policy on last
close/unmap, but I could easily add that.

> 
> > 3) Vma policy of private mappings of files only apply when the 
> >    task gets a private copy of the page--i.e., when do_wp_page()
> >    breaks the COW sharing and allocates a private page.  Private,
> >    read-only mappings of a file use the shared policy which 
> >    defaults, as before, to process policy, which itself defaults
> >    to, well... default policy.  This is how mapped files have
> >    always behaved.
> 
> Let it be the way it is?

Exactly.  I haven't changed anything in this area.  Currently, you get
task policy for shared or private mappings/read-only access and vma
policy when you write to the page, creating a private anon page.  With
my changes, you get the same thing, unless some task has explicitly
installed a shared policy on the file.

I don't how much effort one should expend on handling concurrent shared
and private mappings to a file.  Certainly, tasks can do this, but I
haven't thought of any compelling use cases.  Haven't thought about it
much at all.  If it were important, one could break COW if the private
mapping's vma policy differed from the shared policy for the file at
that offset.  But, this would add additional checks in the fault path
for what would have been a minor fault.  So far, I can't justify that.

>  
> > 4) mbind(... 'MOVE*, ...) will not migrate page cache pages in
> >    a private mapping if the file has a shared policy.  Rather,
> >    only anon pages that the mapping task has "COWed" will be
> >    migrated.  If the mapped file does NOT have a shared policy
> >    or the file is mapped shared, then the pages will be migrated,
> >    subject to mapcount, preserving the existing semantics.
> 
> Hmmm... That could get a bit confusing.

Uh, yeah.  Again, this is the case of both shared and private mappings
of a single file.  It didn't seem fair to let a task with a private
mapping migrate page cache pages that some other tasks [maybe including
itself?] have placed with a shared policy.  Of course, with the current
migration implementation, we don't migrate pages referenced by multiple
ptes.  I'd like to "fix" that, someday ;-).

> 
>  
> > Impact On Kernel Build times
> 
> I think what you would like to show is that there is no negative impact on 
> concurrent reads and write on the page cache. The additional checking
> may impact scalability.

Hmmm.  Shouldn't be additional checks in performance paths.  I think the
only additions are in *page_cache_alloc*() where we're already in a
major fault and will ultimately have to wait for IO.  However, I'd love
for someone to give it a spin on a really large configuration [hint,
hint ;-)].  Seriously, as posted, the applies and builds cleanly for
2.6.22-rc2-mm1 on ia64 and x86_64.  As long as it doesn't conflict with
one's private patches, it should be relatively painless.

Thanks for the feedback,
Lee

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-24 20:41 ` Andi Kleen
@ 2007-05-24 21:05   ` Lee Schermerhorn
  2007-05-24 21:17     ` Christoph Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-24 21:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, akpm, nish.aravamudan, clameter

On Thu, 2007-05-24 at 22:41 +0200, Andi Kleen wrote:
> > 
> > Basic "problem":  currently [~2.6.21], files mmap()ed SHARED
> > do not follow mem policy applied to the mapped regions.  Instead, 
> > shared, file backed pages are allocated using the allocating
> > tasks' task policy.  This is inconsistent with the way that anon
> > and shmem pages are handled, violating, for me, the Principle
> > of Least Astonishment.
> 
> Do you have some specific use cases? Did this actually improve
> some application significantly? 

Same use cases for using mbind() at all.  I want to specify the
placement of memory backing any of my address space.  A shared mapping
of a regular file is, IMO, morally equivalent to a shared memory region,
with the added semantic that is it automatically initialized from the
file contents, and any changes persist after the file is closed.  [One
related semantic that Linux is missing is to initialize the shared
mapping from the file, but not writeback any changes--e.g.,
MAP_NOWRITEBACK.  Some "enterprise unix" support this, presumably at
ISV/customer request.]

> 
> The main basic issue is that it seems weird semantics to have the policy randomly
> disappear when everybody closes the file depending on whether the system
> decides to flush the inode or not. But using EAs or similar
> also looked like overkill.

IMO, it's also weird that the system ignores my mbind()s on regular
files mmap()ed shared, or just because I only do read accesses
[admittedly I don't address the latter "feature"].  It seems like a
fundamental correctness issue to me.  I guess I'm just used to
applications [and language run time environments] that evaluate the
topology of the system, or subset thereof, that they're running on and
mbind() all of the components of the address space accordingly.

I have a more difficult time envisioning a use for policies stored with
the file, except in the case of embedded or other fixed hardware
configurations where you can determine a priori how you want your memory
placed, and don't want to have to add the binding into each application.
I believe that this is the use case that Steve Longerbeam had in mind a
few years ago when he proposed this.  As I responded to Christoph, if we
want to store policy with the file itself and have it automatically
installed when the file is opened or mmap'd, we'd need such a mechanism
as I propose below it.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-24 21:05   ` Lee Schermerhorn
@ 2007-05-24 21:17     ` Christoph Lameter
  2007-05-25 14:55       ` Lee Schermerhorn
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2007-05-24 21:17 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Andi Kleen, linux-mm, akpm, nish.aravamudan

On Thu, 24 May 2007, Lee Schermerhorn wrote:

> Same use cases for using mbind() at all.  I want to specify the
> placement of memory backing any of my address space.  A shared mapping
> of a regular file is, IMO, morally equivalent to a shared memory region,
> with the added semantic that is it automatically initialized from the
> file contents, and any changes persist after the file is closed.  [One
> related semantic that Linux is missing is to initialize the shared
> mapping from the file, but not writeback any changes--e.g.,
> MAP_NOWRITEBACK.  Some "enterprise unix" support this, presumably at
> ISV/customer request.]

I think Andi was looking for an actual problem that is solved by this 
patchset. Any user feedback that triggered this solution?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-24 21:17     ` Christoph Lameter
@ 2007-05-25 14:55       ` Lee Schermerhorn
  2007-05-25 15:25         ` Christoph Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-25 14:55 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, akpm, nish.aravamudan

On Thu, 2007-05-24 at 14:17 -0700, Christoph Lameter wrote:
> On Thu, 24 May 2007, Lee Schermerhorn wrote:
> 
> > Same use cases for using mbind() at all.  I want to specify the
> > placement of memory backing any of my address space.  A shared mapping
> > of a regular file is, IMO, morally equivalent to a shared memory region,
> > with the added semantic that is it automatically initialized from the
> > file contents, and any changes persist after the file is closed.  [One
> > related semantic that Linux is missing is to initialize the shared
> > mapping from the file, but not writeback any changes--e.g.,
> > MAP_NOWRITEBACK.  Some "enterprise unix" support this, presumably at
> > ISV/customer request.]
> 
> I think Andi was looking for an actual problem that is solved by this 
> patchset. Any user feedback that triggered this solution?

The question usually comes up in the context of migrating customers'
applications or benchmarks from our legacy unix numa APIs to Linux.  I
don't know of the exact applications that install explicit policy on
shared mmap()ed files, but on, say, Tru64 Unix it just works.  As a
result, customers and ISVs have used it.  We try to make it easy for
customers to migrate to Linux--providing support, documentation and
such.  Having a one-for-one API replacement makes this easier.  In this
context, it's a glaring hole in Linux today, and I've had to explain to
colleagues that it's a "feature"--at which point they ask me when did I
transfer to marketing ;-).  

It's easy to fix.  The shared policy support is already there.  We just
need to generalize it for regular files.  In the process,
*page_cache_alloc() obeys "file policy", which will allow additional
features such as you mentioned:  global page cache policy as the default
"file policy".

Now, I understand the concern about any increase in size, even if it's
only ~2K, but I think this is mostly of concern to 32-bit systems, where
I expect the increase will be less than 2k.   I also understand that
there are still a few 32-bit NUMA systems out there [NUMAQ?] and that
some folks use fake NUMA and cpusets on 32-bit systems for
container-like resource management.  For those systems, we could gain
back some of the size increase by making numa_maps configurable.  A
quick test showed that for x86_64, eliminating the /proc/<pid>/numa_maps
makes the kernel with my mapped file policy patches ~1.8K smaller than
the unpatched kernel with numa_maps.  I'm NOT proposing to eliminate
numa_maps, in general, because I find it very useful.  But maybe 32-bit
fake numa systems don't need it?

By the way, I think we need the numa_maps fixes in any case because the
current implementation lies about shmem segments if you look at any task
that didn't install [all of] the policy on the segment, unless it
happens to be a child of the task that did install the policy and that
child was forked after the mbind() calls.  I really dislike all of those
"ifs" and "unlesses"--I found it humorous in the George Carlin routine,
but not in user/programming interface design.

Anyway, I posted the patches in hopes of getting some additional eyes to
look at them and maybe getting some time in -mm to see whether it breaks
anything or impacts performance adversely on systems that I don't have
access to.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-25 14:55       ` Lee Schermerhorn
@ 2007-05-25 15:25         ` Christoph Lameter
  2007-05-25 16:06           ` Lee Schermerhorn
  2007-05-25 21:03           ` Andi Kleen
  0 siblings, 2 replies; 30+ messages in thread
From: Christoph Lameter @ 2007-05-25 15:25 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Andi Kleen, linux-mm, akpm, nish.aravamudan

On Fri, 25 May 2007, Lee Schermerhorn wrote:

> It's easy to fix.  The shared policy support is already there.  We just
> need to generalize it for regular files.  In the process,
> *page_cache_alloc() obeys "file policy", which will allow additional
> features such as you mentioned:  global page cache policy as the default
> "file policy".

A page cache policy would not need to be file based. It would be enough 
to have a global one or one per cpuset. And it would not suffer from the 
vanishing act of the inodes.

> By the way, I think we need the numa_maps fixes in any case because the
> current implementation lies about shmem segments if you look at any task
> that didn't install [all of] the policy on the segment, unless it
> happens to be a child of the task that did install the policy and that
> child was forked after the mbind() calls.  I really dislike all of those
> "ifs" and "unlesses"--I found it humorous in the George Carlin routine,
> but not in user/programming interface design.

Could you separate out a patch that fixes these issues?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-25 15:25         ` Christoph Lameter
@ 2007-05-25 16:06           ` Lee Schermerhorn
  2007-05-25 16:24             ` Christoph Lameter
  2007-05-25 21:03           ` Andi Kleen
  1 sibling, 1 reply; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-25 16:06 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, akpm, nish.aravamudan

On Fri, 2007-05-25 at 08:25 -0700, Christoph Lameter wrote:
> On Fri, 25 May 2007, Lee Schermerhorn wrote:
> 
> > It's easy to fix.  The shared policy support is already there.  We just
> > need to generalize it for regular files.  In the process,
> > *page_cache_alloc() obeys "file policy", which will allow additional
> > features such as you mentioned:  global page cache policy as the default
> > "file policy".
> 
> A page cache policy would not need to be file based. It would be enough 
> to have a global one or one per cpuset. And it would not suffer from the 
> vanishing act of the inodes.

True, but shared, mmap'ed file policy does need to be file based, and
that is my objective.  I merely point out that we can easily add the
page cache policy as the fall back when a file has no explicit policy.

> 
> > By the way, I think we need the numa_maps fixes in any case because the
> > current implementation lies about shmem segments if you look at any task
> > that didn't install [all of] the policy on the segment, unless it
> > happens to be a child of the task that did install the policy and that
> > child was forked after the mbind() calls.  I really dislike all of those
> > "ifs" and "unlesses"--I found it humorous in the George Carlin routine,
> > but not in user/programming interface design.
> 
> Could you separate out a patch that fixes these issues?

Could do, but does that improve the chances for acceptance of this patch
set?  If the patch set is accepted, with whatever corrections might be
required, we get the numa_maps fix.  So, I'm not currently motivated to
post a separate patch.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-25 16:06           ` Lee Schermerhorn
@ 2007-05-25 16:24             ` Christoph Lameter
  2007-05-25 17:37               ` Lee Schermerhorn
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2007-05-25 16:24 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Andi Kleen, linux-mm, akpm, nish.aravamudan

On Fri, 25 May 2007, Lee Schermerhorn wrote:

> True, but shared, mmap'ed file policy does need to be file based, and
> that is my objective.  I merely point out that we can easily add the
> page cache policy as the fall back when a file has no explicit policy.

The problem is that you have not given sufficient reason for the 
modifications. Tru64 compatibility is not a valid reason.

> > Could you separate out a patch that fixes these issues?
> 
> Could do, but does that improve the chances for acceptance of this patch
> set?  If the patch set is accepted, with whatever corrections might be
> required, we get the numa_maps fix.  So, I'm not currently motivated to
> post a separate patch.

The patchset as is is not acceptable since it does not follow the 
standards. The fixes should come first. So you have to do this anyways to 
get the patchset accepted.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-25 16:24             ` Christoph Lameter
@ 2007-05-25 17:37               ` Lee Schermerhorn
  2007-05-25 19:10                 ` Christoph Lameter
  2007-05-25 21:01                 ` Andi Kleen
  0 siblings, 2 replies; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-25 17:37 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, akpm, nish.aravamudan

On Fri, 2007-05-25 at 09:24 -0700, Christoph Lameter wrote:
> On Fri, 25 May 2007, Lee Schermerhorn wrote:
> 
> > True, but shared, mmap'ed file policy does need to be file based, and
> > that is my objective.  I merely point out that we can easily add the
> > page cache policy as the fall back when a file has no explicit policy.
> 
> The problem is that you have not given sufficient reason for the 
> modifications. Tru64 compatibility is not a valid reason.

I knew that!  There is no existing practice.  However, I think it is in
our interests to ease the migration of applications to Linux.  And,
again, [trying to choose words carefully], I see this as a
defect/oversight in the API.  I mean, why provide mbind() at all, and
then say, "Oh, by the way, this only works for anonymous memory, SysV
shared memory and private file mappings. You can't use this if you
mmap() a file shared.  For that you have to twiddle your task policy,
fault in and lock down the pages to make sure they don't get paged out,
because, if they do, and you've changed the task policy to place some
other mapped file that doesn't obey mbind(), the kernel doesn't remember
where you placed them.  Oh, and for those private mappings--be sure to
write to each page in the range because if you just read, the kernel
will ignore your vma policy."

Come on!  

> 
> > > Could you separate out a patch that fixes these issues?
> > 
> > Could do, but does that improve the chances for acceptance of this patch
> > set?  If the patch set is accepted, with whatever corrections might be
> > required, we get the numa_maps fix.  So, I'm not currently motivated to
> > post a separate patch.
> 
> The patchset as is is not acceptable since it does not follow the 
> standards. The fixes should come first. So you have to do this anyways to 
> get the patchset accepted.

Which standards are we talking about?  I'll happily fix any coding
standard violations.  Is there something wrong with the format of the
patches?  Please tell me, so I can fix them...

And as for fixing the numa_maps behavior, hey, I didn't post the
defective code.  I'm just pointing out that my patches happen to fix
some existing suspect behavior along the way.  But, if some patch
submittal standard exists that says one must fix all known outstanding
bugs before submitting anything else [Andrew would probably support
that ;-)], please point it out to me... and everyone else.  And, as I've
said before, I see this patch set as one big fix to missing/broken
behavior.  

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-25 17:37               ` Lee Schermerhorn
@ 2007-05-25 19:10                 ` Christoph Lameter
  2007-05-25 21:12                   ` Lee Schermerhorn
  2007-05-25 21:01                 ` Andi Kleen
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2007-05-25 19:10 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Andi Kleen, linux-mm, akpm, nish.aravamudan

On Fri, 25 May 2007, Lee Schermerhorn wrote:

> I knew that!  There is no existing practice.  However, I think it is in
> our interests to ease the migration of applications to Linux.  And,
> again, [trying to choose words carefully], I see this as a
> defect/oversight in the API.  I mean, why provide mbind() at all, and
> then say, "Oh, by the way, this only works for anonymous memory, SysV
> shared memory and private file mappings. You can't use this if you
> mmap() a file shared.  For that you have to twiddle your task policy,
> fault in and lock down the pages to make sure they don't get paged out,
> because, if they do, and you've changed the task policy to place some
> other mapped file that doesn't obey mbind(), the kernel doesn't remember
> where you placed them.  Oh, and for those private mappings--be sure to
> write to each page in the range because if you just read, the kernel
> will ignore your vma policy."
> 
> Come on!  

Well if this patch would simplify things then I would agree but it 
introduces new cornercases.

The current scheme is logical if you consider the pagecache as something 
separate. It is after all already controlled via the memory spreading flag 
in cpusets. There is already limited control by the process.

Also allowing vma based memory policies to control shared mapping is 
problematic because they are shared. Concurrent processes may set 
different policies. This would make sense if the policy could be set at a 
filesystem level.

> And as for fixing the numa_maps behavior, hey, I didn't post the
> defective code.  I'm just pointing out that my patches happen to fix
> some existing suspect behavior along the way.  But, if some patch
> submittal standard exists that says one must fix all known outstanding
> bugs before submitting anything else [Andrew would probably support
> that ;-)], please point it out to me... and everyone else.  And, as I've
> said before, I see this patch set as one big fix to missing/broken
> behavior.  

I still have not found a bug in there....

Convention is that fixes precede enhancements in a patchset.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-25 17:37               ` Lee Schermerhorn
  2007-05-25 19:10                 ` Christoph Lameter
@ 2007-05-25 21:01                 ` Andi Kleen
  2007-05-25 21:41                   ` Lee Schermerhorn
  1 sibling, 1 reply; 30+ messages in thread
From: Andi Kleen @ 2007-05-25 21:01 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Christoph Lameter, linux-mm, akpm, nish.aravamudan

> I knew that!  There is no existing practice.  However, I think it is in
> our interests to ease the migration of applications to Linux.  And,
> again, [trying to choose words carefully], I see this as a
> defect/oversight in the API.  I mean, why provide mbind() at all, and
> then say, "Oh, by the way, this only works for anonymous memory, SysV
> shared memory and private file mappings. You can't use this if you
> mmap() a file shared.  For that you have to twiddle your task policy,
> fault in and lock down the pages to make sure they don't get paged out,
> because, if they do, and you've changed the task policy to place some
> other mapped file that doesn't obey mbind(), the kernel doesn't remember
> where you placed them.  Oh, and for those private mappings--be sure to
> write to each page in the range because if you just read, the kernel
> will ignore your vma policy."
> 
> Come on!  

But "you can set policy but we will randomly lose it later" is also
not very convincing, isn't it? 

I would like to only go forward if there are actually convincing
use cases for this.

The Tru64 compat argument doesn't seem too strong to me for this because
I'm sure there are lots of other incompatibilities too.

> And as for fixing the numa_maps behavior, hey, I didn't post the
> defective code.  I'm just pointing out that my patches happen to fix
> some existing suspect behavior along the way.  But, if some patch
> submittal standard exists that says one must fix all known outstanding
> bugs before submitting anything else [Andrew would probably support
> that ;-)], please point it out to me... and everyone else.  And, as I've
> said before, I see this patch set as one big fix to missing/broken
> behavior.  

In Linux the deal is usually kind of :- the more you care about general
code maintenance the more we care about your feature wishlists.
So fixing bugs is usually a good idea.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-25 15:25         ` Christoph Lameter
  2007-05-25 16:06           ` Lee Schermerhorn
@ 2007-05-25 21:03           ` Andi Kleen
  2007-05-25 21:14             ` Lee Schermerhorn
  1 sibling, 1 reply; 30+ messages in thread
From: Andi Kleen @ 2007-05-25 21:03 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Lee Schermerhorn, linux-mm, akpm, nish.aravamudan

On Friday 25 May 2007 17:25:15 Christoph Lameter wrote:
> On Fri, 25 May 2007, Lee Schermerhorn wrote:
> 
> > It's easy to fix.  The shared policy support is already there.  We just
> > need to generalize it for regular files.  In the process,
> > *page_cache_alloc() obeys "file policy", which will allow additional
> > features such as you mentioned:  global page cache policy as the default
> > "file policy".
> 
> A page cache policy would not need to be file based. It would be enough 
> to have a global one or one per cpuset. And it would not suffer from the 
> vanishing act of the inodes.

I agree. A general page cache policy is probably a good idea and having
it in a cpuset is reasonable too. I've been also toying with the idea to 
change the global default to interleaved for unmapped files.

But in this case it's actually not needed to add something to the
address space. It can be all process policy based.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-25 19:10                 ` Christoph Lameter
@ 2007-05-25 21:12                   ` Lee Schermerhorn
  2007-05-25 21:43                     ` Christoph Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-25 21:12 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, akpm, nish.aravamudan

On Fri, 2007-05-25 at 12:10 -0700, Christoph Lameter wrote:
> On Fri, 25 May 2007, Lee Schermerhorn wrote:
> 
> > I knew that!  There is no existing practice.  However, I think it is in
> > our interests to ease the migration of applications to Linux.  And,
> > again, [trying to choose words carefully], I see this as a
> > defect/oversight in the API.  I mean, why provide mbind() at all, and
> > then say, "Oh, by the way, this only works for anonymous memory, SysV
> > shared memory and private file mappings. You can't use this if you
> > mmap() a file shared.  For that you have to twiddle your task policy,
> > fault in and lock down the pages to make sure they don't get paged out,
> > because, if they do, and you've changed the task policy to place some
> > other mapped file that doesn't obey mbind(), the kernel doesn't remember
> > where you placed them.  Oh, and for those private mappings--be sure to
> > write to each page in the range because if you just read, the kernel
> > will ignore your vma policy."
> > 
> > Come on!  
> 
> Well if this patch would simplify things then I would agree but it 
> introduces new cornercases.

I don't think this is the case, but I could have missed something.  I've
kept the behavior identical, I think, for the default case when no
explicit shared policy is applied.  And the remaining corner case
involves those funky private mappings.  The behavior there is the same
as the current behavior.  

I have a fix for that, but it involves forcing early COW break when the
private mapping has a vma policy and the page cache page doesn't match
the policy.  I haven't posted that because: 1) is DOES add additional
checks in the nopage fault path and 2) it depends on the misplacement
check in my "migrate on fault" series.  I didn't want to muddy the water
with that yet.

> 
> The current scheme is logical if you consider the pagecache as something 
> separate. It is after all already controlled via the memory spreading flag 
> in cpusets. There is already limited control by the process.

Yes, but I have to treat some parts of my address space [mapped shared
files] differently, when it's unnecessary.

> 
> Also allowing vma based memory policies to control shared mapping is 
> problematic because they are shared. Concurrent processes may set 
> different policies. 

But with the shared policy infrastructure, all shared mappers see the
same policy [or policies].  The last one set on any given range of the
underlying file [address_space] is the one that is currently in
effect--just like shmem.  If that wasn't clear from my description, I
need to fix that.

> This would make sense if the policy could be set at a 
> filesystem level.

??? Why?  Different processes could set different policies on the file
in the file system.  The last one [before the file was mapped?] would
rule.

> 
> > And as for fixing the numa_maps behavior, hey, I didn't post the
> > defective code.  I'm just pointing out that my patches happen to fix
> > some existing suspect behavior along the way.  But, if some patch
> > submittal standard exists that says one must fix all known outstanding
> > bugs before submitting anything else [Andrew would probably support
> > that ;-)], please point it out to me... and everyone else.  And, as I've
> > said before, I see this patch set as one big fix to missing/broken
> > behavior.  
> 
> I still have not found a bug in there....

I'll send you a memtoy script to demonstrate the issue.  Next week...

> 
> Convention is that fixes precede enhancements in a patchset.

Seems like a lot of extra effort that could be applied to other tasks,
but you've worn me down.  I'll debug the numa_maps hang with hugetlb
shmem segments with shared policy in the current code base, and reorder
the patch set to handle correct display of shmem policy from all tasks
first.  Next week or so.  

Later,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-25 21:03           ` Andi Kleen
@ 2007-05-25 21:14             ` Lee Schermerhorn
  2007-05-25 22:44               ` Andi Kleen
  0 siblings, 1 reply; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-25 21:14 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Lameter, linux-mm, akpm, nish.aravamudan

On Fri, 2007-05-25 at 23:03 +0200, Andi Kleen wrote:
> On Friday 25 May 2007 17:25:15 Christoph Lameter wrote:
> > On Fri, 25 May 2007, Lee Schermerhorn wrote:
> > 
> > > It's easy to fix.  The shared policy support is already there.  We just
> > > need to generalize it for regular files.  In the process,
> > > *page_cache_alloc() obeys "file policy", which will allow additional
> > > features such as you mentioned:  global page cache policy as the default
> > > "file policy".
> > 
> > A page cache policy would not need to be file based. It would be enough 
> > to have a global one or one per cpuset. And it would not suffer from the 
> > vanishing act of the inodes.
> 
> I agree. A general page cache policy is probably a good idea and having
> it in a cpuset is reasonable too. I've been also toying with the idea to 
> change the global default to interleaved for unmapped files.
> 
> But in this case it's actually not needed to add something to the
> address space. It can be all process policy based.

Just so we're clear, I'm talking about "struct address_space", as in the
file's "mapping", not as in "struct mm_struct".


Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-25 21:01                 ` Andi Kleen
@ 2007-05-25 21:41                   ` Lee Schermerhorn
  2007-05-25 21:46                     ` Christoph Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-25 21:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Lameter, linux-mm, akpm, nish.aravamudan

On Fri, 2007-05-25 at 23:01 +0200, Andi Kleen wrote:
> > I knew that!  There is no existing practice.  However, I think it is in
> > our interests to ease the migration of applications to Linux.  And,
> > again, [trying to choose words carefully], I see this as a
> > defect/oversight in the API.  I mean, why provide mbind() at all, and
> > then say, "Oh, by the way, this only works for anonymous memory, SysV
> > shared memory and private file mappings. You can't use this if you
> > mmap() a file shared.  For that you have to twiddle your task policy,
> > fault in and lock down the pages to make sure they don't get paged out,
> > because, if they do, and you've changed the task policy to place some
> > other mapped file that doesn't obey mbind(), the kernel doesn't remember
> > where you placed them.  Oh, and for those private mappings--be sure to
> > write to each page in the range because if you just read, the kernel
> > will ignore your vma policy."
> > 
> > Come on!  
> 
> But "you can set policy but we will randomly lose it later" is also
> not very convincing, isn't it? 

My patches don't randomly lose the policy as long as some application
has the file open/mapped.  Yeah, shmem shared policies are slightly more
persistent--they can hang around with no mappers, but you lose the
shared policy on reboot.  So, the first application to attach after
[re]boot has to mbind().  Same thing for shared mapped files.  The first
task to mmap has to set policy.  Applications with multiple tasks that
share shmem segments or application-specific shared, mmap()ed files
usually have one task that sets up the environment that handles this
sort of thing for the rest of the tasks.

> 
> I would like to only go forward if there are actually convincing
> use cases for this.

Consider it maintenance ;-).

> 
> The Tru64 compat argument doesn't seem too strong to me for this because
> I'm sure there are lots of other incompatibilities too.

I'm not looking for "compatibility" as much as functional parity...  And
we're so close to having sensible semantics.  It could "just work"...

> 
> > And as for fixing the numa_maps behavior, hey, I didn't post the
> > defective code.  I'm just pointing out that my patches happen to fix
> > some existing suspect behavior along the way.  But, if some patch
> > submittal standard exists that says one must fix all known outstanding
> > bugs before submitting anything else [Andrew would probably support
> > that ;-)], please point it out to me... and everyone else.  And, as I've
> > said before, I see this patch set as one big fix to missing/broken
> > behavior.  
> 
> In Linux the deal is usually kind of :- the more you care about general
> code maintenance the more we care about your feature wishlists.
> So fixing bugs is usually a good idea.

As I've said, I view this series as addressing a number of problems,
including the numa_maps hang when displaying hugetlb shmem segments with
shared policy [that one by accident, I admit], the incorrect display of
shmem segment policy from different tasks, and the disconnect between
mbind() and mapped, shared files [one person's defect is another's
feature, or vice versa ;-)].  However, I will look at reordering the
series to fix the hang and incorrect display first.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-25 21:12                   ` Lee Schermerhorn
@ 2007-05-25 21:43                     ` Christoph Lameter
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2007-05-25 21:43 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Andi Kleen, linux-mm, akpm, nish.aravamudan

On Fri, 25 May 2007, Lee Schermerhorn wrote:

> ??? Why?  Different processes could set different policies on the file
> in the file system.  The last one [before the file was mapped?] would
> rule.

Then the policy would be set on a file and not by processes. So there is 
one way of controlling the memory policy.

> Seems like a lot of extra effort that could be applied to other tasks,
> but you've worn me down.  I'll debug the numa_maps hang with hugetlb
> shmem segments with shared policy in the current code base, and reorder
> the patch set to handle correct display of shmem policy from all tasks
> first.  Next week or so.  

It may be worthwhile to split off the huge tlb pieces and cc those 
interested in huge pages. Maybe they can be treated like shmem?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-25 21:41                   ` Lee Schermerhorn
@ 2007-05-25 21:46                     ` Christoph Lameter
  2007-05-29 13:57                       ` Lee Schermerhorn
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2007-05-25 21:46 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Andi Kleen, linux-mm, akpm, nish.aravamudan

On Fri, 25 May 2007, Lee Schermerhorn wrote:

> As I've said, I view this series as addressing a number of problems,
> including the numa_maps hang when displaying hugetlb shmem segments with
> shared policy [that one by accident, I admit], the incorrect display of

That hang exists only if you first add a shared policy right?

> shmem segment policy from different tasks, and the disconnect between

Ahh.. Never checked that. What is happening with shmem policy display?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-25 21:14             ` Lee Schermerhorn
@ 2007-05-25 22:44               ` Andi Kleen
  2007-05-29 14:17                 ` Lee Schermerhorn
  0 siblings, 1 reply; 30+ messages in thread
From: Andi Kleen @ 2007-05-25 22:44 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Christoph Lameter, linux-mm, akpm, nish.aravamudan

> > I agree. A general page cache policy is probably a good idea and having
> > it in a cpuset is reasonable too. I've been also toying with the idea to 
> > change the global default to interleaved for unmapped files.
> > 
> > But in this case it's actually not needed to add something to the
> > address space. It can be all process policy based.
> 
> Just so we're clear, I'm talking about "struct address_space", as in the
> file's "mapping", not as in "struct mm_struct".

I'm talking about the same. Process/current cpuset policy doesn't need anything in 
struct address_space

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-25 21:46                     ` Christoph Lameter
@ 2007-05-29 13:57                       ` Lee Schermerhorn
  0 siblings, 0 replies; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-29 13:57 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, akpm, nish.aravamudan

On Fri, 2007-05-25 at 14:46 -0700, Christoph Lameter wrote:
> On Fri, 25 May 2007, Lee Schermerhorn wrote:
> 
> > As I've said, I view this series as addressing a number of problems,
> > including the numa_maps hang when displaying hugetlb shmem segments with
> > shared policy [that one by accident, I admit], the incorrect display of
> 
> That hang exists only if you first add a shared policy right?

hugetlbfs inodes already have shared policy struct in their private info
struct.  These get initialized when you create a SHM_HUGETLB segment,
but the .{get|set}_policy vm_ops are not "hooked up".  If I just hook
them up, then hugetlb segment DO obey the policy, as seen using
get_mempolicy() with the MPOL_F_NODE|MPOL_F_ADDR construct.  However, a
display of the numa_maps for the task hangs.  This is without any of my
shared policy patches.  With my patch series, everything works fine--for
my definition of fine.

> 
> > shmem segment policy from different tasks, and the disconnect between
> 
> Ahh.. Never checked that. What is happening with shmem policy display?

I've included a memtoy script below that illustrates what's happening.
You can grab the latest memtoy from:

http://free.linux.hp.com/~lts/Tools/memtoy-latest.tar.gz

To build you'll need the GNU readline and history packages and the
libnuma headers [numactl-devel package?].  These aren't loaded by
default on SLES10, and they're on the SDK iso, which I don't have, so I
haven't built there.  Did build on RHEL5, but I had to load the
numactl-devel package from the install image to get the libnuma headers.

Run script with "memtoy -v <path-to-script>".

Lee

# memtoy script to test shmem policy & numa maps
#
# 1) create a 64 page shmem segment -- shmget() internally
shmem s1 64p
#
# now memtoy has the shmem id in it's internal segment table.
show
#
# 2) now, before mapping [attaching to] the segment, fork a
#    child process.  The child will inherit the [unattached]
#    segment in it's segment table.
child c1
#
/c1 show
#
# 3) map/attach the segment in the parent and apply shared 
#    memory policy to different ranges of the segment, as
#    supported by existing shared policy infrastructure.
#    Using just 2 nodes [0 and 1] because I tested on a
#    2 socket AMD x86_64 blade.  It's also the minimum
#    "interesting" config.
map s1
mbind s1 0p 8p default
mbind s1 8p 16p interleave 0,1
mbind s1 16p 16p bind 0
mbind s1 32p 32p bind 1
#
# 4) now touch the segment to fault in pages.  With a shmem
#    segment, it shouldn't matter whether we touch with a read
#    or write, as it will fault in a page based on the shared
#    policy.  [It DOES matter for anon pages -- read faults
#    on previously unpopulated pages don't obey vma policy
#    installed by mbind()--unless we now have per node 
#    ZEROPAGE?]
touch s1 
#
# 5) Where did it land?  Does it obey the policies installed
#    above?
where s1
#
# 6) Tell the child to attach the segment and see where it
#    thinks it lives.  Child should see the same thing
/c1 map s1
/c1 where s1
#
# 7) pause memtoy.  In another window [or after pushing the
#    paused task to the background], examine the numa_maps
#    of the parent and child.   The pids were displayed when
#    memtoy started and when the child was created, but I'll
#    display them again here.
pid
/c1 pid
pause
#
# What did you see???
# 
# SIGINT [^C ?] to continue/exit

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH/RFC 0/8] Mapped File Policy Overview
  2007-05-25 22:44               ` Andi Kleen
@ 2007-05-29 14:17                 ` Lee Schermerhorn
  0 siblings, 0 replies; 30+ messages in thread
From: Lee Schermerhorn @ 2007-05-29 14:17 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Lameter, linux-mm, akpm, nish.aravamudan

On Sat, 2007-05-26 at 00:44 +0200, Andi Kleen wrote:
> > > I agree. A general page cache policy is probably a good idea and having
> > > it in a cpuset is reasonable too. I've been also toying with the idea to 
> > > change the global default to interleaved for unmapped files.
> > > 
> > > But in this case it's actually not needed to add something to the
> > > address space. It can be all process policy based.
> > 
> > Just so we're clear, I'm talking about "struct address_space", as in the
> > file's "mapping", not as in "struct mm_struct".
> 
> I'm talking about the same. Process/current cpuset policy doesn't need anything in 
> struct address_space

Yes, but for shared policy, that seems like the most natural place to
put it--along with the radix tree that contains the page offset to
memory page [struct] mapping.  Also note that process policy is
potentially transient--certainly so if you use it to place different
files in different locations.  And, it has the same problem that
Christoph noted with using vma policy on shared file mappings--you [can]
get different locations depending on which task faults the page in.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2007-05-29 14:17 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-24 17:28 [PATCH/RFC 0/8] Mapped File Policy Overview Lee Schermerhorn
2007-05-24 17:28 ` [PATCH/RFC 1/8] Mapped File Policy: move shared policy to inode/mapping Lee Schermerhorn
2007-05-24 17:28 ` [PATCH/RFC 2/8] Mapped File Policy: allocate shared policies as needed Lee Schermerhorn
2007-05-24 17:28 ` [PATCH/RFC 3/8] Mapped File Policy: let vma policy ops handle sub-vma policies Lee Schermerhorn
2007-05-24 17:28 ` [PATCH/RFC 4/8] Mapped File Policy: add generic file set/get policy vm ops Lee Schermerhorn
2007-05-24 17:28 ` [PATCH/RFC 5/8] Mapped File Policy: Factor alloc_page_pol routine Lee Schermerhorn
2007-05-24 17:29 ` [PATCH/RFC 6/8] Mapped File Policy: use file policy for page cache allocations Lee Schermerhorn
2007-05-24 17:29 ` [PATCH/RFC 7/8] Mapped File Policy: fix migration of private mappings Lee Schermerhorn
2007-05-24 17:29 ` [PATCH/RFC 8/8] Mapped File Policy: fix show_numa_maps() Lee Schermerhorn
2007-05-24 19:24 ` [PATCH/RFC 0/8] Mapped File Policy Overview Christoph Lameter
2007-05-24 20:46   ` Lee Schermerhorn
2007-05-24 20:41 ` Andi Kleen
2007-05-24 21:05   ` Lee Schermerhorn
2007-05-24 21:17     ` Christoph Lameter
2007-05-25 14:55       ` Lee Schermerhorn
2007-05-25 15:25         ` Christoph Lameter
2007-05-25 16:06           ` Lee Schermerhorn
2007-05-25 16:24             ` Christoph Lameter
2007-05-25 17:37               ` Lee Schermerhorn
2007-05-25 19:10                 ` Christoph Lameter
2007-05-25 21:12                   ` Lee Schermerhorn
2007-05-25 21:43                     ` Christoph Lameter
2007-05-25 21:01                 ` Andi Kleen
2007-05-25 21:41                   ` Lee Schermerhorn
2007-05-25 21:46                     ` Christoph Lameter
2007-05-29 13:57                       ` Lee Schermerhorn
2007-05-25 21:03           ` Andi Kleen
2007-05-25 21:14             ` Lee Schermerhorn
2007-05-25 22:44               ` Andi Kleen
2007-05-29 14:17                 ` Lee Schermerhorn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox