[PATCH/RFC 0/11] Shared Policy Overview

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH/RFC 0/11] Shared Policy Overview
@ 2007-06-25 19:52 Lee Schermerhorn
  2007-06-25 19:52 ` [PATCH/RFC 1/11] Shared Policy: move shared policy to inode/mapping Lee Schermerhorn
                   ` (12 more replies)
  0 siblings, 13 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-25 19:52 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nacc, ak, Lee Schermerhorn, clameter

[RFC] Shared Policy Fixex and Mapped File Policy 0/11 

Against 2.6.22-rc4-mm2

This is my former "Mapped File Policy" patch set, reordered to
move the fixes to numa_maps and the "hook up" of hugetlb shmem
policies to before the hook up of shared policy to shared mmap()ed
regular files.  ["Fixes first" per Christoph L.]

The 2 patches to fix up current behavior issues [#s 4 & 5] sit
atop 3 "cleanup" patches.  The clean up patches simplify the fixes
and, yes, support the generic mapped file policy patches [#s 6-11].

With patches 1-3 applied, external behavior is, AFAICT, exactly
the same as current behavior.  The internal differences are that
shared policy is now a pointer in the address_space structure.
A NULL value [the default] indicates default policy.  The shared
policy is allocated on demand--when one mbind()s a virtual
address range backed by a shmem memory object.

Patch #3 eliminates the need for a pseudo-vma on the stack to 
initialize policies for tmpfs inodes when the superblock has
a non-default policy by changing the interface to
mpol_set_shared_policy() to take a page offset and size in pages,
computed in the shmem set_policy vm_op.  This cleanup addresses
one complaint about the current shared policy infrastructure.

The other internal difference is that linear mappings that support
the 'set_policy' vm_op are mapped by a single VMA--not split on
policy boundaries.  numa_maps needs to be able to handle this
anyway because a task that attaches a shmem segment on which
another task has already installed multiple shared policies will
have a single vma mapping the entire segment.  Patch #4 fixes
numa_maps to display these properly.

Patch #5 hooks up SHM_HUGETLB segments to use the shmem get/set
policy vm_ops.  This "just works" with the fixes to numa_maps
in patch #4.  Without the numa_maps fixes, a cat of the numa_maps
of a task with a hugetlb shmem segment with shared policy attached
would hang.

Again, patches 6-11 define the generic file shared policy support,
They also prevent a private file mapping from affecting any shared
policy installed via a shared mapping, including preventing migrating
the shared pages to follow the address space private policy.
Policies installed via a private mapping apply only to the calling
task's address space--current behavior. 

Patches 6-8 add support for shared policy on regular files, factoring
alloc_page_vma() into vma policy lookup and allocation of a page
given a policy--alloc_page_pol().  Then, the page page cache alloc
function can lookup the shared file policy via page offset and use
the same alloc_page_pol() to allocate the page based on that policy.

Patch #9 defines an initial peristence model for shared policies on
shared mmap()ed files:   a shared policy can only be installed on generic
files via a shared mmap()ing.  Such a policy will persist as long as
any shared mmap()ings exist.  Shared mappings of a file are tracked
by the i_mmap_writable count in the struct address_space.  Patch #9
removes any existing policy when the i_mmap_writable count goes to zero.

Note that the existing shared policy persistence model for shmem segments
is different.  Once installed, the shared policies persist until the segment
is destroyed.  Because shmem goes through the same unmap path, shared
policies on shmem segments are marked with a SPOL_F_PERSIST flag to
prevent them from being removed on last detatch [unmap]--i.e., to preserve
existing behavior.

Also note that because we can remove a shared policy from a "live"
inode, we need to handle potential races with another task performing
a get_file_policy() on the same file via a file descriptor access
[read()/write()/...].  Patch #9 handles this by defining an RCU reader
critical region in get_file_policy() and by synchronizing with this
in mpol_free_shared_policy().

[I hope patch #9 will alleviate Andi's concerns about an unspecified
persistence model.  Note that the model implemented by patch #9 could
easily be enhanced to persist beyond the last shared mapping--e.g.,
via some additional mbind() flags, such as MPOL_MF_[NO]PERSIST--and
possibly enhancements to numactl to set/remove shared policy on files.
I didn't want to pursue that in this patch set because I don't have a
use for it, and it will require some tool to list files with persistent
shared policy--perhaps an enhancement to lsof(8).]

Patch #10 adds a per cpuset control file--shared_file_policy--to
explicitly enable/disable shared policy on shared file mappings.
Default is disabled--current behavior.  That is, even with all 11
patches applied, you'll have to explicitly enable shared file policy,
else the kernel will continue to ignore mbind() of address ranges backed
by a shared regular file mapping.  This preserves existing behavior for
applications that might currently be installing memory policies on
shared regular file mappings, not realizing that they are ignored.
Such applications might break or behave unexpectedly if the kernel
suddenly starts using the shared policy.   With the per cpuset control
defaulting to current behavior, an explicit action by a privileged 
user is required to enable the new behavior.

[I hope patch #10 alleviates Christoph's concern about unexpected
interaction of shared policies on mmap()ed files in one cpuset with
file descriptor access from another cpuset.  This can only happen if
the user/adminstrator explicitly enables shared file policies for an
application.]

Finally, patch #11 adds the generic file set|get_policy vm_ops to
actually hook up shared file mappings to memory policy.  Without this
patch, the shared policy infrastructure enhancements in the previous
patches remain mostly unused, except for the existing shmem and added
hugetlbfs usage.

---

Note:  testing/code sizes/... covered in previous posting:

	http://marc.info/?l=linux-mm&m=118002773528224&w=4

No sense in repeating this until we decide to go forward.
However, this series has been tested with 22-rc4-mm2 on ia64 and
x86_64 platforms.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH/RFC 1/11] Shared Policy: move shared policy to inode/mapping
  2007-06-25 19:52 [PATCH/RFC 0/11] Shared Policy Overview Lee Schermerhorn
@ 2007-06-25 19:52 ` Lee Schermerhorn
  2007-06-25 19:52 ` [PATCH/RFC 2/11] Shared Policy: allocate shared policies as needed Lee Schermerhorn
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-25 19:52 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nacc, ak, Lee Schermerhorn, clameter

Shared Policy Infrstructure 1/11 move shared policy to inode/mapping

Against 2.6.22-rc4-mm2

This patch starts the process of cleaning the shmem shared
[mem]policy infrastructure for use with hugetlb shmem segments
and eventually, I hope, for use with generic mmap()ed files.

1) add a struct shared_policy pointer to struct address_space
   This effectively adds it to each inode in i_data.  get_policy
   vma ops will locate this via vma->vm_file->f_mapping->spolicy.
   Modify [temporarily] mpol_shared_policy_init() to initialize
   via a shared policy pointer.

	A subsequent patch will make this struct dependent
	on CONFIG_NUMA so as not to burden systems that
	don't use numa.  At that point all accesses to
	spolicy will also be made dependent on 'NUMA via
	wrapper functions/macros.  I didn't do that in this
	patch because I'd just have to change the wrappers
	in the next patch where I dynamically alloc shared
	policies.

2) create a shared_policy.h header and move the shared policy
   support from mempolicy.h to shared_policy.h.

3) modify mpol_shared_policy_lookup() to return NULL if
   spolicy pointer contains NULL.  get_vma_policy() will
   substitute the process policy, if any, else the default
   policy.

4) modify shmem, the only existing user of shared policy
   infrastructure, to work with changes above.  At this
   point, just use the shared_policy embedded in the shmem
   inode info struct.  A later patch will dynamically
   allocate the struct when needed.

   Actually, hugetlbfs inodes also contain a shared policy, but
   the vma's get|set_policy ops are not hooked up.  This patch
   modifies hugetlbfs_get_inode() to initialize the shared
   policy struct embedded in its info struct via the i_mapping's
   spolicy pointer.  A later patch will "hook up" hugetlb
   mappings to the get|set_policy ops.

5) some miscellaneous cleanup to use "sp" for shared policy
   in routines that take it as an arg.  Prior use of "info"
   seemed misleading, and use of "p" was just plain 
   inconsistent.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/hugetlbfs/inode.c          |    4 +-
 include/linux/fs.h            |    3 +
 include/linux/mempolicy.h     |   54 -----------------------------------
 include/linux/shared_policy.h |   64 ++++++++++++++++++++++++++++++++++++++++++
 mm/mempolicy.c                |   27 +++++++++--------
 mm/shmem.c                    |   38 ++++++++++++------------
 6 files changed, 104 insertions(+), 86 deletions(-)

Index: Linux/include/linux/fs.h
===================================================================
--- Linux.orig/include/linux/fs.h	2007-06-22 13:07:48.000000000 -0400
+++ Linux/include/linux/fs.h	2007-06-22 13:10:30.000000000 -0400
@@ -523,9 +523,12 @@ struct address_space {
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
 	struct backing_dev_info *backing_dev_info; /* device readahead, etc */
+
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
+
+	struct shared_policy	*spolicy;
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
Index: Linux/include/linux/mempolicy.h
===================================================================
--- Linux.orig/include/linux/mempolicy.h	2007-06-22 13:07:48.000000000 -0400
+++ Linux/include/linux/mempolicy.h	2007-06-22 13:10:30.000000000 -0400
@@ -30,12 +30,12 @@
 
 #include <linux/mmzone.h>
 #include <linux/slab.h>
-#include <linux/rbtree.h>
 #include <linux/spinlock.h>
 #include <linux/nodemask.h>
 
 struct vm_area_struct;
 struct mm_struct;
+#include <linux/shared_policy.h>
 
 #ifdef CONFIG_NUMA
 
@@ -113,34 +113,6 @@ static inline int mpol_equal(struct memp
 
 #define mpol_set_vma_default(vma) ((vma)->vm_policy = NULL)
 
-/*
- * Tree of shared policies for a shared memory region.
- * Maintain the policies in a pseudo mm that contains vmas. The vmas
- * carry the policy. As a special twist the pseudo mm is indexed in pages, not
- * bytes, so that we can work with shared memory segments bigger than
- * unsigned long.
- */
-
-struct sp_node {
-	struct rb_node nd;
-	unsigned long start, end;
-	struct mempolicy *policy;
-};
-
-struct shared_policy {
-	struct rb_root root;
-	spinlock_t lock;
-};
-
-void mpol_shared_policy_init(struct shared_policy *info, int policy,
-				nodemask_t *nodes);
-int mpol_set_shared_policy(struct shared_policy *info,
-				struct vm_area_struct *vma,
-				struct mempolicy *new);
-void mpol_free_shared_policy(struct shared_policy *p);
-struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
-					    unsigned long idx);
-
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
 extern void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *new);
@@ -192,30 +164,6 @@ static inline struct mempolicy *mpol_cop
 	return NULL;
 }
 
-struct shared_policy {};
-
-static inline int mpol_set_shared_policy(struct shared_policy *info,
-					struct vm_area_struct *vma,
-					struct mempolicy *new)
-{
-	return -EINVAL;
-}
-
-static inline void mpol_shared_policy_init(struct shared_policy *info,
-					int policy, nodemask_t *nodes)
-{
-}
-
-static inline void mpol_free_shared_policy(struct shared_policy *p)
-{
-}
-
-static inline struct mempolicy *
-mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
-{
-	return NULL;
-}
-
 #define vma_policy(vma) NULL
 #define vma_set_policy(vma, pol) do {} while(0)
 
Index: Linux/include/linux/shared_policy.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ Linux/include/linux/shared_policy.h	2007-06-22 13:10:30.000000000 -0400
@@ -0,0 +1,64 @@
+#ifndef _LINUX_SHARED_POLICY_H
+#define _LINUX_SHARED_POLICY_H 1
+
+#include <linux/rbtree.h>
+
+/*
+ * Tree of shared policies for a shared memory regions and memory
+ * mapped files.
+TODO:  wean the low level shared policies from the notion of vmas.
+       just use inode, offset, length
+ * Maintain the policies in a pseudo mm that contains vmas. The vmas
+ * carry the policy. As a special twist the pseudo mm is indexed in pages, not
+ * bytes, so that we can work with shared memory segments bigger than
+ * unsigned long.
+ */
+
+#ifdef CONFIG_NUMA
+
+struct sp_node {
+	struct rb_node nd;
+	unsigned long start, end;
+	struct mempolicy *policy;
+};
+
+struct shared_policy {
+	struct rb_root root;
+	spinlock_t lock;	/* protects rb tree */
+};
+
+void mpol_shared_policy_init(struct shared_policy *, int, nodemask_t *);
+int mpol_set_shared_policy(struct shared_policy *,
+				struct vm_area_struct *,
+				struct mempolicy *);
+void mpol_free_shared_policy(struct shared_policy *);
+struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *,
+					    unsigned long);
+
+#else /* !NUMA */
+
+struct shared_policy {};
+
+static inline int mpol_set_shared_policy(struct shared_policy *info,
+					struct vm_area_struct *vma,
+					struct mempolicy *new)
+{
+	return -EINVAL;
+}
+static inline void mpol_shared_policy_init(struct shared_policy *info,
+					int policy, nodemask_t *nodes)
+{
+}
+
+static inline void mpol_free_shared_policy(struct shared_policy *p)
+{
+}
+
+static inline struct mempolicy *
+mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
+{
+	return NULL;
+}
+#endif
+
+#endif /* _LINUX_SHARED_POLICY_H */
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-06-22 13:07:48.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-06-22 13:10:30.000000000 -0400
@@ -1446,7 +1446,7 @@ mpol_shared_policy_lookup(struct shared_
 	struct mempolicy *pol = NULL;
 	struct sp_node *sn;
 
-	if (!sp->root.rb_node)
+	if (!sp || !sp->root.rb_node)
 		return NULL;
 	spin_lock(&sp->lock);
 	sn = sp_lookup(sp, idx, idx+1);
@@ -1528,11 +1528,12 @@ restart:
 	return 0;
 }
 
-void mpol_shared_policy_init(struct shared_policy *info, int policy,
+void mpol_shared_policy_init(struct shared_policy *sp, int policy,
 				nodemask_t *policy_nodes)
 {
-	info->root = RB_ROOT;
-	spin_lock_init(&info->lock);
+
+	sp->root = RB_ROOT;
+	spin_lock_init(&sp->lock);
 
 	if (policy != MPOL_DEFAULT) {
 		struct mempolicy *newpol;
@@ -1546,13 +1547,13 @@ void mpol_shared_policy_init(struct shar
 			memset(&pvma, 0, sizeof(struct vm_area_struct));
 			/* Policy covers entire file */
 			pvma.vm_end = TASK_SIZE;
-			mpol_set_shared_policy(info, &pvma, newpol);
+			mpol_set_shared_policy(sp, &pvma, newpol);
 			mpol_free(newpol);
 		}
 	}
 }
 
-int mpol_set_shared_policy(struct shared_policy *info,
+int mpol_set_shared_policy(struct shared_policy *sp,
 			struct vm_area_struct *vma, struct mempolicy *npol)
 {
 	int err;
@@ -1569,30 +1570,30 @@ int mpol_set_shared_policy(struct shared
 		if (!new)
 			return -ENOMEM;
 	}
-	err = shared_policy_replace(info, vma->vm_pgoff, vma->vm_pgoff+sz, new);
+	err = shared_policy_replace(sp, vma->vm_pgoff, vma->vm_pgoff+sz, new);
 	if (err && new)
 		kmem_cache_free(sn_cache, new);
 	return err;
 }
 
 /* Free a backing policy store on inode delete. */
-void mpol_free_shared_policy(struct shared_policy *p)
+void mpol_free_shared_policy(struct shared_policy *sp)
 {
 	struct sp_node *n;
 	struct rb_node *next;
 
-	if (!p->root.rb_node)
+	if (!sp->root.rb_node)
 		return;
-	spin_lock(&p->lock);
-	next = rb_first(&p->root);
+	spin_lock(&sp->lock);
+	next = rb_first(&sp->root);
 	while (next) {
 		n = rb_entry(next, struct sp_node, nd);
 		next = rb_next(&n->nd);
-		rb_erase(&n->nd, &p->root);
+		rb_erase(&n->nd, &sp->root);
 		mpol_free(n->policy);
 		kmem_cache_free(sn_cache, n);
 	}
-	spin_unlock(&p->lock);
+	spin_unlock(&sp->lock);
 }
 
 int mpol_parse_options(char *value, int *policy, nodemask_t *policy_nodes)
Index: Linux/mm/shmem.c
===================================================================
--- Linux.orig/mm/shmem.c	2007-06-22 13:07:48.000000000 -0400
+++ Linux/mm/shmem.c	2007-06-22 13:10:30.000000000 -0400
@@ -962,7 +962,7 @@ redirty:
 }
 
 #ifdef CONFIG_NUMA
-static struct page *shmem_swapin_async(struct shared_policy *p,
+static struct page *shmem_swapin_async(struct shared_policy *sp,
 				       swp_entry_t entry, unsigned long idx)
 {
 	struct page *page;
@@ -972,41 +972,39 @@ static struct page *shmem_swapin_async(s
 	memset(&pvma, 0, sizeof(struct vm_area_struct));
 	pvma.vm_end = PAGE_SIZE;
 	pvma.vm_pgoff = idx;
-	pvma.vm_policy = mpol_shared_policy_lookup(p, idx);
+	pvma.vm_policy = mpol_shared_policy_lookup(sp, idx);
 	page = read_swap_cache_async(entry, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
 
-struct page *shmem_swapin(struct shmem_inode_info *info, swp_entry_t entry,
-			  unsigned long idx)
+struct page *shmem_swapin(struct shared_policy *sp, swp_entry_t entry,
+				unsigned long idx)
 {
-	struct shared_policy *p = &info->policy;
 	int i, num;
 	struct page *page;
 	unsigned long offset;
 
 	num = valid_swaphandles(entry, &offset);
 	for (i = 0; i < num; offset++, i++) {
-		page = shmem_swapin_async(p,
+		page = shmem_swapin_async(sp,
 				swp_entry(swp_type(entry), offset), idx);
 		if (!page)
 			break;
 		page_cache_release(page);
 	}
 	lru_add_drain();	/* Push any new pages onto the LRU now */
-	return shmem_swapin_async(p, entry, idx);
+	return shmem_swapin_async(sp, entry, idx);
 }
 
 static struct page *
-shmem_alloc_page(gfp_t gfp, struct shmem_inode_info *info,
-		 unsigned long idx)
+shmem_alloc_page(gfp_t gfp, struct shared_policy *sp, unsigned long idx)
 {
 	struct vm_area_struct pvma;
 	struct page *page;
 
 	memset(&pvma, 0, sizeof(struct vm_area_struct));
-	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
+	pvma.vm_policy = mpol_shared_policy_lookup(sp, idx);
 	pvma.vm_pgoff = idx;
 	pvma.vm_end = PAGE_SIZE;
 	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
@@ -1015,14 +1013,14 @@ shmem_alloc_page(gfp_t gfp, struct shmem
 }
 #else
 static inline struct page *
-shmem_swapin(struct shmem_inode_info *info,swp_entry_t entry,unsigned long idx)
+shmem_swapin(void *sp, swp_entry_t entry, unsigned long idx)
 {
 	swapin_readahead(entry, 0, NULL);
 	return read_swap_cache_async(entry, NULL, 0);
 }
 
 static inline struct page *
-shmem_alloc_page(gfp_t gfp,struct shmem_inode_info *info, unsigned long idx)
+shmem_alloc_page(gfp_t gfp, void *sp, unsigned long idx)
 {
 	return alloc_page(gfp | __GFP_ZERO);
 }
@@ -1091,7 +1089,7 @@ repeat:
 				*type = VM_FAULT_MAJOR;
 			}
 			spin_unlock(&info->lock);
-			swappage = shmem_swapin(info, swap, idx);
+			swappage = shmem_swapin(mapping->spolicy, swap, idx);
 			if (!swappage) {
 				spin_lock(&info->lock);
 				entry = shmem_swp_alloc(info, idx, sgp);
@@ -1204,7 +1202,7 @@ repeat:
 		if (!filepage) {
 			spin_unlock(&info->lock);
 			filepage = shmem_alloc_page(mapping_gfp_mask(mapping),
-						    info,
+						    mapping->spolicy,
 						    idx);
 			if (!filepage) {
 				shmem_unacct_blocks(info->flags, 1);
@@ -1370,8 +1368,9 @@ shmem_get_inode(struct super_block *sb, 
 		case S_IFREG:
 			inode->i_op = &shmem_inode_operations;
 			inode->i_fop = &shmem_file_operations;
-			mpol_shared_policy_init(&info->policy, sbinfo->policy,
-							&sbinfo->policy_nodes);
+			inode->i_mapping->spolicy = &info->policy;
+			mpol_shared_policy_init(inode->i_mapping->spolicy,
+					 sbinfo->policy, &sbinfo->policy_nodes);
 			break;
 		case S_IFDIR:
 			inc_nlink(inode);
@@ -1385,8 +1384,9 @@ shmem_get_inode(struct super_block *sb, 
 			 * Must not load anything in the rbtree,
 			 * mpol_free_shared_policy will not be called.
 			 */
-			mpol_shared_policy_init(&info->policy, MPOL_DEFAULT,
-						NULL);
+			inode->i_mapping->spolicy = &info->policy;
+			mpol_shared_policy_init(inode->i_mapping->spolicy,
+					 MPOL_DEFAULT, NULL);
 			break;
 		}
 	} else if (sbinfo->max_inodes) {
@@ -2287,7 +2287,7 @@ static void shmem_destroy_inode(struct i
 {
 	if ((inode->i_mode & S_IFMT) == S_IFREG) {
 		/* only struct inode is valid if it's an inline symlink */
-		mpol_free_shared_policy(&SHMEM_I(inode)->policy);
+		mpol_free_shared_policy(inode->i_mapping->spolicy);
 	}
 	shmem_acl_destroy_inode(inode);
 	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
Index: Linux/fs/hugetlbfs/inode.c
===================================================================
--- Linux.orig/fs/hugetlbfs/inode.c	2007-06-22 13:07:48.000000000 -0400
+++ Linux/fs/hugetlbfs/inode.c	2007-06-22 13:10:30.000000000 -0400
@@ -364,7 +364,9 @@ static struct inode *hugetlbfs_get_inode
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		INIT_LIST_HEAD(&inode->i_mapping->private_list);
 		info = HUGETLBFS_I(inode);
-		mpol_shared_policy_init(&info->policy, MPOL_DEFAULT, NULL);
+		inode->i_mapping->spolicy = &info->policy;
+		mpol_shared_policy_init(inode->i_mapping->spolicy,
+					 MPOL_DEFAULT, NULL);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH/RFC 2/11] Shared Policy: allocate shared policies as needed
  2007-06-25 19:52 [PATCH/RFC 0/11] Shared Policy Overview Lee Schermerhorn
  2007-06-25 19:52 ` [PATCH/RFC 1/11] Shared Policy: move shared policy to inode/mapping Lee Schermerhorn
@ 2007-06-25 19:52 ` Lee Schermerhorn
  2007-06-25 19:52 ` [PATCH/RFC 3/11] Shared Policy: let vma policy ops handle sub-vma policies Lee Schermerhorn
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-25 19:52 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nacc, ak, Lee Schermerhorn, clameter

Shared Policy Infrstructure 2/11 dynamically alloc shared policies

Against 2.6.22-rc4-mm2

Remove shared policy structs from shmem and hugetlbfs inode
info structs and dynamically allocate them as needed.

Make shared policy pointer in address_space dependent on
CONFIG_NUMA.  Access [get/set] via wrappers that also depend
on 'NUMA [to avoid excessive #ifdef in .c files].

Initialize shmem and hugetlbfs inode/address_space spolicy
pointer to null, unless superblock [mount] specifies a 
non-default policy.

set_policy() ops must create shared_policy struct from a new
kmem cache when a new policy is installed and no spolicy exists.
mpol_shared_policy_init() replaced with mpol_shared_policy_new()
to accomplish this.

shmem must create/initialize a shared_policy when inode
allocated if the tmpfs super-block/mount point specifies a
non-default policy.

mpol_free_shared_policy() must free the spolicy, if any, when
inode is destroyed.

	Note:  I considered referencing counting the shared
	policy, but I don't think this is necessary because
	they are always 1-for-1 to a given inode and are 
	only/always deleted when the inode is destroyed.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/hugetlbfs/inode.c          |   15 +++++++------
 fs/inode.c                    |    1 
 include/linux/fs.h            |   20 +++++++++++++++++
 include/linux/hugetlb.h       |    1 
 include/linux/shared_policy.h |   17 ++++++++-------
 include/linux/shmem_fs.h      |    1 
 mm/mempolicy.c                |   43 +++++++++++++++++++++++++++++++++-----
 mm/shmem.c                    |   47 +++++++++++++++++++++++++++---------------
 8 files changed, 108 insertions(+), 37 deletions(-)

Index: Linux/include/linux/shared_policy.h
===================================================================
--- Linux.orig/include/linux/shared_policy.h	2007-06-22 13:10:30.000000000 -0400
+++ Linux/include/linux/shared_policy.h	2007-06-22 13:10:34.000000000 -0400
@@ -1,6 +1,7 @@
 #ifndef _LINUX_SHARED_POLICY_H
 #define _LINUX_SHARED_POLICY_H 1
 
+#include <linux/fs.h>
 #include <linux/rbtree.h>
 
 /*
@@ -27,12 +28,13 @@ struct shared_policy {
 	spinlock_t lock;	/* protects rb tree */
 };
 
-void mpol_shared_policy_init(struct shared_policy *, int, nodemask_t *);
-int mpol_set_shared_policy(struct shared_policy *,
+extern struct shared_policy *mpol_shared_policy_new(struct address_space *,
+							int, nodemask_t *);
+extern int mpol_set_shared_policy(struct shared_policy *,
 				struct vm_area_struct *,
 				struct mempolicy *);
-void mpol_free_shared_policy(struct shared_policy *);
-struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *,
+extern void mpol_free_shared_policy(struct address_space *);
+extern struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *,
 					    unsigned long);
 
 #else /* !NUMA */
@@ -45,12 +47,12 @@ static inline int mpol_set_shared_policy
 {
 	return -EINVAL;
 }
-static inline void mpol_shared_policy_init(struct shared_policy *info,
-					int policy, nodemask_t *nodes)
+static inline struct shared_policy *mpol_shared_policy_new(int policy,
+					nodemask_t *nodes)
 {
 }
 
-static inline void mpol_free_shared_policy(struct shared_policy *p)
+static inline void mpol_free_shared_policy(struct shared_policy *sp)
 {
 }
 
@@ -59,6 +61,7 @@ mpol_shared_policy_lookup(struct shared_
 {
 	return NULL;
 }
+
 #endif
 
 #endif /* _LINUX_SHARED_POLICY_H */
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-06-22 13:10:30.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-06-22 13:10:34.000000000 -0400
@@ -99,6 +99,7 @@
 #define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2)		/* Gather statistics */
 
 static struct kmem_cache *policy_cache;
+static struct kmem_cache *sp_cache;
 static struct kmem_cache *sn_cache;
 
 #define PDprintk(fmt...)
@@ -1528,10 +1529,17 @@ restart:
 	return 0;
 }
 
-void mpol_shared_policy_init(struct shared_policy *sp, int policy,
-				nodemask_t *policy_nodes)
+/*
+ * allocate and initialize a shared policy struct
+ */
+struct shared_policy *mpol_shared_policy_new(struct address_space *mapping,
+				int policy, nodemask_t *policy_nodes)
 {
+	struct shared_policy *sp, *spx;
 
+	sp = kmem_cache_alloc(sp_cache, GFP_KERNEL);
+	if (!sp)
+		return ERR_PTR(-ENOMEM);
 	sp->root = RB_ROOT;
 	spin_lock_init(&sp->lock);
 
@@ -1551,6 +1559,20 @@ void mpol_shared_policy_init(struct shar
 			mpol_free(newpol);
 		}
 	}
+
+	/*
+	 * resolve potential set/set race
+	 */
+	spin_lock(&mapping->i_mmap_lock);
+	spx = mapping->spolicy;
+	if (!spx)
+		mapping->spolicy = sp;
+	else {
+		kmem_cache_free(sp_cache, sp);
+		sp = spx;
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	return sp;
 }
 
 int mpol_set_shared_policy(struct shared_policy *sp,
@@ -1576,14 +1598,20 @@ int mpol_set_shared_policy(struct shared
 	return err;
 }
 
-/* Free a backing policy store on inode delete. */
-void mpol_free_shared_policy(struct shared_policy *sp)
+/*
+ * Free a backing policy store on inode delete.
+ */
+void mpol_free_shared_policy(struct address_space *mapping)
 {
+	struct shared_policy *sp = mapping->spolicy;
 	struct sp_node *n;
 	struct rb_node *next;
 
-	if (!sp->root.rb_node)
+	if (!sp)
 		return;
+
+	mapping->spolicy = NULL;
+
 	spin_lock(&sp->lock);
 	next = rb_first(&sp->root);
 	while (next) {
@@ -1594,6 +1622,7 @@ void mpol_free_shared_policy(struct shar
 		kmem_cache_free(sn_cache, n);
 	}
 	spin_unlock(&sp->lock);
+	kmem_cache_free(sp_cache, sp);
 }
 
 int mpol_parse_options(char *value, int *policy, nodemask_t *policy_nodes)
@@ -1668,6 +1697,10 @@ void __init numa_policy_init(void)
 					 sizeof(struct mempolicy),
 					 0, SLAB_PANIC, NULL, NULL);
 
+	sp_cache = kmem_cache_create("shared_policy",
+				     sizeof(struct shared_policy),
+				     0, SLAB_PANIC, NULL, NULL);
+
 	sn_cache = kmem_cache_create("shared_policy_node",
 				     sizeof(struct sp_node),
 				     0, SLAB_PANIC, NULL, NULL);
Index: Linux/mm/shmem.c
===================================================================
--- Linux.orig/mm/shmem.c	2007-06-22 13:10:30.000000000 -0400
+++ Linux/mm/shmem.c	2007-06-22 13:10:34.000000000 -0400
@@ -1089,7 +1089,8 @@ repeat:
 				*type = VM_FAULT_MAJOR;
 			}
 			spin_unlock(&info->lock);
-			swappage = shmem_swapin(mapping->spolicy, swap, idx);
+			swappage = shmem_swapin(mapping_shared_policy(mapping),
+						swap, idx);
 			if (!swappage) {
 				spin_lock(&info->lock);
 				entry = shmem_swp_alloc(info, idx, sgp);
@@ -1202,8 +1203,8 @@ repeat:
 		if (!filepage) {
 			spin_unlock(&info->lock);
 			filepage = shmem_alloc_page(mapping_gfp_mask(mapping),
-						    mapping->spolicy,
-						    idx);
+						mapping_shared_policy(mapping),
+						idx);
 			if (!filepage) {
 				shmem_unacct_blocks(info->flags, 1);
 				shmem_free_blocks(inode, 1);
@@ -1283,18 +1284,28 @@ static struct page *shmem_fault(struct v
 #ifdef CONFIG_NUMA
 int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
 {
-	struct inode *i = vma->vm_file->f_path.dentry->d_inode;
-	return mpol_set_shared_policy(&SHMEM_I(i)->policy, vma, new);
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct shared_policy *sp = mapping_shared_policy(mapping);
+
+	if (!sp) {
+		sp = mpol_shared_policy_new(mapping, MPOL_DEFAULT, NULL);
+		if (IS_ERR(sp))
+			return PTR_ERR(sp);
+	}
+	return mpol_set_shared_policy(sp, vma, new);
 }
 
 struct mempolicy *
 shmem_get_policy(struct vm_area_struct *vma, unsigned long addr)
 {
-	struct inode *i = vma->vm_file->f_path.dentry->d_inode;
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct shared_policy *sp = mapping_shared_policy(mapping);
 	unsigned long idx;
 
+	if (!sp)
+		return NULL;	/* == default policy */
 	idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
-	return mpol_shared_policy_lookup(&SHMEM_I(i)->policy, idx);
+	return mpol_shared_policy_lookup(sp, idx);
 }
 #endif
 
@@ -1368,9 +1379,16 @@ shmem_get_inode(struct super_block *sb, 
 		case S_IFREG:
 			inode->i_op = &shmem_inode_operations;
 			inode->i_fop = &shmem_file_operations;
-			inode->i_mapping->spolicy = &info->policy;
-			mpol_shared_policy_init(inode->i_mapping->spolicy,
-					 sbinfo->policy, &sbinfo->policy_nodes);
+			if (sbinfo->policy != MPOL_DEFAULT) {
+				struct address_space * mapping;
+				struct shared_policy *sp;
+				mapping = inode->i_mapping;
+				sp = mpol_shared_policy_new(mapping,
+							sbinfo->policy,
+							&sbinfo->policy_nodes);
+				if (!IS_ERR(sp))
+					set_mapping_shared_policy(mapping, sp);
+			}
 			break;
 		case S_IFDIR:
 			inc_nlink(inode);
@@ -1381,12 +1399,9 @@ shmem_get_inode(struct super_block *sb, 
 			break;
 		case S_IFLNK:
 			/*
-			 * Must not load anything in the rbtree,
-			 * mpol_free_shared_policy will not be called.
+			 * This case only exists so that we don't attempt
+			 * to call init_special_inode() for sym links.
 			 */
-			inode->i_mapping->spolicy = &info->policy;
-			mpol_shared_policy_init(inode->i_mapping->spolicy,
-					 MPOL_DEFAULT, NULL);
 			break;
 		}
 	} else if (sbinfo->max_inodes) {
@@ -2287,7 +2302,7 @@ static void shmem_destroy_inode(struct i
 {
 	if ((inode->i_mode & S_IFMT) == S_IFREG) {
 		/* only struct inode is valid if it's an inline symlink */
-		mpol_free_shared_policy(inode->i_mapping->spolicy);
+		mpol_free_shared_policy(inode->i_mapping);
 	}
 	shmem_acl_destroy_inode(inode);
 	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
Index: Linux/fs/hugetlbfs/inode.c
===================================================================
--- Linux.orig/fs/hugetlbfs/inode.c	2007-06-22 13:10:30.000000000 -0400
+++ Linux/fs/hugetlbfs/inode.c	2007-06-22 13:10:34.000000000 -0400
@@ -354,7 +354,6 @@ static struct inode *hugetlbfs_get_inode
 
 	inode = new_inode(sb);
 	if (inode) {
-		struct hugetlbfs_inode_info *info;
 		inode->i_mode = mode;
 		inode->i_uid = uid;
 		inode->i_gid = gid;
@@ -363,10 +362,9 @@ static struct inode *hugetlbfs_get_inode
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		INIT_LIST_HEAD(&inode->i_mapping->private_list);
-		info = HUGETLBFS_I(inode);
-		inode->i_mapping->spolicy = &info->policy;
-		mpol_shared_policy_init(inode->i_mapping->spolicy,
-					 MPOL_DEFAULT, NULL);
+		/*
+		 * leave i_mapping->spolicy NULL [default policy]
+		 */
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
@@ -379,7 +377,10 @@ static struct inode *hugetlbfs_get_inode
 			inode->i_op = &hugetlbfs_dir_inode_operations;
 			inode->i_fop = &simple_dir_operations;
 
-			/* directory inodes start off with i_nlink == 2 (for "." entry) */
+			/*
+			 * directory inodes start off with i_nlink == 2
+			 * (for "." entry)
+			 */
 			inc_nlink(inode);
 			break;
 		case S_IFLNK:
@@ -546,7 +547,7 @@ static struct inode *hugetlbfs_alloc_ino
 static void hugetlbfs_destroy_inode(struct inode *inode)
 {
 	hugetlbfs_inc_free_inodes(HUGETLBFS_SB(inode->i_sb));
-	mpol_free_shared_policy(&HUGETLBFS_I(inode)->policy);
+	mpol_free_shared_policy(inode->i_mapping);
 	kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I(inode));
 }
 
Index: Linux/include/linux/hugetlb.h
===================================================================
--- Linux.orig/include/linux/hugetlb.h	2007-06-22 13:07:48.000000000 -0400
+++ Linux/include/linux/hugetlb.h	2007-06-22 13:10:34.000000000 -0400
@@ -149,7 +149,6 @@ struct hugetlbfs_sb_info {
 
 
 struct hugetlbfs_inode_info {
-	struct shared_policy policy;
 	struct inode vfs_inode;
 };
 
Index: Linux/include/linux/shmem_fs.h
===================================================================
--- Linux.orig/include/linux/shmem_fs.h	2007-06-22 13:07:48.000000000 -0400
+++ Linux/include/linux/shmem_fs.h	2007-06-22 13:10:34.000000000 -0400
@@ -14,7 +14,6 @@ struct shmem_inode_info {
 	unsigned long		alloced;	/* data pages alloced to file */
 	unsigned long		swapped;	/* subtotal assigned to swap */
 	unsigned long		next_index;	/* highest alloced index + 1 */
-	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct page		*i_indirect;	/* top indirect blocks page */
 	swp_entry_t		i_direct[SHMEM_NR_DIRECT]; /* first blocks */
 	struct list_head	swaplist;	/* chain of maybes on swap */
Index: Linux/fs/inode.c
===================================================================
--- Linux.orig/fs/inode.c	2007-06-22 13:07:48.000000000 -0400
+++ Linux/fs/inode.c	2007-06-22 13:10:34.000000000 -0400
@@ -163,6 +163,7 @@ static struct inode *alloc_inode(struct 
 			mapping->backing_dev_info = bdi;
 		}
 		inode->i_private = NULL;
+		set_mapping_shared_policy(mapping, NULL);
 		inode->i_mapping = mapping;
 	}
 	return inode;
Index: Linux/include/linux/fs.h
===================================================================
--- Linux.orig/include/linux/fs.h	2007-06-22 13:10:30.000000000 -0400
+++ Linux/include/linux/fs.h	2007-06-22 13:10:34.000000000 -0400
@@ -528,7 +528,9 @@ struct address_space {
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
 
+#ifdef CONFIG_NUMA
 	struct shared_policy	*spolicy;
+#endif
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
@@ -536,6 +538,24 @@ struct address_space {
 	 * of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON.
 	 */
 
+#ifdef CONFIG_NUMA
+static inline struct shared_policy *
+mapping_shared_policy(struct address_space *mapping)
+{
+	return mapping->spolicy;
+}
+
+static inline void set_mapping_shared_policy(struct address_space *mapping,
+						struct shared_policy *sp)
+{
+	mapping->spolicy = sp;
+}
+
+#else
+#define mapping_shared_policy(M) (NULL)
+#define set_mapping_shared_policy(M, SP)	/* nothing */
+#endif
+
 struct block_device {
 	dev_t			bd_dev;  /* not a kdev_t - it's a search key */
 	struct inode *		bd_inode;	/* will die */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH/RFC 3/11] Shared Policy: let vma policy ops handle sub-vma policies
  2007-06-25 19:52 [PATCH/RFC 0/11] Shared Policy Overview Lee Schermerhorn
  2007-06-25 19:52 ` [PATCH/RFC 1/11] Shared Policy: move shared policy to inode/mapping Lee Schermerhorn
  2007-06-25 19:52 ` [PATCH/RFC 2/11] Shared Policy: allocate shared policies as needed Lee Schermerhorn
@ 2007-06-25 19:52 ` Lee Schermerhorn
  2007-06-25 19:52 ` [PATCH/RFC 4/11] Shared Policy: fix show_numa_maps() Lee Schermerhorn
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-25 19:52 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nacc, ak, Lee Schermerhorn, clameter

Shared Policy Infrstructure 3/11 let vma policy ops handle sub-vma policies

Against 2.6.22-rc4-mm2

Shared policies can handle subranges of an object, so no need to
split the vma for these mappings. So, modify mbind_range() and
policy_vma() to call the set_policy vma op, if one exists, for
shared mappings.  Similarly, modify get_vma_policy() to call the
get_policy(), if one exists, only for shared mappings.  This will
simplify the fix to numa_maps to properly handle display of 
shared policies from different tasks.

We don't want private mappings mucking with the shared policy of
the mapped file, if any, so use vma policy for private mappings.
We'll still split vmas for private mappings.  

	Could use rb_tree for vma subrange policies as well, but
	not in this series.

Also, we can't use the same policy ops for nonlinear mappings because
we don't have a 1-to-1 correspondence between pgoff and vma relative
address.  So, continue to split vmas for non-linear mappings.

	At some point, one could provide policy vm_ops support
	for non-linear mappings if one had a use for this.  Might
	get a bit messy...

As a result, this patch enforces a defined semantic for set|get_policy()
ops:  they only get called for linear, shared mappings, and in that
case we don't split the vma.  Only shmem currently has set|get_policy()
ops, and this seems an appropriate semantic for shared objects.

Now, since the vma start and end addresses no longer specify the
range to which a new policy applies, need to add start,end address
args to the vma policy ops.  The set_policy op/handler just calls into
mpol_set_shared_policy() to do the real work, so we could just pass
the start and end address, along with the vma, down to that function.
However, to eliminate the need for the pseudo-vma on the stack when
initializing the shared policy for an inode with non-default "superblock
policy", we change the interface to mpol_set_shared_policy() to take a
page offset and size in pages.  We compute the page offset and size in
the shmem set_policy handler from the vma and the address range. 

Note:  Added helper function "vma_addr_to_pgoff()" for readability.
This is similar to [linear_]page_index() but takes a shift argument
so that it can be used for calculating page indices for interleaving
for both base pages and huge pages.  Perhaps this can be merged with
other similar functions?

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mm.h            |   21 +++++++++--
 include/linux/shared_policy.h |    7 +--
 ipc/shm.c                     |    5 +-
 mm/mempolicy.c                |   74 +++++++++++++++++++++++++++++-------------
 mm/shmem.c                    |    9 +++--
 5 files changed, 81 insertions(+), 35 deletions(-)

Index: Linux/include/linux/mm.h
===================================================================
--- Linux.orig/include/linux/mm.h	2007-06-22 13:07:48.000000000 -0400
+++ Linux/include/linux/mm.h	2007-06-22 13:10:35.000000000 -0400
@@ -234,11 +234,14 @@ struct vm_operations_struct {
 	struct page *(*nopage)(struct vm_area_struct *area,
 			unsigned long address, int *type);
 
-	/* notification that a previously read-only page is about to become
-	 * writable, if an error is returned it will cause a SIGBUS */
+	/*
+	 * notification that a previously read-only page is about to become
+	 * writable, if an error is returned it will cause a SIGBUS
+	 */
 	int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
 #ifdef CONFIG_NUMA
-	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
+	int (*set_policy)(struct vm_area_struct *vma, unsigned long start,
+				unsigned long end, struct mempolicy *new);
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
 					unsigned long addr);
 	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
@@ -703,7 +706,8 @@ static inline int page_mapped(struct pag
 extern void show_free_areas(void);
 
 #ifdef CONFIG_SHMEM
-int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new);
+int shmem_set_policy(struct vm_area_struct *, unsigned long, unsigned long,
+			 struct mempolicy *);
 struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
 					unsigned long addr);
 int shmem_lock(struct file *file, int lock, struct user_struct *user);
@@ -1051,6 +1055,15 @@ extern void si_meminfo_node(struct sysin
 
 #ifdef CONFIG_NUMA
 extern void setup_per_cpu_pageset(void);
+
+/*
+ * Address to offset for shared mapping policy lookup.
+ */
+static inline pgoff_t vma_addr_to_pgoff(struct vm_area_struct *vma,
+		unsigned long addr, int shift)
+{
+	return ((addr - vma->vm_start) >> shift) + vma->vm_pgoff;
+}
 #else
 static inline void setup_per_cpu_pageset(void) {}
 #endif
Index: Linux/include/linux/shared_policy.h
===================================================================
--- Linux.orig/include/linux/shared_policy.h	2007-06-22 13:10:34.000000000 -0400
+++ Linux/include/linux/shared_policy.h	2007-06-22 13:10:35.000000000 -0400
@@ -30,9 +30,8 @@ struct shared_policy {
 
 extern struct shared_policy *mpol_shared_policy_new(struct address_space *,
 							int, nodemask_t *);
-extern int mpol_set_shared_policy(struct shared_policy *,
-				struct vm_area_struct *,
-				struct mempolicy *);
+extern int mpol_set_shared_policy(struct shared_policy *, pgoff_t,
+				unsigned long, struct mempolicy *);
 extern void mpol_free_shared_policy(struct address_space *);
 extern struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *,
 					    unsigned long);
@@ -42,7 +41,7 @@ extern struct mempolicy *mpol_shared_pol
 struct shared_policy {};
 
 static inline int mpol_set_shared_policy(struct shared_policy *info,
-					struct vm_area_struct *vma,
+					pgoff_t pgoff, unsigned long sz,
 					struct mempolicy *new)
 {
 	return -EINVAL;
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-06-22 13:10:34.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-06-22 13:10:35.000000000 -0400
@@ -374,20 +374,28 @@ check_range(struct mm_struct *mm, unsign
 	return first;
 }
 
-/* Apply policy to a single VMA */
-static int policy_vma(struct vm_area_struct *vma, struct mempolicy *new)
+/*
+ * Apply policy to a single VMA, or a subrange thereof
+ */
+static int policy_vma(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, struct mempolicy *new)
 {
 	int err = 0;
-	struct mempolicy *old = vma->vm_policy;
 
 	PDprintk("vma %lx-%lx/%lx vm_ops %p vm_file %p set_policy %p\n",
-		 vma->vm_start, vma->vm_end, vma->vm_pgoff,
+		 start, end, vma_addr_to_pgoff(vma, start, PAGE_SHIFT),
 		 vma->vm_ops, vma->vm_file,
 		 vma->vm_ops ? vma->vm_ops->set_policy : NULL);
 
-	if (vma->vm_ops && vma->vm_ops->set_policy)
-		err = vma->vm_ops->set_policy(vma, new);
-	if (!err) {
+	/*
+	 * set_policy op, if exists, is responsible for policy
+	 * reference counts.
+	 */
+	if ((vma->vm_flags & (VM_SHARED|VM_NONLINEAR)) == VM_SHARED &&
+		vma->vm_ops && vma->vm_ops->set_policy)
+		err = vma->vm_ops->set_policy(vma, start, end, new);
+	else {
+		struct mempolicy *old = vma->vm_policy;
 		mpol_get(new);
 		vma->vm_policy = new;
 		mpol_free(old);
@@ -404,13 +412,30 @@ static int mbind_range(struct vm_area_st
 
 	err = 0;
 	for (; vma && vma->vm_start < end; vma = next) {
+		unsigned long eend = min(end, vma->vm_end);
 		next = vma->vm_next;
+		if ((vma->vm_flags & (VM_SHARED|VM_NONLINEAR)) == VM_SHARED &&
+			vma->vm_ops && vma->vm_ops->set_policy) {
+			/*
+			 * set_policy op handles policies on sub-range
+			 * of vma for linear, shared mappings
+			 */
+			err = policy_vma(vma, start, eend, new);
+			if (err)
+				break;
+			continue;
+		}
+
+		/*
+		 * for private mappings and shared mappings of objects without
+		 * a set_policy vma op, split the vma and use vma policy
+		 */
 		if (vma->vm_start < start)
 			err = split_vma(vma->vm_mm, vma, start, 1);
 		if (!err && vma->vm_end > end)
-			err = split_vma(vma->vm_mm, vma, end, 0);
+			err = split_vma(vma->vm_mm, vma, eend, 0);
 		if (!err)
-			err = policy_vma(vma, new);
+			err = policy_vma(vma, start, eend, new);
 		if (err)
 			break;
 	}
@@ -538,7 +563,11 @@ long do_get_mempolicy(int *policy, nodem
 			up_read(&mm->mmap_sem);
 			return -EFAULT;
 		}
-		if (vma->vm_ops && vma->vm_ops->get_policy)
+		/*
+		 * use get_policy op, if any, for linear, shared mappings
+		 */
+		if ((vma->vm_flags & (VM_SHARED|VM_NONLINEAR)) == VM_SHARED &&
+			vma->vm_ops && vma->vm_ops->get_policy)
 			pol = vma->vm_ops->get_policy(vma, addr);
 		else
 			pol = vma->vm_policy;
@@ -1080,7 +1109,11 @@ static struct mempolicy * get_vma_policy
 	struct mempolicy *pol = task->mempolicy;
 
 	if (vma) {
-		if (vma->vm_ops && vma->vm_ops->get_policy)
+		/*
+		 * use get_policy op, if any, for shared mappings
+		 */
+		if ((vma->vm_flags & (VM_SHARED|VM_NONLINEAR)) == VM_SHARED &&
+			vma->vm_ops && vma->vm_ops->get_policy)
 			pol = vma->vm_ops->get_policy(vma, addr);
 		else if (vma->vm_policy &&
 				vma->vm_policy->policy != MPOL_DEFAULT)
@@ -1549,13 +1582,10 @@ struct shared_policy *mpol_shared_policy
 		/* Falls back to MPOL_DEFAULT on any error */
 		newpol = mpol_new(policy, policy_nodes);
 		if (!IS_ERR(newpol)) {
-			/* Create pseudo-vma that contains just the policy */
-			struct vm_area_struct pvma;
-
-			memset(&pvma, 0, sizeof(struct vm_area_struct));
 			/* Policy covers entire file */
-			pvma.vm_end = TASK_SIZE;
-			mpol_set_shared_policy(sp, &pvma, newpol);
+			mpol_set_shared_policy(sp,
+						0UL, TASK_SIZE >> PAGE_SHIFT,
+						newpol);
 			mpol_free(newpol);
 		}
 	}
@@ -1576,23 +1606,23 @@ struct shared_policy *mpol_shared_policy
 }
 
 int mpol_set_shared_policy(struct shared_policy *sp,
-			struct vm_area_struct *vma, struct mempolicy *npol)
+			pgoff_t pgoff, unsigned long sz,
+			struct mempolicy *npol)
 {
 	int err;
 	struct sp_node *new = NULL;
-	unsigned long sz = vma_pages(vma);
 
 	PDprintk("set_shared_policy %lx sz %lu %d %lx\n",
-		 vma->vm_pgoff,
+		 pgoff,
 		 sz, npol? npol->policy : -1,
 		npol ? nodes_addr(npol->v.nodes)[0] : -1);
 
 	if (npol) {
-		new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, npol);
+		new = sp_alloc(pgoff, pgoff + sz, npol);
 		if (!new)
 			return -ENOMEM;
 	}
-	err = shared_policy_replace(sp, vma->vm_pgoff, vma->vm_pgoff+sz, new);
+	err = shared_policy_replace(sp, pgoff, pgoff+sz, new);
 	if (err && new)
 		kmem_cache_free(sn_cache, new);
 	return err;
Index: Linux/mm/shmem.c
===================================================================
--- Linux.orig/mm/shmem.c	2007-06-22 13:10:34.000000000 -0400
+++ Linux/mm/shmem.c	2007-06-22 13:10:35.000000000 -0400
@@ -1282,17 +1282,20 @@ static struct page *shmem_fault(struct v
 }
 
 #ifdef CONFIG_NUMA
-int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
+int shmem_set_policy(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, struct mempolicy *new)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct shared_policy *sp = mapping_shared_policy(mapping);
+	unsigned long sz = (end - start) >> PAGE_SHIFT;
+	pgoff_t pgoff = vma_addr_to_pgoff(vma, start, PAGE_SHIFT);
 
 	if (!sp) {
 		sp = mpol_shared_policy_new(mapping, MPOL_DEFAULT, NULL);
 		if (IS_ERR(sp))
 			return PTR_ERR(sp);
 	}
-	return mpol_set_shared_policy(sp, vma, new);
+	return mpol_set_shared_policy(sp, pgoff, sz, new);
 }
 
 struct mempolicy *
@@ -1304,7 +1307,7 @@ shmem_get_policy(struct vm_area_struct *
 
 	if (!sp)
 		return NULL;	/* == default policy */
-	idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+ 	idx = vma_addr_to_pgoff(vma, addr, PAGE_SHIFT);
 	return mpol_shared_policy_lookup(sp, idx);
 }
 #endif
Index: Linux/ipc/shm.c
===================================================================
--- Linux.orig/ipc/shm.c	2007-06-22 13:07:48.000000000 -0400
+++ Linux/ipc/shm.c	2007-06-22 13:10:35.000000000 -0400
@@ -236,13 +236,14 @@ static struct page *shm_fault(struct vm_
 }
 
 #ifdef CONFIG_NUMA
-int shm_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
+int shm_set_policy(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, struct mempolicy *new)
 {
 	struct file *file = vma->vm_file;
 	struct shm_file_data *sfd = shm_file_data(file);
 	int err = 0;
 	if (sfd->vm_ops->set_policy)
-		err = sfd->vm_ops->set_policy(vma, new);
+		err = sfd->vm_ops->set_policy(vma, start, end, new);
 	return err;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH/RFC 4/11] Shared Policy: fix show_numa_maps()
  2007-06-25 19:52 [PATCH/RFC 0/11] Shared Policy Overview Lee Schermerhorn
                   ` (2 preceding siblings ...)
  2007-06-25 19:52 ` [PATCH/RFC 3/11] Shared Policy: let vma policy ops handle sub-vma policies Lee Schermerhorn
@ 2007-06-25 19:52 ` Lee Schermerhorn
  2007-06-25 19:52 ` [PATCH/RFC 5/11] Shared Policy: Add hugepage shmem policy vm_ops Lee Schermerhorn
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-25 19:52 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nacc, ak, Lee Schermerhorn, clameter

Shared Policy Infrstructure 4/11 fix show_numa_maps()

Against 2.6.22-rc4-mm2

This patch updates the procfs numa_maps display to handle multiple
shared policy ranges on a single vma.  numa_maps() still uses the
procfs task maps infrastructure, but provides wrappers around the
maps seq_file ops to handle shared policy "submaps", if any.

This fixes a problem with numa_maps for shared mappings:
Before this [mapped file policy] patch series, numa_maps could show
you different results for shared mappings depending on which task you
examined.  A task which has installed shared policies on sub-ranges
of the shared region will show the policies on the sub-ranges, as the
vmas for that task were split when the policies were installed.  
Another task that shares the region, but didn't install any policies,
or installed policies on a different region or set of regions will
show a different policy/range or set thereof, based on the VMAs
of that task.  By displaying the policies directly from the shared
policy structure, we now see the same info from each task that maps
the segment.

The patch expands the proc_maps_private struct [#ifdef CONFIG_NUMA]
to track the existence of and progress through a submap for the
"current" vma.  For vmas with shared policy submaps, a new 
function--get_numa_submap()--in mm/mempolicy.c allocates and
populates an array of the policy ranges in the shared policy.
To facilitate this, the shared policy struct tracks the number
of ranges [sp_nodes] in the tree.

The nm_* numa_map seq_file wrappers pass the range to be displayed
to show_numa_map() via the saddr and eaddr members added to the
proc_maps_private struct.  The patch modifies show_numa_map() to
use these members, where appropriate, instead of vm_start, vm_end.

As before, once the internal page size buffer is full, seq_read()
suspends the display, drops the mmap_sem and exits the read.
During this time the vma list can change.  However, even within a
single seq_read(), the shared_policy "submap" can be changed by
other mappers.  We could prevent this by holding the shared policy
spin_lock or otherwise holding off other mappers.  That would also
hold off other tasks faulting in pages, attempting to look up the
policy for that offset, unless we convert the lock to reader/writer.
It doesn't seem worth the effort, as the numa_map is only a snap_shot
in any case.  So, this patch makes a best effort [at least as good as
unpatched task map code, I think] to perform a single scan over the
address space, displaying the policies and page state/location
for policy ranges "snapped" under spin lock into the "submap"
array mentioned above.


Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/proc/task_mmu.c            |  191 ++++++++++++++++++++++++++++++++++++++++--
 include/linux/mempolicy.h     |    5 +
 include/linux/mm.h            |    6 +
 include/linux/proc_fs.h       |   12 ++
 include/linux/shared_policy.h |    3 
 mm/mempolicy.c                |   57 +++++++++++-
 6 files changed, 264 insertions(+), 10 deletions(-)

Index: Linux/include/linux/proc_fs.h
===================================================================
--- Linux.orig/include/linux/proc_fs.h	2007-06-22 13:07:48.000000000 -0400
+++ Linux/include/linux/proc_fs.h	2007-06-22 13:10:39.000000000 -0400
@@ -281,12 +281,24 @@ static inline struct proc_dir_entry *PDE
 	return PROC_I(inode)->pde;
 }
 
+struct mpol_range {
+	unsigned long saddr;
+	unsigned long eaddr;
+};
+
 struct proc_maps_private {
 	struct pid *pid;
 	struct task_struct *task;
 #ifdef CONFIG_MMU
 	struct vm_area_struct *tail_vma;
 #endif
+
+#ifdef CONFIG_NUMA
+	struct vm_area_struct *vma;	/* preserved over seq_reads */
+	unsigned long saddr;
+	unsigned long eaddr;		/* preserved over seq_reads */
+	struct mpol_range *range, *ranges; /* preserved ... */
+#endif
 };
 
 #endif /* _LINUX_PROC_FS_H */
Index: Linux/include/linux/mm.h
===================================================================
--- Linux.orig/include/linux/mm.h	2007-06-22 13:10:35.000000000 -0400
+++ Linux/include/linux/mm.h	2007-06-22 13:10:39.000000000 -0400
@@ -1064,6 +1064,12 @@ static inline pgoff_t vma_addr_to_pgoff(
 {
 	return ((addr - vma->vm_start) >> shift) + vma->vm_pgoff;
 }
+
+static inline pgoff_t vma_pgoff_to_addr(struct vm_area_struct *vma,
+		pgoff_t pgoff)
+{
+	return ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) + vma->vm_start;
+}
 #else
 static inline void setup_per_cpu_pageset(void) {}
 #endif
Index: Linux/include/linux/mempolicy.h
===================================================================
--- Linux.orig/include/linux/mempolicy.h	2007-06-22 13:10:30.000000000 -0400
+++ Linux/include/linux/mempolicy.h	2007-06-22 13:10:39.000000000 -0400
@@ -139,6 +139,11 @@ static inline void check_highest_zone(en
 int do_migrate_pages(struct mm_struct *mm,
 	const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);
 
+struct seq_file;
+extern int show_numa_map(struct seq_file *, void *);
+struct mpol_range;
+extern struct mpol_range *get_numa_submap(struct vm_area_struct *);
+
 #else
 
 struct mempolicy {};
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-06-22 13:10:35.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-06-22 13:11:15.000000000 -0400
@@ -1469,6 +1469,7 @@ static void sp_insert(struct shared_poli
 	}
 	rb_link_node(&new->nd, parent, p);
 	rb_insert_color(&new->nd, &sp->root);
+	++sp->nr_sp_nodes;
 	PDprintk("inserting %lx-%lx: %d\n", new->start, new->end,
 		 new->policy ? new->policy->policy : 0);
 }
@@ -1498,6 +1499,7 @@ static void sp_delete(struct shared_poli
 	rb_erase(&n->nd, &sp->root);
 	mpol_free(n->policy);
 	kmem_cache_free(sn_cache, n);
+	--sp->nr_sp_nodes;
 }
 
 struct sp_node *
@@ -1575,6 +1577,7 @@ struct shared_policy *mpol_shared_policy
 		return ERR_PTR(-ENOMEM);
 	sp->root = RB_ROOT;
 	spin_lock_init(&sp->lock);
+	sp->nr_sp_nodes = 0;
 
 	if (policy != MPOL_DEFAULT) {
 		struct mempolicy *newpol;
@@ -1989,9 +1992,9 @@ int show_numa_map(struct seq_file *m, vo
 		return 0;
 
 	mpol_to_str(buffer, sizeof(buffer),
-			    get_vma_policy(priv->task, vma, vma->vm_start));
+			    get_vma_policy(priv->task, vma, priv->saddr));
 
-	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
+	seq_printf(m, "%08lx %s", priv->saddr, buffer);
 
 	if (file) {
 		seq_printf(m, " file=");
@@ -2004,10 +2007,10 @@ int show_numa_map(struct seq_file *m, vo
 	}
 
 	if (is_vm_hugetlb_page(vma)) {
-		check_huge_range(vma, vma->vm_start, vma->vm_end, md);
+		check_huge_range(vma, priv->saddr, priv->eaddr, md);
 		seq_printf(m, " huge");
 	} else {
-		check_pgd_range(vma, vma->vm_start, vma->vm_end,
+		check_pgd_range(vma, priv->saddr, priv->eaddr,
 				&node_online_map, MPOL_MF_STATS, md);
 	}
 
@@ -2046,3 +2049,49 @@ out:
 		m->version = (vma != priv->tail_vma) ? vma->vm_start : 0;
 	return 0;
 }
+
+/*
+ * alloc/populate array of shared policy ranges for show_numa_map()
+ */
+struct mpol_range *get_numa_submap(struct vm_area_struct *vma)
+{
+	struct shared_policy *sp;
+	struct mpol_range *ranges, *range;
+	struct rb_node *rbn;
+	int nranges;
+
+	BUG_ON(!vma->vm_file);
+	sp = mapping_shared_policy(vma->vm_file->f_mapping);
+	if (!sp)
+		return NULL;
+
+	nranges = sp->nr_sp_nodes;
+	if (!nranges)
+		return NULL;
+
+	ranges = kzalloc((nranges + 1) * sizeof(*ranges), GFP_KERNEL);
+	if (!ranges)
+		return NULL;	/* pretend there are none */
+
+	range = ranges;
+	spin_lock(&sp->lock);
+	/*
+	 * # of ranges could have changes since we checked, but that is
+	 * unlikely, so this is close enough [as long as it's safe].
+	 */
+	rbn = rb_first(&sp->root);
+	/*
+	 * count nodes to ensure we leave one empty range struct
+	 * in case node added between check and alloc
+	 */
+	while (rbn && nranges--) {
+		struct sp_node *spn = rb_entry(rbn, struct sp_node, nd);
+		range->saddr = vma_pgoff_to_addr(vma, spn->start);
+		range->eaddr = vma_pgoff_to_addr(vma, spn->end);
+		++range;
+		rbn = rb_next(rbn);
+	}
+
+	spin_unlock(&sp->lock);
+	return ranges;
+}
Index: Linux/fs/proc/task_mmu.c
===================================================================
--- Linux.orig/fs/proc/task_mmu.c	2007-06-22 13:07:48.000000000 -0400
+++ Linux/fs/proc/task_mmu.c	2007-06-22 13:10:39.000000000 -0400
@@ -498,7 +498,188 @@ const struct file_operations proc_clear_
 #endif
 
 #ifdef CONFIG_NUMA
-extern int show_numa_map(struct seq_file *m, void *v);
+/*
+ * numa_maps uses procfs task maps file operations, with wrappers
+ * to handle mpol submaps--policy ranges within a vma
+ */
+
+/*
+ * start processing a new vma for show_numa_maps
+ */
+static void nm_vma_start(struct proc_maps_private *priv,
+			struct vm_area_struct *vma)
+{
+	if (!vma)
+		return;
+	priv->vma = vma;	/* saved across read()s */
+
+	priv->saddr = vma->vm_start;
+	if (!(vma->vm_flags & VM_SHARED) || !vma->vm_file ||
+		!vma->vm_file->f_mapping->spolicy) {
+		/*
+		 * usual case:  no submap
+		 */
+		priv->eaddr = vma->vm_end;
+		return;
+	}
+
+	priv->range = priv->ranges = get_numa_submap(vma);
+	if (!priv->range) {
+		priv->eaddr = vma->vm_end;	/* empty shared policy */
+		return;
+	}
+
+	/*
+	 * restart suspended submap where we left off
+	 */
+	while (priv->range->eaddr && priv->range->eaddr < priv->eaddr)
+		++priv->range;
+
+	if (!priv->range->eaddr)
+		priv->eaddr = vma->vm_end;
+	else if (priv->saddr < priv->range->saddr)
+		priv->eaddr = priv->range->saddr; /* show gap [default pol] */
+	else
+		priv->eaddr = priv->range->eaddr; /* show range */
+}
+
+/*
+ * done with numa_maps vma:  reset so we start a new
+ * vma on next seq_read.
+ */
+static void nm_vma_stop(struct proc_maps_private *priv)
+{
+	if (priv->ranges)
+		kfree(priv->ranges);
+	priv->ranges = priv->range = NULL;
+	priv->vma = NULL;
+}
+
+/*
+ * Advance to next vma in mm or next subrange in vma.
+ * mmap_sem held during a single seq_read(), but shared
+ * policy ranges can be modified at any time by other
+ * mappers.  We just continue to display the ranges we
+ * found when we started the vma.
+ */
+static void *nm_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct proc_maps_private *priv = m->private;
+	struct vm_area_struct *vma = v;
+
+	if (!priv->range || priv->eaddr >= vma->vm_end) {
+		/*
+		 * usual case:  no submap or end of vma
+		 * re: '>=' -- in case we got here from nm_start()
+		 * and vma @ pos truncated to < priv->eaddr
+		 */
+		nm_vma_stop(priv);
+		vma = m_next(m, v, pos);
+		nm_vma_start(priv, vma);
+		return vma;
+	}
+
+	/*
+	 * Advance to next range in submap
+	 */
+	priv->saddr = priv->eaddr;
+	if (priv->eaddr == priv->range->saddr) {
+		/*
+		 * just processed a gap in the submap
+		 */
+		priv->eaddr = min(priv->range->eaddr, vma->vm_end);
+		return vma;	/* show the range */
+	}
+
+	++priv->range;
+	if (!priv->range->eaddr)
+		priv->eaddr = vma->vm_end;	/* past end of ranges */
+	else if (priv->saddr < priv->range->saddr)
+		priv->eaddr = priv->range->saddr; /* gap in submap */
+	else
+		priv->eaddr = min(priv->range->eaddr, vma->vm_end);
+
+	return vma;
+}
+
+/*
+ * [Re]start scan for new seq_read().
+ * N.B., much could have changes in mm, as we dropped the mmap_sem
+ * between reads().  Need to call m_start() to find vma at pos.
+ */
+static void *nm_start(struct seq_file *m, loff_t *pos)
+{
+	struct proc_maps_private *priv = m->private;
+	struct vm_area_struct *vma;
+
+	if (!priv->range) {
+		/*
+		 * usual case:  1st after open, or finished prev vma
+		 */
+		vma = m_start(m, pos);
+		nm_vma_start(priv, vma);
+		return vma;
+	}
+
+	/*
+	 * Continue with submap of "current" vma.  However, vma could have
+	 * been unmapped, split, truncated, ... between read()s.
+	 * Reset "last_addr" to simulate seek;  find vma by 'pos'.
+	 */
+	m->version = 0;
+	--(*pos);		/* seq_read() incremented it */
+	vma = m_start(m, pos);
+	if (vma != priv->vma)
+		goto new_vma;
+	/*
+	 * Same vma address as where we left off, but could have different
+	 * ranges or could be entirely different vma.
+	 */
+	if (vma->vm_start > priv->eaddr)
+		goto new_vma;	/* starts past last range displayed */
+	if (priv->eaddr < vma->vm_end) {
+		/*
+		 * vma at pos still covers eaddr--where we left off.  Submap
+		 * could have changed, but we'll keep reporting ranges we found
+		 * earlier up to vm_end.
+		 * We hope it is very unlikely that submap changed.
+		 */
+		return nm_next(m, vma, pos);
+	}
+
+	/*
+	 * Already reported past end of vma; find next vma past eaddr
+	 */
+	while (vma && vma->vm_end < priv->eaddr)
+		vma = m_next(m, vma, pos);
+
+new_vma:
+	/*
+	 * new vma at pos;  continue from ~ last eaddr
+	 */
+	nm_vma_stop(priv);
+	nm_vma_start(priv, vma);
+	return vma;
+}
+
+/*
+ * Suspend display of numa_map--e.g., buffer full?
+ */
+static void nm_stop(struct seq_file *m, void *v)
+{
+	struct proc_maps_private *priv = m->private;
+	struct vm_area_struct *vma = v;
+
+	if (!vma || priv->eaddr >= vma->vm_end) {
+		nm_vma_stop(priv);
+	}
+	/*
+	 * leave state in priv for nm_start(); but drop the
+	 * mmap_sem and unref the mm
+	 */
+	m_stop(m, v);
+}
+
 
 static int show_numa_map_checked(struct seq_file *m, void *v)
 {
@@ -512,10 +693,10 @@ static int show_numa_map_checked(struct 
 }
 
 static struct seq_operations proc_pid_numa_maps_op = {
-        .start  = m_start,
-        .next   = m_next,
-        .stop   = m_stop,
-        .show   = show_numa_map_checked
+	.start  = nm_start,
+	.next   = nm_next,
+	.stop   = nm_stop,
+	.show   = show_numa_map_checked
 };
 
 static int numa_maps_open(struct inode *inode, struct file *file)
Index: Linux/include/linux/shared_policy.h
===================================================================
--- Linux.orig/include/linux/shared_policy.h	2007-06-22 13:10:35.000000000 -0400
+++ Linux/include/linux/shared_policy.h	2007-06-22 13:10:39.000000000 -0400
@@ -25,7 +25,8 @@ struct sp_node {
 
 struct shared_policy {
 	struct rb_root root;
-	spinlock_t lock;	/* protects rb tree */
+	spinlock_t     lock;		/* protects rb tree */
+	int            nr_sp_nodes;	/* for numa_maps */
 };
 
 extern struct shared_policy *mpol_shared_policy_new(struct address_space *,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH/RFC 5/11] Shared Policy:  Add hugepage shmem policy vm_ops
  2007-06-25 19:52 [PATCH/RFC 0/11] Shared Policy Overview Lee Schermerhorn
                   ` (3 preceding siblings ...)
  2007-06-25 19:52 ` [PATCH/RFC 4/11] Shared Policy: fix show_numa_maps() Lee Schermerhorn
@ 2007-06-25 19:52 ` Lee Schermerhorn
  2007-06-25 19:53 ` [PATCH/RFC 6/11] Shared Policy: Factor alloc_page_pol routine Lee Schermerhorn
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-25 19:52 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nacc, ak, Lee Schermerhorn, clameter

Shared Policy Hugetlb Shmem 5/11 Add hugepage shmem policy vm_ops

Against 2.6.22-rc4-mm2

This patch hooks up the hugepage shmem segment's
{set|get}_policy vm_ops so that shmem segments created with
the SHM_HUGETLB flag will install policies specified via the
mbind() syscall into the shared policy of the shared segment.
This capability is possible now that hugetlb pages are faulted
in on demand.

Huge page shmem segments are used by enterprise class data
base managers to achieve better performance.  Same DBMs are
NUMA aware on enterprise unix[tm] systems and will enable that
support on Linux when all of the pieces are in place.  This is 
one of those pieces.

The shared policy infrastructure maintains memory policies on
"base page size" ranges.  To ensure that policies installed on
a hugetlb shmem segment cover entire huge pages, this patch
enhances do_mbind() to enforce huge page alignment if the policy
range starts within a hugetlb segment.  The enforcement is down
in check_range() because we need the vma to determine whether or
not the range starts in a hugetlb segment.

	Note:  we could just silently round the start address
	down to a hugepage alignment.  This would be safe and
	convenient for the application programmer, but 
	inconsistent with the treatement of base page ranges
	which MUST be page aligned.

This patch depends on the numa_maps fixes and related shared
policy infrastructure clean up to prevent hangs when displaying
[via cat] the numa_maps of a task that has attached a huge page
shmem segment.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/hugetlb.c   |    4 ++++
 mm/mempolicy.c |   11 +++++++++++
 2 files changed, 15 insertions(+)

Index: Linux/mm/hugetlb.c
===================================================================
--- Linux.orig/mm/hugetlb.c	2007-06-22 14:33:03.000000000 -0400
+++ Linux/mm/hugetlb.c	2007-06-22 14:34:16.000000000 -0400
@@ -317,6 +317,10 @@ static struct page *hugetlb_vm_op_fault(

 struct vm_operations_struct hugetlb_vm_ops = {
 	.fault = hugetlb_vm_op_fault,
+#ifdef CONFIG_NUMA
+	.set_policy	= shmem_set_policy,
+	.get_policy	= shmem_get_policy,
+#endif
 };

 static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-06-22 14:33:03.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-06-22 16:35:07.000000000 -0400
@@ -344,6 +344,17 @@ check_range(struct mm_struct *mm, unsign
 	first = find_vma(mm, start);
 	if (!first)
 		return ERR_PTR(-EFAULT);
+
+	/*
+	 * need vma for hugetlb check
+	 */
+	if (is_vm_hugetlb_page(first)) {
+		if (start & ~HPAGE_MASK)
+			return ERR_PTR(-EINVAL);
+		if (end < first->vm_end)
+			end = (end + HPAGE_MASK) & HPAGE_MASK;
+	}
+
 	prev = NULL;
 	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
 		if (!(flags & MPOL_MF_DISCONTIG_OK)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH/RFC 6/11] Shared Policy:  Factor alloc_page_pol routine
  2007-06-25 19:52 [PATCH/RFC 0/11] Shared Policy Overview Lee Schermerhorn
                   ` (4 preceding siblings ...)
  2007-06-25 19:52 ` [PATCH/RFC 5/11] Shared Policy: Add hugepage shmem policy vm_ops Lee Schermerhorn
@ 2007-06-25 19:53 ` Lee Schermerhorn
  2007-06-25 19:53 ` [PATCH/RFC 7/11] Shared Policy: use shared policy for page cache allocations Lee Schermerhorn
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-25 19:53 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nacc, ak, Lee Schermerhorn, clameter

Shared Mapped File Policy 6/11 Factor alloc_page_pol routine

Against 2.6.22-rc4-mm2

Implement alloc_page_pol() to allocate a page given a policy and
an offset [for interleaving].  No vma nor addr needed.  This
function will be used to allocate page_cache pages given the
policy at a given page offset in a subsequent patch.

Revise alloc_page_vma() to just call alloc_page_pol() after looking
up the vma policy, to eliminate duplicate code.  This change rippled
into the interleaving functions.  I was able to eliminate
interleave_nid() by computing the offset at the call sites and
calling [modified] offset_il_node() directly.

	removed vma arg from offset_il_node(), as it wasn't
	used and is not available when called from 
	alloc_page_pol().

Note:  re: alloc_page_vma() -- can be called w/ vma == NULL via
read_swap_cache_async() from swapin_readahead().  Can't compute
a page offset in this case.  This means that pages read by swap
readahead don't/can't follow vma policy.  This is current 
behavior.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/gfp.h       |    3 +
 include/linux/hugetlb.h   |    9 ++++
 include/linux/mempolicy.h |    2 +
 include/linux/mm.h        |    6 ++-
 mm/mempolicy.c            |   89 ++++++++++++++++++++++++++--------------------
 5 files changed, 71 insertions(+), 38 deletions(-)

Index: Linux/include/linux/gfp.h
===================================================================
--- Linux.orig/include/linux/gfp.h	2007-06-25 14:58:25.000000000 -0400
+++ Linux/include/linux/gfp.h	2007-06-25 14:58:57.000000000 -0400
@@ -192,10 +192,13 @@ alloc_pages(gfp_t gfp_mask, unsigned int
 }
 extern struct page *alloc_page_vma(gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr);
+struct mempolicy;
+extern struct page *alloc_page_pol(gfp_t, struct mempolicy *, pgoff_t);
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
 #define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#define alloc_page_pol(gfp_mask, pol, off)  alloc_pages(gfp_mask, 0)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 
Index: Linux/include/linux/hugetlb.h
===================================================================
--- Linux.orig/include/linux/hugetlb.h	2007-06-25 14:58:25.000000000 -0400
+++ Linux/include/linux/hugetlb.h	2007-06-25 14:58:57.000000000 -0400
@@ -14,6 +14,14 @@ static inline int is_vm_hugetlb_page(str
 	return vma->vm_flags & VM_HUGETLB;
 }
 
+static inline int vma_page_shift(struct vm_area_struct *vma)
+{
+	if (unlikely(is_vm_hugetlb_page(vma)))
+		return HPAGE_SHIFT;
+	else
+		return PAGE_SHIFT;
+}
+
 int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
 int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
@@ -127,6 +135,7 @@ static inline unsigned long hugetlb_tota
 #define HPAGE_MASK	PAGE_MASK		/* Keep the compiler happy */
 #define HPAGE_SIZE	PAGE_SIZE
 #endif
+#define vma_page_shift(VMA)		PAGE_SHIFT
 
 #endif /* !CONFIG_HUGETLB_PAGE */
 
Index: Linux/include/linux/mempolicy.h
===================================================================
--- Linux.orig/include/linux/mempolicy.h	2007-06-25 14:58:25.000000000 -0400
+++ Linux/include/linux/mempolicy.h	2007-06-25 14:58:57.000000000 -0400
@@ -124,6 +124,8 @@ extern int mpol_parse_options(char *valu
 			      nodemask_t *policy_nodes);
 
 extern struct mempolicy default_policy;
+extern struct mempolicy *get_file_policy(struct task_struct *,
+		struct address_space *, pgoff_t);
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 		unsigned long addr, gfp_t gfp_flags);
 extern unsigned slab_node(struct mempolicy *policy);
Index: Linux/include/linux/mm.h
===================================================================
--- Linux.orig/include/linux/mm.h	2007-06-25 14:58:25.000000000 -0400
+++ Linux/include/linux/mm.h	2007-06-25 14:58:57.000000000 -0400
@@ -1058,11 +1058,15 @@ extern void setup_per_cpu_pageset(void);
 
 /*
  * Address to offset for shared mapping policy lookup.
+ * When used for interleaving hugepagefs pages [when shift
+ * == HPAGE_SHIFT], actually returns hugepage offset in
+ * mapping; NOT file page offset.
  */
 static inline pgoff_t vma_addr_to_pgoff(struct vm_area_struct *vma,
 		unsigned long addr, int shift)
 {
-	return ((addr - vma->vm_start) >> shift) + vma->vm_pgoff;
+	return ((addr - vma->vm_start) >> shift) +
+		(vma->vm_pgoff >> (shift - PAGE_SHIFT));
 }
 
 static inline pgoff_t vma_pgoff_to_addr(struct vm_area_struct *vma,
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-06-25 14:58:25.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-06-25 14:58:57.000000000 -0400
@@ -21,6 +21,7 @@
  *
  * bind           Only allocate memory on a specific set of nodes,
  *                no fallback.
+//TODO:  following still applicable?
  *                FIXME: memory is allocated starting with the first node
  *                to the last. It would be better if bind would truly restrict
  *                the allocation to memory nodes instead
@@ -35,6 +36,7 @@
  *                use the process policy. This is what Linux always did
  *		  in a NUMA aware kernel and still does by, ahem, default.
  *
+//TODO:  following needs paragraph rewording.  haven't figured out what to say.
  * The process policy is applied for most non interrupt memory allocations
  * in that process' context. Interrupts ignore the policies and always
  * try to allocate on the local CPU. The VMA policy is only applied for memory
@@ -50,15 +52,18 @@
  * Same with GFP_DMA allocations.
  *
  * For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between
- * all users and remembered even when nobody has memory mapped.
+ * all users and remembered even when nobody has memory mapped. Shared
+ * policies handle sub-ranges of the object using a red/black tree.
+ *
+ * For mmap()ed files, the policy is shared between all 'SHARED mappers
+ * and is remembered as long as the inode exists.  Private mappings
+ * still use vma policy for COWed pages, but use the shared policy
+ * [default, if none] for initial and read-only faults.
  */
 
 /* Notebook:
-   fix mmap readahead to honour policy and enable policy for any page cache
-   object
    statistics for bigpages
-   global policy for page cache? currently it uses process policy. Requires
-   first item above.
+   global policy for page cache?
    handle mremap for shared memory (currently ignored for the policy)
    grows down?
    make bind policy root only? It can trigger oom much faster and the
@@ -1135,6 +1140,22 @@ static struct mempolicy * get_vma_policy
 	return pol;
 }
 
+/*
+ * Return effective policy for file [address_space] at pgoff
+ */
+struct mempolicy *get_file_policy(struct task_struct *task,
+		struct address_space *x, pgoff_t pgoff)
+{
+	struct shared_policy *sp = x->spolicy;
+	struct mempolicy *pol = task->mempolicy;
+
+	if (sp)
+		pol = mpol_shared_policy_lookup(sp, pgoff);
+	if (!pol)
+		pol = &default_policy;
+	return pol;
+}
+
 /* Return a zonelist representing a mempolicy */
 static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
 {
@@ -1207,9 +1228,8 @@ unsigned slab_node(struct mempolicy *pol
 	}
 }
 
-/* Do static interleaving for a VMA with known offset. */
-static unsigned offset_il_node(struct mempolicy *pol,
-		struct vm_area_struct *vma, unsigned long off)
+/* Do static interleaving for a policy with known offset. */
+static unsigned offset_il_node(struct mempolicy *pol, pgoff_t off)
 {
 	unsigned nnodes = nodes_weight(pol->v.nodes);
 	unsigned target = (unsigned)off % nnodes;
@@ -1224,28 +1244,6 @@ static unsigned offset_il_node(struct me
 	return nid;
 }
 
-/* Determine a node number for interleave */
-static inline unsigned interleave_nid(struct mempolicy *pol,
-		 struct vm_area_struct *vma, unsigned long addr, int shift)
-{
-	if (vma) {
-		unsigned long off;
-
-		/*
-		 * for small pages, there is no difference between
-		 * shift and PAGE_SHIFT, so the bit-shift is safe.
-		 * for huge pages, since vm_pgoff is in units of small
-		 * pages, we need to shift off the always 0 bits to get
-		 * a useful offset.
-		 */
-		BUG_ON(shift < PAGE_SHIFT);
-		off = vma->vm_pgoff >> (shift - PAGE_SHIFT);
-		off += (addr - vma->vm_start) >> shift;
-		return offset_il_node(pol, vma, off);
-	} else
-		return interleave_nodes(pol);
-}
-
 #ifdef CONFIG_HUGETLBFS
 /* Return a zonelist suitable for a huge page allocation. */
 struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
@@ -1256,7 +1254,8 @@ struct zonelist *huge_zonelist(struct vm
 	if (pol->policy == MPOL_INTERLEAVE) {
 		unsigned nid;
 
-		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
+		nid = offset_il_node(pol,
+				vma_addr_to_pgoff(vma, addr, HPAGE_SHIFT));
 		return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
 	}
 	return zonelist_policy(GFP_HIGHUSER, pol);
@@ -1278,6 +1277,23 @@ static struct page *alloc_page_interleav
 	return page;
 }
 
+/*
+ * alloc_page_pol() -- allocate a page based on policy,offset.
+ * Used for mmap()ed file policy allocations where policy is based
+ * on file offset rather than a vma,addr pair
+ */
+struct page *alloc_page_pol(gfp_t gfp, struct mempolicy *pol, pgoff_t pgoff)
+{
+	if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
+		unsigned nid;
+
+		nid = offset_il_node(pol, pgoff);
+		return alloc_page_interleave(gfp, 0, nid);
+	}
+	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+}
+EXPORT_SYMBOL(alloc_page_pol);
+
 /**
  * 	alloc_page_vma	- Allocate a page for a VMA.
  *
@@ -1304,16 +1320,15 @@ struct page *
 alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
 {
 	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+	pgoff_t pgoff = 0;
 
 	cpuset_update_task_memory_state();
 
-	if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
-		unsigned nid;
-
-		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
-		return alloc_page_interleave(gfp, 0, nid);
+	if (likely(vma)) {
+		int shift = vma_page_shift(vma);
+		pgoff = vma_addr_to_pgoff(vma, addr, shift);
 	}
-	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+	return alloc_page_pol(gfp, pol, pgoff);
 }
 
 /**

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH/RFC 7/11] Shared Policy: use shared policy for page cache allocations
  2007-06-25 19:52 [PATCH/RFC 0/11] Shared Policy Overview Lee Schermerhorn
                   ` (5 preceding siblings ...)
  2007-06-25 19:53 ` [PATCH/RFC 6/11] Shared Policy: Factor alloc_page_pol routine Lee Schermerhorn
@ 2007-06-25 19:53 ` Lee Schermerhorn
  2007-06-25 19:53 ` [PATCH/RFC 8/11] Shared Policy: fix migration of private mappings Lee Schermerhorn
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-25 19:53 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nacc, ak, Lee Schermerhorn, clameter

Shared Mapped File Policy 7/11 use shared policy for page cache allocations

Against 2.6.22-rc4-mm2

This patch implements a "get_file_policy()" function, analogous
to get_vma_policy(), but for a given file[inode/mapping] at
at specified offset, using the shared_policy, if any, in the
file's address_space.  If no shared policy, returns the process
policy of the argument task [to match get_vma_policy() args] or
default policy, if no process policy.

	Note that for a file policy to exist the file must currently
	be mmap()ed into a task's address space with MAP_SHARED,
	with the policy installed via mbind().

	A later patch will hook up the generic file mempolicy
	vm_ops and define a per cpuset control file to enable
	this semantic.  Default will be same as current behavior--
	no policy on shared file mapping

Revert [__]page_cache_alloc() to take mapping argument as I need
that to locate the shared policy.  Add pgoff_t and gfp_t modifier
arguments.  Fix up page_cache_alloc() and page_cache_alloc_cold()
in pagemap.h and all direct callers of __page_cache_alloc accordingly.

Modify __page_cache_alloc() to use get_file_policy() and
alloc_page_pol().  

page_cache_alloc*() now take an additional offset/index
argument, available at all call sites, to lookup the appropriate
policy.  The patches fixes all in kernel users of the modified
interfaces.

Re: interaction with cpusets page spread:  if the file has a
shared policy structure attached, that policy takes precedence
over spreading.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/ntfs/file.c          |    2 +-
 fs/splice.c             |    2 +-
 include/linux/pagemap.h |   19 ++++++++++++-------
 mm/filemap.c            |   48 +++++++++++++++++++++++++++++++++++++++---------
 mm/readahead.c          |    2 +-
 5 files changed, 54 insertions(+), 19 deletions(-)

Index: Linux/mm/filemap.c
===================================================================
--- Linux.orig/mm/filemap.c	2007-06-25 15:00:39.000000000 -0400
+++ Linux/mm/filemap.c	2007-06-25 15:03:25.000000000 -0400
@@ -31,6 +31,8 @@
 #include <linux/syscalls.h>
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
+#include <linux/mempolicy.h>
+
 #include "internal.h"
 
 /*
@@ -469,13 +471,41 @@ int add_to_page_cache_lru(struct page *p
 }
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+/**
+ * __page_cache_alloc - allocate a page cache page
+ * @mapping - address_space for which page will be allocated
+ * @pgoff   - page index in mapping -- for mem policy
+ * @gfp_sub - gfp flags to be removed from mapping's gfp
+ * @gfp_add - gfp flags to be added to mapping's gfp
+ *
+ * If the mapping does not contain a shared policy, and page cache spreading
+ * is enabled for the current context's cpuset, allocate a page from the node
+ * indicated by page cache spreading.
+ *
+ * Otherwise, fetch the memory policy at the indicated pgoff and allocate
+ * a page according to that policy.  Note that if the mapping does not
+ * have a shared policy, the allocation will use the task policy, if any,
+ * else the system default policy.
+ *
+ * All allocations will use the mapping's gfp mask, as modified by the
+ * gfp_sub and gfp_add arguments.
+ */
+struct page *__page_cache_alloc(struct address_space *mapping, pgoff_t pgoff,
+					gfp_t gfp_sub, gfp_t gfp_add)
 {
-	if (cpuset_do_page_mem_spread()) {
+	struct mempolicy *pol;
+	gfp_t gfp = (mapping_gfp_mask(mapping) & ~gfp_sub) | gfp_add;
+
+	/*
+	 * Consider spreading only if no shared_policy
+	 */
+	if (!mapping->spolicy && cpuset_do_page_mem_spread()) {
 		int n = cpuset_mem_spread_node();
 		return alloc_pages_node(n, gfp, 0);
 	}
-	return alloc_pages(gfp, 0);
+
+	pol = get_file_policy(current, mapping, pgoff);
+	return alloc_page_pol(gfp, pol, pgoff);
 }
 EXPORT_SYMBOL(__page_cache_alloc);
 #endif
@@ -671,7 +701,7 @@ struct page *find_or_create_page(struct 
 repeat:
 	page = find_lock_page(mapping, index);
 	if (!page) {
-		page = __page_cache_alloc(gfp_mask);
+		page = __page_cache_alloc(mapping, index, ~0, gfp_mask);
 		if (!page)
 			return NULL;
 		err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
@@ -804,7 +834,7 @@ grab_cache_page_nowait(struct address_sp
 		page_cache_release(page);
 		return NULL;
 	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
+	page = __page_cache_alloc(mapping, index, __GFP_FS, 0);
 	if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
 		page_cache_release(page);
 		page = NULL;
@@ -1052,7 +1082,7 @@ no_cached_page:
 		 * Ok, it wasn't cached, so we need to create a new
 		 * page..
 		 */
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, index);
 		if (!page) {
 			desc->error = -ENOMEM;
 			goto out;
@@ -1318,7 +1348,7 @@ static int fastcall page_cache_read(stru
 	int ret;
 
 	do {
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, offset);
 		if (!page)
 			return -ENOMEM;
 
@@ -1566,7 +1596,7 @@ static struct page *__read_cache_page(st
 repeat:
 	page = find_get_page(mapping, index);
 	if (!page) {
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, index);
 		if (!page)
 			return ERR_PTR(-ENOMEM);
 		err = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
@@ -2058,7 +2088,7 @@ repeat:
 	if (likely(page))
 		return page;
 
-	page = page_cache_alloc(mapping);
+	page = page_cache_alloc(mapping, index);
 	if (!page)
 		return NULL;
 	status = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
Index: Linux/include/linux/pagemap.h
===================================================================
--- Linux.orig/include/linux/pagemap.h	2007-06-25 15:00:39.000000000 -0400
+++ Linux/include/linux/pagemap.h	2007-06-25 15:02:17.000000000 -0400
@@ -63,22 +63,27 @@ static inline void mapping_set_gfp_mask(
 void release_pages(struct page **pages, int nr, int cold);
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(struct address_space *, pgoff_t,
+							gfp_t,  gfp_t);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(struct address_space *mapping,
+						pgoff_t off,
+						gfp_t gfp_sub, gfp_t gfp_add)
 {
-	return alloc_pages(gfp, 0);
+	return alloc_pages((mapping_gfp_mask(mapping) & ~gfp_sub) | gfp_add, 0);
 }
 #endif
 
-static inline struct page *page_cache_alloc(struct address_space *x)
+static inline struct page *page_cache_alloc(struct address_space *mapping,
+						pgoff_t off)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x));
+	return __page_cache_alloc(mapping, off, 0, 0);
 }
 
-static inline struct page *page_cache_alloc_cold(struct address_space *x)
+static inline struct page *page_cache_alloc_cold(struct address_space *mapping,
+						pgoff_t off)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+	return __page_cache_alloc(mapping, off, 0, __GFP_COLD);
 }
 
 typedef int filler_t(void *, struct page *);
Index: Linux/fs/splice.c
===================================================================
--- Linux.orig/fs/splice.c	2007-06-25 15:00:39.000000000 -0400
+++ Linux/fs/splice.c	2007-06-25 15:02:17.000000000 -0400
@@ -318,7 +318,7 @@ __generic_file_splice_read(struct file *
 			/*
 			 * page didn't exist, allocate one.
 			 */
-			page = page_cache_alloc_cold(mapping);
+			page = page_cache_alloc_cold(mapping, index);
 			if (!page)
 				break;
 
Index: Linux/mm/readahead.c
===================================================================
--- Linux.orig/mm/readahead.c	2007-06-25 15:00:39.000000000 -0400
+++ Linux/mm/readahead.c	2007-06-25 15:02:17.000000000 -0400
@@ -160,7 +160,7 @@ __do_page_cache_readahead(struct address
 			continue;
 
 		read_unlock_irq(&mapping->tree_lock);
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, page_offset);
 		read_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
Index: Linux/fs/ntfs/file.c
===================================================================
--- Linux.orig/fs/ntfs/file.c	2007-06-25 15:00:39.000000000 -0400
+++ Linux/fs/ntfs/file.c	2007-06-25 15:02:17.000000000 -0400
@@ -424,7 +424,7 @@ static inline int __ntfs_grab_cache_page
 		pages[nr] = find_lock_page(mapping, index);
 		if (!pages[nr]) {
 			if (!*cached_page) {
-				*cached_page = page_cache_alloc(mapping);
+				*cached_page = page_cache_alloc(mapping, index);
 				if (unlikely(!*cached_page)) {
 					err = -ENOMEM;
 					goto err_out;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH/RFC 8/11] Shared Policy: fix migration of private mappings
  2007-06-25 19:52 [PATCH/RFC 0/11] Shared Policy Overview Lee Schermerhorn
                   ` (6 preceding siblings ...)
  2007-06-25 19:53 ` [PATCH/RFC 7/11] Shared Policy: use shared policy for page cache allocations Lee Schermerhorn
@ 2007-06-25 19:53 ` Lee Schermerhorn
  2007-06-25 19:53 ` [PATCH/RFC 9/11] Shared Policy: mapped file policy persistence model Lee Schermerhorn
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-25 19:53 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nacc, ak, Lee Schermerhorn, clameter

Shared Mapped File Policy 8/11 fix migration of private mappings

Against 2.6.22-rc4-mm2

Don't allow migration of file backed pages mapped with
MAP_PRIVATE if the file has a shared policy.  Rather, only
migrate any private, anon copies that the task has "COWed".

Define a new internal flag that we set in check_range() for
private mappings of files with shared policy.  Then, 
migrate_page_add() will skip non-anon pages when this flag
is set.

May also be able to use this flag to force unmapping of
anon pages that may be shared with relatives during automigrate
on internode task migration--e.g., by using:
	MPOL_MF_MOVE_ALL|MPOL_MF_MOVE_ANON_ONLY
But, that's the subject of a different patch series.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/mempolicy.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-06-22 14:34:17.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-06-22 14:34:23.000000000 -0400
@@ -102,6 +102,7 @@
 #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0)	/* Skip checks for continuous vmas */
 #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
 #define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2)		/* Gather statistics */
+#define MPOL_MF_MOVE_ANON_ONLY (MPOL_MF_INTERNAL << 3)
 
 static struct kmem_cache *policy_cache;
 static struct kmem_cache *sp_cache;
@@ -373,13 +374,19 @@ check_range(struct mm_struct *mm, unsign
 		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
 				vma_migratable(vma)))) {
 			unsigned long endvma = vma->vm_end;
+			unsigned long anononly = 0;
 
 			if (endvma > end)
 				endvma = end;
 			if (vma->vm_start > start)
 				start = vma->vm_start;
+
+			if (vma->vm_file && !(vma->vm_flags & VM_SHARED) &&
+					vma->vm_file->f_mapping->spolicy)
+				anononly = MPOL_MF_MOVE_ANON_ONLY;
+
 			err = check_pgd_range(vma, start, endvma, nodes,
-						flags, private);
+						flags|anononly, private);
 			if (err) {
 				first = ERR_PTR(err);
 				break;
@@ -632,9 +639,11 @@ static void migrate_page_add(struct page
 				unsigned long flags)
 {
 	/*
-	 * Avoid migrating a page that is shared with others.
+	 * Avoid migrating a file backed page in a private mapping or
+	 * a page that is shared with others.
 	 */
-	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1)
+	if ((!(flags & MPOL_MF_MOVE_ANON_ONLY) || PageAnon(page)) &&
+		((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1))
 		isolate_lru_page(page, pagelist);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH/RFC 9/11] Shared Policy: mapped file policy persistence model
  2007-06-25 19:52 [PATCH/RFC 0/11] Shared Policy Overview Lee Schermerhorn
                   ` (7 preceding siblings ...)
  2007-06-25 19:53 ` [PATCH/RFC 8/11] Shared Policy: fix migration of private mappings Lee Schermerhorn
@ 2007-06-25 19:53 ` Lee Schermerhorn
  2007-06-25 19:53 ` [PATCH/RFC 10/11] Shared Policy: per cpuset shared file policy control Lee Schermerhorn
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-25 19:53 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nacc, ak, Lee Schermerhorn, clameter

Shared Mapped File Policy 9/11 define mapped file policy persistence model

Against 2.6.22-rc4-mm2

Mapped file policy applies to a memory mapped file mmap()ed with the
MAP_SHARED flag.  Therefore, retain the shared policy until the last
shared mapping is removed.

Shmem segments [including SHM_HUGETLB segments] look like shared
mapped files to the shared policy infrastructure.  The policy
persistence model for shmem segments is that once a shared policy
is applied, it remains as long as the segment exists.  To retain this
model, define a shared policy persistence flag--SPOL_F_PERSIST--and
set this flag when allocating a shared policy for a shmem segment.  

Free any shmem persistent shared policy when the segment is deleted
in the common inode cleanup path.  Current behavior.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/hugetlbfs/inode.c          |    1 
 fs/inode.c                    |    9 ++++++
 include/linux/shared_policy.h |   11 +++++--
 mm/mempolicy.c                |   63 ++++++++++++++++++++++++++++++++----------
 mm/mmap.c                     |   13 ++++++++
 mm/shmem.c                    |    5 ---
 6 files changed, 80 insertions(+), 22 deletions(-)

Index: Linux/fs/hugetlbfs/inode.c
===================================================================
--- Linux.orig/fs/hugetlbfs/inode.c	2007-06-25 14:53:17.000000000 -0400
+++ Linux/fs/hugetlbfs/inode.c	2007-06-25 15:03:48.000000000 -0400
@@ -547,7 +547,6 @@ static struct inode *hugetlbfs_alloc_ino
 static void hugetlbfs_destroy_inode(struct inode *inode)
 {
 	hugetlbfs_inc_free_inodes(HUGETLBFS_SB(inode->i_sb));
-	mpol_free_shared_policy(inode->i_mapping);
 	kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I(inode));
 }
 
Index: Linux/fs/inode.c
===================================================================
--- Linux.orig/fs/inode.c	2007-06-25 14:53:17.000000000 -0400
+++ Linux/fs/inode.c	2007-06-25 15:03:48.000000000 -0400
@@ -22,6 +22,7 @@
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
 #include <linux/mount.h>
+#include <linux/shared_policy.h>
 
 /*
  * This is needed for the following functions:
@@ -173,6 +174,14 @@ void destroy_inode(struct inode *inode) 
 {
 	BUG_ON(inode_has_buffers(inode));
 	security_inode_free(inode);
+
+	/*
+	 * free any shared policy
+	 */
+	if ((inode->i_mode & S_IFMT) == S_IFREG) {
+		mpol_free_shared_policy(inode->i_mapping);
+	}
+
 	if (inode->i_sb->s_op->destroy_inode)
 		inode->i_sb->s_op->destroy_inode(inode);
 	else
Index: Linux/mm/shmem.c
===================================================================
--- Linux.orig/mm/shmem.c	2007-06-25 14:53:17.000000000 -0400
+++ Linux/mm/shmem.c	2007-06-25 15:03:48.000000000 -0400
@@ -1294,6 +1294,7 @@ int shmem_set_policy(struct vm_area_stru
 		sp = mpol_shared_policy_new(mapping, MPOL_DEFAULT, NULL);
 		if (IS_ERR(sp))
 			return PTR_ERR(sp);
+		sp->sp_flags |= SPOL_F_PERSIST;
 	}
 	return mpol_set_shared_policy(sp, pgoff, sz, new);
 }
@@ -2303,10 +2304,6 @@ static struct inode *shmem_alloc_inode(s
 
 static void shmem_destroy_inode(struct inode *inode)
 {
-	if ((inode->i_mode & S_IFMT) == S_IFREG) {
-		/* only struct inode is valid if it's an inline symlink */
-		mpol_free_shared_policy(inode->i_mapping);
-	}
 	shmem_acl_destroy_inode(inode);
 	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
 }
Index: Linux/mm/mmap.c
===================================================================
--- Linux.orig/mm/mmap.c	2007-06-25 14:53:17.000000000 -0400
+++ Linux/mm/mmap.c	2007-06-25 15:07:07.000000000 -0400
@@ -188,11 +188,24 @@ EXPORT_SYMBOL(__vm_enough_memory);
 static void __remove_shared_vm_struct(struct vm_area_struct *vma,
 		struct file *file, struct address_space *mapping)
 {
+
 	if (vma->vm_flags & VM_DENYWRITE)
 		atomic_inc(&file->f_path.dentry->d_inode->i_writecount);
 	if (vma->vm_flags & VM_SHARED)
 		mapping->i_mmap_writable--;
 
+	if (!mapping->i_mmap_writable) {
+		/*
+		 * shared mmap()ed file policy persistence model:
+		 * remove policy when removing last shared mapping,
+		 * unless marked as persistent--e.g., shmem
+		 */
+		struct shared_policy *sp = mapping_shared_policy(mapping);
+		if (sp && !(sp->sp_flags & SPOL_F_PERSIST)) {
+			mpol_free_shared_policy(mapping);
+		}
+	}
+
 	flush_dcache_mmap_lock(mapping);
 	if (unlikely(vma->vm_flags & VM_NONLINEAR))
 		list_del_init(&vma->shared.vm_set.list);
Index: Linux/include/linux/shared_policy.h
===================================================================
--- Linux.orig/include/linux/shared_policy.h	2007-06-25 14:53:17.000000000 -0400
+++ Linux/include/linux/shared_policy.h	2007-06-25 15:03:48.000000000 -0400
@@ -3,6 +3,7 @@
 
 #include <linux/fs.h>
 #include <linux/rbtree.h>
+#include <linux/rcupdate.h>
 
 /*
  * Tree of shared policies for a shared memory regions and memory
@@ -24,11 +25,15 @@ struct sp_node {
 };
 
 struct shared_policy {
-	struct rb_root root;
-	spinlock_t     lock;		/* protects rb tree */
-	int            nr_sp_nodes;	/* for numa_maps */
+	struct rb_root  root;
+	spinlock_t      lock;		/* protects rb tree, nr_sp_nodes */
+	int             nr_sp_nodes;	/* for numa_maps */
+	int             sp_flags;	/* persistence, ... */
+	struct rcu_head sp_rcu;		/* deferred reclaim */
 };
 
+#define SPOL_F_PERSIST	0x01		/* for shmem use */
+
 extern struct shared_policy *mpol_shared_policy_new(struct address_space *,
 							int, nodemask_t *);
 extern int mpol_set_shared_policy(struct shared_policy *, pgoff_t,
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-06-25 15:03:39.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-06-25 15:03:48.000000000 -0400
@@ -1155,11 +1155,14 @@ static struct mempolicy * get_vma_policy
 struct mempolicy *get_file_policy(struct task_struct *task,
 		struct address_space *x, pgoff_t pgoff)
 {
-	struct shared_policy *sp = x->spolicy;
+	struct shared_policy *sp;
 	struct mempolicy *pol = task->mempolicy;
 
+	rcu_read_lock();
+	sp = rcu_dereference(x->spolicy);
 	if (sp)
 		pol = mpol_shared_policy_lookup(sp, pgoff);
+	rcu_read_unlock();
 	if (!pol)
 		pol = &default_policy;
 	return pol;
@@ -1601,6 +1604,9 @@ restart:
 
 /*
  * allocate and initialize a shared policy struct
+ * Locking:  mapping->spolicy stabilized by current->mm->mmap_sem.
+ * Can't remove last shared mapping while we hold the sem; can't
+ * remove inode/shared policy while inode is mmap()ed shared.
  */
 struct shared_policy *mpol_shared_policy_new(struct address_space *mapping,
 				int policy, nodemask_t *policy_nodes)
@@ -1634,7 +1640,7 @@ struct shared_policy *mpol_shared_policy
 	spin_lock(&mapping->i_mmap_lock);
 	spx = mapping->spolicy;
 	if (!spx)
-		mapping->spolicy = sp;
+		rcu_assign_pointer(mapping->spolicy, sp);
 	else {
 		kmem_cache_free(sp_cache, sp);
 		sp = spx;
@@ -1643,6 +1649,12 @@ struct shared_policy *mpol_shared_policy
 	return sp;
 }
 
+/*
+ * set/replace shared policy on specified address range
+ * Locking:  mapping->spolicy stabilized by current->mm->mmap_sem.
+ * Can't remove last shared mapping while we hold the sem; can't
+ * remove inode/shared policy while inode is mmap()ed shared.
+ */
 int mpol_set_shared_policy(struct shared_policy *sp,
 			pgoff_t pgoff, unsigned long sz,
 			struct mempolicy *npol)
@@ -1668,31 +1680,54 @@ int mpol_set_shared_policy(struct shared
 
 /*
  * Free a backing policy store on inode delete.
+ * Locking:  only free shared policy on inode deletion [shmem] or
+ * removal of last shared mmap()ing.  Can only delete inode when no
+ * more references.  Removal of last shared mmap()ing protected by
+ * mmap_sem [and mapping->i_mmap_lock].  Still a potential race with
+ * shared policy lookups from page cache on behalf of file descriptor
+ * access to pages.  Use deferred RCU to protect readers [in get_file_policy()]
+ * from shared policy free on removal of last shared mmap()ing.
  */
-void mpol_free_shared_policy(struct address_space *mapping)
+static void __mpol_free_shared_policy(struct rcu_head *rhp)
 {
-	struct shared_policy *sp = mapping->spolicy;
-	struct sp_node *n;
+	struct shared_policy *sp =container_of(rhp, struct shared_policy,
+						sp_rcu);
 	struct rb_node *next;
-
-	if (!sp)
-		return;
-
-	mapping->spolicy = NULL;
-
-	spin_lock(&sp->lock);
+	/*
+	 * Now, we can safely tear down the shared policy tree
+	 */
 	next = rb_first(&sp->root);
 	while (next) {
-		n = rb_entry(next, struct sp_node, nd);
+		struct sp_node *n = rb_entry(next, struct sp_node, nd);
 		next = rb_next(&n->nd);
 		rb_erase(&n->nd, &sp->root);
 		mpol_free(n->policy);
 		kmem_cache_free(sn_cache, n);
 	}
-	spin_unlock(&sp->lock);
 	kmem_cache_free(sp_cache, sp);
 }
 
+void mpol_free_shared_policy(struct address_space *mapping)
+{
+	struct shared_policy *sp = mapping->spolicy;
+
+	if (!sp)
+		return;
+
+	rcu_assign_pointer(mapping->spolicy, NULL);
+
+	/*
+	 * Presence of 'PERSIST flag means we're freeing the
+	 * shared policy in the inode destruction path.  No
+	 * need for RCU synchronization.
+	 */
+	if (sp->sp_flags & SPOL_F_PERSIST)
+		__mpol_free_shared_policy(&sp->sp_rcu);
+	else
+		call_rcu(&sp->sp_rcu, __mpol_free_shared_policy);
+
+}
+
 int mpol_parse_options(char *value, int *policy, nodemask_t *policy_nodes)
 {
 	char *nodelist = strchr(value, ':');

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH/RFC 10/11] Shared Policy: per cpuset shared file policy control
  2007-06-25 19:52 [PATCH/RFC 0/11] Shared Policy Overview Lee Schermerhorn
                   ` (8 preceding siblings ...)
  2007-06-25 19:53 ` [PATCH/RFC 9/11] Shared Policy: mapped file policy persistence model Lee Schermerhorn
@ 2007-06-25 19:53 ` Lee Schermerhorn
  2007-06-25 21:10   ` Paul Jackson
  2007-06-25 19:53 ` [PATCH/RFC 11/11] Shared Policy: add generic file set/get policy vm ops Lee Schermerhorn
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-25 19:53 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nacc, ak, Lee Schermerhorn, clameter

Shared Mapped File Policy 10/11 per cpuset shared file policy control

Against 2.6.22-rc4-mm2

Add a per cpuset "shared_file_policy" control file to enable 
shared file policy for tasks in the cpuset.  Default is disabled,
resulting in the old behavior--i.e., we continue to ignore
mbind() on address ranges backed by shared file mappings.
The "shared_file_policy" file depends on CONFIG_NUMA.

Subsequent patch that "hooks up" generic file .{set|get}_policy
vm_ops will only install a shared policy on a memory mapped file
if the capability has been enabled for the caller's cpuset.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
 
 include/linux/sched.h |    1 +
 kernel/cpuset.c       |   42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

Index: Linux/include/linux/sched.h
===================================================================
--- Linux.orig/include/linux/sched.h	2007-06-22 14:33:03.000000000 -0400
+++ Linux/include/linux/sched.h	2007-06-22 14:34:28.000000000 -0400
@@ -1119,6 +1119,7 @@ struct task_struct {
 #ifdef CONFIG_NUMA
   	struct mempolicy *mempolicy;
 	short il_next;
+	short shared_file_policy_enabled;
 #endif
 #ifdef CONFIG_CPUSETS
 	nodemask_t mems_allowed;
Index: Linux/kernel/cpuset.c
===================================================================
--- Linux.orig/kernel/cpuset.c	2007-06-22 14:33:03.000000000 -0400
+++ Linux/kernel/cpuset.c	2007-06-22 14:34:28.000000000 -0400
@@ -121,6 +121,7 @@ typedef enum {
 	CS_MEMORY_MIGRATE,
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
+	CS_SHARED_FILE_POLICY,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -149,6 +150,13 @@ static inline int is_spread_slab(const s
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
+#ifdef CONFIG_NUMA
+static inline int is_shared_file_policy(const struct cpuset *cs)
+{
+	return test_bit(CS_SHARED_FILE_POLICY, &cs->flags);
+}
+#endif
+
 /*
  * Increment this integer everytime any cpuset changes its
  * mems_allowed value.  Users of cpusets can track this generation
@@ -409,6 +417,12 @@ void cpuset_update_task_memory_state(voi
 			tsk->flags |= PF_SPREAD_SLAB;
 		else
 			tsk->flags &= ~PF_SPREAD_SLAB;
+#ifdef CONFIG_NUMA
+		if (is_shared_file_policy(cs))
+			tsk->shared_file_policy_enabled = 1;
+		else
+			tsk->shared_file_policy_enabled = 0;
+#endif
 		task_unlock(tsk);
 		mutex_unlock(&callback_mutex);
 		mpol_rebind_task(tsk, &tsk->mems_allowed);
@@ -923,6 +937,7 @@ typedef enum {
 	FILE_MEMORY_PRESSURE,
 	FILE_SPREAD_PAGE,
 	FILE_SPREAD_SLAB,
+	FILE_SHARED_FILE_POLICY,
 } cpuset_filetype_t;
 
 static ssize_t cpuset_common_file_write(struct container *cont,
@@ -987,6 +1002,12 @@ static ssize_t cpuset_common_file_write(
 		retval = update_flag(CS_SPREAD_SLAB, cs, buffer);
 		cs->mems_generation = cpuset_mems_generation++;
 		break;
+#ifdef CONFIG_NUMA
+	case FILE_SHARED_FILE_POLICY:
+		retval = update_flag(CS_SHARED_FILE_POLICY, cs, buffer);
+		cs->mems_generation = cpuset_mems_generation++;
+		break;
+#endif
 	default:
 		retval = -EINVAL;
 		goto out2;
@@ -1080,6 +1101,11 @@ static ssize_t cpuset_common_file_read(s
 	case FILE_SPREAD_SLAB:
 		*s++ = is_spread_slab(cs) ? '1' : '0';
 		break;
+#ifdef CONFIG_NUMA
+	case FILE_SHARED_FILE_POLICY:
+		*s++ = is_shared_file_policy(cs) ? '1' : '0';
+		break;
+#endif
 	default:
 		retval = -EINVAL;
 		goto out;
@@ -1163,6 +1189,14 @@ static struct cftype cft_spread_slab = {
 	.private = FILE_SPREAD_SLAB,
 };
 
+#ifdef CONFIG_NUMA
+static struct cftype cft_shared_file_policy = {
+	.name = "shared_file_policy",
+	.read = cpuset_common_file_read,
+	.write = cpuset_common_file_write,
+	.private = FILE_SHARED_FILE_POLICY,
+};
+#endif
 int cpuset_populate(struct container_subsys *ss, struct container *cont)
 {
 	int err;
@@ -1183,6 +1217,10 @@ int cpuset_populate(struct container_sub
 		return err;
 	if ((err = container_add_file(cont, &cft_spread_slab)) < 0)
 		return err;
+#ifdef CONFIG_NUMA
+	if ((err = container_add_file(cont, &cft_shared_file_policy)) < 0)
+		return err;
+#endif
 	/* memory_pressure_enabled is in root cpuset only */
 	if (err == 0 && !cont->parent)
 		err = container_add_file(cont, &cft_memory_pressure_enabled);
@@ -1221,6 +1259,10 @@ int cpuset_create(struct container_subsy
 		set_bit(CS_SPREAD_PAGE, &cs->flags);
 	if (is_spread_slab(parent))
 		set_bit(CS_SPREAD_SLAB, &cs->flags);
+#ifdef CONFIG_NUMA
+	if (is_shared_file_policy(parent))
+		set_bit(CS_SHARED_FILE_POLICY, &cs->flags);
+#endif
 	cs->cpus_allowed = CPU_MASK_NONE;
 	cs->mems_allowed = NODE_MASK_NONE;
 	cs->mems_generation = cpuset_mems_generation++;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH/RFC 11/11] Shared Policy: add generic file set/get policy vm ops
  2007-06-25 19:52 [PATCH/RFC 0/11] Shared Policy Overview Lee Schermerhorn
                   ` (9 preceding siblings ...)
  2007-06-25 19:53 ` [PATCH/RFC 10/11] Shared Policy: per cpuset shared file policy control Lee Schermerhorn
@ 2007-06-25 19:53 ` Lee Schermerhorn
  2007-06-26 22:17 ` [PATCH/RFC 0/11] Shared Policy Overview Christoph Lameter
  2007-06-26 22:21 ` Christoph Lameter
  12 siblings, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-25 19:53 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, nacc, ak, Lee Schermerhorn, clameter

Shared Mapped File Policy 11/11 add generic file set/get policy vm ops

Against 2.6.22-rc4-mm2

Add set/get policy vm ops to generic_file_vm_ops in support of
mmap()ed file memory policies.  This patch effectively "hooks up"
shared file mappings to the NUMA shared policy infrastructure.

NOTE:  we could return an error on an attempt to mbind() a shared,
mapped file when shared_file_policy is disabled instead of just ignoring.
This would change existing behavior in the default case--something
I've tried to avoid--but would let the application/programmer know
that the operation is unsupported.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/filemap.c |   41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

Index: Linux/mm/filemap.c
===================================================================
--- Linux.orig/mm/filemap.c	2007-06-25 15:03:25.000000000 -0400
+++ Linux/mm/filemap.c	2007-06-25 15:04:37.000000000 -0400
@@ -30,6 +30,7 @@
 #include <linux/security.h>
 #include <linux/syscalls.h>
 #include <linux/cpuset.h>
+#include <linux/mempolicy.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/mempolicy.h>
 
@@ -508,6 +509,42 @@ struct page *__page_cache_alloc(struct a
 	return alloc_page_pol(gfp, pol, pgoff);
 }
 EXPORT_SYMBOL(__page_cache_alloc);
+
+static int generic_file_set_policy(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end,
+			struct mempolicy *new)
+{
+	struct address_space *mapping;
+	struct shared_policy *sp;
+	unsigned long sz;
+	pgoff_t pgoff;
+
+	if (!current->shared_file_policy_enabled)
+		return 0;	/* could [should?] be -EINVAL */
+
+	mapping = vma->vm_file->f_mapping;
+	sp = mapping->spolicy;
+	if (!sp) {
+		sp = mpol_shared_policy_new(mapping, MPOL_DEFAULT, NULL);
+		if (IS_ERR(sp))
+			return PTR_ERR(sp);
+	}
+
+	sz = (end - start) >> PAGE_SHIFT;
+	pgoff = vma_addr_to_pgoff(vma, start, PAGE_SHIFT);
+	return mpol_set_shared_policy(sp, pgoff, sz, new);
+}
+
+static struct mempolicy *
+generic_file_get_policy(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct shared_policy *sp = vma->vm_file->f_mapping->spolicy;
+	if (!sp)
+		return NULL;
+
+	return mpol_shared_policy_lookup(sp,
+				 vma_addr_to_pgoff(vma, addr, PAGE_SHIFT));
+}
 #endif
 
 static int __sleep_on_page_lock(void *word)
@@ -1547,6 +1584,10 @@ EXPORT_SYMBOL(filemap_fault);
 
 struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+#ifdef CONFIG_NUMA
+	.set_policy     = generic_file_set_policy,
+	.get_policy     = generic_file_get_policy,
+#endif
 };
 
 /* This is used for a general mmap of a disk file */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 10/11] Shared Policy: per cpuset shared file policy control
  2007-06-25 19:53 ` [PATCH/RFC 10/11] Shared Policy: per cpuset shared file policy control Lee Schermerhorn
@ 2007-06-25 21:10   ` Paul Jackson
  2007-06-27 17:33     ` Lee Schermerhorn
  0 siblings, 1 reply; 48+ messages in thread
From: Paul Jackson @ 2007-06-25 21:10 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, akpm, nacc, ak, clameter

Lee wrote:
> +#ifdef CONFIG_NUMA

Hmmm ... our very first ifdef CONFIG_NUMA in kernel/cpuset.c,
and the second ifdef ever in that file.  (And I doubt that
the first ifdef, on CONFIG_MEMORY_HOTPLUG, is necessary.)

How about we just remove these ifdef CONFIG_NUMA's, and
let that per-cpuset 'shared_file_policy' always be present?
It just won't do a heck of a lot on non-NUMA systems.

No sense in breaking code that happens to access that file,
just because we're running on a system where it's useless.
It seems better to just simply, consistently, always have
that file present.

And I don't like ifdef's in kernel/cpuset.c.  If necessary,
put them in some header file, related to whatever piece of
code has to shrink down to nothingness when not configured.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-25 19:52 [PATCH/RFC 0/11] Shared Policy Overview Lee Schermerhorn
                   ` (10 preceding siblings ...)
  2007-06-25 19:53 ` [PATCH/RFC 11/11] Shared Policy: add generic file set/get policy vm ops Lee Schermerhorn
@ 2007-06-26 22:17 ` Christoph Lameter
  2007-06-27 13:43   ` Lee Schermerhorn
  2007-06-26 22:21 ` Christoph Lameter
  12 siblings, 1 reply; 48+ messages in thread
From: Christoph Lameter @ 2007-06-26 22:17 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, akpm, nacc, ak

On Mon, 25 Jun 2007, Lee Schermerhorn wrote:

> With patches 1-3 applied, external behavior is, AFAICT, exactly
> the same as current behavior.  The internal differences are that
> shared policy is now a pointer in the address_space structure.
> A NULL value [the default] indicates default policy.  The shared
> policy is allocated on demand--when one mbind()s a virtual
> address range backed by a shmem memory object.

The cleanup patches introduce new fields in structures etc. And we 
do not have a clean way of handling policies that are modified from the 
task containing the policy itself without the later patches. Are the
fixes really first? Looks like you first introduce new features.

> Patch #3 eliminates the need for a pseudo-vma on the stack to 
> initialize policies for tmpfs inodes when the superblock has
> a non-default policy by changing the interface to
> mpol_set_shared_policy() to take a page offset and size in pages,
> computed in the shmem set_policy vm_op.  This cleanup addresses
> one complaint about the current shared policy infrastructure.

Introducing more locking problems? The policy comes from the context that 
is why the vma on the stack is needed. If you remove the vma then what 
protects the policy structures?

> The other internal difference is that linear mappings that support
> the 'set_policy' vm_op are mapped by a single VMA--not split on
> policy boundaries.  numa_maps needs to be able to handle this
> anyway because a task that attaches a shmem segment on which
> another task has already installed multiple shared policies will
> have a single vma mapping the entire segment.  Patch #4 fixes
> numa_maps to display these properly.

Hmmmm.... Does this display the ranges of pages in a shem area if
they have distinct policies? Looks like a mess: You need to scan
vmas for most vmas and then for some vmas you need to scan the shared 
memory policy tree.

How much overhead is added by having to lookup memory policies in the 
policy tree?

> [I hope patch #9 will alleviate Andi's concerns about an unspecified
> persistence model.  Note that the model implemented by patch #9 could
> easily be enhanced to persist beyond the last shared mapping--e.g.,
> via some additional mbind() flags, such as MPOL_MF_[NO]PERSIST--and
> possibly enhancements to numactl to set/remove shared policy on files.
> I didn't want to pursue that in this patch set because I don't have a
> use for it, and it will require some tool to list files with persistent
> shared policy--perhaps an enhancement to lsof(8).]

Its still the inderministic behavior: A process cannot know when all other
processes sharing the mapping terminates. Why would a policy continues to 
exist when the process that has established that policy terminates? That 
is a surprising change in semantics.

> Patch #10 adds a per cpuset control file--shared_file_policy--to
> explicitly enable/disable shared policy on shared file mappings.
> Default is disabled--current behavior.  That is, even with all 11
> patches applied, you'll have to explicitly enable shared file policy,
> else the kernel will continue to ignore mbind() of address ranges backed
> by a shared regular file mapping.  This preserves existing behavior for
> applications that might currently be installing memory policies on
> shared regular file mappings, not realizing that they are ignored.
> Such applications might break or behave unexpectedly if the kernel
> suddenly starts using the shared policy.   With the per cpuset control
> defaulting to current behavior, an explicit action by a privileged 
> user is required to enable the new behavior.

Hmmmm... This only works in a cpuset?

> [I hope patch #10 alleviates Christoph's concern about unexpected
> interaction of shared policies on mmap()ed files in one cpuset with
> file descriptor access from another cpuset.  This can only happen if
> the user/adminstrator explicitly enables shared file policies for an
> application.]

But a global switch for this? Isnt there some way to come up with clean 
policy semantics that covers all of this? Its complex enough.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-25 19:52 [PATCH/RFC 0/11] Shared Policy Overview Lee Schermerhorn
                   ` (11 preceding siblings ...)
  2007-06-26 22:17 ` [PATCH/RFC 0/11] Shared Policy Overview Christoph Lameter
@ 2007-06-26 22:21 ` Christoph Lameter
  2007-06-26 22:42   ` Andi Kleen
  2007-06-27 18:14   ` Lee Schermerhorn
  12 siblings, 2 replies; 48+ messages in thread
From: Christoph Lameter @ 2007-06-26 22:21 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, akpm, nacc, ak

On Mon, 25 Jun 2007, Lee Schermerhorn wrote:

> Also note that because we can remove a shared policy from a "live"
> inode, we need to handle potential races with another task performing
> a get_file_policy() on the same file via a file descriptor access
> [read()/write()/...].  Patch #9 handles this by defining an RCU reader
> critical region in get_file_policy() and by synchronizing with this
> in mpol_free_shared_policy().

You are sure that this works? Just by looking at the description: It 
cannot work. Any allocator use of a memory policy must use rcu locks 
otherwise the memory policy can vanish from under us while allocating a 
page. This means you need to add this to alloc_pages_current 
and alloc_pages_node.  Possible all of __alloc_pages must be handled 
under RCU. This is a significant increase of RCU use.

If we can make this work then RCU should be used for all policies so that 
we can get rid of the requirement that policies can only be modified from 
the task context that created it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-26 22:21 ` Christoph Lameter
@ 2007-06-26 22:42   ` Andi Kleen
  2007-06-27  3:25     ` Christoph Lameter
  2007-06-27 18:14   ` Lee Schermerhorn
  1 sibling, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2007-06-26 22:42 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Lee Schermerhorn, linux-mm, akpm, nacc

On Wednesday 27 June 2007 00:21, Christoph Lameter wrote:
> On Mon, 25 Jun 2007, Lee Schermerhorn wrote:
> > Also note that because we can remove a shared policy from a "live"
> > inode, we need to handle potential races with another task performing
> > a get_file_policy() on the same file via a file descriptor access
> > [read()/write()/...].  Patch #9 handles this by defining an RCU reader
> > critical region in get_file_policy() and by synchronizing with this
> > in mpol_free_shared_policy().
>
> You are sure that this works? Just by looking at the description: It
> cannot work. Any allocator use of a memory policy must use rcu locks
> otherwise the memory policy can vanish from under us while allocating a
> page. This means you need to add this to alloc_pages_current
> and alloc_pages_node.  Possible all of __alloc_pages must be handled
> under RCU. This is a significant increase of RCU use.

I've been actually looking at using RCUs for the shared policies 
too to plug the recent reference count issue.  I don't think it's a problem 
because the RCU use can be limited to when policies are actually
used. Besides rcu_read_lock() is a nop on non preemptible kernels
anyways and users of preemptible kernels will probably not notice
it among all the other overhead they have anyways.

> If we can make this work then RCU should be used for all policies so that
> we can get rid of the requirement that policies can only be modified from
> the task context that created it.

Huh? RCU doesn't give you locking against multiple writers. Just existence
guarantees. And you can have those already by just holding the reference 
count.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-26 22:42   ` Andi Kleen
@ 2007-06-27  3:25     ` Christoph Lameter
  2007-06-27 20:14       ` Lee Schermerhorn
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Lameter @ 2007-06-27  3:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Lee Schermerhorn, linux-mm, akpm, nacc

On Wed, 27 Jun 2007, Andi Kleen wrote:

> > You are sure that this works? Just by looking at the description: It
> > cannot work. Any allocator use of a memory policy must use rcu locks
> > otherwise the memory policy can vanish from under us while allocating a
> > page. This means you need to add this to alloc_pages_current
> > and alloc_pages_node.  Possible all of __alloc_pages must be handled
> > under RCU. This is a significant increase of RCU use.
> 
> I've been actually looking at using RCUs for the shared policies 
> too to plug the recent reference count issue.  I don't think it's a problem 
> because the RCU use can be limited to when policies are actually
> used. Besides rcu_read_lock() is a nop on non preemptible kernels
> anyways and users of preemptible kernels will probably not notice
> it among all the other overhead they have anyways.

If a system policy is set then it will be used all of the time.
Could be a signficant increase in RCU use.
 
> > If we can make this work then RCU should be used for all policies so that
> > we can get rid of the requirement that policies can only be modified from
> > the task context that created it.
> 
> Huh? RCU doesn't give you locking against multiple writers. Just existence
> guarantees. And you can have those already by just holding the reference 
> count.

If you want to replace one policy by another then RCU ensures 
that the old policy can still be used for the remainder of the rcu period.

If RCU is not used then the updating of a policy is not possible since
there is currently no locking and there may be concurrent uses of the 
policy or the zonelist generated by a policy. One thread may acquire
the pointer to a policy while another changes the policy.

If the old policy is immediately freed then the first thread may access 
invalid data.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-26 22:17 ` [PATCH/RFC 0/11] Shared Policy Overview Christoph Lameter
@ 2007-06-27 13:43   ` Lee Schermerhorn
  0 siblings, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-27 13:43 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, akpm, nacc, ak

On Tue, 2007-06-26 at 15:17 -0700, Christoph Lameter wrote:
> On Mon, 25 Jun 2007, Lee Schermerhorn wrote:
> 
> > With patches 1-3 applied, external behavior is, AFAICT, exactly
> > the same as current behavior.  The internal differences are that
> > shared policy is now a pointer in the address_space structure.
> > A NULL value [the default] indicates default policy.  The shared
> > policy is allocated on demand--when one mbind()s a virtual
> > address range backed by a shmem memory object.
> 
> The cleanup patches introduce new fields in structures etc. And we 
> do not have a clean way of handling policies that are modified from the 
> task containing the policy itself without the later patches. Are the
> fixes really first? Looks like you first introduce new features.

As I said, the "fixes" are atop the "cleanup" patches.  Yes, the first
three patches move the shared policy from being embedded in the shmem
and hugetlbfs inodes to being dynamically allocated as needed, anchored
by a pointer in the inode's address_space.  However, these patches
introduce no new behavior from the application's viewpoint.  They just
generalze the infrastructure to simplify the numa_maps fix and make the
hugetlbfs shmem hook up trivial.

The default case [no shared policy] is even faster because we only need
to note that no shared policy exists on the inode to fall back to task
or system policy, rather than do an rb-tree lookup for the single
default policy.

Not sure what you mean about "handling policies that are modified from
the task containing the policy itself".  Shared policies are not
"contained" by any task.  They are attached directly to the "shared
memory object" represented by the inode/address_space.  As far as
locking, see next response...

> 
> > Patch #3 eliminates the need for a pseudo-vma on the stack to 
> > initialize policies for tmpfs inodes when the superblock has
> > a non-default policy by changing the interface to
> > mpol_set_shared_policy() to take a page offset and size in pages,
> > computed in the shmem set_policy vm_op.  This cleanup addresses
> > one complaint about the current shared policy infrastructure.
> 
> Introducing more locking problems? The policy comes from the context that 
> is why the vma on the stack is needed. If you remove the vma then what 
> protects the policy structures?

The vma was only used to define the start/end address of the range to
which the policy applied.  No longer needed, with patch #3.  Shared
policies, once they exist are protected by their embedded spin lock.
Once we can delete a shared policy before the inode is deleted [patch
#9], we have to protect the shared policy pointer via RCU.  I don't
think I've introduced any NEW locking problems.

> 
> > The other internal difference is that linear mappings that support
> > the 'set_policy' vm_op are mapped by a single VMA--not split on
> > policy boundaries.  numa_maps needs to be able to handle this
> > anyway because a task that attaches a shmem segment on which
> > another task has already installed multiple shared policies will
> > have a single vma mapping the entire segment.  Patch #4 fixes
> > numa_maps to display these properly.
> 
> Hmmmm.... Does this display the ranges of pages in a shem area if
> they have distinct policies? Looks like a mess: You need to scan
> vmas for most vmas and then for some vmas you need to scan the shared 
> memory policy tree.

Yes, each policy range is displayed separately.  This is how numa_maps
worked for the task that installed the policies [assuming only one task
installed policies on the segment].  If you looked at another task that
attaches a shmem with shared policy, numa_maps would only show the
policy for the single vma mapping the entire segment.  I made numa_maps
aware of the fact that a single vma can map a shared object with
multiple policy ranges.  

> 
> How much overhead is added by having to lookup memory policies in the 
> policy tree?

"None,"  he says glibly.  It already does this for shmem segments.  It's
just that now, instead of a new vma for each policy [at least for the
task that installed the policies], numa_maps gets a list of ranges that
have different policies for the object and iterates over those.

> 
> > [I hope patch #9 will alleviate Andi's concerns about an unspecified
> > persistence model.  Note that the model implemented by patch #9 could
> > easily be enhanced to persist beyond the last shared mapping--e.g.,
> > via some additional mbind() flags, such as MPOL_MF_[NO]PERSIST--and
> > possibly enhancements to numactl to set/remove shared policy on files.
> > I didn't want to pursue that in this patch set because I don't have a
> > use for it, and it will require some tool to list files with persistent
> > shared policy--perhaps an enhancement to lsof(8).]
> 
> Its still the inderministic behavior: A process cannot know when all other
> processes sharing the mapping terminates. Why would a policy continues to 
> exist when the process that has established that policy terminates? That 
> is a surprising change in semantics.

Because all of the files mapping the file shared are cooperating.  The
one that applies the policy may actually terminate after the actual
"worker bees" processes are started.  I would agree that, if unrelated
processes started mmap()ing files shared and installing policy for their
own purposes with no cooperation, some processes might be surprised by
the policy, if they noticed.  This can be avoided by only enabling the
feature [patch #10] for systems [or cpusets] running well-behaved,
cooperating processes.  The latter are, I think, well served by this
capability.

I'm not adverse to supporting a shared policy on regular files that
persists beyond any mappings--e.g., via the mbind() MPOL_MF_PERSIST flag
that I mentioned.  The persistent policy itself can add a reference to
the inode that pins it until the policy is removed and no processes have
the file open/mmap()ed.  As you and I discussed, this would argue for
some mechanism to list files hanging around only because they have a
persistent policy--something like an enhanced lsof command.  As I
mentioned, I don't have a use case for this type of persistence, but I'd
be happy to help implement it, if anyone does.

> 
> > Patch #10 adds a per cpuset control file--shared_file_policy--to
> > explicitly enable/disable shared policy on shared file mappings.
> > Default is disabled--current behavior.  That is, even with all 11
> > patches applied, you'll have to explicitly enable shared file policy,
> > else the kernel will continue to ignore mbind() of address ranges backed
> > by a shared regular file mapping.  This preserves existing behavior for
> > applications that might currently be installing memory policies on
> > shared regular file mappings, not realizing that they are ignored.
> > Such applications might break or behave unexpectedly if the kernel
> > suddenly starts using the shared policy.   With the per cpuset control
> > defaulting to current behavior, an explicit action by a privileged 
> > user is required to enable the new behavior.
> 
> Hmmmm... This only works in a cpuset?

At this point yes.  It's a convenient task grouping facility.  I figured
that platforms where locality, and therefore memory policy, really
matters are likely to have cpusets anyway, and one might not want to
enable the facility for the entire system.  However, I have thought
about what to do for systems with cpusets/containers configured out:

Note that the "shared_file_policy_enabled" flag is dependent on NUMA, as
is memory policy.  I need to add a way to initialize this task struct
field for the init task for systems w/o cpusets, and then all tasks will
inherit this state.  I can add a kernel command line option to control
this.  Not sure it's worth a run-time sysctl separate from the cpusets.
What do you think?

> 
> > [I hope patch #10 alleviates Christoph's concern about unexpected
> > interaction of shared policies on mmap()ed files in one cpuset with
> > file descriptor access from another cpuset.  This can only happen if
> > the user/adminstrator explicitly enables shared file policies for an
> > application.]
> 
> But a global switch for this? Isnt there some way to come up with clean 
> policy semantics that covers all of this? Its complex enough.

Global?  Well, could be, if you set it in the top level cpuset; or if I
provide the command line option for !CPUSET configs.

I think we'll always have some tension between containers/cpusets and
memory policies--because I think they serve different purposes for
different audiences.  But, I think we can come up with some additional
sematics that address some [but not all] of your issues that we've
discussed--e.g., with objects shared between disjoint cpusets and backup
tasks evicting an applications working set because policy restricts a
mmap'd files pages to the application's cpuset.  I think part of the
problem with this scenario is in the page cache reclaim area.  Let's
discuss in the NUMA BOF and go forward from there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 10/11] Shared Policy: per cpuset shared file policy control
  2007-06-25 21:10   ` Paul Jackson
@ 2007-06-27 17:33     ` Lee Schermerhorn
  2007-06-27 19:52       ` Paul Jackson
  0 siblings, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-27 17:33 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-mm, akpm, nacc, ak, clameter

On Mon, 2007-06-25 at 14:10 -0700, Paul Jackson wrote:
> Lee wrote:
> > +#ifdef CONFIG_NUMA
> 
> Hmmm ... our very first ifdef CONFIG_NUMA in kernel/cpuset.c,
> and the second ifdef ever in that file.  (And I doubt that
> the first ifdef, on CONFIG_MEMORY_HOTPLUG, is necessary.)

Yeah, I was expecting this comment.  ;-) more below...
> 
> How about we just remove these ifdef CONFIG_NUMA's, and
> let that per-cpuset 'shared_file_policy' always be present?
> It just won't do a heck of a lot on non-NUMA systems.

If my patches eventually go in, I'd agree with this.  I was trying to be
a good doobee and not add code that wasn't needed.

> 
> No sense in breaking code that happens to access that file,
> just because we're running on a system where it's useless.
> It seems better to just simply, consistently, always have
> that file present.

I guess I wouldn't expect much code to access that file other than some
cpuset setup script [maybe program] that enables shared file policy.  In
my various NUMA patch sets [shared policy, lazy/automigration, ...], I
created quite a few additional control files like "shared_file_policy".
I've written scripts to set up cpusets for testing these features.  I
usually code something like:

	[[ ! -f $cpuset/shared_file_policy ]] || echo 1 >$cpuset/...

so they don't break if the file is missing--just don't do anything.

> 
> And I don't like ifdef's in kernel/cpuset.c.  If necessary,
> put them in some header file, related to whatever piece of
> code has to shrink down to nothingness when not configured.

I understand about #ifdef's in kernel code.  I would have implemented a
number of static inline functions or macros in a header, but in some
places, I need to add a case to a switch statement.  That's harder to do
with macros and static inline functions.  I wasn't sure that a macro
that defines an additional case statement would make it past the
"readability nazis" [;-)].

My experience here has made me think that the cpuset implementation for
adding additional control files conditionally could be made more "data
driven" [like procfs?] so that I only need to add a single array-element
initialization and any supporting functions under #ifdef; plus a few
conditionally defined static in-line functions for things like
"update_task_memory_state" and such.  We'd still need some ifdefs, but
not within individual functions.

Thoughts?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-26 22:21 ` Christoph Lameter
  2007-06-26 22:42   ` Andi Kleen
@ 2007-06-27 18:14   ` Lee Schermerhorn
  2007-06-27 21:37     ` Christoph Lameter
  1 sibling, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-27 18:14 UTC (permalink / raw)
  To: Christoph Lameter, Paul E. McKenney; +Cc: linux-mm, akpm, nacc, ak

On Tue, 2007-06-26 at 15:21 -0700, Christoph Lameter wrote:
> On Mon, 25 Jun 2007, Lee Schermerhorn wrote:
> 
> > Also note that because we can remove a shared policy from a "live"
> > inode, we need to handle potential races with another task performing
> > a get_file_policy() on the same file via a file descriptor access
> > [read()/write()/...].  Patch #9 handles this by defining an RCU reader
> > critical region in get_file_policy() and by synchronizing with this
> > in mpol_free_shared_policy().
> 
> You are sure that this works? 

Well, I DO need to ask Dr. RCU [Paul McK.] to take a look at the patch,
but this is how I understand RCU to work...

Paul:  could you take a look at patch #9 of the Shared Policy series?

> Just by looking at the description: It 
> cannot work. Any allocator use of a memory policy must use rcu locks 
> otherwise the memory policy can vanish from under us while allocating a 
> page. 

The only place we need to worry about is "get_file_policy()", and--that
is the only place one can attempt to lookup a shared policy w/o holding
the [user virtual] address space locked [mmap_sem] which pins the shared
mapping of the file, so the i_mmap_writable count can't go to zero, so
we can't attempt to free the policy.  And even then, it's only an issue
for file descriptor accessed page cache allocs.  Lookups called from the
fault path do have the user vas locked during the fault, so the policy
can't go away.  But, because __page_cache_alloc() calls
get_file_policy() to lookup the policy at the faulting page offset, it
uses RCU on the read side, anyway.   I should probably write up the
entire locking picture for this, huh?

> This means you need to add this to alloc_pages_current 
> and alloc_pages_node.  Possible all of __alloc_pages must be handled 
> under RCU. This is a significant increase of RCU use.

alloc_pages_current() doesn't look up shared policy--not even vma
policy.  It just grabs the task's current policy, falling back to the
[statically defined] system default_policy if no task/process policy.

alloc_pages_node() doesn't use policy at all.  Just looks up the
zonelist based on the nid and the gfp_zone()--sort of an abbreviated,
in-line zonelist_policy() call.

But, I think RCU could be used to access/free the task policy and allow
changes to the policy from outside the task.  Probably for vma policies
as well.

> 
> If we can make this work then RCU should be used for all policies so that 
> we can get rid of the requirement that policies can only be modified from 
> the task context that created it.

Yean, I think that's possible...

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 10/11] Shared Policy: per cpuset shared file policy control
  2007-06-27 17:33     ` Lee Schermerhorn
@ 2007-06-27 19:52       ` Paul Jackson
  2007-06-27 20:22         ` Lee Schermerhorn
  0 siblings, 1 reply; 48+ messages in thread
From: Paul Jackson @ 2007-06-27 19:52 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, akpm, nacc, ak, clameter

> If my patches eventually go in, I'd agree with this.  I was trying to be
> a good doobee and not add code that wasn't needed.

The ifdef's are added code -- added source code.

For a body of code that's as big as the Linux kernel, and changing
at the speed of Andrew's Enter key, I worry more about keeping the
source code as easy to read as possible, than I do about the last
few bytes of kernel text size.

The success of Linux is far more constrained by the limitations of
human neurons than by the limitations of dynamic RAM chips.

> 	[[ ! -f $cpuset/shared_file_policy ]] || echo 1 >$cpuset/...

Sure - you can code that - that 'shared_file_policy' file is your baby,
and you know how best to care for it.

But it leads to others writing code that doesn't have this [[ ! -f
... ]] guard, which code works fine ... for a while.  Works long
enough to get good and buried in three layers of cruft, leading to
a problem costing someone hours or days to unravel, when it finally
hits a machine lacking that file.

I'd sure like to see that ifdef gone.  I wish I had the time now to
go stamp out that other ifdef in kernel/cpuset.c as well.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-27  3:25     ` Christoph Lameter
@ 2007-06-27 20:14       ` Lee Schermerhorn
  0 siblings, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-27 20:14 UTC (permalink / raw)
  To: Andi Kleen, Christoph Lameter; +Cc: linux-mm, akpm, nacc

On Tue, 2007-06-26 at 20:25 -0700, Christoph Lameter wrote:
> On Wed, 27 Jun 2007, Andi Kleen wrote:
> 
> > > You are sure that this works? Just by looking at the description: It
> > > cannot work. Any allocator use of a memory policy must use rcu locks
> > > otherwise the memory policy can vanish from under us while allocating a
> > > page. This means you need to add this to alloc_pages_current
> > > and alloc_pages_node.  Possible all of __alloc_pages must be handled
> > > under RCU. This is a significant increase of RCU use.
> > 
> > I've been actually looking at using RCUs for the shared policies 
> > too to plug the recent reference count issue.  I don't think it's a problem 
> > because the RCU use can be limited to when policies are actually
> > used. Besides rcu_read_lock() is a nop on non preemptible kernels
> > anyways and users of preemptible kernels will probably not notice
> > it among all the other overhead they have anyways.

Hi, Andi:

I see that Christoph has already responded, so I'll respond in the
context of his message.
> 
> If a system policy is set then it will be used all of the time.
> Could be a signficant increase in RCU use.

Generally, I don't think you need to use RCU for the system policy, as
it is statically allocated.  Now, if "default_policy" were changed to a
pointer to the actual policy, AND you could replace the pointer at
run-time, there might be a use for RCU.

>  
> > > If we can make this work then RCU should be used for all policies so that
> > > we can get rid of the requirement that policies can only be modified from
> > > the task context that created it.
> > 
> > Huh? RCU doesn't give you locking against multiple writers. Just existence
> > guarantees. And you can have those already by just holding the reference 
> > count.

Right.  It only works for shared policies, because shared policies have
a spin lock that protects the rb-tree from concurrent updates.  [And the
policies stored in the tree seem to be reference counted properly.]
However, I think RCU could be used for changing, including deleting
[more below], the task/process policy and a given VMA policy in a
similar fashion to the way I'm deleting shared file policy on removal of
last shared mapping.

RE: deleting:  it occurs to me that installing a "DEFAULT" policy could
actually delete the corresponding policy without changing semantics.  I
plan on looking at this after OLS.

> 
> If you want to replace one policy by another then RCU ensures 
> that the old policy can still be used for the remainder of the rcu period.

As Christoph indicates, I'm using RCU to replace the shared policy on a
regular file with NULL [== default!] on unmap of last shared mapping.  I
need to protect against any references that come from other than a
shared mapping.  This includes accesses via regular file system IO and
faults from private mappings of the file.   Unlike shared mappings,
which as you say are protected from disappearing by reference counts,
page cache allocations to satisfy normal file descriptor based IO or
faults in private mappings can't guarantee that the policy won't go away
when some other task removes the last shared mapping. 

> 
> If RCU is not used then the updating of a policy is not possible since
> there is currently no locking and there may be concurrent uses of the 
> policy or the zonelist generated by a policy. One thread may acquire
> the pointer to a policy while another changes the policy.
> 
> If the old policy is immediately freed then the first thread may access 
> invalid data.

Yep.  So, deleting a policy or replacing it with another can be done
safely under RCU, assuming everyone who gains access to the policy takes
a proper reference [AND releases it when finished...].

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 10/11] Shared Policy: per cpuset shared file policy control
  2007-06-27 19:52       ` Paul Jackson
@ 2007-06-27 20:22         ` Lee Schermerhorn
  2007-06-27 20:36           ` Paul Jackson
  0 siblings, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-27 20:22 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-mm, akpm, nacc, ak, clameter

On Wed, 2007-06-27 at 12:52 -0700, Paul Jackson wrote:
> > If my patches eventually go in, I'd agree with this.  I was trying to be
> > a good doobee and not add code that wasn't needed.
> 
> The ifdef's are added code -- added source code.
> 
> For a body of code that's as big as the Linux kernel, and changing
> at the speed of Andrew's Enter key, I worry more about keeping the
> source code as easy to read as possible, than I do about the last
> few bytes of kernel text size.

Point taken.
> 
> The success of Linux is far more constrained by the limitations of
> human neurons than by the limitations of dynamic RAM chips.
> 
> > 	[[ ! -f $cpuset/shared_file_policy ]] || echo 1 >$cpuset/...
> 
> Sure - you can code that - that 'shared_file_policy' file is your baby,
> and you know how best to care for it.
> 
> But it leads to others writing code that doesn't have this [[ ! -f
> ... ]] guard, which code works fine ... for a while.  Works long
> enough to get good and buried in three layers of cruft, leading to
> a problem costing someone hours or days to unravel, when it finally
> hits a machine lacking that file.
> 
> I'd sure like to see that ifdef gone.  I wish I had the time now to
> go stamp out that other ifdef in kernel/cpuset.c as well.

Well, I'm sure I've got a few more spins to go on this patch set ;-).  I
can easily just remove the ifdefs and see what folks think.  I'll need
to add a couple of conditionally defined functions/macros to handle the
update/test of "shared_file_policy_enabled" when the system is
configured w/o NUMA.  The resulting #ifdefs will be in the header.  Are
you "OK" with that?

Lee

> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 10/11] Shared Policy: per cpuset shared file policy control
  2007-06-27 20:22         ` Lee Schermerhorn
@ 2007-06-27 20:36           ` Paul Jackson
  0 siblings, 0 replies; 48+ messages in thread
From: Paul Jackson @ 2007-06-27 20:36 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, akpm, nacc, ak, clameter

> The resulting #ifdefs will be in the header.  Are
> you "OK" with that?

Yes - headers are the right place for this.

No need to re-release your patch set just for this;
I'll be quite happy if it makes your next round.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-27 18:14   ` Lee Schermerhorn
@ 2007-06-27 21:37     ` Christoph Lameter
  2007-06-27 22:01       ` Andi Kleen
  2007-06-27 23:36       ` Lee Schermerhorn
  0 siblings, 2 replies; 48+ messages in thread
From: Christoph Lameter @ 2007-06-27 21:37 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Paul E. McKenney, linux-mm, akpm, nacc, ak

On Wed, 27 Jun 2007, Lee Schermerhorn wrote:

> Well, I DO need to ask Dr. RCU [Paul McK.] to take a look at the patch,
> but this is how I understand RCU to work...

RCU is not in doubt here.

> > Just by looking at the description: It 
> > cannot work. Any allocator use of a memory policy must use rcu locks 
> > otherwise the memory policy can vanish from under us while allocating a 
> > page. 
> 
> The only place we need to worry about is "get_file_policy()", and--that
> is the only place one can attempt to lookup a shared policy w/o holding
> the [user virtual] address space locked [mmap_sem] which pins the shared
> mapping of the file, so the i_mmap_writable count can't go to zero, so
> we can't attempt to free the policy.  And even then, it's only an issue
> for file descriptor accessed page cache allocs.  Lookups called from the
> fault path do have the user vas locked during the fault, so the policy
> can't go away.  But, because __page_cache_alloc() calls
> get_file_policy() to lookup the policy at the faulting page offset, it
> uses RCU on the read side, anyway.   I should probably write up the
> entire locking picture for this, huh?

The zonelist from MPOL_BIND is passed to __alloc_pages. As a result the 
RCU lock must be held over the call into the page allocator with reclaim 
etc etc. Note that the zonelist is part of the policy structure.

> > If we can make this work then RCU should be used for all policies so that 
> > we can get rid of the requirement that policies can only be modified from 
> > the task context that created it.
> 
> Yean, I think that's possible...

Great if you can me that work.

I just looked at the shmem implementation. Without RCU you must increment 
a refcount in the policy structure. That is done on every 
single allocation. Which will create yet another bouncing cacheline if you 
do concurrent allocations from the same shmem segment. Performance did not 
seem to have been such a concern for shmem policies since this was a one 
off. Again this is a hack that you are trying to generalize. There is 
trouble all over the place if you do that.

I think one prerequisite to memory policy uses like this is work out how a 
memory policy can be handled by the page allocator in such a way that

1. The use is lightweight and does not impact performance.

2. The policy that is passed to the allocators is context independent. 
   I.e. it needs to be independent of the cpuset context and the process 
   context. That would allow f.e. to store a policy and then apply it to
   readahead. AFAIK this means that the policy struct needs to contain
   the memory policy plus the cpuset and the current node.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-27 21:37     ` Christoph Lameter
@ 2007-06-27 22:01       ` Andi Kleen
  2007-06-27 22:08         ` Christoph Lameter
                           ` (2 more replies)
  2007-06-27 23:36       ` Lee Schermerhorn
  1 sibling, 3 replies; 48+ messages in thread
From: Andi Kleen @ 2007-06-27 22:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, Paul E. McKenney, linux-mm, akpm, nacc

> The zonelist from MPOL_BIND is passed to __alloc_pages. As a result the 
> RCU lock must be held over the call into the page allocator with reclaim 
> etc etc. Note that the zonelist is part of the policy structure.

Yes I realized this at some point too. RCU doesn't work here because
__alloc_pages can sleep. Have to use the reference counts even though
it adds atomic operations.

> I think one prerequisite to memory policy uses like this is work out how a 
> memory policy can be handled by the page allocator in such a way that
> 
> 1. The use is lightweight and does not impact performance.

The current mempolicies are all lightweight and zero cost in the main
allocator path.

The only outlier is still cpusets which does strange stuff, but you
can't blame mempolicies for that.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-27 22:01       ` Andi Kleen
@ 2007-06-27 22:08         ` Christoph Lameter
  2007-06-27 23:46         ` Paul E. McKenney
  2007-06-28 13:42         ` Lee Schermerhorn
  2 siblings, 0 replies; 48+ messages in thread
From: Christoph Lameter @ 2007-06-27 22:08 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Lee Schermerhorn, Paul E. McKenney, linux-mm, akpm, nacc

On Thu, 28 Jun 2007, Andi Kleen wrote:

> > I think one prerequisite to memory policy uses like this is work out how a 
> > memory policy can be handled by the page allocator in such a way that
> > 
> > 1. The use is lightweight and does not impact performance.
> 
> The current mempolicies are all lightweight and zero cost in the main
> allocator path.

Right but with incrementing the policy refcount on each allocation we are 
no longer lightweight.

> The only outlier is still cpusets which does strange stuff, but you
> can't blame mempolicies for that.

What strange stuff does cpusets do? It would be good if further work could 
integration all allocations constraints / special behavior of 
containers/cpusets/memory policies etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-27 21:37     ` Christoph Lameter
  2007-06-27 22:01       ` Andi Kleen
@ 2007-06-27 23:36       ` Lee Schermerhorn
  2007-06-29  1:41         ` Christoph Lameter
  1 sibling, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-27 23:36 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul E. McKenney, linux-mm, akpm, nacc, ak

On Wed, 2007-06-27 at 14:37 -0700, Christoph Lameter wrote:
> On Wed, 27 Jun 2007, Lee Schermerhorn wrote:
> 
> > Well, I DO need to ask Dr. RCU [Paul McK.] to take a look at the patch,
> > but this is how I understand RCU to work...
> 
> RCU is not in doubt here.
> 
> > > Just by looking at the description: It 
> > > cannot work. Any allocator use of a memory policy must use rcu locks 
> > > otherwise the memory policy can vanish from under us while allocating a 
> > > page. 
> > 
> > The only place we need to worry about is "get_file_policy()", and--that
> > is the only place one can attempt to lookup a shared policy w/o holding
> > the [user virtual] address space locked [mmap_sem] which pins the shared
> > mapping of the file, so the i_mmap_writable count can't go to zero, so
> > we can't attempt to free the policy.  And even then, it's only an issue
> > for file descriptor accessed page cache allocs.  Lookups called from the
> > fault path do have the user vas locked during the fault, so the policy
> > can't go away.  But, because __page_cache_alloc() calls
> > get_file_policy() to lookup the policy at the faulting page offset, it
> > uses RCU on the read side, anyway.   I should probably write up the
> > entire locking picture for this, huh?
> 
> The zonelist from MPOL_BIND is passed to __alloc_pages. As a result the 
> RCU lock must be held over the call into the page allocator with reclaim 
> etc etc. Note that the zonelist is part of the policy structure.

OK, I see your issue now.   Policies that are looked up in a shared
policy are automatically reference counted on lookup.  But, as I've seen
discussed in the other policy reference counting thread, I'm not
decrementing the count.  I think this will be easy to add into my
factored "alloc_page_pol"--the mpol_free(), that is.  However, it will
require that we actually take a reference on all the other policies when
we acquire them for allocation, so that we can free the reference when
the allocation completes.  Something you'd like to avoid, but I don't
see how we can for non-atomic allocations.  Might be able to special
case the system default policy and not reference count that, as it can
never go away--for now, anyway...

> 
> > > If we can make this work then RCU should be used for all policies so that 
> > > we can get rid of the requirement that policies can only be modified from 
> > > the task context that created it.
> > 
> > Yean, I think that's possible...
> 
> Great if you can me that work.

I was only considering the replacement of the pointer.  The indefinite
sleep in the allocation is a killer, tho'.

> 
> I just looked at the shmem implementation. Without RCU you must increment 
> a refcount in the policy structure. That is done on every 
> single allocation. Which will create yet another bouncing cacheline if you 
> do concurrent allocations from the same shmem segment. Performance did not 
> seem to have been such a concern for shmem policies since this was a one 
> off. Again this is a hack that you are trying to generalize. There is 
> trouble all over the place if you do that.

As I mentioned, the increment is already there and always was.  Just no
decrement.  

And I don't think that referencing counting a shared object is a hack.
It's standard procedure.  If it weren't for the possiblity of sleeping
indefinitely in allocation/reclaim [and reclaim delays are REALLY
indefinite!], you could use a deferred free, like RCU.  But, the only
time you know that the allocation is finished is when you return from
the alloc call, so you need to release the reference there.

As far as bouncing cache lines during an allocation:  for shared object
policy, either this [bouncing] dies out when all pages of the object are
finally allocated--i.e., it's start-up overhead, or we're constantly
recycling pages because they don't all fit in memory.  In the latter
case, the cache line bounce will be small compared to the reclaim and
rereading of the page from the file system or swap [shmem case].
Again, we may be able to special case the system default policy, and
task policy is private to a task/thread, so I don't think that's too
much of a problem, right?

> 
> I think one prerequisite to memory policy uses like this is work out how a 
> memory policy can be handled by the page allocator in such a way that
> 
> 1. The use is lightweight and does not impact performance.

I agree that use of memory policies should have a net decrease in
performance.  However, nothing is for free.  It's a tradeoff.  If you
don't need policies or if they hurt worse than they help, don't use
them.  No performance impact.  If locality matters and policies help
more than they cost, use them.  

> 
> 2. The policy that is passed to the allocators is context independent. 
>    I.e. it needs to be independent of the cpuset context and the process 
>    context. That would allow f.e. to store a policy and then apply it to
>    readahead.  AFAIK this means that the policy struct needs to contain
>    the memory policy plus the cpuset and the current node.

Maybe.  or maybe something different.  Laudable goals, anyway.  Let's
discuss in the NUMA BOF.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-27 22:01       ` Andi Kleen
  2007-06-27 22:08         ` Christoph Lameter
@ 2007-06-27 23:46         ` Paul E. McKenney
  2007-06-28  0:14           ` Andi Kleen
  2007-06-29 21:47           ` Lee Schermerhorn
  2007-06-28 13:42         ` Lee Schermerhorn
  2 siblings, 2 replies; 48+ messages in thread
From: Paul E. McKenney @ 2007-06-27 23:46 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Lameter, Lee Schermerhorn, linux-mm, akpm, nacc

On Thu, Jun 28, 2007 at 12:01:16AM +0200, Andi Kleen wrote:
> 
> > The zonelist from MPOL_BIND is passed to __alloc_pages. As a result the 
> > RCU lock must be held over the call into the page allocator with reclaim 
> > etc etc. Note that the zonelist is part of the policy structure.
> 
> Yes I realized this at some point too. RCU doesn't work here because
> __alloc_pages can sleep. Have to use the reference counts even though
> it adds atomic operations.

Any reason SRCU wouldn't work here?  From a quick glance at the patch,
it seems possible to me.

							Thanx, Paul

> > I think one prerequisite to memory policy uses like this is work out how a 
> > memory policy can be handled by the page allocator in such a way that
> > 
> > 1. The use is lightweight and does not impact performance.
> 
> The current mempolicies are all lightweight and zero cost in the main
> allocator path.
> 
> The only outlier is still cpusets which does strange stuff, but you
> can't blame mempolicies for that.
> 
> -Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-27 23:46         ` Paul E. McKenney
@ 2007-06-28  0:14           ` Andi Kleen
  2007-06-29 21:47           ` Lee Schermerhorn
  1 sibling, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2007-06-28  0:14 UTC (permalink / raw)
  To: paulmck; +Cc: Christoph Lameter, Lee Schermerhorn, linux-mm, akpm, nacc

On Thursday 28 June 2007 01:46:34 Paul E. McKenney wrote:
> On Thu, Jun 28, 2007 at 12:01:16AM +0200, Andi Kleen wrote:
> > 
> > > The zonelist from MPOL_BIND is passed to __alloc_pages. As a result the 
> > > RCU lock must be held over the call into the page allocator with reclaim 
> > > etc etc. Note that the zonelist is part of the policy structure.
> > 
> > Yes I realized this at some point too. RCU doesn't work here because
> > __alloc_pages can sleep. Have to use the reference counts even though
> > it adds atomic operations.
> 
> Any reason SRCU wouldn't work here?  From a quick glance at the patch,
> it seems possible to me.

We have reference counts anyways that can be used so it's not needed.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-27 22:01       ` Andi Kleen
  2007-06-27 22:08         ` Christoph Lameter
  2007-06-27 23:46         ` Paul E. McKenney
@ 2007-06-28 13:42         ` Lee Schermerhorn
  2007-06-28 22:02           ` Andi Kleen
  2007-06-29  1:39           ` Christoph Lameter
  2 siblings, 2 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-28 13:42 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Lameter, Paul E. McKenney, linux-mm, akpm, nacc

On Thu, 2007-06-28 at 00:01 +0200, Andi Kleen wrote:
> > The zonelist from MPOL_BIND is passed to __alloc_pages. As a result the 
> > RCU lock must be held over the call into the page allocator with reclaim 
> > etc etc. Note that the zonelist is part of the policy structure.
> 
> Yes I realized this at some point too. RCU doesn't work here because
> __alloc_pages can sleep. Have to use the reference counts even though
> it adds atomic operations.
> 
> > I think one prerequisite to memory policy uses like this is work out how a 
> > memory policy can be handled by the page allocator in such a way that
> > 
> > 1. The use is lightweight and does not impact performance.
> 
> The current mempolicies are all lightweight and zero cost in the main
> allocator path.
> 
> The only outlier is still cpusets which does strange stuff, but you
> can't blame mempolicies for that.

Andi, Christoph:

Here is a proposed approach for reference counting based on my factoring
of alloc_page_vma() into get_vma_policy() and alloc_page_pol().   I've
created a patch that would slot into my shared policy series after
#6--the factoring mentioned above.  I've tried to avoid taking a
reference count in the common cases of default system policy and the
current task's mempolicy.  I think it's safe and, I hope, less costly to
do the tests and avoid the ref than to go ahead and acquire the cache
line for write.

I'm not sure that the check for current task's policy is necessary in
get_file_policy() because it is always called in the context of the
current task--i.e., task == current.  I'm not even sure that we need the
task argument to get_file_policy.  I included it to match the call
to get_vma_policy().  Could [should?] probably be removed.  

Note, I've updated my series slightly since last post, to avoid using a
pseudo-vma in shmem_alloc_page(); adding some additional
documentation, ...  You'll see this in the patch below.  I can send the
revised patch #6 if you like.  I still need to think about
shmem_swapin_async() and read_swap_cache_async().  With some work, I
think I can avoid the pseudo vma there as well.  Later, tho'.

Here's the patch-untested.  thoughts on this approach?

Shared Mapped File Policy "6.1/11" fix policy reference counts

This patch acquires a reference count on vma policy and on task policy
when acquired from a task with a different policy--e.g., from
show_numa_map()--and frees that reference after allocating a page or
after converting the policy to a displayable string in show_numa_map().

Avoid the taking the reference count on the system default policy or the
current task's task policy.  Note that if show_numa_map() is called from
the context of a relative of the target task with the same task mempolicy,
we won't take an extra reference either.  This is safe, because the policy
remains referenced by the calling task during the mpol_to_str() processing.

Call __mpol_free() [a.k.a. the "slow path"] directly from alloc_page_pol()
and show_numa_map(), where we know we have non-NULL policy, if policy is
not the system default policy and not the current task's policy.  By
calling __mpol_free() directly in these two places, we avoid the extra
check for null policy in mpol_free() [admittedly a "cheap check"].

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/mempolicy.c |   35 ++++++++++++++++++++++++++++-------
 mm/shmem.c     |    1 -
 2 files changed, 28 insertions(+), 8 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-06-28 09:14:39.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-06-28 09:34:20.000000000 -0400
@@ -1124,25 +1124,32 @@ asmlinkage long compat_sys_mbind(compat_
  * @task  - fall back to this task's policy if no vma policy at @addr
  * @vma   - vma struct containing @addr and possible policy
  * @addr  - virtual address in @vma for which to get policy
+ * Note:  policy returned with an extra reference if the VMA has a non-NULL,
+ * non-DEFAULT  policy or the policy is the task policy for a task other
+ * than "current".
  */
 static struct mempolicy * get_vma_policy(struct task_struct *task,
 		struct vm_area_struct *vma, unsigned long addr)
 {
 	struct mempolicy *pol = task->mempolicy;
+	int shared_pol = 0;
 
 	if (vma) {
 		/*
 		 * use get_policy op, if any, for shared mappings
 		 */
 		if ((vma->vm_flags & (VM_SHARED|VM_NONLINEAR)) == VM_SHARED &&
-			vma->vm_ops && vma->vm_ops->get_policy)
+			vma->vm_ops && vma->vm_ops->get_policy) {
 			pol = vma->vm_ops->get_policy(vma, addr);
-		else if (vma->vm_policy &&
+			shared_pol = 1;	/* if non-NULL, that is */
+		} else if (vma->vm_policy &&
 				vma->vm_policy->policy != MPOL_DEFAULT)
 			pol = vma->vm_policy;
 	}
 	if (!pol)
 		pol = &default_policy;
+	else if (!shared_pol && pol != current->mempolicy)
+		mpol_get(pol);
 	return pol;
 }
 
@@ -1158,11 +1165,17 @@ struct mempolicy *get_file_policy(struct
 {
 	struct shared_policy *sp = x->spolicy;
 	struct mempolicy *pol = task->mempolicy;
+	int shared_pol = 0;
 
-	if (sp)
+	if (sp) {
 		pol = mpol_shared_policy_lookup(sp, pgoff);
+		shared_pol = 1;	/* if non-NULL, that is */
+	}
+
 	if (!pol)
 		pol = &default_policy;
+	else if (!shared_pol && pol != current->mempolicy)
+		mpol_get(pol);
 	return pol;
 }
 
@@ -1290,18 +1303,23 @@ static struct page *alloc_page_interleav
 /*
  * alloc_page_pol() -- allocate a page based on policy,offset.
  * @gfp   - gfp mask [flags + zone] for allocation
- * @pol   - policy to use for allocation
+ * @pol   - policy to use for allocation; must mpol_free()
  * @pgoff - page offset for interleaving -- used only if interleave policy
  */
 struct page *alloc_page_pol(gfp_t gfp, struct mempolicy *pol, pgoff_t pgoff)
 {
+	struct page *page;
+
 	if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
 		unsigned nid;
 
 		nid = offset_il_node(pol, pgoff);
 		return alloc_page_interleave(gfp, 0, nid);
 	}
-	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+	page =  __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+	if (pol != &default_policy && pol != current->mempolicy)
+		__mpol_free(pol);
+	return page;
 }
 EXPORT_SYMBOL(alloc_page_pol);
 
@@ -2018,6 +2036,7 @@ int show_numa_map(struct seq_file *m, vo
 	struct numa_maps *md;
 	struct file *file = vma->vm_file;
 	struct mm_struct *mm = vma->vm_mm;
+	struct mempolicy *pol;
 	int n;
 	char buffer[50];
 
@@ -2028,8 +2047,10 @@ int show_numa_map(struct seq_file *m, vo
 	if (!md)
 		return 0;
 
-	mpol_to_str(buffer, sizeof(buffer),
-			    get_vma_policy(priv->task, vma, priv->saddr));
+	pol = get_vma_policy(priv->task, vma, priv->saddr);
+	mpol_to_str(buffer, sizeof(buffer), pol);
+	if (pol != &default_policy && pol != current->mempolicy)
+		__mpol_free(pol);
 
 	seq_printf(m, "%08lx %s", priv->saddr, buffer);
 
Index: Linux/mm/shmem.c
===================================================================
--- Linux.orig/mm/shmem.c	2007-06-28 09:14:57.000000000 -0400
+++ Linux/mm/shmem.c	2007-06-28 09:27:33.000000000 -0400
@@ -1005,7 +1005,6 @@ shmem_alloc_page(gfp_t gfp, struct share
 
 	pol = mpol_shared_policy_lookup(sp, idx);
 	page = alloc_page_pol(gfp | __GFP_ZERO, pol, idx);
-	mpol_free(pol);
 	return page;
 }
 #else




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-28 13:42         ` Lee Schermerhorn
@ 2007-06-28 22:02           ` Andi Kleen
  2007-06-29 17:14             ` Lee Schermerhorn
  2007-06-29  1:39           ` Christoph Lameter
  1 sibling, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2007-06-28 22:02 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Paul E. McKenney, linux-mm, akpm, nacc


> -	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
> +	page =  __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
> +	if (pol != &default_policy && pol != current->mempolicy)
> +		__mpol_free(pol);

That destroyed the tail call in the fast path. I would prefer if it
was preserved at least for the default_policy case. This means handling
this in a separated if path.

Other than that it looks reasonable and we probably want something
like this for .22.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-28 13:42         ` Lee Schermerhorn
  2007-06-28 22:02           ` Andi Kleen
@ 2007-06-29  1:39           ` Christoph Lameter
  2007-06-29  9:01             ` Andi Kleen
  2007-06-29 13:22             ` Lee Schermerhorn
  1 sibling, 2 replies; 48+ messages in thread
From: Christoph Lameter @ 2007-06-29  1:39 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Andi Kleen, Paul E. McKenney, linux-mm, akpm, nacc

On Thu, 28 Jun 2007, Lee Schermerhorn wrote:

> Avoid the taking the reference count on the system default policy or the
> current task's task policy.  Note that if show_numa_map() is called from
> the context of a relative of the target task with the same task mempolicy,
> we won't take an extra reference either.  This is safe, because the policy
> remains referenced by the calling task during the mpol_to_str() processing.

I still do not see the rationale for this patchset. This adds more special 
casing. So if we have a vma policy then we suck again?

This all still falls under the category of messing up a bad situation even 
more. Its first necessary to come up with way to consistently handle 
memory policies and improve the interaction with other methods to 
constrain allocations (cpusets, node restrictions for hugetlb etc etc). It 
should improve the situation and not increase special casing or make the 
system more unpreditable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-27 23:36       ` Lee Schermerhorn
@ 2007-06-29  1:41         ` Christoph Lameter
  2007-06-29 13:30           ` Lee Schermerhorn
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Lameter @ 2007-06-29  1:41 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Paul E. McKenney, linux-mm, akpm, nacc, ak

On Wed, 27 Jun 2007, Lee Schermerhorn wrote:

> > 1. The use is lightweight and does not impact performance.
> 
> I agree that use of memory policies should have a net decrease in
> performance.  However, nothing is for free.  It's a tradeoff.  If you
> don't need policies or if they hurt worse than they help, don't use
> them.  No performance impact.  If locality matters and policies help
> more than they cost, use them.  

Wel the current situation seems to be better AFAIK. Why tradeoff 
anything for less performance and more inconsistencies?

> Maybe.  or maybe something different.  Laudable goals, anyway.  Let's
> discuss in the NUMA BOF.

Would be good. I keep failing to see the point of all of this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-29  1:39           ` Christoph Lameter
@ 2007-06-29  9:01             ` Andi Kleen
  2007-06-29 14:05               ` Christoph Lameter
  2007-06-29 13:22             ` Lee Schermerhorn
  1 sibling, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2007-06-29  9:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, Paul E. McKenney, linux-mm, akpm, nacc

On Friday 29 June 2007 03:39:52 Christoph Lameter wrote:
> On Thu, 28 Jun 2007, Lee Schermerhorn wrote:
> 
> > Avoid the taking the reference count on the system default policy or the
> > current task's task policy.  Note that if show_numa_map() is called from
> > the context of a relative of the target task with the same task mempolicy,
> > we won't take an extra reference either.  This is safe, because the policy
> > remains referenced by the calling task during the mpol_to_str() processing.
> 
> I still do not see the rationale for this patchset. This adds more special 
> casing. 

The reference count change at least is a good idea.

> So if we have a vma policy then we suck again? 

An additional reference count inc/dec is not exactly "suck". We try to 
avoid it because it's a little slow on some obsolete CPUs we support, but
even on those it is not that bad and will probably only show up
in extreme microbenchmarking. Still it's normally good to avoid
making the default path slower.

> 
> This all still falls under the category of messing up a bad situation even 
> more.

I think you're exaggerating.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-29  1:39           ` Christoph Lameter
  2007-06-29  9:01             ` Andi Kleen
@ 2007-06-29 13:22             ` Lee Schermerhorn
  2007-06-29 14:18               ` Christoph Lameter
  1 sibling, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-29 13:22 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, Paul E. McKenney, linux-mm, akpm, nacc

On Thu, 2007-06-28 at 18:39 -0700, Christoph Lameter wrote:
> On Thu, 28 Jun 2007, Lee Schermerhorn wrote:
> 
> > Avoid the taking the reference count on the system default policy or the
> > current task's task policy.  Note that if show_numa_map() is called from
> > the context of a relative of the target task with the same task mempolicy,
> > we won't take an extra reference either.  This is safe, because the policy
> > remains referenced by the calling task during the mpol_to_str() processing.
> 
> I still do not see the rationale for this patchset. This adds more special 
> casing. So if we have a vma policy then we suck again?

I'm not sure what you mean by "rationale for this patchset" in the
context of this reference counting patch.  We've already gone over the
rationale for shared policy on shared file mappings--over and over...

Here, I'm just trying to show how we could handle the reference counting
problem in the context of my patch series where I've factored out the
"allocate a page given a policy and offset [for interleaving]" where
it's convenient to release the reference.  I'm trying to help ;-).

Will referencing a vma policy suck?  Maybe for a while, on a
multi-threaded program with a huge anon area with vma policy and
multiple tasks/threads all faulting different pages at the same time.
But, this activity HAS to die out when the entire region has been
faulted in--UNLESS the entire region won't fit in memory.  Then, you'll
be swapping your brains out and the reference count on the policy will
be the least of your problems performance-wise.

> 
> This all still falls under the category of messing up a bad situation even 
> more. Its first necessary to come up with way to consistently handle 
> memory policies and improve the interaction with other methods to 
> constrain allocations (cpusets, node restrictions for hugetlb etc etc). It 
> should improve the situation and not increase special casing or make the 
> system more unpreditable.

I understand that this is your opinion.  It does seem orthogonal to
corrently reference counting shared data structures, tho'.  In that
respect, I think it is "improving the situation".  As far as special
casing:  I was just trying to minimize the effect of reference counting
on the common case of system default and task policy.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-29  1:41         ` Christoph Lameter
@ 2007-06-29 13:30           ` Lee Schermerhorn
  2007-06-29 14:20             ` Andi Kleen
  0 siblings, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-29 13:30 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul E. McKenney, linux-mm, akpm, nacc, ak

On Thu, 2007-06-28 at 18:41 -0700, Christoph Lameter wrote:
> On Wed, 27 Jun 2007, Lee Schermerhorn wrote:
> 
> > > 1. The use is lightweight and does not impact performance.
> > 
> > I agree that use of memory policies should have a net decrease in
> > performance.  However, nothing is for free.  It's a tradeoff.  If you
> > don't need policies or if they hurt worse than they help, don't use
> > them.  No performance impact.  If locality matters and policies help
> > more than they cost, use them.  
> 
> Wel the current situation seems to be better AFAIK. Why tradeoff 
> anything for less performance and more inconsistencies?

Firstly, the "current situation" is deficient for applications that I,
on behalf of our customers, care about.

Secondly, I disagree with the "more inconsistencies" bit, as we've
discussed.  

Finally, as far as trading off performance, we're still at the
theoretical stage here.  I don't recall that you've ever tried my
patches on one of your problematic workloads to show that it has any
negative impact.  I don't see any in my tests, but I don't have access
to systems of the size that you do.  

> 
> > Maybe.  or maybe something different.  Laudable goals, anyway.  Let's
> > discuss in the NUMA BOF.
> 
> Would be good. I keep failing to see the point of all of this.

Apparently so... :-(

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-29  9:01             ` Andi Kleen
@ 2007-06-29 14:05               ` Christoph Lameter
  2007-06-29 17:41                 ` Lee Schermerhorn
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Lameter @ 2007-06-29 14:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Lee Schermerhorn, Paul E. McKenney, linux-mm, akpm, nacc

On Fri, 29 Jun 2007, Andi Kleen wrote:

> > I still do not see the rationale for this patchset. This adds more special 
> > casing. 
> 
> The reference count change at least is a good idea.

Allright lets split that out and look at it?

> > This all still falls under the category of messing up a bad situation even 
> > more.
> 
> I think you're exaggerating.

We are creating more weird interactions between processes. The Unix 
model is to isolate processes from each other. As a result of 
this patch modifications to allocation policy in the address space of one 
process can suddenly show up in surprising ways in another.

This in itself is bad enough. Its get worse since there does not seem to 
be a way to prohibit this.

The NUMA policy layer is already difficult enough to comprehend for the 
end user. We already have people not using it because it is too difficult 
to understand. This will kick it into even weirder areas.

One bad case is that two processes run in different cpusets. Lets say 
process A is running in cpuset X on nodes 1 and 2. Process B is running in 
Y on nodes 3 and 4. Both memmap FILE into distinct address ranges.

Now process A sets a MPOL_BIND policy for FILE to only come from node 1.

If process B now follows that policy then process B will allocate outside 
of the cpuset it is contained it. B will likely OOM since it is not 
allowed to access node 1. This is rather surprising for the person trying 
to run the processes in Y because he is unaware of what happens in X. He 
will likely be unable to debug the situation. I will likely see a flood of 
bug reports if this goes in.

This could be fixed by storing the complete context information with the 
file in memory. If the policy would contain the cpuset then process B 
could just be allowed to allocate in cpuset X despite being part of Y.

All of this points to significant conceptual breakage if twe do this. 
There has to be some major semantic change to the memory policy layer in 
order to make shared policies work. But then I do not have any problem 
reports that require shared policies. The issues that I know about are 
complaints that the vma policies of a process are not applied to page 
cache pages like they are for anonymous pages. They are asking for a fix 
for this and not for shared policies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-29 13:22             ` Lee Schermerhorn
@ 2007-06-29 14:18               ` Christoph Lameter
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Lameter @ 2007-06-29 14:18 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Andi Kleen, Paul E. McKenney, linux-mm, akpm, nacc

On Fri, 29 Jun 2007, Lee Schermerhorn wrote:

> I'm not sure what you mean by "rationale for this patchset" in the
> context of this reference counting patch.  We've already gone over the
> rationale for shared policy on shared file mappings--over and over...

Yes and its still not clear to me what the point is. I think sharing 
policies that have so far per process semantics wil break things and cause 
a lot of difficulty in the future.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-29 13:30           ` Lee Schermerhorn
@ 2007-06-29 14:20             ` Andi Kleen
  2007-06-29 21:40               ` Lee Schermerhorn
  0 siblings, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2007-06-29 14:20 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Paul E. McKenney, linux-mm, akpm, nacc

On Friday 29 June 2007 15:30:36 Lee Schermerhorn wrote:

> Firstly, the "current situation" is deficient for applications that I,
> on behalf of our customers, care about.

So what's the specific use case from these applications? How much do 
they lose by not having this?

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-28 22:02           ` Andi Kleen
@ 2007-06-29 17:14             ` Lee Schermerhorn
  2007-06-29 17:42               ` Andi Kleen
  0 siblings, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-29 17:14 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Lameter, Paul E. McKenney, linux-mm, akpm, nacc

On Fri, 2007-06-29 at 00:02 +0200, Andi Kleen wrote:
> 
> > -	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
> > +	page =  __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
> > +	if (pol != &default_policy && pol != current->mempolicy)
> > +		__mpol_free(pol);
> 
> That destroyed the tail call in the fast path. I would prefer if it
> was preserved at least for the default_policy case. This means handling
> this in a separated if path.

Andi:  I could restore the tail call for the common cases of system
default and task policy, but that would require a second call to
__alloc_pages(), I think, for the shared and vma policies.  What do you
think about that solution?

> 
> Other than that it looks reasonable and we probably want something
> like this for .22.

As Christoph notes, this will have to extracted from my series. I think
that only get_vma_policy() and alloc_page_vma() need to change for now.
I won't get a chance to test anything until the 2nd week in July and
that might be too late for .22.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-29 14:05               ` Christoph Lameter
@ 2007-06-29 17:41                 ` Lee Schermerhorn
  2007-06-29 20:15                   ` Christoph Lameter
  0 siblings, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-29 17:41 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, Paul E. McKenney, linux-mm, akpm, nacc

On Fri, 2007-06-29 at 07:05 -0700, Christoph Lameter wrote:
> On Fri, 29 Jun 2007, Andi Kleen wrote:
> 
> > > I still do not see the rationale for this patchset. This adds more special 
> > > casing. 
> > 
> > The reference count change at least is a good idea.
> 
> Allright lets split that out and look at it?

I think that my patch to get_vma_policy() is still applicable.  My
changes to alloc_page_pol() [only exists in my series] need to be
reworked to restore the tail call in the common cases in
alloc_page_vma().  Should be easy, if 2 separate calls to
__alloc_pages() is acceptable.  As I mentioned to Andy, I won't get to
test this for a couple of weeks.

> 
> > > This all still falls under the category of messing up a bad situation even 
> > > more.
> > 
> > I think you're exaggerating.
> 
> We are creating more weird interactions between processes. The Unix 
> model is to isolate processes from each other. As a result of 
> this patch modifications to allocation policy in the address space of one 
> process can suddenly show up in surprising ways in another.

Some times we WANT interactions between cooperating processes.  That's
why we have shared memory and shared file mappings.  Yes, they need to
be managed/controlled.  The shared policy shouldn't show up in
surprising ways in the address space of a process that is sharing the
file with the one that installed the policy.  They should be cooperating
tasks.  I agree that task executables and shared libraries are
different...  See below

> 
> This in itself is bad enough. Its get worse since there does not seem to 
> be a way to prohibit this.

As I've said before, we can DO that, if you think it's needed.  E.g., we
can require write access to a file in order to install a shared policy.
Probably a good idea anyway.  Processes that have write access to a
shared, mmap()ed file BETTER be cooperating.

If you're worried about Joe User applying miscellaneous shared policies
to his task's executable or, say, libc, we could prevent this in a
couple of ways:

1) only enable shared policies on mapped files in cpusets running
applications that need this feature.

2) by requiring write access.  However, it might be useful to have a
privileged task mmap() libc shared/writable and apply a specific policy.
It certainly would on some HP platforms where we have the hardware
interleaved node.  [Altho' in some cases, replication might be
preferable.]

> 
> The NUMA policy layer is already difficult enough to comprehend for the 
> end user. We already have people not using it because it is too difficult 
> to understand. This will kick it into even weirder areas.

It's difficult for an end user to comprehend why we ignore policy on
shared file mappings [without even giving and error return, I might
add].  It's this inconsistencies in the APIs from the application
viewpoint that I'm trying to address.

> 
> One bad case is that two processes run in different cpusets. Lets say 
> process A is running in cpuset X on nodes 1 and 2. Process B is running in 
> Y on nodes 3 and 4. Both memmap FILE into distinct address ranges.
> 
> Now process A sets a MPOL_BIND policy for FILE to only come from node 1.

If they're cooperating and there is some good reason why the
applications are running in disjoint cpusets--e.g., the application was
written to be deployed this way--then the application shouldn't be
installing policies this way.  Or we can just not enable the feature in
those cpusets.  We can still allow well behaved multi-process
applications to have the feature in their own cpuset.

As above, we can control this for shared libs, etc.

> 
> If process B now follows that policy then process B will allocate outside 
> of the cpuset it is contained it. B will likely OOM since it is not 
> allowed to access node 1. This is rather surprising for the person trying 
> to run the processes in Y because he is unaware of what happens in X. He 
> will likely be unable to debug the situation. I will likely see a flood of 
> bug reports if this goes in.

Mapped shared file policy is off by default.  Documentation explain the
implications of turning on for applications that share mapped files
between cpusets.  We need to do this anyway, for shmem.  How many bug
reports have you seen from this scenario for shmem segments which behave
exactly the same?

> 
> This could be fixed by storing the complete context information with the 
> file in memory. If the policy would contain the cpuset then process B 
> could just be allowed to allocate in cpuset X despite being part of Y.

Might work.  Seems like overkill, but if there is a use case for
applications to shared mapped files and shmem with non-default policy
between processes in disjoint cpusets/containers, then I guess we need
to do this anyway--for shmem.

> 
> All of this points to significant conceptual breakage if twe do this. 
> There has to be some major semantic change to the memory policy layer in 
> order to make shared policies work. 

No, we need to reconcile the issue of sharing between disjoint cpusets.
I see this as augmenting the current semantics to specify either current
behavior or some new behavior that we agree on.  

> But then I do not have any problem 
> reports that require shared policies. The issues that I know about are 
> complaints that the vma policies of a process are not applied to page 
> cache pages like they are for anonymous pages. They are asking for a fix 
> for this and not for shared policies.

Do the bug reports specify whether the mapping is for private or shared
mappings?  VMA policies ARE applied to page cache pages of private
mappings if the process COWs the page.   For shared mappings, if we used
a vma based policy, then different processes could have different
policies for the same page.  Then, the location would depend on which
process touched the page first, thus surprising the application that
touched the page later, expecting it to be placed where ITS policy
specifies.  Sounds like another bug report to me. ;-)  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-29 17:14             ` Lee Schermerhorn
@ 2007-06-29 17:42               ` Andi Kleen
  2007-06-30 18:34                 ` [PATCH/RFC] Fix Mempolicy Ref Counts - was " Lee Schermerhorn
  0 siblings, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2007-06-29 17:42 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Paul E. McKenney, linux-mm, akpm, nacc

> Andi:  I could restore the tail call for the common cases of system
> default and task policy, but that would require a second call to
> __alloc_pages(), I think, for the shared and vma policies.  What do you
> think about that solution?

Fine

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-29 17:41                 ` Lee Schermerhorn
@ 2007-06-29 20:15                   ` Christoph Lameter
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Lameter @ 2007-06-29 20:15 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Andi Kleen, Paul E. McKenney, linux-mm, akpm, nacc

On Fri, 29 Jun 2007, Lee Schermerhorn wrote:

> As I've said before, we can DO that, if you think it's needed.  E.g., we
> can require write access to a file in order to install a shared policy.
> Probably a good idea anyway.  Processes that have write access to a
> shared, mmap()ed file BETTER be cooperating.

They currently do not do that.

> Mapped shared file policy is off by default.  Documentation explain the
> implications of turning on for applications that share mapped files
> between cpusets.  We need to do this anyway, for shmem.  How many bug
> reports have you seen from this scenario for shmem segments which behave
> exactly the same?

I think about 7 or so shmem related that were escalated to our team? We 
put in the policy specification for shmem on the kernel command line for a 
reason. You boot the kernel with the policy you want shmem to have and 
tell all other people that may attempt to set a policy on shmem to stay 
the *** away from it.

> Do the bug reports specify whether the mapping is for private or shared
> mappings?  VMA policies ARE applied to page cache pages of private
> mappings if the process COWs the page.   For shared mappings, if we used

Yes but not if the proccess simply reads the page and that is the simple 
case that could be fixed by passing the policy to page_cache_alloc() 
without weird shared sematics on volatile memory objects.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-29 14:20             ` Andi Kleen
@ 2007-06-29 21:40               ` Lee Schermerhorn
  0 siblings, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-29 21:40 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Lameter, Paul E. McKenney, linux-mm, akpm, nacc

On Fri, 2007-06-29 at 16:20 +0200, Andi Kleen wrote:
> On Friday 29 June 2007 15:30:36 Lee Schermerhorn wrote:
> 
> > Firstly, the "current situation" is deficient for applications that I,
> > on behalf of our customers, care about.
> 
> So what's the specific use case from these applications? How much do 
> they lose by not having this?

Andi:

I had answered [attempted to anyway] the first question in:

	http://marc.info/?l=linux-mm&m=118105384427674&w=4

What do they lose?  The ability to control explicitly and reliably the
location of pages in shared, mapped files without prefaulting.  The only
way an application has today to guarantee the location of a file page is
to: 1) have the file opened exclusively--i.e., be the only task with the
file opened, lest some other task access the file and fault in pages; 2)
set the task policy to bind/prefer/interleave to the appropriate
node[s]; 3) prefault some range of file pages in and lock them down; 4)
change the task policy for the next range and fault that in and lock it
down;  etc, until the entire file is placed correctly--assuming the
entire file fits.  If pages of the file are already in memory, they
don't even have the option of mass migration via mbind().  They'd have
to individually migrate pages, once that sys call wrapper is available.

I'd like to just mbind() the mmap()ed range, set the policy and then
know that, as pages fault in, they'll end up obeying the policy.  If
some pages of the file are already memory resident, they can be migrated
to follow policy.  It seems so simple to me.  The fundamental support is
all there.  

I agree we need to handle some of Christoph's issues so that his
customers can't get themselves confused, playing with shared policies,
or attempting to set policies on files [and shmem!] that don't work in
all the cpusets from which the files/shmem might be accessed.  That's a
hairly problem that containers/cpusets introduce when you try to segment
a system along only 1 or 2 dimensions, leaving the rest of the
dimensions wide open for sharing...

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-27 23:46         ` Paul E. McKenney
  2007-06-28  0:14           ` Andi Kleen
@ 2007-06-29 21:47           ` Lee Schermerhorn
  1 sibling, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-29 21:47 UTC (permalink / raw)
  To: paulmck; +Cc: Andi Kleen, Christoph Lameter, linux-mm, akpm, nacc

On Wed, 2007-06-27 at 16:46 -0700, Paul E. McKenney wrote:
> On Thu, Jun 28, 2007 at 12:01:16AM +0200, Andi Kleen wrote:
> > 
> > > The zonelist from MPOL_BIND is passed to __alloc_pages. As a result the 
> > > RCU lock must be held over the call into the page allocator with reclaim 
> > > etc etc. Note that the zonelist is part of the policy structure.
> > 
> > Yes I realized this at some point too. RCU doesn't work here because
> > __alloc_pages can sleep. Have to use the reference counts even though
> > it adds atomic operations.
> 
> Any reason SRCU wouldn't work here?  From a quick glance at the patch,
> it seems possible to me.

Does SRCU have a deferred version--i.e., a call_srcu()?  I didn't see
one.  I originally tried synchronize_rcu() in my patch, but hit a
"scheduling while atomic" bug, so I converted it to deferred reclaim.

For changing the task policy from outside the task--something that I
understand Christoph would like to do--we can use synchronize_srcu(), if
we can call it from outside any atomic context.

Or maybe synchronize_srcu() does attempt to reschedule nor call
"might_schedule()"?  [Sorry, haven't had time to look.]

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH/RFC] Fix Mempolicy Ref Counts - was Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-29 17:42               ` Andi Kleen
@ 2007-06-30 18:34                 ` Lee Schermerhorn
  2007-07-03 18:09                   ` Christoph Lameter
  0 siblings, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-06-30 18:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Lameter, Paul E. McKenney, linux-mm, akpm, nacc

On Fri, 2007-06-29 at 19:42 +0200, Andi Kleen wrote:
> > Andi:  I could restore the tail call for the common cases of system
> > default and task policy, but that would require a second call to
> > __alloc_pages(), I think, for the shared and vma policies.  What do you
> > think about that solution?
> 
> Fine

Andi:

Here's a possible fix for the mempolicy reference counting.  Uncompiled,
untested, probably bogus, against 2.6.22-rc4-mm2 [what I have on my
laptop; no time to refresh right now].  I attempted to extract the
earlier patch from my shared policy series and restored the tail call
for the "common cases".  Probably 

I think it's a bit late to get into .22, but I wanted to send this to
you in case you wanted to fix this up and try.  I'll be mostly off-line
for the next week.

Regards,
Lee

----------
PATCH fix reference counting for memory policy

Against 2.6.22-rc4-mm2; UNTESTED

This patch proposes fixes to the reference counting of memory policy
in the page allocation path [alloc_page_vma()] and in show_numa_map().

Shared policy lookup [shmem] has always added a reference to the
policy, but this was never unrefed after page allocation or after
formatting the numa map data.  

Default system policy should not require additional ref counting,
nor should the current task's task policy.  However, show_numa_map()
calls get_vma_policy() to examine what may be [likely is] another
task's policy.  The latter case needs protection against freeing
of the policy.

This patch adds a reference count to a mempolicy returned by
get_vma_policy() when the policy is a vma policy or another
task's mempolicy.  [Again, shared policy is already reference
counted on lookup.] A matching "unref" [__mpol_free()] is performed
in alloc_page_vma() for shared and vma policies, and in
show_numa_map() for shared and another task's mempolicy.
We can call __mpol_free() directly, saving an admittedly
inexpensive inline NULL test, because we know we have a non-NULL
policy.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/mempolicy.c |   33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-06-30 12:56:51.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-06-30 13:49:12.000000000 -0400
@@ -1077,16 +1077,20 @@ static struct mempolicy * get_vma_policy
 		struct vm_area_struct *vma, unsigned long addr)
 {
 	struct mempolicy *pol = task->mempolicy;
+	int shared_pol = 0;
 
 	if (vma) {
-		if (vma->vm_ops && vma->vm_ops->get_policy)
+		if (vma->vm_ops && vma->vm_ops->get_policy) {
 			pol = vma->vm_ops->get_policy(vma, addr);
-		else if (vma->vm_policy &&
+			shared_pol = 1;	/* if non-NULL, that is */
+		} else if (vma->vm_policy &&
 				vma->vm_policy->policy != MPOL_DEFAULT)
 			pol = vma->vm_policy;
 	}
 	if (!pol)
 		pol = &default_policy;
+	else if (!shared_pol && pol != current->mempolicy)
+		mpol_get(pol);	/* vma or other task's policy */
 	return pol;
 }
 
@@ -1259,6 +1263,7 @@ struct page *
 alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
 {
 	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+	struct zonelist *zl;
 
 	cpuset_update_task_memory_state();
 
@@ -1268,7 +1273,19 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
 		return alloc_page_interleave(gfp, 0, nid);
 	}
-	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+	zl = zonelist_policy(gfp, pol);
+	if (pol != &default_policy && pol != current->mempolicy) {
+		/*
+		 * slow path: ref counted policy -- shared or vma
+		 */
+		struct page *page =  __alloc_pages(gfp, 0, zl);
+		__mpol_free(pol);
+		return page;
+	}
+	/*
+	 * fast path:  default or task policy
+	 */
+	return __alloc_pages(gfp, 0, zl);
 }
 
 /**
@@ -1914,6 +1931,7 @@ int show_numa_map(struct seq_file *m, vo
 	struct numa_maps *md;
 	struct file *file = vma->vm_file;
 	struct mm_struct *mm = vma->vm_mm;
+	struct mempolicy *pol;
 	int n;
 	char buffer[50];
 
@@ -1924,8 +1942,13 @@ int show_numa_map(struct seq_file *m, vo
 	if (!md)
 		return 0;
 
-	mpol_to_str(buffer, sizeof(buffer),
-			    get_vma_policy(priv->task, vma, vma->vm_start));
+	pol = get_vma_policy(priv->task, vma, vma->vm_start);
+	mpol_to_str(buffer, sizeof(buffer), pol);
+	/*
+	 * unref shared or other task's mempolicy
+	 */
+	if (pol != &default_policy && pol != current->mempolicy)
+		__mpol_free(pol);
 
 	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC] Fix Mempolicy Ref Counts - was Re: [PATCH/RFC 0/11] Shared Policy Overview
  2007-06-30 18:34                 ` [PATCH/RFC] Fix Mempolicy Ref Counts - was " Lee Schermerhorn
@ 2007-07-03 18:09                   ` Christoph Lameter
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Lameter @ 2007-07-03 18:09 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Andi Kleen, Paul E. McKenney, linux-mm, akpm, nacc

On Sat, 30 Jun 2007, Lee Schermerhorn wrote:

> Index: Linux/mm/mempolicy.c
> ===================================================================
> --- Linux.orig/mm/mempolicy.c	2007-06-30 12:56:51.000000000 -0400
> +++ Linux/mm/mempolicy.c	2007-06-30 13:49:12.000000000 -0400
> @@ -1077,16 +1077,20 @@ static struct mempolicy * get_vma_policy
>  		struct vm_area_struct *vma, unsigned long addr)
>  {
>  	struct mempolicy *pol = task->mempolicy;
> +	int shared_pol = 0;
>  
>  	if (vma) {
> -		if (vma->vm_ops && vma->vm_ops->get_policy)
> +		if (vma->vm_ops && vma->vm_ops->get_policy) {
>  			pol = vma->vm_ops->get_policy(vma, addr);
> -		else if (vma->vm_policy &&
> +			shared_pol = 1;	/* if non-NULL, that is */
> +		} else if (vma->vm_policy &&
>  				vma->vm_policy->policy != MPOL_DEFAULT)
>  			pol = vma->vm_policy;
>  	}
>  	if (!pol)
>  		pol = &default_policy;
> +	else if (!shared_pol && pol != current->mempolicy)
> +		mpol_get(pol);	/* vma or other task's policy */
>  	return pol;
>  }
>  
> @@ -1259,6 +1263,7 @@ struct page *
>  alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
>  {
>  	struct mempolicy *pol = get_vma_policy(current, vma, addr);
> +	struct zonelist *zl;
>  
>  	cpuset_update_task_memory_state();
>  
> @@ -1268,7 +1273,19 @@ alloc_page_vma(gfp_t gfp, struct vm_area
>  		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
>  		return alloc_page_interleave(gfp, 0, nid);
>  	}
> -	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
> +	zl = zonelist_policy(gfp, pol);
> +	if (pol != &default_policy && pol != current->mempolicy) {
> +		/*
> +		 * slow path: ref counted policy -- shared or vma
> +		 */
> +		struct page *page =  __alloc_pages(gfp, 0, zl);
> +		__mpol_free(pol);
> +		return page;
> +	}
> +	/*
> +	 * fast path:  default or task policy
> +	 */
> +	return __alloc_pages(gfp, 0, zl);
>  }

Argh. Some hot paths are touched here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2007-07-03 18:09 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-06-25 19:52 [PATCH/RFC 0/11] Shared Policy Overview Lee Schermerhorn
2007-06-25 19:52 ` [PATCH/RFC 1/11] Shared Policy: move shared policy to inode/mapping Lee Schermerhorn
2007-06-25 19:52 ` [PATCH/RFC 2/11] Shared Policy: allocate shared policies as needed Lee Schermerhorn
2007-06-25 19:52 ` [PATCH/RFC 3/11] Shared Policy: let vma policy ops handle sub-vma policies Lee Schermerhorn
2007-06-25 19:52 ` [PATCH/RFC 4/11] Shared Policy: fix show_numa_maps() Lee Schermerhorn
2007-06-25 19:52 ` [PATCH/RFC 5/11] Shared Policy: Add hugepage shmem policy vm_ops Lee Schermerhorn
2007-06-25 19:53 ` [PATCH/RFC 6/11] Shared Policy: Factor alloc_page_pol routine Lee Schermerhorn
2007-06-25 19:53 ` [PATCH/RFC 7/11] Shared Policy: use shared policy for page cache allocations Lee Schermerhorn
2007-06-25 19:53 ` [PATCH/RFC 8/11] Shared Policy: fix migration of private mappings Lee Schermerhorn
2007-06-25 19:53 ` [PATCH/RFC 9/11] Shared Policy: mapped file policy persistence model Lee Schermerhorn
2007-06-25 19:53 ` [PATCH/RFC 10/11] Shared Policy: per cpuset shared file policy control Lee Schermerhorn
2007-06-25 21:10   ` Paul Jackson
2007-06-27 17:33     ` Lee Schermerhorn
2007-06-27 19:52       ` Paul Jackson
2007-06-27 20:22         ` Lee Schermerhorn
2007-06-27 20:36           ` Paul Jackson
2007-06-25 19:53 ` [PATCH/RFC 11/11] Shared Policy: add generic file set/get policy vm ops Lee Schermerhorn
2007-06-26 22:17 ` [PATCH/RFC 0/11] Shared Policy Overview Christoph Lameter
2007-06-27 13:43   ` Lee Schermerhorn
2007-06-26 22:21 ` Christoph Lameter
2007-06-26 22:42   ` Andi Kleen
2007-06-27  3:25     ` Christoph Lameter
2007-06-27 20:14       ` Lee Schermerhorn
2007-06-27 18:14   ` Lee Schermerhorn
2007-06-27 21:37     ` Christoph Lameter
2007-06-27 22:01       ` Andi Kleen
2007-06-27 22:08         ` Christoph Lameter
2007-06-27 23:46         ` Paul E. McKenney
2007-06-28  0:14           ` Andi Kleen
2007-06-29 21:47           ` Lee Schermerhorn
2007-06-28 13:42         ` Lee Schermerhorn
2007-06-28 22:02           ` Andi Kleen
2007-06-29 17:14             ` Lee Schermerhorn
2007-06-29 17:42               ` Andi Kleen
2007-06-30 18:34                 ` [PATCH/RFC] Fix Mempolicy Ref Counts - was " Lee Schermerhorn
2007-07-03 18:09                   ` Christoph Lameter
2007-06-29  1:39           ` Christoph Lameter
2007-06-29  9:01             ` Andi Kleen
2007-06-29 14:05               ` Christoph Lameter
2007-06-29 17:41                 ` Lee Schermerhorn
2007-06-29 20:15                   ` Christoph Lameter
2007-06-29 13:22             ` Lee Schermerhorn
2007-06-29 14:18               ` Christoph Lameter
2007-06-27 23:36       ` Lee Schermerhorn
2007-06-29  1:41         ` Christoph Lameter
2007-06-29 13:30           ` Lee Schermerhorn
2007-06-29 14:20             ` Andi Kleen
2007-06-29 21:40               ` Lee Schermerhorn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox