From: Lee Schermerhorn <lee.schermerhorn@hp.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, nacc@us.ibm.com, ak@suse.de,
Lee Schermerhorn <lee.schermerhorn@hp.com>,
clameter@sgi.com
Subject: [PATCH/RFC 4/11] Shared Policy: fix show_numa_maps()
Date: Mon, 25 Jun 2007 15:52:51 -0400 [thread overview]
Message-ID: <20070625195251.21210.18354.sendpatchset@localhost> (raw)
In-Reply-To: <20070625195224.21210.89898.sendpatchset@localhost>
Shared Policy Infrstructure 4/11 fix show_numa_maps()
Against 2.6.22-rc4-mm2
This patch updates the procfs numa_maps display to handle multiple
shared policy ranges on a single vma. numa_maps() still uses the
procfs task maps infrastructure, but provides wrappers around the
maps seq_file ops to handle shared policy "submaps", if any.
This fixes a problem with numa_maps for shared mappings:
Before this [mapped file policy] patch series, numa_maps could show
you different results for shared mappings depending on which task you
examined. A task which has installed shared policies on sub-ranges
of the shared region will show the policies on the sub-ranges, as the
vmas for that task were split when the policies were installed.
Another task that shares the region, but didn't install any policies,
or installed policies on a different region or set of regions will
show a different policy/range or set thereof, based on the VMAs
of that task. By displaying the policies directly from the shared
policy structure, we now see the same info from each task that maps
the segment.
The patch expands the proc_maps_private struct [#ifdef CONFIG_NUMA]
to track the existence of and progress through a submap for the
"current" vma. For vmas with shared policy submaps, a new
function--get_numa_submap()--in mm/mempolicy.c allocates and
populates an array of the policy ranges in the shared policy.
To facilitate this, the shared policy struct tracks the number
of ranges [sp_nodes] in the tree.
The nm_* numa_map seq_file wrappers pass the range to be displayed
to show_numa_map() via the saddr and eaddr members added to the
proc_maps_private struct. The patch modifies show_numa_map() to
use these members, where appropriate, instead of vm_start, vm_end.
As before, once the internal page size buffer is full, seq_read()
suspends the display, drops the mmap_sem and exits the read.
During this time the vma list can change. However, even within a
single seq_read(), the shared_policy "submap" can be changed by
other mappers. We could prevent this by holding the shared policy
spin_lock or otherwise holding off other mappers. That would also
hold off other tasks faulting in pages, attempting to look up the
policy for that offset, unless we convert the lock to reader/writer.
It doesn't seem worth the effort, as the numa_map is only a snap_shot
in any case. So, this patch makes a best effort [at least as good as
unpatched task map code, I think] to perform a single scan over the
address space, displaying the policies and page state/location
for policy ranges "snapped" under spin lock into the "submap"
array mentioned above.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
fs/proc/task_mmu.c | 191 ++++++++++++++++++++++++++++++++++++++++--
include/linux/mempolicy.h | 5 +
include/linux/mm.h | 6 +
include/linux/proc_fs.h | 12 ++
include/linux/shared_policy.h | 3
mm/mempolicy.c | 57 +++++++++++-
6 files changed, 264 insertions(+), 10 deletions(-)
Index: Linux/include/linux/proc_fs.h
===================================================================
--- Linux.orig/include/linux/proc_fs.h 2007-06-22 13:07:48.000000000 -0400
+++ Linux/include/linux/proc_fs.h 2007-06-22 13:10:39.000000000 -0400
@@ -281,12 +281,24 @@ static inline struct proc_dir_entry *PDE
return PROC_I(inode)->pde;
}
+struct mpol_range {
+ unsigned long saddr;
+ unsigned long eaddr;
+};
+
struct proc_maps_private {
struct pid *pid;
struct task_struct *task;
#ifdef CONFIG_MMU
struct vm_area_struct *tail_vma;
#endif
+
+#ifdef CONFIG_NUMA
+ struct vm_area_struct *vma; /* preserved over seq_reads */
+ unsigned long saddr;
+ unsigned long eaddr; /* preserved over seq_reads */
+ struct mpol_range *range, *ranges; /* preserved ... */
+#endif
};
#endif /* _LINUX_PROC_FS_H */
Index: Linux/include/linux/mm.h
===================================================================
--- Linux.orig/include/linux/mm.h 2007-06-22 13:10:35.000000000 -0400
+++ Linux/include/linux/mm.h 2007-06-22 13:10:39.000000000 -0400
@@ -1064,6 +1064,12 @@ static inline pgoff_t vma_addr_to_pgoff(
{
return ((addr - vma->vm_start) >> shift) + vma->vm_pgoff;
}
+
+static inline pgoff_t vma_pgoff_to_addr(struct vm_area_struct *vma,
+ pgoff_t pgoff)
+{
+ return ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) + vma->vm_start;
+}
#else
static inline void setup_per_cpu_pageset(void) {}
#endif
Index: Linux/include/linux/mempolicy.h
===================================================================
--- Linux.orig/include/linux/mempolicy.h 2007-06-22 13:10:30.000000000 -0400
+++ Linux/include/linux/mempolicy.h 2007-06-22 13:10:39.000000000 -0400
@@ -139,6 +139,11 @@ static inline void check_highest_zone(en
int do_migrate_pages(struct mm_struct *mm,
const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);
+struct seq_file;
+extern int show_numa_map(struct seq_file *, void *);
+struct mpol_range;
+extern struct mpol_range *get_numa_submap(struct vm_area_struct *);
+
#else
struct mempolicy {};
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c 2007-06-22 13:10:35.000000000 -0400
+++ Linux/mm/mempolicy.c 2007-06-22 13:11:15.000000000 -0400
@@ -1469,6 +1469,7 @@ static void sp_insert(struct shared_poli
}
rb_link_node(&new->nd, parent, p);
rb_insert_color(&new->nd, &sp->root);
+ ++sp->nr_sp_nodes;
PDprintk("inserting %lx-%lx: %d\n", new->start, new->end,
new->policy ? new->policy->policy : 0);
}
@@ -1498,6 +1499,7 @@ static void sp_delete(struct shared_poli
rb_erase(&n->nd, &sp->root);
mpol_free(n->policy);
kmem_cache_free(sn_cache, n);
+ --sp->nr_sp_nodes;
}
struct sp_node *
@@ -1575,6 +1577,7 @@ struct shared_policy *mpol_shared_policy
return ERR_PTR(-ENOMEM);
sp->root = RB_ROOT;
spin_lock_init(&sp->lock);
+ sp->nr_sp_nodes = 0;
if (policy != MPOL_DEFAULT) {
struct mempolicy *newpol;
@@ -1989,9 +1992,9 @@ int show_numa_map(struct seq_file *m, vo
return 0;
mpol_to_str(buffer, sizeof(buffer),
- get_vma_policy(priv->task, vma, vma->vm_start));
+ get_vma_policy(priv->task, vma, priv->saddr));
- seq_printf(m, "%08lx %s", vma->vm_start, buffer);
+ seq_printf(m, "%08lx %s", priv->saddr, buffer);
if (file) {
seq_printf(m, " file=");
@@ -2004,10 +2007,10 @@ int show_numa_map(struct seq_file *m, vo
}
if (is_vm_hugetlb_page(vma)) {
- check_huge_range(vma, vma->vm_start, vma->vm_end, md);
+ check_huge_range(vma, priv->saddr, priv->eaddr, md);
seq_printf(m, " huge");
} else {
- check_pgd_range(vma, vma->vm_start, vma->vm_end,
+ check_pgd_range(vma, priv->saddr, priv->eaddr,
&node_online_map, MPOL_MF_STATS, md);
}
@@ -2046,3 +2049,49 @@ out:
m->version = (vma != priv->tail_vma) ? vma->vm_start : 0;
return 0;
}
+
+/*
+ * alloc/populate array of shared policy ranges for show_numa_map()
+ */
+struct mpol_range *get_numa_submap(struct vm_area_struct *vma)
+{
+ struct shared_policy *sp;
+ struct mpol_range *ranges, *range;
+ struct rb_node *rbn;
+ int nranges;
+
+ BUG_ON(!vma->vm_file);
+ sp = mapping_shared_policy(vma->vm_file->f_mapping);
+ if (!sp)
+ return NULL;
+
+ nranges = sp->nr_sp_nodes;
+ if (!nranges)
+ return NULL;
+
+ ranges = kzalloc((nranges + 1) * sizeof(*ranges), GFP_KERNEL);
+ if (!ranges)
+ return NULL; /* pretend there are none */
+
+ range = ranges;
+ spin_lock(&sp->lock);
+ /*
+ * # of ranges could have changes since we checked, but that is
+ * unlikely, so this is close enough [as long as it's safe].
+ */
+ rbn = rb_first(&sp->root);
+ /*
+ * count nodes to ensure we leave one empty range struct
+ * in case node added between check and alloc
+ */
+ while (rbn && nranges--) {
+ struct sp_node *spn = rb_entry(rbn, struct sp_node, nd);
+ range->saddr = vma_pgoff_to_addr(vma, spn->start);
+ range->eaddr = vma_pgoff_to_addr(vma, spn->end);
+ ++range;
+ rbn = rb_next(rbn);
+ }
+
+ spin_unlock(&sp->lock);
+ return ranges;
+}
Index: Linux/fs/proc/task_mmu.c
===================================================================
--- Linux.orig/fs/proc/task_mmu.c 2007-06-22 13:07:48.000000000 -0400
+++ Linux/fs/proc/task_mmu.c 2007-06-22 13:10:39.000000000 -0400
@@ -498,7 +498,188 @@ const struct file_operations proc_clear_
#endif
#ifdef CONFIG_NUMA
-extern int show_numa_map(struct seq_file *m, void *v);
+/*
+ * numa_maps uses procfs task maps file operations, with wrappers
+ * to handle mpol submaps--policy ranges within a vma
+ */
+
+/*
+ * start processing a new vma for show_numa_maps
+ */
+static void nm_vma_start(struct proc_maps_private *priv,
+ struct vm_area_struct *vma)
+{
+ if (!vma)
+ return;
+ priv->vma = vma; /* saved across read()s */
+
+ priv->saddr = vma->vm_start;
+ if (!(vma->vm_flags & VM_SHARED) || !vma->vm_file ||
+ !vma->vm_file->f_mapping->spolicy) {
+ /*
+ * usual case: no submap
+ */
+ priv->eaddr = vma->vm_end;
+ return;
+ }
+
+ priv->range = priv->ranges = get_numa_submap(vma);
+ if (!priv->range) {
+ priv->eaddr = vma->vm_end; /* empty shared policy */
+ return;
+ }
+
+ /*
+ * restart suspended submap where we left off
+ */
+ while (priv->range->eaddr && priv->range->eaddr < priv->eaddr)
+ ++priv->range;
+
+ if (!priv->range->eaddr)
+ priv->eaddr = vma->vm_end;
+ else if (priv->saddr < priv->range->saddr)
+ priv->eaddr = priv->range->saddr; /* show gap [default pol] */
+ else
+ priv->eaddr = priv->range->eaddr; /* show range */
+}
+
+/*
+ * done with numa_maps vma: reset so we start a new
+ * vma on next seq_read.
+ */
+static void nm_vma_stop(struct proc_maps_private *priv)
+{
+ if (priv->ranges)
+ kfree(priv->ranges);
+ priv->ranges = priv->range = NULL;
+ priv->vma = NULL;
+}
+
+/*
+ * Advance to next vma in mm or next subrange in vma.
+ * mmap_sem held during a single seq_read(), but shared
+ * policy ranges can be modified at any time by other
+ * mappers. We just continue to display the ranges we
+ * found when we started the vma.
+ */
+static void *nm_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct proc_maps_private *priv = m->private;
+ struct vm_area_struct *vma = v;
+
+ if (!priv->range || priv->eaddr >= vma->vm_end) {
+ /*
+ * usual case: no submap or end of vma
+ * re: '>=' -- in case we got here from nm_start()
+ * and vma @ pos truncated to < priv->eaddr
+ */
+ nm_vma_stop(priv);
+ vma = m_next(m, v, pos);
+ nm_vma_start(priv, vma);
+ return vma;
+ }
+
+ /*
+ * Advance to next range in submap
+ */
+ priv->saddr = priv->eaddr;
+ if (priv->eaddr == priv->range->saddr) {
+ /*
+ * just processed a gap in the submap
+ */
+ priv->eaddr = min(priv->range->eaddr, vma->vm_end);
+ return vma; /* show the range */
+ }
+
+ ++priv->range;
+ if (!priv->range->eaddr)
+ priv->eaddr = vma->vm_end; /* past end of ranges */
+ else if (priv->saddr < priv->range->saddr)
+ priv->eaddr = priv->range->saddr; /* gap in submap */
+ else
+ priv->eaddr = min(priv->range->eaddr, vma->vm_end);
+
+ return vma;
+}
+
+/*
+ * [Re]start scan for new seq_read().
+ * N.B., much could have changes in mm, as we dropped the mmap_sem
+ * between reads(). Need to call m_start() to find vma at pos.
+ */
+static void *nm_start(struct seq_file *m, loff_t *pos)
+{
+ struct proc_maps_private *priv = m->private;
+ struct vm_area_struct *vma;
+
+ if (!priv->range) {
+ /*
+ * usual case: 1st after open, or finished prev vma
+ */
+ vma = m_start(m, pos);
+ nm_vma_start(priv, vma);
+ return vma;
+ }
+
+ /*
+ * Continue with submap of "current" vma. However, vma could have
+ * been unmapped, split, truncated, ... between read()s.
+ * Reset "last_addr" to simulate seek; find vma by 'pos'.
+ */
+ m->version = 0;
+ --(*pos); /* seq_read() incremented it */
+ vma = m_start(m, pos);
+ if (vma != priv->vma)
+ goto new_vma;
+ /*
+ * Same vma address as where we left off, but could have different
+ * ranges or could be entirely different vma.
+ */
+ if (vma->vm_start > priv->eaddr)
+ goto new_vma; /* starts past last range displayed */
+ if (priv->eaddr < vma->vm_end) {
+ /*
+ * vma at pos still covers eaddr--where we left off. Submap
+ * could have changed, but we'll keep reporting ranges we found
+ * earlier up to vm_end.
+ * We hope it is very unlikely that submap changed.
+ */
+ return nm_next(m, vma, pos);
+ }
+
+ /*
+ * Already reported past end of vma; find next vma past eaddr
+ */
+ while (vma && vma->vm_end < priv->eaddr)
+ vma = m_next(m, vma, pos);
+
+new_vma:
+ /*
+ * new vma at pos; continue from ~ last eaddr
+ */
+ nm_vma_stop(priv);
+ nm_vma_start(priv, vma);
+ return vma;
+}
+
+/*
+ * Suspend display of numa_map--e.g., buffer full?
+ */
+static void nm_stop(struct seq_file *m, void *v)
+{
+ struct proc_maps_private *priv = m->private;
+ struct vm_area_struct *vma = v;
+
+ if (!vma || priv->eaddr >= vma->vm_end) {
+ nm_vma_stop(priv);
+ }
+ /*
+ * leave state in priv for nm_start(); but drop the
+ * mmap_sem and unref the mm
+ */
+ m_stop(m, v);
+}
+
static int show_numa_map_checked(struct seq_file *m, void *v)
{
@@ -512,10 +693,10 @@ static int show_numa_map_checked(struct
}
static struct seq_operations proc_pid_numa_maps_op = {
- .start = m_start,
- .next = m_next,
- .stop = m_stop,
- .show = show_numa_map_checked
+ .start = nm_start,
+ .next = nm_next,
+ .stop = nm_stop,
+ .show = show_numa_map_checked
};
static int numa_maps_open(struct inode *inode, struct file *file)
Index: Linux/include/linux/shared_policy.h
===================================================================
--- Linux.orig/include/linux/shared_policy.h 2007-06-22 13:10:35.000000000 -0400
+++ Linux/include/linux/shared_policy.h 2007-06-22 13:10:39.000000000 -0400
@@ -25,7 +25,8 @@ struct sp_node {
struct shared_policy {
struct rb_root root;
- spinlock_t lock; /* protects rb tree */
+ spinlock_t lock; /* protects rb tree */
+ int nr_sp_nodes; /* for numa_maps */
};
extern struct shared_policy *mpol_shared_policy_new(struct address_space *,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-06-25 19:52 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-06-25 19:52 [PATCH/RFC 0/11] Shared Policy Overview Lee Schermerhorn
2007-06-25 19:52 ` [PATCH/RFC 1/11] Shared Policy: move shared policy to inode/mapping Lee Schermerhorn
2007-06-25 19:52 ` [PATCH/RFC 2/11] Shared Policy: allocate shared policies as needed Lee Schermerhorn
2007-06-25 19:52 ` [PATCH/RFC 3/11] Shared Policy: let vma policy ops handle sub-vma policies Lee Schermerhorn
2007-06-25 19:52 ` Lee Schermerhorn [this message]
2007-06-25 19:52 ` [PATCH/RFC 5/11] Shared Policy: Add hugepage shmem policy vm_ops Lee Schermerhorn
2007-06-25 19:53 ` [PATCH/RFC 6/11] Shared Policy: Factor alloc_page_pol routine Lee Schermerhorn
2007-06-25 19:53 ` [PATCH/RFC 7/11] Shared Policy: use shared policy for page cache allocations Lee Schermerhorn
2007-06-25 19:53 ` [PATCH/RFC 8/11] Shared Policy: fix migration of private mappings Lee Schermerhorn
2007-06-25 19:53 ` [PATCH/RFC 9/11] Shared Policy: mapped file policy persistence model Lee Schermerhorn
2007-06-25 19:53 ` [PATCH/RFC 10/11] Shared Policy: per cpuset shared file policy control Lee Schermerhorn
2007-06-25 21:10 ` Paul Jackson
2007-06-27 17:33 ` Lee Schermerhorn
2007-06-27 19:52 ` Paul Jackson
2007-06-27 20:22 ` Lee Schermerhorn
2007-06-27 20:36 ` Paul Jackson
2007-06-25 19:53 ` [PATCH/RFC 11/11] Shared Policy: add generic file set/get policy vm ops Lee Schermerhorn
2007-06-26 22:17 ` [PATCH/RFC 0/11] Shared Policy Overview Christoph Lameter
2007-06-27 13:43 ` Lee Schermerhorn
2007-06-26 22:21 ` Christoph Lameter
2007-06-26 22:42 ` Andi Kleen
2007-06-27 3:25 ` Christoph Lameter
2007-06-27 20:14 ` Lee Schermerhorn
2007-06-27 18:14 ` Lee Schermerhorn
2007-06-27 21:37 ` Christoph Lameter
2007-06-27 22:01 ` Andi Kleen
2007-06-27 22:08 ` Christoph Lameter
2007-06-27 23:46 ` Paul E. McKenney
2007-06-28 0:14 ` Andi Kleen
2007-06-29 21:47 ` Lee Schermerhorn
2007-06-28 13:42 ` Lee Schermerhorn
2007-06-28 22:02 ` Andi Kleen
2007-06-29 17:14 ` Lee Schermerhorn
2007-06-29 17:42 ` Andi Kleen
2007-06-30 18:34 ` [PATCH/RFC] Fix Mempolicy Ref Counts - was " Lee Schermerhorn
2007-07-03 18:09 ` Christoph Lameter
2007-06-29 1:39 ` Christoph Lameter
2007-06-29 9:01 ` Andi Kleen
2007-06-29 14:05 ` Christoph Lameter
2007-06-29 17:41 ` Lee Schermerhorn
2007-06-29 20:15 ` Christoph Lameter
2007-06-29 13:22 ` Lee Schermerhorn
2007-06-29 14:18 ` Christoph Lameter
2007-06-27 23:36 ` Lee Schermerhorn
2007-06-29 1:41 ` Christoph Lameter
2007-06-29 13:30 ` Lee Schermerhorn
2007-06-29 14:20 ` Andi Kleen
2007-06-29 21:40 ` Lee Schermerhorn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070625195251.21210.18354.sendpatchset@localhost \
--to=lee.schermerhorn@hp.com \
--cc=ak@suse.de \
--cc=akpm@linux-foundation.org \
--cc=clameter@sgi.com \
--cc=linux-mm@kvack.org \
--cc=nacc@us.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox