* [RFC][PATCH 0/1] Node-based reclaim/migration
@ 2006-11-29 3:06 menage
2006-11-29 3:06 ` [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace menage
` (2 more replies)
0 siblings, 3 replies; 54+ messages in thread
From: menage @ 2006-11-29 3:06 UTC (permalink / raw)
To: linux-mm; +Cc: akpm
--
We're trying to use NUMA node isolation as a form of job resource
control at Google, and the existing page migration APIs are all bound
to individual processes and so are a bit clunky to use when you just
want to affect all the pages on a given node.
How about an API to allow userspace to direct page migration (and page
reclaim) on a per-node basis? This patch provides such an API, based
around sysfs; a system call approach would certainly be possible too.
It sort of overlaps with memory hot-unplug, but is simpler since it's
not so bad if we miss a few pages.
Comments? Also, can anyone clarify whether I need any locking when
sacnning the pages in a pgdat? As far as I can see, even with memory
hotplug this number can only increase, not decrease.
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-29 3:06 [RFC][PATCH 0/1] Node-based reclaim/migration menage
@ 2006-11-29 3:06 ` menage
2006-11-29 6:07 ` Nick Piggin
` (2 more replies)
2006-11-30 0:31 ` [RFC][PATCH 0/1] Node-based reclaim/migration KAMEZAWA Hiroyuki
2006-11-30 4:04 ` Christoph Lameter
2 siblings, 3 replies; 54+ messages in thread
From: menage @ 2006-11-29 3:06 UTC (permalink / raw)
To: linux-mm; +Cc: akpm
[-- Attachment #1: node_reclaim.patch --]
[-- Type: text/plain, Size: 8417 bytes --]
Currently the page migration APIs allow you to migrate pages from
particular processes, but don't provide a clean and efficient way to
migrate and/or reclaim memory from individual nodes.
This patch provides:
- an additional parameter to try_to_free_pages() to specify the
priority at which the reclaim should give up if it doesn't make
progress
- a way to trigger try_to_free_pages() for a given node with a given
minimum priority, vy writing an integer to
/sys/device/system/node/node<id>/try_to_free_pages
- a way to request that any migratable pages on a given node be
migrated to availage pages on a specified set of nodes by writing a
destination nodemask (in ASCII form) to
/sys/device/system/node/node<id>/migrate_node
Signed-off-by: Paul Menage <menage@google.com>
---
drivers/base/node.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++
fs/buffer.c | 2 -
include/linux/mempolicy.h | 2 +
include/linux/swap.h | 2 -
mm/mempolicy.c | 3 -
mm/page_alloc.c | 2 -
mm/vmscan.c | 5 +-
7 files changed, 101 insertions(+), 7 deletions(-)
Index: 2.6.19-node_reclaim/drivers/base/node.c
===================================================================
--- 2.6.19-node_reclaim.orig/drivers/base/node.c
+++ 2.6.19-node_reclaim/drivers/base/node.c
@@ -12,6 +12,8 @@
#include <linux/topology.h>
#include <linux/nodemask.h>
#include <linux/cpu.h>
+#include <linux/swap.h>
+#include <linux/migrate.h>
static struct sysdev_class node_class = {
set_kset_name("node"),
@@ -137,6 +139,92 @@ static ssize_t node_read_distance(struct
static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL);
+static ssize_t node_store_ttfp(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf,
+ size_t count) {
+ int nid = dev->id;
+ unsigned int priority;
+ struct zonelist *zl;
+ nodemask_t nodes;
+ ssize_t ret = count;
+
+ priority = max(0, min(DEF_PRIORITY, (int)simple_strtoul(buf, NULL, 0)));
+ printk(KERN_INFO "Calling try_to_free_pages(%d, %d)\n",
+ nid, priority);
+
+ nodes_clear(nodes);
+ node_set(nid, nodes);
+ zl = bind_zonelist(&nodes);
+
+ if (!try_to_free_pages(zl->zones, GFP_USER, priority))
+ ret = -ENOMEM;
+
+ kfree(zl);
+
+ return ret;
+}
+
+static SYSDEV_ATTR(try_to_free_pages, 0200, NULL, node_store_ttfp);
+
+static struct page *migrate_from_node_page(struct page *page,
+ unsigned long private,
+ int **result) {
+ struct zonelist *zl = (struct zonelist *) private;
+ return __alloc_pages(GFP_HIGHUSER & ~__GFP_WAIT, 0, zl);
+}
+
+static ssize_t node_store_migrate_node(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf,
+ size_t count) {
+ int nid = dev->id;
+ nodemask_t nodes;
+ ssize_t ret;
+ struct zonelist *zl;
+ struct pglist_data *pgdat = NODE_DATA(nid);
+ int i;
+ int pagecount = 0, failcount = 0;
+ LIST_HEAD(pagelist);
+
+ ret = nodelist_parse(buf, nodes);
+ if (ret)
+ return ret;
+
+ zl = bind_zonelist(&nodes);
+
+ migrate_prep();
+
+ for (i = 0; i < pgdat->node_spanned_pages; ++i) {
+ struct page *page = pgdat_page_nr(pgdat, i);
+ if (!isolate_lru_page(page, &pagelist)) {
+ pagecount++;
+ } else {
+ failcount++;
+ }
+ }
+
+ ret = count;
+ printk(KERN_INFO "Migrating %d pages from node %d\n", pagecount, nid);
+ if (!list_empty(&pagelist)) {
+ int migrate_ret = migrate_pages(&pagelist,
+ migrate_from_node_page,
+ (unsigned long)zl);
+
+ printk(KERN_INFO "migrate_pages returned %d\n", migrate_ret);
+ if (migrate_ret < 0) {
+ ret = migrate_ret;
+ }
+ } else {
+ printk(KERN_INFO "No pages to migrate. Failcount = %d!\n",
+ failcount++);
+ }
+
+ kfree(zl);
+ return ret;
+}
+
+static SYSDEV_ATTR(migrate_node, 0200, NULL, node_store_migrate_node);
/*
* register_node - Setup a driverfs device for a node.
* @num - Node number to use when creating the device.
@@ -156,6 +244,8 @@ int register_node(struct node *node, int
sysdev_create_file(&node->sysdev, &attr_meminfo);
sysdev_create_file(&node->sysdev, &attr_numastat);
sysdev_create_file(&node->sysdev, &attr_distance);
+ sysdev_create_file(&node->sysdev, &attr_try_to_free_pages);
+ sysdev_create_file(&node->sysdev, &attr_migrate_node);
}
return error;
}
@@ -173,6 +263,8 @@ void unregister_node(struct node *node)
sysdev_remove_file(&node->sysdev, &attr_meminfo);
sysdev_remove_file(&node->sysdev, &attr_numastat);
sysdev_remove_file(&node->sysdev, &attr_distance);
+ sysdev_remove_file(&node->sysdev, &attr_try_to_free_pages);
+ sysdev_remove_file(&node->sysdev, &attr_migrate_node);
sysdev_unregister(&node->sysdev);
}
Index: 2.6.19-node_reclaim/fs/buffer.c
===================================================================
--- 2.6.19-node_reclaim.orig/fs/buffer.c
+++ 2.6.19-node_reclaim/fs/buffer.c
@@ -374,7 +374,7 @@ static void free_more_memory(void)
for_each_online_pgdat(pgdat) {
zones = pgdat->node_zonelists[gfp_zone(GFP_NOFS)].zones;
if (*zones)
- try_to_free_pages(zones, GFP_NOFS);
+ try_to_free_pages(zones, GFP_NOFS, 0);
}
}
Index: 2.6.19-node_reclaim/include/linux/mempolicy.h
===================================================================
--- 2.6.19-node_reclaim.orig/include/linux/mempolicy.h
+++ 2.6.19-node_reclaim/include/linux/mempolicy.h
@@ -175,6 +175,8 @@ int do_migrate_pages(struct mm_struct *m
extern void *cpuset_being_rebound; /* Trigger mpol_copy vma rebind */
+struct zonelist *bind_zonelist(nodemask_t *nodes);
+
#else
struct mempolicy {};
Index: 2.6.19-node_reclaim/include/linux/swap.h
===================================================================
--- 2.6.19-node_reclaim.orig/include/linux/swap.h
+++ 2.6.19-node_reclaim/include/linux/swap.h
@@ -187,7 +187,7 @@ extern int rotate_reclaimable_page(struc
extern void swap_setup(void);
/* linux/mm/vmscan.c */
-extern unsigned long try_to_free_pages(struct zone **, gfp_t);
+extern unsigned long try_to_free_pages(struct zone **, gfp_t, int priority);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
Index: 2.6.19-node_reclaim/mm/mempolicy.c
===================================================================
--- 2.6.19-node_reclaim.orig/mm/mempolicy.c
+++ 2.6.19-node_reclaim/mm/mempolicy.c
@@ -134,7 +134,7 @@ static int mpol_check_policy(int mode, n
}
/* Generate a custom zonelist for the BIND policy. */
-static struct zonelist *bind_zonelist(nodemask_t *nodes)
+struct zonelist *bind_zonelist(nodemask_t *nodes)
{
struct zonelist *zl;
int num, max, nd;
@@ -1908,4 +1908,3 @@ out:
m->version = (vma != priv->tail_vma) ? vma->vm_start : 0;
return 0;
}
-
Index: 2.6.19-node_reclaim/mm/page_alloc.c
===================================================================
--- 2.6.19-node_reclaim.orig/mm/page_alloc.c
+++ 2.6.19-node_reclaim/mm/page_alloc.c
@@ -1371,7 +1371,7 @@ nofail_alloc:
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
- did_some_progress = try_to_free_pages(zonelist->zones, gfp_mask);
+ did_some_progress = try_to_free_pages(zonelist->zones, gfp_mask, 0);
p->reclaim_state = NULL;
Index: 2.6.19-node_reclaim/mm/vmscan.c
===================================================================
--- 2.6.19-node_reclaim.orig/mm/vmscan.c
+++ 2.6.19-node_reclaim/mm/vmscan.c
@@ -1014,7 +1014,8 @@ static unsigned long shrink_zones(int pr
* holds filesystem locks which prevent writeout this might not work, and the
* allocation attempt will fail.
*/
-unsigned long try_to_free_pages(struct zone **zones, gfp_t gfp_mask)
+unsigned long try_to_free_pages(struct zone **zones, gfp_t gfp_mask,
+ int min_priority)
{
int priority;
int ret = 0;
@@ -1057,7 +1058,7 @@ unsigned long try_to_free_pages(struct z
lru_pages += zone->nr_active + zone->nr_inactive;
}
- for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ for (priority = DEF_PRIORITY; priority >= min_priority; priority--) {
sc.nr_scanned = 0;
if (!priority)
disable_swap_token();
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-29 3:06 ` [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace menage
@ 2006-11-29 6:07 ` Nick Piggin
2006-11-29 21:57 ` Paul Menage
2006-11-30 0:18 ` KAMEZAWA Hiroyuki
2006-11-30 4:10 ` Christoph Lameter
2 siblings, 1 reply; 54+ messages in thread
From: Nick Piggin @ 2006-11-29 6:07 UTC (permalink / raw)
To: menage; +Cc: linux-mm, akpm
menage@google.com wrote:
> Currently the page migration APIs allow you to migrate pages from
> particular processes, but don't provide a clean and efficient way to
> migrate and/or reclaim memory from individual nodes.
The mechanism for that should probably go in mm/migrate.c, shouldn't
it?
Also, why don't you scan the lru lists of the zones in the node, which
will a) be much more efficient if there are lots of non LRU pages, and
b) allow you to batch the lru lock.
>
> This patch provides:
>
> - an additional parameter to try_to_free_pages() to specify the
> priority at which the reclaim should give up if it doesn't make
> progress
Dang. It would be nice not to export this "priority" stuff outside
vmscan.c too much because it is really an implementation detail and
I would like to get rid of it one day...
>
> - a way to trigger try_to_free_pages() for a given node with a given
> minimum priority, vy writing an integer to
> /sys/device/system/node/node<id>/try_to_free_pages
... especially not to userspace. Why does this have to be exposed to
userspace at all? Can you not wire it up to your resource isolation
implementation in the kernel?
>
> - a way to request that any migratable pages on a given node be
> migrated to availage pages on a specified set of nodes by writing a
> destination nodemask (in ASCII form) to
> /sys/device/system/node/node<id>/migrate_node
... yeah it would obviously be much nicer to do it in kernel space,
behind your higher level APIs. There's probably a good reason why you
aren't, and I haven't been following the lists very much over the
past couple of weeks... Can you describe your problems (or point me
to a post)?
Thanks,
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-29 6:07 ` Nick Piggin
@ 2006-11-29 21:57 ` Paul Menage
2006-11-30 4:13 ` Christoph Lameter
2006-11-30 7:38 ` Nick Piggin
0 siblings, 2 replies; 54+ messages in thread
From: Paul Menage @ 2006-11-29 21:57 UTC (permalink / raw)
To: Nick Piggin; +Cc: linux-mm, akpm
On 11/28/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> menage@google.com wrote:
> > Currently the page migration APIs allow you to migrate pages from
> > particular processes, but don't provide a clean and efficient way to
> > migrate and/or reclaim memory from individual nodes.
>
> The mechanism for that should probably go in mm/migrate.c, shouldn't
> it?
Quite possibly - I don't have a strong feeling for exactly where the
code should go. There's existing code (sys_migrate_pages) that uses
the migration mechanism that's in mm/mempolicy.c rather than
migrate.c, and this was a pretty simple function to write.
>
> Also, why don't you scan the lru lists of the zones in the node, which
> will a) be much more efficient if there are lots of non LRU pages, and
> b) allow you to batch the lru lock.
I'll take a look at that.
> >
> > - a way to trigger try_to_free_pages() for a given node with a given
> > minimum priority, vy writing an integer to
> > /sys/device/system/node/node<id>/try_to_free_pages
>
> ... especially not to userspace. Why does this have to be exposed to
> userspace at all?
We don't need to expose the raw "priority" value, but it would be
really nice for user space to be able to specify how hard the kernel
should try to free some memory.
Then each job can specify a "reclaim pressure", i.e. how much
back-pressure should be applied to its allocated memory, so you can
get a good idea of how much memory the job is really using for a given
level of performance. High reclaim pressure results in a smaller
working set but possibly more paging in from disk; low reclaim
pressure uses more memory but gets higher performance.
> Can you not wire it up to your resource isolation
> implementation in the kernel?
This *is* the resource isolation implementation (plus the existing
cpusets and fake-numa code). The intention is to expose just enough
knobs/hooks to userspace that it can be handled there.
>
> ... yeah it would obviously be much nicer to do it in kernel space,
> behind your higher level APIs.
I don't think it would - keeping as much of the code as possible in
userspace makes development and deployment much faster. We don't
really have any higher-level APIs at this point - just userspace
middleware manipulating cpusets.
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-29 3:06 ` [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace menage
2006-11-29 6:07 ` Nick Piggin
@ 2006-11-30 0:18 ` KAMEZAWA Hiroyuki
2006-11-30 0:25 ` Paul Menage
2006-11-30 4:10 ` Christoph Lameter
2 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-11-30 0:18 UTC (permalink / raw)
To: menage; +Cc: linux-mm, akpm
On Tue, 28 Nov 2006 19:06:56 -0800
menage@google.com wrote:
>
> + for (i = 0; i < pgdat->node_spanned_pages; ++i) {
> + struct page *page = pgdat_page_nr(pgdat, i);
you need pfn_valid() check before accessing page struct.
> + if (!isolate_lru_page(page, &pagelist)) {
you'll see panic if !PageLRU(page).
looks scanning zone's lru list is more suitable for your purpose.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-30 0:18 ` KAMEZAWA Hiroyuki
@ 2006-11-30 0:25 ` Paul Menage
2006-11-30 0:38 ` KAMEZAWA Hiroyuki
2006-11-30 4:15 ` Christoph Lameter
0 siblings, 2 replies; 54+ messages in thread
From: Paul Menage @ 2006-11-30 0:25 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, akpm
On 11/29/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 28 Nov 2006 19:06:56 -0800
> menage@google.com wrote:
>
> >
> > + for (i = 0; i < pgdat->node_spanned_pages; ++i) {
> > + struct page *page = pgdat_page_nr(pgdat, i);
> you need pfn_valid() check before accessing page struct.
OK. (That check can only fail if CONFIG_SPARSEMEM, right?)
>
>
> > + if (!isolate_lru_page(page, &pagelist)) {
> you'll see panic if !PageLRU(page).
In which kernel version? In 2.6.19-rc6 (also -mm1) there's no panic in
isolate_lru_page().
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-29 3:06 [RFC][PATCH 0/1] Node-based reclaim/migration menage
2006-11-29 3:06 ` [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace menage
@ 2006-11-30 0:31 ` KAMEZAWA Hiroyuki
2006-11-30 0:31 ` Paul Menage
2006-11-30 4:04 ` Christoph Lameter
2 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-11-30 0:31 UTC (permalink / raw)
To: menage; +Cc: linux-mm, akpm
On Tue, 28 Nov 2006 19:06:55 -0800
menage@google.com wrote:
> --
>
> We're trying to use NUMA node isolation as a form of job resource
> control at Google, and the existing page migration APIs are all bound
> to individual processes and so are a bit clunky to use when you just
> want to affect all the pages on a given node.
>
> How about an API to allow userspace to direct page migration (and page
> reclaim) on a per-node basis? This patch provides such an API, based
> around sysfs; a system call approach would certainly be possible too.
>
> It sort of overlaps with memory hot-unplug, but is simpler since it's
> not so bad if we miss a few pages.
>
> Comments? Also, can anyone clarify whether I need any locking when
> sacnning the pages in a pgdat? As far as I can see, even with memory
> hotplug this number can only increase, not decrease.
>
Hi, I'm one of memory-hot-unplug men. (But I can't go ahead for now.)
a few comments.
1. memory hot unplug will be implemnted based on *section* not on *node*.
section <-> node relationship will be displayed.
2. AFAIK, migrating pages without taking write lock of any mm->sem will
cause problem. anon_vma can be freed while migration.
3. It's maybe better to add a hook to stop page allocation from the target node(zone).
you may want to use this feature under heavly load.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 0:31 ` [RFC][PATCH 0/1] Node-based reclaim/migration KAMEZAWA Hiroyuki
@ 2006-11-30 0:31 ` Paul Menage
2006-11-30 4:11 ` KAMEZAWA Hiroyuki
2006-11-30 4:17 ` Christoph Lameter
0 siblings, 2 replies; 54+ messages in thread
From: Paul Menage @ 2006-11-30 0:31 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, akpm
On 11/29/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> 2. AFAIK, migrating pages without taking write lock of any mm->sem will
> cause problem. anon_vma can be freed while migration.
Hmm, isn't migration just analagous to swapping out and swapping back
in again, but without the actual swapping?
If what you describe is a problem, then wouldn't you have a problem if
you were doing migration on a particular mm structure, but it was
sharing pages with another mm?
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-30 0:25 ` Paul Menage
@ 2006-11-30 0:38 ` KAMEZAWA Hiroyuki
2006-11-30 4:15 ` Christoph Lameter
1 sibling, 0 replies; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-11-30 0:38 UTC (permalink / raw)
To: Paul Menage; +Cc: linux-mm, akpm
On Wed, 29 Nov 2006 16:25:22 -0800
"Paul Menage" <menage@google.com> wrote:
> On 11/29/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 28 Nov 2006 19:06:56 -0800
> > menage@google.com wrote:
> >
> > >
> > > + for (i = 0; i < pgdat->node_spanned_pages; ++i) {
> > > + struct page *page = pgdat_page_nr(pgdat, i);
> > you need pfn_valid() check before accessing page struct.
>
> OK. (That check can only fail if CONFIG_SPARSEMEM, right?)
>
No, ia64's virtual memmap will fail too.
> >
> >
> > > + if (!isolate_lru_page(page, &pagelist)) {
> > you'll see panic if !PageLRU(page).
>
> In which kernel version? In 2.6.19-rc6 (also -mm1) there's no panic in
> isolate_lru_page().
>
Sorry, my mistake. I checked isolate_lru_pages() (><
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-29 3:06 [RFC][PATCH 0/1] Node-based reclaim/migration menage
2006-11-29 3:06 ` [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace menage
2006-11-30 0:31 ` [RFC][PATCH 0/1] Node-based reclaim/migration KAMEZAWA Hiroyuki
@ 2006-11-30 4:04 ` Christoph Lameter
2 siblings, 0 replies; 54+ messages in thread
From: Christoph Lameter @ 2006-11-30 4:04 UTC (permalink / raw)
To: menage; +Cc: linux-mm, akpm
On Tue, 28 Nov 2006, menage@google.com wrote:
> Comments? Also, can anyone clarify whether I need any locking when
> sacnning the pages in a pgdat? As far as I can see, even with memory
> hotplug this number can only increase, not decrease.
That depends on the way you scan...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-29 3:06 ` [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace menage
2006-11-29 6:07 ` Nick Piggin
2006-11-30 0:18 ` KAMEZAWA Hiroyuki
@ 2006-11-30 4:10 ` Christoph Lameter
2 siblings, 0 replies; 54+ messages in thread
From: Christoph Lameter @ 2006-11-30 4:10 UTC (permalink / raw)
To: menage; +Cc: linux-mm, akpm
On Tue, 28 Nov 2006, menage@google.com wrote:
> + for (i = 0; i < pgdat->node_spanned_pages; ++i) {
> + struct page *page = pgdat_page_nr(pgdat, i);
> + if (!isolate_lru_page(page, &pagelist)) {
> + pagecount++;
> + } else {
> + failcount++;
> + }
> + }
Go along the active / inactive LRU lists? isolate_lru_page will not
allow you isolate other pages.
If you go along the lru lists then you also avoid having to deal with
holes in the memory map. You cannot simply assume that all struct pages in
the area are accessible.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 0:31 ` Paul Menage
@ 2006-11-30 4:11 ` KAMEZAWA Hiroyuki
2006-11-30 4:17 ` Christoph Lameter
1 sibling, 0 replies; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-11-30 4:11 UTC (permalink / raw)
To: Paul Menage; +Cc: linux-mm, akpm
On Wed, 29 Nov 2006 16:31:22 -0800
"Paul Menage" <menage@google.com> wrote:
> On 11/29/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > 2. AFAIK, migrating pages without taking write lock of any mm->sem will
> > cause problem. anon_vma can be freed while migration.
>
> Hmm, isn't migration just analagous to swapping out and swapping back
> in again, but without the actual swapping?
>
I'm sorry if there is no problem in *current* kernel
== See ==
http://lkml.org/lkml/2006/4/17/168
>mmap_sem must be held during page migration due to the way we retrieve the
>anonymous vma.
========
Logic
Considering migrate oldpage to newpage..
1. We unmap a oldpage at migraiton. page->mapcount turns to be 0.
2. copy contents of the oldpage to a newpage. page->mapcount of both pages are 0.
3. map the newpage. this uses copied newpage->mapping. page->mapcount goes up.
And see rmap.c
==
511 void page_remove_rmap(struct page *page)
512 {
513 if (atomic_add_negative(-1, &page->_mapcount)) {
514 #ifdef CONFIG_DEBUG_VM
515 if (unlikely(page_mapcount(page) < 0)) {
516 printk (KERN_EMERG "Eeek! page_mapcount(page) went negative! (%d)\n", page_mapcount(page));
517 printk (KERN_EMERG " page->flags = %lx\n", page->flags);
518 printk (KERN_EMERG " page->count = %x\n", page_count(page));
519 printk (KERN_EMERG " page->mapping = %p\n", page->mapping);
520 }
521 #endif
522 BUG_ON(page_mapcount(page) < 0);
523 /*
524 * It would be tidy to reset the PageAnon mapping here,
525 * but that might overwrite a racing page_add_anon_rmap
526 * which increments mapcount after us but sets mapping
527 * before us: so leave the reset to free_hot_cold_page,
528 * and remember that it's only reliable while mapped.
529 * Leaving it set also helps swapoff to reinstate ptes
530 * faster for those pages still in swapcache.
531 */
532 if (page_test_and_clear_dirty(page))
533 set_page_dirty(page);
534 __dec_zone_page_state(page,
535 PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
536 }
====
We cannot trust page->mapping if page->mapcount == 0.
File pages are guarded by address_space's lock.
if mm->sem is held, the oldpage's anon_vma/vm_area_struct will not change.
Then, the relationship between oldpage/anon_vma will not change.
So, page migration with mm->sem is safe.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-29 21:57 ` Paul Menage
@ 2006-11-30 4:13 ` Christoph Lameter
2006-11-30 4:18 ` Paul Menage
2006-11-30 7:38 ` Nick Piggin
1 sibling, 1 reply; 54+ messages in thread
From: Christoph Lameter @ 2006-11-30 4:13 UTC (permalink / raw)
To: Paul Menage; +Cc: Nick Piggin, linux-mm, akpm
On Wed, 29 Nov 2006, Paul Menage wrote:
> Quite possibly - I don't have a strong feeling for exactly where the
> code should go. There's existing code (sys_migrate_pages) that uses
> the migration mechanism that's in mm/mempolicy.c rather than
> migrate.c, and this was a pretty simple function to write.
Plus there is another mechanism in mm/migrate.c that also uses the
migration mechanism.
> We don't need to expose the raw "priority" value, but it would be
> really nice for user space to be able to specify how hard the kernel
> should try to free some memory.
Would it not be sufficient to specify that in the number of attempts like
already provided by the page migration scheme?
> Then each job can specify a "reclaim pressure", i.e. how much
> back-pressure should be applied to its allocated memory, so you can
> get a good idea of how much memory the job is really using for a given
> level of performance. High reclaim pressure results in a smaller
> working set but possibly more paging in from disk; low reclaim
> pressure uses more memory but gets higher performance.
Reclaim? I thought you wanted to migrate memory of a node?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-30 0:25 ` Paul Menage
2006-11-30 0:38 ` KAMEZAWA Hiroyuki
@ 2006-11-30 4:15 ` Christoph Lameter
1 sibling, 0 replies; 54+ messages in thread
From: Christoph Lameter @ 2006-11-30 4:15 UTC (permalink / raw)
To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm
On Wed, 29 Nov 2006, Paul Menage wrote:
> In which kernel version? In 2.6.19-rc6 (also -mm1) there's no panic in
> isolate_lru_page().
Depends on the hardware and the linux configuration sparsemem,
virtual_memmap etc.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 0:31 ` Paul Menage
2006-11-30 4:11 ` KAMEZAWA Hiroyuki
@ 2006-11-30 4:17 ` Christoph Lameter
2006-11-30 10:45 ` Paul Menage
1 sibling, 1 reply; 54+ messages in thread
From: Christoph Lameter @ 2006-11-30 4:17 UTC (permalink / raw)
To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm
On Wed, 29 Nov 2006, Paul Menage wrote:
> Hmm, isn't migration just analagous to swapping out and swapping back
> in again, but without the actual swapping?
That used to be the case in the beginning. Not anymore. The page is
directly moved to the target. Migration via swap is no longer supported.
> If what you describe is a problem, then wouldn't you have a problem if
> you were doing migration on a particular mm structure, but it was
> sharing pages with another mm?
You do not have a problem as long as you hold a mmap_sem lock on any of
the vmas in which the page appears. Kame and I discussed several
approached on how to avoid the issue in the past but so far there was no
need to resolve the issue.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-30 4:13 ` Christoph Lameter
@ 2006-11-30 4:18 ` Paul Menage
0 siblings, 0 replies; 54+ messages in thread
From: Paul Menage @ 2006-11-30 4:18 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Nick Piggin, linux-mm, akpm
On 11/29/06, Christoph Lameter <clameter@sgi.com> wrote:
>
> Reclaim? I thought you wanted to migrate memory of a node?
>
Both. The idea would be to apply gentle (or not so gentle, depending
on how important the job is ...) reclaim pressure to all the nodes
owned by a job. If you free up enough memory, you can then consider
migrating the allocated pages from one node into other nodes belonging
to the job, and hence reclaim a node for use by some other job.
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-29 21:57 ` Paul Menage
2006-11-30 4:13 ` Christoph Lameter
@ 2006-11-30 7:38 ` Nick Piggin
2006-11-30 7:57 ` Paul Menage
1 sibling, 1 reply; 54+ messages in thread
From: Nick Piggin @ 2006-11-30 7:38 UTC (permalink / raw)
To: Paul Menage; +Cc: linux-mm, akpm
Paul Menage wrote:
> On 11/28/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>> Can you not wire it up to your resource isolation
>> implementation in the kernel?
>
>
> This *is* the resource isolation implementation (plus the existing
> cpusets and fake-numa code). The intention is to expose just enough
> knobs/hooks to userspace that it can be handled there.
Yes, but when you migrate tasks between these containers, or when you
create/destroy them, then why can't you do the migration at that time?
>> ... yeah it would obviously be much nicer to do it in kernel space,
>> behind your higher level APIs.
>
>
> I don't think it would - keeping as much of the code as possible in
> userspace makes development and deployment much faster. We don't
> really have any higher-level APIs at this point - just userspace
> middleware manipulating cpusets.
We can't use that as an argument for the upstream kernel, but I
would believe that it is a good choice for google.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-30 7:38 ` Nick Piggin
@ 2006-11-30 7:57 ` Paul Menage
2006-11-30 8:26 ` Nick Piggin
0 siblings, 1 reply; 54+ messages in thread
From: Paul Menage @ 2006-11-30 7:57 UTC (permalink / raw)
To: Nick Piggin; +Cc: linux-mm, akpm
On 11/29/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> Yes, but when you migrate tasks between these containers, or when you
> create/destroy them, then why can't you do the migration at that time?
?
The migration that I'm envisaging is going to occur when either:
- we're trying to move a job to a different real numa node because,
say, a new job has started that needs the whole of a node to itself,
and we need to clear space for it.
- we're trying to compact the memory usage of a job, when it has
plenty of free space in each of its nodes, and we can fit all the
memory into a smaller set of nodes.
Neither of these are tied to create/destroy time or moving processes
in/out of jobs (in fact we'd not be planning to move processes between
jobs - once a process is in a job it would stay there, although I
realise other people would have different requirements).
> > I don't think it would - keeping as much of the code as possible in
> > userspace makes development and deployment much faster. We don't
> > really have any higher-level APIs at this point - just userspace
> > middleware manipulating cpusets.
>
> We can't use that as an argument for the upstream kernel, but I
> would believe that it is a good choice for google.
>
I would have thought that providing userspace just enough hooks to do
what it needs to do, and not mandating higher-level constructs is
exactly the philosophy of the linux kernel. Hence, e.g. providing
efficient building blocks like sendfile and a threaded network stack,
faster therading with NPTL and a very limited static-file webserver
(TUX, even though it's not in the mainline) and leaving the complex
bits of webserving to userspace.
Things like deciding which containers should be using which nodes, and
directing the kernel appropriately, is the job of userspace, not
kernelspace, since there are lots of possible ways of making those
decisions.
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-30 7:57 ` Paul Menage
@ 2006-11-30 8:26 ` Nick Piggin
2006-11-30 8:39 ` Paul Menage
0 siblings, 1 reply; 54+ messages in thread
From: Nick Piggin @ 2006-11-30 8:26 UTC (permalink / raw)
To: Paul Menage; +Cc: linux-mm, akpm
Paul Menage wrote:
> On 11/29/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
>>
>> Yes, but when you migrate tasks between these containers, or when you
>> create/destroy them, then why can't you do the migration at that time?
>
>
> ?
>
> The migration that I'm envisaging is going to occur when either:
>
> - we're trying to move a job to a different real numa node because,
> say, a new job has started that needs the whole of a node to itself,
> and we need to clear space for it.
So migrate at this point.
> - we're trying to compact the memory usage of a job, when it has
> plenty of free space in each of its nodes, and we can fit all the
> memory into a smaller set of nodes.
Or reclaim at this point.
>> We can't use that as an argument for the upstream kernel, but I
>> would believe that it is a good choice for google.
>>
>
> I would have thought that providing userspace just enough hooks to do
> what it needs to do, and not mandating higher-level constructs is
> exactly the philosophy of the linux kernel. Hence, e.g. providing
Yes, but without exposing implementation to userspace, where possible.
The ultimate would be to devise an API which is usable by your patch,
as well as the other resource control mechanisms going around. If
userspace has to know that you've implemented memory control with
"fake nodes", then IMO something has gone wrong.
> efficient building blocks like sendfile and a threaded network stack,
> faster therading with NPTL and a very limited static-file webserver
> (TUX, even though it's not in the mainline) and leaving the complex
> bits of webserving to userspace.
I don't see the similarity with sendfile+TUX. I don't think putting an
explicit container / resource controller API in the kernel is even
anything like TUX in the kernel, let alone apache in kernel.
> Things like deciding which containers should be using which nodes, and
> directing the kernel appropriately, is the job of userspace, not
> kernelspace, since there are lots of possible ways of making those
> decisions.
I disagree.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-30 8:26 ` Nick Piggin
@ 2006-11-30 8:39 ` Paul Menage
2006-11-30 8:55 ` Nick Piggin
0 siblings, 1 reply; 54+ messages in thread
From: Paul Menage @ 2006-11-30 8:39 UTC (permalink / raw)
To: Nick Piggin; +Cc: linux-mm, akpm
On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >
> > - we're trying to move a job to a different real numa node because,
> > say, a new job has started that needs the whole of a node to itself,
> > and we need to clear space for it.
>
> So migrate at this point.
That's what I want to do. But currently you can only do this on a
process-by-process basis and it doesn't affect file pages in the
pagecache that aren't mapped by anyone.
Being able to say "try to move all memory from this node to this other
set of nodes" seems like a generically useful thing even for other
uses (e.g. hot unplug, general HPC numa systems, etc).
>
> > - we're trying to compact the memory usage of a job, when it has
> > plenty of free space in each of its nodes, and we can fit all the
> > memory into a smaller set of nodes.
>
> Or reclaim at this point.
>
This would be happening after reclaim has successfully shrunk the
in-use memory in a bunch of nodes, and we want to consolidate to a
smaller set of nodes.
>
> The ultimate would be to devise an API which is usable by your patch,
> as well as the other resource control mechanisms going around. If
> userspace has to know that you've implemented memory control with
> "fake nodes", then IMO something has gone wrong.
I disagree. Memory control via fake numa (or even via real numa if you
have enough real nodes) is sufficiently fundamentally different from
memory control via, say, per-page owner pointers (due to granularity,
etc) that userspace really needs to know about it in order to make
sensible decisions.
It also has the nice property that the kernel already exposes most of
the mechanism required for this via the cpusets code.
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-30 8:39 ` Paul Menage
@ 2006-11-30 8:55 ` Nick Piggin
2006-11-30 9:06 ` Paul Menage
0 siblings, 1 reply; 54+ messages in thread
From: Nick Piggin @ 2006-11-30 8:55 UTC (permalink / raw)
To: Paul Menage; +Cc: linux-mm, akpm
Paul Menage wrote:
> On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
>> >
>> > - we're trying to move a job to a different real numa node because,
>> > say, a new job has started that needs the whole of a node to itself,
>> > and we need to clear space for it.
>>
>> So migrate at this point.
>
>
> That's what I want to do. But currently you can only do this on a
> process-by-process basis and it doesn't affect file pages in the
> pagecache that aren't mapped by anyone.
>
> Being able to say "try to move all memory from this node to this other
> set of nodes" seems like a generically useful thing even for other
> uses (e.g. hot unplug, general HPC numa systems, etc).
AFAIK they do that in their higher level APIs (at least HPC numa does).
>> > - we're trying to compact the memory usage of a job, when it has
>> > plenty of free space in each of its nodes, and we can fit all the
>> > memory into a smaller set of nodes.
>>
>> Or reclaim at this point.
>>
>
> This would be happening after reclaim has successfully shrunk the
> in-use memory in a bunch of nodes, and we want to consolidate to a
> smaller set of nodes.
So your API could be some directive to consolidate? You could get
pretty accurate estimates with page statistics, as to whether it
can be done or not.
>> The ultimate would be to devise an API which is usable by your patch,
>> as well as the other resource control mechanisms going around. If
>> userspace has to know that you've implemented memory control with
>> "fake nodes", then IMO something has gone wrong.
>
>
> I disagree. Memory control via fake numa (or even via real numa if you
> have enough real nodes) is sufficiently fundamentally different from
> memory control via, say, per-page owner pointers (due to granularity,
> etc) that userspace really needs to know about it in order to make
> sensible decisions.
>
> It also has the nice property that the kernel already exposes most of
> the mechanism required for this via the cpusets code.
The cpusets code is definitely similar to what memory resource control
needs. I don't think that a resource control API needs to be tied to
such granular, hard limits as the fakenodes code provides though. But
maybe I'm wrong and it really would be acceptable for everyone.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-30 8:55 ` Nick Piggin
@ 2006-11-30 9:06 ` Paul Menage
2006-11-30 9:21 ` Nick Piggin
0 siblings, 1 reply; 54+ messages in thread
From: Paul Menage @ 2006-11-30 9:06 UTC (permalink / raw)
To: Nick Piggin; +Cc: linux-mm, akpm
On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >
> > Being able to say "try to move all memory from this node to this other
> > set of nodes" seems like a generically useful thing even for other
> > uses (e.g. hot unplug, general HPC numa systems, etc).
>
> AFAIK they do that in their higher level APIs (at least HPC numa does).
Could you point me at an example?
> > This would be happening after reclaim has successfully shrunk the
> > in-use memory in a bunch of nodes, and we want to consolidate to a
> > smaller set of nodes.
>
> So your API could be some directive to consolidate? You could get
> pretty accurate estimates with page statistics, as to whether it
> can be done or not.
Yes, and exposing those statistics (already available in
/sys/device/system/node/node*/meminfo) and the low-level mechanism for
migration are, to me, things that are appropriate for the kernel. I'm
not sure what a specific "consolidation API" would look like, beyond
the API that I'm already proposing (migrate memory from node X to
nodes A,B,C)
> The cpusets code is definitely similar to what memory resource control
> needs. I don't think that a resource control API needs to be tied to
> such granular, hard limits as the fakenodes code provides though. But
> maybe I'm wrong and it really would be acceptable for everyone.
Ah. This isn't intended to be specifically a "resource control API".
It's more intended to be an API that could be useful for certain kinds
of resource control, but could also be generically useful.
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-30 9:06 ` Paul Menage
@ 2006-11-30 9:21 ` Nick Piggin
2006-11-30 9:45 ` Paul Menage
0 siblings, 1 reply; 54+ messages in thread
From: Nick Piggin @ 2006-11-30 9:21 UTC (permalink / raw)
To: Paul Menage; +Cc: linux-mm, akpm
Paul Menage wrote:
> On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
>> >
>> > Being able to say "try to move all memory from this node to this other
>> > set of nodes" seems like a generically useful thing even for other
>> > uses (e.g. hot unplug, general HPC numa systems, etc).
>>
>> AFAIK they do that in their higher level APIs (at least HPC numa does).
>
>
> Could you point me at an example?
kernel/cpuset.c:cpuset_migrate_mm
>> So your API could be some directive to consolidate? You could get
>> pretty accurate estimates with page statistics, as to whether it
>> can be done or not.
>
>
> Yes, and exposing those statistics (already available in
> /sys/device/system/node/node*/meminfo) and the low-level mechanism for
> migration are, to me, things that are appropriate for the kernel. I'm
> not sure what a specific "consolidation API" would look like, beyond
> the API that I'm already proposing (migrate memory from node X to
> nodes A,B,C)
How about "try to change the memory reservation charge of this
'container' from xMB to yMB"? Underneath that API, your fakenode
controller would do the node reclaim and consolidation stuff --
but it could be implemented completely differently in the case of
a different type of controller.
>> The cpusets code is definitely similar to what memory resource control
>> needs. I don't think that a resource control API needs to be tied to
>> such granular, hard limits as the fakenodes code provides though. But
>> maybe I'm wrong and it really would be acceptable for everyone.
>
>
> Ah. This isn't intended to be specifically a "resource control API".
> It's more intended to be an API that could be useful for certain kinds
> of resource control, but could also be generically useful.
If it is exporting any kind of implementation details, then it needs
to be justified with a specific user that can't be implemented in a
better way, IMO.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-30 9:21 ` Nick Piggin
@ 2006-11-30 9:45 ` Paul Menage
2006-11-30 10:15 ` Nick Piggin
0 siblings, 1 reply; 54+ messages in thread
From: Paul Menage @ 2006-11-30 9:45 UTC (permalink / raw)
To: Nick Piggin; +Cc: linux-mm, akpm
On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >> AFAIK they do that in their higher level APIs (at least HPC numa does).
> >
> >
> > Could you point me at an example?
>
> kernel/cpuset.c:cpuset_migrate_mm
No, that doesn't really do what we want. It basically just calls
do_migrate_pages, which has the drawbacks of:
- it has no way to try to migrate memory from one source node to
multiple destination nodes.
- it doesn't (as far as I can tell) migrate unmapped file pages in the
page cache.
- it scans every page table entry of every mm in the process. If your
nodes are relatively small compared to your processes, this is likely
to be much more heavyweight than just trying to migrate each page in a
node. (I realise that there are some unsolved implementation issues
with migrating pages whilst not holding an mmap_sem of an mm that's
mapping them; that's something that we would need to solve)
>
> How about "try to change the memory reservation charge of this
> 'container' from xMB to yMB"? Underneath that API, your fakenode
> controller would do the node reclaim and consolidation stuff --
> but it could be implemented completely differently in the case of
> a different type of controller.
How would it make decisions such as which node to free up (e.g.
userspace might have a strong preference for keeping a job on one
particular real node, or moving it to a different one.) I think that
policy decisions like this belong in userspace, in the same way that
the existing cpusets API provides a way to say "this cpuset uses these
nodes" rather than "this cpuset should have N nodes".
If the API was expressive enough to say "try to shrink this cpuset by
X MB, with amount Y of effort, trying to evict nodes in the priority
order A,B,C" that might be a good start.
>
> >> The cpusets code is definitely similar to what memory resource control
> >> needs. I don't think that a resource control API needs to be tied to
> >> such granular, hard limits as the fakenodes code provides though. But
> >> maybe I'm wrong and it really would be acceptable for everyone.
> >
> >
> > Ah. This isn't intended to be specifically a "resource control API".
> > It's more intended to be an API that could be useful for certain kinds
> > of resource control, but could also be generically useful.
>
> If it is exporting any kind of implementation details, then it needs
> to be justified with a specific user that can't be implemented in a
> better way, IMO.
It's not really exporting any more implementation details than the
existing cpusets API (i.e. explicitly binding a job to a set of nodes
chosen by userspace). The only true exposed implementation detail is
the "priority" value from try_to_free_pages, and that could be
abstracted away as a value in some range 0-N where 0 means "try very
hard" and N means "hardly try at all", and it wouldn't have to be
directly linked to the try_to_free_pages() priority.
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-30 9:45 ` Paul Menage
@ 2006-11-30 10:15 ` Nick Piggin
2006-11-30 10:40 ` Paul Menage
0 siblings, 1 reply; 54+ messages in thread
From: Nick Piggin @ 2006-11-30 10:15 UTC (permalink / raw)
To: Paul Menage; +Cc: linux-mm, akpm
Paul Menage wrote:
> On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
>> >> AFAIK they do that in their higher level APIs (at least HPC numa
>> does).
>> >
>> >
>> > Could you point me at an example?
>>
>> kernel/cpuset.c:cpuset_migrate_mm
>
>
> No, that doesn't really do what we want. It basically just calls
> do_migrate_pages, which has the drawbacks of:
I know it doesn't do what you want. It is an example of using page
migration under a higher level API, which I thought is what you
wanted to see.
>> How about "try to change the memory reservation charge of this
>> 'container' from xMB to yMB"? Underneath that API, your fakenode
>> controller would do the node reclaim and consolidation stuff --
>> but it could be implemented completely differently in the case of
>> a different type of controller.
>
>
> How would it make decisions such as which node to free up (e.g.
> userspace might have a strong preference for keeping a job on one
> particular real node, or moving it to a different one.) I think that
> policy decisions like this belong in userspace, in the same way that
> the existing cpusets API provides a way to say "this cpuset uses these
> nodes" rather than "this cpuset should have N nodes".
Now you're talking about physical nodes as well, which is definitely
a problem you get when mixing the two.
But there is no reason why you shouldn't be able to specify physical
nodes, while also altering the reservation. Even if that does mean
hiding the fake nodes from the cpuset interface.
>> If it is exporting any kind of implementation details, then it needs
>> to be justified with a specific user that can't be implemented in a
>> better way, IMO.
>
>
> It's not really exporting any more implementation details than the
> existing cpusets API (i.e. explicitly binding a job to a set of nodes
> chosen by userspace). The only true exposed implementation detail is
> the "priority" value from try_to_free_pages, and that could be
> abstracted away as a value in some range 0-N where 0 means "try very
> hard" and N means "hardly try at all", and it wouldn't have to be
> directly linked to the try_to_free_pages() priority.
Or the fact that memory reservation is implemented with nodes. I'm
still not convinced that idea is the best way to export memory
control to userspace, regardless of whether it is quick and easy to
develop (or even deploy, at google).
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-30 10:15 ` Nick Piggin
@ 2006-11-30 10:40 ` Paul Menage
2006-11-30 11:04 ` Nick Piggin
0 siblings, 1 reply; 54+ messages in thread
From: Paul Menage @ 2006-11-30 10:40 UTC (permalink / raw)
To: Nick Piggin; +Cc: linux-mm, akpm
On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> I know it doesn't do what you want. It is an example of using page
> migration under a higher level API, which I thought is what you
> wanted to see.
I'd been talking about the possibility of doing "try to move all
memory from this node to this other set of nodes"; that wasn't an
example of such an API.
>
> Now you're talking about physical nodes as well, which is definitely
> a problem you get when mixing the two.
>
> But there is no reason why you shouldn't be able to specify physical
> nodes, while also altering the reservation. Even if that does mean
> hiding the fake nodes from the cpuset interface.
I think it should be possible to expose the real numa topology via the
fake topology (e.g. all fake nodes on the same real node appear to be
fairly close together, compared to any fake nodes on a different real
node). So I don't think it's necessary to have a separate abstraction
for fake vs physical nodes.
>
> >> If it is exporting any kind of implementation details, then it needs
> >> to be justified with a specific user that can't be implemented in a
> >> better way, IMO.
> >
> >
> > It's not really exporting any more implementation details than the
> > existing cpusets API (i.e. explicitly binding a job to a set of nodes
> > chosen by userspace). The only true exposed implementation detail is
> > the "priority" value from try_to_free_pages, and that could be
> > abstracted away as a value in some range 0-N where 0 means "try very
> > hard" and N means "hardly try at all", and it wouldn't have to be
> > directly linked to the try_to_free_pages() priority.
>
> Or the fact that memory reservation is implemented with nodes.
Right, but to me that's a pretty fundamental design decision, rather
than an implementation detail.
> I'm
> still not convinced that idea is the best way to export memory
> control to userspace, regardless of whether it is quick and easy to
> develop (or even deploy, at google).
Maybe not the best way for all memory control, but it has certain big
advantages, such as leveraging the existing numa support, and not
requiring additional per-page overhead or LRU complexity.
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 4:17 ` Christoph Lameter
@ 2006-11-30 10:45 ` Paul Menage
2006-11-30 11:12 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 54+ messages in thread
From: Paul Menage @ 2006-11-30 10:45 UTC (permalink / raw)
To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm
On 11/29/06, Christoph Lameter <clameter@sgi.com> wrote:
>
> You do not have a problem as long as you hold a mmap_sem lock on any of
> the vmas in which the page appears. Kame and I discussed several
> approached on how to avoid the issue in the past but so far there was no
> need to resolve the issue.
>
It sounds like this would be useful for memory hot-unplug too, though.
A problem worth solving?
Why isn't page_lock_anon_vma() safe to use in this case? Because after
we've established migration ptes, page_mapped() will be false and so
page_lock_anon_vma() will return NULL?
How does kswapd do this safely?
Possible approach (apologies if you've already considered and rejected this):
- add a migration_count field to anon_vma
- use page_lock_anon_vma() to get the anon_vma for a page, assuming
it's mapped; if it's unmapped or if the anon_vma that we get has no
linked vmas then we ignore it - the chances are that the page is in
the process of being freed anyway, and if someone happens to remap it
just before it's freed then we can catch it next time around.
- isolate_lru_page() can bump this for every page that it isolates
- unlink_anon_vma() won't free the anon_vma if its migration_count is >0.
- remove_anon_migration_ptes() can free the anon_vma if
migration_count is 0 and the vma list is empty.
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-30 10:40 ` Paul Menage
@ 2006-11-30 11:04 ` Nick Piggin
2006-11-30 11:23 ` Paul Menage
0 siblings, 1 reply; 54+ messages in thread
From: Nick Piggin @ 2006-11-30 11:04 UTC (permalink / raw)
To: Paul Menage; +Cc: linux-mm, akpm
Paul Menage wrote:
> On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
>>
>> I know it doesn't do what you want. It is an example of using page
>> migration under a higher level API, which I thought is what you
>> wanted to see.
>
>
> I'd been talking about the possibility of doing "try to move all
> memory from this node to this other set of nodes"; that wasn't an
> example of such an API.
Oh, well I was talking about using higher level API rather than
migrate directly!
>> Now you're talking about physical nodes as well, which is definitely
>> a problem you get when mixing the two.
>>
>> But there is no reason why you shouldn't be able to specify physical
>> nodes, while also altering the reservation. Even if that does mean
>> hiding the fake nodes from the cpuset interface.
>
>
> I think it should be possible to expose the real numa topology via the
> fake topology (e.g. all fake nodes on the same real node appear to be
> fairly close together, compared to any fake nodes on a different real
> node). So I don't think it's necessary to have a separate abstraction
> for fake vs physical nodes.
Well if you want to do (real) node affinity then you need some
separation of course.
But I'm not sure that there is a good reason to use the same
abstraction. Maybe there is, but I think it needs more discussion
(unless I missed something in the past couple of weeks were you
managed to get all memory resource controller groups to agree with
your fakenodes approach).
>> >> If it is exporting any kind of implementation details, then it needs
>> >> to be justified with a specific user that can't be implemented in a
>> >> better way, IMO.
>> >
>> >
>> > It's not really exporting any more implementation details than the
>> > existing cpusets API (i.e. explicitly binding a job to a set of nodes
>> > chosen by userspace). The only true exposed implementation detail is
>> > the "priority" value from try_to_free_pages, and that could be
>> > abstracted away as a value in some range 0-N where 0 means "try very
>> > hard" and N means "hardly try at all", and it wouldn't have to be
>> > directly linked to the try_to_free_pages() priority.
>>
>> Or the fact that memory reservation is implemented with nodes.
>
>
> Right, but to me that's a pretty fundamental design decision, rather
> than an implementation detail.
It is a design of the implementation.
The policy is to be able to reserve memory for specific groups of tasks.
And the best API is one where userspace specifies policy. Now there
might be a few tweaks or lower level hints or calls needed to make the
implementation work really optimally. But those should be added later,
and when they are found to be required (and not just maybe useful).
So I see nothing wrong with your exposing these things to userspace if
the goal is to test implementation or get a prototype working quickly.
But if you're talking about the upstream kernel, then I think you need
to start at a much higher level.
>> I'm
>> still not convinced that idea is the best way to export memory
>> control to userspace, regardless of whether it is quick and easy to
>> develop (or even deploy, at google).
>
>
> Maybe not the best way for all memory control, but it has certain big
> advantages, such as leveraging the existing numa support, and not
> requiring additional per-page overhead or LRU complexity.
Oh I agree. And I think it is one of the better implementations I have
seen. But I don't like the API.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 10:45 ` Paul Menage
@ 2006-11-30 11:12 ` KAMEZAWA Hiroyuki
2006-11-30 11:25 ` Paul Menage
0 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-11-30 11:12 UTC (permalink / raw)
To: Paul Menage; +Cc: clameter, linux-mm, akpm
On Thu, 30 Nov 2006 02:45:51 -0800
"Paul Menage" <menage@google.com> wrote:
> On 11/29/06, Christoph Lameter <clameter@sgi.com> wrote:
> >
> > You do not have a problem as long as you hold a mmap_sem lock on any of
> > the vmas in which the page appears. Kame and I discussed several
> > approached on how to avoid the issue in the past but so far there was no
> > need to resolve the issue.
> >
>
> It sounds like this would be useful for memory hot-unplug too, though.
> A problem worth solving?
>
It's not solved just because 'there is no user'.
If you'll fix it, I welcome it.
> Why isn't page_lock_anon_vma() safe to use in this case? Because after
> we've established migration ptes, page_mapped() will be false and so
> page_lock_anon_vma() will return NULL?
page_lock_anon_vma() will return NULL because mapcount is 0.
We have to guarantee that we can trust anon_vma(from page->mapping0 even if
page->mapcount is 0.maybe there is several ways.
> How does kswapd do this safely?
>
kswapd doesn't touches page->mapping after page_mapcount() goes down to 0.
> Possible approach (apologies if you've already considered and rejected this):
>
As you pointed out, there will be several approaches.
I think one of the biggest concern will be performance impact. And this will
touch objrmap core, it is good to start discussion with a patch.
-Kame
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-30 11:04 ` Nick Piggin
@ 2006-11-30 11:23 ` Paul Menage
2006-11-30 11:35 ` Nick Piggin
0 siblings, 1 reply; 54+ messages in thread
From: Paul Menage @ 2006-11-30 11:23 UTC (permalink / raw)
To: Nick Piggin; +Cc: linux-mm, akpm
On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> But I'm not sure that there is a good reason to use the same
> abstraction. Maybe there is, but I think it needs more discussion
> (unless I missed something in the past couple of weeks were you
> managed to get all memory resource controller groups to agree with
> your fakenodes approach).
No, not at all - but we've observed that:
a) people have been proposing interesting memory controller approaches
for a long time, and haven't made a great deal of progress so far, so
there's no indication than something is going to be agreed upon in the
near future
b) the cpusets and fake numa code provide a fairly serviceable
coarse-grained memory controller, modulo a few missing features such
as per-node reclaim/migration and auto-expansion (see my patch
proposal hopefully tomorrow).
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 11:12 ` KAMEZAWA Hiroyuki
@ 2006-11-30 11:25 ` Paul Menage
2006-11-30 12:18 ` KAMEZAWA Hiroyuki
2006-11-30 18:28 ` Christoph Lameter
0 siblings, 2 replies; 54+ messages in thread
From: Paul Menage @ 2006-11-30 11:25 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: clameter, linux-mm, akpm
On 11/30/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > How does kswapd do this safely?
> >
> kswapd doesn't touches page->mapping after page_mapcount() goes down to 0.
OK, so we could do the same, and just assume that pages with a
page_mapcount() of 0 are either about to be freed or can be picked up
on a later migration sweep. Is it common for a page to have a 0
page_mapcount() for a long period of time without being freed or
remapped?
>
> I think one of the biggest concern will be performance impact. And this will
> touch objrmap core, it is good to start discussion with a patch.
>
I'll have a go. My initial thought is that the only performance impact
on the rmap core would be that unlink_anon_vma() would need one extra
check when determining whether to free an anon_vma
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace
2006-11-30 11:23 ` Paul Menage
@ 2006-11-30 11:35 ` Nick Piggin
0 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2006-11-30 11:35 UTC (permalink / raw)
To: Paul Menage; +Cc: linux-mm, akpm
Paul Menage wrote:
> On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
>> But I'm not sure that there is a good reason to use the same
>> abstraction. Maybe there is, but I think it needs more discussion
>> (unless I missed something in the past couple of weeks were you
>> managed to get all memory resource controller groups to agree with
>> your fakenodes approach).
>
>
> No, not at all - but we've observed that:
I agree with your points and I'll add a couple more.
> a) people have been proposing interesting memory controller approaches
> for a long time, and haven't made a great deal of progress so far, so
> there's no indication than something is going to be agreed upon in the
> near future
a2) and it hasn't been because they've been getting their APIs wrong
> b) the cpusets and fake numa code provide a fairly serviceable
> coarse-grained memory controller, modulo a few missing features such
> as per-node reclaim/migration and auto-expansion (see my patch
> proposal hopefully tomorrow).
b2) and it doesn't mean that it can't be used with a decent API. Or
at least, you haven't yet shown that it can't.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 11:25 ` Paul Menage
@ 2006-11-30 12:18 ` KAMEZAWA Hiroyuki
2006-11-30 18:28 ` Christoph Lameter
1 sibling, 0 replies; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-11-30 12:18 UTC (permalink / raw)
To: Paul Menage; +Cc: clameter, linux-mm, akpm
On Thu, 30 Nov 2006 03:25:21 -0800
"Paul Menage" <menage@google.com> wrote:
> On 11/30/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > How does kswapd do this safely?
> > >
> > kswapd doesn't touches page->mapping after page_mapcount() goes down to 0.
>
> OK, so we could do the same, and just assume that pages with a
> page_mapcount() of 0 are either about to be freed or can be picked up
> on a later migration sweep. Is it common for a page to have a 0
> page_mapcount() for a long period of time without being freed or
> remapped?
>
see shrink_page_list().
unmap -> (write to swap) -> freed. depends on how long write-back needs.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 11:25 ` Paul Menage
2006-11-30 12:18 ` KAMEZAWA Hiroyuki
@ 2006-11-30 18:28 ` Christoph Lameter
2006-11-30 18:35 ` Paul Menage
1 sibling, 1 reply; 54+ messages in thread
From: Christoph Lameter @ 2006-11-30 18:28 UTC (permalink / raw)
To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm
On Thu, 30 Nov 2006, Paul Menage wrote:
> OK, so we could do the same, and just assume that pages with a
> page_mapcount() of 0 are either about to be freed or can be picked up
> on a later migration sweep. Is it common for a page to have a 0
> page_mapcount() for a long period of time without being freed or
> remapped?
page mapcount goes to zero during migration because the references to the
page are removed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 18:28 ` Christoph Lameter
@ 2006-11-30 18:35 ` Paul Menage
2006-11-30 18:39 ` Christoph Lameter
0 siblings, 1 reply; 54+ messages in thread
From: Paul Menage @ 2006-11-30 18:35 UTC (permalink / raw)
To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm
On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote:
> On Thu, 30 Nov 2006, Paul Menage wrote:
>
> > OK, so we could do the same, and just assume that pages with a
> > page_mapcount() of 0 are either about to be freed or can be picked up
> > on a later migration sweep. Is it common for a page to have a 0
> > page_mapcount() for a long period of time without being freed or
> > remapped?
>
> page mapcount goes to zero during migration because the references to the
> page are removed.
>
Yes, but I meant for reasons other than migration.
It sounds as though if we come across a page with page_mapcount() = 0
while gathering pages for migration, it's probably in the process of
being swapped out and so is best not to muck around with anyway?
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 18:35 ` Paul Menage
@ 2006-11-30 18:39 ` Christoph Lameter
2006-11-30 19:09 ` Paul Menage
0 siblings, 1 reply; 54+ messages in thread
From: Christoph Lameter @ 2006-11-30 18:39 UTC (permalink / raw)
To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm
On Thu, 30 Nov 2006, Paul Menage wrote:
> It sounds as though if we come across a page with page_mapcount() = 0
> while gathering pages for migration, it's probably in the process of
> being swapped out and so is best not to muck around with anyway?
F.e. A page cache page may have mapcount == 0. Mapcount 0 only means that
the page is not mapped into any processes memory via a page table. It may
be used for purposes that do not require mapping into a processes memory.
If the reference count is zero (page freed) then page migration will
discard the page and consider the migration a success.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 18:39 ` Christoph Lameter
@ 2006-11-30 19:09 ` Paul Menage
2006-11-30 19:42 ` Christoph Lameter
0 siblings, 1 reply; 54+ messages in thread
From: Paul Menage @ 2006-11-30 19:09 UTC (permalink / raw)
To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm
On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote:
>
> F.e. A page cache page may have mapcount == 0.
OK, I was thinking just about anon pages.
For pagecache pages, it's safe to access the mapping as long as we've
locked the page, even if mapcount is 0? So we don't have the same
races?
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 19:09 ` Paul Menage
@ 2006-11-30 19:42 ` Christoph Lameter
2006-11-30 19:53 ` Paul Menage
0 siblings, 1 reply; 54+ messages in thread
From: Christoph Lameter @ 2006-11-30 19:42 UTC (permalink / raw)
To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm
On Thu, 30 Nov 2006, Paul Menage wrote:
> On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote:
> >
> > F.e. A page cache page may have mapcount == 0.
>
> OK, I was thinking just about anon pages.
>
> For pagecache pages, it's safe to access the mapping as long as we've
> locked the page, even if mapcount is 0? So we don't have the same
> races?
We have no problem with the page lock (you actually may not need any
locking since there are no references remaining to the page). The trouble
is that the vma may have vanished when we try to reestablish the pte.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 19:42 ` Christoph Lameter
@ 2006-11-30 19:53 ` Paul Menage
2006-11-30 20:00 ` Christoph Lameter
0 siblings, 1 reply; 54+ messages in thread
From: Paul Menage @ 2006-11-30 19:53 UTC (permalink / raw)
To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm
On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote:
>
> We have no problem with the page lock (you actually may not need any
> locking since there are no references remaining to the page). The trouble
> is that the vma may have vanished when we try to reestablish the pte.
>
Why is that a problem? If the vma has gone away, then there's no need
to reestablish the pte. And remove_file_migration_ptes() appears to be
adequately protected against races with unlink_file_vma() since they
both take i_mmap_sem.
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 19:53 ` Paul Menage
@ 2006-11-30 20:00 ` Christoph Lameter
2006-11-30 20:07 ` Paul Menage
0 siblings, 1 reply; 54+ messages in thread
From: Christoph Lameter @ 2006-11-30 20:00 UTC (permalink / raw)
To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm
On Thu, 30 Nov 2006, Paul Menage wrote:
> On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote:
> >
> > We have no problem with the page lock (you actually may not need any
> > locking since there are no references remaining to the page). The trouble
> > is that the vma may have vanished when we try to reestablish the pte.
> >
>
> Why is that a problem? If the vma has gone away, then there's no need
> to reestablish the pte. And remove_file_migration_ptes() appears to be
> adequately protected against races with unlink_file_vma() since they
> both take i_mmap_sem.
We are talking about anonymous pages here. You cannot figure out
that the vma is gone since that was the only connection to the process.
Hmm... Not true we still have a migration pte in that processes space. But
we cannot find the process without the anon_vma.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 20:00 ` Christoph Lameter
@ 2006-11-30 20:07 ` Paul Menage
2006-11-30 20:15 ` Christoph Lameter
0 siblings, 1 reply; 54+ messages in thread
From: Paul Menage @ 2006-11-30 20:07 UTC (permalink / raw)
To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm
On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote:
> >
> > Why is that a problem? If the vma has gone away, then there's no need
> > to reestablish the pte. And remove_file_migration_ptes() appears to be
> > adequately protected against races with unlink_file_vma() since they
> > both take i_mmap_sem.
>
> We are talking about anonymous pages here.
No, I was talking about pagecache pages by this point - you'd
mentioned them as the case where page_mapcount() can be 0 for a long
period of time.
> You cannot figure out
> that the vma is gone since that was the only connection to the process.
> Hmm... Not true we still have a migration pte in that processes space. But
> we cannot find the process without the anon_vma.
What did you think of the approach that I proposed of adding a
migration count to anon_vma? unlink_anon_vma() doesn't free the
anon_vma if migration count is non-zero.
When gathering pages for migration, we use page_lock_anon_vma() to get
the anon_vma; if it returns NULL or has an empty vma list we skip the
page, else we bump migration count (and mapcount?) by 1 and unlock.
That will guarantee that the anon_vma sticks around until the end of
the migration.
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 20:07 ` Paul Menage
@ 2006-11-30 20:15 ` Christoph Lameter
2006-11-30 21:33 ` Paul Menage
0 siblings, 1 reply; 54+ messages in thread
From: Christoph Lameter @ 2006-11-30 20:15 UTC (permalink / raw)
To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm, Hugh Dickins
On Thu, 30 Nov 2006, Paul Menage wrote:
> > We are talking about anonymous pages here.
>
> No, I was talking about pagecache pages by this point - you'd
> mentioned them as the case where page_mapcount() can be 0 for a long
> period of time.
Right but pagecache pages are mapped differently by a mapping attached to
the inode. The vma does not vanish. We have to distinguish clearly between
anonymous and file based pages.
> > You cannot figure out
> > that the vma is gone since that was the only connection to the process.
> > Hmm... Not true we still have a migration pte in that processes space. But
> > we cannot find the process without the anon_vma.
>
> What did you think of the approach that I proposed of adding a
> migration count to anon_vma? unlink_anon_vma() doesn't free the
> anon_vma if migration count is non-zero.
Hmmm.. Well talk to Hugh Dickins about that. anon_vmas are very
performance sensitive things.
> When gathering pages for migration, we use page_lock_anon_vma() to get
> the anon_vma; if it returns NULL or has an empty vma list we skip the
> page, else we bump migration count (and mapcount?) by 1 and unlock.
> That will guarantee that the anon_vma sticks around until the end of
> the migration.
You cannot use page_lock_anon_vma since the mapcount is of the page is
zero. Something must be done before we reduce the mapcount to zero to
pin the vma.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 20:15 ` Christoph Lameter
@ 2006-11-30 21:33 ` Paul Menage
2006-11-30 23:41 ` Christoph Lameter
0 siblings, 1 reply; 54+ messages in thread
From: Paul Menage @ 2006-11-30 21:33 UTC (permalink / raw)
To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm, Hugh Dickins
On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote:
>
> Hmmm.. Well talk to Hugh Dickins about that. anon_vmas are very
> performance sensitive things.
>
> > When gathering pages for migration, we use page_lock_anon_vma() to get
> > the anon_vma; if it returns NULL or has an empty vma list we skip the
> > page, else we bump migration count (and mapcount?) by 1 and unlock.
> > That will guarantee that the anon_vma sticks around until the end of
> > the migration.
>
> You cannot use page_lock_anon_vma since the mapcount is of the page is
> zero.
Let me clarify my proposal:
1) When gathering pages we find an anon page
2) We call page_lock_anon_vma(); if it returns NULL we ignore the page
3) If the anon_vma has an empty vma list, we ignore the page
4) We increment page_mapcount(); if this crosses the boundary from
unmapped to mapped, we know that we're racing with someone else;
either ignore the page or start again
5) If page->mapping no longer refers to our anon_vma, we know we're
racing; drop page_mapcount and ignore the page or start again
6) We increment anon_vma->migration_count to pin the anon_vma
At this point we know that the vma isn't going to go away since it's
pinned via the migration count, and any new users of the page will use
the pinned anon_vma since page_mapcount() is positive.
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 21:33 ` Paul Menage
@ 2006-11-30 23:41 ` Christoph Lameter
2006-11-30 23:48 ` Paul Menage
0 siblings, 1 reply; 54+ messages in thread
From: Christoph Lameter @ 2006-11-30 23:41 UTC (permalink / raw)
To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, akpm
I think you initial suggestion of adding a counter to the anon_vma may
work. Here is a patch that may allow us to keep the anon_vma around
without holding mmap_sem. Seems to be simple.
Hugh?
Index: linux-2.6.19-rc6-mm2/include/linux/rmap.h
===================================================================
--- linux-2.6.19-rc6-mm2.orig/include/linux/rmap.h 2006-11-15 22:03:40.000000000 -0600
+++ linux-2.6.19-rc6-mm2/include/linux/rmap.h 2006-11-30 17:39:17.643728656 -0600
@@ -26,6 +26,7 @@
struct anon_vma {
spinlock_t lock; /* Serialize access to vma list */
struct list_head head; /* List of private "related" vmas */
+ int migration_count; /* # processes migrating pages */
};
#ifdef CONFIG_MMU
Index: linux-2.6.19-rc6-mm2/mm/migrate.c
===================================================================
--- linux-2.6.19-rc6-mm2.orig/mm/migrate.c 2006-11-29 18:37:17.797934398 -0600
+++ linux-2.6.19-rc6-mm2/mm/migrate.c 2006-11-30 17:39:48.429639786 -0600
@@ -218,6 +218,7 @@ static void remove_anon_migration_ptes(s
struct anon_vma *anon_vma;
struct vm_area_struct *vma;
unsigned long mapping;
+ int empty;
mapping = (unsigned long)new->mapping;
@@ -229,11 +230,15 @@ static void remove_anon_migration_ptes(s
*/
anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
spin_lock(&anon_vma->lock);
+ anon_vma->migration_count--;
list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
remove_migration_pte(vma, old, new);
+ empty = list_empty(&anon_vma->head);
spin_unlock(&anon_vma->lock);
+ if (empty)
+ anon_vma_free(anon_vma);
}
/*
Index: linux-2.6.19-rc6-mm2/mm/rmap.c
===================================================================
--- linux-2.6.19-rc6-mm2.orig/mm/rmap.c 2006-11-15 22:03:40.000000000 -0600
+++ linux-2.6.19-rc6-mm2/mm/rmap.c 2006-11-30 17:39:17.795109159 -0600
@@ -151,7 +151,7 @@ void anon_vma_unlink(struct vm_area_stru
list_del(&vma->anon_vma_node);
/* We must garbage collect the anon_vma if it's empty */
- empty = list_empty(&anon_vma->head);
+ empty = list_empty(&anon_vma->head) && !anon_vma->migration_count;
spin_unlock(&anon_vma->lock);
if (empty)
@@ -787,6 +787,9 @@ static int try_to_unmap_anon(struct page
if (!anon_vma)
return ret;
+ if (migration)
+ anon_vma->migration_count++;
+
list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
ret = try_to_unmap_one(page, vma, migration);
if (ret == SWAP_FAIL || !page_mapped(page))
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 23:41 ` Christoph Lameter
@ 2006-11-30 23:48 ` Paul Menage
2006-12-01 2:23 ` Christoph Lameter
2006-12-01 2:44 ` KAMEZAWA Hiroyuki
0 siblings, 2 replies; 54+ messages in thread
From: Paul Menage @ 2006-11-30 23:48 UTC (permalink / raw)
To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, akpm
On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote:
> I think you initial suggestion of adding a counter to the anon_vma may
> work. Here is a patch that may allow us to keep the anon_vma around
> without holding mmap_sem. Seems to be simple.
Don't we need to bump the mapcount? If we don't, then the page gets
unmapped by the migration prep, and if we race with anyone trying to
map it they may allocate a new anon_vma and replace it.
> --- linux-2.6.19-rc6-mm2.orig/mm/migrate.c 2006-11-29 18:37:17.797934398 -0600
> +++ linux-2.6.19-rc6-mm2/mm/migrate.c 2006-11-30 17:39:48.429639786 -0600
> @@ -218,6 +218,7 @@ static void remove_anon_migration_ptes(s
> struct anon_vma *anon_vma;
> struct vm_area_struct *vma;
> unsigned long mapping;
> + int empty;
>
> mapping = (unsigned long)new->mapping;
>
> @@ -229,11 +230,15 @@ static void remove_anon_migration_ptes(s
> */
> anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
> spin_lock(&anon_vma->lock);
> + anon_vma->migration_count--;
>
> list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
> remove_migration_pte(vma, old, new);
>
> + empty = list_empty(&anon_vma->head);
I think we need to check for migration_count being non-zero here, just
in case two processes try to migrate the same page at once. Or maybe
just say that if migration_count is non-zero, the second migrator just
ignores the page?
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 23:48 ` Paul Menage
@ 2006-12-01 2:23 ` Christoph Lameter
2006-12-01 19:32 ` Paul Menage
2006-12-01 2:44 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 54+ messages in thread
From: Christoph Lameter @ 2006-12-01 2:23 UTC (permalink / raw)
To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, akpm
On Thu, 30 Nov 2006, Paul Menage wrote:
> Don't we need to bump the mapcount? If we don't, then the page gets
> unmapped by the migration prep, and if we race with anyone trying to
> map it they may allocate a new anon_vma and replace it.
Allocate a new vma for an existing anon page? That never happens. We may
do COW in which case the page is copied.
> > + empty = list_empty(&anon_vma->head);
>
> I think we need to check for migration_count being non-zero here, just
> in case two processes try to migrate the same page at once. Or maybe
> just say that if migration_count is non-zero, the second migrator just
> ignores the page?
Right we need to check for the migration_count being zero. The one that
zeros it must free the anon_vma.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-12-01 2:44 ` KAMEZAWA Hiroyuki
@ 2006-12-01 2:43 ` Christoph Lameter
2006-12-01 2:59 ` KAMEZAWA Hiroyuki
2006-12-01 2:44 ` Christoph Lameter
1 sibling, 1 reply; 54+ messages in thread
From: Christoph Lameter @ 2006-12-01 2:43 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Paul Menage, hugh, linux-mm, akpm
On Fri, 1 Dec 2006, KAMEZAWA Hiroyuki wrote:
> This is a patch. not tested at all, just idea level.
> (seems a period of taking rcu_read_lock() is a bit long..)
This is what we have been trying to avoid. Using rcu means that the
anon_vma cacheline gets cold and this will badly influence benchmarks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-11-30 23:48 ` Paul Menage
2006-12-01 2:23 ` Christoph Lameter
@ 2006-12-01 2:44 ` KAMEZAWA Hiroyuki
2006-12-01 2:43 ` Christoph Lameter
2006-12-01 2:44 ` Christoph Lameter
1 sibling, 2 replies; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-12-01 2:44 UTC (permalink / raw)
To: Paul Menage; +Cc: clameter, hugh, linux-mm, akpm
On Thu, 30 Nov 2006 15:48:28 -0800
"Paul Menage" <menage@google.com> wrote:
> On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote:
> > I think you initial suggestion of adding a counter to the anon_vma may
> > work. Here is a patch that may allow us to keep the anon_vma around
> > without holding mmap_sem. Seems to be simple.
>
> Don't we need to bump the mapcount? If we don't, then the page gets
> unmapped by the migration prep, and if we race with anyone trying to
> map it they may allocate a new anon_vma and replace it.
I don't think add *dummy* mapccount to a page is good.
One way I can think of now is to make use of RCU routine for anon_vma_free() and
take RCU readlock while unmap->map an anon page. This can prevent a freed anon_vma
struct from being used by someone immediately.
But Christoph-san's patch just uses 4bytes(int) for delayed freeing.
This adds 2 pointers to each anon_vma struct, but doesn't uses any special things.
This is a patch. not tested at all, just idea level.
(seems a period of taking rcu_read_lock() is a bit long..)
-Kame
==
For moving page-migration to the next step, we have to fix
anon_vma problem.
migration code temporally makes page->mapcount to 0. This means
page->mapping is not trustful. AFAIK, anon_vma can be freed while
migration if mm->sem is not taken.
To make use of migration without mm->sem, we need to delay freeing
of anon_vma. This patch uses RCU for delayed freeing of anon_vma.
Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
include/linux/rmap.h | 10 +++++++++-
mm/migrate.c | 2 ++
mm/rmap.c | 6 ++++++
3 files changed, 17 insertions(+), 1 deletion(-)
Index: linux-2.6.19/include/linux/rmap.h
===================================================================
--- linux-2.6.19.orig/include/linux/rmap.h
+++ linux-2.6.19/include/linux/rmap.h
@@ -26,6 +26,7 @@
struct anon_vma {
spinlock_t lock; /* Serialize access to vma list */
struct list_head head; /* List of private "related" vmas */
+ struct rcu_head rcu; /* for delayed RCU freeing */
};
#ifdef CONFIG_MMU
@@ -37,11 +38,18 @@ static inline struct anon_vma *anon_vma_
return kmem_cache_alloc(anon_vma_cachep, SLAB_KERNEL);
}
+/*
+ * Because page->mapping(which points to anon-vma) is not cleared
+ * even if page is removed from anon_vma, we use delayed freeing
+ * of anon_vma. This makes migration safer.
+ */
+extern void delayed_anon_vma_free(struct rcu_head *head);
static inline void anon_vma_free(struct anon_vma *anon_vma)
{
- kmem_cache_free(anon_vma_cachep, anon_vma);
+ call_rcu(&anon_vma->rcu, delayed_anon_vma_free);
}
+
static inline void anon_vma_lock(struct vm_area_struct *vma)
{
struct anon_vma *anon_vma = vma->anon_vma;
Index: linux-2.6.19/mm/migrate.c
===================================================================
--- linux-2.6.19.orig/mm/migrate.c
+++ linux-2.6.19/mm/migrate.c
@@ -618,12 +618,14 @@ static int unmap_and_move(new_page_t get
/*
* Establish migration ptes or remove ptes
*/
+ rcu_read_lock();
try_to_unmap(page, 1);
if (!page_mapped(page))
rc = move_to_new_page(newpage, page);
if (rc)
remove_migration_ptes(page, page);
+ rcu_read_unlock();
unlock:
unlock_page(page);
Index: linux-2.6.19/mm/rmap.c
===================================================================
--- linux-2.6.19.orig/mm/rmap.c
+++ linux-2.6.19/mm/rmap.c
@@ -70,6 +70,12 @@ static inline void validate_anon_vma(str
#endif
}
+void delayed_anon_vma_free(struct rcu_head *head)
+{
+ struct anon_vma *anon_vma = container_of(head, struct anon_vma, rcu);
+ kmem_cache_free(anon_vma_cachep, anon_vma);
+}
+
/* This must be called under the mmap_sem. */
int anon_vma_prepare(struct vm_area_struct *vma)
{
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-12-01 2:44 ` KAMEZAWA Hiroyuki
2006-12-01 2:43 ` Christoph Lameter
@ 2006-12-01 2:44 ` Christoph Lameter
2006-12-01 3:10 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 54+ messages in thread
From: Christoph Lameter @ 2006-12-01 2:44 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Paul Menage, hugh, linux-mm, akpm
Fixed up patch with more comments and a check that the migration_count is
zero before freeing.
Index: linux-2.6.19-rc6-mm2/include/linux/rmap.h
===================================================================
--- linux-2.6.19-rc6-mm2.orig/include/linux/rmap.h 2006-11-15 22:03:40.000000000 -0600
+++ linux-2.6.19-rc6-mm2/include/linux/rmap.h 2006-11-30 17:39:17.643728656 -0600
@@ -26,6 +26,7 @@
struct anon_vma {
spinlock_t lock; /* Serialize access to vma list */
struct list_head head; /* List of private "related" vmas */
+ int migration_count; /* # processes migrating pages */
};
#ifdef CONFIG_MMU
Index: linux-2.6.19-rc6-mm2/mm/migrate.c
===================================================================
--- linux-2.6.19-rc6-mm2.orig/mm/migrate.c 2006-11-29 18:37:17.797934398 -0600
+++ linux-2.6.19-rc6-mm2/mm/migrate.c 2006-11-30 20:41:13.810836561 -0600
@@ -209,15 +209,12 @@ static void remove_file_migration_ptes(s
spin_unlock(&mapping->i_mmap_lock);
}
-/*
- * Must hold mmap_sem lock on at least one of the vmas containing
- * the page so that the anon_vma cannot vanish.
- */
static void remove_anon_migration_ptes(struct page *old, struct page *new)
{
struct anon_vma *anon_vma;
struct vm_area_struct *vma;
unsigned long mapping;
+ int empty;
mapping = (unsigned long)new->mapping;
@@ -225,15 +222,20 @@ static void remove_anon_migration_ptes(s
return;
/*
- * We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
+ * We have increased migration_count So no need to call
+ * page_lock_anon_vma.
*/
anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
spin_lock(&anon_vma->lock);
+ anon_vma->migration_count--;
list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
remove_migration_pte(vma, old, new);
+ empty = list_empty(&anon_vma->head) && !anon_vma->migration_count;
spin_unlock(&anon_vma->lock);
+ if (empty)
+ anon_vma_free(anon_vma);
}
/*
Index: linux-2.6.19-rc6-mm2/mm/rmap.c
===================================================================
--- linux-2.6.19-rc6-mm2.orig/mm/rmap.c 2006-11-15 22:03:40.000000000 -0600
+++ linux-2.6.19-rc6-mm2/mm/rmap.c 2006-11-30 20:39:52.266554217 -0600
@@ -150,8 +150,8 @@ void anon_vma_unlink(struct vm_area_stru
validate_anon_vma(vma);
list_del(&vma->anon_vma_node);
- /* We must garbage collect the anon_vma if it's empty */
- empty = list_empty(&anon_vma->head);
+ /* We must garbage collect the anon_vma if it's unused */
+ empty = list_empty(&anon_vma->head) && !anon_vma->migration_count;
spin_unlock(&anon_vma->lock);
if (empty)
@@ -787,6 +787,10 @@ static int try_to_unmap_anon(struct page
if (!anon_vma)
return ret;
+ if (migration)
+ /* Prevent freeing while migrating pages */
+ anon_vma->migration_count++;
+
list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
ret = try_to_unmap_one(page, vma, migration);
if (ret == SWAP_FAIL || !page_mapped(page))
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-12-01 2:43 ` Christoph Lameter
@ 2006-12-01 2:59 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-12-01 2:59 UTC (permalink / raw)
To: Christoph Lameter; +Cc: menage, hugh, linux-mm, akpm
On Thu, 30 Nov 2006 18:43:01 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:
> On Fri, 1 Dec 2006, KAMEZAWA Hiroyuki wrote:
>
> > This is a patch. not tested at all, just idea level.
> > (seems a period of taking rcu_read_lock() is a bit long..)
>
> This is what we have been trying to avoid. Using rcu means that the
> anon_vma cacheline gets cold and this will badly influence benchmarks.
>
Ah, okay. rcu's batch-freeing makes cacheline cold. sorry.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-12-01 2:44 ` Christoph Lameter
@ 2006-12-01 3:10 ` KAMEZAWA Hiroyuki
2006-12-01 5:28 ` Christoph Lameter
0 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-12-01 3:10 UTC (permalink / raw)
To: Christoph Lameter; +Cc: menage, hugh, linux-mm, akpm
On Thu, 30 Nov 2006 18:44:30 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:
> Fixed up patch with more comments and a check that the migration_count is
> zero before freeing.
>
Looks good, thanks. we need users and tests :)
-Kame
> Index: linux-2.6.19-rc6-mm2/include/linux/rmap.h
> ===================================================================
> --- linux-2.6.19-rc6-mm2.orig/include/linux/rmap.h 2006-11-15 22:03:40.000000000 -0600
> +++ linux-2.6.19-rc6-mm2/include/linux/rmap.h 2006-11-30 17:39:17.643728656 -0600
> @@ -26,6 +26,7 @@
> struct anon_vma {
> spinlock_t lock; /* Serialize access to vma list */
> struct list_head head; /* List of private "related" vmas */
> + int migration_count; /* # processes migrating pages */
> };
>
> #ifdef CONFIG_MMU
> Index: linux-2.6.19-rc6-mm2/mm/migrate.c
> ===================================================================
> --- linux-2.6.19-rc6-mm2.orig/mm/migrate.c 2006-11-29 18:37:17.797934398 -0600
> +++ linux-2.6.19-rc6-mm2/mm/migrate.c 2006-11-30 20:41:13.810836561 -0600
> @@ -209,15 +209,12 @@ static void remove_file_migration_ptes(s
> spin_unlock(&mapping->i_mmap_lock);
> }
>
> -/*
> - * Must hold mmap_sem lock on at least one of the vmas containing
> - * the page so that the anon_vma cannot vanish.
> - */
> static void remove_anon_migration_ptes(struct page *old, struct page *new)
> {
> struct anon_vma *anon_vma;
> struct vm_area_struct *vma;
> unsigned long mapping;
> + int empty;
>
> mapping = (unsigned long)new->mapping;
>
> @@ -225,15 +222,20 @@ static void remove_anon_migration_ptes(s
> return;
>
> /*
> - * We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
> + * We have increased migration_count So no need to call
> + * page_lock_anon_vma.
> */
> anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
> spin_lock(&anon_vma->lock);
> + anon_vma->migration_count--;
>
> list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
> remove_migration_pte(vma, old, new);
>
> + empty = list_empty(&anon_vma->head) && !anon_vma->migration_count;
> spin_unlock(&anon_vma->lock);
> + if (empty)
> + anon_vma_free(anon_vma);
> }
>
> /*
> Index: linux-2.6.19-rc6-mm2/mm/rmap.c
> ===================================================================
> --- linux-2.6.19-rc6-mm2.orig/mm/rmap.c 2006-11-15 22:03:40.000000000 -0600
> +++ linux-2.6.19-rc6-mm2/mm/rmap.c 2006-11-30 20:39:52.266554217 -0600
> @@ -150,8 +150,8 @@ void anon_vma_unlink(struct vm_area_stru
> validate_anon_vma(vma);
> list_del(&vma->anon_vma_node);
>
> - /* We must garbage collect the anon_vma if it's empty */
> - empty = list_empty(&anon_vma->head);
> + /* We must garbage collect the anon_vma if it's unused */
> + empty = list_empty(&anon_vma->head) && !anon_vma->migration_count;
> spin_unlock(&anon_vma->lock);
>
> if (empty)
> @@ -787,6 +787,10 @@ static int try_to_unmap_anon(struct page
> if (!anon_vma)
> return ret;
>
> + if (migration)
> + /* Prevent freeing while migrating pages */
> + anon_vma->migration_count++;
> +
> list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
> ret = try_to_unmap_one(page, vma, migration);
> if (ret == SWAP_FAIL || !page_mapped(page))
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-12-01 3:10 ` KAMEZAWA Hiroyuki
@ 2006-12-01 5:28 ` Christoph Lameter
0 siblings, 0 replies; 54+ messages in thread
From: Christoph Lameter @ 2006-12-01 5:28 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: menage, hugh, linux-mm, akpm
On Fri, 1 Dec 2006, KAMEZAWA Hiroyuki wrote:
> On Thu, 30 Nov 2006 18:44:30 -0800 (PST)
> Christoph Lameter <clameter@sgi.com> wrote:
>
> > Fixed up patch with more comments and a check that the migration_count is
> > zero before freeing.
> >
>
> Looks good, thanks. we need users and tests :)
Yeah we would need something that is not process based. Paul Menage may
have something.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-12-01 2:23 ` Christoph Lameter
@ 2006-12-01 19:32 ` Paul Menage
2006-12-01 19:56 ` Christoph Lameter
0 siblings, 1 reply; 54+ messages in thread
From: Paul Menage @ 2006-12-01 19:32 UTC (permalink / raw)
To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, akpm
On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote:
> On Thu, 30 Nov 2006, Paul Menage wrote:
>
> > Don't we need to bump the mapcount? If we don't, then the page gets
> > unmapped by the migration prep, and if we race with anyone trying to
> > map it they may allocate a new anon_vma and replace it.
>
> Allocate a new vma for an existing anon page? That never happens. We may
> do COW in which case the page is copied.
I was thinking of a new anon_vma, rather than a new vma - but I guess
that even if we do race with someone who's faulting on the page and
pulling it from the swap cache, they'll just set the page mapping to
the same value as it is already, rather than setting it to a new
value. So you're right, not a problem.
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration
2006-12-01 19:32 ` Paul Menage
@ 2006-12-01 19:56 ` Christoph Lameter
0 siblings, 0 replies; 54+ messages in thread
From: Christoph Lameter @ 2006-12-01 19:56 UTC (permalink / raw)
To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, akpm
On Fri, 1 Dec 2006, Paul Menage wrote:
>
> I was thinking of a new anon_vma, rather than a new vma - but I guess
> that even if we do race with someone who's faulting on the page and
> pulling it from the swap cache, they'll just set the page mapping to
> the same value as it is already, rather than setting it to a new
> value. So you're right, not a problem.
The page is locked during migration to prevent such occurrences.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
end of thread, other threads:[~2006-12-01 19:56 UTC | newest]
Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-11-29 3:06 [RFC][PATCH 0/1] Node-based reclaim/migration menage
2006-11-29 3:06 ` [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace menage
2006-11-29 6:07 ` Nick Piggin
2006-11-29 21:57 ` Paul Menage
2006-11-30 4:13 ` Christoph Lameter
2006-11-30 4:18 ` Paul Menage
2006-11-30 7:38 ` Nick Piggin
2006-11-30 7:57 ` Paul Menage
2006-11-30 8:26 ` Nick Piggin
2006-11-30 8:39 ` Paul Menage
2006-11-30 8:55 ` Nick Piggin
2006-11-30 9:06 ` Paul Menage
2006-11-30 9:21 ` Nick Piggin
2006-11-30 9:45 ` Paul Menage
2006-11-30 10:15 ` Nick Piggin
2006-11-30 10:40 ` Paul Menage
2006-11-30 11:04 ` Nick Piggin
2006-11-30 11:23 ` Paul Menage
2006-11-30 11:35 ` Nick Piggin
2006-11-30 0:18 ` KAMEZAWA Hiroyuki
2006-11-30 0:25 ` Paul Menage
2006-11-30 0:38 ` KAMEZAWA Hiroyuki
2006-11-30 4:15 ` Christoph Lameter
2006-11-30 4:10 ` Christoph Lameter
2006-11-30 0:31 ` [RFC][PATCH 0/1] Node-based reclaim/migration KAMEZAWA Hiroyuki
2006-11-30 0:31 ` Paul Menage
2006-11-30 4:11 ` KAMEZAWA Hiroyuki
2006-11-30 4:17 ` Christoph Lameter
2006-11-30 10:45 ` Paul Menage
2006-11-30 11:12 ` KAMEZAWA Hiroyuki
2006-11-30 11:25 ` Paul Menage
2006-11-30 12:18 ` KAMEZAWA Hiroyuki
2006-11-30 18:28 ` Christoph Lameter
2006-11-30 18:35 ` Paul Menage
2006-11-30 18:39 ` Christoph Lameter
2006-11-30 19:09 ` Paul Menage
2006-11-30 19:42 ` Christoph Lameter
2006-11-30 19:53 ` Paul Menage
2006-11-30 20:00 ` Christoph Lameter
2006-11-30 20:07 ` Paul Menage
2006-11-30 20:15 ` Christoph Lameter
2006-11-30 21:33 ` Paul Menage
2006-11-30 23:41 ` Christoph Lameter
2006-11-30 23:48 ` Paul Menage
2006-12-01 2:23 ` Christoph Lameter
2006-12-01 19:32 ` Paul Menage
2006-12-01 19:56 ` Christoph Lameter
2006-12-01 2:44 ` KAMEZAWA Hiroyuki
2006-12-01 2:43 ` Christoph Lameter
2006-12-01 2:59 ` KAMEZAWA Hiroyuki
2006-12-01 2:44 ` Christoph Lameter
2006-12-01 3:10 ` KAMEZAWA Hiroyuki
2006-12-01 5:28 ` Christoph Lameter
2006-11-30 4:04 ` Christoph Lameter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox