* [RFC][PATCH 0/1] Node-based reclaim/migration
@ 2006-11-29 3:06 menage
2006-11-29 3:06 ` [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace menage
` (2 more replies)
0 siblings, 3 replies; 54+ messages in thread
From: menage @ 2006-11-29 3:06 UTC (permalink / raw)
To: linux-mm; +Cc: akpm
--
We're trying to use NUMA node isolation as a form of job resource
control at Google, and the existing page migration APIs are all bound
to individual processes and so are a bit clunky to use when you just
want to affect all the pages on a given node.
How about an API to allow userspace to direct page migration (and page
reclaim) on a per-node basis? This patch provides such an API, based
around sysfs; a system call approach would certainly be possible too.
It sort of overlaps with memory hot-unplug, but is simpler since it's
not so bad if we miss a few pages.
Comments? Also, can anyone clarify whether I need any locking when
sacnning the pages in a pgdat? As far as I can see, even with memory
hotplug this number can only increase, not decrease.
Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread* [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-29 3:06 [RFC][PATCH 0/1] Node-based reclaim/migration menage @ 2006-11-29 3:06 ` menage 2006-11-29 6:07 ` Nick Piggin ` (2 more replies) 2006-11-30 0:31 ` [RFC][PATCH 0/1] Node-based reclaim/migration KAMEZAWA Hiroyuki 2006-11-30 4:04 ` Christoph Lameter 2 siblings, 3 replies; 54+ messages in thread From: menage @ 2006-11-29 3:06 UTC (permalink / raw) To: linux-mm; +Cc: akpm [-- Attachment #1: node_reclaim.patch --] [-- Type: text/plain, Size: 8417 bytes --] Currently the page migration APIs allow you to migrate pages from particular processes, but don't provide a clean and efficient way to migrate and/or reclaim memory from individual nodes. This patch provides: - an additional parameter to try_to_free_pages() to specify the priority at which the reclaim should give up if it doesn't make progress - a way to trigger try_to_free_pages() for a given node with a given minimum priority, vy writing an integer to /sys/device/system/node/node<id>/try_to_free_pages - a way to request that any migratable pages on a given node be migrated to availage pages on a specified set of nodes by writing a destination nodemask (in ASCII form) to /sys/device/system/node/node<id>/migrate_node Signed-off-by: Paul Menage <menage@google.com> --- drivers/base/node.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++ fs/buffer.c | 2 - include/linux/mempolicy.h | 2 + include/linux/swap.h | 2 - mm/mempolicy.c | 3 - mm/page_alloc.c | 2 - mm/vmscan.c | 5 +- 7 files changed, 101 insertions(+), 7 deletions(-) Index: 2.6.19-node_reclaim/drivers/base/node.c =================================================================== --- 2.6.19-node_reclaim.orig/drivers/base/node.c +++ 2.6.19-node_reclaim/drivers/base/node.c @@ -12,6 +12,8 @@ #include <linux/topology.h> #include <linux/nodemask.h> #include <linux/cpu.h> +#include <linux/swap.h> +#include <linux/migrate.h> static struct sysdev_class node_class = { set_kset_name("node"), @@ -137,6 +139,92 @@ static ssize_t node_read_distance(struct static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL); +static ssize_t node_store_ttfp(struct sys_device *dev, + struct sysdev_attribute *attr, + const char *buf, + size_t count) { + int nid = dev->id; + unsigned int priority; + struct zonelist *zl; + nodemask_t nodes; + ssize_t ret = count; + + priority = max(0, min(DEF_PRIORITY, (int)simple_strtoul(buf, NULL, 0))); + printk(KERN_INFO "Calling try_to_free_pages(%d, %d)\n", + nid, priority); + + nodes_clear(nodes); + node_set(nid, nodes); + zl = bind_zonelist(&nodes); + + if (!try_to_free_pages(zl->zones, GFP_USER, priority)) + ret = -ENOMEM; + + kfree(zl); + + return ret; +} + +static SYSDEV_ATTR(try_to_free_pages, 0200, NULL, node_store_ttfp); + +static struct page *migrate_from_node_page(struct page *page, + unsigned long private, + int **result) { + struct zonelist *zl = (struct zonelist *) private; + return __alloc_pages(GFP_HIGHUSER & ~__GFP_WAIT, 0, zl); +} + +static ssize_t node_store_migrate_node(struct sys_device *dev, + struct sysdev_attribute *attr, + const char *buf, + size_t count) { + int nid = dev->id; + nodemask_t nodes; + ssize_t ret; + struct zonelist *zl; + struct pglist_data *pgdat = NODE_DATA(nid); + int i; + int pagecount = 0, failcount = 0; + LIST_HEAD(pagelist); + + ret = nodelist_parse(buf, nodes); + if (ret) + return ret; + + zl = bind_zonelist(&nodes); + + migrate_prep(); + + for (i = 0; i < pgdat->node_spanned_pages; ++i) { + struct page *page = pgdat_page_nr(pgdat, i); + if (!isolate_lru_page(page, &pagelist)) { + pagecount++; + } else { + failcount++; + } + } + + ret = count; + printk(KERN_INFO "Migrating %d pages from node %d\n", pagecount, nid); + if (!list_empty(&pagelist)) { + int migrate_ret = migrate_pages(&pagelist, + migrate_from_node_page, + (unsigned long)zl); + + printk(KERN_INFO "migrate_pages returned %d\n", migrate_ret); + if (migrate_ret < 0) { + ret = migrate_ret; + } + } else { + printk(KERN_INFO "No pages to migrate. Failcount = %d!\n", + failcount++); + } + + kfree(zl); + return ret; +} + +static SYSDEV_ATTR(migrate_node, 0200, NULL, node_store_migrate_node); /* * register_node - Setup a driverfs device for a node. * @num - Node number to use when creating the device. @@ -156,6 +244,8 @@ int register_node(struct node *node, int sysdev_create_file(&node->sysdev, &attr_meminfo); sysdev_create_file(&node->sysdev, &attr_numastat); sysdev_create_file(&node->sysdev, &attr_distance); + sysdev_create_file(&node->sysdev, &attr_try_to_free_pages); + sysdev_create_file(&node->sysdev, &attr_migrate_node); } return error; } @@ -173,6 +263,8 @@ void unregister_node(struct node *node) sysdev_remove_file(&node->sysdev, &attr_meminfo); sysdev_remove_file(&node->sysdev, &attr_numastat); sysdev_remove_file(&node->sysdev, &attr_distance); + sysdev_remove_file(&node->sysdev, &attr_try_to_free_pages); + sysdev_remove_file(&node->sysdev, &attr_migrate_node); sysdev_unregister(&node->sysdev); } Index: 2.6.19-node_reclaim/fs/buffer.c =================================================================== --- 2.6.19-node_reclaim.orig/fs/buffer.c +++ 2.6.19-node_reclaim/fs/buffer.c @@ -374,7 +374,7 @@ static void free_more_memory(void) for_each_online_pgdat(pgdat) { zones = pgdat->node_zonelists[gfp_zone(GFP_NOFS)].zones; if (*zones) - try_to_free_pages(zones, GFP_NOFS); + try_to_free_pages(zones, GFP_NOFS, 0); } } Index: 2.6.19-node_reclaim/include/linux/mempolicy.h =================================================================== --- 2.6.19-node_reclaim.orig/include/linux/mempolicy.h +++ 2.6.19-node_reclaim/include/linux/mempolicy.h @@ -175,6 +175,8 @@ int do_migrate_pages(struct mm_struct *m extern void *cpuset_being_rebound; /* Trigger mpol_copy vma rebind */ +struct zonelist *bind_zonelist(nodemask_t *nodes); + #else struct mempolicy {}; Index: 2.6.19-node_reclaim/include/linux/swap.h =================================================================== --- 2.6.19-node_reclaim.orig/include/linux/swap.h +++ 2.6.19-node_reclaim/include/linux/swap.h @@ -187,7 +187,7 @@ extern int rotate_reclaimable_page(struc extern void swap_setup(void); /* linux/mm/vmscan.c */ -extern unsigned long try_to_free_pages(struct zone **, gfp_t); +extern unsigned long try_to_free_pages(struct zone **, gfp_t, int priority); extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); Index: 2.6.19-node_reclaim/mm/mempolicy.c =================================================================== --- 2.6.19-node_reclaim.orig/mm/mempolicy.c +++ 2.6.19-node_reclaim/mm/mempolicy.c @@ -134,7 +134,7 @@ static int mpol_check_policy(int mode, n } /* Generate a custom zonelist for the BIND policy. */ -static struct zonelist *bind_zonelist(nodemask_t *nodes) +struct zonelist *bind_zonelist(nodemask_t *nodes) { struct zonelist *zl; int num, max, nd; @@ -1908,4 +1908,3 @@ out: m->version = (vma != priv->tail_vma) ? vma->vm_start : 0; return 0; } - Index: 2.6.19-node_reclaim/mm/page_alloc.c =================================================================== --- 2.6.19-node_reclaim.orig/mm/page_alloc.c +++ 2.6.19-node_reclaim/mm/page_alloc.c @@ -1371,7 +1371,7 @@ nofail_alloc: reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; - did_some_progress = try_to_free_pages(zonelist->zones, gfp_mask); + did_some_progress = try_to_free_pages(zonelist->zones, gfp_mask, 0); p->reclaim_state = NULL; Index: 2.6.19-node_reclaim/mm/vmscan.c =================================================================== --- 2.6.19-node_reclaim.orig/mm/vmscan.c +++ 2.6.19-node_reclaim/mm/vmscan.c @@ -1014,7 +1014,8 @@ static unsigned long shrink_zones(int pr * holds filesystem locks which prevent writeout this might not work, and the * allocation attempt will fail. */ -unsigned long try_to_free_pages(struct zone **zones, gfp_t gfp_mask) +unsigned long try_to_free_pages(struct zone **zones, gfp_t gfp_mask, + int min_priority) { int priority; int ret = 0; @@ -1057,7 +1058,7 @@ unsigned long try_to_free_pages(struct z lru_pages += zone->nr_active + zone->nr_inactive; } - for (priority = DEF_PRIORITY; priority >= 0; priority--) { + for (priority = DEF_PRIORITY; priority >= min_priority; priority--) { sc.nr_scanned = 0; if (!priority) disable_swap_token(); -- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-29 3:06 ` [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace menage @ 2006-11-29 6:07 ` Nick Piggin 2006-11-29 21:57 ` Paul Menage 2006-11-30 0:18 ` KAMEZAWA Hiroyuki 2006-11-30 4:10 ` Christoph Lameter 2 siblings, 1 reply; 54+ messages in thread From: Nick Piggin @ 2006-11-29 6:07 UTC (permalink / raw) To: menage; +Cc: linux-mm, akpm menage@google.com wrote: > Currently the page migration APIs allow you to migrate pages from > particular processes, but don't provide a clean and efficient way to > migrate and/or reclaim memory from individual nodes. The mechanism for that should probably go in mm/migrate.c, shouldn't it? Also, why don't you scan the lru lists of the zones in the node, which will a) be much more efficient if there are lots of non LRU pages, and b) allow you to batch the lru lock. > > This patch provides: > > - an additional parameter to try_to_free_pages() to specify the > priority at which the reclaim should give up if it doesn't make > progress Dang. It would be nice not to export this "priority" stuff outside vmscan.c too much because it is really an implementation detail and I would like to get rid of it one day... > > - a way to trigger try_to_free_pages() for a given node with a given > minimum priority, vy writing an integer to > /sys/device/system/node/node<id>/try_to_free_pages ... especially not to userspace. Why does this have to be exposed to userspace at all? Can you not wire it up to your resource isolation implementation in the kernel? > > - a way to request that any migratable pages on a given node be > migrated to availage pages on a specified set of nodes by writing a > destination nodemask (in ASCII form) to > /sys/device/system/node/node<id>/migrate_node ... yeah it would obviously be much nicer to do it in kernel space, behind your higher level APIs. There's probably a good reason why you aren't, and I haven't been following the lists very much over the past couple of weeks... Can you describe your problems (or point me to a post)? Thanks, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-29 6:07 ` Nick Piggin @ 2006-11-29 21:57 ` Paul Menage 2006-11-30 4:13 ` Christoph Lameter 2006-11-30 7:38 ` Nick Piggin 0 siblings, 2 replies; 54+ messages in thread From: Paul Menage @ 2006-11-29 21:57 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-mm, akpm On 11/28/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > menage@google.com wrote: > > Currently the page migration APIs allow you to migrate pages from > > particular processes, but don't provide a clean and efficient way to > > migrate and/or reclaim memory from individual nodes. > > The mechanism for that should probably go in mm/migrate.c, shouldn't > it? Quite possibly - I don't have a strong feeling for exactly where the code should go. There's existing code (sys_migrate_pages) that uses the migration mechanism that's in mm/mempolicy.c rather than migrate.c, and this was a pretty simple function to write. > > Also, why don't you scan the lru lists of the zones in the node, which > will a) be much more efficient if there are lots of non LRU pages, and > b) allow you to batch the lru lock. I'll take a look at that. > > > > - a way to trigger try_to_free_pages() for a given node with a given > > minimum priority, vy writing an integer to > > /sys/device/system/node/node<id>/try_to_free_pages > > ... especially not to userspace. Why does this have to be exposed to > userspace at all? We don't need to expose the raw "priority" value, but it would be really nice for user space to be able to specify how hard the kernel should try to free some memory. Then each job can specify a "reclaim pressure", i.e. how much back-pressure should be applied to its allocated memory, so you can get a good idea of how much memory the job is really using for a given level of performance. High reclaim pressure results in a smaller working set but possibly more paging in from disk; low reclaim pressure uses more memory but gets higher performance. > Can you not wire it up to your resource isolation > implementation in the kernel? This *is* the resource isolation implementation (plus the existing cpusets and fake-numa code). The intention is to expose just enough knobs/hooks to userspace that it can be handled there. > > ... yeah it would obviously be much nicer to do it in kernel space, > behind your higher level APIs. I don't think it would - keeping as much of the code as possible in userspace makes development and deployment much faster. We don't really have any higher-level APIs at this point - just userspace middleware manipulating cpusets. Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-29 21:57 ` Paul Menage @ 2006-11-30 4:13 ` Christoph Lameter 2006-11-30 4:18 ` Paul Menage 2006-11-30 7:38 ` Nick Piggin 1 sibling, 1 reply; 54+ messages in thread From: Christoph Lameter @ 2006-11-30 4:13 UTC (permalink / raw) To: Paul Menage; +Cc: Nick Piggin, linux-mm, akpm On Wed, 29 Nov 2006, Paul Menage wrote: > Quite possibly - I don't have a strong feeling for exactly where the > code should go. There's existing code (sys_migrate_pages) that uses > the migration mechanism that's in mm/mempolicy.c rather than > migrate.c, and this was a pretty simple function to write. Plus there is another mechanism in mm/migrate.c that also uses the migration mechanism. > We don't need to expose the raw "priority" value, but it would be > really nice for user space to be able to specify how hard the kernel > should try to free some memory. Would it not be sufficient to specify that in the number of attempts like already provided by the page migration scheme? > Then each job can specify a "reclaim pressure", i.e. how much > back-pressure should be applied to its allocated memory, so you can > get a good idea of how much memory the job is really using for a given > level of performance. High reclaim pressure results in a smaller > working set but possibly more paging in from disk; low reclaim > pressure uses more memory but gets higher performance. Reclaim? I thought you wanted to migrate memory of a node? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-30 4:13 ` Christoph Lameter @ 2006-11-30 4:18 ` Paul Menage 0 siblings, 0 replies; 54+ messages in thread From: Paul Menage @ 2006-11-30 4:18 UTC (permalink / raw) To: Christoph Lameter; +Cc: Nick Piggin, linux-mm, akpm On 11/29/06, Christoph Lameter <clameter@sgi.com> wrote: > > Reclaim? I thought you wanted to migrate memory of a node? > Both. The idea would be to apply gentle (or not so gentle, depending on how important the job is ...) reclaim pressure to all the nodes owned by a job. If you free up enough memory, you can then consider migrating the allocated pages from one node into other nodes belonging to the job, and hence reclaim a node for use by some other job. Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-29 21:57 ` Paul Menage 2006-11-30 4:13 ` Christoph Lameter @ 2006-11-30 7:38 ` Nick Piggin 2006-11-30 7:57 ` Paul Menage 1 sibling, 1 reply; 54+ messages in thread From: Nick Piggin @ 2006-11-30 7:38 UTC (permalink / raw) To: Paul Menage; +Cc: linux-mm, akpm Paul Menage wrote: > On 11/28/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: >> Can you not wire it up to your resource isolation >> implementation in the kernel? > > > This *is* the resource isolation implementation (plus the existing > cpusets and fake-numa code). The intention is to expose just enough > knobs/hooks to userspace that it can be handled there. Yes, but when you migrate tasks between these containers, or when you create/destroy them, then why can't you do the migration at that time? >> ... yeah it would obviously be much nicer to do it in kernel space, >> behind your higher level APIs. > > > I don't think it would - keeping as much of the code as possible in > userspace makes development and deployment much faster. We don't > really have any higher-level APIs at this point - just userspace > middleware manipulating cpusets. We can't use that as an argument for the upstream kernel, but I would believe that it is a good choice for google. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-30 7:38 ` Nick Piggin @ 2006-11-30 7:57 ` Paul Menage 2006-11-30 8:26 ` Nick Piggin 0 siblings, 1 reply; 54+ messages in thread From: Paul Menage @ 2006-11-30 7:57 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-mm, akpm On 11/29/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > Yes, but when you migrate tasks between these containers, or when you > create/destroy them, then why can't you do the migration at that time? ? The migration that I'm envisaging is going to occur when either: - we're trying to move a job to a different real numa node because, say, a new job has started that needs the whole of a node to itself, and we need to clear space for it. - we're trying to compact the memory usage of a job, when it has plenty of free space in each of its nodes, and we can fit all the memory into a smaller set of nodes. Neither of these are tied to create/destroy time or moving processes in/out of jobs (in fact we'd not be planning to move processes between jobs - once a process is in a job it would stay there, although I realise other people would have different requirements). > > I don't think it would - keeping as much of the code as possible in > > userspace makes development and deployment much faster. We don't > > really have any higher-level APIs at this point - just userspace > > middleware manipulating cpusets. > > We can't use that as an argument for the upstream kernel, but I > would believe that it is a good choice for google. > I would have thought that providing userspace just enough hooks to do what it needs to do, and not mandating higher-level constructs is exactly the philosophy of the linux kernel. Hence, e.g. providing efficient building blocks like sendfile and a threaded network stack, faster therading with NPTL and a very limited static-file webserver (TUX, even though it's not in the mainline) and leaving the complex bits of webserving to userspace. Things like deciding which containers should be using which nodes, and directing the kernel appropriately, is the job of userspace, not kernelspace, since there are lots of possible ways of making those decisions. Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-30 7:57 ` Paul Menage @ 2006-11-30 8:26 ` Nick Piggin 2006-11-30 8:39 ` Paul Menage 0 siblings, 1 reply; 54+ messages in thread From: Nick Piggin @ 2006-11-30 8:26 UTC (permalink / raw) To: Paul Menage; +Cc: linux-mm, akpm Paul Menage wrote: > On 11/29/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> >> Yes, but when you migrate tasks between these containers, or when you >> create/destroy them, then why can't you do the migration at that time? > > > ? > > The migration that I'm envisaging is going to occur when either: > > - we're trying to move a job to a different real numa node because, > say, a new job has started that needs the whole of a node to itself, > and we need to clear space for it. So migrate at this point. > - we're trying to compact the memory usage of a job, when it has > plenty of free space in each of its nodes, and we can fit all the > memory into a smaller set of nodes. Or reclaim at this point. >> We can't use that as an argument for the upstream kernel, but I >> would believe that it is a good choice for google. >> > > I would have thought that providing userspace just enough hooks to do > what it needs to do, and not mandating higher-level constructs is > exactly the philosophy of the linux kernel. Hence, e.g. providing Yes, but without exposing implementation to userspace, where possible. The ultimate would be to devise an API which is usable by your patch, as well as the other resource control mechanisms going around. If userspace has to know that you've implemented memory control with "fake nodes", then IMO something has gone wrong. > efficient building blocks like sendfile and a threaded network stack, > faster therading with NPTL and a very limited static-file webserver > (TUX, even though it's not in the mainline) and leaving the complex > bits of webserving to userspace. I don't see the similarity with sendfile+TUX. I don't think putting an explicit container / resource controller API in the kernel is even anything like TUX in the kernel, let alone apache in kernel. > Things like deciding which containers should be using which nodes, and > directing the kernel appropriately, is the job of userspace, not > kernelspace, since there are lots of possible ways of making those > decisions. I disagree. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-30 8:26 ` Nick Piggin @ 2006-11-30 8:39 ` Paul Menage 2006-11-30 8:55 ` Nick Piggin 0 siblings, 1 reply; 54+ messages in thread From: Paul Menage @ 2006-11-30 8:39 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-mm, akpm On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > - we're trying to move a job to a different real numa node because, > > say, a new job has started that needs the whole of a node to itself, > > and we need to clear space for it. > > So migrate at this point. That's what I want to do. But currently you can only do this on a process-by-process basis and it doesn't affect file pages in the pagecache that aren't mapped by anyone. Being able to say "try to move all memory from this node to this other set of nodes" seems like a generically useful thing even for other uses (e.g. hot unplug, general HPC numa systems, etc). > > > - we're trying to compact the memory usage of a job, when it has > > plenty of free space in each of its nodes, and we can fit all the > > memory into a smaller set of nodes. > > Or reclaim at this point. > This would be happening after reclaim has successfully shrunk the in-use memory in a bunch of nodes, and we want to consolidate to a smaller set of nodes. > > The ultimate would be to devise an API which is usable by your patch, > as well as the other resource control mechanisms going around. If > userspace has to know that you've implemented memory control with > "fake nodes", then IMO something has gone wrong. I disagree. Memory control via fake numa (or even via real numa if you have enough real nodes) is sufficiently fundamentally different from memory control via, say, per-page owner pointers (due to granularity, etc) that userspace really needs to know about it in order to make sensible decisions. It also has the nice property that the kernel already exposes most of the mechanism required for this via the cpusets code. Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-30 8:39 ` Paul Menage @ 2006-11-30 8:55 ` Nick Piggin 2006-11-30 9:06 ` Paul Menage 0 siblings, 1 reply; 54+ messages in thread From: Nick Piggin @ 2006-11-30 8:55 UTC (permalink / raw) To: Paul Menage; +Cc: linux-mm, akpm Paul Menage wrote: > On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> > >> > - we're trying to move a job to a different real numa node because, >> > say, a new job has started that needs the whole of a node to itself, >> > and we need to clear space for it. >> >> So migrate at this point. > > > That's what I want to do. But currently you can only do this on a > process-by-process basis and it doesn't affect file pages in the > pagecache that aren't mapped by anyone. > > Being able to say "try to move all memory from this node to this other > set of nodes" seems like a generically useful thing even for other > uses (e.g. hot unplug, general HPC numa systems, etc). AFAIK they do that in their higher level APIs (at least HPC numa does). >> > - we're trying to compact the memory usage of a job, when it has >> > plenty of free space in each of its nodes, and we can fit all the >> > memory into a smaller set of nodes. >> >> Or reclaim at this point. >> > > This would be happening after reclaim has successfully shrunk the > in-use memory in a bunch of nodes, and we want to consolidate to a > smaller set of nodes. So your API could be some directive to consolidate? You could get pretty accurate estimates with page statistics, as to whether it can be done or not. >> The ultimate would be to devise an API which is usable by your patch, >> as well as the other resource control mechanisms going around. If >> userspace has to know that you've implemented memory control with >> "fake nodes", then IMO something has gone wrong. > > > I disagree. Memory control via fake numa (or even via real numa if you > have enough real nodes) is sufficiently fundamentally different from > memory control via, say, per-page owner pointers (due to granularity, > etc) that userspace really needs to know about it in order to make > sensible decisions. > > It also has the nice property that the kernel already exposes most of > the mechanism required for this via the cpusets code. The cpusets code is definitely similar to what memory resource control needs. I don't think that a resource control API needs to be tied to such granular, hard limits as the fakenodes code provides though. But maybe I'm wrong and it really would be acceptable for everyone. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-30 8:55 ` Nick Piggin @ 2006-11-30 9:06 ` Paul Menage 2006-11-30 9:21 ` Nick Piggin 0 siblings, 1 reply; 54+ messages in thread From: Paul Menage @ 2006-11-30 9:06 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-mm, akpm On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > Being able to say "try to move all memory from this node to this other > > set of nodes" seems like a generically useful thing even for other > > uses (e.g. hot unplug, general HPC numa systems, etc). > > AFAIK they do that in their higher level APIs (at least HPC numa does). Could you point me at an example? > > This would be happening after reclaim has successfully shrunk the > > in-use memory in a bunch of nodes, and we want to consolidate to a > > smaller set of nodes. > > So your API could be some directive to consolidate? You could get > pretty accurate estimates with page statistics, as to whether it > can be done or not. Yes, and exposing those statistics (already available in /sys/device/system/node/node*/meminfo) and the low-level mechanism for migration are, to me, things that are appropriate for the kernel. I'm not sure what a specific "consolidation API" would look like, beyond the API that I'm already proposing (migrate memory from node X to nodes A,B,C) > The cpusets code is definitely similar to what memory resource control > needs. I don't think that a resource control API needs to be tied to > such granular, hard limits as the fakenodes code provides though. But > maybe I'm wrong and it really would be acceptable for everyone. Ah. This isn't intended to be specifically a "resource control API". It's more intended to be an API that could be useful for certain kinds of resource control, but could also be generically useful. Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-30 9:06 ` Paul Menage @ 2006-11-30 9:21 ` Nick Piggin 2006-11-30 9:45 ` Paul Menage 0 siblings, 1 reply; 54+ messages in thread From: Nick Piggin @ 2006-11-30 9:21 UTC (permalink / raw) To: Paul Menage; +Cc: linux-mm, akpm Paul Menage wrote: > On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> > >> > Being able to say "try to move all memory from this node to this other >> > set of nodes" seems like a generically useful thing even for other >> > uses (e.g. hot unplug, general HPC numa systems, etc). >> >> AFAIK they do that in their higher level APIs (at least HPC numa does). > > > Could you point me at an example? kernel/cpuset.c:cpuset_migrate_mm >> So your API could be some directive to consolidate? You could get >> pretty accurate estimates with page statistics, as to whether it >> can be done or not. > > > Yes, and exposing those statistics (already available in > /sys/device/system/node/node*/meminfo) and the low-level mechanism for > migration are, to me, things that are appropriate for the kernel. I'm > not sure what a specific "consolidation API" would look like, beyond > the API that I'm already proposing (migrate memory from node X to > nodes A,B,C) How about "try to change the memory reservation charge of this 'container' from xMB to yMB"? Underneath that API, your fakenode controller would do the node reclaim and consolidation stuff -- but it could be implemented completely differently in the case of a different type of controller. >> The cpusets code is definitely similar to what memory resource control >> needs. I don't think that a resource control API needs to be tied to >> such granular, hard limits as the fakenodes code provides though. But >> maybe I'm wrong and it really would be acceptable for everyone. > > > Ah. This isn't intended to be specifically a "resource control API". > It's more intended to be an API that could be useful for certain kinds > of resource control, but could also be generically useful. If it is exporting any kind of implementation details, then it needs to be justified with a specific user that can't be implemented in a better way, IMO. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-30 9:21 ` Nick Piggin @ 2006-11-30 9:45 ` Paul Menage 2006-11-30 10:15 ` Nick Piggin 0 siblings, 1 reply; 54+ messages in thread From: Paul Menage @ 2006-11-30 9:45 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-mm, akpm On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> AFAIK they do that in their higher level APIs (at least HPC numa does). > > > > > > Could you point me at an example? > > kernel/cpuset.c:cpuset_migrate_mm No, that doesn't really do what we want. It basically just calls do_migrate_pages, which has the drawbacks of: - it has no way to try to migrate memory from one source node to multiple destination nodes. - it doesn't (as far as I can tell) migrate unmapped file pages in the page cache. - it scans every page table entry of every mm in the process. If your nodes are relatively small compared to your processes, this is likely to be much more heavyweight than just trying to migrate each page in a node. (I realise that there are some unsolved implementation issues with migrating pages whilst not holding an mmap_sem of an mm that's mapping them; that's something that we would need to solve) > > How about "try to change the memory reservation charge of this > 'container' from xMB to yMB"? Underneath that API, your fakenode > controller would do the node reclaim and consolidation stuff -- > but it could be implemented completely differently in the case of > a different type of controller. How would it make decisions such as which node to free up (e.g. userspace might have a strong preference for keeping a job on one particular real node, or moving it to a different one.) I think that policy decisions like this belong in userspace, in the same way that the existing cpusets API provides a way to say "this cpuset uses these nodes" rather than "this cpuset should have N nodes". If the API was expressive enough to say "try to shrink this cpuset by X MB, with amount Y of effort, trying to evict nodes in the priority order A,B,C" that might be a good start. > > >> The cpusets code is definitely similar to what memory resource control > >> needs. I don't think that a resource control API needs to be tied to > >> such granular, hard limits as the fakenodes code provides though. But > >> maybe I'm wrong and it really would be acceptable for everyone. > > > > > > Ah. This isn't intended to be specifically a "resource control API". > > It's more intended to be an API that could be useful for certain kinds > > of resource control, but could also be generically useful. > > If it is exporting any kind of implementation details, then it needs > to be justified with a specific user that can't be implemented in a > better way, IMO. It's not really exporting any more implementation details than the existing cpusets API (i.e. explicitly binding a job to a set of nodes chosen by userspace). The only true exposed implementation detail is the "priority" value from try_to_free_pages, and that could be abstracted away as a value in some range 0-N where 0 means "try very hard" and N means "hardly try at all", and it wouldn't have to be directly linked to the try_to_free_pages() priority. Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-30 9:45 ` Paul Menage @ 2006-11-30 10:15 ` Nick Piggin 2006-11-30 10:40 ` Paul Menage 0 siblings, 1 reply; 54+ messages in thread From: Nick Piggin @ 2006-11-30 10:15 UTC (permalink / raw) To: Paul Menage; +Cc: linux-mm, akpm Paul Menage wrote: > On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> >> AFAIK they do that in their higher level APIs (at least HPC numa >> does). >> > >> > >> > Could you point me at an example? >> >> kernel/cpuset.c:cpuset_migrate_mm > > > No, that doesn't really do what we want. It basically just calls > do_migrate_pages, which has the drawbacks of: I know it doesn't do what you want. It is an example of using page migration under a higher level API, which I thought is what you wanted to see. >> How about "try to change the memory reservation charge of this >> 'container' from xMB to yMB"? Underneath that API, your fakenode >> controller would do the node reclaim and consolidation stuff -- >> but it could be implemented completely differently in the case of >> a different type of controller. > > > How would it make decisions such as which node to free up (e.g. > userspace might have a strong preference for keeping a job on one > particular real node, or moving it to a different one.) I think that > policy decisions like this belong in userspace, in the same way that > the existing cpusets API provides a way to say "this cpuset uses these > nodes" rather than "this cpuset should have N nodes". Now you're talking about physical nodes as well, which is definitely a problem you get when mixing the two. But there is no reason why you shouldn't be able to specify physical nodes, while also altering the reservation. Even if that does mean hiding the fake nodes from the cpuset interface. >> If it is exporting any kind of implementation details, then it needs >> to be justified with a specific user that can't be implemented in a >> better way, IMO. > > > It's not really exporting any more implementation details than the > existing cpusets API (i.e. explicitly binding a job to a set of nodes > chosen by userspace). The only true exposed implementation detail is > the "priority" value from try_to_free_pages, and that could be > abstracted away as a value in some range 0-N where 0 means "try very > hard" and N means "hardly try at all", and it wouldn't have to be > directly linked to the try_to_free_pages() priority. Or the fact that memory reservation is implemented with nodes. I'm still not convinced that idea is the best way to export memory control to userspace, regardless of whether it is quick and easy to develop (or even deploy, at google). -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-30 10:15 ` Nick Piggin @ 2006-11-30 10:40 ` Paul Menage 2006-11-30 11:04 ` Nick Piggin 0 siblings, 1 reply; 54+ messages in thread From: Paul Menage @ 2006-11-30 10:40 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-mm, akpm On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > I know it doesn't do what you want. It is an example of using page > migration under a higher level API, which I thought is what you > wanted to see. I'd been talking about the possibility of doing "try to move all memory from this node to this other set of nodes"; that wasn't an example of such an API. > > Now you're talking about physical nodes as well, which is definitely > a problem you get when mixing the two. > > But there is no reason why you shouldn't be able to specify physical > nodes, while also altering the reservation. Even if that does mean > hiding the fake nodes from the cpuset interface. I think it should be possible to expose the real numa topology via the fake topology (e.g. all fake nodes on the same real node appear to be fairly close together, compared to any fake nodes on a different real node). So I don't think it's necessary to have a separate abstraction for fake vs physical nodes. > > >> If it is exporting any kind of implementation details, then it needs > >> to be justified with a specific user that can't be implemented in a > >> better way, IMO. > > > > > > It's not really exporting any more implementation details than the > > existing cpusets API (i.e. explicitly binding a job to a set of nodes > > chosen by userspace). The only true exposed implementation detail is > > the "priority" value from try_to_free_pages, and that could be > > abstracted away as a value in some range 0-N where 0 means "try very > > hard" and N means "hardly try at all", and it wouldn't have to be > > directly linked to the try_to_free_pages() priority. > > Or the fact that memory reservation is implemented with nodes. Right, but to me that's a pretty fundamental design decision, rather than an implementation detail. > I'm > still not convinced that idea is the best way to export memory > control to userspace, regardless of whether it is quick and easy to > develop (or even deploy, at google). Maybe not the best way for all memory control, but it has certain big advantages, such as leveraging the existing numa support, and not requiring additional per-page overhead or LRU complexity. Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-30 10:40 ` Paul Menage @ 2006-11-30 11:04 ` Nick Piggin 2006-11-30 11:23 ` Paul Menage 0 siblings, 1 reply; 54+ messages in thread From: Nick Piggin @ 2006-11-30 11:04 UTC (permalink / raw) To: Paul Menage; +Cc: linux-mm, akpm Paul Menage wrote: > On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> >> I know it doesn't do what you want. It is an example of using page >> migration under a higher level API, which I thought is what you >> wanted to see. > > > I'd been talking about the possibility of doing "try to move all > memory from this node to this other set of nodes"; that wasn't an > example of such an API. Oh, well I was talking about using higher level API rather than migrate directly! >> Now you're talking about physical nodes as well, which is definitely >> a problem you get when mixing the two. >> >> But there is no reason why you shouldn't be able to specify physical >> nodes, while also altering the reservation. Even if that does mean >> hiding the fake nodes from the cpuset interface. > > > I think it should be possible to expose the real numa topology via the > fake topology (e.g. all fake nodes on the same real node appear to be > fairly close together, compared to any fake nodes on a different real > node). So I don't think it's necessary to have a separate abstraction > for fake vs physical nodes. Well if you want to do (real) node affinity then you need some separation of course. But I'm not sure that there is a good reason to use the same abstraction. Maybe there is, but I think it needs more discussion (unless I missed something in the past couple of weeks were you managed to get all memory resource controller groups to agree with your fakenodes approach). >> >> If it is exporting any kind of implementation details, then it needs >> >> to be justified with a specific user that can't be implemented in a >> >> better way, IMO. >> > >> > >> > It's not really exporting any more implementation details than the >> > existing cpusets API (i.e. explicitly binding a job to a set of nodes >> > chosen by userspace). The only true exposed implementation detail is >> > the "priority" value from try_to_free_pages, and that could be >> > abstracted away as a value in some range 0-N where 0 means "try very >> > hard" and N means "hardly try at all", and it wouldn't have to be >> > directly linked to the try_to_free_pages() priority. >> >> Or the fact that memory reservation is implemented with nodes. > > > Right, but to me that's a pretty fundamental design decision, rather > than an implementation detail. It is a design of the implementation. The policy is to be able to reserve memory for specific groups of tasks. And the best API is one where userspace specifies policy. Now there might be a few tweaks or lower level hints or calls needed to make the implementation work really optimally. But those should be added later, and when they are found to be required (and not just maybe useful). So I see nothing wrong with your exposing these things to userspace if the goal is to test implementation or get a prototype working quickly. But if you're talking about the upstream kernel, then I think you need to start at a much higher level. >> I'm >> still not convinced that idea is the best way to export memory >> control to userspace, regardless of whether it is quick and easy to >> develop (or even deploy, at google). > > > Maybe not the best way for all memory control, but it has certain big > advantages, such as leveraging the existing numa support, and not > requiring additional per-page overhead or LRU complexity. Oh I agree. And I think it is one of the better implementations I have seen. But I don't like the API. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-30 11:04 ` Nick Piggin @ 2006-11-30 11:23 ` Paul Menage 2006-11-30 11:35 ` Nick Piggin 0 siblings, 1 reply; 54+ messages in thread From: Paul Menage @ 2006-11-30 11:23 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-mm, akpm On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > But I'm not sure that there is a good reason to use the same > abstraction. Maybe there is, but I think it needs more discussion > (unless I missed something in the past couple of weeks were you > managed to get all memory resource controller groups to agree with > your fakenodes approach). No, not at all - but we've observed that: a) people have been proposing interesting memory controller approaches for a long time, and haven't made a great deal of progress so far, so there's no indication than something is going to be agreed upon in the near future b) the cpusets and fake numa code provide a fairly serviceable coarse-grained memory controller, modulo a few missing features such as per-node reclaim/migration and auto-expansion (see my patch proposal hopefully tomorrow). Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-30 11:23 ` Paul Menage @ 2006-11-30 11:35 ` Nick Piggin 0 siblings, 0 replies; 54+ messages in thread From: Nick Piggin @ 2006-11-30 11:35 UTC (permalink / raw) To: Paul Menage; +Cc: linux-mm, akpm Paul Menage wrote: > On 11/30/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> But I'm not sure that there is a good reason to use the same >> abstraction. Maybe there is, but I think it needs more discussion >> (unless I missed something in the past couple of weeks were you >> managed to get all memory resource controller groups to agree with >> your fakenodes approach). > > > No, not at all - but we've observed that: I agree with your points and I'll add a couple more. > a) people have been proposing interesting memory controller approaches > for a long time, and haven't made a great deal of progress so far, so > there's no indication than something is going to be agreed upon in the > near future a2) and it hasn't been because they've been getting their APIs wrong > b) the cpusets and fake numa code provide a fairly serviceable > coarse-grained memory controller, modulo a few missing features such > as per-node reclaim/migration and auto-expansion (see my patch > proposal hopefully tomorrow). b2) and it doesn't mean that it can't be used with a decent API. Or at least, you haven't yet shown that it can't. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-29 3:06 ` [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace menage 2006-11-29 6:07 ` Nick Piggin @ 2006-11-30 0:18 ` KAMEZAWA Hiroyuki 2006-11-30 0:25 ` Paul Menage 2006-11-30 4:10 ` Christoph Lameter 2 siblings, 1 reply; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-11-30 0:18 UTC (permalink / raw) To: menage; +Cc: linux-mm, akpm On Tue, 28 Nov 2006 19:06:56 -0800 menage@google.com wrote: > > + for (i = 0; i < pgdat->node_spanned_pages; ++i) { > + struct page *page = pgdat_page_nr(pgdat, i); you need pfn_valid() check before accessing page struct. > + if (!isolate_lru_page(page, &pagelist)) { you'll see panic if !PageLRU(page). looks scanning zone's lru list is more suitable for your purpose. -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-30 0:18 ` KAMEZAWA Hiroyuki @ 2006-11-30 0:25 ` Paul Menage 2006-11-30 0:38 ` KAMEZAWA Hiroyuki 2006-11-30 4:15 ` Christoph Lameter 0 siblings, 2 replies; 54+ messages in thread From: Paul Menage @ 2006-11-30 0:25 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: linux-mm, akpm On 11/29/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > On Tue, 28 Nov 2006 19:06:56 -0800 > menage@google.com wrote: > > > > > + for (i = 0; i < pgdat->node_spanned_pages; ++i) { > > + struct page *page = pgdat_page_nr(pgdat, i); > you need pfn_valid() check before accessing page struct. OK. (That check can only fail if CONFIG_SPARSEMEM, right?) > > > > + if (!isolate_lru_page(page, &pagelist)) { > you'll see panic if !PageLRU(page). In which kernel version? In 2.6.19-rc6 (also -mm1) there's no panic in isolate_lru_page(). Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-30 0:25 ` Paul Menage @ 2006-11-30 0:38 ` KAMEZAWA Hiroyuki 2006-11-30 4:15 ` Christoph Lameter 1 sibling, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-11-30 0:38 UTC (permalink / raw) To: Paul Menage; +Cc: linux-mm, akpm On Wed, 29 Nov 2006 16:25:22 -0800 "Paul Menage" <menage@google.com> wrote: > On 11/29/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > On Tue, 28 Nov 2006 19:06:56 -0800 > > menage@google.com wrote: > > > > > > > > + for (i = 0; i < pgdat->node_spanned_pages; ++i) { > > > + struct page *page = pgdat_page_nr(pgdat, i); > > you need pfn_valid() check before accessing page struct. > > OK. (That check can only fail if CONFIG_SPARSEMEM, right?) > No, ia64's virtual memmap will fail too. > > > > > > > + if (!isolate_lru_page(page, &pagelist)) { > > you'll see panic if !PageLRU(page). > > In which kernel version? In 2.6.19-rc6 (also -mm1) there's no panic in > isolate_lru_page(). > Sorry, my mistake. I checked isolate_lru_pages() (>< -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-30 0:25 ` Paul Menage 2006-11-30 0:38 ` KAMEZAWA Hiroyuki @ 2006-11-30 4:15 ` Christoph Lameter 1 sibling, 0 replies; 54+ messages in thread From: Christoph Lameter @ 2006-11-30 4:15 UTC (permalink / raw) To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm On Wed, 29 Nov 2006, Paul Menage wrote: > In which kernel version? In 2.6.19-rc6 (also -mm1) there's no panic in > isolate_lru_page(). Depends on the hardware and the linux configuration sparsemem, virtual_memmap etc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace 2006-11-29 3:06 ` [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace menage 2006-11-29 6:07 ` Nick Piggin 2006-11-30 0:18 ` KAMEZAWA Hiroyuki @ 2006-11-30 4:10 ` Christoph Lameter 2 siblings, 0 replies; 54+ messages in thread From: Christoph Lameter @ 2006-11-30 4:10 UTC (permalink / raw) To: menage; +Cc: linux-mm, akpm On Tue, 28 Nov 2006, menage@google.com wrote: > + for (i = 0; i < pgdat->node_spanned_pages; ++i) { > + struct page *page = pgdat_page_nr(pgdat, i); > + if (!isolate_lru_page(page, &pagelist)) { > + pagecount++; > + } else { > + failcount++; > + } > + } Go along the active / inactive LRU lists? isolate_lru_page will not allow you isolate other pages. If you go along the lru lists then you also avoid having to deal with holes in the memory map. You cannot simply assume that all struct pages in the area are accessible. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-29 3:06 [RFC][PATCH 0/1] Node-based reclaim/migration menage 2006-11-29 3:06 ` [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace menage @ 2006-11-30 0:31 ` KAMEZAWA Hiroyuki 2006-11-30 0:31 ` Paul Menage 2006-11-30 4:04 ` Christoph Lameter 2 siblings, 1 reply; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-11-30 0:31 UTC (permalink / raw) To: menage; +Cc: linux-mm, akpm On Tue, 28 Nov 2006 19:06:55 -0800 menage@google.com wrote: > -- > > We're trying to use NUMA node isolation as a form of job resource > control at Google, and the existing page migration APIs are all bound > to individual processes and so are a bit clunky to use when you just > want to affect all the pages on a given node. > > How about an API to allow userspace to direct page migration (and page > reclaim) on a per-node basis? This patch provides such an API, based > around sysfs; a system call approach would certainly be possible too. > > It sort of overlaps with memory hot-unplug, but is simpler since it's > not so bad if we miss a few pages. > > Comments? Also, can anyone clarify whether I need any locking when > sacnning the pages in a pgdat? As far as I can see, even with memory > hotplug this number can only increase, not decrease. > Hi, I'm one of memory-hot-unplug men. (But I can't go ahead for now.) a few comments. 1. memory hot unplug will be implemnted based on *section* not on *node*. section <-> node relationship will be displayed. 2. AFAIK, migrating pages without taking write lock of any mm->sem will cause problem. anon_vma can be freed while migration. 3. It's maybe better to add a hook to stop page allocation from the target node(zone). you may want to use this feature under heavly load. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 0:31 ` [RFC][PATCH 0/1] Node-based reclaim/migration KAMEZAWA Hiroyuki @ 2006-11-30 0:31 ` Paul Menage 2006-11-30 4:11 ` KAMEZAWA Hiroyuki 2006-11-30 4:17 ` Christoph Lameter 0 siblings, 2 replies; 54+ messages in thread From: Paul Menage @ 2006-11-30 0:31 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: linux-mm, akpm On 11/29/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > 2. AFAIK, migrating pages without taking write lock of any mm->sem will > cause problem. anon_vma can be freed while migration. Hmm, isn't migration just analagous to swapping out and swapping back in again, but without the actual swapping? If what you describe is a problem, then wouldn't you have a problem if you were doing migration on a particular mm structure, but it was sharing pages with another mm? Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 0:31 ` Paul Menage @ 2006-11-30 4:11 ` KAMEZAWA Hiroyuki 2006-11-30 4:17 ` Christoph Lameter 1 sibling, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-11-30 4:11 UTC (permalink / raw) To: Paul Menage; +Cc: linux-mm, akpm On Wed, 29 Nov 2006 16:31:22 -0800 "Paul Menage" <menage@google.com> wrote: > On 11/29/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > 2. AFAIK, migrating pages without taking write lock of any mm->sem will > > cause problem. anon_vma can be freed while migration. > > Hmm, isn't migration just analagous to swapping out and swapping back > in again, but without the actual swapping? > I'm sorry if there is no problem in *current* kernel == See == http://lkml.org/lkml/2006/4/17/168 >mmap_sem must be held during page migration due to the way we retrieve the >anonymous vma. ======== Logic Considering migrate oldpage to newpage.. 1. We unmap a oldpage at migraiton. page->mapcount turns to be 0. 2. copy contents of the oldpage to a newpage. page->mapcount of both pages are 0. 3. map the newpage. this uses copied newpage->mapping. page->mapcount goes up. And see rmap.c == 511 void page_remove_rmap(struct page *page) 512 { 513 if (atomic_add_negative(-1, &page->_mapcount)) { 514 #ifdef CONFIG_DEBUG_VM 515 if (unlikely(page_mapcount(page) < 0)) { 516 printk (KERN_EMERG "Eeek! page_mapcount(page) went negative! (%d)\n", page_mapcount(page)); 517 printk (KERN_EMERG " page->flags = %lx\n", page->flags); 518 printk (KERN_EMERG " page->count = %x\n", page_count(page)); 519 printk (KERN_EMERG " page->mapping = %p\n", page->mapping); 520 } 521 #endif 522 BUG_ON(page_mapcount(page) < 0); 523 /* 524 * It would be tidy to reset the PageAnon mapping here, 525 * but that might overwrite a racing page_add_anon_rmap 526 * which increments mapcount after us but sets mapping 527 * before us: so leave the reset to free_hot_cold_page, 528 * and remember that it's only reliable while mapped. 529 * Leaving it set also helps swapoff to reinstate ptes 530 * faster for those pages still in swapcache. 531 */ 532 if (page_test_and_clear_dirty(page)) 533 set_page_dirty(page); 534 __dec_zone_page_state(page, 535 PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED); 536 } ==== We cannot trust page->mapping if page->mapcount == 0. File pages are guarded by address_space's lock. if mm->sem is held, the oldpage's anon_vma/vm_area_struct will not change. Then, the relationship between oldpage/anon_vma will not change. So, page migration with mm->sem is safe. -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 0:31 ` Paul Menage 2006-11-30 4:11 ` KAMEZAWA Hiroyuki @ 2006-11-30 4:17 ` Christoph Lameter 2006-11-30 10:45 ` Paul Menage 1 sibling, 1 reply; 54+ messages in thread From: Christoph Lameter @ 2006-11-30 4:17 UTC (permalink / raw) To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm On Wed, 29 Nov 2006, Paul Menage wrote: > Hmm, isn't migration just analagous to swapping out and swapping back > in again, but without the actual swapping? That used to be the case in the beginning. Not anymore. The page is directly moved to the target. Migration via swap is no longer supported. > If what you describe is a problem, then wouldn't you have a problem if > you were doing migration on a particular mm structure, but it was > sharing pages with another mm? You do not have a problem as long as you hold a mmap_sem lock on any of the vmas in which the page appears. Kame and I discussed several approached on how to avoid the issue in the past but so far there was no need to resolve the issue. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 4:17 ` Christoph Lameter @ 2006-11-30 10:45 ` Paul Menage 2006-11-30 11:12 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 54+ messages in thread From: Paul Menage @ 2006-11-30 10:45 UTC (permalink / raw) To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm On 11/29/06, Christoph Lameter <clameter@sgi.com> wrote: > > You do not have a problem as long as you hold a mmap_sem lock on any of > the vmas in which the page appears. Kame and I discussed several > approached on how to avoid the issue in the past but so far there was no > need to resolve the issue. > It sounds like this would be useful for memory hot-unplug too, though. A problem worth solving? Why isn't page_lock_anon_vma() safe to use in this case? Because after we've established migration ptes, page_mapped() will be false and so page_lock_anon_vma() will return NULL? How does kswapd do this safely? Possible approach (apologies if you've already considered and rejected this): - add a migration_count field to anon_vma - use page_lock_anon_vma() to get the anon_vma for a page, assuming it's mapped; if it's unmapped or if the anon_vma that we get has no linked vmas then we ignore it - the chances are that the page is in the process of being freed anyway, and if someone happens to remap it just before it's freed then we can catch it next time around. - isolate_lru_page() can bump this for every page that it isolates - unlink_anon_vma() won't free the anon_vma if its migration_count is >0. - remove_anon_migration_ptes() can free the anon_vma if migration_count is 0 and the vma list is empty. Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 10:45 ` Paul Menage @ 2006-11-30 11:12 ` KAMEZAWA Hiroyuki 2006-11-30 11:25 ` Paul Menage 0 siblings, 1 reply; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-11-30 11:12 UTC (permalink / raw) To: Paul Menage; +Cc: clameter, linux-mm, akpm On Thu, 30 Nov 2006 02:45:51 -0800 "Paul Menage" <menage@google.com> wrote: > On 11/29/06, Christoph Lameter <clameter@sgi.com> wrote: > > > > You do not have a problem as long as you hold a mmap_sem lock on any of > > the vmas in which the page appears. Kame and I discussed several > > approached on how to avoid the issue in the past but so far there was no > > need to resolve the issue. > > > > It sounds like this would be useful for memory hot-unplug too, though. > A problem worth solving? > It's not solved just because 'there is no user'. If you'll fix it, I welcome it. > Why isn't page_lock_anon_vma() safe to use in this case? Because after > we've established migration ptes, page_mapped() will be false and so > page_lock_anon_vma() will return NULL? page_lock_anon_vma() will return NULL because mapcount is 0. We have to guarantee that we can trust anon_vma(from page->mapping0 even if page->mapcount is 0.maybe there is several ways. > How does kswapd do this safely? > kswapd doesn't touches page->mapping after page_mapcount() goes down to 0. > Possible approach (apologies if you've already considered and rejected this): > As you pointed out, there will be several approaches. I think one of the biggest concern will be performance impact. And this will touch objrmap core, it is good to start discussion with a patch. -Kame > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 11:12 ` KAMEZAWA Hiroyuki @ 2006-11-30 11:25 ` Paul Menage 2006-11-30 12:18 ` KAMEZAWA Hiroyuki 2006-11-30 18:28 ` Christoph Lameter 0 siblings, 2 replies; 54+ messages in thread From: Paul Menage @ 2006-11-30 11:25 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: clameter, linux-mm, akpm On 11/30/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > How does kswapd do this safely? > > > kswapd doesn't touches page->mapping after page_mapcount() goes down to 0. OK, so we could do the same, and just assume that pages with a page_mapcount() of 0 are either about to be freed or can be picked up on a later migration sweep. Is it common for a page to have a 0 page_mapcount() for a long period of time without being freed or remapped? > > I think one of the biggest concern will be performance impact. And this will > touch objrmap core, it is good to start discussion with a patch. > I'll have a go. My initial thought is that the only performance impact on the rmap core would be that unlink_anon_vma() would need one extra check when determining whether to free an anon_vma Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 11:25 ` Paul Menage @ 2006-11-30 12:18 ` KAMEZAWA Hiroyuki 2006-11-30 18:28 ` Christoph Lameter 1 sibling, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-11-30 12:18 UTC (permalink / raw) To: Paul Menage; +Cc: clameter, linux-mm, akpm On Thu, 30 Nov 2006 03:25:21 -0800 "Paul Menage" <menage@google.com> wrote: > On 11/30/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > > How does kswapd do this safely? > > > > > kswapd doesn't touches page->mapping after page_mapcount() goes down to 0. > > OK, so we could do the same, and just assume that pages with a > page_mapcount() of 0 are either about to be freed or can be picked up > on a later migration sweep. Is it common for a page to have a 0 > page_mapcount() for a long period of time without being freed or > remapped? > see shrink_page_list(). unmap -> (write to swap) -> freed. depends on how long write-back needs. -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 11:25 ` Paul Menage 2006-11-30 12:18 ` KAMEZAWA Hiroyuki @ 2006-11-30 18:28 ` Christoph Lameter 2006-11-30 18:35 ` Paul Menage 1 sibling, 1 reply; 54+ messages in thread From: Christoph Lameter @ 2006-11-30 18:28 UTC (permalink / raw) To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm On Thu, 30 Nov 2006, Paul Menage wrote: > OK, so we could do the same, and just assume that pages with a > page_mapcount() of 0 are either about to be freed or can be picked up > on a later migration sweep. Is it common for a page to have a 0 > page_mapcount() for a long period of time without being freed or > remapped? page mapcount goes to zero during migration because the references to the page are removed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 18:28 ` Christoph Lameter @ 2006-11-30 18:35 ` Paul Menage 2006-11-30 18:39 ` Christoph Lameter 0 siblings, 1 reply; 54+ messages in thread From: Paul Menage @ 2006-11-30 18:35 UTC (permalink / raw) To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote: > On Thu, 30 Nov 2006, Paul Menage wrote: > > > OK, so we could do the same, and just assume that pages with a > > page_mapcount() of 0 are either about to be freed or can be picked up > > on a later migration sweep. Is it common for a page to have a 0 > > page_mapcount() for a long period of time without being freed or > > remapped? > > page mapcount goes to zero during migration because the references to the > page are removed. > Yes, but I meant for reasons other than migration. It sounds as though if we come across a page with page_mapcount() = 0 while gathering pages for migration, it's probably in the process of being swapped out and so is best not to muck around with anyway? Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 18:35 ` Paul Menage @ 2006-11-30 18:39 ` Christoph Lameter 2006-11-30 19:09 ` Paul Menage 0 siblings, 1 reply; 54+ messages in thread From: Christoph Lameter @ 2006-11-30 18:39 UTC (permalink / raw) To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm On Thu, 30 Nov 2006, Paul Menage wrote: > It sounds as though if we come across a page with page_mapcount() = 0 > while gathering pages for migration, it's probably in the process of > being swapped out and so is best not to muck around with anyway? F.e. A page cache page may have mapcount == 0. Mapcount 0 only means that the page is not mapped into any processes memory via a page table. It may be used for purposes that do not require mapping into a processes memory. If the reference count is zero (page freed) then page migration will discard the page and consider the migration a success. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 18:39 ` Christoph Lameter @ 2006-11-30 19:09 ` Paul Menage 2006-11-30 19:42 ` Christoph Lameter 0 siblings, 1 reply; 54+ messages in thread From: Paul Menage @ 2006-11-30 19:09 UTC (permalink / raw) To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote: > > F.e. A page cache page may have mapcount == 0. OK, I was thinking just about anon pages. For pagecache pages, it's safe to access the mapping as long as we've locked the page, even if mapcount is 0? So we don't have the same races? Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 19:09 ` Paul Menage @ 2006-11-30 19:42 ` Christoph Lameter 2006-11-30 19:53 ` Paul Menage 0 siblings, 1 reply; 54+ messages in thread From: Christoph Lameter @ 2006-11-30 19:42 UTC (permalink / raw) To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm On Thu, 30 Nov 2006, Paul Menage wrote: > On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote: > > > > F.e. A page cache page may have mapcount == 0. > > OK, I was thinking just about anon pages. > > For pagecache pages, it's safe to access the mapping as long as we've > locked the page, even if mapcount is 0? So we don't have the same > races? We have no problem with the page lock (you actually may not need any locking since there are no references remaining to the page). The trouble is that the vma may have vanished when we try to reestablish the pte. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 19:42 ` Christoph Lameter @ 2006-11-30 19:53 ` Paul Menage 2006-11-30 20:00 ` Christoph Lameter 0 siblings, 1 reply; 54+ messages in thread From: Paul Menage @ 2006-11-30 19:53 UTC (permalink / raw) To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote: > > We have no problem with the page lock (you actually may not need any > locking since there are no references remaining to the page). The trouble > is that the vma may have vanished when we try to reestablish the pte. > Why is that a problem? If the vma has gone away, then there's no need to reestablish the pte. And remove_file_migration_ptes() appears to be adequately protected against races with unlink_file_vma() since they both take i_mmap_sem. Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 19:53 ` Paul Menage @ 2006-11-30 20:00 ` Christoph Lameter 2006-11-30 20:07 ` Paul Menage 0 siblings, 1 reply; 54+ messages in thread From: Christoph Lameter @ 2006-11-30 20:00 UTC (permalink / raw) To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm On Thu, 30 Nov 2006, Paul Menage wrote: > On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote: > > > > We have no problem with the page lock (you actually may not need any > > locking since there are no references remaining to the page). The trouble > > is that the vma may have vanished when we try to reestablish the pte. > > > > Why is that a problem? If the vma has gone away, then there's no need > to reestablish the pte. And remove_file_migration_ptes() appears to be > adequately protected against races with unlink_file_vma() since they > both take i_mmap_sem. We are talking about anonymous pages here. You cannot figure out that the vma is gone since that was the only connection to the process. Hmm... Not true we still have a migration pte in that processes space. But we cannot find the process without the anon_vma. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 20:00 ` Christoph Lameter @ 2006-11-30 20:07 ` Paul Menage 2006-11-30 20:15 ` Christoph Lameter 0 siblings, 1 reply; 54+ messages in thread From: Paul Menage @ 2006-11-30 20:07 UTC (permalink / raw) To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote: > > > > Why is that a problem? If the vma has gone away, then there's no need > > to reestablish the pte. And remove_file_migration_ptes() appears to be > > adequately protected against races with unlink_file_vma() since they > > both take i_mmap_sem. > > We are talking about anonymous pages here. No, I was talking about pagecache pages by this point - you'd mentioned them as the case where page_mapcount() can be 0 for a long period of time. > You cannot figure out > that the vma is gone since that was the only connection to the process. > Hmm... Not true we still have a migration pte in that processes space. But > we cannot find the process without the anon_vma. What did you think of the approach that I proposed of adding a migration count to anon_vma? unlink_anon_vma() doesn't free the anon_vma if migration count is non-zero. When gathering pages for migration, we use page_lock_anon_vma() to get the anon_vma; if it returns NULL or has an empty vma list we skip the page, else we bump migration count (and mapcount?) by 1 and unlock. That will guarantee that the anon_vma sticks around until the end of the migration. Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 20:07 ` Paul Menage @ 2006-11-30 20:15 ` Christoph Lameter 2006-11-30 21:33 ` Paul Menage 0 siblings, 1 reply; 54+ messages in thread From: Christoph Lameter @ 2006-11-30 20:15 UTC (permalink / raw) To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm, Hugh Dickins On Thu, 30 Nov 2006, Paul Menage wrote: > > We are talking about anonymous pages here. > > No, I was talking about pagecache pages by this point - you'd > mentioned them as the case where page_mapcount() can be 0 for a long > period of time. Right but pagecache pages are mapped differently by a mapping attached to the inode. The vma does not vanish. We have to distinguish clearly between anonymous and file based pages. > > You cannot figure out > > that the vma is gone since that was the only connection to the process. > > Hmm... Not true we still have a migration pte in that processes space. But > > we cannot find the process without the anon_vma. > > What did you think of the approach that I proposed of adding a > migration count to anon_vma? unlink_anon_vma() doesn't free the > anon_vma if migration count is non-zero. Hmmm.. Well talk to Hugh Dickins about that. anon_vmas are very performance sensitive things. > When gathering pages for migration, we use page_lock_anon_vma() to get > the anon_vma; if it returns NULL or has an empty vma list we skip the > page, else we bump migration count (and mapcount?) by 1 and unlock. > That will guarantee that the anon_vma sticks around until the end of > the migration. You cannot use page_lock_anon_vma since the mapcount is of the page is zero. Something must be done before we reduce the mapcount to zero to pin the vma. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 20:15 ` Christoph Lameter @ 2006-11-30 21:33 ` Paul Menage 2006-11-30 23:41 ` Christoph Lameter 0 siblings, 1 reply; 54+ messages in thread From: Paul Menage @ 2006-11-30 21:33 UTC (permalink / raw) To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, linux-mm, akpm, Hugh Dickins On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote: > > Hmmm.. Well talk to Hugh Dickins about that. anon_vmas are very > performance sensitive things. > > > When gathering pages for migration, we use page_lock_anon_vma() to get > > the anon_vma; if it returns NULL or has an empty vma list we skip the > > page, else we bump migration count (and mapcount?) by 1 and unlock. > > That will guarantee that the anon_vma sticks around until the end of > > the migration. > > You cannot use page_lock_anon_vma since the mapcount is of the page is > zero. Let me clarify my proposal: 1) When gathering pages we find an anon page 2) We call page_lock_anon_vma(); if it returns NULL we ignore the page 3) If the anon_vma has an empty vma list, we ignore the page 4) We increment page_mapcount(); if this crosses the boundary from unmapped to mapped, we know that we're racing with someone else; either ignore the page or start again 5) If page->mapping no longer refers to our anon_vma, we know we're racing; drop page_mapcount and ignore the page or start again 6) We increment anon_vma->migration_count to pin the anon_vma At this point we know that the vma isn't going to go away since it's pinned via the migration count, and any new users of the page will use the pinned anon_vma since page_mapcount() is positive. Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 21:33 ` Paul Menage @ 2006-11-30 23:41 ` Christoph Lameter 2006-11-30 23:48 ` Paul Menage 0 siblings, 1 reply; 54+ messages in thread From: Christoph Lameter @ 2006-11-30 23:41 UTC (permalink / raw) To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, akpm I think you initial suggestion of adding a counter to the anon_vma may work. Here is a patch that may allow us to keep the anon_vma around without holding mmap_sem. Seems to be simple. Hugh? Index: linux-2.6.19-rc6-mm2/include/linux/rmap.h =================================================================== --- linux-2.6.19-rc6-mm2.orig/include/linux/rmap.h 2006-11-15 22:03:40.000000000 -0600 +++ linux-2.6.19-rc6-mm2/include/linux/rmap.h 2006-11-30 17:39:17.643728656 -0600 @@ -26,6 +26,7 @@ struct anon_vma { spinlock_t lock; /* Serialize access to vma list */ struct list_head head; /* List of private "related" vmas */ + int migration_count; /* # processes migrating pages */ }; #ifdef CONFIG_MMU Index: linux-2.6.19-rc6-mm2/mm/migrate.c =================================================================== --- linux-2.6.19-rc6-mm2.orig/mm/migrate.c 2006-11-29 18:37:17.797934398 -0600 +++ linux-2.6.19-rc6-mm2/mm/migrate.c 2006-11-30 17:39:48.429639786 -0600 @@ -218,6 +218,7 @@ static void remove_anon_migration_ptes(s struct anon_vma *anon_vma; struct vm_area_struct *vma; unsigned long mapping; + int empty; mapping = (unsigned long)new->mapping; @@ -229,11 +230,15 @@ static void remove_anon_migration_ptes(s */ anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON); spin_lock(&anon_vma->lock); + anon_vma->migration_count--; list_for_each_entry(vma, &anon_vma->head, anon_vma_node) remove_migration_pte(vma, old, new); + empty = list_empty(&anon_vma->head); spin_unlock(&anon_vma->lock); + if (empty) + anon_vma_free(anon_vma); } /* Index: linux-2.6.19-rc6-mm2/mm/rmap.c =================================================================== --- linux-2.6.19-rc6-mm2.orig/mm/rmap.c 2006-11-15 22:03:40.000000000 -0600 +++ linux-2.6.19-rc6-mm2/mm/rmap.c 2006-11-30 17:39:17.795109159 -0600 @@ -151,7 +151,7 @@ void anon_vma_unlink(struct vm_area_stru list_del(&vma->anon_vma_node); /* We must garbage collect the anon_vma if it's empty */ - empty = list_empty(&anon_vma->head); + empty = list_empty(&anon_vma->head) && !anon_vma->migration_count; spin_unlock(&anon_vma->lock); if (empty) @@ -787,6 +787,9 @@ static int try_to_unmap_anon(struct page if (!anon_vma) return ret; + if (migration) + anon_vma->migration_count++; + list_for_each_entry(vma, &anon_vma->head, anon_vma_node) { ret = try_to_unmap_one(page, vma, migration); if (ret == SWAP_FAIL || !page_mapped(page)) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 23:41 ` Christoph Lameter @ 2006-11-30 23:48 ` Paul Menage 2006-12-01 2:23 ` Christoph Lameter 2006-12-01 2:44 ` KAMEZAWA Hiroyuki 0 siblings, 2 replies; 54+ messages in thread From: Paul Menage @ 2006-11-30 23:48 UTC (permalink / raw) To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, akpm On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote: > I think you initial suggestion of adding a counter to the anon_vma may > work. Here is a patch that may allow us to keep the anon_vma around > without holding mmap_sem. Seems to be simple. Don't we need to bump the mapcount? If we don't, then the page gets unmapped by the migration prep, and if we race with anyone trying to map it they may allocate a new anon_vma and replace it. > --- linux-2.6.19-rc6-mm2.orig/mm/migrate.c 2006-11-29 18:37:17.797934398 -0600 > +++ linux-2.6.19-rc6-mm2/mm/migrate.c 2006-11-30 17:39:48.429639786 -0600 > @@ -218,6 +218,7 @@ static void remove_anon_migration_ptes(s > struct anon_vma *anon_vma; > struct vm_area_struct *vma; > unsigned long mapping; > + int empty; > > mapping = (unsigned long)new->mapping; > > @@ -229,11 +230,15 @@ static void remove_anon_migration_ptes(s > */ > anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON); > spin_lock(&anon_vma->lock); > + anon_vma->migration_count--; > > list_for_each_entry(vma, &anon_vma->head, anon_vma_node) > remove_migration_pte(vma, old, new); > > + empty = list_empty(&anon_vma->head); I think we need to check for migration_count being non-zero here, just in case two processes try to migrate the same page at once. Or maybe just say that if migration_count is non-zero, the second migrator just ignores the page? Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 23:48 ` Paul Menage @ 2006-12-01 2:23 ` Christoph Lameter 2006-12-01 19:32 ` Paul Menage 2006-12-01 2:44 ` KAMEZAWA Hiroyuki 1 sibling, 1 reply; 54+ messages in thread From: Christoph Lameter @ 2006-12-01 2:23 UTC (permalink / raw) To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, akpm On Thu, 30 Nov 2006, Paul Menage wrote: > Don't we need to bump the mapcount? If we don't, then the page gets > unmapped by the migration prep, and if we race with anyone trying to > map it they may allocate a new anon_vma and replace it. Allocate a new vma for an existing anon page? That never happens. We may do COW in which case the page is copied. > > + empty = list_empty(&anon_vma->head); > > I think we need to check for migration_count being non-zero here, just > in case two processes try to migrate the same page at once. Or maybe > just say that if migration_count is non-zero, the second migrator just > ignores the page? Right we need to check for the migration_count being zero. The one that zeros it must free the anon_vma. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-12-01 2:23 ` Christoph Lameter @ 2006-12-01 19:32 ` Paul Menage 2006-12-01 19:56 ` Christoph Lameter 0 siblings, 1 reply; 54+ messages in thread From: Paul Menage @ 2006-12-01 19:32 UTC (permalink / raw) To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, akpm On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote: > On Thu, 30 Nov 2006, Paul Menage wrote: > > > Don't we need to bump the mapcount? If we don't, then the page gets > > unmapped by the migration prep, and if we race with anyone trying to > > map it they may allocate a new anon_vma and replace it. > > Allocate a new vma for an existing anon page? That never happens. We may > do COW in which case the page is copied. I was thinking of a new anon_vma, rather than a new vma - but I guess that even if we do race with someone who's faulting on the page and pulling it from the swap cache, they'll just set the page mapping to the same value as it is already, rather than setting it to a new value. So you're right, not a problem. Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-12-01 19:32 ` Paul Menage @ 2006-12-01 19:56 ` Christoph Lameter 0 siblings, 0 replies; 54+ messages in thread From: Christoph Lameter @ 2006-12-01 19:56 UTC (permalink / raw) To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, akpm On Fri, 1 Dec 2006, Paul Menage wrote: > > I was thinking of a new anon_vma, rather than a new vma - but I guess > that even if we do race with someone who's faulting on the page and > pulling it from the swap cache, they'll just set the page mapping to > the same value as it is already, rather than setting it to a new > value. So you're right, not a problem. The page is locked during migration to prevent such occurrences. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-30 23:48 ` Paul Menage 2006-12-01 2:23 ` Christoph Lameter @ 2006-12-01 2:44 ` KAMEZAWA Hiroyuki 2006-12-01 2:43 ` Christoph Lameter 2006-12-01 2:44 ` Christoph Lameter 1 sibling, 2 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-12-01 2:44 UTC (permalink / raw) To: Paul Menage; +Cc: clameter, hugh, linux-mm, akpm On Thu, 30 Nov 2006 15:48:28 -0800 "Paul Menage" <menage@google.com> wrote: > On 11/30/06, Christoph Lameter <clameter@sgi.com> wrote: > > I think you initial suggestion of adding a counter to the anon_vma may > > work. Here is a patch that may allow us to keep the anon_vma around > > without holding mmap_sem. Seems to be simple. > > Don't we need to bump the mapcount? If we don't, then the page gets > unmapped by the migration prep, and if we race with anyone trying to > map it they may allocate a new anon_vma and replace it. I don't think add *dummy* mapccount to a page is good. One way I can think of now is to make use of RCU routine for anon_vma_free() and take RCU readlock while unmap->map an anon page. This can prevent a freed anon_vma struct from being used by someone immediately. But Christoph-san's patch just uses 4bytes(int) for delayed freeing. This adds 2 pointers to each anon_vma struct, but doesn't uses any special things. This is a patch. not tested at all, just idea level. (seems a period of taking rcu_read_lock() is a bit long..) -Kame == For moving page-migration to the next step, we have to fix anon_vma problem. migration code temporally makes page->mapcount to 0. This means page->mapping is not trustful. AFAIK, anon_vma can be freed while migration if mm->sem is not taken. To make use of migration without mm->sem, we need to delay freeing of anon_vma. This patch uses RCU for delayed freeing of anon_vma. Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> include/linux/rmap.h | 10 +++++++++- mm/migrate.c | 2 ++ mm/rmap.c | 6 ++++++ 3 files changed, 17 insertions(+), 1 deletion(-) Index: linux-2.6.19/include/linux/rmap.h =================================================================== --- linux-2.6.19.orig/include/linux/rmap.h +++ linux-2.6.19/include/linux/rmap.h @@ -26,6 +26,7 @@ struct anon_vma { spinlock_t lock; /* Serialize access to vma list */ struct list_head head; /* List of private "related" vmas */ + struct rcu_head rcu; /* for delayed RCU freeing */ }; #ifdef CONFIG_MMU @@ -37,11 +38,18 @@ static inline struct anon_vma *anon_vma_ return kmem_cache_alloc(anon_vma_cachep, SLAB_KERNEL); } +/* + * Because page->mapping(which points to anon-vma) is not cleared + * even if page is removed from anon_vma, we use delayed freeing + * of anon_vma. This makes migration safer. + */ +extern void delayed_anon_vma_free(struct rcu_head *head); static inline void anon_vma_free(struct anon_vma *anon_vma) { - kmem_cache_free(anon_vma_cachep, anon_vma); + call_rcu(&anon_vma->rcu, delayed_anon_vma_free); } + static inline void anon_vma_lock(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; Index: linux-2.6.19/mm/migrate.c =================================================================== --- linux-2.6.19.orig/mm/migrate.c +++ linux-2.6.19/mm/migrate.c @@ -618,12 +618,14 @@ static int unmap_and_move(new_page_t get /* * Establish migration ptes or remove ptes */ + rcu_read_lock(); try_to_unmap(page, 1); if (!page_mapped(page)) rc = move_to_new_page(newpage, page); if (rc) remove_migration_ptes(page, page); + rcu_read_unlock(); unlock: unlock_page(page); Index: linux-2.6.19/mm/rmap.c =================================================================== --- linux-2.6.19.orig/mm/rmap.c +++ linux-2.6.19/mm/rmap.c @@ -70,6 +70,12 @@ static inline void validate_anon_vma(str #endif } +void delayed_anon_vma_free(struct rcu_head *head) +{ + struct anon_vma *anon_vma = container_of(head, struct anon_vma, rcu); + kmem_cache_free(anon_vma_cachep, anon_vma); +} + /* This must be called under the mmap_sem. */ int anon_vma_prepare(struct vm_area_struct *vma) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-12-01 2:44 ` KAMEZAWA Hiroyuki @ 2006-12-01 2:43 ` Christoph Lameter 2006-12-01 2:59 ` KAMEZAWA Hiroyuki 2006-12-01 2:44 ` Christoph Lameter 1 sibling, 1 reply; 54+ messages in thread From: Christoph Lameter @ 2006-12-01 2:43 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: Paul Menage, hugh, linux-mm, akpm On Fri, 1 Dec 2006, KAMEZAWA Hiroyuki wrote: > This is a patch. not tested at all, just idea level. > (seems a period of taking rcu_read_lock() is a bit long..) This is what we have been trying to avoid. Using rcu means that the anon_vma cacheline gets cold and this will badly influence benchmarks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-12-01 2:43 ` Christoph Lameter @ 2006-12-01 2:59 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-12-01 2:59 UTC (permalink / raw) To: Christoph Lameter; +Cc: menage, hugh, linux-mm, akpm On Thu, 30 Nov 2006 18:43:01 -0800 (PST) Christoph Lameter <clameter@sgi.com> wrote: > On Fri, 1 Dec 2006, KAMEZAWA Hiroyuki wrote: > > > This is a patch. not tested at all, just idea level. > > (seems a period of taking rcu_read_lock() is a bit long..) > > This is what we have been trying to avoid. Using rcu means that the > anon_vma cacheline gets cold and this will badly influence benchmarks. > Ah, okay. rcu's batch-freeing makes cacheline cold. sorry. -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-12-01 2:44 ` KAMEZAWA Hiroyuki 2006-12-01 2:43 ` Christoph Lameter @ 2006-12-01 2:44 ` Christoph Lameter 2006-12-01 3:10 ` KAMEZAWA Hiroyuki 1 sibling, 1 reply; 54+ messages in thread From: Christoph Lameter @ 2006-12-01 2:44 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: Paul Menage, hugh, linux-mm, akpm Fixed up patch with more comments and a check that the migration_count is zero before freeing. Index: linux-2.6.19-rc6-mm2/include/linux/rmap.h =================================================================== --- linux-2.6.19-rc6-mm2.orig/include/linux/rmap.h 2006-11-15 22:03:40.000000000 -0600 +++ linux-2.6.19-rc6-mm2/include/linux/rmap.h 2006-11-30 17:39:17.643728656 -0600 @@ -26,6 +26,7 @@ struct anon_vma { spinlock_t lock; /* Serialize access to vma list */ struct list_head head; /* List of private "related" vmas */ + int migration_count; /* # processes migrating pages */ }; #ifdef CONFIG_MMU Index: linux-2.6.19-rc6-mm2/mm/migrate.c =================================================================== --- linux-2.6.19-rc6-mm2.orig/mm/migrate.c 2006-11-29 18:37:17.797934398 -0600 +++ linux-2.6.19-rc6-mm2/mm/migrate.c 2006-11-30 20:41:13.810836561 -0600 @@ -209,15 +209,12 @@ static void remove_file_migration_ptes(s spin_unlock(&mapping->i_mmap_lock); } -/* - * Must hold mmap_sem lock on at least one of the vmas containing - * the page so that the anon_vma cannot vanish. - */ static void remove_anon_migration_ptes(struct page *old, struct page *new) { struct anon_vma *anon_vma; struct vm_area_struct *vma; unsigned long mapping; + int empty; mapping = (unsigned long)new->mapping; @@ -225,15 +222,20 @@ static void remove_anon_migration_ptes(s return; /* - * We hold the mmap_sem lock. So no need to call page_lock_anon_vma. + * We have increased migration_count So no need to call + * page_lock_anon_vma. */ anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON); spin_lock(&anon_vma->lock); + anon_vma->migration_count--; list_for_each_entry(vma, &anon_vma->head, anon_vma_node) remove_migration_pte(vma, old, new); + empty = list_empty(&anon_vma->head) && !anon_vma->migration_count; spin_unlock(&anon_vma->lock); + if (empty) + anon_vma_free(anon_vma); } /* Index: linux-2.6.19-rc6-mm2/mm/rmap.c =================================================================== --- linux-2.6.19-rc6-mm2.orig/mm/rmap.c 2006-11-15 22:03:40.000000000 -0600 +++ linux-2.6.19-rc6-mm2/mm/rmap.c 2006-11-30 20:39:52.266554217 -0600 @@ -150,8 +150,8 @@ void anon_vma_unlink(struct vm_area_stru validate_anon_vma(vma); list_del(&vma->anon_vma_node); - /* We must garbage collect the anon_vma if it's empty */ - empty = list_empty(&anon_vma->head); + /* We must garbage collect the anon_vma if it's unused */ + empty = list_empty(&anon_vma->head) && !anon_vma->migration_count; spin_unlock(&anon_vma->lock); if (empty) @@ -787,6 +787,10 @@ static int try_to_unmap_anon(struct page if (!anon_vma) return ret; + if (migration) + /* Prevent freeing while migrating pages */ + anon_vma->migration_count++; + list_for_each_entry(vma, &anon_vma->head, anon_vma_node) { ret = try_to_unmap_one(page, vma, migration); if (ret == SWAP_FAIL || !page_mapped(page)) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-12-01 2:44 ` Christoph Lameter @ 2006-12-01 3:10 ` KAMEZAWA Hiroyuki 2006-12-01 5:28 ` Christoph Lameter 0 siblings, 1 reply; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-12-01 3:10 UTC (permalink / raw) To: Christoph Lameter; +Cc: menage, hugh, linux-mm, akpm On Thu, 30 Nov 2006 18:44:30 -0800 (PST) Christoph Lameter <clameter@sgi.com> wrote: > Fixed up patch with more comments and a check that the migration_count is > zero before freeing. > Looks good, thanks. we need users and tests :) -Kame > Index: linux-2.6.19-rc6-mm2/include/linux/rmap.h > =================================================================== > --- linux-2.6.19-rc6-mm2.orig/include/linux/rmap.h 2006-11-15 22:03:40.000000000 -0600 > +++ linux-2.6.19-rc6-mm2/include/linux/rmap.h 2006-11-30 17:39:17.643728656 -0600 > @@ -26,6 +26,7 @@ > struct anon_vma { > spinlock_t lock; /* Serialize access to vma list */ > struct list_head head; /* List of private "related" vmas */ > + int migration_count; /* # processes migrating pages */ > }; > > #ifdef CONFIG_MMU > Index: linux-2.6.19-rc6-mm2/mm/migrate.c > =================================================================== > --- linux-2.6.19-rc6-mm2.orig/mm/migrate.c 2006-11-29 18:37:17.797934398 -0600 > +++ linux-2.6.19-rc6-mm2/mm/migrate.c 2006-11-30 20:41:13.810836561 -0600 > @@ -209,15 +209,12 @@ static void remove_file_migration_ptes(s > spin_unlock(&mapping->i_mmap_lock); > } > > -/* > - * Must hold mmap_sem lock on at least one of the vmas containing > - * the page so that the anon_vma cannot vanish. > - */ > static void remove_anon_migration_ptes(struct page *old, struct page *new) > { > struct anon_vma *anon_vma; > struct vm_area_struct *vma; > unsigned long mapping; > + int empty; > > mapping = (unsigned long)new->mapping; > > @@ -225,15 +222,20 @@ static void remove_anon_migration_ptes(s > return; > > /* > - * We hold the mmap_sem lock. So no need to call page_lock_anon_vma. > + * We have increased migration_count So no need to call > + * page_lock_anon_vma. > */ > anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON); > spin_lock(&anon_vma->lock); > + anon_vma->migration_count--; > > list_for_each_entry(vma, &anon_vma->head, anon_vma_node) > remove_migration_pte(vma, old, new); > > + empty = list_empty(&anon_vma->head) && !anon_vma->migration_count; > spin_unlock(&anon_vma->lock); > + if (empty) > + anon_vma_free(anon_vma); > } > > /* > Index: linux-2.6.19-rc6-mm2/mm/rmap.c > =================================================================== > --- linux-2.6.19-rc6-mm2.orig/mm/rmap.c 2006-11-15 22:03:40.000000000 -0600 > +++ linux-2.6.19-rc6-mm2/mm/rmap.c 2006-11-30 20:39:52.266554217 -0600 > @@ -150,8 +150,8 @@ void anon_vma_unlink(struct vm_area_stru > validate_anon_vma(vma); > list_del(&vma->anon_vma_node); > > - /* We must garbage collect the anon_vma if it's empty */ > - empty = list_empty(&anon_vma->head); > + /* We must garbage collect the anon_vma if it's unused */ > + empty = list_empty(&anon_vma->head) && !anon_vma->migration_count; > spin_unlock(&anon_vma->lock); > > if (empty) > @@ -787,6 +787,10 @@ static int try_to_unmap_anon(struct page > if (!anon_vma) > return ret; > > + if (migration) > + /* Prevent freeing while migrating pages */ > + anon_vma->migration_count++; > + > list_for_each_entry(vma, &anon_vma->head, anon_vma_node) { > ret = try_to_unmap_one(page, vma, migration); > if (ret == SWAP_FAIL || !page_mapped(page)) > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-12-01 3:10 ` KAMEZAWA Hiroyuki @ 2006-12-01 5:28 ` Christoph Lameter 0 siblings, 0 replies; 54+ messages in thread From: Christoph Lameter @ 2006-12-01 5:28 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: menage, hugh, linux-mm, akpm On Fri, 1 Dec 2006, KAMEZAWA Hiroyuki wrote: > On Thu, 30 Nov 2006 18:44:30 -0800 (PST) > Christoph Lameter <clameter@sgi.com> wrote: > > > Fixed up patch with more comments and a check that the migration_count is > > zero before freeing. > > > > Looks good, thanks. we need users and tests :) Yeah we would need something that is not process based. Paul Menage may have something. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC][PATCH 0/1] Node-based reclaim/migration 2006-11-29 3:06 [RFC][PATCH 0/1] Node-based reclaim/migration menage 2006-11-29 3:06 ` [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace menage 2006-11-30 0:31 ` [RFC][PATCH 0/1] Node-based reclaim/migration KAMEZAWA Hiroyuki @ 2006-11-30 4:04 ` Christoph Lameter 2 siblings, 0 replies; 54+ messages in thread From: Christoph Lameter @ 2006-11-30 4:04 UTC (permalink / raw) To: menage; +Cc: linux-mm, akpm On Tue, 28 Nov 2006, menage@google.com wrote: > Comments? Also, can anyone clarify whether I need any locking when > sacnning the pages in a pgdat? As far as I can see, even with memory > hotplug this number can only increase, not decrease. That depends on the way you scan... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
end of thread, other threads:[~2006-12-01 19:56 UTC | newest] Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-11-29 3:06 [RFC][PATCH 0/1] Node-based reclaim/migration menage 2006-11-29 3:06 ` [RFC][PATCH 1/1] Expose per-node reclaim and migration to userspace menage 2006-11-29 6:07 ` Nick Piggin 2006-11-29 21:57 ` Paul Menage 2006-11-30 4:13 ` Christoph Lameter 2006-11-30 4:18 ` Paul Menage 2006-11-30 7:38 ` Nick Piggin 2006-11-30 7:57 ` Paul Menage 2006-11-30 8:26 ` Nick Piggin 2006-11-30 8:39 ` Paul Menage 2006-11-30 8:55 ` Nick Piggin 2006-11-30 9:06 ` Paul Menage 2006-11-30 9:21 ` Nick Piggin 2006-11-30 9:45 ` Paul Menage 2006-11-30 10:15 ` Nick Piggin 2006-11-30 10:40 ` Paul Menage 2006-11-30 11:04 ` Nick Piggin 2006-11-30 11:23 ` Paul Menage 2006-11-30 11:35 ` Nick Piggin 2006-11-30 0:18 ` KAMEZAWA Hiroyuki 2006-11-30 0:25 ` Paul Menage 2006-11-30 0:38 ` KAMEZAWA Hiroyuki 2006-11-30 4:15 ` Christoph Lameter 2006-11-30 4:10 ` Christoph Lameter 2006-11-30 0:31 ` [RFC][PATCH 0/1] Node-based reclaim/migration KAMEZAWA Hiroyuki 2006-11-30 0:31 ` Paul Menage 2006-11-30 4:11 ` KAMEZAWA Hiroyuki 2006-11-30 4:17 ` Christoph Lameter 2006-11-30 10:45 ` Paul Menage 2006-11-30 11:12 ` KAMEZAWA Hiroyuki 2006-11-30 11:25 ` Paul Menage 2006-11-30 12:18 ` KAMEZAWA Hiroyuki 2006-11-30 18:28 ` Christoph Lameter 2006-11-30 18:35 ` Paul Menage 2006-11-30 18:39 ` Christoph Lameter 2006-11-30 19:09 ` Paul Menage 2006-11-30 19:42 ` Christoph Lameter 2006-11-30 19:53 ` Paul Menage 2006-11-30 20:00 ` Christoph Lameter 2006-11-30 20:07 ` Paul Menage 2006-11-30 20:15 ` Christoph Lameter 2006-11-30 21:33 ` Paul Menage 2006-11-30 23:41 ` Christoph Lameter 2006-11-30 23:48 ` Paul Menage 2006-12-01 2:23 ` Christoph Lameter 2006-12-01 19:32 ` Paul Menage 2006-12-01 19:56 ` Christoph Lameter 2006-12-01 2:44 ` KAMEZAWA Hiroyuki 2006-12-01 2:43 ` Christoph Lameter 2006-12-01 2:59 ` KAMEZAWA Hiroyuki 2006-12-01 2:44 ` Christoph Lameter 2006-12-01 3:10 ` KAMEZAWA Hiroyuki 2006-12-01 5:28 ` Christoph Lameter 2006-11-30 4:04 ` Christoph Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox