On 6/30/20 12:23 AM, Huang, Ying wrote: > Hi, Dave, > > Dave Hansen writes: > >> From: Dave Hansen >> >> Some method is obviously needed to enable reclaim-based migration. >> >> Just like traditional autonuma, there will be some workloads that >> will benefit like workloads with more "static" configurations where >> hot pages stay hot and cold pages stay cold. If pages come and go >> from the hot and cold sets, the benefits of this approach will be >> more limited. >> >> The benefits are truly workload-based and *not* hardware-based. >> We do not believe that there is a viable threshold where certain >> hardware configurations should have this mechanism enabled while >> others do not. >> >> To be conservative, earlier work defaulted to disable reclaim- >> based migration and did not include a mechanism to enable it. >> This propses extending the existing "zone_reclaim_mode" (now >> now really node_reclaim_mode) as a method to enable it. >> >> We are open to any alternative that allows end users to enable >> this mechanism or disable it it workload harm is detected (just >> like traditional autonuma). >> >> The implementation here is pretty simple and entirely unoptimized. >> On any memory hotplug events, assume that a node was added or >> removed and recalculate all migration targets. This ensures that >> the node_demotion[] array is always ready to be used in case the >> new reclaim mode is enabled. This recalculation is far from >> optimal, most glaringly that it does not even attempt to figure >> out if nodes are actually coming or going. >> >> Signed-off-by: Dave Hansen >> Cc: Yang Shi >> Cc: David Rientjes >> Cc: Huang Ying >> Cc: Dan Williams >> --- >> >> b/Documentation/admin-guide/sysctl/vm.rst | 9 ++++ >> b/mm/migrate.c | 61 +++++++++++++++++++++++++++++- >> b/mm/vmscan.c | 7 +-- >> 3 files changed, 73 insertions(+), 4 deletions(-) >> >> diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion Documentation/admin-guide/sysctl/vm.rst >> --- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion 2020-06-29 16:35:01.012312549 -0700 >> +++ b/Documentation/admin-guide/sysctl/vm.rst 2020-06-29 16:35:01.021312549 -0700 >> @@ -941,6 +941,7 @@ This is value OR'ed together of >> 1 (bit currently ignored) >> 2 Zone reclaim writes dirty pages out >> 4 Zone reclaim swaps pages >> +8 Zone reclaim migrates pages >> = =================================== >> >> zone_reclaim_mode is disabled by default. For file servers or workloads >> @@ -965,3 +966,11 @@ of other processes running on other node >> Allowing regular swap effectively restricts allocations to the local >> node unless explicitly overridden by memory policies or cpuset >> configurations. >> + >> +Page migration during reclaim is intended for systems with tiered memory >> +configurations. These systems have multiple types of memory with varied >> +performance characteristics instead of plain NUMA systems where the same >> +kind of memory is found at varied distances. Allowing page migration >> +during reclaim enables these systems to migrate pages from fast tiers to >> +slow tiers when the fast tier is under pressure. This migration is >> +performed before swap. >> diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c >> --- a/mm/migrate.c~enable-numa-demotion 2020-06-29 16:35:01.015312549 -0700 >> +++ b/mm/migrate.c 2020-06-29 16:35:01.022312549 -0700 >> @@ -49,6 +49,7 @@ >> #include >> #include >> #include >> +#include >> >> #include >> >> @@ -3165,6 +3166,10 @@ void set_migration_target_nodes(void) >> * Avoid any oddities like cycles that could occur >> * from changes in the topology. This will leave >> * a momentary gap when migration is disabled. >> + * >> + * This is superfluous for memory offlining since >> + * MEM_GOING_OFFLINE does it independently, but it >> + * does not hurt to do it a second time. >> */ >> disable_all_migrate_targets(); >> >> @@ -3211,6 +3216,60 @@ again: >> /* Is another pass necessary? */ >> if (!nodes_empty(next_pass)) >> goto again; >> +} >> >> - put_online_mems(); >> +/* >> + * React to hotplug events that might online or offline >> + * NUMA nodes. >> + * >> + * This leaves migrate-on-reclaim transiently disabled >> + * between the MEM_GOING_OFFLINE and MEM_OFFLINE events. >> + * This runs whether RECLAIM_MIGRATE is enabled or not. >> + * That ensures that the user can turn RECLAIM_MIGRATE >> + * without needing to recalcuate migration targets. >> + */ >> +#if defined(CONFIG_MEMORY_HOTPLUG) >> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, >> + unsigned long action, void *arg) >> +{ >> + switch (action) { >> + case MEM_GOING_OFFLINE: >> + /* >> + * Make sure there are not transient states where >> + * an offline node is a migration target. This >> + * will leave migration disabled until the offline >> + * completes and the MEM_OFFLINE case below runs. >> + */ >> + disable_all_migrate_targets(); >> + break; >> + case MEM_OFFLINE: >> + case MEM_ONLINE: >> + /* >> + * Recalculate the target nodes once the node >> + * reaches its final state (online or offline). >> + */ >> + set_migration_target_nodes(); >> + break; >> + case MEM_CANCEL_OFFLINE: >> + /* >> + * MEM_GOING_OFFLINE disabled all the migration >> + * targets. Reenable them. >> + */ >> + set_migration_target_nodes(); >> + break; >> + case MEM_GOING_ONLINE: >> + case MEM_CANCEL_ONLINE: >> + break; >> + } >> + >> + return notifier_from_errno(0); >> } >> + >> +static int __init migrate_on_reclaim_init(void) >> +{ >> + hotplug_memory_notifier(migrate_on_reclaim_callback, 100); >> + return 0; >> +} >> +late_initcall(migrate_on_reclaim_init); >> +#endif /* CONFIG_MEMORY_HOTPLUG */ >> + >> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c >> --- a/mm/vmscan.c~enable-numa-demotion 2020-06-29 16:35:01.017312549 -0700 >> +++ b/mm/vmscan.c 2020-06-29 16:35:01.023312549 -0700 >> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly; >> * These bit locations are exposed in the vm.zone_reclaim_mode sysctl >> * ABI. New bits are OK, but existing bits can never change. >> */ >> -#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */ >> -#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ >> -#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ >> +#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */ >> +#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ >> +#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ >> +#define RECLAIM_MIGRATE (1<<3) /* Migrate pages during reclaim */ >> >> /* >> * Priority for NODE_RECLAIM. This determines the fraction of pages > I found that RECLAIM_MIGRATE is defined but never referenced in the > patch. > > If my understanding of the code were correct, shrink_do_demote_mapping() > is called by shrink_page_list(), which is used by kswapd and direct > reclaim. So as long as the persistent memory node is onlined, > reclaim-based migration will be enabled regardless of node reclaim mode. It looks so according to the code. But the intention of a new node reclaim mode is to do migration on reclaim *only when* the RECLAIM_MODE is enabled by the users. It looks the patch just clear the migration target node masks if the memory is offlined. So, I'm supposed you need check if node_reclaim is enabled before doing migration in shrink_page_list() and also need make node reclaim to adopt the new mode. Please refer to https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang.shi@linux.alibaba.com/ I copied the related chunks here: + if (is_demote_ok(page_to_nid(page))) { <--- check if node reclaim is enabled + list_add(&page->lru, &demote_pages); + unlock_page(page); + continue; + } and @@ -4084,8 +4179,10 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in .gfp_mask = current_gfp_context(gfp_mask), .order = order, .priority = NODE_RECLAIM_PRIORITY, - .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE), - .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), + .may_writepage = !!((node_reclaim_mode & RECLAIM_WRITE) || + (node_reclaim_mode & RECLAIM_MIGRATE)), + .may_unmap = !!((node_reclaim_mode & RECLAIM_UNMAP) || + (node_reclaim_mode & RECLAIM_MIGRATE)), .may_swap = 1, .reclaim_idx = gfp_zone(gfp_mask), }; @@ -4105,7 +4202,8 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; - if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) { + if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages || + (node_reclaim_mode & RECLAIM_MIGRATE)) { /* * Free memory by calling shrink node with increasing * priorities until we have enough memory freed. @@ -4138,9 +4236,12 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order) * thrown out if the node is overallocated. So we do not reclaim * if less than a specified percentage of the node is used by * unmapped file backed pages. + * + * Migrate mode doesn't care the above restrictions. */ if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages && - node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages) + node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages && + !(node_reclaim_mode & RECLAIM_MIGRATE)) return NODE_RECLAIM_FULL; > > Best Regards, > Huang, Ying