On 19 Jun 2021, at 4:18, Huang, Ying wrote: > Zi Yan writes: > >> On 18 Jun 2021, at 2:15, Huang Ying wrote: >> >>> From: Dave Hansen >>> >>> When memory fills up on a node, memory contents can be >>> automatically migrated to another node. The biggest problems are >>> knowing when to migrate and to where the migration should be >>> targeted. >>> >>> The most straightforward way to generate the "to where" list would >>> be to follow the page allocator fallback lists. Those lists >>> already tell us if memory is full where to look next. It would >>> also be logical to move memory in that order. >>> >>> But, the allocator fallback lists have a fatal flaw: most nodes >>> appear in all the lists. This would potentially lead to migration >>> cycles (A->B, B->A, A->B, ...). >>> >>> Instead of using the allocator fallback lists directly, keep a >>> separate node migration ordering. But, reuse the same data used >>> to generate page allocator fallback in the first place: >>> find_next_best_node(). >>> >>> This means that the firmware data used to populate node distances >>> essentially dictates the ordering for now. It should also be >>> architecture-neutral since all NUMA architectures have a working >>> find_next_best_node(). >>> >>> The protocol for node_demotion[] access and writing is not >>> standard. It has no specific locking and is intended to be read >>> locklessly. Readers must take care to avoid observing changes >>> that appear incoherent. This was done so that node_demotion[] >>> locking has no chance of becoming a bottleneck on large systems >>> with lots of CPUs in direct reclaim. >>> >>> This code is unused for now. It will be called later in the >>> series. >>> >>> Signed-off-by: Dave Hansen >>> Signed-off-by: "Huang, Ying" >>> Reviewed-by: Yang Shi >>> Cc: Michal Hocko >>> Cc: Wei Xu >>> Cc: David Rientjes >>> Cc: Dan Williams >>> Cc: David Hildenbrand >>> Cc: osalvador >>> >>> -- >>> >>> Changes from 20200122: >>> * Add big node_demotion[] comment >>> Changes from 20210302: >>> * Fix typo in node_demotion[] comment >>> --- >>> mm/internal.h | 5 ++ >>> mm/migrate.c | 175 +++++++++++++++++++++++++++++++++++++++++++++++- >>> mm/page_alloc.c | 2 +- >>> 3 files changed, 180 insertions(+), 2 deletions(-) >>> >>> diff --git a/mm/internal.h b/mm/internal.h >>> index 2f1182948aa6..0344cd78e170 100644 >>> --- a/mm/internal.h >>> +++ b/mm/internal.h >>> @@ -522,12 +522,17 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn, >>> >>> #ifdef CONFIG_NUMA >>> extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int); >>> +extern int find_next_best_node(int node, nodemask_t *used_node_mask); >>> #else >>> static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask, >>> unsigned int order) >>> { >>> return NODE_RECLAIM_NOSCAN; >>> } >>> +static inline int find_next_best_node(int node, nodemask_t *used_node_mask) >>> +{ >>> + return NUMA_NO_NODE; >>> +} >>> #endif >>> >>> extern int hwpoison_filter(struct page *p); >>> diff --git a/mm/migrate.c b/mm/migrate.c >>> index 6cab668132f9..111f8565f75d 100644 >>> --- a/mm/migrate.c >>> +++ b/mm/migrate.c >>> @@ -1136,6 +1136,44 @@ static int __unmap_and_move(struct page *page, struct page *newpage, >>> return rc; >>> } >>> >>> + >>> +/* >>> + * node_demotion[] example: >>> + * >>> + * Consider a system with two sockets. Each socket has >>> + * three classes of memory attached: fast, medium and slow. >>> + * Each memory class is placed in its own NUMA node. The >>> + * CPUs are placed in the node with the "fast" memory. The >>> + * 6 NUMA nodes (0-5) might be split among the sockets like >>> + * this: >>> + * >>> + * Socket A: 0, 1, 2 >>> + * Socket B: 3, 4, 5 >>> + * >>> + * When Node 0 fills up, its memory should be migrated to >>> + * Node 1. When Node 1 fills up, it should be migrated to >>> + * Node 2. The migration path start on the nodes with the >>> + * processors (since allocations default to this node) and >>> + * fast memory, progress through medium and end with the >>> + * slow memory: >>> + * >>> + * 0 -> 1 -> 2 -> stop >>> + * 3 -> 4 -> 5 -> stop >>> + * >>> + * This is represented in the node_demotion[] like this: >>> + * >>> + * { 1, // Node 0 migrates to 1 >>> + * 2, // Node 1 migrates to 2 >>> + * -1, // Node 2 does not migrate >>> + * 4, // Node 3 migrates to 4 >>> + * 5, // Node 4 migrates to 5 >>> + * -1} // Node 5 does not migrate >>> + */ >>> + >>> +/* >>> + * Writes to this array occur without locking. READ_ONCE() >>> + * is recommended for readers to ensure consistent reads. >>> + */ >>> static int node_demotion[MAX_NUMNODES] __read_mostly = >>> {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE}; >>> >>> @@ -1150,7 +1188,13 @@ static int node_demotion[MAX_NUMNODES] __read_mostly = >>> */ >>> int next_demotion_node(int node) >>> { >>> - return node_demotion[node]; >>> + /* >>> + * node_demotion[] is updated without excluding >>> + * this function from running. READ_ONCE() avoids >>> + * reading multiple, inconsistent 'node' values >>> + * during an update. >>> + */ >>> + return READ_ONCE(node_demotion[node]); >>> } >> >> Is it necessary to have two separate patches to add node_demotion and >> next_demotion_node() then modify it immediately? Maybe merge Patch 1 into 2? >> >> Hmm, I just checked Patch 3 and it changes node_demotion again and uses RCU. >> I guess it might be much simpler to just introduce node_demotion with RCU >> in this patch and Patch 3 only takes care of hotplug events. > > Hi, Dave, > > What do you think about this? > >>> >>> /* >>> @@ -3144,3 +3188,132 @@ void migrate_vma_finalize(struct migrate_vma *migrate) >>> } >>> EXPORT_SYMBOL(migrate_vma_finalize); >>> #endif /* CONFIG_DEVICE_PRIVATE */ >>> + >>> +/* Disable reclaim-based migration. */ >>> +static void disable_all_migrate_targets(void) >>> +{ >>> + int node; >>> + >>> + for_each_online_node(node) >>> + node_demotion[node] = NUMA_NO_NODE; >>> +} >>> + >>> +/* >>> + * Find an automatic demotion target for 'node'. >>> + * Failing here is OK. It might just indicate >>> + * being at the end of a chain. >>> + */ >>> +static int establish_migrate_target(int node, nodemask_t *used) >>> +{ >>> + int migration_target; >>> + >>> + /* >>> + * Can not set a migration target on a >>> + * node with it already set. >>> + * >>> + * No need for READ_ONCE() here since this >>> + * in the write path for node_demotion[]. >>> + * This should be the only thread writing. >>> + */ >>> + if (node_demotion[node] != NUMA_NO_NODE) >>> + return NUMA_NO_NODE; >>> + >>> + migration_target = find_next_best_node(node, used); >>> + if (migration_target == NUMA_NO_NODE) >>> + return NUMA_NO_NODE; >>> + >>> + node_demotion[node] = migration_target; >>> + >>> + return migration_target; >>> +} >>> + >>> +/* >>> + * When memory fills up on a node, memory contents can be >>> + * automatically migrated to another node instead of >>> + * discarded at reclaim. >>> + * >>> + * Establish a "migration path" which will start at nodes >>> + * with CPUs and will follow the priorities used to build the >>> + * page allocator zonelists. >>> + * >>> + * The difference here is that cycles must be avoided. If >>> + * node0 migrates to node1, then neither node1, nor anything >>> + * node1 migrates to can migrate to node0. >>> + * >>> + * This function can run simultaneously with readers of >>> + * node_demotion[]. However, it can not run simultaneously >>> + * with itself. Exclusion is provided by memory hotplug events >>> + * being single-threaded. >>> + */ >>> +static void __set_migration_target_nodes(void) >>> +{ >>> + nodemask_t next_pass = NODE_MASK_NONE; >>> + nodemask_t this_pass = NODE_MASK_NONE; >>> + nodemask_t used_targets = NODE_MASK_NONE; >>> + int node; >>> + >>> + /* >>> + * Avoid any oddities like cycles that could occur >>> + * from changes in the topology. This will leave >>> + * a momentary gap when migration is disabled. >>> + */ >>> + disable_all_migrate_targets(); >>> + >>> + /* >>> + * Ensure that the "disable" is visible across the system. >>> + * Readers will see either a combination of before+disable >>> + * state or disable+after. They will never see before and >>> + * after state together. >>> + * >>> + * The before+after state together might have cycles and >>> + * could cause readers to do things like loop until this >>> + * function finishes. This ensures they can only see a >>> + * single "bad" read and would, for instance, only loop >>> + * once. >>> + */ >>> + smp_wmb(); >>> + >>> + /* >>> + * Allocations go close to CPUs, first. Assume that >>> + * the migration path starts at the nodes with CPUs. >>> + */ >>> + next_pass = node_states[N_CPU]; >> >> Is there a plan of allowing user to change where the migration >> path starts? Or maybe one step further providing an interface >> to allow user to specify the demotion path. Something like >> /sys/devices/system/node/node*/node_demotion. > > I don't think that's necessary at least for now. Do you know any real > world use case for this? In our P9+volta system, GPU memory is exposed as a NUMA node. For the GPU workloads with data size greater than GPU memory size, it will be very helpful to allow pages in GPU memory to be migrated/demoted to CPU memory. With your current assumption, GPU memory -> CPU memory demotion seems not possible, right? This should also apply to any system with a device memory exposed as a NUMA node and workloads running on the device and using CPU memory as a lower tier memory than the device memory. — Best Regards, Yan, Zi