* [PATCH 1/3] mm/mempolicy: Use the already fetched local variable
@ 2024-02-17 7:31 Donet Tom
2024-02-17 7:31 ` [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy Donet Tom
` (2 more replies)
0 siblings, 3 replies; 33+ messages in thread
From: Donet Tom @ 2024-02-17 7:31 UTC (permalink / raw)
To: Andrew Morton, linux-mm, linux-kernel
Cc: Aneesh Kumar, Huang Ying, Dave Hansen, Mel Gorman, Ben Widawsky,
Feng Tang, Michal Hocko, Andrea Arcangeli, Peter Zijlstra,
Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox,
Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins,
Kefeng Wang, Suren Baghdasaryan
Avoid doing a per cpu read and use the local variable thisnid. IMHO
this also makes the code more readable.
Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
mm/mempolicy.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..8478574c000c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2526,7 +2526,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
if (node_isset(curnid, pol->nodes))
goto out;
z = first_zones_zonelist(
- node_zonelist(numa_node_id(), GFP_HIGHUSER),
+ node_zonelist(thisnid, GFP_HIGHUSER),
gfp_zone(GFP_HIGHUSER),
&pol->nodes);
polnid = zone_to_nid(z->zone);
--
2.39.3
^ permalink raw reply [flat|nested] 33+ messages in thread* [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-17 7:31 [PATCH 1/3] mm/mempolicy: Use the already fetched local variable Donet Tom @ 2024-02-17 7:31 ` Donet Tom 2024-02-19 12:07 ` Michal Hocko ` (2 more replies) 2024-02-18 21:38 ` [PATCH 1/3] mm/mempolicy: Use the already fetched local variable Andrew Morton [not found] ` <bf7e6779f842fb65cf7bb9b2c617feb2af271cb7.1708097962.git.donettom@linux.ibm.com> 2 siblings, 3 replies; 33+ messages in thread From: Donet Tom @ 2024-02-17 7:31 UTC (permalink / raw) To: Andrew Morton, linux-mm, linux-kernel Cc: Aneesh Kumar, Huang Ying, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Michal Hocko, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan commit bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes") added support for migrate on protnone reference with MPOL_BIND memory policy. This allowed numa fault migration when the executing node is part of the policy mask for MPOL_BIND. This patch extends migration support to MPOL_PREFERRED_MANY policy. Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag MPOL_F_NUMA_BALANCING. This causes issues when we want to use NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, the kernel should not allocate pages from the slower memory tier via allocation control zonelist fallback. Instead, we should move cold pages from the faster memory node via memory demotion. For a page allocation, kswapd is only woken up after we try to allocate pages from all nodes in the allocation zone list. This implies that, without using memory policies, we will end up allocating hot pages in the slower memory tier. MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better allocation control when we have memory tiers in the system. With MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only of faster memory nodes. When we fail to allocate pages from the faster memory node, kswapd would be woken up, allowing demotion of cold pages to slower memory nodes. With the current kernel, such usage of memory policies implies we can't do page promotion from a slower memory tier to a faster memory tier using numa fault. This patch fixes this issue. For MPOL_PREFERRED_MANY, if the executing node is in the policy node mask, we allow numa migration to the executing nodes. If the executing node is not in the policy node mask but the folio is already allocated based on policy preference (the folio node is in the policy node mask), we don't allow numa migration. If both the executing node and folio node are outside the policy node mask, we allow numa migration to the executing nodes. Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Signed-off-by: Donet Tom <donettom@linux.ibm.com> --- mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 73d698e21dae..8c4c92b10371 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1458,9 +1458,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) return -EINVAL; if (*flags & MPOL_F_NUMA_BALANCING) { - if (*mode != MPOL_BIND) + if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) + *flags |= (MPOL_F_MOF | MPOL_F_MORON); + else return -EINVAL; - *flags |= (MPOL_F_MOF | MPOL_F_MORON); } return 0; } @@ -2463,6 +2464,23 @@ static void sp_free(struct sp_node *n) kmem_cache_free(sn_cache, n); } +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, + struct mempolicy *pol) +{ + /* if the executing node is in the policy node mask, migrate */ + if (node_isset(exec_node, pol->nodes)) + return true; + + /* If the folio node is in policy node mask, don't migrate */ + if (node_isset(folio_node, pol->nodes)) + return false; + /* + * both the folio node and executing node are outside the policy nodemask, + * migrate as normal numa fault migration. + */ + return true; +} + /** * mpol_misplaced - check whether current folio node is valid in policy * @@ -2526,6 +2544,12 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, break; case MPOL_PREFERRED_MANY: + if (pol->flags & MPOL_F_MORON) { + if (!mpol_preferred_should_numa_migrate(thisnid, curnid, pol)) + goto out; + break; + } + /* * use current page if in policy nodemask, * else select nearest allowed node, if any. -- 2.39.3 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-17 7:31 ` [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy Donet Tom @ 2024-02-19 12:07 ` Michal Hocko 2024-02-19 13:44 ` Donet Tom 2024-02-19 14:20 ` Michal Hocko 2024-02-20 7:18 ` Huang, Ying 2 siblings, 1 reply; 33+ messages in thread From: Michal Hocko @ 2024-02-19 12:07 UTC (permalink / raw) To: Donet Tom Cc: Andrew Morton, linux-mm, linux-kernel, Aneesh Kumar, Huang Ying, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On Sat 17-02-24 01:31:35, Donet Tom wrote: > commit bda420b98505 ("numa balancing: migrate on fault among multiple bound > nodes") added support for migrate on protnone reference with MPOL_BIND > memory policy. This allowed numa fault migration when the executing node > is part of the policy mask for MPOL_BIND. This patch extends migration > support to MPOL_PREFERRED_MANY policy. > > Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag > MPOL_F_NUMA_BALANCING. This causes issues when we want to use > NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, > the kernel should not allocate pages from the slower memory tier via > allocation control zonelist fallback. Instead, we should move cold pages > from the faster memory node via memory demotion. For a page allocation, > kswapd is only woken up after we try to allocate pages from all nodes in > the allocation zone list. This implies that, without using memory > policies, we will end up allocating hot pages in the slower memory tier. > > MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add > MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better > allocation control when we have memory tiers in the system. With > MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only > of faster memory nodes. When we fail to allocate pages from the faster > memory node, kswapd would be woken up, allowing demotion of cold pages > to slower memory nodes. > > With the current kernel, such usage of memory policies implies we can't > do page promotion from a slower memory tier to a faster memory tier > using numa fault. This patch fixes this issue. > > For MPOL_PREFERRED_MANY, if the executing node is in the policy node > mask, we allow numa migration to the executing nodes. If the executing > node is not in the policy node mask but the folio is already allocated > based on policy preference (the folio node is in the policy node mask), > we don't allow numa migration. If both the executing node and folio node > are outside the policy node mask, we allow numa migration to the > executing nodes. The feature makes sense to me. How has this been tested? Do you have any numbers to present? > Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> > Signed-off-by: Donet Tom <donettom@linux.ibm.com> > --- > mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- > 1 file changed, 26 insertions(+), 2 deletions(-) I haven't spotted anything obviously wrong in the patch itself but I admit this is not an area I am actively familiar with so I might be missing something. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-19 12:07 ` Michal Hocko @ 2024-02-19 13:44 ` Donet Tom 2024-02-20 6:36 ` Huang, Ying 0 siblings, 1 reply; 33+ messages in thread From: Donet Tom @ 2024-02-19 13:44 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, linux-kernel, Aneesh Kumar, Huang Ying, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On 2/19/24 17:37, Michal Hocko wrote: > On Sat 17-02-24 01:31:35, Donet Tom wrote: >> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >> nodes") added support for migrate on protnone reference with MPOL_BIND >> memory policy. This allowed numa fault migration when the executing node >> is part of the policy mask for MPOL_BIND. This patch extends migration >> support to MPOL_PREFERRED_MANY policy. >> >> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >> the kernel should not allocate pages from the slower memory tier via >> allocation control zonelist fallback. Instead, we should move cold pages >> from the faster memory node via memory demotion. For a page allocation, >> kswapd is only woken up after we try to allocate pages from all nodes in >> the allocation zone list. This implies that, without using memory >> policies, we will end up allocating hot pages in the slower memory tier. >> >> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >> allocation control when we have memory tiers in the system. With >> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >> of faster memory nodes. When we fail to allocate pages from the faster >> memory node, kswapd would be woken up, allowing demotion of cold pages >> to slower memory nodes. >> >> With the current kernel, such usage of memory policies implies we can't >> do page promotion from a slower memory tier to a faster memory tier >> using numa fault. This patch fixes this issue. >> >> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >> mask, we allow numa migration to the executing nodes. If the executing >> node is not in the policy node mask but the folio is already allocated >> based on policy preference (the folio node is in the policy node mask), >> we don't allow numa migration. If both the executing node and folio node >> are outside the policy node mask, we allow numa migration to the >> executing nodes. > The feature makes sense to me. How has this been tested? Do you have any > numbers to present? Hi Michal I have a test program which allocate memory on a specified node and trigger the promotion or migration (Keep accessing the pages). Without this patch if we set MPOL_PREFERRED_MANY promotion or migration was not happening with this patch I could see pages are getting migrated or promoted. My system has 2 CPU+DRAM node (Tier 1) and 1 PMEM node(Tier 2). Below are my test results. In below table N0 and N1 are Tier1 Nodes. N6 is the Tier2 Node. Exec_Node is the execution node, Policy is the nodes in nodemask and "Curr Location Pages" is the node where pages present before migration or promotion start. Tests Results ------------------ Scenario 1: if the executing node is in the policy node mask ================================================================================ Exec_Node Policy Curr Location Pages Observations ================================================================================ N0 N0 N1 N6 N1 Pages Migrated from N1 to N0 N0 N0 N1 N6 N6 Pages Promoted from N6 to N0 N0 N0 N1 N1 Pages Migrated from N1 to N0 N0 N0 N1 N6 Pages Promoted from N6 to N0 Scenario 2: If the folio node is in policy node mask and Exec node not in policy node mask ================================================================================ Exec_Node Policy Curr Location Pages Observations ================================================================================ N0 N1 N6 N1 Pages are not Migrating to N0 N0 N1 N6 N6 Pages are not migration to N0 N0 N1 N1 Pages are not Migrating to N0 Scenario 3: both the folio node and executing node are outside the policy nodemask ============================================================================== Exec_Node Policy Curr Location Pages Observations ============================================================================== N0 N1 N6 Pages Promoted from N6 to N0 N0 N6 N1 Pages Migrated from N1 to N0 Thanks Donet Tom > >> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> >> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >> --- >> mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- >> 1 file changed, 26 insertions(+), 2 deletions(-) > I haven't spotted anything obviously wrong in the patch itself but I > admit this is not an area I am actively familiar with so I might be > missing something. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-19 13:44 ` Donet Tom @ 2024-02-20 6:36 ` Huang, Ying 2024-02-20 6:44 ` Aneesh Kumar K.V 0 siblings, 1 reply; 33+ messages in thread From: Huang, Ying @ 2024-02-20 6:36 UTC (permalink / raw) To: Donet Tom Cc: Michal Hocko, Andrew Morton, linux-mm, linux-kernel, Aneesh Kumar, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan Donet Tom <donettom@linux.ibm.com> writes: > On 2/19/24 17:37, Michal Hocko wrote: >> On Sat 17-02-24 01:31:35, Donet Tom wrote: >>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >>> nodes") added support for migrate on protnone reference with MPOL_BIND >>> memory policy. This allowed numa fault migration when the executing node >>> is part of the policy mask for MPOL_BIND. This patch extends migration >>> support to MPOL_PREFERRED_MANY policy. >>> >>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>> the kernel should not allocate pages from the slower memory tier via >>> allocation control zonelist fallback. Instead, we should move cold pages >>> from the faster memory node via memory demotion. For a page allocation, >>> kswapd is only woken up after we try to allocate pages from all nodes in >>> the allocation zone list. This implies that, without using memory >>> policies, we will end up allocating hot pages in the slower memory tier. >>> >>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>> allocation control when we have memory tiers in the system. With >>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>> of faster memory nodes. When we fail to allocate pages from the faster >>> memory node, kswapd would be woken up, allowing demotion of cold pages >>> to slower memory nodes. >>> >>> With the current kernel, such usage of memory policies implies we can't >>> do page promotion from a slower memory tier to a faster memory tier >>> using numa fault. This patch fixes this issue. >>> >>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>> mask, we allow numa migration to the executing nodes. If the executing >>> node is not in the policy node mask but the folio is already allocated >>> based on policy preference (the folio node is in the policy node mask), >>> we don't allow numa migration. If both the executing node and folio node >>> are outside the policy node mask, we allow numa migration to the >>> executing nodes. >> The feature makes sense to me. How has this been tested? Do you have any >> numbers to present? > > Hi Michal > > I have a test program which allocate memory on a specified node and > trigger the promotion or migration (Keep accessing the pages). > > Without this patch if we set MPOL_PREFERRED_MANY promotion or migration was not happening > with this patch I could see pages are getting migrated or promoted. > > My system has 2 CPU+DRAM node (Tier 1) and 1 PMEM node(Tier 2). Below > are my test results. > > In below table N0 and N1 are Tier1 Nodes. N6 is the Tier2 Node. > Exec_Node is the execution node, Policy is the nodes in nodemask and > "Curr Location Pages" is the node where pages present before migration > or promotion start. > > Tests Results > ------------------ > Scenario 1: if the executing node is in the policy node mask > ================================================================================ > Exec_Node Policy Curr Location Pages Observations > ================================================================================ > N0 N0 N1 N6 N1 Pages Migrated from N1 to N0 > N0 N0 N1 N6 N6 Pages Promoted from N6 to N0 > N0 N0 N1 N1 Pages Migrated from N1 to N0 > N0 N0 N1 N6 Pages Promoted from N6 to N0 > > Scenario 2: If the folio node is in policy node mask and Exec node not in policy node mask > ================================================================================ > Exec_Node Policy Curr Location Pages Observations > ================================================================================ > N0 N1 N6 N1 Pages are not Migrating to N0 > N0 N1 N6 N6 Pages are not migration to N0 > N0 N1 N1 Pages are not Migrating to N0 > > Scenario 3: both the folio node and executing node are outside the policy nodemask > ============================================================================== > Exec_Node Policy Curr Location Pages Observations > ============================================================================== > N0 N1 N6 Pages Promoted from N6 to N0 > N0 N6 N1 Pages Migrated from N1 to N0 > Please use some benchmarks (e.g., redis + memtier) and show the proc-vmstat stats and benchamrk score. Not part of the kernel series, but don't forget to submit patches to the man pages project and numactl tool to let users use it. -- Best Regards, Huang, Ying > Thanks > Donet Tom > >> >>> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> >>> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >>> --- >>> mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- >>> 1 file changed, 26 insertions(+), 2 deletions(-) >> I haven't spotted anything obviously wrong in the patch itself but I >> admit this is not an area I am actively familiar with so I might be >> missing something. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-20 6:36 ` Huang, Ying @ 2024-02-20 6:44 ` Aneesh Kumar K.V 2024-02-20 7:23 ` Huang, Ying 0 siblings, 1 reply; 33+ messages in thread From: Aneesh Kumar K.V @ 2024-02-20 6:44 UTC (permalink / raw) To: Huang, Ying, Donet Tom Cc: Michal Hocko, Andrew Morton, linux-mm, linux-kernel, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On 2/20/24 12:06 PM, Huang, Ying wrote: > Donet Tom <donettom@linux.ibm.com> writes: > >> On 2/19/24 17:37, Michal Hocko wrote: >>> On Sat 17-02-24 01:31:35, Donet Tom wrote: >>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >>>> nodes") added support for migrate on protnone reference with MPOL_BIND >>>> memory policy. This allowed numa fault migration when the executing node >>>> is part of the policy mask for MPOL_BIND. This patch extends migration >>>> support to MPOL_PREFERRED_MANY policy. >>>> >>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>>> the kernel should not allocate pages from the slower memory tier via >>>> allocation control zonelist fallback. Instead, we should move cold pages >>>> from the faster memory node via memory demotion. For a page allocation, >>>> kswapd is only woken up after we try to allocate pages from all nodes in >>>> the allocation zone list. This implies that, without using memory >>>> policies, we will end up allocating hot pages in the slower memory tier. >>>> >>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>>> allocation control when we have memory tiers in the system. With >>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>>> of faster memory nodes. When we fail to allocate pages from the faster >>>> memory node, kswapd would be woken up, allowing demotion of cold pages >>>> to slower memory nodes. >>>> >>>> With the current kernel, such usage of memory policies implies we can't >>>> do page promotion from a slower memory tier to a faster memory tier >>>> using numa fault. This patch fixes this issue. >>>> >>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>>> mask, we allow numa migration to the executing nodes. If the executing >>>> node is not in the policy node mask but the folio is already allocated >>>> based on policy preference (the folio node is in the policy node mask), >>>> we don't allow numa migration. If both the executing node and folio node >>>> are outside the policy node mask, we allow numa migration to the >>>> executing nodes. >>> The feature makes sense to me. How has this been tested? Do you have any >>> numbers to present? >> >> Hi Michal >> >> I have a test program which allocate memory on a specified node and >> trigger the promotion or migration (Keep accessing the pages). >> >> Without this patch if we set MPOL_PREFERRED_MANY promotion or migration was not happening >> with this patch I could see pages are getting migrated or promoted. >> >> My system has 2 CPU+DRAM node (Tier 1) and 1 PMEM node(Tier 2). Below >> are my test results. >> >> In below table N0 and N1 are Tier1 Nodes. N6 is the Tier2 Node. >> Exec_Node is the execution node, Policy is the nodes in nodemask and >> "Curr Location Pages" is the node where pages present before migration >> or promotion start. >> >> Tests Results >> ------------------ >> Scenario 1: if the executing node is in the policy node mask >> ================================================================================ >> Exec_Node Policy Curr Location Pages Observations >> ================================================================================ >> N0 N0 N1 N6 N1 Pages Migrated from N1 to N0 >> N0 N0 N1 N6 N6 Pages Promoted from N6 to N0 >> N0 N0 N1 N1 Pages Migrated from N1 to N0 >> N0 N0 N1 N6 Pages Promoted from N6 to N0 >> >> Scenario 2: If the folio node is in policy node mask and Exec node not in policy node mask >> ================================================================================ >> Exec_Node Policy Curr Location Pages Observations >> ================================================================================ >> N0 N1 N6 N1 Pages are not Migrating to N0 >> N0 N1 N6 N6 Pages are not migration to N0 >> N0 N1 N1 Pages are not Migrating to N0 >> >> Scenario 3: both the folio node and executing node are outside the policy nodemask >> ============================================================================== >> Exec_Node Policy Curr Location Pages Observations >> ============================================================================== >> N0 N1 N6 Pages Promoted from N6 to N0 >> N0 N6 N1 Pages Migrated from N1 to N0 >> > > Please use some benchmarks (e.g., redis + memtier) and show the > proc-vmstat stats and benchamrk score. Without this change numa fault migration is not supported with MPOL_PREFERRED_MANY policy. So there is no performance comparison with and without patch. W.r.t effectiveness of numa fault migration, that is a different topic from this patch -aneesh ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-20 6:44 ` Aneesh Kumar K.V @ 2024-02-20 7:23 ` Huang, Ying 2024-02-20 7:46 ` Aneesh Kumar K.V 0 siblings, 1 reply; 33+ messages in thread From: Huang, Ying @ 2024-02-20 7:23 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Donet Tom, Michal Hocko, Andrew Morton, linux-mm, linux-kernel, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> writes: > On 2/20/24 12:06 PM, Huang, Ying wrote: >> Donet Tom <donettom@linux.ibm.com> writes: >> >>> On 2/19/24 17:37, Michal Hocko wrote: >>>> On Sat 17-02-24 01:31:35, Donet Tom wrote: >>>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >>>>> nodes") added support for migrate on protnone reference with MPOL_BIND >>>>> memory policy. This allowed numa fault migration when the executing node >>>>> is part of the policy mask for MPOL_BIND. This patch extends migration >>>>> support to MPOL_PREFERRED_MANY policy. >>>>> >>>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>>>> the kernel should not allocate pages from the slower memory tier via >>>>> allocation control zonelist fallback. Instead, we should move cold pages >>>>> from the faster memory node via memory demotion. For a page allocation, >>>>> kswapd is only woken up after we try to allocate pages from all nodes in >>>>> the allocation zone list. This implies that, without using memory >>>>> policies, we will end up allocating hot pages in the slower memory tier. >>>>> >>>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>>>> allocation control when we have memory tiers in the system. With >>>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>>>> of faster memory nodes. When we fail to allocate pages from the faster >>>>> memory node, kswapd would be woken up, allowing demotion of cold pages >>>>> to slower memory nodes. >>>>> >>>>> With the current kernel, such usage of memory policies implies we can't >>>>> do page promotion from a slower memory tier to a faster memory tier >>>>> using numa fault. This patch fixes this issue. >>>>> >>>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>>>> mask, we allow numa migration to the executing nodes. If the executing >>>>> node is not in the policy node mask but the folio is already allocated >>>>> based on policy preference (the folio node is in the policy node mask), >>>>> we don't allow numa migration. If both the executing node and folio node >>>>> are outside the policy node mask, we allow numa migration to the >>>>> executing nodes. >>>> The feature makes sense to me. How has this been tested? Do you have any >>>> numbers to present? >>> >>> Hi Michal >>> >>> I have a test program which allocate memory on a specified node and >>> trigger the promotion or migration (Keep accessing the pages). >>> >>> Without this patch if we set MPOL_PREFERRED_MANY promotion or migration was not happening >>> with this patch I could see pages are getting migrated or promoted. >>> >>> My system has 2 CPU+DRAM node (Tier 1) and 1 PMEM node(Tier 2). Below >>> are my test results. >>> >>> In below table N0 and N1 are Tier1 Nodes. N6 is the Tier2 Node. >>> Exec_Node is the execution node, Policy is the nodes in nodemask and >>> "Curr Location Pages" is the node where pages present before migration >>> or promotion start. >>> >>> Tests Results >>> ------------------ >>> Scenario 1: if the executing node is in the policy node mask >>> ================================================================================ >>> Exec_Node Policy Curr Location Pages Observations >>> ================================================================================ >>> N0 N0 N1 N6 N1 Pages Migrated from N1 to N0 >>> N0 N0 N1 N6 N6 Pages Promoted from N6 to N0 >>> N0 N0 N1 N1 Pages Migrated from N1 to N0 >>> N0 N0 N1 N6 Pages Promoted from N6 to N0 >>> >>> Scenario 2: If the folio node is in policy node mask and Exec node not in policy node mask >>> ================================================================================ >>> Exec_Node Policy Curr Location Pages Observations >>> ================================================================================ >>> N0 N1 N6 N1 Pages are not Migrating to N0 >>> N0 N1 N6 N6 Pages are not migration to N0 >>> N0 N1 N1 Pages are not Migrating to N0 >>> >>> Scenario 3: both the folio node and executing node are outside the policy nodemask >>> ============================================================================== >>> Exec_Node Policy Curr Location Pages Observations >>> ============================================================================== >>> N0 N1 N6 Pages Promoted from N6 to N0 >>> N0 N6 N1 Pages Migrated from N1 to N0 >>> >> >> Please use some benchmarks (e.g., redis + memtier) and show the >> proc-vmstat stats and benchamrk score. > > > Without this change numa fault migration is not supported with MPOL_PREFERRED_MANY > policy. So there is no performance comparison with and without patch. W.r.t effectiveness of numa > fault migration, that is a different topic from this patch IIUC, the goal of the patch is to optimize performance, right? If so, the benchmark score will help justify the change. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-20 7:23 ` Huang, Ying @ 2024-02-20 7:46 ` Aneesh Kumar K.V 2024-02-20 8:01 ` Huang, Ying 0 siblings, 1 reply; 33+ messages in thread From: Aneesh Kumar K.V @ 2024-02-20 7:46 UTC (permalink / raw) To: Huang, Ying Cc: Donet Tom, Michal Hocko, Andrew Morton, linux-mm, linux-kernel, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan "Huang, Ying" <ying.huang@intel.com> writes: > "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> writes: > >> On 2/20/24 12:06 PM, Huang, Ying wrote: >>> Donet Tom <donettom@linux.ibm.com> writes: >>> >>>> On 2/19/24 17:37, Michal Hocko wrote: >>>>> On Sat 17-02-24 01:31:35, Donet Tom wrote: >>>>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >>>>>> nodes") added support for migrate on protnone reference with MPOL_BIND >>>>>> memory policy. This allowed numa fault migration when the executing node >>>>>> is part of the policy mask for MPOL_BIND. This patch extends migration >>>>>> support to MPOL_PREFERRED_MANY policy. >>>>>> >>>>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>>>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>>>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>>>>> the kernel should not allocate pages from the slower memory tier via >>>>>> allocation control zonelist fallback. Instead, we should move cold pages >>>>>> from the faster memory node via memory demotion. For a page allocation, >>>>>> kswapd is only woken up after we try to allocate pages from all nodes in >>>>>> the allocation zone list. This implies that, without using memory >>>>>> policies, we will end up allocating hot pages in the slower memory tier. >>>>>> >>>>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>>>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>>>>> allocation control when we have memory tiers in the system. With >>>>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>>>>> of faster memory nodes. When we fail to allocate pages from the faster >>>>>> memory node, kswapd would be woken up, allowing demotion of cold pages >>>>>> to slower memory nodes. >>>>>> >>>>>> With the current kernel, such usage of memory policies implies we can't >>>>>> do page promotion from a slower memory tier to a faster memory tier >>>>>> using numa fault. This patch fixes this issue. >>>>>> >>>>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>>>>> mask, we allow numa migration to the executing nodes. If the executing >>>>>> node is not in the policy node mask but the folio is already allocated >>>>>> based on policy preference (the folio node is in the policy node mask), >>>>>> we don't allow numa migration. If both the executing node and folio node >>>>>> are outside the policy node mask, we allow numa migration to the >>>>>> executing nodes. >>>>> The feature makes sense to me. How has this been tested? Do you have any >>>>> numbers to present? >>>> >>>> Hi Michal >>>> >>>> I have a test program which allocate memory on a specified node and >>>> trigger the promotion or migration (Keep accessing the pages). >>>> >>>> Without this patch if we set MPOL_PREFERRED_MANY promotion or migration was not happening >>>> with this patch I could see pages are getting migrated or promoted. >>>> >>>> My system has 2 CPU+DRAM node (Tier 1) and 1 PMEM node(Tier 2). Below >>>> are my test results. >>>> >>>> In below table N0 and N1 are Tier1 Nodes. N6 is the Tier2 Node. >>>> Exec_Node is the execution node, Policy is the nodes in nodemask and >>>> "Curr Location Pages" is the node where pages present before migration >>>> or promotion start. >>>> >>>> Tests Results >>>> ------------------ >>>> Scenario 1: if the executing node is in the policy node mask >>>> ================================================================================ >>>> Exec_Node Policy Curr Location Pages Observations >>>> ================================================================================ >>>> N0 N0 N1 N6 N1 Pages Migrated from N1 to N0 >>>> N0 N0 N1 N6 N6 Pages Promoted from N6 to N0 >>>> N0 N0 N1 N1 Pages Migrated from N1 to N0 >>>> N0 N0 N1 N6 Pages Promoted from N6 to N0 >>>> >>>> Scenario 2: If the folio node is in policy node mask and Exec node not in policy node mask >>>> ================================================================================ >>>> Exec_Node Policy Curr Location Pages Observations >>>> ================================================================================ >>>> N0 N1 N6 N1 Pages are not Migrating to N0 >>>> N0 N1 N6 N6 Pages are not migration to N0 >>>> N0 N1 N1 Pages are not Migrating to N0 >>>> >>>> Scenario 3: both the folio node and executing node are outside the policy nodemask >>>> ============================================================================== >>>> Exec_Node Policy Curr Location Pages Observations >>>> ============================================================================== >>>> N0 N1 N6 Pages Promoted from N6 to N0 >>>> N0 N6 N1 Pages Migrated from N1 to N0 >>>> >>> >>> Please use some benchmarks (e.g., redis + memtier) and show the >>> proc-vmstat stats and benchamrk score. >> >> >> Without this change numa fault migration is not supported with MPOL_PREFERRED_MANY >> policy. So there is no performance comparison with and without patch. W.r.t effectiveness of numa >> fault migration, that is a different topic from this patch > > IIUC, the goal of the patch is to optimize performance, right? If so, > the benchmark score will help justify the change. > The objective is to enable the use of the MPOL_PREFERRED_MANY policy, which is essential for the correct functioning of memory demotion in conjunction with memory promotion. Once we can use memory promotion, we should be able to observe the same benefits as those provided by numa fault memory promotion. The actual benefit of numa fault migration is dependent on various factors such as the speed of the slower memory device, the access pattern of the application, etc. We are discussing its effectiveness and how to improve numa fault overhead in other forums. However, we believe that this discussion should not hinder the merging of this patch. This change is similar to commit bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes") -aneesh ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-20 7:46 ` Aneesh Kumar K.V @ 2024-02-20 8:01 ` Huang, Ying 0 siblings, 0 replies; 33+ messages in thread From: Huang, Ying @ 2024-02-20 8:01 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Donet Tom, Michal Hocko, Andrew Morton, linux-mm, linux-kernel, Dave Hansen, Mel Gorman, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan Aneesh Kumar K.V <aneesh.kumar@kernel.org> writes: > "Huang, Ying" <ying.huang@intel.com> writes: > >> "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> writes: >> >>> On 2/20/24 12:06 PM, Huang, Ying wrote: >>>> Donet Tom <donettom@linux.ibm.com> writes: >>>> >>>>> On 2/19/24 17:37, Michal Hocko wrote: >>>>>> On Sat 17-02-24 01:31:35, Donet Tom wrote: >>>>>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >>>>>>> nodes") added support for migrate on protnone reference with MPOL_BIND >>>>>>> memory policy. This allowed numa fault migration when the executing node >>>>>>> is part of the policy mask for MPOL_BIND. This patch extends migration >>>>>>> support to MPOL_PREFERRED_MANY policy. >>>>>>> >>>>>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>>>>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>>>>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>>>>>> the kernel should not allocate pages from the slower memory tier via >>>>>>> allocation control zonelist fallback. Instead, we should move cold pages >>>>>>> from the faster memory node via memory demotion. For a page allocation, >>>>>>> kswapd is only woken up after we try to allocate pages from all nodes in >>>>>>> the allocation zone list. This implies that, without using memory >>>>>>> policies, we will end up allocating hot pages in the slower memory tier. >>>>>>> >>>>>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>>>>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>>>>>> allocation control when we have memory tiers in the system. With >>>>>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>>>>>> of faster memory nodes. When we fail to allocate pages from the faster >>>>>>> memory node, kswapd would be woken up, allowing demotion of cold pages >>>>>>> to slower memory nodes. >>>>>>> >>>>>>> With the current kernel, such usage of memory policies implies we can't >>>>>>> do page promotion from a slower memory tier to a faster memory tier >>>>>>> using numa fault. This patch fixes this issue. >>>>>>> >>>>>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>>>>>> mask, we allow numa migration to the executing nodes. If the executing >>>>>>> node is not in the policy node mask but the folio is already allocated >>>>>>> based on policy preference (the folio node is in the policy node mask), >>>>>>> we don't allow numa migration. If both the executing node and folio node >>>>>>> are outside the policy node mask, we allow numa migration to the >>>>>>> executing nodes. >>>>>> The feature makes sense to me. How has this been tested? Do you have any >>>>>> numbers to present? >>>>> >>>>> Hi Michal >>>>> >>>>> I have a test program which allocate memory on a specified node and >>>>> trigger the promotion or migration (Keep accessing the pages). >>>>> >>>>> Without this patch if we set MPOL_PREFERRED_MANY promotion or migration was not happening >>>>> with this patch I could see pages are getting migrated or promoted. >>>>> >>>>> My system has 2 CPU+DRAM node (Tier 1) and 1 PMEM node(Tier 2). Below >>>>> are my test results. >>>>> >>>>> In below table N0 and N1 are Tier1 Nodes. N6 is the Tier2 Node. >>>>> Exec_Node is the execution node, Policy is the nodes in nodemask and >>>>> "Curr Location Pages" is the node where pages present before migration >>>>> or promotion start. >>>>> >>>>> Tests Results >>>>> ------------------ >>>>> Scenario 1: if the executing node is in the policy node mask >>>>> ================================================================================ >>>>> Exec_Node Policy Curr Location Pages Observations >>>>> ================================================================================ >>>>> N0 N0 N1 N6 N1 Pages Migrated from N1 to N0 >>>>> N0 N0 N1 N6 N6 Pages Promoted from N6 to N0 >>>>> N0 N0 N1 N1 Pages Migrated from N1 to N0 >>>>> N0 N0 N1 N6 Pages Promoted from N6 to N0 >>>>> >>>>> Scenario 2: If the folio node is in policy node mask and Exec node not in policy node mask >>>>> ================================================================================ >>>>> Exec_Node Policy Curr Location Pages Observations >>>>> ================================================================================ >>>>> N0 N1 N6 N1 Pages are not Migrating to N0 >>>>> N0 N1 N6 N6 Pages are not migration to N0 >>>>> N0 N1 N1 Pages are not Migrating to N0 >>>>> >>>>> Scenario 3: both the folio node and executing node are outside the policy nodemask >>>>> ============================================================================== >>>>> Exec_Node Policy Curr Location Pages Observations >>>>> ============================================================================== >>>>> N0 N1 N6 Pages Promoted from N6 to N0 >>>>> N0 N6 N1 Pages Migrated from N1 to N0 >>>>> >>>> >>>> Please use some benchmarks (e.g., redis + memtier) and show the >>>> proc-vmstat stats and benchamrk score. >>> >>> >>> Without this change numa fault migration is not supported with MPOL_PREFERRED_MANY >>> policy. So there is no performance comparison with and without patch. W.r.t effectiveness of numa >>> fault migration, that is a different topic from this patch >> >> IIUC, the goal of the patch is to optimize performance, right? If so, >> the benchmark score will help justify the change. >> > > The objective is to enable the use of the MPOL_PREFERRED_MANY policy, > which is essential for the correct functioning of memory demotion in > conjunction with memory promotion. Once we can use memory promotion, we > should be able to observe the same benefits as those provided by numa > fault memory promotion. The actual benefit of numa fault migration is > dependent on various factors such as the speed of the slower memory > device, the access pattern of the application, etc. We are discussing > its effectiveness and how to improve numa fault overhead in other > forums. However, we believe that this discussion should not hinder the > merging of this patch. > > This change is similar to commit bda420b98505 ("numa balancing: migrate > on fault among multiple bound nodes") We provide the performance data in the description of that commit :-) -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-17 7:31 ` [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy Donet Tom 2024-02-19 12:07 ` Michal Hocko @ 2024-02-19 14:20 ` Michal Hocko 2024-02-19 15:07 ` Donet Tom 2024-02-20 7:18 ` Huang, Ying 2 siblings, 1 reply; 33+ messages in thread From: Michal Hocko @ 2024-02-19 14:20 UTC (permalink / raw) To: Donet Tom Cc: Andrew Morton, linux-mm, linux-kernel, Aneesh Kumar, Huang Ying, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On Sat 17-02-24 01:31:35, Donet Tom wrote: [...] > +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, > + struct mempolicy *pol) > +{ > + /* if the executing node is in the policy node mask, migrate */ > + if (node_isset(exec_node, pol->nodes)) > + return true; > + > + /* If the folio node is in policy node mask, don't migrate */ > + if (node_isset(folio_node, pol->nodes)) > + return false; > + /* > + * both the folio node and executing node are outside the policy nodemask, > + * migrate as normal numa fault migration. > + */ > + return true; > +} I have looked at this again and only now noticed that this doesn't really work as one would expected. case MPOL_PREFERRED_MANY: /* * use current page if in policy nodemask, * else select nearest allowed node, if any. * If no allowed nodes, use current [!misplaced]. */ if (node_isset(curnid, pol->nodes)) goto out; z = first_zones_zonelist( node_zonelist(numa_node_id(), GFP_HIGHUSER), gfp_zone(GFP_HIGHUSER), &pol->nodes); polnid = zone_to_nid(z->zone); break; Will collapse the whole MPOL_PREFERRED_MANY nodemask into the first notde into that mask. Is that really what we want here? Shouldn't we use the full nodemask as the migration target? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-19 14:20 ` Michal Hocko @ 2024-02-19 15:07 ` Donet Tom 2024-02-19 19:12 ` Michal Hocko 0 siblings, 1 reply; 33+ messages in thread From: Donet Tom @ 2024-02-19 15:07 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, linux-kernel, Aneesh Kumar, Huang Ying, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On 2/19/24 19:50, Michal Hocko wrote: > On Sat 17-02-24 01:31:35, Donet Tom wrote: > [...] >> +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, >> + struct mempolicy *pol) >> +{ >> + /* if the executing node is in the policy node mask, migrate */ >> + if (node_isset(exec_node, pol->nodes)) >> + return true; >> + >> + /* If the folio node is in policy node mask, don't migrate */ >> + if (node_isset(folio_node, pol->nodes)) >> + return false; >> + /* >> + * both the folio node and executing node are outside the policy nodemask, >> + * migrate as normal numa fault migration. >> + */ >> + return true; >> +} > I have looked at this again and only now noticed that this doesn't > really work as one would expected. > > case MPOL_PREFERRED_MANY: > /* > * use current page if in policy nodemask, > * else select nearest allowed node, if any. > * If no allowed nodes, use current [!misplaced]. > */ > if (node_isset(curnid, pol->nodes)) > goto out; > z = first_zones_zonelist( > node_zonelist(numa_node_id(), GFP_HIGHUSER), > gfp_zone(GFP_HIGHUSER), > &pol->nodes); > polnid = zone_to_nid(z->zone); > break; > > Will collapse the whole MPOL_PREFERRED_MANY nodemask into the first > notde into that mask. Is that really what we want here? Shouldn't we use > the full nodemask as the migration target? With this patch it will take full nodemask and find out the correct migration target. It will not collapse into first node. For example if we have 5 NUMA nodes in our system N1 to N5, all five are in nodemask and the execution node is N3. with this fix mpol_preferred_should_numa_migrate() will return true because the execution node is there in the nodemask. So mpol_misplaced() will select N3 as the migration target since MPOL_F_MORON is set and migrate the pages to N3. /* Migrate the folio towards the node whose CPU is referencing it */ if (pol->flags & MPOL_F_MORON) { polnid = thisnid; So with this patch pages will get migrated to the correct migration target. > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-19 15:07 ` Donet Tom @ 2024-02-19 19:12 ` Michal Hocko 2024-02-20 3:57 ` Aneesh Kumar K.V 0 siblings, 1 reply; 33+ messages in thread From: Michal Hocko @ 2024-02-19 19:12 UTC (permalink / raw) To: Donet Tom Cc: Andrew Morton, linux-mm, linux-kernel, Aneesh Kumar, Huang Ying, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On Mon 19-02-24 20:37:17, Donet Tom wrote: > > On 2/19/24 19:50, Michal Hocko wrote: > > On Sat 17-02-24 01:31:35, Donet Tom wrote: > > [...] > > > +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, > > > + struct mempolicy *pol) > > > +{ > > > + /* if the executing node is in the policy node mask, migrate */ > > > + if (node_isset(exec_node, pol->nodes)) > > > + return true; > > > + > > > + /* If the folio node is in policy node mask, don't migrate */ > > > + if (node_isset(folio_node, pol->nodes)) > > > + return false; > > > + /* > > > + * both the folio node and executing node are outside the policy nodemask, > > > + * migrate as normal numa fault migration. > > > + */ > > > + return true; > > > +} > > I have looked at this again and only now noticed that this doesn't > > really work as one would expected. > > > > case MPOL_PREFERRED_MANY: > > /* > > * use current page if in policy nodemask, > > * else select nearest allowed node, if any. > > * If no allowed nodes, use current [!misplaced]. > > */ > > if (node_isset(curnid, pol->nodes)) > > goto out; > > z = first_zones_zonelist( > > node_zonelist(numa_node_id(), GFP_HIGHUSER), > > gfp_zone(GFP_HIGHUSER), > > &pol->nodes); > > polnid = zone_to_nid(z->zone); > > break; > > > > Will collapse the whole MPOL_PREFERRED_MANY nodemask into the first > > notde into that mask. Is that really what we want here? Shouldn't we use > > the full nodemask as the migration target? > > With this patch it will take full nodemask and find out the correct migration target. It will not collapse into first node. Correct me if I am wrong, but mpol_misplaced will return the first node of the preffered node mask and then migrate_misplaced_folio would use it as a target node for alloc_misplaced_dst_folio which performs __GFP_THISNODE allocation so it won't fall back to a different node. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-19 19:12 ` Michal Hocko @ 2024-02-20 3:57 ` Aneesh Kumar K.V 2024-02-20 8:48 ` Michal Hocko 0 siblings, 1 reply; 33+ messages in thread From: Aneesh Kumar K.V @ 2024-02-20 3:57 UTC (permalink / raw) To: Michal Hocko, Donet Tom Cc: Andrew Morton, linux-mm, linux-kernel, Huang Ying, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On 2/20/24 12:42 AM, Michal Hocko wrote: > On Mon 19-02-24 20:37:17, Donet Tom wrote: >> >> On 2/19/24 19:50, Michal Hocko wrote: >>> On Sat 17-02-24 01:31:35, Donet Tom wrote: >>> [...] >>>> +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, >>>> + struct mempolicy *pol) >>>> +{ >>>> + /* if the executing node is in the policy node mask, migrate */ >>>> + if (node_isset(exec_node, pol->nodes)) >>>> + return true; >>>> + >>>> + /* If the folio node is in policy node mask, don't migrate */ >>>> + if (node_isset(folio_node, pol->nodes)) >>>> + return false; >>>> + /* >>>> + * both the folio node and executing node are outside the policy nodemask, >>>> + * migrate as normal numa fault migration. >>>> + */ >>>> + return true; >>>> +} >>> I have looked at this again and only now noticed that this doesn't >>> really work as one would expected. >>> >>> case MPOL_PREFERRED_MANY: >>> /* >>> * use current page if in policy nodemask, >>> * else select nearest allowed node, if any. >>> * If no allowed nodes, use current [!misplaced]. >>> */ >>> if (node_isset(curnid, pol->nodes)) >>> goto out; >>> z = first_zones_zonelist( >>> node_zonelist(numa_node_id(), GFP_HIGHUSER), >>> gfp_zone(GFP_HIGHUSER), >>> &pol->nodes); >>> polnid = zone_to_nid(z->zone); >>> break; >>> >>> Will collapse the whole MPOL_PREFERRED_MANY nodemask into the first >>> notde into that mask. Is that really what we want here? Shouldn't we use >>> the full nodemask as the migration target? >> >> With this patch it will take full nodemask and find out the correct migration target. It will not collapse into first node. > > Correct me if I am wrong, but mpol_misplaced will return the first node > of the preffered node mask and then migrate_misplaced_folio would use > it as a target node for alloc_misplaced_dst_folio which performs > __GFP_THISNODE allocation so it won't fall back to a different node. I think the confusion is between MPOL_F_MOF (migrate on fault) vs MPOL_F_MORON( protnone fault/numa fault). With MPOL_F_MOF alone what we wanted to achieve was to have have mbind() lazy migrate the pages based on policy node mask. The change was introduced in commit commit b24f53a0bea3 ("mm: mempolicy: Add MPOL_MF_LAZY") and later dropped by commit 2cafb582173f ("mempolicy: remove confusing MPOL_MF_LAZY dead code"). We still have mpol_misplaced changes to handle the node selection for MPOL_F_MOF flag (this is dead code IIUC). MPOL_F_MORON was added in commit 5606e3877ad8 ("mm: numa: Migrate on reference policy") and with currently upstream only MPOL_BIND support that flag. With that flag specified and with the changes in the patch mpol_misplaced becomes case MPOL_PREFERRED_MANY: if (pol->flags & MPOL_F_MORON) { if (!mpol_preferred_should_numa_migrate(thisnid, curnid, pol)) goto out; break; } /* * use current page if in policy nodemask, * else select nearest allowed node, if any. * If no allowed nodes, use current [!misplaced]. */ if (node_isset(curnid, pol->nodes)) goto out; z = first_zones_zonelist( node_zonelist(thisnid, GFP_HIGHUSER), gfp_zone(GFP_HIGHUSER), &pol->nodes); polnid = zone_to_nid(z->zone); break; .... ... } /* Migrate the folio towards the node whose CPU is referencing it */ if (pol->flags & MPOL_F_MORON) { polnid = thisnid; if (!should_numa_migrate_memory(current, folio, curnid, thiscpu)) goto out; } if (curnid != polnid) ret = polnid; out: mpol_cond_put(pol); return ret; } ie, if we can do numa migration, we select the currently executing node as the target node otherwise we end up returning from the function with ret = NUMA_NO_NODE. -aneesh ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-20 3:57 ` Aneesh Kumar K.V @ 2024-02-20 8:48 ` Michal Hocko 2024-02-26 13:09 ` Donet Tom 0 siblings, 1 reply; 33+ messages in thread From: Michal Hocko @ 2024-02-20 8:48 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Donet Tom, Andrew Morton, linux-mm, linux-kernel, Huang Ying, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On Tue 20-02-24 09:27:25, Aneesh Kumar K.V wrote: [...] > case MPOL_PREFERRED_MANY: > if (pol->flags & MPOL_F_MORON) { > if (!mpol_preferred_should_numa_migrate(thisnid, curnid, pol)) > goto out; > break; > } > > /* > * use current page if in policy nodemask, > * else select nearest allowed node, if any. > * If no allowed nodes, use current [!misplaced]. > */ > if (node_isset(curnid, pol->nodes)) > goto out; > z = first_zones_zonelist( > node_zonelist(thisnid, GFP_HIGHUSER), > gfp_zone(GFP_HIGHUSER), > &pol->nodes); > polnid = zone_to_nid(z->zone); > break; > .... > .. > } > > /* Migrate the folio towards the node whose CPU is referencing it */ > if (pol->flags & MPOL_F_MORON) { > polnid = thisnid; > > if (!should_numa_migrate_memory(current, folio, curnid, > thiscpu)) > goto out; > } > > if (curnid != polnid) > ret = polnid; > out: > mpol_cond_put(pol); > > return ret; > } Ohh, right this code is confusing as hell. Thanks for the clarification. With this in mind. There should be a comment warning about MPOL_F_MOF always being unset as the userspace cannot really set it up. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-20 8:48 ` Michal Hocko @ 2024-02-26 13:09 ` Donet Tom 0 siblings, 0 replies; 33+ messages in thread From: Donet Tom @ 2024-02-26 13:09 UTC (permalink / raw) To: Michal Hocko, Aneesh Kumar K.V Cc: Andrew Morton, linux-mm, linux-kernel, Huang Ying, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On 2/20/24 14:18, Michal Hocko wrote: > On Tue 20-02-24 09:27:25, Aneesh Kumar K.V wrote: > [...] >> case MPOL_PREFERRED_MANY: >> if (pol->flags & MPOL_F_MORON) { >> if (!mpol_preferred_should_numa_migrate(thisnid, curnid, pol)) >> goto out; >> break; >> } >> >> /* >> * use current page if in policy nodemask, >> * else select nearest allowed node, if any. >> * If no allowed nodes, use current [!misplaced]. >> */ >> if (node_isset(curnid, pol->nodes)) >> goto out; >> z = first_zones_zonelist( >> node_zonelist(thisnid, GFP_HIGHUSER), >> gfp_zone(GFP_HIGHUSER), >> &pol->nodes); >> polnid = zone_to_nid(z->zone); >> break; >> .... >> .. >> } >> >> /* Migrate the folio towards the node whose CPU is referencing it */ >> if (pol->flags & MPOL_F_MORON) { >> polnid = thisnid; >> >> if (!should_numa_migrate_memory(current, folio, curnid, >> thiscpu)) >> goto out; >> } >> >> if (curnid != polnid) >> ret = polnid; >> out: >> mpol_cond_put(pol); >> >> return ret; >> } > Ohh, right this code is confusing as hell. Thanks for the clarification. > With this in mind. There should be a comment warning about MPOL_F_MOF > always being unset as the userspace cannot really set it up. > > Thanks! > Hi Michal Sorry For the late reply. If we set MPOL_F_NUMA_BALANCING from userspace then MPOL_F_MOF and MPOL_F_MORON flags will get set in kernel. /* Basic parameter sanity check used by both mbind() and set_mempolicy() */ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) { *flags = *mode & MPOL_MODE_FLAGS; *mode &= ~MPOL_MODE_FLAGS; if ((unsigned int)(*mode) >= MPOL_MAX) return -EINVAL; if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) return -EINVAL; if (*flags & MPOL_F_NUMA_BALANCING) { if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) *flags |= (MPOL_F_MOF | MPOL_F_MORON); else return -EINVAL; } In current kernel it is supported only for MPOL_BIND and we added suppor for MPOL_PREFERRED_MANY also. Why MPOL_F_MOF flag is required? --------------------------------- For NUMA migration the process memory is unmapped by "task_numa_work" periodically, if unmapped memory got accessed again then NUMA hinting page fault will occur and in page fault handler the pages get migrated. If MPOL_F_MOF is not set then "task_numa_work" will not unmap the process pages and NUMA hinting page fault and migration will not occur. This change has been introduced by commit fc3147245d193b (mm: numa: Limit NUMA scanning to migrate-on-fault VMAs). How new implementation works ---------------------------- MPOL_PREFERRED_MANY is able to set MPOL_F_MOF and MPOL_F_MORON through MPOL_F_NUMA_BALANCING. So NUMA hinting page faults will occur. In mpol_misplaced if we can do numa migration, we select the currently executing node as the target node otherwise we end up returning from the function with ret = NUMA_NO_NODE. So since we are able to set MPOL_F_MOF from userspace through MPOL_F_NUMA_BALANCING, no need to add this comment right? Thanks Donet Tom ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-17 7:31 ` [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy Donet Tom 2024-02-19 12:07 ` Michal Hocko 2024-02-19 14:20 ` Michal Hocko @ 2024-02-20 7:18 ` Huang, Ying 2024-02-20 7:53 ` Aneesh Kumar K.V 2 siblings, 1 reply; 33+ messages in thread From: Huang, Ying @ 2024-02-20 7:18 UTC (permalink / raw) To: Donet Tom Cc: Andrew Morton, linux-mm, linux-kernel, Aneesh Kumar, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Michal Hocko, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan Donet Tom <donettom@linux.ibm.com> writes: > commit bda420b98505 ("numa balancing: migrate on fault among multiple bound > nodes") added support for migrate on protnone reference with MPOL_BIND > memory policy. This allowed numa fault migration when the executing node > is part of the policy mask for MPOL_BIND. This patch extends migration > support to MPOL_PREFERRED_MANY policy. > > Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag > MPOL_F_NUMA_BALANCING. This causes issues when we want to use > NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, > the kernel should not allocate pages from the slower memory tier via > allocation control zonelist fallback. Instead, we should move cold pages > from the faster memory node via memory demotion. For a page allocation, > kswapd is only woken up after we try to allocate pages from all nodes in > the allocation zone list. This implies that, without using memory > policies, we will end up allocating hot pages in the slower memory tier. > > MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add > MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better > allocation control when we have memory tiers in the system. With > MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only > of faster memory nodes. When we fail to allocate pages from the faster > memory node, kswapd would be woken up, allowing demotion of cold pages > to slower memory nodes. > > With the current kernel, such usage of memory policies implies we can't > do page promotion from a slower memory tier to a faster memory tier > using numa fault. This patch fixes this issue. > > For MPOL_PREFERRED_MANY, if the executing node is in the policy node > mask, we allow numa migration to the executing nodes. If the executing > node is not in the policy node mask but the folio is already allocated > based on policy preference (the folio node is in the policy node mask), > we don't allow numa migration. If both the executing node and folio node > are outside the policy node mask, we allow numa migration to the > executing nodes. > > Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> > Signed-off-by: Donet Tom <donettom@linux.ibm.com> > --- > mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- > 1 file changed, 26 insertions(+), 2 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 73d698e21dae..8c4c92b10371 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1458,9 +1458,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) > if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) > return -EINVAL; > if (*flags & MPOL_F_NUMA_BALANCING) { > - if (*mode != MPOL_BIND) > + if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) > + *flags |= (MPOL_F_MOF | MPOL_F_MORON); > + else > return -EINVAL; > - *flags |= (MPOL_F_MOF | MPOL_F_MORON); > } > return 0; > } > @@ -2463,6 +2464,23 @@ static void sp_free(struct sp_node *n) > kmem_cache_free(sn_cache, n); > } > > +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, > + struct mempolicy *pol) > +{ > + /* if the executing node is in the policy node mask, migrate */ > + if (node_isset(exec_node, pol->nodes)) > + return true; > + > + /* If the folio node is in policy node mask, don't migrate */ > + if (node_isset(folio_node, pol->nodes)) > + return false; > + /* > + * both the folio node and executing node are outside the policy nodemask, > + * migrate as normal numa fault migration. > + */ > + return true; Why? This may cause some unexpected result. For example, pages may be distributed among multiple sockets unexpectedly. So, I prefer the more conservative policy, that is, only migrate if this node is in pol->nodes. -- Best Regards, Huang, Ying > +} > + > /** > * mpol_misplaced - check whether current folio node is valid in policy > * > @@ -2526,6 +2544,12 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, > break; > > case MPOL_PREFERRED_MANY: > + if (pol->flags & MPOL_F_MORON) { > + if (!mpol_preferred_should_numa_migrate(thisnid, curnid, pol)) > + goto out; > + break; > + } > + > /* > * use current page if in policy nodemask, > * else select nearest allowed node, if any. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-20 7:18 ` Huang, Ying @ 2024-02-20 7:53 ` Aneesh Kumar K.V 2024-02-20 7:58 ` Huang, Ying 0 siblings, 1 reply; 33+ messages in thread From: Aneesh Kumar K.V @ 2024-02-20 7:53 UTC (permalink / raw) To: Huang, Ying, Donet Tom Cc: Andrew Morton, linux-mm, linux-kernel, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Michal Hocko, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan "Huang, Ying" <ying.huang@intel.com> writes: > Donet Tom <donettom@linux.ibm.com> writes: > >> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >> nodes") added support for migrate on protnone reference with MPOL_BIND >> memory policy. This allowed numa fault migration when the executing node >> is part of the policy mask for MPOL_BIND. This patch extends migration >> support to MPOL_PREFERRED_MANY policy. >> >> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >> the kernel should not allocate pages from the slower memory tier via >> allocation control zonelist fallback. Instead, we should move cold pages >> from the faster memory node via memory demotion. For a page allocation, >> kswapd is only woken up after we try to allocate pages from all nodes in >> the allocation zone list. This implies that, without using memory >> policies, we will end up allocating hot pages in the slower memory tier. >> >> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >> allocation control when we have memory tiers in the system. With >> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >> of faster memory nodes. When we fail to allocate pages from the faster >> memory node, kswapd would be woken up, allowing demotion of cold pages >> to slower memory nodes. >> >> With the current kernel, such usage of memory policies implies we can't >> do page promotion from a slower memory tier to a faster memory tier >> using numa fault. This patch fixes this issue. >> >> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >> mask, we allow numa migration to the executing nodes. If the executing >> node is not in the policy node mask but the folio is already allocated >> based on policy preference (the folio node is in the policy node mask), >> we don't allow numa migration. If both the executing node and folio node >> are outside the policy node mask, we allow numa migration to the >> executing nodes. >> >> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> >> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >> --- >> mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- >> 1 file changed, 26 insertions(+), 2 deletions(-) >> >> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> index 73d698e21dae..8c4c92b10371 100644 >> --- a/mm/mempolicy.c >> +++ b/mm/mempolicy.c >> @@ -1458,9 +1458,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) >> if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) >> return -EINVAL; >> if (*flags & MPOL_F_NUMA_BALANCING) { >> - if (*mode != MPOL_BIND) >> + if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) >> + *flags |= (MPOL_F_MOF | MPOL_F_MORON); >> + else >> return -EINVAL; >> - *flags |= (MPOL_F_MOF | MPOL_F_MORON); >> } >> return 0; >> } >> @@ -2463,6 +2464,23 @@ static void sp_free(struct sp_node *n) >> kmem_cache_free(sn_cache, n); >> } >> >> +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, >> + struct mempolicy *pol) >> +{ >> + /* if the executing node is in the policy node mask, migrate */ >> + if (node_isset(exec_node, pol->nodes)) >> + return true; >> + >> + /* If the folio node is in policy node mask, don't migrate */ >> + if (node_isset(folio_node, pol->nodes)) >> + return false; >> + /* >> + * both the folio node and executing node are outside the policy nodemask, >> + * migrate as normal numa fault migration. >> + */ >> + return true; > > Why? This may cause some unexpected result. For example, pages may be > distributed among multiple sockets unexpectedly. So, I prefer the more > conservative policy, that is, only migrate if this node is in > pol->nodes. > This will only have an impact if the user specifies MPOL_F_NUMA_BALANCING. This means that the user is explicitly requesting for frequently accessed memory pages to be migrated. Memory policy MPOL_PREFERRED_MANY is able to allocate pages from nodes outside of policy->nodes. For the specific use case that I am interested in, it should be okay to restrict it to policy->nodes. However, I am wondering if this is too restrictive given the definition of MPOL_PREFERRED_MANY. -aneesh ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-20 7:53 ` Aneesh Kumar K.V @ 2024-02-20 7:58 ` Huang, Ying 2024-03-03 6:16 ` Aneesh Kumar K.V 0 siblings, 1 reply; 33+ messages in thread From: Huang, Ying @ 2024-02-20 7:58 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Donet Tom, Andrew Morton, linux-mm, linux-kernel, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Michal Hocko, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan Aneesh Kumar K.V <aneesh.kumar@kernel.org> writes: > "Huang, Ying" <ying.huang@intel.com> writes: > >> Donet Tom <donettom@linux.ibm.com> writes: >> >>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >>> nodes") added support for migrate on protnone reference with MPOL_BIND >>> memory policy. This allowed numa fault migration when the executing node >>> is part of the policy mask for MPOL_BIND. This patch extends migration >>> support to MPOL_PREFERRED_MANY policy. >>> >>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>> the kernel should not allocate pages from the slower memory tier via >>> allocation control zonelist fallback. Instead, we should move cold pages >>> from the faster memory node via memory demotion. For a page allocation, >>> kswapd is only woken up after we try to allocate pages from all nodes in >>> the allocation zone list. This implies that, without using memory >>> policies, we will end up allocating hot pages in the slower memory tier. >>> >>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>> allocation control when we have memory tiers in the system. With >>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>> of faster memory nodes. When we fail to allocate pages from the faster >>> memory node, kswapd would be woken up, allowing demotion of cold pages >>> to slower memory nodes. >>> >>> With the current kernel, such usage of memory policies implies we can't >>> do page promotion from a slower memory tier to a faster memory tier >>> using numa fault. This patch fixes this issue. >>> >>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>> mask, we allow numa migration to the executing nodes. If the executing >>> node is not in the policy node mask but the folio is already allocated >>> based on policy preference (the folio node is in the policy node mask), >>> we don't allow numa migration. If both the executing node and folio node >>> are outside the policy node mask, we allow numa migration to the >>> executing nodes. >>> >>> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> >>> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >>> --- >>> mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- >>> 1 file changed, 26 insertions(+), 2 deletions(-) >>> >>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >>> index 73d698e21dae..8c4c92b10371 100644 >>> --- a/mm/mempolicy.c >>> +++ b/mm/mempolicy.c >>> @@ -1458,9 +1458,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) >>> if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) >>> return -EINVAL; >>> if (*flags & MPOL_F_NUMA_BALANCING) { >>> - if (*mode != MPOL_BIND) >>> + if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) >>> + *flags |= (MPOL_F_MOF | MPOL_F_MORON); >>> + else >>> return -EINVAL; >>> - *flags |= (MPOL_F_MOF | MPOL_F_MORON); >>> } >>> return 0; >>> } >>> @@ -2463,6 +2464,23 @@ static void sp_free(struct sp_node *n) >>> kmem_cache_free(sn_cache, n); >>> } >>> >>> +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, >>> + struct mempolicy *pol) >>> +{ >>> + /* if the executing node is in the policy node mask, migrate */ >>> + if (node_isset(exec_node, pol->nodes)) >>> + return true; >>> + >>> + /* If the folio node is in policy node mask, don't migrate */ >>> + if (node_isset(folio_node, pol->nodes)) >>> + return false; >>> + /* >>> + * both the folio node and executing node are outside the policy nodemask, >>> + * migrate as normal numa fault migration. >>> + */ >>> + return true; >> >> Why? This may cause some unexpected result. For example, pages may be >> distributed among multiple sockets unexpectedly. So, I prefer the more >> conservative policy, that is, only migrate if this node is in >> pol->nodes. >> > > This will only have an impact if the user specifies > MPOL_F_NUMA_BALANCING. This means that the user is explicitly requesting > for frequently accessed memory pages to be migrated. Memory policy > MPOL_PREFERRED_MANY is able to allocate pages from nodes outside of > policy->nodes. For the specific use case that I am interested in, it > should be okay to restrict it to policy->nodes. However, I am wondering > if this is too restrictive given the definition of MPOL_PREFERRED_MANY. IMHO, we can start with some consecutive way and expand it if it's proved necessary. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-02-20 7:58 ` Huang, Ying @ 2024-03-03 6:16 ` Aneesh Kumar K.V 2024-03-04 1:59 ` Huang, Ying 0 siblings, 1 reply; 33+ messages in thread From: Aneesh Kumar K.V @ 2024-03-03 6:16 UTC (permalink / raw) To: Huang, Ying Cc: Donet Tom, Andrew Morton, linux-mm, linux-kernel, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Michal Hocko, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan "Huang, Ying" <ying.huang@intel.com> writes: > Aneesh Kumar K.V <aneesh.kumar@kernel.org> writes: > >> "Huang, Ying" <ying.huang@intel.com> writes: >> >>> Donet Tom <donettom@linux.ibm.com> writes: >>> >>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >>>> nodes") added support for migrate on protnone reference with MPOL_BIND >>>> memory policy. This allowed numa fault migration when the executing node >>>> is part of the policy mask for MPOL_BIND. This patch extends migration >>>> support to MPOL_PREFERRED_MANY policy. >>>> >>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>>> the kernel should not allocate pages from the slower memory tier via >>>> allocation control zonelist fallback. Instead, we should move cold pages >>>> from the faster memory node via memory demotion. For a page allocation, >>>> kswapd is only woken up after we try to allocate pages from all nodes in >>>> the allocation zone list. This implies that, without using memory >>>> policies, we will end up allocating hot pages in the slower memory tier. >>>> >>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>>> allocation control when we have memory tiers in the system. With >>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>>> of faster memory nodes. When we fail to allocate pages from the faster >>>> memory node, kswapd would be woken up, allowing demotion of cold pages >>>> to slower memory nodes. >>>> >>>> With the current kernel, such usage of memory policies implies we can't >>>> do page promotion from a slower memory tier to a faster memory tier >>>> using numa fault. This patch fixes this issue. >>>> >>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>>> mask, we allow numa migration to the executing nodes. If the executing >>>> node is not in the policy node mask but the folio is already allocated >>>> based on policy preference (the folio node is in the policy node mask), >>>> we don't allow numa migration. If both the executing node and folio node >>>> are outside the policy node mask, we allow numa migration to the >>>> executing nodes. >>>> >>>> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> >>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >>>> --- >>>> mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- >>>> 1 file changed, 26 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >>>> index 73d698e21dae..8c4c92b10371 100644 >>>> --- a/mm/mempolicy.c >>>> +++ b/mm/mempolicy.c >>>> @@ -1458,9 +1458,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) >>>> if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) >>>> return -EINVAL; >>>> if (*flags & MPOL_F_NUMA_BALANCING) { >>>> - if (*mode != MPOL_BIND) >>>> + if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) >>>> + *flags |= (MPOL_F_MOF | MPOL_F_MORON); >>>> + else >>>> return -EINVAL; >>>> - *flags |= (MPOL_F_MOF | MPOL_F_MORON); >>>> } >>>> return 0; >>>> } >>>> @@ -2463,6 +2464,23 @@ static void sp_free(struct sp_node *n) >>>> kmem_cache_free(sn_cache, n); >>>> } >>>> >>>> +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, >>>> + struct mempolicy *pol) >>>> +{ >>>> + /* if the executing node is in the policy node mask, migrate */ >>>> + if (node_isset(exec_node, pol->nodes)) >>>> + return true; >>>> + >>>> + /* If the folio node is in policy node mask, don't migrate */ >>>> + if (node_isset(folio_node, pol->nodes)) >>>> + return false; >>>> + /* >>>> + * both the folio node and executing node are outside the policy nodemask, >>>> + * migrate as normal numa fault migration. >>>> + */ >>>> + return true; >>> >>> Why? This may cause some unexpected result. For example, pages may be >>> distributed among multiple sockets unexpectedly. So, I prefer the more >>> conservative policy, that is, only migrate if this node is in >>> pol->nodes. >>> >> >> This will only have an impact if the user specifies >> MPOL_F_NUMA_BALANCING. This means that the user is explicitly requesting >> for frequently accessed memory pages to be migrated. Memory policy >> MPOL_PREFERRED_MANY is able to allocate pages from nodes outside of >> policy->nodes. For the specific use case that I am interested in, it >> should be okay to restrict it to policy->nodes. However, I am wondering >> if this is too restrictive given the definition of MPOL_PREFERRED_MANY. > > IMHO, we can start with some consecutive way and expand it if it's > proved necessary. > Is this good? 1 file changed, 14 insertions(+), 34 deletions(-) mm/mempolicy.c | 48 ++++++++++++++---------------------------------- modified mm/mempolicy.c @@ -2464,23 +2464,6 @@ static void sp_free(struct sp_node *n) kmem_cache_free(sn_cache, n); } -static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, - struct mempolicy *pol) -{ - /* if the executing node is in the policy node mask, migrate */ - if (node_isset(exec_node, pol->nodes)) - return true; - - /* If the folio node is in policy node mask, don't migrate */ - if (node_isset(folio_node, pol->nodes)) - return false; - /* - * both the folio node and executing node are outside the policy nodemask, - * migrate as normal numa fault migration. - */ - return true; -} - /** * mpol_misplaced - check whether current folio node is valid in policy * @@ -2533,29 +2516,26 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf, break; case MPOL_BIND: - /* Optimize placement among multiple nodes via NUMA balancing */ + case MPOL_PREFERRED_MANY: + /* + * Even though MPOL_PREFERRED_MANY can allocate pages outside + * policy nodemask we don't allow numa migration to nodes + * outside policy nodemask for now. This is done so that if we + * want demotion to slow memory to happen, before allocating + * from some DRAM node say 'x', we will end up using a + * MPOL_PREFERRED_MANY mask excluding node 'x'. In such scenario + * we should not promote to node 'x' from slow memory node. + */ if (pol->flags & MPOL_F_MORON) { + /* + * Optimize placement among multiple nodes + * via NUMA balancing + */ if (node_isset(thisnid, pol->nodes)) break; goto out; } - if (node_isset(curnid, pol->nodes)) - goto out; - z = first_zones_zonelist( - node_zonelist(thisnid, GFP_HIGHUSER), - gfp_zone(GFP_HIGHUSER), - &pol->nodes); - polnid = zone_to_nid(z->zone); - break; - - case MPOL_PREFERRED_MANY: - if (pol->flags & MPOL_F_MORON) { - if (!mpol_preferred_should_numa_migrate(thisnid, curnid, pol)) - goto out; - break; - } - /* * use current page if in policy nodemask, * else select nearest allowed node, if any. [back] . ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-03-03 6:16 ` Aneesh Kumar K.V @ 2024-03-04 1:59 ` Huang, Ying 0 siblings, 0 replies; 33+ messages in thread From: Huang, Ying @ 2024-03-04 1:59 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Donet Tom, Andrew Morton, linux-mm, linux-kernel, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Michal Hocko, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan Aneesh Kumar K.V <aneesh.kumar@kernel.org> writes: > "Huang, Ying" <ying.huang@intel.com> writes: > >> Aneesh Kumar K.V <aneesh.kumar@kernel.org> writes: >> >>> "Huang, Ying" <ying.huang@intel.com> writes: >>> >>>> Donet Tom <donettom@linux.ibm.com> writes: >>>> >>>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >>>>> nodes") added support for migrate on protnone reference with MPOL_BIND >>>>> memory policy. This allowed numa fault migration when the executing node >>>>> is part of the policy mask for MPOL_BIND. This patch extends migration >>>>> support to MPOL_PREFERRED_MANY policy. >>>>> >>>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>>>> the kernel should not allocate pages from the slower memory tier via >>>>> allocation control zonelist fallback. Instead, we should move cold pages >>>>> from the faster memory node via memory demotion. For a page allocation, >>>>> kswapd is only woken up after we try to allocate pages from all nodes in >>>>> the allocation zone list. This implies that, without using memory >>>>> policies, we will end up allocating hot pages in the slower memory tier. >>>>> >>>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>>>> allocation control when we have memory tiers in the system. With >>>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>>>> of faster memory nodes. When we fail to allocate pages from the faster >>>>> memory node, kswapd would be woken up, allowing demotion of cold pages >>>>> to slower memory nodes. >>>>> >>>>> With the current kernel, such usage of memory policies implies we can't >>>>> do page promotion from a slower memory tier to a faster memory tier >>>>> using numa fault. This patch fixes this issue. >>>>> >>>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>>>> mask, we allow numa migration to the executing nodes. If the executing >>>>> node is not in the policy node mask but the folio is already allocated >>>>> based on policy preference (the folio node is in the policy node mask), >>>>> we don't allow numa migration. If both the executing node and folio node >>>>> are outside the policy node mask, we allow numa migration to the >>>>> executing nodes. >>>>> >>>>> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> >>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >>>>> --- >>>>> mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- >>>>> 1 file changed, 26 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >>>>> index 73d698e21dae..8c4c92b10371 100644 >>>>> --- a/mm/mempolicy.c >>>>> +++ b/mm/mempolicy.c >>>>> @@ -1458,9 +1458,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) >>>>> if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) >>>>> return -EINVAL; >>>>> if (*flags & MPOL_F_NUMA_BALANCING) { >>>>> - if (*mode != MPOL_BIND) >>>>> + if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) >>>>> + *flags |= (MPOL_F_MOF | MPOL_F_MORON); >>>>> + else >>>>> return -EINVAL; >>>>> - *flags |= (MPOL_F_MOF | MPOL_F_MORON); >>>>> } >>>>> return 0; >>>>> } >>>>> @@ -2463,6 +2464,23 @@ static void sp_free(struct sp_node *n) >>>>> kmem_cache_free(sn_cache, n); >>>>> } >>>>> >>>>> +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, >>>>> + struct mempolicy *pol) >>>>> +{ >>>>> + /* if the executing node is in the policy node mask, migrate */ >>>>> + if (node_isset(exec_node, pol->nodes)) >>>>> + return true; >>>>> + >>>>> + /* If the folio node is in policy node mask, don't migrate */ >>>>> + if (node_isset(folio_node, pol->nodes)) >>>>> + return false; >>>>> + /* >>>>> + * both the folio node and executing node are outside the policy nodemask, >>>>> + * migrate as normal numa fault migration. >>>>> + */ >>>>> + return true; >>>> >>>> Why? This may cause some unexpected result. For example, pages may be >>>> distributed among multiple sockets unexpectedly. So, I prefer the more >>>> conservative policy, that is, only migrate if this node is in >>>> pol->nodes. >>>> >>> >>> This will only have an impact if the user specifies >>> MPOL_F_NUMA_BALANCING. This means that the user is explicitly requesting >>> for frequently accessed memory pages to be migrated. Memory policy >>> MPOL_PREFERRED_MANY is able to allocate pages from nodes outside of >>> policy->nodes. For the specific use case that I am interested in, it >>> should be okay to restrict it to policy->nodes. However, I am wondering >>> if this is too restrictive given the definition of MPOL_PREFERRED_MANY. >> >> IMHO, we can start with some consecutive way and expand it if it's >> proved necessary. >> > > Is this good? > > 1 file changed, 14 insertions(+), 34 deletions(-) > mm/mempolicy.c | 48 ++++++++++++++---------------------------------- > > modified mm/mempolicy.c > @@ -2464,23 +2464,6 @@ static void sp_free(struct sp_node *n) > kmem_cache_free(sn_cache, n); > } > > -static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, > - struct mempolicy *pol) > -{ > - /* if the executing node is in the policy node mask, migrate */ > - if (node_isset(exec_node, pol->nodes)) > - return true; > - > - /* If the folio node is in policy node mask, don't migrate */ > - if (node_isset(folio_node, pol->nodes)) > - return false; > - /* > - * both the folio node and executing node are outside the policy nodemask, > - * migrate as normal numa fault migration. > - */ > - return true; > -} > - > /** > * mpol_misplaced - check whether current folio node is valid in policy > * > @@ -2533,29 +2516,26 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf, > break; > > case MPOL_BIND: > - /* Optimize placement among multiple nodes via NUMA balancing */ > + case MPOL_PREFERRED_MANY: > + /* > + * Even though MPOL_PREFERRED_MANY can allocate pages outside > + * policy nodemask we don't allow numa migration to nodes > + * outside policy nodemask for now. This is done so that if we > + * want demotion to slow memory to happen, before allocating > + * from some DRAM node say 'x', we will end up using a > + * MPOL_PREFERRED_MANY mask excluding node 'x'. In such scenario > + * we should not promote to node 'x' from slow memory node. > + */ > if (pol->flags & MPOL_F_MORON) { > + /* > + * Optimize placement among multiple nodes > + * via NUMA balancing > + */ > if (node_isset(thisnid, pol->nodes)) > break; > goto out; > } > > - if (node_isset(curnid, pol->nodes)) > - goto out; > - z = first_zones_zonelist( > - node_zonelist(thisnid, GFP_HIGHUSER), > - gfp_zone(GFP_HIGHUSER), > - &pol->nodes); > - polnid = zone_to_nid(z->zone); > - break; IMO, the above deletion should be put in another patch? -- Best Regards, Huang, Ying > - > - case MPOL_PREFERRED_MANY: > - if (pol->flags & MPOL_F_MORON) { > - if (!mpol_preferred_should_numa_migrate(thisnid, curnid, pol)) > - goto out; > - break; > - } > - > /* > * use current page if in policy nodemask, > * else select nearest allowed node, if any. > > [back] > . ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/3] mm/mempolicy: Use the already fetched local variable 2024-02-17 7:31 [PATCH 1/3] mm/mempolicy: Use the already fetched local variable Donet Tom 2024-02-17 7:31 ` [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy Donet Tom @ 2024-02-18 21:38 ` Andrew Morton 2024-02-19 8:34 ` Donet Tom [not found] ` <bf7e6779f842fb65cf7bb9b2c617feb2af271cb7.1708097962.git.donettom@linux.ibm.com> 2 siblings, 1 reply; 33+ messages in thread From: Andrew Morton @ 2024-02-18 21:38 UTC (permalink / raw) To: Donet Tom Cc: linux-mm, linux-kernel, Aneesh Kumar, Huang Ying, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Michal Hocko, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On Sat, 17 Feb 2024 01:31:33 -0600 Donet Tom <donettom@linux.ibm.com> wrote: > Avoid doing a per cpu read and use the local variable thisnid. IMHO > this also makes the code more readable. > > ... > > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2526,7 +2526,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, > if (node_isset(curnid, pol->nodes)) > goto out; > z = first_zones_zonelist( > - node_zonelist(numa_node_id(), GFP_HIGHUSER), > + node_zonelist(thisnid, GFP_HIGHUSER), > gfp_zone(GFP_HIGHUSER), > &pol->nodes); > polnid = zone_to_nid(z->zone); int thisnid = cpu_to_node(thiscpu); Is there any dofference between numa_node_id() and cpu_to_node(raw_smp_processor_id())? And it it explicable that we're using one here and not the other? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/3] mm/mempolicy: Use the already fetched local variable 2024-02-18 21:38 ` [PATCH 1/3] mm/mempolicy: Use the already fetched local variable Andrew Morton @ 2024-02-19 8:34 ` Donet Tom 2024-02-20 1:21 ` Andrew Morton 0 siblings, 1 reply; 33+ messages in thread From: Donet Tom @ 2024-02-19 8:34 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Aneesh Kumar, Huang Ying, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Michal Hocko, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On 2/19/24 03:08, Andrew Morton wrote: > On Sat, 17 Feb 2024 01:31:33 -0600 Donet Tom <donettom@linux.ibm.com> wrote: > >> Avoid doing a per cpu read and use the local variable thisnid. IMHO >> this also makes the code more readable. >> >> ... >> >> --- a/mm/mempolicy.c >> +++ b/mm/mempolicy.c >> @@ -2526,7 +2526,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, >> if (node_isset(curnid, pol->nodes)) >> goto out; >> z = first_zones_zonelist( >> - node_zonelist(numa_node_id(), GFP_HIGHUSER), >> + node_zonelist(thisnid, GFP_HIGHUSER), >> gfp_zone(GFP_HIGHUSER), >> &pol->nodes); >> polnid = zone_to_nid(z->zone); > int thisnid = cpu_to_node(thiscpu); > > Is there any dofference between numa_node_id() and > cpu_to_node(raw_smp_processor_id())? And it it explicable that we're > using one here and not the other? Hi Andrew Both numa_node_id() and cpu_to_node(raw_smp_processor_id()) return the current execution node id, Since the current execution node is already fetched at the beginning (thisnid) we can reuse it instead of getting it again. Thanks Donet Tom > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/3] mm/mempolicy: Use the already fetched local variable 2024-02-19 8:34 ` Donet Tom @ 2024-02-20 1:21 ` Andrew Morton 2024-02-20 4:10 ` Aneesh Kumar K.V 0 siblings, 1 reply; 33+ messages in thread From: Andrew Morton @ 2024-02-20 1:21 UTC (permalink / raw) To: Donet Tom Cc: linux-mm, linux-kernel, Aneesh Kumar, Huang Ying, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Michal Hocko, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On Mon, 19 Feb 2024 14:04:23 +0530 Donet Tom <donettom@linux.ibm.com> wrote: > >> --- a/mm/mempolicy.c > >> +++ b/mm/mempolicy.c > >> @@ -2526,7 +2526,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, > >> if (node_isset(curnid, pol->nodes)) > >> goto out; > >> z = first_zones_zonelist( > >> - node_zonelist(numa_node_id(), GFP_HIGHUSER), > >> + node_zonelist(thisnid, GFP_HIGHUSER), > >> gfp_zone(GFP_HIGHUSER), > >> &pol->nodes); > >> polnid = zone_to_nid(z->zone); > > int thisnid = cpu_to_node(thiscpu); > > > > Is there any dofference between numa_node_id() and > > cpu_to_node(raw_smp_processor_id())? And it it explicable that we're > > using one here and not the other? > > Hi Andrew > > Both numa_node_id() and cpu_to_node(raw_smp_processor_id()) return the current execution node id, > Since the current execution node is already fetched at the beginning (thisnid) we can reuse it instead of getting it again. Sure, but mine was a broader thought: why do we have both? Is one preferable and if so why? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/3] mm/mempolicy: Use the already fetched local variable 2024-02-20 1:21 ` Andrew Morton @ 2024-02-20 4:10 ` Aneesh Kumar K.V 2024-02-20 6:25 ` Huang, Ying 0 siblings, 1 reply; 33+ messages in thread From: Aneesh Kumar K.V @ 2024-02-20 4:10 UTC (permalink / raw) To: Andrew Morton, Donet Tom Cc: linux-mm, linux-kernel, Huang Ying, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Michal Hocko, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On 2/20/24 6:51 AM, Andrew Morton wrote: > On Mon, 19 Feb 2024 14:04:23 +0530 Donet Tom <donettom@linux.ibm.com> wrote: > >>>> --- a/mm/mempolicy.c >>>> +++ b/mm/mempolicy.c >>>> @@ -2526,7 +2526,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, >>>> if (node_isset(curnid, pol->nodes)) >>>> goto out; >>>> z = first_zones_zonelist( >>>> - node_zonelist(numa_node_id(), GFP_HIGHUSER), >>>> + node_zonelist(thisnid, GFP_HIGHUSER), >>>> gfp_zone(GFP_HIGHUSER), >>>> &pol->nodes); >>>> polnid = zone_to_nid(z->zone); >>> int thisnid = cpu_to_node(thiscpu); >>> >>> Is there any dofference between numa_node_id() and >>> cpu_to_node(raw_smp_processor_id())? And it it explicable that we're >>> using one here and not the other? >> >> Hi Andrew >> >> Both numa_node_id() and cpu_to_node(raw_smp_processor_id()) return the current execution node id, >> Since the current execution node is already fetched at the beginning (thisnid) we can reuse it instead of getting it again. > > Sure, but mine was a broader thought: why do we have both? Is one > preferable and if so why? IIUC these are two helpers to fetch current numa node id. and either of them can be used based on need. The default implementation shows the details. (One small difference is numa_node_id() can use optimized per cpu reader because it is fetching the per cpu variable of the currently running cpu.) #ifndef numa_node_id /* Returns the number of the current Node. */ static inline int numa_node_id(void) { return raw_cpu_read(numa_node); } #endif #ifndef cpu_to_node static inline int cpu_to_node(int cpu) { return per_cpu(numa_node, cpu); } #endif In mpol_misplaced function, we need the cpu details because we are using that in other place (should_numa_migreate_memory()). So it makes it easy to use cpu_to_node(thiscpu) instead of numa_node_id(). -aneesh ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/3] mm/mempolicy: Use the already fetched local variable 2024-02-20 4:10 ` Aneesh Kumar K.V @ 2024-02-20 6:25 ` Huang, Ying 2024-02-20 6:32 ` Aneesh Kumar K.V 0 siblings, 1 reply; 33+ messages in thread From: Huang, Ying @ 2024-02-20 6:25 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Andrew Morton, Donet Tom, linux-mm, linux-kernel, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Michal Hocko, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> writes: > On 2/20/24 6:51 AM, Andrew Morton wrote: >> On Mon, 19 Feb 2024 14:04:23 +0530 Donet Tom <donettom@linux.ibm.com> wrote: >> >>>>> --- a/mm/mempolicy.c >>>>> +++ b/mm/mempolicy.c >>>>> @@ -2526,7 +2526,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, >>>>> if (node_isset(curnid, pol->nodes)) >>>>> goto out; >>>>> z = first_zones_zonelist( >>>>> - node_zonelist(numa_node_id(), GFP_HIGHUSER), >>>>> + node_zonelist(thisnid, GFP_HIGHUSER), >>>>> gfp_zone(GFP_HIGHUSER), >>>>> &pol->nodes); >>>>> polnid = zone_to_nid(z->zone); >>>> int thisnid = cpu_to_node(thiscpu); >>>> >>>> Is there any dofference between numa_node_id() and >>>> cpu_to_node(raw_smp_processor_id())? And it it explicable that we're >>>> using one here and not the other? >>> >>> Hi Andrew >>> >>> Both numa_node_id() and cpu_to_node(raw_smp_processor_id()) return the current execution node id, >>> Since the current execution node is already fetched at the beginning (thisnid) we can reuse it instead of getting it again. >> >> Sure, but mine was a broader thought: why do we have both? Is one >> preferable and if so why? > > IIUC these are two helpers to fetch current numa node id. and either of them can be used based on need. The default implementation shows the details. > (One small difference is numa_node_id() can use optimized per cpu reader because it is fetching the per cpu variable of the currently running cpu.) > > #ifndef numa_node_id > /* Returns the number of the current Node. */ > static inline int numa_node_id(void) > { > return raw_cpu_read(numa_node); > } > #endif > > #ifndef cpu_to_node > static inline int cpu_to_node(int cpu) > { > return per_cpu(numa_node, cpu); > } > #endif > > In mpol_misplaced function, we need the cpu details because we are using that in other place (should_numa_migreate_memory()). So it makes it easy > to use cpu_to_node(thiscpu) instead of numa_node_id(). IIUC, numa_node_id() is faster than cpu_to_node(thiscpu), even if we have thiscpu already. cpu_to_node() is mainly used to get the node of NOT current CPU. So, IMHO, we should only use numa_node_id() in this function. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/3] mm/mempolicy: Use the already fetched local variable 2024-02-20 6:25 ` Huang, Ying @ 2024-02-20 6:32 ` Aneesh Kumar K.V 2024-02-20 7:03 ` Aneesh Kumar K.V 0 siblings, 1 reply; 33+ messages in thread From: Aneesh Kumar K.V @ 2024-02-20 6:32 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Donet Tom, linux-mm, linux-kernel, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Michal Hocko, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On 2/20/24 11:55 AM, Huang, Ying wrote: > "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> writes: > >> On 2/20/24 6:51 AM, Andrew Morton wrote: >>> On Mon, 19 Feb 2024 14:04:23 +0530 Donet Tom <donettom@linux.ibm.com> wrote: >>> >>>>>> --- a/mm/mempolicy.c >>>>>> +++ b/mm/mempolicy.c >>>>>> @@ -2526,7 +2526,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, >>>>>> if (node_isset(curnid, pol->nodes)) >>>>>> goto out; >>>>>> z = first_zones_zonelist( >>>>>> - node_zonelist(numa_node_id(), GFP_HIGHUSER), >>>>>> + node_zonelist(thisnid, GFP_HIGHUSER), >>>>>> gfp_zone(GFP_HIGHUSER), >>>>>> &pol->nodes); >>>>>> polnid = zone_to_nid(z->zone); >>>>> int thisnid = cpu_to_node(thiscpu); >>>>> >>>>> Is there any dofference between numa_node_id() and >>>>> cpu_to_node(raw_smp_processor_id())? And it it explicable that we're >>>>> using one here and not the other? >>>> >>>> Hi Andrew >>>> >>>> Both numa_node_id() and cpu_to_node(raw_smp_processor_id()) return the current execution node id, >>>> Since the current execution node is already fetched at the beginning (thisnid) we can reuse it instead of getting it again. >>> >>> Sure, but mine was a broader thought: why do we have both? Is one >>> preferable and if so why? >> >> IIUC these are two helpers to fetch current numa node id. and either of them can be used based on need. The default implementation shows the details. >> (One small difference is numa_node_id() can use optimized per cpu reader because it is fetching the per cpu variable of the currently running cpu.) >> >> #ifndef numa_node_id >> /* Returns the number of the current Node. */ >> static inline int numa_node_id(void) >> { >> return raw_cpu_read(numa_node); >> } >> #endif >> >> #ifndef cpu_to_node >> static inline int cpu_to_node(int cpu) >> { >> return per_cpu(numa_node, cpu); >> } >> #endif >> >> In mpol_misplaced function, we need the cpu details because we are using that in other place (should_numa_migreate_memory()). So it makes it easy >> to use cpu_to_node(thiscpu) instead of numa_node_id(). > > IIUC, numa_node_id() is faster than cpu_to_node(thiscpu), even if we > have thiscpu already. cpu_to_node() is mainly used to get the node of > NOT current CPU. So, IMHO, we should only use numa_node_id() in this > function. > This change? modified mm/mempolicy.c @@ -2502,8 +2502,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, pgoff_t ilx; struct zoneref *z; int curnid = folio_nid(folio); - int thiscpu = raw_smp_processor_id(); - int thisnid = cpu_to_node(thiscpu); + int thisnid = numa_node_id(); int polnid = NUMA_NO_NODE; int ret = NUMA_NO_NODE; @@ -2573,7 +2572,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, polnid = thisnid; if (!should_numa_migrate_memory(current, folio, curnid, - thiscpu)) + raw_smp_processor_id())) goto out; } -aneesh ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/3] mm/mempolicy: Use the already fetched local variable 2024-02-20 6:32 ` Aneesh Kumar K.V @ 2024-02-20 7:03 ` Aneesh Kumar K.V 2024-02-20 7:22 ` Huang, Ying 0 siblings, 1 reply; 33+ messages in thread From: Aneesh Kumar K.V @ 2024-02-20 7:03 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Donet Tom, linux-mm, linux-kernel, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Michal Hocko, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On 2/20/24 12:02 PM, Aneesh Kumar K.V wrote: > On 2/20/24 11:55 AM, Huang, Ying wrote: >> "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> writes: >> >>> On 2/20/24 6:51 AM, Andrew Morton wrote: >>>> On Mon, 19 Feb 2024 14:04:23 +0530 Donet Tom <donettom@linux.ibm.com> wrote: >>>> >>>>>>> --- a/mm/mempolicy.c >>>>>>> +++ b/mm/mempolicy.c >>>>>>> @@ -2526,7 +2526,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, >>>>>>> if (node_isset(curnid, pol->nodes)) >>>>>>> goto out; >>>>>>> z = first_zones_zonelist( >>>>>>> - node_zonelist(numa_node_id(), GFP_HIGHUSER), >>>>>>> + node_zonelist(thisnid, GFP_HIGHUSER), >>>>>>> gfp_zone(GFP_HIGHUSER), >>>>>>> &pol->nodes); >>>>>>> polnid = zone_to_nid(z->zone); >>>>>> int thisnid = cpu_to_node(thiscpu); >>>>>> >>>>>> Is there any dofference between numa_node_id() and >>>>>> cpu_to_node(raw_smp_processor_id())? And it it explicable that we're >>>>>> using one here and not the other? >>>>> >>>>> Hi Andrew >>>>> >>>>> Both numa_node_id() and cpu_to_node(raw_smp_processor_id()) return the current execution node id, >>>>> Since the current execution node is already fetched at the beginning (thisnid) we can reuse it instead of getting it again. >>>> >>>> Sure, but mine was a broader thought: why do we have both? Is one >>>> preferable and if so why? >>> >>> IIUC these are two helpers to fetch current numa node id. and either of them can be used based on need. The default implementation shows the details. >>> (One small difference is numa_node_id() can use optimized per cpu reader because it is fetching the per cpu variable of the currently running cpu.) >>> >>> #ifndef numa_node_id >>> /* Returns the number of the current Node. */ >>> static inline int numa_node_id(void) >>> { >>> return raw_cpu_read(numa_node); >>> } >>> #endif >>> >>> #ifndef cpu_to_node >>> static inline int cpu_to_node(int cpu) >>> { >>> return per_cpu(numa_node, cpu); >>> } >>> #endif >>> >>> In mpol_misplaced function, we need the cpu details because we are using that in other place (should_numa_migreate_memory()). So it makes it easy >>> to use cpu_to_node(thiscpu) instead of numa_node_id(). >> >> IIUC, numa_node_id() is faster than cpu_to_node(thiscpu), even if we >> have thiscpu already. cpu_to_node() is mainly used to get the node of >> NOT current CPU. So, IMHO, we should only use numa_node_id() in this >> function. >> > > This change? > > modified mm/mempolicy.c > @@ -2502,8 +2502,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, > pgoff_t ilx; > struct zoneref *z; > int curnid = folio_nid(folio); > - int thiscpu = raw_smp_processor_id(); > - int thisnid = cpu_to_node(thiscpu); > + int thisnid = numa_node_id(); > int polnid = NUMA_NO_NODE; > int ret = NUMA_NO_NODE; > > @@ -2573,7 +2572,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, > polnid = thisnid; > > if (!should_numa_migrate_memory(current, folio, curnid, > - thiscpu)) > + raw_smp_processor_id())) > goto out; > } > > One of the problem with the above change will be the need to make sure smp processor id remaining stable, which I am not sure we want in this function. With that we can end up with processor id not related to the numa node id we are using. -aneesh ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/3] mm/mempolicy: Use the already fetched local variable 2024-02-20 7:03 ` Aneesh Kumar K.V @ 2024-02-20 7:22 ` Huang, Ying 2024-02-20 9:03 ` Michal Hocko 0 siblings, 1 reply; 33+ messages in thread From: Huang, Ying @ 2024-02-20 7:22 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Andrew Morton, Donet Tom, linux-mm, linux-kernel, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Michal Hocko, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> writes: > On 2/20/24 12:02 PM, Aneesh Kumar K.V wrote: >> On 2/20/24 11:55 AM, Huang, Ying wrote: >>> "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> writes: >>> >>>> On 2/20/24 6:51 AM, Andrew Morton wrote: >>>>> On Mon, 19 Feb 2024 14:04:23 +0530 Donet Tom <donettom@linux.ibm.com> wrote: >>>>> >>>>>>>> --- a/mm/mempolicy.c >>>>>>>> +++ b/mm/mempolicy.c >>>>>>>> @@ -2526,7 +2526,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, >>>>>>>> if (node_isset(curnid, pol->nodes)) >>>>>>>> goto out; >>>>>>>> z = first_zones_zonelist( >>>>>>>> - node_zonelist(numa_node_id(), GFP_HIGHUSER), >>>>>>>> + node_zonelist(thisnid, GFP_HIGHUSER), >>>>>>>> gfp_zone(GFP_HIGHUSER), >>>>>>>> &pol->nodes); >>>>>>>> polnid = zone_to_nid(z->zone); >>>>>>> int thisnid = cpu_to_node(thiscpu); >>>>>>> >>>>>>> Is there any dofference between numa_node_id() and >>>>>>> cpu_to_node(raw_smp_processor_id())? And it it explicable that we're >>>>>>> using one here and not the other? >>>>>> >>>>>> Hi Andrew >>>>>> >>>>>> Both numa_node_id() and cpu_to_node(raw_smp_processor_id()) return the current execution node id, >>>>>> Since the current execution node is already fetched at the beginning (thisnid) we can reuse it instead of getting it again. >>>>> >>>>> Sure, but mine was a broader thought: why do we have both? Is one >>>>> preferable and if so why? >>>> >>>> IIUC these are two helpers to fetch current numa node id. and either of them can be used based on need. The default implementation shows the details. >>>> (One small difference is numa_node_id() can use optimized per cpu reader because it is fetching the per cpu variable of the currently running cpu.) >>>> >>>> #ifndef numa_node_id >>>> /* Returns the number of the current Node. */ >>>> static inline int numa_node_id(void) >>>> { >>>> return raw_cpu_read(numa_node); >>>> } >>>> #endif >>>> >>>> #ifndef cpu_to_node >>>> static inline int cpu_to_node(int cpu) >>>> { >>>> return per_cpu(numa_node, cpu); >>>> } >>>> #endif >>>> >>>> In mpol_misplaced function, we need the cpu details because we are using that in other place (should_numa_migreate_memory()). So it makes it easy >>>> to use cpu_to_node(thiscpu) instead of numa_node_id(). >>> >>> IIUC, numa_node_id() is faster than cpu_to_node(thiscpu), even if we >>> have thiscpu already. cpu_to_node() is mainly used to get the node of >>> NOT current CPU. So, IMHO, we should only use numa_node_id() in this >>> function. >>> >> >> This change? >> >> modified mm/mempolicy.c >> @@ -2502,8 +2502,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, >> pgoff_t ilx; >> struct zoneref *z; >> int curnid = folio_nid(folio); >> - int thiscpu = raw_smp_processor_id(); >> - int thisnid = cpu_to_node(thiscpu); >> + int thisnid = numa_node_id(); >> int polnid = NUMA_NO_NODE; >> int ret = NUMA_NO_NODE; >> >> @@ -2573,7 +2572,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, >> polnid = thisnid; >> >> if (!should_numa_migrate_memory(current, folio, curnid, >> - thiscpu)) >> + raw_smp_processor_id())) >> goto out; >> } >> >> > > One of the problem with the above change will be the need to make sure smp processor id remaining stable, which > I am not sure we want in this function. With that we can end up with processor id not related to the numa node id > we are using. This isn't an issue now, because mpol_misplaced() are always called with PTL held. And, we can still keep thiscpu local variable. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/3] mm/mempolicy: Use the already fetched local variable 2024-02-20 7:22 ` Huang, Ying @ 2024-02-20 9:03 ` Michal Hocko 2024-03-03 6:17 ` Aneesh Kumar K.V 0 siblings, 1 reply; 33+ messages in thread From: Michal Hocko @ 2024-02-20 9:03 UTC (permalink / raw) To: Huang, Ying Cc: Aneesh Kumar K.V, Andrew Morton, Donet Tom, linux-mm, linux-kernel, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On Tue 20-02-24 15:22:07, Huang, Ying wrote: [...] > This isn't an issue now, because mpol_misplaced() are always called with > PTL held. And, we can still keep thiscpu local variable. yes, this is the case but it would be better if we made that assumption official by lockdep_assert_held -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/3] mm/mempolicy: Use the already fetched local variable 2024-02-20 9:03 ` Michal Hocko @ 2024-03-03 6:17 ` Aneesh Kumar K.V 2024-03-04 1:49 ` Huang, Ying 0 siblings, 1 reply; 33+ messages in thread From: Aneesh Kumar K.V @ 2024-03-03 6:17 UTC (permalink / raw) To: Michal Hocko, Huang, Ying Cc: Andrew Morton, Donet Tom, linux-mm, linux-kernel, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan Michal Hocko <mhocko@suse.com> writes: > On Tue 20-02-24 15:22:07, Huang, Ying wrote: > [...] >> This isn't an issue now, because mpol_misplaced() are always called with >> PTL held. And, we can still keep thiscpu local variable. > > yes, this is the case but it would be better if we made that assumption > official by lockdep_assert_held > How about this folded into this patch? 2 files changed, 12 insertions(+), 4 deletions(-) mm/memory.c | 6 ++++-- mm/mempolicy.c | 10 ++++++++-- modified mm/memory.c @@ -4879,9 +4879,11 @@ static vm_fault_t do_fault(struct vm_fault *vmf) return ret; } -int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma, +int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf, unsigned long addr, int page_nid, int *flags) { + struct vm_area_struct *vma = vmf->vma; + folio_get(folio); /* Record the current PID acceesing VMA */ @@ -4893,7 +4895,7 @@ int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma, *flags |= TNF_FAULT_LOCAL; } - return mpol_misplaced(folio, vma, addr); + return mpol_misplaced(folio, vmf, addr); } static vm_fault_t do_numa_page(struct vm_fault *vmf) modified mm/mempolicy.c @@ -2495,18 +2495,24 @@ static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_n * Return: NUMA_NO_NODE if the page is in a node that is valid for this * policy, or a suitable node ID to allocate a replacement folio from. */ -int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, +int mpol_misplaced(struct folio *folio, struct vm_fault *vmf, unsigned long addr) { struct mempolicy *pol; pgoff_t ilx; struct zoneref *z; int curnid = folio_nid(folio); + struct vm_area_struct *vma = vmf->vma; int thiscpu = raw_smp_processor_id(); - int thisnid = cpu_to_node(thiscpu); + int thisnid = numa_node_id(); int polnid = NUMA_NO_NODE; int ret = NUMA_NO_NODE; + /* + * Make sure ptl is held so that we don't preempt and we + * have a stable smp processor id + */ + lockdep_assert_held(vmf->ptl); pol = get_vma_policy(vma, addr, folio_order(folio), &ilx); if (!(pol->flags & MPOL_F_MOF)) goto out; [back] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/3] mm/mempolicy: Use the already fetched local variable 2024-03-03 6:17 ` Aneesh Kumar K.V @ 2024-03-04 1:49 ` Huang, Ying 0 siblings, 0 replies; 33+ messages in thread From: Huang, Ying @ 2024-03-04 1:49 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Michal Hocko, Andrew Morton, Donet Tom, linux-mm, linux-kernel, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan Aneesh Kumar K.V <aneesh.kumar@kernel.org> writes: > Michal Hocko <mhocko@suse.com> writes: > >> On Tue 20-02-24 15:22:07, Huang, Ying wrote: >> [...] >>> This isn't an issue now, because mpol_misplaced() are always called with >>> PTL held. And, we can still keep thiscpu local variable. >> >> yes, this is the case but it would be better if we made that assumption >> official by lockdep_assert_held >> > > How about this folded into this patch? > > 2 files changed, 12 insertions(+), 4 deletions(-) > mm/memory.c | 6 ++++-- > mm/mempolicy.c | 10 ++++++++-- > > modified mm/memory.c > @@ -4879,9 +4879,11 @@ static vm_fault_t do_fault(struct vm_fault *vmf) > return ret; > } > > -int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma, > +int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf, > unsigned long addr, int page_nid, int *flags) > { > + struct vm_area_struct *vma = vmf->vma; > + > folio_get(folio); > > /* Record the current PID acceesing VMA */ > @@ -4893,7 +4895,7 @@ int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma, > *flags |= TNF_FAULT_LOCAL; > } > > - return mpol_misplaced(folio, vma, addr); > + return mpol_misplaced(folio, vmf, addr); > } > > static vm_fault_t do_numa_page(struct vm_fault *vmf) > modified mm/mempolicy.c > @@ -2495,18 +2495,24 @@ static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_n > * Return: NUMA_NO_NODE if the page is in a node that is valid for this > * policy, or a suitable node ID to allocate a replacement folio from. > */ > -int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, > +int mpol_misplaced(struct folio *folio, struct vm_fault *vmf, > unsigned long addr) > { > struct mempolicy *pol; > pgoff_t ilx; > struct zoneref *z; > int curnid = folio_nid(folio); > + struct vm_area_struct *vma = vmf->vma; > int thiscpu = raw_smp_processor_id(); > - int thisnid = cpu_to_node(thiscpu); > + int thisnid = numa_node_id(); > int polnid = NUMA_NO_NODE; > int ret = NUMA_NO_NODE; > > + /* > + * Make sure ptl is held so that we don't preempt and we > + * have a stable smp processor id > + */ > + lockdep_assert_held(vmf->ptl); > pol = get_vma_policy(vma, addr, folio_order(folio), &ilx); > if (!(pol->flags & MPOL_F_MOF)) > goto out; > > [back] > LGTM, Thanks! -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
[parent not found: <bf7e6779f842fb65cf7bb9b2c617feb2af271cb7.1708097962.git.donettom@linux.ibm.com>]
* Re: [PATCH 2/3] mm/mempolicy: Avoid the fallthrough with MPOLD_BIND in mpol_misplaced. [not found] ` <bf7e6779f842fb65cf7bb9b2c617feb2af271cb7.1708097962.git.donettom@linux.ibm.com> @ 2024-02-19 12:02 ` Michal Hocko 2024-02-19 15:18 ` Donet Tom 0 siblings, 1 reply; 33+ messages in thread From: Michal Hocko @ 2024-02-19 12:02 UTC (permalink / raw) To: Donet Tom Cc: Andrew Morton, linux-mm, linux-kernel, Aneesh Kumar, Huang Ying, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On Sat 17-02-24 01:31:34, Donet Tom wrote: > We will update MPOL_PREFERRED_MANY in the follow up patch. This change > is required for that. Why is it a separate patch then? Does it make review of the next patch easier? If so make it explicit in the changelog. > > Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> > Signed-off-by: Donet Tom <donettom@linux.ibm.com> > --- > mm/mempolicy.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 8478574c000c..73d698e21dae 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2515,7 +2515,15 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, > break; > goto out; > } > - fallthrough; > + > + if (node_isset(curnid, pol->nodes)) > + goto out; > + z = first_zones_zonelist( > + node_zonelist(thisnid, GFP_HIGHUSER), > + gfp_zone(GFP_HIGHUSER), > + &pol->nodes); > + polnid = zone_to_nid(z->zone); > + break; > > case MPOL_PREFERRED_MANY: > /* > -- > 2.39.3 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3] mm/mempolicy: Avoid the fallthrough with MPOLD_BIND in mpol_misplaced. 2024-02-19 12:02 ` [PATCH 2/3] mm/mempolicy: Avoid the fallthrough with MPOLD_BIND in mpol_misplaced Michal Hocko @ 2024-02-19 15:18 ` Donet Tom 0 siblings, 0 replies; 33+ messages in thread From: Donet Tom @ 2024-02-19 15:18 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, linux-kernel, Aneesh Kumar, Huang Ying, Dave Hansen, Mel Gorman, Ben Widawsky, Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner, Matthew Wilcox, Mike Kravetz, Vlastimil Babka, Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan On 2/19/24 17:32, Michal Hocko wrote: > On Sat 17-02-24 01:31:34, Donet Tom wrote: >> We will update MPOL_PREFERRED_MANY in the follow up patch. This change >> is required for that. > Why is it a separate patch then? Does it make review of the next patch > easier? If so make it explicit in the changelog. Hi Michal In this patch there is no functional changes. This is just re-arrangement of code. Patch 3 is the actual fix .It will not look nice if we mix these patches. As you said it is easy for reviewing also. That's why we kept it as a separate patch. Thanks Donet Tom > >> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> >> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >> --- >> mm/mempolicy.c | 10 +++++++++- >> 1 file changed, 9 insertions(+), 1 deletion(-) >> >> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> index 8478574c000c..73d698e21dae 100644 >> --- a/mm/mempolicy.c >> +++ b/mm/mempolicy.c >> @@ -2515,7 +2515,15 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, >> break; >> goto out; >> } >> - fallthrough; >> + >> + if (node_isset(curnid, pol->nodes)) >> + goto out; >> + z = first_zones_zonelist( >> + node_zonelist(thisnid, GFP_HIGHUSER), >> + gfp_zone(GFP_HIGHUSER), >> + &pol->nodes); >> + polnid = zone_to_nid(z->zone); >> + break; >> >> case MPOL_PREFERRED_MANY: >> /* >> -- >> 2.39.3 ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2024-03-04 2:01 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-17 7:31 [PATCH 1/3] mm/mempolicy: Use the already fetched local variable Donet Tom
2024-02-17 7:31 ` [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy Donet Tom
2024-02-19 12:07 ` Michal Hocko
2024-02-19 13:44 ` Donet Tom
2024-02-20 6:36 ` Huang, Ying
2024-02-20 6:44 ` Aneesh Kumar K.V
2024-02-20 7:23 ` Huang, Ying
2024-02-20 7:46 ` Aneesh Kumar K.V
2024-02-20 8:01 ` Huang, Ying
2024-02-19 14:20 ` Michal Hocko
2024-02-19 15:07 ` Donet Tom
2024-02-19 19:12 ` Michal Hocko
2024-02-20 3:57 ` Aneesh Kumar K.V
2024-02-20 8:48 ` Michal Hocko
2024-02-26 13:09 ` Donet Tom
2024-02-20 7:18 ` Huang, Ying
2024-02-20 7:53 ` Aneesh Kumar K.V
2024-02-20 7:58 ` Huang, Ying
2024-03-03 6:16 ` Aneesh Kumar K.V
2024-03-04 1:59 ` Huang, Ying
2024-02-18 21:38 ` [PATCH 1/3] mm/mempolicy: Use the already fetched local variable Andrew Morton
2024-02-19 8:34 ` Donet Tom
2024-02-20 1:21 ` Andrew Morton
2024-02-20 4:10 ` Aneesh Kumar K.V
2024-02-20 6:25 ` Huang, Ying
2024-02-20 6:32 ` Aneesh Kumar K.V
2024-02-20 7:03 ` Aneesh Kumar K.V
2024-02-20 7:22 ` Huang, Ying
2024-02-20 9:03 ` Michal Hocko
2024-03-03 6:17 ` Aneesh Kumar K.V
2024-03-04 1:49 ` Huang, Ying
[not found] ` <bf7e6779f842fb65cf7bb9b2c617feb2af271cb7.1708097962.git.donettom@linux.ibm.com>
2024-02-19 12:02 ` [PATCH 2/3] mm/mempolicy: Avoid the fallthrough with MPOLD_BIND in mpol_misplaced Michal Hocko
2024-02-19 15:18 ` Donet Tom
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox