linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/2] mm/vmscan: optimize preferred target demotion node selection
@ 2026-01-07  7:28 Bing Jiao
  2026-01-07  7:28 ` [PATCH v1 1/2] mm/vmscan: balance demotion allocation in alloc_demote_folio() Bing Jiao
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Bing Jiao @ 2026-01-07  7:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Bing Jiao, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Shakeel Butt, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

In tiered memory systems, the demotion aims to move cold folios to the
far-tier nodes. To maintain system performance, the demotion target
should ideally be the node with the shortest NUMA distance from the
source node.

However, the current implementation has two suboptimal behaviors:

1. Unbalanced Fallback: When the primary preferred demotion node is full,
   the allocator falls back to other nodes in a way that often skews
   toward zones that closer to the primary preferred node rather than
   distributing the load evenly across fallback nodes.

2. Suboptimal Target Selection: demote_folio_list() randomly select
   a preferred node from the allowed mask, potentially selecting
   a very distant node.

This series optimizes the selection logic while ensuring balanced
allocation across fallback nodes.

Patch 1/2 introduces a randomized fallback mechanism in
alloc_demote_folio() to prevent allocation hotspots when the preferred
node is under memory pressure.

Patch 2/2 updates demote_folio_list() to traverse the demotion targets
hierarchically, ensuring the perferred target is always the closest
available node.

Dependencies:

1. Patch 1/2 has no dependencies.

2. Patch 2/2 depends on commit d7138516fb5c ("mm/vmscan: fix demotion
   targets checks in reclaim/demotion").

Bing Jiao (2):
  mm/vmscan: balance demotion allocation in alloc_demote_folio()
  mm/vmscan: select the optimal perferred node in demote_folio_list()

 mm/vmscan.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

--
2.52.0.358.g0dd7633a29-goog



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v1 1/2] mm/vmscan: balance demotion allocation in alloc_demote_folio()
  2026-01-07  7:28 [PATCH v1 0/2] mm/vmscan: optimize preferred target demotion node selection Bing Jiao
@ 2026-01-07  7:28 ` Bing Jiao
  2026-01-08 12:44   ` Donet Tom
  2026-01-07  7:28 ` [PATCH v1 2/2] mm/vmscan: select the closest perferred node in demote_folio_list() Bing Jiao
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: Bing Jiao @ 2026-01-07  7:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Bing Jiao, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Shakeel Butt, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

When the preferred demotion node does not have enough free space,
alloc_demote_folio() attempts to allocate from fallback nodes.
Currently, it lacks a mechanism to distribute these fallback allocations,
which can lead to unbalanced memory pressure across fallback nodes.

Balance the allocation by randomly selecting a new preferred node from
the fallback nodes if the initial allocation from the old preferred
node fails.

Signed-off-by: Bing Jiao <bingjiao@google.com>
---
 mm/vmscan.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 81828fa625ed..db2413c4bd26 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1009,6 +1009,11 @@ static struct folio *alloc_demote_folio(struct folio *src,
 	if (dst)
 		return dst;

+	/* Randomly select a node from fallback nodes for balanced allocation */
+	if (allowed_mask) {
+		mtc->nid = node_random(allowed_mask);
+		node_clear(mtc->nid, *allowed_mask);
+	}
 	mtc->gfp_mask &= ~__GFP_THISNODE;
 	mtc->nmask = allowed_mask;

--
2.52.0.358.g0dd7633a29-goog



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v1 2/2] mm/vmscan: select the closest perferred node in demote_folio_list()
  2026-01-07  7:28 [PATCH v1 0/2] mm/vmscan: optimize preferred target demotion node selection Bing Jiao
  2026-01-07  7:28 ` [PATCH v1 1/2] mm/vmscan: balance demotion allocation in alloc_demote_folio() Bing Jiao
@ 2026-01-07  7:28 ` Bing Jiao
  2026-01-07 17:39 ` [PATCH v1 0/2] mm/vmscan: optimize preferred target demotion node selection Andrew Morton
  2026-01-07 17:46 ` Joshua Hahn
  3 siblings, 0 replies; 7+ messages in thread
From: Bing Jiao @ 2026-01-07  7:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Bing Jiao, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Shakeel Butt, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

The preferred demotion node should be the one closest to the source
node to minimize migration latency. However, if the preferred node
is not set in mems_allowed, demote_folio_list() currently randomly
select one from allowed nodes as the new preferred node. This can
result in selecting a very distant node.

Update demote_folio_list() to traverse the demotion targets
hierarchically until the perferred node is set in mems_allowed,
ensuring the perferred target is always the closest available node.

Signed-off-by: Bing Jiao <bingjiao@google.com>
---
 mm/vmscan.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index db2413c4bd26..d452974c946e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1052,8 +1052,18 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
 	if (nodes_empty(allowed_mask))
 		return false;

-	if (!node_isset(target_nid, allowed_mask))
-		target_nid = node_random(&allowed_mask);
+	while (target_nid != NUMA_NO_NODE &&
+	       !node_isset(target_nid, allowed_mask)) {
+		/* Get the preferred demotion target from the next tier. */
+		target_nid = next_demotion_node(target_nid);
+	}
+
+	/*
+	 * The perferred node query is subject to race conditions such as
+	 * nodes in the next tier are hot-unplugged.
+	 */
+	if (target_nid == NUMA_NO_NODE)
+		return 0;
 	mtc.nid = target_nid;

 	/* Demotion ignores all cpuset and mempolicy settings */
--
2.52.0.358.g0dd7633a29-goog



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 0/2] mm/vmscan: optimize preferred target demotion node selection
  2026-01-07  7:28 [PATCH v1 0/2] mm/vmscan: optimize preferred target demotion node selection Bing Jiao
  2026-01-07  7:28 ` [PATCH v1 1/2] mm/vmscan: balance demotion allocation in alloc_demote_folio() Bing Jiao
  2026-01-07  7:28 ` [PATCH v1 2/2] mm/vmscan: select the closest perferred node in demote_folio_list() Bing Jiao
@ 2026-01-07 17:39 ` Andrew Morton
  2026-01-07 17:46 ` Joshua Hahn
  3 siblings, 0 replies; 7+ messages in thread
From: Andrew Morton @ 2026-01-07 17:39 UTC (permalink / raw)
  To: Bing Jiao
  Cc: linux-mm, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, linux-kernel

On Wed,  7 Jan 2026 07:28:12 +0000 Bing Jiao <bingjiao@google.com> wrote:

> In tiered memory systems, the demotion aims to move cold folios to the
> far-tier nodes. To maintain system performance, the demotion target
> should ideally be the node with the shortest NUMA distance from the
> source node.
> 
> However, the current implementation has two suboptimal behaviors:
> 
> 1. Unbalanced Fallback: When the primary preferred demotion node is full,
>    the allocator falls back to other nodes in a way that often skews
>    toward zones that closer to the primary preferred node rather than
>    distributing the load evenly across fallback nodes.
> 
> 2. Suboptimal Target Selection: demote_folio_list() randomly select
>    a preferred node from the allowed mask, potentially selecting
>    a very distant node.
> 
> This series optimizes the selection logic while ensuring balanced
> allocation across fallback nodes.
> 
> Patch 1/2 introduces a randomized fallback mechanism in
> alloc_demote_folio() to prevent allocation hotspots when the preferred
> node is under memory pressure.
> 
> Patch 2/2 updates demote_folio_list() to traverse the demotion targets
> hierarchically, ensuring the perferred target is always the closest
> available node.

Yes, those things sound rather suboptimal.  Do you have any data which
will help us understand the performance benefit of these changes?

(2/2 has a typo in the subject and in the comment.  perferred->preferred)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 0/2] mm/vmscan: optimize preferred target demotion node selection
  2026-01-07  7:28 [PATCH v1 0/2] mm/vmscan: optimize preferred target demotion node selection Bing Jiao
                   ` (2 preceding siblings ...)
  2026-01-07 17:39 ` [PATCH v1 0/2] mm/vmscan: optimize preferred target demotion node selection Andrew Morton
@ 2026-01-07 17:46 ` Joshua Hahn
  2026-01-08  6:03   ` Bing Jiao
  3 siblings, 1 reply; 7+ messages in thread
From: Joshua Hahn @ 2026-01-07 17:46 UTC (permalink / raw)
  To: Bing Jiao
  Cc: linux-mm, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Shakeel Butt, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

On Wed,  7 Jan 2026 07:28:12 +0000 Bing Jiao <bingjiao@google.com> wrote:

Hello Bing, thank you for your patch!

I have a few questions about the motivation about this patch.

> In tiered memory systems, the demotion aims to move cold folios to the
> far-tier nodes. To maintain system performance, the demotion target
> should ideally be the node with the shortest NUMA distance from the
> source node.
>
> However, the current implementation has two suboptimal behaviors:
> 
> 1. Unbalanced Fallback: When the primary preferred demotion node is full,
>    the allocator falls back to other nodes in a way that often skews
>    toward zones that closer to the primary preferred node rather than
>    distributing the load evenly across fallback nodes.

I definitely think this is a problem that can exist for some workloads /
machines, and I agree that there should be some mechanism to manage this
in the demotion code as well. In the context of tiered memory, it might be
the case that some far-nodes have more restrited memory bandwidth, so better
distribution of memory across those nodes definitely sounds like something
that should at least be considered (even if it might not be the sole factor).

With that said, I think adding some numbers here to motivate this change could
definitely make the argument more convincing. In particular, I don't think
I am fully convinced that doing a full random selection from the demotion
targets makes the most sense. Maybe there are a few more things to consider,
like the node's capacity, how full it is, bandwidth, etc. For instance,
weighted interleave auto-tuning makes a weighted selection based on each
node's bandwidth.

At least right now, it seems like we're consistent with how the demotion node
gets selected when the preferred node is full.

Do your changes lead to a "better" distribution of memory? And does this
distribution lead to increased performance? I think some numbers here could
help my understanding and convince others as well : -)

> 2. Suboptimal Target Selection: demote_folio_list() randomly select
>    a preferred node from the allowed mask, potentially selecting
>    a very distant node.

Following up, I think it could be helpful to have a unified story about how
demotion nodes should be selected. In particular, I'm not entirely confident
if it makes sense to have a "try on the preferred demotion target, and then
select randomly among all other nodes" story, since these have conflicting
stories of "prefer close nodes" vs "distribute demotions". To put it explicitly,
what makes the first demotion target special? Should we just select randomly
for *all* demotion targets, not just if the preferred node is full?

Sorry if it seems like I am asking too many questions, I just wanted to get
a better understanding of the motivation behind the patch.

Thank you, and I hope you have a great day!
Joshua


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 0/2] mm/vmscan: optimize preferred target demotion node selection
  2026-01-07 17:46 ` Joshua Hahn
@ 2026-01-08  6:03   ` Bing Jiao
  0 siblings, 0 replies; 7+ messages in thread
From: Bing Jiao @ 2026-01-08  6:03 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: linux-mm, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Shakeel Butt, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

On Wed, Jan 07, 2026 at 09:46:52AM -0800, Joshua Hahn wrote:

Hi Joshua,

Thanks for your insights and valuable suggestions!

> On Wed,  7 Jan 2026 07:28:12 +0000 Bing Jiao <bingjiao@google.com> wrote:
>
> Hello Bing, thank you for your patch!
>
> I have a few questions about the motivation about this patch.
>
> > In tiered memory systems, the demotion aims to move cold folios to the
> > far-tier nodes. To maintain system performance, the demotion target
> > should ideally be the node with the shortest NUMA distance from the
> > source node.
> >
> > However, the current implementation has two suboptimal behaviors:
> >
> > 1. Unbalanced Fallback: When the primary preferred demotion node is full,
> >    the allocator falls back to other nodes in a way that often skews
> >    toward zones that closer to the primary preferred node rather than
> >    distributing the load evenly across fallback nodes.
>
> I definitely think this is a problem that can exist for some workloads /
> machines, and I agree that there should be some mechanism to manage this
> in the demotion code as well. In the context of tiered memory, it might be
> the case that some far-nodes have more restrited memory bandwidth, so better
> distribution of memory across those nodes definitely sounds like something
> that should at least be considered (even if it might not be the sole factor).
>
> With that said, I think adding some numbers here to motivate this change could
> definitely make the argument more convincing. In particular, I don't think
> I am fully convinced that doing a full random selection from the demotion
> targets makes the most sense. Maybe there are a few more things to consider,
> like the node's capacity, how full it is, bandwidth, etc. For instance,
> weighted interleave auto-tuning makes a weighted selection based on each
> node's bandwidth.

I agree that a detailed evaluation is necessary. When I initially wrote
this patch, I hadn't fully considered a weighted selection. Using
bandwidth as a weight for demotion target selection makes sense,
and node capacity could serve as another useful heuristic.
However, designing and evaluating a proposal that integrates all
these metrics properly will require more time and study.

> At least right now, it seems like we're consistent with how the demotion node
> gets selected when the preferred node is full.
>
> Do your changes lead to a "better" distribution of memory? And does this
> distribution lead to increased performance? I think some numbers here could
> help my understanding and convince others as well : -)

I haven't performed a formal A/B performance test yet. My primary
observation was a significant imbalance in memory pressure: some far
nodes were completely exhausted while others in the same tier remained
half-empty. With this patch, that skewed distribution is mitigated when
nodes reside in the same tier. I agree that providing numbers would
strengthen the proposal. I will work on gathering those numbers later.

> > 2. Suboptimal Target Selection: demote_folio_list() randomly select
> >    a preferred node from the allowed mask, potentially selecting
> >    a very distant node.
>
> Following up, I think it could be helpful to have a unified story about how
> demotion nodes should be selected. In particular, I'm not entirely confident
> if it makes sense to have a "try on the preferred demotion target, and then
> select randomly among all other nodes" story, since these have conflicting
> stories of "prefer close nodes" vs "distribute demotions". To put it explicitly,
> what makes the first demotion target special? Should we just select randomly
> for *all* demotion targets, not just if the preferred node is full?

The "first" target is not particularly special. It is randomly
selected from the tier closest to the source node by
next_demotion_node().

Regarding the strategy, what I am thinking: if far nodes are mostly empty,
preferring the nearest one is optimal. However, as those nodes reach
capacity, consistently targeting the nearest one can create contention
hotspots.

Choosing between "proximity" and "distribution" likely depends on the
current state of the targets. I agree that we need a more comprehensive
study to establish a unified selection policy.

> Sorry if it seems like I am asking too many questions, I just wanted to get
> a better understanding of the motivation behind the patch.
>
> Thank you, and I hope you have a great day!
> Joshua

Thanks for the feedback and suggestions. I realized that my previous
patch ("mm/vmscan: fix demotion targets checks in reclaim/demotion")
is what introduced the "non-preferred node" issue in demote_folio_list().
I am not sure whether it should be in the previous patch series;
but I just posted a refreshed version of Patch 2/2 in the previous series.

Thanks,
Bing



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 1/2] mm/vmscan: balance demotion allocation in alloc_demote_folio()
  2026-01-07  7:28 ` [PATCH v1 1/2] mm/vmscan: balance demotion allocation in alloc_demote_folio() Bing Jiao
@ 2026-01-08 12:44   ` Donet Tom
  0 siblings, 0 replies; 7+ messages in thread
From: Donet Tom @ 2026-01-08 12:44 UTC (permalink / raw)
  To: Bing Jiao, linux-mm
  Cc: Andrew Morton, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, linux-kernel


On 1/7/26 12:58 PM, Bing Jiao wrote:
> When the preferred demotion node does not have enough free space,
> alloc_demote_folio() attempts to allocate from fallback nodes.
> Currently, it lacks a mechanism to distribute these fallback allocations,
> which can lead to unbalanced memory pressure across fallback nodes.
>
> Balance the allocation by randomly selecting a new preferred node from
> the fallback nodes if the initial allocation from the old preferred
> node fails.
>
> Signed-off-by: Bing Jiao <bingjiao@google.com>
> ---
>   mm/vmscan.c | 5 +++++
>   1 file changed, 5 insertions(+)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 81828fa625ed..db2413c4bd26 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1009,6 +1009,11 @@ static struct folio *alloc_demote_folio(struct folio *src,
>   	if (dst)
>   		return dst;
>
> +	/* Randomly select a node from fallback nodes for balanced allocation */
> +	if (allowed_mask) {
> +		mtc->nid = node_random(allowed_mask);


This random selection can cause allocations to fall back to distant 
memory even when the nearer demotion target has sufficient free memory, 
correct? Could this also lead to increased promotion latency?


> +		node_clear(mtc->nid, *allowed_mask);
> +	}
>   	mtc->gfp_mask &= ~__GFP_THISNODE;
>   	mtc->nmask = allowed_mask;
>
> --
> 2.52.0.358.g0dd7633a29-goog
>
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-01-08 12:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-07  7:28 [PATCH v1 0/2] mm/vmscan: optimize preferred target demotion node selection Bing Jiao
2026-01-07  7:28 ` [PATCH v1 1/2] mm/vmscan: balance demotion allocation in alloc_demote_folio() Bing Jiao
2026-01-08 12:44   ` Donet Tom
2026-01-07  7:28 ` [PATCH v1 2/2] mm/vmscan: select the closest perferred node in demote_folio_list() Bing Jiao
2026-01-07 17:39 ` [PATCH v1 0/2] mm/vmscan: optimize preferred target demotion node selection Andrew Morton
2026-01-07 17:46 ` Joshua Hahn
2026-01-08  6:03   ` Bing Jiao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox