* [PATCH] mm/vmscan: respect mems_effective in demote_folio_list()
@ 2025-12-20 6:10 Bing Jiao
2025-12-20 19:20 ` Andrew Morton
` (2 more replies)
0 siblings, 3 replies; 22+ messages in thread
From: Bing Jiao @ 2025-12-20 6:10 UTC (permalink / raw)
To: linux-mm
Cc: gourry, Bing Jiao, Waiman Long, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Tejun Heo,
Michal Koutný,
Qi Zheng, Axel Rasmussen, Yuanchu Xie, Wei Xu, cgroups,
linux-kernel
Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
introduces the cpuset.mems_effective check and applies it to
can_demote(). However, it does not apply this check in
demote_folio_list(), which leads to situations where pages are demoted
to nodes that are explicitly excluded from the task's cpuset.mems.
To address the issue that demotion targets do not respect
cpuset.mem_effective in demote_folio_list(), implement a new function
get_demotion_targets(), which returns a preferred demotion target
and all allowed (fallback) nodes against mems_effective,
and update demote_folio_list() and can_demote() accordingly to
use get_demotion_targets().
Furthermore, update some supporting functions:
- Add a parameter for next_demotion_node() to return a copy of
node_demotion[]->preferred, allowing get_demotion_targets()
to select the next-best node for demotion.
- Change the parameters for cpuset_node_allowed() and
mem_cgroup_node_allowed() from nid to nodemask * to allow
for direct logic-and operations with mems_effective.
Signed-off-by: Bing Jiao <bingjiao@google.com>
---
include/linux/cpuset.h | 5 +--
include/linux/memcontrol.h | 6 +--
include/linux/memory-tiers.h | 6 +--
kernel/cgroup/cpuset.c | 14 +++----
mm/memcontrol.c | 5 ++-
mm/memory-tiers.c | 8 +++-
mm/vmscan.c | 77 +++++++++++++++++++++++++++++-------
7 files changed, 87 insertions(+), 34 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index a98d3330385c..27a0b6e9fb9d 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -174,7 +174,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
task_unlock(current);
}
-extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
+extern void cpuset_node_allowed(struct cgroup *cgroup, nodemask_t *nodes);
#else /* !CONFIG_CPUSETS */
static inline bool cpusets_enabled(void) { return false; }
@@ -301,9 +301,8 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
return false;
}
-static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+static inline void cpuset_node_allowed(struct cgroup *cgroup, nodemask_t *nodes)
{
- return true;
}
#endif /* !CONFIG_CPUSETS */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fd400082313a..a87f008b6600 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1740,7 +1740,7 @@ static inline void count_objcg_events(struct obj_cgroup *objcg,
rcu_read_unlock();
}
-bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
+void mem_cgroup_node_allowed(struct mem_cgroup *memcg, nodemask_t *nodes);
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg);
@@ -1811,9 +1811,9 @@ static inline ino_t page_cgroup_ino(struct page *page)
return 0;
}
-static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
+static inline void mem_cgroup_node_allowed(struct mem_cgroup *memcg,
+ nodemask_t *nodes)
{
- return true;
}
static inline void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 7a805796fcfd..2706ebfa94b5 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -53,11 +53,11 @@ struct memory_dev_type *mt_find_alloc_memory_type(int adist,
struct list_head *memory_types);
void mt_put_memory_types(struct list_head *memory_types);
#ifdef CONFIG_MIGRATION
-int next_demotion_node(int node);
+int next_demotion_node(int node, nodemask_t *mask);
void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
bool node_is_toptier(int node);
#else
-static inline int next_demotion_node(int node)
+static inline int next_demotion_node(int node, nodemask_t *mask)
{
return NUMA_NO_NODE;
}
@@ -101,7 +101,7 @@ static inline void clear_node_memory_type(int node, struct memory_dev_type *memt
}
-static inline int next_demotion_node(int node)
+static inline int next_demotion_node(int node, nodemask_t *mask)
{
return NUMA_NO_NODE;
}
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 6e6eb09b8db6..2d78cfde5911 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4416,11 +4416,10 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
return allowed;
}
-bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+void cpuset_node_allowed(struct cgroup *cgroup, nodemask_t *nodes)
{
struct cgroup_subsys_state *css;
struct cpuset *cs;
- bool allowed;
/*
* In v1, mem_cgroup and cpuset are unlikely in the same hierarchy
@@ -4428,16 +4427,16 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
* so return true to avoid taking a global lock on the empty check.
*/
if (!cpuset_v2())
- return true;
+ return;
css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
if (!css)
- return true;
+ return;
/*
* Normally, accessing effective_mems would require the cpuset_mutex
- * or callback_lock - but node_isset is atomic and the reference
- * taken via cgroup_get_e_css is sufficient to protect css.
+ * or callback_lock - but the reference taken via cgroup_get_e_css
+ * is sufficient to protect css.
*
* Since this interface is intended for use by migration paths, we
* relax locking here to avoid taking global locks - while accepting
@@ -4447,9 +4446,8 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
* cannot make strong isolation guarantees, so this is acceptable.
*/
cs = container_of(css, struct cpuset, css);
- allowed = node_isset(nid, cs->effective_mems);
+ nodes_and(*nodes, *nodes, cs->effective_mems);
css_put(css);
- return allowed;
}
/**
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75fc22a33b28..a62c75b136ef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5597,9 +5597,10 @@ subsys_initcall(mem_cgroup_swap_init);
#endif /* CONFIG_SWAP */
-bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
+void mem_cgroup_node_allowed(struct mem_cgroup *memcg, nodemask_t *nodes)
{
- return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
+ if (memcg)
+ cpuset_node_allowed(memcg->css.cgroup, nodes);
}
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 20aab9c19c5e..ed0ee9c3ae70 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -320,13 +320,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
/**
* next_demotion_node() - Get the next node in the demotion path
* @node: The starting node to lookup the next node
+ * @mask: The preferred nodemask copy to be returned
*
* Return: node id for next memory node in the demotion path hierarchy
* from @node; NUMA_NO_NODE if @node is terminal. This does not keep
* @node online or guarantee that it *continues* to be the next demotion
* target.
*/
-int next_demotion_node(int node)
+int next_demotion_node(int node, nodemask_t *mask)
{
struct demotion_nodes *nd;
int target;
@@ -355,7 +356,12 @@ int next_demotion_node(int node)
* last target node. Or introducing per-cpu data to avoid
* caching issue, which seems more complicated. So selecting
* target node randomly seems better until now.
+ *
+ * Copy preferred nodes as the fallback if the returned one
+ * does not satisify some constraints like cpuset.
*/
+ if (mask)
+ nodes_copy(*mask, nd->preferred);
target = node_random(&nd->preferred);
rcu_read_unlock();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8bdb1629b6eb..2ddbf5584af8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -341,22 +341,71 @@ static void flush_reclaim_state(struct scan_control *sc)
}
}
+/*
+ * Returns a preferred demotion node and all allowed demotion @targets.
+ * Returns NUMA_NO_NODE and @targets is meaningless if no allowed nodes.
+ */
+static int get_demotion_targets(nodemask_t *targets, struct pglist_data *pgdat,
+ struct mem_cgroup *memcg)
+{
+ nodemask_t allowed_mask;
+ nodemask_t preferred_mask;
+ int preferred_node;
+
+ if (!pgdat)
+ return NUMA_NO_NODE;
+
+ preferred_node = next_demotion_node(pgdat->node_id, &preferred_mask);
+ if (preferred_node == NUMA_NO_NODE)
+ return NUMA_NO_NODE;
+
+ node_get_allowed_targets(pgdat, &allowed_mask);
+ mem_cgroup_node_allowed(memcg, &allowed_mask);
+ if (nodes_empty(allowed_mask))
+ return NUMA_NO_NODE;
+
+ if (targets)
+ nodes_copy(*targets, allowed_mask);
+
+ do {
+ if (node_isset(preferred_node, allowed_mask))
+ return preferred_node;
+
+ nodes_and(preferred_mask, preferred_mask, allowed_mask);
+ if (!nodes_empty(preferred_mask))
+ return node_random(&preferred_mask);
+
+ /*
+ * Hop to the next tier of preferred nodes. Even if
+ * preferred_node is not set in allowed_mask, still can use it
+ * to query the nest-best demotion nodes.
+ */
+ preferred_node = next_demotion_node(preferred_node,
+ &preferred_mask);
+ } while (preferred_node != NUMA_NO_NODE);
+
+ /*
+ * Should not reach here, as a non-empty allowed_mask ensures
+ * there must have a target node for demotion.
+ * Otherwise, it suggests something wrong in node_demotion[]->preferred,
+ * where the same-tier nodes have different preferred targets.
+ * E.g., if node 0 identifies both nodes 2 and 3 as preferred targets,
+ * but nodes 2 and 3 themselves have different preferred nodes.
+ */
+ WARN_ON_ONCE(1);
+ return node_random(&allowed_mask);
+}
+
static bool can_demote(int nid, struct scan_control *sc,
struct mem_cgroup *memcg)
{
- int demotion_nid;
-
if (!numa_demotion_enabled)
return false;
if (sc && sc->no_demotion)
return false;
- demotion_nid = next_demotion_node(nid);
- if (demotion_nid == NUMA_NO_NODE)
- return false;
-
- /* If demotion node isn't in the cgroup's mems_allowed, fall back */
- return mem_cgroup_node_allowed(memcg, demotion_nid);
+ return get_demotion_targets(NULL, NODE_DATA(nid), memcg) !=
+ NUMA_NO_NODE;
}
static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
@@ -1019,9 +1068,10 @@ static struct folio *alloc_demote_folio(struct folio *src,
* Folios which are not demoted are left on @demote_folios.
*/
static unsigned int demote_folio_list(struct list_head *demote_folios,
- struct pglist_data *pgdat)
+ struct pglist_data *pgdat,
+ struct mem_cgroup *memcg)
{
- int target_nid = next_demotion_node(pgdat->node_id);
+ int target_nid;
unsigned int nr_succeeded;
nodemask_t allowed_mask;
@@ -1033,7 +1083,6 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
*/
.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
__GFP_NOMEMALLOC | GFP_NOWAIT,
- .nid = target_nid,
.nmask = &allowed_mask,
.reason = MR_DEMOTION,
};
@@ -1041,10 +1090,10 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
if (list_empty(demote_folios))
return 0;
+ target_nid = get_demotion_targets(&allowed_mask, pgdat, memcg);
if (target_nid == NUMA_NO_NODE)
return 0;
-
- node_get_allowed_targets(pgdat, &allowed_mask);
+ mtc.nid = target_nid;
/* Demotion ignores all cpuset and mempolicy settings */
migrate_pages(demote_folios, alloc_demote_folio, NULL,
@@ -1566,7 +1615,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
/* 'folio_list' is always empty here */
/* Migrate folios selected for demotion */
- nr_demoted = demote_folio_list(&demote_folios, pgdat);
+ nr_demoted = demote_folio_list(&demote_folios, pgdat, memcg);
nr_reclaimed += nr_demoted;
stat->nr_demoted += nr_demoted;
/* Folios that could not be demoted are still in @demote_folios */
--
2.52.0.351.gbe84eed79e-goog
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH] mm/vmscan: respect mems_effective in demote_folio_list()
2025-12-20 6:10 [PATCH] mm/vmscan: respect mems_effective in demote_folio_list() Bing Jiao
@ 2025-12-20 19:20 ` Andrew Morton
2025-12-22 6:16 ` Bing Jiao
2025-12-21 12:07 ` Gregory Price
2025-12-21 23:36 ` [PATCH v2 0/2] fix demotion targets checks in reclaim/demotion Bing Jiao
2 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2025-12-20 19:20 UTC (permalink / raw)
To: Bing Jiao
Cc: linux-mm, gourry, Waiman Long, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Tejun Heo, Michal Koutný ,
Qi Zheng, Axel Rasmussen, Yuanchu Xie, Wei Xu, cgroups,
linux-kernel
On Sat, 20 Dec 2025 06:10:21 +0000 Bing Jiao <bingjiao@google.com> wrote:
> Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> introduces the cpuset.mems_effective check and applies it to
> can_demote().
So we'll want
Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
in the changelog.
> However, it does not apply this check in
> demote_folio_list(), which leads to situations where pages are demoted
> to nodes that are explicitly excluded from the task's cpuset.mems.
>
> To address the issue that demotion targets do not respect
> cpuset.mem_effective in demote_folio_list(), implement a new function
> get_demotion_targets(), which returns a preferred demotion target
> and all allowed (fallback) nodes against mems_effective,
> and update demote_folio_list() and can_demote() accordingly to
> use get_demotion_targets().
7d709f49babc fist appeared in 6.16, so we must decide whether to
backport this fix into -stable kernels, via a Cc:
<stable@vger.kernel.org>.
To make this decision it's best to have a clear understanding of the
userspace visible impact of the bug. Putting pages into improper nodes
is undesirable, but how much does it affect real-world workloads?
Please include in the changelog some words about this to help others
understand why we should backport the fix.
> Furthermore, update some supporting functions:
> - Add a parameter for next_demotion_node() to return a copy of
> node_demotion[]->preferred, allowing get_demotion_targets()
> to select the next-best node for demotion.
> - Change the parameters for cpuset_node_allowed() and
> mem_cgroup_node_allowed() from nid to nodemask * to allow
> for direct logic-and operations with mems_effective.
If we do decide to backport the fix into earlier kernels then it's best
to keep the patch as small and as simple as possible. So non-bugfix
changes such as these are best made via a second followup patch which
can be merged via the normal -rc staging process.
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH] mm/vmscan: respect mems_effective in demote_folio_list()
2025-12-20 19:20 ` Andrew Morton
@ 2025-12-22 6:16 ` Bing Jiao
0 siblings, 0 replies; 22+ messages in thread
From: Bing Jiao @ 2025-12-22 6:16 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, gourry, Waiman Long, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Tejun Heo, Michal Koutný,
Qi Zheng, Axel Rasmussen, Yuanchu Xie, Wei Xu, cgroups,
linux-kernel
On Sat, Dec 20, 2025 at 11:20:44AM -0800, Andrew Morton wrote:
> On Sat, 20 Dec 2025 06:10:21 +0000 Bing Jiao <bingjiao@google.com> wrote:
>
> > Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> > introduces the cpuset.mems_effective check and applies it to
> > can_demote().
>
> So we'll want
>
> Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
>
> in the changelog.
>
> > However, it does not apply this check in
> > demote_folio_list(), which leads to situations where pages are demoted
> > to nodes that are explicitly excluded from the task's cpuset.mems.
> >
> > To address the issue that demotion targets do not respect
> > cpuset.mem_effective in demote_folio_list(), implement a new function
> > get_demotion_targets(), which returns a preferred demotion target
> > and all allowed (fallback) nodes against mems_effective,
> > and update demote_folio_list() and can_demote() accordingly to
> > use get_demotion_targets().
>
> 7d709f49babc fist appeared in 6.16, so we must decide whether to
> backport this fix into -stable kernels, via a Cc:
> <stable@vger.kernel.org>.
>
> To make this decision it's best to have a clear understanding of the
> userspace visible impact of the bug. Putting pages into improper nodes
> is undesirable, but how much does it affect real-world workloads?
> Please include in the changelog some words about this to help others
> understand why we should backport the fix.
>
> > Furthermore, update some supporting functions:
> > - Add a parameter for next_demotion_node() to return a copy of
> > node_demotion[]->preferred, allowing get_demotion_targets()
> > to select the next-best node for demotion.
> > - Change the parameters for cpuset_node_allowed() and
> > mem_cgroup_node_allowed() from nid to nodemask * to allow
> > for direct logic-and operations with mems_effective.
>
> If we do decide to backport the fix into earlier kernels then it's best
> to keep the patch as small and as simple as possible. So non-bugfix
> changes such as these are best made via a second followup patch which
> can be merged via the normal -rc staging process.
>
Hi Andrew, thank you for the review and suggestions.
I hvae sent a patch v2 for the backport. However, I forgot to add
the CC:stable line. I will fix it in v3.
Best,
Bing
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/vmscan: respect mems_effective in demote_folio_list()
2025-12-20 6:10 [PATCH] mm/vmscan: respect mems_effective in demote_folio_list() Bing Jiao
2025-12-20 19:20 ` Andrew Morton
@ 2025-12-21 12:07 ` Gregory Price
2025-12-22 6:28 ` Bing Jiao
2025-12-21 23:36 ` [PATCH v2 0/2] fix demotion targets checks in reclaim/demotion Bing Jiao
2 siblings, 1 reply; 22+ messages in thread
From: Gregory Price @ 2025-12-21 12:07 UTC (permalink / raw)
To: Bing Jiao
Cc: linux-mm, Waiman Long, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Tejun Heo,
Michal Koutný,
Qi Zheng, Axel Rasmussen, Yuanchu Xie, Wei Xu, cgroups,
linux-kernel
I think this patch can be done without as many changes as proposed here.
> -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
> +void mem_cgroup_node_allowed(struct mem_cgroup *memcg, nodemask_t *nodes);
> -static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
> +static inline void mem_cgroup_node_allowed(struct mem_cgroup *memcg,
> -int next_demotion_node(int node);
> +int next_demotion_node(int node, nodemask_t *mask);
> -bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> +void cpuset_node_allowed(struct cgroup *cgroup, nodemask_t *nodes)
These are some fairly major contract changes, and the names don't make
much sense as a result.
Would be better to just make something like
/* Filter the given nmask based on cpuset.mems.allowed */
mem_cgroup_filter_mems_allowed(memg, nmask);
(or some other, better name)
separate of the existing interfaces, and operate on one scratch-mask if
possible.
> +static int get_demotion_targets(nodemask_t *targets, struct pglist_data *pgdat,
> + struct mem_cgroup *memcg)
> +{
> + nodemask_t allowed_mask;
> + nodemask_t preferred_mask;
> + int preferred_node;
> +
> + if (!pgdat)
> + return NUMA_NO_NODE;
> +
> + preferred_node = next_demotion_node(pgdat->node_id, &preferred_mask);
> + if (preferred_node == NUMA_NO_NODE)
> + return NUMA_NO_NODE;
> +
> + node_get_allowed_targets(pgdat, &allowed_mask);
> + mem_cgroup_node_allowed(memcg, &allowed_mask);
> + if (nodes_empty(allowed_mask))
> + return NUMA_NO_NODE;
> +
> + if (targets)
> + nodes_copy(*targets, allowed_mask);
> +
> + do {
> + if (node_isset(preferred_node, allowed_mask))
> + return preferred_node;
> +
> + nodes_and(preferred_mask, preferred_mask, allowed_mask);
> + if (!nodes_empty(preferred_mask))
> + return node_random(&preferred_mask);
> +
> + /*
> + * Hop to the next tier of preferred nodes. Even if
> + * preferred_node is not set in allowed_mask, still can use it
> + * to query the nest-best demotion nodes.
> + */
> + preferred_node = next_demotion_node(preferred_node,
> + &preferred_mask);
> + } while (preferred_node != NUMA_NO_NODE);
> +
What you're implementing here is effectively a new feature - allowing
demotion to jump nodes rather than just target the next demotion node.
This is nice, but it should be a separate patch proposal (I think Andrew
said something as much already) - not as part of a fix.
> + /*
> + * Should not reach here, as a non-empty allowed_mask ensures
> + * there must have a target node for demotion.
Does it? What if preferred_node is online when calling
next_demotion_node(), but then is offline when
node_get_allowed_targets() is called?
> + * Otherwise, it suggests something wrong in node_demotion[]->preferred,
> + * where the same-tier nodes have different preferred targets.
> + * E.g., if node 0 identifies both nodes 2 and 3 as preferred targets,
> + * but nodes 2 and 3 themselves have different preferred nodes.
> + */
> + WARN_ON_ONCE(1);
> + return node_random(&allowed_mask);
Just returning a random allowed node seems like an objectively poor
result and we should just not demote if we reach this condition. It
likesly means hotplug was happening and node states changed.
> @@ -1041,10 +1090,10 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
> if (list_empty(demote_folios))
> return 0;
>
> + target_nid = get_demotion_targets(&allowed_mask, pgdat, memcg);
> if (target_nid == NUMA_NO_NODE)
> return 0;
> -
> - node_get_allowed_targets(pgdat, &allowed_mask);
in the immediate fixup patch, it seems more expedient to just add the
function i described above
/* Filter the given nmask based on cpuset.mems.allowed */
mem_cgroup_filter_mems_allowed(memg, nmask);
and then add that immediate after the node_get_allowed_targets() call.
Then come back around afterwards to add the tier/node-skip functionality
from above in a separate feature patch.
~Gregory
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 670fe9fae5ba..1971a8d9475b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1046,6 +1046,11 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
node_get_allowed_targets(pgdat, &allowed_mask);
+ /* Filter based on mems_allowed, fail if the result is empty */
+ mem_cgroup_filter_nodemask(memcg, &allowed_mask);
+ if (nodes_empty(allowed_mask))
+ return 0;
+
/* Demotion ignores all cpuset and mempolicy settings */
migrate_pages(demote_folios, alloc_demote_folio, NULL,
(unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH] mm/vmscan: respect mems_effective in demote_folio_list()
2025-12-21 12:07 ` Gregory Price
@ 2025-12-22 6:28 ` Bing Jiao
0 siblings, 0 replies; 22+ messages in thread
From: Bing Jiao @ 2025-12-22 6:28 UTC (permalink / raw)
To: Gregory Price
Cc: linux-mm, Waiman Long, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Tejun Heo,
Michal Koutný,
Qi Zheng, Axel Rasmussen, Yuanchu Xie, Wei Xu, cgroups,
linux-kernel
On Sun, Dec 21, 2025 at 07:07:18AM -0500, Gregory Price wrote:
>
> I think this patch can be done without as many changes as proposed here.
>
> > -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
> > +void mem_cgroup_node_allowed(struct mem_cgroup *memcg, nodemask_t *nodes);
>
> > -static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
> > +static inline void mem_cgroup_node_allowed(struct mem_cgroup *memcg,
>
> > -int next_demotion_node(int node);
> > +int next_demotion_node(int node, nodemask_t *mask);
>
> > -bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> > +void cpuset_node_allowed(struct cgroup *cgroup, nodemask_t *nodes)
>
> These are some fairly major contract changes, and the names don't make
> much sense as a result.
>
> Would be better to just make something like
>
> /* Filter the given nmask based on cpuset.mems.allowed */
> mem_cgroup_filter_mems_allowed(memg, nmask);
>
> (or some other, better name)
>
> separate of the existing interfaces, and operate on one scratch-mask if
> possible.
>
Hi Gregory, thank you for the review and suggestions.
I have divided these changes into 2 patches based on your suggestions.
Since mem_cgroup_node_allowed() and cpuset_node_allowed() are dangling,
they are removed in v2 2/2.
> > +static int get_demotion_targets(nodemask_t *targets, struct pglist_data *pgdat,
> > + struct mem_cgroup *memcg)
> > +{
> > + nodemask_t allowed_mask;
> > + nodemask_t preferred_mask;
> > + int preferred_node;
> > +
> > + if (!pgdat)
> > + return NUMA_NO_NODE;
> > +
> > + preferred_node = next_demotion_node(pgdat->node_id, &preferred_mask);
> > + if (preferred_node == NUMA_NO_NODE)
> > + return NUMA_NO_NODE;
> > +
> > + node_get_allowed_targets(pgdat, &allowed_mask);
> > + mem_cgroup_node_allowed(memcg, &allowed_mask);
> > + if (nodes_empty(allowed_mask))
> > + return NUMA_NO_NODE;
> > +
> > + if (targets)
> > + nodes_copy(*targets, allowed_mask);
> > +
> > + do {
> > + if (node_isset(preferred_node, allowed_mask))
> > + return preferred_node;
> > +
> > + nodes_and(preferred_mask, preferred_mask, allowed_mask);
> > + if (!nodes_empty(preferred_mask))
> > + return node_random(&preferred_mask);
> > +
> > + /*
> > + * Hop to the next tier of preferred nodes. Even if
> > + * preferred_node is not set in allowed_mask, still can use it
> > + * to query the nest-best demotion nodes.
> > + */
> > + preferred_node = next_demotion_node(preferred_node,
> > + &preferred_mask);
> > + } while (preferred_node != NUMA_NO_NODE);
> > +
>
> What you're implementing here is effectively a new feature - allowing
> demotion to jump nodes rather than just target the next demotion node.
>
> This is nice, but it should be a separate patch proposal (I think Andrew
> said something as much already) - not as part of a fix.
>
Thanks for the suggestion.
I sent a v2 patch series for fixes and backport. This function (jump
node) will be sent in another thread for distinguishing between fixes
and features.
> > + /*
> > + * Should not reach here, as a non-empty allowed_mask ensures
> > + * there must have a target node for demotion.
>
> Does it? What if preferred_node is online when calling
> next_demotion_node(), but then is offline when
> node_get_allowed_targets() is called?
>
>
> > + * Otherwise, it suggests something wrong in node_demotion[]->preferred,
> > + * where the same-tier nodes have different preferred targets.
> > + * E.g., if node 0 identifies both nodes 2 and 3 as preferred targets,
> > + * but nodes 2 and 3 themselves have different preferred nodes.
> > + */
> > + WARN_ON_ONCE(1);
> > + return node_random(&allowed_mask);
>
> Just returning a random allowed node seems like an objectively poor
> result and we should just not demote if we reach this condition. It
> likesly means hotplug was happening and node states changed.
>
> > @@ -1041,10 +1090,10 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
> > if (list_empty(demote_folios))
> > return 0;
> >
> > + target_nid = get_demotion_targets(&allowed_mask, pgdat, memcg);
> > if (target_nid == NUMA_NO_NODE)
> > return 0;
> > -
> > - node_get_allowed_targets(pgdat, &allowed_mask);
>
> in the immediate fixup patch, it seems more expedient to just add the
> function i described above
>
> /* Filter the given nmask based on cpuset.mems.allowed */
> mem_cgroup_filter_mems_allowed(memg, nmask);
>
> and then add that immediate after the node_get_allowed_targets() call.
>
> Then come back around afterwards to add the tier/node-skip functionality
> from above in a separate feature patch.
>
> ~Gregory
>
Thanks for the hit. I had never considered hot-swapping before.
> ---
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 670fe9fae5ba..1971a8d9475b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1046,6 +1046,11 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
>
> node_get_allowed_targets(pgdat, &allowed_mask);
>
> + /* Filter based on mems_allowed, fail if the result is empty */
> + mem_cgroup_filter_nodemask(memcg, &allowed_mask);
> + if (nodes_empty(allowed_mask))
> + return 0;
> +
> /* Demotion ignores all cpuset and mempolicy settings */
> migrate_pages(demote_folios, alloc_demote_folio, NULL,
> (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
>
>
Thanks for the code. My v2 1/2 is based on your suggestion with
some changes.
Best,
Bing
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v2 0/2] fix demotion targets checks in reclaim/demotion
2025-12-20 6:10 [PATCH] mm/vmscan: respect mems_effective in demote_folio_list() Bing Jiao
2025-12-20 19:20 ` Andrew Morton
2025-12-21 12:07 ` Gregory Price
@ 2025-12-21 23:36 ` Bing Jiao
2025-12-21 23:36 ` [PATCH v2 1/2] mm/vmscan: respect mems_effective in demote_folio_list() Bing Jiao
` (2 more replies)
2 siblings, 3 replies; 22+ messages in thread
From: Bing Jiao @ 2025-12-21 23:36 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, stable, akpm, gourry, longman, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tj, mkoutny, david,
zhengqi.arch, lorenzo.stoakes, axelrasmussen, yuanchu, weixugc,
cgroups, Bing Jiao
This patch series addresses two issues in demote_folio_list()
and can_demote() in reclaim/demotion.
Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
introduces the cpuset.mems_effective check and applies it to
can_demote(). However:
1. It does not apply this check in demote_folio_list(), which leads
to situations where pages are demoted to nodes that are
explicitly excluded from the task's cpuset.mems.
2. It checks only the nodes in the immediate next demotion hierarchy
and does not check all allowed demotion targets in can_demote().
This can cause pages to never be demoted if the nodes in the next
demotion hierarchy are not set in mems_effective.
To address these bugs, implement a new function
mem_cgroup_filter_mems_allowed() to filter out nodes that are not
set in mems_effective, and update demote_folio_list() and can_demote()
accordingly.
Reproduct Bug 1:
Assume a system with 4 nodes, where nodes 0-1 are top-tier and
nodes 2-3 are far-tier memory. All nodes have equal capacity.
Test script to reproduct:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Should respect node 0-2 limit.
# Observation: Node 3 shows significant allocation (MemFree drops)
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1
Reproduct Bug 2:
Assume a system with 6 nodes, where nodes 0-2 are top-tier,
node 3 is far-tier node, and nodes 4-5 are the farthest-tier nodes.
All nodes have equal capacity.
Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Pages are demoted to Nodes 4-5
# Observation: No pages are demoted before oom.
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2
Bing Jiao (2):
mm/vmscan: respect mems_effective in demote_folio_list()
mm/vmscan: check all allowed targets in can_demote()
include/linux/cpuset.h | 6 +++---
include/linux/memcontrol.h | 6 +++---
kernel/cgroup/cpuset.c | 12 +++++-------
mm/memcontrol.c | 5 +++--
mm/vmscan.c | 27 ++++++++++++++++++---------
5 files changed, 32 insertions(+), 24 deletions(-)
--
2.52.0.351.gbe84eed79e-goog
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH v2 1/2] mm/vmscan: respect mems_effective in demote_folio_list()
2025-12-21 23:36 ` [PATCH v2 0/2] fix demotion targets checks in reclaim/demotion Bing Jiao
@ 2025-12-21 23:36 ` Bing Jiao
2025-12-22 2:38 ` Chen Ridong
` (2 more replies)
2025-12-21 23:36 ` [PATCH v2 2/2] mm/vmscan: check all allowed targets in can_demote() Bing Jiao
2025-12-23 21:19 ` [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion Bing Jiao
2 siblings, 3 replies; 22+ messages in thread
From: Bing Jiao @ 2025-12-21 23:36 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, stable, akpm, gourry, longman, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tj, mkoutny, david,
zhengqi.arch, lorenzo.stoakes, axelrasmussen, yuanchu, weixugc,
cgroups, Bing Jiao
Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
introduces the cpuset.mems_effective check and applies it to
can_demote(). However, it does not apply this check in
demote_folio_list().
This omission leads to situations where pages are demoted to nodes
that are explicitly excluded from the task's cpuset.mems.
The impact is two-fold:
1. Resource Isolation: This bug breaks resource isolation provided
by cpuset.mems. It allows pages to be demoted to nodes that are
dedicated to other tasks or are intended for hot-unplugging.
2. Performance Issue: In multi-tier systems, users use cpuset.mems
to bind tasks to different performed-far tiers (e.g., avoiding
the slowest tiers for latency-sensitive data). This bug can
cause unexpected latency spikes if pages are demoted to the
farthest nodes.
To address the bug, implement a new function
mem_cgroup_filter_mems_allowed() to filter out nodes that are not
set in mems_effective, and update demote_folio_list() to utilize
this filtering logic. This ensures that demotions target respect
task's memory placement constraints.
Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
Signed-off-by: Bing Jiao <bingjiao@google.com>
---
include/linux/cpuset.h | 6 ++++++
include/linux/memcontrol.h | 7 +++++++
kernel/cgroup/cpuset.c | 18 ++++++++++++++++++
mm/memcontrol.c | 6 ++++++
mm/vmscan.c | 13 ++++++++++---
5 files changed, 47 insertions(+), 3 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index a98d3330385c..0e94548e2d24 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -175,6 +175,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
}
extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
+extern void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask);
#else /* !CONFIG_CPUSETS */
static inline bool cpusets_enabled(void) { return false; }
@@ -305,6 +306,11 @@ static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
{
return true;
}
+
+static inline void cpuset_node_filter_allowed(struct cgroup *cgroup,
+ nodemask_t *mask)
+{
+}
#endif /* !CONFIG_CPUSETS */
#endif /* _LINUX_CPUSET_H */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fd400082313a..7cfd71c57caa 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1742,6 +1742,8 @@ static inline void count_objcg_events(struct obj_cgroup *objcg,
bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
+void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask);
+
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg);
static inline bool memcg_is_dying(struct mem_cgroup *memcg)
@@ -1816,6 +1818,11 @@ static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
return true;
}
+static inline bool mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg,
+ nodemask_t *mask)
+{
+}
+
static inline void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
{
}
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 6e6eb09b8db6..2925bd6bca91 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4452,6 +4452,24 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
return allowed;
}
+void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask)
+{
+ struct cgroup_subsys_state *css;
+ struct cpuset *cs;
+
+ if (!cpuset_v2())
+ return;
+
+ css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
+ if (!css)
+ return;
+
+ /* Follows the same assumption in cpuset_node_allowed() */
+ cs = container_of(css, struct cpuset, css);
+ nodes_and(*mask, *mask, cs->effective_mems);
+ css_put(css);
+}
+
/**
* cpuset_spread_node() - On which node to begin search for a page
* @rotor: round robin rotor
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75fc22a33b28..f414653867de 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5602,6 +5602,12 @@ bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
}
+void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask)
+{
+ if (memcg)
+ cpuset_node_filter_allowed(memcg->css.cgroup, mask);
+}
+
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
{
if (mem_cgroup_disabled() || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 453d654727c1..4d23c491e914 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1018,7 +1018,8 @@ static struct folio *alloc_demote_folio(struct folio *src,
* Folios which are not demoted are left on @demote_folios.
*/
static unsigned int demote_folio_list(struct list_head *demote_folios,
- struct pglist_data *pgdat)
+ struct pglist_data *pgdat,
+ struct mem_cgroup *memcg)
{
int target_nid = next_demotion_node(pgdat->node_id);
unsigned int nr_succeeded;
@@ -1032,7 +1033,6 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
*/
.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
__GFP_NOMEMALLOC | GFP_NOWAIT,
- .nid = target_nid,
.nmask = &allowed_mask,
.reason = MR_DEMOTION,
};
@@ -1044,6 +1044,13 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
return 0;
node_get_allowed_targets(pgdat, &allowed_mask);
+ /* Filter the given nmask based on cpuset.mems.allowed */
+ mem_cgroup_filter_mems_allowed(memcg, &allowed_mask);
+ if (nodes_empty(allowed_mask))
+ return 0;
+ if (!node_isset(target_nid, allowed_mask))
+ target_nid = node_random(&allowed_mask);
+ mtc.nid = target_nid;
/* Demotion ignores all cpuset and mempolicy settings */
migrate_pages(demote_folios, alloc_demote_folio, NULL,
@@ -1565,7 +1572,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
/* 'folio_list' is always empty here */
/* Migrate folios selected for demotion */
- nr_demoted = demote_folio_list(&demote_folios, pgdat);
+ nr_demoted = demote_folio_list(&demote_folios, pgdat, memcg);
nr_reclaimed += nr_demoted;
stat->nr_demoted += nr_demoted;
/* Folios that could not be demoted are still in @demote_folios */
--
2.52.0.351.gbe84eed79e-goog
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v2 1/2] mm/vmscan: respect mems_effective in demote_folio_list()
2025-12-21 23:36 ` [PATCH v2 1/2] mm/vmscan: respect mems_effective in demote_folio_list() Bing Jiao
@ 2025-12-22 2:38 ` Chen Ridong
2025-12-22 21:56 ` kernel test robot
2025-12-22 22:18 ` kernel test robot
2 siblings, 0 replies; 22+ messages in thread
From: Chen Ridong @ 2025-12-22 2:38 UTC (permalink / raw)
To: Bing Jiao, linux-mm
Cc: linux-kernel, stable, akpm, gourry, longman, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tj, mkoutny, david,
zhengqi.arch, lorenzo.stoakes, axelrasmussen, yuanchu, weixugc,
cgroups
On 2025/12/22 7:36, Bing Jiao wrote:
> Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> introduces the cpuset.mems_effective check and applies it to
> can_demote(). However, it does not apply this check in
> demote_folio_list().
>
> This omission leads to situations where pages are demoted to nodes
> that are explicitly excluded from the task's cpuset.mems.
> The impact is two-fold:
>
> 1. Resource Isolation: This bug breaks resource isolation provided
> by cpuset.mems. It allows pages to be demoted to nodes that are
> dedicated to other tasks or are intended for hot-unplugging.
>
> 2. Performance Issue: In multi-tier systems, users use cpuset.mems
> to bind tasks to different performed-far tiers (e.g., avoiding
> the slowest tiers for latency-sensitive data). This bug can
> cause unexpected latency spikes if pages are demoted to the
> farthest nodes.
>
> To address the bug, implement a new function
> mem_cgroup_filter_mems_allowed() to filter out nodes that are not
> set in mems_effective, and update demote_folio_list() to utilize
> this filtering logic. This ensures that demotions target respect
> task's memory placement constraints.
>
> Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> Signed-off-by: Bing Jiao <bingjiao@google.com>
> ---
> include/linux/cpuset.h | 6 ++++++
> include/linux/memcontrol.h | 7 +++++++
> kernel/cgroup/cpuset.c | 18 ++++++++++++++++++
> mm/memcontrol.c | 6 ++++++
> mm/vmscan.c | 13 ++++++++++---
> 5 files changed, 47 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index a98d3330385c..0e94548e2d24 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -175,6 +175,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> }
>
> extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
> +extern void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask);
> #else /* !CONFIG_CPUSETS */
>
> static inline bool cpusets_enabled(void) { return false; }
> @@ -305,6 +306,11 @@ static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> {
> return true;
> }
> +
> +static inline void cpuset_node_filter_allowed(struct cgroup *cgroup,
> + nodemask_t *mask)
> +{
> +}
> #endif /* !CONFIG_CPUSETS */
>
> #endif /* _LINUX_CPUSET_H */
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index fd400082313a..7cfd71c57caa 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1742,6 +1742,8 @@ static inline void count_objcg_events(struct obj_cgroup *objcg,
>
> bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
>
> +void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask);
> +
> void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg);
>
> static inline bool memcg_is_dying(struct mem_cgroup *memcg)
> @@ -1816,6 +1818,11 @@ static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
> return true;
> }
>
> +static inline bool mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg,
> + nodemask_t *mask)
> +{
> +}
> +
> static inline void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
> {
> }
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 6e6eb09b8db6..2925bd6bca91 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -4452,6 +4452,24 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> return allowed;
> }
>
> +void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask)
> +{
> + struct cgroup_subsys_state *css;
> + struct cpuset *cs;
> +
> + if (!cpuset_v2())
> + return;
> +
> + css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
> + if (!css)
> + return;
> +
> + /* Follows the same assumption in cpuset_node_allowed() */
> + cs = container_of(css, struct cpuset, css);
> + nodes_and(*mask, *mask, cs->effective_mems);
> + css_put(css);
> +}
> +
The functions cpuset_node_filter_allowed and cpuset_node_allowed are similar. We should create a
helper function to obtain cs->effective_mems, which can then be used by both
cpuset_node_filter_allowed and cpuset_node_allowed.
For example:
nodemask_t *mask cpuset_get_mem_allowed(struct cgroup *cgroup)
{
}
bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
{
e_mask = cpuset_node_allowed(cgroup);
return allowed = node_isset(nid, mask);
}
void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t mask)
{
e_mask = cpuset_node_allowed(cgroup);
nodes_and(mask, *mask, e_mask);
}
Previously, I did not think we should distinguish between cgroup v1 and v2 here. This should be a
common function; at least based on its name, it should not be solely for v2.
> /**
> * cpuset_spread_node() - On which node to begin search for a page
> * @rotor: round robin rotor
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75fc22a33b28..f414653867de 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5602,6 +5602,12 @@ bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
> return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
> }
>
> +void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask)
> +{
> + if (memcg)
> + cpuset_node_filter_allowed(memcg->css.cgroup, mask);
> +}
> +
> void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
> {
> if (mem_cgroup_disabled() || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 453d654727c1..4d23c491e914 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1018,7 +1018,8 @@ static struct folio *alloc_demote_folio(struct folio *src,
> * Folios which are not demoted are left on @demote_folios.
> */
> static unsigned int demote_folio_list(struct list_head *demote_folios,
> - struct pglist_data *pgdat)
> + struct pglist_data *pgdat,
> + struct mem_cgroup *memcg)
> {
> int target_nid = next_demotion_node(pgdat->node_id);
> unsigned int nr_succeeded;
> @@ -1032,7 +1033,6 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
> */
> .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
> __GFP_NOMEMALLOC | GFP_NOWAIT,
> - .nid = target_nid,
> .nmask = &allowed_mask,
> .reason = MR_DEMOTION,
> };
> @@ -1044,6 +1044,13 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
> return 0;
>
> node_get_allowed_targets(pgdat, &allowed_mask);
> + /* Filter the given nmask based on cpuset.mems.allowed */
> + mem_cgroup_filter_mems_allowed(memcg, &allowed_mask);
> + if (nodes_empty(allowed_mask))
> + return 0;
> + if (!node_isset(target_nid, allowed_mask))
> + target_nid = node_random(&allowed_mask);
> + mtc.nid = target_nid;
>
> /* Demotion ignores all cpuset and mempolicy settings */
> migrate_pages(demote_folios, alloc_demote_folio, NULL,
> @@ -1565,7 +1572,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> /* 'folio_list' is always empty here */
>
> /* Migrate folios selected for demotion */
> - nr_demoted = demote_folio_list(&demote_folios, pgdat);
> + nr_demoted = demote_folio_list(&demote_folios, pgdat, memcg);
> nr_reclaimed += nr_demoted;
> stat->nr_demoted += nr_demoted;
> /* Folios that could not be demoted are still in @demote_folios */
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v2 1/2] mm/vmscan: respect mems_effective in demote_folio_list()
2025-12-21 23:36 ` [PATCH v2 1/2] mm/vmscan: respect mems_effective in demote_folio_list() Bing Jiao
2025-12-22 2:38 ` Chen Ridong
@ 2025-12-22 21:56 ` kernel test robot
2025-12-22 22:18 ` kernel test robot
2 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-12-22 21:56 UTC (permalink / raw)
To: Bing Jiao, linux-mm
Cc: llvm, oe-kbuild-all, linux-kernel, stable, akpm, gourry, longman,
hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, tj,
mkoutny, david, zhengqi.arch, lorenzo.stoakes, axelrasmussen,
yuanchu, weixugc, cgroups, Bing Jiao
Hi Bing,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on tj-cgroup/for-next linus/master v6.19-rc2 next-20251219]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Bing-Jiao/mm-vmscan-respect-mems_effective-in-demote_folio_list/20251222-074143
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20251221233635.3761887-2-bingjiao%40google.com
patch subject: [PATCH v2 1/2] mm/vmscan: respect mems_effective in demote_folio_list()
config: arm-allnoconfig (https://download.01.org/0day-ci/archive/20251223/202512230553.LuiUveL3-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 185f5fd5ce4c65116ca8cf6df467a682ef090499)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251223/202512230553.LuiUveL3-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512230553.LuiUveL3-lkp@intel.com/
All warnings (new ones prefixed by >>):
In file included from kernel/sched/rq-offsets.c:5:
In file included from kernel/sched/sched.h:61:
In file included from include/linux/syscalls_api.h:1:
In file included from include/linux/syscalls.h:96:
In file included from include/trace/syscall.h:7:
In file included from include/linux/trace_events.h:10:
In file included from include/linux/perf_event.h:53:
In file included from include/linux/security.h:35:
In file included from include/linux/bpf.h:32:
>> include/linux/memcontrol.h:1824:1: warning: non-void function does not return a value [-Wreturn-type]
1824 | }
| ^
1 warning generated.
--
In file included from arch/arm/kernel/signal.c:12:
In file included from include/linux/resume_user_mode.h:8:
>> include/linux/memcontrol.h:1824:1: warning: non-void function does not return a value [-Wreturn-type]
1824 | }
| ^
arch/arm/kernel/signal.c:143:15: warning: variable 'aux' set but not used [-Wunused-but-set-variable]
143 | char __user *aux;
| ^
2 warnings generated.
--
In file included from kernel/sched/rq-offsets.c:5:
In file included from kernel/sched/sched.h:61:
In file included from include/linux/syscalls_api.h:1:
In file included from include/linux/syscalls.h:96:
In file included from include/trace/syscall.h:7:
In file included from include/linux/trace_events.h:10:
In file included from include/linux/perf_event.h:53:
In file included from include/linux/security.h:35:
In file included from include/linux/bpf.h:32:
>> include/linux/memcontrol.h:1824:1: warning: non-void function does not return a value [-Wreturn-type]
1824 | }
| ^
1 warning generated.
vim +1824 include/linux/memcontrol.h
1820
1821 static inline bool mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg,
1822 nodemask_t *mask)
1823 {
> 1824 }
1825
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v2 1/2] mm/vmscan: respect mems_effective in demote_folio_list()
2025-12-21 23:36 ` [PATCH v2 1/2] mm/vmscan: respect mems_effective in demote_folio_list() Bing Jiao
2025-12-22 2:38 ` Chen Ridong
2025-12-22 21:56 ` kernel test robot
@ 2025-12-22 22:18 ` kernel test robot
2 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-12-22 22:18 UTC (permalink / raw)
To: Bing Jiao, linux-mm
Cc: oe-kbuild-all, linux-kernel, stable, akpm, gourry, longman,
hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, tj,
mkoutny, david, zhengqi.arch, lorenzo.stoakes, axelrasmussen,
yuanchu, weixugc, cgroups, Bing Jiao
Hi Bing,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on tj-cgroup/for-next linus/master v6.19-rc2 next-20251219]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Bing-Jiao/mm-vmscan-respect-mems_effective-in-demote_folio_list/20251222-074143
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20251221233635.3761887-2-bingjiao%40google.com
patch subject: [PATCH v2 1/2] mm/vmscan: respect mems_effective in demote_folio_list()
config: alpha-allnoconfig (https://download.01.org/0day-ci/archive/20251223/202512230655.QvO6dmjt-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251223/202512230655.QvO6dmjt-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512230655.QvO6dmjt-lkp@intel.com/
All warnings (new ones prefixed by >>):
In file included from include/linux/bpf.h:32,
from include/linux/security.h:35,
from include/linux/perf_event.h:53,
from include/linux/trace_events.h:10,
from include/trace/syscall.h:7,
from include/linux/syscalls.h:96,
from include/linux/syscalls_api.h:1,
from kernel/sched/sched.h:61,
from kernel/sched/rq-offsets.c:5:
include/linux/memcontrol.h: In function 'mem_cgroup_filter_mems_allowed':
>> include/linux/memcontrol.h:1824:1: warning: no return statement in function returning non-void [-Wreturn-type]
1824 | }
| ^
--
In file included from include/linux/bpf.h:32,
from include/linux/security.h:35,
from include/linux/perf_event.h:53,
from include/linux/trace_events.h:10,
from include/trace/syscall.h:7,
from include/linux/syscalls.h:96,
from include/linux/syscalls_api.h:1,
from kernel/sched/sched.h:61,
from kernel/sched/rq-offsets.c:5:
include/linux/memcontrol.h: In function 'mem_cgroup_filter_mems_allowed':
>> include/linux/memcontrol.h:1824:1: warning: no return statement in function returning non-void [-Wreturn-type]
1824 | }
| ^
vim +1824 include/linux/memcontrol.h
1820
1821 static inline bool mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg,
1822 nodemask_t *mask)
1823 {
> 1824 }
1825
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v2 2/2] mm/vmscan: check all allowed targets in can_demote()
2025-12-21 23:36 ` [PATCH v2 0/2] fix demotion targets checks in reclaim/demotion Bing Jiao
2025-12-21 23:36 ` [PATCH v2 1/2] mm/vmscan: respect mems_effective in demote_folio_list() Bing Jiao
@ 2025-12-21 23:36 ` Bing Jiao
2025-12-22 2:51 ` Chen Ridong
2025-12-23 21:19 ` [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion Bing Jiao
2 siblings, 1 reply; 22+ messages in thread
From: Bing Jiao @ 2025-12-21 23:36 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, stable, akpm, gourry, longman, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tj, mkoutny, david,
zhengqi.arch, lorenzo.stoakes, axelrasmussen, yuanchu, weixugc,
cgroups, Bing Jiao
Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
introduces the cpuset.mems_effective check and applies it to
can_demote(). However, it checks only the nodes in the immediate next
demotion hierarchy and does not check all allowed demotion targets.
This can cause pages to never be demoted if the nodes in the next
demotion hierarchy are not set in mems_effective.
To address the bug, use mem_cgroup_filter_mems_allowed() to filter
out allowed targets obtained from node_get_allowed_targets(). Also
remove some unused functions.
Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
Signed-off-by: Bing Jiao <bingjiao@google.com>
---
include/linux/cpuset.h | 6 ------
include/linux/memcontrol.h | 7 -------
kernel/cgroup/cpuset.c | 28 ++++------------------------
mm/memcontrol.c | 5 -----
mm/vmscan.c | 14 ++++++++------
5 files changed, 12 insertions(+), 48 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 0e94548e2d24..ed7c27276e71 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -174,7 +174,6 @@ static inline void set_mems_allowed(nodemask_t nodemask)
task_unlock(current);
}
-extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
extern void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask);
#else /* !CONFIG_CPUSETS */
@@ -302,11 +301,6 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
return false;
}
-static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
-{
- return true;
-}
-
static inline void cpuset_node_filter_allowed(struct cgroup *cgroup,
nodemask_t *mask)
{
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7cfd71c57caa..41aab33499b5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1740,8 +1740,6 @@ static inline void count_objcg_events(struct obj_cgroup *objcg,
rcu_read_unlock();
}
-bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
-
void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask);
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg);
@@ -1813,11 +1811,6 @@ static inline ino_t page_cgroup_ino(struct page *page)
return 0;
}
-static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
-{
- return true;
-}
-
static inline bool mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg,
nodemask_t *mask)
{
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2925bd6bca91..339779571508 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4416,11 +4416,10 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
return allowed;
}
-bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask)
{
struct cgroup_subsys_state *css;
struct cpuset *cs;
- bool allowed;
/*
* In v1, mem_cgroup and cpuset are unlikely in the same hierarchy
@@ -4428,15 +4427,15 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
* so return true to avoid taking a global lock on the empty check.
*/
if (!cpuset_v2())
- return true;
+ return;
css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
if (!css)
- return true;
+ return;
/*
* Normally, accessing effective_mems would require the cpuset_mutex
- * or callback_lock - but node_isset is atomic and the reference
+ * or callback_lock - but it is acceptable and the reference
* taken via cgroup_get_e_css is sufficient to protect css.
*
* Since this interface is intended for use by migration paths, we
@@ -4447,25 +4446,6 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
* cannot make strong isolation guarantees, so this is acceptable.
*/
cs = container_of(css, struct cpuset, css);
- allowed = node_isset(nid, cs->effective_mems);
- css_put(css);
- return allowed;
-}
-
-void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask)
-{
- struct cgroup_subsys_state *css;
- struct cpuset *cs;
-
- if (!cpuset_v2())
- return;
-
- css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
- if (!css)
- return;
-
- /* Follows the same assumption in cpuset_node_allowed() */
- cs = container_of(css, struct cpuset, css);
nodes_and(*mask, *mask, cs->effective_mems);
css_put(css);
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f414653867de..ebf5df3c8ca1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5597,11 +5597,6 @@ subsys_initcall(mem_cgroup_swap_init);
#endif /* CONFIG_SWAP */
-bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
-{
- return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
-}
-
void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask)
{
if (memcg)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4d23c491e914..fa4d51af7f44 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -344,19 +344,21 @@ static void flush_reclaim_state(struct scan_control *sc)
static bool can_demote(int nid, struct scan_control *sc,
struct mem_cgroup *memcg)
{
- int demotion_nid;
+ struct pglist_data *pgdat = NODE_DATA(nid);
+ nodemask_t allowed_mask;
- if (!numa_demotion_enabled)
+ if (!pgdat || !numa_demotion_enabled)
return false;
if (sc && sc->no_demotion)
return false;
- demotion_nid = next_demotion_node(nid);
- if (demotion_nid == NUMA_NO_NODE)
+ node_get_allowed_targets(pgdat, &allowed_mask);
+ if (nodes_empty(allowed_mask))
return false;
- /* If demotion node isn't in the cgroup's mems_allowed, fall back */
- return mem_cgroup_node_allowed(memcg, demotion_nid);
+ /* Filter the given nmask based on cpuset.mems.allowed */
+ mem_cgroup_filter_mems_allowed(memcg, &allowed_mask);
+ return !nodes_empty(allowed_mask);
}
static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
--
2.52.0.351.gbe84eed79e-goog
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v2 2/2] mm/vmscan: check all allowed targets in can_demote()
2025-12-21 23:36 ` [PATCH v2 2/2] mm/vmscan: check all allowed targets in can_demote() Bing Jiao
@ 2025-12-22 2:51 ` Chen Ridong
2025-12-22 6:09 ` Bing Jiao
0 siblings, 1 reply; 22+ messages in thread
From: Chen Ridong @ 2025-12-22 2:51 UTC (permalink / raw)
To: Bing Jiao, linux-mm
Cc: linux-kernel, stable, akpm, gourry, longman, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tj, mkoutny, david,
zhengqi.arch, lorenzo.stoakes, axelrasmussen, yuanchu, weixugc,
cgroups
On 2025/12/22 7:36, Bing Jiao wrote:
> Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> introduces the cpuset.mems_effective check and applies it to
> can_demote(). However, it checks only the nodes in the immediate next
> demotion hierarchy and does not check all allowed demotion targets.
> This can cause pages to never be demoted if the nodes in the next
> demotion hierarchy are not set in mems_effective.
>
> To address the bug, use mem_cgroup_filter_mems_allowed() to filter
> out allowed targets obtained from node_get_allowed_targets(). Also
> remove some unused functions.
>
> Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> Signed-off-by: Bing Jiao <bingjiao@google.com>
> ---
> include/linux/cpuset.h | 6 ------
> include/linux/memcontrol.h | 7 -------
> kernel/cgroup/cpuset.c | 28 ++++------------------------
> mm/memcontrol.c | 5 -----
> mm/vmscan.c | 14 ++++++++------
> 5 files changed, 12 insertions(+), 48 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 0e94548e2d24..ed7c27276e71 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -174,7 +174,6 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> task_unlock(current);
> }
>
> -extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
> extern void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask);
> #else /* !CONFIG_CPUSETS */
>
> @@ -302,11 +301,6 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
> return false;
> }
>
> -static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> -{
> - return true;
> -}
> -
> static inline void cpuset_node_filter_allowed(struct cgroup *cgroup,
> nodemask_t *mask)
> {
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7cfd71c57caa..41aab33499b5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1740,8 +1740,6 @@ static inline void count_objcg_events(struct obj_cgroup *objcg,
> rcu_read_unlock();
> }
>
> -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
> -
> void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask);
>
> void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg);
> @@ -1813,11 +1811,6 @@ static inline ino_t page_cgroup_ino(struct page *page)
> return 0;
> }
>
> -static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
> -{
> - return true;
> -}
> -
> static inline bool mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg,
> nodemask_t *mask)
> {
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 2925bd6bca91..339779571508 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -4416,11 +4416,10 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
> return allowed;
> }
>
> -bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> +void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask)
> {
> struct cgroup_subsys_state *css;
> struct cpuset *cs;
> - bool allowed;
>
> /*
> * In v1, mem_cgroup and cpuset are unlikely in the same hierarchy
> @@ -4428,15 +4427,15 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> * so return true to avoid taking a global lock on the empty check.
> */
> if (!cpuset_v2())
> - return true;
> + return;
>
> css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
> if (!css)
> - return true;
> + return;
>
> /*
> * Normally, accessing effective_mems would require the cpuset_mutex
> - * or callback_lock - but node_isset is atomic and the reference
> + * or callback_lock - but it is acceptable and the reference
> * taken via cgroup_get_e_css is sufficient to protect css.
> *
> * Since this interface is intended for use by migration paths, we
> @@ -4447,25 +4446,6 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> * cannot make strong isolation guarantees, so this is acceptable.
> */
> cs = container_of(css, struct cpuset, css);
> - allowed = node_isset(nid, cs->effective_mems);
> - css_put(css);
> - return allowed;
> -}
> -
> -void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask)
> -{
> - struct cgroup_subsys_state *css;
> - struct cpuset *cs;
> -
> - if (!cpuset_v2())
> - return;
> -
> - css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
> - if (!css)
> - return;
> -
> - /* Follows the same assumption in cpuset_node_allowed() */
> - cs = container_of(css, struct cpuset, css);
> nodes_and(*mask, *mask, cs->effective_mems);
> css_put(css);
> }
Oh, I see you merged these two functions here.
However, I think cpuset_get_mem_allowed would be more versatile in general use.
You can then check whether the returned nodemask intersects with your target mask. In the future,
there may be scenarios where users simply want to retrieve the effective masks directly.
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f414653867de..ebf5df3c8ca1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5597,11 +5597,6 @@ subsys_initcall(mem_cgroup_swap_init);
>
> #endif /* CONFIG_SWAP */
>
> -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
> -{
> - return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
> -}
> -
> void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask)
> {
> if (memcg)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4d23c491e914..fa4d51af7f44 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -344,19 +344,21 @@ static void flush_reclaim_state(struct scan_control *sc)
> static bool can_demote(int nid, struct scan_control *sc,
> struct mem_cgroup *memcg)
> {
> - int demotion_nid;
> + struct pglist_data *pgdat = NODE_DATA(nid);
> + nodemask_t allowed_mask;
>
> - if (!numa_demotion_enabled)
> + if (!pgdat || !numa_demotion_enabled)
> return false;
> if (sc && sc->no_demotion)
> return false;
>
> - demotion_nid = next_demotion_node(nid);
> - if (demotion_nid == NUMA_NO_NODE)
> + node_get_allowed_targets(pgdat, &allowed_mask);
> + if (nodes_empty(allowed_mask))
> return false;
>
> - /* If demotion node isn't in the cgroup's mems_allowed, fall back */
> - return mem_cgroup_node_allowed(memcg, demotion_nid);
> + /* Filter the given nmask based on cpuset.mems.allowed */
> + mem_cgroup_filter_mems_allowed(memcg, &allowed_mask);
> + return !nodes_empty(allowed_mask);
> }
>
> static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v2 2/2] mm/vmscan: check all allowed targets in can_demote()
2025-12-22 2:51 ` Chen Ridong
@ 2025-12-22 6:09 ` Bing Jiao
2025-12-22 8:28 ` Chen Ridong
0 siblings, 1 reply; 22+ messages in thread
From: Bing Jiao @ 2025-12-22 6:09 UTC (permalink / raw)
To: Chen Ridong
Cc: linux-mm, linux-kernel, stable, akpm, gourry, longman, hannes,
mhocko, roman.gushchin, shakeel.butt, muchun.song, tj, mkoutny,
david, zhengqi.arch, lorenzo.stoakes, axelrasmussen, yuanchu,
weixugc, cgroups
On Mon, Dec 22, 2025 at 10:51:49AM +0800, Chen Ridong wrote:
>
>
> On 2025/12/22 7:36, Bing Jiao wrote:
> > -void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask)
> > -{
> > - struct cgroup_subsys_state *css;
> > - struct cpuset *cs;
> > -
> > - if (!cpuset_v2())
> > - return;
> > -
> > - css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
> > - if (!css)
> > - return;
> > -
> > - /* Follows the same assumption in cpuset_node_allowed() */
> > - cs = container_of(css, struct cpuset, css);
> > nodes_and(*mask, *mask, cs->effective_mems);
> > css_put(css);
> > }
>
> Oh, I see you merged these two functions here.
>
> However, I think cpuset_get_mem_allowed would be more versatile in general use.
>
> You can then check whether the returned nodemask intersects with your target mask. In the future,
> there may be scenarios where users simply want to retrieve the effective masks directly.
>
Hi Ridong, thank you for the suggestions.
I agree that returning a nodemask would provide greater versatility.
I think cpuset_get_mem_allowed_relax() would be a better name,
since we do not need the locking and online mem guarantees
compared to an similar function cpuset_mems_allowed().
Best,
Bing
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v2 2/2] mm/vmscan: check all allowed targets in can_demote()
2025-12-22 6:09 ` Bing Jiao
@ 2025-12-22 8:28 ` Chen Ridong
0 siblings, 0 replies; 22+ messages in thread
From: Chen Ridong @ 2025-12-22 8:28 UTC (permalink / raw)
To: Bing Jiao
Cc: linux-mm, linux-kernel, stable, akpm, gourry, longman, hannes,
mhocko, roman.gushchin, shakeel.butt, muchun.song, tj, mkoutny,
david, zhengqi.arch, lorenzo.stoakes, axelrasmussen, yuanchu,
weixugc, cgroups
On 2025/12/22 14:09, Bing Jiao wrote:
> On Mon, Dec 22, 2025 at 10:51:49AM +0800, Chen Ridong wrote:
>>
>>
>> On 2025/12/22 7:36, Bing Jiao wrote:
>>> -void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask)
>>> -{
>>> - struct cgroup_subsys_state *css;
>>> - struct cpuset *cs;
>>> -
>>> - if (!cpuset_v2())
>>> - return;
>>> -
>>> - css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
>>> - if (!css)
>>> - return;
>>> -
>>> - /* Follows the same assumption in cpuset_node_allowed() */
>>> - cs = container_of(css, struct cpuset, css);
>>> nodes_and(*mask, *mask, cs->effective_mems);
>>> css_put(css);
>>> }
>>
>> Oh, I see you merged these two functions here.
>>
>> However, I think cpuset_get_mem_allowed would be more versatile in general use.
>>
>> You can then check whether the returned nodemask intersects with your target mask. In the future,
>> there may be scenarios where users simply want to retrieve the effective masks directly.
>>
>
> Hi Ridong, thank you for the suggestions.
>
> I agree that returning a nodemask would provide greater versatility.
>
> I think cpuset_get_mem_allowed_relax() would be a better name,
> since we do not need the locking and online mem guarantees
> compared to an similar function cpuset_mems_allowed().
>
I think the key difference between cpuset_mems_allowed and the helper you intend to implement lies
not in locking or online memory guarantees, but in the input parameter: you want to retrieve
cpuset->effective_mems for a cgroup from another subsystem.
The cs->effective_mems should typically only include online nodes, except during brief transitional
periods such as hotplug operations. Similarly, node migration logic also requires online nodes.
Therefore, cpuset_get_mem_allowed seems acceptable to me.
Additionally, you may consider calling guarantee_online_mems inside your new helper to ensure
consistency.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion
2025-12-21 23:36 ` [PATCH v2 0/2] fix demotion targets checks in reclaim/demotion Bing Jiao
2025-12-21 23:36 ` [PATCH v2 1/2] mm/vmscan: respect mems_effective in demote_folio_list() Bing Jiao
2025-12-21 23:36 ` [PATCH v2 2/2] mm/vmscan: check all allowed targets in can_demote() Bing Jiao
@ 2025-12-23 21:19 ` Bing Jiao
2025-12-23 21:38 ` Bing Jiao
` (4 more replies)
2 siblings, 5 replies; 22+ messages in thread
From: Bing Jiao @ 2025-12-23 21:19 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, akpm, gourry, longman, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tj, mkoutny, david,
zhengqi.arch, lorenzo.stoakes, axelrasmussen, chenridong,
yuanchu, weixugc, cgroups
Fix two bugs in demote_folio_list() and can_demote() due to incorrect
demotion target checks in reclaim/demotion.
Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
introduces the cpuset.mems_effective check and applies it to
can_demote(). However:
1. It does not apply this check in demote_folio_list(), which leads
to situations where pages are demoted to nodes that are
explicitly excluded from the task's cpuset.mems.
2. It checks only the nodes in the immediate next demotion hierarchy
and does not check all allowed demotion targets in can_demote().
This can cause pages to never be demoted if the nodes in the next
demotion hierarchy are not set in mems_effective.
These bugs break resource isolation provided by cpuset.mems.
This is visible from userspace because pages can either fail to be
demoted entirely or are demoted to nodes that are not allowed
in multi-tier memory systems.
To address these bugs, update cpuset_node_allowed() and
mem_cgroup_node_allowed() to return effective_mems, allowing directly
logic-and operation against demotion targets. Also update can_demote()
and demote_folio_list() accordingly.
Reproduct Bug 1:
Assume a system with 4 nodes, where nodes 0-1 are top-tier and
nodes 2-3 are far-tier memory. All nodes have equal capacity.
Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Should respect node 0-2 limit.
# Observation: Node 3 shows significant allocation (MemFree drops)
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1
Reproduct Bug 2:
Assume a system with 6 nodes, where nodes 0-2 are top-tier,
node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
All nodes have equal capacity.
Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Pages are demoted to Nodes 4-5
# Observation: No pages are demoted before oom.
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2
Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
Cc: <stable@vger.kernel.org>
Signed-off-by: Bing Jiao <bingjiao@google.com>
---
include/linux/cpuset.h | 6 +++---
include/linux/memcontrol.h | 6 +++---
kernel/cgroup/cpuset.c | 16 ++++++++--------
mm/memcontrol.c | 6 ++++--
mm/vmscan.c | 35 +++++++++++++++++++++++------------
5 files changed, 41 insertions(+), 28 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index a98d3330385c..eb358c3aa9c0 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -174,7 +174,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
task_unlock(current);
}
-extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
+extern nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup);
#else /* !CONFIG_CPUSETS */
static inline bool cpusets_enabled(void) { return false; }
@@ -301,9 +301,9 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
return false;
}
-static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+static inline nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup)
{
- return true;
+ return node_possible_map;
}
#endif /* !CONFIG_CPUSETS */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fd400082313a..f9463d853bba 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1740,7 +1740,7 @@ static inline void count_objcg_events(struct obj_cgroup *objcg,
rcu_read_unlock();
}
-bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
+nodemask_t mem_cgroup_node_get_allowed(struct mem_cgroup *memcg);
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg);
@@ -1811,9 +1811,9 @@ static inline ino_t page_cgroup_ino(struct page *page)
return 0;
}
-static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
+static inline nodemask_t mem_cgroup_node_get_allowed(struct mem_cgroup *memcg)
{
- return true;
+ return node_possible_map;
}
static inline void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 6e6eb09b8db6..abb9afb64205 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4416,23 +4416,23 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
return allowed;
}
-bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup)
{
+ nodemask_t nodes = node_possible_map;
struct cgroup_subsys_state *css;
struct cpuset *cs;
- bool allowed;
/*
* In v1, mem_cgroup and cpuset are unlikely in the same hierarchy
* and mems_allowed is likely to be empty even if we could get to it,
- * so return true to avoid taking a global lock on the empty check.
+ * so return directly to avoid taking a global lock on the empty check.
*/
- if (!cpuset_v2())
- return true;
+ if (!cgroup || !cpuset_v2())
+ return nodes;
css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
if (!css)
- return true;
+ return nodes;
/*
* Normally, accessing effective_mems would require the cpuset_mutex
@@ -4447,9 +4447,9 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
* cannot make strong isolation guarantees, so this is acceptable.
*/
cs = container_of(css, struct cpuset, css);
- allowed = node_isset(nid, cs->effective_mems);
+ nodes_copy(nodes, cs->effective_mems);
css_put(css);
- return allowed;
+ return nodes;
}
/**
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75fc22a33b28..c2f4ac50d5c2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5597,9 +5597,11 @@ subsys_initcall(mem_cgroup_swap_init);
#endif /* CONFIG_SWAP */
-bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
+nodemask_t mem_cgroup_node_get_allowed(struct mem_cgroup *memcg)
{
- return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
+ if (memcg)
+ return cpuset_node_get_allowed(memcg->css.cgroup);
+ return node_possible_map;
}
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a4b308a2f9ad..711a04baf258 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -345,18 +345,24 @@ static bool can_demote(int nid, struct scan_control *sc,
struct mem_cgroup *memcg)
{
int demotion_nid;
+ struct pglist_data *pgdat = NODE_DATA(nid);
+ nodemask_t allowed_mask, allowed_mems;
- if (!numa_demotion_enabled)
+ if (!pgdat || !numa_demotion_enabled)
return false;
if (sc && sc->no_demotion)
return false;
- demotion_nid = next_demotion_node(nid);
- if (demotion_nid == NUMA_NO_NODE)
+ node_get_allowed_targets(pgdat, &allowed_mask);
+ if (nodes_empty(allowed_mask))
+ return false;
+
+ allowed_mems = mem_cgroup_node_get_allowed(memcg);
+ nodes_and(allowed_mask, allowed_mask, allowed_mems);
+ if (nodes_empty(allowed_mask))
return false;
- /* If demotion node isn't in the cgroup's mems_allowed, fall back */
- if (mem_cgroup_node_allowed(memcg, demotion_nid)) {
+ for_each_node_mask(demotion_nid, allowed_mask) {
int z;
struct zone *zone;
struct pglist_data *pgdat = NODE_DATA(demotion_nid);
@@ -1029,11 +1035,12 @@ static struct folio *alloc_demote_folio(struct folio *src,
* Folios which are not demoted are left on @demote_folios.
*/
static unsigned int demote_folio_list(struct list_head *demote_folios,
- struct pglist_data *pgdat)
+ struct pglist_data *pgdat,
+ struct mem_cgroup *memcg)
{
int target_nid = next_demotion_node(pgdat->node_id);
unsigned int nr_succeeded;
- nodemask_t allowed_mask;
+ nodemask_t allowed_mask, allowed_mems;
struct migration_target_control mtc = {
/*
@@ -1043,7 +1050,6 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
*/
.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
__GFP_NOMEMALLOC | GFP_NOWAIT,
- .nid = target_nid,
.nmask = &allowed_mask,
.reason = MR_DEMOTION,
};
@@ -1051,10 +1057,15 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
if (list_empty(demote_folios))
return 0;
- if (target_nid == NUMA_NO_NODE)
- return 0;
-
node_get_allowed_targets(pgdat, &allowed_mask);
+ allowed_mems = mem_cgroup_node_get_allowed(memcg);
+ nodes_and(allowed_mask, allowed_mask, allowed_mems);
+ if (nodes_empty(allowed_mask))
+ return false;
+
+ if (target_nid == NUMA_NO_NODE || !node_isset(target_nid, allowed_mask))
+ target_nid = node_random(&allowed_mask);
+ mtc.nid = target_nid;
/* Demotion ignores all cpuset and mempolicy settings */
migrate_pages(demote_folios, alloc_demote_folio, NULL,
@@ -1576,7 +1587,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
/* 'folio_list' is always empty here */
/* Migrate folios selected for demotion */
- nr_demoted = demote_folio_list(&demote_folios, pgdat);
+ nr_demoted = demote_folio_list(&demote_folios, pgdat, memcg);
nr_reclaimed += nr_demoted;
stat->nr_demoted += nr_demoted;
/* Folios that could not be demoted are still in @demote_folios */
--
2.52.0.358.g0dd7633a29-goog
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion
2025-12-23 21:19 ` [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion Bing Jiao
@ 2025-12-23 21:38 ` Bing Jiao
2025-12-24 1:19 ` Gregory Price
` (3 subsequent siblings)
4 siblings, 0 replies; 22+ messages in thread
From: Bing Jiao @ 2025-12-23 21:38 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, akpm, gourry, longman, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tj, mkoutny, david,
zhengqi.arch, lorenzo.stoakes, axelrasmussen, chenridong,
yuanchu, weixugc, cgroups
On Tue, Dec 23, 2025 at 09:19:59PM +0000, Bing Jiao wrote:
> Fix two bugs in demote_folio_list() and can_demote() due to incorrect
> demotion target checks in reclaim/demotion.
>
Considering these bugs are introduced from one commit, I think
it is better to fix it in one patch rather than split it.
v3: Rename cpuset_node_allowed() as cpuset_node_get_allowed() to
return effective_mems directly, providing better versatility
(thanks Ridong for the suggestion).
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion
2025-12-23 21:19 ` [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion Bing Jiao
2025-12-23 21:38 ` Bing Jiao
@ 2025-12-24 1:19 ` Gregory Price
2025-12-26 18:48 ` Bing Jiao
2025-12-24 1:49 ` Chen Ridong
` (2 subsequent siblings)
4 siblings, 1 reply; 22+ messages in thread
From: Gregory Price @ 2025-12-24 1:19 UTC (permalink / raw)
To: Bing Jiao
Cc: linux-mm, linux-kernel, akpm, longman, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tj, mkoutny, david,
zhengqi.arch, lorenzo.stoakes, axelrasmussen, chenridong,
yuanchu, weixugc, cgroups
On Tue, Dec 23, 2025 at 09:19:59PM +0000, Bing Jiao wrote:
> -static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> +static inline nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup)
> {
> - return true;
> + return node_possible_map;
...
> -static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
> +static inline nodemask_t mem_cgroup_node_get_allowed(struct mem_cgroup *memcg)
> {
> - return true;
> + return node_possible_map;
> }
...
> -bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> +nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup)
> {
> + nodemask_t nodes = node_possible_map;
...
> -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
> +nodemask_t mem_cgroup_node_get_allowed(struct mem_cgroup *memcg)
> {
> - return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
> + if (memcg)
> + return cpuset_node_get_allowed(memcg->css.cgroup);
> + return node_possible_map;
> }
node_possible_map or node_states[N_MEMORY]?
The latter seems more appropriate to me since node_possible_map will
include offline nodes.
> - allowed = node_isset(nid, cs->effective_mems);
> + nodes_copy(nodes, cs->effective_mems);
> css_put(css);
> - return allowed;
> + return nodes;
> }
I saw in vmscan you check for returning an empty nodemask, may want to
at least add a comment to the function definition that says this needs
to be checked.
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a4b308a2f9ad..711a04baf258 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -345,18 +345,24 @@ static bool can_demote(int nid, struct scan_control *sc,
> struct mem_cgroup *memcg)
> {
> int demotion_nid;
> + struct pglist_data *pgdat = NODE_DATA(nid);
> + nodemask_t allowed_mask, allowed_mems;
Only other concern i have is the number of nodemasks being added to the
stack. Should be <512 bytes, but I've run into situations where builds
have screamed at me for adding one nodemask to the stack, let alone 3+.
Have you run this through klp?
~Gregory
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion
2025-12-24 1:19 ` Gregory Price
@ 2025-12-26 18:48 ` Bing Jiao
0 siblings, 0 replies; 22+ messages in thread
From: Bing Jiao @ 2025-12-26 18:48 UTC (permalink / raw)
To: Gregory Price
Cc: linux-mm, linux-kernel, akpm, longman, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tj, mkoutny, david,
zhengqi.arch, lorenzo.stoakes, axelrasmussen, chenridong,
yuanchu, weixugc, cgroups
On Tue, Dec 23, 2025 at 08:19:32PM -0500, Gregory Price wrote:
> On Tue, Dec 23, 2025 at 09:19:59PM +0000, Bing Jiao wrote:
> > -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
> > +nodemask_t mem_cgroup_node_get_allowed(struct mem_cgroup *memcg)
> > {
> > - return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
> > + if (memcg)
> > + return cpuset_node_get_allowed(memcg->css.cgroup);
> > + return node_possible_map;
> > }
>
>
> node_possible_map or node_states[N_MEMORY]?
>
> The latter seems more appropriate to me since node_possible_map will
> include offline nodes.
Yes, I agree that node_states[N_MEMORY] would be better.
> > - allowed = node_isset(nid, cs->effective_mems);
> > + nodes_copy(nodes, cs->effective_mems);
> > css_put(css);
> > - return allowed;
> > + return nodes;
> > }
>
> I saw in vmscan you check for returning an empty nodemask, may want to
> at least add a comment to the function definition that says this needs
> to be checked.
Will add a comment to say that it may return an empty nodemask.
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a4b308a2f9ad..711a04baf258 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -345,18 +345,24 @@ static bool can_demote(int nid, struct scan_control *sc,
> > struct mem_cgroup *memcg)
> > {
> > int demotion_nid;
> > + struct pglist_data *pgdat = NODE_DATA(nid);
> > + nodemask_t allowed_mask, allowed_mems;
>
> Only other concern i have is the number of nodemasks being added to the
> stack. Should be <512 bytes, but I've run into situations where builds
> have screamed at me for adding one nodemask to the stack, let alone 3+.
While having 3+ nodemasks presents a risk, utilizing two nodemasks
should be acceptable. Given that the maximum number of nodes is
1024 (2^10), two nodemasks would require 256 bytes, which should be okay.
Otherwise, we can keep to use mem_node_filter_allowed().
Only update it to return a nodemask when subsequent features need.
> Have you run this through klp?
I have not run it through klp. Will do it then.
Thanks,
Bing
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion
2025-12-23 21:19 ` [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion Bing Jiao
2025-12-23 21:38 ` Bing Jiao
2025-12-24 1:19 ` Gregory Price
@ 2025-12-24 1:49 ` Chen Ridong
2025-12-26 18:58 ` Bing Jiao
2025-12-26 19:32 ` Waiman Long
2025-12-26 20:24 ` Waiman Long
4 siblings, 1 reply; 22+ messages in thread
From: Chen Ridong @ 2025-12-24 1:49 UTC (permalink / raw)
To: Bing Jiao, linux-mm
Cc: linux-kernel, akpm, gourry, longman, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tj, mkoutny, david,
zhengqi.arch, lorenzo.stoakes, axelrasmussen, yuanchu, weixugc,
cgroups
On 2025/12/24 5:19, Bing Jiao wrote:
> Fix two bugs in demote_folio_list() and can_demote() due to incorrect
> demotion target checks in reclaim/demotion.
>
> Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> introduces the cpuset.mems_effective check and applies it to
> can_demote(). However:
>
> 1. It does not apply this check in demote_folio_list(), which leads
> to situations where pages are demoted to nodes that are
> explicitly excluded from the task's cpuset.mems.
>
> 2. It checks only the nodes in the immediate next demotion hierarchy
> and does not check all allowed demotion targets in can_demote().
> This can cause pages to never be demoted if the nodes in the next
> demotion hierarchy are not set in mems_effective.
>
> These bugs break resource isolation provided by cpuset.mems.
> This is visible from userspace because pages can either fail to be
> demoted entirely or are demoted to nodes that are not allowed
> in multi-tier memory systems.
>
> To address these bugs, update cpuset_node_allowed() and
> mem_cgroup_node_allowed() to return effective_mems, allowing directly
> logic-and operation against demotion targets. Also update can_demote()
> and demote_folio_list() accordingly.
>
> Reproduct Bug 1:
> Assume a system with 4 nodes, where nodes 0-1 are top-tier and
> nodes 2-3 are far-tier memory. All nodes have equal capacity.
>
> Test script:
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> mkdir /sys/fs/cgroup/test
> echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
> echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> swapoff -a
> # Expectation: Should respect node 0-2 limit.
> # Observation: Node 3 shows significant allocation (MemFree drops)
> stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1
>
> Reproduct Bug 2:
> Assume a system with 6 nodes, where nodes 0-2 are top-tier,
> node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
> All nodes have equal capacity.
>
> Test script:
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> mkdir /sys/fs/cgroup/test
> echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
> echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> swapoff -a
> # Expectation: Pages are demoted to Nodes 4-5
> # Observation: No pages are demoted before oom.
> stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2
>
> Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Bing Jiao <bingjiao@google.com>
> ---
> include/linux/cpuset.h | 6 +++---
> include/linux/memcontrol.h | 6 +++---
> kernel/cgroup/cpuset.c | 16 ++++++++--------
> mm/memcontrol.c | 6 ++++--
> mm/vmscan.c | 35 +++++++++++++++++++++++------------
> 5 files changed, 41 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index a98d3330385c..eb358c3aa9c0 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -174,7 +174,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> task_unlock(current);
> }
>
> -extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
> +extern nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup);
> #else /* !CONFIG_CPUSETS */
>
> static inline bool cpusets_enabled(void) { return false; }
> @@ -301,9 +301,9 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
> return false;
> }
>
> -static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> +static inline nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup)
> {
> - return true;
> + return node_possible_map;
> }
> #endif /* !CONFIG_CPUSETS */
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index fd400082313a..f9463d853bba 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1740,7 +1740,7 @@ static inline void count_objcg_events(struct obj_cgroup *objcg,
> rcu_read_unlock();
> }
>
> -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
> +nodemask_t mem_cgroup_node_get_allowed(struct mem_cgroup *memcg);
>
> void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg);
>
> @@ -1811,9 +1811,9 @@ static inline ino_t page_cgroup_ino(struct page *page)
> return 0;
> }
>
> -static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
> +static inline nodemask_t mem_cgroup_node_get_allowed(struct mem_cgroup *memcg)
> {
> - return true;
> + return node_possible_map;
> }
>
> static inline void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 6e6eb09b8db6..abb9afb64205 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -4416,23 +4416,23 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
> return allowed;
> }
>
> -bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> +nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup)
> {
Could we define it as:
void cpuset_node_get_allowed(struct cgroup *cgroup, nodemask_t *node)
to align with the naming style of node_get_allowed_targets?
> + nodemask_t nodes = node_possible_map;
> struct cgroup_subsys_state *css;
> struct cpuset *cs;
> - bool allowed;
>
> /*
> * In v1, mem_cgroup and cpuset are unlikely in the same hierarchy
> * and mems_allowed is likely to be empty even if we could get to it,
> - * so return true to avoid taking a global lock on the empty check.
> + * so return directly to avoid taking a global lock on the empty check.
> */
> - if (!cpuset_v2())
> - return true;
> + if (!cgroup || !cpuset_v2())
> + return nodes;
>
> css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
> if (!css)
> - return true;
> + return nodes;
>
> /*
> * Normally, accessing effective_mems would require the cpuset_mutex
> @@ -4447,9 +4447,9 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> * cannot make strong isolation guarantees, so this is acceptable.
> */
> cs = container_of(css, struct cpuset, css);
> - allowed = node_isset(nid, cs->effective_mems);
> + nodes_copy(nodes, cs->effective_mems);
> css_put(css);
> - return allowed;
> + return nodes;
> }
>
> /**
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75fc22a33b28..c2f4ac50d5c2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5597,9 +5597,11 @@ subsys_initcall(mem_cgroup_swap_init);
>
> #endif /* CONFIG_SWAP */
>
> -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
> +nodemask_t mem_cgroup_node_get_allowed(struct mem_cgroup *memcg)
void mem_cgroup_node_get_allowed(struct mem_cgroup *memcg, nodemask_t *node)
> {
> - return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
> + if (memcg)
> + return cpuset_node_get_allowed(memcg->css.cgroup);
> + return node_possible_map;
> }
>
> void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a4b308a2f9ad..711a04baf258 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -345,18 +345,24 @@ static bool can_demote(int nid, struct scan_control *sc,
> struct mem_cgroup *memcg)
> {
> int demotion_nid;
> + struct pglist_data *pgdat = NODE_DATA(nid);
> + nodemask_t allowed_mask, allowed_mems;
>
> - if (!numa_demotion_enabled)
> + if (!pgdat || !numa_demotion_enabled)
> return false;
> if (sc && sc->no_demotion)
> return false;
>
> - demotion_nid = next_demotion_node(nid);
> - if (demotion_nid == NUMA_NO_NODE)
> + node_get_allowed_targets(pgdat, &allowed_mask);
> + if (nodes_empty(allowed_mask))
> + return false;
> +
> + allowed_mems = mem_cgroup_node_get_allowed(memcg);
> + nodes_and(allowed_mask, allowed_mask, allowed_mems);
> + if (nodes_empty(allowed_mask))
> return false;
>
node_get_allowed_targets(pgdat, &allowed_mask);
mem_cgroup_node_get_allowed(memcg, allowed_mems);
if (!nodes_intersects(allowed_mask, allowed_mems))
return false;
Would it look better?
> - /* If demotion node isn't in the cgroup's mems_allowed, fall back */
> - if (mem_cgroup_node_allowed(memcg, demotion_nid)) {
> + for_each_node_mask(demotion_nid, allowed_mask) {
> int z;
> struct zone *zone;
> struct pglist_data *pgdat = NODE_DATA(demotion_nid);
> @@ -1029,11 +1035,12 @@ static struct folio *alloc_demote_folio(struct folio *src,
> * Folios which are not demoted are left on @demote_folios.
> */
> static unsigned int demote_folio_list(struct list_head *demote_folios,
> - struct pglist_data *pgdat)
> + struct pglist_data *pgdat,
> + struct mem_cgroup *memcg)
> {
> int target_nid = next_demotion_node(pgdat->node_id);
> unsigned int nr_succeeded;
> - nodemask_t allowed_mask;
> + nodemask_t allowed_mask, allowed_mems;
>
> struct migration_target_control mtc = {
> /*
> @@ -1043,7 +1050,6 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
> */
> .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
> __GFP_NOMEMALLOC | GFP_NOWAIT,
> - .nid = target_nid,
> .nmask = &allowed_mask,
> .reason = MR_DEMOTION,
> };
> @@ -1051,10 +1057,15 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
> if (list_empty(demote_folios))
> return 0;
>
> - if (target_nid == NUMA_NO_NODE)
> - return 0;
> -
> node_get_allowed_targets(pgdat, &allowed_mask);
> + allowed_mems = mem_cgroup_node_get_allowed(memcg);
> + nodes_and(allowed_mask, allowed_mask, allowed_mems);
> + if (nodes_empty(allowed_mask))
> + return false;
> +
> + if (target_nid == NUMA_NO_NODE || !node_isset(target_nid, allowed_mask))
> + target_nid = node_random(&allowed_mask);
> + mtc.nid = target_nid;
>
> /* Demotion ignores all cpuset and mempolicy settings */
> migrate_pages(demote_folios, alloc_demote_folio, NULL,
> @@ -1576,7 +1587,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> /* 'folio_list' is always empty here */
>
> /* Migrate folios selected for demotion */
> - nr_demoted = demote_folio_list(&demote_folios, pgdat);
> + nr_demoted = demote_folio_list(&demote_folios, pgdat, memcg);
> nr_reclaimed += nr_demoted;
> stat->nr_demoted += nr_demoted;
> /* Folios that could not be demoted are still in @demote_folios */
> --
> 2.52.0.358.g0dd7633a29-goog
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion
2025-12-24 1:49 ` Chen Ridong
@ 2025-12-26 18:58 ` Bing Jiao
0 siblings, 0 replies; 22+ messages in thread
From: Bing Jiao @ 2025-12-26 18:58 UTC (permalink / raw)
To: Chen Ridong
Cc: linux-mm, linux-kernel, akpm, gourry, longman, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tj, mkoutny, david,
zhengqi.arch, lorenzo.stoakes, axelrasmussen, yuanchu, weixugc,
cgroups
On Wed, Dec 24, 2025 at 09:49:38AM +0800, Chen Ridong wrote:
> > +nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup)
> > {
>
> Could we define it as:
>
> void cpuset_node_get_allowed(struct cgroup *cgroup, nodemask_t *node)
>
> to align with the naming style of node_get_allowed_targets?
>
> > -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
> > +nodemask_t mem_cgroup_node_get_allowed(struct mem_cgroup *memcg)
>
> void mem_cgroup_node_get_allowed(struct mem_cgroup *memcg, nodemask_t *node)
Thanks for the suggestion. Pass a pointer is better.
Also, Gregory mentioned that the stack size may be an issue if
systems have many nodes.
Do you think it is better to use mem_cgroup_node_filter_allowed()
to keep the stack size smaller?
> > - demotion_nid = next_demotion_node(nid);
> > - if (demotion_nid == NUMA_NO_NODE)
> > + node_get_allowed_targets(pgdat, &allowed_mask);
> > + if (nodes_empty(allowed_mask))
> > + return false;
This is a fast-fail path. When the queried node is the farthest node,
allowed_mask will be empty. Thus, I would like to keep this check
before mem_cgroup_node_get_allowed().
> > +
> > + allowed_mems = mem_cgroup_node_get_allowed(memcg);
> > + nodes_and(allowed_mask, allowed_mask, allowed_mems);
> > + if (nodes_empty(allowed_mask))
> > return false;
> >
> node_get_allowed_targets(pgdat, &allowed_mask);
> mem_cgroup_node_get_allowed(memcg, allowed_mems);
> if (!nodes_intersects(allowed_mask, allowed_mems))
> return false;
>
> Would it look better?
Yes, nodes_intersects() is better than logic-and.
Will update in v3.
Best,
Bing
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion
2025-12-23 21:19 ` [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion Bing Jiao
` (2 preceding siblings ...)
2025-12-24 1:49 ` Chen Ridong
@ 2025-12-26 19:32 ` Waiman Long
2025-12-26 20:24 ` Waiman Long
4 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2025-12-26 19:32 UTC (permalink / raw)
To: Bing Jiao, linux-mm
Cc: linux-kernel, akpm, gourry, hannes, mhocko, roman.gushchin,
shakeel.butt, muchun.song, tj, mkoutny, david, zhengqi.arch,
lorenzo.stoakes, axelrasmussen, chenridong, yuanchu, weixugc,
cgroups
On 12/23/25 4:19 PM, Bing Jiao wrote:
> Fix two bugs in demote_folio_list() and can_demote() due to incorrect
> demotion target checks in reclaim/demotion.
>
> Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> introduces the cpuset.mems_effective check and applies it to
> can_demote(). However:
>
> 1. It does not apply this check in demote_folio_list(), which leads
> to situations where pages are demoted to nodes that are
> explicitly excluded from the task's cpuset.mems.
>
> 2. It checks only the nodes in the immediate next demotion hierarchy
> and does not check all allowed demotion targets in can_demote().
> This can cause pages to never be demoted if the nodes in the next
> demotion hierarchy are not set in mems_effective.
>
> These bugs break resource isolation provided by cpuset.mems.
> This is visible from userspace because pages can either fail to be
> demoted entirely or are demoted to nodes that are not allowed
> in multi-tier memory systems.
>
> To address these bugs, update cpuset_node_allowed() and
> mem_cgroup_node_allowed() to return effective_mems, allowing directly
> logic-and operation against demotion targets. Also update can_demote()
> and demote_folio_list() accordingly.
>
> Reproduct Bug 1:
"Reproduct" is not an English word. Use either "Reproduce" or
"Reproduction".
Cheers,
Longman
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion
2025-12-23 21:19 ` [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion Bing Jiao
` (3 preceding siblings ...)
2025-12-26 19:32 ` Waiman Long
@ 2025-12-26 20:24 ` Waiman Long
4 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2025-12-26 20:24 UTC (permalink / raw)
To: Bing Jiao, linux-mm
Cc: linux-kernel, akpm, gourry, hannes, mhocko, roman.gushchin,
shakeel.butt, muchun.song, tj, mkoutny, david, zhengqi.arch,
lorenzo.stoakes, axelrasmussen, chenridong, yuanchu, weixugc,
cgroups
On 12/23/25 4:19 PM, Bing Jiao wrote:
> Fix two bugs in demote_folio_list() and can_demote() due to incorrect
> demotion target checks in reclaim/demotion.
>
> Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> introduces the cpuset.mems_effective check and applies it to
> can_demote(). However:
>
> 1. It does not apply this check in demote_folio_list(), which leads
> to situations where pages are demoted to nodes that are
> explicitly excluded from the task's cpuset.mems.
>
> 2. It checks only the nodes in the immediate next demotion hierarchy
> and does not check all allowed demotion targets in can_demote().
> This can cause pages to never be demoted if the nodes in the next
> demotion hierarchy are not set in mems_effective.
>
> These bugs break resource isolation provided by cpuset.mems.
> This is visible from userspace because pages can either fail to be
> demoted entirely or are demoted to nodes that are not allowed
> in multi-tier memory systems.
>
> To address these bugs, update cpuset_node_allowed() and
> mem_cgroup_node_allowed() to return effective_mems, allowing directly
> logic-and operation against demotion targets. Also update can_demote()
> and demote_folio_list() accordingly.
>
> Reproduct Bug 1:
> Assume a system with 4 nodes, where nodes 0-1 are top-tier and
> nodes 2-3 are far-tier memory. All nodes have equal capacity.
>
> Test script:
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> mkdir /sys/fs/cgroup/test
> echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
> echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> swapoff -a
> # Expectation: Should respect node 0-2 limit.
> # Observation: Node 3 shows significant allocation (MemFree drops)
> stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1
>
> Reproduct Bug 2:
> Assume a system with 6 nodes, where nodes 0-2 are top-tier,
> node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
> All nodes have equal capacity.
>
> Test script:
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> mkdir /sys/fs/cgroup/test
> echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
> echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> swapoff -a
> # Expectation: Pages are demoted to Nodes 4-5
> # Observation: No pages are demoted before oom.
> stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2
>
> Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Bing Jiao <bingjiao@google.com>
> ---
> include/linux/cpuset.h | 6 +++---
> include/linux/memcontrol.h | 6 +++---
> kernel/cgroup/cpuset.c | 16 ++++++++--------
> mm/memcontrol.c | 6 ++++--
> mm/vmscan.c | 35 +++++++++++++++++++++++------------
> 5 files changed, 41 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index a98d3330385c..eb358c3aa9c0 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -174,7 +174,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> task_unlock(current);
> }
>
> -extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
> +extern nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup);
> #else /* !CONFIG_CPUSETS */
>
> static inline bool cpusets_enabled(void) { return false; }
> @@ -301,9 +301,9 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
> return false;
> }
>
> -static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> +static inline nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup)
> {
> - return true;
> + return node_possible_map;
> }
The nodemask_t type can be large depending on the setting of
CONFIG_NODES_SHIFT. Passing a large data structure on stack may not be a
good idea. You can return a pointer to nodemask_t instead. In that case,
you will have a add a "const" qualifier to the return type to make sure
that the node mask won't get accidentally modified. Alternatively, you
can pass a nodemask_t pointer as an output parameter and copy out the
nodemask_t data.
The name "cpuset_node_get_allowed" doesn't fit the cpuset naming
convention. There is a "cpuset_mems_allowed(struct task_struct *)" to
return "mems_allowed" of a task. This new helper is for returning the
mems_allowed defined in the cpuset. Perhaps we could just use
"cpuset_nodes_allowed(struct cgroup *)".
Cheers,
Longman
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2025-12-26 20:24 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-20 6:10 [PATCH] mm/vmscan: respect mems_effective in demote_folio_list() Bing Jiao
2025-12-20 19:20 ` Andrew Morton
2025-12-22 6:16 ` Bing Jiao
2025-12-21 12:07 ` Gregory Price
2025-12-22 6:28 ` Bing Jiao
2025-12-21 23:36 ` [PATCH v2 0/2] fix demotion targets checks in reclaim/demotion Bing Jiao
2025-12-21 23:36 ` [PATCH v2 1/2] mm/vmscan: respect mems_effective in demote_folio_list() Bing Jiao
2025-12-22 2:38 ` Chen Ridong
2025-12-22 21:56 ` kernel test robot
2025-12-22 22:18 ` kernel test robot
2025-12-21 23:36 ` [PATCH v2 2/2] mm/vmscan: check all allowed targets in can_demote() Bing Jiao
2025-12-22 2:51 ` Chen Ridong
2025-12-22 6:09 ` Bing Jiao
2025-12-22 8:28 ` Chen Ridong
2025-12-23 21:19 ` [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion Bing Jiao
2025-12-23 21:38 ` Bing Jiao
2025-12-24 1:19 ` Gregory Price
2025-12-26 18:48 ` Bing Jiao
2025-12-24 1:49 ` Chen Ridong
2025-12-26 18:58 ` Bing Jiao
2025-12-26 19:32 ` Waiman Long
2025-12-26 20:24 ` Waiman Long
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox