From: Chen Ridong <chenridong@huaweicloud.com>
To: Bing Jiao <bingjiao@google.com>, linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
gourry@gourry.net, longman@redhat.com, hannes@cmpxchg.org,
mhocko@kernel.org, roman.gushchin@linux.dev,
shakeel.butt@linux.dev, muchun.song@linux.dev, tj@kernel.org,
mkoutny@suse.com, david@kernel.org, zhengqi.arch@bytedance.com,
lorenzo.stoakes@oracle.com, axelrasmussen@google.com,
yuanchu@google.com, weixugc@google.com, cgroups@vger.kernel.org
Subject: Re: [PATCH v4] mm/vmscan: fix demotion targets checks in reclaim/demotion
Date: Mon, 5 Jan 2026 10:48:09 +0800 [thread overview]
Message-ID: <b55451ff-b861-4e4f-bc79-6de0c802d64e@huaweicloud.com> (raw)
In-Reply-To: <20260104085439.4076810-1-bingjiao@google.com>
On 2026/1/4 16:54, Bing Jiao wrote:
> Fix two bugs in demote_folio_list() and can_demote() due to incorrect
> demotion target checks in reclaim/demotion.
>
> Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> introduces the cpuset.mems_effective check and applies it to
> can_demote(). However:
>
> 1. It does not apply this check in demote_folio_list(), which leads
> to situations where pages are demoted to nodes that are
> explicitly excluded from the task's cpuset.mems.
>
> 2. It checks only the nodes in the immediate next demotion hierarchy
> and does not check all allowed demotion targets in can_demote().
> This can cause pages to never be demoted if the nodes in the next
> demotion hierarchy are not set in mems_effective.
>
> These bugs break resource isolation provided by cpuset.mems.
> This is visible from userspace because pages can either fail to be
> demoted entirely or are demoted to nodes that are not allowed
> in multi-tier memory systems.
>
> To address these bugs, update cpuset_node_allowed() and
> mem_cgroup_node_allowed() to return effective_mems, allowing directly
> logic-and operation against demotion targets. Also update can_demote()
> and demote_folio_list() accordingly.
>
> Bug 1 reproduction:
> Assume a system with 4 nodes, where nodes 0-1 are top-tier and
> nodes 2-3 are far-tier memory. All nodes have equal capacity.
>
> Test script:
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> mkdir /sys/fs/cgroup/test
> echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
> echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> swapoff -a
> # Expectation: Should respect node 0-2 limit.
> # Observation: Node 3 shows significant allocation (MemFree drops)
> stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1
>
> Bug 2 reproduction:
> Assume a system with 6 nodes, where nodes 0-2 are top-tier,
> node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
> All nodes have equal capacity.
>
> Test script:
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> mkdir /sys/fs/cgroup/test
> echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
> echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> swapoff -a
> # Expectation: Pages are demoted to Nodes 4-5
> # Observation: No pages are demoted before oom.
> stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2
>
> Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Bing Jiao <bingjiao@google.com>
>
> ---
> v3 -> v4:
> Update functions to filter out nodes that are not in cgroup's
> mems_allowed rather than returning mems_allowed directly.
> It minimizes stack usage and remains versatile enough to return
> mems_allowed when all possible nodes are set in the passed
> nodemask_t pointer.
> ---
>
> include/linux/cpuset.h | 7 ++++---
> include/linux/memcontrol.h | 6 +++---
> kernel/cgroup/cpuset.c | 24 +++++++++++++++---------
> mm/memcontrol.c | 5 +++--
> mm/vmscan.c | 32 +++++++++++++++++++++-----------
> 5 files changed, 46 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index a98d3330385c..c937537f318a 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -174,7 +174,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> task_unlock(current);
> }
>
> -extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
> +extern void cpuset_nodes_filter_allowed(struct cgroup *cgroup,
> + nodemask_t *mask);
> #else /* !CONFIG_CPUSETS */
>
> static inline bool cpusets_enabled(void) { return false; }
> @@ -301,9 +302,9 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
> return false;
> }
>
> -static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> +static inline void cpuset_nodes_filter_allowed(struct cgroup *cgroup,
> + nodemask_t *mask)
> {
> - return true;
> }
> #endif /* !CONFIG_CPUSETS */
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index fd400082313a..911e0c71453e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1740,7 +1740,7 @@ static inline void count_objcg_events(struct obj_cgroup *objcg,
> rcu_read_unlock();
> }
>
> -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
> +void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg, nodemask_t *mask);
>
> void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg);
>
> @@ -1811,9 +1811,9 @@ static inline ino_t page_cgroup_ino(struct page *page)
> return 0;
> }
>
> -static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
> +static inline void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg,
> + nodemask_t *mask)
> {
> - return true;
> }
>
> static inline void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 6e6eb09b8db6..cc0e1d42611c 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -4416,27 +4416,34 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
> return allowed;
> }
>
> -bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> +/**
> + * cpuset_nodes_filter_allowed - filter out nodes not in cgroup's mems_allowed.
> + * @cgroup: pointer to struct cgroup.
> + * @mask: pointer to struct nodemask_t to be filtered.
> + *
> + * Description: Filters out nodes that are not in mems_allowed for @cgroup.
> + * Nodes returned in @mask are not guaranteed to be online.
> + **/
> +void cpuset_nodes_filter_allowed(struct cgroup *cgroup, nodemask_t *mask)
> {
> struct cgroup_subsys_state *css;
> struct cpuset *cs;
> - bool allowed;
>
> /*
> * In v1, mem_cgroup and cpuset are unlikely in the same hierarchy
> * and mems_allowed is likely to be empty even if we could get to it,
> - * so return true to avoid taking a global lock on the empty check.
> + * so return directly to avoid taking a global lock on the empty check.
> */
> - if (!cpuset_v2())
> - return true;
> + if (!cgroup || !cpuset_v2())
> + return;
>
> css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
> if (!css)
> - return true;
> + return;
>
> /*
> * Normally, accessing effective_mems would require the cpuset_mutex
> - * or callback_lock - but node_isset is atomic and the reference
> + * or callback_lock - but not doing so is acceptable and the reference
> * taken via cgroup_get_e_css is sufficient to protect css.
> *
> * Since this interface is intended for use by migration paths, we
> @@ -4447,9 +4454,8 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> * cannot make strong isolation guarantees, so this is acceptable.
> */
> cs = container_of(css, struct cpuset, css);
> - allowed = node_isset(nid, cs->effective_mems);
> + nodes_and(*mask, *mask, cs->effective_mems);
Why do we need the and operation? Can't we just copy cs->effective_mems to mask directly?
Per Longman's suggestion, name it cpuset_nodes_allowed and handle the filtering in
mem_cgroup_node_filter_allowed. Please keep the allowed nodes retrieval logic common.
> css_put(css);
> - return allowed;
> }
>
> /**
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75fc22a33b28..4c850805b7a9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5597,9 +5597,10 @@ subsys_initcall(mem_cgroup_swap_init);
>
> #endif /* CONFIG_SWAP */
>
> -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
> +void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg, nodemask_t *mask)
> {
> - return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
> + if (memcg)
> + cpuset_nodes_filter_allowed(memcg->css.cgroup, mask);
> }
>
> void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ac16b6b984ab..919e116ddaf3 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -345,18 +345,24 @@ static bool can_demote(int nid, struct scan_control *sc,
> struct mem_cgroup *memcg)
> {
> int demotion_nid;
> + struct pglist_data *pgdat = NODE_DATA(nid);
> + nodemask_t allowed_mask;
>
> - if (!numa_demotion_enabled)
> + if (!pgdat || !numa_demotion_enabled)
> return false;
> if (sc && sc->no_demotion)
> return false;
>
> - demotion_nid = next_demotion_node(nid);
> - if (demotion_nid == NUMA_NO_NODE)
> + node_get_allowed_targets(pgdat, &allowed_mask);
> + if (nodes_empty(allowed_mask))
> + return false;
> +
> + /* Filter out nodes that are not in cgroup's mems_allowed. */
> + mem_cgroup_node_filter_allowed(memcg, &allowed_mask);
> + if (nodes_empty(allowed_mask))
> return false;
>
> - /* If demotion node isn't in the cgroup's mems_allowed, fall back */
> - if (mem_cgroup_node_allowed(memcg, demotion_nid)) {
> + for_each_node_mask(demotion_nid, allowed_mask) {
> int z;
> struct zone *zone;
> struct pglist_data *pgdat = NODE_DATA(demotion_nid);
> @@ -1029,7 +1035,8 @@ static struct folio *alloc_demote_folio(struct folio *src,
> * Folios which are not demoted are left on @demote_folios.
> */
> static unsigned int demote_folio_list(struct list_head *demote_folios,
> - struct pglist_data *pgdat)
> + struct pglist_data *pgdat,
> + struct mem_cgroup *memcg)
> {
> int target_nid = next_demotion_node(pgdat->node_id);
> unsigned int nr_succeeded;
> @@ -1043,7 +1050,6 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
> */
> .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
> __GFP_NOMEMALLOC | GFP_NOWAIT,
> - .nid = target_nid,
> .nmask = &allowed_mask,
> .reason = MR_DEMOTION,
> };
> @@ -1051,10 +1057,14 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
> if (list_empty(demote_folios))
> return 0;
>
> - if (target_nid == NUMA_NO_NODE)
> - return 0;
> -
> node_get_allowed_targets(pgdat, &allowed_mask);
> + mem_cgroup_node_filter_allowed(memcg, &allowed_mask);
> + if (nodes_empty(allowed_mask))
> + return false;
> +
> + if (!node_isset(target_nid, allowed_mask))
> + target_nid = node_random(&allowed_mask);
> + mtc.nid = target_nid;
>
> /* Demotion ignores all cpuset and mempolicy settings */
> migrate_pages(demote_folios, alloc_demote_folio, NULL,
> @@ -1576,7 +1586,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> /* 'folio_list' is always empty here */
>
> /* Migrate folios selected for demotion */
> - nr_demoted = demote_folio_list(&demote_folios, pgdat);
> + nr_demoted = demote_folio_list(&demote_folios, pgdat, memcg);
> nr_reclaimed += nr_demoted;
> stat->nr_demoted += nr_demoted;
> /* Folios that could not be demoted are still in @demote_folios */
> --
> 2.52.0.358.g0dd7633a29-goog
--
Best regards,
Ridong
next prev parent reply other threads:[~2026-01-05 2:48 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-20 6:10 [PATCH] mm/vmscan: respect mems_effective in demote_folio_list() Bing Jiao
2025-12-20 19:20 ` Andrew Morton
2025-12-22 6:16 ` Bing Jiao
2025-12-21 12:07 ` Gregory Price
2025-12-22 6:28 ` Bing Jiao
2025-12-21 23:36 ` [PATCH v2 0/2] fix demotion targets checks in reclaim/demotion Bing Jiao
2025-12-21 23:36 ` [PATCH v2 1/2] mm/vmscan: respect mems_effective in demote_folio_list() Bing Jiao
2025-12-22 2:38 ` Chen Ridong
2025-12-22 21:56 ` kernel test robot
2025-12-22 22:18 ` kernel test robot
2025-12-21 23:36 ` [PATCH v2 2/2] mm/vmscan: check all allowed targets in can_demote() Bing Jiao
2025-12-22 2:51 ` Chen Ridong
2025-12-22 6:09 ` Bing Jiao
2025-12-22 8:28 ` Chen Ridong
2025-12-23 21:19 ` [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion Bing Jiao
2025-12-23 21:38 ` Bing Jiao
2025-12-24 1:19 ` Gregory Price
2025-12-26 18:48 ` Bing Jiao
2026-01-05 21:57 ` Bing Jiao
2025-12-24 1:49 ` Chen Ridong
2025-12-26 18:58 ` Bing Jiao
2025-12-26 19:32 ` Waiman Long
2025-12-26 20:24 ` Waiman Long
2026-01-04 9:04 ` Bing Jiao
2026-01-04 8:54 ` [PATCH v4] " Bing Jiao
2026-01-04 18:27 ` Andrew Morton
2026-01-05 5:08 ` Bing Jiao
2026-01-05 2:48 ` Chen Ridong [this message]
2026-01-05 5:10 ` Bing Jiao
2026-01-05 5:01 ` [PATCH v5] " Bing Jiao
2026-01-05 15:54 ` Gregory Price
2026-01-05 21:34 ` Bing Jiao
2026-01-06 7:56 ` [PATCH v6] " Bing Jiao
2026-01-06 14:23 ` Gregory Price
2026-01-06 19:36 ` Andrew Morton
2026-01-07 1:27 ` Chen Ridong
2026-01-08 3:32 ` [PATCH v7 0/2] " Bing Jiao
2026-01-08 3:32 ` [PATCH v7 1/2] " Bing Jiao
2026-01-08 3:32 ` [PATCH v7 2/2] mm/vmscan: select the closest preferred node in demote_folio_list() Bing Jiao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b55451ff-b861-4e4f-bc79-6de0c802d64e@huaweicloud.com \
--to=chenridong@huaweicloud.com \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=bingjiao@google.com \
--cc=cgroups@vger.kernel.org \
--cc=david@kernel.org \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=longman@redhat.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=tj@kernel.org \
--cc=weixugc@google.com \
--cc=yuanchu@google.com \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox