linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Waiman Long <llong@redhat.com>
To: Bing Jiao <bingjiao@google.com>, linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
	gourry@gourry.net, hannes@cmpxchg.org, mhocko@kernel.org,
	roman.gushchin@linux.dev, shakeel.butt@linux.dev,
	muchun.song@linux.dev, tj@kernel.org, mkoutny@suse.com,
	david@kernel.org, zhengqi.arch@bytedance.com,
	lorenzo.stoakes@oracle.com, axelrasmussen@google.com,
	chenridong@huaweicloud.com, yuanchu@google.com,
	weixugc@google.com, cgroups@vger.kernel.org
Subject: Re: [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion
Date: Fri, 26 Dec 2025 15:24:29 -0500	[thread overview]
Message-ID: <84ed9b5d-41d5-44a1-a1ad-2b3de8b50a50@redhat.com> (raw)
In-Reply-To: <20251223212032.665731-1-bingjiao@google.com>

On 12/23/25 4:19 PM, Bing Jiao wrote:
> Fix two bugs in demote_folio_list() and can_demote() due to incorrect
> demotion target checks in reclaim/demotion.
>
> Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> introduces the cpuset.mems_effective check and applies it to
> can_demote(). However:
>
>    1. It does not apply this check in demote_folio_list(), which leads
>       to situations where pages are demoted to nodes that are
>       explicitly excluded from the task's cpuset.mems.
>
>    2. It checks only the nodes in the immediate next demotion hierarchy
>       and does not check all allowed demotion targets in can_demote().
>       This can cause pages to never be demoted if the nodes in the next
>       demotion hierarchy are not set in mems_effective.
>
> These bugs break resource isolation provided by cpuset.mems.
> This is visible from userspace because pages can either fail to be
> demoted entirely or are demoted to nodes that are not allowed
> in multi-tier memory systems.
>
> To address these bugs, update cpuset_node_allowed() and
> mem_cgroup_node_allowed() to return effective_mems, allowing directly
> logic-and operation against demotion targets. Also update can_demote()
> and demote_folio_list() accordingly.
>
> Reproduct Bug 1:
>    Assume a system with 4 nodes, where nodes 0-1 are top-tier and
>    nodes 2-3 are far-tier memory. All nodes have equal capacity.
>
>    Test script:
>      echo 1 > /sys/kernel/mm/numa/demotion_enabled
>      mkdir /sys/fs/cgroup/test
>      echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
>      echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
>      echo $$ > /sys/fs/cgroup/test/cgroup.procs
>      swapoff -a
>      # Expectation: Should respect node 0-2 limit.
>      # Observation: Node 3 shows significant allocation (MemFree drops)
>      stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1
>
> Reproduct Bug 2:
>    Assume a system with 6 nodes, where nodes 0-2 are top-tier,
>    node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
>    All nodes have equal capacity.
>
>    Test script:
>      echo 1 > /sys/kernel/mm/numa/demotion_enabled
>      mkdir /sys/fs/cgroup/test
>      echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
>      echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
>      echo $$ > /sys/fs/cgroup/test/cgroup.procs
>      swapoff -a
>      # Expectation: Pages are demoted to Nodes 4-5
>      # Observation: No pages are demoted before oom.
>      stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2
>
> Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Bing Jiao <bingjiao@google.com>
> ---
>   include/linux/cpuset.h     |  6 +++---
>   include/linux/memcontrol.h |  6 +++---
>   kernel/cgroup/cpuset.c     | 16 ++++++++--------
>   mm/memcontrol.c            |  6 ++++--
>   mm/vmscan.c                | 35 +++++++++++++++++++++++------------
>   5 files changed, 41 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index a98d3330385c..eb358c3aa9c0 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -174,7 +174,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>   	task_unlock(current);
>   }
>
> -extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
> +extern nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup);
>   #else /* !CONFIG_CPUSETS */
>
>   static inline bool cpusets_enabled(void) { return false; }
> @@ -301,9 +301,9 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
>   	return false;
>   }
>
> -static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> +static inline nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup)
>   {
> -	return true;
> +	return node_possible_map;
>   }

The nodemask_t type can be large depending on the setting of 
CONFIG_NODES_SHIFT. Passing a large data structure on stack may not be a 
good idea. You can return a pointer to nodemask_t instead. In that case, 
you will have a add a "const" qualifier to the return type to make sure 
that the node mask won't get accidentally modified. Alternatively, you 
can pass a nodemask_t pointer as an output parameter and copy out the 
nodemask_t data.

The name "cpuset_node_get_allowed" doesn't fit the cpuset naming 
convention. There is a "cpuset_mems_allowed(struct task_struct *)" to 
return "mems_allowed" of a task. This new helper is for returning the 
mems_allowed defined in the cpuset. Perhaps we could just use 
"cpuset_nodes_allowed(struct cgroup *)".

Cheers,
Longman



      parent reply	other threads:[~2025-12-26 20:24 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-20  6:10 [PATCH] mm/vmscan: respect mems_effective in demote_folio_list() Bing Jiao
2025-12-20 19:20 ` Andrew Morton
2025-12-22  6:16   ` Bing Jiao
2025-12-21 12:07 ` Gregory Price
2025-12-22  6:28   ` Bing Jiao
2025-12-21 23:36 ` [PATCH v2 0/2] fix demotion targets checks in reclaim/demotion Bing Jiao
2025-12-21 23:36   ` [PATCH v2 1/2] mm/vmscan: respect mems_effective in demote_folio_list() Bing Jiao
2025-12-22  2:38     ` Chen Ridong
2025-12-22 21:56     ` kernel test robot
2025-12-22 22:18     ` kernel test robot
2025-12-21 23:36   ` [PATCH v2 2/2] mm/vmscan: check all allowed targets in can_demote() Bing Jiao
2025-12-22  2:51     ` Chen Ridong
2025-12-22  6:09       ` Bing Jiao
2025-12-22  8:28         ` Chen Ridong
2025-12-23 21:19   ` [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion Bing Jiao
2025-12-23 21:38     ` Bing Jiao
2025-12-24  1:19     ` Gregory Price
2025-12-26 18:48       ` Bing Jiao
2025-12-24  1:49     ` Chen Ridong
2025-12-26 18:58       ` Bing Jiao
2025-12-26 19:32     ` Waiman Long
2025-12-26 20:24     ` Waiman Long [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=84ed9b5d-41d5-44a1-a1ad-2b3de8b50a50@redhat.com \
    --to=llong@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=bingjiao@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chenridong@huaweicloud.com \
    --cc=david@kernel.org \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox