From: "Jiayuan Chen" <jiayuan.chen@linux.dev>
To: "Shakeel Butt" <shakeel.butt@linux.dev>
Cc: linux-mm@kvack.org, "Jiayuan Chen" <jiayuan.chen@shopee.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"David Hildenbrand" <david@kernel.org>,
"Michal Hocko" <mhocko@kernel.org>,
"Qi Zheng" <zhengqi.arch@bytedance.com>,
"Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>,
"Axel Rasmussen" <axelrasmussen@google.com>,
"Yuanchu Xie" <yuanchu@google.com>, "Wei Xu" <weixugc@google.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
Date: Tue, 23 Dec 2025 08:22:43 +0000 [thread overview]
Message-ID: <e93c75cb1a46a60ec415215c555312c82b9145ac@linux.dev> (raw)
In-Reply-To: <u2llnnpmpsgarwrt74ffgo3cuwe4apdbeh5hkclzbh5gykwltb@whb7uuj7ub5i>
December 23, 2025 at 14:11, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:
>
> On Tue, Dec 23, 2025 at 01:42:37AM +0000, Jiayuan Chen wrote:
>
> >
> > December 23, 2025 at 05:15, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:
> >
> [...]
>
> >
> > >
> > I don't think kswapd is an issue here. The system is out of memory and
> > most of the memory is unreclaimable. Either change the workload to use
> > less memory or enable swap (or zswap) to have more reclaimable memory.
> >
> >
> > Hi,
> > Thanks for looking into this.
> >
> > Sorry, I didn't describe the scenario clearly enough in the original patch. Let me clarify:
> >
> > This is a multi-NUMA system where the memory pressure is not global but node-local. The key observation is:
> >
> > Node 0: Under memory pressure, most memory is anonymous (unreclaimable without swap)
> > Node 1: Has plenty of reclaimable memory (~60GB file cache out of 125GB total)
> >
> Thanks and now the situation is much more clear. IIUC you are running
> multiple workloads (pods) on the system. How is the memcg limits
> configured for these workloads. You mentioned memory.high, what about
Thanks for the questions. We have pods configured with memory.high and pods configured with memory.max.
Actually, memory.max itself causes heavy I/O issues for us, because it keeps trying to reclaim hot
pages within the cgroup aggressively without killing the process.
So we configured some pods with memory.high instead, since it performs reclaim in resume_user_mode_work,
which somewhat throttles the memory allocation of user processes.
> memory.max? Also are you using cpusets to limit the pods to individual
> nodes (cpu & memory) or they can run on any node?
Yes, we have cpusets(only cpuset.cpus not cpuset.mems) configured for our cgroups, binding
them to specific NUMA nodes. But I don't think this is directly related to the issue - the
problem can occur with or without cpusets. Even without cpusets.cpus, the kernel prefers
to allocate memory from the node where the process is running, so if a process happens to
run on a CPU belonging to Node 0, the behavior would be similar.
>
> Overall I still think it is unbalanced numa nodes in terms of memory and
> may for cpu as well. Anyways let's talk about kswapd.
> >
> > Node 0's kswapd runs continuously but cannot reclaim anything
> > Direct reclaim succeeds by reclaiming from Node 1
> > Direct reclaim resets kswapd_failures,
> >
> So successful reclaim on one node does not reset kswapd_failures on
> other node. The kernel reclaims each node one by one, so if Node 0
> direct reclaim was successfull only then kernel allows to reset the
> kswapd_failures of Node 0 to be reset.
Let me dig deeper into this.
When either memory.max or memory.high is reached, direct reclaim is
triggered. The memory being reclaimed depends on the CPU where the
process is running.
When the problem occurred, we had workloads continuously hitting
memory.max and workloads continuously hitting memory.high:
reclaim_high -> -> try_to_free_mem_cgroup_pages
^ do_try_to_free_pages(zone of current node)
| shrink_zones()
try_charge_memcg - shrink_node()
kswapd_failures = 0
Although the pages are hot, if we scan aggressively enough, they will eventually
be reclaimed, and then kswapd_failures gets reset to 0 - because even reclaiming
a single page resets kswapd_failures to 0.
The end result is that we most workloads, which didn't even hit their high
or max limits, experiencing continuous refaults, causing heavy I/O.
Thanks.
> >
> > preventing Node 0's kswapd from stopping
> > The few file pages on Node 0 are hot and keep refaulting, causing heavy I/O
> >
> Have you tried numa balancing? Though I think it would be better to
> schedule upfront in a way that one node is not overcommitted but numa
> balancing provides a dynamic way to adjust the load on each node.
Yes, we have tried it. Actually, I submitted a patch about a month ago to improve
its observability:
https://lore.kernel.org/all/20251124153331.465306a2@gandalf.local.home/
(though only Steven replied, a bit awkward :( ).
We found that the default settings didn't work well for our workloads. When we tried
to increase scan_size to make it more aggressive, we noticed the system load started
to increase. So we haven't fully adopted it yet.
> Can you dig deeper on who and why Node 0's kswapd_failures is getting
> reset?
>
prev parent reply other threads:[~2025-12-23 8:22 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20251222122022.254268-1-jiayuan.chen@linux.dev>
2025-12-22 18:29 ` Andrew Morton
2025-12-23 1:51 ` Jiayuan Chen
2025-12-22 21:15 ` Shakeel Butt
2025-12-23 1:42 ` Jiayuan Chen
2025-12-23 6:11 ` Shakeel Butt
2025-12-23 8:22 ` Jiayuan Chen [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e93c75cb1a46a60ec415215c555312c82b9145ac@linux.dev \
--to=jiayuan.chen@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=jiayuan.chen@shopee.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=weixugc@google.com \
--cc=yuanchu@google.com \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox