From: Shakeel Butt <shakeel.butt@linux.dev>
To: Jiayuan Chen <jiayuan.chen@linux.dev>
Cc: linux-mm@kvack.org, Jiayuan Chen <jiayuan.chen@shopee.com>,
Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@suse.cz>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
Axel Rasmussen <axelrasmussen@google.com>,
Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
Steven Rostedt <rostedt@goodmis.org>,
Masami Hiramatsu <mhiramat@kernel.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Brendan Jackman <jackmanb@google.com>,
Johannes Weiner <hannes@cmpxchg.org>, Zi Yan <ziy@nvidia.com>,
Qi Zheng <zhengqi.arch@bytedance.com>,
linux-kernel@vger.kernel.org,
linux-trace-kernel@vger.kernel.org
Subject: Re: [PATCH v3 1/2] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
Date: Sat, 17 Jan 2026 21:04:44 -0800 [thread overview]
Message-ID: <55m7lphwqp3nywapjutkdzajehvhhulig5v4zsu2lafsxqmwkf@qmwqjgk6z3ke> (raw)
In-Reply-To: <20260114074049.229935-2-jiayuan.chen@linux.dev>
On Wed, Jan 14, 2026 at 03:40:35PM +0800, Jiayuan Chen wrote:
> From: Jiayuan Chen <jiayuan.chen@shopee.com>
>
> When kswapd fails to reclaim memory, kswapd_failures is incremented.
> Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
> futile reclaim attempts. However, any successful direct reclaim
> unconditionally resets kswapd_failures to 0, which can cause problems.
>
> We observed an issue in production on a multi-NUMA system where a
> process allocated large amounts of anonymous pages on a single NUMA
> node, causing its watermark to drop below high and evicting most file
> pages:
>
> $ numastat -m
> Per-node system memory usage (in MBs):
> Node 0 Node 1 Total
> --------------- --------------- ---------------
> MemTotal 128222.19 127983.91 256206.11
> MemFree 1414.48 1432.80 2847.29
> MemUsed 126807.71 126551.11 252358.82
> SwapCached 0.00 0.00 0.00
> Active 29017.91 25554.57 54572.48
> Inactive 92749.06 95377.00 188126.06
> Active(anon) 28998.96 23356.47 52355.43
> Inactive(anon) 92685.27 87466.11 180151.39
> Active(file) 18.95 2198.10 2217.05
> Inactive(file) 63.79 7910.89 7974.68
>
> With swap disabled, only file pages can be reclaimed. When kswapd is
> woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
> raise free memory above the high watermark since reclaimable file pages
> are insufficient. Normally, kswapd would eventually stop after
> kswapd_failures reaches MAX_RECLAIM_RETRIES.
>
> However, containers on this machine have memory.high set in their
> cgroup. Business processes continuously trigger the high limit, causing
> frequent direct reclaim that keeps resetting kswapd_failures to 0. This
> prevents kswapd from ever stopping.
>
> The key insight is that direct reclaim triggered by cgroup memory.high
> performs aggressive scanning to throttle the allocating process. With
> sufficiently aggressive scanning, even hot pages will eventually be
> reclaimed, making direct reclaim "successful" at freeing some memory.
> However, this success does not mean the node has reached a balanced
> state - the freed memory may still be insufficient to bring free pages
> above the high watermark. Unconditionally resetting kswapd_failures in
> this case keeps kswapd alive indefinitely.
>
> The result is that kswapd runs endlessly. Unlike direct reclaim which
> only reclaims from the allocating cgroup, kswapd scans the entire node's
> memory. This causes hot file pages from all workloads on the node to be
> evicted, not just those from the cgroup triggering memory.high. These
> pages constantly refault, generating sustained heavy IO READ pressure
> across the entire system.
>
> Fix this by only resetting kswapd_failures when the node is actually
> balanced. This allows both kswapd and direct reclaim to clear
> kswapd_failures upon successful reclaim, but only when the reclaim
> actually resolves the memory pressure (i.e., the node becomes balanced).
>
> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
After incorporating suggestions from Johannes, you can add:
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
next prev parent reply other threads:[~2026-01-18 5:04 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-14 7:40 [PATCH v3 0/2] mm/vmscan: mitigate spurious kswapd_failures reset and add tracepoints Jiayuan Chen
2026-01-14 7:40 ` [PATCH v3 1/2] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim Jiayuan Chen
2026-01-16 17:00 ` Johannes Weiner
2026-01-19 3:04 ` Jiayuan Chen
2026-01-18 5:04 ` Shakeel Butt [this message]
2026-01-14 7:40 ` [PATCH v3 2/2] mm/vmscan: add tracepoint and reason for kswapd_failures reset Jiayuan Chen
2026-01-18 5:08 ` Shakeel Butt
2026-01-15 23:39 ` [PATCH v3 0/2] mm/vmscan: mitigate spurious kswapd_failures reset and add tracepoints Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55m7lphwqp3nywapjutkdzajehvhhulig5v4zsu2lafsxqmwkf@qmwqjgk6z3ke \
--to=shakeel.butt@linux.dev \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=jiayuan.chen@linux.dev \
--cc=jiayuan.chen@shopee.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mathieu.desnoyers@efficios.com \
--cc=mhiramat@kernel.org \
--cc=mhocko@suse.com \
--cc=rostedt@goodmis.org \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=weixugc@google.com \
--cc=yuanchu@google.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox