linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/2] mm/vmscan: mitigate spurious kswapd_failures reset and add tracepoints
@ 2026-01-14  7:40 Jiayuan Chen
  2026-01-14  7:40 ` [PATCH v3 1/2] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim Jiayuan Chen
  2026-01-14  7:40 ` [PATCH v3 2/2] mm/vmscan: add tracepoint and reason for kswapd_failures reset Jiayuan Chen
  0 siblings, 2 replies; 3+ messages in thread
From: Jiayuan Chen @ 2026-01-14  7:40 UTC (permalink / raw)
  To: linux-mm, shakeel.butt
  Cc: Jiayuan Chen, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Brendan Jackman, Johannes Weiner, Zi Yan, Qi Zheng, Jiayuan Chen,
	linux-kernel, linux-trace-kernel

== Problem ==

We observed an issue in production on a multi-NUMA system where kswapd
runs endlessly, causing sustained heavy IO READ pressure across the
entire system.

The root cause is that direct reclaim triggered by cgroup memory.high
keeps resetting kswapd_failures to 0, even when the node cannot be
balanced. This prevents kswapd from ever stopping after reaching
MAX_RECLAIM_RETRIES.

```bash
bpftrace -e '
 #include <linux/mmzone.h>
 #include <linux/shrinker.h>
kprobe:balance_pgdat {
	$pgdat = (struct pglist_data *)arg0;
	if ($pgdat->kswapd_failures > 0) {
		printf("[node %d] [%lu] kswapd end, kswapd_failures %d\n",
                       $pgdat->node_id, jiffies, $pgdat->kswapd_failures);
	}
}
tracepoint:vmscan:mm_vmscan_direct_reclaim_end {
	printf("[cpu %d] [%ul] reset kswapd_failures %d \n", cpu, jiffies,
               args.nr_reclaimed)
}
'
```

The trace results showed that when kswapd_failures reaches 15, continuous
direct reclaim keeps resetting it to 0. This was accompanied by a flood of
kswapd_failures log entries, and shortly after, we observed massive
refaults occurring.

== Solution ==

Patch 1 fixes the issue by only resetting kswapd_failures when the node
is actually balanced. This introduces pgdat_try_reset_kswapd_failures()
as a wrapper that checks pgdat_balanced() before resetting.

Patch 2 extends the wrapper to track why kswapd_failures was reset,
adding tracepoints for better observability:
  - mm_vmscan_reset_kswapd_failures: traces each reset with reason
  - mm_vmscan_kswapd_reclaim_fail: traces each kswapd reclaim failure

---
v2 -> v3: https://lore.kernel.org/all/20251226080042.291657-1-jiayuan.chen@linux.dev/
  - Add tracepoints for kswapd_failures reset and reclaim failure
  - Expand commit message with test results

v1 -> v2: https://lore.kernel.org/all/20251222122022.254268-1-jiayuan.chen@linux.dev/

Jiayuan Chen (2):
  mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  mm/vmscan: add tracepoint and reason for kswapd_failures reset

 include/linux/mmzone.h        |  9 +++++++
 include/trace/events/vmscan.h | 51 +++++++++++++++++++++++++++++++++++
 mm/memory-tiers.c             |  2 +-
 mm/page_alloc.c               |  2 +-
 mm/vmscan.c                   | 33 ++++++++++++++++++++---
 5 files changed, 91 insertions(+), 6 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-01-14  7:41 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-14  7:40 [PATCH v3 0/2] mm/vmscan: mitigate spurious kswapd_failures reset and add tracepoints Jiayuan Chen
2026-01-14  7:40 ` [PATCH v3 1/2] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim Jiayuan Chen
2026-01-14  7:40 ` [PATCH v3 2/2] mm/vmscan: add tracepoint and reason for kswapd_failures reset Jiayuan Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox