Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
       [not found] <20251222122022.254268-1-jiayuan.chen@linux.dev>
@ 2025-12-22 18:29 ` Andrew Morton
  2025-12-23  1:51   ` Jiayuan Chen
  2025-12-22 21:15 ` Shakeel Butt
  1 sibling, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2025-12-22 18:29 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: linux-mm, Jiayuan Chen, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Shakeel Butt, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

On Mon, 22 Dec 2025 20:20:21 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote:

> From: Jiayuan Chen <jiayuan.chen@shopee.com>
> 
> When kswapd fails to reclaim memory, kswapd_failures is incremented.
> Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
> futile reclaim attempts. However, any successful direct reclaim
> unconditionally resets kswapd_failures to 0, which can cause problems.
> 
> We observed an issue in production on a multi-NUMA system where a
> process allocated large amounts of anonymous pages on a single NUMA
> node, causing its watermark to drop below high and evicting most file
> pages:
> 
> $ numastat -m
> Per-node system memory usage (in MBs):
>                           Node 0          Node 1           Total
>                  --------------- --------------- ---------------
> MemTotal               128222.19       127983.91       256206.11
> MemFree                  1414.48         1432.80         2847.29
> MemUsed                126807.71       126551.11       252358.82
> SwapCached                  0.00            0.00            0.00
> Active                  29017.91        25554.57        54572.48
> Inactive                92749.06        95377.00       188126.06
> Active(anon)            28998.96        23356.47        52355.43
> Inactive(anon)          92685.27        87466.11       180151.39
> Active(file)               18.95         2198.10         2217.05
> Inactive(file)             63.79         7910.89         7974.68
> 
> With swap disabled, only file pages can be reclaimed. When kswapd is
> woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
> raise free memory above the high watermark since reclaimable file pages
> are insufficient. Normally, kswapd would eventually stop after
> kswapd_failures reaches MAX_RECLAIM_RETRIES.
> 
> However, pods on this machine have memory.high set in their cgroup.

What's a "pod"?

> Business processes continuously trigger the high limit, causing frequent
> direct reclaim that keeps resetting kswapd_failures to 0. This prevents
> kswapd from ever stopping.
> 
> The result is that kswapd runs endlessly, repeatedly evicting the few
> remaining file pages which are actually hot. These pages constantly
> refault, generating sustained heavy IO READ pressure.

Yes, not good.

> Fix this by only resetting kswapd_failures from direct reclaim when the
> node is actually balanced. This prevents direct reclaim from keeping
> kswapd alive when the node cannot be balanced through reclaim alone.
>
> ...
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2648,6 +2648,15 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
>  			  lruvec_memcg(lruvec));
>  }
>  
> +static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx);

Forward declaration could be avoided by relocating pgdat_balanced(),
although the patch will get a lot larger.

> +static inline void reset_kswapd_failures(struct pglist_data *pgdat,
> +					 struct scan_control *sc)

It would be nice to have a nice comment explaining why this is here. 
Why are we checking for balanced?

> +{
> +	if (!current_is_kswapd() &&

kswapd can no longer clear ->kswapd_failures.  What's the thinking here?

> +	    pgdat_balanced(pgdat, sc->order, sc->reclaim_idx))
> +		atomic_set(&pgdat->kswapd_failures, 0);
> +}
> +
>  #ifdef CONFIG_LRU_GEN
>  
>  #ifdef CONFIG_LRU_GEN_ENABLED
> @@ -5065,7 +5074,7 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
>  	blk_finish_plug(&plug);
>  done:
>  	if (sc->nr_reclaimed > reclaimed)
> -		atomic_set(&pgdat->kswapd_failures, 0);
> +		reset_kswapd_failures(pgdat, sc);
>  }
>  
>  /******************************************************************************
> @@ -6139,7 +6148,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	 * successful direct reclaim run will revive a dormant kswapd.
>  	 */
>  	if (reclaimable)
> -		atomic_set(&pgdat->kswapd_failures, 0);
> +		reset_kswapd_failures(pgdat, sc);
>  	else if (sc->cache_trim_mode)
>  		sc->cache_trim_mode_failed = 1;
>  }



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
       [not found] <20251222122022.254268-1-jiayuan.chen@linux.dev>
  2025-12-22 18:29 ` [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim Andrew Morton
@ 2025-12-22 21:15 ` Shakeel Butt
  2025-12-23  1:42   ` Jiayuan Chen
  1 sibling, 1 reply; 6+ messages in thread
From: Shakeel Butt @ 2025-12-22 21:15 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: linux-mm, Jiayuan Chen, Andrew Morton, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

On Mon, Dec 22, 2025 at 08:20:21PM +0800, Jiayuan Chen wrote:
> From: Jiayuan Chen <jiayuan.chen@shopee.com>
> 
> When kswapd fails to reclaim memory, kswapd_failures is incremented.
> Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
> futile reclaim attempts. However, any successful direct reclaim
> unconditionally resets kswapd_failures to 0, which can cause problems.
> 
> We observed an issue in production on a multi-NUMA system where a
> process allocated large amounts of anonymous pages on a single NUMA
> node, causing its watermark to drop below high and evicting most file
> pages:
> 
> $ numastat -m
> Per-node system memory usage (in MBs):
>                           Node 0          Node 1           Total
>                  --------------- --------------- ---------------
> MemTotal               128222.19       127983.91       256206.11
> MemFree                  1414.48         1432.80         2847.29
> MemUsed                126807.71       126551.11       252358.82
> SwapCached                  0.00            0.00            0.00
> Active                  29017.91        25554.57        54572.48
> Inactive                92749.06        95377.00       188126.06
> Active(anon)            28998.96        23356.47        52355.43
> Inactive(anon)          92685.27        87466.11       180151.39
> Active(file)               18.95         2198.10         2217.05
> Inactive(file)             63.79         7910.89         7974.68
> 
> With swap disabled, only file pages can be reclaimed. When kswapd is
> woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
> raise free memory above the high watermark since reclaimable file pages
> are insufficient. Normally, kswapd would eventually stop after
> kswapd_failures reaches MAX_RECLAIM_RETRIES.
> 
> However, pods on this machine have memory.high set in their cgroup.
> Business processes continuously trigger the high limit, causing frequent
> direct reclaim that keeps resetting kswapd_failures to 0. This prevents
> kswapd from ever stopping.
> 
> The result is that kswapd runs endlessly, repeatedly evicting the few
> remaining file pages which are actually hot. These pages constantly
> refault, generating sustained heavy IO READ pressure.

I don't think kswapd is an issue here. The system is out of memory and
most of the memory is unreclaimable. Either change the workload to use
less memory or enable swap (or zswap) to have more reclaimable memory.

Other than that we can discuss memcg reclaim resetting the kswapd
failure count should be changed or not but that is a separate
discussion.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2025-12-22 21:15 ` Shakeel Butt
@ 2025-12-23  1:42   ` Jiayuan Chen
  2025-12-23  6:11     ` Shakeel Butt
  0 siblings, 1 reply; 6+ messages in thread
From: Jiayuan Chen @ 2025-12-23  1:42 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: linux-mm, Jiayuan Chen, Andrew Morton, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

December 23, 2025 at 05:15, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:


> 
> On Mon, Dec 22, 2025 at 08:20:21PM +0800, Jiayuan Chen wrote:
> 
> > 
> > From: Jiayuan Chen <jiayuan.chen@shopee.com>
> >  
> >  When kswapd fails to reclaim memory, kswapd_failures is incremented.
> >  Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
> >  futile reclaim attempts. However, any successful direct reclaim
> >  unconditionally resets kswapd_failures to 0, which can cause problems.
> >  
> >  We observed an issue in production on a multi-NUMA system where a
> >  process allocated large amounts of anonymous pages on a single NUMA
> >  node, causing its watermark to drop below high and evicting most file
> >  pages:
> >  
> >  $ numastat -m
> >  Per-node system memory usage (in MBs):
> >  Node 0 Node 1 Total
> >  --------------- --------------- ---------------
> >  MemTotal 128222.19 127983.91 256206.11
> >  MemFree 1414.48 1432.80 2847.29
> >  MemUsed 126807.71 126551.11 252358.82
> >  SwapCached 0.00 0.00 0.00
> >  Active 29017.91 25554.57 54572.48
> >  Inactive 92749.06 95377.00 188126.06
> >  Active(anon) 28998.96 23356.47 52355.43
> >  Inactive(anon) 92685.27 87466.11 180151.39
> >  Active(file) 18.95 2198.10 2217.05
> >  Inactive(file) 63.79 7910.89 7974.68
> >  
> >  With swap disabled, only file pages can be reclaimed. When kswapd is
> >  woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
> >  raise free memory above the high watermark since reclaimable file pages
> >  are insufficient. Normally, kswapd would eventually stop after
> >  kswapd_failures reaches MAX_RECLAIM_RETRIES.
> >  
> >  However, pods on this machine have memory.high set in their cgroup.
> >  Business processes continuously trigger the high limit, causing frequent
> >  direct reclaim that keeps resetting kswapd_failures to 0. This prevents
> >  kswapd from ever stopping.
> >  
> >  The result is that kswapd runs endlessly, repeatedly evicting the few
> >  remaining file pages which are actually hot. These pages constantly
> >  refault, generating sustained heavy IO READ pressure.
> > 
> I don't think kswapd is an issue here. The system is out of memory and
> most of the memory is unreclaimable. Either change the workload to use
> less memory or enable swap (or zswap) to have more reclaimable memory.


Hi,
Thanks for looking into this.

Sorry, I didn't describe the scenario clearly enough in the original patch. Let me clarify:

This is a multi-NUMA system where the memory pressure is not global but node-local. The key observation is:

Node 0: Under memory pressure, most memory is anonymous (unreclaimable without swap)
Node 1: Has plenty of reclaimable memory (~60GB file cache out of 125GB total)

Node 0's kswapd runs continuously but cannot reclaim anything
Direct reclaim succeeds by reclaiming from Node 1
Direct reclaim resets kswapd_failures, preventing Node 0's kswapd from stopping
The few file pages on Node 0 are hot and keep refaulting, causing heavy I/O

From a per-node perspective, Node 0 is truly out of reclaimable memory and its kswapd
should stop. But the global direct reclaim success (from Node 1) incorrectly keeps
Node 0's kswapd alive.


Thanks.

> Other than that we can discuss memcg reclaim resetting the kswapd
> failure count should be changed or not but that is a separate
> discussion.
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2025-12-22 18:29 ` [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim Andrew Morton
@ 2025-12-23  1:51   ` Jiayuan Chen
  0 siblings, 0 replies; 6+ messages in thread
From: Jiayuan Chen @ 2025-12-23  1:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Jiayuan Chen, Johannes  Weiner, David Hildenbrand,
	Michal  Hocko, Qi Zheng, Shakeel  Butt, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

December 23, 2025 at 02:29, "Andrew Morton" <akpm@linux-foundation.org mailto:akpm@linux-foundation.org?to=%22Andrew%20Morton%22%20%3Cakpm%40linux-foundation.org%3E > wrote:

Hi Andrew,
Thanks for the review.
> 
> On Mon, 22 Dec 2025 20:20:21 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
> 
> > 
> > From: Jiayuan Chen <jiayuan.chen@shopee.com>
> >  
> >  When kswapd fails to reclaim memory, kswapd_failures is incremented.
> >  Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
> >  futile reclaim attempts. However, any successful direct reclaim
> >  unconditionally resets kswapd_failures to 0, which can cause problems.
> >  
> >  We observed an issue in production on a multi-NUMA system where a
> >  process allocated large amounts of anonymous pages on a single NUMA
> >  node, causing its watermark to drop below high and evicting most file
> >  pages:
> >  
> >  $ numastat -m
> >  Per-node system memory usage (in MBs):
> >  Node 0 Node 1 Total
> >  --------------- --------------- ---------------
> >  MemTotal 128222.19 127983.91 256206.11
> >  MemFree 1414.48 1432.80 2847.29
> >  MemUsed 126807.71 126551.11 252358.82
> >  SwapCached 0.00 0.00 0.00
> >  Active 29017.91 25554.57 54572.48
> >  Inactive 92749.06 95377.00 188126.06
> >  Active(anon) 28998.96 23356.47 52355.43
> >  Inactive(anon) 92685.27 87466.11 180151.39
> >  Active(file) 18.95 2198.10 2217.05
> >  Inactive(file) 63.79 7910.89 7974.68
> >  
> >  With swap disabled, only file pages can be reclaimed. When kswapd is
> >  woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
> >  raise free memory above the high watermark since reclaimable file pages
> >  are insufficient. Normally, kswapd would eventually stop after
> >  kswapd_failures reaches MAX_RECLAIM_RETRIES.
> >  
> >  However, pods on this machine have memory.high set in their cgroup.
> > 
> What's a "pod"?

A pod is a Kubernetes container. Sorry for the unclear terminology.


> > 
> > Business processes continuously trigger the high limit, causing frequent
> >  direct reclaim that keeps resetting kswapd_failures to 0. This prevents
> >  kswapd from ever stopping.
> >  
> >  The result is that kswapd runs endlessly, repeatedly evicting the few
> >  remaining file pages which are actually hot. These pages constantly
> >  refault, generating sustained heavy IO READ pressure.
> > 
> Yes, not good.
> 
> > 
> > Fix this by only resetting kswapd_failures from direct reclaim when the
> >  node is actually balanced. This prevents direct reclaim from keeping
> >  kswapd alive when the node cannot be balanced through reclaim alone.
> > 
> >  ...
> > 
> >  --- a/mm/vmscan.c
> >  +++ b/mm/vmscan.c
> >  @@ -2648,6 +2648,15 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
> >  lruvec_memcg(lruvec));
> >  }
> >  
> >  +static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx);
> > 
> Forward declaration could be avoided by relocating pgdat_balanced(),
> although the patch will get a lot larger.

Thanks for pointing this out.

> > 
> > +static inline void reset_kswapd_failures(struct pglist_data *pgdat,
> >  + struct scan_control *sc)
> > 
> It would be nice to have a nice comment explaining why this is here. 
> Why are we checking for balanced?

You're right, a comment explaining the rationale would be helpful.


> > 
> > +{
> >  + if (!current_is_kswapd() &&
> > 
> kswapd can no longer clear ->kswapd_failures. What's the thinking here?


Good catch. My original thinking was that kswapd already checks pgdat_balanced()
in its own path after successful reclaim, so I wanted to avoid redundant checks.
But looking at the code again, this is indeed a bug - kswapd's reclaim path does
need to clear kswapd_failures on successful reclaim.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2025-12-23  1:42   ` Jiayuan Chen
@ 2025-12-23  6:11     ` Shakeel Butt
  2025-12-23  8:22       ` Jiayuan Chen
  0 siblings, 1 reply; 6+ messages in thread
From: Shakeel Butt @ 2025-12-23  6:11 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: linux-mm, Jiayuan Chen, Andrew Morton, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

On Tue, Dec 23, 2025 at 01:42:37AM +0000, Jiayuan Chen wrote:
> December 23, 2025 at 05:15, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:
> 
[...]
> 
> > > 
> > I don't think kswapd is an issue here. The system is out of memory and
> > most of the memory is unreclaimable. Either change the workload to use
> > less memory or enable swap (or zswap) to have more reclaimable memory.
> 
> 
> Hi,
> Thanks for looking into this.
> 
> Sorry, I didn't describe the scenario clearly enough in the original patch. Let me clarify:
> 
> This is a multi-NUMA system where the memory pressure is not global but node-local. The key observation is:
> 
> Node 0: Under memory pressure, most memory is anonymous (unreclaimable without swap)
> Node 1: Has plenty of reclaimable memory (~60GB file cache out of 125GB total)

Thanks and now the situation is much more clear. IIUC you are running
multiple workloads (pods) on the system. How is the memcg limits
configured for these workloads. You mentioned memory.high, what about
memory.max? Also are you using cpusets to limit the pods to individual
nodes (cpu & memory) or they can run on any node?

Overall I still think it is unbalanced numa nodes in terms of memory and
may for cpu as well. Anyways let's talk about kswapd.

> 
> Node 0's kswapd runs continuously but cannot reclaim anything
> Direct reclaim succeeds by reclaiming from Node 1
> Direct reclaim resets kswapd_failures,

So successful reclaim on one node does not reset kswapd_failures on
other node. The kernel reclaims each node one by one, so if Node 0
direct reclaim was successfull only then kernel allows to reset the
kswapd_failures of Node 0 to be reset.

> preventing Node 0's kswapd from stopping
> The few file pages on Node 0 are hot and keep refaulting, causing heavy I/O
> 

Have you tried numa balancing? Though I think it would be better to
schedule upfront in a way that one node is not overcommitted but numa
balancing provides a dynamic way to adjust the load on each node.

Can you dig deeper on who and why Node 0's kswapd_failures is getting
reset?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2025-12-23  6:11     ` Shakeel Butt
@ 2025-12-23  8:22       ` Jiayuan Chen
  0 siblings, 0 replies; 6+ messages in thread
From: Jiayuan Chen @ 2025-12-23  8:22 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: linux-mm, Jiayuan Chen, Andrew Morton, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

December 23, 2025 at 14:11, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:

> 
> On Tue, Dec 23, 2025 at 01:42:37AM +0000, Jiayuan Chen wrote:
> 
> > 
> > December 23, 2025 at 05:15, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:
> > 
> [...]
> 
> > 
> > > 
> >  I don't think kswapd is an issue here. The system is out of memory and
> >  most of the memory is unreclaimable. Either change the workload to use
> >  less memory or enable swap (or zswap) to have more reclaimable memory.
> >  
> >  
> >  Hi,
> >  Thanks for looking into this.
> >  
> >  Sorry, I didn't describe the scenario clearly enough in the original patch. Let me clarify:
> >  
> >  This is a multi-NUMA system where the memory pressure is not global but node-local. The key observation is:
> >  
> >  Node 0: Under memory pressure, most memory is anonymous (unreclaimable without swap)
> >  Node 1: Has plenty of reclaimable memory (~60GB file cache out of 125GB total)
> > 
> Thanks and now the situation is much more clear. IIUC you are running
> multiple workloads (pods) on the system. How is the memcg limits
> configured for these workloads. You mentioned memory.high, what about

Thanks for the questions. We have pods configured with memory.high and pods configured with memory.max.

Actually, memory.max itself causes heavy I/O issues for us, because it keeps trying to reclaim hot
pages within the cgroup aggressively without killing the process. 

So we configured some pods with memory.high instead, since it performs reclaim in resume_user_mode_work,
which somewhat throttles the memory allocation of user processes.

> memory.max? Also are you using cpusets to limit the pods to individual
> nodes (cpu & memory) or they can run on any node?

Yes, we have cpusets(only cpuset.cpus not cpuset.mems) configured for our cgroups, binding
them to specific NUMA nodes. But I don't think this is directly related to the issue - the
problem can occur with or without cpusets. Even without cpusets.cpus, the kernel prefers
to allocate memory from the node where the process is running, so if a process happens to
run on a CPU belonging to Node 0, the behavior would be similar.

> 
> Overall I still think it is unbalanced numa nodes in terms of memory and
> may for cpu as well. Anyways let's talk about kswapd.
> > 
> > Node 0's kswapd runs continuously but cannot reclaim anything
> >  Direct reclaim succeeds by reclaiming from Node 1
> >  Direct reclaim resets kswapd_failures,
> > 
> So successful reclaim on one node does not reset kswapd_failures on
> other node. The kernel reclaims each node one by one, so if Node 0
> direct reclaim was successfull only then kernel allows to reset the
> kswapd_failures of Node 0 to be reset.

Let me dig deeper into this.

When either memory.max or memory.high is reached, direct reclaim is
triggered. The memory being reclaimed depends on the CPU where the
process is running.

When the problem occurred, we had workloads continuously hitting 
memory.max and workloads continuously hitting memory.high:

reclaim_high    ->   -> try_to_free_mem_cgroup_pages
                   ^      do_try_to_free_pages(zone of current node)
                   |         shrink_zones()
try_charge_memcg  -              shrink_node()
                                     kswapd_failures = 0

Although the pages are hot, if we scan aggressively enough, they will eventually
be reclaimed, and then kswapd_failures gets reset to 0 - because even reclaiming
a single page resets kswapd_failures to 0.

The end result is that we most workloads, which didn't even hit their high
or max limits, experiencing continuous refaults, causing heavy I/O.

Thanks.

> > 
> > preventing Node 0's kswapd from stopping
> >  The few file pages on Node 0 are hot and keep refaulting, causing heavy I/O
> > 
> Have you tried numa balancing? Though I think it would be better to
> schedule upfront in a way that one node is not overcommitted but numa
> balancing provides a dynamic way to adjust the load on each node.

Yes, we have tried it. Actually, I submitted a patch about a month ago to improve
its observability:
https://lore.kernel.org/all/20251124153331.465306a2@gandalf.local.home/
(though only Steven replied, a bit awkward :( ).

We found that the default settings didn't work well for our workloads. When we tried
to increase scan_size to make it more aggressive, we noticed the system load started
to increase. So we haven't fully adopted it yet.

> Can you dig deeper on who and why Node 0's kswapd_failures is getting
> reset?
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-12-23  8:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20251222122022.254268-1-jiayuan.chen@linux.dev>
2025-12-22 18:29 ` [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim Andrew Morton
2025-12-23  1:51   ` Jiayuan Chen
2025-12-22 21:15 ` Shakeel Butt
2025-12-23  1:42   ` Jiayuan Chen
2025-12-23  6:11     ` Shakeel Butt
2025-12-23  8:22       ` Jiayuan Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox