linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Matt Fleming <matt@readmodwrite.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	 Jens Axboe <axboe@kernel.dk>, Minchan Kim <minchan@kernel.org>,
	 Sergey Senozhatsky <senozhatsky@chromium.org>,
	Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>,
	 Kemeng Shi <shikemeng@huaweicloud.com>,
	Nhat Pham <nphamcs@gmail.com>, Baoquan He <bhe@redhat.com>,
	 Barry Song <baohua@kernel.org>,
	Vlastimil Babka <vbabka@kernel.org>,
	 Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	 Brendan Jackman <jackmanb@google.com>, Zi Yan <ziy@nvidia.com>,
	linux-block@vger.kernel.org,  linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, kernel-team@cloudflare.com,
	 Matt Fleming <mfleming@cloudflare.com>
Subject: Re: [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap
Date: Wed, 4 Mar 2026 15:35:52 +0000	[thread overview]
Message-ID: <wqv7kbfipi5rk5nt2r4bjcqslvluz5mklyz7u33vhytgie7djx@vbjz5ndkaf53> (raw)
In-Reply-To: <aac38JHJCRB0PbjJ@cmpxchg.org>

On Tue, Mar 03, 2026 at 02:35:12PM -0500, Johannes Weiner wrote:
> 
> What about when anon pages *are* reclaimable through compression,
> though? Then we'd declare OOM prematurely.
 
I agree this RFC is a rather blunt approach which is why I tried to
limit it to zram/brd specifically.

> You could make the case that what is reclaimable should have been
> reclaimed already by the time we get here. But then you could make the
> same case for file pages, and then there is nothing left.
> 
> The check is meant to be an optimization. The primary OOM cutoff is
> that we aren't able to reclaim anything. This reclaimable check is a
> shortcut that says, even if we are reclaiming some, there is not
> enough juice in that box to keep squeezing.
> 
> Have you looked at what exactly keeps resetting no_progress_loops when
> the system is in this state?
 
I pulled data for some of the worst offenders atm but I couldn't catch
any in this 20-30 min brownout situation. Still, I think this
illustrates the problem...

Across three machines, every reclaim_retry_zone event showed
no_progress_loops = 0 and wmark_check = pass. On the busiest node (141
retry events over 5 minutes), the reclaimable estimate ranged from 4.8M
to 5.3M pages (19-21 GiB). The counter never incremented once.

The reclaimable watermark check also always passes. The traced
reclaimable values (19-21 GiB per zone) trivially exceed the min
watermark (~68 MiB), so should_reclaim_retry() never falls through on
that path either.

Sample output from a bpftrace script [1] on the reclaim_retry_zone
tracepoint (LOOPS = no_progress_loops, WMARK = wmark_check):

  COMM             PID    NODE ORDER    RECLAIMABLE      AVAILABLE      MIN_WMARK LOOPS WMARK
  app1          2133536     4     0        4960156        5013010          17522     0     1
  app2          2337869     5     0        4845655        4901543          17521     0     1
  app3           339457     6     0        4823519        4838900          17522     0     1
  app4          2179800     6     0        4819201        4835085          17522     0     1
  app5          2299092     0     0        3566433        3595953          15821     0     1
  app6          2194373     7     0        5612347        5626651          17521     0     1

Here are the numbers from a 5-minute bpftrace session on a node under
memory pressure:

  should_reclaim_retry:
    141 calls, no_progress_loops = 0 every time, wmark_check = pass every time
    reclaimable estimate: 4.8M - 5.3M pages (19-21 GiB)

  shrink_folio_list (mm_vmscan_lru_shrink_inactive) [2]:
    anon:  52M pages reclaimed / 244M scanned  (21% hit rate)
           53% of scan events reclaimed zero pages
    file:  33M pages reclaimed / 42M scanned   (78% hit rate)
           21% of scan events reclaimed zero pages

    priority distribution peaked at 2-3 (most aggressive levels)

[1] https://gist.github.com/mfleming/167b00bef7e1f4e686a6d32833c42079
[2] https://gist.github.com/mfleming/e31c86d3ab0a883e9053e19010150a13

A second node showed the same pattern: 18% anon scan efficiency vs 90%
file, no_progress_loops = 0, wmark always passes.

> I could see an argument that the two checks are not properly aligned
> right now. We could be making nominal forward progress on a small,
> heavily thrashing cache position only; but we'll keep looping because,
> well, look at all this anon memory! (Which isn't being reclaimed.)
>
> If that's the case, a better solution might be to split
> did_some_progress into anon and file progress, and only consider the
> LRU pages for which reclaim is actually making headway. And ignore
> those where we fail to succeed - for whatever reason, really, not just
> this particular zram situation.

Right. The mm_vmscan_lru_shrink_inactive tracepoint shows the anon LRU
being scanned aggressively at priority 1-3, but only 21% of scanned
pages are reclaimed. Meanwhile file reclaim runs at 78-90% efficiency
but there aren't enough file pages to satisfy the allocation.

> And if that isn't enough, maybe pass did_some_progress as the actual
> page counts instead of a bool, and only consider an LRU type
> reclaimable if the last scan cycle reclaimed at least N% of it.

Nice idea. I'll work on a patch.

Thanks,
Matt


      reply	other threads:[~2026-03-04 15:35 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-03 11:53 Matt Fleming
2026-03-03 11:53 ` [RFC PATCH 1/1] " Matt Fleming
2026-03-03 14:10   ` Christoph Hellwig
2026-03-03 16:59     ` Johannes Weiner
2026-03-03 14:59 ` [RFC PATCH 0/1] " Shakeel Butt
2026-03-03 19:37   ` Jens Axboe
2026-03-03 19:37   ` Matt Fleming
2026-03-03 22:47     ` Shakeel Butt
2026-03-03 19:35 ` Johannes Weiner
2026-03-04 15:35   ` Matt Fleming [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=wqv7kbfipi5rk5nt2r4bjcqslvluz5mklyz7u33vhytgie7djx@vbjz5ndkaf53 \
    --to=matt@readmodwrite.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=chrisl@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=kasong@tencent.com \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mfleming@cloudflare.com \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=nphamcs@gmail.com \
    --cc=senozhatsky@chromium.org \
    --cc=shikemeng@huaweicloud.com \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox