linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Matt Fleming <matt@readmodwrite.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>, Minchan Kim <minchan@kernel.org>,
	Sergey Senozhatsky <senozhatsky@chromium.org>,
	Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	Nhat Pham <nphamcs@gmail.com>, Baoquan He <bhe@redhat.com>,
	Barry Song <baohua@kernel.org>,
	Vlastimil Babka <vbabka@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Brendan Jackman <jackmanb@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>, Zi Yan <ziy@nvidia.com>,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, kernel-team@cloudflare.com,
	Matt Fleming <mfleming@cloudflare.com>
Subject: [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap
Date: Tue,  3 Mar 2026 11:53:57 +0000	[thread overview]
Message-ID: <20260303115358.1323188-1-matt@readmodwrite.com> (raw)

From: Matt Fleming <mfleming@cloudflare.com>

Hi,

Systems with zram-only swap can spin in direct reclaim for 20-30
minutes without ever invoking the OOM killer. We've hit this repeatedly
in production on machines with 377 GiB RAM and a 377 GiB zram device.

The problem
-----------

should_reclaim_retry() calls zone_reclaimable_pages() to estimate how
much memory is still reclaimable. That estimate includes anonymous
pages, on the assumption that swapping them out frees physical pages.

With disk-backed swap, that's true -- writing a page to disk frees a
page of RAM, and SwapFree accurately reflects how many more pages can
be written. With zram, the free slot count is inaccurate. A 377 GiB
zram device with 10% used reports ~340 GiB of free swap slots, but
filling those slots requires physical RAM that the system doesn't have
-- that's why it's in direct reclaim in the first place.

The reclaimable estimate is off by orders of magnitude.

The fix
-------

This patch introduces two new flags: BLK_FEAT_RAM_BACKED at the block
layer (set by zram and brd) and SWP_RAM_BACKED at the swap layer. When
all active swap devices are RAM-backed, should_reclaim_retry() excludes
anonymous pages from the reclaimable estimate and counts only
file-backed pages. Once file pages are exhausted the watermark check
fails and the kernel falls through to OOM.

Opting to OOM kill something over spinning in direct reclaim optimises
for Mean Time To Recovery (MTTR) and prevents "brownout" situations
where performance is degraded for prolonged periods (we've seen 20-30
minutes degraded system performance).

Design choices and known limitations
-------------------------------------

Why not fix zone_reclaimable_pages() globally?

  Other callers (e.g. balance_pgdat() in kswapd) use the anon-inclusive
  count for different purposes. Changing it globally risks breaking
  kswapd's reclaim decisions in ways that are hard to test. Limiting
  the change to should_reclaim_retry() keeps the blast radius small and
  squarely in the direct reclaim path.

What about mixed swap configurations (zram + disk)?

  When at least one disk-backed swap device is active,
  swap_all_ram_backed is false and the current behaviour is preserved.
  Per-device reclaimable accounting is possible but it's a much larger
  change, and mixed zram+disk configurations are uncommon in practice
  AFAIK.

Can we make zram free space accounting more accurate?

  This is possible but probably the most complicated solution. Swap
  device drivers could provide a callback which RAM-backed drivers
  would use to estimate how much physical memory they could store given
  some average compression ratio (either historic or projected given a
  list of anon pages to swap) and the amount of free physical memory.
  Plus, this wouldn't be constant and would change on every invocation
  of the callback inline with the current compression ratio and the
  amount of free memory.

Build-testing
-------------

Built with defconfig, allnoconfig, allmodconfig, and multiple
randconfig iterations on x86_64 / 7.0-rc2.

Matt Fleming (1):
  mm: Reduce direct reclaim stalls with RAM-backed swap

 drivers/block/brd.c           |  3 ++-
 drivers/block/zram/zram_drv.c |  3 ++-
 include/linux/blkdev.h        |  8 ++++++
 include/linux/swap.h          |  9 +++++++
 mm/page_alloc.c               | 23 ++++++++++++++++-
 mm/swapfile.c                 | 47 ++++++++++++++++++++++++++++++++++-
 6 files changed, 89 insertions(+), 4 deletions(-)

-- 
2.43.0


             reply	other threads:[~2026-03-03 11:54 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-03 11:53 Matt Fleming [this message]
2026-03-03 11:53 ` [RFC PATCH 1/1] " Matt Fleming
2026-03-03 14:10   ` Christoph Hellwig
2026-03-03 14:59 ` [RFC PATCH 0/1] " Shakeel Butt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260303115358.1323188-1-matt@readmodwrite.com \
    --to=matt@readmodwrite.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=chrisl@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=kasong@tencent.com \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mfleming@cloudflare.com \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=nphamcs@gmail.com \
    --cc=senozhatsky@chromium.org \
    --cc=shikemeng@huaweicloud.com \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox