[RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap
@ 2026-03-03 11:53 Matt Fleming
  2026-03-03 11:53 ` [RFC PATCH 1/1] " Matt Fleming
  2026-03-03 14:59 ` [RFC PATCH 0/1] " Shakeel Butt
  0 siblings, 2 replies; 5+ messages in thread
From: Matt Fleming @ 2026-03-03 11:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Minchan Kim, Sergey Senozhatsky, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, linux-block,
	linux-kernel, linux-mm, kernel-team, Matt Fleming

From: Matt Fleming <mfleming@cloudflare.com>

Hi,

Systems with zram-only swap can spin in direct reclaim for 20-30
minutes without ever invoking the OOM killer. We've hit this repeatedly
in production on machines with 377 GiB RAM and a 377 GiB zram device.

The problem
-----------

should_reclaim_retry() calls zone_reclaimable_pages() to estimate how
much memory is still reclaimable. That estimate includes anonymous
pages, on the assumption that swapping them out frees physical pages.

With disk-backed swap, that's true -- writing a page to disk frees a
page of RAM, and SwapFree accurately reflects how many more pages can
be written. With zram, the free slot count is inaccurate. A 377 GiB
zram device with 10% used reports ~340 GiB of free swap slots, but
filling those slots requires physical RAM that the system doesn't have
-- that's why it's in direct reclaim in the first place.

The reclaimable estimate is off by orders of magnitude.

The fix
-------

This patch introduces two new flags: BLK_FEAT_RAM_BACKED at the block
layer (set by zram and brd) and SWP_RAM_BACKED at the swap layer. When
all active swap devices are RAM-backed, should_reclaim_retry() excludes
anonymous pages from the reclaimable estimate and counts only
file-backed pages. Once file pages are exhausted the watermark check
fails and the kernel falls through to OOM.

Opting to OOM kill something over spinning in direct reclaim optimises
for Mean Time To Recovery (MTTR) and prevents "brownout" situations
where performance is degraded for prolonged periods (we've seen 20-30
minutes degraded system performance).

Design choices and known limitations
-------------------------------------

Why not fix zone_reclaimable_pages() globally?

  Other callers (e.g. balance_pgdat() in kswapd) use the anon-inclusive
  count for different purposes. Changing it globally risks breaking
  kswapd's reclaim decisions in ways that are hard to test. Limiting
  the change to should_reclaim_retry() keeps the blast radius small and
  squarely in the direct reclaim path.

What about mixed swap configurations (zram + disk)?

  When at least one disk-backed swap device is active,
  swap_all_ram_backed is false and the current behaviour is preserved.
  Per-device reclaimable accounting is possible but it's a much larger
  change, and mixed zram+disk configurations are uncommon in practice
  AFAIK.

Can we make zram free space accounting more accurate?

  This is possible but probably the most complicated solution. Swap
  device drivers could provide a callback which RAM-backed drivers
  would use to estimate how much physical memory they could store given
  some average compression ratio (either historic or projected given a
  list of anon pages to swap) and the amount of free physical memory.
  Plus, this wouldn't be constant and would change on every invocation
  of the callback inline with the current compression ratio and the
  amount of free memory.

Build-testing
-------------

Built with defconfig, allnoconfig, allmodconfig, and multiple
randconfig iterations on x86_64 / 7.0-rc2.

Matt Fleming (1):
  mm: Reduce direct reclaim stalls with RAM-backed swap

 drivers/block/brd.c           |  3 ++-
 drivers/block/zram/zram_drv.c |  3 ++-
 include/linux/blkdev.h        |  8 ++++++
 include/linux/swap.h          |  9 +++++++
 mm/page_alloc.c               | 23 ++++++++++++++++-
 mm/swapfile.c                 | 47 ++++++++++++++++++++++++++++++++++-
 6 files changed, 89 insertions(+), 4 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH 1/1] mm: Reduce direct reclaim stalls with RAM-backed swap
  2026-03-03 11:53 [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap Matt Fleming
@ 2026-03-03 11:53 ` Matt Fleming
  2026-03-03 14:10   ` Christoph Hellwig
  2026-03-03 14:59 ` [RFC PATCH 0/1] " Shakeel Butt
  1 sibling, 1 reply; 5+ messages in thread
From: Matt Fleming @ 2026-03-03 11:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Minchan Kim, Sergey Senozhatsky, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, linux-block,
	linux-kernel, linux-mm, kernel-team, Matt Fleming

From: Matt Fleming <mfleming@cloudflare.com>

The current should_reclaim_retry() code does not account for the fact
the number of logical swap pages available for RAM-backed swap (zram,
brd) is dependent on having enough free physical pages, and simply
always assumes that enough pages are reclaimable to satisfy the
allocation.

For instance, given a system with a 200GiB zram device (10% used) and
100MB of free physical pages, should_reclaim_retry() incorrectly
concludes that it can swap 180GiB worth of anon pages to swap.

Because it appears to be always possible to write to swap, the OOM
killer is delayed and the system retries in direct reclaim for
prolonged periods (20-30 minutes observed in production).

Fix this by excluding anon pages from the reclaimable estimate when all
active swap devices are RAM-backed. Once file-backed pages are exhausted
the watermark check fails and the kernel falls through to OOM as
expected.

To identify RAM-backed swap devices at swapon time, introduce
BLK_FEAT_RAM_BACKED (set by zram and brd) and SWP_RAM_BACKED
(swapfile.c). A cached bool swap_all_ram_backed is maintained under
swap_lock by swap_update_all_ram_backed() during swapon/swapoff, which
is locklessly accessed in should_reclaim_retry().

Signed-off-by: Matt Fleming <mfleming@cloudflare.com>
---
 drivers/block/brd.c           |  3 ++-
 drivers/block/zram/zram_drv.c |  3 ++-
 include/linux/blkdev.h        |  8 ++++++
 include/linux/swap.h          |  9 +++++++
 mm/page_alloc.c               | 23 ++++++++++++++++-
 mm/swapfile.c                 | 47 ++++++++++++++++++++++++++++++++++-
 6 files changed, 89 insertions(+), 4 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 00cc8122068f..c021dd51ff0a 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -310,7 +310,8 @@ static int brd_alloc(int i)
 		.max_discard_segments	= 1,
 		.discard_granularity	= PAGE_SIZE,
 		.features		= BLK_FEAT_SYNCHRONOUS |
-					  BLK_FEAT_NOWAIT,
+					  BLK_FEAT_NOWAIT |
+					  BLK_FEAT_RAM_BACKED,
 	};
 
 	brd = brd_find_or_alloc_device(i);
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index bca33403fc8b..8075bab39e62 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -3074,7 +3074,8 @@ static int zram_add(void)
 		.max_write_zeroes_sectors	= UINT_MAX,
 #endif
 		.features			= BLK_FEAT_STABLE_WRITES |
-						  BLK_FEAT_SYNCHRONOUS,
+						  BLK_FEAT_SYNCHRONOUS |
+						  BLK_FEAT_RAM_BACKED,
 	};
 	struct zram *zram;
 	int ret, device_id;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d463b9b5a0a5..3666837e8774 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -334,6 +334,9 @@ typedef unsigned int __bitwise blk_features_t;
 /* is a zoned device */
 #define BLK_FEAT_ZONED			((__force blk_features_t)(1u << 10))
 
+/* storage is backed by system RAM (e.g. zram, brd) */
+#define BLK_FEAT_RAM_BACKED		((__force blk_features_t)(1u << 11))
+
 /* supports PCI(e) p2p requests */
 #define BLK_FEAT_PCI_P2PDMA		((__force blk_features_t)(1u << 12))
 
@@ -1477,6 +1480,11 @@ static inline bool bdev_synchronous(struct block_device *bdev)
 	return bdev->bd_disk->queue->limits.features & BLK_FEAT_SYNCHRONOUS;
 }
 
+static inline bool bdev_ram_backed(struct block_device *bdev)
+{
+	return bdev->bd_disk->queue->limits.features & BLK_FEAT_RAM_BACKED;
+}
+
 static inline bool bdev_stable_writes(struct block_device *bdev)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 62fc7499b408..844727fe929c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -216,6 +216,7 @@ enum {
 	SWP_PAGE_DISCARD = (1 << 10),	/* freed swap page-cluster discards */
 	SWP_STABLE_WRITES = (1 << 11),	/* no overwrite PG_writeback pages */
 	SWP_SYNCHRONOUS_IO = (1 << 12),	/* synchronous IO is efficient */
+	SWP_RAM_BACKED	= (1 << 13),	/* swap device uses main memory (e.g. zram) */
 					/* add others here before... */
 };
 
@@ -451,6 +452,11 @@ static inline long get_nr_swap_pages(void)
 }
 
 extern void si_swapinfo(struct sysinfo *);
+extern bool swap_all_ram_backed;
+static inline bool swap_is_all_ram_backed(void)
+{
+	return READ_ONCE(swap_all_ram_backed);
+}
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
@@ -508,6 +514,9 @@ static inline void put_swap_device(struct swap_info_struct *si)
 
 #define si_swapinfo(val) \
 	do { (val)->freeswap = (val)->totalswap = 0; } while (0)
+
+static inline bool swap_is_all_ram_backed(void) { return false; }
+
 #define free_folio_and_swap_cache(folio) \
 	folio_put(folio)
 #define free_pages_and_swap_cache(pages, nr) \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2d4b6f1a554e..c1a8f4620baa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -37,6 +37,7 @@
 #include <linux/vmstat.h>
 #include <linux/fault-inject.h>
 #include <linux/compaction.h>
+#include <linux/swap.h>
 #include <trace/events/kmem.h>
 #include <trace/events/oom.h>
 #include <linux/prefetch.h>
@@ -4604,6 +4605,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	struct zone *zone;
 	struct zoneref *z;
 	bool ret = false;
+	bool ram_backed_swap = swap_is_all_ram_backed();
 
 	/*
 	 * Costly allocations might have made a progress but this doesn't mean
@@ -4637,7 +4639,26 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 			!__cpuset_zone_allowed(zone, gfp_mask))
 				continue;
 
-		available = reclaimable = zone_reclaimable_pages(zone);
+		if (ram_backed_swap) {
+			/*
+			 * Exclude anon pages when all swap is RAM-backed.
+			 * The reclaimable estimate assumes anon can be
+			 * reclaimed using free swap slots, but those slots
+			 * are only logical accounting for zram: storing the
+			 * swapped data still consumes physical pages. Free
+			 * RAM is the real limit, so counting anon inflates
+			 * 'available', keeps the watermark check passing,
+			 * and delays falling through to OOM.
+			 */
+			reclaimable =
+				zone_page_state_snapshot(zone,
+							 NR_ZONE_INACTIVE_FILE) +
+				zone_page_state_snapshot(zone,
+							 NR_ZONE_ACTIVE_FILE);
+		} else {
+			reclaimable = zone_reclaimable_pages(zone);
+		}
+		available = reclaimable;
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 
 		/*
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 94af29d1de88..18713618f35c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -64,6 +64,7 @@ static bool folio_swapcache_freeable(struct folio *folio);
 static void move_cluster(struct swap_info_struct *si,
 			 struct swap_cluster_info *ci, struct list_head *list,
 			 enum swap_cluster_flags new_flags);
+static void swap_update_all_ram_backed(void);
 
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
@@ -74,8 +75,15 @@ atomic_long_t nr_swap_pages;
  * check to see if any swap space is available.
  */
 EXPORT_SYMBOL_GPL(nr_swap_pages);
-/* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
+
+/*
+ * Updates to these globals are serialized by swap_lock.
+ * Read locklessly in vm_swap_full() (total_swap_pages) and
+ * should_reclaim_retry() (swap_all_ram_backed).
+ */
 long total_swap_pages;
+bool swap_all_ram_backed;
+
 #define DEF_SWAP_PRIO  -1
 unsigned long swapfile_maximum_size;
 #ifdef CONFIG_MIGRATION
@@ -2670,6 +2678,8 @@ static void _enable_swap_info(struct swap_info_struct *si)
 
 	plist_add(&si->list, &swap_active_head);
 
+	swap_update_all_ram_backed();
+
 	/* Add back to available list */
 	add_to_avail_list(si, true);
 }
@@ -2813,6 +2823,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	spin_lock(&p->lock);
 	del_from_avail_list(p, true);
 	plist_del(&p->list, &swap_active_head);
+	swap_update_all_ram_backed();
 	atomic_long_sub(p->pages, &nr_swap_pages);
 	total_swap_pages -= p->pages;
 	spin_unlock(&p->lock);
@@ -3460,6 +3471,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (si->bdev && bdev_synchronous(si->bdev))
 		si->flags |= SWP_SYNCHRONOUS_IO;
 
+	if (si->bdev && bdev_ram_backed(si->bdev))
+		si->flags |= SWP_RAM_BACKED;
+
 	if (si->bdev && bdev_nonrot(si->bdev)) {
 		si->flags |= SWP_SOLIDSTATE;
 	} else {
@@ -3587,6 +3601,37 @@ void si_swapinfo(struct sysinfo *val)
 	spin_unlock(&swap_lock);
 }
 
+/*
+ * Recompute swap_all_ram_backed. Must be called with swap_lock held
+ * whenever a swap device is added to or removed from swap_active_head.
+ *
+ * swap_all_ram_backed is true when every active swap device is backed
+ * by main memory (e.g. zram, brd). False if there are no swap devices
+ * configured or at least one of them is backed by disk.
+ *
+ * With RAM-backed swap, swapping out an anonymous page does not yield
+ * net free pages because the driver must allocate physical RAM to
+ * store the compressed data.
+ *
+ * See should_reclaim_retry().
+ */
+static void swap_update_all_ram_backed(void)
+{
+	struct swap_info_struct *si;
+	bool all_ram = !plist_head_empty(&swap_active_head);
+
+	assert_spin_locked(&swap_lock);
+
+	plist_for_each_entry(si, &swap_active_head, list) {
+		if (!(si->flags & SWP_RAM_BACKED)) {
+			all_ram = false;
+			break;
+		}
+	}
+
+	WRITE_ONCE(swap_all_ram_backed, all_ram);
+}
+
 /*
  * Verify that nr swap entries are valid and increment their swap map counts.
  *
-- 
2.43.0



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 1/1] mm: Reduce direct reclaim stalls with RAM-backed swap
  2026-03-03 11:53 ` [RFC PATCH 1/1] " Matt Fleming
@ 2026-03-03 14:10   ` Christoph Hellwig
  2026-03-03 16:59     ` Johannes Weiner
  0 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2026-03-03 14:10 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Andrew Morton, Jens Axboe, Minchan Kim, Sergey Senozhatsky,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, linux-block,
	linux-kernel, linux-mm, kernel-team, Matt Fleming

No way.  Stop adding hacks to the block layer just because you're
abusing a block driver for compressed swap.  Please everyone direct
their enegery to pluggable zwap backends and backing store less zswap
now instead of making the zram mess even worse.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 1/1] mm: Reduce direct reclaim stalls with RAM-backed swap
  2026-03-03 14:10   ` Christoph Hellwig
@ 2026-03-03 16:59     ` Johannes Weiner
  0 siblings, 0 replies; 5+ messages in thread
From: Johannes Weiner @ 2026-03-03 16:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matt Fleming, Andrew Morton, Jens Axboe, Minchan Kim,
	Sergey Senozhatsky, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Zi Yan, linux-block, linux-kernel,
	linux-mm, kernel-team, Matt Fleming

On Tue, Mar 03, 2026 at 06:10:07AM -0800, Christoph Hellwig wrote:
> No way.  Stop adding hacks to the block layer just because you're
> abusing a block driver for compressed swap.  Please everyone direct
> their enegery to pluggable zwap backends and backing store less zswap
> now instead of making the zram mess even worse.

+1

The virtual block device idea seems simple and attractive at first
because, hey it looks just like any other swap device, and the block
surface is so much smaller than the MM surface.

But compression is a *memory* consumer. A big one. And with swap it
sits in the reclaim path. So now you have to solve intricate MM
problems with the block layer in between.

The effectiveness of compression as a reclaim strategy is also highly
variable and dependent on page contents, so the static properties of
the block device abstraction are a poor fit for the problem space.

I'd propose the following:

What keeps users in the zram camp is disk-less setups.

What keeps users in the zswap camp is reclaim-writeback, cgroup
accounting & control, and the prospect of fully dynamic sizing.

There are ongoing efforts to solve zswap's disk requirement and with
it its dependency on static address spacing.

The gap on the zram side looks wider, and more awkward to solve given
the block layer intermediary. And it's already 2.6x the SLOC of zswap.

So I fully agree. We should try to make zswap the single compressed
swap implementation. It would simplify things dramatically for kernel
developers working on MM and the swap subsystem. It would make things
better for users too.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap
  2026-03-03 11:53 [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap Matt Fleming
  2026-03-03 11:53 ` [RFC PATCH 1/1] " Matt Fleming
@ 2026-03-03 14:59 ` Shakeel Butt
  1 sibling, 0 replies; 5+ messages in thread
From: Shakeel Butt @ 2026-03-03 14:59 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Andrew Morton, Jens Axboe, Minchan Kim, Sergey Senozhatsky,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, linux-block,
	linux-kernel, linux-mm, kernel-team, Matt Fleming,
	roman.gushchin

Hi Matt,

Thanks for the report and one request I have is to avoid cover letter for a
single patch to avoid partitioning the discussion.

On Tue, Mar 03, 2026 at 11:53:57AM +0000, Matt Fleming wrote:
> From: Matt Fleming <mfleming@cloudflare.com>
> 
> Hi,
> 
> Systems with zram-only swap can spin in direct reclaim for 20-30
> minutes without ever invoking the OOM killer. We've hit this repeatedly
> in production on machines with 377 GiB RAM and a 377 GiB zram device.
> 

Have you tried zswap and if you see similar issues with zswap?

> The problem
> -----------
> 
> should_reclaim_retry() calls zone_reclaimable_pages() to estimate how
> much memory is still reclaimable. That estimate includes anonymous
> pages, on the assumption that swapping them out frees physical pages.
> 
> With disk-backed swap, that's true -- writing a page to disk frees a
> page of RAM, and SwapFree accurately reflects how many more pages can
> be written. With zram, the free slot count is inaccurate. A 377 GiB
> zram device with 10% used reports ~340 GiB of free swap slots, but
> filling those slots requires physical RAM that the system doesn't have
> -- that's why it's in direct reclaim in the first place.
> 
> The reclaimable estimate is off by orders of magnitude.
> 

Over the time we (kernel MM community) have implicitly decided to keep the
kernel oom-killer very conservative as adding more heuristics in the reclaim/oom
path makes the kernel more unreliable and punt the aggressiveness of oom-killing
to the userspace as a policy. All major Linux deployments have started using
userspace oom-killers like systemd-oomd, Android's LMKD, fb-oomd or some
internal alternatives. That provides more flexibility to define the
aggressiveness of oom-killing based on your business needs.

Though userspace oom-killers are prone to reliability issues (oom-killer getting
stuck in reclaim or not getting enough CPU), so we (Roman) are working on adding
support for BPF based oom-killer where wen think we can do oom policies more
reliably.

Anyways, I am wondering if you have tried systemd-oomd or some userspace
alternative. If you are interested in BPF oom-killer, we can help with that as
well.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-03-03 16:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-03 11:53 [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap Matt Fleming
2026-03-03 11:53 ` [RFC PATCH 1/1] " Matt Fleming
2026-03-03 14:10   ` Christoph Hellwig
2026-03-03 16:59     ` Johannes Weiner
2026-03-03 14:59 ` [RFC PATCH 0/1] " Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox