[LSF/MM/BPF TOPIC] Flash Friendly Swap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Flash Friendly Swap
@ 2026-02-18 12:46 YoungJun Park
  2026-02-20 16:22 ` Christoph Hellwig
  0 siblings, 1 reply; 3+ messages in thread
From: YoungJun Park @ 2026-02-18 12:46 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, youngjun.park, chrisl

Hello,

I would like to propose a session on NAND flash friendly swap layout.
Similar to how F2FS is designed as a flash friendly file system, the
goal is to make the swap subsystem write data to NAND flash devices
in a way that causes less wear.

We have been working on this problem in production embedded systems
and have built an out-of-tree solution with RAM buffering, sequential
writeback, and deduplication. I would like to discuss the upstream
path for these capabilities.

Background & Motivation:

We ship embedded products built on eMMC-based NAND flash and have been
dealing with memory pressure that demands aggressive swapping for years.
The limited P/E cycle endurance of NAND flash makes naive swap usage a
reliability risk -- swap I/O is random, small, and frequent, which is
the worst-case pattern for write amplification. Even with an FTL in the
eMMC controller, random writes from the swap layer still cause
additional WAF through internal garbage collection. Buffering and
reordering writes into sequential streams can complement the FTL and
reduce WAF.

Our team has published prior work on this problem[1], covering
techniques such as compression, RAM buffering with sequential writeback,
and flash-aware block management. Based on that work, we built an
internal solution and are now looking at how to bring these capabilities
upstream.

Current Implementation:

The current implementation is a standalone block device driver between
the swap layer and flash storage:

 1. RAM swap buffer: A kernel thread accumulates swap-out pages and
    flushes them to flash as sequential I/O at controlled intervals.

 2. Management layer: Mapping between swap slots and physical flash
    locations, with wear-aware allocation and writeback scheduling.

 3. Deduplication: Content-hash-based dedup before writing to flash --
    swap workloads often contain many zero-filled or duplicate pages.

This works, but as a standalone block device it sits outside mainline
infrastructure. I am seeking feedback on how to upstream this.

Discussion:

I would like to discuss the following topics:

- Flash friendly swap I/O:

  For flash-backed swap, writing sequentially and respecting erase
  block boundaries can reduce WAF. What could the swap subsystem do
  to better support flash devices?

- Deduplication in the swap layer:

  Swap workloads often contain many zero-filled or duplicate pages.
  Should dedup be a swap-layer feature rather than reimplemented
  per-backend?

- Extending zram/zswap writeback with flash awareness:

  zram supports a backing device (CONFIG_ZRAM_WRITEBACK) for writing
  idle/incompressible pages to persistent storage, and zswap sits in
  front of swap devices with its own writeback path. Could these be
  extended with sequential writeback batching, deduplication, and
  flash-aware allocation? Our implementation buffers swap-out pages
  in RAM before flushing to flash -- this is conceptually similar to
  zswap + writeback, but we found the current writeback path
  insufficient for our needs because it still issues per-page random
  writes without awareness of flash erase block boundaries. I would
  like to discuss what gaps remain and whether extending zswap/zram
  writeback is the right upstream path.

- Swap abstraction layer:

  Recent discussions on reworking the swap subsystem[2][3] aim to
  decouple the swap core from its tight binding to swap offsets and
  block devices. If such a layer materializes, it could provide
  extension points for pluggable swap backends with device-specific
  write strategies. I would like to hear the community's view on
  whether this direction could also serve flash friendly swap needs.

Comments or suggestions are welcome.

[1] https://ieeexplore.ieee.org/document/8662047
[2] https://lwn.net/Articles/932077/
[3] https://lwn.net/Articles/974587/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Flash Friendly Swap
  2026-02-18 12:46 [LSF/MM/BPF TOPIC] Flash Friendly Swap YoungJun Park
@ 2026-02-20 16:22 ` Christoph Hellwig
  2026-02-20 23:47   ` Chris Li
  0 siblings, 1 reply; 3+ messages in thread
From: Christoph Hellwig @ 2026-02-20 16:22 UTC (permalink / raw)
  To: YoungJun Park; +Cc: lsf-pc, linux-mm, chrisl

Honestly, I think always writing sequentially when swapping and
reclaiming in lumps (I'd call them "zones" :)) is probably the best
idea.  Even for the these days unlikely case of swapping to HDD it
would do the right thing.  So please no conditional version that need
opt-in or stacked block drivers, let's just fix swapping to not be
stupid.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Flash Friendly Swap
  2026-02-20 16:22 ` Christoph Hellwig
@ 2026-02-20 23:47   ` Chris Li
  0 siblings, 0 replies; 3+ messages in thread
From: Chris Li @ 2026-02-20 23:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: YoungJun Park, lsf-pc, linux-mm

Hi Christoph,

On Fri, Feb 20, 2026 at 8:22 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> Honestly, I think always writing sequentially when swapping and
> reclaiming in lumps (I'd call them "zones" :)) is probably the best
> idea.  Even for the these days unlikely case of swapping to HDD it

For the flash device with FTL, the location of the data written is
most likely logical anyway.  The flash devices tend to group the new
data internally to the same erase block together even when they are
discontinuous from the block device point of view. It is easy to write
out sequentially when the swap device is mostly empty. That is how the
cluster allocator does currently any way. However, the tricky part is
what when some random 4K blocks get swapped in, that will create holes
on both the swap device and internal write out data. Very quickly the
free cluster on swap devices will get all used up and that you will
not be able to write out sequentially any more. The FTL layer
internally wants to GC those holes to create a large empty erase
block. I do see where to pick up the next write location can have a
huge impact on the flash internal GC behavior and write amplification
factor.

> would do the right thing.  So please no conditional version that need

I agree. There is another LSF/MM topic related to this, the plug-able
swap ops backends. I hope that the flash friendly swap layout can
implement the swap ops framework without conditional versioning nor
stack block drivers.

https://lore.kernel.org/linux-mm/aZiFvzlBJiYBUDre@MiWiFi-R3L-srv/

BTW, I am very interested in this topic and want to participate in the
discussion.

> opt-in or stacked block drivers, let's just fix swapping to not be
> stupid.

With the cluster allocator and the recent swap table change, the swap
is a lot better than it was before now. Anyway, feedback to the core
swap stack is always welcome.

Chris

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-02-20 23:47 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-18 12:46 [LSF/MM/BPF TOPIC] Flash Friendly Swap YoungJun Park
2026-02-20 16:22 ` Christoph Hellwig
2026-02-20 23:47   ` Chris Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox