* [LSF/MM/BPF TOPIC] Flash Friendly Swap
@ 2026-02-18 12:46 YoungJun Park
2026-02-20 16:22 ` Christoph Hellwig
0 siblings, 1 reply; 10+ messages in thread
From: YoungJun Park @ 2026-02-18 12:46 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-mm, youngjun.park, chrisl
Hello,
I would like to propose a session on NAND flash friendly swap layout.
Similar to how F2FS is designed as a flash friendly file system, the
goal is to make the swap subsystem write data to NAND flash devices
in a way that causes less wear.
We have been working on this problem in production embedded systems
and have built an out-of-tree solution with RAM buffering, sequential
writeback, and deduplication. I would like to discuss the upstream
path for these capabilities.
Background & Motivation:
We ship embedded products built on eMMC-based NAND flash and have been
dealing with memory pressure that demands aggressive swapping for years.
The limited P/E cycle endurance of NAND flash makes naive swap usage a
reliability risk -- swap I/O is random, small, and frequent, which is
the worst-case pattern for write amplification. Even with an FTL in the
eMMC controller, random writes from the swap layer still cause
additional WAF through internal garbage collection. Buffering and
reordering writes into sequential streams can complement the FTL and
reduce WAF.
Our team has published prior work on this problem[1], covering
techniques such as compression, RAM buffering with sequential writeback,
and flash-aware block management. Based on that work, we built an
internal solution and are now looking at how to bring these capabilities
upstream.
Current Implementation:
The current implementation is a standalone block device driver between
the swap layer and flash storage:
1. RAM swap buffer: A kernel thread accumulates swap-out pages and
flushes them to flash as sequential I/O at controlled intervals.
2. Management layer: Mapping between swap slots and physical flash
locations, with wear-aware allocation and writeback scheduling.
3. Deduplication: Content-hash-based dedup before writing to flash --
swap workloads often contain many zero-filled or duplicate pages.
This works, but as a standalone block device it sits outside mainline
infrastructure. I am seeking feedback on how to upstream this.
Discussion:
I would like to discuss the following topics:
- Flash friendly swap I/O:
For flash-backed swap, writing sequentially and respecting erase
block boundaries can reduce WAF. What could the swap subsystem do
to better support flash devices?
- Deduplication in the swap layer:
Swap workloads often contain many zero-filled or duplicate pages.
Should dedup be a swap-layer feature rather than reimplemented
per-backend?
- Extending zram/zswap writeback with flash awareness:
zram supports a backing device (CONFIG_ZRAM_WRITEBACK) for writing
idle/incompressible pages to persistent storage, and zswap sits in
front of swap devices with its own writeback path. Could these be
extended with sequential writeback batching, deduplication, and
flash-aware allocation? Our implementation buffers swap-out pages
in RAM before flushing to flash -- this is conceptually similar to
zswap + writeback, but we found the current writeback path
insufficient for our needs because it still issues per-page random
writes without awareness of flash erase block boundaries. I would
like to discuss what gaps remain and whether extending zswap/zram
writeback is the right upstream path.
- Swap abstraction layer:
Recent discussions on reworking the swap subsystem[2][3] aim to
decouple the swap core from its tight binding to swap offsets and
block devices. If such a layer materializes, it could provide
extension points for pluggable swap backends with device-specific
write strategies. I would like to hear the community's view on
whether this direction could also serve flash friendly swap needs.
Comments or suggestions are welcome.
[1] https://ieeexplore.ieee.org/document/8662047
[2] https://lwn.net/Articles/932077/
[3] https://lwn.net/Articles/974587/
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [LSF/MM/BPF TOPIC] Flash Friendly Swap 2026-02-18 12:46 [LSF/MM/BPF TOPIC] Flash Friendly Swap YoungJun Park @ 2026-02-20 16:22 ` Christoph Hellwig 2026-02-20 23:47 ` Chris Li 0 siblings, 1 reply; 10+ messages in thread From: Christoph Hellwig @ 2026-02-20 16:22 UTC (permalink / raw) To: YoungJun Park; +Cc: lsf-pc, linux-mm, chrisl Honestly, I think always writing sequentially when swapping and reclaiming in lumps (I'd call them "zones" :)) is probably the best idea. Even for the these days unlikely case of swapping to HDD it would do the right thing. So please no conditional version that need opt-in or stacked block drivers, let's just fix swapping to not be stupid. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Flash Friendly Swap 2026-02-20 16:22 ` Christoph Hellwig @ 2026-02-20 23:47 ` Chris Li 2026-02-23 13:23 ` Christoph Hellwig 0 siblings, 1 reply; 10+ messages in thread From: Chris Li @ 2026-02-20 23:47 UTC (permalink / raw) To: Christoph Hellwig; +Cc: YoungJun Park, lsf-pc, linux-mm Hi Christoph, On Fri, Feb 20, 2026 at 8:22 AM Christoph Hellwig <hch@infradead.org> wrote: > > Honestly, I think always writing sequentially when swapping and > reclaiming in lumps (I'd call them "zones" :)) is probably the best > idea. Even for the these days unlikely case of swapping to HDD it For the flash device with FTL, the location of the data written is most likely logical anyway. The flash devices tend to group the new data internally to the same erase block together even when they are discontinuous from the block device point of view. It is easy to write out sequentially when the swap device is mostly empty. That is how the cluster allocator does currently any way. However, the tricky part is what when some random 4K blocks get swapped in, that will create holes on both the swap device and internal write out data. Very quickly the free cluster on swap devices will get all used up and that you will not be able to write out sequentially any more. The FTL layer internally wants to GC those holes to create a large empty erase block. I do see where to pick up the next write location can have a huge impact on the flash internal GC behavior and write amplification factor. > would do the right thing. So please no conditional version that need I agree. There is another LSF/MM topic related to this, the plug-able swap ops backends. I hope that the flash friendly swap layout can implement the swap ops framework without conditional versioning nor stack block drivers. https://lore.kernel.org/linux-mm/aZiFvzlBJiYBUDre@MiWiFi-R3L-srv/ BTW, I am very interested in this topic and want to participate in the discussion. > opt-in or stacked block drivers, let's just fix swapping to not be > stupid. With the cluster allocator and the recent swap table change, the swap is a lot better than it was before now. Anyway, feedback to the core swap stack is always welcome. Chris ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Flash Friendly Swap 2026-02-20 23:47 ` Chris Li @ 2026-02-23 13:23 ` Christoph Hellwig 2026-02-23 18:15 ` Chris Li 2026-02-24 2:08 ` YoungJun Park 0 siblings, 2 replies; 10+ messages in thread From: Christoph Hellwig @ 2026-02-23 13:23 UTC (permalink / raw) To: Chris Li; +Cc: Christoph Hellwig, YoungJun Park, lsf-pc, linux-mm On Fri, Feb 20, 2026 at 03:47:18PM -0800, Chris Li wrote: > Hi Christoph, > > On Fri, Feb 20, 2026 at 8:22 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > Honestly, I think always writing sequentially when swapping and > > reclaiming in lumps (I'd call them "zones" :)) is probably the best > > idea. Even for the these days unlikely case of swapping to HDD it > > For the flash device with FTL, the location of the data written is > most likely logical anyway. The flash devices tend to group the new > data internally to the same erase block together even when they are > discontinuous from the block device point of view. Yes, but that's not the point.. > It is easy to write > out sequentially when the swap device is mostly empty. That is how the > cluster allocator does currently any way. However, the tricky part is > what when some random 4K blocks get swapped in, that will create holes > on both the swap device and internal write out data. Very quickly the > free cluster on swap devices will get all used up and that you will > not be able to write out sequentially any more. The FTL layer > internally wants to GC those holes to create a large empty erase > block. I do see where to pick up the next write location can have a > huge impact on the flash internal GC behavior and write amplification > factor. And that is the point. The FTL will always do a bad job with these work loads. You should not do overwrites, and can do much better optimizations in the MM based on that. I'm pretty sure YoungJun can explain all what they did. > > > would do the right thing. So please no conditional version that need > > I agree. There is another LSF/MM topic related to this, the plug-able > swap ops backends. I hope that the flash friendly swap layout can > implement the swap ops framework without conditional versioning nor > stack block drivers. > > https://lore.kernel.org/linux-mm/aZiFvzlBJiYBUDre@MiWiFi-R3L-srv/ And it should just be the default block device (and maybe file backed) swap. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Flash Friendly Swap 2026-02-23 13:23 ` Christoph Hellwig @ 2026-02-23 18:15 ` Chris Li 2026-02-23 18:53 ` Pedro Falcato 2026-02-24 2:15 ` YoungJun Park 2026-02-24 2:08 ` YoungJun Park 1 sibling, 2 replies; 10+ messages in thread From: Chris Li @ 2026-02-23 18:15 UTC (permalink / raw) To: Christoph Hellwig; +Cc: YoungJun Park, lsf-pc, linux-mm On Mon, Feb 23, 2026 at 5:23 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Fri, Feb 20, 2026 at 03:47:18PM -0800, Chris Li wrote: > > Hi Christoph, > > > > On Fri, Feb 20, 2026 at 8:22 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > Honestly, I think always writing sequentially when swapping and > > > reclaiming in lumps (I'd call them "zones" :)) is probably the best > > > idea. Even for the these days unlikely case of swapping to HDD it > > > > For the flash device with FTL, the location of the data written is > > most likely logical anyway. The flash devices tend to group the new > > data internally to the same erase block together even when they are > > discontinuous from the block device point of view. > > Yes, but that's not the point.. > > > It is easy to write > > out sequentially when the swap device is mostly empty. That is how the > > cluster allocator does currently any way. However, the tricky part is > > what when some random 4K blocks get swapped in, that will create holes > > on both the swap device and internal write out data. Very quickly the > > free cluster on swap devices will get all used up and that you will > > not be able to write out sequentially any more. The FTL layer > > internally wants to GC those holes to create a large empty erase > > block. I do see where to pick up the next write location can have a > > huge impact on the flash internal GC behavior and write amplification > > factor. > > And that is the point. The FTL will always do a bad job with these work > loads. You should not do overwrites, and can do much better I am not sure I understand "You should not do overwrites". Can you help clarify it for me? Let say we always prefer to the write to new clusters while some swap entries has been free. What happen we run out of new cluster to write? Wouldn't we be forced to overwrite the previous free swap location? It seems to me the "overwrite" is un-avoidable if you keep swapping in and out. That is the part I am missing. I think if we gain insight into the FTL GC behavior, we can design swap to be more friendly to flash. That is why it is great to have the flash storage vendor provide some inputs. Some of the optimization might be vendor specific as well. > optimizations in the MM based on that. I'm pretty sure YoungJun can > explain all what they did. > > > > > > would do the right thing. So please no conditional version that need > > > > I agree. There is another LSF/MM topic related to this, the plug-able > > swap ops backends. I hope that the flash friendly swap layout can > > implement the swap ops framework without conditional versioning nor > > stack block drivers. > > > > https://lore.kernel.org/linux-mm/aZiFvzlBJiYBUDre@MiWiFi-R3L-srv/ > > And it should just be the default block device (and maybe file backed) > swap. Let me repeat what you are saying just to make sure I get it right. You want the flash-friendly swap to be the default block device swap. How about zswap and zram swap? The FTL shouldn't play a major role in selecting the swap location. I am not sure it will be the right default for them. Chris ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Flash Friendly Swap 2026-02-23 18:15 ` Chris Li @ 2026-02-23 18:53 ` Pedro Falcato 2026-02-24 2:24 ` YoungJun Park 2026-02-24 2:15 ` YoungJun Park 1 sibling, 1 reply; 10+ messages in thread From: Pedro Falcato @ 2026-02-23 18:53 UTC (permalink / raw) To: Chris Li; +Cc: Christoph Hellwig, YoungJun Park, lsf-pc, linux-mm On Mon, Feb 23, 2026 at 10:15:14AM -0800, Chris Li wrote: > On Mon, Feb 23, 2026 at 5:23 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Fri, Feb 20, 2026 at 03:47:18PM -0800, Chris Li wrote: > > > Hi Christoph, > > > > > > On Fri, Feb 20, 2026 at 8:22 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > Honestly, I think always writing sequentially when swapping and > > > > reclaiming in lumps (I'd call them "zones" :)) is probably the best > > > > idea. Even for the these days unlikely case of swapping to HDD it > > > > > > For the flash device with FTL, the location of the data written is > > > most likely logical anyway. The flash devices tend to group the new > > > data internally to the same erase block together even when they are > > > discontinuous from the block device point of view. > > > > Yes, but that's not the point.. > > > > > It is easy to write > > > out sequentially when the swap device is mostly empty. That is how the > > > cluster allocator does currently any way. However, the tricky part is > > > what when some random 4K blocks get swapped in, that will create holes > > > on both the swap device and internal write out data. Very quickly the > > > free cluster on swap devices will get all used up and that you will > > > not be able to write out sequentially any more. The FTL layer > > > internally wants to GC those holes to create a large empty erase > > > block. I do see where to pick up the next write location can have a > > > huge impact on the flash internal GC behavior and write amplification > > > factor. > > > > And that is the point. The FTL will always do a bad job with these work > > loads. You should not do overwrites, and can do much better > > I am not sure I understand "You should not do overwrites". Can you > help clarify it for me? Let say we always prefer to the write to new > clusters while some swap entries has been free. What happen we run out > of new cluster to write? Wouldn't we be forced to overwrite the > previous free swap location? It seems to me the "overwrite" is > un-avoidable if you keep swapping in and out. That is the part I am > missing. See log-structured fileystems. I suspect that's close to what we want for flash storage swap. Also, FWIW: the cloud vendors have fake SSDs that while have negligible seek latency, have extremely low IOPS values (e.g AWS gp2 can do 100 IOPS on its base setting, and scales up to 16K IOPS. gp3 can do 3000 up to 80K on the maximum size). I suspect swapping on these is a huge slog, and we would also like to write out as much sequentially as we can here (though I hope no one is *actually* swapping on these things). Also mechanical drives. Log-structured filesystems were originally invented for these too :) -- Pedro ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Flash Friendly Swap 2026-02-23 18:53 ` Pedro Falcato @ 2026-02-24 2:24 ` YoungJun Park 2026-02-24 4:02 ` YoungJun Park 0 siblings, 1 reply; 10+ messages in thread From: YoungJun Park @ 2026-02-24 2:24 UTC (permalink / raw) To: Pedro Falcato Cc: Chris Li, Christoph Hellwig, lsf-pc, linux-mm, nphamcs, bhe, taejoon.song, youngjun.park On Mon, Feb 23, 2026 at 06:53:12PM +0000, Pedro Falcato wrote: > On Mon, Feb 23, 2026 at 10:15:14AM -0800, Chris Li wrote: > > On Mon, Feb 23, 2026 at 5:23 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > On Fri, Feb 20, 2026 at 03:47:18PM -0800, Chris Li wrote: > > > > Hi Christoph, > > > > > > > > On Fri, Feb 20, 2026 at 8:22 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > > > Honestly, I think always writing sequentially when swapping and > > > > > reclaiming in lumps (I'd call them "zones" :)) is probably the best > > > > > idea. Even for the these days unlikely case of swapping to HDD it > > > > > > > > For the flash device with FTL, the location of the data written is > > > > most likely logical anyway. The flash devices tend to group the new > > > > data internally to the same erase block together even when they are > > > > discontinuous from the block device point of view. > > > > > > Yes, but that's not the point.. > > > > > > > It is easy to write > > > > out sequentially when the swap device is mostly empty. That is how the > > > > cluster allocator does currently any way. However, the tricky part is > > > > what when some random 4K blocks get swapped in, that will create holes > > > > on both the swap device and internal write out data. Very quickly the > > > > free cluster on swap devices will get all used up and that you will > > > > not be able to write out sequentially any more. The FTL layer > > > > internally wants to GC those holes to create a large empty erase > > > > block. I do see where to pick up the next write location can have a > > > > huge impact on the flash internal GC behavior and write amplification > > > > factor. > > > > > > And that is the point. The FTL will always do a bad job with these work > > > loads. You should not do overwrites, and can do much better > > > > I am not sure I understand "You should not do overwrites". Can you > > help clarify it for me? Let say we always prefer to the write to new > > clusters while some swap entries has been free. What happen we run out > > of new cluster to write? Wouldn't we be forced to overwrite the > > previous free swap location? It seems to me the "overwrite" is > > un-avoidable if you keep swapping in and out. That is the part I am > > missing. > > See log-structured fileystems. I suspect that's close to what we want for flash > storage swap. > > Also, FWIW: the cloud vendors have fake SSDs that while have negligible seek > latency, have extremely low IOPS values (e.g AWS gp2 can do 100 IOPS on its > base setting, and scales up to 16K IOPS. gp3 can do 3000 up to 80K on the > maximum size). I suspect swapping on these is a huge slog, and we would also > like to write out as much sequentially as we can here (though I hope no one > is *actually* swapping on these things). Also mechanical drives. Log-structured > filesystems were originally invented for these too :) +CC Nhat Pham, He Baoquan, Taejoon Hi Pedro, The motivation is indeed similar to that of log-structured filesystems, and it employs a similar management mechanism. That is why I thought a management style similar to filesystems might be necessary at the swap layer as well (the swap abstraction layer mentioned in the proposal document). Previously, the direction for upstreaming our solution was somewhat ambiguous, so we have been maintaining it privately for several years. However, recently, I would like to discuss how to proceed with upstreaming in the context of Baoquan's "swap_ops and pluggable swap backend" (https://lore.kernel.org/linux-mm/aZiFvzlBJiYBUDre@MiWiFi-R3L-srv/) and Nhat's "Virtual Swap Space" (https://lore.kernel.org/linux-mm/20260208215839.87595-1-nphamcs@gmail.com/). Best regards Youngjun Park ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Flash Friendly Swap 2026-02-24 2:24 ` YoungJun Park @ 2026-02-24 4:02 ` YoungJun Park 0 siblings, 0 replies; 10+ messages in thread From: YoungJun Park @ 2026-02-24 4:02 UTC (permalink / raw) To: Pedro Falcato Cc: Chris Li, Christoph Hellwig, lsf-pc, linux-mm, nphamcs, bhe, taejoon.song, ryncsn On Tue, Feb 24, 2026 at 11:24:35AM +0900, YoungJun Park wrote: > On Mon, Feb 23, 2026 at 06:53:12PM +0000, Pedro Falcato wrote: > > On Mon, Feb 23, 2026 at 10:15:14AM -0800, Chris Li wrote: > > > On Mon, Feb 23, 2026 at 5:23 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > On Fri, Feb 20, 2026 at 03:47:18PM -0800, Chris Li wrote: > > > > > Hi Christoph, > > > > > > > > > > On Fri, Feb 20, 2026 at 8:22 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > > > > > Honestly, I think always writing sequentially when swapping and > > > > > > reclaiming in lumps (I'd call them "zones" :)) is probably the best > > > > > > idea. Even for the these days unlikely case of swapping to HDD it > > > > > > > > > > For the flash device with FTL, the location of the data written is > > > > > most likely logical anyway. The flash devices tend to group the new > > > > > data internally to the same erase block together even when they are > > > > > discontinuous from the block device point of view. > > > > > > > > Yes, but that's not the point.. > > > > > > > > > It is easy to write > > > > > out sequentially when the swap device is mostly empty. That is how the > > > > > cluster allocator does currently any way. However, the tricky part is > > > > > what when some random 4K blocks get swapped in, that will create holes > > > > > on both the swap device and internal write out data. Very quickly the > > > > > free cluster on swap devices will get all used up and that you will > > > > > not be able to write out sequentially any more. The FTL layer > > > > > internally wants to GC those holes to create a large empty erase > > > > > block. I do see where to pick up the next write location can have a > > > > > huge impact on the flash internal GC behavior and write amplification > > > > > factor. > > > > > > > > And that is the point. The FTL will always do a bad job with these work > > > > loads. You should not do overwrites, and can do much better > > > > > > I am not sure I understand "You should not do overwrites". Can you > > > help clarify it for me? Let say we always prefer to the write to new > > > clusters while some swap entries has been free. What happen we run out > > > of new cluster to write? Wouldn't we be forced to overwrite the > > > previous free swap location? It seems to me the "overwrite" is > > > un-avoidable if you keep swapping in and out. That is the part I am > > > missing. > > > > See log-structured fileystems. I suspect that's close to what we want for flash > > storage swap. > > > > Also, FWIW: the cloud vendors have fake SSDs that while have negligible seek > > latency, have extremely low IOPS values (e.g AWS gp2 can do 100 IOPS on its > > base setting, and scales up to 16K IOPS. gp3 can do 3000 up to 80K on the > > maximum size). I suspect swapping on these is a huge slog, and we would also > > like to write out as much sequentially as we can here (though I hope no one > > is *actually* swapping on these things). Also mechanical drives. Log-structured > > filesystems were originally invented for these too :) > > +CC Nhat Pham, He Baoquan, Taejoon > > Hi Pedro, > > The motivation is indeed similar to that of log-structured filesystems, and it > employs a similar management mechanism. > > That is why I thought a management style similar to filesystems might be > necessary at the swap layer as well (the swap abstraction layer mentioned in > the proposal document). > > Previously, the direction for upstreaming our solution was somewhat ambiguous, > so we have been maintaining it privately for several years. > > However, recently, I would like to discuss how to proceed with upstreaming in > the context of Baoquan's "swap_ops and pluggable swap backend" > (https://lore.kernel.org/linux-mm/aZiFvzlBJiYBUDre@MiWiFi-R3L-srv/) and > Nhat's "Virtual Swap Space" > (https://lore.kernel.org/linux-mm/20260208215839.87595-1-nphamcs@gmail.com/). > > Best regards > Youngjun Park +CC Kairui Oops, I missed adding the discussion involving Kairui (CC'd). This is also a direction currently being discussed: https://lore.kernel.org/linux-mm/CAMgjq7D6n0H2=di0SrMQbJ48cVeKhGeQMH_mY0y-au4OJbE2GQ@mail.gmail.com/T/#m2feb4489b29075136169ff3efd28dc365062f66a I hope our proposal can be considered or aligned with these ongoing discussions. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Flash Friendly Swap 2026-02-23 18:15 ` Chris Li 2026-02-23 18:53 ` Pedro Falcato @ 2026-02-24 2:15 ` YoungJun Park 1 sibling, 0 replies; 10+ messages in thread From: YoungJun Park @ 2026-02-24 2:15 UTC (permalink / raw) To: Chris Li; +Cc: Christoph Hellwig, lsf-pc, linux-mm, taejoon.song On Mon, Feb 23, 2026 at 10:15:14AM -0800, Chris Li wrote: > On Mon, Feb 23, 2026 at 5:23 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Fri, Feb 20, 2026 at 03:47:18PM -0800, Chris Li wrote: > > > Hi Christoph, > > > > > > On Fri, Feb 20, 2026 at 8:22 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > Honestly, I think always writing sequentially when swapping and > > > > reclaiming in lumps (I'd call them "zones" :)) is probably the best > > > > idea. Even for the these days unlikely case of swapping to HDD it > > > > > > For the flash device with FTL, the location of the data written is > > > most likely logical anyway. The flash devices tend to group the new > > > data internally to the same erase block together even when they are > > > discontinuous from the block device point of view. > > > > Yes, but that's not the point.. > > > > > It is easy to write > > > out sequentially when the swap device is mostly empty. That is how the > > > cluster allocator does currently any way. However, the tricky part is > > > what when some random 4K blocks get swapped in, that will create holes > > > on both the swap device and internal write out data. Very quickly the > > > free cluster on swap devices will get all used up and that you will > > > not be able to write out sequentially any more. The FTL layer > > > internally wants to GC those holes to create a large empty erase > > > block. I do see where to pick up the next write location can have a > > > huge impact on the flash internal GC behavior and write amplification > > > factor. > > > > And that is the point. The FTL will always do a bad job with these work > > loads. You should not do overwrites, and can do much better > > I am not sure I understand "You should not do overwrites". Can you > help clarify it for me? Let say we always prefer to the write to new > clusters while some swap entries has been free. What happen we run out > of new cluster to write? Wouldn't we be forced to overwrite the > previous free swap location? It seems to me the "overwrite" is > un-avoidable if you keep swapping in and out. That is the part I am > missing. > > I think if we gain insight into the FTL GC behavior, we can design > swap to be more friendly to flash. That is why it is great to have the > flash storage vendor provide some inputs. Some of the optimization > might be vendor specific as well. +CC taejoon.song@lge.com That is correct. Eventually, holes appear as clusters are freed, and if we write sequentially, the clusters will be exhausted. Therefore, compaction is inevitable. There may be various factors to consider, such as FTL GC behavior, to optimize this process. > > optimizations in the MM based on that. I'm pretty sure YoungJun can > > explain all what they did. > > > > > > > > > would do the right thing. So please no conditional version that need > > > > > > I agree. There is another LSF/MM topic related to this, the plug-able > > > swap ops backends. I hope that the flash friendly swap layout can > > > implement the swap ops framework without conditional versioning nor > > > stack block drivers. > > > > > > https://lore.kernel.org/linux-mm/aZiFvzlBJiYBUDre@MiWiFi-R3L-srv/ > > > > And it should just be the default block device (and maybe file backed) > > swap. > > Let me repeat what you are saying just to make sure I get it right. > You want the flash-friendly swap to be the default block device swap. > How about zswap and zram swap? The FTL shouldn't play a major role in > selecting the swap location. I am not sure it will be the right > default for them. I would also like to hear more about this. Best Regards Youngjun Park ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Flash Friendly Swap 2026-02-23 13:23 ` Christoph Hellwig 2026-02-23 18:15 ` Chris Li @ 2026-02-24 2:08 ` YoungJun Park 1 sibling, 0 replies; 10+ messages in thread From: YoungJun Park @ 2026-02-24 2:08 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Chris Li, lsf-pc, linux-mm, taejoon.song On Mon, Feb 23, 2026 at 05:23:00AM -0800, Christoph Hellwig wrote: > On Fri, Feb 20, 2026 at 03:47:18PM -0800, Chris Li wrote: > > Hi Christoph, > > > > On Fri, Feb 20, 2026 at 8:22 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > Honestly, I think always writing sequentially when swapping and > > > reclaiming in lumps (I'd call them "zones" :)) is probably the best > > > idea. Even for the these days unlikely case of swapping to HDD it > > > > For the flash device with FTL, the location of the data written is > > most likely logical anyway. The flash devices tend to group the new > > data internally to the same erase block together even when they are > > discontinuous from the block device point of view. > > Yes, but that's not the point.. > > > It is easy to write > > out sequentially when the swap device is mostly empty. That is how the > > cluster allocator does currently any way. However, the tricky part is > > what when some random 4K blocks get swapped in, that will create holes > > on both the swap device and internal write out data. Very quickly the > > free cluster on swap devices will get all used up and that you will > > not be able to write out sequentially any more. The FTL layer > > internally wants to GC those holes to create a large empty erase > > block. I do see where to pick up the next write location can have a > > huge impact on the flash internal GC behavior and write amplification > > factor. > > And that is the point. The FTL will always do a bad job with these work > loads. You should not do overwrites, and can do much better > optimizations in the MM based on that. I'm pretty sure YoungJun can > explain all what they did. +CC taejoon.song@lge.com Yes, relying solely on Random I/O handled by the FTL does not optimize for the device's lifespan. Sequential writes at the OS layer are what create the optimization for lifespan. However, if lifespan is indeed a critical factor, deduplication is needed to reduce the write amount itself. This is another key aspect of the proposal. Regarding Chris's mention that overwrites are inevitable, I will address that part in a separate reply to Chris's email. Best regards Youngjun Park ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-02-24 4:02 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2026-02-18 12:46 [LSF/MM/BPF TOPIC] Flash Friendly Swap YoungJun Park 2026-02-20 16:22 ` Christoph Hellwig 2026-02-20 23:47 ` Chris Li 2026-02-23 13:23 ` Christoph Hellwig 2026-02-23 18:15 ` Chris Li 2026-02-23 18:53 ` Pedro Falcato 2026-02-24 2:24 ` YoungJun Park 2026-02-24 4:02 ` YoungJun Park 2026-02-24 2:15 ` YoungJun Park 2026-02-24 2:08 ` YoungJun Park
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox