* [LSF/MM/BPF TOPIC] Large folio (z)swapin
@ 2025-01-09 20:06 Usama Arif
2025-01-09 21:34 ` Yosry Ahmed
` (4 more replies)
0 siblings, 5 replies; 13+ messages in thread
From: Usama Arif @ 2025-01-09 20:06 UTC (permalink / raw)
To: lsf-pc, Linux Memory Management List
Cc: Johannes Weiner, Barry Song, Yosry Ahmed, Shakeel Butt
I would like to propose a session to discuss the work going on
around large folio swapin, whether its traditional swap or
zswap or zram.
Large folios have obvious advantages that have been discussed before
like fewer page faults, batched PTE and rmap manipulation, reduced
lru list, TLB coalescing (for arm64 and amd).
However, swapping in large folios has its own drawbacks like higher
swap thrashing.
I had initially sent a RFC of zswapin of large folios in [1]
but it causes a regression due to swap thrashing in kernel
build time, which I am confident is happening with zram large
folio swapin as well (which is merged in kernel).
Some of the points we could discuss in the session:
- What is the right (preferably open source) benchmark to test for
swapin of large folios? kernel build time in limited
memory cgroup shows a regression, microbenchmarks show a massive
improvement, maybe there are benchmarks where TLB misses is
a big factor and show an improvement.
- We could have something like
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled
to enable/disable swapin but its going to be difficult to tune, might
have different optimum values based on workloads and are likely to be
left at their default values. Is there some dynamic way to decide when
to swapin large folios and when to fallback to smaller folios?
swapin_readahead swapcache path which only supports 4K folios atm has a
read ahead window based on hits, however readahead is a folio flag and
not a page flag, so this method can't be used as once a large folio
is swapped in, we won't get a fault and subsequent hits on other
pages of the large folio won't be recorded.
- For zswap and zram, it might be that doing larger block compression/
decompression might offset the regression from swap thrashing, but it
brings about its own issues. For e.g. once a large folio is swapped
out, it could fail to swapin as a large folio and fallback
to 4K, resulting in redundant decompressions.
This will also mean swapin of large folios from traditional swap
isn't something we should proceed with?
- Should we even support large folio swapin? You often have high swap
activity when the system/cgroup is close to running out of memory, at this
point, maybe the best way forward is to just swapin 4K pages and let
khugepaged [2], [3] collapse them if the surrounding pages are swapped in
as well.
[1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/
[2] https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com/
[3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
Thanks,
Usama
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin 2025-01-09 20:06 [LSF/MM/BPF TOPIC] Large folio (z)swapin Usama Arif @ 2025-01-09 21:34 ` Yosry Ahmed 2025-01-10 4:29 ` Nhat Pham ` (3 subsequent siblings) 4 siblings, 0 replies; 13+ messages in thread From: Yosry Ahmed @ 2025-01-09 21:34 UTC (permalink / raw) To: Usama Arif Cc: lsf-pc, Linux Memory Management List, Johannes Weiner, Barry Song, Shakeel Butt On Thu, Jan 9, 2025 at 12:06 PM Usama Arif <usamaarif642@gmail.com> wrote: > > I would like to propose a session to discuss the work going on > around large folio swapin, whether its traditional swap or > zswap or zram. > > Large folios have obvious advantages that have been discussed before > like fewer page faults, batched PTE and rmap manipulation, reduced > lru list, TLB coalescing (for arm64 and amd). > However, swapping in large folios has its own drawbacks like higher > swap thrashing. > I had initially sent a RFC of zswapin of large folios in [1] > but it causes a regression due to swap thrashing in kernel > build time, which I am confident is happening with zram large > folio swapin as well (which is merged in kernel). I am obviously interested in this discussion, but unfortunately I won't be able to make it this year. I will try to attend remotely though if possible! > > Some of the points we could discuss in the session: > > - What is the right (preferably open source) benchmark to test for > swapin of large folios? kernel build time in limited > memory cgroup shows a regression, microbenchmarks show a massive > improvement, maybe there are benchmarks where TLB misses is > a big factor and show an improvement. > > - We could have something like > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled > to enable/disable swapin but its going to be difficult to tune, might > have different optimum values based on workloads and are likely to be > left at their default values. Is there some dynamic way to decide when > to swapin large folios and when to fallback to smaller folios? > swapin_readahead swapcache path which only supports 4K folios atm has a > read ahead window based on hits, however readahead is a folio flag and > not a page flag, so this method can't be used as once a large folio > is swapped in, we won't get a fault and subsequent hits on other > pages of the large folio won't be recorded. > > - For zswap and zram, it might be that doing larger block compression/ > decompression might offset the regression from swap thrashing, but it > brings about its own issues. For e.g. once a large folio is swapped > out, it could fail to swapin as a large folio and fallback > to 4K, resulting in redundant decompressions. > This will also mean swapin of large folios from traditional swap > isn't something we should proceed with? > > - Should we even support large folio swapin? You often have high swap > activity when the system/cgroup is close to running out of memory, at this > point, maybe the best way forward is to just swapin 4K pages and let > khugepaged [2], [3] collapse them if the surrounding pages are swapped in > as well. > > [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/ > [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com/ > [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/ > > Thanks, > Usama ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin 2025-01-09 20:06 [LSF/MM/BPF TOPIC] Large folio (z)swapin Usama Arif 2025-01-09 21:34 ` Yosry Ahmed @ 2025-01-10 4:29 ` Nhat Pham 2025-01-10 10:28 ` Barry Song 2025-01-11 10:52 ` Zhu Yanjun 2025-01-10 10:09 ` Barry Song ` (2 subsequent siblings) 4 siblings, 2 replies; 13+ messages in thread From: Nhat Pham @ 2025-01-10 4:29 UTC (permalink / raw) To: Usama Arif Cc: lsf-pc, Linux Memory Management List, Johannes Weiner, Barry Song, Yosry Ahmed, Shakeel Butt On Fri, Jan 10, 2025 at 3:08 AM Usama Arif <usamaarif642@gmail.com> wrote: > > I would like to propose a session to discuss the work going on > around large folio swapin, whether its traditional swap or > zswap or zram. I'm interested! Count me in the discussion :) > > Large folios have obvious advantages that have been discussed before > like fewer page faults, batched PTE and rmap manipulation, reduced > lru list, TLB coalescing (for arm64 and amd). > However, swapping in large folios has its own drawbacks like higher > swap thrashing. > I had initially sent a RFC of zswapin of large folios in [1] > but it causes a regression due to swap thrashing in kernel > build time, which I am confident is happening with zram large > folio swapin as well (which is merged in kernel). > > Some of the points we could discuss in the session: > > - What is the right (preferably open source) benchmark to test for > swapin of large folios? kernel build time in limited > memory cgroup shows a regression, microbenchmarks show a massive > improvement, maybe there are benchmarks where TLB misses is > a big factor and show an improvement. > > - We could have something like > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled > to enable/disable swapin but its going to be difficult to tune, might > have different optimum values based on workloads and are likely to be Might even be different across memory regions. > left at their default values. Is there some dynamic way to decide when > to swapin large folios and when to fallback to smaller folios? > swapin_readahead swapcache path which only supports 4K folios atm has a > read ahead window based on hits, however readahead is a folio flag and > not a page flag, so this method can't be used as once a large folio > is swapped in, we won't get a fault and subsequent hits on other > pages of the large folio won't be recorded. Is this beneficial/useful enough to make it into a page flag? Can we push this to the swap layer, i.e record the hit information on a per-swap-entry basis instead? The space is a bit tight, but we're already in the talk for the new swap abstraction layer. If we go the dynamic route, we can squeeze this kind of information in the dynamically allocated per-swap-entry metadata structure (swap descriptor?). However, the swap entry can go away after a swapin (see should_try_to_free_swap()), so that might be busted :) > > - For zswap and zram, it might be that doing larger block compression/ > decompression might offset the regression from swap thrashing, but it > brings about its own issues. For e.g. once a large folio is swapped > out, it could fail to swapin as a large folio and fallback > to 4K, resulting in redundant decompressions. > This will also mean swapin of large folios from traditional swap > isn't something we should proceed with? Yeah the cost/benefit analysis differs between backend. I wonder if a one-size-fit-all, backend-agnostic policy could ever work - maybe we need some backend-driven algorithm, or some sort of hinting mechanism? This would make the logic uglier though. We've been here before with HDD and SSD swap, except we don't really care about the former, so we can prioritize optimizing for SSD swap (in fact looks like we're removing the HDD portion of the swap allocator). In this case however, zswap, zram, and SSD swap are all valid options, with different characteristics that can make the optimal decision differ :) If we're going the block (de)compression route, there is also this pesky block size question. For instance, do we want to store the entire 2MB in a single block? That would mean we need to decompress the entire 2MB block at load time. It might be more straightforward in the mTHP world, but we do need to consider 2MB THP users too. Finally, the calculus might change once large folio allocation becomes more reliable. Perhaps we can wait until Johannes and Yu make this work? > > - Should we even support large folio swapin? You often have high swap > activity when the system/cgroup is close to running out of memory, at this > point, maybe the best way forward is to just swapin 4K pages and let > khugepaged [2], [3] collapse them if the surrounding pages are swapped in > as well. Perhaps this is the easiest thing to do :) ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin 2025-01-10 4:29 ` Nhat Pham @ 2025-01-10 10:28 ` Barry Song 2025-01-11 10:52 ` Zhu Yanjun 1 sibling, 0 replies; 13+ messages in thread From: Barry Song @ 2025-01-10 10:28 UTC (permalink / raw) To: Nhat Pham Cc: Usama Arif, lsf-pc, Linux Memory Management List, Johannes Weiner, Yosry Ahmed, Shakeel Butt On Fri, Jan 10, 2025 at 5:29 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Fri, Jan 10, 2025 at 3:08 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > I would like to propose a session to discuss the work going on > > around large folio swapin, whether its traditional swap or > > zswap or zram. > > I'm interested! Count me in the discussion :) > > > > > Large folios have obvious advantages that have been discussed before > > like fewer page faults, batched PTE and rmap manipulation, reduced > > lru list, TLB coalescing (for arm64 and amd). > > However, swapping in large folios has its own drawbacks like higher > > swap thrashing. > > I had initially sent a RFC of zswapin of large folios in [1] > > but it causes a regression due to swap thrashing in kernel > > build time, which I am confident is happening with zram large > > folio swapin as well (which is merged in kernel). > > > > Some of the points we could discuss in the session: > > > > - What is the right (preferably open source) benchmark to test for > > swapin of large folios? kernel build time in limited > > memory cgroup shows a regression, microbenchmarks show a massive > > improvement, maybe there are benchmarks where TLB misses is > > a big factor and show an improvement. > > > > - We could have something like > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled > > to enable/disable swapin but its going to be difficult to tune, might > > have different optimum values based on workloads and are likely to be > > Might even be different across memory regions. > > > left at their default values. Is there some dynamic way to decide when > > to swapin large folios and when to fallback to smaller folios? > > swapin_readahead swapcache path which only supports 4K folios atm has a > > read ahead window based on hits, however readahead is a folio flag and > > not a page flag, so this method can't be used as once a large folio > > is swapped in, we won't get a fault and subsequent hits on other > > pages of the large folio won't be recorded. > > Is this beneficial/useful enough to make it into a page flag? > > Can we push this to the swap layer, i.e record the hit information on > a per-swap-entry basis instead? The space is a bit tight, but we're > already in the talk for the new swap abstraction layer. If we go the > dynamic route, we can squeeze this kind of information in the > dynamically allocated per-swap-entry metadata structure (swap > descriptor?). > > However, the swap entry can go away after a swapin (see > should_try_to_free_swap()), so that might be busted :) > > > > > - For zswap and zram, it might be that doing larger block compression/ > > decompression might offset the regression from swap thrashing, but it > > brings about its own issues. For e.g. once a large folio is swapped > > out, it could fail to swapin as a large folio and fallback > > to 4K, resulting in redundant decompressions. > > This will also mean swapin of large folios from traditional swap > > isn't something we should proceed with? > > Yeah the cost/benefit analysis differs between backend. I wonder if a > one-size-fit-all, backend-agnostic policy could ever work - maybe we > need some backend-driven algorithm, or some sort of hinting mechanism? > > This would make the logic uglier though. We've been here before with > HDD and SSD swap, except we don't really care about the former, so we > can prioritize optimizing for SSD swap (in fact looks like we're > removing the HDD portion of the swap allocator). In this case however, > zswap, zram, and SSD swap are all valid options, with different > characteristics that can make the optimal decision differ :) > > If we're going the block (de)compression route, there is also this > pesky block size question. For instance, do we want to store the > entire 2MB in a single block? That would mean we need to decompress > the entire 2MB block at load time. It might be more straightforward in > the mTHP world, but we do need to consider 2MB THP users too. I don't think we need to save the entire 2MB in a single block. After 64KB, we don't see much improvement in compression ratio or speed. The most significant increase was observed between 4KB and 16KB. For example, for zstd: File size: 182502912 bytes 4KB Block: Compression time = 0.967303 seconds, Decompression time = 0.200064 seconds Original size: 182502912 bytes Compressed size: 66089193 bytes Compression ratio: 36.21% 16KB Block: Compression time = 0.567167 seconds, Decompression time = 0.152807 seconds Original size: 182502912 bytes Compressed size: 59159073 bytes Compression ratio: 32.42% 32KB Block: Compression time = 0.543887 seconds, Decompression time = 0.136602 seconds Original size: 182502912 bytes Compressed size: 57958701 bytes Compression ratio: 31.76% 64KB Block: Compression time = 0.536979 seconds, Decompression time = 0.127069 seconds Original size: 182502912 bytes Compressed size: 56700795 bytes Compression ratio: 31.07% 128KB Block: Compression time = 0.540505 seconds, Decompression time = 0.120685 seconds Original size: 182502912 bytes Compressed size: 55765775 bytes Compression ratio: 30.56% 256KB Block: Compression time = 0.575515 seconds, Decompression time = 0.125049 seconds Original size: 182502912 bytes Compressed size: 54203461 bytes Compression ratio: 29.70% 512KB Block: Compression time = 0.571370 seconds, Decompression time = 0.119609 seconds Original size: 182502912 bytes Compressed size: 53914422 bytes Compression ratio: 29.54% 1024KB Block: Compression time = 0.556631 seconds, Decompression time = 0.119475 seconds Original size: 182502912 bytes Compressed size: 53239893 bytes Compression ratio: 29.17% 2048KB Block: Compression time = 0.539796 seconds, Decompression time = 0.119751 seconds Original size: 182502912 bytes Compressed size: 52923234 bytes Compression ratio: 29.00% To simplify things(Reduce the potential decompression of large blocks for small swap-ins), for a 2MB THP, we are actually saving it as 2MB/16KB blocks in zsmalloc, as shown in the RFC. https://lore.kernel.org/linux-mm/20241121222521.83458-1-21cnbao@gmail.com/ > > Finally, the calculus might change once large folio allocation becomes > more reliable. Perhaps we can wait until Johannes and Yu make this > work? > > > > > - Should we even support large folio swapin? You often have high swap > > activity when the system/cgroup is close to running out of memory, at this > > point, maybe the best way forward is to just swapin 4K pages and let > > khugepaged [2], [3] collapse them if the surrounding pages are swapped in > > as well. > > Perhaps this is the easiest thing to do :) Thanks barry ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin 2025-01-10 4:29 ` Nhat Pham 2025-01-10 10:28 ` Barry Song @ 2025-01-11 10:52 ` Zhu Yanjun 1 sibling, 0 replies; 13+ messages in thread From: Zhu Yanjun @ 2025-01-11 10:52 UTC (permalink / raw) To: Nhat Pham, Usama Arif Cc: lsf-pc, Linux Memory Management List, Johannes Weiner, Barry Song, Yosry Ahmed, Shakeel Butt 在 2025/1/10 5:29, Nhat Pham 写道: > On Fri, Jan 10, 2025 at 3:08 AM Usama Arif <usamaarif642@gmail.com> wrote: >> >> I would like to propose a session to discuss the work going on >> around large folio swapin, whether its traditional swap or >> zswap or zram. > > I'm interested! Count me in the discussion :) I am also interested in this topic. Hope to join the meeting and discuss. Zhu Yanjun > >> >> Large folios have obvious advantages that have been discussed before >> like fewer page faults, batched PTE and rmap manipulation, reduced >> lru list, TLB coalescing (for arm64 and amd). >> However, swapping in large folios has its own drawbacks like higher >> swap thrashing. >> I had initially sent a RFC of zswapin of large folios in [1] >> but it causes a regression due to swap thrashing in kernel >> build time, which I am confident is happening with zram large >> folio swapin as well (which is merged in kernel). >> >> Some of the points we could discuss in the session: >> >> - What is the right (preferably open source) benchmark to test for >> swapin of large folios? kernel build time in limited >> memory cgroup shows a regression, microbenchmarks show a massive >> improvement, maybe there are benchmarks where TLB misses is >> a big factor and show an improvement. >> >> - We could have something like >> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled >> to enable/disable swapin but its going to be difficult to tune, might >> have different optimum values based on workloads and are likely to be > > Might even be different across memory regions. > >> left at their default values. Is there some dynamic way to decide when >> to swapin large folios and when to fallback to smaller folios? >> swapin_readahead swapcache path which only supports 4K folios atm has a >> read ahead window based on hits, however readahead is a folio flag and >> not a page flag, so this method can't be used as once a large folio >> is swapped in, we won't get a fault and subsequent hits on other >> pages of the large folio won't be recorded. > > Is this beneficial/useful enough to make it into a page flag? > > Can we push this to the swap layer, i.e record the hit information on > a per-swap-entry basis instead? The space is a bit tight, but we're > already in the talk for the new swap abstraction layer. If we go the > dynamic route, we can squeeze this kind of information in the > dynamically allocated per-swap-entry metadata structure (swap > descriptor?). > > However, the swap entry can go away after a swapin (see > should_try_to_free_swap()), so that might be busted :) > >> >> - For zswap and zram, it might be that doing larger block compression/ >> decompression might offset the regression from swap thrashing, but it >> brings about its own issues. For e.g. once a large folio is swapped >> out, it could fail to swapin as a large folio and fallback >> to 4K, resulting in redundant decompressions. >> This will also mean swapin of large folios from traditional swap >> isn't something we should proceed with? > > Yeah the cost/benefit analysis differs between backend. I wonder if a > one-size-fit-all, backend-agnostic policy could ever work - maybe we > need some backend-driven algorithm, or some sort of hinting mechanism? > > This would make the logic uglier though. We've been here before with > HDD and SSD swap, except we don't really care about the former, so we > can prioritize optimizing for SSD swap (in fact looks like we're > removing the HDD portion of the swap allocator). In this case however, > zswap, zram, and SSD swap are all valid options, with different > characteristics that can make the optimal decision differ :) > > If we're going the block (de)compression route, there is also this > pesky block size question. For instance, do we want to store the > entire 2MB in a single block? That would mean we need to decompress > the entire 2MB block at load time. It might be more straightforward in > the mTHP world, but we do need to consider 2MB THP users too. > > Finally, the calculus might change once large folio allocation becomes > more reliable. Perhaps we can wait until Johannes and Yu make this > work? > >> >> - Should we even support large folio swapin? You often have high swap >> activity when the system/cgroup is close to running out of memory, at this >> point, maybe the best way forward is to just swapin 4K pages and let >> khugepaged [2], [3] collapse them if the surrounding pages are swapped in >> as well. > > Perhaps this is the easiest thing to do :) > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin 2025-01-09 20:06 [LSF/MM/BPF TOPIC] Large folio (z)swapin Usama Arif 2025-01-09 21:34 ` Yosry Ahmed 2025-01-10 4:29 ` Nhat Pham @ 2025-01-10 10:09 ` Barry Song 2025-01-10 10:26 ` Usama Arif 2025-01-12 10:49 ` Barry Song 2025-01-13 3:16 ` Chuanhua Han 2025-01-28 8:17 ` Sergey Senozhatsky 4 siblings, 2 replies; 13+ messages in thread From: Barry Song @ 2025-01-10 10:09 UTC (permalink / raw) To: Usama Arif Cc: lsf-pc, Linux Memory Management List, Johannes Weiner, Yosry Ahmed, Shakeel Butt Hi Usama, Please include me in the discussion. I'll try to attend, at least remotely. On Fri, Jan 10, 2025 at 9:06 AM Usama Arif <usamaarif642@gmail.com> wrote: > > I would like to propose a session to discuss the work going on > around large folio swapin, whether its traditional swap or > zswap or zram. > > Large folios have obvious advantages that have been discussed before > like fewer page faults, batched PTE and rmap manipulation, reduced > lru list, TLB coalescing (for arm64 and amd). > However, swapping in large folios has its own drawbacks like higher > swap thrashing. > I had initially sent a RFC of zswapin of large folios in [1] > but it causes a regression due to swap thrashing in kernel > build time, which I am confident is happening with zram large > folio swapin as well (which is merged in kernel). > > Some of the points we could discuss in the session: > > - What is the right (preferably open source) benchmark to test for > swapin of large folios? kernel build time in limited > memory cgroup shows a regression, microbenchmarks show a massive > improvement, maybe there are benchmarks where TLB misses is > a big factor and show an improvement. My understanding is that it largely depends on the workload. In interactive scenarios, such as on a phone, swap thrashing is not an issue because there is minimal to no thrashing for the app occupying the screen (foreground). In such cases, swap bandwidth becomes the most critical factor in improving app switching speed, especially when multiple applications are switching between background and foreground states. > > - We could have something like > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled > to enable/disable swapin but its going to be difficult to tune, might > have different optimum values based on workloads and are likely to be > left at their default values. Is there some dynamic way to decide when > to swapin large folios and when to fallback to smaller folios? > swapin_readahead swapcache path which only supports 4K folios atm has a > read ahead window based on hits, however readahead is a folio flag and > not a page flag, so this method can't be used as once a large folio > is swapped in, we won't get a fault and subsequent hits on other > pages of the large folio won't be recorded. > > - For zswap and zram, it might be that doing larger block compression/ > decompression might offset the regression from swap thrashing, but it > brings about its own issues. For e.g. once a large folio is swapped > out, it could fail to swapin as a large folio and fallback > to 4K, resulting in redundant decompressions. That's correct. My current workaround involves swapping four small folios, and zsmalloc will compress and decompress in chunks of four pages, regardless of the actual size of the mTHP - The improvement in compression ratio and speed becomes less significant after exceeding four pages, even though there is still some increase. Our recent experiments on phone also show that enabling direct reclamation for do_swap_page() to allocate 2-order mTHP results in a 0% allocation failure rate - this probably removes the need for fallbacking to 4 small folios. (Note that our experiments include Yu's TAO—Android GKI has already merged it. However, since 2 is less than PAGE_ALLOC_COSTLY_ORDER, we might achieve similar results even without Yu's TAO, although I have not confirmed this.) > This will also mean swapin of large folios from traditional swap > isn't something we should proceed with? > > - Should we even support large folio swapin? You often have high swap > activity when the system/cgroup is close to running out of memory, at this > point, maybe the best way forward is to just swapin 4K pages and let > khugepaged [2], [3] collapse them if the surrounding pages are swapped in > as well. This approach might be suitable for non-interactive scenarios, such as building a kernel within a memory control group (memcg) or running other server applications. However, performing collapse in interactive and power-sensitive scenarios would be unnecessary and could lead to wasted power due to memory migration and unmap/map operations. However, it is quite challenging to automatically determine the type of workloads the system is running. I feel we still need a global control to decide whether to enable mTHP swap-in—not necessarily per size, but at least at a global level. That said, there is evident resistance to introducing additional controls to enable or disable mTHP features. By the way, Usama, have you ever tried switching between mglru and the traditional active/inactive LRU? My experience shows a significant difference in swap thrashing —active/inactive LRU exhibits much less swap thrashing in my local kernel build tests. the latest mm-unstable *********** default mglru: *********** root@barry-desktop:/home/barry/develop/linux# ./build.sh *** Executing round 1 *** real 6m44.561s user 46m53.274s sys 3m48.585s pswpin: 1286081 pswpout: 3147936 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 714580 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 286881 pgpgin: 17199072 pgpgout: 21493892 swpout_zero: 229163 swpin_zero: 84353 ******** disable mglru ******** root@barry-desktop:/home/barry/develop/linux# echo 0 > /sys/kernel/mm/lru_gen/enabled root@barry-desktop:/home/barry/develop/linux# ./build.sh *** Executing round 1 *** real 6m27.944s user 46m41.832s sys 3m30.635s pswpin: 474036 pswpout: 1434853 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 331755 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 106333 pgpgin: 11763720 pgpgout: 14551524 swpout_zero: 145050 swpin_zero: 87981 my build script: #!/bin/bash echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled vmstat_path="/proc/vmstat" thp_base_path="/sys/kernel/mm/transparent_hugepage" read_values() { pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}') pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}') pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}') pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}') swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}') swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}') swpout_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpout 2>/dev/null || echo 0) swpout_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpout 2>/dev/null || echo 0) swpout_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpout 2>/dev/null || echo 0) swpin_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpin 2>/dev/null || echo 0) swpin_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpin 2>/dev/null || echo 0) swpin_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpin 2>/dev/null || echo 0) echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k $swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout $swpout_zero $swpin_zero" } for ((i=1; i<=1; i++)) do echo echo "*** Executing round $i ***" make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null echo 3 > /proc/sys/vm/drop_caches #kernel build initial_values=($(read_values)) time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null final_values=($(read_values)) echo "pswpin: $((final_values[0] - initial_values[0]))" echo "pswpout: $((final_values[1] - initial_values[1]))" echo "64kB-swpout: $((final_values[2] - initial_values[2]))" echo "32kB-swpout: $((final_values[3] - initial_values[3]))" echo "16kB-swpout: $((final_values[4] - initial_values[4]))" echo "64kB-swpin: $((final_values[5] - initial_values[5]))" echo "32kB-swpin: $((final_values[6] - initial_values[6]))" echo "16kB-swpin: $((final_values[7] - initial_values[7]))" echo "pgpgin: $((final_values[8] - initial_values[8]))" echo "pgpgout: $((final_values[9] - initial_values[9]))" echo "swpout_zero: $((final_values[10] - initial_values[10]))" echo "swpin_zero: $((final_values[11] - initial_values[11]))" sync sleep 10 done > > [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/ > [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com/ > [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/ > > Thanks, > Usama Thanks Barry ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin 2025-01-10 10:09 ` Barry Song @ 2025-01-10 10:26 ` Usama Arif 2025-01-10 10:30 ` Barry Song 2025-01-12 10:49 ` Barry Song 1 sibling, 1 reply; 13+ messages in thread From: Usama Arif @ 2025-01-10 10:26 UTC (permalink / raw) To: Barry Song Cc: lsf-pc, Linux Memory Management List, Johannes Weiner, Yosry Ahmed, Shakeel Butt, Yu Zhao On 10/01/2025 10:09, Barry Song wrote: > Hi Usama, > > Please include me in the discussion. I'll try to attend, at least remotely. > > On Fri, Jan 10, 2025 at 9:06 AM Usama Arif <usamaarif642@gmail.com> wrote: >> >> I would like to propose a session to discuss the work going on >> around large folio swapin, whether its traditional swap or >> zswap or zram. >> >> Large folios have obvious advantages that have been discussed before >> like fewer page faults, batched PTE and rmap manipulation, reduced >> lru list, TLB coalescing (for arm64 and amd). >> However, swapping in large folios has its own drawbacks like higher >> swap thrashing. >> I had initially sent a RFC of zswapin of large folios in [1] >> but it causes a regression due to swap thrashing in kernel >> build time, which I am confident is happening with zram large >> folio swapin as well (which is merged in kernel). >> >> Some of the points we could discuss in the session: >> >> - What is the right (preferably open source) benchmark to test for >> swapin of large folios? kernel build time in limited >> memory cgroup shows a regression, microbenchmarks show a massive >> improvement, maybe there are benchmarks where TLB misses is >> a big factor and show an improvement. > > My understanding is that it largely depends on the workload. In interactive > scenarios, such as on a phone, swap thrashing is not an issue because > there is minimal to no thrashing for the app occupying the screen > (foreground). In such cases, swap bandwidth becomes the most critical factor > in improving app switching speed, especially when multiple applications > are switching between background and foreground states. > >> >> - We could have something like >> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled >> to enable/disable swapin but its going to be difficult to tune, might >> have different optimum values based on workloads and are likely to be >> left at their default values. Is there some dynamic way to decide when >> to swapin large folios and when to fallback to smaller folios? >> swapin_readahead swapcache path which only supports 4K folios atm has a >> read ahead window based on hits, however readahead is a folio flag and >> not a page flag, so this method can't be used as once a large folio >> is swapped in, we won't get a fault and subsequent hits on other >> pages of the large folio won't be recorded. >> >> - For zswap and zram, it might be that doing larger block compression/ >> decompression might offset the regression from swap thrashing, but it >> brings about its own issues. For e.g. once a large folio is swapped >> out, it could fail to swapin as a large folio and fallback >> to 4K, resulting in redundant decompressions. > > That's correct. My current workaround involves swapping four small folios, > and zsmalloc will compress and decompress in chunks of four pages, > regardless of the actual size of the mTHP - The improvement in compression > ratio and speed becomes less significant after exceeding four pages, even > though there is still some increase. > > Our recent experiments on phone also show that enabling direct reclamation > for do_swap_page() to allocate 2-order mTHP results in a 0% allocation > failure rate - this probably removes the need for fallbacking to 4 small > folios. (Note that our experiments include Yu's TAO—Android GKI has > already merged it. However, since 2 is less than > PAGE_ALLOC_COSTLY_ORDER, we might achieve similar results even > without Yu's TAO, although I have not confirmed this.) > Hi Barry, Thanks for the comments! I haven't seen any activity on TAO on the mailing list recently. Do you know if there are any plans for it to be sent for upstream review? Have cc-ed Yu Zhao as well. >> This will also mean swapin of large folios from traditional swap >> isn't something we should proceed with? >> >> - Should we even support large folio swapin? You often have high swap >> activity when the system/cgroup is close to running out of memory, at this >> point, maybe the best way forward is to just swapin 4K pages and let >> khugepaged [2], [3] collapse them if the surrounding pages are swapped in >> as well. > > This approach might be suitable for non-interactive scenarios, such as building > a kernel within a memory control group (memcg) or running other server > applications. However, performing collapse in interactive and power-sensitive > scenarios would be unnecessary and could lead to wasted power due to > memory migration and unmap/map operations. > > However, it is quite challenging to automatically determine the type > of workloads > the system is running. I feel we still need a global control to decide whether > to enable mTHP swap-in—not necessarily per size, but at least at a global level. > That said, there is evident resistance to introducing additional > controls to enable > or disable mTHP features. > > By the way, Usama, have you ever tried switching between mglru and the > traditional > active/inactive LRU? My experience shows a significant difference in > swap thrashing > —active/inactive LRU exhibits much less swap thrashing in my local kernel build > tests. > I never tried with MGLRU enabled, so I am probably seeing the lowest amount of swap-thrashing. Thanks, Usama > the latest mm-unstable > > *********** default mglru: *********** > > root@barry-desktop:/home/barry/develop/linux# ./build.sh > *** Executing round 1 *** > real 6m44.561s > user 46m53.274s > sys 3m48.585s > pswpin: 1286081 > pswpout: 3147936 > 64kB-swpout: 0 > 32kB-swpout: 0 > 16kB-swpout: 714580 > 64kB-swpin: 0 > 32kB-swpin: 0 > 16kB-swpin: 286881 > pgpgin: 17199072 > pgpgout: 21493892 > swpout_zero: 229163 > swpin_zero: 84353 > > ******** disable mglru ******** > > root@barry-desktop:/home/barry/develop/linux# echo 0 > > /sys/kernel/mm/lru_gen/enabled > > root@barry-desktop:/home/barry/develop/linux# ./build.sh > *** Executing round 1 *** > real 6m27.944s > user 46m41.832s > sys 3m30.635s > pswpin: 474036 > pswpout: 1434853 > 64kB-swpout: 0 > 32kB-swpout: 0 > 16kB-swpout: 331755 > 64kB-swpin: 0 > 32kB-swpin: 0 > 16kB-swpin: 106333 > pgpgin: 11763720 > pgpgout: 14551524 > swpout_zero: 145050 > swpin_zero: 87981 > > my build script: > > #!/bin/bash > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > > vmstat_path="/proc/vmstat" > thp_base_path="/sys/kernel/mm/transparent_hugepage" > > read_values() { > pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}') > pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}') > pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}') > pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}') > swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}') > swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}') > swpout_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpout > 2>/dev/null || echo 0) > swpout_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpout > 2>/dev/null || echo 0) > swpout_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpout > 2>/dev/null || echo 0) > swpin_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpin > 2>/dev/null || echo 0) > swpin_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpin > 2>/dev/null || echo 0) > swpin_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpin > 2>/dev/null || echo 0) > echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k > $swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout $swpout_zero > $swpin_zero" > } > > for ((i=1; i<=1; i++)) > do > echo > echo "*** Executing round $i ***" > make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null > echo 3 > /proc/sys/vm/drop_caches > > #kernel build > initial_values=($(read_values)) > time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ > CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null > final_values=($(read_values)) > > echo "pswpin: $((final_values[0] - initial_values[0]))" > echo "pswpout: $((final_values[1] - initial_values[1]))" > echo "64kB-swpout: $((final_values[2] - initial_values[2]))" > echo "32kB-swpout: $((final_values[3] - initial_values[3]))" > echo "16kB-swpout: $((final_values[4] - initial_values[4]))" > echo "64kB-swpin: $((final_values[5] - initial_values[5]))" > echo "32kB-swpin: $((final_values[6] - initial_values[6]))" > echo "16kB-swpin: $((final_values[7] - initial_values[7]))" > echo "pgpgin: $((final_values[8] - initial_values[8]))" > echo "pgpgout: $((final_values[9] - initial_values[9]))" > echo "swpout_zero: $((final_values[10] - initial_values[10]))" > echo "swpin_zero: $((final_values[11] - initial_values[11]))" > sync > sleep 10 > done > >> >> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/ >> [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com/ >> [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/ >> >> Thanks, >> Usama > > Thanks > Barry ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin 2025-01-10 10:26 ` Usama Arif @ 2025-01-10 10:30 ` Barry Song 2025-01-10 10:40 ` Usama Arif 0 siblings, 1 reply; 13+ messages in thread From: Barry Song @ 2025-01-10 10:30 UTC (permalink / raw) To: Usama Arif Cc: lsf-pc, Linux Memory Management List, Johannes Weiner, Yosry Ahmed, Shakeel Butt, Yu Zhao On Fri, Jan 10, 2025 at 11:26 PM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 10/01/2025 10:09, Barry Song wrote: > > Hi Usama, > > > > Please include me in the discussion. I'll try to attend, at least remotely. > > > > On Fri, Jan 10, 2025 at 9:06 AM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >> I would like to propose a session to discuss the work going on > >> around large folio swapin, whether its traditional swap or > >> zswap or zram. > >> > >> Large folios have obvious advantages that have been discussed before > >> like fewer page faults, batched PTE and rmap manipulation, reduced > >> lru list, TLB coalescing (for arm64 and amd). > >> However, swapping in large folios has its own drawbacks like higher > >> swap thrashing. > >> I had initially sent a RFC of zswapin of large folios in [1] > >> but it causes a regression due to swap thrashing in kernel > >> build time, which I am confident is happening with zram large > >> folio swapin as well (which is merged in kernel). > >> > >> Some of the points we could discuss in the session: > >> > >> - What is the right (preferably open source) benchmark to test for > >> swapin of large folios? kernel build time in limited > >> memory cgroup shows a regression, microbenchmarks show a massive > >> improvement, maybe there are benchmarks where TLB misses is > >> a big factor and show an improvement. > > > > My understanding is that it largely depends on the workload. In interactive > > scenarios, such as on a phone, swap thrashing is not an issue because > > there is minimal to no thrashing for the app occupying the screen > > (foreground). In such cases, swap bandwidth becomes the most critical factor > > in improving app switching speed, especially when multiple applications > > are switching between background and foreground states. > > > >> > >> - We could have something like > >> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled > >> to enable/disable swapin but its going to be difficult to tune, might > >> have different optimum values based on workloads and are likely to be > >> left at their default values. Is there some dynamic way to decide when > >> to swapin large folios and when to fallback to smaller folios? > >> swapin_readahead swapcache path which only supports 4K folios atm has a > >> read ahead window based on hits, however readahead is a folio flag and > >> not a page flag, so this method can't be used as once a large folio > >> is swapped in, we won't get a fault and subsequent hits on other > >> pages of the large folio won't be recorded. > >> > >> - For zswap and zram, it might be that doing larger block compression/ > >> decompression might offset the regression from swap thrashing, but it > >> brings about its own issues. For e.g. once a large folio is swapped > >> out, it could fail to swapin as a large folio and fallback > >> to 4K, resulting in redundant decompressions. > > > > That's correct. My current workaround involves swapping four small folios, > > and zsmalloc will compress and decompress in chunks of four pages, > > regardless of the actual size of the mTHP - The improvement in compression > > ratio and speed becomes less significant after exceeding four pages, even > > though there is still some increase. > > > > Our recent experiments on phone also show that enabling direct reclamation > > for do_swap_page() to allocate 2-order mTHP results in a 0% allocation > > failure rate - this probably removes the need for fallbacking to 4 small > > folios. (Note that our experiments include Yu's TAO—Android GKI has > > already merged it. However, since 2 is less than > > PAGE_ALLOC_COSTLY_ORDER, we might achieve similar results even > > without Yu's TAO, although I have not confirmed this.) > > > > Hi Barry, > > Thanks for the comments! > > I haven't seen any activity on TAO on the mailing list recently. Do you know > if there are any plans for it to be sent for upstream review? > Have cc-ed Yu Zhao as well. > > > >> This will also mean swapin of large folios from traditional swap > >> isn't something we should proceed with? > >> > >> - Should we even support large folio swapin? You often have high swap > >> activity when the system/cgroup is close to running out of memory, at this > >> point, maybe the best way forward is to just swapin 4K pages and let > >> khugepaged [2], [3] collapse them if the surrounding pages are swapped in > >> as well. > > > > This approach might be suitable for non-interactive scenarios, such as building > > a kernel within a memory control group (memcg) or running other server > > applications. However, performing collapse in interactive and power-sensitive > > scenarios would be unnecessary and could lead to wasted power due to > > memory migration and unmap/map operations. > > > > However, it is quite challenging to automatically determine the type > > of workloads > > the system is running. I feel we still need a global control to decide whether > > to enable mTHP swap-in—not necessarily per size, but at least at a global level. > > That said, there is evident resistance to introducing additional > > controls to enable > > or disable mTHP features. > > > > By the way, Usama, have you ever tried switching between mglru and the > > traditional > > active/inactive LRU? My experience shows a significant difference in > > swap thrashing > > —active/inactive LRU exhibits much less swap thrashing in my local kernel build > > tests. > > > > I never tried with MGLRU enabled, so I am probably seeing the lowest amount of > swap-thrashing. Are you sure, Usama, since mglru is enabled by default? I have to echo 0 to manually disable it. > > Thanks, > Usama > > > the latest mm-unstable > > > > *********** default mglru: *********** > > > > root@barry-desktop:/home/barry/develop/linux# ./build.sh > > *** Executing round 1 *** > > real 6m44.561s > > user 46m53.274s > > sys 3m48.585s > > pswpin: 1286081 > > pswpout: 3147936 > > 64kB-swpout: 0 > > 32kB-swpout: 0 > > 16kB-swpout: 714580 > > 64kB-swpin: 0 > > 32kB-swpin: 0 > > 16kB-swpin: 286881 > > pgpgin: 17199072 > > pgpgout: 21493892 > > swpout_zero: 229163 > > swpin_zero: 84353 > > > > ******** disable mglru ******** > > > > root@barry-desktop:/home/barry/develop/linux# echo 0 > > > /sys/kernel/mm/lru_gen/enabled > > > > root@barry-desktop:/home/barry/develop/linux# ./build.sh > > *** Executing round 1 *** > > real 6m27.944s > > user 46m41.832s > > sys 3m30.635s > > pswpin: 474036 > > pswpout: 1434853 > > 64kB-swpout: 0 > > 32kB-swpout: 0 > > 16kB-swpout: 331755 > > 64kB-swpin: 0 > > 32kB-swpin: 0 > > 16kB-swpin: 106333 > > pgpgin: 11763720 > > pgpgout: 14551524 > > swpout_zero: 145050 > > swpin_zero: 87981 > > > > my build script: > > > > #!/bin/bash > > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > > > > vmstat_path="/proc/vmstat" > > thp_base_path="/sys/kernel/mm/transparent_hugepage" > > > > read_values() { > > pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}') > > pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}') > > pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}') > > pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}') > > swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}') > > swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}') > > swpout_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpout > > 2>/dev/null || echo 0) > > swpout_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpout > > 2>/dev/null || echo 0) > > swpout_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpout > > 2>/dev/null || echo 0) > > swpin_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpin > > 2>/dev/null || echo 0) > > swpin_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpin > > 2>/dev/null || echo 0) > > swpin_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpin > > 2>/dev/null || echo 0) > > echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k > > $swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout $swpout_zero > > $swpin_zero" > > } > > > > for ((i=1; i<=1; i++)) > > do > > echo > > echo "*** Executing round $i ***" > > make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null > > echo 3 > /proc/sys/vm/drop_caches > > > > #kernel build > > initial_values=($(read_values)) > > time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ > > CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null > > final_values=($(read_values)) > > > > echo "pswpin: $((final_values[0] - initial_values[0]))" > > echo "pswpout: $((final_values[1] - initial_values[1]))" > > echo "64kB-swpout: $((final_values[2] - initial_values[2]))" > > echo "32kB-swpout: $((final_values[3] - initial_values[3]))" > > echo "16kB-swpout: $((final_values[4] - initial_values[4]))" > > echo "64kB-swpin: $((final_values[5] - initial_values[5]))" > > echo "32kB-swpin: $((final_values[6] - initial_values[6]))" > > echo "16kB-swpin: $((final_values[7] - initial_values[7]))" > > echo "pgpgin: $((final_values[8] - initial_values[8]))" > > echo "pgpgout: $((final_values[9] - initial_values[9]))" > > echo "swpout_zero: $((final_values[10] - initial_values[10]))" > > echo "swpin_zero: $((final_values[11] - initial_values[11]))" > > sync > > sleep 10 > > done > > > >> > >> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/ > >> [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com/ > >> [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/ > >> > >> Thanks, > >> Usama > > Thanks Barry ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin 2025-01-10 10:30 ` Barry Song @ 2025-01-10 10:40 ` Usama Arif 2025-01-10 10:47 ` Barry Song 0 siblings, 1 reply; 13+ messages in thread From: Usama Arif @ 2025-01-10 10:40 UTC (permalink / raw) To: Barry Song Cc: lsf-pc, Linux Memory Management List, Johannes Weiner, Yosry Ahmed, Shakeel Butt, Yu Zhao On 10/01/2025 10:30, Barry Song wrote: > On Fri, Jan 10, 2025 at 11:26 PM Usama Arif <usamaarif642@gmail.com> wrote: >> >> >> >> On 10/01/2025 10:09, Barry Song wrote: >>> Hi Usama, >>> >>> Please include me in the discussion. I'll try to attend, at least remotely. >>> >>> On Fri, Jan 10, 2025 at 9:06 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>> >>>> I would like to propose a session to discuss the work going on >>>> around large folio swapin, whether its traditional swap or >>>> zswap or zram. >>>> >>>> Large folios have obvious advantages that have been discussed before >>>> like fewer page faults, batched PTE and rmap manipulation, reduced >>>> lru list, TLB coalescing (for arm64 and amd). >>>> However, swapping in large folios has its own drawbacks like higher >>>> swap thrashing. >>>> I had initially sent a RFC of zswapin of large folios in [1] >>>> but it causes a regression due to swap thrashing in kernel >>>> build time, which I am confident is happening with zram large >>>> folio swapin as well (which is merged in kernel). >>>> >>>> Some of the points we could discuss in the session: >>>> >>>> - What is the right (preferably open source) benchmark to test for >>>> swapin of large folios? kernel build time in limited >>>> memory cgroup shows a regression, microbenchmarks show a massive >>>> improvement, maybe there are benchmarks where TLB misses is >>>> a big factor and show an improvement. >>> >>> My understanding is that it largely depends on the workload. In interactive >>> scenarios, such as on a phone, swap thrashing is not an issue because >>> there is minimal to no thrashing for the app occupying the screen >>> (foreground). In such cases, swap bandwidth becomes the most critical factor >>> in improving app switching speed, especially when multiple applications >>> are switching between background and foreground states. >>> >>>> >>>> - We could have something like >>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled >>>> to enable/disable swapin but its going to be difficult to tune, might >>>> have different optimum values based on workloads and are likely to be >>>> left at their default values. Is there some dynamic way to decide when >>>> to swapin large folios and when to fallback to smaller folios? >>>> swapin_readahead swapcache path which only supports 4K folios atm has a >>>> read ahead window based on hits, however readahead is a folio flag and >>>> not a page flag, so this method can't be used as once a large folio >>>> is swapped in, we won't get a fault and subsequent hits on other >>>> pages of the large folio won't be recorded. >>>> >>>> - For zswap and zram, it might be that doing larger block compression/ >>>> decompression might offset the regression from swap thrashing, but it >>>> brings about its own issues. For e.g. once a large folio is swapped >>>> out, it could fail to swapin as a large folio and fallback >>>> to 4K, resulting in redundant decompressions. >>> >>> That's correct. My current workaround involves swapping four small folios, >>> and zsmalloc will compress and decompress in chunks of four pages, >>> regardless of the actual size of the mTHP - The improvement in compression >>> ratio and speed becomes less significant after exceeding four pages, even >>> though there is still some increase. >>> >>> Our recent experiments on phone also show that enabling direct reclamation >>> for do_swap_page() to allocate 2-order mTHP results in a 0% allocation >>> failure rate - this probably removes the need for fallbacking to 4 small >>> folios. (Note that our experiments include Yu's TAO—Android GKI has >>> already merged it. However, since 2 is less than >>> PAGE_ALLOC_COSTLY_ORDER, we might achieve similar results even >>> without Yu's TAO, although I have not confirmed this.) >>> >> >> Hi Barry, >> >> Thanks for the comments! >> >> I haven't seen any activity on TAO on the mailing list recently. Do you know >> if there are any plans for it to be sent for upstream review? >> Have cc-ed Yu Zhao as well. >> >> >>>> This will also mean swapin of large folios from traditional swap >>>> isn't something we should proceed with? >>>> >>>> - Should we even support large folio swapin? You often have high swap >>>> activity when the system/cgroup is close to running out of memory, at this >>>> point, maybe the best way forward is to just swapin 4K pages and let >>>> khugepaged [2], [3] collapse them if the surrounding pages are swapped in >>>> as well. >>> >>> This approach might be suitable for non-interactive scenarios, such as building >>> a kernel within a memory control group (memcg) or running other server >>> applications. However, performing collapse in interactive and power-sensitive >>> scenarios would be unnecessary and could lead to wasted power due to >>> memory migration and unmap/map operations. >>> >>> However, it is quite challenging to automatically determine the type >>> of workloads >>> the system is running. I feel we still need a global control to decide whether >>> to enable mTHP swap-in—not necessarily per size, but at least at a global level. >>> That said, there is evident resistance to introducing additional >>> controls to enable >>> or disable mTHP features. >>> >>> By the way, Usama, have you ever tried switching between mglru and the >>> traditional >>> active/inactive LRU? My experience shows a significant difference in >>> swap thrashing >>> —active/inactive LRU exhibits much less swap thrashing in my local kernel build >>> tests. >>> >> >> I never tried with MGLRU enabled, so I am probably seeing the lowest amount of >> swap-thrashing. > > Are you sure, Usama, since mglru is enabled by default? I have to echo > 0 to manually > disable it. > Yes, I dont have CONFIG_LRU_GEN set in my defconfig. I dont think it is set by default as well? Atleast on x86. $ make defconfig $ grep LRU_GEN .config # CONFIG_LRU_GEN is not set Thanks, Usama >> >> Thanks, >> Usama >> >>> the latest mm-unstable >>> >>> *********** default mglru: *********** >>> >>> root@barry-desktop:/home/barry/develop/linux# ./build.sh >>> *** Executing round 1 *** >>> real 6m44.561s >>> user 46m53.274s >>> sys 3m48.585s >>> pswpin: 1286081 >>> pswpout: 3147936 >>> 64kB-swpout: 0 >>> 32kB-swpout: 0 >>> 16kB-swpout: 714580 >>> 64kB-swpin: 0 >>> 32kB-swpin: 0 >>> 16kB-swpin: 286881 >>> pgpgin: 17199072 >>> pgpgout: 21493892 >>> swpout_zero: 229163 >>> swpin_zero: 84353 >>> >>> ******** disable mglru ******** >>> >>> root@barry-desktop:/home/barry/develop/linux# echo 0 > >>> /sys/kernel/mm/lru_gen/enabled >>> >>> root@barry-desktop:/home/barry/develop/linux# ./build.sh >>> *** Executing round 1 *** >>> real 6m27.944s >>> user 46m41.832s >>> sys 3m30.635s >>> pswpin: 474036 >>> pswpout: 1434853 >>> 64kB-swpout: 0 >>> 32kB-swpout: 0 >>> 16kB-swpout: 331755 >>> 64kB-swpin: 0 >>> 32kB-swpin: 0 >>> 16kB-swpin: 106333 >>> pgpgin: 11763720 >>> pgpgout: 14551524 >>> swpout_zero: 145050 >>> swpin_zero: 87981 >>> >>> my build script: >>> >>> #!/bin/bash >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled >>> >>> vmstat_path="/proc/vmstat" >>> thp_base_path="/sys/kernel/mm/transparent_hugepage" >>> >>> read_values() { >>> pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}') >>> pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}') >>> pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}') >>> pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}') >>> swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}') >>> swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}') >>> swpout_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpout >>> 2>/dev/null || echo 0) >>> swpout_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpout >>> 2>/dev/null || echo 0) >>> swpout_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpout >>> 2>/dev/null || echo 0) >>> swpin_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpin >>> 2>/dev/null || echo 0) >>> swpin_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpin >>> 2>/dev/null || echo 0) >>> swpin_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpin >>> 2>/dev/null || echo 0) >>> echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k >>> $swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout $swpout_zero >>> $swpin_zero" >>> } >>> >>> for ((i=1; i<=1; i++)) >>> do >>> echo >>> echo "*** Executing round $i ***" >>> make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null >>> echo 3 > /proc/sys/vm/drop_caches >>> >>> #kernel build >>> initial_values=($(read_values)) >>> time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ >>> CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null >>> final_values=($(read_values)) >>> >>> echo "pswpin: $((final_values[0] - initial_values[0]))" >>> echo "pswpout: $((final_values[1] - initial_values[1]))" >>> echo "64kB-swpout: $((final_values[2] - initial_values[2]))" >>> echo "32kB-swpout: $((final_values[3] - initial_values[3]))" >>> echo "16kB-swpout: $((final_values[4] - initial_values[4]))" >>> echo "64kB-swpin: $((final_values[5] - initial_values[5]))" >>> echo "32kB-swpin: $((final_values[6] - initial_values[6]))" >>> echo "16kB-swpin: $((final_values[7] - initial_values[7]))" >>> echo "pgpgin: $((final_values[8] - initial_values[8]))" >>> echo "pgpgout: $((final_values[9] - initial_values[9]))" >>> echo "swpout_zero: $((final_values[10] - initial_values[10]))" >>> echo "swpin_zero: $((final_values[11] - initial_values[11]))" >>> sync >>> sleep 10 >>> done >>> >>>> >>>> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/ >>>> [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com/ >>>> [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/ >>>> >>>> Thanks, >>>> Usama >>> > > Thanks > Barry ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin 2025-01-10 10:40 ` Usama Arif @ 2025-01-10 10:47 ` Barry Song 0 siblings, 0 replies; 13+ messages in thread From: Barry Song @ 2025-01-10 10:47 UTC (permalink / raw) To: Usama Arif Cc: lsf-pc, Linux Memory Management List, Johannes Weiner, Yosry Ahmed, Shakeel Butt, Yu Zhao On Fri, Jan 10, 2025 at 11:40 PM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 10/01/2025 10:30, Barry Song wrote: > > On Fri, Jan 10, 2025 at 11:26 PM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >> > >> > >> On 10/01/2025 10:09, Barry Song wrote: > >>> Hi Usama, > >>> > >>> Please include me in the discussion. I'll try to attend, at least remotely. > >>> > >>> On Fri, Jan 10, 2025 at 9:06 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>> > >>>> I would like to propose a session to discuss the work going on > >>>> around large folio swapin, whether its traditional swap or > >>>> zswap or zram. > >>>> > >>>> Large folios have obvious advantages that have been discussed before > >>>> like fewer page faults, batched PTE and rmap manipulation, reduced > >>>> lru list, TLB coalescing (for arm64 and amd). > >>>> However, swapping in large folios has its own drawbacks like higher > >>>> swap thrashing. > >>>> I had initially sent a RFC of zswapin of large folios in [1] > >>>> but it causes a regression due to swap thrashing in kernel > >>>> build time, which I am confident is happening with zram large > >>>> folio swapin as well (which is merged in kernel). > >>>> > >>>> Some of the points we could discuss in the session: > >>>> > >>>> - What is the right (preferably open source) benchmark to test for > >>>> swapin of large folios? kernel build time in limited > >>>> memory cgroup shows a regression, microbenchmarks show a massive > >>>> improvement, maybe there are benchmarks where TLB misses is > >>>> a big factor and show an improvement. > >>> > >>> My understanding is that it largely depends on the workload. In interactive > >>> scenarios, such as on a phone, swap thrashing is not an issue because > >>> there is minimal to no thrashing for the app occupying the screen > >>> (foreground). In such cases, swap bandwidth becomes the most critical factor > >>> in improving app switching speed, especially when multiple applications > >>> are switching between background and foreground states. > >>> > >>>> > >>>> - We could have something like > >>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled > >>>> to enable/disable swapin but its going to be difficult to tune, might > >>>> have different optimum values based on workloads and are likely to be > >>>> left at their default values. Is there some dynamic way to decide when > >>>> to swapin large folios and when to fallback to smaller folios? > >>>> swapin_readahead swapcache path which only supports 4K folios atm has a > >>>> read ahead window based on hits, however readahead is a folio flag and > >>>> not a page flag, so this method can't be used as once a large folio > >>>> is swapped in, we won't get a fault and subsequent hits on other > >>>> pages of the large folio won't be recorded. > >>>> > >>>> - For zswap and zram, it might be that doing larger block compression/ > >>>> decompression might offset the regression from swap thrashing, but it > >>>> brings about its own issues. For e.g. once a large folio is swapped > >>>> out, it could fail to swapin as a large folio and fallback > >>>> to 4K, resulting in redundant decompressions. > >>> > >>> That's correct. My current workaround involves swapping four small folios, > >>> and zsmalloc will compress and decompress in chunks of four pages, > >>> regardless of the actual size of the mTHP - The improvement in compression > >>> ratio and speed becomes less significant after exceeding four pages, even > >>> though there is still some increase. > >>> > >>> Our recent experiments on phone also show that enabling direct reclamation > >>> for do_swap_page() to allocate 2-order mTHP results in a 0% allocation > >>> failure rate - this probably removes the need for fallbacking to 4 small > >>> folios. (Note that our experiments include Yu's TAO—Android GKI has > >>> already merged it. However, since 2 is less than > >>> PAGE_ALLOC_COSTLY_ORDER, we might achieve similar results even > >>> without Yu's TAO, although I have not confirmed this.) > >>> > >> > >> Hi Barry, > >> > >> Thanks for the comments! > >> > >> I haven't seen any activity on TAO on the mailing list recently. Do you know > >> if there are any plans for it to be sent for upstream review? > >> Have cc-ed Yu Zhao as well. > >> > >> > >>>> This will also mean swapin of large folios from traditional swap > >>>> isn't something we should proceed with? > >>>> > >>>> - Should we even support large folio swapin? You often have high swap > >>>> activity when the system/cgroup is close to running out of memory, at this > >>>> point, maybe the best way forward is to just swapin 4K pages and let > >>>> khugepaged [2], [3] collapse them if the surrounding pages are swapped in > >>>> as well. > >>> > >>> This approach might be suitable for non-interactive scenarios, such as building > >>> a kernel within a memory control group (memcg) or running other server > >>> applications. However, performing collapse in interactive and power-sensitive > >>> scenarios would be unnecessary and could lead to wasted power due to > >>> memory migration and unmap/map operations. > >>> > >>> However, it is quite challenging to automatically determine the type > >>> of workloads > >>> the system is running. I feel we still need a global control to decide whether > >>> to enable mTHP swap-in—not necessarily per size, but at least at a global level. > >>> That said, there is evident resistance to introducing additional > >>> controls to enable > >>> or disable mTHP features. > >>> > >>> By the way, Usama, have you ever tried switching between mglru and the > >>> traditional > >>> active/inactive LRU? My experience shows a significant difference in > >>> swap thrashing > >>> —active/inactive LRU exhibits much less swap thrashing in my local kernel build > >>> tests. > >>> > >> > >> I never tried with MGLRU enabled, so I am probably seeing the lowest amount of > >> swap-thrashing. > > > > Are you sure, Usama, since mglru is enabled by default? I have to echo > > 0 to manually > > disable it. > > > > Yes, I dont have CONFIG_LRU_GEN set in my defconfig. I dont think it is set > by default as well? Atleast on x86. > > $ make defconfig > $ grep LRU_GEN .config > # CONFIG_LRU_GEN is not set Okay, it’s likely because I’m using the Ubuntu distribution for x86 and Android GKI for arm64, where mglru is enabled by default in both cases. But regardless, I’d appreciate it if you could enable it and check if you observe the same phenomena as I did :-) > > Thanks, > Usama > > >> > >> Thanks, > >> Usama > >> > >>> the latest mm-unstable > >>> > >>> *********** default mglru: *********** > >>> > >>> root@barry-desktop:/home/barry/develop/linux# ./build.sh > >>> *** Executing round 1 *** > >>> real 6m44.561s > >>> user 46m53.274s > >>> sys 3m48.585s > >>> pswpin: 1286081 > >>> pswpout: 3147936 > >>> 64kB-swpout: 0 > >>> 32kB-swpout: 0 > >>> 16kB-swpout: 714580 > >>> 64kB-swpin: 0 > >>> 32kB-swpin: 0 > >>> 16kB-swpin: 286881 > >>> pgpgin: 17199072 > >>> pgpgout: 21493892 > >>> swpout_zero: 229163 > >>> swpin_zero: 84353 > >>> > >>> ******** disable mglru ******** > >>> > >>> root@barry-desktop:/home/barry/develop/linux# echo 0 > > >>> /sys/kernel/mm/lru_gen/enabled > >>> > >>> root@barry-desktop:/home/barry/develop/linux# ./build.sh > >>> *** Executing round 1 *** > >>> real 6m27.944s > >>> user 46m41.832s > >>> sys 3m30.635s > >>> pswpin: 474036 > >>> pswpout: 1434853 > >>> 64kB-swpout: 0 > >>> 32kB-swpout: 0 > >>> 16kB-swpout: 331755 > >>> 64kB-swpin: 0 > >>> 32kB-swpin: 0 > >>> 16kB-swpin: 106333 > >>> pgpgin: 11763720 > >>> pgpgout: 14551524 > >>> swpout_zero: 145050 > >>> swpin_zero: 87981 > >>> > >>> my build script: > >>> > >>> #!/bin/bash > >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > >>> > >>> vmstat_path="/proc/vmstat" > >>> thp_base_path="/sys/kernel/mm/transparent_hugepage" > >>> > >>> read_values() { > >>> pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}') > >>> pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}') > >>> pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}') > >>> pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}') > >>> swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}') > >>> swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}') > >>> swpout_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpout > >>> 2>/dev/null || echo 0) > >>> swpout_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpout > >>> 2>/dev/null || echo 0) > >>> swpout_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpout > >>> 2>/dev/null || echo 0) > >>> swpin_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpin > >>> 2>/dev/null || echo 0) > >>> swpin_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpin > >>> 2>/dev/null || echo 0) > >>> swpin_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpin > >>> 2>/dev/null || echo 0) > >>> echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k > >>> $swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout $swpout_zero > >>> $swpin_zero" > >>> } > >>> > >>> for ((i=1; i<=1; i++)) > >>> do > >>> echo > >>> echo "*** Executing round $i ***" > >>> make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null > >>> echo 3 > /proc/sys/vm/drop_caches > >>> > >>> #kernel build > >>> initial_values=($(read_values)) > >>> time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ > >>> CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null > >>> final_values=($(read_values)) > >>> > >>> echo "pswpin: $((final_values[0] - initial_values[0]))" > >>> echo "pswpout: $((final_values[1] - initial_values[1]))" > >>> echo "64kB-swpout: $((final_values[2] - initial_values[2]))" > >>> echo "32kB-swpout: $((final_values[3] - initial_values[3]))" > >>> echo "16kB-swpout: $((final_values[4] - initial_values[4]))" > >>> echo "64kB-swpin: $((final_values[5] - initial_values[5]))" > >>> echo "32kB-swpin: $((final_values[6] - initial_values[6]))" > >>> echo "16kB-swpin: $((final_values[7] - initial_values[7]))" > >>> echo "pgpgin: $((final_values[8] - initial_values[8]))" > >>> echo "pgpgout: $((final_values[9] - initial_values[9]))" > >>> echo "swpout_zero: $((final_values[10] - initial_values[10]))" > >>> echo "swpin_zero: $((final_values[11] - initial_values[11]))" > >>> sync > >>> sleep 10 > >>> done > >>> > >>>> > >>>> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/ > >>>> [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com/ > >>>> [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/ > >>>> > >>>> Thanks, > >>>> Usama > >>> > > Thanks Barry ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin 2025-01-10 10:09 ` Barry Song 2025-01-10 10:26 ` Usama Arif @ 2025-01-12 10:49 ` Barry Song 1 sibling, 0 replies; 13+ messages in thread From: Barry Song @ 2025-01-12 10:49 UTC (permalink / raw) To: 21cnbao, usamaarif642 Cc: hannes, linux-mm, lsf-pc, shakeel.butt, yosryahmed, Barry Song On Fri, Jan 10, 2025 at 11:09 PM Barry Song <21cnbao@gmail.com> wrote: > > Hi Usama, > > Please include me in the discussion. I'll try to attend, at least remotely. > > On Fri, Jan 10, 2025 at 9:06 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > I would like to propose a session to discuss the work going on > > around large folio swapin, whether its traditional swap or > > zswap or zram. > > > > Large folios have obvious advantages that have been discussed before > > like fewer page faults, batched PTE and rmap manipulation, reduced > > lru list, TLB coalescing (for arm64 and amd). > > However, swapping in large folios has its own drawbacks like higher > > swap thrashing. > > I had initially sent a RFC of zswapin of large folios in [1] > > but it causes a regression due to swap thrashing in kernel > > build time, which I am confident is happening with zram large > > folio swapin as well (which is merged in kernel). > > > > Some of the points we could discuss in the session: > > > > - What is the right (preferably open source) benchmark to test for > > swapin of large folios? kernel build time in limited > > memory cgroup shows a regression, microbenchmarks show a massive > > improvement, maybe there are benchmarks where TLB misses is > > a big factor and show an improvement. > > My understanding is that it largely depends on the workload. In interactive > scenarios, such as on a phone, swap thrashing is not an issue because > there is minimal to no thrashing for the app occupying the screen > (foreground). In such cases, swap bandwidth becomes the most critical factor > in improving app switching speed, especially when multiple applications > are switching between background and foreground states. > > > > > - We could have something like > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled > > to enable/disable swapin but its going to be difficult to tune, might > > have different optimum values based on workloads and are likely to be > > left at their default values. Is there some dynamic way to decide when > > to swapin large folios and when to fallback to smaller folios? > > swapin_readahead swapcache path which only supports 4K folios atm has a > > read ahead window based on hits, however readahead is a folio flag and > > not a page flag, so this method can't be used as once a large folio > > is swapped in, we won't get a fault and subsequent hits on other > > pages of the large folio won't be recorded. > > > > - For zswap and zram, it might be that doing larger block compression/ > > decompression might offset the regression from swap thrashing, but it > > brings about its own issues. For e.g. once a large folio is swapped > > out, it could fail to swapin as a large folio and fallback > > to 4K, resulting in redundant decompressions. > > That's correct. My current workaround involves swapping four small folios, > and zsmalloc will compress and decompress in chunks of four pages, > regardless of the actual size of the mTHP - The improvement in compression > ratio and speed becomes less significant after exceeding four pages, even > though there is still some increase. > > Our recent experiments on phone also show that enabling direct reclamation > for do_swap_page() to allocate 2-order mTHP results in a 0% allocation > failure rate - this probably removes the need for fallbacking to 4 small > folios. (Note that our experiments include Yu's TAO—Android GKI has > already merged it. However, since 2 is less than > PAGE_ALLOC_COSTLY_ORDER, we might achieve similar results even > without Yu's TAO, although I have not confirmed this.) > > > This will also mean swapin of large folios from traditional swap > > isn't something we should proceed with? > > > > - Should we even support large folio swapin? You often have high swap > > activity when the system/cgroup is close to running out of memory, at this > > point, maybe the best way forward is to just swapin 4K pages and let > > khugepaged [2], [3] collapse them if the surrounding pages are swapped in > > as well. > > This approach might be suitable for non-interactive scenarios, such as building > a kernel within a memory control group (memcg) or running other server > applications. However, performing collapse in interactive and power-sensitive > scenarios would be unnecessary and could lead to wasted power due to > memory migration and unmap/map operations. > > However, it is quite challenging to automatically determine the type > of workloads > the system is running. I feel we still need a global control to decide whether > to enable mTHP swap-in—not necessarily per size, but at least at a global level. > That said, there is evident resistance to introducing additional > controls to enable > or disable mTHP features. I drafted an approach that eliminates the need for this control. Based on my testing, it results in even less swap thrashing compared to disabling mTHP swap-in for the non-mglru case. Here are the results: real 6m27.227s user 49m46.751s sys 3m34.512s pswpin: 294050 pswpout: 1265556 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 288163 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 22899 pgpgin: 11816316 pgpgout: 13891256 swpout_zero: 136907 swpin_zero: 77215 The draft is as below, [PATCH RFC] mm: throttle large folios swap-in based on thrashing We have two types of workloads. The first is interactive systems, where the foreground desktop apps typically do not swap out. In this case, we are more concerned with swap bandwidth for switching background and foreground apps, which is primarily driven by large folio swap-ins. The second type involves scenarios like building a kernel in a 1GB memory cgroup, where extensive swapping occurs. Large folio swap-ins can exacerbate swap thrashing in such cases. While conceptually, we could use a sysfs control to toggle the mTHP swap-in feature, there is resistance to adding new controls. Instead, we employ a simple automatic mechanism to roughly detect swap thrashing: if refaults are observed in a recent batch of swap-ins, we fall back to small folio swap-ins. Even during a kernel build in a 1GiB memory cgroup, we continue to observe many large folio swap-ins, benefiting from increased swap-in bandwidth, while increased swap thrashing has been eliminated compared to disabling mTHP swap-in. Signed-off-by: Barry Song <v-songbaohua@oppo.com> --- include/linux/mmzone.h | 9 +++++++++ mm/memcontrol.c | 19 +++++++++++++++++-- mm/workingset.c | 37 +++++++++++++++++++++++++++++++++++-- 3 files changed, 61 insertions(+), 4 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 9540b41894da..c6deece243d1 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -329,6 +329,15 @@ enum lruvec_flags { LRUVEC_NODE_CONGESTED, }; +/* + * Has the lruvec experienced an anon large folio refault recently? + * Once a refault occurs, we set it to 31; it only degrades to 0 if + * there are more than 31 consecutive non-refault swap-ins. + */ +#define LRUVEC_REFAULT_WIDTH 5 +#define LRUVEC_REFAULT_OFFS (LRUVEC_NODE_CONGESTED + 1) +#define LRUVEC_REFAULT_MASK ((BIT(LRUVEC_REFAULT_WIDTH) - 1) << LRUVEC_REFAULT_OFFS) + #endif /* !__GENERATING_BOUNDS_H */ /* diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 46f8b372d212..4155c4126a80 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4556,12 +4556,21 @@ int mem_cgroup_charge_hugetlb(struct folio *folio, gfp_t gfp) int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) { + struct pglist_data *pgdat = folio_pgdat(folio); + struct lruvec *lruvec; struct mem_cgroup *memcg; unsigned short id; int ret; - if (mem_cgroup_disabled()) - return 0; + if (mem_cgroup_disabled()) { + /* + * lruvec is congested or has recent THP refaults, + * avoid future swap thrashing + */ + lruvec = &pgdat->__lruvec; + return (folio_test_large(folio) && lruvec->flags) ? + -ENOMEM : 0; + } id = lookup_swap_cgroup_id(entry); rcu_read_lock(); @@ -4570,8 +4579,14 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, memcg = get_mem_cgroup_from_mm(mm); rcu_read_unlock(); + lruvec = mem_cgroup_lruvec(memcg, folio_pgdat(folio)); + if (folio_test_large(folio) && lruvec->flags) { + ret = -ENOMEM; + goto out; + } ret = charge_memcg(folio, memcg, gfp); +out: css_put(&memcg->css); return ret; } diff --git a/mm/workingset.c b/mm/workingset.c index 4841ae8af411..095f8668dc22 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -280,6 +280,28 @@ static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec, return abs_diff(max_seq, *token >> LRU_REFS_WIDTH) < MAX_NR_GENS; } +static void lruvec_set_max_refaults(struct lruvec *lruvec) +{ + set_mask_bits(&lruvec->flags, LRUVEC_REFAULT_MASK, LRUVEC_REFAULT_MASK); +} + +static int lruvec_dec_refaults(struct lruvec *lruvec) +{ + unsigned long new_flags, old_flags = READ_ONCE(lruvec->flags); + unsigned long new_ref, old_ref; + + do { + old_ref = (old_flags & LRUVEC_REFAULT_MASK) >> LRUVEC_REFAULT_OFFS; + if (old_ref == 0) + return 0; + new_ref = old_ref - 1; + new_flags = old_flags & ~LRUVEC_REFAULT_MASK; + new_flags |= new_ref << LRUVEC_REFAULT_OFFS; + } while (!try_cmpxchg(&lruvec->flags, &old_flags, new_flags)); + + return old_ref; +} + static void lru_gen_refault(struct folio *folio, void *shadow) { bool recent; @@ -299,8 +321,14 @@ static void lru_gen_refault(struct folio *folio, void *shadow) mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta); - if (!recent) + if (!recent) { + if (!type) + lruvec_dec_refaults(lruvec); goto unlock; + } + + if (!type && folio_test_large(folio)) + lruvec_set_max_refaults(lruvec); lrugen = &lruvec->lrugen; @@ -563,11 +591,16 @@ void workingset_refault(struct folio *folio, void *shadow) mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr); - if (!workingset_test_recent(shadow, file, &workingset, true)) + if (!workingset_test_recent(shadow, file, &workingset, true)) { + if (!file) + lruvec_dec_refaults(lruvec); return; + } folio_set_active(folio); workingset_age_nonresident(lruvec, nr); + if (!file && folio_test_large(folio)) + lruvec_set_max_refaults(lruvec); mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file, nr); /* Folio was active prior to eviction */ -- 2.34.1 > > By the way, Usama, have you ever tried switching between mglru and the > traditional > active/inactive LRU? My experience shows a significant difference in > swap thrashing > —active/inactive LRU exhibits much less swap thrashing in my local kernel build > tests. > > the latest mm-unstable > > *********** default mglru: *********** > > root@barry-desktop:/home/barry/develop/linux# ./build.sh > *** Executing round 1 *** > real 6m44.561s > user 46m53.274s > sys 3m48.585s > pswpin: 1286081 > pswpout: 3147936 > 64kB-swpout: 0 > 32kB-swpout: 0 > 16kB-swpout: 714580 > 64kB-swpin: 0 > 32kB-swpin: 0 > 16kB-swpin: 286881 > pgpgin: 17199072 > pgpgout: 21493892 > swpout_zero: 229163 > swpin_zero: 84353 > > ******** disable mglru ******** > > root@barry-desktop:/home/barry/develop/linux# echo 0 > > /sys/kernel/mm/lru_gen/enabled > > root@barry-desktop:/home/barry/develop/linux# ./build.sh > *** Executing round 1 *** > real 6m27.944s > user 46m41.832s > sys 3m30.635s > pswpin: 474036 > pswpout: 1434853 > 64kB-swpout: 0 > 32kB-swpout: 0 > 16kB-swpout: 331755 > 64kB-swpin: 0 > 32kB-swpin: 0 > 16kB-swpin: 106333 > pgpgin: 11763720 > pgpgout: 14551524 > swpout_zero: 145050 > swpin_zero: 87981 > > my build script: > > #!/bin/bash > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > > vmstat_path="/proc/vmstat" > thp_base_path="/sys/kernel/mm/transparent_hugepage" > > read_values() { > pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}') > pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}') > pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}') > pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}') > swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}') > swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}') > swpout_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpout > 2>/dev/null || echo 0) > swpout_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpout > 2>/dev/null || echo 0) > swpout_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpout > 2>/dev/null || echo 0) > swpin_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpin > 2>/dev/null || echo 0) > swpin_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpin > 2>/dev/null || echo 0) > swpin_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpin > 2>/dev/null || echo 0) > echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k > $swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout $swpout_zero > $swpin_zero" > } > > for ((i=1; i<=1; i++)) > do > echo > echo "*** Executing round $i ***" > make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null > echo 3 > /proc/sys/vm/drop_caches > > #kernel build > initial_values=($(read_values)) > time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ > CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null > final_values=($(read_values)) > > echo "pswpin: $((final_values[0] - initial_values[0]))" > echo "pswpout: $((final_values[1] - initial_values[1]))" > echo "64kB-swpout: $((final_values[2] - initial_values[2]))" > echo "32kB-swpout: $((final_values[3] - initial_values[3]))" > echo "16kB-swpout: $((final_values[4] - initial_values[4]))" > echo "64kB-swpin: $((final_values[5] - initial_values[5]))" > echo "32kB-swpin: $((final_values[6] - initial_values[6]))" > echo "16kB-swpin: $((final_values[7] - initial_values[7]))" > echo "pgpgin: $((final_values[8] - initial_values[8]))" > echo "pgpgout: $((final_values[9] - initial_values[9]))" > echo "swpout_zero: $((final_values[10] - initial_values[10]))" > echo "swpin_zero: $((final_values[11] - initial_values[11]))" > sync > sleep 10 > done > > > > > [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/ > > [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com/ > > [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/ > > > > Thanks, > > Usama > > Thanks > Barry ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin 2025-01-09 20:06 [LSF/MM/BPF TOPIC] Large folio (z)swapin Usama Arif ` (2 preceding siblings ...) 2025-01-10 10:09 ` Barry Song @ 2025-01-13 3:16 ` Chuanhua Han 2025-01-28 8:17 ` Sergey Senozhatsky 4 siblings, 0 replies; 13+ messages in thread From: Chuanhua Han @ 2025-01-13 3:16 UTC (permalink / raw) To: Usama Arif Cc: lsf-pc, Linux Memory Management List, Johannes Weiner, Barry Song, Yosry Ahmed, Shakeel Butt I am also interested in this topic. Please include me in the discussion too. I'll try to attend, at least remotely :) On Fri, 10 Jan 2025 at 04:06, Usama Arif <usamaarif642@gmail.com> wrote: > > I would like to propose a session to discuss the work going on > around large folio swapin, whether its traditional swap or > zswap or zram. > > Large folios have obvious advantages that have been discussed before > like fewer page faults, batched PTE and rmap manipulation, reduced > lru list, TLB coalescing (for arm64 and amd). > However, swapping in large folios has its own drawbacks like higher > swap thrashing. > I had initially sent a RFC of zswapin of large folios in [1] > but it causes a regression due to swap thrashing in kernel > build time, which I am confident is happening with zram large > folio swapin as well (which is merged in kernel). > > Some of the points we could discuss in the session: > > - What is the right (preferably open source) benchmark to test for > swapin of large folios? kernel build time in limited > memory cgroup shows a regression, microbenchmarks show a massive > improvement, maybe there are benchmarks where TLB misses is > a big factor and show an improvement. > > - We could have something like > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled > to enable/disable swapin but its going to be difficult to tune, might > have different optimum values based on workloads and are likely to be > left at their default values. Is there some dynamic way to decide when > to swapin large folios and when to fallback to smaller folios? > swapin_readahead swapcache path which only supports 4K folios atm has a > read ahead window based on hits, however readahead is a folio flag and > not a page flag, so this method can't be used as once a large folio > is swapped in, we won't get a fault and subsequent hits on other > pages of the large folio won't be recorded. > > - For zswap and zram, it might be that doing larger block compression/ > decompression might offset the regression from swap thrashing, but it > brings about its own issues. For e.g. once a large folio is swapped > out, it could fail to swapin as a large folio and fallback > to 4K, resulting in redundant decompressions. > This will also mean swapin of large folios from traditional swap > isn't something we should proceed with? > > - Should we even support large folio swapin? You often have high swap > activity when the system/cgroup is close to running out of memory, at this > point, maybe the best way forward is to just swapin 4K pages and let > khugepaged [2], [3] collapse them if the surrounding pages are swapped in > as well. > > [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/ > [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com/ > [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/ > > Thanks, > Usama > -- Thanks, Chuanhua ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin 2025-01-09 20:06 [LSF/MM/BPF TOPIC] Large folio (z)swapin Usama Arif ` (3 preceding siblings ...) 2025-01-13 3:16 ` Chuanhua Han @ 2025-01-28 8:17 ` Sergey Senozhatsky 4 siblings, 0 replies; 13+ messages in thread From: Sergey Senozhatsky @ 2025-01-28 8:17 UTC (permalink / raw) To: Usama Arif Cc: lsf-pc, Linux Memory Management List, Johannes Weiner, Barry Song, Yosry Ahmed, Shakeel Butt On (25/01/09 20:06), Usama Arif wrote: > - For zswap and zram, it might be that doing larger block compression/ > decompression might offset the regression from swap thrashing So zram/zsmalloc is certainly of some interest to me. Relatively low chances of me traveling, but would definitely like to at least know details of the discussion (should it take place) or maybe can join remotely. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-01-28 8:17 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-01-09 20:06 [LSF/MM/BPF TOPIC] Large folio (z)swapin Usama Arif 2025-01-09 21:34 ` Yosry Ahmed 2025-01-10 4:29 ` Nhat Pham 2025-01-10 10:28 ` Barry Song 2025-01-11 10:52 ` Zhu Yanjun 2025-01-10 10:09 ` Barry Song 2025-01-10 10:26 ` Usama Arif 2025-01-10 10:30 ` Barry Song 2025-01-10 10:40 ` Usama Arif 2025-01-10 10:47 ` Barry Song 2025-01-12 10:49 ` Barry Song 2025-01-13 3:16 ` Chuanhua Han 2025-01-28 8:17 ` Sergey Senozhatsky
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox