From: Zhu Yanjun <yanjun.zhu@linux.dev>
To: Nhat Pham <nphamcs@gmail.com>, Usama Arif <usamaarif642@gmail.com>
Cc: lsf-pc@lists.linux-foundation.org,
Linux Memory Management List <linux-mm@kvack.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Barry Song <21cnbao@gmail.com>,
Yosry Ahmed <yosryahmed@google.com>,
Shakeel Butt <shakeel.butt@linux.dev>
Subject: Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin
Date: Sat, 11 Jan 2025 11:52:12 +0100 [thread overview]
Message-ID: <081a1173-2d71-427b-ad26-16c8d1d99628@linux.dev> (raw)
In-Reply-To: <CAKEwX=PezunYEAjDVi6jumbGCHJEGc9UJaDnfh2nKaX8+UhxFQ@mail.gmail.com>
在 2025/1/10 5:29, Nhat Pham 写道:
> On Fri, Jan 10, 2025 at 3:08 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>> I would like to propose a session to discuss the work going on
>> around large folio swapin, whether its traditional swap or
>> zswap or zram.
>
> I'm interested! Count me in the discussion :)
I am also interested in this topic. Hope to join the meeting and discuss.
Zhu Yanjun
>
>>
>> Large folios have obvious advantages that have been discussed before
>> like fewer page faults, batched PTE and rmap manipulation, reduced
>> lru list, TLB coalescing (for arm64 and amd).
>> However, swapping in large folios has its own drawbacks like higher
>> swap thrashing.
>> I had initially sent a RFC of zswapin of large folios in [1]
>> but it causes a regression due to swap thrashing in kernel
>> build time, which I am confident is happening with zram large
>> folio swapin as well (which is merged in kernel).
>>
>> Some of the points we could discuss in the session:
>>
>> - What is the right (preferably open source) benchmark to test for
>> swapin of large folios? kernel build time in limited
>> memory cgroup shows a regression, microbenchmarks show a massive
>> improvement, maybe there are benchmarks where TLB misses is
>> a big factor and show an improvement.
>>
>> - We could have something like
>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled
>> to enable/disable swapin but its going to be difficult to tune, might
>> have different optimum values based on workloads and are likely to be
>
> Might even be different across memory regions.
>
>> left at their default values. Is there some dynamic way to decide when
>> to swapin large folios and when to fallback to smaller folios?
>> swapin_readahead swapcache path which only supports 4K folios atm has a
>> read ahead window based on hits, however readahead is a folio flag and
>> not a page flag, so this method can't be used as once a large folio
>> is swapped in, we won't get a fault and subsequent hits on other
>> pages of the large folio won't be recorded.
>
> Is this beneficial/useful enough to make it into a page flag?
>
> Can we push this to the swap layer, i.e record the hit information on
> a per-swap-entry basis instead? The space is a bit tight, but we're
> already in the talk for the new swap abstraction layer. If we go the
> dynamic route, we can squeeze this kind of information in the
> dynamically allocated per-swap-entry metadata structure (swap
> descriptor?).
>
> However, the swap entry can go away after a swapin (see
> should_try_to_free_swap()), so that might be busted :)
>
>>
>> - For zswap and zram, it might be that doing larger block compression/
>> decompression might offset the regression from swap thrashing, but it
>> brings about its own issues. For e.g. once a large folio is swapped
>> out, it could fail to swapin as a large folio and fallback
>> to 4K, resulting in redundant decompressions.
>> This will also mean swapin of large folios from traditional swap
>> isn't something we should proceed with?
>
> Yeah the cost/benefit analysis differs between backend. I wonder if a
> one-size-fit-all, backend-agnostic policy could ever work - maybe we
> need some backend-driven algorithm, or some sort of hinting mechanism?
>
> This would make the logic uglier though. We've been here before with
> HDD and SSD swap, except we don't really care about the former, so we
> can prioritize optimizing for SSD swap (in fact looks like we're
> removing the HDD portion of the swap allocator). In this case however,
> zswap, zram, and SSD swap are all valid options, with different
> characteristics that can make the optimal decision differ :)
>
> If we're going the block (de)compression route, there is also this
> pesky block size question. For instance, do we want to store the
> entire 2MB in a single block? That would mean we need to decompress
> the entire 2MB block at load time. It might be more straightforward in
> the mTHP world, but we do need to consider 2MB THP users too.
>
> Finally, the calculus might change once large folio allocation becomes
> more reliable. Perhaps we can wait until Johannes and Yu make this
> work?
>
>>
>> - Should we even support large folio swapin? You often have high swap
>> activity when the system/cgroup is close to running out of memory, at this
>> point, maybe the best way forward is to just swapin 4K pages and let
>> khugepaged [2], [3] collapse them if the surrounding pages are swapped in
>> as well.
>
> Perhaps this is the easiest thing to do :)
>
next prev parent reply other threads:[~2025-01-11 10:52 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-09 20:06 Usama Arif
2025-01-09 21:34 ` Yosry Ahmed
2025-01-10 4:29 ` Nhat Pham
2025-01-10 10:28 ` Barry Song
2025-01-11 10:52 ` Zhu Yanjun [this message]
2025-01-10 10:09 ` Barry Song
2025-01-10 10:26 ` Usama Arif
2025-01-10 10:30 ` Barry Song
2025-01-10 10:40 ` Usama Arif
2025-01-10 10:47 ` Barry Song
2025-01-12 10:49 ` Barry Song
2025-01-13 3:16 ` Chuanhua Han
2025-01-28 8:17 ` Sergey Senozhatsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=081a1173-2d71-427b-ad26-16c8d1d99628@linux.dev \
--to=yanjun.zhu@linux.dev \
--cc=21cnbao@gmail.com \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=nphamcs@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=usamaarif642@gmail.com \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox