linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nhat Pham <nphamcs@gmail.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: YoungJun Park <youngjun.park@lge.com>,
	linux-mm@kvack.org, akpm@linux-foundation.org,
	 hannes@cmpxchg.org, hughd@google.com, yosry.ahmed@linux.dev,
	 mhocko@kernel.org, roman.gushchin@linux.dev,
	shakeel.butt@linux.dev,  muchun.song@linux.dev,
	len.brown@intel.com, chengming.zhou@linux.dev,
	 chrisl@kernel.org, huang.ying.caritas@gmail.com,
	ryan.roberts@arm.com,  viro@zeniv.linux.org.uk,
	baohua@kernel.org, osalvador@suse.de,
	 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu,
	pavel@kernel.org,  kernel-team@meta.com,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	 linux-pm@vger.kernel.org, peterx@redhat.com, gunho.lee@lge.com,
	 taejoon.song@lge.com, iamjoonsoo.kim@lge.com
Subject: Re: [RFC PATCH v2 00/18] Virtual Swap Space
Date: Mon, 2 Jun 2025 11:29:53 -0700	[thread overview]
Message-ID: <CAKEwX=P4Q6jNQAi+H3sMQ73z-F-rG5jz8jj1NeGgUi=Pem_ZTQ@mail.gmail.com> (raw)
In-Reply-To: <CAMgjq7DGMS5A4t6nOQmwyLy5Px96aoejBkiwFHgy9uMk-F8Y-w@mail.gmail.com>

On Sun, Jun 1, 2025 at 9:15 AM Kairui Song <ryncsn@gmail.com> wrote:
>
>
> Hi All,

Thanks for sharing your setup, Kairui! I've always been curious about
multi-tier compression swapping.

>
> I'd like to share some info from my side. Currently we have an
> internal solution for multi tier swap, implemented based on ZRAM and
> writeback: 4 compression level and multiple block layer level. The
> ZRAM table serves a similar role to the swap table in the "swap table
> series" or the virtual layer here.
>
> We hacked the BIO layer to let ZRAM be Cgroup aware, so it even

Hmmm this part seems a bit hacky to me too :-?

> supports per-cgroup priority, and per-cgroup writeback control, and it
> worked perfectly fine in production.
>
> The interface looks something like this:
> /sys/fs/cgroup/cg1/zram.prio: [1-4]
> /sys/fs/cgroup/cg1/zram.writeback_prio [1-4]
> /sys/fs/cgroup/cg1/zram.writeback_size [0 - 4K]

How do you do aging with multiple tiers like this? Or do you just rely
on time thresholds, and have userspace invokes writeback in a cron
job-style?

Tbh, I'm surprised that we see performance win with recompression. I
understand that different workloads might benefit the most from
different points in the Pareto frontier of latency-memory saving:
latency-sensitive workloads might like a fast compression algorithm,
whereas other workloads might prefer a compression algorithm that
saves more memory. So a per-cgroup compressor selection can make
sense.

However, would the overhead of moving a page from one tier to the
other not eat up all the benefit from the (usually small) extra memory
savings?

>
> It's really nothing fancy and complex, the four priority is simply the
> four ZRAM compression streams that's already in upstream, and you can
> simply hardcode four *bdev in "struct zram" and reuse the bits, then
> chain the write bio with new underlayer bio... Getting the priority
> info of a cgroup is even simpler once ZRAM is cgroup aware.
>
> All interfaces can be adjusted dynamically at any time (e.g. by an
> agent), and already swapped out pages won't be touched. The block
> devices are specified in ZRAM's sys files during swapon.
>
> It's easy to implement, but not a good idea for upstream at all:
> redundant layers, and performance is bad (if not optimized):
> - it breaks SYNCHRONIZE_IO, causing a huge slowdown, so we removed the
> SYNCHRONIZE_IO completely which actually improved the performance in
> every aspect (I've been trying to upstream this for a while);
> - ZRAM's block device allocator is just not good (just a bitmap) so we
> want to use the SWAP allocator directly (which I'm also trying to
> upstream with the swap table series);
> - And many other bits and pieces like bio batching are kind of broken,

Interesting, is zram doing writeback batching?

> busy loop due to the ZRAM_WB bit, etc...

Hmmm, this sounds like something swap cache can help with. It's the
approach zswap writeback is taking - concurrent assessors can get the
page in the swap cache, and OTOH zswap writeback back off if it
detects swap cache contention (since the page is probably being
swapped in, freed, or written back by another thread).

But I'm not sure how zram writeback works...

> - Lacking support for things like effective migration/compaction,
> doable but looks horrible.
>
> So I definitely don't like this band-aid solution, but hey, it works.
> I'm looking forward to replacing it with native upstream support.
> That's one of the motivations behind the swap table series, which
> I think it would resolve the problems in an elegant and clean way
> upstreamly. The initial tests do show it has a much lower overhead
> and cleans up SWAP.
>
> But maybe this is kind of similar to the "less optimized form" you
> are talking about? As I mentioned I'm already trying to upstream
> some nice parts of it, and hopefully replace it with an upstream
> solution finally.
>
> I can try upstream other parts of it if there are people really
> interested, but I strongly recommend that we should focus on the
> right approach instead and not waste time on that and spam the
> mail list.

I suppose a lot of this is specific to zram, but bits and pieces of it
sound upstreamable to me :)

We can wait for YoungJun's patches/RFC for further discussion, but perhaps:

1. A new cgroup interface to select swap backends for a cgroup.

2. Writeback/fallback order either designated by the above interface,
or by the priority of the swap backends.


>
> I have no special preference on how the final upstream interface
> should look like. But currently SWAP devices already have priorities,
> so maybe we should just make use of that.


  parent reply	other threads:[~2025-06-02 18:30 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-29 23:38 Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 01/18] swap: rearrange the swap header file Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 02/18] swapfile: rearrange functions Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 03/18] swapfile: rearrange freeing steps Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 04/18] mm: swap: add an abstract API for locking out swapoff Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 05/18] mm: swap: add a separate type for physical swap slots Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 06/18] mm: create scaffolds for the new virtual swap implementation Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 07/18] mm: swap: zswap: swap cache and zswap support for virtualized swap Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 08/18] mm: swap: allocate a virtual swap slot for each swapped out page Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 09/18] swap: implement the swap_cgroup API using virtual swap Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 10/18] swap: manage swap entry lifetime at the virtual swap layer Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 11/18] mm: swap: temporarily disable THP swapin and batched freeing swap Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 12/18] mm: swap: decouple virtual swap slot from backing store Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 13/18] zswap: do not start zswap shrinker if there is no physical swap slots Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 14/18] memcg: swap: only charge " Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 15/18] vswap: support THP swapin and batch free_swap_and_cache Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 16/18] swap: simplify swapoff using virtual swap Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 17/18] swapfile: move zeromap setup out of enable_swap_info Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 18/18] swapfile: remove zeromap in virtual swap implementation Nhat Pham
2025-04-29 23:51 ` [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
2025-05-30  6:47 ` YoungJun Park
2025-05-30 16:52   ` Nhat Pham
2025-05-30 16:54     ` Nhat Pham
2025-06-01 12:56     ` YoungJun Park
2025-06-01 16:14       ` Kairui Song
2025-06-02 15:17         ` YoungJun Park
2025-06-02 18:29         ` Nhat Pham [this message]
2025-06-03  9:50           ` Kairui Song
2025-06-01 21:08       ` Nhat Pham
2025-06-02 15:03         ` YoungJun Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKEwX=P4Q6jNQAi+H3sMQ73z-F-rG5jz8jj1NeGgUi=Pem_ZTQ@mail.gmail.com' \
    --to=nphamcs@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=christophe.leroy@csgroup.eu \
    --cc=gunho.lee@lge.com \
    --cc=hannes@cmpxchg.org \
    --cc=huang.ying.caritas@gmail.com \
    --cc=hughd@google.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=kernel-team@meta.com \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    --cc=pavel@kernel.org \
    --cc=peterx@redhat.com \
    --cc=roman.gushchin@linux.dev \
    --cc=ryan.roberts@arm.com \
    --cc=ryncsn@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=taejoon.song@lge.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=yosry.ahmed@linux.dev \
    --cc=youngjun.park@lge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox