From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3F33FC5B555 for ; Mon, 2 Jun 2025 18:30:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B72EF6B030F; Mon, 2 Jun 2025 14:30:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B4A936B0310; Mon, 2 Jun 2025 14:30:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A3A166B0311; Mon, 2 Jun 2025 14:30:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 805996B030F for ; Mon, 2 Jun 2025 14:30:08 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 1FF82140F7E for ; Mon, 2 Jun 2025 18:30:08 +0000 (UTC) X-FDA: 83511300096.14.23942EB Received: from mail-qv1-f45.google.com (mail-qv1-f45.google.com [209.85.219.45]) by imf01.hostedemail.com (Postfix) with ESMTP id 1DC2C40009 for ; Mon, 2 Jun 2025 18:30:05 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=h+D0Qi8c; spf=pass (imf01.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.45 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748889006; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=I3ps/vXS6xOBaRTDgBK/4ZINVG3GFVOLQseSmjF/i0E=; b=HKhbszltJcK5ZSkr/340LL55xqsZfn7ygcGVgWkLlgP26jlCXn6Gsj/5P8HLGrG9pExdFq HlLexlgJs+QECifloE3MNnjbZwLQ2SR+zpdwQe2enJF7fFgxqzds/v6924cUsg8wamZlci et5P6+vB7MQjCzKi+wQ5khnWWZEzn48= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=h+D0Qi8c; spf=pass (imf01.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.45 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748889006; a=rsa-sha256; cv=none; b=XPjIKkxUnOBhvlTgFiZCc94L82lP2k5Jt/gHk9XQx0YkMSDTv5RVuhx7kzey4w1rO4NT7n MrZSNG1RiF0BxHYmjyzSZ9f2JLgK8adM6KHEFw2wPFOQN+UszxWzg10uzChmeBbUQ9APjM AW3Z2kohKO5dR4YRJ0rhwGkqFDe5Dic= Received: by mail-qv1-f45.google.com with SMTP id 6a1803df08f44-6facba680a1so50794616d6.3 for ; Mon, 02 Jun 2025 11:30:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748889005; x=1749493805; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=I3ps/vXS6xOBaRTDgBK/4ZINVG3GFVOLQseSmjF/i0E=; b=h+D0Qi8cfx7ZT3bo8MtzyoaqwD6WCL55Fkr3nPlgGiKnyGIRRcuoRwJNMDUEggvTlv bsgGJWrWVOCPcfegt1dwLvW9wHmd8Ouh+axtCFJh6qrDs7o6Ef8x/Q00O+OgMNeFxnPL 0mCeWfawMaVinEehDMmmUeEZ6/W2JfQp39v5D/fDyj+DFnSEwpBoDY/v1eTS5YImHLGq p8dPwozCvlz8Bis/K+0JkGqTg05d5dQ5YpnEw5A72UCjKLLQl09QnKn1iBRvzbDuxskS mQWnSFA0X6wmnrJX1Cv6qsem7pKmOEYkK9KHjB/5NRAE2xZ6/MNreiB71oPwVtjOIfJ1 45rg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748889005; x=1749493805; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=I3ps/vXS6xOBaRTDgBK/4ZINVG3GFVOLQseSmjF/i0E=; b=uZbO7Yvio1uNaJJiTuRcPTt3G7Jf1W+DyY9G41vUGhiHm64KT97rwGGwhR5JeoJYnB BLe0AVLK6EAA/FF/jVXLMCFSmhtp1+gBU+gdlt8mHFn2NzAH3oCNQqoJ7zIAw2K4hYBN WPaO2c9X5wE0xg8BhG+K2coU9PjO6QAv+k0xvGelMwTnc3DzThzBemdjp3bKFMMYNzf+ jPc2dfXRkh3NpTWxSoCzeM1Q5v6IAyIvjgn4ehNPC584siUwGTCUDeowZ7is1RTVqGq0 IAtCovgHueiy1pfmVBBpaezYOkWg4562lCHE9qHYLuY3dSgIrb/l+ECYpeKt7LvGu626 85sg== X-Forwarded-Encrypted: i=1; AJvYcCU4lj5gD/olF0uujP1QchcvJTRLSQKyoNb9RjRLRpc+NxqpvE3uwoEJ8wpAIUlfPmjNx2WCE8vHzw==@kvack.org X-Gm-Message-State: AOJu0Yz3h+zaIsV2U2bSs+0C2bhkP+sAMDF5ApBnWNHtWw9JYPMB2oSF IevUp66Na3oy5B2HQKmawBkpy9nZKeqI/ShhhlNqVdPjErMekIE6+VN5RPe+NkQc6w0DkBeb27D Ln4ZaDRFJrrOm4WWp6hLpFnesrHqDbJU= X-Gm-Gg: ASbGncs2iN9PjR2DQvHHEENCu885YWaSb8o5SUpvwXqlmtLglwICrnPcB+bchv9k4+E 8gpAKbQpYme9OTGto/403uP6BQxFfaIUjjMm3k0IrhYvGJYo7zw20PkFfg8YCxu2JfL/laZl7gx H6ZuRe8soH9SZAmm9vTAEhoZpu4gFNdOVdag== X-Google-Smtp-Source: AGHT+IHZPAHvPgBuwq78m2HFFGPtWr5KuODcu/WE969DSVc6e4SrjlVi7L7mgjIj8hJ5MrOb7Fv9nOQykrouQpQkSGo= X-Received: by 2002:a05:6214:1d2c:b0:6f2:a4cf:5fd7 with SMTP id 6a1803df08f44-6fad9165a84mr147092146d6.45.1748889004983; Mon, 02 Jun 2025 11:30:04 -0700 (PDT) MIME-Version: 1.0 References: <20250429233848.3093350-1-nphamcs@gmail.com> In-Reply-To: From: Nhat Pham Date: Mon, 2 Jun 2025 11:29:53 -0700 X-Gm-Features: AX0GCFt-PUurM1QtJWwmhEL52y2jIqhge_cou40ibH8eyg0VQefqwQtBHmfOVoM Message-ID: Subject: Re: [RFC PATCH v2 00/18] Virtual Swap Space To: Kairui Song Cc: YoungJun Park , linux-mm@kvack.org, akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com, yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com, chengming.zhou@linux.dev, chrisl@kernel.org, huang.ying.caritas@gmail.com, ryan.roberts@arm.com, viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de, lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com, gunho.lee@lge.com, taejoon.song@lge.com, iamjoonsoo.kim@lge.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: kytk843ikb9hoa6byskxunxnsgg4q331 X-Rspamd-Queue-Id: 1DC2C40009 X-Rspamd-Server: rspam11 X-HE-Tag: 1748889005-501596 X-HE-Meta: U2FsdGVkX19JGNyCfl74QyuvfPdr9102e7+FjbPBdPBmsJHIikPyJOTRkp5vEqMEd0K95j4ugmXHnFqq4+py1M4tmYZQA3bT5GDDbtWma5yiestuDlO16Ju3HGqEEDQ1ou0/etoMfGsNunB5ggRSPW+W00BpO7sOyDQkiQQAKByMGm9285LZmltwS+Tf1dYyK+EN4PkYGs3v6RmozmSoxQccq5dTuI2OZp+HNyK1Daew1LzTuEqqnirfDncwqrlQUNgT+xe7FTzuVJPB8jdApuXTRmP9GFu7okE2kpgVJ1pxIUpjVNZUG5tfUwCuavx06TpGv6r151lfZRilOQP+A7avm/RpsmMEKYUm/g0SqMk8nlOqWA7sNEaS9QL6EbgM4p6UnNYAsg5fFzvrL3vVhSKelG5qwCK4I0Ws40mw1txnQaATUg6mxvtqjO2ulPoVPXTcnhZ5kF3ENpJa0agdcYqy8uRLc2Rim+d6sHciLxULRts5WiHnQdUgmitPsYMPfAHogeVDaVqdgwOND0xOhKc2H/Q2hHtK8fdLrKyRybo7Cb+PhsHOdPfHhyhmFLwkA6B6vdZWYg/Uflom4QzFluAVeOP+pV4bCiOpAP/z5EQGbfEQl70AGe6iRtLbE1Rx+flIv/q42ORfbx+7WP6KztyxpkRE3xSgRwWbHAVcfTAnSyG54MC/lgCrOo4XeotSoDC4jVV4flqs6kqAmH0I9Pflvw+IeoKKJ4LOymUc8IVzsOj4runPYw4qgq6D+HL+etMP3x+8MLKQgUZtDKPg5HukL0zA3+/50pKRIUv0r0y10XFNwugPv/duTTpsCI2CF45jZCkch/wjIJ3103u497yJlmwaubuvhSXsKlcOiTRQDgV/2vylqG9D2EpY7YkLALKFTxAvfvteka5b8Nb2E8Y2QBEjfb+UwNn0x0un5D8d/y6b/3Cv3NYM/IWjwHLsiw7u3aVLOxBF+eQyLbx WXbbqO9Z lGBLDkDNIMhWKZjwJip0sgFTGGskf0ZYHhAMbuItjDWM7YtChFlJJSPvp8yJA+q3yKnLBMypCXORHuVmXlbJqHYL03ClIV6gU5nzlyHG/SX5Mp+8Yz8WZNnD+mZB/DidSmH+lU8YPkFYvHAGPvMG/kY3aLaiRHkyM93iEoTX7ZjsJc74h9216viS9Gx48BCB2/AmSv6KGzmjHsvqJKmpuDUEIagbuzyR3NgpZhHc4TEbUMxxi7uTFAqKix5Cmadgm9fVZuhep4M66Hxy8f95JFuE5wqHRQ+Qr0a7ObKlo8nG2Pe4Cgmde8Vd6jLYQ7k272zfYeawqXB5RDnqbOB3/l5ZuYt1mSZnsgsxJQcu0WEeLhlflAC8CE2OLBGJfQllEsXkQyX7rekc4jlUASnRbHMlsxbkFXrhPlff3 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Jun 1, 2025 at 9:15=E2=80=AFAM Kairui Song wrote= : > > > Hi All, Thanks for sharing your setup, Kairui! I've always been curious about multi-tier compression swapping. > > I'd like to share some info from my side. Currently we have an > internal solution for multi tier swap, implemented based on ZRAM and > writeback: 4 compression level and multiple block layer level. The > ZRAM table serves a similar role to the swap table in the "swap table > series" or the virtual layer here. > > We hacked the BIO layer to let ZRAM be Cgroup aware, so it even Hmmm this part seems a bit hacky to me too :-? > supports per-cgroup priority, and per-cgroup writeback control, and it > worked perfectly fine in production. > > The interface looks something like this: > /sys/fs/cgroup/cg1/zram.prio: [1-4] > /sys/fs/cgroup/cg1/zram.writeback_prio [1-4] > /sys/fs/cgroup/cg1/zram.writeback_size [0 - 4K] How do you do aging with multiple tiers like this? Or do you just rely on time thresholds, and have userspace invokes writeback in a cron job-style? Tbh, I'm surprised that we see performance win with recompression. I understand that different workloads might benefit the most from different points in the Pareto frontier of latency-memory saving: latency-sensitive workloads might like a fast compression algorithm, whereas other workloads might prefer a compression algorithm that saves more memory. So a per-cgroup compressor selection can make sense. However, would the overhead of moving a page from one tier to the other not eat up all the benefit from the (usually small) extra memory savings? > > It's really nothing fancy and complex, the four priority is simply the > four ZRAM compression streams that's already in upstream, and you can > simply hardcode four *bdev in "struct zram" and reuse the bits, then > chain the write bio with new underlayer bio... Getting the priority > info of a cgroup is even simpler once ZRAM is cgroup aware. > > All interfaces can be adjusted dynamically at any time (e.g. by an > agent), and already swapped out pages won't be touched. The block > devices are specified in ZRAM's sys files during swapon. > > It's easy to implement, but not a good idea for upstream at all: > redundant layers, and performance is bad (if not optimized): > - it breaks SYNCHRONIZE_IO, causing a huge slowdown, so we removed the > SYNCHRONIZE_IO completely which actually improved the performance in > every aspect (I've been trying to upstream this for a while); > - ZRAM's block device allocator is just not good (just a bitmap) so we > want to use the SWAP allocator directly (which I'm also trying to > upstream with the swap table series); > - And many other bits and pieces like bio batching are kind of broken, Interesting, is zram doing writeback batching? > busy loop due to the ZRAM_WB bit, etc... Hmmm, this sounds like something swap cache can help with. It's the approach zswap writeback is taking - concurrent assessors can get the page in the swap cache, and OTOH zswap writeback back off if it detects swap cache contention (since the page is probably being swapped in, freed, or written back by another thread). But I'm not sure how zram writeback works... > - Lacking support for things like effective migration/compaction, > doable but looks horrible. > > So I definitely don't like this band-aid solution, but hey, it works. > I'm looking forward to replacing it with native upstream support. > That's one of the motivations behind the swap table series, which > I think it would resolve the problems in an elegant and clean way > upstreamly. The initial tests do show it has a much lower overhead > and cleans up SWAP. > > But maybe this is kind of similar to the "less optimized form" you > are talking about? As I mentioned I'm already trying to upstream > some nice parts of it, and hopefully replace it with an upstream > solution finally. > > I can try upstream other parts of it if there are people really > interested, but I strongly recommend that we should focus on the > right approach instead and not waste time on that and spam the > mail list. I suppose a lot of this is specific to zram, but bits and pieces of it sound upstreamable to me :) We can wait for YoungJun's patches/RFC for further discussion, but perhaps: 1. A new cgroup interface to select swap backends for a cgroup. 2. Writeback/fallback order either designated by the above interface, or by the priority of the swap backends. > > I have no special preference on how the final upstream interface > should look like. But currently SWAP devices already have priorities, > so maybe we should just make use of that.