From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 74D2BC5AE59 for ; Tue, 3 Jun 2025 09:50:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 108996B03E3; Tue, 3 Jun 2025 05:50:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 044336B03E4; Tue, 3 Jun 2025 05:50:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EC3816B03ED; Tue, 3 Jun 2025 05:50:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id CEAE36B03E3 for ; Tue, 3 Jun 2025 05:50:24 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 8AE6BE81AA for ; Tue, 3 Jun 2025 09:50:24 +0000 (UTC) X-FDA: 83513619168.21.BAE8B14 Received: from mail-lj1-f179.google.com (mail-lj1-f179.google.com [209.85.208.179]) by imf27.hostedemail.com (Postfix) with ESMTP id 9A9CC40006 for ; Tue, 3 Jun 2025 09:50:22 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=WUKwa6BW; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf27.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.179 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748944222; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bfgWS16/oHW3y3qqAgf4oxs16fetKvRUmJSlL6NTq7k=; b=Hxj6nEtp22SKUGMJ9cls7vd2me/1HjZs1YlUhJrui5zgchdXXuM+BydOzvTIid6c0JEEh1 lJ9Tsaa7I8biTyPOsVWKgbFxOrSzNw3h5nvGk36bX8ilGMAfwgU8KNz6203z1UyEI3fKpI fOZnm0UcW8wNOsPHJ4gvsjISrbTf+X8= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=WUKwa6BW; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf27.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.179 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748944222; a=rsa-sha256; cv=none; b=WDLn26JMHb0KUX9s22UcrFZVPXylkmHEHsqqwzaFdCFgswwNYWBITGMN5T8elAhW9HXEfJ Yc1mwHxU8fdy7/GL4V89Z+vyHSNFUREwZv+0cVDMJejMVQ32eVF92+AFBC3TYErTXaZb7m 55u5gJTA2xiEH87fWTbjGsNZQlTFOUA= Received: by mail-lj1-f179.google.com with SMTP id 38308e7fff4ca-3292aad800aso57670981fa.0 for ; Tue, 03 Jun 2025 02:50:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748944221; x=1749549021; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=bfgWS16/oHW3y3qqAgf4oxs16fetKvRUmJSlL6NTq7k=; b=WUKwa6BW56bDR4PhIhEEz57gaixP0KFkWSwbF9tthF7YosNEBfgQ+4sTalKZbWGdMf U+jCisFxVoQhIwMXlrcSkiJ9no7OYJkiCmZsq8XmEY4K5JmkQJLuf/QLKqqL63Uk2zCx F3UjRVMl7W0Sc0J1nps3APo1Ww2ReWvkY3zeSblg/dJBHat5MFLex0EnP1ZTg5Bcb3mJ 23FrNIqL1ehsVpstMbkwPi7THT3T3bE3Legm6U2ClpdObO//i/BQw1DzZPLaFGfK2CzP c1jDPMHvSsnHwS8M+SI+sdZwF305hJDCtBZ09ZBhSzfJ7iH8Dp744GQMF8akB/jYZdaA ryYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748944221; x=1749549021; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=bfgWS16/oHW3y3qqAgf4oxs16fetKvRUmJSlL6NTq7k=; b=uQd3CY8BWttKTF9eju0sPppCm20ZDzfZ9Kv0R2FUDWnuD0pjLqL26Lw8TNuPlXlqY1 GiPD7EMsa1O5x+mOP0v6HJLjyfsXGCMoPd8bYsksy8+G5ASDAcE/r9njQbR2k/eMAvyq uRmj6TrluyFhrP9da7TnmbVX/csUQSNk+99DsO5qLCTTqkt+fvms5BEPb09JupQrnXjt Jznaj5FJ08+dnE+Aej9Z2ssC+JcmqU/lV3egz3WPam+lusd2WePJ3sPLnjpxLLwqB3HK wb32vFvc542XeJ8fbuu4sQxOPtCXryuabs48m4X6/Jh/LZYDDN75iljK8KibRIlcJpIf kDCA== X-Forwarded-Encrypted: i=1; AJvYcCVX9rz8zYM2sUaOe7zfHKE8JMID5/0FtjP5/HjZ1pys4hlgjacDpg2qukEerUxgsz0y9sQyNqGtaA==@kvack.org X-Gm-Message-State: AOJu0YyY6QBjsxiUGKXBBhE40onCElMffVoiQ5ovLJ5nYa3wtTch7KR1 YZFNVoWJcKN1fzzEgoVwzkATvZhMusdPUwlRCh4w2BQgbHC3rQ4aySaiAiRm2nlSTMdBXC3lZES 1mYqKtbVK1m4GJn65At1Vmv4aHWBT3i0= X-Gm-Gg: ASbGncu9yAsM3ABpAeHqWT2kAEb0KDP+uopS9CgXf62i9YVRt697A6LNmZ3Ny+lxclj BTAGy7aFNsOV4ho7KlV9e02t0ahsjgCYWa4QbPB1R9uPXOGk+S4aP1F+wJSB6b6EdBIzpYC+4zv G8V3baDJdJRvNMfw1YVbgCJ1XIs8bmZXVr X-Google-Smtp-Source: AGHT+IFbMWbmAjAWpDW+ugQQa4jB7qBaCa2WAMxYSJu0HZ9aWKD03FUQq0hLL3NX8CwLGm33BAU2lvNa/N2YVz+j/6c= X-Received: by 2002:a05:651c:2204:b0:30d:c4c3:eafa with SMTP id 38308e7fff4ca-32abf4f838bmr5347851fa.7.1748944220298; Tue, 03 Jun 2025 02:50:20 -0700 (PDT) MIME-Version: 1.0 References: <20250429233848.3093350-1-nphamcs@gmail.com> In-Reply-To: From: Kairui Song Date: Tue, 3 Jun 2025 17:50:01 +0800 X-Gm-Features: AX0GCFs-IQz1kR-0WuzGFg4LJh-1KD3qVtLjG4T0xeN4cJ7JSdCZNhTOkw9EglY Message-ID: Subject: Re: [RFC PATCH v2 00/18] Virtual Swap Space To: Nhat Pham Cc: YoungJun Park , linux-mm@kvack.org, akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com, yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com, chengming.zhou@linux.dev, chrisl@kernel.org, huang.ying.caritas@gmail.com, ryan.roberts@arm.com, viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de, lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com, gunho.lee@lge.com, taejoon.song@lge.com, iamjoonsoo.kim@lge.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 9A9CC40006 X-Stat-Signature: wo5c8dqiwaf3ezzpak7717ifh6us1x4w X-Rspam-User: X-HE-Tag: 1748944222-416486 X-HE-Meta: U2FsdGVkX1+6iYvulyuE0qHfQQEVoC4L1KEt5pkHU6lKhslPG+Zx6hLSDEhdgC6K+f/37ywpHpWLAOCuv2n9MKbqZokUWavUYOOcmFa6jcszyHerQ3JfWZE0FQzMCXIy72BZ1EvDLfNCM5PpzbqMLwTSzjoCIFH+SwmnAaaOrgbwa6tsbqIDueFJq7qmu8e7NlMsEH7uoHzV+D+bZKy6e5sp97Ju29UczOEl68CC/GzzXH94noMlNsrL6+EO8G/yyKSGf6RTWpOVHXzcKRkB2ySldMtm1xiZSb59sq859q+VBwP5Tsq51/8WqAKxt5z/nlgYlhKxjL8+fAGVgIkcUMRBMgvFLeGgbXknbknu9JpxvQZlCtJWm1nAURcK0eHk8Fisbo6dfw+ev3Sov+RRJVm1tNgiXerK2zpgq3w/rpFRKHIaiwRuzlcwJcG26IqMToc/CStWtAZ0jjwX6dPaBi+GxftuR+Hg86sHwUQGKMyvxs34aeWHKyoEUqv16ITViO8VAxPR6tih4ZgoNUOjpB7Wtn85O8gHTGzG2hs9b3jZ5X4PGu6CjAHN6rLG1EAgFToivGViHmieY8fuK7/QLYOUHVR12VVcClygCjxknoyqKXLPzxlFzXGVB8BVVR52IIr5TEp8uFHql22TbYFBvIdiN2qTSvE7zghSTSVbfGdlv4tsiHtpB8e0wsEM0VmviKTb1PVdVIQX9sIIN9i0gUeUAC5yffaPRfYlArNCu01MaxB0Y+eSTuZKYUbYGggN4izznO1f3EJgBA1r5LjGJwbZKTIp2EAvLcx5TDAaKi4XoRxNazeaFSunSSuRvg6szzmOXUn2y6dLdHVWxpUgZkZb4SZPQfEbMI6tqc+Q7F9fiijrGnv94HZr5Q/XB0of/d0UNR07mcyvIW6ERG4H4KPsJYAUuYwt0qxDkk3B2W04DYybZgNCJzeQgSqQARpLS3hlPNI9u2h6W6QOssr W30/iwoU mztKnktb8uvuzBFfgZP/LPy9V71JZn4aN6pm6swbeVYbZbw28b8gaSEVHX4fChGL97HJfdK+Q3q2tYSqJ2oN2SW6iLKj+gP633WYzsPxqNEDnYH1DLGS4jcRyFot6g3c9XEu4e8qvYtkECteFSsZq9NF5dWanngq4/CD8vlqxkXqmeFMdUMwXsjr6JVqwYef7+VxTZqgDxBYY+EwZp21+AhQ/OOvMOfUeslRWeNOv5g6y6fJH6u7ma+WcH3rckPy6DN27ofhh871I55MfdijeI7ss8Gq3I7W3pa6A7705goB8Y/FLNk74IUKrjA7IQPZ0wuZw/rv29ImKSUEHamj+U7myHpnIuRYpz7js5P0rjqf/AGt1T3FxgSBrFiDTvXuevEmkvLA6oj/805V2XwuWNX9JBfuPP2IS6geN X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 3, 2025 at 2:30=E2=80=AFAM Nhat Pham wrote: > > On Sun, Jun 1, 2025 at 9:15=E2=80=AFAM Kairui Song wro= te: > > > > > > Hi All, > > Thanks for sharing your setup, Kairui! I've always been curious about > multi-tier compression swapping. > > > > > I'd like to share some info from my side. Currently we have an > > internal solution for multi tier swap, implemented based on ZRAM and > > writeback: 4 compression level and multiple block layer level. The > > ZRAM table serves a similar role to the swap table in the "swap table > > series" or the virtual layer here. > > > > We hacked the BIO layer to let ZRAM be Cgroup aware, so it even > > Hmmm this part seems a bit hacky to me too :-? Yeah, terribly hackish :P One of the reasons why I'm trying to retire it. > > > supports per-cgroup priority, and per-cgroup writeback control, and it > > worked perfectly fine in production. > > > > The interface looks something like this: > > /sys/fs/cgroup/cg1/zram.prio: [1-4] > > /sys/fs/cgroup/cg1/zram.writeback_prio [1-4] > > /sys/fs/cgroup/cg1/zram.writeback_size [0 - 4K] > > How do you do aging with multiple tiers like this? Or do you just rely > on time thresholds, and have userspace invokes writeback in a cron > job-style? ZRAM already has a time threshold, and I added another LRU for swapped out entries, aging is supposed to be done by userspace agents, I didn't mention it here as things are becoming more irrelevant to upstream implementation. > Tbh, I'm surprised that we see performance win with recompression. I > understand that different workloads might benefit the most from > different points in the Pareto frontier of latency-memory saving: > latency-sensitive workloads might like a fast compression algorithm, > whereas other workloads might prefer a compression algorithm that > saves more memory. So a per-cgroup compressor selection can make > sense. > > However, would the overhead of moving a page from one tier to the > other not eat up all the benefit from the (usually small) extra memory > savings? So far we are not re-compressing things, but per-cgroup compression / writeback level is useful indeed. Compressed memory gets written back to the block device, that's a large gain. > > It's really nothing fancy and complex, the four priority is simply the > > four ZRAM compression streams that's already in upstream, and you can > > simply hardcode four *bdev in "struct zram" and reuse the bits, then > > chain the write bio with new underlayer bio... Getting the priority > > info of a cgroup is even simpler once ZRAM is cgroup aware. > > > > All interfaces can be adjusted dynamically at any time (e.g. by an > > agent), and already swapped out pages won't be touched. The block > > devices are specified in ZRAM's sys files during swapon. > > > > It's easy to implement, but not a good idea for upstream at all: > > redundant layers, and performance is bad (if not optimized): > > - it breaks SYNCHRONIZE_IO, causing a huge slowdown, so we removed the > > SYNCHRONIZE_IO completely which actually improved the performance in > > every aspect (I've been trying to upstream this for a while); > > - ZRAM's block device allocator is just not good (just a bitmap) so we > > want to use the SWAP allocator directly (which I'm also trying to > > upstream with the swap table series); > > - And many other bits and pieces like bio batching are kind of broken, > > Interesting, is zram doing writeback batching? Nope, it even has a comment saying "XXX: A single page IO would be inefficient for write". We managed to chain bio on the initial page writeback but still not an ideal design. > > busy loop due to the ZRAM_WB bit, etc... > > Hmmm, this sounds like something swap cache can help with. It's the > approach zswap writeback is taking - concurrent assessors can get the > page in the swap cache, and OTOH zswap writeback back off if it > detects swap cache contention (since the page is probably being > swapped in, freed, or written back by another thread). > > But I'm not sure how zram writeback works... Yeah, any bit lock design suffers a similar problem (like SWAP_HAS_CACHE), I think we should just use folio lock or folio writeback in the long term, it works extremely well as a generic infrastructure (which I'm trying to push upstream) and we don't need any extra locking, minimizing memory / design overhead. > > - Lacking support for things like effective migration/compaction, > > doable but looks horrible. > > > > So I definitely don't like this band-aid solution, but hey, it works. > > I'm looking forward to replacing it with native upstream support. > > That's one of the motivations behind the swap table series, which > > I think it would resolve the problems in an elegant and clean way > > upstreamly. The initial tests do show it has a much lower overhead > > and cleans up SWAP. > > > > But maybe this is kind of similar to the "less optimized form" you > > are talking about? As I mentioned I'm already trying to upstream > > some nice parts of it, and hopefully replace it with an upstream > > solution finally. > > > > I can try upstream other parts of it if there are people really > > interested, but I strongly recommend that we should focus on the > > right approach instead and not waste time on that and spam the > > mail list. > > I suppose a lot of this is specific to zram, but bits and pieces of it > sound upstreamable to me :) > > We can wait for YoungJun's patches/RFC for further discussion, but perhap= s: > > 1. A new cgroup interface to select swap backends for a cgroup. > > 2. Writeback/fallback order either designated by the above interface, > or by the priority of the swap backends. Fully agree, the final interface and features definitely need more discussion and collab in upstream...