From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8CF95C5B549 for ; Sun, 1 Jun 2025 16:15:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7172C6B0204; Sun, 1 Jun 2025 12:15:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6C7E56B0206; Sun, 1 Jun 2025 12:15:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5B60E6B0208; Sun, 1 Jun 2025 12:15:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 3D56F6B0204 for ; Sun, 1 Jun 2025 12:15:15 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id A75A55C710 for ; Sun, 1 Jun 2025 16:15:14 +0000 (UTC) X-FDA: 83507331348.05.1B4F087 Received: from mail-lf1-f48.google.com (mail-lf1-f48.google.com [209.85.167.48]) by imf23.hostedemail.com (Postfix) with ESMTP id D094214000A for ; Sun, 1 Jun 2025 16:15:12 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=cR4cUxWD; spf=pass (imf23.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.48 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748794513; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BQLlsESJhJ9phvmdcgMlXF4zsR1S6bxe/CuZiRC6mSU=; b=BfQVLxRsGGZy8Nqu9EDpmPHYZSIkTuCFOkFWHwgEQ2eioVNLsG8J2QKLhYIMtUQwRHVBoD fHcqasdRDoA5+sA1cMtJWLMtOk5ipzIhzIP3Pxa+TsFK6k2a9iETqGGR1RM+D9kn2bg+nG +ni4vX7JDmCGDhy4l2XCA/WjRfHlXeE= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=cR4cUxWD; spf=pass (imf23.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.48 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748794513; a=rsa-sha256; cv=none; b=ggZhjRYNJAh2+t64LhC+EZuSH5Okvsa1VRHFr5AMUq9bS++WI4pgKUUDq2a6yjiIhE1jqc 0Vz1B5QihrFTwhs6S3zRBz+FIXguvDSg5P3AO6M1n3RFfRrBQn2jgdenbqdKWaNDfYOPGf V9KQRUfYtVFkeX+9njD7AqIPGni4m54= Received: by mail-lf1-f48.google.com with SMTP id 2adb3069b0e04-54e816aeca6so4744330e87.2 for ; Sun, 01 Jun 2025 09:15:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748794511; x=1749399311; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=BQLlsESJhJ9phvmdcgMlXF4zsR1S6bxe/CuZiRC6mSU=; b=cR4cUxWDO9Fb6lT7sh0FdTbQITEDQVty5Xj5EotLc9+38YkRdws7rB2H6Xo+6FNKl1 lA81S17TU1Aranq+12Tur7g/zV7zT1S7/c5hhVmIrwT3zBDX3hyPP0nxA9CwvUpehyZa Iznctk6bBkdpaHl5HUa6KBZlyLFaY2DXbcJ5QYwYzNN586wgZK+8D9cTyooj/IGEV40T StX4dQyf8ebaODMgTTB1U2XUoxk0lNRT2+Ptcaw4eT4FlHx8FVlyC+6fhD3RTG0Ry2xB FiDH39TlAjP4ri8q0zqTry8i3v9enUnjE11ROc1QeNbkTu0lZsuj0gtuFi78Cx8e7yiF L5hQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748794511; x=1749399311; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=BQLlsESJhJ9phvmdcgMlXF4zsR1S6bxe/CuZiRC6mSU=; b=qRLXD7tcQKzNy+4wr9rK4f/8jFO2y4qnOcKcdJIR1INfHlmKMiGDtVIZeg7+8BR5kr Pu6u+x2Hbwxx7P73yzYfJSDIN3rVuvFcGjrRNEAZjqgrK7C54uukMoWmlc8MDDsamDFb o24VIKJwC22y/93IDTxPH9YWmI/hUYVbjGA4eh7HFs1yUNeQJzceKmFziVmzNgdStpqX l2b+6B+saaVQwaiSvs3Q8PeJ7wDoXMzKVqL+KahImw8yxUcl8eiSDNsmidFudLHAc7UJ lx38wpZ2U+anKV5RBEazPfUAeblCBw8k4xUKz57a7uuo9B4vg3nrmYK3aXOiPxgvKN6X CXxA== X-Forwarded-Encrypted: i=1; AJvYcCVALRYENFeTyOjtStZGXabqnZOh7QHkgDCGtOEC7Oa4bCyvHn4EZAnkOEO2Jc/g1POl8f47Jrr+Lg==@kvack.org X-Gm-Message-State: AOJu0YyEZl6GIf8arZdigYRe28JttpsnETZ5qUqU+GYWX/D9FOuIe7bz mEE04Gri3kUNNzIbeP2t3H7YESO90uKv4mfSXrFKqjOXbQw5ZVJ3MLzC92Fr8ay067e3tG3W1lX F867aAomy2YOfTtKR8J3IWPy7FAiTQnY= X-Gm-Gg: ASbGncvfyI+zIKmqQui0440uJu+WczRVS6jmwLVtdc5z1rHpKf70G327u43uxn9UddO t6a+ULkl59pTnyJGdUdpo9k+8t7OxcSxYXxW5pYvpw4NoAGvnOjJl2dw13C/F8gqq0Rr1j5q6Tx mP4tIbrNCfVvgIhSHaFW9Fo8QrBa0sSNwl X-Google-Smtp-Source: AGHT+IFdNhy2OZy82mzQxOlElki1qxGmIReqhiVsDHY2TVJeYmZBRjoft38OuNAc5iyG6X6hJ5HmYmaoQCVQzNX+scI= X-Received: by 2002:a2e:9fc6:0:b0:32a:83b4:c146 with SMTP id 38308e7fff4ca-32a8cd82a84mr43124451fa.23.1748794510443; Sun, 01 Jun 2025 09:15:10 -0700 (PDT) MIME-Version: 1.0 References: <20250429233848.3093350-1-nphamcs@gmail.com> In-Reply-To: From: Kairui Song Date: Mon, 2 Jun 2025 00:14:53 +0800 X-Gm-Features: AX0GCFtnVy1VaJBUnuhMS87vlYylrf1kWOEgnTUuEyzfo5tKbsO2Uz34wY5JsL4 Message-ID: Subject: Re: [RFC PATCH v2 00/18] Virtual Swap Space To: YoungJun Park Cc: Nhat Pham , linux-mm@kvack.org, akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com, yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com, chengming.zhou@linux.dev, chrisl@kernel.org, huang.ying.caritas@gmail.com, ryan.roberts@arm.com, viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de, lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com, gunho.lee@lge.com, taejoon.song@lge.com, iamjoonsoo.kim@lge.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: D094214000A X-Rspamd-Server: rspam09 X-Stat-Signature: qsfb1t8xxhuj5rch7xyycaybotzd94o1 X-HE-Tag: 1748794512-764458 X-HE-Meta: U2FsdGVkX1+FX4CyBkDwPBt74m86tA7X6KMRLdSWp5AWHZF1+1P20Ojatd2G28+xcPCEasrqn4nUhbEPkK8G2CcnZ8jY4+c53A5odtnkT0jlbQ3PO5Bya4u/TZaejoaRfp49VyaYTY1KAI9QFORmNSrbpGrpeS9eoEvK2oH21HkCC6xv5RFURa7EUVFRMGD0NKixW5pNOhhfoDWlvFUj7AKkIcs6El4pLPapW/3+It7yb1FAedZ/sRwqNS4Io/QgswoLVX/b9VOu86ViB9hCEkBXjqpTHSoeGmamUA/56ICu/BbG8f0940MWDXSRMOQQQ/1MJEvJ/YAWny7fhDJp8dskyLcJwrUGPDg+ATKEWC/zDmJd5sEN0gtq3IUXNV5ciGjiKLdWTH+SNMVItjbIu3gEjwXGGSnPhVOtdN4JbCzQ1oWape9+48xJz5dZqk5PwgQJUPfMz+n1LT8LNDEgbvKceCEWQVDoOKGRZrP7LIJAkU8VNZUC75XLjG1WCZznWc6xt4Lr2xCvAfv873I8EjO8iVDg06Sjw4NKYXElmcjvRREMW1fdmAHrqZpQGltvgF4JmfF3YnvhnT9/+k77jrCB23K4/VHLQq/w/GJG4zoKEPC/DHld24rvpdf3EMPzxgsIpLn6iOx4U4KUCQVm5Rv+nlXKzVIfxjiiIsls7td7oW/vyyCqEkCmAootldna9nJMYnpo8HGpoNOkuq1HFq9IFOsz9HLtU5qg8ywFagsrsutFjEPfEBZZIUfBM5K0roBm8W9x0kUA1F4AJCMfQ5Q4H5xAN7OA4Eklsub6/Hxwortuoz199ZLul7ObdY64AkO0FaA3PzSIpM3toE9VmsstU/o8C2bvwtEjyikoUx4wNffnPQJ4A6p8pvBiAb8YBNY2cy9YNaVFHummsT/xZmXjjlWoSwAT0LQ9axWI+r56+MetxnvZ45NYaTHjd5DaEhmD34qrwrq1cu1Nx6w 5oEnE+5d 42/PO7855zHJhjaRDKpn1PvX8GaemgFVRRn+S3S/1FkjyfO75frr9LokgkA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Jun 1, 2025 at 8:56=E2=80=AFPM YoungJun Park wrote: > > On Fri, May 30, 2025 at 09:52:42AM -0700, Nhat Pham wrote: > > On Thu, May 29, 2025 at 11:47=E2=80=AFPM YoungJun Park wrote: > > > > > > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote: > > > > Changelog: > > > > * v2: > > > > * Use a single atomic type (swap_refs) for reference counting > > > > purpose. This brings the size of the swap descriptor from 6= 4 KB > > > > down to 48 KB (25% reduction). Suggested by Yosry Ahmed. > > > > * Zeromap bitmap is removed in the virtual swap implementatio= n. > > > > This saves one bit per phyiscal swapfile slot. > > > > * Rearrange the patches and the code change to make things mo= re > > > > reviewable. Suggested by Johannes Weiner. > > > > * Update the cover letter a bit. > > > > > > Hi Nhat, > > > > > > Thank you for sharing this patch series. > > > I=E2=80=99ve read through it with great interest. > > > > > > I=E2=80=99m part of a kernel team working on features related to mult= i-tier swapping, > > > and this patch set appears quite relevant > > > to our ongoing discussions and early-stage implementation. > > > > May I ask - what's the use case you're thinking of here? Remote swappin= g? > > > > Yes, that's correct. > Our usage scenario includes remote swap, > and we're experimenting with assigning swap tiers per cgroup > in order to improve specific scene of our target device performance. > > We=E2=80=99ve explored several approaches and PoCs around this, > and in the process of evaluating > whether our direction could eventually be aligned > with the upstream kernel, > I came across your patchset and wanted to ask whether > similar efforts have been discussed or attempted before. > > > > > > > I had a couple of questions regarding the future direction. > > > > > > > * Multi-tier swapping (as mentioned in [5]), with transparent > > > > transferring (promotion/demotion) of pages across tiers (see [8] = and > > > > [9]). Similar to swapoff, with the old design we would need to > > > > perform the expensive page table walk. > > > > > > Based on the discussion in [5], it seems there was some exploration > > > around enabling per-cgroup selection of multiple tiers. > > > Do you envision the current design evolving in a similar direction > > > to those past discussions, or is there a different direction you're a= iming for? > > > > IIRC, that past design focused on the interface aspect of the problem, > > but never actually touched the mechanism to implement a multi-tier > > swapping solution. > > > > The simple reason is it's impossible, or at least highly inefficient > > to do it in the current design, i.e without virtualizing swap. Storing > > As you pointed out, there are certainly inefficiencies > in supporting this use case with the current design, > but if there is a valid use case, > I believe there=E2=80=99s room for it to be supported in the current mode= l > =E2=80=94possibly in a less optimized form=E2=80=94 > until a virtual swap device becomes available > and provides a more efficient solution. > What do you think about? Hi All, I'd like to share some info from my side. Currently we have an internal solution for multi tier swap, implemented based on ZRAM and writeback: 4 compression level and multiple block layer level. The ZRAM table serves a similar role to the swap table in the "swap table series" or the virtual layer here. We hacked the BIO layer to let ZRAM be Cgroup aware, so it even supports per-cgroup priority, and per-cgroup writeback control, and it worked perfectly fine in production. The interface looks something like this: /sys/fs/cgroup/cg1/zram.prio: [1-4] /sys/fs/cgroup/cg1/zram.writeback_prio [1-4] /sys/fs/cgroup/cg1/zram.writeback_size [0 - 4K] It's really nothing fancy and complex, the four priority is simply the four ZRAM compression streams that's already in upstream, and you can simply hardcode four *bdev in "struct zram" and reuse the bits, then chain the write bio with new underlayer bio... Getting the priority info of a cgroup is even simpler once ZRAM is cgroup aware. All interfaces can be adjusted dynamically at any time (e.g. by an agent), and already swapped out pages won't be touched. The block devices are specified in ZRAM's sys files during swapon. It's easy to implement, but not a good idea for upstream at all: redundant layers, and performance is bad (if not optimized): - it breaks SYNCHRONIZE_IO, causing a huge slowdown, so we removed the SYNCHRONIZE_IO completely which actually improved the performance in every aspect (I've been trying to upstream this for a while); - ZRAM's block device allocator is just not good (just a bitmap) so we want to use the SWAP allocator directly (which I'm also trying to upstream with the swap table series); - And many other bits and pieces like bio batching are kind of broken, busy loop due to the ZRAM_WB bit, etc... - Lacking support for things like effective migration/compaction, doable but looks horrible. So I definitely don't like this band-aid solution, but hey, it works. I'm looking forward to replacing it with native upstream support. That's one of the motivations behind the swap table series, which I think it would resolve the problems in an elegant and clean way upstreamly. The initial tests do show it has a much lower overhead and cleans up SWAP. But maybe this is kind of similar to the "less optimized form" you are talking about? As I mentioned I'm already trying to upstream some nice parts of it, and hopefully replace it with an upstream solution finally. I can try upstream other parts of it if there are people really interested, but I strongly recommend that we should focus on the right approach instead and not waste time on that and spam the mail list. I have no special preference on how the final upstream interface should look like. But currently SWAP devices already have priorities, so maybe we should just make use of that.