From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9DEA2C28B20 for ; Fri, 28 Mar 2025 14:41:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 02A4828014A; Fri, 28 Mar 2025 10:41:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F198B280148; Fri, 28 Mar 2025 10:41:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DE3C128014A; Fri, 28 Mar 2025 10:41:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id C07BF280148 for ; Fri, 28 Mar 2025 10:41:36 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 2AB7614035D for ; Fri, 28 Mar 2025 14:41:38 +0000 (UTC) X-FDA: 83271223476.26.B1EF99E Received: from mail-qt1-f181.google.com (mail-qt1-f181.google.com [209.85.160.181]) by imf02.hostedemail.com (Postfix) with ESMTP id 1963F80013 for ; Fri, 28 Mar 2025 14:41:35 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Zy4Hn90i; spf=pass (imf02.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.160.181 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743172896; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8PSLk2F3fInrjRTLiWjUEBX5kPanbDTjlA93maHcUB8=; b=25Zlmf8vJ0YxJ94wcKCXOH1cpQa0eZHOfOO8Tt9v3MQomZcKRgw/yyWlV618TB54Kguir2 erJnKAKaMyMRPpt7GnUAiiveiAfwUhW0Lck8heaRF/KWf2vzk8uaS7utms5LyNWg5oGFRZ sxM74jGLqIeoFRmlt9kPJs48NBbLA5U= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743172896; a=rsa-sha256; cv=none; b=y93M7pQx/QS/TdQYEmNxd1F8wDbP16gqrcFpZ4izRYq/3KU3bO5Gpil+1/zPwIPtVTOWPX dZsCYdi2peF08+BvNPLs2EOkXCSv5Ve/6WooxmEru7lmb4d+Gj1DRTSky1dHeey4W65O9O rKRneeeQwgP8G6m28pOVkuhndtKKFm8= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Zy4Hn90i; spf=pass (imf02.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.160.181 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-qt1-f181.google.com with SMTP id d75a77b69052e-47690a4ec97so22831901cf.2 for ; Fri, 28 Mar 2025 07:41:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743172895; x=1743777695; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=8PSLk2F3fInrjRTLiWjUEBX5kPanbDTjlA93maHcUB8=; b=Zy4Hn90iFw2ZXmPt/24Kck/0m9nzQhUG1rgJijVQZHe6+gkz81xPOCaiKZYME+NlfE dZeJ7LL9mTXbiNYWQBg7LjG3UnCCdzUyS9HcZfj2ey70LNVLyHoDnlgMc+0gZebhKKMp 2wyn96dQJd4/4iPIIKdBMNuchjmkwh/Gtqmr736aeqBa8M8xjjqRs/VXjWUEPFlY+HA7 Em0DIQqi9c6uuROBfnVxJV35CfUZuVX5Sqt/N4IfxELOtEsyx9SI6f3vJBk6eooiCDfI LSQU9dsDN8XIIA/HN5x/7aBHZKGSx839wPNkAcqBYUcY7Kdy+5MxW+nyGcVo4pWFIupo wASA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743172895; x=1743777695; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8PSLk2F3fInrjRTLiWjUEBX5kPanbDTjlA93maHcUB8=; b=b7mzxR5q9RIVwUIO+dAvjKTVWk3jF9A3QUIMDopZfeXsFngnpWycA/2fuogtnOpWeP jMErKp8ubSL7rIkT++l5nRB9C2Ob8SVLALdXPZHEbJViiOQidHsEqXkY9GY97PnAV8xg 9o+rSilT1RbBouy0UOxzJ+FwdNoDtGlmn6YCjzMjRMRy9wohwhYEgzrP+7f9GA0gbU6t rLRDi7tPC8fVXnX+sRldCSFemfMOMFifnYyU3Yl4O0lrbEyUVdfajVDmg8rp9z1JtLLr 07k4JrFy1ILTdllyVebZzysmq1CzofCJO4AaaNpAlhz2KLYYi8DVJvrSY3nlniksaKVQ Jw6A== X-Forwarded-Encrypted: i=1; AJvYcCV2jW2hukjXfJy0w6+cJK/awmfgnIfF1qbMEqPUEnSbxDlWumNNLmb76Ar3Y8MVCxEpOka6AwoLJw==@kvack.org X-Gm-Message-State: AOJu0YxGEkX/hvQfn6KBdnSqE35/IgWw59yxhbD53OgiHlHbPGtse2/x zPha1/xT7wV+4z+nUFL1OWOPSpQvyY2N5YU2lszZ3QGykhs4Y3MG6PUB63IXQJBz68KKOjrUB+O hUxfR6W4jOOanmOQc0g6qxc5M7DE= X-Gm-Gg: ASbGncs7z5Hl3BD+g6cQtjr0w7cSgIz4mG3+EBgyMPycokySdtQtnJTSsvchQ43ODtv /sVVq0BCeBBppwc3hR1n2khkOSYHEVeGHvqEFE49JyQuN3L0Aj0qdCMYxNdKbfJp1IG00XFWBfh Pz5qt4rOD3CPb8n+2X3ebOMzHb X-Google-Smtp-Source: AGHT+IGcmCoufZsJf+6l9aRk5HLG9OtzQu1jDmiZN16zPWDRYTTfVuM+toXqrUBPygRyIl2b6nKgv1mMqXgQ0pTc/Ns= X-Received: by 2002:ad4:4eaa:0:b0:6e8:fb8c:e6dd with SMTP id 6a1803df08f44-6ed2389150cmr114525136d6.5.1743172894625; Fri, 28 Mar 2025 07:41:34 -0700 (PDT) MIME-Version: 1.0 References: <20250116092254.204549-1-nphamcs@gmail.com> In-Reply-To: <20250116092254.204549-1-nphamcs@gmail.com> From: Nhat Pham Date: Fri, 28 Mar 2025 10:41:23 -0400 X-Gm-Features: AQ5f1JqS9FleVybHK178H17T9QEZ5B5-G4s_YZvA6v7tsWpD5FulqYeWSFfjRAw Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Virtual Swap Space To: lsf-pc@lists.linux-foundation.org, akpm@linux-foundation.org, hannes@cmpxchg.org Cc: ryncsn@gmail.com, chengming.zhou@linux.dev, yosryahmed@google.com, chrisl@kernel.org, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, shakeel.butt@linux.dev, hch@infradead.org, hughd@google.com, 21cnbao@gmail.com, usamaarif642@gmail.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 6mtudeyttcdu71muuzwhoi74513ryaqw X-Rspam-User: X-Rspamd-Queue-Id: 1963F80013 X-Rspamd-Server: rspam08 X-HE-Tag: 1743172895-154821 X-HE-Meta: U2FsdGVkX18LnXJ7vW2OAy7shg+YNjKALaKHW67ZA79cZC8cV5F8r7sEe4cUMHfwcfU12wt7mDltWoBHzNFN09Ok2DO2RX4M/9OtfNdowy3umkO+X1njk8s5Rt4gL1g02b8hFpj5EYrA6JHpYlpaU7KsCdg7P5clYFU7/PW5jsBEyLsKBPcCvHY9UK4uH0eA9IcJYil3X3MttVqOfmqZN4RYzHP/zsoCN8crxDd6YgBAT2sbjH/TO6W/2SYVI4+iXxhPadlhw1giRM97+lphI93Dy/rhKCr7Z8xv5R/wMUXPf03semwLHEx3Gm8B3kDHLMIQ3x/CFqIGly+n5oN0p3BOXK4FYfdoN6jBSHxbezRLMw+Zpm3KGxw72O7IFd280fifyKvb9LH0bwiIx3i4WeiBKFeJNiFLJzfmJhhISM1GYKMRrc7v1r6v6+JAu7tFuwcV3DEJswI8/9gAEMr2vkCvYUUThN5+5HlPr7xTdhept4A51GfRqqLLNcJdW6BcRc5J5RvDLtccBdU+iJy4k0oGcyppQ/h0c2x7KFFn6Wk4Q+OYC75fSBJGjmPzTSJaBn6RyxqFYTYGxrJioe4LVUZI/QP4zrysA0GqDtpJtQyXQGMkFUv7xvbFa+v4LQ6RTu/CjypgF6jE1foH5jKKcsFgEMYcGb4jAIVhuWxCjOEvLSJ9RaEiXPU5Ibflk9lzoyM1MG+2dq2IgsgVogENDYAUUH+0gDywzP17sOntKntNsb3P9w7HmN/YOPIvktfz4wZWfJPW1+JL3p+1xfa2C504GHl9FLAwPbP4PejjmCh0VKK2Hih4ltxJnNMKKXQgig3gP5da8f3fYjghoUZhSWAnpbyxDVVsuAXO33Ep0lTdK24daeZzq8IleCDzAZ2P4mYZdKnkrVqCgW7O8lha4iZxg1bRASyt1NLjTENwJvKW8qgEqICg0s9zf1cR5sErfpMig/wXdV6cTm616UB GaLI+xbw zHm7OfptlSTKwF1TYusvOIn8xXaVYUEnFGuPx5Hm3m7D6IWd3TpPF9PwZSbVW0MFrw5nQM/qDnZnrfhATmUFFL2u8N9s/cVvG6VIAjBMpNwQKjYWWpvkY0ccRIcChUkrHNIs5Jdf5rUYkG5qxObs9dbN7V5wf9YluvgqkxVOnAGrtLmYH1SWXzml+Oe7Atr5mzlRiS2KbByYAhJPQ46hFwTel3DgI92+w6NpCaMfzH4+8Y4FuFpj0D3/W3kpNch+3epaUZ+o007FznXqOQmkN1lrWLK7eTdmo1gEZ8kDDD/df6M8g+IjvjCPFb/I4kVQHB/qBLbiP2FtYachjv26nvvTBQCbSPTxWkKcQx+/d5rbI/UHzA/vcjpzqZ7WGT0I6kl5HmoLgCQUsydkTpybBJlL7GuxWv6H2vftR4HinhxSTXnYJ4UFwGndmrlI3f6FnFbAIy1ztfS02i72jrf/KG0Ed6h3x3e585x+GsjLfj1huxmGWmVaBzrDpsiOIgy1xXyA8yKekPL63rh+C7k0+Im3PVJBNy9PaVWQIi2vaW51NdqIWU+oD561Vm6zZt2K2mc43KQ+dt6XqTJ8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000152, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jan 16, 2025 at 4:22=E2=80=AFAM Nhat Pham wrote= : > > My apologies if I missed any interested party in the cc list - > hopefully the mailing lists cc's suffice :) > > I'd like to (re-)propose the topic of swap abstraction layer for the > conference, as a continuation of Yosry's proposals at LSFMMBPF 2023 > (see [1], [2], [3]). > > (AFAICT, the same idea has been floated by Rik van Riel since at > least 2011 - see [8]). > > I have a working(-ish) prototype, which hopefully will be > submission-ready soon. For now, I'd like to give the motivation/context > for the topic, as well as some high level design: > > I. Motivation > > Currently, when an anon page is swapped out, a slot in a backing swap > device is allocated and stored in the page table entries that refer to > the original page. This slot is also used as the "key" to find the > swapped out content, as well as the index to swap data structures, such > as the swap cache, or the swap cgroup mapping. Tying a swap entry to its > backing slot in this way is performant and efficient when swap is purely > just disk space, and swapoff is rare. > > However, the advent of many swap optimizations has exposed major > drawbacks of this design. The first problem is that we occupy a physical > slot in the swap space, even for pages that are NEVER expected to hit > the disk: pages compressed and stored in the zswap pool, zero-filled > pages, or pages rejected by both of these optimizations when zswap > writeback is disabled. This is the arguably central shortcoming of > zswap: > * In deployments when no disk space can be afforded for swap (such as > mobile and embedded devices), users cannot adopt zswap, and are forced > to use zram. This is confusing for users, and creates extra burdens > for developers, having to develop and maintain similar features for > two separate swap backends (writeback, cgroup charging, THP support, > etc.). For instance, see the discussion in [4]. > * Resource-wise, it is hugely wasteful in terms of disk usage, and > limits the memory saving potentials of these optimizations by the > static size of the swapfile, especially in high memory systems that > can have up to terabytes worth of memory. It also creates significant > challenges for users who rely on swap utilization as an early OOM > signal. > > Another motivation for a swap redesign is to simplify swapoff, which > is complicated and expensive in the current design. Tight coupling > between a swap entry and its backing storage means that it requires a > whole page table walk to update all the page table entries that refer to > this swap entry, as well as updating all the associated swap data > structures (swap cache, etc.). > > > II. High Level Design Overview > > To fix the aforementioned issues, we need an abstraction that separates > a swap entry from its physical backing storage. IOW, we need to > =E2=80=9Cvirtualize=E2=80=9D the swap space: swap clients will work with = a virtual swap > slot (that is dynamically allocated on-demand), storing it in page > table entries, and using it to index into various swap-related data > structures. > > The backing storage is decoupled from this slot, and the newly > introduced layer will =E2=80=9Cresolve=E2=80=9D the ID to the actual stor= age, as well > as cooperating with the swap cache to handle all the required > synchronization. This layer also manages other metadata of the swap > entry, such as its lifetime information (swap count), via a dynamically > allocated per-entry swap descriptor: > > struct swp_desc { > swp_entry_t vswap; > union { > swp_slot_t slot; > struct folio *folio; > struct zswap_entry *zswap_entry; > }; > struct rcu_head rcu; > > rwlock_t lock; > enum swap_type type; > > #ifdef CONFIG_MEMCG > atomic_t memcgid; > #endif > > atomic_t in_swapcache; > struct kref refcnt; > atomic_t swap_count; > }; > > > This design allows us to: > * Decouple zswap (and zeromapped swap entry) from backing swapfile: > simply associate the swap ID with one of the supported backends: a > zswap entry, a zero-filled swap page, a slot on the swapfile, or a > page in memory . > * Simplify and optimize swapoff: we only have to fault the page in and > have the swap ID points to the page instead of the on-disk swap slot. > No need to perform any page table walking :) > > III. Future Use Cases > > Other than decoupling swap backends and optimizing swapoff, this new > design allows us to implement the following more easily and > efficiently: > > * Multi-tier swapping (as mentioned in [5]), with transparent > transferring (promotion/demotion) of pages across tiers (see [8] and > [9]). Similar to swapoff, with the old design we would need to > perform the expensive page table walk. > * Swapfile compaction to alleviate fragmentation (as proposed by Ying > Huang in [6]). > * Mixed backing THP swapin (see [7]): Once you have pinned down the > backing store of THPs, then you can dispatch each range of subpages > to appropriate pagein handler. > > [1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U= 4wtW6cM+puw@mail.gmail.com/ > [2]: https://lwn.net/Articles/932077/ > [3]: https://www.youtube.com/watch?v=3DHwqw_TBGEhg > [4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/ > [5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXU= ZSBVJrcGFXCA@mail.gmail.com/ > [6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.co= rp.intel.com/ > [7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=3D6gvee1x3ttbOnifGneqcR= m9Hoeun=3DuFQ2w@mail.gmail.com/ > [8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/ > [9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjp= qxMwxS2C9TQ@mail.gmail.com/ Link to my slides: https://drive.google.com/file/d/1mn2kSczvEzwq7j55iKhVB3SP67Qy4KU2/view?usp= =3Dsharing Thank you for your interest!