From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D2DFEC02183 for ; Thu, 16 Jan 2025 18:48:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 624E46B0082; Thu, 16 Jan 2025 13:48:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5D3DF6B0083; Thu, 16 Jan 2025 13:48:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 49B976B0085; Thu, 16 Jan 2025 13:48:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 2B4E76B0082 for ; Thu, 16 Jan 2025 13:48:20 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id A5D961610FF for ; Thu, 16 Jan 2025 18:48:19 +0000 (UTC) X-FDA: 83014200318.01.7B855BE Received: from mail-qk1-f174.google.com (mail-qk1-f174.google.com [209.85.222.174]) by imf13.hostedemail.com (Postfix) with ESMTP id BDD5C2000D for ; Thu, 16 Jan 2025 18:48:17 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ddSFxM5W; spf=pass (imf13.hostedemail.com: domain of yosryahmed@google.com designates 209.85.222.174 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737053297; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yhpVhXlcHAIPI9llwUOnrvToBdw/RGzNX9/l+UCl794=; b=MxwhMhdiXJ5Q/8trTpXtURpDhmCkPDDBlNT9W2EYcVks1wum9kLnz1X96PGtpdCPKZllIw X0k7d1eXQKqCCp0qxCAAMkiAQzgK9sP6SYpGh7cl3uHiiU0c8ZK/cnlXs7EXIVfui+3+LB b3HKG4wck/WEzlsLUa4TAqJFgUxdjLI= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ddSFxM5W; spf=pass (imf13.hostedemail.com: domain of yosryahmed@google.com designates 209.85.222.174 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737053297; a=rsa-sha256; cv=none; b=56LaQ53OHy87xavEDq9F834zRvQLrs3f//nYhOC5nuuFIcKNBPGtknndjHqmTIuDobqoG9 ORJYlf/3yQUYI/fcdZRc/m/7nKbe3bjq837bqnn28ZiT7PATSZxO0Wy3e1Q4WfwFkvZyzD PrOy4N4XI+c5vItWCnhhmmLsqWcxWms= Received: by mail-qk1-f174.google.com with SMTP id af79cd13be357-7b6e9586b82so96669685a.1 for ; Thu, 16 Jan 2025 10:48:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737053297; x=1737658097; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=yhpVhXlcHAIPI9llwUOnrvToBdw/RGzNX9/l+UCl794=; b=ddSFxM5W0y67uKPNp2WVeauRzup1KRKkcwRYbdrze7phUc0XhWsD8FkD3TpCbTrMg5 i5TCGIabxTPEVQ5dGefo6nLp1XL0pOVuNmHnYpH5wBmaH83YmK81QzafykVQJuQaoNFu fZSHIelsTlnEcbmmnvLkqEFQfwLS6Vka5WYHEbW/ZmiyPgIIuy+PELLmRMWHA903UVaf M5IwS/60YUL7kJAxpyOZCzsddFjDZcHaUMFl70x1N2iZI1zA7uYi+emvlnnDof44FMAr VulIovB1Ade3UVdv97X/2JMzz+9sefb37E/pQfc5PnHdiPUJK8Akf7iJqznYf6e4w6Zu xp+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737053297; x=1737658097; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=yhpVhXlcHAIPI9llwUOnrvToBdw/RGzNX9/l+UCl794=; b=UZcG8nOgdXMUk9j5diBfeaP3gGiMTOzCpMUAAWBK7iWzK1prVWwBUlxtPm2tqZCl4v f4KkbjXPLW/3jDFUG4kSBZetjmpFXVhkJsK3c7m13RKQrzqcB0/mE4IJRgixHOA0RiTq Os3e8arTFiz+8MkI92kktWeeACj6axLtdOcjZEfTYcGN+rAM2pvjtFcsaa7YgA8pQ0l+ h6yFAdyP5okDO7V0XRsyESrJmoced5sOpMFCBWsFOGnIPobU+bMg/iKieXttaG1tOgRx UyLu2NxO/AtxIHx2A5db6hnmuIW69NPRnVkyzRCGb6U/OpFeg4xN04hB676NX3Pa7zcW 9ksw== X-Forwarded-Encrypted: i=1; AJvYcCVpp/JB6UNV9iC3HNTnkfad7Knu2dfDb9f4sWo+tiYl7MfsVxfbzjVMMXEp454VmXN6/luuuNq35A==@kvack.org X-Gm-Message-State: AOJu0YyD9FWeWyfGF1J34K0Avx0N3ZkweXCppGUqtsQWF81xFkC1IKd8 9BXBNIbaDpAlhTj9DmN8MSSefwvh9umA7wEkSPcAP4xoTyQYrSp+lR5hYZxEXXanVE7PMVH8zrP JmKGM1jd8Ha6LX5iQMXLlJKkDvU8ZxKoJBfS9 X-Gm-Gg: ASbGncvN5TkrYSLswHzjijA4EToCy2l4mQt1YnRrchP9tKf3rR5LAs1jqMWrSVghaLp /RrmqafWzMkDf+0A3+zVsB7Jlx+x0HxxL5ScZ1AmK4OObUArrXDhZs1LA9xvGEHAVImd+ X-Google-Smtp-Source: AGHT+IHSOD6nSbaKx6v1ypU9ySSx9rH6omW++qlDEuq0OTCiq5QLsDfkc3keLUtmenMf72GBY9SmvrS43sZMGM3g8Ac= X-Received: by 2002:a05:620a:2450:b0:7b6:d435:ccf7 with SMTP id af79cd13be357-7bcd9767e0cmr5281320785a.50.1737053296575; Thu, 16 Jan 2025 10:48:16 -0800 (PST) MIME-Version: 1.0 References: <20250116092254.204549-1-nphamcs@gmail.com> In-Reply-To: <20250116092254.204549-1-nphamcs@gmail.com> From: Yosry Ahmed Date: Thu, 16 Jan 2025 10:47:39 -0800 X-Gm-Features: AbW1kvYQed74x8tY9smSsr07Wq7KPtbCcWSF9-2tG3_iItwFzkZGotZ79j24Z8M Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Virtual Swap Space To: Nhat Pham Cc: lsf-pc@lists.linux-foundation.org, akpm@linux-foundation.org, hannes@cmpxchg.org, ryncsn@gmail.com, chengming.zhou@linux.dev, chrisl@kernel.org, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, shakeel.butt@linux.dev, hch@infradead.org, hughd@google.com, 21cnbao@gmail.com, usamaarif642@gmail.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: BDD5C2000D X-Stat-Signature: g356yi1da5yirosmheshqquhryymz4k3 X-Rspamd-Server: rspam08 X-Rspam-User: X-HE-Tag: 1737053297-747438 X-HE-Meta: U2FsdGVkX1/xKywtZrV2pjTvJ3q+6KzYO8c1ycvaAXiyGMeeuAqWBqqS0NBD7/zdvxllARqNXzuVwBdqsVcf/VHGERudK2mXxjPJPK/gKPkMcBggGoSNfd5xPh6vRaoGZ0wDP2Q3bqYnyjj/FdN6banxI+j6l2STXIQGbwbHbxjDty/3zjVp/hS0aRAcOm21H8ad6I4jRtN/RKYzjWjgp+0pCPFCxYACQdfiYUavfSpxmcs/rnym8TVBpzI1CwqyVTm1M6A+aFG4hYk0JCH0x5rDAVzBYXga2fHEI8Snf3/ufEi7ppJm5Xpt4A4ekgKqS7cAp2e3lmySlHmSdT4YLEm5Yy5ZrrliuXhpocxVihkDa9FrvHU91BFwCJ60EdVGa0i+kFdWOzgdJTjV4ZVHrnlfWgc+RgT2tJDBbGTxkVqujs62Hle8+SXm7WrjAFR19oVpGZdb+yzdvsA1JGKC0jdc3UbKWEsaV8Pxj4TSNHDV5oRL2mDs3QHzXMmqQm7IkPP9mCogJRhHtD2X2lYoRE+bYTXZGe/TxABk0gdc5koKiVV3rkyFEF8eHY95qFM3hbdFWqDZIX1ydanwnjHoXU9e1RwtJpMFRel/MKIp/kAI+HhlNTK4kvw9GinCViB5Ky9X/VudIMzmxLkSvgFMloBNTrGXSvElVvjuMeDRMNkuia1MIc5sdOC/qPo7GyQLVS1qSscfgdI52DBoki72wN+K0+QnT7sbqhEbMSZUiWohnCTvh12/MjOi89AWf+m8SaqtThYYbFbZ7/LbKaIFX0Cd77mwCTPdHf5ZjCImEtwp+u9MV/IkC2o2a7Wu2BEvgX9fzjFGtHdoEMfzPD/pC/ygfS9y2scQmqWmhFqSwTlbsWun/GAA6UqF3G2c5ER/pMBnDaV2hVTMOkonUeFNabTwxBZsAGG9baBUwqlnh2WW6L4Sio9yKhoz4aoQFbvBZUWLUJMIxKI3LXB0UO5 roz/o5pt CXBhvzlX04WERPvw+24/0rKVmwm+5W+SQp/JQ/uU8qb9t+KSlfIV86vU5rXCCnV48uGkB5uyf08e2I8jrty8V/YmKhsSC1455c9pNbRAz61qTktUEhxZtGkuiULMMy2TGf6nhGQIdXD9XOdEi65XvYezcpiy6X+l9MWG1MGaHAFaxmejAbC813ZRVNkREgM9IIoGte+P+hlbxpRdgBzyEbMz7U0Pd+H5VCnOg0Z9873FDew8Ht3yt8d8mrD3WVq0k74E0H14lCxubfbBX23Rm/1E4m/edScyUry5SsTTLE0dfJvvKxb95ZIaxQDVHkwX+LnEXPIJCY533KGjpOobeHLX/Q5HMAdbIKlTksy59KPEHYwbMgzEkpnDgjf4TKp+CKTTqmjJ49tKrqkSXOrKNjvH1Y3pn1c/ui+zqGhiEUqp6b7whB4Vz7C/FdQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jan 16, 2025 at 1:22=E2=80=AFAM Nhat Pham wrote= : > > My apologies if I missed any interested party in the cc list - > hopefully the mailing lists cc's suffice :) > > I'd like to (re-)propose the topic of swap abstraction layer for the > conference, as a continuation of Yosry's proposals at LSFMMBPF 2023 > (see [1], [2], [3]). > > (AFAICT, the same idea has been floated by Rik van Riel since at > least 2011 - see [8]). > > I have a working(-ish) prototype, which hopefully will be > submission-ready soon. For now, I'd like to give the motivation/context > for the topic, as well as some high level design: I would obviously be interested in attending this, albeit virtually if possible. Just sharing some random thoughts below from my cold cache. > > I. Motivation > > Currently, when an anon page is swapped out, a slot in a backing swap > device is allocated and stored in the page table entries that refer to > the original page. This slot is also used as the "key" to find the > swapped out content, as well as the index to swap data structures, such > as the swap cache, or the swap cgroup mapping. Tying a swap entry to its > backing slot in this way is performant and efficient when swap is purely > just disk space, and swapoff is rare. > > However, the advent of many swap optimizations has exposed major > drawbacks of this design. The first problem is that we occupy a physical > slot in the swap space, even for pages that are NEVER expected to hit > the disk: pages compressed and stored in the zswap pool, zero-filled > pages, or pages rejected by both of these optimizations when zswap > writeback is disabled. This is the arguably central shortcoming of > zswap: > * In deployments when no disk space can be afforded for swap (such as > mobile and embedded devices), users cannot adopt zswap, and are forced > to use zram. This is confusing for users, and creates extra burdens > for developers, having to develop and maintain similar features for > two separate swap backends (writeback, cgroup charging, THP support, > etc.). For instance, see the discussion in [4]. > * Resource-wise, it is hugely wasteful in terms of disk usage, and > limits the memory saving potentials of these optimizations by the > static size of the swapfile, especially in high memory systems that > can have up to terabytes worth of memory. It also creates significant > challenges for users who rely on swap utilization as an early OOM > signal. > > Another motivation for a swap redesign is to simplify swapoff, which > is complicated and expensive in the current design. Tight coupling > between a swap entry and its backing storage means that it requires a > whole page table walk to update all the page table entries that refer to > this swap entry, as well as updating all the associated swap data > structures (swap cache, etc.). > > > II. High Level Design Overview > > To fix the aforementioned issues, we need an abstraction that separates > a swap entry from its physical backing storage. IOW, we need to > =E2=80=9Cvirtualize=E2=80=9D the swap space: swap clients will work with = a virtual swap > slot (that is dynamically allocated on-demand), storing it in page > table entries, and using it to index into various swap-related data > structures. > > The backing storage is decoupled from this slot, and the newly > introduced layer will =E2=80=9Cresolve=E2=80=9D the ID to the actual stor= age, as well > as cooperating with the swap cache to handle all the required > synchronization. This layer also manages other metadata of the swap > entry, such as its lifetime information (swap count), via a dynamically > allocated per-entry swap descriptor: Do you plan to allocate one per-folio or per-page? I suppose it's per-page based on the design, but I am wondering if you explored having it per-folio. To make it work we'd need to support splitting a swp_desc, and figuring out which slot or zswap_entry corresponds to a certain page in a folio. > > struct swp_desc { > swp_entry_t vswap; > union { > swp_slot_t slot; > struct folio *folio; > struct zswap_entry *zswap_entry; > }; > struct rcu_head rcu; > > rwlock_t lock; > enum swap_type type; > > #ifdef CONFIG_MEMCG > atomic_t memcgid; > #endif > > atomic_t in_swapcache; > struct kref refcnt; > atomic_t swap_count; > }; That seems a bit large. I am assuming this is for the purpose of the prototype and we can reduce its size eventually, right? Particularly, I remember looking into merging the swap_count and refcnt, and I am not sure what in_swapcache is (is this a bit? Why can't we use a bit from swap_count?). I also think we can shove the swap_type in the low bits of the pointers (with some finesse for swp_slot_t), and the locking could be made less granular (I remember exploring going completely lockless, but I don't remember how that turned out). > > > This design allows us to: > * Decouple zswap (and zeromapped swap entry) from backing swapfile: > simply associate the swap ID with one of the supported backends: a > zswap entry, a zero-filled swap page, a slot on the swapfile, or a > page in memory . > * Simplify and optimize swapoff: we only have to fault the page in and > have the swap ID points to the page instead of the on-disk swap slot. > No need to perform any page table walking :) It also allows us to delete the complex swap count continuation code. > > III. Future Use Cases > > Other than decoupling swap backends and optimizing swapoff, this new > design allows us to implement the following more easily and > efficiently: > > * Multi-tier swapping (as mentioned in [5]), with transparent > transferring (promotion/demotion) of pages across tiers (see [8] and > [9]). Similar to swapoff, with the old design we would need to > perform the expensive page table walk. > * Swapfile compaction to alleviate fragmentation (as proposed by Ying > Huang in [6]). > * Mixed backing THP swapin (see [7]): Once you have pinned down the > backing store of THPs, then you can dispatch each range of subpages > to appropriate pagein handler. > > [1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U= 4wtW6cM+puw@mail.gmail.com/ > [2]: https://lwn.net/Articles/932077/ > [3]: https://www.youtube.com/watch?v=3DHwqw_TBGEhg > [4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/ > [5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXU= ZSBVJrcGFXCA@mail.gmail.com/ > [6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.co= rp.intel.com/ > [7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=3D6gvee1x3ttbOnifGneqcR= m9Hoeun=3DuFQ2w@mail.gmail.com/ > [8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/ > [9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjp= qxMwxS2C9TQ@mail.gmail.com/