From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A5412C369A2 for ; Tue, 8 Apr 2025 16:23:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B53B728000C; Tue, 8 Apr 2025 12:23:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ADA4328000B; Tue, 8 Apr 2025 12:23:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9A16F28000C; Tue, 8 Apr 2025 12:23:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 786A428000B for ; Tue, 8 Apr 2025 12:23:19 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 4DEBDC121C for ; Tue, 8 Apr 2025 16:23:20 +0000 (UTC) X-FDA: 83311396560.19.FD150A6 Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com [209.85.208.180]) by imf18.hostedemail.com (Postfix) with ESMTP id 055A31C0015 for ; Tue, 8 Apr 2025 16:23:17 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="QvOzvf/S"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf18.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.180 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744129398; a=rsa-sha256; cv=none; b=VDCd13jCGhSCPwdVMARjzKDEY6h+q763EbAsJvTg6mvfqpxmCjpWx3JMvV6tAb/M2/D8sc cWO2h8ta4rF+BjylUzQel9dAboXrB9YLk99RhmNAWBDeVwN97gg+e/AKcRunHkMOKJjxkn ITaCklVmY1u8/eXrmOXhzf82CixCGiQ= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="QvOzvf/S"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf18.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.180 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744129398; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cO/lkkmO99Xfr4W0KYk8tkPWToFmwEKNPxUMEMkq524=; b=EsHaI8MWe9tyC4yKNFKvtM1bxV6uWw8V8lULYD6hvLwjA/Cgz0K6eUPH2utj6ATYezFKTv lOP8k6ARbmPB+WMQxFxeSPU7xMuzeIjgXLNBy39u7iAP+t0xd1ayYtRq+7aQbGQGnj+kiN XV72W7lBpsoP5cSP1hZho+B3MgnPCOA= Received: by mail-lj1-f180.google.com with SMTP id 38308e7fff4ca-30bee1cb370so51738851fa.1 for ; Tue, 08 Apr 2025 09:23:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1744129396; x=1744734196; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=cO/lkkmO99Xfr4W0KYk8tkPWToFmwEKNPxUMEMkq524=; b=QvOzvf/Snob2DuLrq9p8/GI5CX7vxi4g68TSteOdb7x8BauhOJr2MG9U/nyhr8ytDk W0N8DPOug54DHxcPlHOwrpchc5TLB+EYhl0Z+uM/l+3eJBE5qCaalB/BON6LWs8vGyD5 0uCPGGOiLjGTy2691hd+Z1RHNqW11bl4y70muhK9gSwg5d1EUbkLv4rzEaAbqQoDH6rk lkknHo5eIk55yoGAxQDByt9pp27uJIoMQN73N0KWn5jk86OuCUvh8HehJTEyjKGJJN4G KW28CEdL1rY/UydBmpLhcHlQW657vZgkXpx0IKlMFYsE5Ga64f5iIkyhrcIKEFozWk3z DUFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1744129396; x=1744734196; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=cO/lkkmO99Xfr4W0KYk8tkPWToFmwEKNPxUMEMkq524=; b=PupGnAOZxgGO+ql16pD8Xfwt1PdGAoCgtTyryh6vXS+bgSwApigs+sDLzTfK0kpheZ 0ANpyyHXjZyrUY4W+CjYqRGEgfcOXJLcqbDl4fKWihlpy3ZJ32Gh1sSCT62ZIkx1Yp3C 2mcr9cqT5lo6f+5EtWsua8ouVf2wYmrFfzO7ZdsG0alhj430bStjh99DxIoz2CugMAkQ WZ0xDjb1dyzE6/6rdJNwZxt4grXsYT5I2eEiEeNa1Iq7t+AAYxQvQoKY+3InCkawF0u2 4QlXMRCKJBVycCrks5Rtu+b3vFsleiVUT3LHYaOmsvcmkLGJ0JFyXeLp3KYmoB5iTu37 19DA== X-Gm-Message-State: AOJu0Yyql/+rbjvOV4v3MB44FJ/mowyC3UQmRidqi/aGm8csdshrtPIx NyUOPyye3ahdicZP2yzoqvubqIxyhpDBU9r50bRhpSpaLiRohPvG6Vkhvknt03ywTLrvYL5U7WT Nk6Czmjtk3E/xiKrKOEkZo8a1J2E= X-Gm-Gg: ASbGncscKA2bYG78abeI5HCsv4HZPCg6XFHhGPKP1PLKih7JiTZd2FZrHJW5T6qhYWW suww7sY4/G+kMP5PWbhT6odT0Xfl/sydB+02kGPZoFG20LeI1zV21p0/1Q6iFcalvYUfVM1jryo +AQfgQtwGHpvhfIOFWO6dU6jiDwA== X-Google-Smtp-Source: AGHT+IEP7Nh9wOKkc5/HcoAMiiR8n7ueANOAM3TtdGIfWyMJemlif/wFUDHgutRmKYSexGttnFcOIpdcHd1q2LgcJ/o= X-Received: by 2002:a2e:a553:0:b0:30d:b328:8394 with SMTP id 38308e7fff4ca-30f0bf1eb07mr47215981fa.13.1744129395756; Tue, 08 Apr 2025 09:23:15 -0700 (PDT) MIME-Version: 1.0 References: <20250407234223.1059191-1-nphamcs@gmail.com> In-Reply-To: <20250407234223.1059191-1-nphamcs@gmail.com> From: Kairui Song Date: Wed, 9 Apr 2025 00:22:58 +0800 X-Gm-Features: ATxdqUFrOA5poLk2LCvT8r-ZavahScL1G90R_uOdgsK8qiGr9BEUgLKAk8ly3mA Message-ID: Subject: Re: [RFC PATCH 00/14] Virtual Swap Space To: Nhat Pham Cc: linux-mm@kvack.org, akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com, yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com, chengming.zhou@linux.dev, chrisl@kernel.org, huang.ying.caritas@gmail.com, ryan.roberts@arm.com, viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de, lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-pm@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 055A31C0015 X-Stat-Signature: gdoq7b6w9qokeka4naretocajiafgh4z X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1744129397-810225 X-HE-Meta: U2FsdGVkX1+PVf8N7VFXQkEcwRqGbxq/vsdTLL2682fElRJEjflej+UA+ydPWjkHzjHyGqnHiZzdNycYTt9L5vqKQ8PAkaC70uD9igRVGSqeokwmRvQ47WsrxlVuq3PyAZW7LLVefAr+MBt4DylTVOsBxxn0LzHA1MFqkv5plCfCHD0kMifFTy0Lzq+gTuzuVsmI4NlvhBMnzy4mnDa6mqbeSnM6ZRGbqMwIJBEOQvJzC/82C41tienU9hqw7KtHLeD7+3JYX4ldS2JDv536NqlZvxyk1Jcyc0VlSetePNPc6ZR696apMy/ZQtz/HIaV1sxr4wcKGr4MFF69ZQN4/8GB4Ia3OYvlr8SUbMvJYiRSRZiE4GIPVEMTOSjxm/FDqrGwL4u4e7bsCi9LftXoOM3VTVlNltO1ygrA6TfzJFafHVtfoAt5oF2HJ8Sb7PLeh5g+cy5G5FtBdYOto22uZf6Uu8AJqzo+1uvio2zhmjvsA9JbFTpgBjO/O4V9udwVAHERighDqw9hhkyqQOi7bFG9tVBz57Xex0bdksfbzSUxEuXIzq/4pLCfoSj5e/SNOhiqelgVe0Cwe/nFfWH1yWryZh1rEr07MCmdrFeHb7kcwkSqddbQh4C4clQ7MMIjaujZYc4eagMQ4IW2w3QkFDDVvnXQEhI3i+2duukGBjZpdI1WkmE1L/1G2rBUFNN/lWdffOQlMoxlMK2bTvoGRTMty1GuGr7A+dYrvcaSxRQrNssvPjujqE1pVSG3HhVKPkJQD+XtjAiwTl7qDzpIty9nERgSS0APEWJ2y/an0KiwRWf6NOahRkoSRbWQOvPVkrdji/meLULoz9YSzdIMJ71ny5eCGPMTK2uYcPkY6FIt4Ue7ipE2DluZ8RapIxbmqsSH86Wg/ixBqOr9NXojsWdM6RTR5Btn+3Z1EuqhfPT2OJKnh8UtTpbYwDK8r/pjfl9fpBTFlUOln1wEmqq TGXpvc4s JRRdKIle25fHjBQSXEKa+LWw44382nOMY9KgSQqRBYcbyxZjT3tZxcnPIXbiuR0278kRINZJ/BMGmB9XASjC1pai8f43f413YBVOnEoJwEDv+c8NxR9KHUdszVxoyJx4FJ986iBYZLeRrta+8e0PZmKlFw4w+oUTbOexb49t4oQZfEri5MuT023nFeB4OzMD8WDeddHYhuZHbZfu+o/Itu/9V6srDfF3M64nsEK91Z4fB0/miRxKmlXjhzh/9wr9EPN2f671cxWR+0zAZ0ZUEod3Zh1Z3UhPylPmB8dYVm556eJnSZFeI5raMmenH0hKoq6CJ/GTd99buboYaBpRfVDTMGA55ms2PJBvFfeUKvxl9oMIXYAoa3K48t1lt2d+jP1Vxjbiyrv6TGwDc6/zA6jDpatYZG+qvqHctvlYTywCi3H57VLCgPbrLSTkailFYxu+F9a9sbNHFUSSRbyu2hXLnYw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Apr 8, 2025 at 7:47=E2=80=AFAM Nhat Pham wrote: > > This RFC implements the virtual swap space idea, based on Yosry's > proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable > inputs from Johannes Weiner. The same idea (with different > implementation details) has been floated by Rik van Riel since at least > 2011 (see [8]). > > The code attached to this RFC is purely a prototype. It is not 100% > merge-ready (see section VI for future work). I do, however, want to show > people this prototype/RFC, including all the bells and whistles and a > couple of actual use cases, so that folks can see what the end results > will look like, and give me early feedback :) > > I. Motivation > > Currently, when an anon page is swapped out, a slot in a backing swap > device is allocated and stored in the page table entries that refer to > the original page. This slot is also used as the "key" to find the > swapped out content, as well as the index to swap data structures, such > as the swap cache, or the swap cgroup mapping. Tying a swap entry to its > backing slot in this way is performant and efficient when swap is purely > just disk space, and swapoff is rare. > > However, the advent of many swap optimizations has exposed major > drawbacks of this design. The first problem is that we occupy a physical > slot in the swap space, even for pages that are NEVER expected to hit > the disk: pages compressed and stored in the zswap pool, zero-filled > pages, or pages rejected by both of these optimizations when zswap > writeback is disabled. This is the arguably central shortcoming of > zswap: > * In deployments when no disk space can be afforded for swap (such as > mobile and embedded devices), users cannot adopt zswap, and are forced > to use zram. This is confusing for users, and creates extra burdens > for developers, having to develop and maintain similar features for > two separate swap backends (writeback, cgroup charging, THP support, > etc.). For instance, see the discussion in [4]. > * Resource-wise, it is hugely wasteful in terms of disk usage, and > limits the memory saving potentials of these optimizations by the > static size of the swapfile, especially in high memory systems that > can have up to terabytes worth of memory. It also creates significant > challenges for users who rely on swap utilization as an early OOM > signal. > > Another motivation for a swap redesign is to simplify swapoff, which > is complicated and expensive in the current design. Tight coupling > between a swap entry and its backing storage means that it requires a > whole page table walk to update all the page table entries that refer to > this swap entry, as well as updating all the associated swap data > structures (swap cache, etc.). > > > II. High Level Design Overview > > To fix the aforementioned issues, we need an abstraction that separates > a swap entry from its physical backing storage. IOW, we need to > =E2=80=9Cvirtualize=E2=80=9D the swap space: swap clients will work with = a dynamically > allocated virtual swap slot, storing it in page table entries, and > using it to index into various swap-related data structures. The > backing storage is decoupled from the virtual swap slot, and the newly > introduced layer will =E2=80=9Cresolve=E2=80=9D the virtual swap slot to = the actual > storage. This layer also manages other metadata of the swap entry, such > as its lifetime information (swap count), via a dynamically allocated > per-swap-entry descriptor: > > struct swp_desc { > swp_entry_t vswap; > union { > swp_slot_t slot; > struct folio *folio; > struct zswap_entry *zswap_entry; > }; > struct rcu_head rcu; > > rwlock_t lock; > enum swap_type type; > > atomic_t memcgid; > > atomic_t in_swapcache; > struct kref refcnt; > atomic_t swap_count; > }; Thanks for sharing the code, my initial idea after the discussion at LSFMM is that there is a simple way to combine this with the "swap table" [1] design of mine to solve the performance issue of this series: just store the pointer of this struct in the swap table. It's a bruteforce and glue like solution but the contention issue will be gone. Of course it's not a good approach, ideally the data structure can be simplified to an entry type in the swap table. The swap table series handles locking and synchronizations using either cluster lock (reusing swap allocator and existing swap logics) or folio lock (kind of like page cache). So many parts can be much simplified, I think it will be at most ~32 bytes per page with a virtual device (including the intermediate pointers).Will require quite some work though. The good side with that approach is we will have a much lower memory overhead and even better performance. And the virtual space part will be optional, for non virtual setup the memory consumption will be only 8 bytes per page and also dynamically allocated, as discussed at LSFMM. So sorry that I still have a few parts undone, looking forward to posting in about one week, eg. After this weekend it goes well. I'll also try to check your series first to see how these can be collaborated better. A draft version is available here though, just in case anyone is really anxious to see the code, I wouldn't recommend spend much effort check it though as it may change rapidly: https://github.com/ryncsn/linux/tree/kasong/devel/swap-unification But the good news is the total LOC should be reduced, or at least won't increase much, as it will unify a lot of swap infrastructures. So things might be easier to implement after that. [1] https://lore.kernel.org/linux-mm/CAMgjq7DHFYWhm+Z0C5tR2U2a-N_mtmgB4+idD= 2S+-1438u-wWw@mail.gmail.com/T/