From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 3FB9DEF06F8 for ; Sun, 8 Feb 2026 22:52:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6AA276B0089; Sun, 8 Feb 2026 17:52:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6818A6B0092; Sun, 8 Feb 2026 17:52:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 555AD6B0093; Sun, 8 Feb 2026 17:52:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 3D5656B0089 for ; Sun, 8 Feb 2026 17:52:15 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id CB1F859D06 for ; Sun, 8 Feb 2026 22:52:14 +0000 (UTC) X-FDA: 84422789388.01.1F048F7 Received: from mail-wm1-f54.google.com (mail-wm1-f54.google.com [209.85.128.54]) by imf22.hostedemail.com (Postfix) with ESMTP id B64E8C000D for ; Sun, 8 Feb 2026 22:52:12 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=d4CWv5ji; spf=pass (imf22.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.128.54 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com; arc=pass ("google.com:s=arc-20240605:i=1") ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1770591132; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=E4b0R+1uOz3SafJH0e8svvRW7IkcEC4uc4W5aDH7JxA=; b=sESR5zgFXf2LdW8y07oA1wfCCfY5TgY4ZDzq/xBxioMx5pYAhAAaFyJM+1S58Vo6+bCRec woAfgfIDLJxmvpyDfncaM+KiOSo16d9Zr0jG70DdRgehiTTv5K/xqkJzKIDyRVyKXrwFmV ZhoP59zfD5yf4RWNixuepebrGb9sA9g= ARC-Authentication-Results: i=2; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=d4CWv5ji; spf=pass (imf22.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.128.54 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com; arc=pass ("google.com:s=arc-20240605:i=1") ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1770591132; a=rsa-sha256; cv=pass; b=X0a3ctzTb2vkljLNxvmzcQtBIxWRRkb1jzDiNA6EjUxaBvqf/aoaNOBJ9iJfpLg23NmwES 7aGGLENDWAk45Sopxj1OGGJnURRYsdat+DzQa7e4kLiq53LO6LCY6ljKkO3hnuUzMyutEB X+0kJo65NvyLZt+ncvWgDPyl/N8r2Go= Received: by mail-wm1-f54.google.com with SMTP id 5b1f17b1804b1-483337aa225so3163245e9.2 for ; Sun, 08 Feb 2026 14:52:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1770591131; cv=none; d=google.com; s=arc-20240605; b=DqB2gQua+XCXxuLgoUJVcqdRiV5n+Yz8t2Vyd3qlbVHpbVtTg9zUIHLzx8awMffAGS 3GZpNO3g9AiieIXdH9/1xnrGfo5qLJHiTDsMM59ZFRwVjpVZJ43eJ2t9DJ7EV7tsVi76 w9b9K40IwzmnELIo9x7ixqTCcW3zLcic7naDEHtwMoNkaxngPSp5me7JiP2nH8voV7I9 kCWGpZBVVRwR2YX4saGcIGbg4pZJwCeBFcS+TRK+CpTXzFFJISCch6k7b6AYZCX2/Mbw rs+WbjS+gqe1kGOjU1TKjNMIl+tux3StGGBAO3FWb+WI/4j8bZYD8kxWIFD/rkkmTNx3 7lFw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=E4b0R+1uOz3SafJH0e8svvRW7IkcEC4uc4W5aDH7JxA=; fh=2ZFW4Q6w60FuOBsaE4hj6G3MG8qN0pBlI2nSkmaOdfI=; b=DRwTaQym3KdOBTSfUBZ91E8T/BcMn6c2ojxuU3J3CGIH/STGJ7PZK8p++im25cIE7O J4iN1/NbTnh8kA3O6fIOo9jHcz/9hvl2mQHa5n8ywbI/MciQ0aRXxQL29z5rOakCDeQh /+cya+2POviXze5UZgvUDybxm7QQPxCq99UTQugWs46SFmf5EgGD1TAGA2HQfx5WLVto iVks3uU9EAdZ2041b4vl/GFZ85F9VFUEYm3WS+m/vYAsHZQfBfoEaBTfJiQ7ApF8o9fJ 8xhue2Yfn0RW3fLpIFyfHoDuoe10LoDZq9Y6B4igTyzPV3JwbgKv0u6/OQnd2ezCxP43 MXWg==; darn=kvack.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1770591131; x=1771195931; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=E4b0R+1uOz3SafJH0e8svvRW7IkcEC4uc4W5aDH7JxA=; b=d4CWv5jiIAmgW/GpfDwO+TxKMN0/2Za1GqLiJIHim6rhjY+pS7+qdi5x9fTTvl6Dr5 JbqoNDKl/gC9zvwXMtUlhZhXHSTFK+2j1h45dS0NS2A6scZ96QK8zZebwtfVhPTIA1kZ SWNupsghCbCsNzTo37emMTtG4QP4pr9fxVFyWw6E6kc37AHmJcYhQLWv/0uroLsU1sS8 KgFQBYcLmg9xBax5tMDohs3MXe99J03Es8LUBMEzk52ivblB2syuCQfXkVul7SDs3Ndi 4ljzNyFBeEqu5g3dSWETaGLo+JNt+kCsFtrakbTowt5rp1PDaeltlVKwXTiyCDTWsxYb XTXg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770591131; x=1771195931; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=E4b0R+1uOz3SafJH0e8svvRW7IkcEC4uc4W5aDH7JxA=; b=eg1Iwo+x/RtdfRUYuaOI0hvCOXnyQ2ZG+p4PHoWEv2NuL70Q2D2z8AQE39UwXHQlPo 6H6sLJAq1igs9OxsnzuPy3uMRdIsAAyeKJjuCHJPIOiQPAO0UVlGGm1n5MR7qwKfpt5e 4b5cdmG9JSCktI7mWGgY2hvbTer8uSK0McfGlblQwhJDX/m0BrmuenSpBlK0NhmsniGA AlZ3d/ziJ1gIiy34llSTSMozpiqbGFPF7U/CqBdc1k/JNllnLgLdnol47GREa5wc8Yjl +0HsQUMZKzs5NsUr5Se/y+yQba3sT/8BPmC8WhPfXIAh7ZDgTavK4VYVKzQ5vOSSRPEz KkSQ== X-Gm-Message-State: AOJu0YzjXDVCzmR4SC5CQBuLvC1tNoD96ThHW6FxBQGn2JdtUPdrlGy2 RDHXJiPvCuvhj85etmbl/651z2RKlDgGIv/cLMwBNWzS5NmKUvcyfq3caUdihlAhTUxxmh8gA0i czlXCHxna0b2pZNmJCPHOy+e2lFM9QUaaghXdqBo= X-Gm-Gg: AZuq6aIz8YYxUrjQkBYFthd44w2TAUtI1lmVceAUP33Abqbn/rrHbZnuQJ+ZEMWAquD QqVUHaXqrVHa5m2FZUe6P1mEzyvsnx9F7wFYevDWj8bzRlO/j5SIYaF2JReqHxPweNQCEBiEmHN UgFSxaDoMQPpBKvHWBRSBF0WRUjywu1pc/E9ciwjZPaFJAQvDg43sBt3nRd7ihslAYhoxWf5h/O ytUvWMoZbShlAfW2rtsQmjuRt97GBUxHynljRrCAk/eFyNe5ScFQ/WdLxZI+md/Qanr2KGJiCJG G2fA02aJ2bpsJV2ti01u1xGFPYE= X-Received: by 2002:a05:600c:4f54:b0:46e:1a5e:211 with SMTP id 5b1f17b1804b1-48320212d61mr136081035e9.21.1770591130669; Sun, 08 Feb 2026 14:52:10 -0800 (PST) MIME-Version: 1.0 References: <20260208215839.87595-1-nphamcs@gmail.com> In-Reply-To: <20260208215839.87595-1-nphamcs@gmail.com> From: Nhat Pham Date: Sun, 8 Feb 2026 14:51:59 -0800 X-Gm-Features: AZwV_QiOo6dwbNDrHe8UhuP-Xws_pF7B3zVG2WHWpJtVOrQOfzYWE4bwFhG68O0 Message-ID: Subject: Re: [PATCH v3 00/20] Virtual Swap Space To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com, yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com, chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org, huang.ying.caritas@gmail.com, ryan.roberts@arm.com, shikemeng@huaweicloud.com, viro@zeniv.linux.org.uk, baohua@kernel.org, bhe@redhat.com, osalvador@suse.de, lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com, riel@surriel.com, joshua.hahnjy@gmail.com, npache@redhat.com, gourry@gourry.net, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, rafael@kernel.org, jannh@google.com, pfalcato@suse.de, zhengqi.arch@bytedance.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: ztmmrjqdtu5i1rccgsmoxkyr5nznd36j X-Rspamd-Queue-Id: B64E8C000D X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1770591132-142933 X-HE-Meta: U2FsdGVkX18IO060o1tgnNY/ZyFBUKjAPBIWQB6EDcFwEwtVsyDyV35hteuvfEm7/8Aj3fpk7fJ/duXz7El8uTGcahrnifVAy/TCeY4wyU7UGS4B/g74GR4un4tHQE3TPYiNDaoOmiEOF6vbC3W6x3nLZHmXyqjtan6uJgP/gGBvSPTNKjF5WbLY2J5Lp8hhvav9tvwcawh/0VtGnnQgmd0B8ExAfI26MizmM3HLkbh0FvNsMOubQOb4Jol/lDeCquYxNiSSHJntrzn/TP/TqzY1Ss9qm7teIbGcCHr+7jNFQB6FjGo0jBhIKwpsVzQ/0+FuP/6yvngf4OBYaE6rGW9mkpHzXE9Gb1aUGO14tuC3izin8sh0ZTp642T8h4LsdpvbIwaGNTgoTd6CDnC9IWLdnNJi5oHekTLPcR2AshoaHylFrVW7l78K0t+pKSWh/kT05PTOC68/WdoQwqsaZqmWoacEYQqHPy1NMqzTI/nyL8ramndK3ndLaLBF7QUnbid0AOThkTIFqDvBAjJ4XbvXhgOE5khu7t1vqTvmZjhi4k/hsNjU+CG2BHbRFj8xSgzgtyNWDhNmpN+OXMVm3I+IRkTnlwK1H7a0jksCGrRrk3oseyzkMSBqVrO6UoXQbF1XRStB5yTpO7Vf4Cn0nZd/vCMCZ4p6HL2zdKw/3i6EY79OlpC4VGbq8+0yZtQ9imPHz/tggXPkn+wyPwWIKSxTOAqFyqNHJ5TEZ/5IIYlZNr38eg3A0gGtfFwKUsLaszOmYLo4ZIafHbZdtMMh3J1ww8F2lI1kD0JEFaW4yliSbxvq03ScQeq2B6E4OJmegq7MFYIY/ocKYoCj+IQFTHkhDNuLKfVY+b5B9QD0o6da0ZznCUf7ZNoQlwVV2O4p+hSm1WogxBe3lLfCS2derPnAXx7O+iOdJRRwpKXLH/lDhbI6KmZxvD3bayFsLllgS8mXs+3C5UwxK3/Oiat Pog0m92F sg5QwzXZ5JFwVjtXk1clLSVDlnM8JLwHvzG6ESXAt16wKuAqE7ZZOe4k7rY3McZnlxpuTtQGDx84yn0JXjnfnVdjvk6ivl7ak7bZXAnD0049CvHsJQXrZnVPxtcTn8Cj2kA5BSFgDIyzhiw3jI8pszdUqxqPrsYUC8Pi+JxfW2tUUJp3+/p8/SE8RMCG9qRuUZs1yKRfGs0vUKbT9x1saL3BxTaGyXeP1JAJK0z40hrp9Og01wTHrHQD2vjq7U/eEp9nItuUiL21QCvDRRpDi7qSppzHHJ8dAbW9RQx80rZOs9ogGxpn2Np0EWIqJK/x5fC1qS/lK988EQ+RpAD0IvkOvmPSuy1kSZpA4WuLLhPqaOIzLgPUjzYK2NTWOgaUU21wwO1uXrLE7pY+XXZAFoosIhkRMNdfMtBSnFLvqEPa7f7WbpfxHBFWjlm5Yt4OKewFy0wN7T63tNzWGP1hc3RY82JrAfkvSRzBuI3H7VsUX9VjppVVYGiI+NC8wUJ1RCnvuYzScrTH5RQecRtiwomLSTyDcS+vRClhsqqNAngSXfEVkrU0JyiclHTSqicsuP3RKlDCFYsatlJUopPL7lkXrDfEmySXmrt7YwDYzaLUr7WSsdPjDPYvmTylGBeJgyntVzdRK/btPZtd1J6eGd/gbaDLbgijqqVKf3T8L09QAysKcVazSuwewX067bvZ7ZhFaxefdw8TvgT/Guy+orx0HcvcjP5BL29/W X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Feb 8, 2026 at 1:58=E2=80=AFPM Nhat Pham wrote: > > Changelog: > * RFC v2 -> v3: > * Implement a cluster-based allocation algorithm for virtual swap > slots, inspired by Kairui Song and Chris Li's implementation, as > well as Johannes Weiner's suggestions. This eliminates the lock > contention issues on the virtual swap layer. > * Re-use swap table for the reverse mapping. > * Remove CONFIG_VIRTUAL_SWAP. > * Reducing the size of the swap descriptor from 48 bytes to 24 > bytes, i.e another 50% reduction in memory overhead from v2. > * Remove swap cache and zswap tree and use the swap descriptor > for this. > * Remove zeromap, and replace the swap_map bytemap with 2 bitmaps > (one for allocated slots, and one for bad slots). > * Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449) > * Update cover letter to include new benchmark results and discus= sion > on overhead in various cases. > * RFC v1 -> RFC v2: > * Use a single atomic type (swap_refs) for reference counting > purpose. This brings the size of the swap descriptor from 64 B > down to 48 B (25% reduction). Suggested by Yosry Ahmed. > * Zeromap bitmap is removed in the virtual swap implementation. > This saves one bit per phyiscal swapfile slot. > * Rearrange the patches and the code change to make things more > reviewable. Suggested by Johannes Weiner. > * Update the cover letter a bit. > > This patch series implements the virtual swap space idea, based on Yosry'= s > proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable > inputs from Johannes Weiner. The same idea (with different > implementation details) has been floated by Rik van Riel since at least > 2011 (see [8]). > > This patch series is based on 6.19. There are a couple more > swap-related changes in the mm-stable branch that I would need to > coordinate with, but I would like to send this out as an update, to show > that the lock contention issues that plagued earlier versions have been > resolved and performance on the kernel build benchmark is now on-par with > baseline. Furthermore, memory overhead has been substantially reduced > compared to the last RFC version. > > > I. Motivation > > Currently, when an anon page is swapped out, a slot in a backing swap > device is allocated and stored in the page table entries that refer to > the original page. This slot is also used as the "key" to find the > swapped out content, as well as the index to swap data structures, such > as the swap cache, or the swap cgroup mapping. Tying a swap entry to its > backing slot in this way is performant and efficient when swap is purely > just disk space, and swapoff is rare. > > However, the advent of many swap optimizations has exposed major > drawbacks of this design. The first problem is that we occupy a physical > slot in the swap space, even for pages that are NEVER expected to hit > the disk: pages compressed and stored in the zswap pool, zero-filled > pages, or pages rejected by both of these optimizations when zswap > writeback is disabled. This is the arguably central shortcoming of > zswap: > * In deployments when no disk space can be afforded for swap (such as > mobile and embedded devices), users cannot adopt zswap, and are forced > to use zram. This is confusing for users, and creates extra burdens > for developers, having to develop and maintain similar features for > two separate swap backends (writeback, cgroup charging, THP support, > etc.). For instance, see the discussion in [4]. > * Resource-wise, it is hugely wasteful in terms of disk usage. At Meta, > we have swapfile in the order of tens to hundreds of GBs, which are > mostly unused and only exist to enable zswap usage and zero-filled > pages swap optimizations. > * Tying zswap (and more generally, other in-memory swap backends) to > the current physical swapfile infrastructure makes zswap implicitly > statically sized. This does not make sense, as unlike disk swap, in > which we consume a limited resource (disk space or swapfile space) to > save another resource (memory), zswap consume the same resource it is > saving (memory). The more we zswap, the more memory we have available, > not less. We are not rationing a limited resource when we limit > the size of he zswap pool, but rather we are capping the resource > (memory) saving potential of zswap. Under memory pressure, using > more zswap is almost always better than the alternative (disk IOs, or > even worse, OOMs), and dynamically sizing the zswap pool on demand > allows the system to flexibly respond to these precarious scenarios. > * Operationally, static provisioning the swapfile for zswap pose > significant challenges, because the sysadmin has to prescribe how > much swap is needed a priori, for each combination of > (memory size x disk space x workload usage). It is even more > complicated when we take into account the variance of memory > compression, which changes the reclaim dynamics (and as a result, > swap space size requirement). The problem is further exarcebated for > users who rely on swap utilization (and exhaustion) as an OOM signal. > > All of these factors make it very difficult to configure the swapfile > for zswap: too small of a swapfile and we risk preventable OOMs and > limit the memory saving potentials of zswap; too big of a swapfile > and we waste disk space and memory due to swap metadata overhead. > This dilemma becomes more drastic in high memory systems, which can > have up to TBs worth of memory. > > Past attempts to decouple disk and compressed swap backends, namely the > ghost swapfile approach (see [13]), as well as the alternative > compressed swap backend zram, have mainly focused on eliminating the > disk space usage of compressed backends. We want a solution that not > only tackles that same problem, but also achieve the dyamicization of > swap space to maximize the memory saving potentials while reducing > operational and static memory overhead. > > Finally, any swap redesign should support efficient backend transfer, > i.e without having to perform the expensive page table walk to > update all the PTEs that refer to the swap entry: > * The main motivation for this requirement is zswap writeback. To quote > Johannes (from [14]): "Combining compression with disk swap is > extremely powerful, because it dramatically reduces the worst aspects > of both: it reduces the memory footprint of compression by shedding > the coldest data to disk; it reduces the IO latencies and flash wear > of disk swap through the writeback cache. In practice, this reduces > *average event rates of the entire reclaim/paging/IO stack*." > * Another motivation is to simplify swapoff, which is both complicated > and expensive in the current design, precisely because we are storing > an encoding of the backend positional information in the page table, > and thus requires a full page table walk to remove these references. > > > II. High Level Design Overview > > To fix the aforementioned issues, we need an abstraction that separates > a swap entry from its physical backing storage. IOW, we need to > =E2=80=9Cvirtualize=E2=80=9D the swap space: swap clients will work with = a dynamically > allocated virtual swap slot, storing it in page table entries, and > using it to index into various swap-related data structures. The > backing storage is decoupled from the virtual swap slot, and the newly > introduced layer will =E2=80=9Cresolve=E2=80=9D the virtual swap slot to = the actual > storage. This layer also manages other metadata of the swap entry, such > as its lifetime information (swap count), via a dynamically allocated, > per-swap-entry descriptor: > > struct swp_desc { > union { > swp_slot_t slot; /* 0 8 *= / > struct zswap_entry * zswap_entry; /* 0 8 *= / > }; /* 0 8 *= / > union { > struct folio * swap_cache; /* 8 8 *= / > void * shadow; /* 8 8 *= / > }; /* 8 8 *= / > unsigned int swap_count; /* 16 4 *= / > unsigned short memcgid:16; /* 20: 0 2 *= / > bool in_swapcache:1; /* 22: 0 1 *= / > > /* Bitfield combined with previous fields */ > > enum swap_type type:2; /* 20:17 4 *= / > > /* size: 24, cachelines: 1, members: 6 */ > /* bit_padding: 13 bits */ > /* last cacheline: 24 bytes */ > }; > > (output from pahole). > > This design allows us to: > * Decouple zswap (and zeromapped swap entry) from backing swapfile: > simply associate the virtual swap slot with one of the supported > backends: a zswap entry, a zero-filled swap page, a slot on the > swapfile, or an in-memory page. > * Simplify and optimize swapoff: we only have to fault the page in and > have the virtual swap slot points to the page instead of the on-disk > physical swap slot. No need to perform any page table walking. > > The size of the virtual swap descriptor is 24 bytes. Note that this is > not all "new" overhead, as the swap descriptor will replace: > * the swap_cgroup arrays (one per swap type) in the old design, which > is a massive source of static memory overhead. With the new design, > it is only allocated for used clusters. > * the swap tables, which holds the swap cache and workingset shadows. > * the zeromap bitmap, which is a bitmap of physical swap slots to > indicate whether the swapped out page is zero-filled or not. > * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps, > one for allocated slots, and one for bad slots, representing 3 possible > states of a slot on the swapfile: allocated, free, and bad. > * the zswap tree. > > So, in terms of additional memory overhead: > * For zswap entries, the added memory overhead is rather minimal. The > new indirection pointer neatly replaces the existing zswap tree. > We really only incur less than one word of overhead for swap count > blow up (since we no longer use swap continuation) and the swap type. > * For physical swap entries, the new design will impose fewer than 3 word= s > memory overhead. However, as noted above this overhead is only for > actively used swap entries, whereas in the current design the overhead = is > static (including the swap cgroup array for example). > > The primary victim of this overhead will be zram users. However, as > zswap now no longer takes up disk space, zram users can consider > switching to zswap (which, as a bonus, has a lot of useful features > out of the box, such as cgroup tracking, dynamic zswap pool sizing, > LRU-ordering writeback, etc.). > > For a more concrete example, suppose we have a 32 GB swapfile (i.e. > 8,388,608 swap entries), and we use zswap. > > 0% usage, or 0 entries: 0.00 MB > * Old design total overhead: 25.00 MB > * Vswap total overhead: 0.00 MB > > 25% usage, or 2,097,152 entries: > * Old design total overhead: 57.00 MB > * Vswap total overhead: 48.25 MB > > 50% usage, or 4,194,304 entries: > * Old design total overhead: 89.00 MB > * Vswap total overhead: 96.50 MB > > 75% usage, or 6,291,456 entries: > * Old design total overhead: 121.00 MB > * Vswap total overhead: 144.75 MB > > 100% usage, or 8,388,608 entries: > * Old design total overhead: 153.00 MB > * Vswap total overhead: 193.00 MB > > So even in the worst case scenario for virtual swap, i.e when we > somehow have an oracle to correctly size the swapfile for zswap > pool to 32 GB, the added overhead is only 40 MB, which is a mere > 0.12% of the total swapfile :) > > In practice, the overhead will be closer to the 50-75% usage case, as > systems tend to leave swap headroom for pathological events or sudden > spikes in memory requirements. The added overhead in these cases are > practically neglible. And in deployments where swapfiles for zswap > are previously sparsely used, switching over to virtual swap will > actually reduce memory overhead. > > Doing the same math for the disk swap, which is the worst case for > virtual swap in terms of swap backends: > > 0% usage, or 0 entries: 0.00 MB > * Old design total overhead: 25.00 MB > * Vswap total overhead: 2.00 MB > > 25% usage, or 2,097,152 entries: > * Old design total overhead: 41.00 MB > * Vswap total overhead: 66.25 MB > > 50% usage, or 4,194,304 entries: > * Old design total overhead: 57.00 MB > * Vswap total overhead: 130.50 MB > > 75% usage, or 6,291,456 entries: > * Old design total overhead: 73.00 MB > * Vswap total overhead: 194.75 MB > > 100% usage, or 8,388,608 entries: > * Old design total overhead: 89.00 MB > * Vswap total overhead: 259.00 MB > > The added overhead is 170MB, which is 0.5% of the total swapfile size, > again in the worst case when we have a sizing oracle. > > Please see the attached patches for more implementation details. > > > III. Usage and Benchmarking > > This patch series introduce no new syscalls or userspace API. Existing > userspace setups will work as-is, except we no longer have to create a > swapfile or set memory.swap.max if we want to use zswap, as zswap is no > longer tied to physical swap. The zswap pool will be automatically and > dynamically sized based on memory usage and reclaim dynamics. > > To measure the performance of the new implementation, I have run the > following benchmarks: > > 1. Kernel building: 52 workers (one per processor), memory.max =3D 3G. > > Using zswap as the backend: > > Baseline: > real: mean: 185.2s, stdev: 0.93s > sys: mean: 683.7s, stdev: 33.77s > > Vswap: > real: mean: 184.88s, stdev: 0.57s > sys: mean: 675.14s, stdev: 32.8s > > We actually see a slight improvement in systime (by 1.5%) :) This is > likely because we no longer have to perform swap charging for zswap > entries, and virtual swap allocator is simpler than that of physical > swap. > > Using SSD swap as the backend: > > Baseline: > real: mean: 200.3s, stdev: 2.33s > sys: mean: 489.88s, stdev: 9.62s > > Vswap: > real: mean: 201.47s, stdev: 2.98s > sys: mean: 487.36s, stdev: 5.53s > > The performance is neck-to-neck. > > > IV. Future Use Cases > > While the patch series focus on two applications (decoupling swap > backends and swapoff optimization/simplification), this new, > future-proof design also allows us to implement new swap features more > easily and efficiently: > > * Multi-tier swapping (as mentioned in [5]), with transparent > transferring (promotion/demotion) of pages across tiers (see [8] and > [9]). Similar to swapoff, with the old design we would need to > perform the expensive page table walk. > * Swapfile compaction to alleviate fragmentation (as proposed by Ying > Huang in [6]). > * Mixed backing THP swapin (see [7]): Once you have pinned down the > backing store of THPs, then you can dispatch each range of subpages > to appropriate backend swapin handler. > * Swapping a folio out with discontiguous physical swap slots > (see [10]). > * Zswap writeback optimization: The current architecture pre-reserves > physical swap space for pages when they enter the zswap pool, giving > the kernel no flexibility at writeback time. With the virtual swap > implementation, the backends are decoupled, and physical swap space > is allocated on-demand at writeback time, at which point we can make > much smarter decisions: we can batch multiple zswap writeback > operations into a single IO request, allocating contiguous physical > swap slots for that request. We can even perform compressed writeback > (i.e writing these pages without decompressing them) (see [12]). > > > V. References > > [1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U= 4wtW6cM+puw@mail.gmail.com/ > [2]: https://lwn.net/Articles/932077/ > [3]: https://www.youtube.com/watch?v=3DHwqw_TBGEhg > [4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/ > [5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXU= ZSBVJrcGFXCA@mail.gmail.com/ > [6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.co= rp.intel.com/ > [7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=3D6gvee1x3ttbOnifGneqcR= m9Hoeun=3DuFQ2w@mail.gmail.com/ > [8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/ > [9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjp= qxMwxS2C9TQ@mail.gmail.com/ > [10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8= sBn1yANikEmQ@mail.gmail.com/ > [11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXH= GBnfisCAb8VA@mail.gmail.com/ > [12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.= org/ > [13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel= .org/ > [14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.or= g/ > > Nhat Pham (20): > mm/swap: decouple swap cache from physical swap infrastructure > swap: rearrange the swap header file > mm: swap: add an abstract API for locking out swapoff > zswap: add new helpers for zswap entry operations > mm/swap: add a new function to check if a swap entry is in swap > cached. > mm: swap: add a separate type for physical swap slots > mm: create scaffolds for the new virtual swap implementation > zswap: prepare zswap for swap virtualization > mm: swap: allocate a virtual swap slot for each swapped out page > swap: move swap cache to virtual swap descriptor > zswap: move zswap entry management to the virtual swap descriptor > swap: implement the swap_cgroup API using virtual swap > swap: manage swap entry lifecycle at the virtual swap layer > mm: swap: decouple virtual swap slot from backing store > zswap: do not start zswap shrinker if there is no physical swap slots > swap: do not unnecesarily pin readahead swap entries > swapfile: remove zeromap bitmap > memcg: swap: only charge physical swap slots > swap: simplify swapoff using virtual swap > swapfile: replace the swap map with bitmaps > > Documentation/mm/swap-table.rst | 69 -- > MAINTAINERS | 2 + > include/linux/cpuhotplug.h | 1 + > include/linux/mm_types.h | 16 + > include/linux/shmem_fs.h | 7 +- > include/linux/swap.h | 135 ++- > include/linux/swap_cgroup.h | 13 - > include/linux/swapops.h | 25 + > include/linux/zswap.h | 17 +- > kernel/power/swap.c | 6 +- > mm/Makefile | 5 +- > mm/huge_memory.c | 11 +- > mm/internal.h | 12 +- > mm/memcontrol-v1.c | 6 + > mm/memcontrol.c | 142 ++- > mm/memory.c | 101 +- > mm/migrate.c | 13 +- > mm/mincore.c | 15 +- > mm/page_io.c | 83 +- > mm/shmem.c | 215 +--- > mm/swap.h | 157 +-- > mm/swap_cgroup.c | 172 --- > mm/swap_state.c | 306 +---- > mm/swap_table.h | 78 +- > mm/swapfile.c | 1518 ++++------------------- > mm/userfaultfd.c | 18 +- > mm/vmscan.c | 28 +- > mm/vswap.c | 2025 +++++++++++++++++++++++++++++++ > mm/zswap.c | 142 +-- > 29 files changed, 2853 insertions(+), 2485 deletions(-) > delete mode 100644 Documentation/mm/swap-table.rst > delete mode 100644 mm/swap_cgroup.c > create mode 100644 mm/vswap.c > > > base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b > -- > 2.47.3 Weirdly, it seems like the cover letter (and only the cover letter) is not being delivered... I'm trying to figure out what's going on :( My apologies for the inconvenience...