From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 11478C369C2 for ; Tue, 22 Apr 2025 19:29:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2A8756B000C; Tue, 22 Apr 2025 15:29:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 22FFC6B000D; Tue, 22 Apr 2025 15:29:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0AA5D6B000E; Tue, 22 Apr 2025 15:29:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id DE4A66B000C for ; Tue, 22 Apr 2025 15:29:21 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 757CD1A1808 for ; Tue, 22 Apr 2025 19:29:23 +0000 (UTC) X-FDA: 83362668606.30.ACD3A8A Received: from mail-qv1-f42.google.com (mail-qv1-f42.google.com [209.85.219.42]) by imf17.hostedemail.com (Postfix) with ESMTP id 8380640009 for ; Tue, 22 Apr 2025 19:29:21 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="k8/1h8yW"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf17.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.42 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745350161; a=rsa-sha256; cv=none; b=1sWAO/LYzp908QmbPg0tSXVSorKuhK7mbT5qKw0/9VlzYIgvWl4Cqch7RJD0N+ueo7EI0E J7FUPPcD9wWe8vXxMMR35suzavcj9hAVQQCqXwo8mfCfnKi5fNMh74lmaYdt02CU6C8QVT tROlLaKXLDYVQdMDmWvxsIGH4skVCEU= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="k8/1h8yW"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf17.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.42 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745350161; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3ndpKGQSvZ0aqLqjETeu44ATYoM5vYBNDUrHip9UxeY=; b=svtIPG/BYBRk9ZZlxFDe8dpLJZjuE8tKzoKSfr90EwkJhTAr812RJ0zqdKJaohg6llsdh9 k7tVUyMFfFU2sswKcLR2xcr0G8EM5xnYLoTlAvzVPLCI0AdIWn4pgrhdmhA2dEdtu1b50G MBIeGOEc3uNYp4YXp2AgvfKQIb4IwyE= Received: by mail-qv1-f42.google.com with SMTP id 6a1803df08f44-6f0c30a1ca3so63312576d6.1 for ; Tue, 22 Apr 2025 12:29:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1745350160; x=1745954960; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=3ndpKGQSvZ0aqLqjETeu44ATYoM5vYBNDUrHip9UxeY=; b=k8/1h8yWZgl3M7C+Jt89yw9kAB6pvNdz3MCDsVbgrqW0VayTnJAdDuOgaxF04qkRFT bfPaWu2m1391j5eM8PYKnfiWCR04A4z7Roqsf/PwknH9FWW00+le82c9u3W5/0iWeBnV ioNvKVSnOkw7aWnh474xjz1Lg2IIgX/T7yDUopZ4rIhE/HbfKzzNHUbQ8kG9AckP9NY5 MWbEuLXyCJsEo7wH5z5msvhwZNGFwJerYraG/soxJyt5e90z67A9AUqBTA/p+iX/IdjQ EKiToZ8+JDBr+9JXOBlnZxJ/f10hbbpL635XpSoqxYktqnAIqAZ4+0TnVQeA69iJcQ/e 8N8Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745350160; x=1745954960; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3ndpKGQSvZ0aqLqjETeu44ATYoM5vYBNDUrHip9UxeY=; b=S+KYvfz+tDRgP2KDJmqh+cO6BqgK+MiHIns0nOYsKvBTwadfgnch8IOnETO4W2/7NW H29X6eDRJSHtFKkx1/9dYgQRL81ehGJRPLiMys5TVDafBqjKgRRUOlKNBCQuWVCNshto hMAacHo2CM5phM4DMt/Qv8nH3rKItkqipqGqE3HpxynJ7S74l5Aw6T3lUfG6GXS8D5tF HvmUhhvqiTyPmPEdcF22b1ADIj676S0gWSJkYf9iIs91efTi/WOrK/eO9g2pdpYQSnEa rxU1dzx3yA4qkAqu3ovr1ii9lyGKjCryce8g8Q3AZ26mNBRMsMyD4NTzCI9YfshnhMWr BwPw== X-Gm-Message-State: AOJu0Yz6I8tptOgqWI+A5jx7wKlOh1ngjeUshoAuz1bM3mmwwiykRvLq XzjuBqMFpLdnf+bbOVD61C/aU6R3BV+EfuobdgPH5zi352Nf+EWS29QXUlwSd8blFGPXGNx1Fks ZMgfZ7MyNv51fwvYTgC3+LJ/jBao= X-Gm-Gg: ASbGncvlNvdj7Rvfg6K8ISKywC6hnghW+PFMvhTnMLb88L4osnOYFBFn11jqClEAn7P Xs12s/9TZz1XuGkcQG5/txpePGnOlvnFj9sARSgL2x58UxEglVAA1Db0zG91lzVi638puuV72Wd 8uUvx22T23Wx8OLxMOSe2Ykys= X-Google-Smtp-Source: AGHT+IGkSPCMeFkbaE7r9rhzr8WggAH+kF6OiQcpyNCNspqhZ0ybhXGaBNOVK+jQ3HgCZRgPGQgp1f3RLPGlBD4qBuk= X-Received: by 2002:a05:6214:20a3:b0:6e4:4484:f354 with SMTP id 6a1803df08f44-6f2c4671b1cmr334524676d6.38.1745350160485; Tue, 22 Apr 2025 12:29:20 -0700 (PDT) MIME-Version: 1.0 References: <20250407234223.1059191-1-nphamcs@gmail.com> <6807afd0.a70a0220.2ae8b9.e07cSMTPIN_ADDED_BROKEN@mx.google.com> In-Reply-To: From: Nhat Pham Date: Tue, 22 Apr 2025 12:29:08 -0700 X-Gm-Features: ATxdqUF5O_YtzRFDfeJ1JbJJt-RugNOBNDCB-M-afZovyBAPOYL5Iv1Eurn4SUU Message-ID: Subject: Re: [RFC PATCH 00/14] Virtual Swap Space To: Yosry Ahmed Cc: linux-mm@kvack.org, akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com, chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org, huang.ying.caritas@gmail.com, ryan.roberts@arm.com, viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de, lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-pm@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 8380640009 X-Stat-Signature: 636j953sbyzss9918nyzp6hqrp94bash X-HE-Tag: 1745350161-571360 X-HE-Meta: U2FsdGVkX1+qDIaB4aHONNV5PGe61m0IJzyK5hDVzZZNxDnIpqZSDEMSO3Eqcd4M0w8ZJ5LF++0emiW/1Jx4i0dk88SCBUsPyU2tiXNYUcMn2EQbrDkktZHHgpCBIm6QUST+O1+v9Qx3whHSTWRgrgmZRwcKEGJibZG7kG4mFASORaYoIkAeuLlXBNEdfXNz2M9Ys5QOD1ZEoETJIXhAGeAaUKsWOMvv6YGdsfgzDtYj2n8TvuXr+b+0vE49eY3gPSTQNMfkFUOLZvwy9jRyw6Nt8r0ODJ1I15PGlDzDLkGqFrtd3XdLhD4r6fqZn9CWfsFdSUGgAB0fYHIiTlKQieZl9KamEbyAcQ3Zg1XT9R1Vk2YqNfIh5C4FpG2/HeJczBT+4cfQdJ+wv4vVL0uRBTmPPf5I0GMBxAK4aowQKH5Pvw+k0SaUMk89T+vIWPxjbmy/mgQXNJMP5L5U9IAReGBP1OXhKzRxjWu6HhbS66/+ZHdv5kYAdHNBpwmE3+Lqfb1hcSZ0TxrVyad1LCdTqeAFeBB/YcxymsMF3o3UcQmicMSRmoCmat8JUOI9oamgUNfZKDJfNxlipi6HUYDPCJ9qlHxlfFxrIuy/Cq5akR8h0Suvp05N/PECDtzxOXLftgyNmjamDouIyTUvS6HZ3NHg4LBFS6oPqXdlPPoiwJSZcu364HukMHmWBzZTxUaf4kkgQ7EqIjen1hM5WvUUnMDkxq05wPb2eY4NYSyEwD79l4ve+UG3reEJ7fS0xCJ74Ag4SjsdrNt1a/2OTgtNmo6DDUWZWr3cmmEGxs4jDo5G58u9/yAFWvbOPYXNbvxmC43lgR35uhM+WuYRolRyPWZYg3ywBIkulLc/qSZ/r6JyuqPRjzza17XwdkwNNzgXTSSLidIRzc4NLtOwRveRTiOXBfqUmeRhzQoLRJJkw7AY36SR4Rw8fca2DSmemsUm4wQRHnydp1ZUSOCsHmN m8Bzuv7Y pzOCKfu6neuf0p4y/MFAFKUimUItwk9Ov7CWmWrJFEUu+x1XOqcfOuUD7GPb8A9hCcQxGv8ns7j69/zteT2MyducTdOJXGj03jCLhmmgWGal2PKW5FJpK5xKuYXmTs+Rn98R//J7bWEud82QCX99jRSs5TStqihjTifg7qBZoZFJSUCC1H8ij/W8hUg1NZ/T3A22G/DrjMBB6J5vuhR2jEKsbrPXgkQz+iU2hlvHD8QfO4Ccb+0z1wvY30+6BkxYhiuPSvsOEEucVgITnFBWqgWfhPquhR6dtFNRY7K+innVG6s0VgfuXtljqxbZ2TigI0R8z5iKkwctT70ljm6yj+481p/yNPiAp2UYmB0UV6Tz9ulqojnggo4MOmTZ2bK8DiTjgyT6mfmhazSafOfB4+Dqg4G4w8gBkxq3i X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Apr 22, 2025 at 10:15=E2=80=AFAM Nhat Pham wrot= e: > > On Tue, Apr 22, 2025 at 8:03=E2=80=AFAM Yosry Ahmed wrote: > > > > On Mon, Apr 07, 2025 at 04:42:01PM -0700, Nhat Pham wrote: > > It's exciting to see this proposal materilizing :) > > > > I didn't get a chance to look too closely at the code, but I have a few > > high-level comments. > > > > Do we need separate refcnt and swap_count? I am aware that there are > > cases where we need to hold a reference to prevent the descriptor from > > going away, without an extra page table entry referencing the swap > > descriptor -- but I am wondering if we can get away by just incrementin= g > > the swap count in these cases too? Would this mess things up? > > Actually, you're right - we might not even need a separate refcnt > field at all :) Here's my original thought process: > > 1. We need something that keeps the virtual swap slot and its metadata > data structure (the swap descriptor) valid while we work with it. > > 2. In the old design, this is all stored at the swap device, so we > need to obtain a reference to the swap device itself. > > 3. In the new design, this is no longer even possible. The backend > might change under us even! So the refcnting needs to be done at the > virtual swap level. > > 3. The refcnting needs to be separate from the swap count field, > because certain operations/optimizations do check for the actual swap > count, and incrementing the swap count willy nilly like that might > accidentally throw these off. Think readahead-induced swap reads, for > example. So I need a separate refcnt field that takes into account 3 > sources: PTE references (swap count), swap cache, and "ephemeral" (i.e > temporary) references, that replace the role of the swap device > reference in the old design. > > However, I have thought more about it. I don't think I need to obtain > any ephemeral reference. I do need a refcnting mechanism, but one > atomic field (that stores both the swap count and swap cache pin) > should suffice. > > Refcnt + RCU should already guarantee the existence of the swap > descriptor while I work with it. So there won't be any UAF issue, as > long as I am disciplined and check if the swap descriptor still exists > etc. in the virtual swap implementation, which I already am doing > anyway. > > This should be safe enough, even in the face of swapoff, because > swapoff also relies on the same reference counting mechanism to free > the virtual swap slot and its descriptor. It tries to swap_free() the > virtual swap slot, as it unmaps the virtual swap slot from the page > table entry, which will decrement the swap count. So we're all good on > this front. > > We DO need to obtain a reference to the swap device in certain places > though, if we want to use it down the line for some sort of > optimizations (for example, to look at its swap device flags to check > if it is a SWP_SYNCHRONOUS_IO device - see do_swap_page()). But this > is a separate matter. > > The end result is I will reduce 4 fields: > > 1. swp_entry_t vswap > 2. atomic_t in_swapcache > 3. atomic_t swap_count > 4. struct kref kref; > > Into a single swap_refs field. > > > > > > > > > > This design allows us to: > > > * Decouple zswap (and zeromapped swap entry) from backing swapfile: > > > simply associate the virtual swap slot with one of the supported > > > backends: a zswap entry, a zero-filled swap page, a slot on the > > > swapfile, or an in-memory page . > > > * Simplify and optimize swapoff: we only have to fault the page in an= d > > > have the virtual swap slot points to the page instead of the on-dis= k > > > physical swap slot. No need to perform any page table walking. > > > > > > Please see the attached patches for implementation details. > > > > > > Note that I do not remove the old implementation for now. Users can > > > select between the old and the new implementation via the > > > CONFIG_VIRTUAL_SWAP build config. This will also allow us to land the > > > new design, and iteratively optimize upon it (without having to inclu= de > > > everything in an even more massive patch series). > > > > I know this is easier, but honestly I'd prefer if we do an incremental > > replacement (if possible) rather than introducing a new implementation > > and slowly deprecating the old one, which historically doesn't seem to > > go well :P > > I know, I know :P > > > > > Once the series is organized as Johannes suggested, and we have better > > insights into how this will be integrated with Kairui's work, it should > > be clearer whether it's possible to incrementally update the current > > implemetation rather than add a parallel implementation. > > Will take a look at Kairui's work when it's available :) > > > > > > > > > III. Future Use Cases > > > > > > Other than decoupling swap backends and optimizing swapoff, this new > > > design allows us to implement the following more easily and > > > efficiently: > > > > > > * Multi-tier swapping (as mentioned in [5]), with transparent > > > transferring (promotion/demotion) of pages across tiers (see [8] an= d > > > [9]). Similar to swapoff, with the old design we would need to > > > perform the expensive page table walk. > > > * Swapfile compaction to alleviate fragmentation (as proposed by Ying > > > Huang in [6]). > > > * Mixed backing THP swapin (see [7]): Once you have pinned down the > > > backing store of THPs, then you can dispatch each range of subpages > > > to appropriate swapin handle. > > > * Swapping a folio out with discontiguous physical swap slots (see [1= 0]) > > > > > > > > > IV. Potential Issues > > > > > > Here is a couple of issues I can think of, along with some potential > > > solutions: > > > > > > 1. Space overhead: we need one swap descriptor per swap entry. > > > * Note that this overhead is dynamic, i.e only incurred when we actua= lly > > > need to swap a page out. > > > * It can be further offset by the reduction of swap map and the > > > elimination of zeromapped bitmap. > > > > > > 2. Lock contention: since the virtual swap space is dynamic/unbounded= , > > > we cannot naively range partition it anymore. This can increase lock > > > contention on swap-related data structures (swap cache, zswap=E2=80= =99s xarray, > > > etc.). > > > * The problem is slightly alleviated by the lockless nature of the ne= w > > > reference counting scheme, as well as the per-entry locking for > > > backing store information. > > > * Johannes suggested that I can implement a dynamic partition scheme,= in > > > which new partitions (along with associated data structures) are > > > allocated on demand. It is one extra layer of indirection, but glob= al > > > locking will only be done only on partition allocation, rather than= on > > > each access. All other accesses only take local (per-partition) > > > locks, or are completely lockless (such as partition lookup). > > > > > > > > > V. Benchmarking > > > > > > As a proof of concept, I run the prototype through some simple > > > benchmarks: > > > > > > 1. usemem: 16 threads, 2G each, memory.max =3D 16G > > > > > > I benchmarked the following usemem commands: > > > > > > time usemem --init-time -w -O -s 10 -n 16 2g > > > > > > Baseline: > > > real: 33.96s > > > user: 25.31s > > > sys: 341.09s > > > average throughput: 111295.45 KB/s > > > average free time: 2079258.68 usecs > > > > > > New Design: > > > real: 35.87s > > > user: 25.15s > > > sys: 373.01s > > > average throughput: 106965.46 KB/s > > > average free time: 3192465.62 usecs > > > > > > To root cause this regression, I ran perf on the usemem program, as > > > well as on the following stress-ng program: > > > > > > perf record -ag -e cycles -G perf_cg -- ./stress-ng/stress-ng --page= swap $(nproc) --pageswap-ops 100000 > > > > > > and observed the (predicted) increase in lock contention on swap cach= e > > > accesses. This regression is alleviated if I put together the > > > following hack: limit the virtual swap space to a sufficient size for > > > the benchmark, range partition the swap-related data structures (swap > > > cache, zswap tree, etc.) based on the limit, and distribute the > > > allocation of virtual swap slotss among these partitions (on a per-CP= U > > > basis): > > > > > > real: 34.94s > > > user: 25.28s > > > sys: 360.25s > > > average throughput: 108181.15 KB/s > > > average free time: 2680890.24 usecs > > > > > > As mentioned above, I will implement proper dynamic swap range > > > partitioning in a follow up work. > > > > I thought there would be some improvements with the new design once the > > lock contention is gone, due to the colocation of all swap metadata. Do > > we know why this isn't the case? > > The lock contention is reduced on access, but increased on allocation > and free step (because we have to go through a global lock now due to > the loss of swap space partitioning). > > Virtual swap allocation optimization will be the next step, or it can > be done concurrently, if we can figure out a way to make Kairui's work > compatible with this. To clarify a bit - what Kairui's proposal gives us (IIUC) is a dynamic clustered approach on swap slot allocation. It's already done at the physical level. This is precisely what this RFC is missing. So if there is a way to combine the work, I think it will go a long way in reducing the regression. That said, I haven't looked closely at his code yet, so I don't know how easy/hard it is to combine the efforts :)