From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 11478C369C2
	for <linux-mm@archiver.kernel.org>; Tue, 22 Apr 2025 19:29:24 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2A8756B000C; Tue, 22 Apr 2025 15:29:22 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 22FFC6B000D; Tue, 22 Apr 2025 15:29:22 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0AA5D6B000E; Tue, 22 Apr 2025 15:29:22 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id DE4A66B000C
	for <linux-mm@kvack.org>; Tue, 22 Apr 2025 15:29:21 -0400 (EDT)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 757CD1A1808
	for <linux-mm@kvack.org>; Tue, 22 Apr 2025 19:29:23 +0000 (UTC)
X-FDA: 83362668606.30.ACD3A8A
Received: from mail-qv1-f42.google.com (mail-qv1-f42.google.com [209.85.219.42])
	by imf17.hostedemail.com (Postfix) with ESMTP id 8380640009
	for <linux-mm@kvack.org>; Tue, 22 Apr 2025 19:29:21 +0000 (UTC)
Authentication-Results: imf17.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="k8/1h8yW";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf17.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.42 as permitted sender) smtp.mailfrom=nphamcs@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745350161; a=rsa-sha256;
	cv=none;
	b=1sWAO/LYzp908QmbPg0tSXVSorKuhK7mbT5qKw0/9VlzYIgvWl4Cqch7RJD0N+ueo7EI0E
	J7FUPPcD9wWe8vXxMMR35suzavcj9hAVQQCqXwo8mfCfnKi5fNMh74lmaYdt02CU6C8QVT
	tROlLaKXLDYVQdMDmWvxsIGH4skVCEU=
ARC-Authentication-Results: i=1;
	imf17.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="k8/1h8yW";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf17.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.42 as permitted sender) smtp.mailfrom=nphamcs@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1745350161;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=3ndpKGQSvZ0aqLqjETeu44ATYoM5vYBNDUrHip9UxeY=;
	b=svtIPG/BYBRk9ZZlxFDe8dpLJZjuE8tKzoKSfr90EwkJhTAr812RJ0zqdKJaohg6llsdh9
	k7tVUyMFfFU2sswKcLR2xcr0G8EM5xnYLoTlAvzVPLCI0AdIWn4pgrhdmhA2dEdtu1b50G
	MBIeGOEc3uNYp4YXp2AgvfKQIb4IwyE=
Received: by mail-qv1-f42.google.com with SMTP id 6a1803df08f44-6f0c30a1ca3so63312576d6.1
        for <linux-mm@kvack.org>; Tue, 22 Apr 2025 12:29:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1745350160; x=1745954960; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=3ndpKGQSvZ0aqLqjETeu44ATYoM5vYBNDUrHip9UxeY=;
        b=k8/1h8yWZgl3M7C+Jt89yw9kAB6pvNdz3MCDsVbgrqW0VayTnJAdDuOgaxF04qkRFT
         bfPaWu2m1391j5eM8PYKnfiWCR04A4z7Roqsf/PwknH9FWW00+le82c9u3W5/0iWeBnV
         ioNvKVSnOkw7aWnh474xjz1Lg2IIgX/T7yDUopZ4rIhE/HbfKzzNHUbQ8kG9AckP9NY5
         MWbEuLXyCJsEo7wH5z5msvhwZNGFwJerYraG/soxJyt5e90z67A9AUqBTA/p+iX/IdjQ
         EKiToZ8+JDBr+9JXOBlnZxJ/f10hbbpL635XpSoqxYktqnAIqAZ4+0TnVQeA69iJcQ/e
         8N8Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1745350160; x=1745954960;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=3ndpKGQSvZ0aqLqjETeu44ATYoM5vYBNDUrHip9UxeY=;
        b=S+KYvfz+tDRgP2KDJmqh+cO6BqgK+MiHIns0nOYsKvBTwadfgnch8IOnETO4W2/7NW
         H29X6eDRJSHtFKkx1/9dYgQRL81ehGJRPLiMys5TVDafBqjKgRRUOlKNBCQuWVCNshto
         hMAacHo2CM5phM4DMt/Qv8nH3rKItkqipqGqE3HpxynJ7S74l5Aw6T3lUfG6GXS8D5tF
         HvmUhhvqiTyPmPEdcF22b1ADIj676S0gWSJkYf9iIs91efTi/WOrK/eO9g2pdpYQSnEa
         rxU1dzx3yA4qkAqu3ovr1ii9lyGKjCryce8g8Q3AZ26mNBRMsMyD4NTzCI9YfshnhMWr
         BwPw==
X-Gm-Message-State: AOJu0Yz6I8tptOgqWI+A5jx7wKlOh1ngjeUshoAuz1bM3mmwwiykRvLq
	XzjuBqMFpLdnf+bbOVD61C/aU6R3BV+EfuobdgPH5zi352Nf+EWS29QXUlwSd8blFGPXGNx1Fks
	ZMgfZ7MyNv51fwvYTgC3+LJ/jBao=
X-Gm-Gg: ASbGncvlNvdj7Rvfg6K8ISKywC6hnghW+PFMvhTnMLb88L4osnOYFBFn11jqClEAn7P
	Xs12s/9TZz1XuGkcQG5/txpePGnOlvnFj9sARSgL2x58UxEglVAA1Db0zG91lzVi638puuV72Wd
	8uUvx22T23Wx8OLxMOSe2Ykys=
X-Google-Smtp-Source: AGHT+IGkSPCMeFkbaE7r9rhzr8WggAH+kF6OiQcpyNCNspqhZ0ybhXGaBNOVK+jQ3HgCZRgPGQgp1f3RLPGlBD4qBuk=
X-Received: by 2002:a05:6214:20a3:b0:6e4:4484:f354 with SMTP id
 6a1803df08f44-6f2c4671b1cmr334524676d6.38.1745350160485; Tue, 22 Apr 2025
 12:29:20 -0700 (PDT)
MIME-Version: 1.0
References: <20250407234223.1059191-1-nphamcs@gmail.com> <6807afd0.a70a0220.2ae8b9.e07cSMTPIN_ADDED_BROKEN@mx.google.com>
 <CAKEwX=NQyDqNBoS2kPePZO1iTkt88MgrtEKexxu7uLhaeA6rsQ@mail.gmail.com>
In-Reply-To: <CAKEwX=NQyDqNBoS2kPePZO1iTkt88MgrtEKexxu7uLhaeA6rsQ@mail.gmail.com>
From: Nhat Pham <nphamcs@gmail.com>
Date: Tue, 22 Apr 2025 12:29:08 -0700
X-Gm-Features: ATxdqUF5O_YtzRFDfeJ1JbJJt-RugNOBNDCB-M-afZovyBAPOYL5Iv1Eurn4SUU
Message-ID: <CAKEwX=OBC3n-+hPXGnpoZChCqjtQUt-nbBrjj0kRqsCdTcqghA@mail.gmail.com>
Subject: Re: [RFC PATCH 00/14] Virtual Swap Space
To: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org, hannes@cmpxchg.org, 
	hughd@google.com, mhocko@kernel.org, roman.gushchin@linux.dev, 
	shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com, 
	chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org, 
	huang.ying.caritas@gmail.com, ryan.roberts@arm.com, viro@zeniv.linux.org.uk, 
	baohua@kernel.org, osalvador@suse.de, lorenzo.stoakes@oracle.com, 
	christophe.leroy@csgroup.eu, pavel@kernel.org, kernel-team@meta.com, 
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, 
	linux-pm@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: 8380640009
X-Stat-Signature: 636j953sbyzss9918nyzp6hqrp94bash
X-HE-Tag: 1745350161-571360
X-HE-Meta: U2FsdGVkX1+qDIaB4aHONNV5PGe61m0IJzyK5hDVzZZNxDnIpqZSDEMSO3Eqcd4M0w8ZJ5LF++0emiW/1Jx4i0dk88SCBUsPyU2tiXNYUcMn2EQbrDkktZHHgpCBIm6QUST+O1+v9Qx3whHSTWRgrgmZRwcKEGJibZG7kG4mFASORaYoIkAeuLlXBNEdfXNz2M9Ys5QOD1ZEoETJIXhAGeAaUKsWOMvv6YGdsfgzDtYj2n8TvuXr+b+0vE49eY3gPSTQNMfkFUOLZvwy9jRyw6Nt8r0ODJ1I15PGlDzDLkGqFrtd3XdLhD4r6fqZn9CWfsFdSUGgAB0fYHIiTlKQieZl9KamEbyAcQ3Zg1XT9R1Vk2YqNfIh5C4FpG2/HeJczBT+4cfQdJ+wv4vVL0uRBTmPPf5I0GMBxAK4aowQKH5Pvw+k0SaUMk89T+vIWPxjbmy/mgQXNJMP5L5U9IAReGBP1OXhKzRxjWu6HhbS66/+ZHdv5kYAdHNBpwmE3+Lqfb1hcSZ0TxrVyad1LCdTqeAFeBB/YcxymsMF3o3UcQmicMSRmoCmat8JUOI9oamgUNfZKDJfNxlipi6HUYDPCJ9qlHxlfFxrIuy/Cq5akR8h0Suvp05N/PECDtzxOXLftgyNmjamDouIyTUvS6HZ3NHg4LBFS6oPqXdlPPoiwJSZcu364HukMHmWBzZTxUaf4kkgQ7EqIjen1hM5WvUUnMDkxq05wPb2eY4NYSyEwD79l4ve+UG3reEJ7fS0xCJ74Ag4SjsdrNt1a/2OTgtNmo6DDUWZWr3cmmEGxs4jDo5G58u9/yAFWvbOPYXNbvxmC43lgR35uhM+WuYRolRyPWZYg3ywBIkulLc/qSZ/r6JyuqPRjzza17XwdkwNNzgXTSSLidIRzc4NLtOwRveRTiOXBfqUmeRhzQoLRJJkw7AY36SR4Rw8fca2DSmemsUm4wQRHnydp1ZUSOCsHmN
 m8Bzuv7Y
 pzOCKfu6neuf0p4y/MFAFKUimUItwk9Ov7CWmWrJFEUu+x1XOqcfOuUD7GPb8A9hCcQxGv8ns7j69/zteT2MyducTdOJXGj03jCLhmmgWGal2PKW5FJpK5xKuYXmTs+Rn98R//J7bWEud82QCX99jRSs5TStqihjTifg7qBZoZFJSUCC1H8ij/W8hUg1NZ/T3A22G/DrjMBB6J5vuhR2jEKsbrPXgkQz+iU2hlvHD8QfO4Ccb+0z1wvY30+6BkxYhiuPSvsOEEucVgITnFBWqgWfhPquhR6dtFNRY7K+innVG6s0VgfuXtljqxbZ2TigI0R8z5iKkwctT70ljm6yj+481p/yNPiAp2UYmB0UV6Tz9ulqojnggo4MOmTZ2bK8DiTjgyT6mfmhazSafOfB4+Dqg4G4w8gBkxq3i
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Apr 22, 2025 at 10:15=E2=80=AFAM Nhat Pham <nphamcs@gmail.com> wrot=
e:
>
> On Tue, Apr 22, 2025 at 8:03=E2=80=AFAM Yosry Ahmed <yosry.ahmed@linux.de=
v> wrote:
> >
> > On Mon, Apr 07, 2025 at 04:42:01PM -0700, Nhat Pham wrote:
> > It's exciting to see this proposal materilizing :)
> >
> > I didn't get a chance to look too closely at the code, but I have a few
> > high-level comments.
> >
> > Do we need separate refcnt and swap_count? I am aware that there are
> > cases where we need to hold a reference to prevent the descriptor from
> > going away, without an extra page table entry referencing the swap
> > descriptor -- but I am wondering if we can get away by just incrementin=
g
> > the swap count in these cases too? Would this mess things up?
>
> Actually, you're right - we might not even need a separate refcnt
> field at all :) Here's my original thought process:
>
> 1. We need something that keeps the virtual swap slot and its metadata
> data structure (the swap descriptor) valid while we work with it.
>
> 2. In the old design, this is all stored at the swap device, so we
> need to obtain a reference to the swap device itself.
>
> 3. In the new design, this is no longer even possible. The backend
> might change under us even! So the refcnting needs to be done at the
> virtual swap level.
>
> 3. The refcnting needs to be separate from the swap count field,
> because certain operations/optimizations do check for the actual swap
> count, and incrementing the swap count willy nilly like that might
> accidentally throw these off. Think readahead-induced swap reads, for
> example. So I need a separate refcnt field that takes into account 3
> sources: PTE references (swap count), swap cache, and "ephemeral" (i.e
> temporary) references, that replace the role of the swap device
> reference in the old design.
>
> However, I have thought more about it. I don't think I need to obtain
> any ephemeral reference. I do need a refcnting mechanism, but one
> atomic field (that stores both the swap count and swap cache pin)
> should suffice.
>
> Refcnt + RCU should already guarantee the existence of the swap
> descriptor while I work with it. So there won't be any UAF issue, as
> long as I am disciplined and check if the swap descriptor still exists
> etc. in the virtual swap implementation, which I already am doing
> anyway.
>
> This should be safe enough, even in the face of swapoff, because
> swapoff also relies on the same reference counting mechanism to free
> the virtual swap slot and its descriptor. It tries to swap_free() the
> virtual swap slot, as it unmaps the virtual swap slot from the page
> table entry, which will decrement the swap count. So we're all good on
> this front.
>
> We DO need to obtain a reference to the swap device in certain places
> though, if we want to use it down the line for some sort of
> optimizations (for example, to look at its swap device flags to check
> if it is a SWP_SYNCHRONOUS_IO device - see do_swap_page()). But this
> is a separate matter.
>
> The end result is I will reduce 4 fields:
>
> 1. swp_entry_t vswap
> 2. atomic_t in_swapcache
> 3. atomic_t swap_count
> 4. struct kref kref;
>
> Into a single swap_refs field.
>
>
> >
> > >
> > > This design allows us to:
> > > * Decouple zswap (and zeromapped swap entry) from backing swapfile:
> > >   simply associate the virtual swap slot with one of the supported
> > >   backends: a zswap entry, a zero-filled swap page, a slot on the
> > >   swapfile, or an in-memory page .
> > > * Simplify and optimize swapoff: we only have to fault the page in an=
d
> > >   have the virtual swap slot points to the page instead of the on-dis=
k
> > >   physical swap slot. No need to perform any page table walking.
> > >
> > > Please see the attached patches for implementation details.
> > >
> > > Note that I do not remove the old implementation for now. Users can
> > > select between the old and the new implementation via the
> > > CONFIG_VIRTUAL_SWAP build config. This will also allow us to land the
> > > new design, and iteratively optimize upon it (without having to inclu=
de
> > > everything in an even more massive patch series).
> >
> > I know this is easier, but honestly I'd prefer if we do an incremental
> > replacement (if possible) rather than introducing a new implementation
> > and slowly deprecating the old one, which historically doesn't seem to
> > go well :P
>
> I know, I know :P
>
> >
> > Once the series is organized as Johannes suggested, and we have better
> > insights into how this will be integrated with Kairui's work, it should
> > be clearer whether it's possible to incrementally update the current
> > implemetation rather than add a parallel implementation.
>
> Will take a look at Kairui's work when it's available :)
>
> >
> > >
> > > III. Future Use Cases
> > >
> > > Other than decoupling swap backends and optimizing swapoff, this new
> > > design allows us to implement the following more easily and
> > > efficiently:
> > >
> > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > >   transferring (promotion/demotion) of pages across tiers (see [8] an=
d
> > >   [9]). Similar to swapoff, with the old design we would need to
> > >   perform the expensive page table walk.
> > > * Swapfile compaction to alleviate fragmentation (as proposed by Ying
> > >   Huang in [6]).
> > > * Mixed backing THP swapin (see [7]): Once you have pinned down the
> > >   backing store of THPs, then you can dispatch each range of subpages
> > >   to appropriate swapin handle.
> > > * Swapping a folio out with discontiguous physical swap slots (see [1=
0])
> > >
> > >
> > > IV. Potential Issues
> > >
> > > Here is a couple of issues I can think of, along with some potential
> > > solutions:
> > >
> > > 1. Space overhead: we need one swap descriptor per swap entry.
> > > * Note that this overhead is dynamic, i.e only incurred when we actua=
lly
> > >   need to swap a page out.
> > > * It can be further offset by the reduction of swap map and the
> > >   elimination of zeromapped bitmap.
> > >
> > > 2. Lock contention: since the virtual swap space is dynamic/unbounded=
,
> > > we cannot naively range partition it anymore. This can increase lock
> > > contention on swap-related data structures (swap cache, zswap=E2=80=
=99s xarray,
> > > etc.).
> > > * The problem is slightly alleviated by the lockless nature of the ne=
w
> > >   reference counting scheme, as well as the per-entry locking for
> > >   backing store information.
> > > * Johannes suggested that I can implement a dynamic partition scheme,=
 in
> > >   which new partitions (along with associated data structures) are
> > >   allocated on demand. It is one extra layer of indirection, but glob=
al
> > >   locking will only be done only on partition allocation, rather than=
 on
> > >   each access. All other accesses only take local (per-partition)
> > >   locks, or are completely lockless (such as partition lookup).
> > >
> > >
> > > V. Benchmarking
> > >
> > > As a proof of concept, I run the prototype through some simple
> > > benchmarks:
> > >
> > > 1. usemem: 16 threads, 2G each, memory.max =3D 16G
> > >
> > > I benchmarked the following usemem commands:
> > >
> > > time usemem --init-time -w -O -s 10 -n 16 2g
> > >
> > > Baseline:
> > > real: 33.96s
> > > user: 25.31s
> > > sys: 341.09s
> > > average throughput: 111295.45 KB/s
> > > average free time: 2079258.68 usecs
> > >
> > > New Design:
> > > real: 35.87s
> > > user: 25.15s
> > > sys: 373.01s
> > > average throughput: 106965.46 KB/s
> > > average free time: 3192465.62 usecs
> > >
> > > To root cause this regression, I ran perf on the usemem program, as
> > > well as on the following stress-ng program:
> > >
> > > perf record -ag -e cycles -G perf_cg -- ./stress-ng/stress-ng  --page=
swap $(nproc) --pageswap-ops 100000
> > >
> > > and observed the (predicted) increase in lock contention on swap cach=
e
> > > accesses. This regression is alleviated if I put together the
> > > following hack: limit the virtual swap space to a sufficient size for
> > > the benchmark, range partition the swap-related data structures (swap
> > > cache, zswap tree, etc.) based on the limit, and distribute the
> > > allocation of virtual swap slotss among these partitions (on a per-CP=
U
> > > basis):
> > >
> > > real: 34.94s
> > > user: 25.28s
> > > sys: 360.25s
> > > average throughput: 108181.15 KB/s
> > > average free time: 2680890.24 usecs
> > >
> > > As mentioned above, I will implement proper dynamic swap range
> > > partitioning in a follow up work.
> >
> > I thought there would be some improvements with the new design once the
> > lock contention is gone, due to the colocation of all swap metadata. Do
> > we know why this isn't the case?
>
> The lock contention is reduced on access, but increased on allocation
> and free step (because we have to go through a global lock now due to
> the loss of swap space partitioning).
>
> Virtual swap allocation optimization will be the next step, or it can
> be done concurrently, if we can figure out a way to make Kairui's work
> compatible with this.

To clarify a bit - what Kairui's proposal gives us (IIUC) is a dynamic
clustered approach on swap slot allocation. It's already done at the
physical level.

This is precisely what this RFC is missing. So if there is a way to
combine the work, I think it will go a long way in reducing the
regression.

That said, I haven't looked closely at his code yet, so I don't know
how easy/hard it is to combine the efforts :)