From: Nhat Pham <nphamcs@gmail.com>
To: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
hannes@cmpxchg.org, hughd@google.com, mhocko@kernel.org,
roman.gushchin@linux.dev, shakeel.butt@linux.dev,
muchun.song@linux.dev, len.brown@intel.com,
chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu,
pavel@kernel.org, kernel-team@meta.com,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
linux-pm@vger.kernel.org
Subject: Re: [RFC PATCH 00/14] Virtual Swap Space
Date: Tue, 22 Apr 2025 12:29:08 -0700 [thread overview]
Message-ID: <CAKEwX=OBC3n-+hPXGnpoZChCqjtQUt-nbBrjj0kRqsCdTcqghA@mail.gmail.com> (raw)
In-Reply-To: <CAKEwX=NQyDqNBoS2kPePZO1iTkt88MgrtEKexxu7uLhaeA6rsQ@mail.gmail.com>
On Tue, Apr 22, 2025 at 10:15 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Apr 22, 2025 at 8:03 AM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
> >
> > On Mon, Apr 07, 2025 at 04:42:01PM -0700, Nhat Pham wrote:
> > It's exciting to see this proposal materilizing :)
> >
> > I didn't get a chance to look too closely at the code, but I have a few
> > high-level comments.
> >
> > Do we need separate refcnt and swap_count? I am aware that there are
> > cases where we need to hold a reference to prevent the descriptor from
> > going away, without an extra page table entry referencing the swap
> > descriptor -- but I am wondering if we can get away by just incrementing
> > the swap count in these cases too? Would this mess things up?
>
> Actually, you're right - we might not even need a separate refcnt
> field at all :) Here's my original thought process:
>
> 1. We need something that keeps the virtual swap slot and its metadata
> data structure (the swap descriptor) valid while we work with it.
>
> 2. In the old design, this is all stored at the swap device, so we
> need to obtain a reference to the swap device itself.
>
> 3. In the new design, this is no longer even possible. The backend
> might change under us even! So the refcnting needs to be done at the
> virtual swap level.
>
> 3. The refcnting needs to be separate from the swap count field,
> because certain operations/optimizations do check for the actual swap
> count, and incrementing the swap count willy nilly like that might
> accidentally throw these off. Think readahead-induced swap reads, for
> example. So I need a separate refcnt field that takes into account 3
> sources: PTE references (swap count), swap cache, and "ephemeral" (i.e
> temporary) references, that replace the role of the swap device
> reference in the old design.
>
> However, I have thought more about it. I don't think I need to obtain
> any ephemeral reference. I do need a refcnting mechanism, but one
> atomic field (that stores both the swap count and swap cache pin)
> should suffice.
>
> Refcnt + RCU should already guarantee the existence of the swap
> descriptor while I work with it. So there won't be any UAF issue, as
> long as I am disciplined and check if the swap descriptor still exists
> etc. in the virtual swap implementation, which I already am doing
> anyway.
>
> This should be safe enough, even in the face of swapoff, because
> swapoff also relies on the same reference counting mechanism to free
> the virtual swap slot and its descriptor. It tries to swap_free() the
> virtual swap slot, as it unmaps the virtual swap slot from the page
> table entry, which will decrement the swap count. So we're all good on
> this front.
>
> We DO need to obtain a reference to the swap device in certain places
> though, if we want to use it down the line for some sort of
> optimizations (for example, to look at its swap device flags to check
> if it is a SWP_SYNCHRONOUS_IO device - see do_swap_page()). But this
> is a separate matter.
>
> The end result is I will reduce 4 fields:
>
> 1. swp_entry_t vswap
> 2. atomic_t in_swapcache
> 3. atomic_t swap_count
> 4. struct kref kref;
>
> Into a single swap_refs field.
>
>
> >
> > >
> > > This design allows us to:
> > > * Decouple zswap (and zeromapped swap entry) from backing swapfile:
> > > simply associate the virtual swap slot with one of the supported
> > > backends: a zswap entry, a zero-filled swap page, a slot on the
> > > swapfile, or an in-memory page .
> > > * Simplify and optimize swapoff: we only have to fault the page in and
> > > have the virtual swap slot points to the page instead of the on-disk
> > > physical swap slot. No need to perform any page table walking.
> > >
> > > Please see the attached patches for implementation details.
> > >
> > > Note that I do not remove the old implementation for now. Users can
> > > select between the old and the new implementation via the
> > > CONFIG_VIRTUAL_SWAP build config. This will also allow us to land the
> > > new design, and iteratively optimize upon it (without having to include
> > > everything in an even more massive patch series).
> >
> > I know this is easier, but honestly I'd prefer if we do an incremental
> > replacement (if possible) rather than introducing a new implementation
> > and slowly deprecating the old one, which historically doesn't seem to
> > go well :P
>
> I know, I know :P
>
> >
> > Once the series is organized as Johannes suggested, and we have better
> > insights into how this will be integrated with Kairui's work, it should
> > be clearer whether it's possible to incrementally update the current
> > implemetation rather than add a parallel implementation.
>
> Will take a look at Kairui's work when it's available :)
>
> >
> > >
> > > III. Future Use Cases
> > >
> > > Other than decoupling swap backends and optimizing swapoff, this new
> > > design allows us to implement the following more easily and
> > > efficiently:
> > >
> > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > > transferring (promotion/demotion) of pages across tiers (see [8] and
> > > [9]). Similar to swapoff, with the old design we would need to
> > > perform the expensive page table walk.
> > > * Swapfile compaction to alleviate fragmentation (as proposed by Ying
> > > Huang in [6]).
> > > * Mixed backing THP swapin (see [7]): Once you have pinned down the
> > > backing store of THPs, then you can dispatch each range of subpages
> > > to appropriate swapin handle.
> > > * Swapping a folio out with discontiguous physical swap slots (see [10])
> > >
> > >
> > > IV. Potential Issues
> > >
> > > Here is a couple of issues I can think of, along with some potential
> > > solutions:
> > >
> > > 1. Space overhead: we need one swap descriptor per swap entry.
> > > * Note that this overhead is dynamic, i.e only incurred when we actually
> > > need to swap a page out.
> > > * It can be further offset by the reduction of swap map and the
> > > elimination of zeromapped bitmap.
> > >
> > > 2. Lock contention: since the virtual swap space is dynamic/unbounded,
> > > we cannot naively range partition it anymore. This can increase lock
> > > contention on swap-related data structures (swap cache, zswap’s xarray,
> > > etc.).
> > > * The problem is slightly alleviated by the lockless nature of the new
> > > reference counting scheme, as well as the per-entry locking for
> > > backing store information.
> > > * Johannes suggested that I can implement a dynamic partition scheme, in
> > > which new partitions (along with associated data structures) are
> > > allocated on demand. It is one extra layer of indirection, but global
> > > locking will only be done only on partition allocation, rather than on
> > > each access. All other accesses only take local (per-partition)
> > > locks, or are completely lockless (such as partition lookup).
> > >
> > >
> > > V. Benchmarking
> > >
> > > As a proof of concept, I run the prototype through some simple
> > > benchmarks:
> > >
> > > 1. usemem: 16 threads, 2G each, memory.max = 16G
> > >
> > > I benchmarked the following usemem commands:
> > >
> > > time usemem --init-time -w -O -s 10 -n 16 2g
> > >
> > > Baseline:
> > > real: 33.96s
> > > user: 25.31s
> > > sys: 341.09s
> > > average throughput: 111295.45 KB/s
> > > average free time: 2079258.68 usecs
> > >
> > > New Design:
> > > real: 35.87s
> > > user: 25.15s
> > > sys: 373.01s
> > > average throughput: 106965.46 KB/s
> > > average free time: 3192465.62 usecs
> > >
> > > To root cause this regression, I ran perf on the usemem program, as
> > > well as on the following stress-ng program:
> > >
> > > perf record -ag -e cycles -G perf_cg -- ./stress-ng/stress-ng --pageswap $(nproc) --pageswap-ops 100000
> > >
> > > and observed the (predicted) increase in lock contention on swap cache
> > > accesses. This regression is alleviated if I put together the
> > > following hack: limit the virtual swap space to a sufficient size for
> > > the benchmark, range partition the swap-related data structures (swap
> > > cache, zswap tree, etc.) based on the limit, and distribute the
> > > allocation of virtual swap slotss among these partitions (on a per-CPU
> > > basis):
> > >
> > > real: 34.94s
> > > user: 25.28s
> > > sys: 360.25s
> > > average throughput: 108181.15 KB/s
> > > average free time: 2680890.24 usecs
> > >
> > > As mentioned above, I will implement proper dynamic swap range
> > > partitioning in a follow up work.
> >
> > I thought there would be some improvements with the new design once the
> > lock contention is gone, due to the colocation of all swap metadata. Do
> > we know why this isn't the case?
>
> The lock contention is reduced on access, but increased on allocation
> and free step (because we have to go through a global lock now due to
> the loss of swap space partitioning).
>
> Virtual swap allocation optimization will be the next step, or it can
> be done concurrently, if we can figure out a way to make Kairui's work
> compatible with this.
To clarify a bit - what Kairui's proposal gives us (IIUC) is a dynamic
clustered approach on swap slot allocation. It's already done at the
physical level.
This is precisely what this RFC is missing. So if there is a way to
combine the work, I think it will go a long way in reducing the
regression.
That said, I haven't looked closely at his code yet, so I don't know
how easy/hard it is to combine the efforts :)
prev parent reply other threads:[~2025-04-22 19:29 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-07 23:42 Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 01/14] swapfile: rearrange functions Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 02/14] mm: swap: add an abstract API for locking out swapoff Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 03/14] mm: swap: add a separate type for physical swap slots Nhat Pham
2025-04-08 14:15 ` Johannes Weiner
2025-04-08 15:11 ` Nhat Pham
2025-04-22 14:41 ` Yosry Ahmed
[not found] ` <6807ab09.670a0220.152ca3.502fSMTPIN_ADDED_BROKEN@mx.google.com>
2025-04-22 15:50 ` Nhat Pham
2025-04-22 18:50 ` Kairui Song
2025-04-07 23:42 ` [RFC PATCH 04/14] mm: swap: swap cache support for virtualized swap Nhat Pham
2025-04-08 15:00 ` Johannes Weiner
2025-04-08 15:34 ` Nhat Pham
2025-04-08 15:43 ` Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 05/14] zswap: unify zswap tree " Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 06/14] mm: swap: allocate a virtual swap slot for each swapped out page Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 07/14] swap: implement the swap_cgroup API using virtual swap Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 08/14] swap: manage swap entry lifetime at the virtual swap layer Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 09/14] swap: implement locking out swapoff using virtual swap slot Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 10/14] mm: swap: decouple virtual swap slot from backing store Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 11/14] memcg: swap: only charge physical swap slots Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 12/14] vswap: support THP swapin and batch free_swap_and_cache Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 13/14] swap: simplify swapoff using virtual swap Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 14/14] zswap: do not start zswap shrinker if there is no physical swap slots Nhat Pham
2025-04-08 13:04 ` [RFC PATCH 00/14] Virtual Swap Space Usama Arif
2025-04-08 15:20 ` Nhat Pham
2025-04-08 15:45 ` Johannes Weiner
2025-04-08 16:25 ` Nhat Pham
2025-04-08 16:27 ` Nhat Pham
2025-04-08 16:22 ` Kairui Song
2025-04-08 16:47 ` Nhat Pham
2025-04-08 16:59 ` Kairui Song
2025-04-22 14:43 ` Yosry Ahmed
2025-04-22 14:56 ` Yosry Ahmed
[not found] ` <6807afd0.a70a0220.2ae8b9.e07cSMTPIN_ADDED_BROKEN@mx.google.com>
2025-04-22 17:15 ` Nhat Pham
2025-04-22 19:29 ` Nhat Pham [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAKEwX=OBC3n-+hPXGnpoZChCqjtQUt-nbBrjj0kRqsCdTcqghA@mail.gmail.com' \
--to=nphamcs@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=chrisl@kernel.org \
--cc=christophe.leroy@csgroup.eu \
--cc=hannes@cmpxchg.org \
--cc=huang.ying.caritas@gmail.com \
--cc=hughd@google.com \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=len.brown@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-pm@vger.kernel.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=osalvador@suse.de \
--cc=pavel@kernel.org \
--cc=roman.gushchin@linux.dev \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=viro@zeniv.linux.org.uk \
--cc=yosry.ahmed@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox