From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B20C4C02194
	for <linux-mm@archiver.kernel.org>; Tue,  4 Feb 2025 18:39:01 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 1C1F328000D; Tue,  4 Feb 2025 13:39:01 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 17279280008; Tue,  4 Feb 2025 13:39:01 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 039E728000D; Tue,  4 Feb 2025 13:39:00 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id D74F1280008
	for <linux-mm@kvack.org>; Tue,  4 Feb 2025 13:39:00 -0500 (EST)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 639A680CE7
	for <linux-mm@kvack.org>; Tue,  4 Feb 2025 18:39:00 +0000 (UTC)
X-FDA: 83083124040.15.1BEAB5A
Received: from mail-lf1-f43.google.com (mail-lf1-f43.google.com [209.85.167.43])
	by imf15.hostedemail.com (Postfix) with ESMTP id 63F90A0002
	for <linux-mm@kvack.org>; Tue,  4 Feb 2025 18:38:58 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=U00NciZL;
	spf=pass (imf15.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.43 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1738694338;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=WoQAfm00YEIwVtV4JS77CO0grtmIVLMlfMdvqc1cg9E=;
	b=TMTGlICK0lCW76iNhupZ7/F7qNafFv9cr3rD4SWuZdZwV1iR2aLnwu0OIlR78KeCR/+dvL
	NwMUaRsIX9ncv0DyKMk9dgZp+8+VjiUvn7WVUTUxt/98ioFHu6HfpKCWeA/c1g2VxCdnUM
	RHjxHW8kn+CsV476FMu0fjvrsM/3GAM=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738694338; a=rsa-sha256;
	cv=none;
	b=Ivi/ialENCIC3rOhTZqNrfG6p3SiDxX+cq0p2SDDX1odpEvj+dNTPB6omjEOVriSKjDsY3
	tWSIpKRxXpyC2yZ/jtaR7GBoznN4oGoLTwc0gUULNhXjgrMwts0QDbtGKYzxMj+s9StHmn
	pQAsefd6u8gQYnZTJl6pddPJ5B0ifAk=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=U00NciZL;
	spf=pass (imf15.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.43 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-lf1-f43.google.com with SMTP id 2adb3069b0e04-53f757134cdso6168817e87.2
        for <linux-mm@kvack.org>; Tue, 04 Feb 2025 10:38:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1738694337; x=1739299137; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=WoQAfm00YEIwVtV4JS77CO0grtmIVLMlfMdvqc1cg9E=;
        b=U00NciZLzQxOsULfxiwmJat36JrvP/0+rT/K/zBcqwmPAekcFuypn6tWzPoWaOBuxe
         7gkSPHZsrnI4seYgkmMYtxBLoRv+w6AWBhxpAZUP24x+ZNB93erWGKbm5kLHHnc0GRjS
         1J5G6L5p4OO2ocmTQabs5hqmvj4GA6Q+FR4L26+A/agF+tDzAW/9LtIUN3ls2qwXx1A9
         g7jo77OFYQBKs4GafZuy9XnrwHMN3czwnRcDOpp8JY2OG/Bd/jxl0TYe8MI3DUJ+sZ7a
         boaS5y/tAoPLhgE0gbMULCN7WZ8NYtR0y6xfHxSCuhooreYjmMrUIoesq5JBeUQY+Vcl
         DR6A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1738694337; x=1739299137;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=WoQAfm00YEIwVtV4JS77CO0grtmIVLMlfMdvqc1cg9E=;
        b=T/ZqtVoVFoOb+ek9YDtlmrINgye1wDXkpU6t4GSSYa+9R+AlIJ9NBFH6MC/VOsHWgG
         YBBMxmuSvg8kJ3JFUJ+iUTBryrICpHnhknFuKghCqnTwMytJjJju1i2NFerRO++g2nFI
         X4XbU21cdxX22evtmj5Kp25Hz9F4Rm8AqJyAGTUjhp8vLC1WklA6eOKnKks8TF4WcPXV
         9qm/3/yf7+ZRAptmokw1PkJB7Zxt1flZj8Lz/AiCnPhsN+1Z57CmpBs4J8n7BAjp0P77
         rJfGCN1eSXuhHQDqd4Y2ufiy0e1ZFk22fRnvC81m6fT0g0rMow9QJMmDCXIZT42qo5sa
         Xl6Q==
X-Forwarded-Encrypted: i=1; AJvYcCWecxjM1d9eEVRbH/4n5YeyGj01RSKwmF+41ySal84zXAOWG0IZmJo3sYaFy0PI52VmS6Tifdg0og==@kvack.org
X-Gm-Message-State: AOJu0YyTjgQFOCnhn17/d3X9UfMIKURWjgDgHTAczNQ+r1kYM8nSvNlK
	h8AJwqqF9SBwfBuEDPHBgC33Lklkia/Md5kLgE6bW7UeMwn6dRTsK4+eDLN27MrWPQHMCpiFaB3
	cmR6+QZ3m2QwxiESGI2CgpCyBuOE=
X-Gm-Gg: ASbGncuFD8Fe2ZBaGP+2XtZ2tNe1wEVldQ/K2jmql6LsuFvPJyS0qXFOufy1QgRxCE8
	r8yJNMwPj3qXWnsQJHk32/uh5C9kCPPrrSkHnTJLgYJ9ASDqBnbns6SzE7vhRIY7WzDjjqqpC
X-Google-Smtp-Source: AGHT+IFsS7lCTcG/roAsQh3uG+0u+cW9lcNYxX5cBHH4thmT3re5353ENdVYGA8TiGuTFy+0kIMwOhWvRMZJef17ncM=
X-Received: by 2002:a05:6512:3b26:b0:542:7196:d1ed with SMTP id
 2adb3069b0e04-543e4c3b94cmr7435629e87.53.1738694336239; Tue, 04 Feb 2025
 10:38:56 -0800 (PST)
MIME-Version: 1.0
References: <CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com>
 <20250204162426.GB705532@cmpxchg.org> <CAMgjq7BCw15kG73LyLuZBmSubxDa6-kNSPSaep5mfKt+z6DL7w@mail.gmail.com>
 <Z6JYRpJPMmZO1Dhh@google.com>
In-Reply-To: <Z6JYRpJPMmZO1Dhh@google.com>
From: Kairui Song <ryncsn@gmail.com>
Date: Wed, 5 Feb 2025 02:38:39 +0800
X-Gm-Features: AWEUYZlvrHTD6CVrpqSPLgpxteh2ncFxeLfO_UEVpnXgLNlMieBSDgvIzF-N1sU
Message-ID: <CAMgjq7Ao3-uDdY0WOVQz=GNdgywWqPffuQvR+iyuPoMRJrYVvQ@mail.gmail.com>
Subject: Re: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator
To: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>, lsf-pc@lists.linux-foundation.org, 
	linux-mm <linux-mm@kvack.org>, Andrew Morton <akpm@linux-foundation.org>, 
	Chris Li <chrisl@kernel.org>, Chengming Zhou <chengming.zhou@linux.dev>, 
	Shakeel Butt <shakeel.butt@linux.dev>, Hugh Dickins <hughd@google.com>, 
	Matthew Wilcox <willy@infradead.org>, Barry Song <21cnbao@gmail.com>, Nhat Pham <nphamcs@gmail.com>, 
	Usama Arif <usamaarif642@gmail.com>, Ryan Roberts <ryan.roberts@arm.com>, 
	"Huang, Ying" <ying.huang@linux.alibaba.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: x5f47nkkycu4d9ticjdcsbye3umwyduw
X-Rspamd-Queue-Id: 63F90A0002
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-HE-Tag: 1738694338-651024
X-HE-Meta: U2FsdGVkX19D228pesXBW7OEIQlKn2pEy/0DDzPs+6yPVuajyx8rR90sKLEKpvJq4HU4s2GC5lW7gBYDTVtPXfZM12wOdDOZRVnYz3RGRdoeL9VhdW8sn4oC9Rw/RM+0YIjJdoBiiwveXED920Ey6VhUke1tztIZ9AQS49nwjsPHjqGqWoUpeX7np/tSa9dxPcjcqJarOo+2WaDoo5Gm6XE7VzKc1NgTQD5eOMIv6BEVPs2axC+aqDPgdp+xPUu/vUlvyx/EmP+rL39fncvwwNk/w/ZJ+cTqx2cAsHaGRVuUzg7KAXaNrCeUOTCu7qkTrjlhkB1TbPJpb3tSQCsGSBypMffeXNUv7N0cm7qLsf8Fab2v4aQ4HhVYWiO4fcE52FFFi8SSxy6SnH90KdEa97ZeXN/0abHfbuB7Mhij6padJsApvRHv/wZGtMcGSAQFkbCSBw4oNeHoWXyTurb1CuKiYIXovIiH0RoWQ2e0zMu0v4QEPtM/4tFpWpOXCMMQNRAz2OkjimHSncU6+oBx6jx4J0+IJpzo/pkYI2qLwg7XjsDMO6r1YuxrXczmsTNoLcZAkekTkngGBHDnTmQOa9Taz3THvR5AJuMLITWkM9GY+5SR2KbWd+bfco0Xto7MpdWmcY96V8xmk5hRzDET08gp+Rvy7WrduVk3qx/5WiDUTq9IvDwgPJLqRubwdOIORbTj4t05Dz86QkFRCbwYo/A69I1eChDZV5rlkE91C3CIf1uekpd6wIfp29iQIlKV2baDHe1hVbXeXwwo5KM/IyOpOotBUYYYRZ1aB0fUdYPcFlJk7t/STH94ruVMnjHQ6AibEmU4ylBwdjWJOirfX25+cj0l0/HJGurhuZ5GXf+fa4ilu8eeke6nid9RWPpdrq8uOFgb+uNtTflLkwNQ0StAGdpqSgaMNnfnEtZPHYJ9FCH5QP7MZFBzaO5Oo1qQ7KOb9WkG1tgWztFU2Dz
 Qr7ZfgLM
 Bx1S+qc1sG8IkbCAs8Gy2FkHD37cb4efBPkvOatk/SY8cY95Ew2or4UqdcLaCLs9GTYVleg5h0iTZQ2Hh7YVx6CX7no/LZztRoOzCEE9yEVy4AKfKYxB8xkZFBWTAOfzUoYRIMDj80NzYYohj36dpMeTqIGOL4fKU7LX2cu/JP75fNMMpHuG2PPPT5Pp1z9hkfqs3QmgJZ2a2BJouCZKaPT1nXFr9Mwd9+sQsKZ5qp3ewh1hMYzRgf61zlfW2LGZPg4CDotk3VnTFRzGTpUmz07qasqCfy50pdDCeXnuuVOwafQnVa3n/cSjah9qkIyvK3TJVXH58scFxKwtvKiu5iI5912IxXW436Mia2422FWpdm4BX+mvgIEPpkBCGFD8PLxibEL3wwRA5n9pd9+hmpaSbKQ8bhfhUszpV48IHWpwaZ84=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Feb 5, 2025 at 2:11=E2=80=AFAM Yosry Ahmed <yosry.ahmed@linux.dev> =
wrote:
>
> On Wed, Feb 05, 2025 at 12:46:26AM +0800, Kairui Song wrote:
> > On Wed, Feb 5, 2025 at 12:24=E2=80=AFAM Johannes Weiner <hannes@cmpxchg=
.org> wrote:
> > >
> > > Hi Kairui,
> > >
> > > On Tue, Feb 04, 2025 at 07:44:46PM +0800, Kairui Song wrote:
> > > > Hi all, sorry for the late submission.
> > > >
> > > > Following previous work and topics with the SWAP allocator
> > > > [1][2][3][4], this topic would propose a way to redesign and integr=
ate
> > > > multiple swap data into the swap allocator, which should be a
> > > > future-proof design, achieving following benefits:
> > > > - Even lower memory usage than the current design
> > > > - Higher performance (Remove HAS_CACHE pin trampoline)
> > > > - Dynamic allocation and growth support, further reducing idle memo=
ry usage
> > > > - Unifying the swapin path for a more maintainable code base (Remov=
e SYNC_IO)
> > > > - More extensible, provide a clean bedrock for implementing things
> > > > like discontinuous swapout, readahead based mTHP swapin and more.
> > > >
> > > > People have been complaining about the SWAP management subsystem [5=
].
> > > > Many incremental workarounds and optimizations are added, but cause=
s
> > > > many other problems eg. [6][7][8][9] and making implementing new
> > > > features more difficult. One reason is the current design almost ha=
s
> > > > the minimal memory usage (1 byte swap map) with acceptable
> > > > performance, so it's hard to beat with incremental changes. But
> > > > actually as more code and features are added, there are already lot=
s
> > > > of duplicated parts. So I'm proposing this idea to overhaul whole S=
WAP
> > > > slot management from a different aspect, as the following work on t=
he
> > > > SWAP allocator [2].
> > > >
> > > > Chris's topic "Swap abstraction" at LSFMM 2024 [1] raised the idea =
of
> > > > unifying swap data, we worked together to implement the short term
> > > > solution first: The swap allocator was the bottleneck for performan=
ce
> > > > and fragmentation issues. The new cluster allocator solved these
> > > > issues, and turned the cluster into a basic swap management unit.
> > > > It also removed slot cache freeing path, and I'll post another seri=
es
> > > > soon to remove the slot cache allocation path, so folios will alway=
s
> > > > interact with the SWAP allocator directly, preparing for this long
> > > > term goal:
> > > >
> > > > A brief intro of the new design
> > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > >
> > > > It will first be a drop-in replacement for swap cache, using a per
> > > > cluster table to handle all things required for SWAP management.
> > > > Compared to the previous attempt to unify swap cache [11], this wil=
l
> > > > have lower overhead with more features achievable:
> > > >
> > > > struct swap_cluster_info {
> > > > spinlock_t lock;
> > > > u16 count;
> > > > u8 flags;
> > > > u8 order;
> > > > + void *table; /* 512 entries */
> > > > struct list_head list;
> > > > };
> > > >
> > > > The table itself can have variants of format, but for basic usage,
> > > > each void* could be in one of the following type:
> > > >
> > > > /*
> > > >  * a NULL:    | -----------    0    ------------| - Empty slot
> > > >  * a Shadow:  | SWAP_COUNT |---- Shadow ----|XX1| - Swaped out
> > > >  * a PFN:     | SWAP_COUNT |------ PFN -----|X10| - Cached
> > > >  * a Pointer: |----------- Pointer ---------|100| - Reserved / Unus=
ed yet
> > > > * SWAP_COUNT is still 8 bits.
> > > >  */
> > > >
> > > > Clearly it can hold both cache and swap count. The shadow still has
> > > > enough for distance (using 16M as buckets for 52 bit VA) or gen
> > > > counting. For COUNT_CONTINUED, it can simply allocate another 512
> > > > atomics for one cluster.
> > > >
> > > > The table is protected by ci->lock, which has little to none conten=
tion.
> > > > It also gets rid of the "HAS_CACHE bit setting vs Cache Insert",
> > > > "HAS_CACHE pin as trampoline" issue, deprecating SWP_SYNCHRONOUS_IO=
.
> > > > And remove the "multiple smaller file in one bit swapfile" design.
> > > >
> > > > It will further remove the swap cgroup map. Cached folio (stored as
> > > > PFN) or shadow can provide such info. Some careful audit and workfl=
ow
> > > > redesign might be needed.
> > > >
> > > > Each entry will be 8 bytes, smaller than current (8 bytes cache) + =
(2
> > > > bytes cgroup map) + (1 bytes SWAP Map) =3D 11 bytes.
> > > >
> > > > Shadow reclaim and high order storing are still doable too, by
> > > > introducing dense cluster tables formats. We can even optimize it
> > > > specially for shmem to have 1 bit per entry. And empty clusters can
> > > > have their table freed. This part might be optional.
> > > >
> > > > And it can have more types for supporting things like entry migrati=
ons
> > > > or virtual swapfile. The example formats above showed four types. L=
ast
> > > > three or more bits can be used as a type indicator, as HAS_CACHE an=
d
> > > > COUNT_CONTINUED will be gone.
> > >
> >
> > Hi Johannes
> >
> > > My understanding is that this would still tie the swap space to
> > > configured swapfiles. That aspect of the current design has more and
> > > more turned into a problem, because we now have several categories of
> > > swap entries that either permanently or for extended periods of time
> > > live in memory. Such entries should not occupy actual disk space.
> > >
> > > The oldest one is probably partially refaulted entries (where one out
> > > of N swapped page tables faults back in). We currently have to spend
> > > full pages of both memory AND disk space for these.
> > >
> > > The newest ones are zero-filled entries which are stored in a bitmap.
> > >
> > > Then there is zswap. You mention ghost swapfiles - I know some setups
> > > do this to use zswap purely for compression. But zswap is a writeback
> > > cache for real swapfiles primarily, and it is used as such. That mean=
s
> > > entries need to be able to move from the compressed pool to disk at
> > > some point, but might not for a long time. Tying the compressed pool
> > > size to disk space is hugely wasteful and an operational headache.
> > >
> > > So I think any future proof design for the swap allocator needs to
> > > decouple the virtual memory layer (page table count, swapcache, memcg
> > > linkage, shadow info) from the physical layer (swapfile slot).
> > >
> > > Can you touch on that concern?
> >
> > Yes, I fully understand your concern. The purpose of this swap table
> > design is to provide a base for building other parts, including
> > decoupling the virtual layer from the physical layer.
> >
> > The table entry can have different types, so virtual file/space can
> > leverage this too. For example the virtual layer can have something
> > like a "redirection entry" pointing to a physical device layer. Or
> > just a pointer to anything that could possibly be used (In the four
> > examples I provided one type is a pointer). A swap space will need
> > something to index its data.
> > We have already internally deployed a very similar solution for
> > multi-layer swapout, and it's working well, we are expecting to
> > upstreamly implement it and deprecate the downstream solution.
> >
> > Using an optional layer for doing so still consumes very little memory
> > (16 bytes per entry for two layers, and this might be doable just with
> > single layer), And there are setups that doesn't need a extra layer,
> > such setups can ignore that part and have only 8 bytes per entry,
> > having a very low overhead.
>
> IIUC with this design we still have a fixed-size swap space, but it's
> not directly tied to the physical swap layer (i.e. it can be backed with
> a swap slot on disk, zswap, zero-filled pages, etc). Did I get this
> right?
>
> In this case, using clusters to manage this should be an implementation
> detail that is not visible to userspace. Ideally the kernel would
> allocate more clusters dynamically as needed, and when a swap entry is
> being allocated in that cluster the kernel chooses the backing for that
> swap entry based on the available options.
>
> I see the benefit of managing things on the cluster level to reduce
> memory overhead (e.g. one lock per cluster vs. per entry), and to
> leverage existing code where it makes sense.

Yes, agree, cluster based map means we can have many empty clusters
without consuming any pre-reserved map memory. And extending the
cluster array should be doable too.

>
> However, what we should *not* do is have these clusters be tied to the
> disk swap space with the ability to redirect some entries to use
> someting like zswap. This does not fix the problem Johannes is
> describing.

Yes, a virtual swap file can have its own swap space, which is indexed
by the cache / table, and reuse all the logic. As long as we don't
dramatically change the kernel swapout path, adding a folio to
swapcache seems a very reasonable way to avoid redundant IO, and
synchronize it upon swapin/swapout, and reusing a lot of
infrastructure, even if that's a virtual file. For example a current
busy loop issue can be just fixed by leveraging the folio lock:
https://lore.kernel.org/lkml/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPS=
OPrmQ@mail.gmail.com/

The virtual file/space can be decoupled from the lower device. But the
virtual file/space's table entry can point to an underlying physical
SWAP device or some meta struct.