From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AD359C02188 for ; Fri, 17 Jan 2025 16:52:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 399D9280003; Fri, 17 Jan 2025 11:52:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 349F6280002; Fri, 17 Jan 2025 11:52:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 21142280003; Fri, 17 Jan 2025 11:52:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 03487280002 for ; Fri, 17 Jan 2025 11:52:20 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id B0D9214067A for ; Fri, 17 Jan 2025 16:52:20 +0000 (UTC) X-FDA: 83017536840.03.F82F69D Received: from mail-qv1-f46.google.com (mail-qv1-f46.google.com [209.85.219.46]) by imf19.hostedemail.com (Postfix) with ESMTP id CF63C1A0005 for ; Fri, 17 Jan 2025 16:52:18 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=fymmez5w; spf=pass (imf19.hostedemail.com: domain of yosryahmed@google.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737132738; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Def0dVcDXepRp7gHECNmY/Rs929btgGJhltoTOydYHA=; b=6DAaDoYUZMqdzbXRLOr4Hf9D6EtbLGBnn/ws5snTHVbmk5AVbk8biiY2tYZZhuUzV/mdGr mAbSyEN0V2kFlxFkSl9NXTZFQalosAhJgrKKoIJC0f8QHL08OX5PLXqlRgvQgNgct4TONd kO+LKpdx0Bu+mfM2p7iPr/LGQh8aSio= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737132738; a=rsa-sha256; cv=none; b=KPx1qp9iM1wUzFO1pQ7COXumvUxjHlVfFNlDKGWc8WPKPfOvEE6KKosSzKX9ddH0Q94GPu Q9FjQB4KOuKqORixvJdijqwORsgAefPvK10UqFv7ox5IbBkyHnHyxX+kPWw3v0RA8NhnJS lmJyAWjb4iKfpLBz2y8V2+6KYX11FS8= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=fymmez5w; spf=pass (imf19.hostedemail.com: domain of yosryahmed@google.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-qv1-f46.google.com with SMTP id 6a1803df08f44-6e1a41935c3so32171736d6.3 for ; Fri, 17 Jan 2025 08:52:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737132738; x=1737737538; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Def0dVcDXepRp7gHECNmY/Rs929btgGJhltoTOydYHA=; b=fymmez5wtVUgimBhhMIMAWZIoE4wMcfTUrudm4k2tRKZJhT5onsSahCBDK3x/41vyO 7YZyDL/oBLMMgxPJVdsZOt6Q3qVZROema7rZ12YZQM2wYAJ5Hf9lQlm7Py8d/CAN8SJt 8Gnjr1B5G5wGJCltbXNBNFRFN3tNYOR8kUdPGuyH9EFz3INAPu6tf3sK5I8nAl6/wXGn TPpnCnTO9+7lHtPG+U47j3etBOUojIWmG2gBSa7zbO5Ubwb11cKDPcZODVjRZUI80aOu 5eFqCg62f9znvJitPJThIZwyZyQGx2RE799I15r6/VrxhtAX/xyDzM6RbQN7+3q4dYxW 92hw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737132738; x=1737737538; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Def0dVcDXepRp7gHECNmY/Rs929btgGJhltoTOydYHA=; b=Z1DbH4brLKd7PwtbY8yOuXLbdjLTTRS4kSFI39Y724PfdiMbaA/WAlBLVOdtLuOpN0 A2rPM+s9lVzy6F9DVGbC1KS7onk/jUFCvPu5jFvkMnKk/DN1fRxcargqDQdjkrCdZUH7 KTKtzbvOhegVCCgb66tqHh6zZDf+x7Qltcbz66iJf6/Lh2VaZTCsAVOfnOjBqfz33KtB IWhuC/u0QrL2Rgol0m69NavcBW9ko/ng2tvprh4qoDNDVCEydZRwNgRDjtMOVhG6smCE x1jdj/p4wl8XtUOmetoB/kEv6c3eQlkgYIfj6I0BYDnI8p1bpwfA6QmMz6h0ToX5+u3H EhHQ== X-Forwarded-Encrypted: i=1; AJvYcCU9WWAUqRKaNxoognvsFybDgq9xBsEOVEmn72OVW6eop5Cq/+tUm+kFmhpbYfPS7tMKZ7WbU5HDhg==@kvack.org X-Gm-Message-State: AOJu0YyIanluwxhYxJ/4U1w6Nqk+6bXJ3wsnsawgMJQUrzVE+68r2rMw yiuP4TjaNo1u1s7qFGMXkTriLw12MlmsJUdfodEAOsEQg7gHhYbceMUUJzEyqkj6sO/y56xMoUT wZqdSFxtEjG773V3WloqDyswS0eCsHnudXAch X-Gm-Gg: ASbGnctL2cA7rK9VAGAVtANlU8nx+fcy4NFk3PkFtPjk25x/WmyfFipvt8ui9DFej00 SiDQvlN059rdjZMJ4FhN4cyWy7Pgquq2S2y0= X-Google-Smtp-Source: AGHT+IHCP/2RVDl4ZBba9XfbP86/dKdTqV2yTo4691aKVE0NOC5t9gn9Z/cycYVbyF1I6WZ1nsV0Y+QWydoSgOSUlyY= X-Received: by 2002:ad4:5c68:0:b0:6d8:aba8:8393 with SMTP id 6a1803df08f44-6e1b2235c2bmr57261246d6.44.1737132737709; Fri, 17 Jan 2025 08:52:17 -0800 (PST) MIME-Version: 1.0 References: <20250116092254.204549-1-nphamcs@gmail.com> In-Reply-To: From: Yosry Ahmed Date: Fri, 17 Jan 2025 08:51:41 -0800 X-Gm-Features: AbW1kvb_I6lwebmY62aISRA0LVLVbg-p3Rzxu18Xx8GSI0OxlSMmH6vQRaoMp1M Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Virtual Swap Space To: Nhat Pham Cc: lsf-pc@lists.linux-foundation.org, akpm@linux-foundation.org, hannes@cmpxchg.org, ryncsn@gmail.com, chengming.zhou@linux.dev, chrisl@kernel.org, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, shakeel.butt@linux.dev, hch@infradead.org, hughd@google.com, 21cnbao@gmail.com, usamaarif642@gmail.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: itnnfxmobuj11zzdd7s8sdc5j4i8jqpk X-Rspamd-Queue-Id: CF63C1A0005 X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1737132738-193260 X-HE-Meta: U2FsdGVkX1+Squ1QR1+Ew3AIDIUmTiyJj2C1dL3E/3zplZrDbwSjnMLu7F++0C9R1h4VgMe5SedZRwS/x1V54QLfp3i9VpAxFJtWO2Aie/V4fPDoVAG2j8kEvZjPBZzQGKjrt/q89we+fPgfFAp3pgdt53aV8Pk8YzytTwMA/uMqn4IRlZKkdr6uLn//32rsb6in6YyXxJObtpI1va8uQYNuOP6jw2S8BOuIxSwfCFwit8qTbgmZE715Oh/lfrSSK9D5uzzzpxHR3KLqsuumRDQOwoMgNxOC8MZ++WJz8Endh4k7ZizrOmYBd2HiCLVOi9nX/HPZc/I32Zkuj1Z3+M5qRK7DCFdDwtcHmROUpylEfQY2yaWXBAy763GiUrhpHj8BDPyI+QapF/Z1cYc+FXpJ4LT1nsGUibqco2Jqoq/kXfNKIUCZ9be46OeoEY9YhXNxPJPnIrDrTLAEbDr5xonxFvoObqIqH81xgYWb2n44cwHCHKEJ4QLqbVv7QpbTqxIMR9UTNsVNam2d4Qyuphjz+3Q2o94moPuAkCFbDhBKWtVsYyC4fHOGlEimR+8m/t+AIk/oloRb7XMm5pbyJW7Xh9XH0ipzig3CU4dZ0A51NoSGctbWV5lKBu95aPbUe5t24tUe4N3eL49QR7Wfw2bdiCDC4etfXkf97G4qSiY6gK0VdBlfrch5eAP7Efxqy1GO77hMWOXJRqHaFo88kAFIP3fPpm2FIBal1WXlFvKhkSfJGf+JUfeH8HbpNlR8X8puX6BwXjU4ApdooNlnJ0i/LfCHjSGS9Hrjlgp1i3M3v726xPMqOIx8Btzuvray1B6VGF23R87NkI0k265DhTXB4/YIqQSp/+3PHHO7SX7CNPKZi+HJeC90U1g9GWWU+N11aM6XtgsF68x75yjU+IdxT2ztFMSNzBusTwESUzbZqq2L4+eIN/ePFx4SjAfIBhMfgcyldK2Zveb8PIP Qj5TDq8a Lm4Eg4bPuxLt4GnuxmwJQwZFXTMsAIN9uSouDwOinduTpqdVwTeZDjZyaC0FiJUizyalkF0qkfCbsF/lItCBDvn1YwaNj4HWdDrPa7uKG6vdWIjJhkQ0mRll2hm4voKFLZGarJoD7FvtLHgUscNAcbOT4kl+aXSUidQQ/QWmBmlYVyWjQEwThMs8mobw7wNh+ZrJMJ6ER8X3AJ9SUHhTrvB740PTPTuFp8Q8Bb/eQ8z4W9CLzPxUK7DV7YA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jan 16, 2025 at 6:47=E2=80=AFPM Nhat Pham wrote= : > > On Fri, Jan 17, 2025 at 1:48=E2=80=AFAM Yosry Ahmed wrote: > > > > On Thu, Jan 16, 2025 at 1:22=E2=80=AFAM Nhat Pham w= rote: > > > > > > My apologies if I missed any interested party in the cc list - > > > hopefully the mailing lists cc's suffice :) > > > > > > I'd like to (re-)propose the topic of swap abstraction layer for the > > > conference, as a continuation of Yosry's proposals at LSFMMBPF 2023 > > > (see [1], [2], [3]). > > > > > > (AFAICT, the same idea has been floated by Rik van Riel since at > > > least 2011 - see [8]). > > > > > > I have a working(-ish) prototype, which hopefully will be > > > submission-ready soon. For now, I'd like to give the motivation/conte= xt > > > for the topic, as well as some high level design: > > > > I would obviously be interested in attending this, albeit virtually if > > possible. Just sharing some random thoughts below from my cold cache. > > Your inputs are always appreciated :) > > > > > > > > > I. Motivation > > > > > > Currently, when an anon page is swapped out, a slot in a backing swap > > > device is allocated and stored in the page table entries that refer t= o > > > the original page. This slot is also used as the "key" to find the > > > swapped out content, as well as the index to swap data structures, su= ch > > > as the swap cache, or the swap cgroup mapping. Tying a swap entry to = its > > > backing slot in this way is performant and efficient when swap is pur= ely > > > just disk space, and swapoff is rare. > > > > > > However, the advent of many swap optimizations has exposed major > > > drawbacks of this design. The first problem is that we occupy a physi= cal > > > slot in the swap space, even for pages that are NEVER expected to hit > > > the disk: pages compressed and stored in the zswap pool, zero-filled > > > pages, or pages rejected by both of these optimizations when zswap > > > writeback is disabled. This is the arguably central shortcoming of > > > zswap: > > > * In deployments when no disk space can be afforded for swap (such as > > > mobile and embedded devices), users cannot adopt zswap, and are for= ced > > > to use zram. This is confusing for users, and creates extra burdens > > > for developers, having to develop and maintain similar features for > > > two separate swap backends (writeback, cgroup charging, THP support= , > > > etc.). For instance, see the discussion in [4]. > > > * Resource-wise, it is hugely wasteful in terms of disk usage, and > > > limits the memory saving potentials of these optimizations by the > > > static size of the swapfile, especially in high memory systems that > > > can have up to terabytes worth of memory. It also creates significa= nt > > > challenges for users who rely on swap utilization as an early OOM > > > signal. > > > > > > Another motivation for a swap redesign is to simplify swapoff, which > > > is complicated and expensive in the current design. Tight coupling > > > between a swap entry and its backing storage means that it requires a > > > whole page table walk to update all the page table entries that refer= to > > > this swap entry, as well as updating all the associated swap data > > > structures (swap cache, etc.). > > > > > > > > > II. High Level Design Overview > > > > > > To fix the aforementioned issues, we need an abstraction that separat= es > > > a swap entry from its physical backing storage. IOW, we need to > > > =E2=80=9Cvirtualize=E2=80=9D the swap space: swap clients will work w= ith a virtual swap > > > slot (that is dynamically allocated on-demand), storing it in page > > > table entries, and using it to index into various swap-related data > > > structures. > > > > > > The backing storage is decoupled from this slot, and the newly > > > introduced layer will =E2=80=9Cresolve=E2=80=9D the ID to the actual = storage, as well > > > as cooperating with the swap cache to handle all the required > > > synchronization. This layer also manages other metadata of the swap > > > entry, such as its lifetime information (swap count), via a dynamical= ly > > > allocated per-entry swap descriptor: > > > > Do you plan to allocate one per-folio or per-page? I suppose it's > > per-page based on the design, but I am wondering if you explored > > having it per-folio. To make it work we'd need to support splitting a > > swp_desc, and figuring out which slot or zswap_entry corresponds to a > > certain page in a folio > > Per-page, for now. Per-folio requires allocating these swp_descs on > huge page splitting etc., which is more complex. We'd also need to allocate them during swapin. If a folio is swapped out as a 16K chunk with a single swp_desc, then we try to swapin one 4K in the middle, we may need to split the swp_desc into 2. > > And yeah, we need to chain these zswap_entry's somehow. Not impossible > certainly, but more overhead and more complexity :) > > > > > > > > > struct swp_desc { > > > swp_entry_t vswap; > > > union { > > > swp_slot_t slot; > > > struct folio *folio; > > > struct zswap_entry *zswap_entry; > > > }; > > > struct rcu_head rcu; > > > > > > rwlock_t lock; > > > enum swap_type type; > > > > > > #ifdef CONFIG_MEMCG > > > atomic_t memcgid; > > > #endif > > > > > > atomic_t in_swapcache; > > > struct kref refcnt; > > > atomic_t swap_count; > > > }; > > > > That seems a bit large. I am assuming this is for the purpose of the > > prototype and we can reduce its size eventually, right? > > Yup. I copied and pasted this from the prototype. Originally I > squeezed all the state (in_swapcache and the swap type) in an > integer-type "flag" field + 1 separate swap count field, and protected > them all with a single rw lock. That gets really ugly/confusing, so > for the sake of the prototype I just separate them all out in their > own fields, and play with atomicity to see if it's possible to do > things locklessly. So far so good (i.e no crashes yet), but the final > form is TBD :) Maybe we can discuss in closer details once I send out > this prototype as an RFC? Yeah, I just had some passing comments. > > (I will say though it looks cleaner when all these fields are > separated. So it's going to be a tradeoff in that sense too). It's a tradeoff but I think we should be able to hide a lot of the complexity behind neat helpers. It's not pretty but I think the memory overhead is an important factor here. > > > > > Particularly, I remember looking into merging the swap_count and > > refcnt, and I am not sure what in_swapcache is (is this a bit? Why > > can't we use a bit from swap_count?). > > Yup. That's a single bit - it's a (partial) replacement for > SWAP_HAS_CACHE state in the existing swap map. > > No particular reason why we can't squeeze it into swap counts other > than clarity :) It's going to be a bit annoying working with swap > count values (swap count increment is now * 2 instead of ++ etc.). Nothing a nice helper cannot hide :)