From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B20C4C02194 for ; Tue, 4 Feb 2025 18:39:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1C1F328000D; Tue, 4 Feb 2025 13:39:01 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 17279280008; Tue, 4 Feb 2025 13:39:01 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 039E728000D; Tue, 4 Feb 2025 13:39:00 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id D74F1280008 for ; Tue, 4 Feb 2025 13:39:00 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 639A680CE7 for ; Tue, 4 Feb 2025 18:39:00 +0000 (UTC) X-FDA: 83083124040.15.1BEAB5A Received: from mail-lf1-f43.google.com (mail-lf1-f43.google.com [209.85.167.43]) by imf15.hostedemail.com (Postfix) with ESMTP id 63F90A0002 for ; Tue, 4 Feb 2025 18:38:58 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=U00NciZL; spf=pass (imf15.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.43 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738694338; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WoQAfm00YEIwVtV4JS77CO0grtmIVLMlfMdvqc1cg9E=; b=TMTGlICK0lCW76iNhupZ7/F7qNafFv9cr3rD4SWuZdZwV1iR2aLnwu0OIlR78KeCR/+dvL NwMUaRsIX9ncv0DyKMk9dgZp+8+VjiUvn7WVUTUxt/98ioFHu6HfpKCWeA/c1g2VxCdnUM RHjxHW8kn+CsV476FMu0fjvrsM/3GAM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738694338; a=rsa-sha256; cv=none; b=Ivi/ialENCIC3rOhTZqNrfG6p3SiDxX+cq0p2SDDX1odpEvj+dNTPB6omjEOVriSKjDsY3 tWSIpKRxXpyC2yZ/jtaR7GBoznN4oGoLTwc0gUULNhXjgrMwts0QDbtGKYzxMj+s9StHmn pQAsefd6u8gQYnZTJl6pddPJ5B0ifAk= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=U00NciZL; spf=pass (imf15.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.43 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-lf1-f43.google.com with SMTP id 2adb3069b0e04-53f757134cdso6168817e87.2 for ; Tue, 04 Feb 2025 10:38:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1738694337; x=1739299137; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=WoQAfm00YEIwVtV4JS77CO0grtmIVLMlfMdvqc1cg9E=; b=U00NciZLzQxOsULfxiwmJat36JrvP/0+rT/K/zBcqwmPAekcFuypn6tWzPoWaOBuxe 7gkSPHZsrnI4seYgkmMYtxBLoRv+w6AWBhxpAZUP24x+ZNB93erWGKbm5kLHHnc0GRjS 1J5G6L5p4OO2ocmTQabs5hqmvj4GA6Q+FR4L26+A/agF+tDzAW/9LtIUN3ls2qwXx1A9 g7jo77OFYQBKs4GafZuy9XnrwHMN3czwnRcDOpp8JY2OG/Bd/jxl0TYe8MI3DUJ+sZ7a boaS5y/tAoPLhgE0gbMULCN7WZ8NYtR0y6xfHxSCuhooreYjmMrUIoesq5JBeUQY+Vcl DR6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738694337; x=1739299137; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=WoQAfm00YEIwVtV4JS77CO0grtmIVLMlfMdvqc1cg9E=; b=T/ZqtVoVFoOb+ek9YDtlmrINgye1wDXkpU6t4GSSYa+9R+AlIJ9NBFH6MC/VOsHWgG YBBMxmuSvg8kJ3JFUJ+iUTBryrICpHnhknFuKghCqnTwMytJjJju1i2NFerRO++g2nFI X4XbU21cdxX22evtmj5Kp25Hz9F4Rm8AqJyAGTUjhp8vLC1WklA6eOKnKks8TF4WcPXV 9qm/3/yf7+ZRAptmokw1PkJB7Zxt1flZj8Lz/AiCnPhsN+1Z57CmpBs4J8n7BAjp0P77 rJfGCN1eSXuhHQDqd4Y2ufiy0e1ZFk22fRnvC81m6fT0g0rMow9QJMmDCXIZT42qo5sa Xl6Q== X-Forwarded-Encrypted: i=1; AJvYcCWecxjM1d9eEVRbH/4n5YeyGj01RSKwmF+41ySal84zXAOWG0IZmJo3sYaFy0PI52VmS6Tifdg0og==@kvack.org X-Gm-Message-State: AOJu0YyTjgQFOCnhn17/d3X9UfMIKURWjgDgHTAczNQ+r1kYM8nSvNlK h8AJwqqF9SBwfBuEDPHBgC33Lklkia/Md5kLgE6bW7UeMwn6dRTsK4+eDLN27MrWPQHMCpiFaB3 cmR6+QZ3m2QwxiESGI2CgpCyBuOE= X-Gm-Gg: ASbGncuFD8Fe2ZBaGP+2XtZ2tNe1wEVldQ/K2jmql6LsuFvPJyS0qXFOufy1QgRxCE8 r8yJNMwPj3qXWnsQJHk32/uh5C9kCPPrrSkHnTJLgYJ9ASDqBnbns6SzE7vhRIY7WzDjjqqpC X-Google-Smtp-Source: AGHT+IFsS7lCTcG/roAsQh3uG+0u+cW9lcNYxX5cBHH4thmT3re5353ENdVYGA8TiGuTFy+0kIMwOhWvRMZJef17ncM= X-Received: by 2002:a05:6512:3b26:b0:542:7196:d1ed with SMTP id 2adb3069b0e04-543e4c3b94cmr7435629e87.53.1738694336239; Tue, 04 Feb 2025 10:38:56 -0800 (PST) MIME-Version: 1.0 References: <20250204162426.GB705532@cmpxchg.org> In-Reply-To: From: Kairui Song Date: Wed, 5 Feb 2025 02:38:39 +0800 X-Gm-Features: AWEUYZlvrHTD6CVrpqSPLgpxteh2ncFxeLfO_UEVpnXgLNlMieBSDgvIzF-N1sU Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator To: Yosry Ahmed Cc: Johannes Weiner , lsf-pc@lists.linux-foundation.org, linux-mm , Andrew Morton , Chris Li , Chengming Zhou , Shakeel Butt , Hugh Dickins , Matthew Wilcox , Barry Song <21cnbao@gmail.com>, Nhat Pham , Usama Arif , Ryan Roberts , "Huang, Ying" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: x5f47nkkycu4d9ticjdcsbye3umwyduw X-Rspamd-Queue-Id: 63F90A0002 X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1738694338-651024 X-HE-Meta: U2FsdGVkX19D228pesXBW7OEIQlKn2pEy/0DDzPs+6yPVuajyx8rR90sKLEKpvJq4HU4s2GC5lW7gBYDTVtPXfZM12wOdDOZRVnYz3RGRdoeL9VhdW8sn4oC9Rw/RM+0YIjJdoBiiwveXED920Ey6VhUke1tztIZ9AQS49nwjsPHjqGqWoUpeX7np/tSa9dxPcjcqJarOo+2WaDoo5Gm6XE7VzKc1NgTQD5eOMIv6BEVPs2axC+aqDPgdp+xPUu/vUlvyx/EmP+rL39fncvwwNk/w/ZJ+cTqx2cAsHaGRVuUzg7KAXaNrCeUOTCu7qkTrjlhkB1TbPJpb3tSQCsGSBypMffeXNUv7N0cm7qLsf8Fab2v4aQ4HhVYWiO4fcE52FFFi8SSxy6SnH90KdEa97ZeXN/0abHfbuB7Mhij6padJsApvRHv/wZGtMcGSAQFkbCSBw4oNeHoWXyTurb1CuKiYIXovIiH0RoWQ2e0zMu0v4QEPtM/4tFpWpOXCMMQNRAz2OkjimHSncU6+oBx6jx4J0+IJpzo/pkYI2qLwg7XjsDMO6r1YuxrXczmsTNoLcZAkekTkngGBHDnTmQOa9Taz3THvR5AJuMLITWkM9GY+5SR2KbWd+bfco0Xto7MpdWmcY96V8xmk5hRzDET08gp+Rvy7WrduVk3qx/5WiDUTq9IvDwgPJLqRubwdOIORbTj4t05Dz86QkFRCbwYo/A69I1eChDZV5rlkE91C3CIf1uekpd6wIfp29iQIlKV2baDHe1hVbXeXwwo5KM/IyOpOotBUYYYRZ1aB0fUdYPcFlJk7t/STH94ruVMnjHQ6AibEmU4ylBwdjWJOirfX25+cj0l0/HJGurhuZ5GXf+fa4ilu8eeke6nid9RWPpdrq8uOFgb+uNtTflLkwNQ0StAGdpqSgaMNnfnEtZPHYJ9FCH5QP7MZFBzaO5Oo1qQ7KOb9WkG1tgWztFU2Dz Qr7ZfgLM Bx1S+qc1sG8IkbCAs8Gy2FkHD37cb4efBPkvOatk/SY8cY95Ew2or4UqdcLaCLs9GTYVleg5h0iTZQ2Hh7YVx6CX7no/LZztRoOzCEE9yEVy4AKfKYxB8xkZFBWTAOfzUoYRIMDj80NzYYohj36dpMeTqIGOL4fKU7LX2cu/JP75fNMMpHuG2PPPT5Pp1z9hkfqs3QmgJZ2a2BJouCZKaPT1nXFr9Mwd9+sQsKZ5qp3ewh1hMYzRgf61zlfW2LGZPg4CDotk3VnTFRzGTpUmz07qasqCfy50pdDCeXnuuVOwafQnVa3n/cSjah9qkIyvK3TJVXH58scFxKwtvKiu5iI5912IxXW436Mia2422FWpdm4BX+mvgIEPpkBCGFD8PLxibEL3wwRA5n9pd9+hmpaSbKQ8bhfhUszpV48IHWpwaZ84= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Feb 5, 2025 at 2:11=E2=80=AFAM Yosry Ahmed = wrote: > > On Wed, Feb 05, 2025 at 12:46:26AM +0800, Kairui Song wrote: > > On Wed, Feb 5, 2025 at 12:24=E2=80=AFAM Johannes Weiner wrote: > > > > > > Hi Kairui, > > > > > > On Tue, Feb 04, 2025 at 07:44:46PM +0800, Kairui Song wrote: > > > > Hi all, sorry for the late submission. > > > > > > > > Following previous work and topics with the SWAP allocator > > > > [1][2][3][4], this topic would propose a way to redesign and integr= ate > > > > multiple swap data into the swap allocator, which should be a > > > > future-proof design, achieving following benefits: > > > > - Even lower memory usage than the current design > > > > - Higher performance (Remove HAS_CACHE pin trampoline) > > > > - Dynamic allocation and growth support, further reducing idle memo= ry usage > > > > - Unifying the swapin path for a more maintainable code base (Remov= e SYNC_IO) > > > > - More extensible, provide a clean bedrock for implementing things > > > > like discontinuous swapout, readahead based mTHP swapin and more. > > > > > > > > People have been complaining about the SWAP management subsystem [5= ]. > > > > Many incremental workarounds and optimizations are added, but cause= s > > > > many other problems eg. [6][7][8][9] and making implementing new > > > > features more difficult. One reason is the current design almost ha= s > > > > the minimal memory usage (1 byte swap map) with acceptable > > > > performance, so it's hard to beat with incremental changes. But > > > > actually as more code and features are added, there are already lot= s > > > > of duplicated parts. So I'm proposing this idea to overhaul whole S= WAP > > > > slot management from a different aspect, as the following work on t= he > > > > SWAP allocator [2]. > > > > > > > > Chris's topic "Swap abstraction" at LSFMM 2024 [1] raised the idea = of > > > > unifying swap data, we worked together to implement the short term > > > > solution first: The swap allocator was the bottleneck for performan= ce > > > > and fragmentation issues. The new cluster allocator solved these > > > > issues, and turned the cluster into a basic swap management unit. > > > > It also removed slot cache freeing path, and I'll post another seri= es > > > > soon to remove the slot cache allocation path, so folios will alway= s > > > > interact with the SWAP allocator directly, preparing for this long > > > > term goal: > > > > > > > > A brief intro of the new design > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > > It will first be a drop-in replacement for swap cache, using a per > > > > cluster table to handle all things required for SWAP management. > > > > Compared to the previous attempt to unify swap cache [11], this wil= l > > > > have lower overhead with more features achievable: > > > > > > > > struct swap_cluster_info { > > > > spinlock_t lock; > > > > u16 count; > > > > u8 flags; > > > > u8 order; > > > > + void *table; /* 512 entries */ > > > > struct list_head list; > > > > }; > > > > > > > > The table itself can have variants of format, but for basic usage, > > > > each void* could be in one of the following type: > > > > > > > > /* > > > > * a NULL: | ----------- 0 ------------| - Empty slot > > > > * a Shadow: | SWAP_COUNT |---- Shadow ----|XX1| - Swaped out > > > > * a PFN: | SWAP_COUNT |------ PFN -----|X10| - Cached > > > > * a Pointer: |----------- Pointer ---------|100| - Reserved / Unus= ed yet > > > > * SWAP_COUNT is still 8 bits. > > > > */ > > > > > > > > Clearly it can hold both cache and swap count. The shadow still has > > > > enough for distance (using 16M as buckets for 52 bit VA) or gen > > > > counting. For COUNT_CONTINUED, it can simply allocate another 512 > > > > atomics for one cluster. > > > > > > > > The table is protected by ci->lock, which has little to none conten= tion. > > > > It also gets rid of the "HAS_CACHE bit setting vs Cache Insert", > > > > "HAS_CACHE pin as trampoline" issue, deprecating SWP_SYNCHRONOUS_IO= . > > > > And remove the "multiple smaller file in one bit swapfile" design. > > > > > > > > It will further remove the swap cgroup map. Cached folio (stored as > > > > PFN) or shadow can provide such info. Some careful audit and workfl= ow > > > > redesign might be needed. > > > > > > > > Each entry will be 8 bytes, smaller than current (8 bytes cache) + = (2 > > > > bytes cgroup map) + (1 bytes SWAP Map) =3D 11 bytes. > > > > > > > > Shadow reclaim and high order storing are still doable too, by > > > > introducing dense cluster tables formats. We can even optimize it > > > > specially for shmem to have 1 bit per entry. And empty clusters can > > > > have their table freed. This part might be optional. > > > > > > > > And it can have more types for supporting things like entry migrati= ons > > > > or virtual swapfile. The example formats above showed four types. L= ast > > > > three or more bits can be used as a type indicator, as HAS_CACHE an= d > > > > COUNT_CONTINUED will be gone. > > > > > > > Hi Johannes > > > > > My understanding is that this would still tie the swap space to > > > configured swapfiles. That aspect of the current design has more and > > > more turned into a problem, because we now have several categories of > > > swap entries that either permanently or for extended periods of time > > > live in memory. Such entries should not occupy actual disk space. > > > > > > The oldest one is probably partially refaulted entries (where one out > > > of N swapped page tables faults back in). We currently have to spend > > > full pages of both memory AND disk space for these. > > > > > > The newest ones are zero-filled entries which are stored in a bitmap. > > > > > > Then there is zswap. You mention ghost swapfiles - I know some setups > > > do this to use zswap purely for compression. But zswap is a writeback > > > cache for real swapfiles primarily, and it is used as such. That mean= s > > > entries need to be able to move from the compressed pool to disk at > > > some point, but might not for a long time. Tying the compressed pool > > > size to disk space is hugely wasteful and an operational headache. > > > > > > So I think any future proof design for the swap allocator needs to > > > decouple the virtual memory layer (page table count, swapcache, memcg > > > linkage, shadow info) from the physical layer (swapfile slot). > > > > > > Can you touch on that concern? > > > > Yes, I fully understand your concern. The purpose of this swap table > > design is to provide a base for building other parts, including > > decoupling the virtual layer from the physical layer. > > > > The table entry can have different types, so virtual file/space can > > leverage this too. For example the virtual layer can have something > > like a "redirection entry" pointing to a physical device layer. Or > > just a pointer to anything that could possibly be used (In the four > > examples I provided one type is a pointer). A swap space will need > > something to index its data. > > We have already internally deployed a very similar solution for > > multi-layer swapout, and it's working well, we are expecting to > > upstreamly implement it and deprecate the downstream solution. > > > > Using an optional layer for doing so still consumes very little memory > > (16 bytes per entry for two layers, and this might be doable just with > > single layer), And there are setups that doesn't need a extra layer, > > such setups can ignore that part and have only 8 bytes per entry, > > having a very low overhead. > > IIUC with this design we still have a fixed-size swap space, but it's > not directly tied to the physical swap layer (i.e. it can be backed with > a swap slot on disk, zswap, zero-filled pages, etc). Did I get this > right? > > In this case, using clusters to manage this should be an implementation > detail that is not visible to userspace. Ideally the kernel would > allocate more clusters dynamically as needed, and when a swap entry is > being allocated in that cluster the kernel chooses the backing for that > swap entry based on the available options. > > I see the benefit of managing things on the cluster level to reduce > memory overhead (e.g. one lock per cluster vs. per entry), and to > leverage existing code where it makes sense. Yes, agree, cluster based map means we can have many empty clusters without consuming any pre-reserved map memory. And extending the cluster array should be doable too. > > However, what we should *not* do is have these clusters be tied to the > disk swap space with the ability to redirect some entries to use > someting like zswap. This does not fix the problem Johannes is > describing. Yes, a virtual swap file can have its own swap space, which is indexed by the cache / table, and reuse all the logic. As long as we don't dramatically change the kernel swapout path, adding a folio to swapcache seems a very reasonable way to avoid redundant IO, and synchronize it upon swapin/swapout, and reusing a lot of infrastructure, even if that's a virtual file. For example a current busy loop issue can be just fixed by leveraging the folio lock: https://lore.kernel.org/lkml/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPS= OPrmQ@mail.gmail.com/ The virtual file/space can be decoupled from the lower device. But the virtual file/space's table entry can point to an underlying physical SWAP device or some meta struct.