From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1589DC36005 for ; Wed, 26 Mar 2025 03:24:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A1DD9280053; Tue, 25 Mar 2025 23:24:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9CDC9280051; Tue, 25 Mar 2025 23:24:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 89623280053; Tue, 25 Mar 2025 23:24:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 6D493280051 for ; Tue, 25 Mar 2025 23:24:12 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 86955A8EEB for ; Wed, 26 Mar 2025 03:24:14 +0000 (UTC) X-FDA: 83262258828.16.24BAC8D Received: from mail-lj1-f172.google.com (mail-lj1-f172.google.com [209.85.208.172]) by imf23.hostedemail.com (Postfix) with ESMTP id 78BC2140005 for ; Wed, 26 Mar 2025 03:24:12 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=AH0fVjVB; spf=pass (imf23.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.172 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742959452; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WjoKFb0vNDe8yTYT4YgtsNFXya4wAXApqu/0CjuVr7w=; b=fFh11V+RdaiNo25xqrum2oUmEmREDhfYHNdwDK7mZYkWOhqaev0k2bCX4D1dqrBtDZRpfq LOfAce0yklmGdjBFPK9aqgXHFQeEUfVCXqiwuEpb9kaPDgwRCiBuLxod4wSuP3toaGX57d SFydOQBIChJhG7PRGa+Dqh05hv1Voj8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742959452; a=rsa-sha256; cv=none; b=v+LBNo1CJWhlbzKwQ7uQXT7rRlCrpZp5T1nfY4KnxdXFQtluSLDD/xyLHtWTUCVDePGucO oUMokZb4nQ+0DHQ4j8CslvC+UrBAKMz9Rcf7qBQUQ787lJxfiSuSs53gbyGbP8mu6imDWe SZwzbos9OB5fuKMBfYuT2GiwU55+0Ug= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=AH0fVjVB; spf=pass (imf23.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.172 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-lj1-f172.google.com with SMTP id 38308e7fff4ca-307325f2436so62589291fa.0 for ; Tue, 25 Mar 2025 20:24:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1742959450; x=1743564250; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=WjoKFb0vNDe8yTYT4YgtsNFXya4wAXApqu/0CjuVr7w=; b=AH0fVjVBE5IsrLaaY+hX6RWaAoojWmxBCER9qnw9I4PhbbDpo3VEBQ9sP7J6IfXAX3 L0yvJC+00QPyRJ7f3NokvVV9kOwvYuIA2dhr/6L0GteHUmWJ/M6R+WL+2z4WQKwRfcwn UJw6as8VMagda3B8V/5g9IEhIeEfJSkKjJ/QaLGaNCjyY8w60/w4+mOMicMTdheCdD6L wulgXkn3sB0DdPhU7tuaXbjvTnqsN/qE7MT91ka1/N80WCTMlrhtdJkiBnWsBZGZUdHw Yt53uC497b/VAa1vZxNrxfN6bZ/jNKegEhBZAj74eAsHjqZ539QRCyu7sVoJH5KBZwlt rx9g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742959450; x=1743564250; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=WjoKFb0vNDe8yTYT4YgtsNFXya4wAXApqu/0CjuVr7w=; b=LXG8PAIwfQYQj6GliJmRNmU1+GUBwCU0+7frG7dGZK06g41Crq9xc7YFFTfOpqCpzZ YAVYHawZVURvnAgNeUTYDHe5MSRDCAyntuPk+F9u3n5wTIkvGZeV4SncAQS0Fj/Qs3A3 66Vo6/+X+LMAfq8OQyQDcbOq6afCxMFl1fYvq5a2tZPoJH10cM9/lCN5uAn66v13rWwe CF3y91EObdmzxv5CehTGYXb1kB+zDGuhEEx8L1Merm7ouvGCvn+zEHfnZoO0o24l8F+3 ceZJcmDPF3HTdkQrYRkwVmynLdzi2qfxbbw9K4Pcniqpn0RUL0rliFxWefe7qbjx/wFp vfmw== X-Forwarded-Encrypted: i=1; AJvYcCXdmZ9Pqdc4WxptDmSiDjag3+D2Vkd+KHwG4wnGnVqBQEcE1MTFctGBMpqnzXOCcmQZsclbjxTLWg==@kvack.org X-Gm-Message-State: AOJu0YwKNruws3Tu1B1tHt8NbGmnWcQH7OBTXdm5VSxbUQ5ae45xVoQ8 NarW5og37h5B/nM55PMIbFnPJnoB6BhoPyE781nlh9+moaGW6ji8Rc7P/06qDsv5Pa9uq1u5jxy LgBs+5dP8H17bH5he0cZjmgskIjw= X-Gm-Gg: ASbGncu/v6+kPPXxb97xPW+qyGUrxY70xVDxkJGk3BHa5H9X5+dfZJKFo+ETSivmcZ5 bRp+j1hdkuiZfVTp6/ZQCPutCe45HBxgW0jpKAvIN/EkIPZnUUe16UFRu7P87Ec3IA7TfPNGVam 4WHpFiZ5gvgqd15T0mvT+PI9/y X-Google-Smtp-Source: AGHT+IFjaRYxSxKwshEEMrGp33mvfXCG5gVElytOcpXUYvMCtR929eeKdSm8q8SbBOyPs7buJHSAzW2WMbeLWTT7W4Y= X-Received: by 2002:a2e:be88:0:b0:30c:7fe:9c95 with SMTP id 38308e7fff4ca-30d7e202c6cmr64182291fa.8.1742959449681; Tue, 25 Mar 2025 20:24:09 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Kairui Song Date: Tue, 25 Mar 2025 23:23:51 -0400 X-Gm-Features: AQ5f1JrMcKRJs5NfhEVYeuOA813NYA-lg_nbDOHQgRcpSnY_JHsoI2U6LbXwwKc Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator To: lsf-pc@lists.linux-foundation.org, linux-mm Cc: Andrew Morton , Chris Li , Johannes Weiner , Chengming Zhou , Yosry Ahmed , Shakeel Butt , Hugh Dickins , Matthew Wilcox , Barry Song <21cnbao@gmail.com>, Nhat Pham , Usama Arif , Ryan Roberts , "Huang, Ying" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 78BC2140005 X-Stat-Signature: yc8fqsndj7mr6555tna3jnrpkd47rncd X-HE-Tag: 1742959452-298234 X-HE-Meta: U2FsdGVkX18JDerIy9K6mYSZ8A9+zXtcQANSej2qgocsmIaYPVTV1YGtiB+x/8G2fD+QN343V7Dsqluz+eIu2HQjv4vdDiAlxRE3Ik/slj0hmJG5d+OzKsu0xuFANchJnnEM69XMbKhPPKyabE1AGo0R1RNGR2GfuML6gq2suBeYTJPlRMtQew6NEJ04WP8AnNLkbjgubX0J5uf2j2LXxGlAh8iShY0CcUUN0d8e4tsdgQHuufp1VYP0AbfIXoZzhYIisBhKUzQOimpAQ1JwMl8oQhZn02u9BX5LwjmTRorDnrbxB8ps545hzKi0lqa6Xwd/uTLIskdL73vxboqTbNvTTLVk4Lo18+/rud3v7JDElrmIaj9VjQ3l6dpHPLmKDdqM76UJrG9fnmDG62NxdN1KBi09GXUSwxl0EwvOL/N3P0PkC5DuUxHfP8v4ETF58vXHu/F/u1rRkhavA61onCvEpeB7AbWcPgJcdiQ8cJ2OJaVMiDzDZbyWCL7OkZHFEoVy3CHS9JUkHnA3rDzrrNzkSXcS43N8STZ894BNBUZbe9gycjTfo1QJaiiosLmt6XwW1MAUd604q5qrMyVeF6pgJS81CVp9PzXKjbmsHQk3PjRkFPHyW7JuaIHkG5UUC5dT0tMdAEsAxVcnUQ0GLJ1gS+6ncaG4NOqvbvqNhQGTq5cloKhuqOipctQiKxyyZ9JaTF80Q+ouhbIWoVg3ZjccFSVz+tsfqpGD2VSRz7AuEkKtDJF6k4o0SQEsW3jGFfTgM2lkBXBMy3JhJNtyAy2ZQfmFw0TZAsSQ1n7fjallPy+GMu1EOxrsY17aQAxrMNOGldzY4VbAa+jkmAuRej/fIp376y37NPf08VMrpepX5Pl2Sjsv2/+Tx9qSe6lASm9s6LPbhl/oGlHUGxWYe13aCiyr28aFw18qkvrdgGJ4Qf/oe2WWgsye6WfFH1bXazCAPRyHeOPz51BUn7t yenTJ9fK ykXkIiAk5CBVp3z3m/96s/RjCVuAhhERoQnoJXvMEElc7VqBU5n4qdO8YyO1WB2RBDOaOQ4u25/RotIs4xj+nNmM/1zknKuLmNXhktoKCSUwcY9lcL9gHmSw/K9bKHfvURqtNx7/BuU9w5QvX3DoqI5jASLASFjnWaW0pPPdBfzcT6VNCmrtzl/wlAH6aI5L5TI86xL/a9QuAJkHCQxBtMu1SCQviL511YvYdGBwSRYtKKakjjFC8MJEYBRLI0p1HpvKw8KGds61lywL+ZgyoHP/oatN4Uspl+mZtBQHUxmVhZzZz94EFKpvCKdYptuny2u1zt/5zsGzXZoPKquh902frYpUU3W4yWD0vPRMlViTnq5gwBZx3gU+NmOUwGZcQVY+Bo6ie4zi7TA5e3QAPe9Oscb+LqYfiaeoZsd0CANU8HJVvvj8mH6JKa8vMClTTYyL3ZdzSfRQ9kc9gZJ77tSzIo7uVEwqRSj+emuvtiQ0UOyiLBuepmCeKfn7Bl0/SBtSnTJYzI26WoDdmQTr47muVUTajt8VQXYoZji9naVgS1CaMFzuM1uEDfhk/evNj1sprrlStBxzA6kF96hi6fbYGJlZPAz2QHfZl4/4Zc9/YFHuPnOWgcXxhKEZCUXxx3McpSKBDBemoqFxmM4j7R84QbQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Feb 4, 2025 at 6:44=E2=80=AFAM Kairui Song wrote= : > > Hi all, sorry for the late submission. > > Following previous work and topics with the SWAP allocator > [1][2][3][4], this topic would propose a way to redesign and integrate > multiple swap data into the swap allocator, which should be a > future-proof design, achieving following benefits: > - Even lower memory usage than the current design > - Higher performance (Remove HAS_CACHE pin trampoline) > - Dynamic allocation and growth support, further reducing idle memory usa= ge > - Unifying the swapin path for a more maintainable code base (Remove SYNC= _IO) > - More extensible, provide a clean bedrock for implementing things > like discontinuous swapout, readahead based mTHP swapin and more. > > People have been complaining about the SWAP management subsystem [5]. > Many incremental workarounds and optimizations are added, but causes > many other problems eg. [6][7][8][9] and making implementing new > features more difficult. One reason is the current design almost has > the minimal memory usage (1 byte swap map) with acceptable > performance, so it's hard to beat with incremental changes. But > actually as more code and features are added, there are already lots > of duplicated parts. So I'm proposing this idea to overhaul whole SWAP > slot management from a different aspect, as the following work on the > SWAP allocator [2]. > > Chris's topic "Swap abstraction" at LSFMM 2024 [1] raised the idea of > unifying swap data, we worked together to implement the short term > solution first: The swap allocator was the bottleneck for performance > and fragmentation issues. The new cluster allocator solved these > issues, and turned the cluster into a basic swap management unit. > It also removed slot cache freeing path, and I'll post another series > soon to remove the slot cache allocation path, so folios will always > interact with the SWAP allocator directly, preparing for this long > term goal: > > A brief intro of the new design > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D > > It will first be a drop-in replacement for swap cache, using a per > cluster table to handle all things required for SWAP management. > Compared to the previous attempt to unify swap cache [11], this will > have lower overhead with more features achievable: > > struct swap_cluster_info { > spinlock_t lock; > u16 count; > u8 flags; > u8 order; > + void *table; /* 512 entries */ > struct list_head list; > }; > > The table itself can have variants of format, but for basic usage, > each void* could be in one of the following type: > > /* > * a NULL: | ----------- 0 ------------| - Empty slot > * a Shadow: | SWAP_COUNT |---- Shadow ----|XX1| - Swaped out > * a PFN: | SWAP_COUNT |------ PFN -----|X10| - Cached > * a Pointer: |----------- Pointer ---------|100| - Reserved / Unused yet > * SWAP_COUNT is still 8 bits. > */ > > Clearly it can hold both cache and swap count. The shadow still has > enough for distance (using 16M as buckets for 52 bit VA) or gen > counting. For COUNT_CONTINUED, it can simply allocate another 512 > atomics for one cluster. > > The table is protected by ci->lock, which has little to none contention. > It also gets rid of the "HAS_CACHE bit setting vs Cache Insert", > "HAS_CACHE pin as trampoline" issue, deprecating SWP_SYNCHRONOUS_IO. > And remove the "multiple smaller file in one bit swapfile" design. > > It will further remove the swap cgroup map. Cached folio (stored as > PFN) or shadow can provide such info. Some careful audit and workflow > redesign might be needed. > > Each entry will be 8 bytes, smaller than current (8 bytes cache) + (2 > bytes cgroup map) + (1 bytes SWAP Map) =3D 11 bytes. > > Shadow reclaim and high order storing are still doable too, by > introducing dense cluster tables formats. We can even optimize it > specially for shmem to have 1 bit per entry. And empty clusters can > have their table freed. This part might be optional. > > And it can have more types for supporting things like entry migrations > or virtual swapfile. The example formats above showed four types. Last > three or more bits can be used as a type indicator, as HAS_CACHE and > COUNT_CONTINUED will be gone. > > Issues > =3D=3D=3D=3D=3D=3D > There are unresolved problems or issues that may be worth getting some > addressing: > - Is workingset node reclaim really worth doing? We didn't do that > until 5649d113ffce in 2023. Especially considering fragmentation of > slab and the limited amount of SWAP compared to file cache. > - Userspace API change? This new design will allow dynamic growth of > swap size (especially for non physical devices like ZRAM or a > virtual/ghost swapfile). This may be worth thinking about how to be > used. > - Advanced usage and extensions for issues like "Swap Min Order", > "Discontinuous swapout". For example the "Swap Min Order" issue might > be solvable by allocating only specific order using the new cluster > allocator, then having an abstract / virtual file as a batch layer. > This layer may use some "redirection entries" in its table, with a > very low overhead and be optional in real world usage. Details are yet > to be decided. > - Noticed that this will allow all swapin to no longer bypass swap > cache (just like previous series) with better performance. This may > provide an opportunity to implement a tunable readahead based large > folio swapin. [12] > > [1] https://lwn.net/Articles/974587/ > [2] https://lpc.events/event/18/contributions/1769/ > [3] https://lwn.net/Articles/984090/ > [4] https://lwn.net/Articles/1005081/ > [5] https://lwn.net/Articles/932077/ > [6] https://lore.kernel.org/linux-mm/20240206182559.32264-1-ryncsn@gmail.= com/ > [7] https://lore.kernel.org/lkml/20240324210447.956973-1-hannes@cmpxchg.o= rg/ > [8] https://lore.kernel.org/lkml/20240926211936.75373-1-21cnbao@gmail.com= / > [9] https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdme= sW_59W1BWw@mail.gmail.com/ > [10] https://lore.kernel.org/lkml/20240326185032.72159-1-ryncsn@gmail.com= / > [11] https://lwn.net/Articles/966845/ > [12] https://lore.kernel.org/lkml/874j7zfqkk.fsf@yhuang6-desk2.ccr.corp.i= ntel.com/ Hi all, Here is the slides presented today: https://drive.google.com/file/d/1_QKlXErUkQ-TXmJJy79fJoLPui9TGK1S/view?usp= =3Dsharing