From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D5EDDC02194 for ; Tue, 4 Feb 2025 11:45:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 398756B0083; Tue, 4 Feb 2025 06:45:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 321A66B0089; Tue, 4 Feb 2025 06:45:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1C2CE6B008A; Tue, 4 Feb 2025 06:45:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id EF3396B0083 for ; Tue, 4 Feb 2025 06:45:06 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id A0B351C7A8A for ; Tue, 4 Feb 2025 11:45:06 +0000 (UTC) X-FDA: 83082081012.30.ED0E8B3 Received: from mail-lj1-f172.google.com (mail-lj1-f172.google.com [209.85.208.172]) by imf10.hostedemail.com (Postfix) with ESMTP id C1AA9C0008 for ; Tue, 4 Feb 2025 11:45:04 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=nd8lGHPA; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf10.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.172 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738669504; a=rsa-sha256; cv=none; b=svclCh1F169d77oXIpAb3TyuLqlnPl5/sBx8peeOk4gxLpJrv25Cf+FLX69OAptY+wNsr2 DFmHZfe4e0R1sBLzUu6cmYGoL50o2mZRc+TAX8ygMzcy8wDE60owJ4JKt44S2vrIHwhkbr +/6k3mdtqxlbjaGS99w+A4iptSRjcE8= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=nd8lGHPA; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf10.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.172 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738669504; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=yKTXRHGxSzz0oI+Ce3Ay008J9SpZ8QXXKZQsKhknbYs=; b=nNVBuy3omaNp2AsvMgv014F7VVelNfqDdE6T3YX4cyQK19AnsS6vspyADyPnAdJfttkZKK v6UabT7hLLF2PgnOizXNp/reyVrQ8b+XiPB1FIFje9HVPt9X/HVVD8VYJYLpmG0nVklTd5 DxOodiry1txBvTPMgA9nRD4nUgYJty4= Received: by mail-lj1-f172.google.com with SMTP id 38308e7fff4ca-3003c0c43c0so52225751fa.1 for ; Tue, 04 Feb 2025 03:45:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1738669503; x=1739274303; darn=kvack.org; h=cc:to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=yKTXRHGxSzz0oI+Ce3Ay008J9SpZ8QXXKZQsKhknbYs=; b=nd8lGHPA8+JQRfkLc5ABIX5Vt6dJ/rLA5jcXuepdQ60jzyAGntxron4+2vCCvj6XK7 MhlrTQadhamb+lU3a9a+FQ9B9O0lkveB6q2iD7PsQJNQHlu+HgOpcdTCac14dbuwgldg /zl58NsGO17vemVtLy31XXksLv4hqZYgBTclrnBRVEgOd+Y6lSGomCqdKHg3Qim2Iqwi zZvg5T6jeqdLPYAfUf6kjyEme0xRYeL4KZsGtu3aSV5ResHOTaOWB9fyFqar3TTBlC/+ VH6lal76eKpwvJ6Rdn74j2EqRoFJ7P3a3yktfO5a0lIk8INIx/tCD8L3XW9+AuEoH0h1 Y8hA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738669503; x=1739274303; h=cc:to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=yKTXRHGxSzz0oI+Ce3Ay008J9SpZ8QXXKZQsKhknbYs=; b=mPi/7n8nHlF56G/wVXRWFMExM0wrEZtgQPI5AiykwPVM4Lkgs5WqQelN2AYsZlj3Wb OlE/082LuUvBRZooAdwY9VLAMB0mYQnSIXLemgGBLsS8nouI8EnxFYzVoosYQeHQCsqo VfbV22ZsYdg0LWEdcm3ZaYA2DrKwljrpQXu0eNaN9hZyOMmszYtzS3rUBBj0JZgDs1CB Kfrg0RXafLbLmUky79HdECWt0rMbZEMstORhgd74O3mCP0ZsM2PGrqt9+5a8YOXFXvTS xdN61aa9pBsNSnqiZasAf9B+6DiPoSwKXT+jjJAFl5L0J+wfbkexctJFYCGbtFS1Ws/W A2KA== X-Forwarded-Encrypted: i=1; AJvYcCWs6LRD90lwOn2E/ij/OsZ2RnGB4aXJAAtODOccN20HmaIPJWv5JOvS4StiTtctCkxZO6Kv4xk/Sg==@kvack.org X-Gm-Message-State: AOJu0Yy3oNu6Corp2uOdZw1uvDoGORpahGBzmYODaUbQA9h4c3UT8Yf3 MJAs9SbGgZskrHkKQesd+H5jLfiQx3nJ2V1s0c/gN2+BFh4WoiTn5rXLeBSVIPIDShhrhFO+hma HZbyAMYLp+UByCDpfBpbd427KV5w= X-Gm-Gg: ASbGncu+o6YScEx8g3YwkYtju+9r6S+ELL2KbtCBHKhUjWDlWTS6MQKeCh3hBo8PweU tmmiVW3OBi5Axivp1dCFTxg9I4xASu12j7aROwnGtuZeyG/vV4AZaisLn9xjFVvuXPDsCbo1B X-Google-Smtp-Source: AGHT+IF+K9Wr58zwxBGHBZOcnQplS9eX25/J26+7PJvy+iGi6ypTYloz5AxKVmhvvzzIyZIzCMVOApBd+5KTcmWtVKs= X-Received: by 2002:a05:651c:2208:b0:302:3003:97e with SMTP id 38308e7fff4ca-3079695585dmr87014121fa.30.1738669502507; Tue, 04 Feb 2025 03:45:02 -0800 (PST) MIME-Version: 1.0 From: Kairui Song Date: Tue, 4 Feb 2025 19:44:46 +0800 X-Gm-Features: AWEUYZn-jaS64DPA6tgoEST7kWOx5VtwEw5uq7cwvOgQyKkdkt2cp-KrUJOCmc4 Message-ID: Subject: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator To: lsf-pc@lists.linux-foundation.org, linux-mm Cc: Andrew Morton , Chris Li , Johannes Weiner , Chengming Zhou , Yosry Ahmed , Shakeel Butt , Hugh Dickins , Matthew Wilcox , Barry Song <21cnbao@gmail.com>, Nhat Pham , Usama Arif , Ryan Roberts , "Huang, Ying" Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: C1AA9C0008 X-Stat-Signature: a95ax97f5egy786wnqcp1anapnaenw4p X-HE-Tag: 1738669504-561715 X-HE-Meta: U2FsdGVkX19WJRF7oKN9SMrFmj1+QJ5Gy6twocTb3Vf+Xlupdk6icH9aQzoKqEP23RCduPiu5907zH6xUeM+YhUZkY+LFtEbuaOBOb1DKvv+naLu7gKGRUZiOhQ5B/3oE5bYbglRqI3P7vJCOEiCkKkRa1pImvARStm/LqaEMQznP7xFrBhlivaHUIU2soGIKqXFeRrKjTdE8vCccvSVcc6K/Jpk++mWjOPgtmSf3xh+mGyoBnzE3f1mwIaHeHdA99nBkEOmj6k/pejBc8QG9BVjQ4okBPDs84FjKoUKxVKbAlFC55XecXf2EDIcur60/UW0Ez7hlh4ZhXqer+g6T/52gJTwT6CZOndPhbSMz3pLJk23tH2C2jjbq9SAalBg73KN94lN+KMQFRQs+X0RHwmFXdXlph6xxbep4jH9Bty9R5VM4zaV6e0SYw9/aqh7duzNJsaMwc7JdLDmIhUy+Yfi7pEuNfn6ZskHhgcWXOowXJhIPGz1UPQnL48hsjhFzEesFa7T+8epr2QGeMLmHyzLeEPz9kzBtbo+3eBD6ilvkHHh29FBOyecITlzzXJGe5vH1kSBiA4x7XdTNMbJlEhb0mqCAyl61lBnfSGaWuXs3EJ/M4d4NB8NzNWEaVi1WaXVgLaH8uoDVYDBgLCnvcVF/GtKL/+YfPXuMaEr9GBrL/7rBdhrIbbCagKd4PGmzp57GXPpl3ybSuGOQDN4ZNfKWe83mxSsGIjLa/RMH6YFavpUT4smzlZFij3X2HspePWLerQQ/aD9pNSooZ7vA65E3QQBZ27Tf0pid8AFxLvFLON6WG/0rfMYLOvbfesOl7fq6rIvBsmF/iyTbIQasOhibbVX+vK2O9oniO5DBbeAC0IAa0e/1vCEs3lDnux2GSRaUKK8M9ATpyEnf8aUgcg/zGEFrHz/c8LdWUcZmptTS8gk97s3DaG9fA5RjwPiIGaOAuukzNlF8KU/V+S ZKzsLWzF XzhCFAKlBS7pEZG5I8cf41dt9Yty7+jTxYtxLz9YRI1qiqO16pMUKuJufWVA6bvlWn9FZbk9G0Pz/jHfMjL8JZHx8/w4rGcXGQ8qInFIg64CteJRNwKLfe+X8wYUhnt/jvJ2J+iuwYjCqBfQDJJVjBCq7JgYXNOBFx/d4pGpW8yb14VXt7wIUBzJFdb4zffp9WhN3Y5YxuYYpNHnP8g/+Yn0F5pcT+0AkuOIfMESAenkMKtYoM2iPam3g1CWTnb/dsWOlORbq9seyEOPW9AL+sHksq3joHYGxZMYKQE/dxD9gb3Meyy86ewJT7zrp9+1Z74HvNKuV3ycR7Z/krrfaRdOkBVuln/GnzMdNVFGIhBCR/D9dXiA0UFgd9Fy0eINTfuZc1TgB9fuPv8ICXmZ/Hl7UM3wBR97wBVoZrWzyoYEAQGBuqufEk6fRWoBbyjUlE71YTiHSbNorzJH05PKWdhxzpeR1dS79AZ88WfqWMtMbfGns/juSFY4AIHafsG6HZ55bG+8OaWk5D1lmb8Xi+yrUIsNu7ia1+BkD/iaSFe0s4fcT5utM4V+ddBNgvmsxtXKRs9xQc4HzzI20pFsL8kGizQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi all, sorry for the late submission. Following previous work and topics with the SWAP allocator [1][2][3][4], this topic would propose a way to redesign and integrate multiple swap data into the swap allocator, which should be a future-proof design, achieving following benefits: - Even lower memory usage than the current design - Higher performance (Remove HAS_CACHE pin trampoline) - Dynamic allocation and growth support, further reducing idle memory usage - Unifying the swapin path for a more maintainable code base (Remove SYNC_IO) - More extensible, provide a clean bedrock for implementing things like discontinuous swapout, readahead based mTHP swapin and more. People have been complaining about the SWAP management subsystem [5]. Many incremental workarounds and optimizations are added, but causes many other problems eg. [6][7][8][9] and making implementing new features more difficult. One reason is the current design almost has the minimal memory usage (1 byte swap map) with acceptable performance, so it's hard to beat with incremental changes. But actually as more code and features are added, there are already lots of duplicated parts. So I'm proposing this idea to overhaul whole SWAP slot management from a different aspect, as the following work on the SWAP allocator [2]. Chris's topic "Swap abstraction" at LSFMM 2024 [1] raised the idea of unifying swap data, we worked together to implement the short term solution first: The swap allocator was the bottleneck for performance and fragmentation issues. The new cluster allocator solved these issues, and turned the cluster into a basic swap management unit. It also removed slot cache freeing path, and I'll post another series soon to remove the slot cache allocation path, so folios will always interact with the SWAP allocator directly, preparing for this long term goal: A brief intro of the new design =============================== It will first be a drop-in replacement for swap cache, using a per cluster table to handle all things required for SWAP management. Compared to the previous attempt to unify swap cache [11], this will have lower overhead with more features achievable: struct swap_cluster_info { spinlock_t lock; u16 count; u8 flags; u8 order; + void *table; /* 512 entries */ struct list_head list; }; The table itself can have variants of format, but for basic usage, each void* could be in one of the following type: /* * a NULL: | ----------- 0 ------------| - Empty slot * a Shadow: | SWAP_COUNT |---- Shadow ----|XX1| - Swaped out * a PFN: | SWAP_COUNT |------ PFN -----|X10| - Cached * a Pointer: |----------- Pointer ---------|100| - Reserved / Unused yet * SWAP_COUNT is still 8 bits. */ Clearly it can hold both cache and swap count. The shadow still has enough for distance (using 16M as buckets for 52 bit VA) or gen counting. For COUNT_CONTINUED, it can simply allocate another 512 atomics for one cluster. The table is protected by ci->lock, which has little to none contention. It also gets rid of the "HAS_CACHE bit setting vs Cache Insert", "HAS_CACHE pin as trampoline" issue, deprecating SWP_SYNCHRONOUS_IO. And remove the "multiple smaller file in one bit swapfile" design. It will further remove the swap cgroup map. Cached folio (stored as PFN) or shadow can provide such info. Some careful audit and workflow redesign might be needed. Each entry will be 8 bytes, smaller than current (8 bytes cache) + (2 bytes cgroup map) + (1 bytes SWAP Map) = 11 bytes. Shadow reclaim and high order storing are still doable too, by introducing dense cluster tables formats. We can even optimize it specially for shmem to have 1 bit per entry. And empty clusters can have their table freed. This part might be optional. And it can have more types for supporting things like entry migrations or virtual swapfile. The example formats above showed four types. Last three or more bits can be used as a type indicator, as HAS_CACHE and COUNT_CONTINUED will be gone. Issues ====== There are unresolved problems or issues that may be worth getting some addressing: - Is workingset node reclaim really worth doing? We didn't do that until 5649d113ffce in 2023. Especially considering fragmentation of slab and the limited amount of SWAP compared to file cache. - Userspace API change? This new design will allow dynamic growth of swap size (especially for non physical devices like ZRAM or a virtual/ghost swapfile). This may be worth thinking about how to be used. - Advanced usage and extensions for issues like "Swap Min Order", "Discontinuous swapout". For example the "Swap Min Order" issue might be solvable by allocating only specific order using the new cluster allocator, then having an abstract / virtual file as a batch layer. This layer may use some "redirection entries" in its table, with a very low overhead and be optional in real world usage. Details are yet to be decided. - Noticed that this will allow all swapin to no longer bypass swap cache (just like previous series) with better performance. This may provide an opportunity to implement a tunable readahead based large folio swapin. [12] [1] https://lwn.net/Articles/974587/ [2] https://lpc.events/event/18/contributions/1769/ [3] https://lwn.net/Articles/984090/ [4] https://lwn.net/Articles/1005081/ [5] https://lwn.net/Articles/932077/ [6] https://lore.kernel.org/linux-mm/20240206182559.32264-1-ryncsn@gmail.com/ [7] https://lore.kernel.org/lkml/20240324210447.956973-1-hannes@cmpxchg.org/ [8] https://lore.kernel.org/lkml/20240926211936.75373-1-21cnbao@gmail.com/ [9] https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/ [10] https://lore.kernel.org/lkml/20240326185032.72159-1-ryncsn@gmail.com/ [11] https://lwn.net/Articles/966845/ [12] https://lore.kernel.org/lkml/874j7zfqkk.fsf@yhuang6-desk2.ccr.corp.intel.com/