From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DCEA5C02193 for ; Tue, 4 Feb 2025 16:24:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5C6FB6B0082; Tue, 4 Feb 2025 11:24:35 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 577516B0083; Tue, 4 Feb 2025 11:24:35 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 440996B0088; Tue, 4 Feb 2025 11:24:35 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 2A5BF6B0082 for ; Tue, 4 Feb 2025 11:24:35 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id CD67116013F for ; Tue, 4 Feb 2025 16:24:34 +0000 (UTC) X-FDA: 83082785268.07.9BD83D8 Received: from mail-qv1-f52.google.com (mail-qv1-f52.google.com [209.85.219.52]) by imf13.hostedemail.com (Postfix) with ESMTP id 4EF5720014 for ; Tue, 4 Feb 2025 16:24:32 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=S0MZ+MBA; spf=pass (imf13.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.52 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738686272; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Bh+9hcabLVBviNekd2521GXgheTt1foQIHcltVljYlM=; b=TuphKovyoSdisL6d9P67swtoTpHxG1WP4ZXoJ6FRgNnLJNq63DWaxzbIrRoaxC2tQTARi9 kABBbkJUtsBFg5FM6MiOw81+ftPmqrbJTE5oy0F/D1TtijFi92d/FeW/rrvdNGu/XYYdRB DRvWNUNWbGDQZ8hKpWw+J9ciFB/bLt4= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=S0MZ+MBA; spf=pass (imf13.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.52 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738686272; a=rsa-sha256; cv=none; b=xUtzqradFOCN2JMJu0yaJiydTDHfxFygvE43NseoIz/rFRRkuuZf1ksDyZ2NF2nN5Glx2e hhuXdAJx0c2kmaBhRH0oqCSMZWaYs3WtJlr5D8rOvAu9Np6yMIyYrmc8+BcEUdJOuEeKya 0QFF+6ql/n6zBOVNJOSuREydu5J4OMM= Received: by mail-qv1-f52.google.com with SMTP id 6a1803df08f44-6e2362ea655so44068666d6.3 for ; Tue, 04 Feb 2025 08:24:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1738686271; x=1739291071; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=Bh+9hcabLVBviNekd2521GXgheTt1foQIHcltVljYlM=; b=S0MZ+MBA5ELxEwh5XeP5Lwh1JRhKkdclsr+cbF2GIksj4yu8R7UEmyRpFunt1u/WZU 7fGsiO1X1KiqjKuRSJm+UI97lvvXH04GsEjbqxOXGYccJX+wPqZTKvsAokePhS3kWtMH 1EwvtwLSRK0MOBRiDORfvdeN0NbzG9Fg0huO98u+TI8He3GxFo7sKNb/CZX0j00Cn/PV ZLnHfHpwNpnAMWupnjJtB4ZI4eaB52lL5PeXtXelSSmrEPU2PFmLaxNvvvpTF84X8gQr qgLp9gRlMDW6VzMQFTWKTZIynARdH6mYo9EjuOsgR51wuNYhFdKb9eP4OqZdh86AcdR8 u/bQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738686271; x=1739291071; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Bh+9hcabLVBviNekd2521GXgheTt1foQIHcltVljYlM=; b=nUAVqeY4ZX2tfqSCnGeq/2S6jAIldyT+fAkHvFV1Fi3JpTnqOnLqDJAvVLS9eodVEF /Dn53kY7xWKKdBF+qqywmtyPPVfuTY8Jeuixu3IgT+JpE6nQoRkTa6EYnrxu6HZN8ses P7zNvF+Egv5Nt7VvZVMlhityyYG/YCx2O/P1iLPWsNN9w0mnuk/miCq3EG6YUONXjbrK FWdlFb6T4MfBzEsDwF7ewHiZISd1O4RpGm97Sc+Qbap9dRa+AHmA5nAjrWiyJ1x+R0aw XCT5Js5jn+ZCuvrHHTSFzk1l7fCJrDD6247hP6WH2z++mi7OjnB29cTfXoixcSePRHZr wZsw== X-Forwarded-Encrypted: i=1; AJvYcCUqEuLbkMOigqCWO0FtrsUVHVRm2os/XrmEhPjCLG7cW8vt3OREcSkmXO7a0I0m5VcUOYZ8cPy9Eg==@kvack.org X-Gm-Message-State: AOJu0YxUNuPyYDQFwBnwl4OyBmbHkf/wut9XNf5G2KTwzwz/tSsp2HvY Dwu7eMN/1bYECF7NalQSzBM2jKJYNgwFmowbu33c91IUFhNgGcCM04JJLntEfcc= X-Gm-Gg: ASbGncvYXOcBsJVsWLpc4I2Xyyv1nzi7CgPLZcPj+CZFzVwnRT2B9Wa49976sTxmD3R Fe4lIeH5wk2D4QZmY2atxKcVGJ5Nk2xie4mJVHd95O1942pgn+HndmAvqgnMKUBOK3kUQiiJgvc VgyCKwzhuVOhTMOpraYeXPU2NAw99hN/HwpBQLFyeEEPRXJ4sR9GD2B1RoA+0H3J/e5rIAtkooe BlC1Vbz+90w3jkTTKV0JHYaSuvcIi1UtRwjnC1C3dx8aFTMSxDXZMWyxWzgee4+bAbbsqW/vvqS fGoRMisQcomOMA== X-Google-Smtp-Source: AGHT+IHDz77hF5SwbiDhEK014uVnTkmUPHCK1lDfenBNeaAmZEiRxE5IQ07ZJ/UIJ+ytOmUcnJYXKw== X-Received: by 2002:ad4:5f4e:0:b0:6d4:19a0:202 with SMTP id 6a1803df08f44-6e243c75128mr339269746d6.33.1738686271161; Tue, 04 Feb 2025 08:24:31 -0800 (PST) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with UTF8SMTPSA id 6a1803df08f44-6e2548141b5sm63240016d6.39.2025.02.04.08.24.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 04 Feb 2025 08:24:30 -0800 (PST) Date: Tue, 4 Feb 2025 11:24:26 -0500 From: Johannes Weiner To: Kairui Song Cc: lsf-pc@lists.linux-foundation.org, linux-mm , Andrew Morton , Chris Li , Chengming Zhou , Yosry Ahmed , Shakeel Butt , Hugh Dickins , Matthew Wilcox , Barry Song <21cnbao@gmail.com>, Nhat Pham , Usama Arif , Ryan Roberts , "Huang, Ying" Subject: Re: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator Message-ID: <20250204162426.GB705532@cmpxchg.org> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 4EF5720014 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: sxnq86bmowtrd5zbnkzn1bxdmgmdtzci X-HE-Tag: 1738686272-463100 X-HE-Meta: U2FsdGVkX182JHZdkF+J5nq73tP7Jb5Zj2G8ubnmes7uPL8nyj1/NExZRRnkNLs48/Ve9enRZT4Egde5JZFgcPzXftQxj5GziUTJh7hUaQuV/WZchmAoOJMs0H2t0QgkkqBzYSAnlvhXY1S7sOOcH0HcfhdN3ng2x8CjvfcPjqn5BHZGmEQ7xIG1XRWjQQhOmhvwP2bjguE3zZX0Le4Vpg87XoC0O3UdhF/pg3u8qShxgGGVQyth2R93xjH6Eb/MV8Ix8erDdiMCt4O76DwCf/zs7BjnMAl0TyHA1KEfNk3YJkrgDDblQMN+DOZj+dda1cAkuognonO1pItVnStDd+4FleUL6saGvabLGZXVHAJnMEu/8Q/yCqPpYIq0RQIyuObKai5LmrzoiCa+CmRA8Aa9q0xGtgt6u6ICH/w1mJ0kWEWfQxgdcrpKv0SowxYTZmF8C4r8WBeonl+j6kf3ddBQ+FT/8ZCdRDTkGA/NkCd+IlEzKvGF26SJWvURskTRm95GcplACYo0Mo2lY0nrpY724eTVNgAy6dZJwm0b8aBhxu+7dybjFLJO34D3Fy90sXwvU6pg4zwZ/WFPQ2jpo1/NPBtC4U9G5J0wQ894P0W+dSlpXXv5DXofoG7vLKqljEKHM1YZ8EpJrVnbzApv59KS0SX3WfcvQ8vrIrjOzMMLHHM7u1gNO4zwrQr4hdHqk3HzTEiguY7rrezDKJ/8OmubE/xVl717xgqdaS9WNUico14XXW0loGfhdMXuO64DsQ3oK8964cmxobCKZdK1mn0zuDU7RY+TD7HFvWaMSgY+oStUQJYZTjy6fZO75HPu/TOLy8zo3NMJ5QN9yV8G5FHwd4mt79fwOuZVilVogvWblwLqZBHX9b6PxnfFnhWV3R4sLk4OB+aL16GY2T7vtzuxhDOKm/nn2lI4NA2IAt+cuV/r/hFsUIY1xYfYoDR7bik80BfvgIUrBypPzPo HAsKt7X3 AzjlZFy5i+Ht8o7pcTbBAiMHqYm3oqDe2tyqGYdHG6tmSmjKQJa0Ni7aobWmk7ibXtFWnO4LAFNKfASnHpVkFj6K14cIrHI9RuwOJNHUaHc4Jzmx7TTU/OzVl/lfRSperp0WIvm5cO2QQ5ZrAlCqXPNw1DLN1bYLHCiTVv6QeboX41FP+zIzIRnd3x5Bbt+G5i5tRmjHH6rp01kPiRhmtVGIos6VGF9DZ5Eyjxr5yHNuXw4GgU4Fr0cGrWkjE3a0EzJyIk3Yrk670KEq+c1QaWeJjWp0Er9QuG6eoQ9hmn/IXwbWHrL3So59o6tgjEGoG40Mzc+H5kl4tMO5yXkPzyWSjpZKeiWbM8TwkojtylvI33mnB0x7d5SVrvIq1NiFZK8wADkuFS08ne2AoVuzXpJ5ajd8rdGkPHGb7xWqmKlwYVfZIUCA7WvlQrqSvyG+EZ1qWWsMfISUrCvaXoW18v0+pvA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Kairui, On Tue, Feb 04, 2025 at 07:44:46PM +0800, Kairui Song wrote: > Hi all, sorry for the late submission. > > Following previous work and topics with the SWAP allocator > [1][2][3][4], this topic would propose a way to redesign and integrate > multiple swap data into the swap allocator, which should be a > future-proof design, achieving following benefits: > - Even lower memory usage than the current design > - Higher performance (Remove HAS_CACHE pin trampoline) > - Dynamic allocation and growth support, further reducing idle memory usage > - Unifying the swapin path for a more maintainable code base (Remove SYNC_IO) > - More extensible, provide a clean bedrock for implementing things > like discontinuous swapout, readahead based mTHP swapin and more. > > People have been complaining about the SWAP management subsystem [5]. > Many incremental workarounds and optimizations are added, but causes > many other problems eg. [6][7][8][9] and making implementing new > features more difficult. One reason is the current design almost has > the minimal memory usage (1 byte swap map) with acceptable > performance, so it's hard to beat with incremental changes. But > actually as more code and features are added, there are already lots > of duplicated parts. So I'm proposing this idea to overhaul whole SWAP > slot management from a different aspect, as the following work on the > SWAP allocator [2]. > > Chris's topic "Swap abstraction" at LSFMM 2024 [1] raised the idea of > unifying swap data, we worked together to implement the short term > solution first: The swap allocator was the bottleneck for performance > and fragmentation issues. The new cluster allocator solved these > issues, and turned the cluster into a basic swap management unit. > It also removed slot cache freeing path, and I'll post another series > soon to remove the slot cache allocation path, so folios will always > interact with the SWAP allocator directly, preparing for this long > term goal: > > A brief intro of the new design > =============================== > > It will first be a drop-in replacement for swap cache, using a per > cluster table to handle all things required for SWAP management. > Compared to the previous attempt to unify swap cache [11], this will > have lower overhead with more features achievable: > > struct swap_cluster_info { > spinlock_t lock; > u16 count; > u8 flags; > u8 order; > + void *table; /* 512 entries */ > struct list_head list; > }; > > The table itself can have variants of format, but for basic usage, > each void* could be in one of the following type: > > /* > * a NULL: | ----------- 0 ------------| - Empty slot > * a Shadow: | SWAP_COUNT |---- Shadow ----|XX1| - Swaped out > * a PFN: | SWAP_COUNT |------ PFN -----|X10| - Cached > * a Pointer: |----------- Pointer ---------|100| - Reserved / Unused yet > * SWAP_COUNT is still 8 bits. > */ > > Clearly it can hold both cache and swap count. The shadow still has > enough for distance (using 16M as buckets for 52 bit VA) or gen > counting. For COUNT_CONTINUED, it can simply allocate another 512 > atomics for one cluster. > > The table is protected by ci->lock, which has little to none contention. > It also gets rid of the "HAS_CACHE bit setting vs Cache Insert", > "HAS_CACHE pin as trampoline" issue, deprecating SWP_SYNCHRONOUS_IO. > And remove the "multiple smaller file in one bit swapfile" design. > > It will further remove the swap cgroup map. Cached folio (stored as > PFN) or shadow can provide such info. Some careful audit and workflow > redesign might be needed. > > Each entry will be 8 bytes, smaller than current (8 bytes cache) + (2 > bytes cgroup map) + (1 bytes SWAP Map) = 11 bytes. > > Shadow reclaim and high order storing are still doable too, by > introducing dense cluster tables formats. We can even optimize it > specially for shmem to have 1 bit per entry. And empty clusters can > have their table freed. This part might be optional. > > And it can have more types for supporting things like entry migrations > or virtual swapfile. The example formats above showed four types. Last > three or more bits can be used as a type indicator, as HAS_CACHE and > COUNT_CONTINUED will be gone. My understanding is that this would still tie the swap space to configured swapfiles. That aspect of the current design has more and more turned into a problem, because we now have several categories of swap entries that either permanently or for extended periods of time live in memory. Such entries should not occupy actual disk space. The oldest one is probably partially refaulted entries (where one out of N swapped page tables faults back in). We currently have to spend full pages of both memory AND disk space for these. The newest ones are zero-filled entries which are stored in a bitmap. Then there is zswap. You mention ghost swapfiles - I know some setups do this to use zswap purely for compression. But zswap is a writeback cache for real swapfiles primarily, and it is used as such. That means entries need to be able to move from the compressed pool to disk at some point, but might not for a long time. Tying the compressed pool size to disk space is hugely wasteful and an operational headache. So I think any future proof design for the swap allocator needs to decouple the virtual memory layer (page table count, swapcache, memcg linkage, shadow info) from the physical layer (swapfile slot). Can you touch on that concern?