From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2D31EC54756 for ; Thu, 22 May 2025 04:14:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 961EF6B0082; Thu, 22 May 2025 00:14:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8EACE6B0083; Thu, 22 May 2025 00:14:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8000B6B0085; Thu, 22 May 2025 00:14:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 622F46B0082 for ; Thu, 22 May 2025 00:14:01 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 48AA91A0155 for ; Thu, 22 May 2025 04:14:00 +0000 (UTC) X-FDA: 83469225840.22.C409261 Received: from mail-lf1-f41.google.com (mail-lf1-f41.google.com [209.85.167.41]) by imf14.hostedemail.com (Postfix) with ESMTP id 605FA10000E for ; Thu, 22 May 2025 04:13:58 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=fSXCJsXi; spf=pass (imf14.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.41 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747887238; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Bte9CFdj+wrGfjA+R6p82brPFpWF/WUa3+uYva1Cp1I=; b=J3qocIkGGzILRAfqjsltpVxF0DBclQTTMVOMnUl7DaHkQraLV11KfACi2KPIcJAAuP017a 1QbsxV7UDj4Uxa7QXicSs1YaPriIut3tGFjrkp7afKbptSYN6scadVpFSHUUVlBLzwOnXx jwTXVFWpP0nYiGrfsl5sUESvjz2dKBM= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=fSXCJsXi; spf=pass (imf14.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.41 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747887238; a=rsa-sha256; cv=none; b=3bYi9cNpzbdW3/ToTWbPvdE0A9qwygW43I4lKRhOH6YoF1dPo+aHy+YHztQg177GQBp1iW Ovp1rlGJ+o7sclzMQeuNxqSujUjYSIa8RrRcE/AiQFKdlRxH7uEbnQ3Vn24BXj2VFv5SCR RK6XQqxNOABFwxWoa8ETqtVGIld4+sA= Received: by mail-lf1-f41.google.com with SMTP id 2adb3069b0e04-551f00720cfso5549336e87.0 for ; Wed, 21 May 2025 21:13:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1747887237; x=1748492037; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Bte9CFdj+wrGfjA+R6p82brPFpWF/WUa3+uYva1Cp1I=; b=fSXCJsXiDbfc2SdN6oldsKKWYdcPv/ETpEbFMgAlhOQxY91o65gpX3aS0pynMjLsqO Qf+1zYlX26ZBbuqH48SaBvWh7muCDTk5YagQ9zzJzB6L/7UpWYehpf1eZNL6aquPpBxT JVZa5fjr7MQ2o3qM7IlcMMuXdKRuYFmG/KdIk27Oe93rIAsrmFHcfaKZ/bQdP9SeNYP/ QWQw/1GX5laIqhj8F0TclLAkrdtrY+gFZKY1BtOmR/a97jru3wi0AAwZZD/dFw9hJlYJ 5wBfacGz5fJeC1wcNN0hwziHayjdwm5+MK3Xjm5yhYoqRrhJJ34NofuhWFeWrKb61o8n +Eyw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747887237; x=1748492037; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Bte9CFdj+wrGfjA+R6p82brPFpWF/WUa3+uYva1Cp1I=; b=qfm/mU4xIE3r27/LePaTFXQFJuLHn3+WKCSn5Yw3NncaUSt/mQK9yQvWXQuGBja5m6 Osoh/mKg6jqiVqWQ5+J/nAIZW9A6lrTcJ64ZNNAfU8SDsjd+URgYt6jJb9IjVW5NTklY /bsTbRChM2pN6T5lxfPmDhosaEqJYIHbFawNO27t/x+wrQujyxAID9vLH+g1BmdAShiO svh6yp+kxplfTN3sQxVlBq6VnVhP1WgdXLdZyfvnl3owmusU4vgBHvXggTlQE/sDIyYx q3bLvzOCXGgIkSR6Ntrc1Mvj9Rmp36j2l1xwhP5Wvp7l781q4TfPwoNP5rjF4WC3CPP1 yWsQ== X-Gm-Message-State: AOJu0Yxhou1/DmywobkZaok0w8cT6Bzd9D73QhUQIesY40LVNeNNtAaE 4cy7YH+goDz7EG7MuG5mjEebQ5Y6qxgoxtY00WWG7s82vE4vlG83ORO8LC7e5lJwGxJtNARwue2 4HCdXGZwD1hGAFB5AMC0pagt3Na7GOnw= X-Gm-Gg: ASbGnct5cPNBur7A494NFcsZzBlJ5vGO2NRJAZ6aWG6WL8WluqbhEaTVdVzcaQWQH58 MXfWYJgEHh2DbKgFRVsvPs2i0n83kDwByk6s2mKI002sNf+scY12/4ttiHd2CCnTu1sDKklZWxO 61kPpHmETc4M/6w0cwfgJr5SPz3HuT9BFz2ME+B7qaIy8= X-Google-Smtp-Source: AGHT+IHT28IBBwTi8/+xS4G/RjJiiVmRxQ0cf/TimQs9dpEiYDfpdr93GdWkjVmnq0QPAXPpzqIXR8XBFXD9AfS6Sk0= X-Received: by 2002:a05:6512:671c:b0:54b:117b:b54b with SMTP id 2adb3069b0e04-550e725bb73mr6915713e87.54.1747887236454; Wed, 21 May 2025 21:13:56 -0700 (PDT) MIME-Version: 1.0 References: <20250514201729.48420-1-ryncsn@gmail.com> <20250514201729.48420-29-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Thu, 22 May 2025 12:13:37 +0800 X-Gm-Features: AX0GCFuXRYMgoPNwD3rP7zPji66BkOaor9fW7saRZdo0olNBGsFnKTY6nd02tFc Message-ID: Subject: Re: [PATCH 28/28] mm, swap: implement dynamic allocation of swap table To: Nhat Pham Cc: linux-mm@kvack.org, Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , David Hildenbrand , Yosry Ahmed , "Huang, Ying" , Johannes Weiner , Baolin Wang , Baoquan He , Barry Song , Kalesh Singh , Kemeng Shi , Tim Chen , Ryan Roberts , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: qru33oqwsxq7zztaf3gtmsqfr9gectpu X-Rspamd-Queue-Id: 605FA10000E X-Rspam-User: X-Rspamd-Server: rspam02 X-HE-Tag: 1747887238-302823 X-HE-Meta: U2FsdGVkX1+3TYrHvgBcGJLpQZFAmOMFY7DVlxESWSOoLWNkCD9qZg49W++AZZOM/wO/urBH2iEHCaGIXySLGTjO1UmPYK1yNMdePNzgz1axFDTNjr4/SvtXl5q32fFDrS6X4+VV6wEUrO3ZatXOJdGi5J9dLKqxxi14F3VX/bSxLdzn7otYKlMiJxPL/2aMsTANoLxBrn7t8YjoZ+FNCazWlR/DuOUwk7hOzN51G7SrILFNmukQfB/MURc6wrlmwNK0dKtUn6Gjgva3KSev7++bOJeJK3M7GgVxTTLy1SoTao7claTgEymFMmXTN0JqOXuotJ88oSyvmYGgKzQ6Gm83EE+EDthV+ZMMKl0Ip5EkN8pAVJgypYyujRr6gFJLwshDQyq6Eikimw7/1Pe7d2fT79CXYiUzQarmL7V9oOreiZeDNe0xR6T+fG4IC/I/e9CYY2CMQ50UZpXwe8wV+yU8VNVm3d/XOaQcj6z1KXFuBZ05O2TgGFXKo8ExkqXJDx19Z+uYk1G5UfQnzOJ1bVjjTLH/mqIEJjeGE7dA0cTDUyocmxgdMEwpq1W+8L/8nVUwj/EPBxxOXQEMV8tJjWEvMKRi/OGQ15khlUzwKiBfC3JMzvYDfb0KYop1gmLk70EogYWoeWqmJOWp3REETn4n+LYBY+qFA5vJnqKbGU/RhCCP6ethU0co/wzlDH7KV9czV+hazZRy/p2sxVm9vvqZtAcpQoWRuoQrma2Q4DesfnD4tTQam1ohXHbm36FFU39mmkKTsl1nNM66IgoxR2O5AR48ffH1040LzsTjcv7THE56k5yJ0EHgDpeQXUr5FGQWS5iPNaj0IkUl4H8Ncqqjo0jxuqA6+amaXnoo96opo8kL7UemToQyDjiaLUZt4xMZaNEwxatv3LC8M81elwDqARgymKOigrxaR0+fsBnS1FP+/0aYk6tpxQ5gL5y+aPnSZgHodbkH39sNd5T tAi0KgaG YjwW6PQHmTn+Z5MZJ1E+WxGyJmAaGKC5FPSCca3+L37ccJ8wq5u7cLg1BBV6bO7/jyA6HbfKx9KZABsaEUWMhZ/j5h4sUMxWAkI6o0Cpy+DbUnOeYD8YihgX0u9RfzHMD+Yd/nZwaYwLZ4XcMyjxAR05FEZsL/d+OS2Z66YY9JkzB9kWN3zur71SqWHe8MK+b0XE9qiRSRsnioMW7MOn1828isROEbXcZIYg/pLlA8bGvql3hYDobgJIbDrwPj/9Rt9pTAEjOOM8vfGPmMfkP+YdeMmhFdmjYUjPX5DS33KO+vof2eGV4fCJmRpp6jxUauJ1XnAhTtU9eW5SiAqVzeX6ScQbgNeJfxqNh14YjWDwZhXWfSdeKNpA0wj4tJnOujpx8SEyKULrosJ0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, May 22, 2025 at 3:38=E2=80=AFAM Nhat Pham wrote= : > > On Wed, May 14, 2025 at 1:20=E2=80=AFPM Kairui Song wr= ote: > > > > From: Kairui Song > > > > Now swap table is cluster based, which means free clusters can free its > > table since no one should modify it. > > > > There could be speculative readers, like swap cache look up, protect > > them by making them RCU safe. All swap table should be filled with null > > entries before free, so such readers will either see a NULL pointer or > > a null filled table being lazy freed. > > > > On allocation, allocate the table when a cluster is used by any order. > > > > This way, we can reduce the memory usage of large swap device > > significantly. > > > > This idea to dynamically release unused swap cluster data was initially > > suggested by Chris Li while proposing the cluster swap allocator and > > I found it suits the swap table idea very well. > > > > Suggested-by: Chris Li > > Signed-off-by: Kairui Song > > Nice optimization! Thanks! > > However, please correct me if I'm wrong - but we are only dynamically > allocating the swap table with this patch. What we are getting here is > the dynamic allocation of the swap entries' metadata (through the swap > table), which my virtual swap prototype already provides. The cluster > metadata struct (struct swap_cluster_info) itself is statically > allocated still (at swapon time), correct? That's true for now, but noticing the static data is much smaller and unifi= ed now, and that enables more work in the following ways: (I didn't include it in the series because it is getting too long already..= ) The static data is only 48 bytes per 2M swap space, so for example if you have a 1TB swap device / space, it's only 20M in total, previously it would be at least 768M (could be much higher as I'm only counting swap_map and cgroup array here). Now the memory overhead is 0.0019% of the swap space. And the static data is now only an intermediate cluster table, and only used in one place (si->cluster_info), so reallocating is doable now: Readers of the actual swap table are protected by RCU and won't modifying the cluster metadata, the only updater of cluster metadata is allocation/freeing, and they can be organized in better ways to allow the cluster data to be reallocated. And due to the low memory overhead of cluster metadata, it's totally acceptable to preallocate a much larger space now, for example we can always preallocate a 4TB space on boot, tha't 80M in total. Might seems not that trivial, but there is another planned series to make the vmalloc space dynamic too, leverage the page table directly, so the 20M per TB overhead can be avoided too. Not sure if it will be needed though, the overhead is so tiny already. So in summary what I have in mind is we can either: - Extend the cluster data when it's not enough (or getting fragmented), since the table data is still accessible during the reallocate and copied data is minimal, so it shouldn't be a heavy lifting operation. - Preallocate a larger amount of cluster data on swapon, the overhead is still very controllable. - (Once we have a dynamic vmalloc) preallocate a super large space for swap and allocate each page when needed. These ideas can be somehow combined, or related to each other. > That will not work for a > large virtual swap space :( So unfortunately, even with this swap > table series, swap virtualization is still not trivial - definitely > not as trivial as a new swap device type... > > Reading your physical swapfile allocator gives me some ideas though - > let me build it into my prototype :) I'll send it out once it's ready. > Yeah, a virtual swap is definitely not trivial, instead it's challenging and very important, just like you have demonstrated. It requires quite some work other than just metadata level things, I never expected it to be just as simple as a "just another swap table entry type" :) What I meant is that to be done with minimal overhead and better flexibility, swap needs better infrastructures, which this series is workin= g on.