From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2A699C021AA for ; Wed, 19 Feb 2025 11:12:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9D3F0280227; Wed, 19 Feb 2025 06:12:29 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9844F280226; Wed, 19 Feb 2025 06:12:29 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7FE5C280227; Wed, 19 Feb 2025 06:12:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 5FA4A280226 for ; Wed, 19 Feb 2025 06:12:29 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 158ED4C5B7 for ; Wed, 19 Feb 2025 11:12:29 +0000 (UTC) X-FDA: 83136430818.23.15B8130 Received: from mail-lj1-f171.google.com (mail-lj1-f171.google.com [209.85.208.171]) by imf08.hostedemail.com (Postfix) with ESMTP id DA523160008 for ; Wed, 19 Feb 2025 11:12:26 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="incr1c/s"; spf=pass (imf08.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.171 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1739963547; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=iag4w9SsMVvgB6MNxayB0k1msEVHqRD+5sp/2ZHKru8=; b=4rfTMVfYqkv5Yl1Zn54w/GKg9OvXrCv8s0wDSDjKSugAs8wy1zYRE8WQuypztERKNHMaYw df+U1z5jggkmdTqAo+XWgiT0EegdfR5fClUHY3akKtQG3Amkntdb4Ijf9fD5iCiEr4N1iV YqwUyBw4xqXxrlJbUQnRMGJzLfdbrTo= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="incr1c/s"; spf=pass (imf08.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.171 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739963547; a=rsa-sha256; cv=none; b=LPSKRnnpARLLa2/XupJShTlNBNvLAH5qk8aP+9HLqomk9pL0yScMd21RVeUmkjPcO77qKn 3vJ0PB7qp7kxmKsgQSC2FrCLZh6k2CihD0C2tcDycQfVt2+ddO2BOpsITlt7gQJOznJ6Vp LyFsyO0aYgOwlpoM0zFqUmt4EJYY5H4= Received: by mail-lj1-f171.google.com with SMTP id 38308e7fff4ca-30761be8fa7so67114511fa.2 for ; Wed, 19 Feb 2025 03:12:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1739963545; x=1740568345; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=iag4w9SsMVvgB6MNxayB0k1msEVHqRD+5sp/2ZHKru8=; b=incr1c/sbIoMrfKpjrWulMkEtH1e5gm9TYtAdASI2kIkK1eHepRXZkn+/B1R3ZSzVh YnAn1C+tO7OKk/vbhkTgt4nR6icYRECraRm8ls0mGqDpKOKsjqKSZCbPTCNAUYoBI23k ObfzfGGERvzs1J5NI7vt6KL4VAL19qT/Zt+JleK4j0h6YzMMYmmR3kA2RAVgZi/e5tft kbU1D+E8FFAVGizzYypOL/2B2xSNBPfOEHVJAFo/lG3+onK3ULKIynmy++CAlk6BPQVK mopNsQYUVTRZ0npKQ4blOlE11vECWpf8bHSxgOMdhjRNPclAQJUdOVUCmGW7E0Vr6rk2 p/JA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739963545; x=1740568345; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=iag4w9SsMVvgB6MNxayB0k1msEVHqRD+5sp/2ZHKru8=; b=WraA+THxx7Kr/iL09eFay1oNh8iVMf3FOxkQVTXyN0vIj7rLWb523iNHpfcSbeXAk5 YfmMvydEPftVYSBxwfYBH6SEkgD2sMvDw2bOYvGuQKQaBHAGWqZDBRc6KJdpDlOPf9EB q+fNIfk7jN3Hn/90+gcLlfo8Xa87xDa7CbS+ElNjUZayVzpQ5a10u708m6+cbLcDLd8K 1hyUdrMzaP8P/G+jm57o8FWEDn1Ak6CiH5uSxMR/xz1JqOzV+txhh0i9zqyEISVY2PNE QA3wBohAhg7Nhg2mQ8N8CbR7mPuv9sHDI0vhKPs/xr5nwtK+ei4J2DqSckoLqmcYaLGT eMGg== X-Gm-Message-State: AOJu0YwrEWelduHKaMiqRZop/1Y0sp/HYvwBNyiotu27xCOlXiX8uRBj AVRcLKf6HlftCzpfl0rhlkZ2DRk5xcq7dBdLzhxbXDeMUJ3wbh4DY1sQ602OJugxa2BmWJvccvU upSTIJQ/lII6XRM/2IutZRwH1HB4= X-Gm-Gg: ASbGncvkXEJuD6sIGaGRSwmSHuElqV+zKYImdJVj6assv7ZaAlfPoG/qxz05dSmJFVL QBwvYxxke6DFSLU+MTv59RR29LEkmPVeqokRonkkDZXIkeag5xNimD9R1ET/LFb9Bf4bmBgRB X-Google-Smtp-Source: AGHT+IH0M7W3inIpmwrzmpo1KPZa0eSFJAmsoJJ6KicbwA637WWnTGHA1FR3w/7VPFO11Ntz3PAq77RvluLbGSDe/1M= X-Received: by 2002:a2e:2e09:0:b0:302:3e14:34c8 with SMTP id 38308e7fff4ca-30a44dd6835mr9248541fa.22.1739963544714; Wed, 19 Feb 2025 03:12:24 -0800 (PST) MIME-Version: 1.0 References: <20250214175709.76029-1-ryncsn@gmail.com> <20250214175709.76029-6-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Wed, 19 Feb 2025 19:12:05 +0800 X-Gm-Features: AWEUYZk4ZIcj-eseUT3AteioMNjoXpnH24ZLVn36iUjQqT2wFSSTwo8M-TPs_t0 Message-ID: Subject: Re: [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path To: Baoquan He Cc: linux-mm@kvack.org, Andrew Morton , Chris Li , Barry Song , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: DA523160008 X-Stat-Signature: n8a7ktem8tggfn8ehg5gp4rtmnk3qydm X-Rspamd-Server: rspam03 X-HE-Tag: 1739963546-633471 X-HE-Meta: U2FsdGVkX18gwIhvTLjXlxrYTrWxZAdlIogyihufLheTW9FoD00C8oVTw/A1s55FeTxiJby5IxIToWQIHX8pVo2NdV4I3JoDz9zVPqdw/ikp3QkAMc7OrlAozQGfiRPa70SX3o1ZHwgdRcEDawPKTO+s/OfC85x3FP2mv9N6Gh3PABkYZskPf1T7gLlKCiv7do+ioYh+Zr9A50aD4RD/DquvCIJu+Z6R/Sl+KuH76dAE6obWEi/snTJLMpAlNgd5fO4JJmYJ00WRXomFUddtMhr2ZntkmWyniJ3GmYQYIXHOYTl8jiFtaKtNHFdGxcNkOzDRxX0hX4Rh0zQVKr3pZ0Dc5Z3npP7i5xEtRd3N7Arh7s6jDRaiP+yChv47PWmrbLopBd4urR867P1eKRvRbJDCzZMiyM71HVFV/g/4vgTwtqnzuAnrr7sMsOAmTd3Dgw9z1eVXont9wBozdNy7OO94fjxrTLOWVisTT9R4v8czLP469IppAy+MXhMGl9eStgQBi3w6FiH3DomK8q1iJFUE8JHgeld14UrmWiHHBJ4U/R+OsCNuqVCKYVPNtBYzt886XmIEFuHCmI5oR8a1latflBcQXAxRbt8R5i6/umsrXPb5om7lI+ireIdEiJxoNQ7x2/XAq8H0Um4PCOyJ0X+UnKAGhy6QCkedQe4HZulyN5894KqplxkdWwPpNIUBSzzwiFsoC0hXyg0t0IznCWasbNJwxItyZPcHz5XgIH7r05DuV5q3SA8AE2w+IWy8YMfvKlwbNKZqxxjCFdLgTpc1wUDpDosWNPP+xJv1ODBAo50/Q8ipEHLVshjIx61NY2PmXwlEaX9FvMcuokh0N///3kjn3G2iFLl3yu7O/Co9Pp2QxXdMlTzACgCqQC2hzpKjmWVasR3tuY6G9VihlmJUIuOqPOTfkT1Hp+HEvypBNzuPRjRHliGh1y0455Ak9VRzfAgVzrSI9JngjfE J5ObqfMk ATqdjojJYTcHnUpWLU+jwhnWybZ0JTmgd/46zxs+eHeJg3J6PTvXC4dPK39Csug+4AnplUg3KlsglxP2LhKo7e7tIxj2HQVsJVsOsVbQCRE+7n+0GoxfXPA/S0MWUirAt3nqBoXRAPld8XqxN61sT7Hs6jzX71qNEduK3GpdUo77jTvRc2UamxuvsSi3zrtbACa1IJvJG0yJVRYRQCw0BzimrdChGE3NIWwM7Xvjkq9ZM2q+oXJ7cAEesbJOrensiLL70o2YViU8N21e+ny9x5+YzODExKkCMD+57avKgu1wPnijAFEHEbRoka44D/8gJTq+BCvhVS2sPYnRbxUTZZVg1ACP/axvf/gOySuy3aDJI0b2OR1XjgdUITw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000002, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Feb 19, 2025 at 6:55=E2=80=AFPM Baoquan He wrote: > > On 02/19/25 at 04:34pm, Kairui Song wrote: > > On Wed, Feb 19, 2025 at 3:54=E2=80=AFPM Baoquan He wro= te: > > > > Hi Baoquan, > > > > Thanks for the review! > > > > > > > > On 02/15/25 at 01:57am, Kairui Song wrote: > > > > From: Kairui Song > > > > > > > > Current allocation workflow first traverses the plist with a global= lock > > > > held, after choosing a device, it uses the percpu cluster on that s= wap > > > > device. This commit moves the percpu cluster variable out of being = tied > > > > to individual swap devices, making it a global percpu variable, and= will > > > > be used directly for allocation as a fast path. > > > > > > > > The global percpu cluster variable will never point to a HDD device= , and > > > > allocation on HDD devices is still globally serialized. > > > > > > > > This improves the allocator performance and prepares for removal of= the > > > > slot cache in later commits. There shouldn't be much observable beh= avior > > > > change, except one thing: this changes how swap device allocation > > > > rotation works. > > > > > > > > Currently, each allocation will rotate the plist, and because of th= e > > > > existence of slot cache (64 entries), swap devices of the same prio= rity > > > > are rotated for every 64 entries consumed. And, high order allocati= ons > > > > are different, they will bypass the slot cache, and so swap device = is > > > > rotated for every 16K, 32K, or up to 2M allocation. > > > > > > > > The rotation rule was never clearly defined or documented, it was c= hanged > > > > several times without mentioning too. > > > > > > > > After this commit, once slot cache is gone in later commits, swap d= evice > > > > rotation will happen for every consumed cluster. Ideally non-HDD de= vices > > > > will be rotated if 2M space has been consumed for each order, this = seems > > > > > > This breaks the rule where the high priority swap device is always ta= ken > > > to allocate as long as there's free space in the device. After this p= atch, > > > it will try the percpu cluster firstly which is lower priority even t= hough > > > the higher priority device has free space. However, this only happens= when > > > the higher priority device is exhausted, not a generic case. If this = is > > > expected, it may need be mentioned in log or doc somewhere at least. > > > > Hmm, actually this rule was already broken if you are very strict > > about it. The current percpu slot cache does a pre-allocation, so the > > high priority device will be removed from the plist while some CPU's > > slot cache holding usable entries. > > > > If the high priority device is exhausted, some CPU's percpu cluster > > will point to a low priority device indeed, and keep using it until > > the percpu cluster is drained. I think this should be > > OK. The high priority device is already full, so the amount of > > swapouts falls back to low priority device is only a performance > > issue, I think it's a tiny change for a rare case. > > > > > > > > > reasonable. HDD devices is rotated for every allocation regardless = of the > > > > allocation order, which should be OK and trivial. > > > > > > > > Signed-off-by: Kairui Song > > > > --- > > > > include/linux/swap.h | 11 ++-- > > > > mm/swapfile.c | 120 +++++++++++++++++++++++++++------------= ---- > > > > 2 files changed, 79 insertions(+), 52 deletions(-) > > > ...... > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > > > index ae3bd0a862fc..791cd7ed5bdf 100644 > > > > --- a/mm/swapfile.c > > > > +++ b/mm/swapfile.c > > > > @@ -116,6 +116,18 @@ static atomic_t proc_poll_event =3D ATOMIC_INI= T(0); > > > > > > > ......snip.... > > > > int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entr= y_order) > > > > { > > > > int order =3D swap_entry_order(entry_order); > > > > @@ -1211,19 +1251,28 @@ int get_swap_pages(int n_goal, swp_entry_t = swp_entries[], int entry_order) > > > > int n_ret =3D 0; > > > > int node; > > > > > > > > + /* Fast path using percpu cluster */ > > > > + local_lock(&percpu_swap_cluster.lock); > > > > + n_ret =3D swap_alloc_fast(swp_entries, > > > > + SWAP_HAS_CACHE, > > > > + order, n_goal); > > > > + if (n_ret =3D=3D n_goal) > > > > + goto out; > > > > + > > > > + n_goal =3D min_t(int, n_goal - n_ret, SWAP_BATCH); > > > > > > Here, the behaviour is changed too. In old allocation, partial > > > allocation will jump out to return. In this patch, you try the percpu > > > cluster firstly, then call scan_swap_map_slots() to try best and will > > > jump out even though partial allocation succeed. But the allocation f= rom > > > scan_swap_map_slots() could happen on different si device, this looks > > > bizarre. Do you think we need reconsider the design? > > > > Right, that's a behavior change, but only temporarily affects slot cach= e. > > get_swap_pages will only be called with size > 1 when order =3D=3D 0, a= nd > > only by slot cache. (Large order allocation always use size =3D=3D 1, > > other users only uses order =3D=3D 0 && size =3D=3D 1). So I didn't' no= tice > > it, as this series is removing slot cache. > > > > The partial side effect would be "returned slots will be from > > different devices" and "slot_cache may get drained faster as > > get_swap_pages may return less slots when percpu cluster is drained". > > Might be a performance issue but seems slight and trivial, slot cache > > can still work. And the next commit will just remove the slot cache, > > and the problem will be gone. I think I can add a comment about it > > here? > > By the way, another thing I suddenly think of is the percpu cluster > becoming glober over all devices. If one order on the stored si of > percpu cluster is used up, then the whole percpu cluster is drained and > rewritten. Won't this impact performance compared with the old embedding > percpu cluster in each si? In log you said "Ideally non-HDD devices > will be rotated if 2M space has been consumed for each order, this seems > reasonable.", while in reality it may be very difficult to achieve the > 'each 2M space has been consumed for each order', but more often happen > when very few of order's space has been consumed, then rewrite percpu. > Wonder what I have missed about this point. Hi Baoquan, > then the whole percpu cluster is drained and rewritten Not sure what you mean, SWAP IO doesn't happen in units of clusters, cluster is only a management unit for slots, so only allocated / freed slot will be written. Discard is a different thing, and this should have very little effect on that. Or you mean the percpu struct array getting updated? It should be even faster than before, updating a global percpu variable is easier to calculate the offset at runtime, using GS. > n reality it may be very difficult to achieve the 'each 2M space has been= consumed for each order', Very true, but notice for order >=3D 1, slot cache never worked before. And for order =3D=3D 0, it's very likely that a cluster will have more than 64 slots usable. The test result I posted should be a good example, and device is very full during the test, and performance is basically identical to before. My only concern was about the device rotating, as slot cache never worked for order >=3D 1, so the device rotates was very frequently. But still seems no one really cared about it, mthp swapout is a new thing and the previous rotation rule seems even more confusing than this new idea.