From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57052C54E60 for ; Mon, 18 Mar 2024 01:28:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ADCDE6B0082; Sun, 17 Mar 2024 21:28:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A8D216B0083; Sun, 17 Mar 2024 21:28:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 954BD6B0085; Sun, 17 Mar 2024 21:28:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 86AFC6B0082 for ; Sun, 17 Mar 2024 21:28:32 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 1DCFC40B47 for ; Mon, 18 Mar 2024 01:28:32 +0000 (UTC) X-FDA: 81908424864.05.A84629B Received: from mail-lj1-f181.google.com (mail-lj1-f181.google.com [209.85.208.181]) by imf27.hostedemail.com (Postfix) with ESMTP id 21D6740009 for ; Mon, 18 Mar 2024 01:28:28 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=KJQyixes; spf=pass (imf27.hostedemail.com: domain of chuanhuahan@gmail.com designates 209.85.208.181 as permitted sender) smtp.mailfrom=chuanhuahan@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710725309; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nJJWEPgNKuQe2IAdyvbQuuADni5SrmjiNQjHK6NUUtk=; b=5sCUBAk0ioFTjTiWLjkVbyWJwIb/2GOUCnXgcMPeCE5+e7YF+LBQc84RN+10TQihF3Ulqx AcKfIFVTWP5KCsmhzkxJN19t/hCEbx9GzYFzCTCapWFS1BwQta/9PBfKEhPp9JkT30Y4Tr 7jpcrrIdRZRFPG6VC352QfQRowUZQIY= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=KJQyixes; spf=pass (imf27.hostedemail.com: domain of chuanhuahan@gmail.com designates 209.85.208.181 as permitted sender) smtp.mailfrom=chuanhuahan@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710725309; a=rsa-sha256; cv=none; b=5Qi7J0P397JPH59S5RDMQtqKFqoDfWgz95I6CkPPdEFEoX4nFk1fg/QNHUuIq6nsCqRkgS OpHVOd9ZQp/mXKOuZ8PjLtWI7JPtLqOcao2n+r33jE65YO/aoCPVV9IinO6a7jNDB/EJYC DbWNG968z0cfctdoqbCDk2i8IOjIW0g= Received: by mail-lj1-f181.google.com with SMTP id 38308e7fff4ca-2d475b6609eso49679751fa.2 for ; Sun, 17 Mar 2024 18:28:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1710725307; x=1711330107; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=nJJWEPgNKuQe2IAdyvbQuuADni5SrmjiNQjHK6NUUtk=; b=KJQyixesaGn84fvcVj3m8h3oUWJKa2/Sb59ARgUeSVnVHzivu8R8W0SjIiORPoTuqi H4DBi1jMxTcICq51iiZuEQzo1S8xsg9hg3T8d/c6qOnIb1ZX9IbJ4HEJM2dasX80Wl3I ZfrZCFh5xhWYarigL4xGezWGLPmrEd58NiUGT4mMBCB6zeVDdVvP/aX89fmIbM6MNFAy Ogg1SGvXC3+cqioqpuxE9Ku80OTPsspHPyqwerRdOhfNrUuCVp7sD14SQMivzkTIjgFt qrg9l3iwPvD9Jdlag58nRfWmCQeNady5rUxOWewux0TNC/RLvXSSlu9/ydT6DqAx8s0p SCpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710725307; x=1711330107; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nJJWEPgNKuQe2IAdyvbQuuADni5SrmjiNQjHK6NUUtk=; b=wdpvnIB4BT8I01/bYET0klIUE1rl02I6cgqa5VdPK5pym1DlLLgAjA9D00i0hU7aM4 jEqlEcZ09RylFAnE8hBla3oeF+Q6spbX0c6DgLsRPWMUGdo4EmoBBnVUxWQQlGsrc+Sk FD2/5S+xSKIIdsYA7w1m6eM8OjA0Kcs5kVbxtjPVjfir/aPUa6VnDB9wKwuWkzURY2zm 0y6RCtRr7Ws0JprQP6JIUJdJtetHJUU49MmQCfQDlz4qt3Ytu628G86bSlXRStA/8ro9 K+xUl9a5T09uVMIqxHmkpkHLpGXSrvFACeqG9V9NVQveR+W1Jj7sKV48rniT4GGbWwQ9 Dpng== X-Forwarded-Encrypted: i=1; AJvYcCUdj7noZ8zOd8slyZVRvnT4bMD7Lu1eVt7ZzPHTnkhIXu7uVR5MC/GXBHEnv3vIjcoQOCSMofNaqEwtlI5C69GkTh0= X-Gm-Message-State: AOJu0YySfOGuT9TCCxnYK6RD/Mh91axrY+UVufCA3ZvyZCCnLE8F2HWY wRLnmzZQm0PbCHFT6D3RCBc0JGES8d5+sJkNX/0z1irO7kIHuqBXzoipKKoLxkANq7omMpy1rgh RDW4gHMcbDcWkZ+ahwoy8cbhJ/8g= X-Google-Smtp-Source: AGHT+IEE9GatBuUUXuitxUWFfO+HovAUwrNnaBRQFHWjWGs84tWsorg5/o1AY3CnbzqX0QSIi4soDV5M1YhtOkOQS5Q= X-Received: by 2002:a2e:6a19:0:b0:2d3:f464:2274 with SMTP id f25-20020a2e6a19000000b002d3f4642274mr6144642ljc.37.1710725306960; Sun, 17 Mar 2024 18:28:26 -0700 (PDT) MIME-Version: 1.0 References: <20240304081348.197341-1-21cnbao@gmail.com> <20240304081348.197341-3-21cnbao@gmail.com> <499a60c6-eeb8-4bbd-8563-9717c0d2e43d@arm.com> <716bb29c-d2a2-4eef-b300-b037f08f458f@arm.com> <76c16222-78fd-4d96-b9f7-13264bb37747@arm.com> In-Reply-To: <76c16222-78fd-4d96-b9f7-13264bb37747@arm.com> From: Chuanhua Han Date: Mon, 18 Mar 2024 09:28:15 +0800 Message-ID: Subject: Re: [RFC PATCH v3 2/5] mm: swap: introduce swap_nr_free() for batched swap_free() To: Ryan Roberts Cc: Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, linux-mm@kvack.org, chengming.zhou@linux.dev, chrisl@kernel.org, david@redhat.com, hannes@cmpxchg.org, kasong@tencent.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, mhocko@suse.com, nphamcs@gmail.com, shy828301@gmail.com, steven.price@arm.com, surenb@google.com, wangkefeng.wang@huawei.com, willy@infradead.org, xiang@kernel.org, ying.huang@intel.com, yosryahmed@google.com, yuzhao@google.com, Chuanhua Han , Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 21D6740009 X-Rspam-User: X-Stat-Signature: zawfeyifb985y183w1qomqux63hyyrsu X-Rspamd-Server: rspam01 X-HE-Tag: 1710725308-860888 X-HE-Meta: U2FsdGVkX1/b9n5ORlMGG5XFjV8Zmg1/c2Enj/DNvBlp8YbsFCDbj0gGqU0+jw9P9avrndEbekAM4llIKAvd/4k1w+mVsQDeo0Mzi+D3vhecZCvJw2pvYWjVdawbfIJilVal2Tk2NCL2U4D+LI/2ZO7WvFRoISyqcgeb7gl1QFzpNFc081lzctKaUJky1TCRIerlmi4hvoS6sAJh/QCiiZHBgo1eRWiluY5jX1XoZvq0Rw1P4LUFpvShKfx9QQWotnO1M7XMw9isb5PCxCjlWRl5HycCzmAHIbdplj92/nzH35fcJSpLpDtAnxHZoCFapoHN2sFH9rzKJgbSnJots6tGZMlnxXm3q6sJ/Y6e9Ko06RNF+biWwKSYu4y4rs3Emftg8izsFMFbud/g20naQCjX0RlB490yrN29wJuRnatDqLNMRTu3zI02jlPnx/K/gH7RGNfRtadx+CwRuSroSaSb0sCd7PS9mP3JjyLcST9mMUL9RPcso5q4zHvYNV7kjxBVrKKas/W+/nc00CMailPJP9UBAAHfdGKraMj8GcoVOyECE9UCA+5KgUD8drQC6CFHcZZz6wakL516aweCbgDsd4K844QQitqY68bzDvYFUKPVoPknKqRd7j57gLEI/P9VTnJyXv1Tl16aEEU/JJZW5OW6rGUYCAy906S04AgMQ8uyMA4KJUbqgCrj8It8AhOUP/UWztjcMMdlsS/wxG7/CryytD4qrThIAcxr6p2+TD7USPW5b3CwEP3mqE1QOQouRF2ZrlZjxnMpILHcH1uetK2SMxlc0xNdjEjDrmO8efi5lMqzZ5tllXR44r1BvOKlSFJbqYsnJ7V4nko2RcK5xhb0zJO5Qaq0gXN1dUSc1DENt+zdHoWtRO/Hz4BgrPitzQDUFxld4WNLQcN4pa1ZuAj4BTgOqcO5V+XAO8T2nNbP5Cr5OJHgC+AddXZj18z2m+c7bEilkCRX9a4 QGCQXbMn jQb96rncqin7nL/Q/eq16nzqtjtpDOxfAOA0Gj2/NJd4ovbY6tgDoawGtjmTME3DW1eLWloD7rTgxsGdeL37huGuBo3rbtXK5RCqQ472QZit8mRszTHSgpCnG59e8jtgalSkUrriDm2qxPZMqJyshW8WKx6lmH9mUrHEV7oOnyJvgpjmPZ+2yhfjS3aRvb38n5qAKspOX4Fm9OyQtyw2t+gAySw1z5a+VcwAGHo6c5ox3VYxmiu0/3C+Q+vXLZU7uFEkeQNCfyPJ2bT9lyvlNDLf0VcnMtwukNMbMpQKga+JGuBEVD+SkJmkrVzlsT7BP1nLawpfBVb7rw1azXX32HqVKRFV0xdE+tdINUjYniPhsUd+Mx03Z+klJUNmnwd+0HSH9fmLNfquKFkMOhtBNNF6AmojTJXOV6Y4P1B3xeD3ewpJjU9rQ+hcuenmRlwdoLk2o7USwFss3NXqMySMOU641aBJuvHAOQ0HQx3t/ss+vaetmBP+/o5tx6797vp1vPXVpkdpenQ9E2iE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Ryan Roberts =E4=BA=8E2024=E5=B9=B43=E6=9C=8815=E6= =97=A5=E5=91=A8=E4=BA=94 18:57=E5=86=99=E9=81=93=EF=BC=9A > > On 15/03/2024 08:34, Chuanhua Han wrote: > > Ryan Roberts =E4=BA=8E2024=E5=B9=B43=E6=9C=8814= =E6=97=A5=E5=91=A8=E5=9B=9B 21:43=E5=86=99=E9=81=93=EF=BC=9A > >> > >> On 14/03/2024 13:12, Chuanhua Han wrote: > >>> Ryan Roberts =E4=BA=8E2024=E5=B9=B43=E6=9C=881= 2=E6=97=A5=E5=91=A8=E4=BA=8C 02:51=E5=86=99=E9=81=93=EF=BC=9A > >>>> > >>>> On 04/03/2024 08:13, Barry Song wrote: > >>>>> From: Chuanhua Han > >>>>> > >>>>> While swapping in a large folio, we need to free swaps related to t= he whole > >>>>> folio. To avoid frequently acquiring and releasing swap locks, it i= s better > >>>>> to introduce an API for batched free. > >>>>> > >>>>> Signed-off-by: Chuanhua Han > >>>>> Co-developed-by: Barry Song > >>>>> Signed-off-by: Barry Song > >>>>> --- > >>>>> include/linux/swap.h | 6 ++++++ > >>>>> mm/swapfile.c | 35 +++++++++++++++++++++++++++++++++++ > >>>>> 2 files changed, 41 insertions(+) > >>>>> > >>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h > >>>>> index 2955f7a78d8d..d6ab27929458 100644 > >>>>> --- a/include/linux/swap.h > >>>>> +++ b/include/linux/swap.h > >>>>> @@ -481,6 +481,7 @@ extern void swap_shmem_alloc(swp_entry_t); > >>>>> extern int swap_duplicate(swp_entry_t); > >>>>> extern int swapcache_prepare(swp_entry_t); > >>>>> extern void swap_free(swp_entry_t); > >>>>> +extern void swap_nr_free(swp_entry_t entry, int nr_pages); > >>>> > >>>> nit: In my swap-out v4 series, I've created a batched version of > >>>> free_swap_and_cache() and called it free_swap_and_cache_nr(). Perhap= s it is > >>>> preferable to align the naming schemes - i.e. call this swap_free_nr= (). Your > >>>> scheme doesn't really work when applied to free_swap_and_cache(). > >>> Thanks for your suggestions, and for the next version, we'll see whic= h > >>> package is more appropriate! > >>>> > >>>>> extern void swapcache_free_entries(swp_entry_t *entries, int n); > >>>>> extern int free_swap_and_cache(swp_entry_t); > >>>>> int swap_type_of(dev_t device, sector_t offset); > >>>>> @@ -561,6 +562,11 @@ static inline void swap_free(swp_entry_t swp) > >>>>> { > >>>>> } > >>>>> > >>>>> +void swap_nr_free(swp_entry_t entry, int nr_pages) > >>>>> +{ > >>>>> + > >>>>> +} > >>>>> + > >>>>> static inline void put_swap_folio(struct folio *folio, swp_entry_t= swp) > >>>>> { > >>>>> } > >>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c > >>>>> index 3f594be83b58..244106998a69 100644 > >>>>> --- a/mm/swapfile.c > >>>>> +++ b/mm/swapfile.c > >>>>> @@ -1341,6 +1341,41 @@ void swap_free(swp_entry_t entry) > >>>>> __swap_entry_free(p, entry); > >>>>> } > >>>>> > >>>>> +/* > >>>>> + * Called after swapping in a large folio, batched free swap entri= es > >>>>> + * for this large folio, entry should be for the first subpage and > >>>>> + * its offset is aligned with nr_pages > >>>>> + */ > >>>>> +void swap_nr_free(swp_entry_t entry, int nr_pages) > >>>>> +{ > >>>>> + int i; > >>>>> + struct swap_cluster_info *ci; > >>>>> + struct swap_info_struct *p; > >>>>> + unsigned type =3D swp_type(entry); > >>>> > >>>> nit: checkpatch.py will complain about bare "unsigned", preferring "= unsigned > >>>> int" or at least it did for me when I did something similar in my sw= ap-out patch > >>>> set. > >>> Gee, thanks for pointing that out! > >>>> > >>>>> + unsigned long offset =3D swp_offset(entry); > >>>>> + DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) =3D { 0 }; > >>>> > >>>> I don't love this, as it could blow the stack if SWAPFILE_CLUSTER ev= er > >>>> increases. But the only other way I can think of is to explicitly lo= op over > >>>> fixed size chunks, and that's not much better. > >>> Is it possible to save kernel stack better by using bit_map here? If > >>> SWAPFILE_CLUSTER=3D512, we consume only (512/64)*8=3D 64 bytes. > >> > >> I'm not sure I've understood what you are saying? You're already using > >> DECLARE_BITMAP(), so its already consuming 64 bytes if SWAPFILE_CLUSTE= R=3D512, no? > >> > >> I actually did a bad job of trying to express a couple of different po= ints: > >> > >> - Are there any configurations today where SWAPFILE_CLUSTER > 512? I'm= not sure. > >> Certainly not for arm64, but not sure about other architectures. For e= xample if > >> an arch had 64K pages with 8192 entries per THP and supports SWAP_THP,= that's 1K > >> for the bitmap, which is now looking pretty big for the stack. > > I agree with you.The current bit_map grows linearly with the > > SWAPFILE_CLUSTER, which may cause the kernel stack to swell. > > I need to think of a way to save more memory . > >> > >> - Would it be better to decouple stack usage from SWAPFILE_CLUSTER and= instead > >> define a fixed stack size (e.g. 64 bytes -> 512 entries). Then free th= e range of > >> entries in batches no bigger than this size. This approach could also = allow > >> removing the constraint that the range has to be aligned and fit in a = single > >> cluster. Personally I think an approach like this would be much more r= obust, in > >> return for a tiny bit more complexity. > > Because we cannot determine how many swap entries a cluster has in an > > architecture or a configuration, we do not know how large the variable > > needs to be defined=EF=BC=9F > > My point is that we could define a fixed size, then loop through the pass= ed in > range, operating on batches of that fixed size. You could even take into > consideration the cluster boundaries so that you take the correct lock fo= r every > batch and can drop the "must be naturally aligned, must be no bigger than > cluster size" constraint. Thank you. I understand it=EF=BC=81 > > > >> > >>>> > >>>>> + > >>>>> + /* all swap entries are within a cluster for mTHP */ > >>>>> + VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLU= STER); > >>>>> + > >>>>> + if (nr_pages =3D=3D 1) { > >>>>> + swap_free(entry); > >>>>> + return; > >>>>> + } > >>>>> + > >>>>> + p =3D _swap_info_get(entry); > >>>> > >>>> You need to handle this returning NULL, like swap_free() does. > >>> Yes, you're right! We did forget to judge NULL here. > >>>> > >>>>> + > >>>>> + ci =3D lock_cluster(p, offset); > >>>> > >>>> The existing swap_free() calls lock_cluster_or_swap_info(). So if sw= ap is backed > >>>> by rotating media, and clusters are not in use, it will lock the who= le swap > >>>> info. But your new version only calls lock_cluster() which won't loc= k anything > >>>> if clusters are not in use. So I think this is a locking bug. > >>> Again, you're right, it's bug! > >>>> > >>>>> + for (i =3D 0; i < nr_pages; i++) { > >>>>> + if (__swap_entry_free_locked(p, offset + i, 1)) > >>>>> + __bitmap_set(usage, i, 1); > >>>>> + } > >>>>> + unlock_cluster(ci); > >>>>> + > >>>>> + for_each_clear_bit(i, usage, nr_pages) > >>>>> + free_swap_slot(swp_entry(type, offset + i)); > >>>>> +} > >>>>> + > >>>>> /* > >>>>> * Called after dropping swapcache to decrease refcnt to swap entr= ies. > >>>>> */ > >>>> > >>>> Thanks, > >>>> Ryan > >>>> > >>>> > >>> > >>> > >> > > > > > --=20 Thanks, Chuanhua