From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7854EC48BC3
	for <linux-mm@archiver.kernel.org>; Tue, 20 Feb 2024 10:27:13 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 143208D0001; Tue, 20 Feb 2024 05:27:13 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0CC636B007E; Tue, 20 Feb 2024 05:27:13 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E3B3C8D0001; Tue, 20 Feb 2024 05:27:12 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id CD3ED6B007D
	for <linux-mm@kvack.org>; Tue, 20 Feb 2024 05:27:12 -0500 (EST)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 947F9405AA
	for <linux-mm@kvack.org>; Tue, 20 Feb 2024 10:27:12 +0000 (UTC)
X-FDA: 81811804704.15.50C8396
Received: from mail-vs1-f49.google.com (mail-vs1-f49.google.com [209.85.217.49])
	by imf15.hostedemail.com (Postfix) with ESMTP id AF94AA000C
	for <linux-mm@kvack.org>; Tue, 20 Feb 2024 10:27:10 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=c1BCKYiU;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf15.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.49 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1708424830;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=EZn98CE/Oe+cVNVy+tmJNozH4/KB14kaXBjDwNFXGpA=;
	b=7stVFS5s3Y+bzPEu/KVB33gZumjrpDkz0Rgp14tVp/g8yT958KhFVIxvjOcZMif/3dil1V
	Js+U0F6ESULFId/Y0AXNWIZ7q8bLnx/ygYJfOKdChrEOIgRvlBnwbwGzUMrf5UwiRlyzdJ
	bt2wPwsaIuGoxs3jptHNHwvOkPVt4hE=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=c1BCKYiU;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf15.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.49 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708424830; a=rsa-sha256;
	cv=none;
	b=MbO7o5UYUn2apO9CTgpItF/lOovepGJjspDJg4AJwer96/uWsYlt+2KHgx8KwHFGtOvRvQ
	j1mG1H6XLo8Y+k0uejkfJowc8UlDqk48yeQJwhWi7jBVV10H2kaPt99+z79FpcNm8+MiPl
	/1BQn5+WLrO6Ao4yyRPgfmfcoNl4uL8=
Received: by mail-vs1-f49.google.com with SMTP id ada2fe7eead31-47047e7b068so203335137.2
        for <linux-mm@kvack.org>; Tue, 20 Feb 2024 02:27:10 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1708424830; x=1709029630; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=EZn98CE/Oe+cVNVy+tmJNozH4/KB14kaXBjDwNFXGpA=;
        b=c1BCKYiUHXMziMlm4TcimGqdIoPQxjdIx6ijlHH59hvdMYB8CWtNK0cGm7bI2/iH0V
         VMZoU8CQqGFrSe483+C5uFGRHI9sTDPVi4M3mbsOn+YrJCuO6bqXA7hlXPNiEqoAsfFl
         BaBwPBaIr56T7vQAVmACTzMnmg7WwoLqvGr2gdesnLXivpnZX38VdqGvSpzkingQt9yu
         ec2lPB/BR+1pM+8NHaKza/mQgdeibYxnjLZAFGCsyk8kR/QudrC3V329JrIiO3uJqggg
         CJrQgUDlBDUQehFMyU1vtkCRColWsP1Hvr7PI27A/a4ukhvE//JCTM5HS2xs4NNoSxjv
         lPTg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1708424830; x=1709029630;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=EZn98CE/Oe+cVNVy+tmJNozH4/KB14kaXBjDwNFXGpA=;
        b=Wm7XU4WgScoGX6woZbu8z9+10tPEMwAchoEnlDWRL8L/xKHZwTu2ZOIiTN6TW9xiK2
         hCg/I9iXHWXReAQmlYQm4a1OTaM+2/iO/jKbw12sLjiGL9FXWCu4HPjkUDzhObxNrW1P
         Ycpkim33DnWcX0CXjW159TxbMSJ0+XHyySRN+r7Po0qX4v4C/YsR8JaSgGwF4011pufJ
         MpDAcGfNONJfbtv0z9vHHQJpoFN0/YDAamtfslLYnwEYAEsmXP8o8eHfhVJ0lXGLhtaX
         LKkNk+bTUQGwLNz2/rOATPDi0p2Zuis100VhJxWr+Y+e/FFXqISp8DJnsgQA7hjRtzz1
         QSgw==
X-Forwarded-Encrypted: i=1; AJvYcCVbg5Mhi1ZewXETaKOdLzX+s516t+xJRwcdiNb+O1Gg1jd31ZAqolau07Ieie08tzTDrw7GxGB7sK0OK4ONxTFOkhs=
X-Gm-Message-State: AOJu0YzXOn/WeMsxsSwMlKa8zGZNIrY1zXJs9ynlhnP8Mq9g8AbVSbg9
	6co+IKC4RbKDC6u9cEQCAisvesxG4zpZm1ziMyhzEUfvCfm7uhRdpeMI9vB7EHxb0z0Wi012aIr
	6GenTGELzs8Eqz9j3Bq/vt5vo191Lfg2uAhvGa3X+
X-Google-Smtp-Source: AGHT+IE8IvKXeZAJNP8t3fFfQv7tnDCcvxm21TuRtRSHVBB6O9q7Kr4DiLzNMEangcb7uxvQPiWUzBnzRCH6HIjJnys=
X-Received: by 2002:a05:6102:3189:b0:470:5080:451 with SMTP id
 c9-20020a056102318900b0047050800451mr3585185vsh.16.1708424829609; Tue, 20 Feb
 2024 02:27:09 -0800 (PST)
MIME-Version: 1.0
References: <20240219082040.7495-1-ryncsn@gmail.com> <20240219173147.3f4b50b7c9ae554008f50b66@linux-foundation.org>
 <CAMgjq7DgBOJhDJStwGuD+C6-FNYZBp-cu6M_HAgRry3gBSf7GA@mail.gmail.com>
 <CAGsJ_4zyf5OOq_WA7VjsDKp1ciaDwzM23Ef95_O-24oLtr_5AQ@mail.gmail.com> <CAMgjq7AnZJSseC2BB_nF+s533YybyP_WU8HijEKFA=OXE1x41Q@mail.gmail.com>
In-Reply-To: <CAMgjq7AnZJSseC2BB_nF+s533YybyP_WU8HijEKFA=OXE1x41Q@mail.gmail.com>
From: Barry Song <21cnbao@gmail.com>
Date: Tue, 20 Feb 2024 23:26:57 +1300
Message-ID: <CAGsJ_4xEM5od8Ebs4Kaejqw6DU8cu=YX+mfXejLoTuDySxjTOA@mail.gmail.com>
Subject: Re: [PATCH v4] mm/swap: fix race when skipping swapcache
To: Kairui Song <ryncsn@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, 
	"Huang, Ying" <ying.huang@intel.com>, Chris Li <chrisl@kernel.org>, 
	Minchan Kim <minchan@kernel.org>, Barry Song <v-songbaohua@oppo.com>, Yu Zhao <yuzhao@google.com>, 
	SeongJae Park <sj@kernel.org>, David Hildenbrand <david@redhat.com>, Hugh Dickins <hughd@google.com>, 
	Johannes Weiner <hannes@cmpxchg.org>, Matthew Wilcox <willy@infradead.org>, Michal Hocko <mhocko@suse.com>, 
	Yosry Ahmed <yosryahmed@google.com>, Aaron Lu <aaron.lu@intel.com>, stable@vger.kernel.org, 
	linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: AF94AA000C
X-Stat-Signature: h6rjuttk5dps447heyw9cct7esupyekw
X-HE-Tag: 1708424830-649528
X-HE-Meta: U2FsdGVkX19LcM6DvhaAy9xrf3jA8cXv1JlyukAkfmJveOkYq4tFyW2GsiNZ7gRUvrl0B03UnM9TxZukgc/HXYsmR+6H800K+W9ULZdJ4gp7KWRD5YqL8jgmsGdCv4Y3BkD4OWnb3eR06cgMSz8IkWp6rDomwTCkAwS4Or0enbw4h3EKJfWBVWhJD2XY4OAGH5AEOiOoOP6O0T9RS60NIfjIXDL38ZFxUAzHxvynPLWXeIfuSIWrVOJZHge5E9Bkx4KSYHVOCyGDXS8h5syRaR8/wKG1ICjCrLYgTFxOv1jO6vWLb7d6UvhO5bIcJ5cMEnwR9lh1l6tD3Gz95OdtikdUKB7IzfhJ0voLcWX29qtBY4bF7Sw5dW/K9hdSZZcCZWLQlUI/bli+fPzjYmDRYRFHdbYzu7FaU+pJeA4UYoj2kovAUaaNL6Hx0yjIV8Asy+6f5jnP4pBg+h5CIVva4zy+pt7rkBDOnTLvPPyHqd4LlQATOLyDhjtrPykt0wz6v2tzTFnUCJP4yRhIJgV2CJc79kh84f5ZVvJDDpTYNIyqR11C3C+ZAdpJyCM+53Fgm4V/ylPEH6e1IPYFbKeScMDcNJ72g3NGI1X9ciPgeMYqYY1/NdGVp36mLMNoBlo6djqTC1bnpsRuaYoskzWrjTWezkILZTLycEYnths9QGaYY+KxQrzLADZmOEWyfh5MPnxB7lhGFdVzPm6St/4IXwUtf1lETFRxEvpufzCglcpOI7huWy0Psr+aQgacuyKFLUA5cOFoIi8f2EeiAcG0JGiCNipgXPU735eecvn1BBmkzgEqA52hhA09nn/JBp7Z+8HV5UrgcNopWa5FntvAmcz3/YY8+FixPf4pZEJesAhTZcMD7iy/Z5DyRZ62IocCx0HCwjOpSsvqmME4OabOzQLcWIK7MAGJTdKnTXMuU3wuUWfrmilPQC2rqsHuuUCDryo4xXRqciMHyxljDrQ
 9BjZUQqG
 CJkeiCCs+oKnIKeuR66XdB632BLAPtrqmfKbnIVkE0CSl8wvcrGXYWm0wTHCIZGnqa+g79RNXZHw9/vdcSk1BG97D+EtgJ9QLlUJE5PD9d6rROL68b1tzXryIZtelW+6FBOTAakmBt0ZRT7dYXgKplbzFgB1pqWgErlGjmW8e2KhTRCDu9Eydxwp/uJDjV2mLi319aAhWmVMuy1b67Hhw01Qb0uHnNe54yoO5E7GfbIi8tB/cc6c8J44R4Q5lYE7Y83kog47Sp7Term5S1sev+I+sNVIstflJvpoMrAa8Icmj7vDij8+tlTnSJPOoUSwTctVf/AMDogHyuEzm6OniN0g+1hjOVQJUoGvhZM6P+1TDQF41L7SHC8nxPS7gVBCVf3YngUI5q9I+nUypx7JHD5Yujgkn2LJvXDbo8NcdYX2bCkfzbSv45JiKwjGO5tHrdt0l3nHNDCkEmo+ycJjNXwxosnyQ686UUr8Wa38yF/oOCi+q/PU9jUIeyq4aBuAyZbHwUn/1xlEfXX9ZOO3r2fnxYxsb0lGR88PU+MxcoM8+ywWaWQKCDibXz9i91gMXgEGh5aEmNOEf6VifXmNA3Bh7bf9uTJ3lEQtT8ykEZvrKelg=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Feb 20, 2024 at 5:56=E2=80=AFPM Kairui Song <ryncsn@gmail.com> wrot=
e:
>
> On Tue, Feb 20, 2024 at 12:01=E2=80=AFPM Barry Song <21cnbao@gmail.com> w=
rote:
> >
> > On Tue, Feb 20, 2024 at 4:42=E2=80=AFPM Kairui Song <ryncsn@gmail.com> =
wrote:
> > >
> > > On Tue, Feb 20, 2024 at 9:31=E2=80=AFAM Andrew Morton <akpm@linux-fou=
ndation.org> wrote:
> > > >
> > > > On Mon, 19 Feb 2024 16:20:40 +0800 Kairui Song <ryncsn@gmail.com> w=
rote:
> > > >
> > > > > From: Kairui Song <kasong@tencent.com>
> > > > >
> > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more th=
reads
> > > > > swapin the same entry at the same time, they get different pages =
(A, B).
> > > > > Before one thread (T0) finishes the swapin and installs page (A)
> > > > > to the PTE, another thread (T1) could finish swapin of page (B),
> > > > > swap_free the entry, then swap out the possibly modified page
> > > > > reusing the same entry. It breaks the pte_same check in (T0) beca=
use
> > > > > PTE value is unchanged, causing ABA problem. Thread (T0) will
> > > > > install a stalled page (A) into the PTE and cause data corruption=
.
> > > > >
> > > > > @@ -3867,6 +3868,20 @@ vm_fault_t do_swap_page(struct vm_fault *v=
mf)
> > > > >       if (!folio) {
> > > > >               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > > > >                   __swap_count(entry) =3D=3D 1) {
> > > > > +                     /*
> > > > > +                      * Prevent parallel swapin from proceeding =
with
> > > > > +                      * the cache flag. Otherwise, another threa=
d may
> > > > > +                      * finish swapin first, free the entry, and=
 swapout
> > > > > +                      * reusing the same entry. It's undetectabl=
e as
> > > > > +                      * pte_same() returns true due to entry reu=
se.
> > > > > +                      */
> > > > > +                     if (swapcache_prepare(entry)) {
> > > > > +                             /* Relax a bit to prevent rapid rep=
eated page faults */
> > > > > +                             schedule_timeout_uninterruptible(1)=
;
> > > >
> > > > Well this is unpleasant.  How often can we expect this to occur?
> > > >
> > >
> > > The chance is very low, using the current mainline kernel and ZRAM,
> > > even with threads set to race on purpose using the reproducer I
> > > provides, for 647132 page faults it occured 1528 times (~0.2%).
> > >
> > > If I run MySQL and sysbench with 128 threads and 16G buffer pool, wit=
h
> > > 6G cgroup limit and 32G ZRAM, it occured 1372 times for 40 min,
> > > 109930201 page faults in total (~0.001%).
> >
>
> Hi Barry,
>
> > it might not be a problem for throughput. but for real-time and tail la=
tency,
> > this hurts. For example, this might increase dropping frames of UI whic=
h
> > is an important parameter to evaluate performance :-)
> >
>

Hi Kairui,

> That's a true issue, as Chris mentioned before I think we need to
> think of some clever data struct to solve this more naturally in the
> future, similar issue exists for cached swapin as well and it has been
> there for a while. On the other hand I think maybe applications that
> are extremely latency sensitive should try to avoid swap on fault? A
> swapin could cause other issues like reclaim, throttled or contention
> with many other things, these seem to have a higher chance than this
> race.

ideally, if memory is very very large, we can avoid swap or mlock a
lot of things. In Android phones, most anon memory is actually in
swap as a system with limited memory.
For example, users might switch between a couple of applications.
some cold app could be entirely swapped. but those applications
can be re-activated by users all of a sudden.
We do mlock some limited memories. but we don't abuse mlock()
everywhere :-)
For a soft-real time system, a lot of other optimization is involved
to make sure RT/UI tasks can get priority on getting locks, memory
etc.
Overall we live together with swap but still try our best to make
important tasks have low latency.

The current patch, to me, seems to add a new place which makes
high priority tasks have no way to be done faster. But I do understand
the percentage is not high. And I have no doubt you have done your
best work on this.

I'm just curious if the number will increase a lot of times for large folio=
s
swap-in as the conflicting memory range is enlarged. and also its
impact on UI and RT tasks.

Thus, I have followed up your work and made it support large folios
swap-in[1] as below. I will get phones to run it and update you with
the result after a couple of days(could be 3-4 weeks later).

Subject: [PATCH] mm: swap: introduce swapcache_prepare_nr and
 swapcache_clear_nr for large folios swap-in

Apply Kairui's work on large folios swap-in

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a293ef17c2b6..f1cf64c9ccb5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -483,6 +483,7 @@ extern int add_swap_count_continuation(swp_entry_t, gfp=
_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
+extern int swapcache_prepare_nr(swp_entry_t, int nr);
 extern void swap_free(swp_entry_t);
 extern void swap_nr_free(swp_entry_t entry, int nr_pages);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
diff --git a/mm/memory.c b/mm/memory.c
index 2d27c087a39e..9cfd806a8236 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3905,7 +3905,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
                                 * reusing the same entry. It's undetectabl=
e as
                                 * pte_same() returns true due to entry reu=
se.
                                 */
-                               if (swapcache_prepare(entry)) {
+                               if (swapcache_prepare_nr(entry, nr_pages)) =
{
                                        /* Relax a bit to prevent
rapid repeated page faults */
                                        schedule_timeout_uninterruptible(1)=
;
                                        goto out;
@@ -4194,7 +4194,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 out:
        /* Clear the swap cache pin for direct swapin after PTL unlock */
        if (need_clear_cache)
-               swapcache_clear(si, entry);
+               swapcache_clear_nr(si, entry, nr_pages);
        if (si)
                put_swap_device(si);
        return ret;
@@ -4210,7 +4210,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
                folio_put(swapcache);
        }
        if (need_clear_cache)
-               swapcache_clear(si, entry);
+               swapcache_clear_nr(si, entry, nr_pages);
        if (si)
                put_swap_device(si);
        return ret;
diff --git a/mm/swap.h b/mm/swap.h
index 693d1b281559..a457496bd669 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -39,6 +39,7 @@ void delete_from_swap_cache(struct folio *folio);
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
                                  unsigned long end);
 void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
+void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t
entry, int nr);
 struct folio *swap_cache_get_folio(swp_entry_t entry,
                struct vm_area_struct *vma, unsigned long addr);
 struct folio *filemap_get_incore_folio(struct address_space *mapping,
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6ecee63cf678..8c9d53f9f068 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3322,65 +3322,76 @@ void si_swapinfo(struct sysinfo *val)
  * - swap-cache reference is requested but the entry is not used. -> ENOEN=
T
  * - swap-mapped reference requested but needs continued swap count. -> EN=
OMEM
  */
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+
+static int __swap_duplicate_nr(swp_entry_t entry, int nr, unsigned char us=
age)
 {
        struct swap_info_struct *p;
        struct swap_cluster_info *ci;
        unsigned long offset;
-       unsigned char count;
-       unsigned char has_cache;
-       int err;
+       unsigned char count[SWAPFILE_CLUSTER];
+       unsigned char has_cache[SWAPFILE_CLUSTER];
+       int err, i;

        p =3D swp_swap_info(entry);

        offset =3D swp_offset(entry);
        ci =3D lock_cluster_or_swap_info(p, offset);

-       count =3D p->swap_map[offset];
+       for (i =3D 0; i < nr; i++) {
+               count[i] =3D p->swap_map[offset + i];

-       /*
-        * swapin_readahead() doesn't check if a swap entry is valid, so th=
e
-        * swap entry could be SWAP_MAP_BAD. Check here with lock held.
-        */
-       if (unlikely(swap_count(count) =3D=3D SWAP_MAP_BAD)) {
-               err =3D -ENOENT;
-               goto unlock_out;
+               /*
+                * swapin_readahead() doesn't check if a swap entry is
valid, so the
+                * swap entry could be SWAP_MAP_BAD. Check here with lock h=
eld.
+                */
+               if (unlikely(swap_count(count[i]) =3D=3D SWAP_MAP_BAD)) {
+                       err =3D -ENOENT;
+                       goto unlock_out;
+               }
+
+               has_cache[i] =3D count[i] & SWAP_HAS_CACHE;
+               count[i] &=3D ~SWAP_HAS_CACHE;
+               err =3D 0;
+
+               if (usage =3D=3D SWAP_HAS_CACHE) {
+
+                       /* set SWAP_HAS_CACHE if there is no cache and
entry is used */
+                       if (!has_cache[i] && count[i])
+                               has_cache[i] =3D SWAP_HAS_CACHE;
+                       else if (has_cache[i])          /* someone
else added cache */
+                               err =3D -EEXIST;
+                       else                            /* no users remaini=
ng */
+                               err =3D -ENOENT;
+               } else if (count[i] || has_cache[i]) {
+
+                       if ((count[i] & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
+                               count[i] +=3D usage;
+                       else if ((count[i] & ~COUNT_CONTINUED) > SWAP_MAP_M=
AX)
+                               err =3D -EINVAL;
+                       else if (swap_count_continued(p, offset + i, count[=
i]))
+                               count[i] =3D COUNT_CONTINUED;
+                       else
+                               err =3D -ENOMEM;
+               } else
+                       err =3D -ENOENT;                  /* unused swap en=
try */
+
+               if (err)
+                       goto unlock_out;
        }

-       has_cache =3D count & SWAP_HAS_CACHE;
-       count &=3D ~SWAP_HAS_CACHE;
-       err =3D 0;
-
-       if (usage =3D=3D SWAP_HAS_CACHE) {
-
-               /* set SWAP_HAS_CACHE if there is no cache and entry is use=
d */
-               if (!has_cache && count)
-                       has_cache =3D SWAP_HAS_CACHE;
-               else if (has_cache)             /* someone else added cache=
 */
-                       err =3D -EEXIST;
-               else                            /* no users remaining */
-                       err =3D -ENOENT;
-
-       } else if (count || has_cache) {
-
-               if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
-                       count +=3D usage;
-               else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
-                       err =3D -EINVAL;
-               else if (swap_count_continued(p, offset, count))
-                       count =3D COUNT_CONTINUED;
-               else
-                       err =3D -ENOMEM;
-       } else
-               err =3D -ENOENT;                  /* unused swap entry */
-
-       WRITE_ONCE(p->swap_map[offset], count | has_cache);
-
+       for (i =3D 0; i < nr; i++)
+               WRITE_ONCE(p->swap_map[offset + i], count[i] | has_cache[i]=
);
 unlock_out:
        unlock_cluster_or_swap_info(p, ci);
        return err;
 }

+
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+{
+       return __swap_duplicate_nr(entry, 1, usage);
+}
+
 /*
  * Help swapoff by noting that swap entry belongs to shmem/tmpfs
  * (in which case its reference count is never incremented).
@@ -3419,17 +3430,33 @@ int swapcache_prepare(swp_entry_t entry)
        return __swap_duplicate(entry, SWAP_HAS_CACHE);
 }

-void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+int swapcache_prepare_nr(swp_entry_t entry, int nr)
+{
+       return __swap_duplicate_nr(entry, nr, SWAP_HAS_CACHE);
+}
+
+void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, in=
t nr)
 {
        struct swap_cluster_info *ci;
        unsigned long offset =3D swp_offset(entry);
-       unsigned char usage;
+       unsigned char usage[SWAPFILE_CLUSTER];
+       int i;

        ci =3D lock_cluster_or_swap_info(si, offset);
-       usage =3D __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE);
+       for (i =3D 0; i < nr; i++)
+               usage[i] =3D __swap_entry_free_locked(si, offset + i,
SWAP_HAS_CACHE);
        unlock_cluster_or_swap_info(si, ci);
-       if (!usage)
-               free_swap_slot(entry);
+       for (i =3D 0; i < nr; i++) {
+               if (!usage[i]) {
+                       free_swap_slot(entry);
+                       entry.val++;
+               }
-
-       if (usage =3D=3D SWAP_HAS_CACHE) {
-
-               /* set SWAP_HAS_CACHE if there is no cache and entry is use=
d */
-               if (!has_cache && count)
-                       has_cache =3D SWAP_HAS_CACHE;
-               else if (has_cache)             /* someone else added cache=
 */
-                       err =3D -EEXIST;
-               else                            /* no users remaining */
-                       err =3D -ENOENT;
-
-       } else if (count || has_cache) {
-
-               if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
-                       count +=3D usage;
-               else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
-                       err =3D -EINVAL;
-               else if (swap_count_continued(p, offset, count))
-                       count =3D COUNT_CONTINUED;
-               else
-                       err =3D -ENOMEM;
-       } else
-               err =3D -ENOENT;                  /* unused swap entry */
-
-       WRITE_ONCE(p->swap_map[offset], count | has_cache);
-
+       for (i =3D 0; i < nr; i++)
+               WRITE_ONCE(p->swap_map[offset + i], count[i] | has_cache[i]=
);
 unlock_out:
        unlock_cluster_or_swap_info(p, ci);
        return err;
 }

+
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+{
+       return __swap_duplicate_nr(entry, 1, usage);
+}
+
 /*
  * Help swapoff by noting that swap entry belongs to shmem/tmpfs
  * (in which case its reference count is never incremented).
@@ -3419,17 +3430,33 @@ int swapcache_prepare(swp_entry_t entry)
        return __swap_duplicate(entry, SWAP_HAS_CACHE);
 }

-void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+int swapcache_prepare_nr(swp_entry_t entry, int nr)
+{
+       return __swap_duplicate_nr(entry, nr, SWAP_HAS_CACHE);
+}
+
+void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, in=
t nr)
 {
        struct swap_cluster_info *ci;
        unsigned long offset =3D swp_offset(entry);
-       unsigned char usage;
+       unsigned char usage[SWAPFILE_CLUSTER];
+       int i;

        ci =3D lock_cluster_or_swap_info(si, offset);
-       usage =3D __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE);
+       for (i =3D 0; i < nr; i++)
+               usage[i] =3D __swap_entry_free_locked(si, offset + i,
SWAP_HAS_CACHE);
        unlock_cluster_or_swap_info(si, ci);
-       if (!usage)
-               free_swap_slot(entry);
+       for (i =3D 0; i < nr; i++) {
+               if (!usage[i]) {
+                       free_swap_slot(entry);
+                       entry.val++;
+               }
+       }
+}
+
+void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+{
+       swapcache_clear_nr(si, entry, 1);
 }

 struct swap_info_struct *swp_swap_info(swp_entry_t entry)

[1] https://lore.kernel.org/linux-mm/20240118111036.72641-1-21cnbao@gmail.c=
om/

>
> > BTW, I wonder if ying's previous proposal - moving swapcache_prepare()
> > after swap_read_folio() will further help decrease the number?
>
> We can move the swapcache_prepare after folio alloc or cgroup charge,
> but I didn't see an observable change from statistics, for some
> workload the reading is even worse. I think that's mostly due to
> noise, or higher swap out rate since all raced threads will alloc an
> extra folio now. Applications that have many pages swapped out due to
> memory limit are already on the edge of triggering another reclaim, so
> a dozen more folio alloc could just trigger that...

sometimes. might not be edged because of this app, but because users
launch another app as foreground and this one becomes background.

>
> And we can't move it after swap_read_folio()... That's exactly what we
> want to protect.

understood, thanks!

Barry