From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EE3D1C83038
	for <linux-mm@archiver.kernel.org>; Tue,  1 Jul 2025 18:50:28 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 960866B0088; Tue,  1 Jul 2025 14:50:28 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 938056B00B8; Tue,  1 Jul 2025 14:50:28 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8001D6B00B9; Tue,  1 Jul 2025 14:50:28 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 703D16B0088
	for <linux-mm@kvack.org>; Tue,  1 Jul 2025 14:50:28 -0400 (EDT)
Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id D26E81403AB
	for <linux-mm@kvack.org>; Tue,  1 Jul 2025 18:50:27 +0000 (UTC)
X-FDA: 83616586494.07.535CC2C
Received: from mail-oi1-f176.google.com (mail-oi1-f176.google.com [209.85.167.176])
	by imf07.hostedemail.com (Postfix) with ESMTP id 05F4740006
	for <linux-mm@kvack.org>; Tue,  1 Jul 2025 18:50:25 +0000 (UTC)
Authentication-Results: imf07.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=N2uJq4h1;
	spf=pass (imf07.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.176 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1751395826;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=HeaXLDhnOR+ESQPZpiJeZ8DJiZobAk1i38knqNaGybY=;
	b=B5z7li26rmGHlP+GEp0nMuDANLSJ17/MGiv+z+LlKm/W7eFmbVBPFq81z958ErY0v1M+jx
	cTt46PAWMygIjGfC8LwXtzIGk1ynL6Og/dhvwz5F1qE+XJa36YxdHDJZRkt6lA8kdeCAX9
	wz1sQctA/uDXkMfhyqzky8pueQLskxs=
ARC-Authentication-Results: i=1;
	imf07.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=N2uJq4h1;
	spf=pass (imf07.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.176 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751395826; a=rsa-sha256;
	cv=none;
	b=09LWwVbQnN4hsEjTm0o3gSI6zXK7I85uaJS6lYS34Fe29U8UC8brJ7652uVE8bzGJ7ZAqX
	VbpS8v/ZRlMg8Fd8xxvaXqClu5Pgx1pYvSHtp0WPSIFRbzC3eJn1ePCZfSgpNbPFY7k5d/
	U2yhW/n+2MnVdl7yHrxnkCivICG+OpU=
Received: by mail-oi1-f176.google.com with SMTP id 5614622812f47-407a6c6a6d4so1967325b6e.1
        for <linux-mm@kvack.org>; Tue, 01 Jul 2025 11:50:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1751395825; x=1752000625; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=HeaXLDhnOR+ESQPZpiJeZ8DJiZobAk1i38knqNaGybY=;
        b=N2uJq4h10QH35ilssNxfo3eE9xK21Ovz2irwJlENQtjtW4TejQw9sQeZoRRal4qO4x
         3TwVss65WsBUxc7cftkx+qZn4A+u7M+a39cZVhTWBnGvW5vLvRkzWFUs9WZwbIXF6IYq
         0we3ALC4B4jhtxXMrieYrvsAAw/3htWdy00CBH6sKFmcaos3ZXO1NvexP+ZJW4sAeC06
         RxmOS4LHLKj3wsjwEi8gHBRHFbJYFXX7ItcWWGec5TZFVJC1P27iQnGhMuRumvRlzYn3
         ePbj3JsEJ+UiWHEKe0KheVigiy8O5AT5S6yR/LulEiKpIxJ6/HFtrDkFtyRNDO2gVBeK
         JQBA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1751395825; x=1752000625;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=HeaXLDhnOR+ESQPZpiJeZ8DJiZobAk1i38knqNaGybY=;
        b=EztkQB75NUBGnuPH6eEyeEiAXSxK++hEj5rE5BykaYSfQ8of2Bt/65rzl5C+tuLoqm
         Dpmw5rCvktu+Z4wxCcBXpLkqtQDxF4GzSLvuo7NHz1ZKeENzH5sfTGNBDqAzyc9Zv+FU
         B1eoX9QvL0mrC8thNAY9bL24vP3t8Gj+5/9j9UjNLO9oZc/oZHeCeiKOWFaVcBbONJUk
         NaXT/u6EZ7uNSYYgp894S2wWiiQySWyhHExW+poSPSIVhZG6eVWhLg6HTm5ugtEq6Nx3
         bV5obDmv9s9Kcd2GMOLlLy2EKw8aAJRsiqpeIWcP5/SItEQGqARVx0rCDixiDbhJDz3r
         Grgw==
X-Gm-Message-State: AOJu0YyZJZ+8+2TRiZPMBaGtf41HnkKu8YFcp/Fg3C9rLhXwGpEbW0ow
	7W0Z+7PlyKdKvl/FKFl+ezvIPLTDAWXWbriy79ham2m7kYoN5F6D8K+wiClwD8yIrm5dmtJ39pg
	CxJf3VA38VxlOhv3SeVMGNHduNIr1rNg=
X-Gm-Gg: ASbGncsr/pymZZbL7syxBeCAwK+css2fsJ2H+Sny+6mqs0F8cof68jBJu0bkv9MoTm+
	oZWosrZ+MNJ2MS+q6Lq2ZunaqrfX8JPU0stCVx5lpiNEaE4Jm4/gvgsAzy5RVAZJESnPNtpwldN
	Oa/BEaG1wPjVR+xmqODKZKN+CsUtGlYzGhXIhUKzsYN2s=
X-Google-Smtp-Source: AGHT+IGqJAqh4cagMVs/WdFDeJnSt3JtSO0gUJfO3THRo/y9r4c0nqFkmSQ4917g2ZM26sc3x+13nRS/Z8vUhY7RzYo=
X-Received: by 2002:a05:6808:1206:b0:40a:547c:20a2 with SMTP id
 5614622812f47-40b33c181ecmr14215013b6e.6.1751395824758; Tue, 01 Jul 2025
 11:50:24 -0700 (PDT)
MIME-Version: 1.0
References: <20250627062020.534-1-ryncsn@gmail.com> <20250627062020.534-5-ryncsn@gmail.com>
 <3d317ce9-2304-4361-adda-32abcc06e3dd@linux.alibaba.com> <CAMgjq7Dc4QJAkxFpAkzXk-8Xe5zyx2HUzNKpoq_sqbyZUFpOkA@mail.gmail.com>
 <ce54b1b1-1d3a-4783-bb76-6d4d4e902dac@linux.alibaba.com> <CAMgjq7CmK3g_QpepGJP9Cyu2aA7t+R4XN+NQDqouLCKKF+RJPA@mail.gmail.com>
 <1102fb2b-3e2e-4ae2-8609-cff6a4b0d821@linux.alibaba.com> <CAMgjq7D8wPJECCdsRzHkTNXN9Ot5A7Op+9BH83_ojDjtp=GCUg@mail.gmail.com>
 <ddf1f0f0-dfb9-4df0-9bfa-df9c5cf39ec5@linux.alibaba.com>
In-Reply-To: <ddf1f0f0-dfb9-4df0-9bfa-df9c5cf39ec5@linux.alibaba.com>
From: Kairui Song <ryncsn@gmail.com>
Date: Wed, 2 Jul 2025 02:49:41 +0800
X-Gm-Features: Ac12FXyEfAG5Dsjjc73_PzpmDuMtBKKASw3zCfUHsbPX9x-om_EVWeTXdoHcAso
Message-ID: <CAMgjq7Cvtx-2eVqcJY9K6DQ_G7zg8t6wCnS_H2TpW-_08jyoug@mail.gmail.com>
Subject: Re: [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting
To: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>, 
	Hugh Dickins <hughd@google.com>, Matthew Wilcox <willy@infradead.org>, 
	Kemeng Shi <shikemeng@huaweicloud.com>, Chris Li <chrisl@kernel.org>, 
	Nhat Pham <nphamcs@gmail.com>, Baoquan He <bhe@redhat.com>, Barry Song <baohua@kernel.org>, 
	linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: 05F4740006
X-Stat-Signature: tqnpgaq8r1x115hjb5id9jfw5pznt538
X-Rspam-User: 
X-HE-Tag: 1751395825-53440
X-HE-Meta: U2FsdGVkX1+gSpHu6jOUiS8d2JAJmsFm0PNPmzVpEsQSR1mEST1AENuCy5KxXT80IPtrEQmrGMR2SYz2c/U42QvI5qhr6tU4Rp4GxMTT0H7qA4naWr8bU7BKWLTpNOD17c24XPI/UDFqkMAbOSsd9zKR/IvjXn/uTDXTmtfs/2tv3bsvJqn8G2bbO2rZXmIAdWkyP6YItZkh/tPbEOZgSm3Gbze33DZmM9uKF1U8emMDcCkD6q+pBq47C7Lh2KJntV2v0upgjKTXB2o4+0OZIuhSG6ATlN1NwNDYrDGQqk+WIfb3fan7eKqEV4PDQW+uWb8JpQjOZxkR2Etw+DJSLSDA9OEFsN4hI1U6Pt//mvcuJL7xN81KOOV6bvnS1BAUf+g7oe7MWeGJLETWGUtD1yTuGnwxGEn5rLK6tP9UafMXsBG+2ShNU19z0zJYVaUuAuPazd6Otw4CV4es9vV1phxdRVhf0HNBwD23b5ZBFv5RPTFIBvgxloLjWSKIbd6DLCT+FjuUcvGk+6jUDrYQ8SpqdZbMf3v3VN/AkCUF++uuyVhU2tb/f/QWH8Mw6PPX9prOx0aGYdYg7BCUT5HMz/DKGmJKX6EE3OXjZQ1LJrLn1G5ukv0VfJC1fdmJX8MjssKxHJu+2LzLtE7Y+DUgXrc+vzMVR9ob31Tm8lQH7JeB+SeeRTV1lTpTmnAvxpnDs6vz5sLzMzma9VyxCf9uNICW6aT00zBt2de+JZAvfYYPUWRGeH94Kn+8xfO0Fqvbwkewq6Yqr4H5+lXfezokLydBwKISMfLkGms2GfLnviIYxZhUFSEXuUO++3/+x0yVpRRwB7y1A5fKlCkWFjdWQFRm8PlKN8mmdlfkB6gzKmlNztEpxccZsL/nSYaXMyfVK36CZJtEyKQyH1pd53CBUEEGD3nXBn18MCJphM5FBaiecdYR+/hZU00FBTKoq4uTjp83gNZhv5I7/KDmisZ
 RhCtYv1o
 HLnkEDMRL+IvrbXg0smwXMZLAj9IAprGRuVBDMdcuH5zERO5aPnvPKStpVk9C/2xQrWJxDwQLik54meWt8YXl7JTBerDFDES9yPmuqvISDcB8JrclQM8jx6rEUAhleOm0kXDW1g2XGdDoJlRXrx5DqS6CR8o1P6bN+GN6DcIirkHa1hkajfzADs+Fk1vDI9VvHDBKRov3fKmfsJOcb2rO9cul288Y58DeF+SwbVALbh+EJQy+XnJFTzNzOFqDYb53bbUeeo+nThcPad8qwsx5qN4Z5xPI7EhTZXxvY5jRZR/fKPpiksycGLmrwlw8D0nzbeL6bi3qUrke9loi6jKKIxVmJB31uy3QA4haz/pdNDUgKyaFYhrsFV3st1zgmCxNZGtP
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Jul 1, 2025 at 9:57=E2=80=AFAM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
> On 2025/7/1 02:19, Kairui Song wrote:
> > On Mon, Jun 30, 2025 at 7:59=E2=80=AFPM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >> On 2025/6/30 18:06, Kairui Song wrote:
... snip ...
> >>> It can only allocate order 0 folio, but It can hit a large folio: eg.
> >>> a parallel swapin/swapout happened, and the folio stays in swap cache=
,
> >>> while we are handling a swapin here.
> >>
> >> Yes, I know this. This is exactly the issue that patch 1 wants to fix.
> >
> > Hmm, patch 1 prevents the hang but doesn't prevent things like a
> > duplicated fault:
> >
> > Starts with:
> > swap entry val =3D 0x400, order =3D 4, covering index 0 - 15, faulting =
index 3:
> >
> > Before this patch:
> > CPU0                        CPU1
> > shmem_swapin_folio (see order =3D 4)
> >                              shmem_swapin_folio (see order =3D 4)
> >    /* fallback to order 0 due to */
> >    /* mem pressure / temporary pin / etc */
> >    shmem_split_large_entry
> >    /* split to order 0 */
> >    /* interrupted */
> >                              /* swapin done with order =3D 4 folio */
> >                              /* swapout again, leave the large folio
> >                                 in cache temporarily  */
> >    folio =3D swap_cluster_readahead(0x403)
> >    /* Gets folio order =3D 4, folio->swap =3D 0x400
> >       since swap_cluster_readahead uses swap cache */
> >    folio->swap.val !=3D swap.val
> >    /* ! Above test failed ! */
> >    /* It shouldn't fail the round down is needed */
>
> OK. Thanks for the explanation. Yes, this might cause a refault.
>
> > This patch moved the split after the swapin so it should be OK now,
>
> Yes, I agree 'moving the split after the swapin'.
>
> > but still the split_order could be unstable, see below:
>
> >>>>> And I'm not sure if split_order is always reliable here, for exampl=
e
> >>>>> concurrent split may return an inaccurate value here.
> >>>>
> >>>> We've held the xas lock to ensure the split is reliable, even though
> >>>> concurrent splits may occur, only one split can get the large
> >>>> 'split_order', another will return 0 (since it will see the large sw=
ao
> >>>> entry has already been split).
> >>>
> >>> Yes, it may return 0, so we can get a large folio here, but get
> >>> `split_order =3D 0`?
> >>
> >> If split happens, which means the 'order' > folio_order(), right? how
> >> can you get a large folio in this context?
> >>
> >>> And if concurrently swapout/swapin happened, the `split_order` could
> >>> be a different value?
> >>
> >> What do you mean different value? The large swap entry can only be spl=
it
> >> once, so the 'split_order' must be 0 or the original large order.
> >
> > Since d53c78fffe7ad, shmem_split_large_entry doesn't split every slot
> > into order 0 IIUC, so things get complex if two CPUs are faulting on
> > different indexes landing into two different splitting zones:
> >
> > Before this patch:
> > swap entry val =3D 0x400, order =3D 9, covering index 0 - 511, faulting=
 index 3:
> >
> > CPU0                           CPU1
> > shmem_swapin_folio (index =3D 3)
> >                                 shmem_swapin_folio (index =3D 510)
> >    /* Gets swap =3D 0x400 */      /* Gets swap =3D 0x400 */
> >    /* fallback to order 0 */    /* fallback to order 0 */
> >    split_order =3D shmem_split_large_entry
> >    /* get split_order =3D 512 */
> >    /* offset =3D 3 */
> >                                 split_order =3D shmem_split_large_entry
> >                                 /* get split_order =3D 0, but no split =
*/
> >                                 /* map order is 8, offset =3D 0 */
> >                                 /* wrong offset */
> >                                 shmem_swapin_cluster(0x400)
> >                                 /* It should swapin 0x5fe */
>
> Not ture. IIUC, the CPU1 will failed to split due to
> 'swp_to_radix_entry(swap) !=3D old' in shmem_split_large_entry(), and wil=
l
> retry again to fix this race.
>
> > After this patch (with the append fix which was left in latest patch
> > by mistake) it won't swapin wrong entry now, but
> > shmem_split_large_entry may still return a outdated order.
>
> Like I said above, I don't think we can get a outdated order,right?
>
> > That's two previous races I can come up with. These no longer exist
> > after this patch, it's not a bug though, just redundant IO as far as I
> > can see because other checks will fallback, looks a bit fragile
> > though. But shmem_split_large_entry may still return invalid order,
> > just much less likely.
> >
> > I think the ideology here is, because the `order =3D
> > shmem_confirm_swap(mapping, index, swap)` ensures order is stable and
> > corresponds to the entry value at one point, so keep using that value
> > is better (and so this patch does the offset and calculation using the
> > `order` retrieved there before issues the swapin).
>
> I don't think that the 'order' obtained through the shmem_confirm_swap()
> is stable, because shmem_confirm_swap() is only protected by RCU.
> However, I think the 'split_order' obtained from
> shmem_split_large_entry() under the xas lock is stable.
>
> > And after the swapin have brought a folio in, simply round down using
> > the folio's order, which should ensure the folio can be added
> > successfully in any case as long as the folio->swap and index fits the
> > shmem mapping fine.
> >
> >>>> Based on your current patch, would the following modifications be cl=
earer?
> >>>>
> >>>> diff --git a/mm/shmem.c b/mm/shmem.c
> >>>> index 5be9c905396e..91c071fb7b67 100644
> >>>> --- a/mm/shmem.c
> >>>> +++ b/mm/shmem.c
> >>>> @@ -2254,7 +2254,7 @@ static int shmem_split_swap_entry(struct inode
> >>>> *inode, pgoff_t index,
> >>>>            if (xas_error(&xas))
> >>>>                    return xas_error(&xas);
> >>>>
> >>>> -       return 0;
> >>>> +       return split_order;
> >>>>     }

Sorry about the bad example, I got confused looking at the wrong code
base and deduced a wrong stack. Before this series it was returning
entry_order here so yeah it's OK.

> >>>>
> >>>>     /*
> >>>> @@ -2351,10 +2351,23 @@ static int shmem_swapin_folio(struct inode
> >>>> *inode, pgoff_t index,
> >>>>                    error =3D shmem_split_swap_entry(inode, index, sw=
ap, gfp);
> >>>>                    if (error)
> >>>>                            goto failed_nolock;
> >>>> -       }
> >>>>
> >>>> -       index =3D round_down(index, 1 << swap_order);
> >>>> -       swap.val =3D round_down(swap.val, 1 << swap_order);
> >>>> +               /*
> >>>> +                * If the large swap entry has already been split, i=
t is
> >>>> +                * necessary to recalculate the new swap entry based=
 on
> >>>> +                * the old order alignment.
> >>>> +                */
> >>>> +               if (split_order > 0) {
> >
> > The split_order could still be an outdated value, eg. we may even get
> > split_order =3D 0 with a large folio loaded here.
>
> Ditto. I didn't see split_order can be an outdated value.

The problem is, could split_order be 0 while getting a large folio,
which is like the problem in patch 1.
This patch changed the race window by a lot, even if that happens,
shmem_split_swap_entry will just fail and retry though.

Or, can it get a split_order !=3D 0 while folio order =3D=3D 0:

So assume we do (check split_order instead, trimming all comments):
if (order > folio_order(folio)) {
    split_order =3D shmem_split_swap_entry(inode, index, swap, gfp);
    if (split_order < 0)
        goto failed_nolock;
    if (split_order > 0) {
        offset =3D index - round_down(index, 1 << split_order);
        swap =3D swp_entry(swp_type(swap), swp_offset(swap) + offset);
        /* Could we get unnecessary `folio->swap !=3D swap`? */
    } else {
        /* Do we need any rounding here? */
        /* eg. swap.val =3D round_down(swap.val, 1 << folio_order(folio)) *=
/
        /* Could be helpful */
    }
} else if (order < folio_order(folio)) {
    swap.val =3D round_down(swap.val, 1 << folio_order(folio));
}

I was scratching my head in the midnight to figure out if we
need a round down in the /* Do we need any rounding here */ part.

And it seems, we don't? There could be a lot of races, but because
shmem_split_swap_entry is now actually
guarded by swap cache (or swap_map's HAS_CACHE) now: to split an
index, we must load the folio in the swap cache (or the HAS_CACHE bit)
first now. So if a large folio is loaded here, the underlying entries
must remain un-splitted, as it prevents any parallel swapin for
entries is covered.

But this also requires us to call shmem_split_swap_entry with the
right swap and index value covered by the folio (if not, it's still
OK, just redundant split or redundant fault in the worst case it
seems, this clean up can be done later instead).

But there is still a missing issue that I posted previously, it really
should do a ealy fixup of the swap entry for the cached swapin in this
patch (it's delayed to the final patch in this series):

So the shmem_swapin_cluster call should be:

/* Cached swapin currently only supports order 0 swapin */
offset =3D index - round_down(index, 1 << order);
folio =3D shmem_swapin_cluster(swp_entry(swp_type(swap),
swp_offset(swap) + offset), gfp, info, index);

So far I think it's enough for this patch now. I can send a V4 to
keep this part's change as minimized as possible, but in later commits
things may still have to change a bit... especially for avoiding
redundant calculation of offset and swap, or minimizing the overhead,
ensuring shmem_split_swap_entry is called with the right value to
avoid re-faults.

I'll try my best to keep this part as unchanged as possible, but I do find =
it
somehow harder to track as the swapin path gets more refactors.

My bad for making it confusing in the first place, the patches are
still not very well-splitted enough for easier reviewing I think,
please have my apologies if this is causing any inconvenience for you.