From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4DA5EC54ED1
	for <linux-mm@archiver.kernel.org>; Fri, 23 May 2025 20:01:37 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A5C116B007B; Fri, 23 May 2025 16:01:36 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A0C9D6B0083; Fri, 23 May 2025 16:01:36 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8AD6E6B00AD; Fri, 23 May 2025 16:01:36 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 650396B007B
	for <linux-mm@kvack.org>; Fri, 23 May 2025 16:01:36 -0400 (EDT)
Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 877F3160EAC
	for <linux-mm@kvack.org>; Fri, 23 May 2025 20:01:35 +0000 (UTC)
X-FDA: 83475242550.07.D586B2B
Received: from mail-lj1-f175.google.com (mail-lj1-f175.google.com [209.85.208.175])
	by imf13.hostedemail.com (Postfix) with ESMTP id 4E4F120013
	for <linux-mm@kvack.org>; Fri, 23 May 2025 20:01:33 +0000 (UTC)
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="DZYqcpv/";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf13.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.175 as permitted sender) smtp.mailfrom=ryncsn@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748030493; a=rsa-sha256;
	cv=none;
	b=TVEqkIKVFbG+zscPJKzzjsIXQ//UezYqb3Vo1D2d0L4fOIh9Gatam2suEIzTDBYo03CnVN
	TfKgibKAcGn2DcdwpqiHCLOgrmNqSsR+6WPt6VFY0oboWiSIN0Gja71GmLUjcKCHHQm+L4
	Q27dcjaoJYGQCxhPL0l5aoOH3pWeyb4=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="DZYqcpv/";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf13.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.175 as permitted sender) smtp.mailfrom=ryncsn@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1748030493;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=+w9SMCGJOBxFdtkMOTm2SmdrS/iWnk1LdyGgwi+LfGg=;
	b=V9FRyWeM06jtXAVlmrAapVwj8pt97KivyUQsFuykjk6zIw1UgjOs7GC/uSuEcUB+3cI6RR
	gUpjoN4FgXnCRNH8DIxXnPCHdq/cM394pX1x/FbaIk4/w2OBaef2n/Yw71+Lerum5loaHC
	wyldvgraCY01ZoesBOeX9m6yBaUtX7M=
Received: by mail-lj1-f175.google.com with SMTP id 38308e7fff4ca-328114b262aso1726071fa.3
        for <linux-mm@kvack.org>; Fri, 23 May 2025 13:01:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1748030491; x=1748635291; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=+w9SMCGJOBxFdtkMOTm2SmdrS/iWnk1LdyGgwi+LfGg=;
        b=DZYqcpv/mpe1LPFBsPJwM/41geEY00ZT1fiz5A/3PjBPYPM8giBj6L1lGB8UTinc8j
         R8a1cyaO1rb/20G5f4gZBe9Guc4P7/tr71x8R/gXFhgQE3RUz9LGA/nhFgqihhg0mQxz
         CO8nIRZ+Hn5T2k/FI5b8Ws74og6fiKHFsmLWrzyhcET0b3nMcnhqrSf2U4uqPQm118Fu
         //bKuNwwTRyWoDEPkzqb4haQbA5HVRthiYFmFva4C0mLvXzSqNY57+lmKqwkASKfKHdl
         K5UvYVeBF4s1Vy25ezGdQppkeiRam6eiRN0xF5lHdoUYDKwRtLd2n3PtdtOmJkw1EDf6
         HQSg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1748030491; x=1748635291;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=+w9SMCGJOBxFdtkMOTm2SmdrS/iWnk1LdyGgwi+LfGg=;
        b=l49crS5Fy2a8AqPkVwaC23Ml7P+QjMNsYoFOYjilTcU8MhTdQLTQRwq6YLHS5C+v+H
         HZR5IPDBjNjNvhnmnzdIfpiSaIVItRMpPQ5zXfpOXK/eFYJXf4xELTgWF+bl+zQnMAgH
         a2G10W39LOLKTbsCcqF8IUG/QseAriY7EBv8tEVt7BbYzcZG9RpnErVIPVna/qXEtYE9
         uS3w3w4irImA9c3xhhFDam35hlcgKI7mJ/aWXKDYQNTf/JP9Cx4vtwiBOBLDB+OJJYBu
         lu4nU2Zhq5F0vxSk46Am5EEvxTy1Kllxaj7t1PH1SQZL3bMXkzBsGijTs5V8RRzM1y2J
         VveA==
X-Forwarded-Encrypted: i=1; AJvYcCW/RFiOepe1qnC812BINee0IcwSaRGZnUNzNmAd3Z3Hew/BDoJXaG+dDMm6HuX5w0zI5Uqys5KsuQ==@kvack.org
X-Gm-Message-State: AOJu0Yyg3aRpTfeBK+9eMqSxojbCazijRfxa1ZZfnfoeIp4+2+uISI4M
	R4yiDrh/59aQJpkGbs1mDvshGhOZWOBTaexIyrPM+sdPnxmDwKNmPJvaaY55rktbDsNVhxE95+M
	S7kqNWU4e3JGDvUE6U4lsneVToKJ/7lM=
X-Gm-Gg: ASbGncs3fX6qPPF2T4/BkppaWubMUgGs8gTOwgEaWaDEma6AvOzJgxL5lMB3HPMD2rB
	1UDKDRths97lKKGkUPspivZisM878YR4dgIXtqSOi/g+0hJCWI7TECHk8JnzmvbBE2RRdRHjOai
	wXhrSEoWvOdFZPudvyawm4itcMhzDJeB2GZRChB6fyXZc=
X-Google-Smtp-Source: AGHT+IG8LpGR+FLklM3jTLwJhxqn5AYHh5ue5QKpwLZ1MbccKhxQtOmsqjjrHbUwFNm5JVn9LatDQ0dl5344FVRGWjs=
X-Received: by 2002:a2e:a54a:0:b0:30a:2a8a:e4b5 with SMTP id
 38308e7fff4ca-3295bada9c2mr1149181fa.27.1748030490834; Fri, 23 May 2025
 13:01:30 -0700 (PDT)
MIME-Version: 1.0
References: <20250514201729.48420-6-ryncsn@gmail.com> <20250519043847.1806-1-21cnbao@gmail.com>
 <CAMgjq7BpfueOn9ms8apRX-6dF8rZGtbC=MuZzSD7hbZxtw=Kdg@mail.gmail.com>
 <CAGsJ_4wC5_YSMLNoY5q4hUsZTpD+YPHSBtzCAdWRFH65EJA_iw@mail.gmail.com>
 <CAMgjq7AO__8TFE8ibwQswWmmf4tTGg2NBEJp0aEn32vN+Dy8uw@mail.gmail.com>
 <CAGsJ_4z1cJfOCcpZDt4EuHK7+SON1r0ptRJNv1h=cDv+eOcdSQ@mail.gmail.com>
 <CAMgjq7CJ4_9bB=46TVzByFRuOwxNs4da=sN==x8cc++YsV+ETQ@mail.gmail.com> <CAGsJ_4wo6u1WSXdzj8RUUDNdk5_YCfLV_mcJtvhiv2UonXw+nw@mail.gmail.com>
In-Reply-To: <CAGsJ_4wo6u1WSXdzj8RUUDNdk5_YCfLV_mcJtvhiv2UonXw+nw@mail.gmail.com>
From: Kairui Song <ryncsn@gmail.com>
Date: Sat, 24 May 2025 04:01:13 +0800
X-Gm-Features: AX0GCFsQv7Tim25pnQCXZYeXez9fQMau0NOHjvzSFVhIRAc0Gg1S-imH6voOJp0
Message-ID: <CAMgjq7Bc0-eXZ8G=bN8bo2NG1ndtPmCUvxCi0bdM+HdqmOjaPQ@mail.gmail.com>
Subject: Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
To: Barry Song <21cnbao@gmail.com>
Cc: akpm@linux-foundation.org, Baolin Wang <baolin.wang@linux.alibaba.com>, 
	Baoquan He <bhe@redhat.com>, Chris Li <chrisl@kernel.org>, David Hildenbrand <david@redhat.com>, 
	Johannes Weiner <hannes@cmpxchg.org>, Hugh Dickins <hughd@google.com>, 
	Kalesh Singh <kaleshsingh@google.com>, LKML <linux-kernel@vger.kernel.org>, 
	linux-mm <linux-mm@kvack.org>, Nhat Pham <nphamcs@gmail.com>, 
	Ryan Roberts <ryan.roberts@arm.com>, Kemeng Shi <shikemeng@huaweicloud.com>, 
	Tim Chen <tim.c.chen@linux.intel.com>, Matthew Wilcox <willy@infradead.org>, 
	"Huang, Ying" <ying.huang@linux.alibaba.com>, Yosry Ahmed <yosryahmed@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: 4E4F120013
X-Stat-Signature: zfk1do5b8zk61bknfhhnr7xb6kiwh6yh
X-Rspam-User: 
X-HE-Tag: 1748030493-229160
X-HE-Meta: U2FsdGVkX1+1wqgvwnIrhsyLYBWBE+263ZQgSa6kXiK6g01L1tgUdN4wu8XUEgSGtpA/J1MX8oCK47EAGhGCQ3r7aUL+JvNg9Qa8GunNBq8RqHG2EnXxQCEwpGqHGO7MKcyW6SLTSlk+W23HpMAPJ5EUsw5EqrLMTYnE0wj9HjCNZa9qFQhVfXnVlk33XWw9+K/2FSErh0blmB8IDRAfz9BEe4TeovN3IxHpVAaj1y9S+pTBxgjeGo72qZ+WNOz7uZKVKftM28goLSWoNzdwDWVtvrhc2kR9QvPH0rAYWF5C6KEoUAkxb3M5Dr2FdvHn2eTqB8lq1fiQHoUF5FwY5vflLIYWRUGZZoM2nR7BradXIEdIDqexIDs5Bb4b9fmtRKM4XQSneE1sJWI2cqzypaQ+WP7cjpNtlgSJC9G3TcBJ+vTkPtlGcDPWoJe5GDZ1votg3LLcAaU6+aYq6k9Kq3TZbuT3FAG5E7mzZD6rTlV/kaLv/k7btIcGHbAFPEyaDNjJP/4R17nK9+CXTa72eZSxbq4gtcBK2hyKAXkUAl1VSnbsxCU81AjNaiQ0GsAAUxTed3M/mhPhi8ZL/ROF3x+Y4iSFI+AAZdMzi+2k3rsdo6byL3dnRL8W28515AatKsohaisw6RWZXO+dDMXjzd40bFXN7FMn2g9YToBfd3XyCiWTJKreXmwumJi+BMBrb3hRCS5mWjY5JUY+Sxg+qF59WGZObyE0ebArTOFFVGgxYiPoB7lS611o7DTqTZ+BVqGUiSibOMOAanGTBUKFL0By+Ast3cNDC2ZaehPYY/GO+W2BR4gv9tfrz0H8Nm4GtcVVkvrMjwQbCg5sj9J+XEuCHnR5Fzgr3w6THaBc48omyFOnjOAWWs1UtyD97orE7Bt9UpSdTDmKYL2xoK8zVbiBCfLr+79yTgoH3H8sHXmegCZUWQl4iey2K2uDPmqeD/RiD4deHf1gtvtW4h+
 JRv2oiD5
 U4kExXxinpyMNd2BdKsfvVvMF+C1S3azfOOtaFljElP3+ZsVgxOjqGt63a8UjrXs22qVf+jvm7xwOj9Edh4vv6N/1ixxjmV3rru3d8kJoWwEw/vGuGhclrArIjuYdCbtXh4OrEOfeFU3+dGecAFQv2Yr8s+Nt9NTLQATnWojxhdKUPDoQ6V1KSZBQkaSzSq81Kw9OG/sUrEJner5s8LI4nipSXusiN4v4xNdcGnBMQAf0Xkipl9ISTGZRDckYUoorWtcNneSjRtSXXkKQVjgvXyZpQFattq3MGQRuVnsHKjU/hL72F2vVU5MR5eEUVIFvSxesIhGqEv1zPvbo+Cv5ew9V/Gq1SzM9e9k1XMOncsliIU3ymKGwxcCOog==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, May 23, 2025 at 10:30=E2=80=AFAM Barry Song <21cnbao@gmail.com> wro=
te:
>
> On Wed, May 21, 2025 at 2:45=E2=80=AFPM Kairui Song <ryncsn@gmail.com> wr=
ote:
> >
> > Barry Song <21cnbao@gmail.com> =E4=BA=8E 2025=E5=B9=B45=E6=9C=8821=E6=
=97=A5=E5=91=A8=E4=B8=89 06:33=E5=86=99=E9=81=93=EF=BC=9A
> > >
> > > On Wed, May 21, 2025 at 7:10=E2=80=AFAM Kairui Song <ryncsn@gmail.com=
> wrote:
> > > >
> > > > On Tue, May 20, 2025 at 12:41=E2=80=AFPM Barry Song <21cnbao@gmail.=
com> wrote:
> > > > >
> > > > > On Tue, May 20, 2025 at 3:31=E2=80=AFPM Kairui Song <ryncsn@gmail=
.com> wrote:
> > > > > >
> > > > > > On Mon, May 19, 2025 at 12:38=E2=80=AFPM Barry Song <21cnbao@gm=
ail.com> wrote:
> > > > > > >
> > > > > > > > From: Kairui Song <kasong@tencent.com>
> > > > > > >
> > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > > > > index e5a0db7f3331..5b4f01aecf35 100644
> > > > > > > > --- a/mm/userfaultfd.c
> > > > > > > > +++ b/mm/userfaultfd.c
> > > > > > > > @@ -1409,6 +1409,10 @@ static int move_pages_pte(struct mm_=
struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > > > > > >                               goto retry;
> > > > > > > >                       }
> > > > > > > >               }
> > > > > > > > +             if (!folio_swap_contains(src_folio, entry)) {
> > > > > > > > +                     err =3D -EBUSY;
> > > > > > > > +                     goto out;
> > > > > > > > +             }
> > > > > > >
> > > > > > > It seems we don't need this. In move_swap_pte(), we have been=
 checking pte pages
> > > > > > > are stable:
> > > > > > >
> > > > > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_p=
te, orig_src_pte,
> > > > > > >                                  dst_pmd, dst_pmdval)) {
> > > > > > >                 double_pt_unlock(dst_ptl, src_ptl);
> > > > > > >                 return -EAGAIN;
> > > > > > >         }
> > > > > >
> > > > > > The tricky part is when swap_cache_get_folio returns the folio,=
 both
> > > > > > folio and ptes are unlocked. So is it possible that someone els=
e
> > > > > > swapped in the entries, then swapped them out again using the s=
ame
> > > > > > entries?
> > > > > >
> > > > > > The folio will be different here but PTEs are still the same va=
lue to
> > > > > > they will pass the is_pte_pages_stable check, we previously saw
> > > > > > similar races with anon fault or shmem. I think more strict che=
cking
> > > > > > won't hurt here.
> > > > >
> > > > > This doesn't seem to be the same case as the one you fixed in
> > > > > do_swap_page(). Here, we're hitting the swap cache, whereas in th=
at
> > > > > case, there was no one hitting the swap cache, and you used
> > > > > swap_prepare() to set up the cache to fix the issue.
> > > > >
> > > > > By the way, if we're not hitting the swap cache, src_folio will b=
e
> > > > > NULL. Also, it seems that folio_swap_contains(src_folio, entry) d=
oes
> > > > > not guard against that case either.
> > > >
> > > > Ah, that's true, it should be moved inside the if (folio) {...} blo=
ck
> > > > above. Thanks for catching this!
> > > >
> > > > > But I suspect we won't have a problem, since we're not swapping i=
n =E2=80=94
> > > > > we didn't read any stale data, right? Swap-in will only occur aft=
er we
> > > > > move the PTEs.
> > > >
> > > > My concern is that a parallel swapin / swapout could result in the
> > > > folio to be a completely irrelevant or invalid folio.
> > > >
> > > > It's not about the dst, but in the move src side, something like:
> > > >
> > > > CPU1                             CPU2
> > > > move_pages_pte
> > > >   folio =3D swap_cache_get_folio(...)
> > > >     | Got folio A here
> > > >   move_swap_pte
> > > >                                  <swapin src_pte, using folio A>
> > > >                                  <swapout src_pte, put folio A>
> > > >                                    | Now folio A is no longer valid=
.
> > > >                                    | It's very unlikely but here SW=
AP
> > > >                                    | could reuse the same entry as =
above.
> > >
> > >
> > > swap_cache_get_folio() does increment the folio's refcount, but it se=
ems this
> > > doesn't prevent do_swap_page() from freeing the swap entry after swap=
ping
> > > in src_pte with folio A, if it's a read fault.
> > > for write fault, folio_ref_count(folio) =3D=3D (1 + folio_nr_pages(fo=
lio))
> > > will be false:
> > >
> > > static inline bool should_try_to_free_swap(struct folio *folio,
> > >                                            struct vm_area_struct *vma=
,
> > >                                            unsigned int fault_flags)
> > > {
> > >        ...
> > >
> > >         /*
> > >          * If we want to map a page that's in the swapcache writable,=
 we
> > >          * have to detect via the refcount if we're really the exclus=
ive
> > >          * user. Try freeing the swapcache to get rid of the swapcach=
e
> > >          * reference only in case it's likely that we'll be the exlus=
ive user.
> > >          */
> > >         return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(fo=
lio) &&
> > >                 folio_ref_count(folio) =3D=3D (1 + folio_nr_pages(fol=
io));
> > > }
> > >
> > > and for swapout, __removing_mapping does check refcount as well:
> > >
> > > static int __remove_mapping(struct address_space *mapping, struct fol=
io *folio,
> > >                             bool reclaimed, struct mem_cgroup *target=
_memcg)
> > > {
> > >         refcount =3D 1 + folio_nr_pages(folio);
> > >         if (!folio_ref_freeze(folio, refcount))
> > >                 goto cannot_free;
> > >
> > > }
> > >
> > > However, since __remove_mapping() occurs after pageout(), it seems
> > > this also doesn't prevent swapout from allocating a new swap entry to
> > > fill src_pte.
> > >
> > > It seems your concern is valid=E2=80=94unless I'm missing something.
> > > Do you have a reproducer? If so, this will likely need a separate fix
> > > patch rather than being hidden in this patchset.
> >
> > Thanks for the analysis. I don't have a reproducer yet, I did some
> > local experiments and that seems possible, but the race window is so
> > tiny and it's very difficult to make the swap entry reuse to collide
> > with that, I'll try more but in theory this seems possible, or at
> > least looks very fragile.
> >
> > And yeah, let's patch the kernel first if that's a real issue.
> >
> > >
> > > >     double_pt_lock
> > > >     is_pte_pages_stable
> > > >       | Passed because of entry reuse.
> > > >     folio_move_anon_rmap(...)
> > > >       | Moved invalid folio A.
> > > >
> > > > And could it be possible that the swap_cache_get_folio returns NULL
> > > > here, but later right before the double_pt_lock, a folio is added t=
o
> > > > swap cache? Maybe we better check the swap cache after clear and
> > > > releasing dst lock, but before releasing src lock?
> > >
> > > It seems you're suggesting that a parallel swap-in allocates and adds
> > > a folio to the swap cache, but the PTE has not yet been updated from
> > > a swap entry to a present mapping?
> > >
> > > As long as do_swap_page() adds the folio to the swap cache
> > > before updating the PTE to present, this scenario seems possible.
> >
> > Yes, that's two kinds of problems here. I suspected there could be an
> > ABA problem while working on the series, but wasn't certain. And just
> > realised there could be another missed cache read here thanks to your
> > review and discussion :)
> >
> > >
> > > It seems we need to double-check:
> > >
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index bc473ad21202..976053bd2bf1 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -1102,8 +1102,14 @@ static int move_swap_pte(struct mm_struct *mm,
> > > struct vm_area_struct *dst_vma,
> > >         if (src_folio) {
> > >                 folio_move_anon_rmap(src_folio, dst_vma);
> > >                 src_folio->index =3D linear_page_index(dst_vma, dst_a=
ddr);
> > > +       } else {
> > > +               struct folio *folio =3D
> > > filemap_get_folio(swap_address_space(entry),
> > > +                                       swap_cache_index(entry));
> > > +               if (!IS_ERR_OR_NULL(folio)) {
> > > +                       double_pt_unlock(dst_ptl, src_ptl);
> > > +                       return -EAGAIN;
> > > +               }
> > >         }
> > > -
> > >         orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_pte);
> > >  #ifdef CONFIG_MEM_SOFT_DIRTY
> > >         orig_src_pte =3D pte_swp_mksoft_dirty(orig_src_pte);
> >
> > Maybe it has to get even dirtier here to call swapcache_prepare too to
> > cover the SYNC_IO case?
> >
> > >
> > > Let me run test case [1] to check whether this ever happens. I guess =
I need to
> > > hack kernel a bit to always add folio to swapcache even for SYNC IO.
> >
> > That will cause quite a performance regression I think. Good thing is,
> > that's exactly the problem this series is solving by dropping the SYNC
> > IO swapin path and never bypassing the swap cache, while improving the
> > performance, eliminating things like this. One more reason to justify
> > the approach :)

Hi Barry,

>
> I attempted to reproduce the scenario where a folio is added to the swapc=
ache
> after filemap_get_folio() returns NULL but before move_swap_pte()
> moves the swap PTE
> using non-synchronized I/O. Technically, this seems possible; however,
> I was unable
> to reproduce it, likely because the time window between swapin_readahead =
and
> taking the page table lock within do_swap_page() is too short.

Thank you so much for trying this!

I have been trying to reproduce it too, and so far I didn't observe
any crash or warn. I added following debug code:

 static __always_inline
 bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_en=
d)
@@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
        pmd_t dummy_pmdval;
        pmd_t dst_pmdval;
        struct folio *src_folio =3D NULL;
+       struct folio *tmp_folio =3D NULL;
        struct anon_vma *src_anon_vma =3D NULL;
        struct mmu_notifier_range range;
        int err =3D 0;
@@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
                if (!src_folio)
                        folio =3D filemap_get_folio(swap_address_space(entr=
y),
                                        swap_cache_index(entry));
+               udelay(get_random_u32_below(1000));
+               tmp_folio =3D filemap_get_folio(swap_address_space(entry),
+                                       swap_cache_index(entry));
+               if (!IS_ERR_OR_NULL(tmp_folio)) {
+                       if (!IS_ERR_OR_NULL(folio) && tmp_folio !=3D folio)=
 {
+                               pr_err("UFFDIO_MOVE: UNSTABLE folio
%lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio,
tmp_folio->swap.val);
+                       }
+                       folio_put(tmp_folio);
+               }
                if (!IS_ERR_OR_NULL(folio)) {
                        if (folio_test_large(folio)) {
                                err =3D -EBUSY;
@@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
                err =3D move_swap_pte(mm, dst_vma, dst_addr, src_addr,
dst_pte, src_pte,
                                orig_dst_pte, orig_src_pte, dst_pmd, dst_pm=
dval,
                                dst_ptl, src_ptl, src_folio);
+               if (tmp_folio !=3D folio && !err)
+                       pr_err("UFFDIO_MOVE: UNSTABLE folio passed
check: %lx -> %lx\n", folio, tmp_folio);
        }

And I saw these two prints are getting triggered like this (not a real
issue though, just help to understand the problem)
...
[ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) ->
fffffdffc7ccac80 (51)
[ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) ->
fffffdffc3435e00 (3b)
[ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) ->
fffffdffc34ab8c0 (76)
[ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) ->
fffffdffc34ab8c0 (6d)
[ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) ->
fffffdffc34eda80 (32)
[ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) ->
fffffdffc38cedc0 (b)
[ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc34db=
140
[ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) ->
fffffdffc3465cc0 (21)
...

The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) ->
fffffdffc3853540 (53)" worries me at first. On first look it seems the
folio is indeed freed completely from the swap cache after the first
lookup, so another swapout can reuse the entry. But as you mentioned
__remove_mapping won't release a folio if the refcount check fails, so
they must be freed by folio_free_swap or __try_to_reclaim_swap, there
are many places that can happen. But these two helpers won't free a
folio from swap cache if its swap count is not zero. And the folio
will either be swapped out (swap count non zero), or mapped (freeing
it is fine, PTE is non_swap, and another swapout will still use the
same folio).

So after more investigation and dumping the pages, it's actually the
second lookup (tmp_folio) seeing the entry being reused by another
page table entry, after the first folio is swapped back and released.
So the page table check below will always fail just fine.

But this also proves the first look up can see a completely irrelevant
folio too: If the src folio is swapped out, but got swapped back and
freed, then another folio B shortly got added to swap cache reuse the
src folio's old swap entry, then the folio B got seen by the look up
here and get freed from swap cache, then src folio got swapped out
again also reusing the same entry, then we have a problem as PTE seems
untouched indeed but we grabbed a wrong folio. Seems possible if I'm
not wrong:

Something like this:
CPU1                             CPU2
move_pages_pte
  entry =3D pte_to_swp_entry(orig_src_pte);
    | Got Swap Entry S1 from src_pte
  ...
                                 <swapin src_pte, using folio A>
                                 <free folio A from swap cache freeing S1>
                                 <someone else try swap out folio B >
                                 <put folio B to swap cache using S1 >
                                ...
  folio =3D swap_cache_get_folio(S1)
    | Got folio B here !!!
  move_swap_pte
                                 <free folio B from swap cache>
                                   | Holding a reference doesn't pin the ca=
che
                                   | as we have demonstrated
                                 <Swapout folio A also using S1>
    double_pt_lock
    is_pte_pages_stable
      | Passed because of S1 is reused
    folio_move_anon_rmap(...)
      | Moved invalid folio B here !!!

But this is extremely hard to reproduce though, even if doing it
deliberately...

So I think a "folio_swap_contains" or equivalent check here is a good
thing to have, to make it more robust and easier to understand. The
checking after locking a folio has very tiny overhead and can
definitely ensure the folio's swap entry is valid and stable.

The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00"
here might seem problematic, but it's still not a real problem. That's
the case where the swapin in src region happens after the lookup, and
before the PTE lock. It will pass the PTE check without moving the
folio. But the folio is guaranteed to be a completely new folio here
because the folio can't be added back to the page table without
holding the PTE lock, and if that happens the following PTE check here
will fail.

So I think we should patch the current kernel only adding a
"folio_swap_contains" equivalent check here, and maybe more comments,
how do you think?