From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 10258C54F30
	for <linux-mm@archiver.kernel.org>; Tue, 27 May 2025 07:59:13 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A2EF46B0085; Tue, 27 May 2025 03:59:12 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A05F56B008A; Tue, 27 May 2025 03:59:12 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8F4F96B008C; Tue, 27 May 2025 03:59:12 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 6BD246B0085
	for <linux-mm@kvack.org>; Tue, 27 May 2025 03:59:12 -0400 (EDT)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 1C84616026E
	for <linux-mm@kvack.org>; Tue, 27 May 2025 07:59:12 +0000 (UTC)
X-FDA: 83487937344.13.537C44C
Received: from mail-qk1-f173.google.com (mail-qk1-f173.google.com [209.85.222.173])
	by imf27.hostedemail.com (Postfix) with ESMTP id 68CDD40002
	for <linux-mm@kvack.org>; Tue, 27 May 2025 07:59:10 +0000 (UTC)
Authentication-Results: imf27.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Qr4sbksd;
	spf=pass (imf27.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.173 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1748332750;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=hplfmEAI0ObABv6iennLOuBJUZpDNYU6guT+clGZ648=;
	b=NogH4d+QTau3CnCpgo+Y5uPLvQbZ5mkCFyaNIdh+zJKUWb3si/LE8pQlaYX6VHujyx7+u1
	NBpzncBwY+RdyPJ5nRKzoX2w3C6wZbTWkFZKA5ljXV7rGiinqZ3oklWJJlI9uADfHdyduh
	DYew5wXLNp7/7NB/Wf7fuzRijF/JsLE=
ARC-Authentication-Results: i=1;
	imf27.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Qr4sbksd;
	spf=pass (imf27.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.173 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748332750; a=rsa-sha256;
	cv=none;
	b=8OqkqWOF9Z2Os+jsNlOg8xtYakgcG2i5ZOXTerjwyqTwt+9oftK4Ig154bL9zjemQd0Zzw
	A7svPzhL1LQVmhONqsdQ1VCdi9JewMYv3SBQmWAKqzX4D656OvUPV+NCHP6RceRU71ZxzX
	PWsiE56fRvyYm73sh5SxO55tZk7Ljk4=
Received: by mail-qk1-f173.google.com with SMTP id af79cd13be357-7c9376c4dbaso294562685a.0
        for <linux-mm@kvack.org>; Tue, 27 May 2025 00:59:10 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1748332749; x=1748937549; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=hplfmEAI0ObABv6iennLOuBJUZpDNYU6guT+clGZ648=;
        b=Qr4sbksdF90TJKDU1CYFKV3Y+BTxZXfVAfZXl48FZXbYaQct3aDbkwJXBfltgOvOOG
         AjMVVfJOIWCbU0Pj/UPwOTdPaGyy25P2Xnf9btUMl58Pj4uNXnfd0pjpVQ8PQhNdynZW
         16Gg4cME73sokXkZKxvNsGITSq6JU9BECFDowDlq4xM1tr66vwyStF2Vxb1JgAGB3qbq
         y79uWzZ2oERZMLh39GT1r+dfY2Q3lDK+Hdm0Al5gqfE8/uvwaqdgfDczpt28DbvSvmco
         HkQlbLl/RYwiIdGvYK7agW9BkEYWRAexPPbrKxIlrCm6AtP/zefHc5nRArcEQI8zHzP7
         MQOg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1748332749; x=1748937549;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=hplfmEAI0ObABv6iennLOuBJUZpDNYU6guT+clGZ648=;
        b=CWTRVgg8pkR02hCJhyC+LmmFyGe+P3F1lwFxNAMPONc07ZxJMulQRfs30SBfYeudDI
         dKNGooFPTBoIteX1PO6rIIhO1K0KTpo6B/wEjyDmwb8iEJAiFM49zTHfTt9IIr9hTgyQ
         L94Ns4KcSrG71ixVVa2/oPT8C1T70TOQYNAkEGXOCzXqcwdC5JP7+e6rwOgH2swP+Ls7
         vid+pDxL5WUCUMSMv6Xn90KclFl8YNYsLLYRSDeaRx4yUKUXu+3pnihVjBhv0F60uM2L
         WM89W9SWSRBBJIlQlhd6CgTZXrbepLVr14M/cCGlf1EiUXozbBYX4WInzW2AXO6vOwjs
         aOLA==
X-Forwarded-Encrypted: i=1; AJvYcCVwunaoObfGVOx759x+AdgrEJZRnXtfee0OptlI57f9DWapUean9SLo3eE7xTqsc2CCiLY+ldTfSQ==@kvack.org
X-Gm-Message-State: AOJu0Yz91LwvbIwcXszUuuw4ndYCoDFm38K4BOHcYwe8wtUhrtdbgwQp
	juBXo68/HKdjVqbnMSSxKPPry7AZ/WgAGs5fcwoQ9djXNIRar8uprxt70qWNYhAz+RyAktQ5zB/
	u94Y6Trh2cdzeFd1bAVSuF9tnlpYb9BXWDBAx
X-Gm-Gg: ASbGncvqvnLE+XUJRumm2HrdWWTcvY1GGeAJVKlm/+qxDx+ve8EN4dGrkYkzvUzmrDq
	C0YEIWhF1PzYOZcS5aSoUPCFWjGz7AxV1byi5QMl+v1y66mGw0sM7Kv4gFbmYP3flZkkTN9il40
	Vl994WB/1B/0sG5fhi0Cyq2pWUuya7OEsyjw==
X-Google-Smtp-Source: AGHT+IGKzpgcFn4eQbWsqR5WJWfumDy1Xr9B/icnNkP390i2UgIIEEB1AFM6EhiUFaVHt+jC+iU0a1EqCrY90kksQjg=
X-Received: by 2002:a05:6102:dd2:b0:4dd:ab6c:7654 with SMTP id
 ada2fe7eead31-4e423e601e5mr9168751137.8.1748332738801; Tue, 27 May 2025
 00:58:58 -0700 (PDT)
MIME-Version: 1.0
References: <20250514201729.48420-6-ryncsn@gmail.com> <20250519043847.1806-1-21cnbao@gmail.com>
 <CAMgjq7BpfueOn9ms8apRX-6dF8rZGtbC=MuZzSD7hbZxtw=Kdg@mail.gmail.com>
 <CAGsJ_4wC5_YSMLNoY5q4hUsZTpD+YPHSBtzCAdWRFH65EJA_iw@mail.gmail.com>
 <CAMgjq7AO__8TFE8ibwQswWmmf4tTGg2NBEJp0aEn32vN+Dy8uw@mail.gmail.com>
 <CAGsJ_4z1cJfOCcpZDt4EuHK7+SON1r0ptRJNv1h=cDv+eOcdSQ@mail.gmail.com>
 <CAMgjq7CJ4_9bB=46TVzByFRuOwxNs4da=sN==x8cc++YsV+ETQ@mail.gmail.com>
 <CAGsJ_4wo6u1WSXdzj8RUUDNdk5_YCfLV_mcJtvhiv2UonXw+nw@mail.gmail.com> <CAMgjq7Bc0-eXZ8G=bN8bo2NG1ndtPmCUvxCi0bdM+HdqmOjaPQ@mail.gmail.com>
In-Reply-To: <CAMgjq7Bc0-eXZ8G=bN8bo2NG1ndtPmCUvxCi0bdM+HdqmOjaPQ@mail.gmail.com>
From: Barry Song <21cnbao@gmail.com>
Date: Tue, 27 May 2025 19:58:47 +1200
X-Gm-Features: AX0GCFtSjRdfbo_re4Zx58GpTYN0U7VxlXrr-1Kmf-HNfqbkzeyTydFvtPqllfU
Message-ID: <CAGsJ_4ymRwXhQdzabstHhkK0OM0JEWtvR3tjeyQppm7sKZ8FUw@mail.gmail.com>
Subject: Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
To: Kairui Song <ryncsn@gmail.com>
Cc: akpm@linux-foundation.org, Baolin Wang <baolin.wang@linux.alibaba.com>, 
	Baoquan He <bhe@redhat.com>, Chris Li <chrisl@kernel.org>, David Hildenbrand <david@redhat.com>, 
	Johannes Weiner <hannes@cmpxchg.org>, Hugh Dickins <hughd@google.com>, 
	Kalesh Singh <kaleshsingh@google.com>, LKML <linux-kernel@vger.kernel.org>, 
	linux-mm <linux-mm@kvack.org>, Nhat Pham <nphamcs@gmail.com>, 
	Ryan Roberts <ryan.roberts@arm.com>, Kemeng Shi <shikemeng@huaweicloud.com>, 
	Tim Chen <tim.c.chen@linux.intel.com>, Matthew Wilcox <willy@infradead.org>, 
	"Huang, Ying" <ying.huang@linux.alibaba.com>, Yosry Ahmed <yosryahmed@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: 68CDD40002
X-Stat-Signature: td9wuwpztpkcoy3mk8sxy5x19gci13ts
X-Rspam-User: 
X-HE-Tag: 1748332750-532163
X-HE-Meta: U2FsdGVkX1+BtFoCDS3pkJNQd8xucJlrsFH5hV90QYZZWsKgJBj4v7ik5YhwmB8LnamP2wR8MghWNfU15BeWmkXKpRnKzrGJ377HJsLmFaTGSzQecnSsAgCbl4engBfyum6FDVSxXuh4ACjNZL1aucehz76klEzNjhn81Mh+H3JPUd4hcr4cqeSiZGPQ7rcUQkc4VLd6Ep6Nb4rBvPb8LtzB+9cXOA9v8VIli/vr39VOEKcZCTo/4Ov/xM2MWaRIAQMs15sbQUAjpc9ovn9ckQmrFioVsiG0h8lIZmyR5/IttJ+uv+LQyGXAFco04yEatrfLvdDaMI3lYjNmbUtF7V6+ZBR+SPLwl+++352sv/eQotuoEnpFgwVTZZvdVAhCx8A5T0nGyZM/lNP6GT9+Te+TxWZyXys0UnpF42KPhO0EpGc9s/21i6TOCTZ8rcgjqBeNMQPbbT03yO9NA2rGLF+7TUHIvCmaitt1N0CMH7fmK7tOte1m3AGZpv4ebyYloAH5P6Z2I6ilHYGMH0w4KbVhK1Fs5ufVY1fPjcWxqh7vTmBIqbei+wJ/L609jxNqJ9XUQ+wGZ/hH0nJlGVoDvgN4d6Ggy+K5IyQbK2vwjzWg48Iel+c3rZsV8nz4LYhGuizszl/d6UG+FpTfoHamg8G0czQ3FTKAU3A1qLgDKmxkfPQgupDHJ7nFniGnBLlGociwNMVpEdZnae2JbpHIDB7ew3p8+NDRTofX87xDR0q2fdn4NbEGp8LF2G+Y2HmDV5ncoF14x8+0e4OXyI95qi3Dm9UacoZp/1/Wx9xMFMVgZtgKAtoR0E9e6M2pQ3PbOF+diLC/01HRGQQGg0aN1xb62MMQCiDG+TdVDYrVdU93tyxiJmEb2YF65TrCTaZ1OROP9ZBX6uN7ZyJvGoYrp8UZMMKjdqhEk1Uyuv8DvAP8mBEA0A6ZFHaPPGgA4jhVV9Zi/PJkWh0=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Sat, May 24, 2025 at 8:01=E2=80=AFAM Kairui Song <ryncsn@gmail.com> wrot=
e:
>
> On Fri, May 23, 2025 at 10:30=E2=80=AFAM Barry Song <21cnbao@gmail.com> w=
rote:
> >
> > On Wed, May 21, 2025 at 2:45=E2=80=AFPM Kairui Song <ryncsn@gmail.com> =
wrote:
> > >
> > > Barry Song <21cnbao@gmail.com> =E4=BA=8E 2025=E5=B9=B45=E6=9C=8821=E6=
=97=A5=E5=91=A8=E4=B8=89 06:33=E5=86=99=E9=81=93=EF=BC=9A
> > > >
> > > > On Wed, May 21, 2025 at 7:10=E2=80=AFAM Kairui Song <ryncsn@gmail.c=
om> wrote:
> > > > >
> > > > > On Tue, May 20, 2025 at 12:41=E2=80=AFPM Barry Song <21cnbao@gmai=
l.com> wrote:
> > > > > >
> > > > > > On Tue, May 20, 2025 at 3:31=E2=80=AFPM Kairui Song <ryncsn@gma=
il.com> wrote:
> > > > > > >
> > > > > > > On Mon, May 19, 2025 at 12:38=E2=80=AFPM Barry Song <21cnbao@=
gmail.com> wrote:
> > > > > > > >
> > > > > > > > > From: Kairui Song <kasong@tencent.com>
> > > > > > > >
> > > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > > > > > index e5a0db7f3331..5b4f01aecf35 100644
> > > > > > > > > --- a/mm/userfaultfd.c
> > > > > > > > > +++ b/mm/userfaultfd.c
> > > > > > > > > @@ -1409,6 +1409,10 @@ static int move_pages_pte(struct m=
m_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > > > > > > >                               goto retry;
> > > > > > > > >                       }
> > > > > > > > >               }
> > > > > > > > > +             if (!folio_swap_contains(src_folio, entry))=
 {
> > > > > > > > > +                     err =3D -EBUSY;
> > > > > > > > > +                     goto out;
> > > > > > > > > +             }
> > > > > > > >
> > > > > > > > It seems we don't need this. In move_swap_pte(), we have be=
en checking pte pages
> > > > > > > > are stable:
> > > > > > > >
> > > > > > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst=
_pte, orig_src_pte,
> > > > > > > >                                  dst_pmd, dst_pmdval)) {
> > > > > > > >                 double_pt_unlock(dst_ptl, src_ptl);
> > > > > > > >                 return -EAGAIN;
> > > > > > > >         }
> > > > > > >
> > > > > > > The tricky part is when swap_cache_get_folio returns the foli=
o, both
> > > > > > > folio and ptes are unlocked. So is it possible that someone e=
lse
> > > > > > > swapped in the entries, then swapped them out again using the=
 same
> > > > > > > entries?
> > > > > > >
> > > > > > > The folio will be different here but PTEs are still the same =
value to
> > > > > > > they will pass the is_pte_pages_stable check, we previously s=
aw
> > > > > > > similar races with anon fault or shmem. I think more strict c=
hecking
> > > > > > > won't hurt here.
> > > > > >
> > > > > > This doesn't seem to be the same case as the one you fixed in
> > > > > > do_swap_page(). Here, we're hitting the swap cache, whereas in =
that
> > > > > > case, there was no one hitting the swap cache, and you used
> > > > > > swap_prepare() to set up the cache to fix the issue.
> > > > > >
> > > > > > By the way, if we're not hitting the swap cache, src_folio will=
 be
> > > > > > NULL. Also, it seems that folio_swap_contains(src_folio, entry)=
 does
> > > > > > not guard against that case either.
> > > > >
> > > > > Ah, that's true, it should be moved inside the if (folio) {...} b=
lock
> > > > > above. Thanks for catching this!
> > > > >
> > > > > > But I suspect we won't have a problem, since we're not swapping=
 in =E2=80=94
> > > > > > we didn't read any stale data, right? Swap-in will only occur a=
fter we
> > > > > > move the PTEs.
> > > > >
> > > > > My concern is that a parallel swapin / swapout could result in th=
e
> > > > > folio to be a completely irrelevant or invalid folio.
> > > > >
> > > > > It's not about the dst, but in the move src side, something like:
> > > > >
> > > > > CPU1                             CPU2
> > > > > move_pages_pte
> > > > >   folio =3D swap_cache_get_folio(...)
> > > > >     | Got folio A here
> > > > >   move_swap_pte
> > > > >                                  <swapin src_pte, using folio A>
> > > > >                                  <swapout src_pte, put folio A>
> > > > >                                    | Now folio A is no longer val=
id.
> > > > >                                    | It's very unlikely but here =
SWAP
> > > > >                                    | could reuse the same entry a=
s above.
> > > >
> > > >
> > > > swap_cache_get_folio() does increment the folio's refcount, but it =
seems this
> > > > doesn't prevent do_swap_page() from freeing the swap entry after sw=
apping
> > > > in src_pte with folio A, if it's a read fault.
> > > > for write fault, folio_ref_count(folio) =3D=3D (1 + folio_nr_pages(=
folio))
> > > > will be false:
> > > >
> > > > static inline bool should_try_to_free_swap(struct folio *folio,
> > > >                                            struct vm_area_struct *v=
ma,
> > > >                                            unsigned int fault_flags=
)
> > > > {
> > > >        ...
> > > >
> > > >         /*
> > > >          * If we want to map a page that's in the swapcache writabl=
e, we
> > > >          * have to detect via the refcount if we're really the excl=
usive
> > > >          * user. Try freeing the swapcache to get rid of the swapca=
che
> > > >          * reference only in case it's likely that we'll be the exl=
usive user.
> > > >          */
> > > >         return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(=
folio) &&
> > > >                 folio_ref_count(folio) =3D=3D (1 + folio_nr_pages(f=
olio));
> > > > }
> > > >
> > > > and for swapout, __removing_mapping does check refcount as well:
> > > >
> > > > static int __remove_mapping(struct address_space *mapping, struct f=
olio *folio,
> > > >                             bool reclaimed, struct mem_cgroup *targ=
et_memcg)
> > > > {
> > > >         refcount =3D 1 + folio_nr_pages(folio);
> > > >         if (!folio_ref_freeze(folio, refcount))
> > > >                 goto cannot_free;
> > > >
> > > > }
> > > >
> > > > However, since __remove_mapping() occurs after pageout(), it seems
> > > > this also doesn't prevent swapout from allocating a new swap entry =
to
> > > > fill src_pte.
> > > >
> > > > It seems your concern is valid=E2=80=94unless I'm missing something=
.
> > > > Do you have a reproducer? If so, this will likely need a separate f=
ix
> > > > patch rather than being hidden in this patchset.
> > >
> > > Thanks for the analysis. I don't have a reproducer yet, I did some
> > > local experiments and that seems possible, but the race window is so
> > > tiny and it's very difficult to make the swap entry reuse to collide
> > > with that, I'll try more but in theory this seems possible, or at
> > > least looks very fragile.
> > >
> > > And yeah, let's patch the kernel first if that's a real issue.
> > >
> > > >
> > > > >     double_pt_lock
> > > > >     is_pte_pages_stable
> > > > >       | Passed because of entry reuse.
> > > > >     folio_move_anon_rmap(...)
> > > > >       | Moved invalid folio A.
> > > > >
> > > > > And could it be possible that the swap_cache_get_folio returns NU=
LL
> > > > > here, but later right before the double_pt_lock, a folio is added=
 to
> > > > > swap cache? Maybe we better check the swap cache after clear and
> > > > > releasing dst lock, but before releasing src lock?
> > > >
> > > > It seems you're suggesting that a parallel swap-in allocates and ad=
ds
> > > > a folio to the swap cache, but the PTE has not yet been updated fro=
m
> > > > a swap entry to a present mapping?
> > > >
> > > > As long as do_swap_page() adds the folio to the swap cache
> > > > before updating the PTE to present, this scenario seems possible.
> > >
> > > Yes, that's two kinds of problems here. I suspected there could be an
> > > ABA problem while working on the series, but wasn't certain. And just
> > > realised there could be another missed cache read here thanks to your
> > > review and discussion :)
> > >
> > > >
> > > > It seems we need to double-check:
> > > >
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index bc473ad21202..976053bd2bf1 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > > @@ -1102,8 +1102,14 @@ static int move_swap_pte(struct mm_struct *m=
m,
> > > > struct vm_area_struct *dst_vma,
> > > >         if (src_folio) {
> > > >                 folio_move_anon_rmap(src_folio, dst_vma);
> > > >                 src_folio->index =3D linear_page_index(dst_vma, dst=
_addr);
> > > > +       } else {
> > > > +               struct folio *folio =3D
> > > > filemap_get_folio(swap_address_space(entry),
> > > > +                                       swap_cache_index(entry));
> > > > +               if (!IS_ERR_OR_NULL(folio)) {
> > > > +                       double_pt_unlock(dst_ptl, src_ptl);
> > > > +                       return -EAGAIN;
> > > > +               }
> > > >         }
> > > > -
> > > >         orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_pte);
> > > >  #ifdef CONFIG_MEM_SOFT_DIRTY
> > > >         orig_src_pte =3D pte_swp_mksoft_dirty(orig_src_pte);
> > >
> > > Maybe it has to get even dirtier here to call swapcache_prepare too t=
o
> > > cover the SYNC_IO case?
> > >
> > > >
> > > > Let me run test case [1] to check whether this ever happens. I gues=
s I need to
> > > > hack kernel a bit to always add folio to swapcache even for SYNC IO=
.
> > >
> > > That will cause quite a performance regression I think. Good thing is=
,
> > > that's exactly the problem this series is solving by dropping the SYN=
C
> > > IO swapin path and never bypassing the swap cache, while improving th=
e
> > > performance, eliminating things like this. One more reason to justify
> > > the approach :)
>
> Hi Barry,
>
> >
> > I attempted to reproduce the scenario where a folio is added to the swa=
pcache
> > after filemap_get_folio() returns NULL but before move_swap_pte()
> > moves the swap PTE
> > using non-synchronized I/O. Technically, this seems possible; however,
> > I was unable
> > to reproduce it, likely because the time window between swapin_readahea=
d and
> > taking the page table lock within do_swap_page() is too short.
>
> Thank you so much for trying this!
>
> I have been trying to reproduce it too, and so far I didn't observe
> any crash or warn. I added following debug code:
>
>  static __always_inline
>  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_=
end)
> @@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>         pmd_t dummy_pmdval;
>         pmd_t dst_pmdval;
>         struct folio *src_folio =3D NULL;
> +       struct folio *tmp_folio =3D NULL;
>         struct anon_vma *src_anon_vma =3D NULL;
>         struct mmu_notifier_range range;
>         int err =3D 0;
> @@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>                 if (!src_folio)
>                         folio =3D filemap_get_folio(swap_address_space(en=
try),
>                                         swap_cache_index(entry));
> +               udelay(get_random_u32_below(1000));
> +               tmp_folio =3D filemap_get_folio(swap_address_space(entry)=
,
> +                                       swap_cache_index(entry));
> +               if (!IS_ERR_OR_NULL(tmp_folio)) {
> +                       if (!IS_ERR_OR_NULL(folio) && tmp_folio !=3D foli=
o) {
> +                               pr_err("UFFDIO_MOVE: UNSTABLE folio
> %lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio,
> tmp_folio->swap.val);
> +                       }
> +                       folio_put(tmp_folio);
> +               }
>                 if (!IS_ERR_OR_NULL(folio)) {
>                         if (folio_test_large(folio)) {
>                                 err =3D -EBUSY;
> @@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>                 err =3D move_swap_pte(mm, dst_vma, dst_addr, src_addr,
> dst_pte, src_pte,
>                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_=
pmdval,
>                                 dst_ptl, src_ptl, src_folio);
> +               if (tmp_folio !=3D folio && !err)
> +                       pr_err("UFFDIO_MOVE: UNSTABLE folio passed
> check: %lx -> %lx\n", folio, tmp_folio);
>         }
>
> And I saw these two prints are getting triggered like this (not a real
> issue though, just help to understand the problem)
> ...
> [ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) ->
> fffffdffc7ccac80 (51)
> [ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) ->
> fffffdffc3435e00 (3b)
> [ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) ->
> fffffdffc34ab8c0 (76)
> [ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) ->
> fffffdffc34ab8c0 (6d)
> [ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) ->
> fffffdffc34eda80 (32)
> [ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) ->
> fffffdffc38cedc0 (b)
> [ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc34=
db140
> [ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) ->
> fffffdffc3465cc0 (21)
> ...
>
> The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) ->
> fffffdffc3853540 (53)" worries me at first. On first look it seems the
> folio is indeed freed completely from the swap cache after the first
> lookup, so another swapout can reuse the entry. But as you mentioned
> __remove_mapping won't release a folio if the refcount check fails, so
> they must be freed by folio_free_swap or __try_to_reclaim_swap, there
> are many places that can happen. But these two helpers won't free a
> folio from swap cache if its swap count is not zero. And the folio
> will either be swapped out (swap count non zero), or mapped (freeing
> it is fine, PTE is non_swap, and another swapout will still use the
> same folio).
>
> So after more investigation and dumping the pages, it's actually the
> second lookup (tmp_folio) seeing the entry being reused by another
> page table entry, after the first folio is swapped back and released.
> So the page table check below will always fail just fine.
>
> But this also proves the first look up can see a completely irrelevant
> folio too: If the src folio is swapped out, but got swapped back and
> freed, then another folio B shortly got added to swap cache reuse the
> src folio's old swap entry, then the folio B got seen by the look up
> here and get freed from swap cache, then src folio got swapped out
> again also reusing the same entry, then we have a problem as PTE seems
> untouched indeed but we grabbed a wrong folio. Seems possible if I'm
> not wrong:
>
> Something like this:
> CPU1                             CPU2
> move_pages_pte
>   entry =3D pte_to_swp_entry(orig_src_pte);
>     | Got Swap Entry S1 from src_pte
>   ...
>                                  <swapin src_pte, using folio A>

I=E2=80=99m assuming you mean `<swapin src_pte, using folio B>`, since I=E2=
=80=99m not
sure where folio B comes from in the statement `<someone else tried to
swap out folio B>`.

If that assumption is correct, and folio A is still in the swapcache,
how could someone swap in folio B without hitting folio A? That would
suggest folio A must have been removed from the swapcache earlier=E2=80=94r=
ight?

>                                  <free folio A from swap cache freeing S1=
>
>                                  <someone else try swap out folio B >
>                                  <put folio B to swap cache using S1 >
>                                 ...
>   folio =3D swap_cache_get_folio(S1)
>     | Got folio B here !!!
>   move_swap_pte
>                                  <free folio B from swap cache>
>                                    | Holding a reference doesn't pin the =
cache
>                                    | as we have demonstrated
>                                  <Swapout folio A also using S1>
>     double_pt_lock
>     is_pte_pages_stable
>       | Passed because of S1 is reused
>     folio_move_anon_rmap(...)
>       | Moved invalid folio B here !!!
>
> But this is extremely hard to reproduce though, even if doing it
> deliberately...
>
> So I think a "folio_swap_contains" or equivalent check here is a good
> thing to have, to make it more robust and easier to understand. The
> checking after locking a folio has very tiny overhead and can
> definitely ensure the folio's swap entry is valid and stable.
>
> The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00"
> here might seem problematic, but it's still not a real problem. That's
> the case where the swapin in src region happens after the lookup, and
> before the PTE lock. It will pass the PTE check without moving the
> folio. But the folio is guaranteed to be a completely new folio here
> because the folio can't be added back to the page table without
> holding the PTE lock, and if that happens the following PTE check here
> will fail.
>
> So I think we should patch the current kernel only adding a
> "folio_swap_contains" equivalent check here, and maybe more comments,
> how do you think?

The description appears to have some inconsistencies.
Would you mind rephrasing it?

Thanks
barry