From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 10258C54F30 for ; Tue, 27 May 2025 07:59:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A2EF46B0085; Tue, 27 May 2025 03:59:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A05F56B008A; Tue, 27 May 2025 03:59:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8F4F96B008C; Tue, 27 May 2025 03:59:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 6BD246B0085 for ; Tue, 27 May 2025 03:59:12 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 1C84616026E for ; Tue, 27 May 2025 07:59:12 +0000 (UTC) X-FDA: 83487937344.13.537C44C Received: from mail-qk1-f173.google.com (mail-qk1-f173.google.com [209.85.222.173]) by imf27.hostedemail.com (Postfix) with ESMTP id 68CDD40002 for ; Tue, 27 May 2025 07:59:10 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Qr4sbksd; spf=pass (imf27.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.173 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748332750; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hplfmEAI0ObABv6iennLOuBJUZpDNYU6guT+clGZ648=; b=NogH4d+QTau3CnCpgo+Y5uPLvQbZ5mkCFyaNIdh+zJKUWb3si/LE8pQlaYX6VHujyx7+u1 NBpzncBwY+RdyPJ5nRKzoX2w3C6wZbTWkFZKA5ljXV7rGiinqZ3oklWJJlI9uADfHdyduh DYew5wXLNp7/7NB/Wf7fuzRijF/JsLE= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Qr4sbksd; spf=pass (imf27.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.173 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748332750; a=rsa-sha256; cv=none; b=8OqkqWOF9Z2Os+jsNlOg8xtYakgcG2i5ZOXTerjwyqTwt+9oftK4Ig154bL9zjemQd0Zzw A7svPzhL1LQVmhONqsdQ1VCdi9JewMYv3SBQmWAKqzX4D656OvUPV+NCHP6RceRU71ZxzX PWsiE56fRvyYm73sh5SxO55tZk7Ljk4= Received: by mail-qk1-f173.google.com with SMTP id af79cd13be357-7c9376c4dbaso294562685a.0 for ; Tue, 27 May 2025 00:59:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748332749; x=1748937549; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=hplfmEAI0ObABv6iennLOuBJUZpDNYU6guT+clGZ648=; b=Qr4sbksdF90TJKDU1CYFKV3Y+BTxZXfVAfZXl48FZXbYaQct3aDbkwJXBfltgOvOOG AjMVVfJOIWCbU0Pj/UPwOTdPaGyy25P2Xnf9btUMl58Pj4uNXnfd0pjpVQ8PQhNdynZW 16Gg4cME73sokXkZKxvNsGITSq6JU9BECFDowDlq4xM1tr66vwyStF2Vxb1JgAGB3qbq y79uWzZ2oERZMLh39GT1r+dfY2Q3lDK+Hdm0Al5gqfE8/uvwaqdgfDczpt28DbvSvmco HkQlbLl/RYwiIdGvYK7agW9BkEYWRAexPPbrKxIlrCm6AtP/zefHc5nRArcEQI8zHzP7 MQOg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748332749; x=1748937549; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hplfmEAI0ObABv6iennLOuBJUZpDNYU6guT+clGZ648=; b=CWTRVgg8pkR02hCJhyC+LmmFyGe+P3F1lwFxNAMPONc07ZxJMulQRfs30SBfYeudDI dKNGooFPTBoIteX1PO6rIIhO1K0KTpo6B/wEjyDmwb8iEJAiFM49zTHfTt9IIr9hTgyQ L94Ns4KcSrG71ixVVa2/oPT8C1T70TOQYNAkEGXOCzXqcwdC5JP7+e6rwOgH2swP+Ls7 vid+pDxL5WUCUMSMv6Xn90KclFl8YNYsLLYRSDeaRx4yUKUXu+3pnihVjBhv0F60uM2L WM89W9SWSRBBJIlQlhd6CgTZXrbepLVr14M/cCGlf1EiUXozbBYX4WInzW2AXO6vOwjs aOLA== X-Forwarded-Encrypted: i=1; AJvYcCVwunaoObfGVOx759x+AdgrEJZRnXtfee0OptlI57f9DWapUean9SLo3eE7xTqsc2CCiLY+ldTfSQ==@kvack.org X-Gm-Message-State: AOJu0Yz91LwvbIwcXszUuuw4ndYCoDFm38K4BOHcYwe8wtUhrtdbgwQp juBXo68/HKdjVqbnMSSxKPPry7AZ/WgAGs5fcwoQ9djXNIRar8uprxt70qWNYhAz+RyAktQ5zB/ u94Y6Trh2cdzeFd1bAVSuF9tnlpYb9BXWDBAx X-Gm-Gg: ASbGncvqvnLE+XUJRumm2HrdWWTcvY1GGeAJVKlm/+qxDx+ve8EN4dGrkYkzvUzmrDq C0YEIWhF1PzYOZcS5aSoUPCFWjGz7AxV1byi5QMl+v1y66mGw0sM7Kv4gFbmYP3flZkkTN9il40 Vl994WB/1B/0sG5fhi0Cyq2pWUuya7OEsyjw== X-Google-Smtp-Source: AGHT+IGKzpgcFn4eQbWsqR5WJWfumDy1Xr9B/icnNkP390i2UgIIEEB1AFM6EhiUFaVHt+jC+iU0a1EqCrY90kksQjg= X-Received: by 2002:a05:6102:dd2:b0:4dd:ab6c:7654 with SMTP id ada2fe7eead31-4e423e601e5mr9168751137.8.1748332738801; Tue, 27 May 2025 00:58:58 -0700 (PDT) MIME-Version: 1.0 References: <20250514201729.48420-6-ryncsn@gmail.com> <20250519043847.1806-1-21cnbao@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Tue, 27 May 2025 19:58:47 +1200 X-Gm-Features: AX0GCFtSjRdfbo_re4Zx58GpTYN0U7VxlXrr-1Kmf-HNfqbkzeyTydFvtPqllfU Message-ID: Subject: Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention To: Kairui Song Cc: akpm@linux-foundation.org, Baolin Wang , Baoquan He , Chris Li , David Hildenbrand , Johannes Weiner , Hugh Dickins , Kalesh Singh , LKML , linux-mm , Nhat Pham , Ryan Roberts , Kemeng Shi , Tim Chen , Matthew Wilcox , "Huang, Ying" , Yosry Ahmed Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 68CDD40002 X-Stat-Signature: td9wuwpztpkcoy3mk8sxy5x19gci13ts X-Rspam-User: X-HE-Tag: 1748332750-532163 X-HE-Meta: U2FsdGVkX1+BtFoCDS3pkJNQd8xucJlrsFH5hV90QYZZWsKgJBj4v7ik5YhwmB8LnamP2wR8MghWNfU15BeWmkXKpRnKzrGJ377HJsLmFaTGSzQecnSsAgCbl4engBfyum6FDVSxXuh4ACjNZL1aucehz76klEzNjhn81Mh+H3JPUd4hcr4cqeSiZGPQ7rcUQkc4VLd6Ep6Nb4rBvPb8LtzB+9cXOA9v8VIli/vr39VOEKcZCTo/4Ov/xM2MWaRIAQMs15sbQUAjpc9ovn9ckQmrFioVsiG0h8lIZmyR5/IttJ+uv+LQyGXAFco04yEatrfLvdDaMI3lYjNmbUtF7V6+ZBR+SPLwl+++352sv/eQotuoEnpFgwVTZZvdVAhCx8A5T0nGyZM/lNP6GT9+Te+TxWZyXys0UnpF42KPhO0EpGc9s/21i6TOCTZ8rcgjqBeNMQPbbT03yO9NA2rGLF+7TUHIvCmaitt1N0CMH7fmK7tOte1m3AGZpv4ebyYloAH5P6Z2I6ilHYGMH0w4KbVhK1Fs5ufVY1fPjcWxqh7vTmBIqbei+wJ/L609jxNqJ9XUQ+wGZ/hH0nJlGVoDvgN4d6Ggy+K5IyQbK2vwjzWg48Iel+c3rZsV8nz4LYhGuizszl/d6UG+FpTfoHamg8G0czQ3FTKAU3A1qLgDKmxkfPQgupDHJ7nFniGnBLlGociwNMVpEdZnae2JbpHIDB7ew3p8+NDRTofX87xDR0q2fdn4NbEGp8LF2G+Y2HmDV5ncoF14x8+0e4OXyI95qi3Dm9UacoZp/1/Wx9xMFMVgZtgKAtoR0E9e6M2pQ3PbOF+diLC/01HRGQQGg0aN1xb62MMQCiDG+TdVDYrVdU93tyxiJmEb2YF65TrCTaZ1OROP9ZBX6uN7ZyJvGoYrp8UZMMKjdqhEk1Uyuv8DvAP8mBEA0A6ZFHaPPGgA4jhVV9Zi/PJkWh0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, May 24, 2025 at 8:01=E2=80=AFAM Kairui Song wrot= e: > > On Fri, May 23, 2025 at 10:30=E2=80=AFAM Barry Song <21cnbao@gmail.com> w= rote: > > > > On Wed, May 21, 2025 at 2:45=E2=80=AFPM Kairui Song = wrote: > > > > > > Barry Song <21cnbao@gmail.com> =E4=BA=8E 2025=E5=B9=B45=E6=9C=8821=E6= =97=A5=E5=91=A8=E4=B8=89 06:33=E5=86=99=E9=81=93=EF=BC=9A > > > > > > > > On Wed, May 21, 2025 at 7:10=E2=80=AFAM Kairui Song wrote: > > > > > > > > > > On Tue, May 20, 2025 at 12:41=E2=80=AFPM Barry Song <21cnbao@gmai= l.com> wrote: > > > > > > > > > > > > On Tue, May 20, 2025 at 3:31=E2=80=AFPM Kairui Song wrote: > > > > > > > > > > > > > > On Mon, May 19, 2025 at 12:38=E2=80=AFPM Barry Song <21cnbao@= gmail.com> wrote: > > > > > > > > > > > > > > > > > From: Kairui Song > > > > > > > > > > > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > > > > > > > index e5a0db7f3331..5b4f01aecf35 100644 > > > > > > > > > --- a/mm/userfaultfd.c > > > > > > > > > +++ b/mm/userfaultfd.c > > > > > > > > > @@ -1409,6 +1409,10 @@ static int move_pages_pte(struct m= m_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, > > > > > > > > > goto retry; > > > > > > > > > } > > > > > > > > > } > > > > > > > > > + if (!folio_swap_contains(src_folio, entry))= { > > > > > > > > > + err =3D -EBUSY; > > > > > > > > > + goto out; > > > > > > > > > + } > > > > > > > > > > > > > > > > It seems we don't need this. In move_swap_pte(), we have be= en checking pte pages > > > > > > > > are stable: > > > > > > > > > > > > > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst= _pte, orig_src_pte, > > > > > > > > dst_pmd, dst_pmdval)) { > > > > > > > > double_pt_unlock(dst_ptl, src_ptl); > > > > > > > > return -EAGAIN; > > > > > > > > } > > > > > > > > > > > > > > The tricky part is when swap_cache_get_folio returns the foli= o, both > > > > > > > folio and ptes are unlocked. So is it possible that someone e= lse > > > > > > > swapped in the entries, then swapped them out again using the= same > > > > > > > entries? > > > > > > > > > > > > > > The folio will be different here but PTEs are still the same = value to > > > > > > > they will pass the is_pte_pages_stable check, we previously s= aw > > > > > > > similar races with anon fault or shmem. I think more strict c= hecking > > > > > > > won't hurt here. > > > > > > > > > > > > This doesn't seem to be the same case as the one you fixed in > > > > > > do_swap_page(). Here, we're hitting the swap cache, whereas in = that > > > > > > case, there was no one hitting the swap cache, and you used > > > > > > swap_prepare() to set up the cache to fix the issue. > > > > > > > > > > > > By the way, if we're not hitting the swap cache, src_folio will= be > > > > > > NULL. Also, it seems that folio_swap_contains(src_folio, entry)= does > > > > > > not guard against that case either. > > > > > > > > > > Ah, that's true, it should be moved inside the if (folio) {...} b= lock > > > > > above. Thanks for catching this! > > > > > > > > > > > But I suspect we won't have a problem, since we're not swapping= in =E2=80=94 > > > > > > we didn't read any stale data, right? Swap-in will only occur a= fter we > > > > > > move the PTEs. > > > > > > > > > > My concern is that a parallel swapin / swapout could result in th= e > > > > > folio to be a completely irrelevant or invalid folio. > > > > > > > > > > It's not about the dst, but in the move src side, something like: > > > > > > > > > > CPU1 CPU2 > > > > > move_pages_pte > > > > > folio =3D swap_cache_get_folio(...) > > > > > | Got folio A here > > > > > move_swap_pte > > > > > > > > > > > > > > > | Now folio A is no longer val= id. > > > > > | It's very unlikely but here = SWAP > > > > > | could reuse the same entry a= s above. > > > > > > > > > > > > swap_cache_get_folio() does increment the folio's refcount, but it = seems this > > > > doesn't prevent do_swap_page() from freeing the swap entry after sw= apping > > > > in src_pte with folio A, if it's a read fault. > > > > for write fault, folio_ref_count(folio) =3D=3D (1 + folio_nr_pages(= folio)) > > > > will be false: > > > > > > > > static inline bool should_try_to_free_swap(struct folio *folio, > > > > struct vm_area_struct *v= ma, > > > > unsigned int fault_flags= ) > > > > { > > > > ... > > > > > > > > /* > > > > * If we want to map a page that's in the swapcache writabl= e, we > > > > * have to detect via the refcount if we're really the excl= usive > > > > * user. Try freeing the swapcache to get rid of the swapca= che > > > > * reference only in case it's likely that we'll be the exl= usive user. > > > > */ > > > > return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(= folio) && > > > > folio_ref_count(folio) =3D=3D (1 + folio_nr_pages(f= olio)); > > > > } > > > > > > > > and for swapout, __removing_mapping does check refcount as well: > > > > > > > > static int __remove_mapping(struct address_space *mapping, struct f= olio *folio, > > > > bool reclaimed, struct mem_cgroup *targ= et_memcg) > > > > { > > > > refcount =3D 1 + folio_nr_pages(folio); > > > > if (!folio_ref_freeze(folio, refcount)) > > > > goto cannot_free; > > > > > > > > } > > > > > > > > However, since __remove_mapping() occurs after pageout(), it seems > > > > this also doesn't prevent swapout from allocating a new swap entry = to > > > > fill src_pte. > > > > > > > > It seems your concern is valid=E2=80=94unless I'm missing something= . > > > > Do you have a reproducer? If so, this will likely need a separate f= ix > > > > patch rather than being hidden in this patchset. > > > > > > Thanks for the analysis. I don't have a reproducer yet, I did some > > > local experiments and that seems possible, but the race window is so > > > tiny and it's very difficult to make the swap entry reuse to collide > > > with that, I'll try more but in theory this seems possible, or at > > > least looks very fragile. > > > > > > And yeah, let's patch the kernel first if that's a real issue. > > > > > > > > > > > > double_pt_lock > > > > > is_pte_pages_stable > > > > > | Passed because of entry reuse. > > > > > folio_move_anon_rmap(...) > > > > > | Moved invalid folio A. > > > > > > > > > > And could it be possible that the swap_cache_get_folio returns NU= LL > > > > > here, but later right before the double_pt_lock, a folio is added= to > > > > > swap cache? Maybe we better check the swap cache after clear and > > > > > releasing dst lock, but before releasing src lock? > > > > > > > > It seems you're suggesting that a parallel swap-in allocates and ad= ds > > > > a folio to the swap cache, but the PTE has not yet been updated fro= m > > > > a swap entry to a present mapping? > > > > > > > > As long as do_swap_page() adds the folio to the swap cache > > > > before updating the PTE to present, this scenario seems possible. > > > > > > Yes, that's two kinds of problems here. I suspected there could be an > > > ABA problem while working on the series, but wasn't certain. And just > > > realised there could be another missed cache read here thanks to your > > > review and discussion :) > > > > > > > > > > > It seems we need to double-check: > > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > > index bc473ad21202..976053bd2bf1 100644 > > > > --- a/mm/userfaultfd.c > > > > +++ b/mm/userfaultfd.c > > > > @@ -1102,8 +1102,14 @@ static int move_swap_pte(struct mm_struct *m= m, > > > > struct vm_area_struct *dst_vma, > > > > if (src_folio) { > > > > folio_move_anon_rmap(src_folio, dst_vma); > > > > src_folio->index =3D linear_page_index(dst_vma, dst= _addr); > > > > + } else { > > > > + struct folio *folio =3D > > > > filemap_get_folio(swap_address_space(entry), > > > > + swap_cache_index(entry)); > > > > + if (!IS_ERR_OR_NULL(folio)) { > > > > + double_pt_unlock(dst_ptl, src_ptl); > > > > + return -EAGAIN; > > > > + } > > > > } > > > > - > > > > orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_pte); > > > > #ifdef CONFIG_MEM_SOFT_DIRTY > > > > orig_src_pte =3D pte_swp_mksoft_dirty(orig_src_pte); > > > > > > Maybe it has to get even dirtier here to call swapcache_prepare too t= o > > > cover the SYNC_IO case? > > > > > > > > > > > Let me run test case [1] to check whether this ever happens. I gues= s I need to > > > > hack kernel a bit to always add folio to swapcache even for SYNC IO= . > > > > > > That will cause quite a performance regression I think. Good thing is= , > > > that's exactly the problem this series is solving by dropping the SYN= C > > > IO swapin path and never bypassing the swap cache, while improving th= e > > > performance, eliminating things like this. One more reason to justify > > > the approach :) > > Hi Barry, > > > > > I attempted to reproduce the scenario where a folio is added to the swa= pcache > > after filemap_get_folio() returns NULL but before move_swap_pte() > > moves the swap PTE > > using non-synchronized I/O. Technically, this seems possible; however, > > I was unable > > to reproduce it, likely because the time window between swapin_readahea= d and > > taking the page table lock within do_swap_page() is too short. > > Thank you so much for trying this! > > I have been trying to reproduce it too, and so far I didn't observe > any crash or warn. I added following debug code: > > static __always_inline > bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_= end) > @@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *mm, > pmd_t *dst_pmd, pmd_t *src_pmd, > pmd_t dummy_pmdval; > pmd_t dst_pmdval; > struct folio *src_folio =3D NULL; > + struct folio *tmp_folio =3D NULL; > struct anon_vma *src_anon_vma =3D NULL; > struct mmu_notifier_range range; > int err =3D 0; > @@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *mm, > pmd_t *dst_pmd, pmd_t *src_pmd, > if (!src_folio) > folio =3D filemap_get_folio(swap_address_space(en= try), > swap_cache_index(entry)); > + udelay(get_random_u32_below(1000)); > + tmp_folio =3D filemap_get_folio(swap_address_space(entry)= , > + swap_cache_index(entry)); > + if (!IS_ERR_OR_NULL(tmp_folio)) { > + if (!IS_ERR_OR_NULL(folio) && tmp_folio !=3D foli= o) { > + pr_err("UFFDIO_MOVE: UNSTABLE folio > %lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio, > tmp_folio->swap.val); > + } > + folio_put(tmp_folio); > + } > if (!IS_ERR_OR_NULL(folio)) { > if (folio_test_large(folio)) { > err =3D -EBUSY; > @@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *mm, > pmd_t *dst_pmd, pmd_t *src_pmd, > err =3D move_swap_pte(mm, dst_vma, dst_addr, src_addr, > dst_pte, src_pte, > orig_dst_pte, orig_src_pte, dst_pmd, dst_= pmdval, > dst_ptl, src_ptl, src_folio); > + if (tmp_folio !=3D folio && !err) > + pr_err("UFFDIO_MOVE: UNSTABLE folio passed > check: %lx -> %lx\n", folio, tmp_folio); > } > > And I saw these two prints are getting triggered like this (not a real > issue though, just help to understand the problem) > ... > [ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) -> > fffffdffc7ccac80 (51) > [ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) -> > fffffdffc3435e00 (3b) > [ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) -> > fffffdffc34ab8c0 (76) > [ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) -> > fffffdffc34ab8c0 (6d) > [ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) -> > fffffdffc34eda80 (32) > [ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) -> > fffffdffc38cedc0 (b) > [ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc34= db140 > [ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) -> > fffffdffc3465cc0 (21) > ... > > The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) -> > fffffdffc3853540 (53)" worries me at first. On first look it seems the > folio is indeed freed completely from the swap cache after the first > lookup, so another swapout can reuse the entry. But as you mentioned > __remove_mapping won't release a folio if the refcount check fails, so > they must be freed by folio_free_swap or __try_to_reclaim_swap, there > are many places that can happen. But these two helpers won't free a > folio from swap cache if its swap count is not zero. And the folio > will either be swapped out (swap count non zero), or mapped (freeing > it is fine, PTE is non_swap, and another swapout will still use the > same folio). > > So after more investigation and dumping the pages, it's actually the > second lookup (tmp_folio) seeing the entry being reused by another > page table entry, after the first folio is swapped back and released. > So the page table check below will always fail just fine. > > But this also proves the first look up can see a completely irrelevant > folio too: If the src folio is swapped out, but got swapped back and > freed, then another folio B shortly got added to swap cache reuse the > src folio's old swap entry, then the folio B got seen by the look up > here and get freed from swap cache, then src folio got swapped out > again also reusing the same entry, then we have a problem as PTE seems > untouched indeed but we grabbed a wrong folio. Seems possible if I'm > not wrong: > > Something like this: > CPU1 CPU2 > move_pages_pte > entry =3D pte_to_swp_entry(orig_src_pte); > | Got Swap Entry S1 from src_pte > ... > I=E2=80=99m assuming you mean ``, since I=E2= =80=99m not sure where folio B comes from in the statement ``. If that assumption is correct, and folio A is still in the swapcache, how could someone swap in folio B without hitting folio A? That would suggest folio A must have been removed from the swapcache earlier=E2=80=94r= ight? > > > > ... > folio =3D swap_cache_get_folio(S1) > | Got folio B here !!! > move_swap_pte > > | Holding a reference doesn't pin the = cache > | as we have demonstrated > > double_pt_lock > is_pte_pages_stable > | Passed because of S1 is reused > folio_move_anon_rmap(...) > | Moved invalid folio B here !!! > > But this is extremely hard to reproduce though, even if doing it > deliberately... > > So I think a "folio_swap_contains" or equivalent check here is a good > thing to have, to make it more robust and easier to understand. The > checking after locking a folio has very tiny overhead and can > definitely ensure the folio's swap entry is valid and stable. > > The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00" > here might seem problematic, but it's still not a real problem. That's > the case where the swapin in src region happens after the lookup, and > before the PTE lock. It will pass the PTE check without moving the > folio. But the folio is guaranteed to be a completely new folio here > because the folio can't be added back to the page table without > holding the PTE lock, and if that happens the following PTE check here > will fail. > > So I think we should patch the current kernel only adding a > "folio_swap_contains" equivalent check here, and maybe more comments, > how do you think? The description appears to have some inconsistencies. Would you mind rephrasing it? Thanks barry