linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Kairui Song <ryncsn@gmail.com>
To: Barry Song <21cnbao@gmail.com>
Cc: akpm@linux-foundation.org,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	 Baoquan He <bhe@redhat.com>, Chris Li <chrisl@kernel.org>,
	David Hildenbrand <david@redhat.com>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Hugh Dickins <hughd@google.com>,
	 Kalesh Singh <kaleshsingh@google.com>,
	LKML <linux-kernel@vger.kernel.org>,
	 linux-mm <linux-mm@kvack.org>, Nhat Pham <nphamcs@gmail.com>,
	 Ryan Roberts <ryan.roberts@arm.com>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	 Tim Chen <tim.c.chen@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	 "Huang, Ying" <ying.huang@linux.alibaba.com>,
	Yosry Ahmed <yosryahmed@google.com>
Subject: Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
Date: Tue, 27 May 2025 23:11:28 +0800	[thread overview]
Message-ID: <CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com> (raw)
In-Reply-To: <CAGsJ_4ymRwXhQdzabstHhkK0OM0JEWtvR3tjeyQppm7sKZ8FUw@mail.gmail.com>

On Tue, May 27, 2025 at 3:59 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, May 24, 2025 at 8:01 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Fri, May 23, 2025 at 10:30 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Wed, May 21, 2025 at 2:45 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > Barry Song <21cnbao@gmail.com> 于 2025年5月21日周三 06:33写道:
> > > > > Let me run test case [1] to check whether this ever happens. I guess I need to
> > > > > hack kernel a bit to always add folio to swapcache even for SYNC IO.
> > > >
> > > > That will cause quite a performance regression I think. Good thing is,
> > > > that's exactly the problem this series is solving by dropping the SYNC
> > > > IO swapin path and never bypassing the swap cache, while improving the
> > > > performance, eliminating things like this. One more reason to justify
> > > > the approach :)
> >
> > Hi Barry,
> >
> > >
> > > I attempted to reproduce the scenario where a folio is added to the swapcache
> > > after filemap_get_folio() returns NULL but before move_swap_pte()
> > > moves the swap PTE
> > > using non-synchronized I/O. Technically, this seems possible; however,
> > > I was unable
> > > to reproduce it, likely because the time window between swapin_readahead and
> > > taking the page table lock within do_swap_page() is too short.
> >
> > Thank you so much for trying this!
> >
> > I have been trying to reproduce it too, and so far I didn't observe
> > any crash or warn. I added following debug code:
> >
> >  static __always_inline
> >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> > @@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *mm,
> > pmd_t *dst_pmd, pmd_t *src_pmd,
> >         pmd_t dummy_pmdval;
> >         pmd_t dst_pmdval;
> >         struct folio *src_folio = NULL;
> > +       struct folio *tmp_folio = NULL;
> >         struct anon_vma *src_anon_vma = NULL;
> >         struct mmu_notifier_range range;
> >         int err = 0;
> > @@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *mm,
> > pmd_t *dst_pmd, pmd_t *src_pmd,
> >                 if (!src_folio)
> >                         folio = filemap_get_folio(swap_address_space(entry),
> >                                         swap_cache_index(entry));
> > +               udelay(get_random_u32_below(1000));
> > +               tmp_folio = filemap_get_folio(swap_address_space(entry),
> > +                                       swap_cache_index(entry));
> > +               if (!IS_ERR_OR_NULL(tmp_folio)) {
> > +                       if (!IS_ERR_OR_NULL(folio) && tmp_folio != folio) {
> > +                               pr_err("UFFDIO_MOVE: UNSTABLE folio
> > %lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio,
> > tmp_folio->swap.val);
> > +                       }
> > +                       folio_put(tmp_folio);
> > +               }
> >                 if (!IS_ERR_OR_NULL(folio)) {
> >                         if (folio_test_large(folio)) {
> >                                 err = -EBUSY;
> > @@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *mm,
> > pmd_t *dst_pmd, pmd_t *src_pmd,
> >                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr,
> > dst_pte, src_pte,
> >                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> >                                 dst_ptl, src_ptl, src_folio);
> > +               if (tmp_folio != folio && !err)
> > +                       pr_err("UFFDIO_MOVE: UNSTABLE folio passed
> > check: %lx -> %lx\n", folio, tmp_folio);
> >         }
> >
> > And I saw these two prints are getting triggered like this (not a real
> > issue though, just help to understand the problem)
> > ...
> > [ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) ->
> > fffffdffc7ccac80 (51)
> > [ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) ->
> > fffffdffc3435e00 (3b)
> > [ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) ->
> > fffffdffc34ab8c0 (76)
> > [ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) ->
> > fffffdffc34ab8c0 (6d)
> > [ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) ->
> > fffffdffc34eda80 (32)
> > [ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) ->
> > fffffdffc38cedc0 (b)
> > [ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc34db140
> > [ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) ->
> > fffffdffc3465cc0 (21)
> > ...
> >
> > The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) ->
> > fffffdffc3853540 (53)" worries me at first. On first look it seems the
> > folio is indeed freed completely from the swap cache after the first
> > lookup, so another swapout can reuse the entry. But as you mentioned
> > __remove_mapping won't release a folio if the refcount check fails, so
> > they must be freed by folio_free_swap or __try_to_reclaim_swap, there
> > are many places that can happen. But these two helpers won't free a
> > folio from swap cache if its swap count is not zero. And the folio
> > will either be swapped out (swap count non zero), or mapped (freeing
> > it is fine, PTE is non_swap, and another swapout will still use the
> > same folio).
> >
> > So after more investigation and dumping the pages, it's actually the
> > second lookup (tmp_folio) seeing the entry being reused by another
> > page table entry, after the first folio is swapped back and released.
> > So the page table check below will always fail just fine.
> >
> > But this also proves the first look up can see a completely irrelevant
> > folio too: If the src folio is swapped out, but got swapped back and
> > freed, then another folio B shortly got added to swap cache reuse the
> > src folio's old swap entry, then the folio B got seen by the look up
> > here and get freed from swap cache, then src folio got swapped out
> > again also reusing the same entry, then we have a problem as PTE seems
> > untouched indeed but we grabbed a wrong folio. Seems possible if I'm
> > not wrong:
> >
> > Something like this:
> > CPU1                             CPU2
> > move_pages_pte
> >   entry = pte_to_swp_entry(orig_src_pte);
> >     | Got Swap Entry S1 from src_pte
> >   ...
> >                                  <swapin src_pte, using folio A>
>
> I’m assuming you mean `<swapin src_pte, using folio B>`, since I’m not
> sure where folio B comes from in the statement `<someone else tried to
> swap out folio B>`.
>
> If that assumption is correct, and folio A is still in the swapcache,
> how could someone swap in folio B without hitting folio A? That would
> suggest folio A must have been removed from the swapcache earlier—right?
>
> >                                  <free folio A from swap cache freeing S1>
> >                                  <someone else try swap out folio B >

Sorry my bad, I think I made people think folio B is related to
src_pte at this point. What I actually mean is that: Another random
folio B, unrelated to src_pte, could got swapped out, and using the
swap entry S1.

> >                                  <put folio B to swap cache using S1 >
> >                                 ...
> >   folio = swap_cache_get_folio(S1)
> >     | Got folio B here !!!
> >   move_swap_pte
> >                                  <free folio B from swap cache>
> >                                    | Holding a reference doesn't pin the cache
> >                                    | as we have demonstrated
> >                                  <Swapout folio A also using S1>
> >     double_pt_lock
> >     is_pte_pages_stable
> >       | Passed because of S1 is reused
> >     folio_move_anon_rmap(...)
> >       | Moved invalid folio B here !!!
> >
> > But this is extremely hard to reproduce though, even if doing it
> > deliberately...
> >
> > So I think a "folio_swap_contains" or equivalent check here is a good
> > thing to have, to make it more robust and easier to understand. The
> > checking after locking a folio has very tiny overhead and can
> > definitely ensure the folio's swap entry is valid and stable.
> >
> > The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00"
> > here might seem problematic, but it's still not a real problem. That's
> > the case where the swapin in src region happens after the lookup, and
> > before the PTE lock. It will pass the PTE check without moving the
> > folio. But the folio is guaranteed to be a completely new folio here
> > because the folio can't be added back to the page table without
> > holding the PTE lock, and if that happens the following PTE check here
> > will fail.
> >
> > So I think we should patch the current kernel only adding a
> > "folio_swap_contains" equivalent check here, and maybe more comments,
> > how do you think?
>
> The description appears to have some inconsistencies.
> Would you mind rephrasing it?

Yeah, let's ignore the "UFFDIO_MOVE: UNSTABLE folio passed check: 0 ->
fffffdffc385fb00" part first, as both you and me have come into a
conclusion that "filemap_get_folio() returns NULL before
move_swap_pte, but a folio was added to swap cache" is OK, and this
output only proves that happens.

So the problematic race is:

Here move_pages_pte is moving src_pte to dst_pte, and it begins with
src_pte == swap entry S1, and S1 isn't cached.

CPU1                             CPU2
move_pages_pte()
  entry = pte_to_swp_entry(orig_src_pte);
    | src_pte is absent, and got entry == S1
  ... < Somehow interrupted> ...
                                 <swapin src_pte, using folio A>
                                   | folio A is just a new allocated folio
                                   | for resolving the swap in fault.
                                 <free folio A from swap cache freeing S1>
                                   | swap in fault is resolved, src_pte
                                   | now points to folio A, so folio A
                                   | can get freed just fine.
                                   | And now S1 is free to be used
                                   | by anyone.
                                 <someone else try swap out another folio B >
                                   | Folio B is a completely unrelated
                                   | folio swapped out by random process.
                                   | (has nothing to do with src_pte)
                                   | But S1 is freed so it may use S1
                                   | as its swap entry.
                                 <put folio B to swap cache with index S1 >
                                 ...
  folio = filemap_get_folio(S1)
    | The lookup is using S1, so it
    | got folio B here !!!
  ... < Somehow interrupted> ...
                                 <free folio B from swap cache>
                                   | Folio B could fail to be swapped out
                                   | or got swapped in again, so it can
                                   | be freed by folio_free_swap or
                                   | swap cache reclaim.
                                   | CPU1 is holding a reference but it
                                   | doesn't pin the swap cache folio
                                   | as I have demonstrated with the
                                   | test C program previously.
                                   | New S1 is free to be used again.
                                 <Swapout src_pte again using S1>
                                   | No thing blocks this from happening
                                   | The swapout is still using folio A,
                                   | and src_pte == S1.
  folio_trylock(folio)
  move_swap_pte
    double_pt_lock
    is_pte_pages_stable
      | Passed because of S1 is reused so src_pte == S1.
    folio_move_anon_rmap(...)
      | Moved invalid folio B here !!!

It's a long and complex one, I don't think it's practically possible
to happen in reality but in theory doable, once in a million maybe...
Still we have to fix it, or did I got anything wrong here?

>
> Thanks
> barry


  reply	other threads:[~2025-05-27 15:11 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
2025-05-14 20:17 ` [PATCH 01/28] mm, swap: don't scan every fragment cluster Kairui Song
2025-05-14 20:17 ` [PATCH 02/28] mm, swap: consolidate the helper for mincore Kairui Song
2025-05-14 20:17 ` [PATCH 03/28] mm/shmem, swap: remove SWAP_MAP_SHMEM Kairui Song
2025-05-14 20:17 ` [PATCH 04/28] mm, swap: split readahead update out of swap cache lookup Kairui Song
2025-05-14 20:17 ` [PATCH 05/28] mm, swap: sanitize swap cache lookup convention Kairui Song
2025-05-19  4:38   ` Barry Song
2025-05-20  3:31     ` Kairui Song
2025-05-20  4:41       ` Barry Song
2025-05-20 19:09         ` Kairui Song
2025-05-20 22:33           ` Barry Song
2025-05-21  2:45             ` Kairui Song
2025-05-21  3:24               ` Barry Song
2025-05-23  2:29               ` Barry Song
2025-05-23 20:01                 ` Kairui Song
2025-05-27  7:58                   ` Barry Song
2025-05-27 15:11                     ` Kairui Song [this message]
2025-05-30  8:49                       ` Kairui Song
2025-05-30 19:24                         ` Kairui Song
2025-05-14 20:17 ` [PATCH 06/28] mm, swap: rearrange swap cluster definition and helpers Kairui Song
2025-05-19  6:26   ` Barry Song
2025-05-20  3:50     ` Kairui Song
2025-05-14 20:17 ` [PATCH 07/28] mm, swap: tidy up swap device and cluster info helpers Kairui Song
2025-05-14 20:17 ` [PATCH 08/28] mm, swap: use swap table for the swap cache and switch API Kairui Song
2025-05-14 20:17 ` [PATCH 09/28] mm/swap: rename __read_swap_cache_async to __swapin_cache_alloc Kairui Song
2025-05-14 20:17 ` [PATCH 10/28] mm, swap: add a swap helper for bypassing only read ahead Kairui Song
2025-05-14 20:17 ` [PATCH 11/28] mm, swap: clean up and consolidate helper for mTHP swapin check Kairui Song
2025-05-15  9:31   ` Klara Modin
2025-05-15  9:39     ` Kairui Song
2025-05-19  7:08   ` Barry Song
2025-05-19 11:09     ` Kairui Song
2025-05-19 11:57       ` Barry Song
2025-05-14 20:17 ` [PATCH 12/28] mm, swap: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
2025-05-14 20:17 ` [PATCH 13/28] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
2025-05-14 20:17 ` [PATCH 14/28] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
2025-05-14 20:17 ` [PATCH 15/28] mm, swap: split locked entry freeing into a standalone helper Kairui Song
2025-05-14 20:17 ` [PATCH 16/28] mm, swap: use swap cache as the swap in synchronize layer Kairui Song
2025-05-14 20:17 ` [PATCH 17/28] mm, swap: sanitize swap entry management workflow Kairui Song
2025-05-14 20:17 ` [PATCH 18/28] mm, swap: rename and introduce folio_free_swap_cache Kairui Song
2025-05-14 20:17 ` [PATCH 19/28] mm, swap: clean up and improve swap entries batch freeing Kairui Song
2025-05-14 20:17 ` [PATCH 20/28] mm, swap: check swap table directly for checking cache Kairui Song
2025-06-19 10:38   ` Baoquan He
2025-06-19 10:50     ` Kairui Song
2025-06-20  8:04       ` Baoquan He
2025-05-14 20:17 ` [PATCH 21/28] mm, swap: add folio to swap cache directly on allocation Kairui Song
2025-05-14 20:17 ` [PATCH 22/28] mm, swap: drop the SWAP_HAS_CACHE flag Kairui Song
2025-05-14 20:17 ` [PATCH 23/28] mm, swap: remove no longer needed _swap_info_get Kairui Song
2025-05-14 20:17 ` [PATCH 24/28] mm, swap: implement helpers for reserving data in swap table Kairui Song
2025-05-15  9:40   ` Klara Modin
2025-05-16  2:35     ` Kairui Song
2025-05-14 20:17 ` [PATCH 25/28] mm/workingset: leave highest 8 bits empty for anon shadow Kairui Song
2025-05-14 20:17 ` [PATCH 26/28] mm, swap: minor clean up for swapon Kairui Song
2025-05-14 20:17 ` [PATCH 27/28] mm, swap: use swap table to track swap count Kairui Song
2025-05-14 20:17 ` [PATCH 28/28] mm, swap: implement dynamic allocation of swap table Kairui Song
2025-05-21 18:36   ` Nhat Pham
2025-05-22  4:13     ` Kairui Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com' \
    --to=ryncsn@gmail.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhe@redhat.com \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kaleshsingh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=shikemeng@huaweicloud.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox