From: Kairui Song <ryncsn@gmail.com>
To: Barry Song <21cnbao@gmail.com>
Cc: akpm@linux-foundation.org,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Baoquan He <bhe@redhat.com>, Chris Li <chrisl@kernel.org>,
David Hildenbrand <david@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Hugh Dickins <hughd@google.com>,
Kalesh Singh <kaleshsingh@google.com>,
LKML <linux-kernel@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>, Nhat Pham <nphamcs@gmail.com>,
Ryan Roberts <ryan.roberts@arm.com>,
Kemeng Shi <shikemeng@huaweicloud.com>,
Tim Chen <tim.c.chen@linux.intel.com>,
Matthew Wilcox <willy@infradead.org>,
"Huang, Ying" <ying.huang@linux.alibaba.com>,
Yosry Ahmed <yosryahmed@google.com>
Subject: Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
Date: Tue, 27 May 2025 23:11:28 +0800 [thread overview]
Message-ID: <CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com> (raw)
In-Reply-To: <CAGsJ_4ymRwXhQdzabstHhkK0OM0JEWtvR3tjeyQppm7sKZ8FUw@mail.gmail.com>
On Tue, May 27, 2025 at 3:59 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, May 24, 2025 at 8:01 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Fri, May 23, 2025 at 10:30 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Wed, May 21, 2025 at 2:45 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > Barry Song <21cnbao@gmail.com> 于 2025年5月21日周三 06:33写道:
> > > > > Let me run test case [1] to check whether this ever happens. I guess I need to
> > > > > hack kernel a bit to always add folio to swapcache even for SYNC IO.
> > > >
> > > > That will cause quite a performance regression I think. Good thing is,
> > > > that's exactly the problem this series is solving by dropping the SYNC
> > > > IO swapin path and never bypassing the swap cache, while improving the
> > > > performance, eliminating things like this. One more reason to justify
> > > > the approach :)
> >
> > Hi Barry,
> >
> > >
> > > I attempted to reproduce the scenario where a folio is added to the swapcache
> > > after filemap_get_folio() returns NULL but before move_swap_pte()
> > > moves the swap PTE
> > > using non-synchronized I/O. Technically, this seems possible; however,
> > > I was unable
> > > to reproduce it, likely because the time window between swapin_readahead and
> > > taking the page table lock within do_swap_page() is too short.
> >
> > Thank you so much for trying this!
> >
> > I have been trying to reproduce it too, and so far I didn't observe
> > any crash or warn. I added following debug code:
> >
> > static __always_inline
> > bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> > @@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *mm,
> > pmd_t *dst_pmd, pmd_t *src_pmd,
> > pmd_t dummy_pmdval;
> > pmd_t dst_pmdval;
> > struct folio *src_folio = NULL;
> > + struct folio *tmp_folio = NULL;
> > struct anon_vma *src_anon_vma = NULL;
> > struct mmu_notifier_range range;
> > int err = 0;
> > @@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *mm,
> > pmd_t *dst_pmd, pmd_t *src_pmd,
> > if (!src_folio)
> > folio = filemap_get_folio(swap_address_space(entry),
> > swap_cache_index(entry));
> > + udelay(get_random_u32_below(1000));
> > + tmp_folio = filemap_get_folio(swap_address_space(entry),
> > + swap_cache_index(entry));
> > + if (!IS_ERR_OR_NULL(tmp_folio)) {
> > + if (!IS_ERR_OR_NULL(folio) && tmp_folio != folio) {
> > + pr_err("UFFDIO_MOVE: UNSTABLE folio
> > %lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio,
> > tmp_folio->swap.val);
> > + }
> > + folio_put(tmp_folio);
> > + }
> > if (!IS_ERR_OR_NULL(folio)) {
> > if (folio_test_large(folio)) {
> > err = -EBUSY;
> > @@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *mm,
> > pmd_t *dst_pmd, pmd_t *src_pmd,
> > err = move_swap_pte(mm, dst_vma, dst_addr, src_addr,
> > dst_pte, src_pte,
> > orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > dst_ptl, src_ptl, src_folio);
> > + if (tmp_folio != folio && !err)
> > + pr_err("UFFDIO_MOVE: UNSTABLE folio passed
> > check: %lx -> %lx\n", folio, tmp_folio);
> > }
> >
> > And I saw these two prints are getting triggered like this (not a real
> > issue though, just help to understand the problem)
> > ...
> > [ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) ->
> > fffffdffc7ccac80 (51)
> > [ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) ->
> > fffffdffc3435e00 (3b)
> > [ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) ->
> > fffffdffc34ab8c0 (76)
> > [ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) ->
> > fffffdffc34ab8c0 (6d)
> > [ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) ->
> > fffffdffc34eda80 (32)
> > [ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) ->
> > fffffdffc38cedc0 (b)
> > [ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc34db140
> > [ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) ->
> > fffffdffc3465cc0 (21)
> > ...
> >
> > The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) ->
> > fffffdffc3853540 (53)" worries me at first. On first look it seems the
> > folio is indeed freed completely from the swap cache after the first
> > lookup, so another swapout can reuse the entry. But as you mentioned
> > __remove_mapping won't release a folio if the refcount check fails, so
> > they must be freed by folio_free_swap or __try_to_reclaim_swap, there
> > are many places that can happen. But these two helpers won't free a
> > folio from swap cache if its swap count is not zero. And the folio
> > will either be swapped out (swap count non zero), or mapped (freeing
> > it is fine, PTE is non_swap, and another swapout will still use the
> > same folio).
> >
> > So after more investigation and dumping the pages, it's actually the
> > second lookup (tmp_folio) seeing the entry being reused by another
> > page table entry, after the first folio is swapped back and released.
> > So the page table check below will always fail just fine.
> >
> > But this also proves the first look up can see a completely irrelevant
> > folio too: If the src folio is swapped out, but got swapped back and
> > freed, then another folio B shortly got added to swap cache reuse the
> > src folio's old swap entry, then the folio B got seen by the look up
> > here and get freed from swap cache, then src folio got swapped out
> > again also reusing the same entry, then we have a problem as PTE seems
> > untouched indeed but we grabbed a wrong folio. Seems possible if I'm
> > not wrong:
> >
> > Something like this:
> > CPU1 CPU2
> > move_pages_pte
> > entry = pte_to_swp_entry(orig_src_pte);
> > | Got Swap Entry S1 from src_pte
> > ...
> > <swapin src_pte, using folio A>
>
> I’m assuming you mean `<swapin src_pte, using folio B>`, since I’m not
> sure where folio B comes from in the statement `<someone else tried to
> swap out folio B>`.
>
> If that assumption is correct, and folio A is still in the swapcache,
> how could someone swap in folio B without hitting folio A? That would
> suggest folio A must have been removed from the swapcache earlier—right?
>
> > <free folio A from swap cache freeing S1>
> > <someone else try swap out folio B >
Sorry my bad, I think I made people think folio B is related to
src_pte at this point. What I actually mean is that: Another random
folio B, unrelated to src_pte, could got swapped out, and using the
swap entry S1.
> > <put folio B to swap cache using S1 >
> > ...
> > folio = swap_cache_get_folio(S1)
> > | Got folio B here !!!
> > move_swap_pte
> > <free folio B from swap cache>
> > | Holding a reference doesn't pin the cache
> > | as we have demonstrated
> > <Swapout folio A also using S1>
> > double_pt_lock
> > is_pte_pages_stable
> > | Passed because of S1 is reused
> > folio_move_anon_rmap(...)
> > | Moved invalid folio B here !!!
> >
> > But this is extremely hard to reproduce though, even if doing it
> > deliberately...
> >
> > So I think a "folio_swap_contains" or equivalent check here is a good
> > thing to have, to make it more robust and easier to understand. The
> > checking after locking a folio has very tiny overhead and can
> > definitely ensure the folio's swap entry is valid and stable.
> >
> > The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00"
> > here might seem problematic, but it's still not a real problem. That's
> > the case where the swapin in src region happens after the lookup, and
> > before the PTE lock. It will pass the PTE check without moving the
> > folio. But the folio is guaranteed to be a completely new folio here
> > because the folio can't be added back to the page table without
> > holding the PTE lock, and if that happens the following PTE check here
> > will fail.
> >
> > So I think we should patch the current kernel only adding a
> > "folio_swap_contains" equivalent check here, and maybe more comments,
> > how do you think?
>
> The description appears to have some inconsistencies.
> Would you mind rephrasing it?
Yeah, let's ignore the "UFFDIO_MOVE: UNSTABLE folio passed check: 0 ->
fffffdffc385fb00" part first, as both you and me have come into a
conclusion that "filemap_get_folio() returns NULL before
move_swap_pte, but a folio was added to swap cache" is OK, and this
output only proves that happens.
So the problematic race is:
Here move_pages_pte is moving src_pte to dst_pte, and it begins with
src_pte == swap entry S1, and S1 isn't cached.
CPU1 CPU2
move_pages_pte()
entry = pte_to_swp_entry(orig_src_pte);
| src_pte is absent, and got entry == S1
... < Somehow interrupted> ...
<swapin src_pte, using folio A>
| folio A is just a new allocated folio
| for resolving the swap in fault.
<free folio A from swap cache freeing S1>
| swap in fault is resolved, src_pte
| now points to folio A, so folio A
| can get freed just fine.
| And now S1 is free to be used
| by anyone.
<someone else try swap out another folio B >
| Folio B is a completely unrelated
| folio swapped out by random process.
| (has nothing to do with src_pte)
| But S1 is freed so it may use S1
| as its swap entry.
<put folio B to swap cache with index S1 >
...
folio = filemap_get_folio(S1)
| The lookup is using S1, so it
| got folio B here !!!
... < Somehow interrupted> ...
<free folio B from swap cache>
| Folio B could fail to be swapped out
| or got swapped in again, so it can
| be freed by folio_free_swap or
| swap cache reclaim.
| CPU1 is holding a reference but it
| doesn't pin the swap cache folio
| as I have demonstrated with the
| test C program previously.
| New S1 is free to be used again.
<Swapout src_pte again using S1>
| No thing blocks this from happening
| The swapout is still using folio A,
| and src_pte == S1.
folio_trylock(folio)
move_swap_pte
double_pt_lock
is_pte_pages_stable
| Passed because of S1 is reused so src_pte == S1.
folio_move_anon_rmap(...)
| Moved invalid folio B here !!!
It's a long and complex one, I don't think it's practically possible
to happen in reality but in theory doable, once in a million maybe...
Still we have to fix it, or did I got anything wrong here?
>
> Thanks
> barry
next prev parent reply other threads:[~2025-05-27 15:11 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
2025-05-14 20:17 ` [PATCH 01/28] mm, swap: don't scan every fragment cluster Kairui Song
2025-05-14 20:17 ` [PATCH 02/28] mm, swap: consolidate the helper for mincore Kairui Song
2025-05-14 20:17 ` [PATCH 03/28] mm/shmem, swap: remove SWAP_MAP_SHMEM Kairui Song
2025-05-14 20:17 ` [PATCH 04/28] mm, swap: split readahead update out of swap cache lookup Kairui Song
2025-05-14 20:17 ` [PATCH 05/28] mm, swap: sanitize swap cache lookup convention Kairui Song
2025-05-19 4:38 ` Barry Song
2025-05-20 3:31 ` Kairui Song
2025-05-20 4:41 ` Barry Song
2025-05-20 19:09 ` Kairui Song
2025-05-20 22:33 ` Barry Song
2025-05-21 2:45 ` Kairui Song
2025-05-21 3:24 ` Barry Song
2025-05-23 2:29 ` Barry Song
2025-05-23 20:01 ` Kairui Song
2025-05-27 7:58 ` Barry Song
2025-05-27 15:11 ` Kairui Song [this message]
2025-05-30 8:49 ` Kairui Song
2025-05-30 19:24 ` Kairui Song
2025-05-14 20:17 ` [PATCH 06/28] mm, swap: rearrange swap cluster definition and helpers Kairui Song
2025-05-19 6:26 ` Barry Song
2025-05-20 3:50 ` Kairui Song
2025-05-14 20:17 ` [PATCH 07/28] mm, swap: tidy up swap device and cluster info helpers Kairui Song
2025-05-14 20:17 ` [PATCH 08/28] mm, swap: use swap table for the swap cache and switch API Kairui Song
2025-05-14 20:17 ` [PATCH 09/28] mm/swap: rename __read_swap_cache_async to __swapin_cache_alloc Kairui Song
2025-05-14 20:17 ` [PATCH 10/28] mm, swap: add a swap helper for bypassing only read ahead Kairui Song
2025-05-14 20:17 ` [PATCH 11/28] mm, swap: clean up and consolidate helper for mTHP swapin check Kairui Song
2025-05-15 9:31 ` Klara Modin
2025-05-15 9:39 ` Kairui Song
2025-05-19 7:08 ` Barry Song
2025-05-19 11:09 ` Kairui Song
2025-05-19 11:57 ` Barry Song
2025-05-14 20:17 ` [PATCH 12/28] mm, swap: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
2025-05-14 20:17 ` [PATCH 13/28] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
2025-05-14 20:17 ` [PATCH 14/28] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
2025-05-14 20:17 ` [PATCH 15/28] mm, swap: split locked entry freeing into a standalone helper Kairui Song
2025-05-14 20:17 ` [PATCH 16/28] mm, swap: use swap cache as the swap in synchronize layer Kairui Song
2025-05-14 20:17 ` [PATCH 17/28] mm, swap: sanitize swap entry management workflow Kairui Song
2025-05-14 20:17 ` [PATCH 18/28] mm, swap: rename and introduce folio_free_swap_cache Kairui Song
2025-05-14 20:17 ` [PATCH 19/28] mm, swap: clean up and improve swap entries batch freeing Kairui Song
2025-05-14 20:17 ` [PATCH 20/28] mm, swap: check swap table directly for checking cache Kairui Song
2025-06-19 10:38 ` Baoquan He
2025-06-19 10:50 ` Kairui Song
2025-06-20 8:04 ` Baoquan He
2025-05-14 20:17 ` [PATCH 21/28] mm, swap: add folio to swap cache directly on allocation Kairui Song
2025-05-14 20:17 ` [PATCH 22/28] mm, swap: drop the SWAP_HAS_CACHE flag Kairui Song
2025-05-14 20:17 ` [PATCH 23/28] mm, swap: remove no longer needed _swap_info_get Kairui Song
2025-05-14 20:17 ` [PATCH 24/28] mm, swap: implement helpers for reserving data in swap table Kairui Song
2025-05-15 9:40 ` Klara Modin
2025-05-16 2:35 ` Kairui Song
2025-05-14 20:17 ` [PATCH 25/28] mm/workingset: leave highest 8 bits empty for anon shadow Kairui Song
2025-05-14 20:17 ` [PATCH 26/28] mm, swap: minor clean up for swapon Kairui Song
2025-05-14 20:17 ` [PATCH 27/28] mm, swap: use swap table to track swap count Kairui Song
2025-05-14 20:17 ` [PATCH 28/28] mm, swap: implement dynamic allocation of swap table Kairui Song
2025-05-21 18:36 ` Nhat Pham
2025-05-22 4:13 ` Kairui Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com' \
--to=ryncsn@gmail.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=chrisl@kernel.org \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kaleshsingh@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=ryan.roberts@arm.com \
--cc=shikemeng@huaweicloud.com \
--cc=tim.c.chen@linux.intel.com \
--cc=willy@infradead.org \
--cc=ying.huang@linux.alibaba.com \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox