From: Kairui Song <ryncsn@gmail.com>
To: Barry Song <21cnbao@gmail.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>,
linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
Peter Xu <peterx@redhat.com>,
Suren Baghdasaryan <surenb@google.com>,
Andrea Arcangeli <aarcange@redhat.com>,
David Hildenbrand <david@redhat.com>,
stable@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache
Date: Sat, 31 May 2025 15:00:31 +0800 [thread overview]
Message-ID: <CAMgjq7BzBFTOb-urmfuF5y6qsWxwFMy0Eq=Fym+2x2pjcqg1fQ@mail.gmail.com> (raw)
In-Reply-To: <CAGsJ_4yJhJBo16XhiC-nUzSheyX-V3-nFE+tAi=8Y560K8eT=A@mail.gmail.com>
On Sat, May 31, 2025 at 2:36 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, May 31, 2025 at 6:25 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Sat, May 31, 2025 at 11:39 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sat, May 31, 2025 at 11:40 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > >
> > > > On Fri, May 30, 2025 at 1:17 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > From: Kairui Song <kasong@tencent.com>
> > > > >
> > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > > > > lookup, and try to move the found folio to the faulting vma when.
> > > > > Currently, it relies on the PTE value check to ensure the moved folio
> > > > > still belongs to the src swap entry, which turns out is not reliable.
> > > > >
> > > > > While working and reviewing the swap table series with Barry, following
> > > > > existing race is observed and reproduced [1]:
> > > > >
> > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> > > > > swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> > > > >
> > > > > CPU1 CPU2
> > > > > userfaultfd_move
> > > > > move_pages_pte()
> > > > > entry = pte_to_swp_entry(orig_src_pte);
> > > > > // Here it got entry = S1
> > > > > ... < Somehow interrupted> ...
> > > > > <swapin src_pte, alloc and use folio A>
> > > > > // folio A is just a new allocated folio
> > > > > // and get installed into src_pte
> > > > > <frees swap entry S1>
> > > > > // src_pte now points to folio A, S1
> > > > > // has swap count == 0, it can be freed
> > > > > // by folio_swap_swap or swap
> > > > > // allocator's reclaim.
> > > > > <try to swap out another folio B>
> > > > > // folio B is a folio in another VMA.
> > > > > <put folio B to swap cache using S1 >
> > > > > // S1 is freed, folio B could use it
> > > > > // for swap out with no problem.
> > > > > ...
> > > > > folio = filemap_get_folio(S1)
> > > > > // Got folio B here !!!
> > > > > ... < Somehow interrupted again> ...
> > > > > <swapin folio B and free S1>
> > > > > // Now S1 is free to be used again.
> > > > > <swapout src_pte & folio A using S1>
> > > > > // Now src_pte is a swap entry pte
> > > > > // holding S1 again.
> > > > > folio_trylock(folio)
> > > > > move_swap_pte
> > > > > double_pt_lock
> > > > > is_pte_pages_stable
> > > > > // Check passed because src_pte == S1
> > > > > folio_move_anon_rmap(...)
> > > > > // Moved invalid folio B here !!!
> > > > >
> > > > > The race window is very short and requires multiple collisions of
> > > > > multiple rare events, so it's very unlikely to happen, but with a
> > > > > deliberately constructed reproducer and increased time window, it can be
> > > > > reproduced [1].
> > > >
> > > > Thanks for catching and fixing this. Just to clarify a few things
> > > > about your reproducer:
> > > > 1. Is it necessary for the 'race' mapping to be MAP_SHARED, or
> > > > MAP_PRIVATE will work as well?
> > > > 2. You mentioned that the 'current dir is on a block device'. Are you
> > > > indicating that if we are using zram for swap then it doesn't
> > > > reproduce?
> > > >
> > > > >
> > > > > It's also possible that folio (A) is swapped in, and swapped out again
> > > > > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > > > > swap cache so it needs to be moved too. In this case we should also try
> > > > > again so kernel won't miss a folio move.
> > > > >
> > > > > Fix this by checking if the folio is the valid swap cache folio after
> > > > > acquiring the folio lock, and checking the swap cache again after
> > > > > acquiring the src_pte lock.
> > > > >
> > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > > > > we don't need to worry about that since folios only might get exposed to
> > > > > swap cache in the swap out path, and it's covered in this patch too by
> > > > > checking the swap cache again after acquiring src_pte lock.
> > > > >
> > > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > > ---
> > > > > mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
> > > > > 1 file changed, 26 insertions(+)
> > > > >
> > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > index bc473ad21202..a1564d205dfb 100644
> > > > > --- a/mm/userfaultfd.c
> > > > > +++ b/mm/userfaultfd.c
> > > > > @@ -15,6 +15,7 @@
> > > > > #include <linux/mmu_notifier.h>
> > > > > #include <linux/hugetlb.h>
> > > > > #include <linux/shmem_fs.h>
> > > > > +#include <linux/delay.h>
> > > > I guess you mistakenly left it from your reproducer code :)
> > > > > #include <asm/tlbflush.h>
> > > > > #include <asm/tlb.h>
> > > > > #include "internal.h"
> > > > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > > > spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > > > struct folio *src_folio)
> > > > > {
> > > > > + swp_entry_t entry;
> > > > > +
> > > > > double_pt_lock(dst_ptl, src_ptl);
> > > > >
> > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > > > if (src_folio) {
> > > > > folio_move_anon_rmap(src_folio, dst_vma);
> > > > > src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > > + } else {
> > > > > + /*
> > > > > + * Check again after acquiring the src_pte lock. Or we might
> > > > > + * miss a new loaded swap cache folio.
> > > > > + */
> > > > > + entry = pte_to_swp_entry(orig_src_pte);
> > > > > + src_folio = filemap_get_folio(swap_address_space(entry),
> > > > > + swap_cache_index(entry));
> > > >
> > > > Given the non-trivial overhead of filemap_get_folio(), do you think it
> > > > will work if filemap_get_filio() was only once after locking src_ptl?
> > > > Please correct me if my assumption about the overhead is wrong.
> > >
> > > not quite sure as we have a folio_lock(src_folio) before move_swap_pte().
> > > can we safely folio_move_anon_rmap + src_folio->index while not holding
> > > folio lock?
> >
> > I think no, we can't even make sure the folio is still in the swap
> > cache, so it can be a freed folio that does not belong to any VMA
> > while not holding the folio lock.
>
> Right, but will the following be sufficient, given that we don’t really
> care about the folio—only whether there’s new cache?
>
> if (READ_ONCE(si->swap_map[offset]) & SWAP_HAS_CACHE) {
> double_pt_unlock(dst_ptl, src_ptl);
> return -EAGAIN;
> }
The problem is reading swap_map without locking the cluster map seems
unstable, and has strange false positives, a swapin will set this bit
first, while not adding the folio to swap cache or even when skipping
the swap cache, that seems could make it more complex.
next prev parent reply other threads:[~2025-05-31 7:00 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-30 20:17 Kairui Song
2025-05-30 23:40 ` Lokesh Gidra
2025-05-31 3:37 ` Barry Song
2025-05-31 6:25 ` Kairui Song
2025-05-31 6:35 ` Barry Song
2025-05-31 7:00 ` Kairui Song [this message]
2025-05-31 7:06 ` Barry Song
2025-05-31 7:11 ` Barry Song
2025-05-31 7:11 ` Lokesh Gidra
2025-05-31 6:22 ` Kairui Song
2025-05-31 4:04 ` Barry Song
2025-05-31 4:41 ` Barry Song
2025-05-31 6:10 ` Lokesh Gidra
2025-05-31 6:36 ` Kairui Song
2025-05-31 6:54 ` Barry Song
2025-05-31 6:54 ` Kairui Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAMgjq7BzBFTOb-urmfuF5y6qsWxwFMy0Eq=Fym+2x2pjcqg1fQ@mail.gmail.com' \
--to=ryncsn@gmail.com \
--cc=21cnbao@gmail.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lokeshgidra@google.com \
--cc=peterx@redhat.com \
--cc=stable@vger.kernel.org \
--cc=surenb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox