From: Barry Song <21cnbao@gmail.com>
To: Lokesh Gidra <lokeshgidra@google.com>
Cc: Peter Xu <peterx@redhat.com>,
akpm@linux-foundation.org, aarcange@redhat.com,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
ngeoffray@google.com, Suren Baghdasaryan <surenb@google.com>,
Kalesh Singh <kaleshsingh@google.com>,
Barry Song <v-songbaohua@oppo.com>,
David Hildenbrand <david@redhat.com>
Subject: Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
Date: Wed, 13 Aug 2025 17:03:02 +0800 [thread overview]
Message-ID: <CAGsJ_4y98H-8aK9r_5YrSPV=SCU=-rZf7YPMz32K0C8oFnUCNA@mail.gmail.com> (raw)
In-Reply-To: <CA+EESO763JtY3jjmgGGbJjSQcnJJFNpDs2iPaUwjP44VpyyzdQ@mail.gmail.com>
On Tue, Aug 12, 2025 at 11:44 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Tue, Aug 12, 2025 at 7:44 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Aug 11, 2025 at 11:55:36AM +0800, Barry Song wrote:
> > > Hi Lokesh,
[...]
> > > >
> > > > mm/userfaultfd.c | 178 +++++++++++++++++++++++++++++++++--------------
> > > > 1 file changed, 127 insertions(+), 51 deletions(-)
> > > >
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index cbed91b09640..39d81d2972db 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > > @@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
> > > > pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
> > > > }
> > > >
> > > > -static int move_present_pte(struct mm_struct *mm,
> > > > - struct vm_area_struct *dst_vma,
> > > > - struct vm_area_struct *src_vma,
> > > > - unsigned long dst_addr, unsigned long src_addr,
> > > > - pte_t *dst_pte, pte_t *src_pte,
> > > > - pte_t orig_dst_pte, pte_t orig_src_pte,
> > > > - pmd_t *dst_pmd, pmd_t dst_pmdval,
> > > > - spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > > - struct folio *src_folio)
> > > > +/*
> > > > + * Checks if the two ptes and the corresponding folio are eligible for batched
> > > > + * move. If so, then returns pointer to the locked folio. Otherwise, returns NULL.
> > > > + *
> > > > + * NOTE: folio's reference is not required as the whole operation is within
> > > > + * PTL's critical section.
> > > > + */
> > > > +static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> > > > + unsigned long src_addr,
> > > > + pte_t *src_pte, pte_t *dst_pte,
> > > > + struct anon_vma *src_anon_vma)
> > > > +{
> > > > + pte_t orig_dst_pte, orig_src_pte;
> > > > + struct folio *folio;
> > > > +
> > > > + orig_dst_pte = ptep_get(dst_pte);
> > > > + if (!pte_none(orig_dst_pte))
> > > > + return NULL;
> > > > +
> > > > + orig_src_pte = ptep_get(src_pte);
> > > > + if (!pte_present(orig_src_pte) || is_zero_pfn(pte_pfn(orig_src_pte)))
> > > > + return NULL;
> > > > +
> > > > + folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> > > > + if (!folio || !folio_trylock(folio))
> > > > + return NULL;
> > > > + if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> > > > + folio_anon_vma(folio) != src_anon_vma) {
> > > > + folio_unlock(folio);
> > > > + return NULL;
> > > > + }
> > > > + return folio;
> > > > +}
> > > > +
> > >
> > > I’m still quite confused by the code. Before move_present_ptes(), we’ve
> > > already performed all the checks—pte_same(), vm_normal_folio(),
> > > folio_trylock(), folio_test_large(), folio_get_anon_vma(),
> > > and anon_vma_lock_write()—at least for the first PTE. Now we’re
> > > duplicating them again for all PTEs. Does this mean we’re doing those
> > > operations for the first PTE twice? It feels like the old non-batch check
> > > code should be removed?
> >
> > This function should only start to work on the 2nd (or more) continuous
> > ptes to move within the same pgtable lock held. We'll still need the
> > original path because that was sleepable, this one isn't, and it's only
> > best-effort fast path only. E.g. if trylock() fails above, it would
> > fallback to the slow path.
> >
> Thanks Peter. I was about to give exactly the same reasoning :)
Apologies, I overlooked this part:
src_addr += PAGE_SIZE;
if (src_addr == addr_end)
break;
dst_addr += PAGE_SIZE;
dst_pte++;
src_pte++;
folio_unlock(src_folio);
src_folio = check_ptes_for_batched_move(src_vma,
src_addr, src_pte,
dst_pte, src_anon_vma);
I still find this a little tricky to follow — couldn’t we just handle it
like the other batched cases:
static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
struct page_vma_mapped_walk *pvmw,
enum ttu_flags flags, pte_t pte)
We pass the first PTE and use a function to determine how many PTEs we
can batch together. That way, we don’t need a special path for the first
PTE.
I guess the challenge is that the first PTE needs to handle
split_folio(), folio_trylock() with -EAGAIN, and
anon_vma_trylock_write(), while the other PTEs don’t?
If so, could we add a clear comment explaining that move_present_ptes()
moves PTEs that share the same anon_vma as the first PTE, are not large
folios, and can successfully take folio_trylock()?
If this condition isn’t met, the batch stops.
Thanks
Barry
next prev parent reply other threads:[~2025-08-13 9:03 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-10 6:29 Lokesh Gidra
2025-08-11 3:55 ` Barry Song
2025-08-12 14:44 ` Peter Xu
2025-08-12 15:44 ` Lokesh Gidra
2025-08-13 9:03 ` Barry Song [this message]
2025-08-13 14:12 ` Lokesh Gidra
2025-08-12 15:50 ` Lokesh Gidra
2025-08-13 9:29 ` Barry Song
2025-08-13 14:15 ` Lokesh Gidra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAGsJ_4y98H-8aK9r_5YrSPV=SCU=-rZf7YPMz32K0C8oFnUCNA@mail.gmail.com' \
--to=21cnbao@gmail.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=kaleshsingh@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lokeshgidra@google.com \
--cc=ngeoffray@google.com \
--cc=peterx@redhat.com \
--cc=surenb@google.com \
--cc=v-songbaohua@oppo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox