Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Barry Song <21cnbao@gmail.com>
To: Lokesh Gidra <lokeshgidra@google.com>
Cc: Peter Xu <peterx@redhat.com>,
	akpm@linux-foundation.org, aarcange@redhat.com,
	 linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	ngeoffray@google.com,  Suren Baghdasaryan <surenb@google.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	 Barry Song <v-songbaohua@oppo.com>,
	David Hildenbrand <david@redhat.com>
Subject: Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
Date: Wed, 13 Aug 2025 17:03:02 +0800	[thread overview]
Message-ID: <CAGsJ_4y98H-8aK9r_5YrSPV=SCU=-rZf7YPMz32K0C8oFnUCNA@mail.gmail.com> (raw)
In-Reply-To: <CA+EESO763JtY3jjmgGGbJjSQcnJJFNpDs2iPaUwjP44VpyyzdQ@mail.gmail.com>

On Tue, Aug 12, 2025 at 11:44 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Tue, Aug 12, 2025 at 7:44 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Aug 11, 2025 at 11:55:36AM +0800, Barry Song wrote:
> > > Hi Lokesh,
[...]
> > > >
> > > >  mm/userfaultfd.c | 178 +++++++++++++++++++++++++++++++++--------------
> > > >  1 file changed, 127 insertions(+), 51 deletions(-)
> > > >
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index cbed91b09640..39d81d2972db 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > > @@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
> > > >                pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
> > > >  }
> > > >
> > > > -static int move_present_pte(struct mm_struct *mm,
> > > > -                           struct vm_area_struct *dst_vma,
> > > > -                           struct vm_area_struct *src_vma,
> > > > -                           unsigned long dst_addr, unsigned long src_addr,
> > > > -                           pte_t *dst_pte, pte_t *src_pte,
> > > > -                           pte_t orig_dst_pte, pte_t orig_src_pte,
> > > > -                           pmd_t *dst_pmd, pmd_t dst_pmdval,
> > > > -                           spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > > -                           struct folio *src_folio)
> > > > +/*
> > > > + * Checks if the two ptes and the corresponding folio are eligible for batched
> > > > + * move. If so, then returns pointer to the locked folio. Otherwise, returns NULL.
> > > > + *
> > > > + * NOTE: folio's reference is not required as the whole operation is within
> > > > + * PTL's critical section.
> > > > + */
> > > > +static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> > > > +                                                unsigned long src_addr,
> > > > +                                                pte_t *src_pte, pte_t *dst_pte,
> > > > +                                                struct anon_vma *src_anon_vma)
> > > > +{
> > > > +       pte_t orig_dst_pte, orig_src_pte;
> > > > +       struct folio *folio;
> > > > +
> > > > +       orig_dst_pte = ptep_get(dst_pte);
> > > > +       if (!pte_none(orig_dst_pte))
> > > > +               return NULL;
> > > > +
> > > > +       orig_src_pte = ptep_get(src_pte);
> > > > +       if (!pte_present(orig_src_pte) || is_zero_pfn(pte_pfn(orig_src_pte)))
> > > > +               return NULL;
> > > > +
> > > > +       folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> > > > +       if (!folio || !folio_trylock(folio))
> > > > +               return NULL;
> > > > +       if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> > > > +           folio_anon_vma(folio) != src_anon_vma) {
> > > > +               folio_unlock(folio);
> > > > +               return NULL;
> > > > +       }
> > > > +       return folio;
> > > > +}
> > > > +
> > >
> > > I’m still quite confused by the code. Before move_present_ptes(), we’ve
> > > already performed all the checks—pte_same(), vm_normal_folio(),
> > > folio_trylock(), folio_test_large(), folio_get_anon_vma(),
> > > and anon_vma_lock_write()—at least for the first PTE. Now we’re
> > > duplicating them again for all PTEs. Does this mean we’re doing those
> > > operations for the first PTE twice? It feels like the old non-batch check
> > > code should be removed?
> >
> > This function should only start to work on the 2nd (or more) continuous
> > ptes to move within the same pgtable lock held.  We'll still need the
> > original path because that was sleepable, this one isn't, and it's only
> > best-effort fast path only. E.g. if trylock() fails above, it would
> > fallback to the slow path.
> >
> Thanks Peter. I was about to give exactly the same reasoning :)

Apologies, I overlooked this part:
                src_addr += PAGE_SIZE;
                if (src_addr == addr_end)
                        break;
                dst_addr += PAGE_SIZE;
                dst_pte++;
                src_pte++;
                folio_unlock(src_folio);
                src_folio = check_ptes_for_batched_move(src_vma,
src_addr, src_pte,
                                                        dst_pte, src_anon_vma);

I still find this a little tricky to follow — couldn’t we just handle it
like the other batched cases:

static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
                        struct page_vma_mapped_walk *pvmw,
                        enum ttu_flags flags, pte_t pte)

We pass the first PTE and use a function to determine how many PTEs we
can batch together. That way, we don’t need a special path for the first
PTE.

I guess the challenge is that the first PTE needs to handle
split_folio(), folio_trylock() with -EAGAIN, and
anon_vma_trylock_write(), while the other PTEs don’t?

If so, could we add a clear comment explaining that move_present_ptes()
moves PTEs that share the same anon_vma as the first PTE, are not large
folios, and can successfully take folio_trylock()?
If this condition isn’t met, the batch stops.

Thanks
Barry

next prev parent reply	other threads:[~2025-08-13  9:03 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-10  6:29 Lokesh Gidra
2025-08-11  3:55 ` Barry Song
2025-08-12 14:44   ` Peter Xu
2025-08-12 15:44     ` Lokesh Gidra
2025-08-13  9:03       ` Barry Song [this message]
2025-08-13 14:12         ` Lokesh Gidra
2025-08-12 15:50   ` Lokesh Gidra
2025-08-13  9:29     ` Barry Song
2025-08-13 14:15       ` Lokesh Gidra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGsJ_4y98H-8aK9r_5YrSPV=SCU=-rZf7YPMz32K0C8oFnUCNA@mail.gmail.com' \
    --to=21cnbao@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=kaleshsingh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lokeshgidra@google.com \
    --cc=ngeoffray@google.com \
    --cc=peterx@redhat.com \
    --cc=surenb@google.com \
    --cc=v-songbaohua@oppo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox