linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dev Jain <dev.jain@arm.com>
To: Barry Song <21cnbao@gmail.com>
Cc: akpm@linux-foundation.org, Liam.Howlett@oracle.com,
	lorenzo.stoakes@oracle.com, vbabka@suse.cz, jannh@google.com,
	pfalcato@suse.de, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, david@redhat.com,
	peterx@redhat.com, ryan.roberts@arm.com, mingo@kernel.org,
	libang.li@antgroup.com, maobibo@loongson.cn,
	zhengqi.arch@bytedance.com, anshuman.khandual@arm.com,
	willy@infradead.org, ioworker0@gmail.com,
	yang@os.amperecomputing.com, baolin.wang@linux.alibaba.com,
	ziy@nvidia.com, hughd@google.com
Subject: Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching
Date: Tue, 10 Jun 2025 13:14:45 +0530	[thread overview]
Message-ID: <bf185ecc-8310-48ad-b9cc-5c78e3da6d0b@arm.com> (raw)
In-Reply-To: <CAGsJ_4xPq-eJ7JE-SFhhO2TboH8HKGifaYCwKw8cqd_2K=uD4w@mail.gmail.com>


On 10/06/25 12:33 pm, Barry Song wrote:
> Hi Dev,
>
> On Tue, Jun 10, 2025 at 3:51 PM Dev Jain <dev.jain@arm.com> wrote:
>> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
>> are painted with the contig bit, then ptep_get() will iterate through all 16
>> entries to collect a/d bits. Hence this optimization will result in a 16x
>> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
>> will eventually call contpte_try_unfold() on every contig block, thus
>> flushing the TLB for the complete large folio range. Instead, use
>> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
>> do them on the starting and ending contig block.
>>
>> For split folios, there will be no pte batching; nr_ptes will be 1. For
>> pagetable splitting, the ptes will still point to the same large folio;
>> for arm64, this results in the optimization described above, and for other
>> arches (including the general case), a minor improvement is expected due to
>> a reduction in the number of function calls.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mremap.c | 39 ++++++++++++++++++++++++++++++++-------
>>   1 file changed, 32 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 180b12225368..18b215521ada 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>>          return pte;
>>   }
>>
>> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>> +               pte_t *ptep, pte_t pte, int max_nr)
>> +{
>> +       const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +       struct folio *folio;
>> +
>> +       if (max_nr == 1)
>> +               return 1;
>> +
>> +       folio = vm_normal_folio(vma, addr, pte);
>> +       if (!folio || !folio_test_large(folio))
> I'm curious about the following case:
> If the addr/ptep is not the first subpage of the folio—for example, the
> 14th subpage—will mremap_folio_pte_batch() return 3?

It will return the number of PTEs, starting from the PTE pointing to the 14th
subpage, that point to consecutive pages of the same large folio, up till max_nr.
For an example, if we are operating on a single large folio of order 4, then max_nr
will be 16 - 14 + 1 = 3. So in this case we will return 3, since the 14th, 15th and
16th PTE point to consec pages of the same large folio.

> If so, get_and_clear_full_ptes() would operate on 3 subpages of the folio.
> In that case, can unfold still work correctly?

Yes, first we unfold as in, we do a BBM sequence: cont -> clear -> non-cont.
Then, on this non-contig block, we will clear only the PTEs which were asked
for us to do.

>
> Similarly, if the addr/ptep points to the first subpage, but max_nr is
> less than CONT_PTES, what will happen in that case?
>
>
>> +               return 1;
>> +
>> +       return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
>> +                              NULL, NULL);
>> +}
>> +
>>   static int move_ptes(struct pagetable_move_control *pmc,
>>                  unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>>   {
>> @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>          bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>>          struct mm_struct *mm = vma->vm_mm;
>>          pte_t *old_ptep, *new_ptep;
>> -       pte_t pte;
>> +       pte_t old_pte, pte;
>>          pmd_t dummy_pmdval;
>>          spinlock_t *old_ptl, *new_ptl;
>>          bool force_flush = false;
>> @@ -185,6 +202,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>          unsigned long new_addr = pmc->new_addr;
>>          unsigned long old_end = old_addr + extent;
>>          unsigned long len = old_end - old_addr;
>> +       int max_nr_ptes;
>> +       int nr_ptes;
>>          int err = 0;
>>
>>          /*
>> @@ -236,14 +255,16 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>          flush_tlb_batched_pending(vma->vm_mm);
>>          arch_enter_lazy_mmu_mode();
>>
>> -       for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
>> -                                  new_ptep++, new_addr += PAGE_SIZE) {
>> +       for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
>> +               new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
>>                  VM_WARN_ON_ONCE(!pte_none(*new_ptep));
>>
>> -               if (pte_none(ptep_get(old_ptep)))
>> +               nr_ptes = 1;
>> +               max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>> +               old_pte = ptep_get(old_ptep);
>> +               if (pte_none(old_pte))
>>                          continue;
>>
>> -               pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>>                  /*
>>                   * If we are remapping a valid PTE, make sure
>>                   * to flush TLB before we drop the PTL for the
>> @@ -255,8 +276,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>                   * the TLB entry for the old mapping has been
>>                   * flushed.
>>                   */
>> -               if (pte_present(pte))
>> +               if (pte_present(old_pte)) {
>> +                       nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
>> +                                                        old_pte, max_nr_ptes);
>>                          force_flush = true;
>> +               }
>> +               pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
>>                  pte = move_pte(pte, old_addr, new_addr);
>>                  pte = move_soft_dirty_pte(pte);
>>
>> @@ -269,7 +294,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>                                  else if (is_swap_pte(pte))
>>                                          pte = pte_swp_clear_uffd_wp(pte);
>>                          }
>> -                       set_pte_at(mm, new_addr, new_ptep, pte);
>> +                       set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>>                  }
>>          }
>>
>> --
>> 2.30.2
>>
> Thanks
> Barry


  reply	other threads:[~2025-06-10  7:45 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-10  3:50 [PATCH v4 0/2] Optimize mremap() for large folios Dev Jain
2025-06-10  3:50 ` [PATCH v4 1/2] mm: Call pointers to ptes as ptep Dev Jain
2025-06-11 13:23   ` David Hildenbrand
2025-06-11 13:25     ` Dev Jain
2025-06-11 13:29       ` Lorenzo Stoakes
2025-06-11 13:31         ` David Hildenbrand
2025-06-12 12:05   ` Pedro Falcato
2025-06-10  3:50 ` [PATCH v4 2/2] mm: Optimize mremap() by PTE batching Dev Jain
2025-06-10  7:03   ` Barry Song
2025-06-10  7:44     ` Dev Jain [this message]
2025-06-10  8:11       ` Barry Song
2025-06-16 21:27         ` Ryan Roberts
2025-06-10  8:37   ` Barry Song
2025-06-10 13:18   ` Lorenzo Stoakes
2025-06-11 14:00   ` David Hildenbrand
2025-06-13  4:24     ` Dev Jain
2025-06-17  8:02       ` David Hildenbrand
2025-06-13 12:32     ` Lorenzo Stoakes
2025-06-16 16:13       ` David Hildenbrand
2025-06-12 12:13   ` Pedro Falcato
2025-10-27 21:40   ` David Hildenbrand
2025-10-28  5:32     ` Dev Jain
2025-10-28  7:14       ` David Hildenbrand
2025-06-10 12:11 ` [PATCH v4 0/2] Optimize mremap() for large folios Lorenzo Stoakes
2025-06-10 12:33   ` Dev Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bf185ecc-8310-48ad-b9cc-5c78e3da6d0b@arm.com \
    --to=dev.jain@arm.com \
    --cc=21cnbao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=ioworker0@gmail.com \
    --cc=jannh@google.com \
    --cc=libang.li@antgroup.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=maobibo@loongson.cn \
    --cc=mingo@kernel.org \
    --cc=peterx@redhat.com \
    --cc=pfalcato@suse.de \
    --cc=ryan.roberts@arm.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox