From: Barry Song <21cnbao@gmail.com>
To: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: akpm@linux-foundation.org, david@kernel.org,
catalin.marinas@arm.com, will@kernel.org,
lorenzo.stoakes@oracle.com, ryan.roberts@arm.com,
Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, riel@surriel.com,
harry.yoo@oracle.com, jannh@google.com, willy@infradead.org,
dev.jain@arm.com, linux-mm@kvack.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
Date: Sat, 7 Mar 2026 16:02:47 +0800 [thread overview]
Message-ID: <CAGsJ_4woBVQdYcCbN1Btr2vxOL8OAujaHkNZZ41=S-TdNh-wbw@mail.gmail.com> (raw)
In-Reply-To: <a4d7cf56-eab4-431d-886b-a32456e44736@linux.alibaba.com>
On Sat, Mar 7, 2026 at 10:22 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 3/7/26 5:07 AM, Barry Song wrote:
> > On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >>
> >> Currently, folio_referenced_one() always checks the young flag for each PTE
> >> sequentially, which is inefficient for large folios. This inefficiency is
> >> especially noticeable when reclaiming clean file-backed large folios, where
> >> folio_referenced() is observed as a significant performance hotspot.
> >>
> >> Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
> >> an optimization to clear the young flags for PTEs within a contiguous range.
> >> However, this is not sufficient. We can extend this to perform batched operations
> >> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> >>
> >> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
> >> of the young flags and flushing TLB entries, thereby improving performance
> >> during large folio reclamation. And it will be overridden by the architecture
> >> that implements a more efficient batch operation in the following patches.
> >>
> >> While we are at it, rename ptep_clear_flush_young_notify() to
> >> clear_flush_young_ptes_notify() to indicate that this is a batch operation.
> >>
> >> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> >> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >
> > LGTM,
> >
> > Reviewed-by: Barry Song <baohua@kernel.org>
>
> Thanks.
>
> >> ---
> >> include/linux/mmu_notifier.h | 9 +++++----
> >> include/linux/pgtable.h | 35 +++++++++++++++++++++++++++++++++++
> >> mm/rmap.c | 28 +++++++++++++++++++++++++---
> >> 3 files changed, 65 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> >> index d1094c2d5fb6..07a2bbaf86e9 100644
> >> --- a/include/linux/mmu_notifier.h
> >> +++ b/include/linux/mmu_notifier.h
> >> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
> >> range->owner = owner;
> >> }
> >>
> >> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \
> >> +#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr) \
> >> ({ \
> >> int __young; \
> >> struct vm_area_struct *___vma = __vma; \
> >> unsigned long ___address = __address; \
> >> - __young = ptep_clear_flush_young(___vma, ___address, __ptep); \
> >> + unsigned int ___nr = __nr; \
> >> + __young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr); \
> >> __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \
> >> ___address, \
> >> ___address + \
> >> - PAGE_SIZE); \
> >> + ___nr * PAGE_SIZE); \
> >> __young; \
> >> })
> >>
> >> @@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
> >>
> >> #define mmu_notifier_range_update_to_read_only(r) false
> >>
> >> -#define ptep_clear_flush_young_notify ptep_clear_flush_young
> >> +#define clear_flush_young_ptes_notify clear_flush_young_ptes
> >> #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
> >> #define ptep_clear_young_notify ptep_test_and_clear_young
> >> #define pmdp_clear_young_notify pmdp_test_and_clear_young
> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >> index 21b67d937555..a50df42a893f 100644
> >> --- a/include/linux/pgtable.h
> >> +++ b/include/linux/pgtable.h
> >> @@ -1068,6 +1068,41 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> >> }
> >> #endif
> >>
> >> +#ifndef clear_flush_young_ptes
> >> +/**
> >> + * clear_flush_young_ptes - Mark PTEs that map consecutive pages of the same
> >> + * folio as old and flush the TLB.
> >> + * @vma: The virtual memory area the pages are mapped into.
> >> + * @addr: Address the first page is mapped at.
> >> + * @ptep: Page table pointer for the first entry.
> >> + * @nr: Number of entries to clear access bit.
> >> + *
> >> + * May be overridden by the architecture; otherwise, implemented as a simple
> >> + * loop over ptep_clear_flush_young().
> >> + *
> >> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> >> + * some PTEs might be write-protected.
> >> + *
> >> + * Context: The caller holds the page table lock. The PTEs map consecutive
> >> + * pages that belong to the same folio. The PTEs are all in the same PMD.
> >> + */
> >> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> >> + unsigned long addr, pte_t *ptep, unsigned int nr)
> >> +{
> >> + int young = 0;
> >> +
> >> + for (;;) {
> >> + young |= ptep_clear_flush_young(vma, addr, ptep);
> >> + if (--nr == 0)
> >> + break;
> >> + ptep++;
> >> + addr += PAGE_SIZE;
> >> + }
> >> +
> >> + return young;
> >> +}
> >> +#endif
> >
> > We might have an opportunity to batch the TLB synchronization,
> > using flush_tlb_range() instead of calling flush_tlb_page()
> > one by one. Not sure the benefit would be significant though,
> > especially if only one entry among nr has the young bit set.
>
> Yes. In addition, this will involve many architectures’ implementations
> and their differing TLB flush mechanisms, so it’s difficult to make a
> reasonable per-architecture measurement. If any architecture has a more
> efficient flush method, I’d prefer to implement an architecture‑specific
> clear_flush_young_ptes().
Right! Since TLBI is usually quite expensive, I wonder if a generic
implementation for architectures lacking clear_flush_young_ptes()
might benefit from something like the below (just a very rough idea):
int clear_flush_young_ptes(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep, unsigned int nr)
{
unsigned long curr_addr = addr;
int young = 0;
while (nr--) {
young |= ptep_test_and_clear_young(vma, curr_addr, ptep);
ptep++;
curr_addr += PAGE_SIZE;
}
if (young)
flush_tlb_range(vma, addr, curr_addr);
return young;
}
Thanks
Barry
next prev parent reply other threads:[~2026-03-07 8:03 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-09 14:07 [PATCH v6 0/5] support batch checking of references and unmapping " Baolin Wang
2026-02-09 14:07 ` [PATCH v6 1/5] mm: rmap: support batched checks of the references " Baolin Wang
2026-02-09 15:25 ` David Hildenbrand (Arm)
2026-03-06 21:07 ` Barry Song
2026-03-07 2:22 ` Baolin Wang
2026-03-07 8:02 ` Barry Song [this message]
2026-02-09 14:07 ` [PATCH v6 2/5] arm64: mm: factor out the address and ptep alignment into a new helper Baolin Wang
2026-02-09 14:07 ` [PATCH v6 3/5] arm64: mm: support batch clearing of the young flag for large folios Baolin Wang
2026-02-09 14:07 ` [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes() Baolin Wang
2026-02-09 15:30 ` David Hildenbrand (Arm)
2026-02-10 0:39 ` Baolin Wang
2026-03-06 21:20 ` Barry Song
2026-03-07 2:14 ` Baolin Wang
2026-03-07 7:41 ` Barry Song
2026-02-09 14:07 ` [PATCH v6 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
2026-02-09 15:31 ` David Hildenbrand (Arm)
2026-02-10 1:53 ` [PATCH v6 0/5] support batch checking of references and unmapping for " Andrew Morton
2026-02-10 2:01 ` Baolin Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAGsJ_4woBVQdYcCbN1Btr2vxOL8OAujaHkNZZ41=S-TdNh-wbw@mail.gmail.com' \
--to=21cnbao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=catalin.marinas@arm.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=harry.yoo@oracle.com \
--cc=jannh@google.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=riel@surriel.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=will@kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox