Re: [PATCH v5 1/5] mm: rmap: support batched checks of the references for large folios

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Baolin Wang <baolin.wang@linux.alibaba.com>,
	akpm@linux-foundation.org, catalin.marinas@arm.com,
	will@kernel.org
Cc: lorenzo.stoakes@oracle.com, ryan.roberts@arm.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, riel@surriel.com,
	harry.yoo@oracle.com, jannh@google.com, willy@infradead.org,
	baohua@kernel.org, dev.jain@arm.com, linux-mm@kvack.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5 1/5] mm: rmap: support batched checks of the references for large folios
Date: Mon, 9 Feb 2026 09:49:36 +0100	[thread overview]
Message-ID: <3d5cb9a4-6604-4302-a110-3d8ff91baa56@kernel.org> (raw)
In-Reply-To: <18b3eb9c730d16756e5d23c7be22efe2f6219911.1766631066.git.baolin.wang@linux.alibaba.com>

On 12/26/25 07:07, Baolin Wang wrote:
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
> 
> Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> 
> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
> of the young flags and flushing TLB entries, thereby improving performance
> during large folio reclamation. And it will be overridden by the architecture
> that implements a more efficient batch operation in the following patches.
> 
> While we are at it, rename ptep_clear_flush_young_notify() to
> clear_flush_young_ptes_notify() to indicate that this is a batch operation.
> 
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   include/linux/mmu_notifier.h |  9 +++++----
>   include/linux/pgtable.h      | 31 +++++++++++++++++++++++++++++++
>   mm/rmap.c                    | 31 ++++++++++++++++++++++++++++---
>   3 files changed, 64 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index d1094c2d5fb6..07a2bbaf86e9 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>   	range->owner = owner;
>   }
>   
> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
> +#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr)	\
>   ({									\
>   	int __young;							\
>   	struct vm_area_struct *___vma = __vma;				\
>   	unsigned long ___address = __address;				\
> -	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
> +	unsigned int ___nr = __nr;					\
> +	__young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);	\
>   	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
>   						  ___address,		\
>   						  ___address +		\
> -							PAGE_SIZE);	\
> +						  ___nr * PAGE_SIZE);	\
>   	__young;							\
>   })
>   

Man that's ugly, Not your fault, but can this possibly be turned into an 
inline function in a follow-up patch.

[...]

>   
> +#ifndef clear_flush_young_ptes
> +/**
> + * clear_flush_young_ptes - Clear the access bit and perform a TLB flush for PTEs
> + *			    that map consecutive pages of the same folio.

With clear_young_dirty_ptes() description in mind, this should probably 
be "Mark PTEs that map consecutive pages of the same folio as clean and 
flush the TLB" ?

> + * @vma: The virtual memory area the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear access bit.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_clear_flush_young().
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> + */
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +					 unsigned long addr, pte_t *ptep,
> +					 unsigned int nr)

Two-tab alignment on second+ line like all similar functions here.

> +{
> +	int i, young = 0;
> +
> +	for (i = 0; i < nr; ++i, ++ptep, addr += PAGE_SIZE)
> +		young |= ptep_clear_flush_young(vma, addr, ptep);
> +

Why don't we use a similar loop we use in clear_young_dirty_ptes() or 
clear_full_ptes() etc? It's not only consistent but also optimizes out 
the first check for nr.


for (;;) {
	young |= ptep_clear_flush_young(vma, addr, ptep);
	if (--nr == 0)
		break;
	ptep++;
	addr += PAGE_SIZE;
}

> +	return young;
> +}
> +#endif
> +
>   /*
>    * On some architectures hardware does not set page access bit when accessing
>    * memory page, it is responsibility of software setting this bit. It brings
> diff --git a/mm/rmap.c b/mm/rmap.c
> index e805ddc5a27b..985ab0b085ba 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -828,9 +828,11 @@ static bool folio_referenced_one(struct folio *folio,
>   	struct folio_referenced_arg *pra = arg;
>   	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>   	int ptes = 0, referenced = 0;
> +	unsigned int nr;
>   
>   	while (page_vma_mapped_walk(&pvmw)) {
>   		address = pvmw.address;
> +		nr = 1;
>   
>   		if (vma->vm_flags & VM_LOCKED) {
>   			ptes++;
> @@ -875,9 +877,24 @@ static bool folio_referenced_one(struct folio *folio,
>   			if (lru_gen_look_around(&pvmw))
>   				referenced++;
>   		} else if (pvmw.pte) {
> -			if (ptep_clear_flush_young_notify(vma, address,
> -						pvmw.pte))
> +			if (folio_test_large(folio)) {
> +				unsigned long end_addr =
> +					pmd_addr_end(address, vma->vm_end);
> +				unsigned int max_nr =
> +					(end_addr - address) >> PAGE_SHIFT;

Good news: you can both fit into a single line as we are allowed to 
exceed 80c if it aids readability.

> +				pte_t pteval = ptep_get(pvmw.pte);
> +
> +				nr = folio_pte_batch(folio, pvmw.pte,
> +						     pteval, max_nr);
> +			}
> +
> +			ptes += nr;

I'm not sure about whether we should mess with the "ptes" variable that 
is so far only used for VM_LOCKED vmas. See below, maybe we can just 
avoid that.

> +			if (clear_flush_young_ptes_notify(vma, address,
> +						pvmw.pte, nr))

Could maybe fit that into a single line as well, whatever you prefer.

>   				referenced++;
> +			/* Skip the batched PTEs */
> +			pvmw.pte += nr - 1;
> +			pvmw.address += (nr - 1) * PAGE_SIZE;
>   		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>   			if (pmdp_clear_flush_young_notify(vma, address,
>   						pvmw.pmd))
> @@ -887,7 +904,15 @@ static bool folio_referenced_one(struct folio *folio,
>   			WARN_ON_ONCE(1);
>   		}
>   
> -		pra->mapcount--;
> +		pra->mapcount -= nr;
> +		/*
> +		 * If we are sure that we batched the entire folio,
> +		 * we can just optimize and stop right here.
> +		 */
> +		if (ptes == pvmw.nr_pages) {
> +			page_vma_mapped_walk_done(&pvmw);
> +			break;
> +		}
Why not check for !pra->mapcount? Then you can also drop the comment, 
because it's exactly the same thing we check after the loop to indicate 
what to return to the caller.

And you will not have to mess with the "ptes" variable?



Only minor stuff.

-- 
Cheers,

David

next prev parent reply	other threads:[~2026-02-09  8:49 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-26  6:07 [PATCH v5 0/5] support batch checking of references and unmapping " Baolin Wang
2025-12-26  6:07 ` [PATCH v5 1/5] mm: rmap: support batched checks of the references " Baolin Wang
2026-01-07  6:01   ` Harry Yoo
2026-02-09  8:49   ` David Hildenbrand (Arm) [this message]
2026-02-09  9:14     ` Baolin Wang
2026-02-09  9:20       ` David Hildenbrand (Arm)
2026-02-09  9:25         ` Baolin Wang
2025-12-26  6:07 ` [PATCH v5 2/5] arm64: mm: factor out the address and ptep alignment into a new helper Baolin Wang
2026-02-09  8:50   ` David Hildenbrand (Arm)
2025-12-26  6:07 ` [PATCH v5 3/5] arm64: mm: support batch clearing of the young flag for large folios Baolin Wang
2026-01-02 12:21   ` Ryan Roberts
2026-02-09  9:02   ` David Hildenbrand (Arm)
2025-12-26  6:07 ` [PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes() Baolin Wang
2026-01-28 11:47   ` Chris Mason
2026-01-29  1:42     ` Baolin Wang
2026-02-09  9:09       ` David Hildenbrand (Arm)
2026-02-09  9:36         ` Baolin Wang
2026-02-09  9:55           ` David Hildenbrand (Arm)
2026-02-09 10:13             ` Baolin Wang
2026-02-16  0:24               ` Alistair Popple
2025-12-26  6:07 ` [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
2026-01-06 13:22   ` Wei Yang
2026-01-06 21:29     ` Barry Song
2026-01-07  1:46       ` Wei Yang
2026-01-07  2:21         ` Barry Song
2026-01-07  2:29           ` Baolin Wang
2026-01-07  3:31             ` Wei Yang
2026-01-16  9:53         ` Dev Jain
2026-01-16 11:14           ` Lorenzo Stoakes
2026-01-16 14:28           ` Barry Song
2026-01-16 15:23             ` Barry Song
2026-01-16 15:49             ` Baolin Wang
2026-01-18  5:46             ` Dev Jain
2026-01-19  5:50               ` Baolin Wang
2026-01-19  6:36                 ` Dev Jain
2026-01-19  7:22                   ` Baolin Wang
2026-01-16 15:14           ` Barry Song
2026-01-18  5:48             ` Dev Jain
2026-01-07  6:54   ` Harry Yoo
2026-01-16  8:42   ` Lorenzo Stoakes
2026-01-16 16:26   ` [PATCH] mm: rmap: skip batched unmapping for UFFD vmas Baolin Wang
2026-02-09  9:54     ` David Hildenbrand (Arm)
2026-02-09 10:49       ` Barry Song
2026-02-09 10:58         ` David Hildenbrand (Arm)
2026-02-10 12:01         ` Dev Jain
2026-02-09  9:38   ` [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios David Hildenbrand (Arm)
2026-02-09  9:43     ` Baolin Wang
2026-02-13  5:19       ` Barry Song
2026-02-18 12:26         ` Dev Jain
2026-01-16  8:41 ` [PATCH v5 0/5] support batch checking of references and unmapping for " Lorenzo Stoakes
2026-01-16 10:53   ` David Hildenbrand (Red Hat)
2026-01-16 10:52 ` David Hildenbrand (Red Hat)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3d5cb9a4-6604-4302-a110-3d8ff91baa56@kernel.org \
    --to=david@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=catalin.marinas@arm.com \
    --cc=dev.jain@arm.com \
    --cc=harry.yoo@oracle.com \
    --cc=jannh@google.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=riel@surriel.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox