Re: [PATCH] khugepaged: Optimize collapse_pte_mapped_thp() for large folios by PTE batching

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Dev Jain <dev.jain@arm.com>
Cc: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com,
	baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com,
	npache@redhat.com, ryan.roberts@arm.com, baohua@kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] khugepaged: Optimize collapse_pte_mapped_thp() for large folios by PTE batching
Date: Wed, 18 Jun 2025 18:50:38 +0100	[thread overview]
Message-ID: <cd871535-f606-4f2b-8fb2-e3520a2b000f@lucifer.local> (raw)
In-Reply-To: <20250618155608.18580-1-dev.jain@arm.com>

This series has a lot of duplication in it esp vs. your other series [0], but
perhaps something we can tackle in a follow up.

It'd be nice if we could find a way to de-duplicate some of the near-identical
code though.

But it's a 'maybe' on that because hey, the code in this file is hideous anyway
and needs a mega rework in any case...

[0]: https://lore.kernel.org/all/20250618102607.10551-1-dev.jain@arm.com/

On Wed, Jun 18, 2025 at 09:26:08PM +0530, Dev Jain wrote:
> Use PTE batching to optimize collapse_pte_mapped_thp().
>
> On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for collapse.
> Then, calling ptep_clear() for every pte will cause a TLB flush for every
> contpte block. Instead, clear_full_ptes() does a
> contpte_try_unfold_partial() which will flush the TLB only for the (if any)
> starting and ending contpte block, if they partially overlap with the range
> khugepaged is looking at.
>
> For all arches, there should be a benefit due to batching atomic operations
> on mapcounts due to folio_remove_rmap_ptes().
>
> Note that we do not need to make a change to the check
> "if (folio_page(folio, i) != page)"; if i'th page of the folio is equal
> to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1
> pages of the folio will be equal to the corresponding pages of our
> batch mapping consecutive pages.
>
> No issues were observed with mm-selftests.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>
> This is rebased on:
> https://lore.kernel.org/all/20250618102607.10551-1-dev.jain@arm.com/
> If there will be a v2 of either version I'll send them together.

Hmmm I say again - slow down a bit :) there's no need to shoot out multiple
patches in a single day and you'd maybe avoid some of this kind of thing.

It's really preferable to avoid possible conflicts like this or at least reduce
the chance by having review on one thing done first.

I mean, why not just put both of these in a series for the respin? Just a
thought ;) in fact this is probably an ideal use of a series for that as you can
ensure you deal with both if any conflicts arise.

>
>  mm/khugepaged.c | 38 +++++++++++++++++++++++++-------------
>  1 file changed, 25 insertions(+), 13 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 649ccb2670f8..7d37058eda5b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1499,15 +1499,16 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
>  int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  			    bool install_pmd)
>  {
> +	int nr_mapped_ptes = 0, nr_batch_ptes, result = SCAN_FAIL;

NIT: I don't know why you're moving this, and while y'know it's kind of the fun
of subjective stuff I'd rather the assigned values and unassigned values be on
different lines (yes I know this codebase violates this with the pml, ptl below
but hey :P)

>  	struct mmu_notifier_range range;
>  	bool notified = false;
>  	unsigned long haddr = addr & HPAGE_PMD_MASK;
> +	unsigned long end = haddr + HPAGE_PMD_SIZE;
>  	struct vm_area_struct *vma = vma_lookup(mm, haddr);
>  	struct folio *folio;
>  	pte_t *start_pte, *pte;
>  	pmd_t *pmd, pgt_pmd;
>  	spinlock_t *pml = NULL, *ptl;
> -	int nr_ptes = 0, result = SCAN_FAIL;
>  	int i;
>
>  	mmap_assert_locked(mm);
> @@ -1620,12 +1621,17 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  	if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
>  		goto abort;
>
> +	i = 0, addr = haddr, pte = start_pte;

This is horrid, no absolutely not. This is not how we do assignment in arbitrary
C code.

I don't know why we need a do/while here in general, I think the for loop should
still work ok no?

>  	/* step 2: clear page table and adjust rmap */
> -	for (i = 0, addr = haddr, pte = start_pte;
> -	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
> +	do {
> +		const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +		int max_nr_batch_ptes = (end - addr) >> PAGE_SHIFT;
> +		struct folio *this_folio;

Hate this name. We are not C#... ;)

Just call it folio no? The 'this_' is redundant.


>  		struct page *page;
>  		pte_t ptent = ptep_get(pte);
>
> +		nr_batch_ptes = 1;
> +
>  		if (pte_none(ptent))
>  			continue;
>  		/*
> @@ -1639,6 +1645,11 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  			goto abort;
>  		}
>  		page = vm_normal_page(vma, addr, ptent);
> +		this_folio = page_folio(page);
> +		if (folio_test_large(this_folio) && max_nr_batch_ptes != 1)
> +			nr_batch_ptes = folio_pte_batch(this_folio, addr, pte, ptent,
> +					max_nr_batch_ptes, flags, NULL, NULL, NULL);
> +
>  		if (folio_page(folio, i) != page)
>  			goto abort;
>
> @@ -1647,18 +1658,19 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  		 * TLB flush can be left until pmdp_collapse_flush() does it.
>  		 * PTE dirty? Shmem page is already dirty; file is read-only.
>  		 */
> -		ptep_clear(mm, addr, pte);
> -		folio_remove_rmap_pte(folio, page, vma);
> -		nr_ptes++;
> -	}
> +		clear_full_ptes(mm, addr, pte, nr_batch_ptes, false);
> +		folio_remove_rmap_ptes(folio, page, nr_batch_ptes, vma);
> +		nr_mapped_ptes += nr_batch_ptes;
> +	} while (i += nr_batch_ptes, addr += nr_batch_ptes * PAGE_SIZE,
> +		 pte += nr_batch_ptes, i < HPAGE_PMD_NR);
>
>  	if (!pml)
>  		spin_unlock(ptl);
>
>  	/* step 3: set proper refcount and mm_counters. */
> -	if (nr_ptes) {
> -		folio_ref_sub(folio, nr_ptes);
> -		add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
> +	if (nr_mapped_ptes) {
> +		folio_ref_sub(folio, nr_mapped_ptes);
> +		add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
>  	}
>
>  	/* step 4: remove empty page table */
> @@ -1691,10 +1703,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  			: SCAN_SUCCEED;
>  	goto drop_folio;
>  abort:
> -	if (nr_ptes) {
> +	if (nr_mapped_ptes) {
>  		flush_tlb_mm(mm);
> -		folio_ref_sub(folio, nr_ptes);
> -		add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
> +		folio_ref_sub(folio, nr_mapped_ptes);
> +		add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
>  	}
>  unlock:
>  	if (start_pte)
> --
> 2.30.2
>

Logic looks generally sane though... :)

next prev parent reply	other threads:[~2025-06-18 17:50 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-18 15:56 Dev Jain
2025-06-18 17:50 ` Lorenzo Stoakes [this message]
2025-06-19  3:48   ` Dev Jain
2025-06-19 12:55     ` Lorenzo Stoakes
2025-06-23 14:01       ` David Hildenbrand
2025-06-23  6:40 ` Baolin Wang
2025-06-23  7:16   ` Dev Jain
2025-06-23  7:21     ` Baolin Wang
2025-06-23  7:25       ` Dev Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cd871535-f606-4f2b-8fb2-e3520a2b000f@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npache@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox