From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7FE58C3ABDD
	for <linux-mm@archiver.kernel.org>; Tue, 20 May 2025 09:21:52 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 22AC56B0088; Tue, 20 May 2025 05:21:52 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1DBAE6B008C; Tue, 20 May 2025 05:21:52 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0F18D6B0092; Tue, 20 May 2025 05:21:52 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id E17E96B0088
	for <linux-mm@kvack.org>; Tue, 20 May 2025 05:21:51 -0400 (EDT)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id B4684BC4EC
	for <linux-mm@kvack.org>; Tue, 20 May 2025 09:21:50 +0000 (UTC)
X-FDA: 83462743980.15.0135F59
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf06.hostedemail.com (Postfix) with ESMTP id F3437180008
	for <linux-mm@kvack.org>; Tue, 20 May 2025 09:21:48 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=none;
	spf=pass (imf06.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1747732909;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=ephqeQ2nRRekhOey0Ad2XAZ62RoWN2xfMJ7+RGzz3vM=;
	b=8FNxyCVweM5m3UTAzwyFdg0NPieB+nndmKgjk03TOjqhIX2NhYQ9tbFVqombnCYlgWvP82
	bIpnafS2Y+jf/TGkgRRe54EFlJHP/6cvbLFFxhXdqKuVD9IoowlVs0Pjf9tOOW1PeVmg9A
	8Gqccpu1qMCzVngY5i7a2cPU25Ba/2Y=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=none;
	spf=pass (imf06.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747732909; a=rsa-sha256;
	cv=none;
	b=anRCovl/d3ePWHycVUnA4Y8+1udliQTcjbIFOJV4MDATGyqOHhLGG+7YH4lTXpH7vjlLAa
	PpcKfgOAyYt/AmrkTdDrJ4XQhresuow4jP6EW4WiQGW37TNOL+zYmQfxxNRfOh5s3IYgwy
	l8td/PKVj4fZrRHg7OIoi9y/+n4iw6g=
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A4BC1150C;
	Tue, 20 May 2025 02:21:34 -0700 (PDT)
Received: from [10.164.18.48] (unknown [10.164.18.48])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 568403F5A1;
	Tue, 20 May 2025 02:21:42 -0700 (PDT)
Message-ID: <712f21ea-e2e6-4a56-8bdb-6ad071b6972e@arm.com>
Date: Tue, 20 May 2025 14:51:39 +0530
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: akpm@linux-foundation.org, Liam.Howlett@oracle.com, vbabka@suse.cz,
 jannh@google.com, pfalcato@suse.de, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org, david@redhat.com, peterx@redhat.com,
 ryan.roberts@arm.com, mingo@kernel.org, libang.li@antgroup.com,
 maobibo@loongson.cn, zhengqi.arch@bytedance.com, baohua@kernel.org,
 anshuman.khandual@arm.com, willy@infradead.org, ioworker0@gmail.com,
 yang@os.amperecomputing.com, baolin.wang@linux.alibaba.com, ziy@nvidia.com,
 hughd@google.com
References: <20250507060256.78278-1-dev.jain@arm.com>
 <20250507060256.78278-3-dev.jain@arm.com>
 <fdb76016-396a-4ee4-9c9d-beb18c86cfdb@lucifer.local>
 <c038b7cf-3b72-403f-b988-bf3009287502@arm.com>
 <55cc1cef-1ece-48c3-af17-22ece8df5ed3@lucifer.local>
Content-Language: en-US
From: Dev Jain <dev.jain@arm.com>
In-Reply-To: <55cc1cef-1ece-48c3-af17-22ece8df5ed3@lucifer.local>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Rspam-User: 
X-Rspamd-Queue-Id: F3437180008
X-Rspamd-Server: rspam09
X-Stat-Signature: 8ia9ih16bodg14n6bzo9rphkdrzcdqwy
X-HE-Tag: 1747732908-932124
X-HE-Meta: U2FsdGVkX1+lIKc8zMaQXufE/29fG7ct7FnMahK39Uz1sfteDhMC6j9zF/0ljNBsS5S1cnvhrEq8Itb5aNymgduh/1zKcck4P3ScE8NoXojdVAgGXeB7Veerq5XySfFtoAVqARZ1n2Eh1bAJs0gvGKC1QlVMRcRieyoA1iM2xadHX2ZqMvzYkhdfg5nARsyaRBaKyhWzouKBblZr1EU4V/AAZf/17iK1YUY/cPyt/8jlnKNIMBTAssZllWjaVHwBybtmkOHJ3G3a/Q4I9hRvde/jhSqYLGSREIe/fX+qR90ME0bbQbtZc8wQk6SYckDvANWmXDT6AmBQ6Sqj3IZj3+cyFg4Dc7+e0ttbYOSqFgrkR+l/ywX3rsgur/kZQwUDDU8mMrErV9sGoYp+heqAhFQM6N2ll8pP1jn/FDgxPfoJh+CY2R2XdVm4Gx5LgcU314x9u/bhySp1eIhMr9uE7vZPkzUuXlAD4embfomH8mJhknG9Ysv4eRWK+PzbtFsaiZoufMpN1lVMZefbnLA04sGWEHeKyS5Hz+TkUHaFmKVWIeXfi4yBunR0Mh98N0UOcKMwqLdVsB+L/8TEbrGq0VD/UVHgXOr+EZxJiU6vnkjPDUCDAhJ/F+U0WSzurx+2ulPS3t9bgxlN6iI764L+aXwo+lBJjda1GtHS8CboWQaq0/71Aqa2KQP7XmXxecxvkAjrJZllwzGufpUwn13NU3pYNGq6uWI1LEAMd6/vTKWNXASjVuMz1y1flWZqWeE/Bns8RnkPlVoWm9Y2xVGzBGtGMlZoV3dIZhzBoFxcL5P+JjBcO6sKDAVswS3oYO6JKnZT7HjhraTh7B8SzP011xgX4DhpPJzCjT1DTi1Ekg9n189gXibfW8L/sYlA/zISzNDX0Yj5pcgI7BXbyY5gSLZwMppUwFrlE2UHkNL1po+yhrO798FC2ybHsO/cOos5nikeZTlSu6E5CgC9fZa
 dj+7jtqv
 j/Dj8deOFdmu8UNa2sAWHdGvzEiioGNEBNyeFcjBKM96WNdkfRdp7GccUwk1f2cdj2i0+5ylRwsFQHWgm3u1RzcrXfqkA3P7KW5UU9YCBAyXf4ltgjpFYWcXNyt1RWCboJyhNjkH1BZi2JGErck4ZQYDv/ItOhdFsnTOUy0njHsWL8/5zImW//cwvWbadWTUatH8D5hbzOQrffyXFJTCPjdYVLqZd65KHrrePbremY8L3NwZDGazSCsnhCBgYKke2/fio
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 19/05/25 2:34 pm, Lorenzo Stoakes wrote:
> On Sun, May 18, 2025 at 01:47:35PM +0530, Dev Jain wrote:
>>
>>
>> On 08/05/25 3:34 pm, Lorenzo Stoakes wrote:
>>> Before getting into the review, just to say thanks for refactoring as per
>>> my (and of course other's) comments, much appreciated and big improvement!
>>> :)
>>>
>>> We're getting there...
>>>
>>> On Wed, May 07, 2025 at 11:32:56AM +0530, Dev Jain wrote:
>>>> To use PTE batching, we want to determine whether the folio mapped by
>>>> the PTE is large, thus requiring the use of vm_normal_folio(). We want
>>>> to avoid the cost of vm_normal_folio() if the code path doesn't already
>>>> require the folio. For arm64, pte_batch_hint() does the job. To generalize
>>>> this hint, add a helper which will determine whether two consecutive PTEs
>>>> point to consecutive PFNs, in which case there is a high probability that
>>>> the underlying folio is large.
>>>> Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
>>>> are painted with the contig bit, then ptep_get() will iterate through all 16
>>>> entries to collect a/d bits. Hence this optimization will result in a 16x
>>>> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
>>>> will eventually call contpte_try_unfold() on every contig block, thus
>>>> flushing the TLB for the complete large folio range. Instead, use
>>>> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
>>>> do them on the starting and ending contig block.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>    include/linux/pgtable.h | 29 +++++++++++++++++++++++++++++
>>>>    mm/mremap.c             | 37 ++++++++++++++++++++++++++++++-------
>>>>    2 files changed, 59 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index b50447ef1c92..38dab1f562ed 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>>>>    }
>>>>    #endif
>>>>
>>>> +/**
>>>> + * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs
>>>> + * to a large folio.
>>>> + * @ptep: Pointer to the page table entry.
>>>> + * @pte: The page table entry.
>>>> + *
>>>> + * This helper is invoked when the caller wants to batch over a set of ptes
>>>> + * mapping a large folio, but the concerned code path does not already have
>>>> + * the folio. We want to avoid the cost of vm_normal_folio() only to find that
>>>> + * the underlying folio was small; i.e keep the small folio case as fast as
>>>> + * possible.
>>>> + *
>>>> + * The caller must ensure that ptep + 1 exists.
>>>> + */
>>>> +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
>>>> +{
>>>> +	pte_t *next_ptep, next_pte;
>>>> +
>>>> +	if (pte_batch_hint(ptep, pte) != 1)
>>>> +		return true;
>>>> +
>>>> +	next_ptep = ptep + 1;
>>>> +	next_pte = ptep_get(next_ptep);
>>>> +	if (!pte_present(next_pte))
>>>> +		return false;
>>>> +
>>>> +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == 1);
>>>
>>> Let's not do unlikely()'s unless we have data for them... it shouldn't mean
>>> 'what the programmer believes' :)
>>>
>>>> +}
>>>
>>> Yeah I'm with Andrew and Anshuman, I mean this is kind of a nasty interface
>>> (I mean _perhaps_ unavoidably) and we've done the relevant check in
>>> mremap_folio_pte_batch(), so let's just move it there with comments, as this
>>>
>>>> +
>>>>    #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>>>>    static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>    					    unsigned long address,
>>>> diff --git a/mm/mremap.c b/mm/mremap.c
>>>> index 0163e02e5aa8..9c88a276bec4 100644
>>>> --- a/mm/mremap.c
>>>> +++ b/mm/mremap.c
>>>> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>>>>    	return pte;
>>>>    }
>>>>
>>>> +/* mremap a batch of PTEs mapping the same large folio */
>>>> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>>>> +		pte_t *ptep, pte_t pte, int max_nr)
>>>> +{
>>>> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>>> +	struct folio *folio;
>>>> +	int nr = 1;
>>>> +
>>>> +	if ((max_nr != 1) && maybe_contiguous_pte_pfns(ptep, pte)) {
>>>> +		folio = vm_normal_folio(vma, addr, pte);
>>>> +		if (folio && folio_test_large(folio))
>>>> +			nr = folio_pte_batch(folio, addr, ptep, pte, max_nr,
>>>> +					     flags, NULL, NULL, NULL);
>>>> +	}
>>>
>>> This needs some refactoring, avoid nesting at all costs :)
>>>
>>> We'll want to move the maybe_contiguous_pte_pfns() function over here, so
>>> that'll change things, but in general let's use a guard clause.
>>>
>>> So an if block like:
>>>
>>> if (foo) {
>>> 	... bunch of logic ...
>>> }
>>>
>>> Is better replaced with a guard clause so you have:
>>>
>>> if (!foo)
>>> 	return ...;
>>>
>>> ... bunch of logic ...
>>>
>>> Here we could really expand things out to make things SUPER clear like:
>>>
>>> static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>>> 		pte_t *ptep, pte_t pte, int max_nr)
>>> {
>>> 	const fpb_t flags;
>>> 	struct folio *folio;
>>>
>>> 	if (max_nr == 1)
>>> 		return 1;
>>> 	if (!maybe_contiguous_pte_pfns(ptep, pte)) // obviously replace with open code...
>>> 		return 1;
>>>
>>> 	folio = vm_normal_folio(vma, addr, pte);
>>> 	if (!folio)
>>> 		return 1;
>>> 	if (!folio_test_large(folio))
>>> 		return 1;
>>>
>>> 	flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>> 	return folio_pte_batch(folio, addr, ptep, pte, max_nr,
>>> 		flags, NULL, NULL, NULL);
>>> }
>>>
>>> I mean you could argue assign nr would be neater here, but you get the point!
>>>
>>> David mentioned a point about this code over in v1 discussion (see
>>> [0]). Trying to bring converastion here to avoid it being split across
>>> old/new series. There he said:
>>>
>>> David H:
>>>> (2) Do we really need "must be part of the same folio", or could be just batch over present
>>>> ptes that map consecutive PFNs? In that case, a helper that avoids folio_pte_batch() completely
>>>> might be better.
>>>
>>> Hm, if we didn't do the batch test, can we batch a split large folio here ok?
>>> I'm guessing we can in which case this check is actually limiting...
>>>
>>> Are we _explicitly_ only considering the cont pte case and ignoring the
>>> split THP case?
>>>
>>> [0]: https://lore.kernel.org/all/887fb371-409e-4dad-b4ff-38b85bfddf95@redhat.com/
>>>
>>> And in what circumstances will the hint be set, with a present subsequent
>>> PTE but !folio_test_large()?
>>>
>>> I guess the hint might not be taken? But then isn't the valid check just
>>> folio_test_large() and we don't need this batched check at all?
>>>
>>> Is it only to avoid the split THP case?
>>>
>>> We definitely need some clarity here, and a comment in the code explaining
>>> what's going on as this is subtle stuff.
>>
>> I am focussed only on batching large folios. Split THPs won't be batched;
>> you can use pte_batch() (from David's refactoring) and
>> figure the split THP batch out, but then get_and_clear_full_ptes()
>> will be gathering a/d bits and smearing them across the batch, which will be
>> incorrect. Even if we introduce a new version of get_and_clear_full_ptes()
>> which does not gather a/d bits, then if the pte_batch actually belongs to a
>> folio, then we will *not* be smearing a/d bits, which is again wrong. So in
>> any case we must know what the underlying folio looks like :) So my agenda
>> for v3 is,
> 
> Right, ack, there being a large folio per se doesn't mean A/D collection is
> appropriate in THP case.
> 
> I guess then this is really _only_ about the mTHP case, where essentially
> the other PTEs are to be disregarded going forwards?

Yes.

> 
>>
>> - Incorporate your refactoring comments
>> - Remove maybe_contiguous_pte_pfns and just use vm_normal_folio +
>> folio_test_large
>> - Fix indentation
>>
>> Sounds good?
> 
> Sure, but can we hold off until the mprotect() stuff is done first please?
> I mean obviously you're free to do things as you like, but this will help
> workload-wise on my side :>)
> 
> Thanks!

No problem, thanks for reviewing!

> 
>>
>>>
>>>> +	return nr;
>>>> +}
>>>> +
>>>>    static int move_ptes(struct pagetable_move_control *pmc,
>>>>    		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>>>>    {
>>>> @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>>>>    	struct mm_struct *mm = vma->vm_mm;
>>>>    	pte_t *old_ptep, *new_ptep;
>>>> -	pte_t pte;
>>>> +	pte_t old_pte, pte;
>>>>    	pmd_t dummy_pmdval;
>>>>    	spinlock_t *old_ptl, *new_ptl;
>>>>    	bool force_flush = false;
>>>> @@ -186,6 +203,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    	unsigned long old_end = old_addr + extent;
>>>>    	unsigned long len = old_end - old_addr;
>>>>    	int err = 0;
>>>> +	int max_nr;
>>>>
>>>>    	/*
>>>>    	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
>>>> @@ -236,12 +254,13 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    	flush_tlb_batched_pending(vma->vm_mm);
>>>>    	arch_enter_lazy_mmu_mode();
>>>>
>>>> -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
>>>> -				   new_ptep++, new_addr += PAGE_SIZE) {
>>>> -		if (pte_none(ptep_get(old_ptep)))
>>>> +	for (int nr = 1; old_addr < old_end; old_ptep += nr, old_addr += nr * PAGE_SIZE,
>>>> +				   new_ptep += nr, new_addr += nr * PAGE_SIZE) {
>>>
>>> Really nitty thing here but the indentation is all messed up here, I mean
>>> nothing is going to be nice but maybe indent by two tabs below 'for'.
>>>
>>> I'm not a fan of this declaration of nr, typically in a for loop a declaration
>>> here would be the counter, so this is just confusing.
>>>
>>> In the old implementation, declaring nr in the for loop would make sense,
>>> but in the newly refactored one you should just declare it at the top.
>>>
>>> Also as per Anshuman review, I think nr_ptes, max_nr_ptes would be better.
>>>
>>> I don't think 'nr' needs to be initialised either, since the conditional is
>>> 'old_addr < old_end' and you _should_ only perform the
>>>
>>>> +		max_nr = (old_end - old_addr) >> PAGE_SHIFT;
>>>> +		old_pte = ptep_get(old_ptep);
>>>> +		if (pte_none(old_pte))
>>>
>>> This seems broken.
>>>
>>> You're missing a nr assignment here, so you'll happen to offset by the
>>> number of pages of the last folio you encountered?
>>>
>>> Should be:
>>>
>>> 	if (pte_none(old_pte)) {
>>> 		nr_ptes = 1;
>>> 		continue;
>>> 	}
>>>
>>> Or, alternatively, you can reset nr_ptes to 1 at the start of each loop.
>>>
>>>
>>>>    			continue;
>>>>
>>>> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>>>
>>>>    		/*
>>>>    		 * If we are remapping a valid PTE, make sure
>>>>    		 * to flush TLB before we drop the PTL for the
>>>> @@ -253,8 +272,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    		 * the TLB entry for the old mapping has been
>>>>    		 * flushed.
>>>>    		 */
>>>> -		if (pte_present(pte))
>>>> +		if (pte_present(old_pte)) {
>>>> +			nr = mremap_folio_pte_batch(vma, old_addr, old_ptep,
>>>> +						    old_pte, max_nr);
>>>>    			force_flush = true;
>>>> +		}
>>>
>>> Thanks this is much clearer compared to v1
>>>
>>>> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, 0);
>>>
>>> Nit but...
>>>
>>> Can we have a comment indicating what the last parameter refers to? I think
>>> David maybe doens't like this so obviously if he prefers not that fine, but
>>> I'm thinking something like:
>>>
>>> pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, /*full=*/false);
>>>
>>> I think we are good to just use 'false' here right? As it's only an int for
>>> historical purposes...
>>>
>>>>    		pte = move_pte(pte, old_addr, new_addr);
>>>>    		pte = move_soft_dirty_pte(pte);
>>>>
>>>> @@ -267,7 +290,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    				else if (is_swap_pte(pte))
>>>>    					pte = pte_swp_clear_uffd_wp(pte);
>>>>    			}
>>>> -			set_pte_at(mm, new_addr, new_ptep, pte);
>>>> +			set_ptes(mm, new_addr, new_ptep, pte, nr);
>>>>    		}
>>>>    	}
>>>>
>>>> --
>>>> 2.30.2
>>>>
>>