From: Dev Jain <dev.jain@arm.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Ryan Roberts <ryan.roberts@arm.com>
Cc: David Hildenbrand <david@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
Pedro Falcato <pfalcato@suse.de>, Barry Song <baohua@kernel.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
Date: Fri, 8 Aug 2025 10:48:25 +0530 [thread overview]
Message-ID: <9db37970-abfa-49ab-bab6-081d26e587ee@arm.com> (raw)
In-Reply-To: <ae01c6bc-019a-4263-8083-8ab073e72729@lucifer.local>
On 08/08/25 2:28 am, Lorenzo Stoakes wrote:
> On Thu, Aug 07, 2025 at 08:56:44PM +0100, Ryan Roberts wrote:
>> On 07/08/2025 20:20, Lorenzo Stoakes wrote:
>>> +cc Ryan for ContPTE stuff.
>> Appologies, I was aware of the other thread and on-going issues but haven't had
>> the bandwidth to follow too closely.
>>
>>> On Thu, Aug 07, 2025 at 09:10:52PM +0200, David Hildenbrand wrote:
>>>> Acked-by: David Hildenbrand <david@redhat.com>
>>> Thanks!
>>>
>>>> Wondering whether we could then just use the patch hint instead of going via
>>>> the folio.
>>>>
>>>> IOW,
>>>>
>>>> return pte_batch_hint(ptep, pte);
>>> Wouldn't that break the A/D stuff? Also this doesn't mean that the PTE won't
>>> have some conflicting flags potentially. The check is empirical:
>>>
>>> static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
>>> {
>>> if (!pte_valid_cont(pte))
>>> return 1;
>>>
>>> return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
>>> }
>>>
>>> So it's 'the most number of PTEs that _might_ coalesce'.
>> No that's not correct; It's "at least this number of ptes _do_ coalesce".
>> folio_pte_batch() may end up returning a larger batch, but never smaller.
> Yup David explained.
>
> I suggest you rename this from 'hint', because that's not what hint means
> :) unless I'm really misunderstanding what this word means (it's 10pm and I
> started work at 6am so it's currently rather possible).
>
> I understand the con PTE bit is a 'hint' but as I recall you saying at
> LSF/MM 'modern CPUs take the hint'. Which presumably is where this comes
> from, but that's kinda deceptive.
>
> Anyway the reason I was emphatic here is on the basis that I believe I had
> this explained to met his way, which obviously I or whoever it was (don't
> recall) must have misunderstood. Or perhaps I hallucinated it... :)
>
> I see that folio_pte_batch() can get _more_, is this on the basis of there
> being adjacent, physically contiguous contPTE entries that can also be
> batched up?
>
>> This function is looking to see if ptep is inside a conpte mapping, and if it
>> is, it's returning the number of ptes to the end of the contpte mapping (which
>> is of 64K size and alignment on 4K kernels). A contpte mapping will only exist
>> if the physical memory is appropriately aligned/sized and all belongs to a
>> single folio.
>>
>>> (note that a bit grossly we'll call it _again_ in folio_pte_batch_flags()).
>>>
>>> I suppose we could not even bother with checking if same folio and _just_ check
>>> if PTEs have consecutive PFNs, which is not very likely if different folio
>>> but... could that break something?
>> Yes something could break; the batch must *all* belong to the same folio.
>> Functions like set_ptes() require that in their documentation, and arm64 depends
>> upon it in order not to screw up the access/dirty bits.
> Turning this around - is a cont pte range guaranteed to belong to only one
> folio?
>
> If so then we can just limit the range to one batched block for the sake of
> mremap that perhaps doesn't necessarily hugely benefit from further
> batching anyway?
>
> Let's take the time to check performance on arm64 hardware.
>
> Are we able to check to see how things behave if we have small folios only
> in the tested range on arm64
>
>>> It seems the 'magic' is in set_ptes() on arm64 where it'll know to do the 'right
>>> thing' for a contPTE batch (I may be missing something - please correct me if so
>>> Dev/Ryan).
>> It will all do the right thing functionally no matter how you call it. But if
>> you can set_ptes() (and friends) on full contpte mappings, things are more
>> efficient.
> Yup this is what I was... hinting at ;)
>
>>> So actually do we even really care that much about folio?
>> From arm64's perspective, we're happy enough with batches the size of
>> pte_batch_hint(). folio_pte_batch() is a bonus, but certainly not a deal-breaker
>> for this location.
> OK, so I think we should definitely refactor this.
>
> David pointed out off-list we are duplicating the a/d handing _anyway_ in
> get_and_clear_ptes(). So that bit is just wasted effort, so there's really
> no need to do much that.
>
>> For the record, I'm pretty sure I was the person pushing for protecting
>> vm_normal_folio() with pte_batch_hint() right at the start of this process :)
> I think you didn't give your hint clearly enough ;)
>
>> Thanks,
>> Ryan
>>
>>>>
>>>> Not sure if that was discussed at some point before we went into the
>>>> direction of using folios. But there really doesn't seem to be anything
>>>> gained for other architectures here (as raised by Jann).
>>> Yup... I wonder about the other instances of this... ruh roh.
>> IIRC prior to Dev's mprotect and mremap optimizations, I believe all sites
>> already needed the folio. I haven't actually looked at how mprotect ended up,
>> but maybe worth checking to see if it should protect with pte_batch_hint() too.
> mprotect didn't? I mean let's check.
Yeah it didn't, it took the folio only for prot_numa case. For that reason I had first come up
with that maybe_contiguous_pte_pfns() [1] thingy for mremap, not sure how useful that is - It will
increase ptep_get() calls potentially and also won't work for large folios split into small
folios, since they will have contiguous pfns but not useful for batching.
But, I think even that won't work - I think the regression we see here is because batching
isn't actually saving anything here as Jann mentions, so maybe_contiguous_pte_pfns will
still regress for large folios due to retrieving the folio.
So fundamentally the optimization to be made in this specific case is only for arm64 - save
on ptep_get calls and TLB flushes.
[1] https://lore.kernel.org/all/20250507060256.78278-3-dev.jain@arm.com/
>
> We definitely need to be careful about other arches.
>
>> Thanks,
>> Ryan
> Cheers, Lorenzo
next prev parent reply other threads:[~2025-08-08 5:19 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-07 18:58 Lorenzo Stoakes
2025-08-07 19:10 ` David Hildenbrand
2025-08-07 19:20 ` Lorenzo Stoakes
2025-08-07 19:41 ` David Hildenbrand
2025-08-07 20:11 ` Lorenzo Stoakes
2025-08-07 21:01 ` Lorenzo Stoakes
2025-08-07 19:56 ` Ryan Roberts
2025-08-07 20:58 ` Lorenzo Stoakes
2025-08-08 5:18 ` Dev Jain [this message]
2025-08-08 7:19 ` Ryan Roberts
2025-08-08 7:45 ` David Hildenbrand
2025-08-08 7:56 ` Ryan Roberts
2025-08-08 8:44 ` Dev Jain
2025-08-08 9:50 ` Lorenzo Stoakes
2025-08-08 9:45 ` Lorenzo Stoakes
2025-08-08 9:40 ` Lorenzo Stoakes
2025-08-07 19:14 ` Pedro Falcato
2025-08-07 19:22 ` Lorenzo Stoakes
2025-08-07 19:33 ` David Hildenbrand
2025-08-08 5:19 ` Dev Jain
2025-08-08 9:56 ` Vlastimil Babka
2025-08-11 2:40 ` Barry Song
2025-08-11 4:57 ` Lorenzo Stoakes
2025-08-11 6:52 ` Barry Song
2025-08-11 15:08 ` Lorenzo Stoakes
2025-08-11 15:19 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9db37970-abfa-49ab-bab6-081d26e587ee@arm.com \
--to=dev.jain@arm.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=david@redhat.com \
--cc=jannh@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=pfalcato@suse.de \
--cc=ryan.roberts@arm.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox