linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dev Jain <dev.jain@arm.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Ryan Roberts <ryan.roberts@arm.com>
Cc: David Hildenbrand <david@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	Pedro Falcato <pfalcato@suse.de>, Barry Song <baohua@kernel.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
Date: Fri, 8 Aug 2025 10:48:25 +0530	[thread overview]
Message-ID: <9db37970-abfa-49ab-bab6-081d26e587ee@arm.com> (raw)
In-Reply-To: <ae01c6bc-019a-4263-8083-8ab073e72729@lucifer.local>


On 08/08/25 2:28 am, Lorenzo Stoakes wrote:
> On Thu, Aug 07, 2025 at 08:56:44PM +0100, Ryan Roberts wrote:
>> On 07/08/2025 20:20, Lorenzo Stoakes wrote:
>>> +cc Ryan for ContPTE stuff.
>> Appologies, I was aware of the other thread and on-going issues but haven't had
>> the bandwidth to follow too closely.
>>
>>> On Thu, Aug 07, 2025 at 09:10:52PM +0200, David Hildenbrand wrote:
>>>> Acked-by: David Hildenbrand <david@redhat.com>
>>> Thanks!
>>>
>>>> Wondering whether we could then just use the patch hint instead of going via
>>>> the folio.
>>>>
>>>> IOW,
>>>>
>>>> return pte_batch_hint(ptep, pte);
>>> Wouldn't that break the A/D stuff? Also this doesn't mean that the PTE won't
>>> have some conflicting flags potentially. The check is empirical:
>>>
>>> static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
>>> {
>>> 	if (!pte_valid_cont(pte))
>>> 		return 1;
>>>
>>> 	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
>>> }
>>>
>>> So it's 'the most number of PTEs that _might_ coalesce'.
>> No that's not correct; It's "at least this number of ptes _do_ coalesce".
>> folio_pte_batch() may end up returning a larger batch, but never smaller.
> Yup David explained.
>
> I suggest you rename this from 'hint', because that's not what hint means
> :) unless I'm really misunderstanding what this word means (it's 10pm and I
> started work at 6am so it's currently rather possible).
>
> I understand the con PTE bit is a 'hint' but as I recall you saying at
> LSF/MM 'modern CPUs take the hint'. Which presumably is where this comes
> from, but that's kinda deceptive.
>
> Anyway the reason I was emphatic here is on the basis that I believe I had
> this explained to met his way, which obviously I or whoever it was (don't
> recall) must have misunderstood. Or perhaps I hallucinated it... :)
>
> I see that folio_pte_batch() can get _more_, is this on the basis of there
> being adjacent, physically contiguous contPTE entries that can also be
> batched up?
>
>> This function is looking to see if ptep is inside a conpte mapping, and if it
>> is, it's returning the number of ptes to the end of the contpte mapping (which
>> is of 64K size and alignment on 4K kernels). A contpte mapping will only exist
>> if the physical memory is appropriately aligned/sized and all belongs to a
>> single folio.
>>
>>> (note that a bit grossly we'll call it _again_ in folio_pte_batch_flags()).
>>>
>>> I suppose we could not even bother with checking if same folio and _just_ check
>>> if PTEs have consecutive PFNs, which is not very likely if different folio
>>> but... could that break something?
>> Yes something could break; the batch must *all* belong to the same folio.
>> Functions like set_ptes() require that in their documentation, and arm64 depends
>> upon it in order not to screw up the access/dirty bits.
> Turning this around - is a cont pte range guaranteed to belong to only one
> folio?
>
> If so then we can just limit the range to one batched block for the sake of
> mremap that perhaps doesn't necessarily hugely benefit from further
> batching anyway?
>
> Let's take the time to check performance on arm64 hardware.
>
> Are we able to check to see how things behave if we have small folios only
> in the tested range on arm64
>
>>> It seems the 'magic' is in set_ptes() on arm64 where it'll know to do the 'right
>>> thing' for a contPTE batch (I may be missing something - please correct me if so
>>> Dev/Ryan).
>> It will all do the right thing functionally no matter how you call it. But if
>> you can set_ptes() (and friends) on full contpte mappings, things are more
>> efficient.
> Yup this is what I was... hinting at ;)
>
>>> So actually do we even really care that much about folio?
>>  From arm64's perspective, we're happy enough with batches the size of
>> pte_batch_hint(). folio_pte_batch() is a bonus, but certainly not a deal-breaker
>> for this location.
> OK, so I think we should definitely refactor this.
>
> David pointed out off-list we are duplicating the a/d handing _anyway_ in
> get_and_clear_ptes(). So that bit is just wasted effort, so there's really
> no need to do much that.
>
>> For the record, I'm pretty sure I was the person pushing for protecting
>> vm_normal_folio() with pte_batch_hint() right at the start of this process :)
> I think you didn't give your hint clearly enough ;)
>
>> Thanks,
>> Ryan
>>
>>>>
>>>> Not sure if that was discussed at some point before we went into the
>>>> direction of using folios. But there really doesn't seem to be anything
>>>> gained for other architectures here (as raised by Jann).
>>> Yup... I wonder about the other instances of this... ruh roh.
>> IIRC prior to Dev's mprotect and mremap optimizations, I believe all sites
>> already needed the folio. I haven't actually looked at how mprotect ended up,
>> but maybe worth checking to see if it should protect with pte_batch_hint() too.
> mprotect didn't? I mean let's check.

Yeah it didn't, it took the folio only for prot_numa case. For that reason I had first come up
with that maybe_contiguous_pte_pfns() [1] thingy for mremap, not sure how useful that is - It will
increase ptep_get() calls potentially and also won't work for large folios split into small
folios, since they will have contiguous pfns but not useful for batching.

But, I think even that won't work - I think the regression we see here is because batching
isn't actually saving anything here as Jann mentions, so maybe_contiguous_pte_pfns will
still regress for large folios due to retrieving the folio.

So fundamentally the optimization to be made in this specific case is only for arm64 - save
on ptep_get calls and TLB flushes.

[1] https://lore.kernel.org/all/20250507060256.78278-3-dev.jain@arm.com/

>
> We definitely need to be careful about other arches.
>
>> Thanks,
>> Ryan
> Cheers, Lorenzo


  reply	other threads:[~2025-08-08  5:19 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-07 18:58 Lorenzo Stoakes
2025-08-07 19:10 ` David Hildenbrand
2025-08-07 19:20   ` Lorenzo Stoakes
2025-08-07 19:41     ` David Hildenbrand
2025-08-07 20:11       ` Lorenzo Stoakes
2025-08-07 21:01         ` Lorenzo Stoakes
2025-08-07 19:56     ` Ryan Roberts
2025-08-07 20:58       ` Lorenzo Stoakes
2025-08-08  5:18         ` Dev Jain [this message]
2025-08-08  7:19         ` Ryan Roberts
2025-08-08  7:45           ` David Hildenbrand
2025-08-08  7:56             ` Ryan Roberts
2025-08-08  8:44               ` Dev Jain
2025-08-08  9:50                 ` Lorenzo Stoakes
2025-08-08  9:45             ` Lorenzo Stoakes
2025-08-08  9:40           ` Lorenzo Stoakes
2025-08-07 19:14 ` Pedro Falcato
2025-08-07 19:22   ` Lorenzo Stoakes
2025-08-07 19:33     ` David Hildenbrand
2025-08-08  5:19 ` Dev Jain
2025-08-08  9:56 ` Vlastimil Babka
2025-08-11  2:40 ` Barry Song
2025-08-11  4:57   ` Lorenzo Stoakes
2025-08-11  6:52     ` Barry Song
2025-08-11 15:08       ` Lorenzo Stoakes
2025-08-11 15:19         ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9db37970-abfa-49ab-bab6-081d26e587ee@arm.com \
    --to=dev.jain@arm.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=david@redhat.com \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=pfalcato@suse.de \
    --cc=ryan.roberts@arm.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox