linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dev Jain <dev.jain@arm.com>
To: Ryan Roberts <ryan.roberts@arm.com>,
	David Hildenbrand <david@redhat.com>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	Pedro Falcato <pfalcato@suse.de>, Barry Song <baohua@kernel.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
Date: Fri, 8 Aug 2025 14:14:08 +0530	[thread overview]
Message-ID: <20cc4429-58a2-43f9-87e5-959980d7ba5e@arm.com> (raw)
In-Reply-To: <8391c672-1123-4499-8d28-a731f2d88a9e@arm.com>


On 08/08/25 1:26 pm, Ryan Roberts wrote:
> On 08/08/2025 08:45, David Hildenbrand wrote:
>>> Not sure if some sleep has changed your mind on what "hint" means? I'm pretty
>>> sure David named this function, but for me the name makes sense. The arch is
>>> saying "I know that the pte batch is at least N ptes. It's up to you if you use
>>> that information. I'll still work correctly if you ignore it".
>> The last one is the important bit I think.
>>
>>> For me, your interpretation of 'the most number of PTEs that _might_ coalesce'
>>> would be a guess, not a hint.
>> I'm not a native speaker, so I'll let both of you figure that out. To me it
>> makes sense as well ... but well, I was involved when creating that function. :)
>>
>>>> I understand the con PTE bit is a 'hint' but as I recall you saying at
>>>> LSF/MM 'modern CPUs take the hint'. Which presumably is where this comes
>>>> from, but that's kinda deceptive.
>>>>
>>>> Anyway the reason I was emphatic here is on the basis that I believe I had
>>>> this explained to met his way, which obviously I or whoever it was (don't
>>>> recall) must have misunderstood. Or perhaps I hallucinated it... :)
>>> FWIW, this is the documentation for the function:
>>>
>>> /**
>>>    * pte_batch_hint - Number of pages that can be added to batch without scanning.
>>>    * @ptep: Page table pointer for the entry.
>>>    * @pte: Page table entry.
>>>    *
>>>    * Some architectures know that a set of contiguous ptes all map the same
>>>    * contiguous memory with the same permissions. In this case, it can provide a
>>>    * hint to aid pte batching without the core code needing to scan every pte.
>>>    *
>>>    * An architecture implementation may ignore the PTE accessed state. Further,
>>>    * the dirty state must apply atomically to all the PTEs described by the hint.
>>>    *
>>>    * May be overridden by the architecture, else pte_batch_hint is always 1.
>>>    */
>> It's actually ... surprisingly good after reading it again after at least a year.
>>
>>>> I see that folio_pte_batch() can get _more_, is this on the basis of there
>>>> being adjacent, physically contiguous contPTE entries that can also be
>>>> batched up?
>> [...]
>>
>>>>>>>
>>>>>>> Not sure if that was discussed at some point before we went into the
>>>>>>> direction of using folios. But there really doesn't seem to be anything
>>>>>>> gained for other architectures here (as raised by Jann).
>>>>>> Yup... I wonder about the other instances of this... ruh roh.
>>>>> IIRC prior to Dev's mprotect and mremap optimizations, I believe all sites
>>>>> already needed the folio. I haven't actually looked at how mprotect ended up,
>>>>> but maybe worth checking to see if it should protect with pte_batch_hint() too.
>>>> mprotect didn't? I mean let's check.
>>> I think for mprotect, the folio was only previously needed for the numa case. I
>>> have a vague memory that either Dev of I proposed wrapping folio_pte_batch() to
>>> only get the folio and call it if the next PTE had an adjacent PFN (or something
>>> like that). But it was deemed to complex. I might be misremembering... could
>>> have been an internal conversation. I'll chat with Dev about it and revisit.
>>>
>> I am probably to blame here, because I think I rejected early to have arm64-only
>> optimization, assuming other arch could benefit here as well with batching. But
>> as it seems, batching in mremap() code really only serves the cont-pte managing
>> code, and the folio_pte_batch() is really entirely unnecessary.
>>
>> In case of mprotect(), I think really only (a) NUMA and (b) anon-folio write-
>> upgrade required the folio. So it's a bit more tricky than mremap() here
>> where ... the folio is entirely irrelevant.
>>
>> One could detect the "anon write-upgrade possible" case early as well, and only
>> lookup the folio in that case, otherwise use the straight pte hint.
>>
>> So I think there is some room for further improvement.
> ACK; Dev, perhaps you can take another look at this and work up a patch to more
> agressively avoid vm_normal_folio() for mprotect?

Yup I'll investigate this in a few weeks time perhaps - try to use pte_batch_hint(),
and when we have to unconditionally retrieve the folio, then use that instead.

I'll also look into whether even for arm64, there is any use in retrieving
the folio at all - the only benefit we get is to batch across contig blocks, but I don't
think there are any atomic operations (refcount/mapcount manipulation) which will be saved.
By batching across a single contig block we save on ptep_get calls and TLBIs, which was
our objective.

>
> Thanks,
> Ryan
>
>


  reply	other threads:[~2025-08-08  8:44 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-07 18:58 Lorenzo Stoakes
2025-08-07 19:10 ` David Hildenbrand
2025-08-07 19:20   ` Lorenzo Stoakes
2025-08-07 19:41     ` David Hildenbrand
2025-08-07 20:11       ` Lorenzo Stoakes
2025-08-07 21:01         ` Lorenzo Stoakes
2025-08-07 19:56     ` Ryan Roberts
2025-08-07 20:58       ` Lorenzo Stoakes
2025-08-08  5:18         ` Dev Jain
2025-08-08  7:19         ` Ryan Roberts
2025-08-08  7:45           ` David Hildenbrand
2025-08-08  7:56             ` Ryan Roberts
2025-08-08  8:44               ` Dev Jain [this message]
2025-08-08  9:50                 ` Lorenzo Stoakes
2025-08-08  9:45             ` Lorenzo Stoakes
2025-08-08  9:40           ` Lorenzo Stoakes
2025-08-07 19:14 ` Pedro Falcato
2025-08-07 19:22   ` Lorenzo Stoakes
2025-08-07 19:33     ` David Hildenbrand
2025-08-08  5:19 ` Dev Jain
2025-08-08  9:56 ` Vlastimil Babka
2025-08-11  2:40 ` Barry Song
2025-08-11  4:57   ` Lorenzo Stoakes
2025-08-11  6:52     ` Barry Song
2025-08-11 15:08       ` Lorenzo Stoakes
2025-08-11 15:19         ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20cc4429-58a2-43f9-87e5-959980d7ba5e@arm.com \
    --to=dev.jain@arm.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=david@redhat.com \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=pfalcato@suse.de \
    --cc=ryan.roberts@arm.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox