From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Dev Jain <dev.jain@arm.com>, Pedro Falcato <pfalcato@suse.de>
Cc: Luke Yang <luyang@redhat.com>,
surenb@google.com, jhladky@redhat.com, akpm@linux-foundation.org,
Liam.Howlett@oracle.com, willy@infradead.org, vbabka@suse.cz,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
Date: Wed, 18 Feb 2026 11:46:29 +0100 [thread overview]
Message-ID: <340be2bc-cf9b-4e22-b557-dfde6efa9de8@kernel.org> (raw)
In-Reply-To: <eaa6be47-f1fc-4b88-b267-5aa38e3ba2a9@arm.com>
On 2/18/26 11:38, Dev Jain wrote:
>
> On 18/02/26 3:36 pm, Pedro Falcato wrote:
>> On Wed, Feb 18, 2026 at 10:31:19AM +0530, Dev Jain wrote:
>>> Thanks for working on this. Some comments -
>>>
>>> 1. Rejecting batching with pte_batch_hint() means that we also don't batch 16K and 32K large
>>> folios on arm64, since the cont bit is on starting only at 64K. Not sure how imp this is.
>> I don't understand what you mean. Is ARM64 doing large folio optimization,
>> even when there's no special MMU support for it (the aforementioned 16K and
>> 32K cases)? If so, perhaps it's time for a ARCH_SUPPORTS_PTE_BATCHING flag.
>> Though if you could provide numbers in that case it would be much appreciated.
>
> There are two things at play here:
>
> 1. All arches are expected to benefit from pte batching on large folios, because
> of doing similar operations together in one shot. For code paths except mprotect
> and mremap, that benefit is far more clear due to:
>
> a) batching across atomic operations etc. For example, see copy_present_ptes -> folio_ref_add.
> Instead of bumping the reference by 1 nr times, we bump it by nr in one shot.
>
> b) vm_normal_folio was already being invoked. So, all in all the only new overhead
> we introduce is of folio_pte_batch(_flags). In fact, since we already have the
> folio, I recall that we even just special case the large folio case, out from
> the small folio case. Thus 4K folio processing will have no overhead.
>
> 2. Due to the requirements of contpte, ptep_get() on arm64 needs to fetch a/d bits
> across a cont block. Thus, for each ptep_get, it does 16 pte accesses. To avoid this,
> it becomes critical to batch on arm64.
>
>
>>
>>> 2. Did you measure if there is an optimization due to just the first commit ("prefetch the next pte")?
>> Yes, I could measure a sizeable improvement (perhaps some 5%). I tested on
>> zen5 (which is a pretty beefy uarch) and the loop is so full of ~~crap~~
>> features that the prefetcher seems to be doing a poor job, at least per my
>> results.
>
> Nice.
>
>>
>>> I actually had prefetch in mind - is it possible to do some kind of prefetch(pfn_to_page(pte_pfn(pte)))
>>> to optimize the call to vm_normal_folio()?
>> Certainly possible, but I suspect it doesn't make too much sense. You want to
>> avoid bringing in the cacheline if possible. In the pte's case, I know we're
>> probably going to look at it and modify it, and if I'm wrong it's just one
>> cacheline we misprefetched (though I had some parallel convos and it might
>> be that we need a branch there to avoid prefetching out of the PTE table).
>> We would like to avoid bringing in the folio cacheline at all, even if we
>> don't stall through some fancy prefetching or sheer CPU magic.
>
> I dunno, need other opinions.
Let's repeat my question: what, besides the micro-benchmark in some
cases with all small-folios, are we trying to optimize here. No hand
waving (Androids does this or that) please.
--
Cheers,
David
next prev parent reply other threads:[~2026-02-18 10:46 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-13 15:08 Luke Yang
2026-02-13 15:47 ` David Hildenbrand (Arm)
2026-02-13 16:24 ` Pedro Falcato
2026-02-13 17:16 ` Suren Baghdasaryan
2026-02-13 17:26 ` David Hildenbrand (Arm)
2026-02-16 10:12 ` Dev Jain
2026-02-16 14:56 ` Pedro Falcato
2026-02-17 17:43 ` Luke Yang
2026-02-17 18:08 ` Pedro Falcato
2026-02-18 5:01 ` Dev Jain
2026-02-18 10:06 ` Pedro Falcato
2026-02-18 10:38 ` Dev Jain
2026-02-18 10:46 ` David Hildenbrand (Arm) [this message]
2026-02-18 11:58 ` Pedro Falcato
2026-02-18 12:24 ` David Hildenbrand (Arm)
2026-02-19 12:15 ` Pedro Falcato
2026-02-19 13:02 ` David Hildenbrand (Arm)
2026-02-19 15:00 ` Pedro Falcato
2026-02-19 15:29 ` David Hildenbrand (Arm)
2026-02-20 4:12 ` Dev Jain
2026-02-18 11:52 ` Pedro Falcato
2026-02-18 4:50 ` Dev Jain
2026-02-18 13:29 ` David Hildenbrand (Arm)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=340be2bc-cf9b-4e22-b557-dfde6efa9de8@kernel.org \
--to=david@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=dev.jain@arm.com \
--cc=jhladky@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luyang@redhat.com \
--cc=pfalcato@suse.de \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox