From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Pedro Falcato <pfalcato@suse.de>
Cc: Dev Jain <dev.jain@arm.com>, Luke Yang <luyang@redhat.com>,
surenb@google.com, jhladky@redhat.com, akpm@linux-foundation.org,
Liam.Howlett@oracle.com, willy@infradead.org, vbabka@suse.cz,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
Date: Wed, 18 Feb 2026 13:24:28 +0100 [thread overview]
Message-ID: <624496ee-4709-497f-9ac1-c63bcf4724d6@kernel.org> (raw)
In-Reply-To: <cdrrvtzy76f7wplcrls3pbfe37kzrvzsrlaed7glg2cq6j3yob@wjbjklvovpl2>
On 2/18/26 12:58, Pedro Falcato wrote:
> On Wed, Feb 18, 2026 at 11:46:29AM +0100, David Hildenbrand (Arm) wrote:
>> On 2/18/26 11:38, Dev Jain wrote:
>>>
>>>
>>> There are two things at play here:
>>>
>>> 1. All arches are expected to benefit from pte batching on large folios, because
>>> of doing similar operations together in one shot. For code paths except mprotect
>>> and mremap, that benefit is far more clear due to:
>>>
>>> a) batching across atomic operations etc. For example, see copy_present_ptes -> folio_ref_add.
>>> Instead of bumping the reference by 1 nr times, we bump it by nr in one shot.
>>>
>>> b) vm_normal_folio was already being invoked. So, all in all the only new overhead
>>> we introduce is of folio_pte_batch(_flags). In fact, since we already have the
>>> folio, I recall that we even just special case the large folio case, out from
>>> the small folio case. Thus 4K folio processing will have no overhead.
>>>
>>> 2. Due to the requirements of contpte, ptep_get() on arm64 needs to fetch a/d bits
>>> across a cont block. Thus, for each ptep_get, it does 16 pte accesses. To avoid this,
>>> it becomes critical to batch on arm64.
>>>
>>>
>>>
>>> Nice.
>>>
>>>
>>> I dunno, need other opinions.
>>
>> Let's repeat my question: what, besides the micro-benchmark in some cases
>> with all small-folios, are we trying to optimize here. No hand waving
>> (Androids does this or that) please.
>
> I don't understand what you're looking for. an mprotect-based workload? those
> obviously don't really exist, apart from something like a JIT engine cranking
> out a lot of mprotect() calls in an aggressive fashion. Or perhaps some of that
> usage of mprotect that our DB friends like to use sometimes (discussed in
> $OTHER_CONTEXTS), though those are generally hugepages.
>
Anything besides a homemade micro-benchmark that highlights why we
should care about this exact fast and repeated sequence of events.
I'm surprise that such a "large regression" does not show up in any
other non-home-made benchmark that people/bots are running. That's
really what I am questioning.
Having that said, I'm all for optimizing it if there is a real problem
there.
> I don't see how this can justify large performance regressions in a system
> call, for something every-architecture-not-named-arm64 does not have.
Take a look at the reported performance improvements on AMD with large
folios.
The issue really is that small folios don't perform well, on any
architecture. But to detect large vs. small folios we need the ... folio.
So once we optimize for small folios (== don't try to detect large
folios) we'll degrade large folios.
For fork() and unmap() we were able to avoid most of the performance
regressions for small folios by special-casing the implementation on two
variants: nr_pages == 1 (incl. small folios) vs. nr_pages != 1 (large
folios).
We cannot avoid the vm_normal_folio(). Maybe the function-call overhead
could be avoided by providing an inlined variant -- if that is the real
problem.
But likely it's also just access to the folio when we really don't need
it in some cases.
--
Cheers,
David
next prev parent reply other threads:[~2026-02-18 12:24 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-13 15:08 Luke Yang
2026-02-13 15:47 ` David Hildenbrand (Arm)
2026-02-13 16:24 ` Pedro Falcato
2026-02-13 17:16 ` Suren Baghdasaryan
2026-02-13 17:26 ` David Hildenbrand (Arm)
2026-02-16 10:12 ` Dev Jain
2026-02-16 14:56 ` Pedro Falcato
2026-02-17 17:43 ` Luke Yang
2026-02-17 18:08 ` Pedro Falcato
2026-02-18 5:01 ` Dev Jain
2026-02-18 10:06 ` Pedro Falcato
2026-02-18 10:38 ` Dev Jain
2026-02-18 10:46 ` David Hildenbrand (Arm)
2026-02-18 11:58 ` Pedro Falcato
2026-02-18 12:24 ` David Hildenbrand (Arm) [this message]
2026-02-19 12:15 ` Pedro Falcato
2026-02-19 13:02 ` David Hildenbrand (Arm)
2026-02-19 15:00 ` Pedro Falcato
2026-02-19 15:29 ` David Hildenbrand (Arm)
2026-02-20 4:12 ` Dev Jain
2026-02-18 11:52 ` Pedro Falcato
2026-02-18 4:50 ` Dev Jain
2026-02-18 13:29 ` David Hildenbrand (Arm)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=624496ee-4709-497f-9ac1-c63bcf4724d6@kernel.org \
--to=david@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=dev.jain@arm.com \
--cc=jhladky@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luyang@redhat.com \
--cc=pfalcato@suse.de \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox