Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Pedro Falcato <pfalcato@suse.de>
To: "David Hildenbrand (Arm)" <david@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>, Luke Yang <luyang@redhat.com>,
	 surenb@google.com, jhladky@redhat.com,
	akpm@linux-foundation.org,  Liam.Howlett@oracle.com,
	willy@infradead.org, vbabka@suse.cz, linux-mm@kvack.org,
	 linux-kernel@vger.kernel.org
Subject: Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
Date: Thu, 19 Feb 2026 12:15:09 +0000	[thread overview]
Message-ID: <r2b2cjuqicmrw3zdwruacpelulhjhfdawrtbgzph5vsf6h5omj@dhrga7p62hju> (raw)
In-Reply-To: <624496ee-4709-497f-9ac1-c63bcf4724d6@kernel.org>

On Wed, Feb 18, 2026 at 01:24:28PM +0100, David Hildenbrand (Arm) wrote:
> On 2/18/26 12:58, Pedro Falcato wrote:
> > On Wed, Feb 18, 2026 at 11:46:29AM +0100, David Hildenbrand (Arm) wrote:
> > > On 2/18/26 11:38, Dev Jain wrote:
> > > > 
> > > > 
> > > > There are two things at play here:
> > > > 
> > > > 1. All arches are expected to benefit from pte batching on large folios, because
> > > > of doing similar operations together in one shot. For code paths except mprotect
> > > > and mremap, that benefit is far more clear due to:
> > > > 
> > > > a) batching across atomic operations etc. For example, see copy_present_ptes -> folio_ref_add.
> > > >      Instead of bumping the reference by 1 nr times, we bump it by nr in one shot.
> > > > 
> > > > b) vm_normal_folio was already being invoked. So, all in all the only new overhead
> > > >      we introduce is of folio_pte_batch(_flags). In fact, since we already have the
> > > >      folio, I recall that we even just special case the large folio case, out from
> > > >      the small folio case. Thus 4K folio processing will have no overhead.
> > > > 
> > > > 2. Due to the requirements of contpte, ptep_get() on arm64 needs to fetch a/d bits
> > > > across a cont block. Thus, for each ptep_get, it does 16 pte accesses. To avoid this,
> > > > it becomes critical to batch on arm64.
> > > > 
> > > > 
> > > > 
> > > > Nice.
> > > > 
> > > > 
> > > > I dunno, need other opinions.
> > > 
> > > Let's repeat my question: what, besides the micro-benchmark in some cases
> > > with all small-folios, are we trying to optimize here. No hand waving
> > > (Androids does this or that) please.
> > 
> > I don't understand what you're looking for. an mprotect-based workload? those
> > obviously don't really exist, apart from something like a JIT engine cranking
> > out a lot of mprotect() calls in an aggressive fashion. Or perhaps some of that
> > usage of mprotect that our DB friends like to use sometimes (discussed in
> > $OTHER_CONTEXTS), though those are generally hugepages.
> > 
> 
> Anything besides a homemade micro-benchmark that highlights why we should
> care about this exact fast and repeated sequence of events.
> 
> I'm surprise that such a "large regression" does not show up in any other
> non-home-made benchmark that people/bots are running. That's really what I
> am questioning.

I don't know, perhaps there isn't a will-it-scale test for this. That's
alright. Even the standard will-it-scale and stress-ng tests people use
to detect regressions usually have glaring problems and are insanely
microbenchey.

> 
> Having that said, I'm all for optimizing it if there is a real problem
> there.
> 
> > I don't see how this can justify large performance regressions in a system
> > call, for something every-architecture-not-named-arm64 does not have.
> Take a look at the reported performance improvements on AMD with large
> folios.

Sure, but pte-mapped 2M folios is almost a worst-case (why not a PMD at that
point...)

> 
> The issue really is that small folios don't perform well, on any
> architecture. But to detect large vs. small folios we need the ... folio.
> 
> So once we optimize for small folios (== don't try to detect large folios)
> we'll degrade large folios.

I suspect it's not that huge of a deal. Worst case you can always provide a
software PTE_CONT bit that would e.g be set when mapping a large folio. Or
perhaps "if this pte has a PFN, and the next pte has PFN + 1, then we're
probably in a large folio, thus do the proper batching stuff". I think that
could satisfy everyone. There are heuristics we can use, and perhaps
pte_batch_hint() does not need to be that simple and useless in the !arm64
case then. I'll try to look into a cromulent solution for everyone.

(shower thought: do we always get wins when batching large folios, or do these
need to be of a significant order to get wins?)

But personally I would err on the side of small folios, like we did for mremap()
a few months back.

> 
> 
> For fork() and unmap() we were able to avoid most of the performance
> regressions for small folios by special-casing the implementation on two
> variants: nr_pages == 1 (incl. small folios) vs. nr_pages != 1 (large
> folios).
> 
> We cannot avoid the vm_normal_folio(). Maybe the function-call overhead
> could be avoided by providing an inlined variant -- if that is the real
> problem.
> 
> But likely it's also just access to the folio when we really don't need it
> in some cases.

/me shrieks at the thought of the extra cacheline accesses in the glorious
memdesc future :)

-- 
Pedro

next prev parent reply	other threads:[~2026-02-19 12:15 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-13 15:08 Luke Yang
2026-02-13 15:47 ` David Hildenbrand (Arm)
2026-02-13 16:24   ` Pedro Falcato
2026-02-13 17:16     ` Suren Baghdasaryan
2026-02-13 17:26       ` David Hildenbrand (Arm)
2026-02-16 10:12         ` Dev Jain
2026-02-16 14:56           ` Pedro Falcato
2026-02-17 17:43           ` Luke Yang
2026-02-17 18:08             ` Pedro Falcato
2026-02-18  5:01               ` Dev Jain
2026-02-18 10:06                 ` Pedro Falcato
2026-02-18 10:38                   ` Dev Jain
2026-02-18 10:46                     ` David Hildenbrand (Arm)
2026-02-18 11:58                       ` Pedro Falcato
2026-02-18 12:24                         ` David Hildenbrand (Arm)
2026-02-19 12:15                           ` Pedro Falcato [this message]
2026-02-19 13:02                             ` David Hildenbrand (Arm)
2026-02-19 15:00                               ` Pedro Falcato
2026-02-19 15:29                                 ` David Hildenbrand (Arm)
2026-02-20  4:12                                 ` Dev Jain
2026-02-18 11:52                     ` Pedro Falcato
2026-02-18  4:50             ` Dev Jain
2026-02-18 13:29 ` David Hildenbrand (Arm)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=r2b2cjuqicmrw3zdwruacpelulhjhfdawrtbgzph5vsf6h5omj@dhrga7p62hju \
    --to=pfalcato@suse.de \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=jhladky@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luyang@redhat.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox