From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 13984E9A04A for ; Thu, 19 Feb 2026 15:00:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2A0BF6B0005; Thu, 19 Feb 2026 10:00:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2783D6B0089; Thu, 19 Feb 2026 10:00:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 17B516B008A; Thu, 19 Feb 2026 10:00:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 072766B0005 for ; Thu, 19 Feb 2026 10:00:46 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 7E68413B496 for ; Thu, 19 Feb 2026 15:00:45 +0000 (UTC) X-FDA: 84461518050.10.47C15E8 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by imf07.hostedemail.com (Postfix) with ESMTP id 139B340019 for ; Thu, 19 Feb 2026 15:00:42 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=none; spf=pass (imf07.hostedemail.com: domain of pfalcato@suse.de designates 195.135.223.131 as permitted sender) smtp.mailfrom=pfalcato@suse.de; dmarc=pass (policy=none) header.from=suse.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1771513243; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bf9shwcxi6+RNcuKlehYMYK1/zJWznKBE3VcW418bik=; b=TdY0gQDk4eyI8+iWLnumh46LjUEQGXNcu9G9Utt5wSh8RZAxXIhTF8sDaDDqmPSNkuEYvB y2Daq4/v6o2vtB2HtCt8z7EFp5H3mRIfgikk3XcQGKvYXzIJywjvuZURXsiEz1mQuCJUN3 1cUCCgDK8jBmGYYkf4kwWCa4j79LOHw= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1771513243; a=rsa-sha256; cv=none; b=xycYo2kEsujAekkaX5KTmJ+OzkoTh9aitJT5N2AfpbR8FcfKq/Cb9MyxY7WyUDkT6Qwun7 a4jmcAVQlOr6ERVUqE1K8l6x6yKzqp8qaOgLnslsTo7+5bmxJkKWxrr6ccj352puXEiQ9d gJe8JV034+S6WafMQsdZlNvzoR99NFo= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=none; spf=pass (imf07.hostedemail.com: domain of pfalcato@suse.de designates 195.135.223.131 as permitted sender) smtp.mailfrom=pfalcato@suse.de; dmarc=pass (policy=none) header.from=suse.de Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 9393B5BCFF; Thu, 19 Feb 2026 15:00:41 +0000 (UTC) Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id C18F53EA66; Thu, 19 Feb 2026 15:00:40 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id PSkBLJgll2lqFgAAD6G6ig (envelope-from ); Thu, 19 Feb 2026 15:00:40 +0000 Date: Thu, 19 Feb 2026 15:00:39 +0000 From: Pedro Falcato To: "David Hildenbrand (Arm)" Cc: Dev Jain , Luke Yang , surenb@google.com, jhladky@redhat.com, akpm@linux-foundation.org, Liam.Howlett@oracle.com, willy@infradead.org, vbabka@suse.cz, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad) Message-ID: References: <8315cbde-389c-40c5-ac72-92074625489a@arm.com> <5dso4ctke4baz7hky62zyfdzyg27tcikdbg5ecnrqmnluvmxzo@sciiqgatpqqv> <340be2bc-cf9b-4e22-b557-dfde6efa9de8@kernel.org> <624496ee-4709-497f-9ac1-c63bcf4724d6@kernel.org> <9209d642-a495-4c13-9ec3-10ced1d2a04c@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <9209d642-a495-4c13-9ec3-10ced1d2a04c@kernel.org> X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Rspamd-Action: no action X-Rspamd-Server: rspam09 X-Stat-Signature: 37jb5sta8zekk9wzenrjbojfiey4d4q7 X-Rspamd-Queue-Id: 139B340019 X-Rspam-User: X-HE-Tag: 1771513242-340621 X-HE-Meta: U2FsdGVkX1+XwVmjfFI/Du1cMnQzLh6LHrlYxkrw/C43i4vr2h0YxiFZ4/L+kdVjLjRWWQxzKcujcGwltb+T0xujaLH6y+ILVOXC2r+FFgEMDhdRfbSD9iYAR6t23VFMgZpaRkARYJWWSHZDXrwcCmftnSO8+C01AziZQbsqpA1AXP46Pee71K6lQIVWu2XLAUWKdSbv9uG/PQj6lfSJPxI90iUYlj1lhwbUR/eBMtarEs9BEd9UYi7PtQXR16V6LTpdMmzu3lcQImaYs3vk0etAk7MX/Wrh/I45Cfor2GFRFK5mwIUIdMqoDkPhQhg2ksVTWa0KNf/alhW9oequeUGX8xJKbI6+ylZG/X7A+5yXGJpbmaSHUFN2L90zGgIX05z/B/Ybtg0PnG/B9Mc1BOrorhEBg4kQ5gTFoOftrOLjnEuyVPJPGY+fV76Nh8jI5Ni4HNcGuDJfKCbd6px5p46tCfBFFaiq15/qhxhDX8KZh7JMPtEC9acCN7viFF70CSK/lWRLG9oeNyz91sZrO2E/ic+eP0KYbVnb02YlcAxZnGW2xSY0QBPulFRpN8hOmQD2NPPuAogFID9k3mpxu8d5hG8uHGITHu6/j7HLQytR0X3H8NX9fInVRKSH0tROw+HPx7rYwGtWpfgbGtXyob9sk5SZIhwr8Cy9TGEy707W13fi2SasW3xU9LIpwifnebUUm8W9VryUqzFQ2KKvdGEXTO2jaLYpqxS8pBCD4GtMYR7dAhBjkA6pI39BOQvqF1afUxOfF7tvfPDucnIQ5MID7GThSbUtCBD4KK3nuq5BwBV8mECQfS94dmy+lVdR2+w/jpm+TMiwrkYi1bjUPJoLEeHEBliwlNTT6mHfhdi/3riO17Jq4EoiKdurDCClSJDmqt+r6wsRXCKmu6Hu8Mq20BmFbdTClAYB33s+tG3ww8TKxJowBPke/M0qcZ5TO5hd1XY/ckEZEt1dzIj ObC9aqX2 IlMLBN+w9NdfMpnr0ccHuqh+AzPA9UJnR0UIW+Ujab3ZP7VLvVWtKkIsCmRnFsOV4JTB+O4q8DX3MO+kqE7Jws+iTYWQ2e+YvnjMjoiIs5JTJFYU9DrCbjFeIiTJ86kAncTssmNZbI5M1NUUKQkNw0MaCzx4KkAXsKR8foJP0pJ4afco= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 19, 2026 at 02:02:42PM +0100, David Hildenbrand (Arm) wrote: > On 2/19/26 13:15, Pedro Falcato wrote: > > On Wed, Feb 18, 2026 at 01:24:28PM +0100, David Hildenbrand (Arm) wrote: > > > On 2/18/26 12:58, Pedro Falcato wrote: > > > > > > > > I don't understand what you're looking for. an mprotect-based workload? those > > > > obviously don't really exist, apart from something like a JIT engine cranking > > > > out a lot of mprotect() calls in an aggressive fashion. Or perhaps some of that > > > > usage of mprotect that our DB friends like to use sometimes (discussed in > > > > $OTHER_CONTEXTS), though those are generally hugepages. > > > > > > > > > > Anything besides a homemade micro-benchmark that highlights why we should > > > care about this exact fast and repeated sequence of events. > > > > > > I'm surprise that such a "large regression" does not show up in any other > > > non-home-made benchmark that people/bots are running. That's really what I > > > am questioning. > > > > I don't know, perhaps there isn't a will-it-scale test for this. That's > > alright. Even the standard will-it-scale and stress-ng tests people use > > to detect regressions usually have glaring problems and are insanely > > microbenchey. > > My theory is that most heavy (high frequency where it would really hit performance) > mprotect users (like JITs) perform mprotect on very small ranges (e.g., single page), > where all the other overhead (syscall, TLB flush) dominates. > > That's why I was wondering which use cases that behave similar to the reproducer exist. > > > > > > > > > Having that said, I'm all for optimizing it if there is a real problem > > > there. > > > > > > > I don't see how this can justify large performance regressions in a system > > > > call, for something every-architecture-not-named-arm64 does not have. > > > Take a look at the reported performance improvements on AMD with large > > > folios. > > > > Sure, but pte-mapped 2M folios is almost a worst-case (why not a PMD at that > > point...) > > Well, 1M and all the way down will similarly benefit. 2M is just always the extreme case. > > > > > > > > > The issue really is that small folios don't perform well, on any > > > architecture. But to detect large vs. small folios we need the ... folio. > > > > > > So once we optimize for small folios (== don't try to detect large folios) > > > we'll degrade large folios. > > > > I suspect it's not that huge of a deal. Worst case you can always provide a > > software PTE_CONT bit that would e.g be set when mapping a large folio. Or > > perhaps "if this pte has a PFN, and the next pte has PFN + 1, then we're > > probably in a large folio, thus do the proper batching stuff". I think that > > could satisfy everyone. There are heuristics we can use, and perhaps > > pte_batch_hint() does not need to be that simple and useless in the !arm64 > > case then. I'll try to look into a cromulent solution for everyone. > > Software bits are generally -ENOSPC, but maybe we are lucky on some architectures. > > We'd run into similar issues like aarch64 when shattering contiguity etc, so > there is quite some complexity too it that might not be worth it. > > > > > (shower thought: do we always get wins when batching large folios, or do these > > need to be of a significant order to get wins?) > > For mprotect(), I don't know. For fork() and unmap() batching there was always a > win even with order-2 folios. (never measured order-1, because they don't apply to > anonymous memory) > > I assume for mprotect() it depends whether we really needed the folio before, or > whether it's just not required like for mremap(). > > > > > But personally I would err on the side of small folios, like we did for mremap() > > a few months back. > > The following (completely untested) might make most people happy by looking up > the folio only if (a) required or (b) if the architecture indicates that there is a large folio. > > I assume for some large folio use cases it might perform worse than before. But for > the write-upgrade case with large anon folios the performance improvement should remain. > > Not sure if some regression would remain for which we'd have to special-case the implementation > to take a separate path for nr_ptes == 1. > > Maybe you had something similar already: > > > diff --git a/mm/mprotect.c b/mm/mprotect.c > index c0571445bef7..0b3856ad728e 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -211,6 +211,25 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, > commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb); > } > +static bool mprotect_wants_folio_for_pte(unsigned long cp_flags, pte_t *ptep, > + pte_t pte, unsigned long max_nr_ptes) > +{ > + /* NUMA hinting needs decide whether working on the folio is ok. */ > + if (cp_flags & MM_CP_PROT_NUMA) > + return true; > + > + /* We want the folio for possible write-upgrade. */ > + if (!pte_write(pte) && (cp_flags & MM_CP_TRY_CHANGE_WRITABLE)) > + return true; > + > + /* There is nothing to batch. */ > + if (max_nr_ptes == 1) > + return false; > + > + /* For guaranteed large folios it's usually a win. */ > + return pte_batch_hint(ptep, pte) > 1; > +} > + > static long change_pte_range(struct mmu_gather *tlb, > struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, > unsigned long end, pgprot_t newprot, unsigned long cp_flags) > @@ -241,16 +260,18 @@ static long change_pte_range(struct mmu_gather *tlb, > const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE; > int max_nr_ptes = (end - addr) >> PAGE_SHIFT; > struct folio *folio = NULL; > - struct page *page; > + struct page *page = NULL; > pte_t ptent; > /* Already in the desired state. */ > if (prot_numa && pte_protnone(oldpte)) > continue; > - page = vm_normal_page(vma, addr, oldpte); > - if (page) > - folio = page_folio(page); > + if (mprotect_wants_folio_for_pte(cp_flags, pte, oldpte, max_nr_ptes)) { > + page = vm_normal_page(vma, addr, oldpte); > + if (page) > + folio = page_folio(page); > + } > /* > * Avoid trapping faults against the zero or KSM > Yes, this is a better version than what I had, I'll take this hunk if you don't mind :) Note that it still doesn't handle large folios on !contpte architectures, which is partly the issue. I suspect some sort of PTE lookahead might work well in practice, aside from the issues where e.g two order-0 folios that are contiguous in memory are separately mapped. Though perhaps inlining vm_normal_folio() might also be interesting and side-step most of the issue. I'll play around with that. -- Pedro