From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 8C4F6E9A03B for ; Wed, 18 Feb 2026 10:38:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A8D086B0088; Wed, 18 Feb 2026 05:38:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A3B716B0089; Wed, 18 Feb 2026 05:38:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 92FF46B008A; Wed, 18 Feb 2026 05:38:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 7E69F6B0088 for ; Wed, 18 Feb 2026 05:38:20 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 208C8160514 for ; Wed, 18 Feb 2026 10:38:20 +0000 (UTC) X-FDA: 84457227960.25.1DDBFCB Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf26.hostedemail.com (Postfix) with ESMTP id 6F4A7140009 for ; Wed, 18 Feb 2026 10:38:18 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf26.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1771411098; a=rsa-sha256; cv=none; b=0UGEYTcVrz3IOcPEO3LJ80DC3UQGAq4GR8+Z6lIGhFM+xO92iYdJdEuzh9dhUbaslHHNua lxHm7KFvInUJMLikjSqS+Lvpe67EkzCyf+0KlcZnYR6F8xg6az+O4iab+QvOfO8/vK7Ado Lfn1gafUA1LApVfOSMV6XRRpI0Nj+QI= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf26.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1771411098; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=avwwh3WxJg4dUAHa39kIjhIhmqYlDgya9MjleoteMtc=; b=PRRw22NOHdgK3vMGjmG+yQiQrcFCrGbyTRqIKDZiV8aIMo+EuERRbOz8ZOsLfmUDrHWbiu r/l54rHGZ2QhsTTYCL6tEFfsgRey8w0XzHWU+35SJ4Can/L1fhBGJDbTwh68hBg37gDmaP ol9jZjnez7nUA5znRql1hlxWEqVsGTU= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0691A1477; Wed, 18 Feb 2026 02:38:11 -0800 (PST) Received: from [10.164.19.71] (unknown [10.164.19.71]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 32FF73F62B; Wed, 18 Feb 2026 02:38:13 -0800 (PST) Message-ID: Date: Wed, 18 Feb 2026 16:08:11 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad) To: Pedro Falcato Cc: Luke Yang , david@kernel.org, surenb@google.com, jhladky@redhat.com, akpm@linux-foundation.org, Liam.Howlett@oracle.com, willy@infradead.org, vbabka@suse.cz, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <764792ea-6029-41d8-b079-5297ca62505a@kernel.org> <71fbee21-f1b4-4202-a790-5076850d8d00@arm.com> <8315cbde-389c-40c5-ac72-92074625489a@arm.com> <5dso4ctke4baz7hky62zyfdzyg27tcikdbg5ecnrqmnluvmxzo@sciiqgatpqqv> Content-Language: en-US From: Dev Jain In-Reply-To: <5dso4ctke4baz7hky62zyfdzyg27tcikdbg5ecnrqmnluvmxzo@sciiqgatpqqv> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Stat-Signature: uunnhtenurjtqmt7k7fc43wg1cdr1zt9 X-Rspam-User: X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 6F4A7140009 X-HE-Tag: 1771411098-319939 X-HE-Meta: U2FsdGVkX18YMNu26Sqms2U7Z7S73DI+XFVyX5Bzu8QzrQcs06ufvVGiM1g5ytXL40yMtTDZexBrzF27vDMNNiHrt8dw6FlXq+Os+uL6ULFQDLbmGwl81xrIYuAiBDClxUa+NEilfS3MvTteYshs0tVCDIx+UHxTFFqXHdmYhRNtkMuIJCnhu1n6K+u8jAxNdRjP7W65AyaLFoyC4Gbr28DyY7MU9F2eYzw6XhmjmJOTyDBz2/NyJVYpo7Oi9lVx8NsfVMFNQvbM4Wg7XOZvJTUz7A26W1ngdCOPKbagQ6xViLbf/WhiaUCVI/VmFgKh95Nx1JaaHoSwRi5adInDEV2QjHR4u87js7z9ulbyQ3PdQKS6fgO6UufaHX/xEuSh4MfjFmQtdomOLpE2mgzmRaF6f8cGyb79W8UnbLQbIaVPz/0buEEDjPhTq4J9PiBuiNtyaTnqQcAwGZW2q7+rEKUJ3yP8/5hlfpTHWbWftVduhQMf8SldEmpOMNqKQlOtcDU2tQcfIOVWhGUfvA5kHCX+viLu8/Jzsc7xsIgqfNxjiDbIeluRqD8+67f3ElUmzzwicU2k+n2nvTRKEeeZOUlU3Jr3Re6kmwJzvdzdIXq+HXGu0AuUyDdvH4nlWxzQzgSyDkVZ4opsIFe5A4/lehKUAFGVbtlLFjs8QaKgV4UxnWY/9V+5Gqm7eYT78PYVQGD2lWvL07DfQfdf9gUTXAGsJwBZMSs9O7Am8nIxjwrgF3mUBP/J6TB3us7iQgWOGupdZQlsZhioZuZ3VIAdLYcLEayTXjhof6czwLK5vGfr9uzvJYDA4cYXBarUGTe6I0h0CWclUL9T80zibZZ+fg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 18/02/26 3:36 pm, Pedro Falcato wrote: > On Wed, Feb 18, 2026 at 10:31:19AM +0530, Dev Jain wrote: >> On 17/02/26 11:38 pm, Pedro Falcato wrote: >>> On Tue, Feb 17, 2026 at 12:43:38PM -0500, Luke Yang wrote: >>>> On Mon, Feb 16, 2026 at 03:42:08PM +0530, Dev Jain wrote: >>>>> On 13/02/26 10:56 pm, David Hildenbrand (Arm) wrote: >>>>>> On 2/13/26 18:16, Suren Baghdasaryan wrote: >>>>>>> On Fri, Feb 13, 2026 at 4:24 PM Pedro Falcato wrote: >>>>>>>> On Fri, Feb 13, 2026 at 04:47:29PM +0100, David Hildenbrand (Arm) wrote: >>>>>>>>> Hi! >>>>>>>>> >>>>>>>>> >>>>>>>>> Micro-benchmark results are nice. But what is the real word impact? >>>>>>>>> IOW, why >>>>>>>>> should we care? >>>>>>>> Well, mprotect is widely used in thread spawning, code JITting, >>>>>>>> and even process startup. And we don't want to pay for a feature we can't >>>>>>>> even use (on x86). >>>>>>> I agree. When I straced Android's zygote a while ago, mprotect() came >>>>>>> up #30 in the list of most frequently used syscalls and one of the >>>>>>> most used mm-related syscalls due to its use during process creation. >>>>>>> However, I don't know how often it's used on VMAs of size >=400KiB. >>>>>> See my point? :) If this is apparently so widespread then finding a real >>>>>> reproducer is likely not a problem. Otherwise it's just speculation. >>>>>> >>>>>> It would also be interesting to know whether the reproducer ran with any >>>>>> sort of mTHP enabled or not.  >>>>> Yes. Luke, can you experiment with the following microbenchmark: >>>>> >>>>> https://pastebin.com/3hNtYirT >>>>> >>>>> and see if there is an optimization for pte-mapped 2M folios, before and >>>>> after the commit? >>>>> >>>>> (set transparent_hugepages/enabled=always, hugepages-2048Kb/enabled=always) >>> Since you're testing stuff, could you please test the changes in: >>> https://github.com/heatd/linux/tree/mprotect-opt ? >>> >>> Not posting them yet since merge window, etc. Plus I think there's some >>> further optimization work we can pull off. >>> >>> With the benchmark in https://gist.github.com/heatd/25eb2edb601719d22bfb514bcf06a132 >>> (compiled with g++ -O2 file.cpp -lbenchmark, needs google/benchmark) I've measured >>> about an 18% speedup between original vs with patches. >> Thanks for working on this. Some comments - >> >> 1. Rejecting batching with pte_batch_hint() means that we also don't batch 16K and 32K large >> folios on arm64, since the cont bit is on starting only at 64K. Not sure how imp this is. > I don't understand what you mean. Is ARM64 doing large folio optimization, > even when there's no special MMU support for it (the aforementioned 16K and > 32K cases)? If so, perhaps it's time for a ARCH_SUPPORTS_PTE_BATCHING flag. > Though if you could provide numbers in that case it would be much appreciated. There are two things at play here: 1. All arches are expected to benefit from pte batching on large folios, because of doing similar operations together in one shot. For code paths except mprotect and mremap, that benefit is far more clear due to: a) batching across atomic operations etc. For example, see copy_present_ptes -> folio_ref_add. Instead of bumping the reference by 1 nr times, we bump it by nr in one shot. b) vm_normal_folio was already being invoked. So, all in all the only new overhead we introduce is of folio_pte_batch(_flags). In fact, since we already have the folio, I recall that we even just special case the large folio case, out from the small folio case. Thus 4K folio processing will have no overhead. 2. Due to the requirements of contpte, ptep_get() on arm64 needs to fetch a/d bits across a cont block. Thus, for each ptep_get, it does 16 pte accesses. To avoid this, it becomes critical to batch on arm64. > >> 2. Did you measure if there is an optimization due to just the first commit ("prefetch the next pte")? > Yes, I could measure a sizeable improvement (perhaps some 5%). I tested on > zen5 (which is a pretty beefy uarch) and the loop is so full of ~~crap~~ > features that the prefetcher seems to be doing a poor job, at least per my > results. Nice. > >> I actually had prefetch in mind - is it possible to do some kind of prefetch(pfn_to_page(pte_pfn(pte))) >> to optimize the call to vm_normal_folio()? > Certainly possible, but I suspect it doesn't make too much sense. You want to > avoid bringing in the cacheline if possible. In the pte's case, I know we're > probably going to look at it and modify it, and if I'm wrong it's just one > cacheline we misprefetched (though I had some parallel convos and it might > be that we need a branch there to avoid prefetching out of the PTE table). > We would like to avoid bringing in the folio cacheline at all, even if we > don't stall through some fancy prefetching or sheer CPU magic. I dunno, need other opinions. The question here becomes that - should we prefer performance on 4K folios or large folios? As Luke reports in the other email, the benefit on pte-mapped-thp was staggering. I believe that if the sysadmin is enabling CONFIG_TRANSPARENT_HUGEPAGE, they know that the kernel will contain code which incorporates this fact that it will see large folios. So, is it reasonable to penalize folio order-0 case, in preference to folio order > 0? If yes, we can simply stop batching if !IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE). >