From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 7987BEEA84C for ; Thu, 12 Feb 2026 19:36:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C83AD6B0088; Thu, 12 Feb 2026 14:36:58 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C3B346B008A; Thu, 12 Feb 2026 14:36:58 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B05EB6B008C; Thu, 12 Feb 2026 14:36:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 9CEC36B0088 for ; Thu, 12 Feb 2026 14:36:58 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 5784814026C for ; Thu, 12 Feb 2026 19:36:58 +0000 (UTC) X-FDA: 84436812516.20.0A7485E Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf04.hostedemail.com (Postfix) with ESMTP id 853764001A for ; Thu, 12 Feb 2026 19:36:56 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=none; spf=pass (imf04.hostedemail.com: domain of catalin.marinas@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=catalin.marinas@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1770925016; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=u54b/1RopAEFnI4H2pitgLJ4SOvfQoxn9ULzjBPlPvE=; b=Z8dbt/444Kqh0OahEfbQT9DTxPVcxZza88SBLVNQjtVOxeOWqLSj6q6VJoCavd6i4epRwF Pqa4K96yjKPqeraDElvI2TbsRuGNFebbhS30J3nNCXjZm5gL07cV+cAGH4tqjic9tJgECK MGdnjfXhr4lTr6Wh2SetFa57Rnl9QaQ= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=none; spf=pass (imf04.hostedemail.com: domain of catalin.marinas@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=catalin.marinas@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1770925016; a=rsa-sha256; cv=none; b=X/FajI9gUJVwXA75nfE/X2sPtz05g3oPiZW182hCaOu5xyMj7uLlc7iprGSCEgzG5m0JNJ WvObSPzh5CisCQHdGMEfcfgZuZy2QF5rU0H5sERNNCYChyHp0PuTuhMJeAXlCiYNYllEx1 E7B/6C7UeRL0BlSKUuCv7zyf6fQ967c= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 5B872339; Thu, 12 Feb 2026 11:36:49 -0800 (PST) Received: from arm.com (arrakis.cambridge.arm.com [10.1.197.46]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id A7B723F632; Thu, 12 Feb 2026 11:36:53 -0800 (PST) Date: Thu, 12 Feb 2026 19:36:50 +0000 From: Catalin Marinas To: Ryan Roberts Cc: Yang Shi , Tejun Heo , lsf-pc@lists.linux-foundation.org, Linux MM , "Christoph Lameter (Ampere)" , dennis@kernel.org, urezki@gmail.com, Will Deacon , Yang Shi Subject: Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures) Message-ID: References: <82420c8c-d7b0-4ebf-870f-a6061fa4428f@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <82420c8c-d7b0-4ebf-870f-a6061fa4428f@arm.com> X-Rspam-User: X-Rspamd-Queue-Id: 853764001A X-Rspamd-Server: rspam07 X-Stat-Signature: cn4zg4u3y5jbdgnew8ddu4w4ojx9zzzd X-HE-Tag: 1770925016-475544 X-HE-Meta: U2FsdGVkX18PAUm9eLsQOFdy/O6P8nvE7yX6WCZVG0BLCrXbr38W67E7UKAu3Y7KIgS6S5yjIspwUQ52cuLpZvdNRa7kvJb6IzxlbJFFFsuA1mwILvPrg/eCq06b7M7b3p6r51xtz5U/2LIWZ19hZXk4MIxvNet/SHS6nL00o6Ni3ZLTX5weSSo4i6rc0LncaB6s4ToTiaVCogI0ns6jVBg4s9wpwQP1E43bM8fwf0646plngYkOJpgL6Fb4m9rtcBny+DrdLeJWoWgGymxLqe33F4pEAIK+NZaDWQySw6PmlLnbhP6gntF/r/9FO2nVOwNqODdiPoQb1ySF9fqgNzvw2W93xGK/T1kJIWsWhNdFx1V1fWLcCp3RfMOyTwkIGy/5EwDfeLY6uQdrmrNTFMC2Xws1PGLBZsO5Hg4A6HBUE2QHx0OkeCEj+e1mtA2qcfJD8gvI8nKu6PSA+a14zGRMJ39X7shMwzhgi8ITwdDe02JCAlS+KqoPg0f4NHCf9LWfTSv/PE/c01A5B+fZkJD8j2ORXbHeuKCufZ2L5zm+xxps2rjEpWaO2kqb09J/ne4L8SizCZ+wodtACFuhWxFaMGMQmf0YDyrKVTGpwmowAmnAnRxjtKAaQ6dNE3+gdmW+82VcggYaiH5CnEnPmInKpoBt8PZB650FbHUIGGjTs44kKv4jVjOOvp1n4QJw7reZPCZoMNzYMuTb3E7ib/p+rB1XlMMEotPPcQYzsP/swkjhMbdZrSQD/NgUuFXA5+45wdzX2tUlnbunke5oVVAeBO6whI6dKip27At6zXl8Krt2KZcTIdx47SeIvM/sXFr09O7UOVGiwOTIkJUuKbJRmEh5VC/LizBDw1qBkJ8tFZUew3M6BH3qneCY1V0smINhhnvgnhlC6mUUsFUt4/+2vCbimEohurP8yij18D7QCrCTqzYuVxSne9Uu9VMn11jdGofclPgghg3TO15 dW2H0Wsn 4CWR46wngpu+guklkdDNcRcE1Qdy8F5TjJMKCNYnnJarxXLG/y86Pu5hxVZbQnuvetedY6vwNGpnB33ikjvuVphXgTzDsUyKI1aiXRWdGZQHIK55YZ4gHPmOPp/NbDtJVh07keDVDDhcbaUICdAbaAM+ZhXUoQfzlP/nRrPxZrw3E1gsx1jcL6MTmfuifQIqwQRrz4JVLrQNJooLHgPr9UQ+Yol3e4QyuP16iXgn64WMhgUE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 12, 2026 at 06:45:19PM +0000, Ryan Roberts wrote: > On 12/02/2026 17:54, Catalin Marinas wrote: > > On Wed, Feb 11, 2026 at 03:58:50PM -0800, Yang Shi wrote: > >> On Wed, Feb 11, 2026 at 3:29 PM Tejun Heo wrote: > >>> On Wed, Feb 11, 2026 at 03:14:57PM -0800, Yang Shi wrote: > >>> ... > >>>> Overhead > >>>> ======== > >>>> 1. Some extra virtual memory space. But it shouldn’t be too much. I > >>>> saw 960K with Fedora default kernel config. Given terabytes virtual > >>>> memory space on 64 bit machine, 960K is negligible. > >>>> 2. Some extra physical memory for percpu kernel page table. 4K * > >>>> (nr_cpus – 1) for PGD pages, plus the page tables used by percpu local > >>>> mapping area. A couple of megabytes with Fedora default kernel config > >>>> on AmpereOne with 160 cores. > >>>> 3. Percpu allocation and free will be slower due to extra virtual > >>>> memory allocation and page table manipulation. However, percpu is > >>>> allocated by chunk. One chunk typically holds a lot percpu variables. > >>>> So the slowdown should be negligible. The test result below also > >>>> proved it. > > [...] > >>> One property that this breaks is per_cpu_ptr() of a given CPU disagreeing > >>> with this_cpu_ptr(). e.g. If there are users that take this_cpu_ptr() and > >>> uses that outside preempt disable block (which is a bit odd but allowed), > >>> the end result would be surprising. Hmm... I wonder whether it'd be > >>> worthwhile to keep this_cpu_ptr() returning the global address - ie. make it > >>> access global offset from local mapping and then return the computed global > >>> address. This should still be pretty cheap and gets rid of surprising and > >>> potentially extremely subtle corner cases. > >> > >> Yes, this is going to be a problem. So we don't change how > >> this_cpu_ptr() works and keep it returning the global address. Because > >> I noticed this may cause confusion for list APIs too. For example, > >> when initializing a list embedded into a percpu variable, the ->next > >> and ->prev will be initialized to global addresses by using > >> per_cpu_ptr(), but if the list is accessed via this_cpu_ptr(), list > >> head will be dereferenced by using local address, then list_empty() > >> will complain, which compare the list head pointer and ->next pointer. > >> This will cause some problems. > >> > >> So we just use the local address for this_cpu_add/sub/inc/dec and so > >> on, which just manipulate a scalar counter. > > > > I wonder how much overhead is caused by calling into the scheduler on > > preempt_enable(). It would be good to get some numbers for something > > like the patch below (also removing the preempt disabling for > > this_cpu_read() as I don't think it matters - a thread cannot > > distinguish whether it was preempted between TPIDR read and variable > > read or immediately after the variable read; we can't do this for writes > > as other threads may notice unexpected updates). > > > > Another wild hack could be to read the kernel instruction at > > (current_pt_regs()->pc - 4) in arch_irqentry_exit_need_resched() and > > return false if it's a read from TPIDR_EL1/2, together with removing the > > preempt disabling. Or some other lighter way of detecting this_cpu_* > > constructs without full preemption disabling. > > Could a sort of kernel version of restartable sequences help? i.e. detect > preemption instead of preventing it? Yes, in principle that's what we'd need but it's too expensive to check, especially as those accessors are inlined. For the write variants with LL/SC, we can check the TPIDR_EL2 again between the LDXR and STXR and bail out if it's different from the one read outside the loop. An interrupt would clear the exclusive monitor anyway and STXR fail. This won't work for the theoretical this_cpu_read() case. -- Catalin