* [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL
@ 2026-02-15 3:39 Jisheng Zhang
2026-02-16 10:59 ` Dev Jain
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Jisheng Zhang @ 2026-02-15 3:39 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon, Dennis Zhou, Tejun Heo, Christoph Lameter
Cc: linux-arm-kernel, linux-kernel, linux-mm
It turns out the generic disable/enable irq this_cpu_cmpxchg
implementation is faster than LL/SC or lse implementation. Remove
HAVE_CMPXCHG_LOCAL for better performance on arm64.
Tested on Quad 1.9GHZ CA55 platform:
average mod_node_page_state() cost decreases from 167ns to 103ns
the spawn (30 duration) benchmark in unixbench is improved
from 147494 lps to 150561 lps, improved by 2.1%
Tested on Quad 2.1GHZ CA73 platform:
average mod_node_page_state() cost decreases from 113ns to 85ns
the spawn (30 duration) benchmark in unixbench is improved
from 209844 lps to 212581 lps, improved by 1.3%
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
---
arch/arm64/Kconfig | 1 -
arch/arm64/include/asm/percpu.h | 24 ------------------------
2 files changed, 25 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 38dba5f7e4d2..5e7e2e65d5a5 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -205,7 +205,6 @@ config ARM64
select HAVE_EBPF_JIT
select HAVE_C_RECORDMCOUNT
select HAVE_CMPXCHG_DOUBLE
- select HAVE_CMPXCHG_LOCAL
select HAVE_CONTEXT_TRACKING_USER
select HAVE_DEBUG_KMEMLEAK
select HAVE_DMA_CONTIGUOUS
diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
index b57b2bb00967..70ffe566cb4b 100644
--- a/arch/arm64/include/asm/percpu.h
+++ b/arch/arm64/include/asm/percpu.h
@@ -232,30 +232,6 @@ PERCPU_RET_OP(add, add, ldadd)
#define this_cpu_xchg_8(pcp, val) \
_pcp_protect_return(xchg_relaxed, pcp, val)
-#define this_cpu_cmpxchg_1(pcp, o, n) \
- _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
-#define this_cpu_cmpxchg_2(pcp, o, n) \
- _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
-#define this_cpu_cmpxchg_4(pcp, o, n) \
- _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
-#define this_cpu_cmpxchg_8(pcp, o, n) \
- _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
-
-#define this_cpu_cmpxchg64(pcp, o, n) this_cpu_cmpxchg_8(pcp, o, n)
-
-#define this_cpu_cmpxchg128(pcp, o, n) \
-({ \
- typedef typeof(pcp) pcp_op_T__; \
- u128 old__, new__, ret__; \
- pcp_op_T__ *ptr__; \
- old__ = o; \
- new__ = n; \
- preempt_disable_notrace(); \
- ptr__ = raw_cpu_ptr(&(pcp)); \
- ret__ = cmpxchg128_local((void *)ptr__, old__, new__); \
- preempt_enable_notrace(); \
- ret__; \
-})
#ifdef __KVM_NVHE_HYPERVISOR__
extern unsigned long __hyp_per_cpu_offset(unsigned int cpu);
--
2.51.0
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL 2026-02-15 3:39 [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL Jisheng Zhang @ 2026-02-16 10:59 ` Dev Jain 2026-02-16 11:00 ` Will Deacon 2026-02-18 22:07 ` Shakeel Butt 2 siblings, 0 replies; 14+ messages in thread From: Dev Jain @ 2026-02-16 10:59 UTC (permalink / raw) To: Jisheng Zhang, Catalin Marinas, Will Deacon, Dennis Zhou, Tejun Heo, Christoph Lameter Cc: linux-arm-kernel, linux-kernel, linux-mm On 15/02/26 9:09 am, Jisheng Zhang wrote: > It turns out the generic disable/enable irq this_cpu_cmpxchg > implementation is faster than LL/SC or lse implementation. Remove > HAVE_CMPXCHG_LOCAL for better performance on arm64. > > Tested on Quad 1.9GHZ CA55 platform: > average mod_node_page_state() cost decreases from 167ns to 103ns > the spawn (30 duration) benchmark in unixbench is improved > from 147494 lps to 150561 lps, improved by 2.1% > > Tested on Quad 2.1GHZ CA73 platform: > average mod_node_page_state() cost decreases from 113ns to 85ns > the spawn (30 duration) benchmark in unixbench is improved > from 209844 lps to 212581 lps, improved by 1.3% > > Signed-off-by: Jisheng Zhang <jszhang@kernel.org> > --- Thanks. This concurs with my investigation on [1]. The problem isn't really LL/SC/LSE but preempt_disable()/enable() in this_cpu_* [1, 2]. I think you should only remove the selection of the config, but keep the code? We may want to switch this on again if the real issue gets solved. [1] https://lore.kernel.org/all/5a6782f3-d758-4d9c-975b-5ae4b5d80d4e@arm.com/ [2] https://lore.kernel.org/all/CAHbLzkpcN-T8MH6=W3jCxcFj1gVZp8fRqe231yzZT-rV_E_org@mail.gmail.com/ > arch/arm64/Kconfig | 1 - > arch/arm64/include/asm/percpu.h | 24 ------------------------ > 2 files changed, 25 deletions(-) > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 38dba5f7e4d2..5e7e2e65d5a5 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -205,7 +205,6 @@ config ARM64 > select HAVE_EBPF_JIT > select HAVE_C_RECORDMCOUNT > select HAVE_CMPXCHG_DOUBLE > - select HAVE_CMPXCHG_LOCAL > select HAVE_CONTEXT_TRACKING_USER > select HAVE_DEBUG_KMEMLEAK > select HAVE_DMA_CONTIGUOUS > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h > index b57b2bb00967..70ffe566cb4b 100644 > --- a/arch/arm64/include/asm/percpu.h > +++ b/arch/arm64/include/asm/percpu.h > @@ -232,30 +232,6 @@ PERCPU_RET_OP(add, add, ldadd) > #define this_cpu_xchg_8(pcp, val) \ > _pcp_protect_return(xchg_relaxed, pcp, val) > > -#define this_cpu_cmpxchg_1(pcp, o, n) \ > - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n) > -#define this_cpu_cmpxchg_2(pcp, o, n) \ > - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n) > -#define this_cpu_cmpxchg_4(pcp, o, n) \ > - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n) > -#define this_cpu_cmpxchg_8(pcp, o, n) \ > - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n) > - > -#define this_cpu_cmpxchg64(pcp, o, n) this_cpu_cmpxchg_8(pcp, o, n) > - > -#define this_cpu_cmpxchg128(pcp, o, n) \ > -({ \ > - typedef typeof(pcp) pcp_op_T__; \ > - u128 old__, new__, ret__; \ > - pcp_op_T__ *ptr__; \ > - old__ = o; \ > - new__ = n; \ > - preempt_disable_notrace(); \ > - ptr__ = raw_cpu_ptr(&(pcp)); \ > - ret__ = cmpxchg128_local((void *)ptr__, old__, new__); \ > - preempt_enable_notrace(); \ > - ret__; \ > -}) > > #ifdef __KVM_NVHE_HYPERVISOR__ > extern unsigned long __hyp_per_cpu_offset(unsigned int cpu); ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL 2026-02-15 3:39 [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL Jisheng Zhang 2026-02-16 10:59 ` Dev Jain @ 2026-02-16 11:00 ` Will Deacon 2026-02-16 15:29 ` Dev Jain 2026-02-18 22:07 ` Shakeel Butt 2 siblings, 1 reply; 14+ messages in thread From: Will Deacon @ 2026-02-16 11:00 UTC (permalink / raw) To: Jisheng Zhang Cc: Catalin Marinas, Dennis Zhou, Tejun Heo, Christoph Lameter, linux-arm-kernel, linux-kernel, linux-mm, maz On Sun, Feb 15, 2026 at 11:39:44AM +0800, Jisheng Zhang wrote: > It turns out the generic disable/enable irq this_cpu_cmpxchg > implementation is faster than LL/SC or lse implementation. Remove > HAVE_CMPXCHG_LOCAL for better performance on arm64. > > Tested on Quad 1.9GHZ CA55 platform: > average mod_node_page_state() cost decreases from 167ns to 103ns > the spawn (30 duration) benchmark in unixbench is improved > from 147494 lps to 150561 lps, improved by 2.1% > > Tested on Quad 2.1GHZ CA73 platform: > average mod_node_page_state() cost decreases from 113ns to 85ns > the spawn (30 duration) benchmark in unixbench is improved > from 209844 lps to 212581 lps, improved by 1.3% > > Signed-off-by: Jisheng Zhang <jszhang@kernel.org> > --- > arch/arm64/Kconfig | 1 - > arch/arm64/include/asm/percpu.h | 24 ------------------------ > 2 files changed, 25 deletions(-) That is _entirely_ dependent on the system, so this isn't the right approach. I also don't think it's something we particularly want to micro-optimise to accomodate systems that suck at atomics. Will > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 38dba5f7e4d2..5e7e2e65d5a5 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -205,7 +205,6 @@ config ARM64 > select HAVE_EBPF_JIT > select HAVE_C_RECORDMCOUNT > select HAVE_CMPXCHG_DOUBLE > - select HAVE_CMPXCHG_LOCAL > select HAVE_CONTEXT_TRACKING_USER > select HAVE_DEBUG_KMEMLEAK > select HAVE_DMA_CONTIGUOUS > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h > index b57b2bb00967..70ffe566cb4b 100644 > --- a/arch/arm64/include/asm/percpu.h > +++ b/arch/arm64/include/asm/percpu.h > @@ -232,30 +232,6 @@ PERCPU_RET_OP(add, add, ldadd) > #define this_cpu_xchg_8(pcp, val) \ > _pcp_protect_return(xchg_relaxed, pcp, val) > > -#define this_cpu_cmpxchg_1(pcp, o, n) \ > - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n) > -#define this_cpu_cmpxchg_2(pcp, o, n) \ > - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n) > -#define this_cpu_cmpxchg_4(pcp, o, n) \ > - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n) > -#define this_cpu_cmpxchg_8(pcp, o, n) \ > - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n) > - > -#define this_cpu_cmpxchg64(pcp, o, n) this_cpu_cmpxchg_8(pcp, o, n) > - > -#define this_cpu_cmpxchg128(pcp, o, n) \ > -({ \ > - typedef typeof(pcp) pcp_op_T__; \ > - u128 old__, new__, ret__; \ > - pcp_op_T__ *ptr__; \ > - old__ = o; \ > - new__ = n; \ > - preempt_disable_notrace(); \ > - ptr__ = raw_cpu_ptr(&(pcp)); \ > - ret__ = cmpxchg128_local((void *)ptr__, old__, new__); \ > - preempt_enable_notrace(); \ > - ret__; \ > -}) > > #ifdef __KVM_NVHE_HYPERVISOR__ > extern unsigned long __hyp_per_cpu_offset(unsigned int cpu); > -- > 2.51.0 > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL 2026-02-16 11:00 ` Will Deacon @ 2026-02-16 15:29 ` Dev Jain 2026-02-17 13:53 ` Catalin Marinas ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Dev Jain @ 2026-02-16 15:29 UTC (permalink / raw) To: Will Deacon, Jisheng Zhang Cc: Catalin Marinas, Dennis Zhou, Tejun Heo, Christoph Lameter, linux-arm-kernel, linux-kernel, linux-mm, maz On 16/02/26 4:30 pm, Will Deacon wrote: > On Sun, Feb 15, 2026 at 11:39:44AM +0800, Jisheng Zhang wrote: >> It turns out the generic disable/enable irq this_cpu_cmpxchg >> implementation is faster than LL/SC or lse implementation. Remove >> HAVE_CMPXCHG_LOCAL for better performance on arm64. >> >> Tested on Quad 1.9GHZ CA55 platform: >> average mod_node_page_state() cost decreases from 167ns to 103ns >> the spawn (30 duration) benchmark in unixbench is improved >> from 147494 lps to 150561 lps, improved by 2.1% >> >> Tested on Quad 2.1GHZ CA73 platform: >> average mod_node_page_state() cost decreases from 113ns to 85ns >> the spawn (30 duration) benchmark in unixbench is improved >> from 209844 lps to 212581 lps, improved by 1.3% >> >> Signed-off-by: Jisheng Zhang <jszhang@kernel.org> >> --- >> arch/arm64/Kconfig | 1 - >> arch/arm64/include/asm/percpu.h | 24 ------------------------ >> 2 files changed, 25 deletions(-) > That is _entirely_ dependent on the system, so this isn't the right > approach. I also don't think it's something we particularly want to > micro-optimise to accomodate systems that suck at atomics. Hi Will, As I mention in the other email, the suspect is not the atomics, but preempt_disable(). On Apple M3, the regression reported in [1] resolves by removing preempt_disable/enable in _pcp_protect_return. To prove this another way, I disabled CONFIG_ARM64_HAS_LSE_ATOMICS and the regression worsened, indicating that at least on Apple M3 the atomics are faster. It may help to confirm this hypothesis on other hardware - perhaps Jisheng can test with this change on his hardware and confirm whether he gets the same performance improvement. By coincidence, Yang Shi has been discussing the this_cpu_* overhead at [2]. [1] https://lore.kernel.org/all/1052a452-9ba3-4da7-be47-7d27d27b3d1d@arm.com/ [2] https://lore.kernel.org/all/CAHbLzkpcN-T8MH6=W3jCxcFj1gVZp8fRqe231yzZT-rV_E_org@mail.gmail.com/ > > Will > >> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig >> index 38dba5f7e4d2..5e7e2e65d5a5 100644 >> --- a/arch/arm64/Kconfig >> +++ b/arch/arm64/Kconfig >> @@ -205,7 +205,6 @@ config ARM64 >> select HAVE_EBPF_JIT >> select HAVE_C_RECORDMCOUNT >> select HAVE_CMPXCHG_DOUBLE >> - select HAVE_CMPXCHG_LOCAL >> select HAVE_CONTEXT_TRACKING_USER >> select HAVE_DEBUG_KMEMLEAK >> select HAVE_DMA_CONTIGUOUS >> diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h >> index b57b2bb00967..70ffe566cb4b 100644 >> --- a/arch/arm64/include/asm/percpu.h >> +++ b/arch/arm64/include/asm/percpu.h >> @@ -232,30 +232,6 @@ PERCPU_RET_OP(add, add, ldadd) >> #define this_cpu_xchg_8(pcp, val) \ >> _pcp_protect_return(xchg_relaxed, pcp, val) >> >> -#define this_cpu_cmpxchg_1(pcp, o, n) \ >> - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n) >> -#define this_cpu_cmpxchg_2(pcp, o, n) \ >> - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n) >> -#define this_cpu_cmpxchg_4(pcp, o, n) \ >> - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n) >> -#define this_cpu_cmpxchg_8(pcp, o, n) \ >> - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n) >> - >> -#define this_cpu_cmpxchg64(pcp, o, n) this_cpu_cmpxchg_8(pcp, o, n) >> - >> -#define this_cpu_cmpxchg128(pcp, o, n) \ >> -({ \ >> - typedef typeof(pcp) pcp_op_T__; \ >> - u128 old__, new__, ret__; \ >> - pcp_op_T__ *ptr__; \ >> - old__ = o; \ >> - new__ = n; \ >> - preempt_disable_notrace(); \ >> - ptr__ = raw_cpu_ptr(&(pcp)); \ >> - ret__ = cmpxchg128_local((void *)ptr__, old__, new__); \ >> - preempt_enable_notrace(); \ >> - ret__; \ >> -}) >> >> #ifdef __KVM_NVHE_HYPERVISOR__ >> extern unsigned long __hyp_per_cpu_offset(unsigned int cpu); >> -- >> 2.51.0 >> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL 2026-02-16 15:29 ` Dev Jain @ 2026-02-17 13:53 ` Catalin Marinas 2026-02-17 15:00 ` Will Deacon 2026-02-17 17:19 ` Christoph Lameter (Ampere) 2026-02-20 6:14 ` Jisheng Zhang 2 siblings, 1 reply; 14+ messages in thread From: Catalin Marinas @ 2026-02-17 13:53 UTC (permalink / raw) To: Dev Jain Cc: Will Deacon, Jisheng Zhang, Dennis Zhou, Tejun Heo, Christoph Lameter, linux-arm-kernel, linux-kernel, linux-mm, maz On Mon, Feb 16, 2026 at 08:59:17PM +0530, Dev Jain wrote: > On 16/02/26 4:30 pm, Will Deacon wrote: > > On Sun, Feb 15, 2026 at 11:39:44AM +0800, Jisheng Zhang wrote: > >> It turns out the generic disable/enable irq this_cpu_cmpxchg > >> implementation is faster than LL/SC or lse implementation. Remove > >> HAVE_CMPXCHG_LOCAL for better performance on arm64. > >> > >> Tested on Quad 1.9GHZ CA55 platform: > >> average mod_node_page_state() cost decreases from 167ns to 103ns > >> the spawn (30 duration) benchmark in unixbench is improved > >> from 147494 lps to 150561 lps, improved by 2.1% > >> > >> Tested on Quad 2.1GHZ CA73 platform: > >> average mod_node_page_state() cost decreases from 113ns to 85ns > >> the spawn (30 duration) benchmark in unixbench is improved > >> from 209844 lps to 212581 lps, improved by 1.3% > >> > >> Signed-off-by: Jisheng Zhang <jszhang@kernel.org> > >> --- > >> arch/arm64/Kconfig | 1 - > >> arch/arm64/include/asm/percpu.h | 24 ------------------------ > >> 2 files changed, 25 deletions(-) > > That is _entirely_ dependent on the system, so this isn't the right > > approach. I also don't think it's something we particularly want to > > micro-optimise to accomodate systems that suck at atomics. > > Hi Will, > > As I mention in the other email, the suspect is not the atomics, but > preempt_disable(). On Apple M3, the regression reported in [1] resolves > by removing preempt_disable/enable in _pcp_protect_return. To prove > this another way, I disabled CONFIG_ARM64_HAS_LSE_ATOMICS and the > regression worsened, indicating that at least on Apple M3 the > atomics are faster. Then why don't we replace the preempt disabling with local_irq_save() in the arm64 code and still use the LSE atomics? IIUC (lots of macro indirection), the generic cmpxchg is not atomic, so another CPU is allowed to mess this up if it accesses current CPU's variable via per_cpu_ptr(). -- Catalin ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL 2026-02-17 13:53 ` Catalin Marinas @ 2026-02-17 15:00 ` Will Deacon 2026-02-17 16:48 ` Catalin Marinas 0 siblings, 1 reply; 14+ messages in thread From: Will Deacon @ 2026-02-17 15:00 UTC (permalink / raw) To: Catalin Marinas Cc: Dev Jain, Jisheng Zhang, Dennis Zhou, Tejun Heo, Christoph Lameter, linux-arm-kernel, linux-kernel, linux-mm, maz On Tue, Feb 17, 2026 at 01:53:19PM +0000, Catalin Marinas wrote: > On Mon, Feb 16, 2026 at 08:59:17PM +0530, Dev Jain wrote: > > On 16/02/26 4:30 pm, Will Deacon wrote: > > > On Sun, Feb 15, 2026 at 11:39:44AM +0800, Jisheng Zhang wrote: > > >> It turns out the generic disable/enable irq this_cpu_cmpxchg > > >> implementation is faster than LL/SC or lse implementation. Remove > > >> HAVE_CMPXCHG_LOCAL for better performance on arm64. > > >> > > >> Tested on Quad 1.9GHZ CA55 platform: > > >> average mod_node_page_state() cost decreases from 167ns to 103ns > > >> the spawn (30 duration) benchmark in unixbench is improved > > >> from 147494 lps to 150561 lps, improved by 2.1% > > >> > > >> Tested on Quad 2.1GHZ CA73 platform: > > >> average mod_node_page_state() cost decreases from 113ns to 85ns > > >> the spawn (30 duration) benchmark in unixbench is improved > > >> from 209844 lps to 212581 lps, improved by 1.3% > > >> > > >> Signed-off-by: Jisheng Zhang <jszhang@kernel.org> > > >> --- > > >> arch/arm64/Kconfig | 1 - > > >> arch/arm64/include/asm/percpu.h | 24 ------------------------ > > >> 2 files changed, 25 deletions(-) > > > That is _entirely_ dependent on the system, so this isn't the right > > > approach. I also don't think it's something we particularly want to > > > micro-optimise to accomodate systems that suck at atomics. > > > > Hi Will, > > > > As I mention in the other email, the suspect is not the atomics, but > > preempt_disable(). On Apple M3, the regression reported in [1] resolves > > by removing preempt_disable/enable in _pcp_protect_return. To prove > > this another way, I disabled CONFIG_ARM64_HAS_LSE_ATOMICS and the > > regression worsened, indicating that at least on Apple M3 the > > atomics are faster. > > Then why don't we replace the preempt disabling with local_irq_save() > in the arm64 code and still use the LSE atomics? Even better, work on making preempt_disable() faster as it's used in many other places. Of course, if people want to hack the .config, they could also change the preemption mode... Will ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL 2026-02-17 15:00 ` Will Deacon @ 2026-02-17 16:48 ` Catalin Marinas 2026-02-18 4:01 ` K Prateek Nayak 0 siblings, 1 reply; 14+ messages in thread From: Catalin Marinas @ 2026-02-17 16:48 UTC (permalink / raw) To: Will Deacon Cc: Dev Jain, Jisheng Zhang, Dennis Zhou, Tejun Heo, Christoph Lameter, linux-arm-kernel, linux-kernel, linux-mm, maz On Tue, Feb 17, 2026 at 03:00:22PM +0000, Will Deacon wrote: > On Tue, Feb 17, 2026 at 01:53:19PM +0000, Catalin Marinas wrote: > > On Mon, Feb 16, 2026 at 08:59:17PM +0530, Dev Jain wrote: > > > On 16/02/26 4:30 pm, Will Deacon wrote: > > > > On Sun, Feb 15, 2026 at 11:39:44AM +0800, Jisheng Zhang wrote: > > > >> It turns out the generic disable/enable irq this_cpu_cmpxchg > > > >> implementation is faster than LL/SC or lse implementation. Remove > > > >> HAVE_CMPXCHG_LOCAL for better performance on arm64. > > > >> > > > >> Tested on Quad 1.9GHZ CA55 platform: > > > >> average mod_node_page_state() cost decreases from 167ns to 103ns > > > >> the spawn (30 duration) benchmark in unixbench is improved > > > >> from 147494 lps to 150561 lps, improved by 2.1% > > > >> > > > >> Tested on Quad 2.1GHZ CA73 platform: > > > >> average mod_node_page_state() cost decreases from 113ns to 85ns > > > >> the spawn (30 duration) benchmark in unixbench is improved > > > >> from 209844 lps to 212581 lps, improved by 1.3% [...] > > > > That is _entirely_ dependent on the system, so this isn't the right > > > > approach. I also don't think it's something we particularly want to > > > > micro-optimise to accomodate systems that suck at atomics. > > > > > > As I mention in the other email, the suspect is not the atomics, but > > > preempt_disable(). On Apple M3, the regression reported in [1] resolves > > > by removing preempt_disable/enable in _pcp_protect_return. To prove > > > this another way, I disabled CONFIG_ARM64_HAS_LSE_ATOMICS and the > > > regression worsened, indicating that at least on Apple M3 the > > > atomics are faster. > > > > Then why don't we replace the preempt disabling with local_irq_save() > > in the arm64 code and still use the LSE atomics? > > Even better, work on making preempt_disable() faster as it's used in many > other places. Yes, that would be good. It's the preempt_enable_notrace() path that ends up calling preempt_schedule_notrace() -> __schedule() pretty much unconditionally. Not sure what would go wrong but some simple change like this (can be done at a higher in the preempt macros to even avoid getting here): diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 854984967fe2..d9a5d6438303 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7119,7 +7119,7 @@ asmlinkage __visible void __sched notrace preempt_schedule_notrace(void) if (likely(!preemptible())) return; - do { + while (need_resched()) { /* * Because the function tracer can trace preempt_count_sub() * and it also uses preempt_enable/disable_notrace(), if @@ -7146,7 +7146,7 @@ asmlinkage __visible void __sched notrace preempt_schedule_notrace(void) preempt_latency_stop(1); preempt_enable_no_resched_notrace(); - } while (need_resched()); + } } EXPORT_SYMBOL_GPL(preempt_schedule_notrace); Of course, changing the preemption model solves this by making the macros no-ops but I assume people want to keep preemption on. -- Catalin ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL 2026-02-17 16:48 ` Catalin Marinas @ 2026-02-18 4:01 ` K Prateek Nayak 2026-02-18 9:29 ` Catalin Marinas 0 siblings, 1 reply; 14+ messages in thread From: K Prateek Nayak @ 2026-02-18 4:01 UTC (permalink / raw) To: Catalin Marinas, Will Deacon Cc: Dev Jain, Jisheng Zhang, Dennis Zhou, Tejun Heo, Christoph Lameter, linux-arm-kernel, linux-kernel, linux-mm, maz Hello Catalin, On 2/17/2026 10:18 PM, Catalin Marinas wrote: > Yes, that would be good. It's the preempt_enable_notrace() path that > ends up calling preempt_schedule_notrace() -> __schedule() pretty much > unconditionally. What do you mean by unconditionally? We always check __preempt_count_dec_and_test() before calling into __schedule(). On x86, We use MSB of preempt_count to indicate a resched and set_preempt_need_resched() would just clear this MSB. If the preempt_count() turns 0, we immediately go into schedule or or the next preempt_enable() -> __preempt_count_dec_and_test() would see the entire preempt_count being clear and will call into schedule. The arm64 implementation seems to be doing something similar too with a separate "ti->preempt.need_resched" bit which is part of the "ti->preempt_count"'s union so it isn't really unconditional. > Not sure what would go wrong but some simple change > like this (can be done at a higher in the preempt macros to even avoid > getting here): > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 854984967fe2..d9a5d6438303 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -7119,7 +7119,7 @@ asmlinkage __visible void __sched notrace preempt_schedule_notrace(void) > if (likely(!preemptible())) > return; > > - do { > + while (need_resched()) { Essentially you are simply checking it twice now on entry since need_resched() state would have already been communicated by __preempt_count_dec_and_test(). > /* > * Because the function tracer can trace preempt_count_sub() > * and it also uses preempt_enable/disable_notrace(), if > @@ -7146,7 +7146,7 @@ asmlinkage __visible void __sched notrace preempt_schedule_notrace(void) > > preempt_latency_stop(1); > preempt_enable_no_resched_notrace(); > - } while (need_resched()); > + } > } > EXPORT_SYMBOL_GPL(preempt_schedule_notrace); -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL 2026-02-18 4:01 ` K Prateek Nayak @ 2026-02-18 9:29 ` Catalin Marinas 0 siblings, 0 replies; 14+ messages in thread From: Catalin Marinas @ 2026-02-18 9:29 UTC (permalink / raw) To: K Prateek Nayak Cc: Will Deacon, Dev Jain, Jisheng Zhang, Dennis Zhou, Tejun Heo, Christoph Lameter, linux-arm-kernel, linux-kernel, linux-mm, maz Hi Prateek, On Wed, Feb 18, 2026 at 09:31:19AM +0530, K Prateek Nayak wrote: > On 2/17/2026 10:18 PM, Catalin Marinas wrote: > > Yes, that would be good. It's the preempt_enable_notrace() path that > > ends up calling preempt_schedule_notrace() -> __schedule() pretty much > > unconditionally. > > What do you mean by unconditionally? We always check > __preempt_count_dec_and_test() before calling into __schedule(). > > On x86, We use MSB of preempt_count to indicate a resched and > set_preempt_need_resched() would just clear this MSB. > > If the preempt_count() turns 0, we immediately go into schedule > or or the next preempt_enable() -> __preempt_count_dec_and_test() > would see the entire preempt_count being clear and will call into > schedule. > > The arm64 implementation seems to be doing something similar too > with a separate "ti->preempt.need_resched" bit which is part of > the "ti->preempt_count"'s union so it isn't really unconditional. Ah, yes, you are right. I got the polarity of need_resched in thread_info wrong (we should have named it no_need_to_resched). So in the common case, the overhead is caused by the additional pointer chase and preempt_count update, on top of the cpu offset read. Not sure we can squeeze any more cycles out of these without some large overhaul like: https://git.kernel.org/mark/c/84ee5f23f93d4a650e828f831da9ed29c54623c5 or Yang's per-CPU page tables. Well, there are more ideas like in-kernel restartable sequences but they move the overhead elsewhere. Thanks. -- Catalin ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL 2026-02-16 15:29 ` Dev Jain 2026-02-17 13:53 ` Catalin Marinas @ 2026-02-17 17:19 ` Christoph Lameter (Ampere) 2026-02-20 6:14 ` Jisheng Zhang 2 siblings, 0 replies; 14+ messages in thread From: Christoph Lameter (Ampere) @ 2026-02-17 17:19 UTC (permalink / raw) To: Dev Jain Cc: Will Deacon, Jisheng Zhang, Catalin Marinas, Dennis Zhou, Tejun Heo, linux-arm-kernel, linux-kernel, linux-mm, maz On Mon, 16 Feb 2026, Dev Jain wrote: > By coincidence, Yang Shi has been discussing the this_cpu_* overhead > at [2]. Yang Shi is on vacation but we have a patchset that removes preempt_enable/disable from this_cpu operations on ARM64. The performance of cmpxchg varies by platform in use and with the kernel config. The measurements that I did 2 years ago indicated that the cmpxchg use with Ampere processors did not cause a regression. Note that distro kernels often do not enable PREEMPT_FULL and therefore preempt_disable/enable overhead is not incurred in production systems. PREEMPT_VOLUNTARY does not use preemption for this_cpu ops. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL 2026-02-16 15:29 ` Dev Jain 2026-02-17 13:53 ` Catalin Marinas 2026-02-17 17:19 ` Christoph Lameter (Ampere) @ 2026-02-20 6:14 ` Jisheng Zhang 2 siblings, 0 replies; 14+ messages in thread From: Jisheng Zhang @ 2026-02-20 6:14 UTC (permalink / raw) To: Dev Jain Cc: Will Deacon, Catalin Marinas, Dennis Zhou, Tejun Heo, Christoph Lameter, linux-arm-kernel, linux-kernel, linux-mm, maz On Mon, Feb 16, 2026 at 08:59:17PM +0530, Dev Jain wrote: > > On 16/02/26 4:30 pm, Will Deacon wrote: > > On Sun, Feb 15, 2026 at 11:39:44AM +0800, Jisheng Zhang wrote: > >> It turns out the generic disable/enable irq this_cpu_cmpxchg > >> implementation is faster than LL/SC or lse implementation. Remove > >> HAVE_CMPXCHG_LOCAL for better performance on arm64. > >> > >> Tested on Quad 1.9GHZ CA55 platform: > >> average mod_node_page_state() cost decreases from 167ns to 103ns > >> the spawn (30 duration) benchmark in unixbench is improved > >> from 147494 lps to 150561 lps, improved by 2.1% > >> > >> Tested on Quad 2.1GHZ CA73 platform: > >> average mod_node_page_state() cost decreases from 113ns to 85ns > >> the spawn (30 duration) benchmark in unixbench is improved > >> from 209844 lps to 212581 lps, improved by 1.3% > >> > >> Signed-off-by: Jisheng Zhang <jszhang@kernel.org> > >> --- > >> arch/arm64/Kconfig | 1 - > >> arch/arm64/include/asm/percpu.h | 24 ------------------------ > >> 2 files changed, 25 deletions(-) > > That is _entirely_ dependent on the system, so this isn't the right > > approach. I also don't think it's something we particularly want to > > micro-optimise to accomodate systems that suck at atomics. Hi Will, I read this as an implication that the cmpxchg_local version is better than generic disable/enable irq version on the newer arm64 systems. Is my understanding correct? > > Hi Will, > > As I mention in the other email, the suspect is not the atomics, but > preempt_disable(). On Apple M3, the regression reported in [1] resolves > by removing preempt_disable/enable in _pcp_protect_return. To prove > this another way, I disabled CONFIG_ARM64_HAS_LSE_ATOMICS and the > regression worsened, indicating that at least on Apple M3 the > atomics are faster. > > It may help to confirm this hypothesis on other hardware - perhaps > Jisheng can test with this change on his hardware and confirm > whether he gets the same performance improvement. Hi Dev, Thanks for the hints. I tried to remove the preempt_disable/enable from _pcp_protect_return, it improves, but the HAVE_CMPXCHG_LOCAL version is still worse than generic disable/enable irq version on CA55 and CA73. > > By coincidence, Yang Shi has been discussing the this_cpu_* overhead > at [2]. > > [1] https://lore.kernel.org/all/1052a452-9ba3-4da7-be47-7d27d27b3d1d@arm.com/ > [2] https://lore.kernel.org/all/CAHbLzkpcN-T8MH6=W3jCxcFj1gVZp8fRqe231yzZT-rV_E_org@mail.gmail.com/ > > > > > Will > > > >> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > >> index 38dba5f7e4d2..5e7e2e65d5a5 100644 > >> --- a/arch/arm64/Kconfig > >> +++ b/arch/arm64/Kconfig > >> @@ -205,7 +205,6 @@ config ARM64 > >> select HAVE_EBPF_JIT > >> select HAVE_C_RECORDMCOUNT > >> select HAVE_CMPXCHG_DOUBLE > >> - select HAVE_CMPXCHG_LOCAL > >> select HAVE_CONTEXT_TRACKING_USER > >> select HAVE_DEBUG_KMEMLEAK > >> select HAVE_DMA_CONTIGUOUS > >> diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h > >> index b57b2bb00967..70ffe566cb4b 100644 > >> --- a/arch/arm64/include/asm/percpu.h > >> +++ b/arch/arm64/include/asm/percpu.h > >> @@ -232,30 +232,6 @@ PERCPU_RET_OP(add, add, ldadd) > >> #define this_cpu_xchg_8(pcp, val) \ > >> _pcp_protect_return(xchg_relaxed, pcp, val) > >> > >> -#define this_cpu_cmpxchg_1(pcp, o, n) \ > >> - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n) > >> -#define this_cpu_cmpxchg_2(pcp, o, n) \ > >> - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n) > >> -#define this_cpu_cmpxchg_4(pcp, o, n) \ > >> - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n) > >> -#define this_cpu_cmpxchg_8(pcp, o, n) \ > >> - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n) > >> - > >> -#define this_cpu_cmpxchg64(pcp, o, n) this_cpu_cmpxchg_8(pcp, o, n) > >> - > >> -#define this_cpu_cmpxchg128(pcp, o, n) \ > >> -({ \ > >> - typedef typeof(pcp) pcp_op_T__; \ > >> - u128 old__, new__, ret__; \ > >> - pcp_op_T__ *ptr__; \ > >> - old__ = o; \ > >> - new__ = n; \ > >> - preempt_disable_notrace(); \ > >> - ptr__ = raw_cpu_ptr(&(pcp)); \ > >> - ret__ = cmpxchg128_local((void *)ptr__, old__, new__); \ > >> - preempt_enable_notrace(); \ > >> - ret__; \ > >> -}) > >> > >> #ifdef __KVM_NVHE_HYPERVISOR__ > >> extern unsigned long __hyp_per_cpu_offset(unsigned int cpu); > >> -- > >> 2.51.0 > >> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL 2026-02-15 3:39 [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL Jisheng Zhang 2026-02-16 10:59 ` Dev Jain 2026-02-16 11:00 ` Will Deacon @ 2026-02-18 22:07 ` Shakeel Butt 2026-02-20 6:20 ` Jisheng Zhang 2 siblings, 1 reply; 14+ messages in thread From: Shakeel Butt @ 2026-02-18 22:07 UTC (permalink / raw) To: Jisheng Zhang Cc: Catalin Marinas, Will Deacon, Dennis Zhou, Tejun Heo, Christoph Lameter, linux-arm-kernel, linux-kernel, linux-mm On Sun, Feb 15, 2026 at 11:39:44AM +0800, Jisheng Zhang wrote: > It turns out the generic disable/enable irq this_cpu_cmpxchg > implementation is faster than LL/SC or lse implementation. Remove > HAVE_CMPXCHG_LOCAL for better performance on arm64. > > Tested on Quad 1.9GHZ CA55 platform: > average mod_node_page_state() cost decreases from 167ns to 103ns > the spawn (30 duration) benchmark in unixbench is improved > from 147494 lps to 150561 lps, improved by 2.1% > > Tested on Quad 2.1GHZ CA73 platform: > average mod_node_page_state() cost decreases from 113ns to 85ns > the spawn (30 duration) benchmark in unixbench is improved > from 209844 lps to 212581 lps, improved by 1.3% > > Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Please note that mod_node_page_state() can be called in NMI context and generic disable/enable irq are not safe against NMIs (newer arm arch supports NMI). ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL 2026-02-18 22:07 ` Shakeel Butt @ 2026-02-20 6:20 ` Jisheng Zhang 2026-02-20 23:27 ` Shakeel Butt 0 siblings, 1 reply; 14+ messages in thread From: Jisheng Zhang @ 2026-02-20 6:20 UTC (permalink / raw) To: Shakeel Butt Cc: Catalin Marinas, Will Deacon, Dennis Zhou, Tejun Heo, Christoph Lameter, linux-arm-kernel, linux-kernel, linux-mm On Wed, Feb 18, 2026 at 02:07:57PM -0800, Shakeel Butt wrote: > On Sun, Feb 15, 2026 at 11:39:44AM +0800, Jisheng Zhang wrote: > > It turns out the generic disable/enable irq this_cpu_cmpxchg > > implementation is faster than LL/SC or lse implementation. Remove > > HAVE_CMPXCHG_LOCAL for better performance on arm64. > > > > Tested on Quad 1.9GHZ CA55 platform: > > average mod_node_page_state() cost decreases from 167ns to 103ns > > the spawn (30 duration) benchmark in unixbench is improved > > from 147494 lps to 150561 lps, improved by 2.1% > > > > Tested on Quad 2.1GHZ CA73 platform: > > average mod_node_page_state() cost decreases from 113ns to 85ns > > the spawn (30 duration) benchmark in unixbench is improved > > from 209844 lps to 212581 lps, improved by 1.3% > > > > Signed-off-by: Jisheng Zhang <jszhang@kernel.org> > > Please note that mod_node_page_state() can be called in NMI context and > generic disable/enable irq are not safe against NMIs (newer arm arch supports > NMI). hmm, interesting... fgrep HAVE_NMI arch/*/Kconfig then fgrep HAVE_CMPXCHG_LOCAL arch/*/Kconfig shows that only x86, arm64, s390 and loongarch are safe, while arm, powerpc and mips enable HAVE_NMI but missing HAVE_CMPXCHG_LOCAL, so they rely on generic generic disable/enable irq version, so you imply that these three arch are not safe considering mod_node_page_state() in NMI context. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL 2026-02-20 6:20 ` Jisheng Zhang @ 2026-02-20 23:27 ` Shakeel Butt 0 siblings, 0 replies; 14+ messages in thread From: Shakeel Butt @ 2026-02-20 23:27 UTC (permalink / raw) To: Jisheng Zhang Cc: Catalin Marinas, Will Deacon, Dennis Zhou, Tejun Heo, Christoph Lameter, linux-arm-kernel, linux-kernel, linux-mm On Fri, Feb 20, 2026 at 02:20:54PM +0800, Jisheng Zhang wrote: > On Wed, Feb 18, 2026 at 02:07:57PM -0800, Shakeel Butt wrote: > > On Sun, Feb 15, 2026 at 11:39:44AM +0800, Jisheng Zhang wrote: > > > It turns out the generic disable/enable irq this_cpu_cmpxchg > > > implementation is faster than LL/SC or lse implementation. Remove > > > HAVE_CMPXCHG_LOCAL for better performance on arm64. > > > > > > Tested on Quad 1.9GHZ CA55 platform: > > > average mod_node_page_state() cost decreases from 167ns to 103ns > > > the spawn (30 duration) benchmark in unixbench is improved > > > from 147494 lps to 150561 lps, improved by 2.1% > > > > > > Tested on Quad 2.1GHZ CA73 platform: > > > average mod_node_page_state() cost decreases from 113ns to 85ns > > > the spawn (30 duration) benchmark in unixbench is improved > > > from 209844 lps to 212581 lps, improved by 1.3% > > > > > > Signed-off-by: Jisheng Zhang <jszhang@kernel.org> > > > > Please note that mod_node_page_state() can be called in NMI context and > > generic disable/enable irq are not safe against NMIs (newer arm arch supports > > NMI). > > hmm, interesting... > > fgrep HAVE_NMI arch/*/Kconfig > then > fgrep HAVE_CMPXCHG_LOCAL arch/*/Kconfig > > shows that only x86, arm64, s390 and loongarch are safe, while arm, > powerpc and mips enable HAVE_NMI but missing HAVE_CMPXCHG_LOCAL, so > they rely on generic generic disable/enable irq version, so you imply > that these three arch are not safe considering mod_node_page_state() > in NMI context. Yes it seems like it. For memcg stats, we use ARCH_HAVE_NMI_SAFE_CMPXCHG and ARCH_HAS_NMI_SAFE_THIS_CPU_OPS config options to correctly handle the updates from NMI context. Maybe we need something similar for vmstat as well. So arm, powerpc and mips does not have ARCH_HAS_NMI_SAFE_THIS_CPU_OPS but powerpc does have ARCH_HAVE_NMI_SAFE_CMPXCHG and arm has it for CPU_V7, CPU_V7M & CPU_V6K models. I wonder if we need to add complexity for these archs. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2026-02-20 23:28 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2026-02-15 3:39 [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL Jisheng Zhang 2026-02-16 10:59 ` Dev Jain 2026-02-16 11:00 ` Will Deacon 2026-02-16 15:29 ` Dev Jain 2026-02-17 13:53 ` Catalin Marinas 2026-02-17 15:00 ` Will Deacon 2026-02-17 16:48 ` Catalin Marinas 2026-02-18 4:01 ` K Prateek Nayak 2026-02-18 9:29 ` Catalin Marinas 2026-02-17 17:19 ` Christoph Lameter (Ampere) 2026-02-20 6:14 ` Jisheng Zhang 2026-02-18 22:07 ` Shakeel Butt 2026-02-20 6:20 ` Jisheng Zhang 2026-02-20 23:27 ` Shakeel Butt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox