* [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
@ 2026-02-11 23:14 Yang Shi
2026-02-11 23:29 ` Tejun Heo
2026-02-12 18:41 ` Ryan Roberts
0 siblings, 2 replies; 21+ messages in thread
From: Yang Shi @ 2026-02-11 23:14 UTC (permalink / raw)
To: lsf-pc, Linux MM, Christoph Lameter (Ampere),
dennis, Tejun Heo, urezki, Catalin Marinas, Will Deacon,
Ryan Roberts
Cc: Yang Shi, Yang Shi
Background
=========
The APIs using this_cpu_*() operate on a local copy of a percpu
variable for the current processor. In order to obtain the address of
this cpu specific variable a cpu specific offset has to be added to
the address.
On x86 this address calculation can be created by prefixing an
instruction with a segment register. x86 can increment a percpu
counter with a single instruction. Since the address calculation and
the RMV operation occurs within one instruction it is atomic vs the
scheduler. So no preemption is needed.
f.e
INC %gs:[my_counter]
See https://www.kernel.org/doc/Documentation/this_cpu_ops.txt for more details.
ARM64 and some other non-x86 architectures don't have a segment
register. The address of the current percpu variable has to be
calculated and then that address can be used for an operation on
percpu data. This process must be atomic vs the scheduler. Therefore,
it is necessary to disable preemption, perform the address calculation
and then the increment operation. The cpu specific offset is in a MSR
that also needs to be accessed on ARM64. The code flow looks like:
Disable preemption
Calculate the current CPU copy address by using the offset
Manipulate the counter
Enable preemption
This process is inefficient relative to x86 and has to be repeated for
every access to per cpu data.
ARM64 has an increment instruction but this increment does not allow
the use of a base register or a segment register like on x86. So an
address calculation is always necessary even if the atomic instruction
is used.
A page table allows us to do remapping of addresses. So if the atomic
instruction would be using a virtual address and the page tables for
the local processor would map this area to the local per cpu data then
we can also create a single instruction on ARM64 (hopefully for some
other non-x86 architectures too) and be as efficient as x86 is.
So, the code flow should just become:
INC VIRTUAL_BASE + percpu_variable_offset
In order to do that we need to have the same virtual address mapped
differently for each processor. This means we need different page
tables for each processor. These page tables
can map almost all of the address space in the same way. The only area
that will be special is the area starting at VIRTUAL_BASE.
In addition, the percpu counters also can be accessed from other CPUs
by using per_cpu_ptr() APIs. This is usually used by counters
initialization code. For example,
for_each_possible_cpu(cpu) {
p = per_cpu_ptr(ptr, cpu);
initialize(p);
}
Percpu allocator
=============
When calling alloc_percpu(), kernel allocates contiguous virtual
memory area from vmalloc area. It is called “chunk”. The chunk looks
like:
| CPU 0 | CPU 1 | …… | CPU n|
The size of the chunk is the percpu_unit_size * nr_cpus. Then kernel
maps them to physical memory. It returns an offset.
Design
======
To improve the performance for this_cpu_ops on ARM64 and potentially
some other non-x86 architectures, I and Christopher Lameter proposed
the below solution.
To remove the preemption disable/enable, we need to guarantee
this_cpu_*() APIs actually convert the offset returned by
alloc_percpu() to a pointer which should be the same on all CPUs. But
it should not break per_cpu_ptr() APIs usecase either.
To achieve this, we need to modify the percpu allocator to allocate
extra virtual memory other than the virtual memory area shown in the
above diagram. The size of the extra allocation is percpu_unit_size.
this_cpu_*() APIs will convert the offset returned by alloc_percpu()
to a pointer to this area. It is the same for all CPUs. I call the
extra allocated area “local mapping” and the original area “global
mapping” in order to simplify the discussion. So the percpu chunk will
look like:
| CPU 0 | CPU 1 | …… | CPU n| xxxxxxxxx | CPU |
Global mapping local mapping
this_cpu_*() APIs will just access the local mapping, per_cpu_ptr()
APIs continue to use the global mapping.
The local mapping requires mapping to different physical memory
(shared physical memory mapped by global mapping, no need to allocate
extra physical memory) on different CPUs in order to manipulate the
right copy. This can be achieved by using the percpu page table in
arch-dependent code. Each CPU just sees its own kernel page table copy
instead of sharing one single kernel page table. However the most
contents of the page tables can be shared except the area for percpu
local mapping. So they basically can share PUD/PMD/PTE except PGD.
The kernel also maintains a base address for global mapping in order
to convert the offset returned by alloc_percpu() to the correct
pointer. The local mapping also needs a base address, and the offset
between local mapping base address and allocated local mapping area
must be the same with the offset returned by alloc_percpu(). So the
local mapping has to happen in a specific address range. This may need
a dedicated percpu local mapping area which can’t be used by vmalloc()
in order to avoid conflicts.
I have done some PoC on ARM64. Hopefully I can post them to the
mailing list to ease the discussion before the conference.
Overhead
========
1. Some extra virtual memory space. But it shouldn’t be too much. I
saw 960K with Fedora default kernel config. Given terabytes virtual
memory space on 64 bit machine, 960K is negligible.
2. Some extra physical memory for percpu kernel page table. 4K *
(nr_cpus – 1) for PGD pages, plus the page tables used by percpu local
mapping area. A couple of megabytes with Fedora default kernel config
on AmpereOne with 160 cores.
3. Percpu allocation and free will be slower due to extra virtual
memory allocation and page table manipulation. However, percpu is
allocated by chunk. One chunk typically holds a lot percpu variables.
So the slowdown should be negligible. The test result below also
proved it.
Performance Test
==============
I have done a PoC on ARM64. So all the tests were run on AmpereOne
with 160 cores.
1. Kernel build
--------------------
Run kernel build (make -j160) with default Fedora kernel config in a memcg.
Roughly 13% - 15% systime improvement for my kernel build workload.
2. stress-ng
----------------
stress-ng --vm 160 --vm-bytes 128M --vm-ops 100000000
6% improvement for systime
3. vm-scalability
----------------------
Single digit (0 – 8%) improvement for systime for some vm-scalability test cases
4. will-it-scale
------------------
3% - 8% improvement for pagefault cases from will-it-scale
And profiling to page_fault3_processes from will-it-scale also shows
the reduction in percpu counters manipulation (perf diff output):
5.91% -1.82% [kernel.kallsyms] [k] mod_memcg_lruvec_state
2.84% -1.30% [kernel.kallsyms] [k] percpu_counter_add_batch
Regression Test
=============
Create 10K cgroups.
Creating cgroups need to call percpu allocators multiple times. For
example, creating one memcg needs to allocate percpu refcnt, rstat and
objcg percpu refcnt.
It consumed 2112K more virtual memory for percpu local mapping. A few
more megabytes consumed by percpu page table to map local mapping. The
memory consumption depends on the number of CPUs.
Execution time is basically the same. No noticeable regression is
found. The profiling shows (perf diff):
0.35% -0.33% [kernel.kallsyms] [k]
percpu_ref_get_many
0.61% -0.30% [kernel.kallsyms] [k]
percpu_counter_add_batch
0.34% +0.02% [kernel.kallsyms] [k]
pcpu_alloc_noprof
0.00% +0.05% [kernel.kallsyms] [k]
free_percpu.part.0
The gain from manipulating percpu counters outweigh the slowdown from
percpu allocation and free. There is even a little bit of net gain.
Future usecases
=============
Some potential usecases may be unlocked by percpu page table, for
example, kernel text replication, off the top of my head. Anyway this
is not the main point for this proposal.
Key attendees
===========
This work will incur changes to percpu allocator, vmalloc (just need
to add a new interface to take pgdir pointer as an argument) and arch
dependent code (percpu page table implementation is arch-dependent).
So the percpu allocator maintainers, vmalloc maintainers and arch
experts (for example, ARM64) should be key attendees. I don't know who
can attend so I just list all of them.
Christopher Lameter (co-presenter and percpu allocator maintainer)
Dennis Zhou/Tejun Heo (percpu allocator maintainer)
Uladzislau Rezki (vmalloc maintainer)
Catalin Marinas/Will Deacon/Ryan Roberts (ARM64 memory management)
Thanks,
Yang
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-11 23:14 [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures) Yang Shi
@ 2026-02-11 23:29 ` Tejun Heo
2026-02-11 23:39 ` Christoph Lameter (Ampere)
2026-02-11 23:58 ` Yang Shi
2026-02-12 18:41 ` Ryan Roberts
1 sibling, 2 replies; 21+ messages in thread
From: Tejun Heo @ 2026-02-11 23:29 UTC (permalink / raw)
To: Yang Shi
Cc: lsf-pc, Linux MM, Christoph Lameter (Ampere),
dennis, urezki, Catalin Marinas, Will Deacon, Ryan Roberts,
Yang Shi
Hello,
On Wed, Feb 11, 2026 at 03:14:57PM -0800, Yang Shi wrote:
...
> Overhead
> ========
> 1. Some extra virtual memory space. But it shouldn’t be too much. I
> saw 960K with Fedora default kernel config. Given terabytes virtual
> memory space on 64 bit machine, 960K is negligible.
> 2. Some extra physical memory for percpu kernel page table. 4K *
> (nr_cpus – 1) for PGD pages, plus the page tables used by percpu local
> mapping area. A couple of megabytes with Fedora default kernel config
> on AmpereOne with 160 cores.
> 3. Percpu allocation and free will be slower due to extra virtual
> memory allocation and page table manipulation. However, percpu is
> allocated by chunk. One chunk typically holds a lot percpu variables.
> So the slowdown should be negligible. The test result below also
> proved it.
It will also add a bit of TLB pressure as a lot of percpu allocations are
currently embedded in the linear address space backed by large page
mappings. Likely immaterial compared to the reduced overhead of
this_cpu_*().
One property that this breaks is per_cpu_ptr() of a given CPU disagreeing
with this_cpu_ptr(). e.g. If there are users that take this_cpu_ptr() and
uses that outside preempt disable block (which is a bit odd but allowed),
the end result would be surprising. Hmm... I wonder whether it'd be
worthwhile to keep this_cpu_ptr() returning the global address - ie. make it
access global offset from local mapping and then return the computed global
address. This should still be pretty cheap and gets rid of surprising and
potentially extremely subtle corner cases.
Generally sounds like a great solution for !x86.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-11 23:29 ` Tejun Heo
@ 2026-02-11 23:39 ` Christoph Lameter (Ampere)
2026-02-11 23:40 ` Tejun Heo
2026-02-11 23:58 ` Yang Shi
1 sibling, 1 reply; 21+ messages in thread
From: Christoph Lameter (Ampere) @ 2026-02-11 23:39 UTC (permalink / raw)
To: Tejun Heo
Cc: Yang Shi, lsf-pc, Linux MM, dennis, urezki, Catalin Marinas,
Will Deacon, Ryan Roberts, Yang Shi
On Wed, 11 Feb 2026, Tejun Heo wrote:
> One property that this breaks is per_cpu_ptr() of a given CPU disagreeing
> with this_cpu_ptr(). e.g. If there are users that take this_cpu_ptr() and
> uses that outside preempt disable block (which is a bit odd but allowed),
this_cpu_ptr converts a percpu variable offset to a normal pointer that
can be used without preemption.
If the scheduler changes the cpu then you would remotely access that per
cpu data in a remote cpu. Read only acces is fine. Writing could be dicey
in some cases.
> Generally sounds like a great solution for !x86.
Thanks.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-11 23:39 ` Christoph Lameter (Ampere)
@ 2026-02-11 23:40 ` Tejun Heo
2026-02-12 0:05 ` Christoph Lameter (Ampere)
0 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2026-02-11 23:40 UTC (permalink / raw)
To: Christoph Lameter (Ampere)
Cc: Yang Shi, lsf-pc, Linux MM, dennis, urezki, Catalin Marinas,
Will Deacon, Ryan Roberts, Yang Shi
On Wed, Feb 11, 2026 at 03:39:02PM -0800, Christoph Lameter (Ampere) wrote:
> On Wed, 11 Feb 2026, Tejun Heo wrote:
>
> > One property that this breaks is per_cpu_ptr() of a given CPU disagreeing
> > with this_cpu_ptr(). e.g. If there are users that take this_cpu_ptr() and
> > uses that outside preempt disable block (which is a bit odd but allowed),
>
> this_cpu_ptr converts a percpu variable offset to a normal pointer that
> can be used without preemption.
Ah, no problem then.
> If the scheduler changes the cpu then you would remotely access that per
> cpu data in a remote cpu. Read only acces is fine. Writing could be dicey
> in some cases.
Yeah, that's the current behavior.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-11 23:29 ` Tejun Heo
2026-02-11 23:39 ` Christoph Lameter (Ampere)
@ 2026-02-11 23:58 ` Yang Shi
2026-02-12 17:54 ` Catalin Marinas
1 sibling, 1 reply; 21+ messages in thread
From: Yang Shi @ 2026-02-11 23:58 UTC (permalink / raw)
To: Tejun Heo
Cc: lsf-pc, Linux MM, Christoph Lameter (Ampere),
dennis, urezki, Catalin Marinas, Will Deacon, Ryan Roberts,
Yang Shi
On Wed, Feb 11, 2026 at 3:29 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Feb 11, 2026 at 03:14:57PM -0800, Yang Shi wrote:
> ...
> > Overhead
> > ========
> > 1. Some extra virtual memory space. But it shouldn’t be too much. I
> > saw 960K with Fedora default kernel config. Given terabytes virtual
> > memory space on 64 bit machine, 960K is negligible.
> > 2. Some extra physical memory for percpu kernel page table. 4K *
> > (nr_cpus – 1) for PGD pages, plus the page tables used by percpu local
> > mapping area. A couple of megabytes with Fedora default kernel config
> > on AmpereOne with 160 cores.
> > 3. Percpu allocation and free will be slower due to extra virtual
> > memory allocation and page table manipulation. However, percpu is
> > allocated by chunk. One chunk typically holds a lot percpu variables.
> > So the slowdown should be negligible. The test result below also
> > proved it.
>
> It will also add a bit of TLB pressure as a lot of percpu allocations are
> currently embedded in the linear address space backed by large page
> mappings. Likely immaterial compared to the reduced overhead of
> this_cpu_*().
Yes, this should be not noticeable. This can be optimized further by
using cont PTEs on ARM64 if it turns out to be a problem. The percpu
area is typically larger than 64K (cont PTE size with 4K page size on
arm64).
And linear address space may be not backed by large page mappins on
ARM64. If rodata=on (the default, it was called "full" before) and the
machines don't support BBML2_NOABORT, linear address space is backed
by PTEs.
>
> One property that this breaks is per_cpu_ptr() of a given CPU disagreeing
> with this_cpu_ptr(). e.g. If there are users that take this_cpu_ptr() and
> uses that outside preempt disable block (which is a bit odd but allowed),
> the end result would be surprising. Hmm... I wonder whether it'd be
> worthwhile to keep this_cpu_ptr() returning the global address - ie. make it
> access global offset from local mapping and then return the computed global
> address. This should still be pretty cheap and gets rid of surprising and
> potentially extremely subtle corner cases.
Yes, this is going to be a problem. So we don't change how
this_cpu_ptr() works and keep it returning the global address. Because
I noticed this may cause confusion for list APIs too. For example,
when initializing a list embedded into a percpu variable, the ->next
and ->prev will be initialized to global addresses by using
per_cpu_ptr(), but if the list is accessed via this_cpu_ptr(), list
head will be dereferenced by using local address, then list_empty()
will complain, which compare the list head pointer and ->next pointer.
This will cause some problems.
So we just use the local address for this_cpu_add/sub/inc/dec and so
on, which just manipulate a scalar counter.
>
> Generally sounds like a great solution for !x86.
Thank you.
Yang
>
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-11 23:40 ` Tejun Heo
@ 2026-02-12 0:05 ` Christoph Lameter (Ampere)
0 siblings, 0 replies; 21+ messages in thread
From: Christoph Lameter (Ampere) @ 2026-02-12 0:05 UTC (permalink / raw)
To: Tejun Heo
Cc: Yang Shi, lsf-pc, Linux MM, dennis, urezki, Catalin Marinas,
Will Deacon, Ryan Roberts, Yang Shi
On Wed, 11 Feb 2026, Tejun Heo wrote:
> Yeah, that's the current behavior.
There is no change to the existing API. Sorry if we gave that impression.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-11 23:58 ` Yang Shi
@ 2026-02-12 17:54 ` Catalin Marinas
2026-02-12 18:43 ` Catalin Marinas
2026-02-12 18:45 ` Ryan Roberts
0 siblings, 2 replies; 21+ messages in thread
From: Catalin Marinas @ 2026-02-12 17:54 UTC (permalink / raw)
To: Yang Shi
Cc: Tejun Heo, lsf-pc, Linux MM, Christoph Lameter (Ampere),
dennis, urezki, Will Deacon, Ryan Roberts, Yang Shi
On Wed, Feb 11, 2026 at 03:58:50PM -0800, Yang Shi wrote:
> On Wed, Feb 11, 2026 at 3:29 PM Tejun Heo <tj@kernel.org> wrote:
> > On Wed, Feb 11, 2026 at 03:14:57PM -0800, Yang Shi wrote:
> > ...
> > > Overhead
> > > ========
> > > 1. Some extra virtual memory space. But it shouldn’t be too much. I
> > > saw 960K with Fedora default kernel config. Given terabytes virtual
> > > memory space on 64 bit machine, 960K is negligible.
> > > 2. Some extra physical memory for percpu kernel page table. 4K *
> > > (nr_cpus – 1) for PGD pages, plus the page tables used by percpu local
> > > mapping area. A couple of megabytes with Fedora default kernel config
> > > on AmpereOne with 160 cores.
> > > 3. Percpu allocation and free will be slower due to extra virtual
> > > memory allocation and page table manipulation. However, percpu is
> > > allocated by chunk. One chunk typically holds a lot percpu variables.
> > > So the slowdown should be negligible. The test result below also
> > > proved it.
[...]
> > One property that this breaks is per_cpu_ptr() of a given CPU disagreeing
> > with this_cpu_ptr(). e.g. If there are users that take this_cpu_ptr() and
> > uses that outside preempt disable block (which is a bit odd but allowed),
> > the end result would be surprising. Hmm... I wonder whether it'd be
> > worthwhile to keep this_cpu_ptr() returning the global address - ie. make it
> > access global offset from local mapping and then return the computed global
> > address. This should still be pretty cheap and gets rid of surprising and
> > potentially extremely subtle corner cases.
>
> Yes, this is going to be a problem. So we don't change how
> this_cpu_ptr() works and keep it returning the global address. Because
> I noticed this may cause confusion for list APIs too. For example,
> when initializing a list embedded into a percpu variable, the ->next
> and ->prev will be initialized to global addresses by using
> per_cpu_ptr(), but if the list is accessed via this_cpu_ptr(), list
> head will be dereferenced by using local address, then list_empty()
> will complain, which compare the list head pointer and ->next pointer.
> This will cause some problems.
>
> So we just use the local address for this_cpu_add/sub/inc/dec and so
> on, which just manipulate a scalar counter.
I wonder how much overhead is caused by calling into the scheduler on
preempt_enable(). It would be good to get some numbers for something
like the patch below (also removing the preempt disabling for
this_cpu_read() as I don't think it matters - a thread cannot
distinguish whether it was preempted between TPIDR read and variable
read or immediately after the variable read; we can't do this for writes
as other threads may notice unexpected updates).
Another wild hack could be to read the kernel instruction at
(current_pt_regs()->pc - 4) in arch_irqentry_exit_need_resched() and
return false if it's a read from TPIDR_EL1/2, together with removing the
preempt disabling. Or some other lighter way of detecting this_cpu_*
constructs without full preemption disabling.
-----------------8<------------------------------------
diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
index b57b2bb00967..7194cc997293 100644
--- a/arch/arm64/include/asm/percpu.h
+++ b/arch/arm64/include/asm/percpu.h
@@ -153,11 +153,17 @@ PERCPU_RET_OP(add, add, ldadd)
* disabled.
*/
+#ifdef preempt_enable_no_resched_notrace
+#define _pcp_preempt_enable_notrace preempt_enable_no_resched_notrace
+#else
+#define _pcp_preempt_enable_notrace preempt_enable_notrace
+#endif
+
#define _pcp_protect(op, pcp, ...) \
({ \
preempt_disable_notrace(); \
op(raw_cpu_ptr(&(pcp)), __VA_ARGS__); \
- preempt_enable_notrace(); \
+ _pcp_preempt_enable_notrace(); \
})
#define _pcp_protect_return(op, pcp, args...) \
@@ -165,18 +171,21 @@ PERCPU_RET_OP(add, add, ldadd)
typeof(pcp) __retval; \
preempt_disable_notrace(); \
__retval = (typeof(pcp))op(raw_cpu_ptr(&(pcp)), ##args); \
- preempt_enable_notrace(); \
+ _pcp_preempt_enable_notrace(); \
__retval; \
})
+#define _pcp_return(op, pcp, args...) \
+ ((typeof(pcp))op(raw_cpu_ptr(&(pcp)), ##args))
+
#define this_cpu_read_1(pcp) \
- _pcp_protect_return(__percpu_read_8, pcp)
+ _pcp_return(__percpu_read_8, pcp)
#define this_cpu_read_2(pcp) \
- _pcp_protect_return(__percpu_read_16, pcp)
+ _pcp_return(__percpu_read_16, pcp)
#define this_cpu_read_4(pcp) \
- _pcp_protect_return(__percpu_read_32, pcp)
+ _pcp_return(__percpu_read_32, pcp)
#define this_cpu_read_8(pcp) \
- _pcp_protect_return(__percpu_read_64, pcp)
+ _pcp_return(__percpu_read_64, pcp)
#define this_cpu_write_1(pcp, val) \
_pcp_protect(__percpu_write_8, pcp, (unsigned long)val)
@@ -253,7 +262,7 @@ PERCPU_RET_OP(add, add, ldadd)
preempt_disable_notrace(); \
ptr__ = raw_cpu_ptr(&(pcp)); \
ret__ = cmpxchg128_local((void *)ptr__, old__, new__); \
- preempt_enable_notrace(); \
+ _pcp_preempt_enable_notrace(); \
ret__; \
})
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-11 23:14 [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures) Yang Shi
2026-02-11 23:29 ` Tejun Heo
@ 2026-02-12 18:41 ` Ryan Roberts
2026-02-12 18:55 ` Christoph Lameter (Ampere)
2026-02-13 18:42 ` Yang Shi
1 sibling, 2 replies; 21+ messages in thread
From: Ryan Roberts @ 2026-02-12 18:41 UTC (permalink / raw)
To: Yang Shi, lsf-pc, Linux MM, Christoph Lameter (Ampere),
dennis, Tejun Heo, urezki, Catalin Marinas, Will Deacon
Cc: Yang Shi
On 11/02/2026 23:14, Yang Shi wrote:
> Background
> =========
> The APIs using this_cpu_*() operate on a local copy of a percpu
> variable for the current processor. In order to obtain the address of
> this cpu specific variable a cpu specific offset has to be added to
> the address.
> On x86 this address calculation can be created by prefixing an
> instruction with a segment register. x86 can increment a percpu
> counter with a single instruction. Since the address calculation and
> the RMV operation occurs within one instruction it is atomic vs the
> scheduler. So no preemption is needed.
> f.e
> INC %gs:[my_counter]
> See https://www.kernel.org/doc/Documentation/this_cpu_ops.txt for more details.
>
> ARM64 and some other non-x86 architectures don't have a segment
> register. The address of the current percpu variable has to be
> calculated and then that address can be used for an operation on
> percpu data. This process must be atomic vs the scheduler. Therefore,
> it is necessary to disable preemption, perform the address calculation
> and then the increment operation. The cpu specific offset is in a MSR
> that also needs to be accessed on ARM64. The code flow looks like:
> Disable preemption
> Calculate the current CPU copy address by using the offset
> Manipulate the counter
> Enable preemption
By massive coincidence, Dev Jain and I have been investigating a large
regression seen in a munmap micro-benchmark in 6.19, which is root caused to a
change that ends up using this_cpu_*() a lot more on the path.
We have concluded that we can simplify this_cpu_read() to not bother
disabling/enabling preemption, since it is read-only and a migration between the
2 ops vs after the second op is indistinguishable. I believe Dev is planning to
post a patch to list soon. This will solve our immediate regression issue.
But we can't do the same trick for ops that write. See [1].
[1] https://lore.kernel.org/all/20190311164837.GD24275@lakrids.cambridge.arm.com/
>
> This process is inefficient relative to x86 and has to be repeated for
> every access to per cpu data.
> ARM64 has an increment instruction but this increment does not allow
> the use of a base register or a segment register like on x86. So an
> address calculation is always necessary even if the atomic instruction
> is used.
> A page table allows us to do remapping of addresses. So if the atomic
> instruction would be using a virtual address and the page tables for
> the local processor would map this area to the local per cpu data then
> we can also create a single instruction on ARM64 (hopefully for some
> other non-x86 architectures too) and be as efficient as x86 is.
>
> So, the code flow should just become:
> INC VIRTUAL_BASE + percpu_variable_offset
>
> In order to do that we need to have the same virtual address mapped
> differently for each processor. This means we need different page
> tables for each processor. These page tables
> can map almost all of the address space in the same way. The only area
> that will be special is the area starting at VIRTUAL_BASE.
This is an interesting idea. I'm keen to be involved in discussions.
My immediate concern is that this would not be compatible with FEAT_TTCNP, which
allows multiple PEs (ARM speak for CPU) to share a TLB - e.g. for SMT. I'm not
sure if that would be the end of the world; the perf numbers below are
compelling. I'll defer to others' opions on that.
Thanks,
Ryan
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-12 17:54 ` Catalin Marinas
@ 2026-02-12 18:43 ` Catalin Marinas
2026-02-13 0:23 ` Yang Shi
2026-02-12 18:45 ` Ryan Roberts
1 sibling, 1 reply; 21+ messages in thread
From: Catalin Marinas @ 2026-02-12 18:43 UTC (permalink / raw)
To: Yang Shi
Cc: Tejun Heo, lsf-pc, Linux MM, Christoph Lameter (Ampere),
dennis, urezki, Will Deacon, Ryan Roberts, Yang Shi
More thoughts...
On Thu, Feb 12, 2026 at 05:54:19PM +0000, Catalin Marinas wrote:
> On Wed, Feb 11, 2026 at 03:58:50PM -0800, Yang Shi wrote:
> > So we just use the local address for this_cpu_add/sub/inc/dec and so
> > on, which just manipulate a scalar counter.
>
> I wonder how much overhead is caused by calling into the scheduler on
> preempt_enable(). It would be good to get some numbers for something
> like the patch below
In case it wasn't obvious, the patch messes up the scheduling, so I
don't propose it as such, only to get some idea of where the bottleneck
is. Maybe it could be made to work with some need_resched() checks.
> (also removing the preempt disabling for
> this_cpu_read() as I don't think it matters - a thread cannot
> distinguish whether it was preempted between TPIDR read and variable
> read or immediately after the variable read; we can't do this for writes
> as other threads may notice unexpected updates).
There's a theoretical case where even this_cpu_read() needs preemption
disabling, e.g.:
thread0:
preempt_disable();
this_cpu_write(var, unique_val);
// check that no-one has seen unique_value;
this_cpu_write(var, other_val);
preempt_enable();
thread1:
this_cpu_read(var);
thread1 is not supposed to see the unique_val but it would if it was
preempted in the middle of the per-cpu op and migrated to another CPU.
> Another wild hack could be to read the kernel instruction at
> (current_pt_regs()->pc - 4) in arch_irqentry_exit_need_resched() and
> return false if it's a read from TPIDR_EL1/2, together with removing the
> preempt disabling.
This one also breaks the kernel scheduling just like using
preempt_enable_no_resched(). It might be possible but in combination
with additional need_resched() checks.
--
Catalin
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-12 17:54 ` Catalin Marinas
2026-02-12 18:43 ` Catalin Marinas
@ 2026-02-12 18:45 ` Ryan Roberts
2026-02-12 19:36 ` Catalin Marinas
1 sibling, 1 reply; 21+ messages in thread
From: Ryan Roberts @ 2026-02-12 18:45 UTC (permalink / raw)
To: Catalin Marinas, Yang Shi
Cc: Tejun Heo, lsf-pc, Linux MM, Christoph Lameter (Ampere),
dennis, urezki, Will Deacon, Yang Shi
On 12/02/2026 17:54, Catalin Marinas wrote:
> On Wed, Feb 11, 2026 at 03:58:50PM -0800, Yang Shi wrote:
>> On Wed, Feb 11, 2026 at 3:29 PM Tejun Heo <tj@kernel.org> wrote:
>>> On Wed, Feb 11, 2026 at 03:14:57PM -0800, Yang Shi wrote:
>>> ...
>>>> Overhead
>>>> ========
>>>> 1. Some extra virtual memory space. But it shouldn’t be too much. I
>>>> saw 960K with Fedora default kernel config. Given terabytes virtual
>>>> memory space on 64 bit machine, 960K is negligible.
>>>> 2. Some extra physical memory for percpu kernel page table. 4K *
>>>> (nr_cpus – 1) for PGD pages, plus the page tables used by percpu local
>>>> mapping area. A couple of megabytes with Fedora default kernel config
>>>> on AmpereOne with 160 cores.
>>>> 3. Percpu allocation and free will be slower due to extra virtual
>>>> memory allocation and page table manipulation. However, percpu is
>>>> allocated by chunk. One chunk typically holds a lot percpu variables.
>>>> So the slowdown should be negligible. The test result below also
>>>> proved it.
> [...]
>>> One property that this breaks is per_cpu_ptr() of a given CPU disagreeing
>>> with this_cpu_ptr(). e.g. If there are users that take this_cpu_ptr() and
>>> uses that outside preempt disable block (which is a bit odd but allowed),
>>> the end result would be surprising. Hmm... I wonder whether it'd be
>>> worthwhile to keep this_cpu_ptr() returning the global address - ie. make it
>>> access global offset from local mapping and then return the computed global
>>> address. This should still be pretty cheap and gets rid of surprising and
>>> potentially extremely subtle corner cases.
>>
>> Yes, this is going to be a problem. So we don't change how
>> this_cpu_ptr() works and keep it returning the global address. Because
>> I noticed this may cause confusion for list APIs too. For example,
>> when initializing a list embedded into a percpu variable, the ->next
>> and ->prev will be initialized to global addresses by using
>> per_cpu_ptr(), but if the list is accessed via this_cpu_ptr(), list
>> head will be dereferenced by using local address, then list_empty()
>> will complain, which compare the list head pointer and ->next pointer.
>> This will cause some problems.
>>
>> So we just use the local address for this_cpu_add/sub/inc/dec and so
>> on, which just manipulate a scalar counter.
>
> I wonder how much overhead is caused by calling into the scheduler on
> preempt_enable(). It would be good to get some numbers for something
> like the patch below (also removing the preempt disabling for
> this_cpu_read() as I don't think it matters - a thread cannot
> distinguish whether it was preempted between TPIDR read and variable
> read or immediately after the variable read; we can't do this for writes
> as other threads may notice unexpected updates).
>
> Another wild hack could be to read the kernel instruction at
> (current_pt_regs()->pc - 4) in arch_irqentry_exit_need_resched() and
> return false if it's a read from TPIDR_EL1/2, together with removing the
> preempt disabling. Or some other lighter way of detecting this_cpu_*
> constructs without full preemption disabling.
Could a sort of kernel version of restartable sequences help? i.e. detect
preemption instead of preventing it?
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-12 18:41 ` Ryan Roberts
@ 2026-02-12 18:55 ` Christoph Lameter (Ampere)
2026-02-12 18:58 ` Ryan Roberts
2026-02-13 18:42 ` Yang Shi
1 sibling, 1 reply; 21+ messages in thread
From: Christoph Lameter (Ampere) @ 2026-02-12 18:55 UTC (permalink / raw)
To: Ryan Roberts
Cc: Yang Shi, lsf-pc, Linux MM, dennis, Tejun Heo, urezki,
Catalin Marinas, Will Deacon, Yang Shi
On Thu, 12 Feb 2026, Ryan Roberts wrote:
> This is an interesting idea. I'm keen to be involved in discussions.
Note also that the percpu kernel page tables that are used in this
proposal will also enable additional performance optimizations in the
future
- Kernel text / readonly / readmostly replication for NUMA configurations
to limit the volume of cacheline transfers across an interconnect.
- Per node memory allocator to optimize access paths and RMV operations
for data that is NUMA node specific.
We think that these optimizations are especially relevant for high core
count and high NUMA setups. These may be inefficient right now. With these
additional scaling improvements more distributed SOC designs become
possible.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-12 18:55 ` Christoph Lameter (Ampere)
@ 2026-02-12 18:58 ` Ryan Roberts
0 siblings, 0 replies; 21+ messages in thread
From: Ryan Roberts @ 2026-02-12 18:58 UTC (permalink / raw)
To: Christoph Lameter (Ampere)
Cc: Yang Shi, lsf-pc, Linux MM, dennis, Tejun Heo, urezki,
Catalin Marinas, Will Deacon, Yang Shi
On 12/02/2026 18:55, Christoph Lameter (Ampere) wrote:
> On Thu, 12 Feb 2026, Ryan Roberts wrote:
>
>> This is an interesting idea. I'm keen to be involved in discussions.
>
> Note also that the percpu kernel page tables that are used in this
> proposal will also enable additional performance optimizations in the
> future
>
> - Kernel text / readonly / readmostly replication for NUMA configurations
> to limit the volume of cacheline transfers across an interconnect.
I'm aware of Russell King's series to map kernel text locally for each node. I
guess that's the shape of what you're describing here?
>
> - Per node memory allocator to optimize access paths and RMV operations
> for data that is NUMA node specific.
>
> We think that these optimizations are especially relevant for high core
> count and high NUMA setups. These may be inefficient right now. With these
> additional scaling improvements more distributed SOC designs become
> possible.
>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-12 18:45 ` Ryan Roberts
@ 2026-02-12 19:36 ` Catalin Marinas
2026-02-12 21:12 ` Ryan Roberts
0 siblings, 1 reply; 21+ messages in thread
From: Catalin Marinas @ 2026-02-12 19:36 UTC (permalink / raw)
To: Ryan Roberts
Cc: Yang Shi, Tejun Heo, lsf-pc, Linux MM, Christoph Lameter (Ampere),
dennis, urezki, Will Deacon, Yang Shi
On Thu, Feb 12, 2026 at 06:45:19PM +0000, Ryan Roberts wrote:
> On 12/02/2026 17:54, Catalin Marinas wrote:
> > On Wed, Feb 11, 2026 at 03:58:50PM -0800, Yang Shi wrote:
> >> On Wed, Feb 11, 2026 at 3:29 PM Tejun Heo <tj@kernel.org> wrote:
> >>> On Wed, Feb 11, 2026 at 03:14:57PM -0800, Yang Shi wrote:
> >>> ...
> >>>> Overhead
> >>>> ========
> >>>> 1. Some extra virtual memory space. But it shouldn’t be too much. I
> >>>> saw 960K with Fedora default kernel config. Given terabytes virtual
> >>>> memory space on 64 bit machine, 960K is negligible.
> >>>> 2. Some extra physical memory for percpu kernel page table. 4K *
> >>>> (nr_cpus – 1) for PGD pages, plus the page tables used by percpu local
> >>>> mapping area. A couple of megabytes with Fedora default kernel config
> >>>> on AmpereOne with 160 cores.
> >>>> 3. Percpu allocation and free will be slower due to extra virtual
> >>>> memory allocation and page table manipulation. However, percpu is
> >>>> allocated by chunk. One chunk typically holds a lot percpu variables.
> >>>> So the slowdown should be negligible. The test result below also
> >>>> proved it.
> > [...]
> >>> One property that this breaks is per_cpu_ptr() of a given CPU disagreeing
> >>> with this_cpu_ptr(). e.g. If there are users that take this_cpu_ptr() and
> >>> uses that outside preempt disable block (which is a bit odd but allowed),
> >>> the end result would be surprising. Hmm... I wonder whether it'd be
> >>> worthwhile to keep this_cpu_ptr() returning the global address - ie. make it
> >>> access global offset from local mapping and then return the computed global
> >>> address. This should still be pretty cheap and gets rid of surprising and
> >>> potentially extremely subtle corner cases.
> >>
> >> Yes, this is going to be a problem. So we don't change how
> >> this_cpu_ptr() works and keep it returning the global address. Because
> >> I noticed this may cause confusion for list APIs too. For example,
> >> when initializing a list embedded into a percpu variable, the ->next
> >> and ->prev will be initialized to global addresses by using
> >> per_cpu_ptr(), but if the list is accessed via this_cpu_ptr(), list
> >> head will be dereferenced by using local address, then list_empty()
> >> will complain, which compare the list head pointer and ->next pointer.
> >> This will cause some problems.
> >>
> >> So we just use the local address for this_cpu_add/sub/inc/dec and so
> >> on, which just manipulate a scalar counter.
> >
> > I wonder how much overhead is caused by calling into the scheduler on
> > preempt_enable(). It would be good to get some numbers for something
> > like the patch below (also removing the preempt disabling for
> > this_cpu_read() as I don't think it matters - a thread cannot
> > distinguish whether it was preempted between TPIDR read and variable
> > read or immediately after the variable read; we can't do this for writes
> > as other threads may notice unexpected updates).
> >
> > Another wild hack could be to read the kernel instruction at
> > (current_pt_regs()->pc - 4) in arch_irqentry_exit_need_resched() and
> > return false if it's a read from TPIDR_EL1/2, together with removing the
> > preempt disabling. Or some other lighter way of detecting this_cpu_*
> > constructs without full preemption disabling.
>
> Could a sort of kernel version of restartable sequences help? i.e. detect
> preemption instead of preventing it?
Yes, in principle that's what we'd need but it's too expensive to check,
especially as those accessors are inlined.
For the write variants with LL/SC, we can check the TPIDR_EL2 again
between the LDXR and STXR and bail out if it's different from the one
read outside the loop. An interrupt would clear the exclusive monitor
anyway and STXR fail. This won't work for the theoretical
this_cpu_read() case.
--
Catalin
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-12 19:36 ` Catalin Marinas
@ 2026-02-12 21:12 ` Ryan Roberts
2026-02-16 10:37 ` Catalin Marinas
0 siblings, 1 reply; 21+ messages in thread
From: Ryan Roberts @ 2026-02-12 21:12 UTC (permalink / raw)
To: Catalin Marinas
Cc: Yang Shi, Tejun Heo, lsf-pc, Linux MM, Christoph Lameter (Ampere),
dennis, urezki, Will Deacon, Yang Shi
On 12/02/2026 19:36, Catalin Marinas wrote:
> On Thu, Feb 12, 2026 at 06:45:19PM +0000, Ryan Roberts wrote:
>> On 12/02/2026 17:54, Catalin Marinas wrote:
>>> On Wed, Feb 11, 2026 at 03:58:50PM -0800, Yang Shi wrote:
>>>> On Wed, Feb 11, 2026 at 3:29 PM Tejun Heo <tj@kernel.org> wrote:
>>>>> On Wed, Feb 11, 2026 at 03:14:57PM -0800, Yang Shi wrote:
>>>>> ...
>>>>>> Overhead
>>>>>> ========
>>>>>> 1. Some extra virtual memory space. But it shouldn’t be too much. I
>>>>>> saw 960K with Fedora default kernel config. Given terabytes virtual
>>>>>> memory space on 64 bit machine, 960K is negligible.
>>>>>> 2. Some extra physical memory for percpu kernel page table. 4K *
>>>>>> (nr_cpus – 1) for PGD pages, plus the page tables used by percpu local
>>>>>> mapping area. A couple of megabytes with Fedora default kernel config
>>>>>> on AmpereOne with 160 cores.
>>>>>> 3. Percpu allocation and free will be slower due to extra virtual
>>>>>> memory allocation and page table manipulation. However, percpu is
>>>>>> allocated by chunk. One chunk typically holds a lot percpu variables.
>>>>>> So the slowdown should be negligible. The test result below also
>>>>>> proved it.
>>> [...]
>>>>> One property that this breaks is per_cpu_ptr() of a given CPU disagreeing
>>>>> with this_cpu_ptr(). e.g. If there are users that take this_cpu_ptr() and
>>>>> uses that outside preempt disable block (which is a bit odd but allowed),
>>>>> the end result would be surprising. Hmm... I wonder whether it'd be
>>>>> worthwhile to keep this_cpu_ptr() returning the global address - ie. make it
>>>>> access global offset from local mapping and then return the computed global
>>>>> address. This should still be pretty cheap and gets rid of surprising and
>>>>> potentially extremely subtle corner cases.
>>>>
>>>> Yes, this is going to be a problem. So we don't change how
>>>> this_cpu_ptr() works and keep it returning the global address. Because
>>>> I noticed this may cause confusion for list APIs too. For example,
>>>> when initializing a list embedded into a percpu variable, the ->next
>>>> and ->prev will be initialized to global addresses by using
>>>> per_cpu_ptr(), but if the list is accessed via this_cpu_ptr(), list
>>>> head will be dereferenced by using local address, then list_empty()
>>>> will complain, which compare the list head pointer and ->next pointer.
>>>> This will cause some problems.
>>>>
>>>> So we just use the local address for this_cpu_add/sub/inc/dec and so
>>>> on, which just manipulate a scalar counter.
>>>
>>> I wonder how much overhead is caused by calling into the scheduler on
>>> preempt_enable(). It would be good to get some numbers for something
>>> like the patch below (also removing the preempt disabling for
>>> this_cpu_read() as I don't think it matters - a thread cannot
>>> distinguish whether it was preempted between TPIDR read and variable
>>> read or immediately after the variable read; we can't do this for writes
>>> as other threads may notice unexpected updates).
>>>
>>> Another wild hack could be to read the kernel instruction at
>>> (current_pt_regs()->pc - 4) in arch_irqentry_exit_need_resched() and
>>> return false if it's a read from TPIDR_EL1/2, together with removing the
>>> preempt disabling. Or some other lighter way of detecting this_cpu_*
>>> constructs without full preemption disabling.
>>
>> Could a sort of kernel version of restartable sequences help? i.e. detect
>> preemption instead of preventing it?
>
> Yes, in principle that's what we'd need but it's too expensive to check,
> especially as those accessors are inlined.
Could we use bit 63 of tpidr_el[12] to indicate "don't preempt"? a sort of
arch-specifc preemption disable mechanism that doesn't require load/store...
>
> For the write variants with LL/SC, we can check the TPIDR_EL2 again
> between the LDXR and STXR and bail out if it's different from the one
> read outside the loop. An interrupt would clear the exclusive monitor
> anyway and STXR fail. This won't work for the theoretical
> this_cpu_read() case.
Could you clarify that last sentence? - we don't need it to work for
this_cpu_read() because we don't need to disable preemption for that case, right?
Thanks,
Ryan
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-12 18:43 ` Catalin Marinas
@ 2026-02-13 0:23 ` Yang Shi
0 siblings, 0 replies; 21+ messages in thread
From: Yang Shi @ 2026-02-13 0:23 UTC (permalink / raw)
To: Catalin Marinas
Cc: Tejun Heo, lsf-pc, Linux MM, Christoph Lameter (Ampere),
dennis, urezki, Will Deacon, Ryan Roberts, Yang Shi
On Thu, Feb 12, 2026 at 10:43 AM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> More thoughts...
>
> On Thu, Feb 12, 2026 at 05:54:19PM +0000, Catalin Marinas wrote:
> > On Wed, Feb 11, 2026 at 03:58:50PM -0800, Yang Shi wrote:
> > > So we just use the local address for this_cpu_add/sub/inc/dec and so
> > > on, which just manipulate a scalar counter.
> >
> > I wonder how much overhead is caused by calling into the scheduler on
> > preempt_enable(). It would be good to get some numbers for something
> > like the patch below
>
> In case it wasn't obvious, the patch messes up the scheduling, so I
> don't propose it as such, only to get some idea of where the bottleneck
> is. Maybe it could be made to work with some need_resched() checks.
Yeah, I was wondering whether it would make something wrong or not
because I noticed the comment right before _pcp_protect().
And I saw some confusing results by running kernel build workload with
the suggested patch, it should be caused by the messed up scheduler. I
can got much more stable result with "page_fault3_processes -s 20 -t
1" from will-it-scale. The test just launches one process, so it can
minimize the impact from messed up scheduler. The baseline is mainline
kernel.
systime improvement:
baseline no schedule no preemption
1 0.96 0.92
profiling diff (perf diff)
baseline vs no schedule
5.48% -1.40% [kernel.kallsyms] [k] mod_memcg_lruvec_state
baseline vs no preemption
5.48% -2.21% [kernel.kallsyms] [k] mod_memcg_lruvec_state
>
> > (also removing the preempt disabling for
> > this_cpu_read() as I don't think it matters - a thread cannot
> > distinguish whether it was preempted between TPIDR read and variable
> > read or immediately after the variable read; we can't do this for writes
> > as other threads may notice unexpected updates).
>
> There's a theoretical case where even this_cpu_read() needs preemption
> disabling, e.g.:
>
> thread0:
> preempt_disable();
> this_cpu_write(var, unique_val);
> // check that no-one has seen unique_value;
> this_cpu_write(var, other_val);
> preempt_enable();
>
> thread1:
> this_cpu_read(var);
>
> thread1 is not supposed to see the unique_val but it would if it was
> preempted in the middle of the per-cpu op and migrated to another CPU.
I'm not sure whether kernel may make some decision by using the
counter read from this_cpu_read() or not. If kernel does so, it may
mess up something if the wrong counter is read.
Thanks,
Yang
>
> > Another wild hack could be to read the kernel instruction at
> > (current_pt_regs()->pc - 4) in arch_irqentry_exit_need_resched() and
> > return false if it's a read from TPIDR_EL1/2, together with removing the
> > preempt disabling.
>
> This one also breaks the kernel scheduling just like using
> preempt_enable_no_resched(). It might be possible but in combination
> with additional need_resched() checks.
>
> --
> Catalin
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-12 18:41 ` Ryan Roberts
2026-02-12 18:55 ` Christoph Lameter (Ampere)
@ 2026-02-13 18:42 ` Yang Shi
2026-02-16 11:39 ` Catalin Marinas
1 sibling, 1 reply; 21+ messages in thread
From: Yang Shi @ 2026-02-13 18:42 UTC (permalink / raw)
To: Ryan Roberts
Cc: lsf-pc, Linux MM, Christoph Lameter (Ampere),
dennis, Tejun Heo, urezki, Catalin Marinas, Will Deacon,
Yang Shi
On Thu, Feb 12, 2026 at 10:42 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 11/02/2026 23:14, Yang Shi wrote:
> > Background
> > =========
> > The APIs using this_cpu_*() operate on a local copy of a percpu
> > variable for the current processor. In order to obtain the address of
> > this cpu specific variable a cpu specific offset has to be added to
> > the address.
> > On x86 this address calculation can be created by prefixing an
> > instruction with a segment register. x86 can increment a percpu
> > counter with a single instruction. Since the address calculation and
> > the RMV operation occurs within one instruction it is atomic vs the
> > scheduler. So no preemption is needed.
> > f.e
> > INC %gs:[my_counter]
> > See https://www.kernel.org/doc/Documentation/this_cpu_ops.txt for more details.
> >
> > ARM64 and some other non-x86 architectures don't have a segment
> > register. The address of the current percpu variable has to be
> > calculated and then that address can be used for an operation on
> > percpu data. This process must be atomic vs the scheduler. Therefore,
> > it is necessary to disable preemption, perform the address calculation
> > and then the increment operation. The cpu specific offset is in a MSR
> > that also needs to be accessed on ARM64. The code flow looks like:
> > Disable preemption
> > Calculate the current CPU copy address by using the offset
> > Manipulate the counter
> > Enable preemption
>
> By massive coincidence, Dev Jain and I have been investigating a large
> regression seen in a munmap micro-benchmark in 6.19, which is root caused to a
> change that ends up using this_cpu_*() a lot more on the path.
>
> We have concluded that we can simplify this_cpu_read() to not bother
> disabling/enabling preemption, since it is read-only and a migration between the
> 2 ops vs after the second op is indistinguishable. I believe Dev is planning to
> post a patch to list soon. This will solve our immediate regression issue.
>
> But we can't do the same trick for ops that write. See [1].
>
> [1] https://lore.kernel.org/all/20190311164837.GD24275@lakrids.cambridge.arm.com/
Thank you for sharing this. We didn't know Mark used to work on it. I
thought about using atomic instruction to generate the address, but I
doubted the cost may be too high. It looks like Mark's attempt proved
my speculation.
>
> >
> > This process is inefficient relative to x86 and has to be repeated for
> > every access to per cpu data.
> > ARM64 has an increment instruction but this increment does not allow
> > the use of a base register or a segment register like on x86. So an
> > address calculation is always necessary even if the atomic instruction
> > is used.
> > A page table allows us to do remapping of addresses. So if the atomic
> > instruction would be using a virtual address and the page tables for
> > the local processor would map this area to the local per cpu data then
> > we can also create a single instruction on ARM64 (hopefully for some
> > other non-x86 architectures too) and be as efficient as x86 is.
> >
> > So, the code flow should just become:
> > INC VIRTUAL_BASE + percpu_variable_offset
> >
> > In order to do that we need to have the same virtual address mapped
> > differently for each processor. This means we need different page
> > tables for each processor. These page tables
> > can map almost all of the address space in the same way. The only area
> > that will be special is the area starting at VIRTUAL_BASE.
>
> This is an interesting idea. I'm keen to be involved in discussions.
>
> My immediate concern is that this would not be compatible with FEAT_TTCNP, which
> allows multiple PEs (ARM speak for CPU) to share a TLB - e.g. for SMT. I'm not
> sure if that would be the end of the world; the perf numbers below are
> compelling. I'll defer to others' opions on that.
Thank you for involving the discussion. The concern is definitely
valid. The shared TLB sounds like a microarchitecture feature or
design choice. AmpereOne supports CNP, but doesn't share TLB. As long
as it doesn't generate TLB conflict abort, shared TLB should be fine,
but may suffer from frequent TLB invalidation. Anyway I think it
should be solvable. We can make percpu page table opt-in if the
machines can handle TLB conflict, just like what we did for
bbml2_noabort.
Thanks,
Yang
>
> Thanks,
> Ryan
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-12 21:12 ` Ryan Roberts
@ 2026-02-16 10:37 ` Catalin Marinas
2026-02-18 8:59 ` Ryan Roberts
0 siblings, 1 reply; 21+ messages in thread
From: Catalin Marinas @ 2026-02-16 10:37 UTC (permalink / raw)
To: Ryan Roberts
Cc: Yang Shi, Tejun Heo, lsf-pc, Linux MM, Christoph Lameter (Ampere),
dennis, urezki, Will Deacon, Yang Shi
On Thu, Feb 12, 2026 at 09:12:55PM +0000, Ryan Roberts wrote:
> On 12/02/2026 19:36, Catalin Marinas wrote:
> > On Thu, Feb 12, 2026 at 06:45:19PM +0000, Ryan Roberts wrote:
> >> On 12/02/2026 17:54, Catalin Marinas wrote:
> >>> On Wed, Feb 11, 2026 at 03:58:50PM -0800, Yang Shi wrote:
> >>>> On Wed, Feb 11, 2026 at 3:29 PM Tejun Heo <tj@kernel.org> wrote:
> >>>>> On Wed, Feb 11, 2026 at 03:14:57PM -0800, Yang Shi wrote:
> >>>>> ...
> >>>>>> Overhead
> >>>>>> ========
> >>>>>> 1. Some extra virtual memory space. But it shouldn’t be too much. I
> >>>>>> saw 960K with Fedora default kernel config. Given terabytes virtual
> >>>>>> memory space on 64 bit machine, 960K is negligible.
> >>>>>> 2. Some extra physical memory for percpu kernel page table. 4K *
> >>>>>> (nr_cpus – 1) for PGD pages, plus the page tables used by percpu local
> >>>>>> mapping area. A couple of megabytes with Fedora default kernel config
> >>>>>> on AmpereOne with 160 cores.
> >>>>>> 3. Percpu allocation and free will be slower due to extra virtual
> >>>>>> memory allocation and page table manipulation. However, percpu is
> >>>>>> allocated by chunk. One chunk typically holds a lot percpu variables.
> >>>>>> So the slowdown should be negligible. The test result below also
> >>>>>> proved it.
> >>> [...]
> >>>>> One property that this breaks is per_cpu_ptr() of a given CPU disagreeing
> >>>>> with this_cpu_ptr(). e.g. If there are users that take this_cpu_ptr() and
> >>>>> uses that outside preempt disable block (which is a bit odd but allowed),
> >>>>> the end result would be surprising. Hmm... I wonder whether it'd be
> >>>>> worthwhile to keep this_cpu_ptr() returning the global address - ie. make it
> >>>>> access global offset from local mapping and then return the computed global
> >>>>> address. This should still be pretty cheap and gets rid of surprising and
> >>>>> potentially extremely subtle corner cases.
> >>>>
> >>>> Yes, this is going to be a problem. So we don't change how
> >>>> this_cpu_ptr() works and keep it returning the global address. Because
> >>>> I noticed this may cause confusion for list APIs too. For example,
> >>>> when initializing a list embedded into a percpu variable, the ->next
> >>>> and ->prev will be initialized to global addresses by using
> >>>> per_cpu_ptr(), but if the list is accessed via this_cpu_ptr(), list
> >>>> head will be dereferenced by using local address, then list_empty()
> >>>> will complain, which compare the list head pointer and ->next pointer.
> >>>> This will cause some problems.
> >>>>
> >>>> So we just use the local address for this_cpu_add/sub/inc/dec and so
> >>>> on, which just manipulate a scalar counter.
> >>>
> >>> I wonder how much overhead is caused by calling into the scheduler on
> >>> preempt_enable(). It would be good to get some numbers for something
> >>> like the patch below (also removing the preempt disabling for
> >>> this_cpu_read() as I don't think it matters - a thread cannot
> >>> distinguish whether it was preempted between TPIDR read and variable
> >>> read or immediately after the variable read; we can't do this for writes
> >>> as other threads may notice unexpected updates).
> >>>
> >>> Another wild hack could be to read the kernel instruction at
> >>> (current_pt_regs()->pc - 4) in arch_irqentry_exit_need_resched() and
> >>> return false if it's a read from TPIDR_EL1/2, together with removing the
> >>> preempt disabling. Or some other lighter way of detecting this_cpu_*
> >>> constructs without full preemption disabling.
> >>
> >> Could a sort of kernel version of restartable sequences help? i.e. detect
> >> preemption instead of preventing it?
> >
> > Yes, in principle that's what we'd need but it's too expensive to check,
> > especially as those accessors are inlined.
>
> Could we use bit 63 of tpidr_el[12] to indicate "don't preempt"? a sort of
> arch-specifc preemption disable mechanism that doesn't require load/store...
As long as it doesn't nest with interrupts, in which case some refcount
would be needed.
But I need to check Yang's emails to see whether the actual TPIDR access
is problematic.
> > For the write variants with LL/SC, we can check the TPIDR_EL2 again
> > between the LDXR and STXR and bail out if it's different from the one
> > read outside the loop. An interrupt would clear the exclusive monitor
> > anyway and STXR fail. This won't work for the theoretical
> > this_cpu_read() case.
>
> Could you clarify that last sentence? - we don't need it to work for
> this_cpu_read() because we don't need to disable preemption for that case, right?
Mostly right but there can be some theoretical scenario where a thread
expects to be the only one running on a CPU and any sequence of
modifications to a per-cpu variable to be atomic:
https://lore.kernel.org/r/aY4fQOgyx3meku3b@arm.com
--
Catalin
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-13 18:42 ` Yang Shi
@ 2026-02-16 11:39 ` Catalin Marinas
2026-02-17 17:28 ` Christoph Lameter (Ampere)
0 siblings, 1 reply; 21+ messages in thread
From: Catalin Marinas @ 2026-02-16 11:39 UTC (permalink / raw)
To: Yang Shi
Cc: Ryan Roberts, lsf-pc, Linux MM, Christoph Lameter (Ampere),
dennis, Tejun Heo, urezki, Will Deacon, Yang Shi
On Fri, Feb 13, 2026 at 10:42:21AM -0800, Yang Shi wrote:
> On Thu, Feb 12, 2026 at 10:42 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> > On 11/02/2026 23:14, Yang Shi wrote:
> > > So, the code flow should just become:
> > > INC VIRTUAL_BASE + percpu_variable_offset
> > >
> > > In order to do that we need to have the same virtual address mapped
> > > differently for each processor. This means we need different page
> > > tables for each processor. These page tables
> > > can map almost all of the address space in the same way. The only area
> > > that will be special is the area starting at VIRTUAL_BASE.
> >
> > This is an interesting idea. I'm keen to be involved in discussions.
> >
> > My immediate concern is that this would not be compatible with FEAT_TTCNP, which
> > allows multiple PEs (ARM speak for CPU) to share a TLB - e.g. for SMT. I'm not
> > sure if that would be the end of the world; the perf numbers below are
> > compelling. I'll defer to others' opions on that.
>
> Thank you for involving the discussion. The concern is definitely
> valid. The shared TLB sounds like a microarchitecture feature or
> design choice. AmpereOne supports CNP, but doesn't share TLB. As long
> as it doesn't generate TLB conflict abort, shared TLB should be fine,
> but may suffer from frequent TLB invalidation. Anyway I think it
> should be solvable. We can make percpu page table opt-in if the
> machines can handle TLB conflict, just like what we did for
> bbml2_noabort.
It's not about TLB conflicts but rather using the wrong translation for
a per-CPU variable with CnP.
--
Catalin
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-16 11:39 ` Catalin Marinas
@ 2026-02-17 17:28 ` Christoph Lameter (Ampere)
2026-02-18 9:18 ` Ryan Roberts
0 siblings, 1 reply; 21+ messages in thread
From: Christoph Lameter (Ampere) @ 2026-02-17 17:28 UTC (permalink / raw)
To: Catalin Marinas
Cc: Yang Shi, Ryan Roberts, lsf-pc, Linux MM, dennis, Tejun Heo,
urezki, Will Deacon, Yang Shi
On Mon, 16 Feb 2026, Catalin Marinas wrote:
> It's not about TLB conflicts but rather using the wrong translation for
> a per-CPU variable with CnP.
These conflicts could only come about if each PE would not be able to have
its own page table but share page tables between processors.
That is not the case from what I can tell. The ARM64 code does not
support a shared page table and if I remember right Windows actually requires a per
processor page table.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-16 10:37 ` Catalin Marinas
@ 2026-02-18 8:59 ` Ryan Roberts
0 siblings, 0 replies; 21+ messages in thread
From: Ryan Roberts @ 2026-02-18 8:59 UTC (permalink / raw)
To: Catalin Marinas
Cc: Yang Shi, Tejun Heo, lsf-pc, Linux MM, Christoph Lameter (Ampere),
dennis, urezki, Will Deacon, Yang Shi
On 16/02/2026 10:37, Catalin Marinas wrote:
> On Thu, Feb 12, 2026 at 09:12:55PM +0000, Ryan Roberts wrote:
>> On 12/02/2026 19:36, Catalin Marinas wrote:
>>> On Thu, Feb 12, 2026 at 06:45:19PM +0000, Ryan Roberts wrote:
>>>> On 12/02/2026 17:54, Catalin Marinas wrote:
>>>>> On Wed, Feb 11, 2026 at 03:58:50PM -0800, Yang Shi wrote:
>>>>>> On Wed, Feb 11, 2026 at 3:29 PM Tejun Heo <tj@kernel.org> wrote:
>>>>>>> On Wed, Feb 11, 2026 at 03:14:57PM -0800, Yang Shi wrote:
>>>>>>> ...
>>>>>>>> Overhead
>>>>>>>> ========
>>>>>>>> 1. Some extra virtual memory space. But it shouldn’t be too much. I
>>>>>>>> saw 960K with Fedora default kernel config. Given terabytes virtual
>>>>>>>> memory space on 64 bit machine, 960K is negligible.
>>>>>>>> 2. Some extra physical memory for percpu kernel page table. 4K *
>>>>>>>> (nr_cpus – 1) for PGD pages, plus the page tables used by percpu local
>>>>>>>> mapping area. A couple of megabytes with Fedora default kernel config
>>>>>>>> on AmpereOne with 160 cores.
>>>>>>>> 3. Percpu allocation and free will be slower due to extra virtual
>>>>>>>> memory allocation and page table manipulation. However, percpu is
>>>>>>>> allocated by chunk. One chunk typically holds a lot percpu variables.
>>>>>>>> So the slowdown should be negligible. The test result below also
>>>>>>>> proved it.
>>>>> [...]
>>>>>>> One property that this breaks is per_cpu_ptr() of a given CPU disagreeing
>>>>>>> with this_cpu_ptr(). e.g. If there are users that take this_cpu_ptr() and
>>>>>>> uses that outside preempt disable block (which is a bit odd but allowed),
>>>>>>> the end result would be surprising. Hmm... I wonder whether it'd be
>>>>>>> worthwhile to keep this_cpu_ptr() returning the global address - ie. make it
>>>>>>> access global offset from local mapping and then return the computed global
>>>>>>> address. This should still be pretty cheap and gets rid of surprising and
>>>>>>> potentially extremely subtle corner cases.
>>>>>>
>>>>>> Yes, this is going to be a problem. So we don't change how
>>>>>> this_cpu_ptr() works and keep it returning the global address. Because
>>>>>> I noticed this may cause confusion for list APIs too. For example,
>>>>>> when initializing a list embedded into a percpu variable, the ->next
>>>>>> and ->prev will be initialized to global addresses by using
>>>>>> per_cpu_ptr(), but if the list is accessed via this_cpu_ptr(), list
>>>>>> head will be dereferenced by using local address, then list_empty()
>>>>>> will complain, which compare the list head pointer and ->next pointer.
>>>>>> This will cause some problems.
>>>>>>
>>>>>> So we just use the local address for this_cpu_add/sub/inc/dec and so
>>>>>> on, which just manipulate a scalar counter.
>>>>>
>>>>> I wonder how much overhead is caused by calling into the scheduler on
>>>>> preempt_enable(). It would be good to get some numbers for something
>>>>> like the patch below (also removing the preempt disabling for
>>>>> this_cpu_read() as I don't think it matters - a thread cannot
>>>>> distinguish whether it was preempted between TPIDR read and variable
>>>>> read or immediately after the variable read; we can't do this for writes
>>>>> as other threads may notice unexpected updates).
>>>>>
>>>>> Another wild hack could be to read the kernel instruction at
>>>>> (current_pt_regs()->pc - 4) in arch_irqentry_exit_need_resched() and
>>>>> return false if it's a read from TPIDR_EL1/2, together with removing the
>>>>> preempt disabling. Or some other lighter way of detecting this_cpu_*
>>>>> constructs without full preemption disabling.
>>>>
>>>> Could a sort of kernel version of restartable sequences help? i.e. detect
>>>> preemption instead of preventing it?
>>>
>>> Yes, in principle that's what we'd need but it's too expensive to check,
>>> especially as those accessors are inlined.
>>
>> Could we use bit 63 of tpidr_el[12] to indicate "don't preempt"? a sort of
>> arch-specifc preemption disable mechanism that doesn't require load/store...
>
> As long as it doesn't nest with interrupts, in which case some refcount
> would be needed.
>
> But I need to check Yang's emails to see whether the actual TPIDR access
> is problematic.
We can't set the bit atomically so we could still be prempted between the read
and write back. So this is no good. Ignore my rambling...
>
>>> For the write variants with LL/SC, we can check the TPIDR_EL2 again
>>> between the LDXR and STXR and bail out if it's different from the one
>>> read outside the loop. An interrupt would clear the exclusive monitor
>>> anyway and STXR fail. This won't work for the theoretical
>>> this_cpu_read() case.
>>
>> Could you clarify that last sentence? - we don't need it to work for
>> this_cpu_read() because we don't need to disable preemption for that case, right?
>
> Mostly right but there can be some theoretical scenario where a thread
> expects to be the only one running on a CPU and any sequence of
> modifications to a per-cpu variable to be atomic:
>
> https://lore.kernel.org/r/aY4fQOgyx3meku3b@arm.com
Yeah, I saw that after posting this. Thanks for the education :)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
2026-02-17 17:28 ` Christoph Lameter (Ampere)
@ 2026-02-18 9:18 ` Ryan Roberts
0 siblings, 0 replies; 21+ messages in thread
From: Ryan Roberts @ 2026-02-18 9:18 UTC (permalink / raw)
To: Christoph Lameter (Ampere), Catalin Marinas
Cc: Yang Shi, lsf-pc, Linux MM, dennis, Tejun Heo, urezki,
Will Deacon, Yang Shi
On 17/02/2026 17:28, Christoph Lameter (Ampere) wrote:
> On Mon, 16 Feb 2026, Catalin Marinas wrote:
>
>> It's not about TLB conflicts but rather using the wrong translation for
>> a per-CPU variable with CnP.
>
> These conflicts could only come about if each PE would not be able to have
> its own page table but share page tables between processors.
>
> That is not the case from what I can tell. The ARM64 code does not
> support a shared page table and if I remember right Windows actually requires a per
> processor page table.
Each PE is able to be configured to use it's own page table, I'm not denying
that. It is definitely possible to implement what you are proposing.
However, there is a feature called FEAT_TTCNP, which allows SW to hint to the PE
that the entries in it's pgtable are the same as entries in another PE's
pgtable. If the hint (bit 0 in TTBR1) is applied for multiple PEs, then those
PEs are permitted to share a TLB.
Today, if the kernel is compiled with CNP support and the CPU supports it, then
the bit is set.
AIUI, there are a number of CPUs out there that can share TLB to some extent and
do take advantage of this feature.
Your proposed per-CPU pgtable would be incompatible with CNP because if PE0 and
PE1 share a TLB, PE1 would end up miss-translating a per-CPU VA to PE0's copy if
PE0 previously accessed it and caused the (shared) TLB entry to be created.
So I believe there will be a performance cost for some CPUs if we take your
proposed approach and disable CNP. We would need to evaluate that cost and
decide which of the 2 mutually exclusive features provides the most value.
Thanks,
Ryan
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2026-02-18 9:18 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-11 23:14 [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures) Yang Shi
2026-02-11 23:29 ` Tejun Heo
2026-02-11 23:39 ` Christoph Lameter (Ampere)
2026-02-11 23:40 ` Tejun Heo
2026-02-12 0:05 ` Christoph Lameter (Ampere)
2026-02-11 23:58 ` Yang Shi
2026-02-12 17:54 ` Catalin Marinas
2026-02-12 18:43 ` Catalin Marinas
2026-02-13 0:23 ` Yang Shi
2026-02-12 18:45 ` Ryan Roberts
2026-02-12 19:36 ` Catalin Marinas
2026-02-12 21:12 ` Ryan Roberts
2026-02-16 10:37 ` Catalin Marinas
2026-02-18 8:59 ` Ryan Roberts
2026-02-12 18:41 ` Ryan Roberts
2026-02-12 18:55 ` Christoph Lameter (Ampere)
2026-02-12 18:58 ` Ryan Roberts
2026-02-13 18:42 ` Yang Shi
2026-02-16 11:39 ` Catalin Marinas
2026-02-17 17:28 ` Christoph Lameter (Ampere)
2026-02-18 9:18 ` Ryan Roberts
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox