[LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yang Shi <shy828301@gmail.com>
To: lsf-pc@lists.linux-foundation.org, Linux MM <linux-mm@kvack.org>,
	 "Christoph Lameter (Ampere)" <cl@gentwo.org>,
	dennis@kernel.org, Tejun Heo <tj@kernel.org>,
	urezki@gmail.com,  Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	 Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>, Yang Shi <shy828301@gmail.com>
Subject: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
Date: Wed, 11 Feb 2026 15:14:57 -0800	[thread overview]
Message-ID: <CAHbLzkpcN-T8MH6=W3jCxcFj1gVZp8fRqe231yzZT-rV_E_org@mail.gmail.com> (raw)

Background
=========
The APIs using this_cpu_*() operate on a local copy of a percpu
variable for the current processor. In order to obtain the address of
this cpu specific variable a cpu specific offset has to be added to
the address.
On x86 this address calculation can be created by prefixing an
instruction with a segment register. x86 can increment a percpu
counter with a single instruction. Since the address calculation and
the RMV operation occurs within one instruction it is atomic vs the
scheduler. So no preemption is needed.
f.e
INC %gs:[my_counter]
See https://www.kernel.org/doc/Documentation/this_cpu_ops.txt for more details.

ARM64 and some other non-x86 architectures don't have a segment
register. The address of the current percpu variable has to be
calculated and then that address can be used for an operation on
percpu data. This process must be atomic vs the scheduler. Therefore,
it is necessary to disable preemption, perform the address calculation
and then the increment operation. The cpu specific offset is in a MSR
that also needs to be accessed on ARM64. The code flow looks like:
    Disable preemption
    Calculate the current CPU copy address by using the offset
    Manipulate the counter
    Enable preemption

This process is inefficient relative to x86 and has to be repeated for
every access to per cpu data.
ARM64 has an increment instruction but this increment does not allow
the use of a base register or a segment register like on x86. So an
address calculation is always necessary even if the atomic instruction
is used.
A page table allows us to do remapping of addresses. So if the atomic
instruction would be using a virtual address and the page tables for
the local processor would map this area to the local per cpu data then
we can also create a single instruction on ARM64 (hopefully for some
other non-x86 architectures too) and be as efficient as x86 is.

So, the code flow should just become:
INC VIRTUAL_BASE + percpu_variable_offset

In order to do that we need to have the same virtual address mapped
differently for each processor. This means we need different page
tables for each processor. These page tables
can map almost all of the address space in the same way. The only area
that will be special is the area starting at VIRTUAL_BASE.

In addition, the percpu counters also can be accessed from other CPUs
by using per_cpu_ptr() APIs. This is usually used by counters
initialization code. For example,
for_each_possible_cpu(cpu) {
    p = per_cpu_ptr(ptr, cpu);
    initialize(p);
}

Percpu allocator
=============
When calling alloc_percpu(), kernel allocates contiguous virtual
memory area from vmalloc area. It is called “chunk”. The chunk looks
like:
| CPU 0 | CPU 1 | …… | CPU n|

The size of the chunk is the percpu_unit_size * nr_cpus. Then kernel
maps them to physical memory. It returns an offset.

Design
======
To improve the performance for this_cpu_ops on ARM64 and potentially
some other non-x86 architectures, I and Christopher Lameter proposed
the below solution.

To remove the preemption disable/enable, we need to guarantee
this_cpu_*() APIs actually convert the offset returned by
alloc_percpu() to a pointer which should be the same on all CPUs. But
it should not break per_cpu_ptr() APIs usecase either.
To achieve this, we need to modify the percpu allocator to allocate
extra virtual memory other than the virtual memory area shown in the
above diagram. The size of the extra allocation is percpu_unit_size.
this_cpu_*() APIs will convert the offset returned by alloc_percpu()
to a pointer to this area. It is the same for all CPUs. I call the
extra allocated area “local mapping” and the original area “global
mapping” in order to simplify the discussion. So the percpu chunk will
look like:
| CPU 0 | CPU 1 | …… | CPU n| xxxxxxxxx | CPU |
           Global mapping                                     local mapping

this_cpu_*() APIs will just access the local mapping, per_cpu_ptr()
APIs continue to use the global mapping.

The local mapping requires mapping to different physical memory
(shared physical memory mapped by global mapping, no need to allocate
extra physical memory) on different CPUs in order to manipulate the
right copy. This can be achieved by using the percpu page table in
arch-dependent code. Each CPU just sees its own kernel page table copy
instead of sharing one single kernel page table. However the most
contents of the page tables can be shared except the area for percpu
local mapping. So they basically can share PUD/PMD/PTE except PGD.

The kernel also maintains a base address for global mapping in order
to convert the offset returned by alloc_percpu() to the correct
pointer. The local mapping also needs a base address, and the offset
between local mapping base address and allocated local mapping area
must be the same with the offset returned by alloc_percpu(). So the
local mapping has to happen in a specific address range. This may need
a dedicated percpu local mapping area which can’t be used by vmalloc()
in order to avoid conflicts.

I have done some PoC on ARM64. Hopefully I can post them to the
mailing list to ease the discussion before the conference.

Overhead
========
1. Some extra virtual memory space. But it shouldn’t be too much. I
saw 960K with Fedora default kernel config. Given terabytes virtual
memory space on 64 bit machine, 960K is negligible.
2. Some extra physical memory for percpu kernel page table. 4K *
(nr_cpus – 1) for PGD pages, plus the page tables used by percpu local
mapping area. A couple of megabytes with Fedora default kernel config
on AmpereOne with 160 cores.
3. Percpu allocation and free will be slower due to extra virtual
memory allocation and page table manipulation. However, percpu is
allocated by chunk. One chunk typically holds a lot percpu variables.
So the slowdown should be negligible. The test result below also
proved it.

Performance Test
==============
I have done a PoC on ARM64. So all the tests were run on AmpereOne
with 160 cores.
1. Kernel build
--------------------
Run kernel build (make -j160) with default Fedora kernel config in a memcg.
Roughly 13% - 15% systime improvement for my kernel build workload.

2. stress-ng
----------------
stress-ng --vm 160 --vm-bytes 128M --vm-ops 100000000
6% improvement for systime

3. vm-scalability
----------------------
Single digit (0 – 8%) improvement for systime for some vm-scalability test cases

4. will-it-scale
------------------
3% - 8% improvement for pagefault cases from will-it-scale
And profiling to page_fault3_processes from will-it-scale also shows
the reduction in percpu counters manipulation (perf diff output):
5.91%     -1.82%  [kernel.kallsyms]        [k] mod_memcg_lruvec_state
2.84%     -1.30%  [kernel.kallsyms]        [k] percpu_counter_add_batch

Regression Test
=============
Create 10K cgroups.
Creating cgroups need to call percpu allocators multiple times. For
example, creating one memcg needs to allocate percpu refcnt, rstat and
objcg percpu refcnt.

It consumed 2112K more virtual memory for percpu local mapping. A few
more megabytes consumed by percpu page table to map local mapping. The
memory consumption depends on the number of CPUs.

Execution time is basically the same. No noticeable regression is
found. The profiling shows (perf diff):
0.35%     -0.33%  [kernel.kallsyms]                               [k]
percpu_ref_get_many
0.61%     -0.30%  [kernel.kallsyms]                               [k]
percpu_counter_add_batch
0.34%     +0.02%  [kernel.kallsyms]                               [k]
pcpu_alloc_noprof
0.00%     +0.05%  [kernel.kallsyms]                               [k]
free_percpu.part.0
The gain from manipulating percpu counters outweigh the slowdown from
percpu allocation and free. There is even a little bit of net gain.

Future usecases
=============
Some potential usecases may be unlocked by percpu page table, for
example, kernel text replication, off the top of my head. Anyway this
is not the main point for this proposal.

Key attendees
===========
This work will incur changes to percpu allocator, vmalloc (just need
to add a new interface to take pgdir pointer as an argument) and arch
dependent code (percpu page table implementation is arch-dependent).
So the percpu allocator maintainers, vmalloc maintainers and arch
experts (for example, ARM64) should be key attendees. I don't know who
can attend so I just list all of them.

Christopher Lameter (co-presenter and percpu allocator maintainer)
Dennis Zhou/Tejun Heo (percpu allocator maintainer)
Uladzislau Rezki (vmalloc maintainer)
Catalin Marinas/Will Deacon/Ryan Roberts (ARM64 memory management)

Thanks,
Yang

next             reply	other threads:[~2026-02-11 23:15 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-11 23:14 Yang Shi [this message]
2026-02-11 23:29 ` Tejun Heo
2026-02-11 23:39   ` Christoph Lameter (Ampere)
2026-02-11 23:40     ` Tejun Heo
2026-02-12  0:05       ` Christoph Lameter (Ampere)
2026-02-11 23:58   ` Yang Shi
2026-02-12 17:54     ` Catalin Marinas
2026-02-12 18:43       ` Catalin Marinas
2026-02-13  0:23         ` Yang Shi
2026-02-12 18:45       ` Ryan Roberts
2026-02-12 19:36         ` Catalin Marinas
2026-02-12 21:12           ` Ryan Roberts
2026-02-16 10:37             ` Catalin Marinas
2026-02-18  8:59               ` Ryan Roberts
2026-02-12 18:41 ` Ryan Roberts
2026-02-12 18:55   ` Christoph Lameter (Ampere)
2026-02-12 18:58     ` Ryan Roberts
2026-02-13 18:42   ` Yang Shi
2026-02-16 11:39     ` Catalin Marinas
2026-02-17 17:28       ` Christoph Lameter (Ampere)
2026-02-18  9:18         ` Ryan Roberts

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHbLzkpcN-T8MH6=W3jCxcFj1gVZp8fRqe231yzZT-rV_E_org@mail.gmail.com' \
    --to=shy828301@gmail.com \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=dennis@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=ryan.roberts@arm.com \
    --cc=tj@kernel.org \
    --cc=urezki@gmail.com \
    --cc=will@kernel.org \
    --cc=yang@os.amperecomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox