Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yang Shi <shy828301@gmail.com>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: lsf-pc@lists.linux-foundation.org, Linux MM <linux-mm@kvack.org>,
	 "Christoph Lameter (Ampere)" <cl@gentwo.org>,
	dennis@kernel.org, Tejun Heo <tj@kernel.org>,
	urezki@gmail.com,  Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	 Yang Shi <yang@os.amperecomputing.com>
Subject: Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
Date: Fri, 13 Feb 2026 10:42:21 -0800	[thread overview]
Message-ID: <CAHbLzkqRjuWg7ybdkuv5xdoPz7z_q2n+0Eh0pRNoVxq-WskRvA@mail.gmail.com> (raw)
In-Reply-To: <5a648f49-97b1-4195-a825-47f3261225eb@arm.com>

On Thu, Feb 12, 2026 at 10:42 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 11/02/2026 23:14, Yang Shi wrote:
> > Background
> > =========
> > The APIs using this_cpu_*() operate on a local copy of a percpu
> > variable for the current processor. In order to obtain the address of
> > this cpu specific variable a cpu specific offset has to be added to
> > the address.
> > On x86 this address calculation can be created by prefixing an
> > instruction with a segment register. x86 can increment a percpu
> > counter with a single instruction. Since the address calculation and
> > the RMV operation occurs within one instruction it is atomic vs the
> > scheduler. So no preemption is needed.
> > f.e
> > INC %gs:[my_counter]
> > See https://www.kernel.org/doc/Documentation/this_cpu_ops.txt for more details.
> >
> > ARM64 and some other non-x86 architectures don't have a segment
> > register. The address of the current percpu variable has to be
> > calculated and then that address can be used for an operation on
> > percpu data. This process must be atomic vs the scheduler. Therefore,
> > it is necessary to disable preemption, perform the address calculation
> > and then the increment operation. The cpu specific offset is in a MSR
> > that also needs to be accessed on ARM64. The code flow looks like:
> >     Disable preemption
> >     Calculate the current CPU copy address by using the offset
> >     Manipulate the counter
> >     Enable preemption
>
> By massive coincidence, Dev Jain and I have been investigating a large
> regression seen in a munmap micro-benchmark in 6.19, which is root caused to a
> change that ends up using this_cpu_*() a lot more on the path.
>
> We have concluded that we can simplify this_cpu_read() to not bother
> disabling/enabling preemption, since it is read-only and a migration between the
> 2 ops vs after the second op is indistinguishable. I believe Dev is planning to
> post a patch to list soon. This will solve our immediate regression issue.
>
> But we can't do the same trick for ops that write. See [1].
>
> [1] https://lore.kernel.org/all/20190311164837.GD24275@lakrids.cambridge.arm.com/

Thank you for sharing this. We didn't know Mark used to work on it. I
thought about using atomic instruction to generate the address, but I
doubted the cost may be too high. It looks like Mark's attempt proved
my speculation.

>
> >
> > This process is inefficient relative to x86 and has to be repeated for
> > every access to per cpu data.
> > ARM64 has an increment instruction but this increment does not allow
> > the use of a base register or a segment register like on x86. So an
> > address calculation is always necessary even if the atomic instruction
> > is used.
> > A page table allows us to do remapping of addresses. So if the atomic
> > instruction would be using a virtual address and the page tables for
> > the local processor would map this area to the local per cpu data then
> > we can also create a single instruction on ARM64 (hopefully for some
> > other non-x86 architectures too) and be as efficient as x86 is.
> >
> > So, the code flow should just become:
> > INC VIRTUAL_BASE + percpu_variable_offset
> >
> > In order to do that we need to have the same virtual address mapped
> > differently for each processor. This means we need different page
> > tables for each processor. These page tables
> > can map almost all of the address space in the same way. The only area
> > that will be special is the area starting at VIRTUAL_BASE.
>
> This is an interesting idea. I'm keen to be involved in discussions.
>
> My immediate concern is that this would not be compatible with FEAT_TTCNP, which
> allows multiple PEs (ARM speak for CPU) to share a TLB - e.g. for SMT. I'm not
> sure if that would be the end of the world; the perf numbers below are
> compelling. I'll defer to others' opions on that.

Thank you for involving the discussion. The concern is definitely
valid. The shared TLB sounds like a microarchitecture feature or
design choice. AmpereOne supports CNP, but doesn't share TLB. As long
as it doesn't generate TLB conflict abort, shared TLB should be fine,
but may suffer from frequent TLB invalidation. Anyway I think it
should be solvable. We can make percpu page table opt-in if the
machines can handle TLB conflict, just like what we did for
bbml2_noabort.

Thanks,
Yang

>
> Thanks,
> Ryan
>

next prev parent reply	other threads:[~2026-02-13 18:42 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-11 23:14 Yang Shi
2026-02-11 23:29 ` Tejun Heo
2026-02-11 23:39   ` Christoph Lameter (Ampere)
2026-02-11 23:40     ` Tejun Heo
2026-02-12  0:05       ` Christoph Lameter (Ampere)
2026-02-11 23:58   ` Yang Shi
2026-02-12 17:54     ` Catalin Marinas
2026-02-12 18:43       ` Catalin Marinas
2026-02-13  0:23         ` Yang Shi
2026-02-12 18:45       ` Ryan Roberts
2026-02-12 19:36         ` Catalin Marinas
2026-02-12 21:12           ` Ryan Roberts
2026-02-16 10:37             ` Catalin Marinas
2026-02-18  8:59               ` Ryan Roberts
2026-02-12 18:41 ` Ryan Roberts
2026-02-12 18:55   ` Christoph Lameter (Ampere)
2026-02-12 18:58     ` Ryan Roberts
2026-02-13 18:42   ` Yang Shi [this message]
2026-02-16 11:39     ` Catalin Marinas
2026-02-17 17:28       ` Christoph Lameter (Ampere)
2026-02-18  9:18         ` Ryan Roberts

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHbLzkqRjuWg7ybdkuv5xdoPz7z_q2n+0Eh0pRNoVxq-WskRvA@mail.gmail.com \
    --to=shy828301@gmail.com \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=dennis@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=ryan.roberts@arm.com \
    --cc=tj@kernel.org \
    --cc=urezki@gmail.com \
    --cc=will@kernel.org \
    --cc=yang@os.amperecomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox