On Sat, 28 Feb 2026 at 19:27, Linus Torvalds
<torvalds@linuxfoundation.org> wrote:
>
> This attached patch is ENTIRELY UNTESTED.

Here's a slightly cleaned up and further simplified version, which is
also actually tested, although only in the "it boots for me" sense.

It generates good code at least with clang:

  .LBB76_7:
          movl    $1, %eax
  .LBB76_8:
          leal    1(%rax), %ecx
          lock cmpxchgl   %ecx, 52(%rdi)
          sete    %cl
          je      .LBB76_10
          testl   %eax, %eax
          jne     .LBB76_8
  .LBB76_10:

which actually looks both simple and fairly optimal for that sequence.

Of course, since this is very much about cacheline access patterns,
actual performance will depend on random microarchitectural issues
(and not just the CPU core, but the whole memory subsystem).

Can somebody with a good - and relevant - benchmark system try this out?

               Linus