On Sat, 28 Feb 2026 at 19:27, Linus Torvalds wrote: > > This attached patch is ENTIRELY UNTESTED. Here's a slightly cleaned up and further simplified version, which is also actually tested, although only in the "it boots for me" sense. It generates good code at least with clang: .LBB76_7: movl $1, %eax .LBB76_8: leal 1(%rax), %ecx lock cmpxchgl %ecx, 52(%rdi) sete %cl je .LBB76_10 testl %eax, %eax jne .LBB76_8 .LBB76_10: which actually looks both simple and fairly optimal for that sequence. Of course, since this is very much about cacheline access patterns, actual performance will depend on random microarchitectural issues (and not just the CPU core, but the whole memory subsystem). Can somebody with a good - and relevant - benchmark system try this out? Linus