On Sat, 28 Feb 2026 at 14:19, Andrew Morton wrote: > > Well it's nice to see the performance benefits from Kiryl's ill-fated > patch > (https://lore.kernel.org/linux-mm/20251017141536.577466-1-kirill@shutemov.name/) > > And this approach looks far simpler. This attempt does something completely different, in that it doesn't actually remove any atomics at all. Quite the opposite, in fact. It adds *new* atomics - just in a different place. But if it helps performance, that is certainly still interesting. It's basically saying that it's not the atomic op itself that is so expensive, it's literally just the "read + cmpxchg" in atomic_add_unless() that makes for most of the expense. And that, in turn, is probably due the fact that the read in that loop first tries to make the cacheline shared, and then the cmpxchg has to turn the shared cacheline exclusive, so you have two cache ops - and possibly then many more because of bouncing due to this all. Fine, I can believe that. But if it's purely about the cacheline shared/exclusive behavior, I think there's a much simpler patch That much more simple patch is something we've done before: do *not* read the old value before the cmpxchg loop. Do the cmpxchg with a default value, and if we guessed wrong, just do the extra loop iteration. This attached patch is ENTIRELY UNTESTED. I might have gotten something wrong. A quick look at the assembler seems to say it generates that expected code (gcc is not great at this), with the loop being mov $0x1,%eax lea 0x34(%rdi),%rdx lea 0x1(%rax),%ecx lock cmpxchg %ecx,(%rdx) ... ie the first time through we just assume the count is one. And yes, that assumption may be wrong, but at least we don't go through the shared state, and since we got the cacheline for exclusive the first time around the loop, the second time around we will get it right. What do the numbers look with this much simpler patch? (All assuming I didn't screw some logic up and get some conditional the wrong way around - please check me). Linus