linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Pedro Falcato <pfalcato@suse.de>
To: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	 Gladyshev Ilya <gladyshev.ilya1@h-partners.com>,
	David Hildenbrand <david@kernel.org>,
	 Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	 Vlastimil Babka <vbabka@suse.cz>,
	Mike Rapoport <rppt@kernel.org>,
	 Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>, Zi Yan <ziy@nvidia.com>,
	 Harry Yoo <harry.yoo@oracle.com>,
	Matthew Wilcox <willy@infradead.org>,
	 Yu Zhao <yuzhao@google.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	 Alistair Popple <apopple@nvidia.com>,
	Gorbunov Ivan <gorbunov.ivan@h-partners.com>,
	 Muchun Song <muchun.song@linux.dev>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	 Kiryl Shutsemau <kirill@shutemov.name>,
	Dave Chinner <david@fromorbit.com>
Subject: Re: [PATCH 0/1] mm: improve folio refcount scalability
Date: Sun, 1 Mar 2026 20:26:50 +0000	[thread overview]
Message-ID: <m73zmq6jbdzqmxdjgnmfmd5h4w5a2khur7we3e64r357o32hfj@mvwcqgnuhzqh> (raw)
In-Reply-To: <CAHk-=wgOxtAFf5XeYTcoUHSpWfMrgZaXKc_hrmqt6yKBdr=3Zw@mail.gmail.com>

On Sun, Mar 01, 2026 at 10:52:57AM -0800, Linus Torvalds wrote:
> On Sat, 28 Feb 2026 at 19:27, Linus Torvalds
> <torvalds@linuxfoundation.org> wrote:
> >
> > This attached patch is ENTIRELY UNTESTED.
> 
> Here's a slightly cleaned up and further simplified version, which is
> also actually tested, although only in the "it boots for me" sense.
> 
> It generates good code at least with clang:
> 
>   .LBB76_7:
>           movl    $1, %eax
>   .LBB76_8:
>           leal    1(%rax), %ecx
>           lock cmpxchgl   %ecx, 52(%rdi)
>           sete    %cl
>           je      .LBB76_10
>           testl   %eax, %eax
>           jne     .LBB76_8
>   .LBB76_10:
> 
> which actually looks both simple and fairly optimal for that sequence.
> 
> Of course, since this is very much about cacheline access patterns,
> actual performance will depend on random microarchitectural issues
> (and not just the CPU core, but the whole memory subsystem).
> 
> Can somebody with a good - and relevant - benchmark system try this out?
> 
>                Linus

Here are some perhaps interesting numbers from an extremely synthetic
benchmark[1] I wrote just now:

note: xadd_bench is lock addl, cmpxchg is the typical load + lock cmpxchg loop,
and optimistic_cmpxchg_benchmark is similar to what you wrote, where we assume
1 and only later do we do the actual loop. I also don't claim this is
representative of page cache performance, but this is quite a lot simpler
to set up and play around with.

On my zen4 AMD Ryzen 7 PRO 7840U laptop:
------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations
------------------------------------------------------------------------------
xadd_bench/threads:1                      2.76 ns         2.76 ns    250435782
xadd_bench/threads:4                      42.1 ns         42.1 ns     15969296
xadd_bench/threads:8                      84.8 ns         84.8 ns      8920800
xadd_bench/threads:16                      226 ns          211 ns      2446928
cmpxchg_bench/threads:1                   3.12 ns         3.12 ns    220339301
cmpxchg_bench/threads:4                   51.1 ns         51.1 ns     12372808
cmpxchg_bench/threads:8                    112 ns          112 ns      6228056
cmpxchg_bench/threads:16                   679 ns          648 ns       930832
optimistic_cmpxchg_bench/threads:1        2.95 ns         2.95 ns    233704391
optimistic_cmpxchg_bench/threads:4        56.2 ns         56.2 ns     11780588
optimistic_cmpxchg_bench/threads:8         140 ns          140 ns      4606440
optimistic_cmpxchg_bench/threads:16        789 ns          746 ns       806400

Here we can see that the optimistic cmpxchg still can't match the xadd/lock addl
performance in single-thread, and degrades quickly and worse than straight up
cmpxchg under load (perhaps presumably because of the cmpxchg miss).

On our internal large 160-core Intel(R) Xeon(R) CPU E7-8891 v4 (older uarch,
sad) machine:
------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations
------------------------------------------------------------------------------
xadd_bench/threads:1                       13.6 ns         13.6 ns     51445934
xadd_bench/threads:4                       41.4 ns          166 ns      4211940
xadd_bench/threads:8                       30.3 ns          242 ns      2190488
xadd_bench/threads:16                      37.3 ns          596 ns      1162336
xadd_bench/threads:64                      24.9 ns         1376 ns       640000
xadd_bench/threads:128                     27.3 ns         3108 ns      1054592
cmpxchg_bench/threads:1                    17.9 ns         17.9 ns     38992029
cmpxchg_bench/threads:4                    54.8 ns          219 ns      3431076
cmpxchg_bench/threads:8                    39.0 ns          312 ns      1698712
cmpxchg_bench/threads:16                   62.2 ns          994 ns       530672
cmpxchg_bench/threads:64                   28.5 ns         1479 ns       665280
cmpxchg_bench/threads:128                  17.2 ns         1838 ns       517376
optimistic_cmpxchg_bench/threads:1         13.6 ns         13.6 ns     51384286
optimistic_cmpxchg_bench/threads:4         70.2 ns          281 ns      2585092
optimistic_cmpxchg_bench/threads:8         58.1 ns          465 ns      1598592
optimistic_cmpxchg_bench/threads:16         106 ns         1694 ns       420832
optimistic_cmpxchg_bench/threads:64        30.8 ns         1767 ns       499264
optimistic_cmpxchg_bench/threads:128       39.3 ns         4632 ns       447104

Here, optimistic seems to match xadd in single-threaded, but then very quickly
degrades. In general optimistic_cmpxchg seems to degrade worse than cmpxchg,
but there is a lot of variance here (and other users lightly using it) so
results (particularly those with higher thread counts) should be taken with a
grain of salt (for example, lock add scaling dratistically worse than cmpxchg
seems to be a fluke).

TL;DR I don't think the idea quite works, particularly when a folio is under
contention, because if you have traffic on a cacheline then you certainly have
a couple of threads trying to grab a refcount. And doing two cmpxchgs just
increases traffic and pessimises things. Also perhaps worth noting that neither
solution scales in any way.

[1] https://gist.github.com/heatd/2a6e6c778c3cfd4aa6804b2d598c7a4c (excuse my C++)
-- 
Pedro


  reply	other threads:[~2026-03-01 20:26 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-26 16:27 Gladyshev Ilya
2026-02-26 16:27 ` [PATCH 1/1] mm: implement page refcount locking via dedicated bit Gladyshev Ilya
2026-02-28 22:19 ` [PATCH 0/1] mm: improve folio refcount scalability Andrew Morton
2026-03-01  3:27   ` Linus Torvalds
2026-03-01 18:52     ` Linus Torvalds
2026-03-01 20:26       ` Pedro Falcato [this message]
2026-03-01 21:16         ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m73zmq6jbdzqmxdjgnmfmd5h4w5a2khur7we3e64r357o32hfj@mvwcqgnuhzqh \
    --to=pfalcato@suse.de \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@fromorbit.com \
    --cc=david@kernel.org \
    --cc=gladyshev.ilya1@h-partners.com \
    --cc=gorbunov.ivan@h-partners.com \
    --cc=harry.yoo@oracle.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=torvalds@linuxfoundation.org \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox