From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: "Liam R. Howlett" <Liam.Howlett@oracle.com>,
Matthew Wilcox <willy@infradead.org>,
"Christoph Lameter (Ampere)" <cl@gentwo.org>,
Sweet Tea Dorminy <sweettea@google.com>,
Mateusz Guzik <mjguzik@gmail.com>,
linux-kernel@vger.kernel.org,
Andrew Morton <akpm@linux-foundation.org>,
"Paul E. McKenney" <paulmck@kernel.org>,
Steven Rostedt <rostedt@goodmis.org>,
Masami Hiramatsu <mhiramat@kernel.org>,
Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
Martin Liu <liumartin@google.com>,
David Rientjes <rientjes@google.com>,
christian.koenig@amd.com, Shakeel Butt <shakeel.butt@linux.dev>,
Johannes Weiner <hannes@cmpxchg.org>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Suren Baghdasaryan <surenb@google.com>,
Vlastimil Babka <vbabka@suse.cz>,
Christian Brauner <brauner@kernel.org>,
Wei Yang <richard.weiyang@gmail.com>,
David Hildenbrand <david@redhat.com>,
Miaohe Lin <linmiaohe@huawei.com>,
Al Viro <viro@zeniv.linux.org.uk>,
linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
Yu Zhao <yuzhao@google.com>,
Roman Gushchin <roman.gushchin@linux.dev>
Subject: Re: [RFC PATCH v2] Introduce Hierarchical Per-CPU Counters
Date: Tue, 8 Apr 2025 15:40:19 -0400 [thread overview]
Message-ID: <bcebaacb-ef9e-409d-b770-3057a96c3d11@efficios.com> (raw)
In-Reply-To: <iqa7gvbcwyw76jd6cgimp5jbu4szxob5ptjvi3rzll6amfjygg@cunm5zrgm27e>
On 2025-04-08 13:41, Liam R. Howlett wrote:
> * Matthew Wilcox <willy@infradead.org> [250408 13:03]:
>> On Tue, Apr 08, 2025 at 09:37:18AM -0700, Christoph Lameter (Ampere) wrote:
>>>> The hierarchical per-CPU counters propagate a sum approximation through
>>>> a binary tree. When reaching the batch size, the carry is propagated
>>>> through a binary tree which consists of log2(nr_cpu_ids) levels. The
>>>> batch size for each level is twice the batch size of the prior level.
>>>
>>> A binary tree? Could we do this N-way? Otherwise the tree will be 8 levels
>>> on a 512 cpu machine. Given the inflation of the number of cpus this
>>> scheme better work up to 8K cpus.
>>
>> I find that a fan-out somewhere between 8 and 16 works well in practice.
>> log16(512) gives a 3 level tree as does a log8 tree. log16(8192) is a 4
>> level tree whereas log8(8192) is a 5 level tree. Not a big difference
>> either way.
>>
>> Somebody was trying to persuade me that a new tree type that maintained
>> additional information at each level of the tree to make some operations
>> log(log(N)) would be a better idea than a B-tree that is log(N). I
>> countered that a wider tree made the argument unsound at any size tree
>> up to 100k. And we don't tend to have _that_ many objects in a
>> data structure inside the kernel.
>
> I still maintain vEB trees are super cool, but I am glad we didn't try
> to implement an RCU safe version.
>
>>
>> ceil(log14(100,000)) = 5
>> ceil(log2(log2(100,000))) = 5
>>
>> at a million, there's actually a gap, 6 vs 5. But constant factors
>> become a much larger factor than scalability arguments at that point.
>
> In retrospect, it seems more of a math win than a practical win - and
> only really the O(n) bounds. Beyond what willy points out, writes
> rippling up the tree should be a concern for most users since it will
> impact the restart of readers and negatively affect the writer speed -
> but probably not here (hot plug?).
This implementation of hierarchical per-cpu counters is lock-free
for increment/decrement *and* for precise/approximate sums.
The increment/decrement use:
- this_cpu_add_return on the fast-path,
- atomic_add_return_relaxed for intermediate levels carry propagation,
- atomic_add for approximate sum updates.
The precise sum iterates on all possible cpus, loading their current
value. The approximate sum simply loads the current value of the
approximate sum.
So I'm unsure about your concern of writers restarting readers, because
this tree does not rely on mutual exclusion between updaters and
readers, nor does it rely on cmpxchg-based retry mechanisms in readers.
I agree with you that updates going all the way up the tree may
negatively affect the updater and approximate sum reader performance due
to bouncing of the counter cache line across CPUs.
>
> Working in (multiples of) cacheline sized b-tree nodes makes the most
> sense, in my experience.
I'm confused. Can you explain how this recommendation can practically
apply to the hierarchical counters ?
Thanks!
Mathieu
>
> Thanks,
> Liam
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
next prev parent reply other threads:[~2025-04-08 19:40 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-08 16:05 Mathieu Desnoyers
2025-04-08 16:37 ` Christoph Lameter (Ampere)
2025-04-08 17:01 ` Matthew Wilcox
2025-04-08 17:41 ` Liam R. Howlett
2025-04-08 19:40 ` Mathieu Desnoyers [this message]
2025-04-08 20:08 ` Liam R. Howlett
2025-04-08 19:29 ` Mathieu Desnoyers
2025-04-08 20:44 ` Christoph Lameter (Ampere)
2025-04-08 21:00 ` Paul E. McKenney
2025-04-08 21:21 ` Christoph Lameter (Ampere)
2025-04-08 21:46 ` Paul E. McKenney
2025-04-08 22:12 ` Roman Gushchin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bcebaacb-ef9e-409d-b770-3057a96c3d11@efficios.com \
--to=mathieu.desnoyers@efficios.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=christian.koenig@amd.com \
--cc=cl@gentwo.org \
--cc=david@redhat.com \
--cc=dennis@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linmiaohe@huawei.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=liumartin@google.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhiramat@kernel.org \
--cc=mjguzik@gmail.com \
--cc=paulmck@kernel.org \
--cc=richard.weiyang@gmail.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rostedt@goodmis.org \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=sweettea@google.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox