From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: "Christoph Lameter (Ampere)" <cl@gentwo.org>
Cc: Sweet Tea Dorminy <sweettea@google.com>,
Mateusz Guzik <mjguzik@gmail.com>,
linux-kernel@vger.kernel.org,
Andrew Morton <akpm@linux-foundation.org>,
"Paul E. McKenney" <paulmck@kernel.org>,
Steven Rostedt <rostedt@goodmis.org>,
Masami Hiramatsu <mhiramat@kernel.org>,
Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
Martin Liu <liumartin@google.com>,
David Rientjes <rientjes@google.com>,
christian.koenig@amd.com, Shakeel Butt <shakeel.butt@linux.dev>,
Johannes Weiner <hannes@cmpxchg.org>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
"Liam R . Howlett" <Liam.Howlett@Oracle.com>,
Suren Baghdasaryan <surenb@google.com>,
Vlastimil Babka <vbabka@suse.cz>,
Christian Brauner <brauner@kernel.org>,
Wei Yang <richard.weiyang@gmail.com>,
David Hildenbrand <david@redhat.com>,
Miaohe Lin <linmiaohe@huawei.com>,
Al Viro <viro@zeniv.linux.org.uk>,
linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
Yu Zhao <yuzhao@google.com>,
Roman Gushchin <roman.gushchin@linux.dev>,
Matthew Wilcox <willy@infradead.org>
Subject: Re: [RFC PATCH v2] Introduce Hierarchical Per-CPU Counters
Date: Tue, 8 Apr 2025 15:29:40 -0400 [thread overview]
Message-ID: <f014a9bb-0653-4682-8608-7fe6e2ad5ee6@efficios.com> (raw)
In-Reply-To: <f90d7646-99a3-48f5-ba2e-850c73592080@gentwo.org>
On 2025-04-08 12:37, Christoph Lameter (Ampere) wrote:
> On Tue, 8 Apr 2025, Mathieu Desnoyers wrote:
>
>> - Minimize contention when incrementing and decrementing counters,
>> - Provide fast access to a sum approximation,
>
> In general I like this as a abstraction of the Zoned VM counters in
> vmstat.c that will make the scalable counters there useful elsewhere.
I'm glad you like it! :)
>
>> It aims at fixing the per-mm RSS tracking which has become too
>> inaccurate for OOM killer purposes on large many-core systems [1].
>
> There are numerous cases where these issues occur. I know of a few I could
> use something like this.
I'm looking forward to see it used.
>
>> The hierarchical per-CPU counters propagate a sum approximation through
>> a binary tree. When reaching the batch size, the carry is propagated
>> through a binary tree which consists of log2(nr_cpu_ids) levels. The
>> batch size for each level is twice the batch size of the prior level.
>
> A binary tree? Could we do this N-way? Otherwise the tree will be 8 levels
> on a 512 cpu machine. Given the inflation of the number of cpus this
> scheme better work up to 8K cpus.
That's a good point. We should probably tune the tree n-arity to limit
the tree depth to about 5. Here is a table I've done showing a possible
n-arity choice for each power of 2 nr_cpu_ids:
nr_cpu_ids that can be represented at depth N for tree n-arity
2, 4, 8, and 16:
N 2^N 4^N 8^N 16^N
--------------------------------------------------------------
0 1 1 1 1
1 2 4 8 16
2 4 16 64 256
3 8 64 512 4096
4 16 256 4096 *65536*
5 32 1024 *32768* 1048576
6 64 4096 262144 16777216
7 128 *16384* 2097152 268435456
8 256 65536 16777216 4294967296
9 512 262144 134217728 68719476736
10 1024 1048576 1073741824 1099511627776
11 2048 4194304 8589934592 17592186044416
12 4096 16777216 68719476736 281474976710656
13 *8192* 67108864 549755813888 4503599627370496
14 16384 268435456 4398046511104 72057594037927936
15 32768 1073741824 35184372088832 1152921504606846976
16 65536 4294967296 281474976710656 18446744073709551616
nr_levels is the number of tree levels for carry propagation
(excludes the final accumulation counter).
nr_cpu_ids n-arity nr_levels
2 2 1
4 2 2
8 2 3
16 4 3
32 4 3
64 4 3
128 8 3
256 8 3
512 8 3
1024 8 4
2048 8 4
4096 8 4
8192 16 4
16384 16 4
32768 16 4
65536 16 4
131072 16 5
262144 16 5
524288 16 5
1048576 16 5
>
>> +int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter);
>> +int percpu_counter_tree_precise_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b);
>> +int percpu_counter_tree_precise_compare_value(struct percpu_counter_tree *counter, int v);
>
> Precise? Concurrent counter updates can occur while determining the global
> value. People may get confused.
If counters are incremented or decremented concurrently, of course a "sum"
reader will either observe the value before or after the increment.
This is inherent to reading a counter concurrently with its updates.
However, the same can be said of reading a global counter concurrently
with its updates.
If you introduce an external synchronization mechanism that makes the
updaters quiescent (e.g. RCU) before reading the counter value, then
you have a sum which is guaranteed to include all prior increments/decrements.
So using the term "precise" does not appear misleading to me from that
perspective.
> Also maybe there would be a need for a function to collape the values into
> the global if f.e. a cpu goes off line or in order to switch off OS
> activities on a cpu.
Currently percpu_counter_tree_precise_sum_unbiased() iterates on each
possible cpu, which does not require cpu hotplug integration.
Indeed, as an improvement, we could integrate with cpu hotplug notifiers
and move the offlining CPU counters to a global accumulator, but this would
require the precise sum to synchronize against concurrent CPU hotplug. That
would allow the precise sum iteration to only consider online cpus.
Note that the hierarchical counter tree algorithms for both the
increment/decrement and for approximate/precise sums are conveniently
lock-free at the moment.
We'd have to consider whether it's preferable to keep lock-freedom or
introduce locking for the precise sum to iterate on fewer CPUs.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
next prev parent reply other threads:[~2025-04-08 19:29 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-08 16:05 Mathieu Desnoyers
2025-04-08 16:37 ` Christoph Lameter (Ampere)
2025-04-08 17:01 ` Matthew Wilcox
2025-04-08 17:41 ` Liam R. Howlett
2025-04-08 19:40 ` Mathieu Desnoyers
2025-04-08 20:08 ` Liam R. Howlett
2025-04-08 19:29 ` Mathieu Desnoyers [this message]
2025-04-08 20:44 ` Christoph Lameter (Ampere)
2025-04-08 21:00 ` Paul E. McKenney
2025-04-08 21:21 ` Christoph Lameter (Ampere)
2025-04-08 21:46 ` Paul E. McKenney
2025-04-08 22:12 ` Roman Gushchin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f014a9bb-0653-4682-8608-7fe6e2ad5ee6@efficios.com \
--to=mathieu.desnoyers@efficios.com \
--cc=Liam.Howlett@Oracle.com \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=christian.koenig@amd.com \
--cc=cl@gentwo.org \
--cc=david@redhat.com \
--cc=dennis@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linmiaohe@huawei.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=liumartin@google.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhiramat@kernel.org \
--cc=mjguzik@gmail.com \
--cc=paulmck@kernel.org \
--cc=richard.weiyang@gmail.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rostedt@goodmis.org \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=sweettea@google.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox