From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Shakeel Butt <shakeel.butt@linux.dev>,
Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
"Steven Rostedt" <rostedt@goodmis.org>,
"Masami Hiramatsu" <mhiramat@kernel.org>,
"Dennis Zhou" <dennis@kernel.org>, "Tejun Heo" <tj@kernel.org>,
"Christoph Lameter" <cl@linux.com>,
"Martin Liu" <liumartin@google.com>,
"David Rientjes" <rientjes@google.com>,
"Christian König" <christian.koenig@amd.com>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Sweet Tea Dorminy" <sweettea@google.com>,
"Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
"Suren Baghdasaryan" <surenb@google.com>,
"Vlastimil Babka" <vbabka@suse.cz>,
"Christian Brauner" <brauner@kernel.org>,
"Wei Yang" <richard.weiyang@gmail.com>,
"David Hildenbrand" <david@redhat.com>,
"Miaohe Lin" <linmiaohe@huawei.com>,
"Al Viro" <viro@zeniv.linux.org.uk>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-trace-kernel@vger.kernel.org, "Yu Zhao" <yuzhao@google.com>,
"Roman Gushchin" <roman.gushchin@linux.dev>,
"Mateusz Guzik" <mjguzik@gmail.com>
Subject: Re: [RFC PATCH v2] mm: use per-numa-node atomics instead of percpu_counters
Date: Thu, 3 Apr 2025 13:59:39 -0400 [thread overview]
Message-ID: <55c89f03-6120-43d1-a620-46d8ca8aba4e@efficios.com> (raw)
In-Reply-To: <2m3wwqpha2jlo4zjn6xbucahfufej75gbaxxgh4j4h67pgrw7b@diodkog7ygk3>
On 2025-04-02 20:00, Shakeel Butt wrote:
> On Mon, Mar 31, 2025 at 06:35:14PM -0400, Sweet Tea Dorminy wrote:
[...]
> I am still not buying the 'good performance' point. To me we might need
> to go with reduced batch size of existing approach or multi level
> approach from Mathieu (I still have to see Mateusz and Kairui's
> proposals).
Here is an initial userspace prototype of my hierarchical split counters:
https://github.com/compudj/librseq/blob/percpu-counter/include/rseq/percpu-counter.h
https://github.com/compudj/librseq/blob/percpu-counter/src/percpu-counter.c
How to try it out:
* Install liburcu
* Clone & build https://github.com/compudj/librseq branch: percpu-counter
(note: it's currently a POC, very lightly tested.)
Run tests/percpu_counter_test.tap :
ok 1 - Registered current thread with rseq
Counter init: approx: 0 precise: 0 inaccuracy: ±2048
Counter after sum: approx: 2998016 precise: 2996800 inaccuracy: ±2048
Counter after set=0: approx: 1216 precise: 0 inaccuracy: ±2048
ok 2 - Unregistered current thread with rseq
1..2
It implements the following operations:
Fast paths:
- counter_add
- counter_approx_sum
Function call APIs for:
- counter_add_slowpath (approximation propagation to levels > 0),
- counter_precise_sum (iterate on all per-cpu counters)
- counter_set: Set a bias to bring precise sum to a given target value.
- counter_inaccuracy: returns the maximum inaccuracy of approximation for this
counter configuration.
- counter_compare: Compare a counter against a value. Use approximation if value
is further than inaccuracy limits, else use precise sum.
Porting it to the Linux kernel and replacing lib/percpu_counter.c should be
straightforward. AFAIU, the only thing I have not implemented is a replacement
for percpu_counter_limited_add, and I'm not so sure how useful it is.
The most relevant piece of the algorithm is within counter_add as follows:
static inline
void counter_add(struct percpu_counter *counter, long inc)
{
unsigned long bit_mask = counter->level0_bit_mask;
intptr_t orig, res;
int ret, cpu;
if (!inc)
return;
// This is basically a percpu_add_return() in userspace with rseq.
do {
cpu = rseq_cpu_start();
orig = *rseq_percpu_ptr(counter->level0, cpu);
ret = rseq_load_cbne_store__ptr(RSEQ_MO_RELAXED, RSEQ_PERCPU_CPU_ID,
rseq_percpu_ptr(counter->level0, cpu),
orig, orig + inc, cpu);
} while (ret);
res = orig + inc;
counter_dbg_printf("counter_add: inc: %ld, bit_mask: %ld, orig %ld, res %ld\n",
inc, bit_mask, orig, res);
if (inc < 0) {
inc = -(-inc & ~((bit_mask << 1) - 1));
/* xor bit_mask, same sign: underflow */
if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res))
inc -= bit_mask;
} else {
inc &= ~((bit_mask << 1) - 1);
/* xor bit_mask, same sign: overflow */
if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res))
inc += bit_mask;
}
if (inc)
counter_add_slowpath(counter, inc);
}
void counter_add_slowpath(struct percpu_counter *counter, long inc)
{
struct percpu_counter_level_item *item = counter->items;
unsigned int level_items = counter->nr_cpus >> 1;
unsigned int level, nr_levels = counter->nr_levels;
long bit_mask = counter->level0_bit_mask;
int cpu = rseq_current_cpu_raw();
for (level = 1; level < nr_levels; level++) {
long orig, res;
long *count = &item[cpu & (level_items - 1)].count;
bit_mask <<= 1;
res = uatomic_add_return(count, inc, CMM_RELAXED);
orig = res - inc;
counter_dbg_printf("counter_add_slowpath: level %d, inc: %ld, bit_mask: %ld, orig %ld, res %ld\n",
level, inc, bit_mask, orig, res);
if (inc < 0) {
inc = -(-inc & ~((bit_mask << 1) - 1));
/* xor bit_mask, same sign: underflow */
if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res))
inc -= bit_mask;
} else {
inc &= ~((bit_mask << 1) - 1);
/* xor bit_mask, same sign: overflow */
if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res))
inc += bit_mask;
}
item += level_items;
level_items >>= 1;
if (!inc)
return;
}
counter_dbg_printf("counter_add_slowpath: last level add %ld\n", inc);
uatomic_add(&item->count, inc, CMM_RELAXED);
}
Feedback is welcome!
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
next prev parent reply other threads:[~2025-04-03 17:59 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-31 22:35 Sweet Tea Dorminy
2025-04-01 3:26 ` Kairui Song
2025-04-03 14:31 ` Mateusz Guzik
2025-04-04 16:51 ` Kairui Song
2025-04-08 7:46 ` Mateusz Guzik
2025-04-03 0:00 ` Shakeel Butt
2025-04-03 17:59 ` Mathieu Desnoyers [this message]
2025-04-04 16:02 ` Mathieu Desnoyers
2025-04-03 16:39 ` Mateusz Guzik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55c89f03-6120-43d1-a620-46d8ca8aba4e@efficios.com \
--to=mathieu.desnoyers@efficios.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=christian.koenig@amd.com \
--cc=cl@linux.com \
--cc=david@redhat.com \
--cc=dennis@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linmiaohe@huawei.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=liumartin@google.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhiramat@kernel.org \
--cc=mjguzik@gmail.com \
--cc=richard.weiyang@gmail.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rostedt@goodmis.org \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=sweettea-kernel@dorminy.me \
--cc=sweettea@google.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox