Re: [RFC PATCH v2] mm: use per-numa-node atomics instead of percpu_counters

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Shakeel Butt <shakeel.butt@linux.dev>,
	Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Masami Hiramatsu" <mhiramat@kernel.org>,
	"Dennis Zhou" <dennis@kernel.org>, "Tejun Heo" <tj@kernel.org>,
	"Christoph Lameter" <cl@linux.com>,
	"Martin Liu" <liumartin@google.com>,
	"David Rientjes" <rientjes@google.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Sweet Tea Dorminy" <sweettea@google.com>,
	"Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	"Suren Baghdasaryan" <surenb@google.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Christian Brauner" <brauner@kernel.org>,
	"Wei Yang" <richard.weiyang@gmail.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Miaohe Lin" <linmiaohe@huawei.com>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org, "Yu Zhao" <yuzhao@google.com>,
	"Roman Gushchin" <roman.gushchin@linux.dev>,
	"Mateusz Guzik" <mjguzik@gmail.com>
Subject: Re: [RFC PATCH v2] mm: use per-numa-node atomics instead of percpu_counters
Date: Thu, 3 Apr 2025 13:59:39 -0400	[thread overview]
Message-ID: <55c89f03-6120-43d1-a620-46d8ca8aba4e@efficios.com> (raw)
In-Reply-To: <2m3wwqpha2jlo4zjn6xbucahfufej75gbaxxgh4j4h67pgrw7b@diodkog7ygk3>

On 2025-04-02 20:00, Shakeel Butt wrote:
> On Mon, Mar 31, 2025 at 06:35:14PM -0400, Sweet Tea Dorminy wrote:
[...]
> I am still not buying the 'good performance' point. To me we might need
> to go with reduced batch size of existing approach or multi level
> approach from Mathieu (I still have to see Mateusz and Kairui's
> proposals).

Here is an initial userspace prototype of my hierarchical split counters:

https://github.com/compudj/librseq/blob/percpu-counter/include/rseq/percpu-counter.h
https://github.com/compudj/librseq/blob/percpu-counter/src/percpu-counter.c

How to try it out:

* Install liburcu
* Clone & build  https://github.com/compudj/librseq branch: percpu-counter

(note: it's currently a POC, very lightly tested.)

Run tests/percpu_counter_test.tap :

ok 1 - Registered current thread with rseq
Counter init: approx: 0 precise: 0 inaccuracy: ±2048
Counter after sum: approx: 2998016 precise: 2996800 inaccuracy: ±2048
Counter after set=0: approx: 1216 precise: 0 inaccuracy: ±2048
ok 2 - Unregistered current thread with rseq
1..2

It implements the following operations:

Fast paths:

- counter_add
- counter_approx_sum

Function call APIs for:

- counter_add_slowpath (approximation propagation to levels > 0),
- counter_precise_sum (iterate on all per-cpu counters)
- counter_set: Set a bias to bring precise sum to a given target value.
- counter_inaccuracy: returns the maximum inaccuracy of approximation for this
                       counter configuration.
- counter_compare: Compare a counter against a value. Use approximation if value
                    is further than inaccuracy limits, else use precise sum.

Porting it to the Linux kernel and replacing lib/percpu_counter.c should be
straightforward. AFAIU, the only thing I have not implemented is a replacement
for percpu_counter_limited_add, and I'm not so sure how useful it is.

The most relevant piece of the algorithm is within counter_add as follows:

static inline
void counter_add(struct percpu_counter *counter, long inc)
{
         unsigned long bit_mask = counter->level0_bit_mask;
         intptr_t orig, res;
         int ret, cpu;

         if (!inc)
                 return;
	// This is basically a percpu_add_return() in userspace with rseq.
         do {
                 cpu = rseq_cpu_start();
                 orig = *rseq_percpu_ptr(counter->level0, cpu);
                 ret = rseq_load_cbne_store__ptr(RSEQ_MO_RELAXED, RSEQ_PERCPU_CPU_ID,
                                 rseq_percpu_ptr(counter->level0, cpu),
                                 orig, orig + inc, cpu);
         } while (ret);
         res = orig + inc;
         counter_dbg_printf("counter_add: inc: %ld, bit_mask: %ld, orig %ld, res %ld\n",
                            inc, bit_mask, orig, res);
         if (inc < 0) {
                 inc = -(-inc & ~((bit_mask << 1) - 1));
                 /* xor bit_mask, same sign: underflow */
                 if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res))
                         inc -= bit_mask;
         } else {
                 inc &= ~((bit_mask << 1) - 1);
                 /* xor bit_mask, same sign: overflow */
                 if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res))
                         inc += bit_mask;
         }
         if (inc)
                 counter_add_slowpath(counter, inc);
}

void counter_add_slowpath(struct percpu_counter *counter, long inc)
{
         struct percpu_counter_level_item *item = counter->items;
         unsigned int level_items = counter->nr_cpus >> 1;
         unsigned int level, nr_levels = counter->nr_levels;
         long bit_mask = counter->level0_bit_mask;
         int cpu = rseq_current_cpu_raw();

         for (level = 1; level < nr_levels; level++) {
                 long orig, res;
                 long *count = &item[cpu & (level_items - 1)].count;

                 bit_mask <<= 1;
                 res = uatomic_add_return(count, inc, CMM_RELAXED);
                 orig = res - inc;
                 counter_dbg_printf("counter_add_slowpath: level %d, inc: %ld, bit_mask: %ld, orig %ld, res %ld\n",
                                    level, inc, bit_mask, orig, res);
                 if (inc < 0) {
                         inc = -(-inc & ~((bit_mask << 1) - 1));
                         /* xor bit_mask, same sign: underflow */
                         if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res))
                                 inc -= bit_mask;
                 } else {
                         inc &= ~((bit_mask << 1) - 1);
                         /* xor bit_mask, same sign: overflow */
                         if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res))
                                 inc += bit_mask;
                 }
                 item += level_items;
                 level_items >>= 1;
                 if (!inc)
                         return;
         }
         counter_dbg_printf("counter_add_slowpath: last level add %ld\n", inc);
         uatomic_add(&item->count, inc, CMM_RELAXED);
}

Feedback is welcome!

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

next prev parent reply	other threads:[~2025-04-03 17:59 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-31 22:35 Sweet Tea Dorminy
2025-04-01  3:26 ` Kairui Song
2025-04-03 14:31   ` Mateusz Guzik
2025-04-04 16:51     ` Kairui Song
2025-04-08  7:46       ` Mateusz Guzik
2025-04-03  0:00 ` Shakeel Butt
2025-04-03 17:59   ` Mathieu Desnoyers [this message]
2025-04-04 16:02     ` Mathieu Desnoyers
2025-04-03 16:39 ` Mateusz Guzik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55c89f03-6120-43d1-a620-46d8ca8aba4e@efficios.com \
    --to=mathieu.desnoyers@efficios.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=christian.koenig@amd.com \
    --cc=cl@linux.com \
    --cc=david@redhat.com \
    --cc=dennis@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=liumartin@google.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhiramat@kernel.org \
    --cc=mjguzik@gmail.com \
    --cc=richard.weiyang@gmail.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=sweettea-kernel@dorminy.me \
    --cc=sweettea@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox