Re: [RFC PATCH v2] mm: use per-numa-node atomics instead of percpu_counters

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mateusz Guzik <mjguzik@gmail.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: "Sweet Tea Dorminy" <sweettea-kernel@dorminy.me>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Masami Hiramatsu" <mhiramat@kernel.org>,
	"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
	"Dennis Zhou" <dennis@kernel.org>, "Tejun Heo" <tj@kernel.org>,
	"Christoph Lameter" <cl@linux.com>,
	"Martin Liu" <liumartin@google.com>,
	"David Rientjes" <rientjes@google.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Shakeel Butt" <shakeel.butt@linux.dev>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Sweet Tea Dorminy" <sweettea@google.com>,
	"Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	"Suren Baghdasaryan" <surenb@google.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Christian Brauner" <brauner@kernel.org>,
	"Wei Yang" <richard.weiyang@gmail.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Miaohe Lin" <linmiaohe@huawei.com>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org, "Yu Zhao" <yuzhao@google.com>,
	"Roman Gushchin" <roman.gushchin@linux.dev>
Subject: Re: [RFC PATCH v2] mm: use per-numa-node atomics instead of percpu_counters
Date: Tue, 8 Apr 2025 09:46:15 +0200	[thread overview]
Message-ID: <CAGudoHHQ4y0Z_A0yzpfim_wGFVUuF3NaLgNtWUiquiCby6Ppkg@mail.gmail.com> (raw)
In-Reply-To: <CAMgjq7C_W3dfYQ6DJT4QCza1DCtCE7yUdiManQSxCKOENxTm_g@mail.gmail.com>

On Fri, Apr 4, 2025 at 6:51 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, Apr 3, 2025 at 10:31 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > Note there are 2 unrelated components in that patchset:
> > - one per-cpu instance of rss counters which is rolled up on context
> > switches, avoiding the costly counter alloc/free on mm
> > creation/teardown
> > - cpu iteration in get_mm_counter
> >
> > The allocation problem is fixable without abandoning the counters, see
> > my other e -mail (tl;dr let mm's hanging out in slab caches *keep* the
> > counters). This aspect has to be solved anyway due to mm_alloc_cid().
> > Providing a way to sort it out covers *both* the rss counters and the
> > cid thing.
>
> It's not just about the fork performance, on some servers there could
> be ~100K processes and ~200 CPUs, that will be hundreds of MBs of
> memory just for the counters.
>
> And nowadays it's not something uncommon for a desktop to have ~64
> CPUs and ~10K processes.
>
> If we use a single shared "per-cpu" counter (as in the patch), the
> total consumption will always be only about just dozens of bytes.
>

I agree there is a tradeoff here and your approach saves memory in
exchange for more work during a context switch.

I have no opinion which way to go here.

> >
> > In your patchset the accuracy increase comes at the expense of walking
> > all CPUs every time, while a big part of the point of using percpu
> > counters is to have a good enough approximation somewhere that this is
> > not necessary.
>
> It usually doesn't walk all CPUs, only the CPUs that actually used
> that mm_struct, by checking mm_struct's cpu_bitmap. I didn't check if
> all arch uses that bitmap though.
>
> It's true that one CPU having its bit set on one mm_struct's
> cpu_bitmap doesn't mean it updated the RSS counter so there will be
> false positives, the false positive rate is low as schedulers don't
> shuffle processes between processors randomly, and not every process
> will be ran at a period.
>
> Also per my observation the reader side is much colder compared to
> updater for /proc.
>

Per my comment, the read thing happens a lot for mmap and munmap so it
cannot be taken lightly. You can check yourself with bpftrace.

While I can agree vast majority of processes are not very thread-heavy
and vast majority of machines out there don't have hundreds of cores,
this does have to behave sanely for the cases which *do* exhibit these
conditions. For example a box with > 200 cores and 200+ threads to
boot, all running on the entirety of the box.

In your patch as posted fetching the value will force the walk *a lot*
and is consequently a no-go. This aspect needs to be dealt with for
the patchset to be ok. Otherwise few months down the road someone else
will show up and complain about a new slowdown stemming from it.

-- 
Mateusz Guzik <mjguzik gmail.com>

next prev parent reply	other threads:[~2025-04-08  7:46 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-31 22:35 Sweet Tea Dorminy
2025-04-01  3:26 ` Kairui Song
2025-04-03 14:31   ` Mateusz Guzik
2025-04-04 16:51     ` Kairui Song
2025-04-08  7:46       ` Mateusz Guzik [this message]
2025-04-03  0:00 ` Shakeel Butt
2025-04-03 17:59   ` Mathieu Desnoyers
2025-04-04 16:02     ` Mathieu Desnoyers
2025-04-03 16:39 ` Mateusz Guzik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGudoHHQ4y0Z_A0yzpfim_wGFVUuF3NaLgNtWUiquiCby6Ppkg@mail.gmail.com \
    --to=mjguzik@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=christian.koenig@amd.com \
    --cc=cl@linux.com \
    --cc=david@redhat.com \
    --cc=dennis@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=liumartin@google.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=richard.weiyang@gmail.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=ryncsn@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=sweettea-kernel@dorminy.me \
    --cc=sweettea@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox