From: Mateusz Guzik <mjguzik@gmail.com>
To: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: "Greg Thelen" <gthelen@google.com>, "Tejun Heo" <tj@kernel.org>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Michal Koutný" <mkoutny@suse.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Eric Dumazet" <edumzaet@google.com>,
cgroups@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org,
"Eric Dumazet" <edumazet@google.com>
Subject: Re: [PATCH] cgroup/rstat: avoid disabling irqs for O(num_cpu)
Date: Thu, 27 Mar 2025 15:38:50 +0100 [thread overview]
Message-ID: <2vznaaotzkgkrfoi2qitiwdjinpl7ozhpz7w6n7577kaa2hpki@okh2mkqqhbkq> (raw)
In-Reply-To: <Z9r8TX0WiPWVffI0@google.com>
[-- Attachment #1: Type: text/plain, Size: 4037 bytes --]
On Wed, Mar 19, 2025 at 05:18:05PM +0000, Yosry Ahmed wrote:
> On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote:
> > Is not this going a little too far?
> >
> > the lock + irq trip is quite expensive in its own right and now is
> > going to be paid for each cpu, as in the total time spent executing
> > cgroup_rstat_flush_locked is going to go up.
> >
> > Would your problem go away toggling this every -- say -- 8 cpus?
>
> I was concerned about this too, and about more lock bouncing, but the
> testing suggests that this actually overall improves the latency of
> cgroup_rstat_flush_locked() (at least on tested HW).
>
> So I don't think we need to do something like this unless a regression
> is observed.
>
To my reading it reduces max time spent with irq disabled, which of
course it does -- after all it toggles it for every CPU.
Per my other e-mail in the thread the irq + lock trips remain not cheap
at least on Sapphire Rapids.
In my testing outlined below I see 11% increase in total execution time
with the irq + lock trip for every CPU in a 24-way vm.
So I stand by instead doing this every n CPUs, call it 8 or whatever.
How to repro:
I employed a poor-man's profiler like so:
bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @start[tid] = nsecs; } kretprobe:cgroup_rstat_flush_locked /@start[tid]/ { print(nsecs - @start[tid]); delete(@start[tid]); } interval:s:60 { exit(); }'
This patch or not, execution time varies wildly even while the box is idle.
The above runs for a minute, collecting 23 samples (you may get
"lucky" and get one extra, in that case remove it for comparison).
A sysctl was added to toggle the new behavior vs old one. Patch at the
end.
"enabled"(1) means new behavior, "disabled"(0) means the old one.
Sum of nsecs (results piped to: awk '{ sum += $1 } END { print sum }'):
disabled: 903610
enabled: 1006833 (+11.4%)
Toggle at runtime with:
sysctl fs.magic_tunable=0 # disabled, no mandatory relocks
sysctl fs.magic_tunable=1 # enabled, relock for every CPU
I attached the stats I got for reference.
I patched v6.14 with the following:
diff --git a/fs/file_table.c b/fs/file_table.c
index c04ed94cdc4b..441f89421413 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -106,6 +106,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer,
return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
}
+unsigned long magic_tunable;
+
static const struct ctl_table fs_stat_sysctls[] = {
{
.procname = "file-nr",
@@ -123,6 +125,16 @@ static const struct ctl_table fs_stat_sysctls[] = {
.extra1 = SYSCTL_LONG_ZERO,
.extra2 = SYSCTL_LONG_MAX,
},
+ {
+ .procname = "magic_tunable",
+ .data = &magic_tunable,
+ .maxlen = sizeof(magic_tunable),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ .extra1 = SYSCTL_LONG_ZERO,
+ .extra2 = SYSCTL_LONG_MAX,
+ },
+
{
.procname = "nr_open",
.data = &sysctl_nr_open,
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 3e01781aeb7b..f6444bf25b2f 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -299,6 +299,8 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
spin_unlock_irq(&cgroup_rstat_lock);
}
+extern unsigned long magic_tunable;
+
/* see cgroup_rstat_flush() */
static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
@@ -323,12 +325,18 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
rcu_read_unlock();
}
- /* play nice and yield if necessary */
- if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+ if (READ_ONCE(magic_tunable)) {
__cgroup_rstat_unlock(cgrp, cpu);
if (!cond_resched())
cpu_relax();
__cgroup_rstat_lock(cgrp, cpu);
+ } else {
+ if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+ __cgroup_rstat_unlock(cgrp, cpu);
+ if (!cond_resched())
+ cpu_relax();
+ __cgroup_rstat_lock(cgrp, cpu);
+ }
}
}
}
[-- Attachment #2: disabled --]
[-- Type: text/plain, Size: 138 bytes --]
69869
30473
64670
30544
30950
36445
36235
29920
51179
35760
33424
42426
30177
31211
44974
34450
37871
72642
33016
29518
31800
35730
30326
[-- Attachment #3: enabled --]
[-- Type: text/plain, Size: 138 bytes --]
63507
50113
36280
35148
63329
41232
51265
41341
41418
42824
35200
35550
54684
41597
55325
36120
48675
41179
39339
35794
38826
37411
40676
next prev parent reply other threads:[~2025-03-27 14:39 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-19 7:13 Greg Thelen
2025-03-19 7:17 ` Greg Thelen
2025-03-19 10:20 ` Michal Koutný
2025-03-19 10:47 ` Mateusz Guzik
2025-03-19 17:18 ` Yosry Ahmed
2025-03-27 14:38 ` Mateusz Guzik [this message]
2025-03-27 17:17 ` Yosry Ahmed
2025-03-27 17:47 ` Mateusz Guzik
2025-04-01 15:00 ` Michal Koutný
2025-04-01 15:46 ` Mateusz Guzik
2025-04-01 16:59 ` Michal Koutný
2025-03-19 17:26 ` Tejun Heo
2025-03-26 23:57 ` Mateusz Guzik
2025-03-19 17:16 ` Yosry Ahmed
2025-03-19 18:06 ` Johannes Weiner
2025-03-19 18:35 ` Yosry Ahmed
2025-03-19 19:10 ` Tejun Heo
2025-03-19 19:16 ` Johannes Weiner
2025-03-19 19:46 ` Johannes Weiner
2025-03-19 17:26 ` Tejun Heo
2025-03-19 17:35 ` Johannes Weiner
2025-03-19 17:53 ` Shakeel Butt
2025-03-19 18:37 ` Eric Dumazet
2025-03-20 14:43 ` Greg Thelen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2vznaaotzkgkrfoi2qitiwdjinpl7ozhpz7w6n7577kaa2hpki@okh2mkqqhbkq \
--to=mjguzik@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=edumazet@google.com \
--cc=edumzaet@google.com \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mkoutny@suse.com \
--cc=tj@kernel.org \
--cc=yosry.ahmed@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox