linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mateusz Guzik <mjguzik@gmail.com>
To: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: "Greg Thelen" <gthelen@google.com>, "Tejun Heo" <tj@kernel.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Eric Dumazet" <edumzaet@google.com>,
	cgroups@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	"Eric Dumazet" <edumazet@google.com>
Subject: Re: [PATCH] cgroup/rstat: avoid disabling irqs for O(num_cpu)
Date: Thu, 27 Mar 2025 15:38:50 +0100	[thread overview]
Message-ID: <2vznaaotzkgkrfoi2qitiwdjinpl7ozhpz7w6n7577kaa2hpki@okh2mkqqhbkq> (raw)
In-Reply-To: <Z9r8TX0WiPWVffI0@google.com>

[-- Attachment #1: Type: text/plain, Size: 4037 bytes --]

On Wed, Mar 19, 2025 at 05:18:05PM +0000, Yosry Ahmed wrote:
> On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote:
> > Is not this going a little too far?
> > 
> > the lock + irq trip is quite expensive in its own right and now is
> > going to be paid for each cpu, as in the total time spent executing
> > cgroup_rstat_flush_locked is going to go up.
> > 
> > Would your problem go away toggling this every -- say -- 8 cpus?
> 
> I was concerned about this too, and about more lock bouncing, but the
> testing suggests that this actually overall improves the latency of
> cgroup_rstat_flush_locked() (at least on tested HW).
> 
> So I don't think we need to do something like this unless a regression
> is observed.
> 

To my reading it reduces max time spent with irq disabled, which of
course it does -- after all it toggles it for every CPU.

Per my other e-mail in the thread the irq + lock trips remain not cheap
at least on Sapphire Rapids.

In my testing outlined below I see 11% increase in total execution time
with the irq + lock trip for every CPU in a 24-way vm.

So I stand by instead doing this every n CPUs, call it 8 or whatever.

How to repro:

I employed a poor-man's profiler like so:

bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @start[tid] = nsecs; } kretprobe:cgroup_rstat_flush_locked /@start[tid]/ { print(nsecs - @start[tid]); delete(@start[tid]); } interval:s:60 { exit(); }'

This patch or not, execution time varies wildly even while the box is idle.

The above runs for a minute, collecting 23 samples (you may get
"lucky" and get one extra, in that case remove it for comparison).

A sysctl was added to toggle the new behavior vs old one. Patch at the
end.

"enabled"(1) means new behavior, "disabled"(0) means the old one.

Sum of nsecs (results piped to: awk '{ sum += $1 } END { print sum }'):
disabled:	903610
enabled:	1006833 (+11.4%)

Toggle at runtime with:
sysctl fs.magic_tunable=0 # disabled, no mandatory relocks
sysctl fs.magic_tunable=1 # enabled, relock for every CPU

I attached the stats I got for reference.

I patched v6.14 with the following:
diff --git a/fs/file_table.c b/fs/file_table.c
index c04ed94cdc4b..441f89421413 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -106,6 +106,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer,
 	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
 }
 
+unsigned long magic_tunable;
+
 static const struct ctl_table fs_stat_sysctls[] = {
 	{
 		.procname	= "file-nr",
@@ -123,6 +125,16 @@ static const struct ctl_table fs_stat_sysctls[] = {
 		.extra1		= SYSCTL_LONG_ZERO,
 		.extra2		= SYSCTL_LONG_MAX,
 	},
+	{
+		.procname	= "magic_tunable",
+		.data		= &magic_tunable,
+		.maxlen		= sizeof(magic_tunable),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra1		= SYSCTL_LONG_ZERO,
+		.extra2		= SYSCTL_LONG_MAX,
+	},
+
 	{
 		.procname	= "nr_open",
 		.data		= &sysctl_nr_open,
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 3e01781aeb7b..f6444bf25b2f 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -299,6 +299,8 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
 	spin_unlock_irq(&cgroup_rstat_lock);
 }
 
+extern unsigned long magic_tunable;
+
 /* see cgroup_rstat_flush() */
 static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
@@ -323,12 +325,18 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 			rcu_read_unlock();
 		}
 
-		/* play nice and yield if necessary */
-		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+		if (READ_ONCE(magic_tunable)) {
 			__cgroup_rstat_unlock(cgrp, cpu);
 			if (!cond_resched())
 				cpu_relax();
 			__cgroup_rstat_lock(cgrp, cpu);
+		} else {
+			if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+				__cgroup_rstat_unlock(cgrp, cpu);
+				if (!cond_resched())
+					cpu_relax();
+				__cgroup_rstat_lock(cgrp, cpu);
+			}
 		}
 	}
 }


[-- Attachment #2: disabled --]
[-- Type: text/plain, Size: 138 bytes --]

69869
30473
64670
30544
30950
36445
36235
29920
51179
35760
33424
42426
30177
31211
44974
34450
37871
72642
33016
29518
31800
35730
30326

[-- Attachment #3: enabled --]
[-- Type: text/plain, Size: 138 bytes --]

63507
50113
36280
35148
63329
41232
51265
41341
41418
42824
35200
35550
54684
41597
55325
36120
48675
41179
39339
35794
38826
37411
40676

  reply	other threads:[~2025-03-27 14:39 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-19  7:13 Greg Thelen
2025-03-19  7:17 ` Greg Thelen
2025-03-19 10:20 ` Michal Koutný
2025-03-19 10:47 ` Mateusz Guzik
2025-03-19 17:18   ` Yosry Ahmed
2025-03-27 14:38     ` Mateusz Guzik [this message]
2025-03-27 17:17       ` Yosry Ahmed
2025-03-27 17:47         ` Mateusz Guzik
2025-04-01 15:00           ` Michal Koutný
2025-04-01 15:46             ` Mateusz Guzik
2025-04-01 16:59               ` Michal Koutný
2025-03-19 17:26   ` Tejun Heo
2025-03-26 23:57     ` Mateusz Guzik
2025-03-19 17:16 ` Yosry Ahmed
2025-03-19 18:06   ` Johannes Weiner
2025-03-19 18:35     ` Yosry Ahmed
2025-03-19 19:10       ` Tejun Heo
2025-03-19 19:16       ` Johannes Weiner
2025-03-19 19:46         ` Johannes Weiner
2025-03-19 17:26 ` Tejun Heo
2025-03-19 17:35 ` Johannes Weiner
2025-03-19 17:53 ` Shakeel Butt
2025-03-19 18:37   ` Eric Dumazet
2025-03-20 14:43   ` Greg Thelen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2vznaaotzkgkrfoi2qitiwdjinpl7ozhpz7w6n7577kaa2hpki@okh2mkqqhbkq \
    --to=mjguzik@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=edumazet@google.com \
    --cc=edumzaet@google.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mkoutny@suse.com \
    --cc=tj@kernel.org \
    --cc=yosry.ahmed@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox