On Wed, Mar 19, 2025 at 05:18:05PM +0000, Yosry Ahmed wrote:
> On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote:
> > Is not this going a little too far?
> > 
> > the lock + irq trip is quite expensive in its own right and now is
> > going to be paid for each cpu, as in the total time spent executing
> > cgroup_rstat_flush_locked is going to go up.
> > 
> > Would your problem go away toggling this every -- say -- 8 cpus?
> 
> I was concerned about this too, and about more lock bouncing, but the
> testing suggests that this actually overall improves the latency of
> cgroup_rstat_flush_locked() (at least on tested HW).
> 
> So I don't think we need to do something like this unless a regression
> is observed.
> 

To my reading it reduces max time spent with irq disabled, which of
course it does -- after all it toggles it for every CPU.

Per my other e-mail in the thread the irq + lock trips remain not cheap
at least on Sapphire Rapids.

In my testing outlined below I see 11% increase in total execution time
with the irq + lock trip for every CPU in a 24-way vm.

So I stand by instead doing this every n CPUs, call it 8 or whatever.

How to repro:

I employed a poor-man's profiler like so:

bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @start[tid] = nsecs; } kretprobe:cgroup_rstat_flush_locked /@start[tid]/ { print(nsecs - @start[tid]); delete(@start[tid]); } interval:s:60 { exit(); }'

This patch or not, execution time varies wildly even while the box is idle.

The above runs for a minute, collecting 23 samples (you may get
"lucky" and get one extra, in that case remove it for comparison).

A sysctl was added to toggle the new behavior vs old one. Patch at the
end.

"enabled"(1) means new behavior, "disabled"(0) means the old one.

Sum of nsecs (results piped to: awk '{ sum += $1 } END { print sum }'):
disabled:	903610
enabled:	1006833 (+11.4%)

Toggle at runtime with:
sysctl fs.magic_tunable=0 # disabled, no mandatory relocks
sysctl fs.magic_tunable=1 # enabled, relock for every CPU

I attached the stats I got for reference.

I patched v6.14 with the following:
diff --git a/fs/file_table.c b/fs/file_table.c
index c04ed94cdc4b..441f89421413 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -106,6 +106,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer,
 	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
 }
 
+unsigned long magic_tunable;
+
 static const struct ctl_table fs_stat_sysctls[] = {
 	{
 		.procname	= "file-nr",
@@ -123,6 +125,16 @@ static const struct ctl_table fs_stat_sysctls[] = {
 		.extra1		= SYSCTL_LONG_ZERO,
 		.extra2		= SYSCTL_LONG_MAX,
 	},
+	{
+		.procname	= "magic_tunable",
+		.data		= &magic_tunable,
+		.maxlen		= sizeof(magic_tunable),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra1		= SYSCTL_LONG_ZERO,
+		.extra2		= SYSCTL_LONG_MAX,
+	},
+
 	{
 		.procname	= "nr_open",
 		.data		= &sysctl_nr_open,
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 3e01781aeb7b..f6444bf25b2f 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -299,6 +299,8 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
 	spin_unlock_irq(&cgroup_rstat_lock);
 }
 
+extern unsigned long magic_tunable;
+
 /* see cgroup_rstat_flush() */
 static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
@@ -323,12 +325,18 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 			rcu_read_unlock();
 		}
 
-		/* play nice and yield if necessary */
-		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+		if (READ_ONCE(magic_tunable)) {
 			__cgroup_rstat_unlock(cgrp, cpu);
 			if (!cond_resched())
 				cpu_relax();
 			__cgroup_rstat_lock(cgrp, cpu);
+		} else {
+			if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+				__cgroup_rstat_unlock(cgrp, cpu);
+				if (!cond_resched())
+					cpu_relax();
+				__cgroup_rstat_lock(cgrp, cpu);
+			}
 		}
 	}
 }