[PATCH 0/4 v3] cgroup: separate rstat trees

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4 v3] cgroup: separate rstat trees
@ 2025-03-19 22:21 JP Kobryn
  2025-03-19 22:21 ` [PATCH 1/4 v3] cgroup: use separate rstat api for bpf programs JP Kobryn
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: JP Kobryn @ 2025-03-19 22:21 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

The current design of rstat takes the approach that if one subsystem is to
be flushed, all other subsystems with pending updates should also be
flushed. A flush may be initiated by reading specific stats (like cpu.stat)
and other subsystems will be flushed alongside. The complexity of flushing
some subsystems has grown to the extent that the overhead of side flushes
is unnecessarily delaying the fetching of desired stats.

One big area where the issue comes up is system telemetry, where programs
periodically sample cpu stats while the memory controller is enabled. It
would be a benefit for programs sampling cpu.stat if the overhead of having
to flush memory (and also io) stats was eliminated. It would save cpu
cycles for existing stat reader programs and improve scalability in terms
of sampling frequency and host volume.

This series changes the approach of "flush all subsystems" to "flush only
the requested subsystem". The core design change is moving from a unified
model where rstat trees are shared by subsystems to having separate trees
for each subsystem. On a per-cpu basis, there will be separate trees for
each enabled subsystem if it implements css_rstat_flush and one tree for
the base stats. In order to do this, the rstat list pointers were moved off
of the cgroup and onto the css. In the transition, these pointer types were
changed to cgroup_subsys_state. Finally the API for updated/flush was
changed to accept a reference to a css instead of a cgroup. This allows for
a specific subsystem to be associated with a given update or flush. The
result is that rstat trees will now be made up of css nodes, and a given
tree will only contain nodes associated with a specific subsystem.

Since separate trees will now be in use, the locking scheme was adjusted.
The global locks were split up in such a way that there are separate locks
for the base stats (cgroup::self) and also for each subsystem (memory, io,
etc). This allows different subsystems (and base stats) to use rstat in
parallel with no contention.

Breaking up the unified tree into separate trees eliminates the overhead
and scalability issue explained in the first section, but comes at the
expense of additional memory. In an effort to minimize this overhead, new
rstat structs are introduced and a conditional allocation is performed.
The cgroup_rstat_cpu which originally contained the rstat list pointers and
the base stat entities was renamed cgroup_rstat_base_cpu. It is only
allocated when the associated css is cgroup::self. As for non-self css's, a
new compact struct was added that only contains the rstat list pointers.
During initialization, when the given css is associated with an actual
subsystem (not cgroup::self), this compact struct is allocated. With this
conditional allocation, the change in memory overhead on a per-cpu basis
before/after is shown below.

memory overhead before:
sizeof(struct cgroup_rstat_cpu) =~ 144 bytes /* can vary based on config */

nr_cgroups * sizeof(struct cgroup_rstat_cpu)
nr_cgroups * 144 bytes

memory overhead after:
sizeof(struct cgroup_rstat_cpu) == 16 bytes
sizeof(struct cgroup_rstat_base_cpu) =~ 144 bytes

nr_cgroups * (
	sizeof(struct cgroup_rstat_base_cpu) +
		sizeof(struct cgroup_rstat_cpu) * nr_rstat_controllers
	)

nr_cgroups * (144 + 16 * nr_rstat_controllers)

... where nr_rstat_controllers is the number of enabled cgroup controllers
that implement css_rstat_flush(). On a host where both memory and io are
enabled:

nr_cgroups * (144 + 16 * 2)
nr_cgroups * 176 bytes

This leaves us with an increase in memory overhead of:
	32 bytes per cgroup per cpu

Validation was performed by reading some *.stat files of a target parent
cgroup while the system was under different workloads. A test program was
made to loop 1M times, reading the files cgroup.stat, cpu.stat, io.stat,
memory.stat of the parent cgroup each iteration. Using a non-patched kernel
as control and this series as experimental, the findings show perf gains
when reading stats with this series.

The first experiment consisted of a parent cgroup with memory.swap.max=0
and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and
within each child cgroup a process was spawned to frequently update the
memory cgroup stats by creating and then reading a file of size 1T
(encouraging reclaim). The test program was run alongside these 26 tasks in
parallel. The results showed time and perf gains for the reader test
program.

test program elapsed time
control:
real    1m13.663s
user    0m0.948s
sys     1m12.356s

experiment:
real    0m42.498s
user    0m0.764s
sys     0m41.546s

test program perf
control:
31.75% mem_cgroup_css_rstat_flush
 5.49% __blkcg_rstat_flush
 0.10% cpu_stat_show
 0.05% cgroup_base_stat_cputime_show

experiment:
8.60% mem_cgroup_css_rstat_flush
0.15% blkcg_print_stat
0.12% cgroup_base_stat_cputime_show

It's worth noting that memcg uses heuristics to optimize flushing.
Depending on the state of updated stats at a given time, a memcg flush may
be considered unnecessary and skipped as a result. This opportunity to skip
a flush is bypassed when memcg is flushed as a consequence of sharing the
tree with another controller.

A second experiment was setup on the same host using a parent cgroup with
two child cgroups. The same swap and memory max were used as in the
previous experiment. In the two child cgroups, kernel builds were done in
parallel, each using "-j 20". The perf comparison is shown below.

test program elapsed time
control:
real    1m22.809s
user    0m1.142s
sys     1m21.138s

experiment:
real    0m42.504s
user    0m1.000s
sys     0m41.220s

test program perf
control:
37.16% mem_cgroup_css_rstat_flush
 3.68% __blkcg_rstat_flush
 0.09% cpu_stat_show
 0.06% cgroup_base_stat_cputime_show

experiment:
2.02% mem_cgroup_css_rstat_flush
0.20% blkcg_print_stat
0.14% cpu_stat_show
0.08% cgroup_base_stat_cputime_show

The final experiment differs from the previous two in that it measures
performance from the stat updater perspective. A kernel build was run in a
child node with -j 20 on the same host and cgroup setup. A baseline was
established by having the build run while no stats were read. The builds
were then repeated while stats were constantly being read. In all cases,
perf appeared similar in cycles spent on cgroup_rstat_updated()
(insignificant compared to the other recorded events). As for the elapsed
build times, the results of the different scenarios are shown below,
indicating no significant drawbacks of the split tree approach.

control with no readers
real    5m12.307s
user    84m52.037s
sys     3m54.000s

control with constant readers of {memory,io,cpu,cgroup}.stat
real    5m13.209s
user    84m47.949s
sys     4m9.260s

experiment with no readers
real    5m11.961s
user    84m41.750s
sys     3m54.058s

experiment with constant readers of {memory,io,cpu,cgroup}.stat
real    5m12.626s
user    85m0.323s
sys     3m56.167s

changelog
v3:
	new bpf kfunc api for updated/flush
	rename cgroup_rstat_{updated,flush} and related to "css_rstat_*"
	check for ss->css_rstat_flush existence where applicable
	rename locks for base stats
	move subsystem locks to cgroup_subsys struct
	change cgroup_rstat_boot() to ss_rstat_init(ss) and init locks within
	change lock helpers to accept css and perform lock selection within
	fix comments that had outdated lock names
	add open css_is_cgroup() helper
	rename rstatc to rstatbc to reflect base stats in use
	rename cgroup_dfl_root_rstat_cpu to root_self_rstat_cpu
	add comments in early init code to explain deferred allocation
	misc formatting fixes

v2:
	drop the patch creating a new cgroup_rstat struct and related code
	drop bpf-specific patches. instead just use cgroup::self in bpf progs
	drop the cpu lock patches. instead select cpu lock in updated_list func
	relocate the cgroup_rstat_init() call to inside css_create()
	relocate the cgroup_rstat_exit() cleanup from apply_control_enable()
		to css_free_rwork_fn()
v1:
	https://lore.kernel.org/all/20250218031448.46951-1-inwardvessel@gmail.com/

JP Kobryn (4):
  cgroup: separate rstat api for bpf programs
  cgroup: use separate rstat trees for each subsystem
  cgroup: use subsystem-specific rstat locks to avoid contention
  cgroup: split up cgroup_rstat_cpu into base stat and non base stat
    versions

 block/blk-cgroup.c                            |   6 +-
 include/linux/cgroup-defs.h                   |  80 ++--
 include/linux/cgroup.h                        |  16 +-
 include/trace/events/cgroup.h                 |  10 +-
 kernel/cgroup/cgroup-internal.h               |   6 +-
 kernel/cgroup/cgroup.c                        |  69 +--
 kernel/cgroup/rstat.c                         | 412 +++++++++++-------
 mm/memcontrol.c                               |   4 +-
 .../selftests/bpf/progs/btf_type_tag_percpu.c |   5 +-
 .../bpf/progs/cgroup_hierarchical_stats.c     |   8 +-
 10 files changed, 363 insertions(+), 253 deletions(-)

-- 
2.47.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4 v3] cgroup: use separate rstat api for bpf programs
  2025-03-19 22:21 [PATCH 0/4 v3] cgroup: separate rstat trees JP Kobryn
@ 2025-03-19 22:21 ` JP Kobryn
  2025-03-24 17:47   ` Michal Koutný
  2025-03-19 22:21 ` [PATCH 2/4 v3] cgroup: use separate rstat trees for each subsystem JP Kobryn
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: JP Kobryn @ 2025-03-19 22:21 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

The rstat updated/flush API functions are exported as kfuncs so bpf
programs can make the same calls that in-kernel code can. Split these API
functions into separate in-kernel and bpf versions. Function signatures
remain unchanged. The kfuncs are named with the prefix "bpf_". This
non-functional change allows for future commits which will modify the
signature of the in-kernel API without impacting bpf call sites. The
implementations of the kfuncs serve as adapters to the in-kernel API.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/cgroup.h                        |  3 +++
 kernel/cgroup/rstat.c                         | 19 ++++++++++++++-----
 .../bpf/progs/cgroup_hierarchical_stats.c     |  8 ++++----
 3 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index f8ef47f8a634..13fd82a4336d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -692,6 +692,9 @@ void cgroup_rstat_flush(struct cgroup *cgrp);
 void cgroup_rstat_flush_hold(struct cgroup *cgrp);
 void cgroup_rstat_flush_release(struct cgroup *cgrp);
 
+void bpf_cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
+void bpf_cgroup_rstat_flush(struct cgroup *cgrp);
+
 /*
  * Basic resource stats.
  */
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index aac91466279f..0d66cfc53061 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -82,7 +82,7 @@ void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
  * rstat_cpu->updated_children list.  See the comment on top of
  * cgroup_rstat_cpu definition for details.
  */
-__bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
+void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 {
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	unsigned long flags;
@@ -129,6 +129,11 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, true);
 }
 
+__bpf_kfunc void bpf_cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
+{
+	cgroup_rstat_updated(cgrp, cpu);
+}
+
 /**
  * cgroup_rstat_push_children - push children cgroups into the given list
  * @head: current head of the list (= subtree root)
@@ -346,7 +351,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
  *
  * This function may block.
  */
-__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
+void cgroup_rstat_flush(struct cgroup *cgrp)
 {
 	might_sleep();
 
@@ -355,6 +360,11 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
 	__cgroup_rstat_unlock(cgrp, -1);
 }
 
+__bpf_kfunc void bpf_cgroup_rstat_flush(struct cgroup *cgrp)
+{
+	cgroup_rstat_flush(cgrp);
+}
+
 /**
  * cgroup_rstat_flush_hold - flush stats in @cgrp's subtree and hold
  * @cgrp: target cgroup
@@ -644,10 +654,9 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 	cgroup_force_idle_show(seq, &cgrp->bstat);
 }
 
-/* Add bpf kfuncs for cgroup_rstat_updated() and cgroup_rstat_flush() */
 BTF_KFUNCS_START(bpf_rstat_kfunc_ids)
-BTF_ID_FLAGS(func, cgroup_rstat_updated)
-BTF_ID_FLAGS(func, cgroup_rstat_flush, KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_cgroup_rstat_updated)
+BTF_ID_FLAGS(func, bpf_cgroup_rstat_flush, KF_SLEEPABLE)
 BTF_KFUNCS_END(bpf_rstat_kfunc_ids)
 
 static const struct btf_kfunc_id_set bpf_rstat_kfunc_set = {
diff --git a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
index c74362854948..24450dd4d3f3 100644
--- a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
+++ b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
@@ -37,8 +37,8 @@ struct {
 	__type(value, struct attach_counter);
 } attach_counters SEC(".maps");
 
-extern void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __ksym;
-extern void cgroup_rstat_flush(struct cgroup *cgrp) __ksym;
+extern void bpf_cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __ksym;
+extern void bpf_cgroup_rstat_flush(struct cgroup *cgrp) __ksym;
 
 static uint64_t cgroup_id(struct cgroup *cgrp)
 {
@@ -75,7 +75,7 @@ int BPF_PROG(counter, struct cgroup *dst_cgrp, struct task_struct *leader,
 	else if (create_percpu_attach_counter(cg_id, 1))
 		return 0;
 
-	cgroup_rstat_updated(dst_cgrp, bpf_get_smp_processor_id());
+	bpf_cgroup_rstat_updated(dst_cgrp, bpf_get_smp_processor_id());
 	return 0;
 }
 
@@ -141,7 +141,7 @@ int BPF_PROG(dumper, struct bpf_iter_meta *meta, struct cgroup *cgrp)
 		return 1;
 
 	/* Flush the stats to make sure we get the most updated numbers */
-	cgroup_rstat_flush(cgrp);
+	bpf_cgroup_rstat_flush(cgrp);
 
 	total_counter = bpf_map_lookup_elem(&attach_counters, &cg_id);
 	if (!total_counter) {
-- 
2.47.1



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/4 v3] cgroup: use separate rstat trees for each subsystem
  2025-03-19 22:21 [PATCH 0/4 v3] cgroup: separate rstat trees JP Kobryn
  2025-03-19 22:21 ` [PATCH 1/4 v3] cgroup: use separate rstat api for bpf programs JP Kobryn
@ 2025-03-19 22:21 ` JP Kobryn
  2025-03-20 21:31   ` Tejun Heo
  2025-03-24 17:48   ` Michal Koutný
  2025-03-19 22:21 ` [PATCH 3/4 v3] cgroup: use subsystem-specific rstat locks to avoid contention JP Kobryn
  2025-03-19 22:21 ` [PATCH 4/4 v3] cgroup: save memory by splitting cgroup_rstat_cpu into compact and full versions JP Kobryn
  3 siblings, 2 replies; 17+ messages in thread
From: JP Kobryn @ 2025-03-19 22:21 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

Different subsystems may call cgroup_rstat_updated() within the same
cgroup, resulting in a tree of pending updates from multiple subsystems.
When one of these subsystems is flushed via cgroup_rstat_flushed(), all
other subsystems with pending updates on the tree will also be flushed.

Change the paradigm of having a single rstat tree for all subsystems to
having separate trees for each subsystem. This separation allows for
subsystems to perform flushes without the side effects of other subsystems.
As an example, flushing the cpu stats will no longer cause the memory stats
to be flushed and vice versa.

In order to achieve subsystem-specific trees, change the tree node type
from cgroup to cgroup_subsys_state pointer. Then remove those pointers from
the cgroup and instead place them on the css. Finally, change the
updated/flush API's to accept a reference to a css instead of a cgroup.
This allows a specific subsystem to be associated with an update or flush.
Separate rstat trees will now exist for each unique subsystem.

Since updating/flushing will now be done at the subsystem level, there is
no longer a need to keep track of updated css nodes at the cgroup level.
The list management of these nodes done within the cgroup (rstat_css_list
and related) has been removed accordingly. There was also padding in the
cgroup to keep rstat_css_list on a cacheline different from
rstat_flush_next and the base stats. This padding has also been removed.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 block/blk-cgroup.c                            |   4 +-
 include/linux/cgroup-defs.h                   |  41 ++--
 include/linux/cgroup.h                        |  13 +-
 kernel/cgroup/cgroup-internal.h               |   4 +-
 kernel/cgroup/cgroup.c                        |  63 +++---
 kernel/cgroup/rstat.c                         | 212 +++++++++---------
 mm/memcontrol.c                               |   4 +-
 .../selftests/bpf/progs/btf_type_tag_percpu.c |   5 +-
 8 files changed, 177 insertions(+), 169 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 9ed93d91d754..cd9521f4f607 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1201,7 +1201,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
 	if (!seq_css(sf)->parent)
 		blkcg_fill_root_iostats();
 	else
-		cgroup_rstat_flush(blkcg->css.cgroup);
+		css_rstat_flush(&blkcg->css);
 
 	rcu_read_lock();
 	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
@@ -2186,7 +2186,7 @@ void blk_cgroup_bio_start(struct bio *bio)
 	}
 
 	u64_stats_update_end_irqrestore(&bis->sync, flags);
-	cgroup_rstat_updated(blkcg->css.cgroup, cpu);
+	css_rstat_updated(&blkcg->css, cpu);
 	put_cpu();
 }
 
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 17960a1e858d..031f55a9ac49 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -169,6 +169,9 @@ struct cgroup_subsys_state {
 	/* reference count - access via css_[try]get() and css_put() */
 	struct percpu_ref refcnt;
 
+	/* per-cpu recursive resource statistics */
+	struct css_rstat_cpu __percpu *rstat_cpu;
+
 	/*
 	 * siblings list anchored at the parent's ->children
 	 *
@@ -177,9 +180,6 @@ struct cgroup_subsys_state {
 	struct list_head sibling;
 	struct list_head children;
 
-	/* flush target list anchored at cgrp->rstat_css_list */
-	struct list_head rstat_css_node;
-
 	/*
 	 * PI: Subsys-unique ID.  0 is unused and root is always 1.  The
 	 * matching css can be looked up using css_from_id().
@@ -219,6 +219,13 @@ struct cgroup_subsys_state {
 	 * Protected by cgroup_mutex.
 	 */
 	int nr_descendants;
+
+	/*
+	 * A singly-linked list of css structures to be rstat flushed.
+	 * This is a scratch field to be used exclusively by
+	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
+	 */
+	struct cgroup_subsys_state *rstat_flush_next;
 };
 
 /*
@@ -329,10 +336,10 @@ struct cgroup_base_stat {
 
 /*
  * rstat - cgroup scalable recursive statistics.  Accounting is done
- * per-cpu in cgroup_rstat_cpu which is then lazily propagated up the
+ * per-cpu in css_rstat_cpu which is then lazily propagated up the
  * hierarchy on reads.
  *
- * When a stat gets updated, the cgroup_rstat_cpu and its ancestors are
+ * When a stat gets updated, the css_rstat_cpu and its ancestors are
  * linked into the updated tree.  On the following read, propagation only
  * considers and consumes the updated tree.  This makes reading O(the
  * number of descendants which have been active since last read) instead of
@@ -347,7 +354,7 @@ struct cgroup_base_stat {
  * updated_children and updated_next - and the fields which track basic
  * resource statistics on top of it - bsync, bstat and last_bstat.
  */
-struct cgroup_rstat_cpu {
+struct css_rstat_cpu {
 	/*
 	 * ->bsync protects ->bstat.  These are the only fields which get
 	 * updated in the hot path.
@@ -386,8 +393,8 @@ struct cgroup_rstat_cpu {
 	 *
 	 * Protected by per-cpu cgroup_rstat_cpu_lock.
 	 */
-	struct cgroup *updated_children;	/* terminated by self cgroup */
-	struct cgroup *updated_next;		/* NULL iff not on the list */
+	struct cgroup_subsys_state *updated_children;	/* terminated by self */
+	struct cgroup_subsys_state *updated_next;	/* NULL if not on list */
 };
 
 struct cgroup_freezer_state {
@@ -516,24 +523,6 @@ struct cgroup {
 	struct cgroup *dom_cgrp;
 	struct cgroup *old_dom_cgrp;		/* used while enabling threaded */
 
-	/* per-cpu recursive resource statistics */
-	struct cgroup_rstat_cpu __percpu *rstat_cpu;
-	struct list_head rstat_css_list;
-
-	/*
-	 * Add padding to separate the read mostly rstat_cpu and
-	 * rstat_css_list into a different cacheline from the following
-	 * rstat_flush_next and *bstat fields which can have frequent updates.
-	 */
-	CACHELINE_PADDING(_pad_);
-
-	/*
-	 * A singly-linked list of cgroup structures to be rstat flushed.
-	 * This is a scratch field to be used exclusively by
-	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
-	 */
-	struct cgroup	*rstat_flush_next;
-
 	/* cgroup basic resource statistics */
 	struct cgroup_base_stat last_bstat;
 	struct cgroup_base_stat bstat;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 13fd82a4336d..4e71ae9858d3 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -346,6 +346,11 @@ static inline bool css_is_dying(struct cgroup_subsys_state *css)
 	return !(css->flags & CSS_NO_REF) && percpu_ref_is_dying(&css->refcnt);
 }
 
+static inline bool css_is_cgroup(struct cgroup_subsys_state *css)
+{
+	return css->ss == NULL;
+}
+
 static inline void cgroup_get(struct cgroup *cgrp)
 {
 	css_get(&cgrp->self);
@@ -687,10 +692,10 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
 /*
  * cgroup scalable recursive statistics.
  */
-void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
-void cgroup_rstat_flush(struct cgroup *cgrp);
-void cgroup_rstat_flush_hold(struct cgroup *cgrp);
-void cgroup_rstat_flush_release(struct cgroup *cgrp);
+void css_rstat_updated(struct cgroup_subsys_state *css, int cpu);
+void css_rstat_flush(struct cgroup_subsys_state *css);
+void css_rstat_flush_hold(struct cgroup_subsys_state *css);
+void css_rstat_flush_release(struct cgroup_subsys_state *css);
 
 void bpf_cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
 void bpf_cgroup_rstat_flush(struct cgroup *cgrp);
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index c964dd7ff967..d4b75fba9a54 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -269,8 +269,8 @@ int cgroup_task_count(const struct cgroup *cgrp);
 /*
  * rstat.c
  */
-int cgroup_rstat_init(struct cgroup *cgrp);
-void cgroup_rstat_exit(struct cgroup *cgrp);
+int css_rstat_init(struct cgroup_subsys_state *css);
+void css_rstat_exit(struct cgroup_subsys_state *css);
 void cgroup_rstat_boot(void);
 void cgroup_base_stat_cputime_show(struct seq_file *seq);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index afc665b7b1fe..1e21065dec0e 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -161,10 +161,12 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
 };
 #undef SUBSYS
 
-static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu);
+static DEFINE_PER_CPU(struct css_rstat_cpu, root_self_rstat_cpu);
 
 /* the default hierarchy */
-struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
+struct cgroup_root cgrp_dfl_root = {
+	.cgrp.self.rstat_cpu = &root_self_rstat_cpu
+};
 EXPORT_SYMBOL_GPL(cgrp_dfl_root);
 
 /*
@@ -1358,7 +1360,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
 
 	cgroup_unlock();
 
-	cgroup_rstat_exit(cgrp);
+	css_rstat_exit(&cgrp->self);
 	kernfs_destroy_root(root->kf_root);
 	cgroup_free_root(root);
 }
@@ -1863,13 +1865,6 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
 		}
 		spin_unlock_irq(&css_set_lock);
 
-		if (ss->css_rstat_flush) {
-			list_del_rcu(&css->rstat_css_node);
-			synchronize_rcu();
-			list_add_rcu(&css->rstat_css_node,
-				     &dcgrp->rstat_css_list);
-		}
-
 		/* default hierarchy doesn't enable controllers by default */
 		dst_root->subsys_mask |= 1 << ssid;
 		if (dst_root == &cgrp_dfl_root) {
@@ -2052,7 +2047,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	cgrp->dom_cgrp = cgrp;
 	cgrp->max_descendants = INT_MAX;
 	cgrp->max_depth = INT_MAX;
-	INIT_LIST_HEAD(&cgrp->rstat_css_list);
 	prev_cputime_init(&cgrp->prev_cputime);
 
 	for_each_subsys(ss, ssid)
@@ -2132,7 +2126,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	if (ret)
 		goto destroy_root;
 
-	ret = cgroup_rstat_init(root_cgrp);
+	ret = css_rstat_init(&root_cgrp->self);
 	if (ret)
 		goto destroy_root;
 
@@ -2174,7 +2168,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	goto out;
 
 exit_stats:
-	cgroup_rstat_exit(root_cgrp);
+	css_rstat_exit(&root_cgrp->self);
 destroy_root:
 	kernfs_destroy_root(root->kf_root);
 	root->kf_root = NULL;
@@ -5407,6 +5401,9 @@ static void css_free_rwork_fn(struct work_struct *work)
 		struct cgroup_subsys_state *parent = css->parent;
 		int id = css->id;
 
+		if (ss->css_rstat_flush)
+			css_rstat_exit(css);
+
 		ss->css_free(css);
 		cgroup_idr_remove(&ss->css_idr, id);
 		cgroup_put(cgrp);
@@ -5431,7 +5428,7 @@ static void css_free_rwork_fn(struct work_struct *work)
 			cgroup_put(cgroup_parent(cgrp));
 			kernfs_put(cgrp->kn);
 			psi_cgroup_free(cgrp);
-			cgroup_rstat_exit(cgrp);
+			css_rstat_exit(&cgrp->self);
 			kfree(cgrp);
 		} else {
 			/*
@@ -5459,11 +5456,8 @@ static void css_release_work_fn(struct work_struct *work)
 	if (ss) {
 		struct cgroup *parent_cgrp;
 
-		/* css release path */
-		if (!list_empty(&css->rstat_css_node)) {
-			cgroup_rstat_flush(cgrp);
-			list_del_rcu(&css->rstat_css_node);
-		}
+		if (ss->css_rstat_flush)
+			css_rstat_flush(css);
 
 		cgroup_idr_replace(&ss->css_idr, NULL, css->id);
 		if (ss->css_released)
@@ -5489,7 +5483,7 @@ static void css_release_work_fn(struct work_struct *work)
 		/* cgroup release path */
 		TRACE_CGROUP_PATH(release, cgrp);
 
-		cgroup_rstat_flush(cgrp);
+		css_rstat_flush(&cgrp->self);
 
 		spin_lock_irq(&css_set_lock);
 		for (tcgrp = cgroup_parent(cgrp); tcgrp;
@@ -5537,7 +5531,6 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	css->id = -1;
 	INIT_LIST_HEAD(&css->sibling);
 	INIT_LIST_HEAD(&css->children);
-	INIT_LIST_HEAD(&css->rstat_css_node);
 	css->serial_nr = css_serial_nr_next++;
 	atomic_set(&css->online_cnt, 0);
 
@@ -5546,9 +5539,6 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 		css_get(css->parent);
 	}
 
-	if (ss->css_rstat_flush)
-		list_add_rcu(&css->rstat_css_node, &cgrp->rstat_css_list);
-
 	BUG_ON(cgroup_css(cgrp, ss));
 }
 
@@ -5641,6 +5631,12 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 		goto err_free_css;
 	css->id = err;
 
+	if (ss->css_rstat_flush) {
+		err = css_rstat_init(css);
+		if (err)
+			goto err_free_css;
+	}
+
 	/* @css is ready to be brought online now, make it visible */
 	list_add_tail_rcu(&css->sibling, &parent_css->children);
 	cgroup_idr_replace(&ss->css_idr, css, css->id);
@@ -5654,7 +5650,6 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 err_list_del:
 	list_del_rcu(&css->sibling);
 err_free_css:
-	list_del_rcu(&css->rstat_css_node);
 	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
 	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
 	return ERR_PTR(err);
@@ -5682,7 +5677,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 	if (ret)
 		goto out_free_cgrp;
 
-	ret = cgroup_rstat_init(cgrp);
+	ret = css_rstat_init(&cgrp->self);
 	if (ret)
 		goto out_cancel_ref;
 
@@ -5775,7 +5770,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 out_kernfs_remove:
 	kernfs_remove(cgrp->kn);
 out_stat_exit:
-	cgroup_rstat_exit(cgrp);
+	css_rstat_exit(&cgrp->self);
 out_cancel_ref:
 	percpu_ref_exit(&cgrp->self.refcnt);
 out_free_cgrp:
@@ -6082,11 +6077,16 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	css->flags |= CSS_NO_REF;
 
 	if (early) {
-		/* allocation can't be done safely during early init */
+		/* allocation can't be done safely during early init.
+		 * defer idr and rstat allocations until cgroup_init().
+		 */
 		css->id = 1;
 	} else {
 		css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, GFP_KERNEL);
 		BUG_ON(css->id < 0);
+
+		if (ss->css_rstat_flush)
+			BUG_ON(css_rstat_init(css));
 	}
 
 	/* Update the init_css_set to contain a subsys
@@ -6185,9 +6185,16 @@ int __init cgroup_init(void)
 			struct cgroup_subsys_state *css =
 				init_css_set.subsys[ss->id];
 
+			/* it is now safe to perform allocations.
+			 * finish setting up subsystems that previously
+			 * deferred idr and rstat allocations.
+			 */
 			css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
 						   GFP_KERNEL);
 			BUG_ON(css->id < 0);
+
+			if (ss->css_rstat_flush)
+				BUG_ON(css_rstat_init(css));
 		} else {
 			cgroup_init_subsys(ss, false);
 		}
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 0d66cfc53061..a28c00b11736 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -14,9 +14,10 @@ static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
-static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
+static struct css_rstat_cpu *css_rstat_cpu(
+		struct cgroup_subsys_state *css, int cpu)
 {
-	return per_cpu_ptr(cgrp->rstat_cpu, cpu);
+	return per_cpu_ptr(css->rstat_cpu, cpu);
 }
 
 /*
@@ -74,16 +75,17 @@ void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
 }
 
 /**
- * cgroup_rstat_updated - keep track of updated rstat_cpu
- * @cgrp: target cgroup
+ * css_rstat_updated - keep track of updated rstat_cpu
+ * @css: target cgroup subsystem state
  * @cpu: cpu on which rstat_cpu was updated
  *
- * @cgrp's rstat_cpu on @cpu was updated.  Put it on the parent's matching
+ * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching
  * rstat_cpu->updated_children list.  See the comment on top of
- * cgroup_rstat_cpu definition for details.
+ * css_rstat_cpu definition for details.
  */
-void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
+void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
+	struct cgroup *cgrp = css->cgroup;
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	unsigned long flags;
 
@@ -92,19 +94,19 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 	 * temporary inaccuracies, which is fine.
 	 *
 	 * Because @parent's updated_children is terminated with @parent
-	 * instead of NULL, we can tell whether @cgrp is on the list by
+	 * instead of NULL, we can tell whether @css is on the list by
 	 * testing the next pointer for NULL.
 	 */
-	if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next))
+	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
 		return;
 
 	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
 
-	/* put @cgrp and all ancestors on the corresponding updated lists */
+	/* put @css and all ancestors on the corresponding updated lists */
 	while (true) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
-		struct cgroup *parent = cgroup_parent(cgrp);
-		struct cgroup_rstat_cpu *prstatc;
+		struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
+		struct cgroup_subsys_state *parent = css->parent;
+		struct css_rstat_cpu *prstatc;
 
 		/*
 		 * Both additions and removals are bottom-up.  If a cgroup
@@ -115,15 +117,15 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 
 		/* Root has no parent to link it to, but mark it busy */
 		if (!parent) {
-			rstatc->updated_next = cgrp;
+			rstatc->updated_next = css;
 			break;
 		}
 
-		prstatc = cgroup_rstat_cpu(parent, cpu);
+		prstatc = css_rstat_cpu(parent, cpu);
 		rstatc->updated_next = prstatc->updated_children;
-		prstatc->updated_children = cgrp;
+		prstatc->updated_children = css;
 
-		cgrp = parent;
+		css = parent;
 	}
 
 	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, true);
@@ -131,7 +133,7 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 
 __bpf_kfunc void bpf_cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 {
-	cgroup_rstat_updated(cgrp, cpu);
+	css_rstat_updated(&cgrp->self, cpu);
 }
 
 /**
@@ -141,18 +143,19 @@ __bpf_kfunc void bpf_cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
  * @cpu: target cpu
  * Return: A new singly linked list of cgroups to be flush
  *
- * Iteratively traverse down the cgroup_rstat_cpu updated tree level by
+ * Iteratively traverse down the css_rstat_cpu updated tree level by
  * level and push all the parents first before their next level children
  * into a singly linked list built from the tail backward like "pushing"
  * cgroups into a stack. The root is pushed by the caller.
  */
-static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
-						 struct cgroup *child, int cpu)
+static struct cgroup_subsys_state *cgroup_rstat_push_children(
+		struct cgroup_subsys_state *head,
+		struct cgroup_subsys_state *child, int cpu)
 {
-	struct cgroup *chead = child;	/* Head of child cgroup level */
-	struct cgroup *ghead = NULL;	/* Head of grandchild cgroup level */
-	struct cgroup *parent, *grandchild;
-	struct cgroup_rstat_cpu *crstatc;
+	struct cgroup_subsys_state *chead = child;	/* Head of child css level */
+	struct cgroup_subsys_state *ghead = NULL;	/* Head of grandchild css level */
+	struct cgroup_subsys_state *parent, *grandchild;
+	struct css_rstat_cpu *crstatc;
 
 	child->rstat_flush_next = NULL;
 
@@ -160,13 +163,13 @@ static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
 	while (chead) {
 		child = chead;
 		chead = child->rstat_flush_next;
-		parent = cgroup_parent(child);
+		parent = child->parent;
 
 		/* updated_next is parent cgroup terminated */
 		while (child != parent) {
 			child->rstat_flush_next = head;
 			head = child;
-			crstatc = cgroup_rstat_cpu(child, cpu);
+			crstatc = css_rstat_cpu(child, cpu);
 			grandchild = crstatc->updated_children;
 			if (grandchild != child) {
 				/* Push the grand child to the next level */
@@ -188,31 +191,33 @@ static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
 }
 
 /**
- * cgroup_rstat_updated_list - return a list of updated cgroups to be flushed
- * @root: root of the cgroup subtree to traverse
+ * css_rstat_updated_list - return a list of updated cgroups to be flushed
+ * @root: root of the css subtree to traverse
  * @cpu: target cpu
  * Return: A singly linked list of cgroups to be flushed
  *
  * Walks the updated rstat_cpu tree on @cpu from @root.  During traversal,
- * each returned cgroup is unlinked from the updated tree.
+ * each returned css is unlinked from the updated tree.
  *
  * The only ordering guarantee is that, for a parent and a child pair
  * covered by a given traversal, the child is before its parent in
  * the list.
  *
  * Note that updated_children is self terminated and points to a list of
- * child cgroups if not empty. Whereas updated_next is like a sibling link
- * within the children list and terminated by the parent cgroup. An exception
+ * child css's if not empty. Whereas updated_next is like a sibling link
+ * within the children list and terminated by the parent css. An exception
  * here is the cgroup root whose updated_next can be self terminated.
  */
-static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
+static struct cgroup_subsys_state *css_rstat_updated_list(
+		struct cgroup_subsys_state *root, int cpu)
 {
+	struct cgroup *cgrp = root->cgroup;
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
-	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
-	struct cgroup *head = NULL, *parent, *child;
+	struct css_rstat_cpu *rstatc = css_rstat_cpu(root, cpu);
+	struct cgroup_subsys_state *head = NULL, *parent, *child;
 	unsigned long flags;
 
-	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, root, false);
+	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, false);
 
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
@@ -222,17 +227,17 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
 	 * Unlink @root from its parent. As the updated_children list is
 	 * singly linked, we have to walk it to find the removal point.
 	 */
-	parent = cgroup_parent(root);
+	parent = root->parent;
 	if (parent) {
-		struct cgroup_rstat_cpu *prstatc;
-		struct cgroup **nextp;
+		struct css_rstat_cpu *prstatc;
+		struct cgroup_subsys_state **nextp;
 
-		prstatc = cgroup_rstat_cpu(parent, cpu);
+		prstatc = css_rstat_cpu(parent, cpu);
 		nextp = &prstatc->updated_children;
 		while (*nextp != root) {
-			struct cgroup_rstat_cpu *nrstatc;
+			struct css_rstat_cpu *nrstatc;
 
-			nrstatc = cgroup_rstat_cpu(*nextp, cpu);
+			nrstatc = css_rstat_cpu(*nextp, cpu);
 			WARN_ON_ONCE(*nextp == parent);
 			nextp = &nrstatc->updated_next;
 		}
@@ -249,14 +254,14 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
 	if (child != root)
 		head = cgroup_rstat_push_children(head, child, cpu);
 unlock_ret:
-	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, root, flags, false);
+	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, false);
 	return head;
 }
 
 /*
  * A hook for bpf stat collectors to attach to and flush their stats.
- * Together with providing bpf kfuncs for cgroup_rstat_updated() and
- * cgroup_rstat_flush(), this enables a complete workflow where bpf progs that
+ * Together with providing bpf kfuncs for css_rstat_updated() and
+ * css_rstat_flush(), this enables a complete workflow where bpf progs that
  * collect cgroup stats can integrate with rstat for efficient flushing.
  *
  * A static noinline declaration here could cause the compiler to optimize away
@@ -304,28 +309,26 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
 	spin_unlock_irq(&cgroup_rstat_lock);
 }
 
-/* see cgroup_rstat_flush() */
-static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
+/* see css_rstat_flush() */
+static void css_rstat_flush_locked(struct cgroup_subsys_state *css)
 	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
 {
+	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
 	lockdep_assert_held(&cgroup_rstat_lock);
 
 	for_each_possible_cpu(cpu) {
-		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
+		struct cgroup_subsys_state *pos;
 
+		pos = css_rstat_updated_list(css, cpu);
 		for (; pos; pos = pos->rstat_flush_next) {
-			struct cgroup_subsys_state *css;
-
-			cgroup_base_stat_flush(pos, cpu);
-			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
-
-			rcu_read_lock();
-			list_for_each_entry_rcu(css, &pos->rstat_css_list,
-						rstat_css_node)
-				css->ss->css_rstat_flush(css, cpu);
-			rcu_read_unlock();
+			if (css_is_cgroup(pos)) {
+				cgroup_base_stat_flush(pos->cgroup, cpu);
+				bpf_rstat_flush(pos->cgroup,
+						cgroup_parent(pos->cgroup), cpu);
+			} else if (pos->ss->css_rstat_flush)
+				pos->ss->css_rstat_flush(pos, cpu);
 		}
 
 		/* play nice and yield if necessary */
@@ -339,98 +342,101 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 }
 
 /**
- * cgroup_rstat_flush - flush stats in @cgrp's subtree
- * @cgrp: target cgroup
+ * css_rstat_flush - flush stats in @css's rstat subtree
+ * @css: target cgroup subsystem state
  *
- * Collect all per-cpu stats in @cgrp's subtree into the global counters
- * and propagate them upwards.  After this function returns, all cgroups in
- * the subtree have up-to-date ->stat.
+ * Collect all per-cpu stats in @css's subtree into the global counters
+ * and propagate them upwards. After this function returns, all rstat
+ * nodes in the subtree have up-to-date ->stat.
  *
- * This also gets all cgroups in the subtree including @cgrp off the
+ * This also gets all rstat nodes in the subtree including @css off the
  * ->updated_children lists.
  *
  * This function may block.
  */
-void cgroup_rstat_flush(struct cgroup *cgrp)
+void css_rstat_flush(struct cgroup_subsys_state *css)
 {
+	struct cgroup *cgrp = css->cgroup;
+
 	might_sleep();
 
 	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(cgrp);
+	css_rstat_flush_locked(css);
 	__cgroup_rstat_unlock(cgrp, -1);
 }
 
 __bpf_kfunc void bpf_cgroup_rstat_flush(struct cgroup *cgrp)
 {
-	cgroup_rstat_flush(cgrp);
+	css_rstat_flush(&cgrp->self);
 }
 
 /**
- * cgroup_rstat_flush_hold - flush stats in @cgrp's subtree and hold
- * @cgrp: target cgroup
+ * css_rstat_flush_hold - flush stats in @css's rstat subtree and hold
+ * @css: target subsystem state
  *
- * Flush stats in @cgrp's subtree and prevent further flushes.  Must be
- * paired with cgroup_rstat_flush_release().
+ * Flush stats in @css's rstat subtree and prevent further flushes. Must be
+ * paired with css_rstat_flush_release().
  *
  * This function may block.
  */
-void cgroup_rstat_flush_hold(struct cgroup *cgrp)
-	__acquires(&cgroup_rstat_lock)
+void css_rstat_flush_hold(struct cgroup_subsys_state *css)
 {
+	struct cgroup *cgrp = css->cgroup;
+
 	might_sleep();
 	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(cgrp);
+	css_rstat_flush_locked(css);
 }
 
 /**
- * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
- * @cgrp: cgroup used by tracepoint
+ * css_rstat_flush_release - release css_rstat_flush_hold()
+ * @css: css that was previously used for the call to flush hold
  */
-void cgroup_rstat_flush_release(struct cgroup *cgrp)
-	__releases(&cgroup_rstat_lock)
+void css_rstat_flush_release(struct cgroup_subsys_state *css)
 {
+	struct cgroup *cgrp = css->cgroup;
 	__cgroup_rstat_unlock(cgrp, -1);
 }
 
-int cgroup_rstat_init(struct cgroup *cgrp)
+int css_rstat_init(struct cgroup_subsys_state *css)
 {
 	int cpu;
 
-	/* the root cgrp has rstat_cpu preallocated */
-	if (!cgrp->rstat_cpu) {
-		cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
-		if (!cgrp->rstat_cpu)
+	/* the root cgrp's self css has rstat_cpu preallocated */
+	if (!css->rstat_cpu) {
+		css->rstat_cpu = alloc_percpu(struct css_rstat_cpu);
+		if (!css->rstat_cpu)
 			return -ENOMEM;
 	}
 
 	/* ->updated_children list is self terminated */
 	for_each_possible_cpu(cpu) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+		struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
 
-		rstatc->updated_children = cgrp;
+		rstatc->updated_children = css;
 		u64_stats_init(&rstatc->bsync);
 	}
 
 	return 0;
 }
 
-void cgroup_rstat_exit(struct cgroup *cgrp)
+void css_rstat_exit(struct cgroup_subsys_state *css)
 {
 	int cpu;
 
-	cgroup_rstat_flush(cgrp);
+	css_rstat_flush(css);
 
 	/* sanity check */
 	for_each_possible_cpu(cpu) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+		struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
 
-		if (WARN_ON_ONCE(rstatc->updated_children != cgrp) ||
+		if (WARN_ON_ONCE(rstatc->updated_children != css) ||
 		    WARN_ON_ONCE(rstatc->updated_next))
 			return;
 	}
 
-	free_percpu(cgrp->rstat_cpu);
-	cgrp->rstat_cpu = NULL;
+	free_percpu(css->rstat_cpu);
+	css->rstat_cpu = NULL;
 }
 
 void __init cgroup_rstat_boot(void)
@@ -471,9 +477,9 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 {
-	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+	struct css_rstat_cpu *rstatc = css_rstat_cpu(&cgrp->self, cpu);
 	struct cgroup *parent = cgroup_parent(cgrp);
-	struct cgroup_rstat_cpu *prstatc;
+	struct css_rstat_cpu *prstatc;
 	struct cgroup_base_stat delta;
 	unsigned seq;
 
@@ -501,35 +507,35 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 		cgroup_base_stat_add(&cgrp->last_bstat, &delta);
 
 		delta = rstatc->subtree_bstat;
-		prstatc = cgroup_rstat_cpu(parent, cpu);
+		prstatc = css_rstat_cpu(&parent->self, cpu);
 		cgroup_base_stat_sub(&delta, &rstatc->last_subtree_bstat);
 		cgroup_base_stat_add(&prstatc->subtree_bstat, &delta);
 		cgroup_base_stat_add(&rstatc->last_subtree_bstat, &delta);
 	}
 }
 
-static struct cgroup_rstat_cpu *
+static struct css_rstat_cpu *
 cgroup_base_stat_cputime_account_begin(struct cgroup *cgrp, unsigned long *flags)
 {
-	struct cgroup_rstat_cpu *rstatc;
+	struct css_rstat_cpu *rstatc;
 
-	rstatc = get_cpu_ptr(cgrp->rstat_cpu);
+	rstatc = get_cpu_ptr(cgrp->self.rstat_cpu);
 	*flags = u64_stats_update_begin_irqsave(&rstatc->bsync);
 	return rstatc;
 }
 
 static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
-						 struct cgroup_rstat_cpu *rstatc,
+						 struct css_rstat_cpu *rstatc,
 						 unsigned long flags)
 {
 	u64_stats_update_end_irqrestore(&rstatc->bsync, flags);
-	cgroup_rstat_updated(cgrp, smp_processor_id());
+	css_rstat_updated(&cgrp->self, smp_processor_id());
 	put_cpu_ptr(rstatc);
 }
 
 void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
 {
-	struct cgroup_rstat_cpu *rstatc;
+	struct css_rstat_cpu *rstatc;
 	unsigned long flags;
 
 	rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
@@ -540,7 +546,7 @@ void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
 void __cgroup_account_cputime_field(struct cgroup *cgrp,
 				    enum cpu_usage_stat index, u64 delta_exec)
 {
-	struct cgroup_rstat_cpu *rstatc;
+	struct css_rstat_cpu *rstatc;
 	unsigned long flags;
 
 	rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
@@ -625,12 +631,12 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 	u64 usage, utime, stime, ntime;
 
 	if (cgroup_parent(cgrp)) {
-		cgroup_rstat_flush_hold(cgrp);
+		css_rstat_flush_hold(&cgrp->self);
 		usage = cgrp->bstat.cputime.sum_exec_runtime;
 		cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime,
 			       &utime, &stime);
 		ntime = cgrp->bstat.ntime;
-		cgroup_rstat_flush_release(cgrp);
+		css_rstat_flush_release(&cgrp->self);
 	} else {
 		/* cgrp->bstat of root is not actually used, reuse it */
 		root_cgroup_cputime(&cgrp->bstat);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4de6acb9b8ec..fe86d7efe372 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -579,7 +579,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 	if (!val)
 		return;
 
-	cgroup_rstat_updated(memcg->css.cgroup, cpu);
+	css_rstat_updated(&memcg->css, cpu);
 	statc = this_cpu_ptr(memcg->vmstats_percpu);
 	for (; statc; statc = statc->parent) {
 		stats_updates = READ_ONCE(statc->stats_updates) + abs(val);
@@ -611,7 +611,7 @@ static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
 	if (mem_cgroup_is_root(memcg))
 		WRITE_ONCE(flush_last_time, jiffies_64);
 
-	cgroup_rstat_flush(memcg->css.cgroup);
+	css_rstat_flush(&memcg->css);
 }
 
 /*
diff --git a/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c b/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
index 38f78d9345de..f362f7d41b9e 100644
--- a/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
+++ b/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
@@ -45,7 +45,7 @@ int BPF_PROG(test_percpu2, struct bpf_testmod_btf_type_tag_2 *arg)
 SEC("tp_btf/cgroup_mkdir")
 int BPF_PROG(test_percpu_load, struct cgroup *cgrp, const char *path)
 {
-	g = (__u64)cgrp->rstat_cpu->updated_children;
+	g = (__u64)cgrp->self.rstat_cpu->updated_children;
 	return 0;
 }
 
@@ -56,7 +56,8 @@ int BPF_PROG(test_percpu_helper, struct cgroup *cgrp, const char *path)
 	__u32 cpu;
 
 	cpu = bpf_get_smp_processor_id();
-	rstat = (struct cgroup_rstat_cpu *)bpf_per_cpu_ptr(cgrp->rstat_cpu, cpu);
+	rstat = (struct cgroup_rstat_cpu *)bpf_per_cpu_ptr(
+			cgrp->self.rstat_cpu, cpu);
 	if (rstat) {
 		/* READ_ONCE */
 		*(volatile int *)rstat;
-- 
2.47.1



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/4 v3] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-03-19 22:21 [PATCH 0/4 v3] cgroup: separate rstat trees JP Kobryn
  2025-03-19 22:21 ` [PATCH 1/4 v3] cgroup: use separate rstat api for bpf programs JP Kobryn
  2025-03-19 22:21 ` [PATCH 2/4 v3] cgroup: use separate rstat trees for each subsystem JP Kobryn
@ 2025-03-19 22:21 ` JP Kobryn
  2025-03-20 21:38   ` Tejun Heo
  2025-03-24 17:48   ` Michal Koutný
  2025-03-19 22:21 ` [PATCH 4/4 v3] cgroup: save memory by splitting cgroup_rstat_cpu into compact and full versions JP Kobryn
  3 siblings, 2 replies; 17+ messages in thread
From: JP Kobryn @ 2025-03-19 22:21 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

It is possible to eliminate contention between subsystems when
updating/flushing stats by using subsystem-specific locks. Let the existing
rstat locks be dedicated to the base stats and rename them to reflect it.
Add similar locks to the cgroup_subsys struct for use with individual
subsystems. To make use of the new locks, change the existing lock helper
functions to accept a reference to a css and use css->ss to access the
locks or use the static locks for base stats when css->ss is NULL.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 block/blk-cgroup.c              |   2 +-
 include/linux/cgroup-defs.h     |  12 ++-
 include/trace/events/cgroup.h   |  10 ++-
 kernel/cgroup/cgroup-internal.h |   2 +-
 kernel/cgroup/cgroup.c          |  10 ++-
 kernel/cgroup/rstat.c           | 145 +++++++++++++++++++++-----------
 6 files changed, 122 insertions(+), 59 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index cd9521f4f607..34d72bbdd5e5 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1022,7 +1022,7 @@ static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu)
 	/*
 	 * For covering concurrent parent blkg update from blkg_release().
 	 *
-	 * When flushing from cgroup, cgroup_rstat_lock is always held, so
+	 * When flushing from cgroup, the subsystem lock is always held, so
 	 * this lock won't cause contention most of time.
 	 */
 	raw_spin_lock_irqsave(&blkg_stat_lock, flags);
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 031f55a9ac49..0ffc8438c6d9 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -223,7 +223,10 @@ struct cgroup_subsys_state {
 	/*
 	 * A singly-linked list of css structures to be rstat flushed.
 	 * This is a scratch field to be used exclusively by
-	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
+	 * cgroup_rstat_flush_locked().
+	 *
+	 * protected by rstat_base_lock when css is cgroup::self
+	 * protected by css->ss->lock otherwise
 	 */
 	struct cgroup_subsys_state *rstat_flush_next;
 };
@@ -391,7 +394,9 @@ struct css_rstat_cpu {
 	 * to the cgroup makes it unnecessary for each per-cpu struct to
 	 * point back to the associated cgroup.
 	 *
-	 * Protected by per-cpu cgroup_rstat_cpu_lock.
+	 * Protected by per-cpu rstat_base_cpu_lock when css->ss == NULL
+	 * otherwise,
+	 * Protected by per-cpu css->ss->rstat_cpu_lock
 	 */
 	struct cgroup_subsys_state *updated_children;	/* terminated by self */
 	struct cgroup_subsys_state *updated_next;	/* NULL if not on list */
@@ -779,6 +784,9 @@ struct cgroup_subsys {
 	 * specifies the mask of subsystems that this one depends on.
 	 */
 	unsigned int depends_on;
+
+	spinlock_t lock;
+	raw_spinlock_t __percpu *percpu_lock;
 };
 
 extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
diff --git a/include/trace/events/cgroup.h b/include/trace/events/cgroup.h
index af2755bda6eb..ec3a95bf4981 100644
--- a/include/trace/events/cgroup.h
+++ b/include/trace/events/cgroup.h
@@ -231,7 +231,10 @@ DECLARE_EVENT_CLASS(cgroup_rstat,
 		  __entry->cpu, __entry->contended)
 );
 
-/* Related to global: cgroup_rstat_lock */
+/* Related to locks:
+ * rstat_base_lock when handling cgroup::self
+ * css->ss->lock otherwise
+ */
 DEFINE_EVENT(cgroup_rstat, cgroup_rstat_lock_contended,
 
 	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
@@ -253,7 +256,10 @@ DEFINE_EVENT(cgroup_rstat, cgroup_rstat_unlock,
 	TP_ARGS(cgrp, cpu, contended)
 );
 
-/* Related to per CPU: cgroup_rstat_cpu_lock */
+/* Related to per CPU locks:
+ * rstat_base_cpu_lock when handling cgroup::self
+ * css->ss->cpu_lock otherwise
+ */
 DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended,
 
 	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index d4b75fba9a54..513bfce3bc23 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -271,7 +271,7 @@ int cgroup_task_count(const struct cgroup *cgrp);
  */
 int css_rstat_init(struct cgroup_subsys_state *css);
 void css_rstat_exit(struct cgroup_subsys_state *css);
-void cgroup_rstat_boot(void);
+int ss_rstat_init(struct cgroup_subsys *ss);
 void cgroup_base_stat_cputime_show(struct seq_file *seq);
 
 /*
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1e21065dec0e..3e8948805f67 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6085,8 +6085,10 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 		css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, GFP_KERNEL);
 		BUG_ON(css->id < 0);
 
-		if (ss->css_rstat_flush)
+		if (ss->css_rstat_flush) {
+			BUG_ON(ss_rstat_init(ss));
 			BUG_ON(css_rstat_init(css));
+		}
 	}
 
 	/* Update the init_css_set to contain a subsys
@@ -6163,7 +6165,7 @@ int __init cgroup_init(void)
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_psi_files));
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files));
 
-	cgroup_rstat_boot();
+	BUG_ON(ss_rstat_init(NULL));
 
 	get_user_ns(init_cgroup_ns.user_ns);
 
@@ -6193,8 +6195,10 @@ int __init cgroup_init(void)
 						   GFP_KERNEL);
 			BUG_ON(css->id < 0);
 
-			if (ss->css_rstat_flush)
+			if (ss->css_rstat_flush) {
+				BUG_ON(ss_rstat_init(ss));
 				BUG_ON(css_rstat_init(css));
+			}
 		} else {
 			cgroup_init_subsys(ss, false);
 		}
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index a28c00b11736..ffd7ac6bcefc 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -9,8 +9,8 @@
 
 #include <trace/events/cgroup.h>
 
-static DEFINE_SPINLOCK(cgroup_rstat_lock);
-static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
+static DEFINE_SPINLOCK(rstat_base_lock);
+static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
@@ -20,8 +20,24 @@ static struct css_rstat_cpu *css_rstat_cpu(
 	return per_cpu_ptr(css->rstat_cpu, cpu);
 }
 
+static spinlock_t *ss_rstat_lock(struct cgroup_subsys *ss)
+{
+	if (ss)
+		return &ss->lock;
+
+	return &rstat_base_lock;
+}
+
+static raw_spinlock_t *ss_rstat_cpu_lock(struct cgroup_subsys *ss, int cpu)
+{
+	if (ss)
+		return per_cpu_ptr(ss->percpu_lock, cpu);
+
+	return per_cpu_ptr(&rstat_base_cpu_lock, cpu);
+}
+
 /*
- * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
+ * Helper functions for rstat per CPU locks.
  *
  * This makes it easier to diagnose locking issues and contention in
  * production environments. The parameter @fast_path determine the
@@ -29,20 +45,23 @@ static struct css_rstat_cpu *css_rstat_cpu(
  * operations without handling high-frequency fast-path "update" events.
  */
 static __always_inline
-unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
-				     struct cgroup *cgrp, const bool fast_path)
+unsigned long _css_rstat_cpu_lock(struct cgroup_subsys_state *css, int cpu,
+		const bool fast_path)
 {
+	struct cgroup *cgrp = css->cgroup;
+	raw_spinlock_t *cpu_lock;
 	unsigned long flags;
 	bool contended;
 
 	/*
-	 * The _irqsave() is needed because cgroup_rstat_lock is
-	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
-	 * this lock with the _irq() suffix only disables interrupts on
-	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
-	 * interrupts on both configurations. The _irqsave() ensures
-	 * that interrupts are always disabled and later restored.
+	 * The _irqsave() is needed because the locks used for flushing are
+	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring this lock
+	 * with the _irq() suffix only disables interrupts on a non-PREEMPT_RT
+	 * kernel. The raw_spinlock_t below disables interrupts on both
+	 * configurations. The _irqsave() ensures that interrupts are always
+	 * disabled and later restored.
 	 */
+	cpu_lock = ss_rstat_cpu_lock(css->ss, cpu);
 	contended = !raw_spin_trylock_irqsave(cpu_lock, flags);
 	if (contended) {
 		if (fast_path)
@@ -62,15 +81,18 @@ unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
 }
 
 static __always_inline
-void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
-			      struct cgroup *cgrp, unsigned long flags,
-			      const bool fast_path)
+void _css_rstat_cpu_unlock(struct cgroup_subsys_state *css, int cpu,
+		unsigned long flags, const bool fast_path)
 {
+	struct cgroup *cgrp = css->cgroup;
+	raw_spinlock_t *cpu_lock;
+
 	if (fast_path)
 		trace_cgroup_rstat_cpu_unlock_fastpath(cgrp, cpu, false);
 	else
 		trace_cgroup_rstat_cpu_unlock(cgrp, cpu, false);
 
+	cpu_lock = ss_rstat_cpu_lock(css->ss, cpu);
 	raw_spin_unlock_irqrestore(cpu_lock, flags);
 }
 
@@ -85,8 +107,6 @@ void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
  */
 void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
-	struct cgroup *cgrp = css->cgroup;
-	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	unsigned long flags;
 
 	/*
@@ -100,7 +120,7 @@ void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
 		return;
 
-	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
+	flags = _css_rstat_cpu_lock(css, cpu, true);
 
 	/* put @css and all ancestors on the corresponding updated lists */
 	while (true) {
@@ -128,7 +148,7 @@ void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 		css = parent;
 	}
 
-	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, true);
+	_css_rstat_cpu_unlock(css, cpu, flags, true);
 }
 
 __bpf_kfunc void bpf_cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
@@ -211,13 +231,11 @@ static struct cgroup_subsys_state *cgroup_rstat_push_children(
 static struct cgroup_subsys_state *css_rstat_updated_list(
 		struct cgroup_subsys_state *root, int cpu)
 {
-	struct cgroup *cgrp = root->cgroup;
-	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	struct css_rstat_cpu *rstatc = css_rstat_cpu(root, cpu);
 	struct cgroup_subsys_state *head = NULL, *parent, *child;
 	unsigned long flags;
 
-	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, false);
+	flags = _css_rstat_cpu_lock(root, cpu, false);
 
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
@@ -254,7 +272,7 @@ static struct cgroup_subsys_state *css_rstat_updated_list(
 	if (child != root)
 		head = cgroup_rstat_push_children(head, child, cpu);
 unlock_ret:
-	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, false);
+	_css_rstat_cpu_unlock(root, cpu, flags, false);
 	return head;
 }
 
@@ -281,7 +299,7 @@ __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
 __bpf_hook_end();
 
 /*
- * Helper functions for locking cgroup_rstat_lock.
+ * Helper functions for locking.
  *
  * This makes it easier to diagnose locking issues and contention in
  * production environments.  The parameter @cpu_in_loop indicate lock
@@ -289,35 +307,44 @@ __bpf_hook_end();
  * value -1 is used when obtaining the main lock else this is the CPU
  * number processed last.
  */
-static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
-	__acquires(&cgroup_rstat_lock)
+static inline void __css_rstat_lock(struct cgroup_subsys_state *css,
+		int cpu_in_loop)
+	__acquires(lock)
 {
+	struct cgroup *cgrp = css->cgroup;
+	spinlock_t *lock;
 	bool contended;
 
-	contended = !spin_trylock_irq(&cgroup_rstat_lock);
+	lock = ss_rstat_lock(css->ss);
+	contended = !spin_trylock_irq(lock);
 	if (contended) {
 		trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
-		spin_lock_irq(&cgroup_rstat_lock);
+		spin_lock_irq(lock);
 	}
 	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
 }
 
-static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
-	__releases(&cgroup_rstat_lock)
+static inline void __css_rstat_unlock(struct cgroup_subsys_state *css,
+		int cpu_in_loop)
+	__releases(lock)
 {
+	struct cgroup *cgrp = css->cgroup;
+	spinlock_t *lock;
+
+	lock = ss_rstat_lock(css->ss);
 	trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
-	spin_unlock_irq(&cgroup_rstat_lock);
+	spin_unlock_irq(lock);
 }
 
-/* see css_rstat_flush() */
+/* see css_rstat_flush()
+ *
+ * it is required that callers have previously acquired a lock via
+ * __css_rstat_lock(css)
+ */
 static void css_rstat_flush_locked(struct cgroup_subsys_state *css)
-	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
 {
-	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
-	lockdep_assert_held(&cgroup_rstat_lock);
-
 	for_each_possible_cpu(cpu) {
 		struct cgroup_subsys_state *pos;
 
@@ -332,11 +359,11 @@ static void css_rstat_flush_locked(struct cgroup_subsys_state *css)
 		}
 
 		/* play nice and yield if necessary */
-		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
-			__cgroup_rstat_unlock(cgrp, cpu);
+		if (need_resched() || spin_needbreak(ss_rstat_lock(css->ss))) {
+			__css_rstat_unlock(css, cpu);
 			if (!cond_resched())
 				cpu_relax();
-			__cgroup_rstat_lock(cgrp, cpu);
+			__css_rstat_lock(css, cpu);
 		}
 	}
 }
@@ -356,13 +383,10 @@ static void css_rstat_flush_locked(struct cgroup_subsys_state *css)
  */
 void css_rstat_flush(struct cgroup_subsys_state *css)
 {
-	struct cgroup *cgrp = css->cgroup;
-
 	might_sleep();
-
-	__cgroup_rstat_lock(cgrp, -1);
+	__css_rstat_lock(css, -1);
 	css_rstat_flush_locked(css);
-	__cgroup_rstat_unlock(cgrp, -1);
+	__css_rstat_unlock(css, -1);
 }
 
 __bpf_kfunc void bpf_cgroup_rstat_flush(struct cgroup *cgrp)
@@ -381,10 +405,8 @@ __bpf_kfunc void bpf_cgroup_rstat_flush(struct cgroup *cgrp)
  */
 void css_rstat_flush_hold(struct cgroup_subsys_state *css)
 {
-	struct cgroup *cgrp = css->cgroup;
-
 	might_sleep();
-	__cgroup_rstat_lock(cgrp, -1);
+	__css_rstat_lock(css, -1);
 	css_rstat_flush_locked(css);
 }
 
@@ -394,8 +416,7 @@ void css_rstat_flush_hold(struct cgroup_subsys_state *css)
  */
 void css_rstat_flush_release(struct cgroup_subsys_state *css)
 {
-	struct cgroup *cgrp = css->cgroup;
-	__cgroup_rstat_unlock(cgrp, -1);
+	__css_rstat_unlock(css, -1);
 }
 
 int css_rstat_init(struct cgroup_subsys_state *css)
@@ -439,12 +460,36 @@ void css_rstat_exit(struct cgroup_subsys_state *css)
 	css->rstat_cpu = NULL;
 }
 
-void __init cgroup_rstat_boot(void)
+/**
+ * ss_rstat_init - subsystem-specific rstat initialization
+ * @ss: target subsystem
+ *
+ * If @ss is NULL, the static locks associated with the base stats
+ * are initialized. If @ss is non-NULL, the subsystem-specific locks
+ * are initialized.
+ */
+int __init ss_rstat_init(struct cgroup_subsys *ss)
 {
 	int cpu;
 
+	if (!ss) {
+		spin_lock_init(&rstat_base_lock);
+
+		for_each_possible_cpu(cpu)
+			raw_spin_lock_init(per_cpu_ptr(&rstat_base_cpu_lock, cpu));
+
+		return 0;
+	}
+
+	spin_lock_init(&ss->lock);
+	ss->percpu_lock = alloc_percpu(raw_spinlock_t);
+	if (!ss->percpu_lock)
+		return -ENOMEM;
+
 	for_each_possible_cpu(cpu)
-		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
+		raw_spin_lock_init(per_cpu_ptr(ss->percpu_lock, cpu));
+
+	return 0;
 }
 
 /*
-- 
2.47.1



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 4/4 v3] cgroup: save memory by splitting cgroup_rstat_cpu into compact and full versions
  2025-03-19 22:21 [PATCH 0/4 v3] cgroup: separate rstat trees JP Kobryn
                   ` (2 preceding siblings ...)
  2025-03-19 22:21 ` [PATCH 3/4 v3] cgroup: use subsystem-specific rstat locks to avoid contention JP Kobryn
@ 2025-03-19 22:21 ` JP Kobryn
  2025-03-20 21:44   ` Tejun Heo
  3 siblings, 1 reply; 17+ messages in thread
From: JP Kobryn @ 2025-03-19 22:21 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

The cgroup_rstat_cpu struct contains rstat node pointers and also the base
stat objects. Since ownership of the cgroup_rstat_cpu has shifted from
cgroup to cgroup_subsys_state, css's other than cgroup::self are now
carrying along these base stat objects which go unused. Eliminate this
wasted memory by splitting up cgroup_rstat_cpu into two separate structs.

The cgroup_rstat_cpu struct is modified in a way that it now contains only
the rstat node pointers. css's that are associated with a subsystem
(memory, io) use this compact struct to participate in rstat without the
memory overhead of the base stat objects.

As for css's represented by cgroup::self, a new cgroup_rstat_base_cpu
struct is introduced. It contains the new compact cgroup_rstat_cpu struct
as its first field followed by the base stat objects.

Because the rstat pointers exist at the same offset (beginning) in both
structs, cgroup_subsys_state is modified to contain a union of the two
structs. Where css initialization is done, the compact struct is allocated
when the css is associated with a subsystem. When the css is not associated
with a subsystem, the full struct is allocated. The union allows the
existing rstat updated/flush routines to work with any css regardless of
subsystem association. The base stats routines however, were modified to
access the full struct field in the union.

The change in memory on a per-cpu basis is shown below.

before:
    struct size
    sizeof(cgroup_rstat_cpu) =~ 144 bytes /* can vary based on config */

    per-cpu overhead
    nr_cgroups * (
	sizeof(cgroup_rstat_cpu) * (1 + nr_rstat_subsystems)
    )
    nr_cgroups * (144 * (1 + 2))
    nr_cgroups * 432

    432 bytes per cgroup per cpu

after:
    struct sizes
    sizeof(cgroup_rstat_base_cpu) =~ 144 bytes
    sizeof(cgroup_rstat_cpu) = 16 bytes

    per-cpu overhead
    nr_cgroups * (
	sizeof(cgroup_rstat_base_cpu) +
	sizeof(cgroup_rstat_cpu) * (nr_rstat_subsystems)
    )
    nr_cgroups * (144 + 16 * 2)
    nr_cgroups * 176

    176 bytes per cgroup per cpu

savings:
    256 bytes per cgroup per cpu

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/cgroup-defs.h |  41 +++++++++------
 kernel/cgroup/rstat.c       | 100 ++++++++++++++++++++++--------------
 2 files changed, 86 insertions(+), 55 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 0ffc8438c6d9..f9b84e7f718d 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -170,7 +170,10 @@ struct cgroup_subsys_state {
 	struct percpu_ref refcnt;
 
 	/* per-cpu recursive resource statistics */
-	struct css_rstat_cpu __percpu *rstat_cpu;
+	union {
+		struct css_rstat_cpu __percpu *rstat_cpu;
+		struct css_rstat_base_cpu __percpu *rstat_base_cpu;
+	};
 
 	/*
 	 * siblings list anchored at the parent's ->children
@@ -358,6 +361,26 @@ struct cgroup_base_stat {
  * resource statistics on top of it - bsync, bstat and last_bstat.
  */
 struct css_rstat_cpu {
+	/*
+	 * Child cgroups with stat updates on this cpu since the last read
+	 * are linked on the parent's ->updated_children through
+	 * ->updated_next.
+	 *
+	 * In addition to being more compact, singly-linked list pointing
+	 * to the cgroup makes it unnecessary for each per-cpu struct to
+	 * point back to the associated cgroup.
+	 *
+	 * Protected by per-cpu rstat_base_cpu_lock when css->ss == NULL
+	 * otherwise,
+	 * Protected by per-cpu css->ss->rstat_cpu_lock
+	 */
+	struct cgroup_subsys_state *updated_children;	/* terminated by self */
+	struct cgroup_subsys_state *updated_next;	/* NULL if not on list */
+};
+
+struct css_rstat_base_cpu {
+	struct css_rstat_cpu rstat_cpu;
+
 	/*
 	 * ->bsync protects ->bstat.  These are the only fields which get
 	 * updated in the hot path.
@@ -384,22 +407,6 @@ struct css_rstat_cpu {
 	 * deltas to propagate to the per-cpu subtree_bstat.
 	 */
 	struct cgroup_base_stat last_subtree_bstat;
-
-	/*
-	 * Child cgroups with stat updates on this cpu since the last read
-	 * are linked on the parent's ->updated_children through
-	 * ->updated_next.
-	 *
-	 * In addition to being more compact, singly-linked list pointing
-	 * to the cgroup makes it unnecessary for each per-cpu struct to
-	 * point back to the associated cgroup.
-	 *
-	 * Protected by per-cpu rstat_base_cpu_lock when css->ss == NULL
-	 * otherwise,
-	 * Protected by per-cpu css->ss->rstat_cpu_lock
-	 */
-	struct cgroup_subsys_state *updated_children;	/* terminated by self */
-	struct cgroup_subsys_state *updated_next;	/* NULL if not on list */
 };
 
 struct cgroup_freezer_state {
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index ffd7ac6bcefc..250f0987407e 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -20,6 +20,12 @@ static struct css_rstat_cpu *css_rstat_cpu(
 	return per_cpu_ptr(css->rstat_cpu, cpu);
 }
 
+static struct css_rstat_base_cpu *css_rstat_base_cpu(
+		struct cgroup_subsys_state *css, int cpu)
+{
+	return per_cpu_ptr(css->rstat_base_cpu, cpu);
+}
+
 static spinlock_t *ss_rstat_lock(struct cgroup_subsys *ss)
 {
 	if (ss)
@@ -425,17 +431,35 @@ int css_rstat_init(struct cgroup_subsys_state *css)
 
 	/* the root cgrp's self css has rstat_cpu preallocated */
 	if (!css->rstat_cpu) {
-		css->rstat_cpu = alloc_percpu(struct css_rstat_cpu);
-		if (!css->rstat_cpu)
-			return -ENOMEM;
+		/* One of the union fields must be initialized.
+		 * Allocate the larger rstat struct for base stats when css is
+		 * cgroup::self.
+		 * Otherwise, allocate the compact rstat struct since the css is
+		 * associated with a subsystem.
+		 */
+		if (css_is_cgroup(css)) {
+			css->rstat_base_cpu = alloc_percpu(struct css_rstat_base_cpu);
+			if (!css->rstat_base_cpu)
+				return -ENOMEM;
+		} else {
+			css->rstat_cpu = alloc_percpu(struct css_rstat_cpu);
+			if (!css->rstat_cpu)
+				return -ENOMEM;
+		}
 	}
 
-	/* ->updated_children list is self terminated */
 	for_each_possible_cpu(cpu) {
-		struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
+		struct css_rstat_cpu *rstatc;
 
+		rstatc = css_rstat_cpu(css, cpu);
 		rstatc->updated_children = css;
-		u64_stats_init(&rstatc->bsync);
+
+		if (css_is_cgroup(css)) {
+			struct css_rstat_base_cpu *rstatbc;
+
+			rstatbc = css_rstat_base_cpu(css, cpu);
+			u64_stats_init(&rstatbc->bsync);
+		}
 	}
 
 	return 0;
@@ -522,9 +546,9 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 {
-	struct css_rstat_cpu *rstatc = css_rstat_cpu(&cgrp->self, cpu);
+	struct css_rstat_base_cpu *rstatbc = css_rstat_base_cpu(&cgrp->self, cpu);
 	struct cgroup *parent = cgroup_parent(cgrp);
-	struct css_rstat_cpu *prstatc;
+	struct css_rstat_base_cpu *prstatbc;
 	struct cgroup_base_stat delta;
 	unsigned seq;
 
@@ -534,15 +558,15 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 
 	/* fetch the current per-cpu values */
 	do {
-		seq = __u64_stats_fetch_begin(&rstatc->bsync);
-		delta = rstatc->bstat;
-	} while (__u64_stats_fetch_retry(&rstatc->bsync, seq));
+		seq = __u64_stats_fetch_begin(&rstatbc->bsync);
+		delta = rstatbc->bstat;
+	} while (__u64_stats_fetch_retry(&rstatbc->bsync, seq));
 
 	/* propagate per-cpu delta to cgroup and per-cpu global statistics */
-	cgroup_base_stat_sub(&delta, &rstatc->last_bstat);
+	cgroup_base_stat_sub(&delta, &rstatbc->last_bstat);
 	cgroup_base_stat_add(&cgrp->bstat, &delta);
-	cgroup_base_stat_add(&rstatc->last_bstat, &delta);
-	cgroup_base_stat_add(&rstatc->subtree_bstat, &delta);
+	cgroup_base_stat_add(&rstatbc->last_bstat, &delta);
+	cgroup_base_stat_add(&rstatbc->subtree_bstat, &delta);
 
 	/* propagate cgroup and per-cpu global delta to parent (unless that's root) */
 	if (cgroup_parent(parent)) {
@@ -551,73 +575,73 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 		cgroup_base_stat_add(&parent->bstat, &delta);
 		cgroup_base_stat_add(&cgrp->last_bstat, &delta);
 
-		delta = rstatc->subtree_bstat;
-		prstatc = css_rstat_cpu(&parent->self, cpu);
-		cgroup_base_stat_sub(&delta, &rstatc->last_subtree_bstat);
-		cgroup_base_stat_add(&prstatc->subtree_bstat, &delta);
-		cgroup_base_stat_add(&rstatc->last_subtree_bstat, &delta);
+		delta = rstatbc->subtree_bstat;
+		prstatbc = css_rstat_base_cpu(&parent->self, cpu);
+		cgroup_base_stat_sub(&delta, &rstatbc->last_subtree_bstat);
+		cgroup_base_stat_add(&prstatbc->subtree_bstat, &delta);
+		cgroup_base_stat_add(&rstatbc->last_subtree_bstat, &delta);
 	}
 }
 
-static struct css_rstat_cpu *
+static struct css_rstat_base_cpu *
 cgroup_base_stat_cputime_account_begin(struct cgroup *cgrp, unsigned long *flags)
 {
-	struct css_rstat_cpu *rstatc;
+	struct css_rstat_base_cpu *rstatbc;
 
-	rstatc = get_cpu_ptr(cgrp->self.rstat_cpu);
-	*flags = u64_stats_update_begin_irqsave(&rstatc->bsync);
-	return rstatc;
+	rstatbc = get_cpu_ptr(cgrp->self.rstat_base_cpu);
+	*flags = u64_stats_update_begin_irqsave(&rstatbc->bsync);
+	return rstatbc;
 }
 
 static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
-						 struct css_rstat_cpu *rstatc,
+						 struct css_rstat_base_cpu *rstatbc,
 						 unsigned long flags)
 {
-	u64_stats_update_end_irqrestore(&rstatc->bsync, flags);
+	u64_stats_update_end_irqrestore(&rstatbc->bsync, flags);
 	css_rstat_updated(&cgrp->self, smp_processor_id());
-	put_cpu_ptr(rstatc);
+	put_cpu_ptr(rstatbc);
 }
 
 void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
 {
-	struct css_rstat_cpu *rstatc;
+	struct css_rstat_base_cpu *rstatbc;
 	unsigned long flags;
 
-	rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
-	rstatc->bstat.cputime.sum_exec_runtime += delta_exec;
-	cgroup_base_stat_cputime_account_end(cgrp, rstatc, flags);
+	rstatbc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
+	rstatbc->bstat.cputime.sum_exec_runtime += delta_exec;
+	cgroup_base_stat_cputime_account_end(cgrp, rstatbc, flags);
 }
 
 void __cgroup_account_cputime_field(struct cgroup *cgrp,
 				    enum cpu_usage_stat index, u64 delta_exec)
 {
-	struct css_rstat_cpu *rstatc;
+	struct css_rstat_base_cpu *rstatbc;
 	unsigned long flags;
 
-	rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
+	rstatbc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
 
 	switch (index) {
 	case CPUTIME_NICE:
-		rstatc->bstat.ntime += delta_exec;
+		rstatbc->bstat.ntime += delta_exec;
 		fallthrough;
 	case CPUTIME_USER:
-		rstatc->bstat.cputime.utime += delta_exec;
+		rstatbc->bstat.cputime.utime += delta_exec;
 		break;
 	case CPUTIME_SYSTEM:
 	case CPUTIME_IRQ:
 	case CPUTIME_SOFTIRQ:
-		rstatc->bstat.cputime.stime += delta_exec;
+		rstatbc->bstat.cputime.stime += delta_exec;
 		break;
 #ifdef CONFIG_SCHED_CORE
 	case CPUTIME_FORCEIDLE:
-		rstatc->bstat.forceidle_sum += delta_exec;
+		rstatbc->bstat.forceidle_sum += delta_exec;
 		break;
 #endif
 	default:
 		break;
 	}
 
-	cgroup_base_stat_cputime_account_end(cgrp, rstatc, flags);
+	cgroup_base_stat_cputime_account_end(cgrp, rstatbc, flags);
 }
 
 /*
-- 
2.47.1



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4 v3] cgroup: use separate rstat trees for each subsystem
  2025-03-19 22:21 ` [PATCH 2/4 v3] cgroup: use separate rstat trees for each subsystem JP Kobryn
@ 2025-03-20 21:31   ` Tejun Heo
  2025-03-21 17:22     ` JP Kobryn
  2025-03-24 17:48   ` Michal Koutný
  1 sibling, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2025-03-20 21:31 UTC (permalink / raw)
  To: JP Kobryn
  Cc: shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

Hello,

On Wed, Mar 19, 2025 at 03:21:48PM -0700, JP Kobryn wrote:
> Different subsystems may call cgroup_rstat_updated() within the same
> cgroup, resulting in a tree of pending updates from multiple subsystems.
> When one of these subsystems is flushed via cgroup_rstat_flushed(), all
> other subsystems with pending updates on the tree will also be flushed.
> 
> Change the paradigm of having a single rstat tree for all subsystems to
> having separate trees for each subsystem. This separation allows for
> subsystems to perform flushes without the side effects of other subsystems.
> As an example, flushing the cpu stats will no longer cause the memory stats
> to be flushed and vice versa.
> 
> In order to achieve subsystem-specific trees, change the tree node type
> from cgroup to cgroup_subsys_state pointer. Then remove those pointers from
> the cgroup and instead place them on the css. Finally, change the
> updated/flush API's to accept a reference to a css instead of a cgroup.
> This allows a specific subsystem to be associated with an update or flush.
> Separate rstat trees will now exist for each unique subsystem.
> 
> Since updating/flushing will now be done at the subsystem level, there is
> no longer a need to keep track of updated css nodes at the cgroup level.
> The list management of these nodes done within the cgroup (rstat_css_list
> and related) has been removed accordingly. There was also padding in the
> cgroup to keep rstat_css_list on a cacheline different from
> rstat_flush_next and the base stats. This padding has also been removed.

Overall, this looks okay but I think the patch should be split further.
There's too much cgroup -> css renames mixed with actual changes which makes
it difficult to understand what the actual changes are. Can you please
separate it into a patch which makes everything css based but the actual
queueing and flushing is still only on the cgroup css and then the next
patch to actually split out linking and flushing to each css?

> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 13fd82a4336d..4e71ae9858d3 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -346,6 +346,11 @@ static inline bool css_is_dying(struct cgroup_subsys_state *css)
>  	return !(css->flags & CSS_NO_REF) && percpu_ref_is_dying(&css->refcnt);
>  }
>  
> +static inline bool css_is_cgroup(struct cgroup_subsys_state *css)
> +{
> +	return css->ss == NULL;
> +}

Maybe introduce this in a prep patch and replace existing users?

...
> @@ -6082,11 +6077,16 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
>  	css->flags |= CSS_NO_REF;
>  
>  	if (early) {
> -		/* allocation can't be done safely during early init */
> +		/* allocation can't be done safely during early init.
> +		 * defer idr and rstat allocations until cgroup_init().
> +		 */

Nit: Please use fully winged comment blocks for multilines with
captalizations.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4 v3] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-03-19 22:21 ` [PATCH 3/4 v3] cgroup: use subsystem-specific rstat locks to avoid contention JP Kobryn
@ 2025-03-20 21:38   ` Tejun Heo
  2025-03-24 17:48   ` Michal Koutný
  1 sibling, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2025-03-20 21:38 UTC (permalink / raw)
  To: JP Kobryn
  Cc: shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On Wed, Mar 19, 2025 at 03:21:49PM -0700, JP Kobryn wrote:
> @@ -779,6 +784,9 @@ struct cgroup_subsys {
>  	 * specifies the mask of subsystems that this one depends on.
>  	 */
>  	unsigned int depends_on;
> +
> +	spinlock_t lock;
> +	raw_spinlock_t __percpu *percpu_lock;

Please further qualify the names.

> -/* Related to global: cgroup_rstat_lock */
> +/* Related to locks:
> + * rstat_base_lock when handling cgroup::self
> + * css->ss->lock otherwise
> + */

Comment style, here and in other places.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4 v3] cgroup: save memory by splitting cgroup_rstat_cpu into compact and full versions
  2025-03-19 22:21 ` [PATCH 4/4 v3] cgroup: save memory by splitting cgroup_rstat_cpu into compact and full versions JP Kobryn
@ 2025-03-20 21:44   ` Tejun Heo
  2025-03-21 17:45     ` JP Kobryn
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2025-03-20 21:44 UTC (permalink / raw)
  To: JP Kobryn
  Cc: shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

Hello,

On Wed, Mar 19, 2025 at 03:21:50PM -0700, JP Kobryn wrote:
> The cgroup_rstat_cpu struct contains rstat node pointers and also the base
> stat objects. Since ownership of the cgroup_rstat_cpu has shifted from
> cgroup to cgroup_subsys_state, css's other than cgroup::self are now
> carrying along these base stat objects which go unused. Eliminate this
> wasted memory by splitting up cgroup_rstat_cpu into two separate structs.
> 
> The cgroup_rstat_cpu struct is modified in a way that it now contains only
> the rstat node pointers. css's that are associated with a subsystem
> (memory, io) use this compact struct to participate in rstat without the
> memory overhead of the base stat objects.
> 
> As for css's represented by cgroup::self, a new cgroup_rstat_base_cpu
> struct is introduced. It contains the new compact cgroup_rstat_cpu struct
> as its first field followed by the base stat objects.
> 
> Because the rstat pointers exist at the same offset (beginning) in both
> structs, cgroup_subsys_state is modified to contain a union of the two
> structs. Where css initialization is done, the compact struct is allocated
> when the css is associated with a subsystem. When the css is not associated
> with a subsystem, the full struct is allocated. The union allows the
> existing rstat updated/flush routines to work with any css regardless of
> subsystem association. The base stats routines however, were modified to
> access the full struct field in the union.

Can you do this as a part of prep patch? ie. Move the bstat out of rstat_cpu
into the containing cgroup before switching to css based structure? It's
rather odd to claim memory saving after bloating it up due to patch
sequencing.

> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 0ffc8438c6d9..f9b84e7f718d 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -170,7 +170,10 @@ struct cgroup_subsys_state {
>  	struct percpu_ref refcnt;
>  
>  	/* per-cpu recursive resource statistics */
> -	struct css_rstat_cpu __percpu *rstat_cpu;
> +	union {
> +		struct css_rstat_cpu __percpu *rstat_cpu;
> +		struct css_rstat_base_cpu __percpu *rstat_base_cpu;
> +	};

I have a bit of difficult time following this. All that bstat is is the
counters that the cgroup itself carries regardless of the subsystem but they
would be collected and flushed the same way. Wouldn't that mean that the
cgroup css should carry the same css_rstat_cpu as other csses but would have
separate struct to carry the counters? Why is it multiplexed on
css_rstat_cpu?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4 v3] cgroup: use separate rstat trees for each subsystem
  2025-03-20 21:31   ` Tejun Heo
@ 2025-03-21 17:22     ` JP Kobryn
  0 siblings, 0 replies; 17+ messages in thread
From: JP Kobryn @ 2025-03-21 17:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On 3/20/25 2:31 PM, Tejun Heo wrote:
> Hello,
>
> On Wed, Mar 19, 2025 at 03:21:48PM -0700, JP Kobryn wrote:
>> Different subsystems may call cgroup_rstat_updated() within the same
>> cgroup, resulting in a tree of pending updates from multiple subsystems.
>> When one of these subsystems is flushed via cgroup_rstat_flushed(), all
>> other subsystems with pending updates on the tree will also be flushed.
>>
>> Change the paradigm of having a single rstat tree for all subsystems to
>> having separate trees for each subsystem. This separation allows for
>> subsystems to perform flushes without the side effects of other subsystems.
>> As an example, flushing the cpu stats will no longer cause the memory stats
>> to be flushed and vice versa.
>>
>> In order to achieve subsystem-specific trees, change the tree node type
>> from cgroup to cgroup_subsys_state pointer. Then remove those pointers from
>> the cgroup and instead place them on the css. Finally, change the
>> updated/flush API's to accept a reference to a css instead of a cgroup.
>> This allows a specific subsystem to be associated with an update or flush.
>> Separate rstat trees will now exist for each unique subsystem.
>>
>> Since updating/flushing will now be done at the subsystem level, there is
>> no longer a need to keep track of updated css nodes at the cgroup level.
>> The list management of these nodes done within the cgroup (rstat_css_list
>> and related) has been removed accordingly. There was also padding in the
>> cgroup to keep rstat_css_list on a cacheline different from
>> rstat_flush_next and the base stats. This padding has also been removed.
> Overall, this looks okay but I think the patch should be split further.
> There's too much cgroup -> css renames mixed with actual changes which makes
> it difficult to understand what the actual changes are. Can you please
> separate it into a patch which makes everything css based but the actual
> queueing and flushing is still only on the cgroup css and then the next
> patch to actually split out linking and flushing to each css?

Sure, no problem. I did something similar in the RFC series so I'll 
apply again here.

>> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
>> index 13fd82a4336d..4e71ae9858d3 100644
>> --- a/include/linux/cgroup.h
>> +++ b/include/linux/cgroup.h
>> @@ -346,6 +346,11 @@ static inline bool css_is_dying(struct cgroup_subsys_state *css)
>>   	return !(css->flags & CSS_NO_REF) && percpu_ref_is_dying(&css->refcnt);
>>   }
>>   
>> +static inline bool css_is_cgroup(struct cgroup_subsys_state *css)
>> +{
>> +	return css->ss == NULL;
>> +}
> Maybe introduce this in a prep patch and replace existing users?
Makes sense, will do.
> ...
>> @@ -6082,11 +6077,16 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
>>   	css->flags |= CSS_NO_REF;
>>   
>>   	if (early) {
>> -		/* allocation can't be done safely during early init */
>> +		/* allocation can't be done safely during early init.
>> +		 * defer idr and rstat allocations until cgroup_init().
>> +		 */
> Nit: Please use fully winged comment blocks for multilines with
> captalizations.
>
> Thanks.
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4 v3] cgroup: save memory by splitting cgroup_rstat_cpu into compact and full versions
  2025-03-20 21:44   ` Tejun Heo
@ 2025-03-21 17:45     ` JP Kobryn
  2025-03-24 17:49       ` Michal Koutný
  2025-03-25 13:55       ` Shakeel Butt
  0 siblings, 2 replies; 17+ messages in thread
From: JP Kobryn @ 2025-03-21 17:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On 3/20/25 2:44 PM, Tejun Heo wrote:
> Hello,
>
> On Wed, Mar 19, 2025 at 03:21:50PM -0700, JP Kobryn wrote:
>> The cgroup_rstat_cpu struct contains rstat node pointers and also the base
>> stat objects. Since ownership of the cgroup_rstat_cpu has shifted from
>> cgroup to cgroup_subsys_state, css's other than cgroup::self are now
>> carrying along these base stat objects which go unused. Eliminate this
>> wasted memory by splitting up cgroup_rstat_cpu into two separate structs.
>>
>> The cgroup_rstat_cpu struct is modified in a way that it now contains only
>> the rstat node pointers. css's that are associated with a subsystem
>> (memory, io) use this compact struct to participate in rstat without the
>> memory overhead of the base stat objects.
>>
>> As for css's represented by cgroup::self, a new cgroup_rstat_base_cpu
>> struct is introduced. It contains the new compact cgroup_rstat_cpu struct
>> as its first field followed by the base stat objects.
>>
>> Because the rstat pointers exist at the same offset (beginning) in both
>> structs, cgroup_subsys_state is modified to contain a union of the two
>> structs. Where css initialization is done, the compact struct is allocated
>> when the css is associated with a subsystem. When the css is not associated
>> with a subsystem, the full struct is allocated. The union allows the
>> existing rstat updated/flush routines to work with any css regardless of
>> subsystem association. The base stats routines however, were modified to
>> access the full struct field in the union.
> Can you do this as a part of prep patch? ie. Move the bstat out of rstat_cpu
> into the containing cgroup before switching to css based structure? It's
> rather odd to claim memory saving after bloating it up due to patch
> sequencing.
I think it's possible. I'll give my thoughts in response to your 
question below.
>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>> index 0ffc8438c6d9..f9b84e7f718d 100644
>> --- a/include/linux/cgroup-defs.h
>> +++ b/include/linux/cgroup-defs.h
>> @@ -170,7 +170,10 @@ struct cgroup_subsys_state {
>>   	struct percpu_ref refcnt;
>>   
>>   	/* per-cpu recursive resource statistics */
>> -	struct css_rstat_cpu __percpu *rstat_cpu;
>> +	union {
>> +		struct css_rstat_cpu __percpu *rstat_cpu;
>> +		struct css_rstat_base_cpu __percpu *rstat_base_cpu;
>> +	};
> I have a bit of difficult time following this. All that bstat is is the
> counters that the cgroup itself carries regardless of the subsystem but they
> would be collected and flushed the same way. Wouldn't that mean that the
> cgroup css should carry the same css_rstat_cpu as other csses but would have
> separate struct to carry the counters? Why is it multiplexed on
> css_rstat_cpu?

Originally the idea came from wanting to avoid having extra objects 
(even if just one pointer) in the css that would not be used. This setup 
allowed for what felt like (to me at least) an organized way of keeping 
the objects used in rstat in the same container (css). You initialize 
everything within the css and also have consistency with the css-based 
rstat API's - using the css for updated/flushed allows you to get the 
needed rstat objects via css. So if we moved the base stat objects to 
the cgroup, we could where applicable check for css_is_cgroup() and then 
use css->cgroup to get the base stats which we kind of do anyway. I 
don't see a problem there. The only complication I think is on the init 
side, we would have to perform a separate percpu allocation for these 
base stats in the cgroup. I can give this a try for next rev unless 
anyone feels otherwise.


In general, I'll wait to see if Yosry, Michal, or Shakeel want any other 
changes before sending out the next rev.

>
> Thanks.
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4 v3] cgroup: use separate rstat api for bpf programs
  2025-03-19 22:21 ` [PATCH 1/4 v3] cgroup: use separate rstat api for bpf programs JP Kobryn
@ 2025-03-24 17:47   ` Michal Koutný
  2025-03-25 18:03     ` JP Kobryn
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Koutný @ 2025-03-24 17:47 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, shakeel.butt, yosryahmed, hannes, akpm, linux-mm, cgroups,
	kernel-team

[-- Attachment #1: Type: text/plain, Size: 1141 bytes --]

On Wed, Mar 19, 2025 at 03:21:47PM -0700, JP Kobryn <inwardvessel@gmail.com> wrote:
> The rstat updated/flush API functions are exported as kfuncs so bpf
> programs can make the same calls that in-kernel code can. Split these API
> functions into separate in-kernel and bpf versions. Function signatures
> remain unchanged. The kfuncs are named with the prefix "bpf_". This
> non-functional change allows for future commits which will modify the
> signature of the in-kernel API without impacting bpf call sites. The
> implementations of the kfuncs serve as adapters to the in-kernel API.

This made me look up
https://docs.kernel.org/bpf/kfuncs.html#bpf-kfunc-lifecycle-expectations

The series reworks existing kfuncs anyway, is it necessary to have the
bpf_ versions? The semantics is changed too from base+all subsystems
flush to only base flush (bpf_rstat_flush()).

I'd perhaps do the changes freely w/out taking kfuncs into account and
then add possible reconstructive patches towards the end of the series.
(I'm not unsure whether the modified btf_type_tag_percpu.c selftest
survives the latest unionization of base stats.)


Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4 v3] cgroup: use separate rstat trees for each subsystem
  2025-03-19 22:21 ` [PATCH 2/4 v3] cgroup: use separate rstat trees for each subsystem JP Kobryn
  2025-03-20 21:31   ` Tejun Heo
@ 2025-03-24 17:48   ` Michal Koutný
  1 sibling, 0 replies; 17+ messages in thread
From: Michal Koutný @ 2025-03-24 17:48 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, shakeel.butt, yosryahmed, hannes, akpm, linux-mm, cgroups,
	kernel-team

[-- Attachment #1: Type: text/plain, Size: 1614 bytes --]

On Wed, Mar 19, 2025 at 03:21:48PM -0700, JP Kobryn <inwardvessel@gmail.com> wrote:
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -161,10 +161,12 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
>  };
>  #undef SUBSYS
>  
> -static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu);
> +static DEFINE_PER_CPU(struct css_rstat_cpu, root_self_rstat_cpu);

root_base_rstat_cpu
cgrp_dfl_root_base_rstat_cpu

(not a big deal, it's only referenced once next to definition)

> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
...
> -static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
> -						 struct cgroup *child, int cpu)
> +static struct cgroup_subsys_state *cgroup_rstat_push_children(
> +		struct cgroup_subsys_state *head,
> +		struct cgroup_subsys_state *child, int cpu)

Forgotten rename?

> diff --git a/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c b/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
> index 38f78d9345de..f362f7d41b9e 100644
> --- a/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
> +++ b/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
> @@ -45,7 +45,7 @@ int BPF_PROG(test_percpu2, struct bpf_testmod_btf_type_tag_2 *arg)
>  SEC("tp_btf/cgroup_mkdir")
>  int BPF_PROG(test_percpu_load, struct cgroup *cgrp, const char *path)
>  {
> -       g = (__u64)cgrp->rstat_cpu->updated_children;
> +       g = (__u64)cgrp->self.rstat_cpu->updated_children;
>         return 0;
>  }

There are also some comments above needing an update.


Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4 v3] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-03-19 22:21 ` [PATCH 3/4 v3] cgroup: use subsystem-specific rstat locks to avoid contention JP Kobryn
  2025-03-20 21:38   ` Tejun Heo
@ 2025-03-24 17:48   ` Michal Koutný
  1 sibling, 0 replies; 17+ messages in thread
From: Michal Koutný @ 2025-03-24 17:48 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, shakeel.butt, yosryahmed, hannes, akpm, linux-mm, cgroups,
	kernel-team

[-- Attachment #1: Type: text/plain, Size: 662 bytes --]

On Wed, Mar 19, 2025 at 03:21:49PM -0700, JP Kobryn <inwardvessel@gmail.com> wrote:
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index a28c00b11736..ffd7ac6bcefc 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
...
> +static spinlock_t *ss_rstat_lock(struct cgroup_subsys *ss)
> +{
> +	if (ss)
> +		return &ss->lock;
> +
> +	return &rstat_base_lock;
> +}
> +
> +static raw_spinlock_t *ss_rstat_cpu_lock(struct cgroup_subsys *ss, int cpu)
> +{
> +	if (ss)
> +		return per_cpu_ptr(ss->percpu_lock, cpu);
> +
> +	return per_cpu_ptr(&rstat_base_cpu_lock, cpu);
> +}

rstat_ss_lock
rstat_ss_cpu_lock

(judgment call only)


Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4 v3] cgroup: save memory by splitting cgroup_rstat_cpu into compact and full versions
  2025-03-21 17:45     ` JP Kobryn
@ 2025-03-24 17:49       ` Michal Koutný
  2025-03-25 13:55       ` Shakeel Butt
  1 sibling, 0 replies; 17+ messages in thread
From: Michal Koutný @ 2025-03-24 17:49 UTC (permalink / raw)
  To: JP Kobryn
  Cc: Tejun Heo, shakeel.butt, yosryahmed, hannes, akpm, linux-mm,
	cgroups, kernel-team

[-- Attachment #1: Type: text/plain, Size: 225 bytes --]

On Fri, Mar 21, 2025 at 10:45:29AM -0700, JP Kobryn <inwardvessel@gmail.com> wrote:
> In general, I'll wait to see if Yosry, Michal, or Shakeel want any other
> changes before sending out the next rev.

Done.

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4 v3] cgroup: save memory by splitting cgroup_rstat_cpu into compact and full versions
  2025-03-21 17:45     ` JP Kobryn
  2025-03-24 17:49       ` Michal Koutný
@ 2025-03-25 13:55       ` Shakeel Butt
  1 sibling, 0 replies; 17+ messages in thread
From: Shakeel Butt @ 2025-03-25 13:55 UTC (permalink / raw)
  To: JP Kobryn
  Cc: Tejun Heo, yosryahmed, mkoutny, hannes, akpm, linux-mm, cgroups,
	kernel-team

On Fri, Mar 21, 2025 at 10:45:29AM -0700, JP Kobryn wrote:
[...]
> 
> In general, I'll wait to see if Yosry, Michal, or Shakeel want any other
> changes before sending out the next rev.
> 

All good from my side.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4 v3] cgroup: use separate rstat api for bpf programs
  2025-03-24 17:47   ` Michal Koutný
@ 2025-03-25 18:03     ` JP Kobryn
  2025-03-26  0:39       ` Yosry Ahmed
  0 siblings, 1 reply; 17+ messages in thread
From: JP Kobryn @ 2025-03-25 18:03 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj, shakeel.butt, yosryahmed, hannes, akpm, linux-mm, cgroups,
	kernel-team

On 3/24/25 10:47 AM, Michal Koutný wrote:
> On Wed, Mar 19, 2025 at 03:21:47PM -0700, JP Kobryn <inwardvessel@gmail.com> wrote:
>> The rstat updated/flush API functions are exported as kfuncs so bpf
>> programs can make the same calls that in-kernel code can. Split these API
>> functions into separate in-kernel and bpf versions. Function signatures
>> remain unchanged. The kfuncs are named with the prefix "bpf_". This
>> non-functional change allows for future commits which will modify the
>> signature of the in-kernel API without impacting bpf call sites. The
>> implementations of the kfuncs serve as adapters to the in-kernel API.
> 
> This made me look up
> https://docs.kernel.org/bpf/kfuncs.html#bpf-kfunc-lifecycle-expectations
> 
> The series reworks existing kfuncs anyway, is it necessary to have the
> bpf_ versions? The semantics is changed too from base+all subsystems
> flush to only base flush (bpf_rstat_flush()).

This patch was done based on some conversation in v2 around what to do
with the kfuncs (bottom of [0]). It's true the kfunc API deviates from
flush-specific-subsystem approach. Dropping this patch is fine by me and
I assume it would be ok with Yosry (based on comments at end of [0]),
but I'll wait and see if he has any input.

[0] https://lore.kernel.org/all/Z8IIxUdRpqxZyIHO@google.com/

> 
> I'd perhaps do the changes freely w/out taking kfuncs into account and
> then add possible reconstructive patches towards the end of the series.
> (I'm not unsure whether the modified btf_type_tag_percpu.c selftest
> survives the latest unionization of base stats.)
> 
> 
> Michal



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4 v3] cgroup: use separate rstat api for bpf programs
  2025-03-25 18:03     ` JP Kobryn
@ 2025-03-26  0:39       ` Yosry Ahmed
  0 siblings, 0 replies; 17+ messages in thread
From: Yosry Ahmed @ 2025-03-26  0:39 UTC (permalink / raw)
  To: JP Kobryn
  Cc: Michal Koutný,
	tj, shakeel.butt, hannes, akpm, linux-mm, cgroups, kernel-team

On Tue, Mar 25, 2025 at 11:03:26AM -0700, JP Kobryn wrote:
> On 3/24/25 10:47 AM, Michal Koutný wrote:
> > On Wed, Mar 19, 2025 at 03:21:47PM -0700, JP Kobryn <inwardvessel@gmail.com> wrote:
> > > The rstat updated/flush API functions are exported as kfuncs so bpf
> > > programs can make the same calls that in-kernel code can. Split these API
> > > functions into separate in-kernel and bpf versions. Function signatures
> > > remain unchanged. The kfuncs are named with the prefix "bpf_". This
> > > non-functional change allows for future commits which will modify the
> > > signature of the in-kernel API without impacting bpf call sites. The
> > > implementations of the kfuncs serve as adapters to the in-kernel API.
> > 
> > This made me look up
> > https://docs.kernel.org/bpf/kfuncs.html#bpf-kfunc-lifecycle-expectations
> > 
> > The series reworks existing kfuncs anyway, is it necessary to have the
> > bpf_ versions? The semantics is changed too from base+all subsystems
> > flush to only base flush (bpf_rstat_flush()).
> 
> This patch was done based on some conversation in v2 around what to do
> with the kfuncs (bottom of [0]). It's true the kfunc API deviates from
> flush-specific-subsystem approach. Dropping this patch is fine by me and
> I assume it would be ok with Yosry (based on comments at end of [0]),
> but I'll wait and see if he has any input.

Yeah I am okay with dropping this as well.

> 
> [0] https://lore.kernel.org/all/Z8IIxUdRpqxZyIHO@google.com/


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-03-26  0:39 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-19 22:21 [PATCH 0/4 v3] cgroup: separate rstat trees JP Kobryn
2025-03-19 22:21 ` [PATCH 1/4 v3] cgroup: use separate rstat api for bpf programs JP Kobryn
2025-03-24 17:47   ` Michal Koutný
2025-03-25 18:03     ` JP Kobryn
2025-03-26  0:39       ` Yosry Ahmed
2025-03-19 22:21 ` [PATCH 2/4 v3] cgroup: use separate rstat trees for each subsystem JP Kobryn
2025-03-20 21:31   ` Tejun Heo
2025-03-21 17:22     ` JP Kobryn
2025-03-24 17:48   ` Michal Koutný
2025-03-19 22:21 ` [PATCH 3/4 v3] cgroup: use subsystem-specific rstat locks to avoid contention JP Kobryn
2025-03-20 21:38   ` Tejun Heo
2025-03-24 17:48   ` Michal Koutný
2025-03-19 22:21 ` [PATCH 4/4 v3] cgroup: save memory by splitting cgroup_rstat_cpu into compact and full versions JP Kobryn
2025-03-20 21:44   ` Tejun Heo
2025-03-21 17:45     ` JP Kobryn
2025-03-24 17:49       ` Michal Koutný
2025-03-25 13:55       ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox