[RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees
@ 2025-01-03  1:50 JP Kobryn
  2025-01-03  1:50 ` [RFC PATCH 1/9 v2] cgroup: change cgroup to css in rstat updated and flush api JP Kobryn
                   ` (10 more replies)
  0 siblings, 11 replies; 14+ messages in thread
From: JP Kobryn @ 2025-01-03  1:50 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

The current rstat model is set up to keep track of cgroup stats on a per-cpu
basis. When a stat (of any subsystem) is updated, the updater notes this change
using the cgroup_rstat_updated() API call. This change is propagated to the
cpu-specific rstat tree, by appending the updated cgroup to the tree (unless
it's already on the tree). So for each cpu, an rstat tree will consist of the
cgroups that reported one or more updated stats. Later on when a flush is
requested via cgroup_rstat_flush(), each per-cpu rstat tree is traversed
starting at the requested cgroup and the subsystem-specific flush callbacks
(via css_rstat_flush) are invoked along the way. During the flush, the section
of the tree starting at the requested cgroup through its descendants are
removed.

Using the cgroup struct to represent nodes of change means that the changes
represented by a given tree are heterogeneous - the tree can consist of nodes
that have changes from different subsystems; i.e. changes in stats from the
memory subsystem and the io subsystem can coexist in the same tree. The
implication is that when a flush is requested, usually in the context of a
single subsystem, all other subsystems need to be flushed along with it. This
seems to have become a drawback due to how expensive the flushing of the
memory-specific stats have become [0][1]. Another implication is when updates
are performed, subsystems may contend with each other over the locks involved.

I've been experimenting with an idea that allows for isolating the updating and
flushing of cgroup stats on a per-subsystem basis. The idea was instead of
having a per-cpu rstat tree for managing stats across all subsystems, we could
split up the per-cpu trees into separate trees for each subsystem. So each cpu
would have separate trees for each subsystem. It would allow subsystems to
update and flush their stats without any contention or extra overhead from
other subsystems. The core change is moving ownership of the the rstat entities
from the cgroup struct onto the cgroup_subsystem_state struct.

To complement the ownership change, the lockng scheme was adjusted. The global
cgroup_rstat_lock for synchronizing updates and flushes was replaced with
subsystem-specific locks (in the cgroup_subsystem struct). An additional global
lock was added to allow the base stats pseudo-subsystem to be synchronized in a
similar way. The per-cpu locks called cgroup_rstat_cpu_lock have changed to a
per-cpu array of locks which is indexed by subsystem id. Following suit, there
is also a per-cpu array of locks dedicated to the base subsystem. The dedicated
locks for the base stats was added since the base stats have a NULL subsystem
so it did not fit the subsystem id index approach.

I reached a point where this started to feel stable in my local testing, so I
wanted to share and get feedback on this approach.

[0] https://lore.kernel.org/all/CAOm-9arwY3VLUx5189JAR9J7B=Miad9nQjjet_VNdT3i+J+5FA@mail.gmail.com/
[1] https://github.blog/engineering/debugging-network-stalls-on-kubernetes/

Changelog
v2: updated cover letter and some patch text. no code changes.

JP Kobryn (8):
  change cgroup to css in rstat updated and flush api
  change cgroup to css in rstat internal flush and lock funcs
  change cgroup to css in rstat init and exit api
  split rstat from cgroup into separate css
  separate locking between base css and others
  isolate base stat flush
  remove unneeded rcu list
  remove bpf rstat flush from css generic flush

 block/blk-cgroup.c              |   4 +-
 include/linux/cgroup-defs.h     |  35 ++---
 include/linux/cgroup.h          |   8 +-
 kernel/cgroup/cgroup-internal.h |   4 +-
 kernel/cgroup/cgroup.c          |  79 ++++++-----
 kernel/cgroup/rstat.c           | 225 +++++++++++++++++++-------------
 mm/memcontrol.c                 |   4 +-
 7 files changed, 203 insertions(+), 156 deletions(-)

-- 
2.47.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 1/9 v2] cgroup: change cgroup to css in rstat updated and flush api
  2025-01-03  1:50 [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees JP Kobryn
@ 2025-01-03  1:50 ` JP Kobryn
  2025-01-03  1:50 ` [RFC PATCH 2/9 v2] cgroup: change cgroup to css in rstat internal flush and lock funcs JP Kobryn
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: JP Kobryn @ 2025-01-03  1:50 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

Change rstat flush and updated API calls to accept a cgroup_subsys_state
instead of a cgroup.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 block/blk-cgroup.c     |  4 ++--
 include/linux/cgroup.h |  8 ++++----
 kernel/cgroup/cgroup.c |  4 ++--
 kernel/cgroup/rstat.c  | 28 +++++++++++++++++++---------
 mm/memcontrol.c        |  4 ++--
 5 files changed, 29 insertions(+), 19 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 45a395862fbc..98e586ae21ec 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1200,7 +1200,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
 	if (!seq_css(sf)->parent)
 		blkcg_fill_root_iostats();
 	else
-		cgroup_rstat_flush(blkcg->css.cgroup);
+		cgroup_rstat_flush(&blkcg->css);
 
 	rcu_read_lock();
 	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
@@ -2183,7 +2183,7 @@ void blk_cgroup_bio_start(struct bio *bio)
 	}
 
 	u64_stats_update_end_irqrestore(&bis->sync, flags);
-	cgroup_rstat_updated(blkcg->css.cgroup, cpu);
+	cgroup_rstat_updated(&blkcg->css, cpu);
 	put_cpu();
 }
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index f8ef47f8a634..eec970622419 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -687,10 +687,10 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
 /*
  * cgroup scalable recursive statistics.
  */
-void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
-void cgroup_rstat_flush(struct cgroup *cgrp);
-void cgroup_rstat_flush_hold(struct cgroup *cgrp);
-void cgroup_rstat_flush_release(struct cgroup *cgrp);
+void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu);
+void cgroup_rstat_flush(struct cgroup_subsys_state *css);
+void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css);
+void cgroup_rstat_flush_release(struct cgroup_subsys_state *css);
 
 /*
  * Basic resource stats.
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 43c949824814..fdddd5ec5f3c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5465,7 +5465,7 @@ static void css_release_work_fn(struct work_struct *work)
 
 		/* css release path */
 		if (!list_empty(&css->rstat_css_node)) {
-			cgroup_rstat_flush(cgrp);
+			cgroup_rstat_flush(css);
 			list_del_rcu(&css->rstat_css_node);
 		}
 
@@ -5493,7 +5493,7 @@ static void css_release_work_fn(struct work_struct *work)
 		/* cgroup release path */
 		TRACE_CGROUP_PATH(release, cgrp);
 
-		cgroup_rstat_flush(cgrp);
+		cgroup_rstat_flush(css);
 
 		spin_lock_irq(&css_set_lock);
 		for (tcgrp = cgroup_parent(cgrp); tcgrp;
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 5877974ece92..1ed0f3aab0d9 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -82,8 +82,9 @@ void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
  * rstat_cpu->updated_children list.  See the comment on top of
  * cgroup_rstat_cpu definition for details.
  */
-__bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
+__bpf_kfunc void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
+	struct cgroup *cgrp = css->cgroup;
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	unsigned long flags;
 
@@ -346,8 +347,10 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
  *
  * This function may block.
  */
-__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
+__bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 {
+	struct cgroup *cgrp = css->cgroup;
+
 	might_sleep();
 
 	__cgroup_rstat_lock(cgrp, -1);
@@ -364,9 +367,11 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
  *
  * This function may block.
  */
-void cgroup_rstat_flush_hold(struct cgroup *cgrp)
+void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
 	__acquires(&cgroup_rstat_lock)
 {
+	struct cgroup *cgrp = css->cgroup;
+
 	might_sleep();
 	__cgroup_rstat_lock(cgrp, -1);
 	cgroup_rstat_flush_locked(cgrp);
@@ -376,9 +381,11 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
  * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
  * @cgrp: cgroup used by tracepoint
  */
-void cgroup_rstat_flush_release(struct cgroup *cgrp)
+void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
 	__releases(&cgroup_rstat_lock)
 {
+	struct cgroup *cgrp = css->cgroup;
+
 	__cgroup_rstat_unlock(cgrp, -1);
 }
 
@@ -408,7 +415,7 @@ void cgroup_rstat_exit(struct cgroup *cgrp)
 {
 	int cpu;
 
-	cgroup_rstat_flush(cgrp);
+	cgroup_rstat_flush(&cgrp->self);
 
 	/* sanity check */
 	for_each_possible_cpu(cpu) {
@@ -512,8 +519,10 @@ static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
 						 struct cgroup_rstat_cpu *rstatc,
 						 unsigned long flags)
 {
+	struct cgroup_subsys_state *css = &cgrp->self;
+
 	u64_stats_update_end_irqrestore(&rstatc->bsync, flags);
-	cgroup_rstat_updated(cgrp, smp_processor_id());
+	cgroup_rstat_updated(css, smp_processor_id());
 	put_cpu_ptr(rstatc);
 }
 
@@ -612,16 +621,17 @@ static void cgroup_force_idle_show(struct seq_file *seq, struct cgroup_base_stat
 
 void cgroup_base_stat_cputime_show(struct seq_file *seq)
 {
-	struct cgroup *cgrp = seq_css(seq)->cgroup;
+	struct cgroup_subsys_state *css = seq_css(seq);
+	struct cgroup *cgrp = css->cgroup;
 	u64 usage, utime, stime, ntime;
 
 	if (cgroup_parent(cgrp)) {
-		cgroup_rstat_flush_hold(cgrp);
+		cgroup_rstat_flush_hold(css);
 		usage = cgrp->bstat.cputime.sum_exec_runtime;
 		cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime,
 			       &utime, &stime);
 		ntime = cgrp->bstat.ntime;
-		cgroup_rstat_flush_release(cgrp);
+		cgroup_rstat_flush_release(css);
 	} else {
 		/* cgrp->bstat of root is not actually used, reuse it */
 		root_cgroup_cputime(&cgrp->bstat);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..97552476b844 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -579,7 +579,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 	if (!val)
 		return;
 
-	cgroup_rstat_updated(memcg->css.cgroup, cpu);
+	cgroup_rstat_updated(&memcg->css, cpu);
 	statc = this_cpu_ptr(memcg->vmstats_percpu);
 	for (; statc; statc = statc->parent) {
 		stats_updates = READ_ONCE(statc->stats_updates) + abs(val);
@@ -611,7 +611,7 @@ static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
 	if (mem_cgroup_is_root(memcg))
 		WRITE_ONCE(flush_last_time, jiffies_64);
 
-	cgroup_rstat_flush(memcg->css.cgroup);
+	cgroup_rstat_flush(&memcg->css);
 }
 
 /*
-- 
2.47.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 2/9 v2] cgroup: change cgroup to css in rstat internal flush and lock funcs
  2025-01-03  1:50 [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees JP Kobryn
  2025-01-03  1:50 ` [RFC PATCH 1/9 v2] cgroup: change cgroup to css in rstat updated and flush api JP Kobryn
@ 2025-01-03  1:50 ` JP Kobryn
  2025-01-03  1:50 ` [RFC PATCH 3/9 v2] cgroup: change cgroup to css in rstat init and exit api JP Kobryn
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: JP Kobryn @ 2025-01-03  1:50 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

Change internal rstat functions to accept a cgroup_subsys_state instead of a
cgroup.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/rstat.c | 45 ++++++++++++++++++++++---------------------
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 1ed0f3aab0d9..1b7ef8690a09 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -201,8 +201,10 @@ static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
  * within the children list and terminated by the parent cgroup. An exception
  * here is the cgroup root whose updated_next can be self terminated.
  */
-static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
+static struct cgroup *cgroup_rstat_updated_list(struct cgroup_subsys_state *root_css,
+				int cpu)
 {
+	struct cgroup *root = root_css->cgroup;
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
 	struct cgroup *head = NULL, *parent, *child;
@@ -280,9 +282,11 @@ __bpf_hook_end();
  * value -1 is used when obtaining the main lock else this is the CPU
  * number processed last.
  */
-static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
+static inline void __cgroup_rstat_lock(struct cgroup_subsys_state *css,
+				int cpu_in_loop)
 	__acquires(&cgroup_rstat_lock)
 {
+	struct cgroup *cgrp = css->cgroup;
 	bool contended;
 
 	contended = !spin_trylock_irq(&cgroup_rstat_lock);
@@ -293,15 +297,18 @@ static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
 	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
 }
 
-static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
+static inline void __cgroup_rstat_unlock(struct cgroup_subsys_state *css,
+				int cpu_in_loop)
 	__releases(&cgroup_rstat_lock)
 {
+	struct cgroup *cgrp = css->cgroup;
+
 	trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
 	spin_unlock_irq(&cgroup_rstat_lock);
 }
 
 /* see cgroup_rstat_flush() */
-static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
+static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
 	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
 {
 	int cpu;
@@ -309,27 +316,27 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 	lockdep_assert_held(&cgroup_rstat_lock);
 
 	for_each_possible_cpu(cpu) {
-		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
+		struct cgroup *pos = cgroup_rstat_updated_list(css, cpu);
 
 		for (; pos; pos = pos->rstat_flush_next) {
-			struct cgroup_subsys_state *css;
+			struct cgroup_subsys_state *css_iter;
 
 			cgroup_base_stat_flush(pos, cpu);
 			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
 
 			rcu_read_lock();
-			list_for_each_entry_rcu(css, &pos->rstat_css_list,
+			list_for_each_entry_rcu(css_iter, &pos->rstat_css_list,
 						rstat_css_node)
-				css->ss->css_rstat_flush(css, cpu);
+				css_iter->ss->css_rstat_flush(css_iter, cpu);
 			rcu_read_unlock();
 		}
 
 		/* play nice and yield if necessary */
 		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
-			__cgroup_rstat_unlock(cgrp, cpu);
+			__cgroup_rstat_unlock(css, cpu);
 			if (!cond_resched())
 				cpu_relax();
-			__cgroup_rstat_lock(cgrp, cpu);
+			__cgroup_rstat_lock(css, cpu);
 		}
 	}
 }
@@ -349,13 +356,11 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
  */
 __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 {
-	struct cgroup *cgrp = css->cgroup;
-
 	might_sleep();
 
-	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(cgrp);
-	__cgroup_rstat_unlock(cgrp, -1);
+	__cgroup_rstat_lock(css, -1);
+	cgroup_rstat_flush_locked(css);
+	__cgroup_rstat_unlock(css, -1);
 }
 
 /**
@@ -370,11 +375,9 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
 	__acquires(&cgroup_rstat_lock)
 {
-	struct cgroup *cgrp = css->cgroup;
-
 	might_sleep();
-	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(cgrp);
+	__cgroup_rstat_lock(css, -1);
+	cgroup_rstat_flush_locked(css);
 }
 
 /**
@@ -384,9 +387,7 @@ void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
 void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
 	__releases(&cgroup_rstat_lock)
 {
-	struct cgroup *cgrp = css->cgroup;
-
-	__cgroup_rstat_unlock(cgrp, -1);
+	__cgroup_rstat_unlock(css, -1);
 }
 
 int cgroup_rstat_init(struct cgroup *cgrp)
-- 
2.47.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 3/9 v2] cgroup: change cgroup to css in rstat init and exit api
  2025-01-03  1:50 [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees JP Kobryn
  2025-01-03  1:50 ` [RFC PATCH 1/9 v2] cgroup: change cgroup to css in rstat updated and flush api JP Kobryn
  2025-01-03  1:50 ` [RFC PATCH 2/9 v2] cgroup: change cgroup to css in rstat internal flush and lock funcs JP Kobryn
@ 2025-01-03  1:50 ` JP Kobryn
  2025-01-03  1:50 ` [RFC PATCH 4/9 v2] cgroup: split rstat from cgroup into separate css JP Kobryn
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: JP Kobryn @ 2025-01-03  1:50 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

Change rstat init and exit API calls to accept a cgroup_subsys_state instead of
a cgroup.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/cgroup-internal.h |  4 ++--
 kernel/cgroup/cgroup.c          | 16 ++++++++++------
 kernel/cgroup/rstat.c           |  6 ++++--
 3 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index c964dd7ff967..87d062baff90 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -269,8 +269,8 @@ int cgroup_task_count(const struct cgroup *cgrp);
 /*
  * rstat.c
  */
-int cgroup_rstat_init(struct cgroup *cgrp);
-void cgroup_rstat_exit(struct cgroup *cgrp);
+int cgroup_rstat_init(struct cgroup_subsys_state *css);
+void cgroup_rstat_exit(struct cgroup_subsys_state *css);
 void cgroup_rstat_boot(void);
 void cgroup_base_stat_cputime_show(struct seq_file *seq);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index fdddd5ec5f3c..848e09f433c0 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1358,7 +1358,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
 
 	cgroup_unlock();
 
-	cgroup_rstat_exit(cgrp);
+	cgroup_rstat_exit(&cgrp->self);
 	kernfs_destroy_root(root->kf_root);
 	cgroup_free_root(root);
 }
@@ -2132,7 +2132,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	if (ret)
 		goto destroy_root;
 
-	ret = cgroup_rstat_init(root_cgrp);
+	ret = cgroup_rstat_init(&root_cgrp->self);
 	if (ret)
 		goto destroy_root;
 
@@ -2174,7 +2174,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	goto out;
 
 exit_stats:
-	cgroup_rstat_exit(root_cgrp);
+	cgroup_rstat_exit(&root_cgrp->self);
 destroy_root:
 	kernfs_destroy_root(root->kf_root);
 	root->kf_root = NULL;
@@ -5435,7 +5435,7 @@ static void css_free_rwork_fn(struct work_struct *work)
 			cgroup_put(cgroup_parent(cgrp));
 			kernfs_put(cgrp->kn);
 			psi_cgroup_free(cgrp);
-			cgroup_rstat_exit(cgrp);
+			cgroup_rstat_exit(css);
 			kfree(cgrp);
 		} else {
 			/*
@@ -5686,7 +5686,11 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 	if (ret)
 		goto out_free_cgrp;
 
-	ret = cgroup_rstat_init(cgrp);
+	/* init self cgroup early so css->cgroup is valid within cgroup_rstat_init()
+	 * note that this will go away in a subsequent patch in this series
+	 */
+	cgrp->self.cgroup = cgrp;
+	ret = cgroup_rstat_init(&cgrp->self);
 	if (ret)
 		goto out_cancel_ref;
 
@@ -5779,7 +5783,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 out_kernfs_remove:
 	kernfs_remove(cgrp->kn);
 out_stat_exit:
-	cgroup_rstat_exit(cgrp);
+	cgroup_rstat_exit(&cgrp->self);
 out_cancel_ref:
 	percpu_ref_exit(&cgrp->self.refcnt);
 out_free_cgrp:
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 1b7ef8690a09..01a5c185b02a 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -390,8 +390,9 @@ void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
 	__cgroup_rstat_unlock(css, -1);
 }
 
-int cgroup_rstat_init(struct cgroup *cgrp)
+int cgroup_rstat_init(struct cgroup_subsys_state *css)
 {
+	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
 	/* the root cgrp has rstat_cpu preallocated */
@@ -412,8 +413,9 @@ int cgroup_rstat_init(struct cgroup *cgrp)
 	return 0;
 }
 
-void cgroup_rstat_exit(struct cgroup *cgrp)
+void cgroup_rstat_exit(struct cgroup_subsys_state *css)
 {
+	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
 	cgroup_rstat_flush(&cgrp->self);
-- 
2.47.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 4/9 v2] cgroup: split rstat from cgroup into separate css
  2025-01-03  1:50 [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees JP Kobryn
                   ` (2 preceding siblings ...)
  2025-01-03  1:50 ` [RFC PATCH 3/9 v2] cgroup: change cgroup to css in rstat init and exit api JP Kobryn
@ 2025-01-03  1:50 ` JP Kobryn
  2025-01-03  1:50 ` [RFC PATCH 5/9 v2] cgroup: separate locking between base css and others JP Kobryn
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: JP Kobryn @ 2025-01-03  1:50 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

Move the rstat entities off of the cgroup struct and onto the
cgroup_subsys_state struct. Adjust related code to reflect this new ownership.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/cgroup-defs.h | 40 ++++++++--------
 kernel/cgroup/cgroup.c      | 65 ++++++++++++++++++--------
 kernel/cgroup/rstat.c       | 92 ++++++++++++++++++-------------------
 3 files changed, 111 insertions(+), 86 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1b20d2d8ef7c..1932f8ae7995 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -180,6 +180,24 @@ struct cgroup_subsys_state {
 	struct list_head sibling;
 	struct list_head children;
 
+	/* per-cpu recursive resource statistics */
+	struct cgroup_rstat_cpu __percpu *rstat_cpu;
+	struct list_head rstat_css_list;
+
+	/*
+	 * Add padding to separate the read mostly rstat_cpu and
+	 * rstat_css_list into a different cacheline from the following
+	 * rstat_flush_next and *bstat fields which can have frequent updates.
+	 */
+	CACHELINE_PADDING(_pad_);
+
+	/*
+	 * A singly-linked list of cgroup structures to be rstat flushed.
+	 * This is a scratch field to be used exclusively by
+	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
+	 */
+	struct cgroup_subsys_state *rstat_flush_next;
+
 	/* flush target list anchored at cgrp->rstat_css_list */
 	struct list_head rstat_css_node;
 
@@ -389,8 +407,8 @@ struct cgroup_rstat_cpu {
 	 *
 	 * Protected by per-cpu cgroup_rstat_cpu_lock.
 	 */
-	struct cgroup *updated_children;	/* terminated by self cgroup */
-	struct cgroup *updated_next;		/* NULL iff not on the list */
+	struct cgroup_subsys_state *updated_children;	/* terminated by self cgroup */
+	struct cgroup_subsys_state *updated_next;		/* NULL iff not on the list */
 };
 
 struct cgroup_freezer_state {
@@ -516,24 +534,6 @@ struct cgroup {
 	struct cgroup *dom_cgrp;
 	struct cgroup *old_dom_cgrp;		/* used while enabling threaded */
 
-	/* per-cpu recursive resource statistics */
-	struct cgroup_rstat_cpu __percpu *rstat_cpu;
-	struct list_head rstat_css_list;
-
-	/*
-	 * Add padding to separate the read mostly rstat_cpu and
-	 * rstat_css_list into a different cacheline from the following
-	 * rstat_flush_next and *bstat fields which can have frequent updates.
-	 */
-	CACHELINE_PADDING(_pad_);
-
-	/*
-	 * A singly-linked list of cgroup structures to be rstat flushed.
-	 * This is a scratch field to be used exclusively by
-	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
-	 */
-	struct cgroup	*rstat_flush_next;
-
 	/* cgroup basic resource statistics */
 	struct cgroup_base_stat last_bstat;
 	struct cgroup_base_stat bstat;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 848e09f433c0..96a2d15fe5e9 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -164,7 +164,7 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
 static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu);
 
 /* the default hierarchy */
-struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
+struct cgroup_root cgrp_dfl_root = { .cgrp.self.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
 EXPORT_SYMBOL_GPL(cgrp_dfl_root);
 
 /*
@@ -1826,6 +1826,7 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
 		struct cgroup_root *src_root = ss->root;
 		struct cgroup *scgrp = &src_root->cgrp;
 		struct cgroup_subsys_state *css = cgroup_css(scgrp, ss);
+		struct cgroup_subsys_state *dcss = cgroup_css(dcgrp, ss);
 		struct css_set *cset, *cset_pos;
 		struct css_task_iter *it;
 
@@ -1867,7 +1868,7 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
 			list_del_rcu(&css->rstat_css_node);
 			synchronize_rcu();
 			list_add_rcu(&css->rstat_css_node,
-				     &dcgrp->rstat_css_list);
+				     &dcss->rstat_css_list);
 		}
 
 		/* default hierarchy doesn't enable controllers by default */
@@ -2052,7 +2053,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	cgrp->dom_cgrp = cgrp;
 	cgrp->max_descendants = INT_MAX;
 	cgrp->max_depth = INT_MAX;
-	INIT_LIST_HEAD(&cgrp->rstat_css_list);
 	prev_cputime_init(&cgrp->prev_cputime);
 
 	for_each_subsys(ss, ssid)
@@ -2088,7 +2088,8 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	struct cgroup *root_cgrp = &root->cgrp;
 	struct kernfs_syscall_ops *kf_sops;
 	struct css_set *cset;
-	int i, ret;
+	struct cgroup_subsys *ss;
+	int i, ret, ssid;
 
 	lockdep_assert_held(&cgroup_mutex);
 
@@ -2132,10 +2133,6 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	if (ret)
 		goto destroy_root;
 
-	ret = cgroup_rstat_init(&root_cgrp->self);
-	if (ret)
-		goto destroy_root;
-
 	ret = rebind_subsystems(root, ss_mask);
 	if (ret)
 		goto exit_stats;
@@ -2174,7 +2171,10 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	goto out;
 
 exit_stats:
-	cgroup_rstat_exit(&root_cgrp->self);
+	for_each_subsys(ss, ssid) {
+		struct cgroup_subsys_state *css = init_css_set.subsys[ssid];
+		cgroup_rstat_exit(css);
+	}
 destroy_root:
 	kernfs_destroy_root(root->kf_root);
 	root->kf_root = NULL;
@@ -3229,6 +3229,10 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
 	int ssid, ret;
 
 	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
+		ret = cgroup_rstat_init(&dsct->self);
+		if (ret)
+			return ret;
+
 		for_each_subsys(ss, ssid) {
 			struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
 
@@ -3239,6 +3243,10 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
 				css = css_create(dsct, ss);
 				if (IS_ERR(css))
 					return PTR_ERR(css);
+
+				ret = cgroup_rstat_init(css);
+				if (ret)
+					goto err_free_css;
 			}
 
 			WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));
@@ -3252,6 +3260,20 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
 	}
 
 	return 0;
+
+err_free_css:
+	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
+		cgroup_rstat_exit(&dsct->self);
+
+		for_each_subsys(ss, ssid) {
+			struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
+
+			if (css != &dsct->self)
+				cgroup_rstat_exit(css);
+		}
+	}
+
+	return ret;
 }
 
 /**
@@ -5403,6 +5425,7 @@ static void css_free_rwork_fn(struct work_struct *work)
 				struct cgroup_subsys_state, destroy_rwork);
 	struct cgroup_subsys *ss = css->ss;
 	struct cgroup *cgrp = css->cgroup;
+	int ssid;
 
 	percpu_ref_exit(&css->refcnt);
 
@@ -5435,7 +5458,12 @@ static void css_free_rwork_fn(struct work_struct *work)
 			cgroup_put(cgroup_parent(cgrp));
 			kernfs_put(cgrp->kn);
 			psi_cgroup_free(cgrp);
-			cgroup_rstat_exit(css);
+			for_each_subsys(ss, ssid) {
+				struct cgroup_subsys_state *css = cgrp->subsys[ssid];
+
+				if (css)
+					cgroup_rstat_exit(css);
+			}
 			kfree(cgrp);
 		} else {
 			/*
@@ -5541,6 +5569,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	css->id = -1;
 	INIT_LIST_HEAD(&css->sibling);
 	INIT_LIST_HEAD(&css->children);
+	INIT_LIST_HEAD(&css->rstat_css_list);
 	INIT_LIST_HEAD(&css->rstat_css_node);
 	css->serial_nr = css_serial_nr_next++;
 	atomic_set(&css->online_cnt, 0);
@@ -5551,7 +5580,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	}
 
 	if (ss->css_rstat_flush)
-		list_add_rcu(&css->rstat_css_node, &cgrp->rstat_css_list);
+		list_add_rcu(&css->rstat_css_node, &css->rstat_css_list);
 
 	BUG_ON(cgroup_css(cgrp, ss));
 }
@@ -5686,14 +5715,6 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 	if (ret)
 		goto out_free_cgrp;
 
-	/* init self cgroup early so css->cgroup is valid within cgroup_rstat_init()
-	 * note that this will go away in a subsequent patch in this series
-	 */
-	cgrp->self.cgroup = cgrp;
-	ret = cgroup_rstat_init(&cgrp->self);
-	if (ret)
-		goto out_cancel_ref;
-
 	/* create the directory */
 	kn = kernfs_create_dir_ns(parent->kn, name, mode,
 				  current_fsuid(), current_fsgid(),
@@ -5784,7 +5805,6 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 	kernfs_remove(cgrp->kn);
 out_stat_exit:
 	cgroup_rstat_exit(&cgrp->self);
-out_cancel_ref:
 	percpu_ref_exit(&cgrp->self.refcnt);
 out_free_cgrp:
 	kfree(cgrp);
@@ -6189,6 +6209,8 @@ int __init cgroup_init(void)
 	cgroup_unlock();
 
 	for_each_subsys(ss, ssid) {
+		struct cgroup_subsys_state *css;
+
 		if (ss->early_init) {
 			struct cgroup_subsys_state *css =
 				init_css_set.subsys[ss->id];
@@ -6200,6 +6222,9 @@ int __init cgroup_init(void)
 			cgroup_init_subsys(ss, false);
 		}
 
+		css = init_css_set.subsys[ss->id];
+		BUG_ON(cgroup_rstat_init(css));
+
 		list_add_tail(&init_css_set.e_cset_node[ssid],
 			      &cgrp_dfl_root.cgrp.e_csets[ssid]);
 
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 01a5c185b02a..4381eb9ac426 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -14,9 +14,10 @@ static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
-static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
+static struct cgroup_rstat_cpu *css_rstat_cpu(
+				struct cgroup_subsys_state *css, int cpu)
 {
-	return per_cpu_ptr(cgrp->rstat_cpu, cpu);
+	return per_cpu_ptr(css->rstat_cpu, cpu);
 }
 
 /*
@@ -96,15 +97,16 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 	 * instead of NULL, we can tell whether @cgrp is on the list by
 	 * testing the next pointer for NULL.
 	 */
-	if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next))
+	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
 		return;
 
 	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
 
 	/* put @cgrp and all ancestors on the corresponding updated lists */
 	while (true) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
-		struct cgroup *parent = cgroup_parent(cgrp);
+		struct cgroup_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
+		struct cgroup_subsys_state *parent = css->parent
+;
 		struct cgroup_rstat_cpu *prstatc;
 
 		/*
@@ -116,15 +118,15 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 
 		/* Root has no parent to link it to, but mark it busy */
 		if (!parent) {
-			rstatc->updated_next = cgrp;
+			rstatc->updated_next = css;
 			break;
 		}
 
-		prstatc = cgroup_rstat_cpu(parent, cpu);
+		prstatc = css_rstat_cpu(parent, cpu);
 		rstatc->updated_next = prstatc->updated_children;
-		prstatc->updated_children = cgrp;
+		prstatc->updated_children = css;
 
-		cgrp = parent;
+		css = parent;
 	}
 
 	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, true);
@@ -142,12 +144,13 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
  * into a singly linked list built from the tail backward like "pushing"
  * cgroups into a stack. The root is pushed by the caller.
  */
-static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
-						 struct cgroup *child, int cpu)
+static struct cgroup_subsys_state *cgroup_rstat_push_children(
+				struct cgroup_subsys_state *head,
+				struct cgroup_subsys_state *child, int cpu)
 {
-	struct cgroup *chead = child;	/* Head of child cgroup level */
-	struct cgroup *ghead = NULL;	/* Head of grandchild cgroup level */
-	struct cgroup *parent, *grandchild;
+	struct cgroup_subsys_state *chead = child;	/* Head of child cgroup level */
+	struct cgroup_subsys_state *ghead = NULL;	/* Head of grandchild cgroup level */
+	struct cgroup_subsys_state *parent, *grandchild;
 	struct cgroup_rstat_cpu *crstatc;
 
 	child->rstat_flush_next = NULL;
@@ -156,13 +159,13 @@ static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
 	while (chead) {
 		child = chead;
 		chead = child->rstat_flush_next;
-		parent = cgroup_parent(child);
+		parent = child->parent;
 
 		/* updated_next is parent cgroup terminated */
 		while (child != parent) {
 			child->rstat_flush_next = head;
 			head = child;
-			crstatc = cgroup_rstat_cpu(child, cpu);
+			crstatc = css_rstat_cpu(child, cpu);
 			grandchild = crstatc->updated_children;
 			if (grandchild != child) {
 				/* Push the grand child to the next level */
@@ -201,16 +204,15 @@ static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
  * within the children list and terminated by the parent cgroup. An exception
  * here is the cgroup root whose updated_next can be self terminated.
  */
-static struct cgroup *cgroup_rstat_updated_list(struct cgroup_subsys_state *root_css,
-				int cpu)
+static struct cgroup_subsys_state *cgroup_rstat_updated_list(
+				struct cgroup_subsys_state *root, int cpu)
 {
-	struct cgroup *root = root_css->cgroup;
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
-	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
-	struct cgroup *head = NULL, *parent, *child;
+	struct cgroup_rstat_cpu *rstatc = css_rstat_cpu(root, cpu);
+	struct cgroup_subsys_state *head = NULL, *parent, *child;
 	unsigned long flags;
 
-	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, root, false);
+	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, root->cgroup, false);
 
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
@@ -220,17 +222,17 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup_subsys_state *root
 	 * Unlink @root from its parent. As the updated_children list is
 	 * singly linked, we have to walk it to find the removal point.
 	 */
-	parent = cgroup_parent(root);
+	parent = root->parent;
 	if (parent) {
 		struct cgroup_rstat_cpu *prstatc;
-		struct cgroup **nextp;
+		struct cgroup_subsys_state **nextp;
 
-		prstatc = cgroup_rstat_cpu(parent, cpu);
+		prstatc = css_rstat_cpu(parent, cpu);
 		nextp = &prstatc->updated_children;
 		while (*nextp != root) {
 			struct cgroup_rstat_cpu *nrstatc;
 
-			nrstatc = cgroup_rstat_cpu(*nextp, cpu);
+			nrstatc = css_rstat_cpu(*nextp, cpu);
 			WARN_ON_ONCE(*nextp == parent);
 			nextp = &nrstatc->updated_next;
 		}
@@ -247,7 +249,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup_subsys_state *root
 	if (child != root)
 		head = cgroup_rstat_push_children(head, child, cpu);
 unlock_ret:
-	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, root, flags, false);
+	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, root->cgroup, flags, false);
 	return head;
 }
 
@@ -316,13 +318,13 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
 	lockdep_assert_held(&cgroup_rstat_lock);
 
 	for_each_possible_cpu(cpu) {
-		struct cgroup *pos = cgroup_rstat_updated_list(css, cpu);
+		struct cgroup_subsys_state *pos = cgroup_rstat_updated_list(css, cpu);
 
 		for (; pos; pos = pos->rstat_flush_next) {
 			struct cgroup_subsys_state *css_iter;
 
-			cgroup_base_stat_flush(pos, cpu);
-			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
+			cgroup_base_stat_flush(pos->cgroup, cpu);
+			bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu);
 
 			rcu_read_lock();
 			list_for_each_entry_rcu(css_iter, &pos->rstat_css_list,
@@ -392,21 +394,20 @@ void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
 
 int cgroup_rstat_init(struct cgroup_subsys_state *css)
 {
-	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
-	/* the root cgrp has rstat_cpu preallocated */
-	if (!cgrp->rstat_cpu) {
-		cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
-		if (!cgrp->rstat_cpu)
+	/* the root cgrp css has rstat_cpu preallocated */
+	if (!css->rstat_cpu) {
+		css->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
+		if (!css->rstat_cpu)
 			return -ENOMEM;
 	}
 
 	/* ->updated_children list is self terminated */
 	for_each_possible_cpu(cpu) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+		struct cgroup_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
 
-		rstatc->updated_children = cgrp;
+		rstatc->updated_children = css;
 		u64_stats_init(&rstatc->bsync);
 	}
 
@@ -415,22 +416,21 @@ int cgroup_rstat_init(struct cgroup_subsys_state *css)
 
 void cgroup_rstat_exit(struct cgroup_subsys_state *css)
 {
-	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
-	cgroup_rstat_flush(&cgrp->self);
+	cgroup_rstat_flush(css);
 
 	/* sanity check */
 	for_each_possible_cpu(cpu) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+		struct cgroup_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
 
-		if (WARN_ON_ONCE(rstatc->updated_children != cgrp) ||
+		if (WARN_ON_ONCE(rstatc->updated_children != css) ||
 		    WARN_ON_ONCE(rstatc->updated_next))
 			return;
 	}
 
-	free_percpu(cgrp->rstat_cpu);
-	cgrp->rstat_cpu = NULL;
+	free_percpu(css->rstat_cpu);
+	css->rstat_cpu = NULL;
 }
 
 void __init cgroup_rstat_boot(void)
@@ -471,7 +471,7 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 {
-	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+	struct cgroup_rstat_cpu *rstatc = css_rstat_cpu(&cgrp->self, cpu);
 	struct cgroup *parent = cgroup_parent(cgrp);
 	struct cgroup_rstat_cpu *prstatc;
 	struct cgroup_base_stat delta;
@@ -501,7 +501,7 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 		cgroup_base_stat_add(&cgrp->last_bstat, &delta);
 
 		delta = rstatc->subtree_bstat;
-		prstatc = cgroup_rstat_cpu(parent, cpu);
+		prstatc = css_rstat_cpu(&parent->self, cpu);
 		cgroup_base_stat_sub(&delta, &rstatc->last_subtree_bstat);
 		cgroup_base_stat_add(&prstatc->subtree_bstat, &delta);
 		cgroup_base_stat_add(&rstatc->last_subtree_bstat, &delta);
@@ -513,7 +513,7 @@ cgroup_base_stat_cputime_account_begin(struct cgroup *cgrp, unsigned long *flags
 {
 	struct cgroup_rstat_cpu *rstatc;
 
-	rstatc = get_cpu_ptr(cgrp->rstat_cpu);
+	rstatc = get_cpu_ptr(cgrp->self.rstat_cpu);
 	*flags = u64_stats_update_begin_irqsave(&rstatc->bsync);
 	return rstatc;
 }
-- 
2.47.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 5/9 v2] cgroup: separate locking between base css and others
  2025-01-03  1:50 [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees JP Kobryn
                   ` (3 preceding siblings ...)
  2025-01-03  1:50 ` [RFC PATCH 4/9 v2] cgroup: split rstat from cgroup into separate css JP Kobryn
@ 2025-01-03  1:50 ` JP Kobryn
  2025-01-03  1:50 ` [RFC PATCH 6/9 v2] cgroup: isolate base stat flush JP Kobryn
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: JP Kobryn @ 2025-01-03  1:50 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

Separate locks can be used to eliminate contention between subsystems that make
use of rstat. The base stats also get their own lock. Where applicable, check
for the existence of a subsystem pointer to determine if the given
cgroup_subsys_state is the base css or not for deciding which lock to take.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/cgroup-defs.h |  2 +
 kernel/cgroup/rstat.c       | 92 +++++++++++++++++++++++++------------
 2 files changed, 65 insertions(+), 29 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1932f8ae7995..4d87519ff023 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -790,6 +790,8 @@ struct cgroup_subsys {
 	 * specifies the mask of subsystems that this one depends on.
 	 */
 	unsigned int depends_on;
+
+	spinlock_t rstat_lock;
 };
 
 extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 4381eb9ac426..958bdccf0359 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -9,8 +9,9 @@
 
 #include <trace/events/cgroup.h>
 
-static DEFINE_SPINLOCK(cgroup_rstat_lock);
-static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
+static DEFINE_SPINLOCK(cgroup_rstat_base_lock);
+static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_base_cpu_lock);
+static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock[CGROUP_SUBSYS_COUNT]);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
@@ -86,7 +87,7 @@ void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
 __bpf_kfunc void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
 	struct cgroup *cgrp = css->cgroup;
-	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
+	raw_spinlock_t *cpu_lock;
 	unsigned long flags;
 
 	/*
@@ -100,6 +101,11 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
 		return;
 
+	if (css->ss)
+		cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock[css->ss->id], cpu);
+	else
+		cpu_lock = per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu);
+
 	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
 
 	/* put @cgrp and all ancestors on the corresponding updated lists */
@@ -207,11 +213,16 @@ static struct cgroup_subsys_state *cgroup_rstat_push_children(
 static struct cgroup_subsys_state *cgroup_rstat_updated_list(
 				struct cgroup_subsys_state *root, int cpu)
 {
-	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	struct cgroup_rstat_cpu *rstatc = css_rstat_cpu(root, cpu);
 	struct cgroup_subsys_state *head = NULL, *parent, *child;
+	raw_spinlock_t *cpu_lock;
 	unsigned long flags;
 
+	if (root->ss)
+		cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock[root->ss->id], cpu);
+	else
+		cpu_lock = per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu);
+
 	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, root->cgroup, false);
 
 	/* Return NULL if this subtree is not on-list */
@@ -285,37 +296,44 @@ __bpf_hook_end();
  * number processed last.
  */
 static inline void __cgroup_rstat_lock(struct cgroup_subsys_state *css,
-				int cpu_in_loop)
-	__acquires(&cgroup_rstat_lock)
+				spinlock_t *lock, int cpu_in_loop)
+	__acquires(lock)
 {
 	struct cgroup *cgrp = css->cgroup;
 	bool contended;
 
-	contended = !spin_trylock_irq(&cgroup_rstat_lock);
+	contended = !spin_trylock_irq(lock);
 	if (contended) {
 		trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
-		spin_lock_irq(&cgroup_rstat_lock);
+		spin_lock_irq(lock);
 	}
 	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
 }
 
 static inline void __cgroup_rstat_unlock(struct cgroup_subsys_state *css,
-				int cpu_in_loop)
-	__releases(&cgroup_rstat_lock)
+				spinlock_t *lock, int cpu_in_loop)
+	__releases(lock)
 {
 	struct cgroup *cgrp = css->cgroup;
 
 	trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
-	spin_unlock_irq(&cgroup_rstat_lock);
+	spin_unlock_irq(lock);
 }
 
 /* see cgroup_rstat_flush() */
 static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
-	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
+	__releases(&css->ss->rstat_lock) __acquires(&css->ss->rstat_lock)
 {
+	spinlock_t *lock;
 	int cpu;
 
-	lockdep_assert_held(&cgroup_rstat_lock);
+	if (!css->ss) {
+		pr_warn("cannot use generic flush on base subsystem\n");
+		return;
+	}
+
+	lock = &css->ss->rstat_lock;
+	lockdep_assert_held(lock);
 
 	for_each_possible_cpu(cpu) {
 		struct cgroup_subsys_state *pos = cgroup_rstat_updated_list(css, cpu);
@@ -334,11 +352,11 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
 		}
 
 		/* play nice and yield if necessary */
-		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
-			__cgroup_rstat_unlock(css, cpu);
+		if (need_resched() || spin_needbreak(lock)) {
+			__cgroup_rstat_unlock(css, lock, cpu);
 			if (!cond_resched())
 				cpu_relax();
-			__cgroup_rstat_lock(css, cpu);
+			__cgroup_rstat_lock(css, lock, cpu);
 		}
 	}
 }
@@ -358,11 +376,22 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
  */
 __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 {
+	spinlock_t *lock;
+
+	if (!css->ss) {
+		int cpu;
+
+		for_each_possible_cpu(cpu)
+			cgroup_base_stat_flush(css->cgroup, cpu);
+		return;
+	}
+
 	might_sleep();
 
-	__cgroup_rstat_lock(css, -1);
+	lock = &css->ss->rstat_lock;
+	__cgroup_rstat_lock(css, lock, -1);
 	cgroup_rstat_flush_locked(css);
-	__cgroup_rstat_unlock(css, -1);
+	__cgroup_rstat_unlock(css, lock, -1);
 }
 
 /**
@@ -374,11 +403,11 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
  *
  * This function may block.
  */
-void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
-	__acquires(&cgroup_rstat_lock)
+static void cgroup_rstat_base_flush_hold(struct cgroup_subsys_state *css)
+	__acquires(&cgroup_rstat_base_lock)
 {
 	might_sleep();
-	__cgroup_rstat_lock(css, -1);
+	__cgroup_rstat_lock(css, &cgroup_rstat_base_lock, -1);
 	cgroup_rstat_flush_locked(css);
 }
 
@@ -386,10 +415,10 @@ void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
  * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
  * @cgrp: cgroup used by tracepoint
  */
-void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
-	__releases(&cgroup_rstat_lock)
+static void cgroup_rstat_base_flush_release(struct cgroup_subsys_state *css)
+	__releases(&cgroup_rstat_base_lock)
 {
-	__cgroup_rstat_unlock(css, -1);
+	__cgroup_rstat_unlock(css, &cgroup_rstat_base_lock, -1);
 }
 
 int cgroup_rstat_init(struct cgroup_subsys_state *css)
@@ -435,10 +464,15 @@ void cgroup_rstat_exit(struct cgroup_subsys_state *css)
 
 void __init cgroup_rstat_boot(void)
 {
-	int cpu;
+	struct cgroup_subsys *ss;
+	int cpu, ssid;
+
+	for_each_possible_cpu(cpu) {
+		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu));
 
-	for_each_possible_cpu(cpu)
-		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
+		for_each_subsys(ss, ssid)
+			raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock[ssid], cpu));
+	}
 }
 
 /*
@@ -629,12 +663,12 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 	u64 usage, utime, stime, ntime;
 
 	if (cgroup_parent(cgrp)) {
-		cgroup_rstat_flush_hold(css);
+		cgroup_rstat_base_flush_hold(css);
 		usage = cgrp->bstat.cputime.sum_exec_runtime;
 		cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime,
 			       &utime, &stime);
 		ntime = cgrp->bstat.ntime;
-		cgroup_rstat_flush_release(css);
+		cgroup_rstat_base_flush_release(css);
 	} else {
 		/* cgrp->bstat of root is not actually used, reuse it */
 		root_cgroup_cputime(&cgrp->bstat);
-- 
2.47.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 6/9 v2] cgroup: isolate base stat flush
  2025-01-03  1:50 [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees JP Kobryn
                   ` (4 preceding siblings ...)
  2025-01-03  1:50 ` [RFC PATCH 5/9 v2] cgroup: separate locking between base css and others JP Kobryn
@ 2025-01-03  1:50 ` JP Kobryn
  2025-01-03  1:50 ` [RFC PATCH 7/9 v2] cgroup: remove unneeded rcu list JP Kobryn
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: JP Kobryn @ 2025-01-03  1:50 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

Remove the base stat flushing from the generic css flushing routine. Provide a
direct way to flush the base stats and make use of it when the cpu stats are
read.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/rstat.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 958bdccf0359..92a46b960be1 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -320,6 +320,14 @@ static inline void __cgroup_rstat_unlock(struct cgroup_subsys_state *css,
 	spin_unlock_irq(lock);
 }
 
+static void cgroup_rstat_base_flush_locked(struct cgroup *cgrp)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		cgroup_base_stat_flush(cgrp, cpu);
+}
+
 /* see cgroup_rstat_flush() */
 static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
 	__releases(&css->ss->rstat_lock) __acquires(&css->ss->rstat_lock)
@@ -341,7 +349,6 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
 		for (; pos; pos = pos->rstat_flush_next) {
 			struct cgroup_subsys_state *css_iter;
 
-			cgroup_base_stat_flush(pos->cgroup, cpu);
 			bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu);
 
 			rcu_read_lock();
@@ -408,7 +415,7 @@ static void cgroup_rstat_base_flush_hold(struct cgroup_subsys_state *css)
 {
 	might_sleep();
 	__cgroup_rstat_lock(css, &cgroup_rstat_base_lock, -1);
-	cgroup_rstat_flush_locked(css);
+	cgroup_rstat_base_flush_locked(css->cgroup);
 }
 
 /**
-- 
2.47.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 7/9 v2] cgroup: remove unneeded rcu list
  2025-01-03  1:50 [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees JP Kobryn
                   ` (5 preceding siblings ...)
  2025-01-03  1:50 ` [RFC PATCH 6/9 v2] cgroup: isolate base stat flush JP Kobryn
@ 2025-01-03  1:50 ` JP Kobryn
  2025-01-03  1:50 ` [RFC PATCH 8/9 v2] cgroup: remove bpf rstat flush from css generic flush JP Kobryn
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: JP Kobryn @ 2025-01-03  1:50 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

Since the cgroup_subsystem_state now owns the rstat tree, the list management
previously done on the cgroup to keep track of which subsystem states are
participating in rstat is no longer needed.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/cgroup-defs.h | 11 -----------
 kernel/cgroup/cgroup.c      | 20 +-------------------
 kernel/cgroup/rstat.c       |  9 +--------
 3 files changed, 2 insertions(+), 38 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 4d87519ff023..836260c422a0 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -182,14 +182,6 @@ struct cgroup_subsys_state {
 
 	/* per-cpu recursive resource statistics */
 	struct cgroup_rstat_cpu __percpu *rstat_cpu;
-	struct list_head rstat_css_list;
-
-	/*
-	 * Add padding to separate the read mostly rstat_cpu and
-	 * rstat_css_list into a different cacheline from the following
-	 * rstat_flush_next and *bstat fields which can have frequent updates.
-	 */
-	CACHELINE_PADDING(_pad_);
 
 	/*
 	 * A singly-linked list of cgroup structures to be rstat flushed.
@@ -198,9 +190,6 @@ struct cgroup_subsys_state {
 	 */
 	struct cgroup_subsys_state *rstat_flush_next;
 
-	/* flush target list anchored at cgrp->rstat_css_list */
-	struct list_head rstat_css_node;
-
 	/*
 	 * PI: Subsys-unique ID.  0 is unused and root is always 1.  The
 	 * matching css can be looked up using css_from_id().
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 96a2d15fe5e9..a36ed3995c6f 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1826,7 +1826,6 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
 		struct cgroup_root *src_root = ss->root;
 		struct cgroup *scgrp = &src_root->cgrp;
 		struct cgroup_subsys_state *css = cgroup_css(scgrp, ss);
-		struct cgroup_subsys_state *dcss = cgroup_css(dcgrp, ss);
 		struct css_set *cset, *cset_pos;
 		struct css_task_iter *it;
 
@@ -1864,13 +1863,6 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
 		}
 		spin_unlock_irq(&css_set_lock);
 
-		if (ss->css_rstat_flush) {
-			list_del_rcu(&css->rstat_css_node);
-			synchronize_rcu();
-			list_add_rcu(&css->rstat_css_node,
-				     &dcss->rstat_css_list);
-		}
-
 		/* default hierarchy doesn't enable controllers by default */
 		dst_root->subsys_mask |= 1 << ssid;
 		if (dst_root == &cgrp_dfl_root) {
@@ -5491,11 +5483,7 @@ static void css_release_work_fn(struct work_struct *work)
 	if (ss) {
 		struct cgroup *parent_cgrp;
 
-		/* css release path */
-		if (!list_empty(&css->rstat_css_node)) {
-			cgroup_rstat_flush(css);
-			list_del_rcu(&css->rstat_css_node);
-		}
+		cgroup_rstat_flush(css);
 
 		cgroup_idr_replace(&ss->css_idr, NULL, css->id);
 		if (ss->css_released)
@@ -5569,8 +5557,6 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	css->id = -1;
 	INIT_LIST_HEAD(&css->sibling);
 	INIT_LIST_HEAD(&css->children);
-	INIT_LIST_HEAD(&css->rstat_css_list);
-	INIT_LIST_HEAD(&css->rstat_css_node);
 	css->serial_nr = css_serial_nr_next++;
 	atomic_set(&css->online_cnt, 0);
 
@@ -5579,9 +5565,6 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 		css_get(css->parent);
 	}
 
-	if (ss->css_rstat_flush)
-		list_add_rcu(&css->rstat_css_node, &css->rstat_css_list);
-
 	BUG_ON(cgroup_css(cgrp, ss));
 }
 
@@ -5687,7 +5670,6 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 err_list_del:
 	list_del_rcu(&css->sibling);
 err_free_css:
-	list_del_rcu(&css->rstat_css_node);
 	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
 	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
 	return ERR_PTR(err);
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 92a46b960be1..c52e8429c75d 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -347,15 +347,8 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
 		struct cgroup_subsys_state *pos = cgroup_rstat_updated_list(css, cpu);
 
 		for (; pos; pos = pos->rstat_flush_next) {
-			struct cgroup_subsys_state *css_iter;
-
 			bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu);
-
-			rcu_read_lock();
-			list_for_each_entry_rcu(css_iter, &pos->rstat_css_list,
-						rstat_css_node)
-				css_iter->ss->css_rstat_flush(css_iter, cpu);
-			rcu_read_unlock();
+			pos->ss->css_rstat_flush(pos, cpu);
 		}
 
 		/* play nice and yield if necessary */
-- 
2.47.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 8/9 v2] cgroup: remove bpf rstat flush from css generic flush
  2025-01-03  1:50 [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees JP Kobryn
                   ` (6 preceding siblings ...)
  2025-01-03  1:50 ` [RFC PATCH 7/9 v2] cgroup: remove unneeded rcu list JP Kobryn
@ 2025-01-03  1:50 ` JP Kobryn
  2025-01-03  1:50 ` [RFC PATCH 9/9 v2] cgroup: avoid allocating rstat when flush func not present JP Kobryn
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: JP Kobryn @ 2025-01-03  1:50 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

Remove the bpf-specific flush call from the generic subsystem flush. Leave it
up to bpf programs to manually flush any subsystems desired by using the kfunc
cgroup_rstat_flush().

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/rstat.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index c52e8429c75d..03effaaf09a4 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -346,10 +346,8 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
 	for_each_possible_cpu(cpu) {
 		struct cgroup_subsys_state *pos = cgroup_rstat_updated_list(css, cpu);
 
-		for (; pos; pos = pos->rstat_flush_next) {
-			bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu);
+		for (; pos; pos = pos->rstat_flush_next)
 			pos->ss->css_rstat_flush(pos, cpu);
-		}
 
 		/* play nice and yield if necessary */
 		if (need_resched() || spin_needbreak(lock)) {
-- 
2.47.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 9/9 v2] cgroup: avoid allocating rstat when flush func not present
  2025-01-03  1:50 [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees JP Kobryn
                   ` (7 preceding siblings ...)
  2025-01-03  1:50 ` [RFC PATCH 8/9 v2] cgroup: remove bpf rstat flush from css generic flush JP Kobryn
@ 2025-01-03  1:50 ` JP Kobryn
  2025-01-03 22:08 ` [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees Tejun Heo
  2025-01-06 23:13 ` Yosry Ahmed
  10 siblings, 0 replies; 14+ messages in thread
From: JP Kobryn @ 2025-01-03  1:50 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

If a given subsystem is not the base type and does not have a flush func, do
not perform any rstat allocation.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/rstat.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 03effaaf09a4..4feefa37fa46 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -423,6 +423,9 @@ int cgroup_rstat_init(struct cgroup_subsys_state *css)
 {
 	int cpu;
 
+	if (css->ss && !css->ss->css_rstat_flush)
+		return 0;
+
 	/* the root cgrp css has rstat_cpu preallocated */
 	if (!css->rstat_cpu) {
 		css->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
@@ -445,6 +448,9 @@ void cgroup_rstat_exit(struct cgroup_subsys_state *css)
 {
 	int cpu;
 
+	if (css->ss && !css->ss->css_rstat_flush)
+		return;
+
 	cgroup_rstat_flush(css);
 
 	/* sanity check */
-- 
2.47.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees
  2025-01-03  1:50 [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees JP Kobryn
                   ` (8 preceding siblings ...)
  2025-01-03  1:50 ` [RFC PATCH 9/9 v2] cgroup: avoid allocating rstat when flush func not present JP Kobryn
@ 2025-01-03 22:08 ` Tejun Heo
  2025-01-13 18:39   ` Shakeel Butt
  2025-01-06 23:13 ` Yosry Ahmed
  10 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2025-01-03 22:08 UTC (permalink / raw)
  To: JP Kobryn
  Cc: shakeel.butt, mhocko, hannes, yosryahmed, akpm, linux-mm, cgroups

Hello,

On Thu, Jan 02, 2025 at 05:50:11PM -0800, JP Kobryn wrote:
...
> I reached a point where this started to feel stable in my local testing, so I
> wanted to share and get feedback on this approach.

The rationale for using one tree to track all subsystems was that if one
subsys has been active (e.g. memory), it's likely that other subsyses have
been active too (e.g. cpu) and thus we might as well flush the whole thing
together. The approach can be useful for reducing the amount of work done
when e.g. there are a lot of cgroups which are only active periodically but
has drawbacks when one subsystem's stats are read a lot more actively than
others as you pointed out.

Intuitions go only so far and it's difficult to judge whether splitting the
trees would be a good idea without data. Can you please provide some
numbers along with rationales for the test setups?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees
  2025-01-03  1:50 [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees JP Kobryn
                   ` (9 preceding siblings ...)
  2025-01-03 22:08 ` [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees Tejun Heo
@ 2025-01-06 23:13 ` Yosry Ahmed
  10 siblings, 0 replies; 14+ messages in thread
From: Yosry Ahmed @ 2025-01-06 23:13 UTC (permalink / raw)
  To: JP Kobryn
  Cc: shakeel.butt, tj, mhocko, hannes, akpm, linux-mm, cgroups,
	Michal Koutný

On Thu, Jan 2, 2025 at 5:50 PM JP Kobryn <inwardvessel@gmail.com> wrote:
>
> The current rstat model is set up to keep track of cgroup stats on a per-cpu
> basis. When a stat (of any subsystem) is updated, the updater notes this change
> using the cgroup_rstat_updated() API call. This change is propagated to the
> cpu-specific rstat tree, by appending the updated cgroup to the tree (unless
> it's already on the tree). So for each cpu, an rstat tree will consist of the
> cgroups that reported one or more updated stats. Later on when a flush is
> requested via cgroup_rstat_flush(), each per-cpu rstat tree is traversed
> starting at the requested cgroup and the subsystem-specific flush callbacks
> (via css_rstat_flush) are invoked along the way. During the flush, the section
> of the tree starting at the requested cgroup through its descendants are
> removed.
>
> Using the cgroup struct to represent nodes of change means that the changes
> represented by a given tree are heterogeneous - the tree can consist of nodes
> that have changes from different subsystems; i.e. changes in stats from the
> memory subsystem and the io subsystem can coexist in the same tree. The
> implication is that when a flush is requested, usually in the context of a
> single subsystem, all other subsystems need to be flushed along with it. This
> seems to have become a drawback due to how expensive the flushing of the
> memory-specific stats have become [0][1]. Another implication is when updates
> are performed, subsystems may contend with each other over the locks involved.
>
> I've been experimenting with an idea that allows for isolating the updating and
> flushing of cgroup stats on a per-subsystem basis. The idea was instead of
> having a per-cpu rstat tree for managing stats across all subsystems, we could
> split up the per-cpu trees into separate trees for each subsystem. So each cpu
> would have separate trees for each subsystem. It would allow subsystems to
> update and flush their stats without any contention or extra overhead from
> other subsystems. The core change is moving ownership of the the rstat entities
> from the cgroup struct onto the cgroup_subsystem_state struct.
>
> To complement the ownership change, the lockng scheme was adjusted. The global
> cgroup_rstat_lock for synchronizing updates and flushes was replaced with
> subsystem-specific locks (in the cgroup_subsystem struct). An additional global
> lock was added to allow the base stats pseudo-subsystem to be synchronized in a
> similar way. The per-cpu locks called cgroup_rstat_cpu_lock have changed to a
> per-cpu array of locks which is indexed by subsystem id. Following suit, there
> is also a per-cpu array of locks dedicated to the base subsystem. The dedicated
> locks for the base stats was added since the base stats have a NULL subsystem
> so it did not fit the subsystem id index approach.
>
> I reached a point where this started to feel stable in my local testing, so I
> wanted to share and get feedback on this approach.

I remember discussing this with Shakeel and Michal Koutný in LPC two
years ago. I suggested it multiple times over the last few years, most
recently in: https://lore.kernel.org/lkml/CAJD7tkbpFu8z1HaUgkaE6bup_fsD39QLPmgNyOnaTrm+hZ_9hA@mail.gmail.com/.

I think it conceptually makes sense, and I took a stab at it when I
was working on fixing the hard lockups due to atomic flushing, but the
system I was working on was using cgroup v1, so different subsystems
had different hierarchies (and hence different trees) anyway, so it
wouldn't have helped.

This is especially true for the MM subsystem, which apparently flushes
most often and has the most expensive flushes, so other subsystems are
probably being unnecessarily taxed.

>
> [0] https://lore.kernel.org/all/CAOm-9arwY3VLUx5189JAR9J7B=Miad9nQjjet_VNdT3i+J+5FA@mail.gmail.com/
> [1] https://github.blog/engineering/debugging-network-stalls-on-kubernetes/
>
> Changelog
> v2: updated cover letter and some patch text. no code changes.
>
> JP Kobryn (8):
>   change cgroup to css in rstat updated and flush api
>   change cgroup to css in rstat internal flush and lock funcs
>   change cgroup to css in rstat init and exit api
>   split rstat from cgroup into separate css
>   separate locking between base css and others
>   isolate base stat flush
>   remove unneeded rcu list
>   remove bpf rstat flush from css generic flush
>
>  block/blk-cgroup.c              |   4 +-
>  include/linux/cgroup-defs.h     |  35 ++---
>  include/linux/cgroup.h          |   8 +-
>  kernel/cgroup/cgroup-internal.h |   4 +-
>  kernel/cgroup/cgroup.c          |  79 ++++++-----
>  kernel/cgroup/rstat.c           | 225 +++++++++++++++++++-------------
>  mm/memcontrol.c                 |   4 +-
>  7 files changed, 203 insertions(+), 156 deletions(-)
>
> --
> 2.47.1
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees
  2025-01-03 22:08 ` [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees Tejun Heo
@ 2025-01-13 18:39   ` Shakeel Butt
  2025-01-13 19:14     ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Shakeel Butt @ 2025-01-13 18:39 UTC (permalink / raw)
  To: Tejun Heo; +Cc: JP Kobryn, mhocko, hannes, yosryahmed, akpm, linux-mm, cgroups

On Fri, Jan 03, 2025 at 12:08:39PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Jan 02, 2025 at 05:50:11PM -0800, JP Kobryn wrote:
> ...
> > I reached a point where this started to feel stable in my local testing, so I
> > wanted to share and get feedback on this approach.
> 
> The rationale for using one tree to track all subsystems was that if one
> subsys has been active (e.g. memory), it's likely that other subsyses have
> been active too (e.g. cpu) and thus we might as well flush the whole thing
> together. The approach can be useful for reducing the amount of work done
> when e.g. there are a lot of cgroups which are only active periodically but
> has drawbacks when one subsystem's stats are read a lot more actively than
> others as you pointed out.

I wanted to add two more points to above: (1) One subsystem (memory) has
in-kernel stats consumer with strict latency/performance requirement and
(2) the flush cost of memory stats have drastically increased due to
more than 100 stats it has to maintain.

> 
> Intuitions go only so far and it's difficult to judge whether splitting the
> trees would be a good idea without data. Can you please provide some
> numbers along with rationales for the test setups?

Here I think the supportive data we can show is the (1) non-memory stats
readers not needing to spend time on memory stats flushing and (2) with
per-subsystem update tree, have we increased the cost of update tree
insertion in general?

Anything else you think will be needed?

Thanks Tejun for taking a look.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees
  2025-01-13 18:39   ` Shakeel Butt
@ 2025-01-13 19:14     ` Tejun Heo
  0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2025-01-13 19:14 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: JP Kobryn, mhocko, hannes, yosryahmed, akpm, linux-mm, cgroups

Hello,

On Mon, Jan 13, 2025 at 10:39:02AM -0800, Shakeel Butt wrote:
...
> Here I think the supportive data we can show is the (1) non-memory stats
> readers not needing to spend time on memory stats flushing and (2) with
> per-subsystem update tree, have we increased the cost of update tree
> insertion in general?
> 
> Anything else you think will be needed?

Anything that shows:

1. It doesn't cause noticeable regressions for generic use cases.

2. It has noticeable benefits in the targeted use cases.

The above is a bit willy-nilly but 2 should be fairly straightforward. For
1, the main worry would be whether high(er) frequency monitoring using
something like systemd-cgtop or below becomes noticeably more expensive on a
large and busy system.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-01-13 19:14 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-03  1:50 [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees JP Kobryn
2025-01-03  1:50 ` [RFC PATCH 1/9 v2] cgroup: change cgroup to css in rstat updated and flush api JP Kobryn
2025-01-03  1:50 ` [RFC PATCH 2/9 v2] cgroup: change cgroup to css in rstat internal flush and lock funcs JP Kobryn
2025-01-03  1:50 ` [RFC PATCH 3/9 v2] cgroup: change cgroup to css in rstat init and exit api JP Kobryn
2025-01-03  1:50 ` [RFC PATCH 4/9 v2] cgroup: split rstat from cgroup into separate css JP Kobryn
2025-01-03  1:50 ` [RFC PATCH 5/9 v2] cgroup: separate locking between base css and others JP Kobryn
2025-01-03  1:50 ` [RFC PATCH 6/9 v2] cgroup: isolate base stat flush JP Kobryn
2025-01-03  1:50 ` [RFC PATCH 7/9 v2] cgroup: remove unneeded rcu list JP Kobryn
2025-01-03  1:50 ` [RFC PATCH 8/9 v2] cgroup: remove bpf rstat flush from css generic flush JP Kobryn
2025-01-03  1:50 ` [RFC PATCH 9/9 v2] cgroup: avoid allocating rstat when flush func not present JP Kobryn
2025-01-03 22:08 ` [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees Tejun Heo
2025-01-13 18:39   ` Shakeel Butt
2025-01-13 19:14     ` Tejun Heo
2025-01-06 23:13 ` Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox