linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9 RFC] cgroup: separate rstat trees
@ 2024-12-24  1:13 JP Kobryn
  2024-12-24  1:13 ` [PATCH 1/9 RFC] change cgroup to css in rstat updated and flush api JP Kobryn
                   ` (10 more replies)
  0 siblings, 11 replies; 21+ messages in thread
From: JP Kobryn @ 2024-12-24  1:13 UTC (permalink / raw)
  To: shakeel.butt, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

I've been experimenting with these changes to allow for separate
updating/flushing of cgroup stats per-subsystem. The idea was instead of having
a single per-cpu rstat tree for managing stats across all subsystems, we could
maybe split up the rstat trees into separate trees for each subsystem. So each
cpu would have individual trees for each subsystem. It would allow subsystems
to update and flush their stats without having any contention with others, i.e.
the io subsystem would not have to wait for an in progress memory subsystem
flush to finish. The core change is moving the rstat entities off of the cgroup
struct and onto the cgroup_subsystem_state struct. Every patch revolves around
that concept.

I reached a point where this started to feel stable in my local testing, so I
wanted to share and get feedback on this approach.

JP Kobryn (8):
  change cgroup to css in rstat updated and flush api
  change cgroup to css in rstat internal flush and lock funcs
  change cgroup to css in rstat init and exit api
  split rstat from cgroup into separate css
  separate locking between base css and others
  isolate base stat flush
  remove unneeded rcu list
  remove bpf rstat flush from css generic flush

 block/blk-cgroup.c              |   4 +-
 include/linux/cgroup-defs.h     |  35 ++---
 include/linux/cgroup.h          |   8 +-
 kernel/cgroup/cgroup-internal.h |   4 +-
 kernel/cgroup/cgroup.c          |  79 ++++++-----
 kernel/cgroup/rstat.c           | 225 +++++++++++++++++++-------------
 mm/memcontrol.c                 |   4 +-
 7 files changed, 203 insertions(+), 156 deletions(-)

-- 
2.47.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/9 RFC] change cgroup to css in rstat updated and flush api
  2024-12-24  1:13 [PATCH 0/9 RFC] cgroup: separate rstat trees JP Kobryn
@ 2024-12-24  1:13 ` JP Kobryn
  2024-12-24  1:13 ` [PATCH 2/9 RFC] cgroup: change cgroup to css in rstat internal flush and lock funcs JP Kobryn
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: JP Kobryn @ 2024-12-24  1:13 UTC (permalink / raw)
  To: shakeel.butt, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

Change API calls to accept a cgroup_subsys_state instead of a cgroup.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 block/blk-cgroup.c     |  4 ++--
 include/linux/cgroup.h |  8 ++++----
 kernel/cgroup/cgroup.c |  4 ++--
 kernel/cgroup/rstat.c  | 28 +++++++++++++++++++---------
 mm/memcontrol.c        |  4 ++--
 5 files changed, 29 insertions(+), 19 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 45a395862fbc..98e586ae21ec 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1200,7 +1200,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
 	if (!seq_css(sf)->parent)
 		blkcg_fill_root_iostats();
 	else
-		cgroup_rstat_flush(blkcg->css.cgroup);
+		cgroup_rstat_flush(&blkcg->css);
 
 	rcu_read_lock();
 	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
@@ -2183,7 +2183,7 @@ void blk_cgroup_bio_start(struct bio *bio)
 	}
 
 	u64_stats_update_end_irqrestore(&bis->sync, flags);
-	cgroup_rstat_updated(blkcg->css.cgroup, cpu);
+	cgroup_rstat_updated(&blkcg->css, cpu);
 	put_cpu();
 }
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index f8ef47f8a634..eec970622419 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -687,10 +687,10 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
 /*
  * cgroup scalable recursive statistics.
  */
-void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
-void cgroup_rstat_flush(struct cgroup *cgrp);
-void cgroup_rstat_flush_hold(struct cgroup *cgrp);
-void cgroup_rstat_flush_release(struct cgroup *cgrp);
+void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu);
+void cgroup_rstat_flush(struct cgroup_subsys_state *css);
+void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css);
+void cgroup_rstat_flush_release(struct cgroup_subsys_state *css);
 
 /*
  * Basic resource stats.
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 43c949824814..fdddd5ec5f3c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5465,7 +5465,7 @@ static void css_release_work_fn(struct work_struct *work)
 
 		/* css release path */
 		if (!list_empty(&css->rstat_css_node)) {
-			cgroup_rstat_flush(cgrp);
+			cgroup_rstat_flush(css);
 			list_del_rcu(&css->rstat_css_node);
 		}
 
@@ -5493,7 +5493,7 @@ static void css_release_work_fn(struct work_struct *work)
 		/* cgroup release path */
 		TRACE_CGROUP_PATH(release, cgrp);
 
-		cgroup_rstat_flush(cgrp);
+		cgroup_rstat_flush(css);
 
 		spin_lock_irq(&css_set_lock);
 		for (tcgrp = cgroup_parent(cgrp); tcgrp;
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 5877974ece92..1ed0f3aab0d9 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -82,8 +82,9 @@ void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
  * rstat_cpu->updated_children list.  See the comment on top of
  * cgroup_rstat_cpu definition for details.
  */
-__bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
+__bpf_kfunc void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
+	struct cgroup *cgrp = css->cgroup;
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	unsigned long flags;
 
@@ -346,8 +347,10 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
  *
  * This function may block.
  */
-__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
+__bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 {
+	struct cgroup *cgrp = css->cgroup;
+
 	might_sleep();
 
 	__cgroup_rstat_lock(cgrp, -1);
@@ -364,9 +367,11 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
  *
  * This function may block.
  */
-void cgroup_rstat_flush_hold(struct cgroup *cgrp)
+void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
 	__acquires(&cgroup_rstat_lock)
 {
+	struct cgroup *cgrp = css->cgroup;
+
 	might_sleep();
 	__cgroup_rstat_lock(cgrp, -1);
 	cgroup_rstat_flush_locked(cgrp);
@@ -376,9 +381,11 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
  * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
  * @cgrp: cgroup used by tracepoint
  */
-void cgroup_rstat_flush_release(struct cgroup *cgrp)
+void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
 	__releases(&cgroup_rstat_lock)
 {
+	struct cgroup *cgrp = css->cgroup;
+
 	__cgroup_rstat_unlock(cgrp, -1);
 }
 
@@ -408,7 +415,7 @@ void cgroup_rstat_exit(struct cgroup *cgrp)
 {
 	int cpu;
 
-	cgroup_rstat_flush(cgrp);
+	cgroup_rstat_flush(&cgrp->self);
 
 	/* sanity check */
 	for_each_possible_cpu(cpu) {
@@ -512,8 +519,10 @@ static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
 						 struct cgroup_rstat_cpu *rstatc,
 						 unsigned long flags)
 {
+	struct cgroup_subsys_state *css = &cgrp->self;
+
 	u64_stats_update_end_irqrestore(&rstatc->bsync, flags);
-	cgroup_rstat_updated(cgrp, smp_processor_id());
+	cgroup_rstat_updated(css, smp_processor_id());
 	put_cpu_ptr(rstatc);
 }
 
@@ -612,16 +621,17 @@ static void cgroup_force_idle_show(struct seq_file *seq, struct cgroup_base_stat
 
 void cgroup_base_stat_cputime_show(struct seq_file *seq)
 {
-	struct cgroup *cgrp = seq_css(seq)->cgroup;
+	struct cgroup_subsys_state *css = seq_css(seq);
+	struct cgroup *cgrp = css->cgroup;
 	u64 usage, utime, stime, ntime;
 
 	if (cgroup_parent(cgrp)) {
-		cgroup_rstat_flush_hold(cgrp);
+		cgroup_rstat_flush_hold(css);
 		usage = cgrp->bstat.cputime.sum_exec_runtime;
 		cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime,
 			       &utime, &stime);
 		ntime = cgrp->bstat.ntime;
-		cgroup_rstat_flush_release(cgrp);
+		cgroup_rstat_flush_release(css);
 	} else {
 		/* cgrp->bstat of root is not actually used, reuse it */
 		root_cgroup_cputime(&cgrp->bstat);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..97552476b844 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -579,7 +579,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 	if (!val)
 		return;
 
-	cgroup_rstat_updated(memcg->css.cgroup, cpu);
+	cgroup_rstat_updated(&memcg->css, cpu);
 	statc = this_cpu_ptr(memcg->vmstats_percpu);
 	for (; statc; statc = statc->parent) {
 		stats_updates = READ_ONCE(statc->stats_updates) + abs(val);
@@ -611,7 +611,7 @@ static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
 	if (mem_cgroup_is_root(memcg))
 		WRITE_ONCE(flush_last_time, jiffies_64);
 
-	cgroup_rstat_flush(memcg->css.cgroup);
+	cgroup_rstat_flush(&memcg->css);
 }
 
 /*
-- 
2.47.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/9 RFC] cgroup: change cgroup to css in rstat internal flush and lock funcs
  2024-12-24  1:13 [PATCH 0/9 RFC] cgroup: separate rstat trees JP Kobryn
  2024-12-24  1:13 ` [PATCH 1/9 RFC] change cgroup to css in rstat updated and flush api JP Kobryn
@ 2024-12-24  1:13 ` JP Kobryn
  2024-12-24  1:13 ` [PATCH 3/9 RFC] cgroup: change cgroup to css in rstat init and exit api JP Kobryn
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: JP Kobryn @ 2024-12-24  1:13 UTC (permalink / raw)
  To: shakeel.butt, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

Change select function calls to accept a cgroup_subsys_state instead of a
cgroup.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/rstat.c | 45 ++++++++++++++++++++++---------------------
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 1ed0f3aab0d9..1b7ef8690a09 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -201,8 +201,10 @@ static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
  * within the children list and terminated by the parent cgroup. An exception
  * here is the cgroup root whose updated_next can be self terminated.
  */
-static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
+static struct cgroup *cgroup_rstat_updated_list(struct cgroup_subsys_state *root_css,
+				int cpu)
 {
+	struct cgroup *root = root_css->cgroup;
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
 	struct cgroup *head = NULL, *parent, *child;
@@ -280,9 +282,11 @@ __bpf_hook_end();
  * value -1 is used when obtaining the main lock else this is the CPU
  * number processed last.
  */
-static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
+static inline void __cgroup_rstat_lock(struct cgroup_subsys_state *css,
+				int cpu_in_loop)
 	__acquires(&cgroup_rstat_lock)
 {
+	struct cgroup *cgrp = css->cgroup;
 	bool contended;
 
 	contended = !spin_trylock_irq(&cgroup_rstat_lock);
@@ -293,15 +297,18 @@ static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
 	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
 }
 
-static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
+static inline void __cgroup_rstat_unlock(struct cgroup_subsys_state *css,
+				int cpu_in_loop)
 	__releases(&cgroup_rstat_lock)
 {
+	struct cgroup *cgrp = css->cgroup;
+
 	trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
 	spin_unlock_irq(&cgroup_rstat_lock);
 }
 
 /* see cgroup_rstat_flush() */
-static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
+static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
 	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
 {
 	int cpu;
@@ -309,27 +316,27 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 	lockdep_assert_held(&cgroup_rstat_lock);
 
 	for_each_possible_cpu(cpu) {
-		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
+		struct cgroup *pos = cgroup_rstat_updated_list(css, cpu);
 
 		for (; pos; pos = pos->rstat_flush_next) {
-			struct cgroup_subsys_state *css;
+			struct cgroup_subsys_state *css_iter;
 
 			cgroup_base_stat_flush(pos, cpu);
 			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
 
 			rcu_read_lock();
-			list_for_each_entry_rcu(css, &pos->rstat_css_list,
+			list_for_each_entry_rcu(css_iter, &pos->rstat_css_list,
 						rstat_css_node)
-				css->ss->css_rstat_flush(css, cpu);
+				css_iter->ss->css_rstat_flush(css_iter, cpu);
 			rcu_read_unlock();
 		}
 
 		/* play nice and yield if necessary */
 		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
-			__cgroup_rstat_unlock(cgrp, cpu);
+			__cgroup_rstat_unlock(css, cpu);
 			if (!cond_resched())
 				cpu_relax();
-			__cgroup_rstat_lock(cgrp, cpu);
+			__cgroup_rstat_lock(css, cpu);
 		}
 	}
 }
@@ -349,13 +356,11 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
  */
 __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 {
-	struct cgroup *cgrp = css->cgroup;
-
 	might_sleep();
 
-	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(cgrp);
-	__cgroup_rstat_unlock(cgrp, -1);
+	__cgroup_rstat_lock(css, -1);
+	cgroup_rstat_flush_locked(css);
+	__cgroup_rstat_unlock(css, -1);
 }
 
 /**
@@ -370,11 +375,9 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
 	__acquires(&cgroup_rstat_lock)
 {
-	struct cgroup *cgrp = css->cgroup;
-
 	might_sleep();
-	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(cgrp);
+	__cgroup_rstat_lock(css, -1);
+	cgroup_rstat_flush_locked(css);
 }
 
 /**
@@ -384,9 +387,7 @@ void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
 void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
 	__releases(&cgroup_rstat_lock)
 {
-	struct cgroup *cgrp = css->cgroup;
-
-	__cgroup_rstat_unlock(cgrp, -1);
+	__cgroup_rstat_unlock(css, -1);
 }
 
 int cgroup_rstat_init(struct cgroup *cgrp)
-- 
2.47.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/9 RFC] cgroup: change cgroup to css in rstat init and exit api
  2024-12-24  1:13 [PATCH 0/9 RFC] cgroup: separate rstat trees JP Kobryn
  2024-12-24  1:13 ` [PATCH 1/9 RFC] change cgroup to css in rstat updated and flush api JP Kobryn
  2024-12-24  1:13 ` [PATCH 2/9 RFC] cgroup: change cgroup to css in rstat internal flush and lock funcs JP Kobryn
@ 2024-12-24  1:13 ` JP Kobryn
  2024-12-24  1:13 ` [PATCH 4/9 RFC] cgroup: split rstat from cgroup into separate css JP Kobryn
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: JP Kobryn @ 2024-12-24  1:13 UTC (permalink / raw)
  To: shakeel.butt, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

Change API calls to accept a cgroup_subsys_state instead of a cgroup.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/cgroup-internal.h |  4 ++--
 kernel/cgroup/cgroup.c          | 16 ++++++++++------
 kernel/cgroup/rstat.c           |  6 ++++--
 3 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index c964dd7ff967..87d062baff90 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -269,8 +269,8 @@ int cgroup_task_count(const struct cgroup *cgrp);
 /*
  * rstat.c
  */
-int cgroup_rstat_init(struct cgroup *cgrp);
-void cgroup_rstat_exit(struct cgroup *cgrp);
+int cgroup_rstat_init(struct cgroup_subsys_state *css);
+void cgroup_rstat_exit(struct cgroup_subsys_state *css);
 void cgroup_rstat_boot(void);
 void cgroup_base_stat_cputime_show(struct seq_file *seq);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index fdddd5ec5f3c..848e09f433c0 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1358,7 +1358,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
 
 	cgroup_unlock();
 
-	cgroup_rstat_exit(cgrp);
+	cgroup_rstat_exit(&cgrp->self);
 	kernfs_destroy_root(root->kf_root);
 	cgroup_free_root(root);
 }
@@ -2132,7 +2132,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	if (ret)
 		goto destroy_root;
 
-	ret = cgroup_rstat_init(root_cgrp);
+	ret = cgroup_rstat_init(&root_cgrp->self);
 	if (ret)
 		goto destroy_root;
 
@@ -2174,7 +2174,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	goto out;
 
 exit_stats:
-	cgroup_rstat_exit(root_cgrp);
+	cgroup_rstat_exit(&root_cgrp->self);
 destroy_root:
 	kernfs_destroy_root(root->kf_root);
 	root->kf_root = NULL;
@@ -5435,7 +5435,7 @@ static void css_free_rwork_fn(struct work_struct *work)
 			cgroup_put(cgroup_parent(cgrp));
 			kernfs_put(cgrp->kn);
 			psi_cgroup_free(cgrp);
-			cgroup_rstat_exit(cgrp);
+			cgroup_rstat_exit(css);
 			kfree(cgrp);
 		} else {
 			/*
@@ -5686,7 +5686,11 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 	if (ret)
 		goto out_free_cgrp;
 
-	ret = cgroup_rstat_init(cgrp);
+	/* init self cgroup early so css->cgroup is valid within cgroup_rstat_init()
+	 * note that this will go away in a subsequent patch in this series
+	 */
+	cgrp->self.cgroup = cgrp;
+	ret = cgroup_rstat_init(&cgrp->self);
 	if (ret)
 		goto out_cancel_ref;
 
@@ -5779,7 +5783,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 out_kernfs_remove:
 	kernfs_remove(cgrp->kn);
 out_stat_exit:
-	cgroup_rstat_exit(cgrp);
+	cgroup_rstat_exit(&cgrp->self);
 out_cancel_ref:
 	percpu_ref_exit(&cgrp->self.refcnt);
 out_free_cgrp:
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 1b7ef8690a09..01a5c185b02a 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -390,8 +390,9 @@ void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
 	__cgroup_rstat_unlock(css, -1);
 }
 
-int cgroup_rstat_init(struct cgroup *cgrp)
+int cgroup_rstat_init(struct cgroup_subsys_state *css)
 {
+	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
 	/* the root cgrp has rstat_cpu preallocated */
@@ -412,8 +413,9 @@ int cgroup_rstat_init(struct cgroup *cgrp)
 	return 0;
 }
 
-void cgroup_rstat_exit(struct cgroup *cgrp)
+void cgroup_rstat_exit(struct cgroup_subsys_state *css)
 {
+	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
 	cgroup_rstat_flush(&cgrp->self);
-- 
2.47.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 4/9 RFC] cgroup: split rstat from cgroup into separate css
  2024-12-24  1:13 [PATCH 0/9 RFC] cgroup: separate rstat trees JP Kobryn
                   ` (2 preceding siblings ...)
  2024-12-24  1:13 ` [PATCH 3/9 RFC] cgroup: change cgroup to css in rstat init and exit api JP Kobryn
@ 2024-12-24  1:13 ` JP Kobryn
  2024-12-24  1:13 ` [PATCH 5/9 RFC] cgroup: separate locking between base css and others JP Kobryn
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: JP Kobryn @ 2024-12-24  1:13 UTC (permalink / raw)
  To: shakeel.butt, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

Move the rstat entities off of the cgroup struct and onto the
cgroup_subsys_state struct. Adjust related code to reflect this new ownership.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/cgroup-defs.h | 40 ++++++++--------
 kernel/cgroup/cgroup.c      | 65 ++++++++++++++++++--------
 kernel/cgroup/rstat.c       | 92 ++++++++++++++++++-------------------
 3 files changed, 111 insertions(+), 86 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1b20d2d8ef7c..1932f8ae7995 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -180,6 +180,24 @@ struct cgroup_subsys_state {
 	struct list_head sibling;
 	struct list_head children;
 
+	/* per-cpu recursive resource statistics */
+	struct cgroup_rstat_cpu __percpu *rstat_cpu;
+	struct list_head rstat_css_list;
+
+	/*
+	 * Add padding to separate the read mostly rstat_cpu and
+	 * rstat_css_list into a different cacheline from the following
+	 * rstat_flush_next and *bstat fields which can have frequent updates.
+	 */
+	CACHELINE_PADDING(_pad_);
+
+	/*
+	 * A singly-linked list of cgroup structures to be rstat flushed.
+	 * This is a scratch field to be used exclusively by
+	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
+	 */
+	struct cgroup_subsys_state *rstat_flush_next;
+
 	/* flush target list anchored at cgrp->rstat_css_list */
 	struct list_head rstat_css_node;
 
@@ -389,8 +407,8 @@ struct cgroup_rstat_cpu {
 	 *
 	 * Protected by per-cpu cgroup_rstat_cpu_lock.
 	 */
-	struct cgroup *updated_children;	/* terminated by self cgroup */
-	struct cgroup *updated_next;		/* NULL iff not on the list */
+	struct cgroup_subsys_state *updated_children;	/* terminated by self cgroup */
+	struct cgroup_subsys_state *updated_next;		/* NULL iff not on the list */
 };
 
 struct cgroup_freezer_state {
@@ -516,24 +534,6 @@ struct cgroup {
 	struct cgroup *dom_cgrp;
 	struct cgroup *old_dom_cgrp;		/* used while enabling threaded */
 
-	/* per-cpu recursive resource statistics */
-	struct cgroup_rstat_cpu __percpu *rstat_cpu;
-	struct list_head rstat_css_list;
-
-	/*
-	 * Add padding to separate the read mostly rstat_cpu and
-	 * rstat_css_list into a different cacheline from the following
-	 * rstat_flush_next and *bstat fields which can have frequent updates.
-	 */
-	CACHELINE_PADDING(_pad_);
-
-	/*
-	 * A singly-linked list of cgroup structures to be rstat flushed.
-	 * This is a scratch field to be used exclusively by
-	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
-	 */
-	struct cgroup	*rstat_flush_next;
-
 	/* cgroup basic resource statistics */
 	struct cgroup_base_stat last_bstat;
 	struct cgroup_base_stat bstat;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 848e09f433c0..96a2d15fe5e9 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -164,7 +164,7 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
 static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu);
 
 /* the default hierarchy */
-struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
+struct cgroup_root cgrp_dfl_root = { .cgrp.self.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
 EXPORT_SYMBOL_GPL(cgrp_dfl_root);
 
 /*
@@ -1826,6 +1826,7 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
 		struct cgroup_root *src_root = ss->root;
 		struct cgroup *scgrp = &src_root->cgrp;
 		struct cgroup_subsys_state *css = cgroup_css(scgrp, ss);
+		struct cgroup_subsys_state *dcss = cgroup_css(dcgrp, ss);
 		struct css_set *cset, *cset_pos;
 		struct css_task_iter *it;
 
@@ -1867,7 +1868,7 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
 			list_del_rcu(&css->rstat_css_node);
 			synchronize_rcu();
 			list_add_rcu(&css->rstat_css_node,
-				     &dcgrp->rstat_css_list);
+				     &dcss->rstat_css_list);
 		}
 
 		/* default hierarchy doesn't enable controllers by default */
@@ -2052,7 +2053,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	cgrp->dom_cgrp = cgrp;
 	cgrp->max_descendants = INT_MAX;
 	cgrp->max_depth = INT_MAX;
-	INIT_LIST_HEAD(&cgrp->rstat_css_list);
 	prev_cputime_init(&cgrp->prev_cputime);
 
 	for_each_subsys(ss, ssid)
@@ -2088,7 +2088,8 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	struct cgroup *root_cgrp = &root->cgrp;
 	struct kernfs_syscall_ops *kf_sops;
 	struct css_set *cset;
-	int i, ret;
+	struct cgroup_subsys *ss;
+	int i, ret, ssid;
 
 	lockdep_assert_held(&cgroup_mutex);
 
@@ -2132,10 +2133,6 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	if (ret)
 		goto destroy_root;
 
-	ret = cgroup_rstat_init(&root_cgrp->self);
-	if (ret)
-		goto destroy_root;
-
 	ret = rebind_subsystems(root, ss_mask);
 	if (ret)
 		goto exit_stats;
@@ -2174,7 +2171,10 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	goto out;
 
 exit_stats:
-	cgroup_rstat_exit(&root_cgrp->self);
+	for_each_subsys(ss, ssid) {
+		struct cgroup_subsys_state *css = init_css_set.subsys[ssid];
+		cgroup_rstat_exit(css);
+	}
 destroy_root:
 	kernfs_destroy_root(root->kf_root);
 	root->kf_root = NULL;
@@ -3229,6 +3229,10 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
 	int ssid, ret;
 
 	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
+		ret = cgroup_rstat_init(&dsct->self);
+		if (ret)
+			return ret;
+
 		for_each_subsys(ss, ssid) {
 			struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
 
@@ -3239,6 +3243,10 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
 				css = css_create(dsct, ss);
 				if (IS_ERR(css))
 					return PTR_ERR(css);
+
+				ret = cgroup_rstat_init(css);
+				if (ret)
+					goto err_free_css;
 			}
 
 			WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));
@@ -3252,6 +3260,20 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
 	}
 
 	return 0;
+
+err_free_css:
+	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
+		cgroup_rstat_exit(&dsct->self);
+
+		for_each_subsys(ss, ssid) {
+			struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
+
+			if (css != &dsct->self)
+				cgroup_rstat_exit(css);
+		}
+	}
+
+	return ret;
 }
 
 /**
@@ -5403,6 +5425,7 @@ static void css_free_rwork_fn(struct work_struct *work)
 				struct cgroup_subsys_state, destroy_rwork);
 	struct cgroup_subsys *ss = css->ss;
 	struct cgroup *cgrp = css->cgroup;
+	int ssid;
 
 	percpu_ref_exit(&css->refcnt);
 
@@ -5435,7 +5458,12 @@ static void css_free_rwork_fn(struct work_struct *work)
 			cgroup_put(cgroup_parent(cgrp));
 			kernfs_put(cgrp->kn);
 			psi_cgroup_free(cgrp);
-			cgroup_rstat_exit(css);
+			for_each_subsys(ss, ssid) {
+				struct cgroup_subsys_state *css = cgrp->subsys[ssid];
+
+				if (css)
+					cgroup_rstat_exit(css);
+			}
 			kfree(cgrp);
 		} else {
 			/*
@@ -5541,6 +5569,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	css->id = -1;
 	INIT_LIST_HEAD(&css->sibling);
 	INIT_LIST_HEAD(&css->children);
+	INIT_LIST_HEAD(&css->rstat_css_list);
 	INIT_LIST_HEAD(&css->rstat_css_node);
 	css->serial_nr = css_serial_nr_next++;
 	atomic_set(&css->online_cnt, 0);
@@ -5551,7 +5580,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	}
 
 	if (ss->css_rstat_flush)
-		list_add_rcu(&css->rstat_css_node, &cgrp->rstat_css_list);
+		list_add_rcu(&css->rstat_css_node, &css->rstat_css_list);
 
 	BUG_ON(cgroup_css(cgrp, ss));
 }
@@ -5686,14 +5715,6 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 	if (ret)
 		goto out_free_cgrp;
 
-	/* init self cgroup early so css->cgroup is valid within cgroup_rstat_init()
-	 * note that this will go away in a subsequent patch in this series
-	 */
-	cgrp->self.cgroup = cgrp;
-	ret = cgroup_rstat_init(&cgrp->self);
-	if (ret)
-		goto out_cancel_ref;
-
 	/* create the directory */
 	kn = kernfs_create_dir_ns(parent->kn, name, mode,
 				  current_fsuid(), current_fsgid(),
@@ -5784,7 +5805,6 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 	kernfs_remove(cgrp->kn);
 out_stat_exit:
 	cgroup_rstat_exit(&cgrp->self);
-out_cancel_ref:
 	percpu_ref_exit(&cgrp->self.refcnt);
 out_free_cgrp:
 	kfree(cgrp);
@@ -6189,6 +6209,8 @@ int __init cgroup_init(void)
 	cgroup_unlock();
 
 	for_each_subsys(ss, ssid) {
+		struct cgroup_subsys_state *css;
+
 		if (ss->early_init) {
 			struct cgroup_subsys_state *css =
 				init_css_set.subsys[ss->id];
@@ -6200,6 +6222,9 @@ int __init cgroup_init(void)
 			cgroup_init_subsys(ss, false);
 		}
 
+		css = init_css_set.subsys[ss->id];
+		BUG_ON(cgroup_rstat_init(css));
+
 		list_add_tail(&init_css_set.e_cset_node[ssid],
 			      &cgrp_dfl_root.cgrp.e_csets[ssid]);
 
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 01a5c185b02a..4381eb9ac426 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -14,9 +14,10 @@ static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
-static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
+static struct cgroup_rstat_cpu *css_rstat_cpu(
+				struct cgroup_subsys_state *css, int cpu)
 {
-	return per_cpu_ptr(cgrp->rstat_cpu, cpu);
+	return per_cpu_ptr(css->rstat_cpu, cpu);
 }
 
 /*
@@ -96,15 +97,16 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 	 * instead of NULL, we can tell whether @cgrp is on the list by
 	 * testing the next pointer for NULL.
 	 */
-	if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next))
+	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
 		return;
 
 	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
 
 	/* put @cgrp and all ancestors on the corresponding updated lists */
 	while (true) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
-		struct cgroup *parent = cgroup_parent(cgrp);
+		struct cgroup_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
+		struct cgroup_subsys_state *parent = css->parent
+;
 		struct cgroup_rstat_cpu *prstatc;
 
 		/*
@@ -116,15 +118,15 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 
 		/* Root has no parent to link it to, but mark it busy */
 		if (!parent) {
-			rstatc->updated_next = cgrp;
+			rstatc->updated_next = css;
 			break;
 		}
 
-		prstatc = cgroup_rstat_cpu(parent, cpu);
+		prstatc = css_rstat_cpu(parent, cpu);
 		rstatc->updated_next = prstatc->updated_children;
-		prstatc->updated_children = cgrp;
+		prstatc->updated_children = css;
 
-		cgrp = parent;
+		css = parent;
 	}
 
 	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, true);
@@ -142,12 +144,13 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
  * into a singly linked list built from the tail backward like "pushing"
  * cgroups into a stack. The root is pushed by the caller.
  */
-static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
-						 struct cgroup *child, int cpu)
+static struct cgroup_subsys_state *cgroup_rstat_push_children(
+				struct cgroup_subsys_state *head,
+				struct cgroup_subsys_state *child, int cpu)
 {
-	struct cgroup *chead = child;	/* Head of child cgroup level */
-	struct cgroup *ghead = NULL;	/* Head of grandchild cgroup level */
-	struct cgroup *parent, *grandchild;
+	struct cgroup_subsys_state *chead = child;	/* Head of child cgroup level */
+	struct cgroup_subsys_state *ghead = NULL;	/* Head of grandchild cgroup level */
+	struct cgroup_subsys_state *parent, *grandchild;
 	struct cgroup_rstat_cpu *crstatc;
 
 	child->rstat_flush_next = NULL;
@@ -156,13 +159,13 @@ static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
 	while (chead) {
 		child = chead;
 		chead = child->rstat_flush_next;
-		parent = cgroup_parent(child);
+		parent = child->parent;
 
 		/* updated_next is parent cgroup terminated */
 		while (child != parent) {
 			child->rstat_flush_next = head;
 			head = child;
-			crstatc = cgroup_rstat_cpu(child, cpu);
+			crstatc = css_rstat_cpu(child, cpu);
 			grandchild = crstatc->updated_children;
 			if (grandchild != child) {
 				/* Push the grand child to the next level */
@@ -201,16 +204,15 @@ static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
  * within the children list and terminated by the parent cgroup. An exception
  * here is the cgroup root whose updated_next can be self terminated.
  */
-static struct cgroup *cgroup_rstat_updated_list(struct cgroup_subsys_state *root_css,
-				int cpu)
+static struct cgroup_subsys_state *cgroup_rstat_updated_list(
+				struct cgroup_subsys_state *root, int cpu)
 {
-	struct cgroup *root = root_css->cgroup;
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
-	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
-	struct cgroup *head = NULL, *parent, *child;
+	struct cgroup_rstat_cpu *rstatc = css_rstat_cpu(root, cpu);
+	struct cgroup_subsys_state *head = NULL, *parent, *child;
 	unsigned long flags;
 
-	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, root, false);
+	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, root->cgroup, false);
 
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
@@ -220,17 +222,17 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup_subsys_state *root
 	 * Unlink @root from its parent. As the updated_children list is
 	 * singly linked, we have to walk it to find the removal point.
 	 */
-	parent = cgroup_parent(root);
+	parent = root->parent;
 	if (parent) {
 		struct cgroup_rstat_cpu *prstatc;
-		struct cgroup **nextp;
+		struct cgroup_subsys_state **nextp;
 
-		prstatc = cgroup_rstat_cpu(parent, cpu);
+		prstatc = css_rstat_cpu(parent, cpu);
 		nextp = &prstatc->updated_children;
 		while (*nextp != root) {
 			struct cgroup_rstat_cpu *nrstatc;
 
-			nrstatc = cgroup_rstat_cpu(*nextp, cpu);
+			nrstatc = css_rstat_cpu(*nextp, cpu);
 			WARN_ON_ONCE(*nextp == parent);
 			nextp = &nrstatc->updated_next;
 		}
@@ -247,7 +249,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup_subsys_state *root
 	if (child != root)
 		head = cgroup_rstat_push_children(head, child, cpu);
 unlock_ret:
-	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, root, flags, false);
+	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, root->cgroup, flags, false);
 	return head;
 }
 
@@ -316,13 +318,13 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
 	lockdep_assert_held(&cgroup_rstat_lock);
 
 	for_each_possible_cpu(cpu) {
-		struct cgroup *pos = cgroup_rstat_updated_list(css, cpu);
+		struct cgroup_subsys_state *pos = cgroup_rstat_updated_list(css, cpu);
 
 		for (; pos; pos = pos->rstat_flush_next) {
 			struct cgroup_subsys_state *css_iter;
 
-			cgroup_base_stat_flush(pos, cpu);
-			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
+			cgroup_base_stat_flush(pos->cgroup, cpu);
+			bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu);
 
 			rcu_read_lock();
 			list_for_each_entry_rcu(css_iter, &pos->rstat_css_list,
@@ -392,21 +394,20 @@ void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
 
 int cgroup_rstat_init(struct cgroup_subsys_state *css)
 {
-	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
-	/* the root cgrp has rstat_cpu preallocated */
-	if (!cgrp->rstat_cpu) {
-		cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
-		if (!cgrp->rstat_cpu)
+	/* the root cgrp css has rstat_cpu preallocated */
+	if (!css->rstat_cpu) {
+		css->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
+		if (!css->rstat_cpu)
 			return -ENOMEM;
 	}
 
 	/* ->updated_children list is self terminated */
 	for_each_possible_cpu(cpu) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+		struct cgroup_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
 
-		rstatc->updated_children = cgrp;
+		rstatc->updated_children = css;
 		u64_stats_init(&rstatc->bsync);
 	}
 
@@ -415,22 +416,21 @@ int cgroup_rstat_init(struct cgroup_subsys_state *css)
 
 void cgroup_rstat_exit(struct cgroup_subsys_state *css)
 {
-	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
-	cgroup_rstat_flush(&cgrp->self);
+	cgroup_rstat_flush(css);
 
 	/* sanity check */
 	for_each_possible_cpu(cpu) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+		struct cgroup_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
 
-		if (WARN_ON_ONCE(rstatc->updated_children != cgrp) ||
+		if (WARN_ON_ONCE(rstatc->updated_children != css) ||
 		    WARN_ON_ONCE(rstatc->updated_next))
 			return;
 	}
 
-	free_percpu(cgrp->rstat_cpu);
-	cgrp->rstat_cpu = NULL;
+	free_percpu(css->rstat_cpu);
+	css->rstat_cpu = NULL;
 }
 
 void __init cgroup_rstat_boot(void)
@@ -471,7 +471,7 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 {
-	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+	struct cgroup_rstat_cpu *rstatc = css_rstat_cpu(&cgrp->self, cpu);
 	struct cgroup *parent = cgroup_parent(cgrp);
 	struct cgroup_rstat_cpu *prstatc;
 	struct cgroup_base_stat delta;
@@ -501,7 +501,7 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 		cgroup_base_stat_add(&cgrp->last_bstat, &delta);
 
 		delta = rstatc->subtree_bstat;
-		prstatc = cgroup_rstat_cpu(parent, cpu);
+		prstatc = css_rstat_cpu(&parent->self, cpu);
 		cgroup_base_stat_sub(&delta, &rstatc->last_subtree_bstat);
 		cgroup_base_stat_add(&prstatc->subtree_bstat, &delta);
 		cgroup_base_stat_add(&rstatc->last_subtree_bstat, &delta);
@@ -513,7 +513,7 @@ cgroup_base_stat_cputime_account_begin(struct cgroup *cgrp, unsigned long *flags
 {
 	struct cgroup_rstat_cpu *rstatc;
 
-	rstatc = get_cpu_ptr(cgrp->rstat_cpu);
+	rstatc = get_cpu_ptr(cgrp->self.rstat_cpu);
 	*flags = u64_stats_update_begin_irqsave(&rstatc->bsync);
 	return rstatc;
 }
-- 
2.47.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 5/9 RFC] cgroup: separate locking between base css and others
  2024-12-24  1:13 [PATCH 0/9 RFC] cgroup: separate rstat trees JP Kobryn
                   ` (3 preceding siblings ...)
  2024-12-24  1:13 ` [PATCH 4/9 RFC] cgroup: split rstat from cgroup into separate css JP Kobryn
@ 2024-12-24  1:13 ` JP Kobryn
  2024-12-24  1:13 ` [PATCH 6/9 RFC] cgroup: isolate base stat flush JP Kobryn
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: JP Kobryn @ 2024-12-24  1:13 UTC (permalink / raw)
  To: shakeel.butt, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

Separate locks can be used to eliminate contention across subsystems that make
use of rstat. The base stats also get their own lock. Where applicable, check
for the existence of a subsystem pointer to determine if the given
cgroup_subsys_state is the base css or not.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/cgroup-defs.h |  2 +
 kernel/cgroup/rstat.c       | 92 +++++++++++++++++++++++++------------
 2 files changed, 65 insertions(+), 29 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1932f8ae7995..4d87519ff023 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -790,6 +790,8 @@ struct cgroup_subsys {
 	 * specifies the mask of subsystems that this one depends on.
 	 */
 	unsigned int depends_on;
+
+	spinlock_t rstat_lock;
 };
 
 extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 4381eb9ac426..958bdccf0359 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -9,8 +9,9 @@
 
 #include <trace/events/cgroup.h>
 
-static DEFINE_SPINLOCK(cgroup_rstat_lock);
-static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
+static DEFINE_SPINLOCK(cgroup_rstat_base_lock);
+static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_base_cpu_lock);
+static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock[CGROUP_SUBSYS_COUNT]);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
@@ -86,7 +87,7 @@ void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
 __bpf_kfunc void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
 	struct cgroup *cgrp = css->cgroup;
-	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
+	raw_spinlock_t *cpu_lock;
 	unsigned long flags;
 
 	/*
@@ -100,6 +101,11 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
 		return;
 
+	if (css->ss)
+		cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock[css->ss->id], cpu);
+	else
+		cpu_lock = per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu);
+
 	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
 
 	/* put @cgrp and all ancestors on the corresponding updated lists */
@@ -207,11 +213,16 @@ static struct cgroup_subsys_state *cgroup_rstat_push_children(
 static struct cgroup_subsys_state *cgroup_rstat_updated_list(
 				struct cgroup_subsys_state *root, int cpu)
 {
-	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	struct cgroup_rstat_cpu *rstatc = css_rstat_cpu(root, cpu);
 	struct cgroup_subsys_state *head = NULL, *parent, *child;
+	raw_spinlock_t *cpu_lock;
 	unsigned long flags;
 
+	if (root->ss)
+		cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock[root->ss->id], cpu);
+	else
+		cpu_lock = per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu);
+
 	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, root->cgroup, false);
 
 	/* Return NULL if this subtree is not on-list */
@@ -285,37 +296,44 @@ __bpf_hook_end();
  * number processed last.
  */
 static inline void __cgroup_rstat_lock(struct cgroup_subsys_state *css,
-				int cpu_in_loop)
-	__acquires(&cgroup_rstat_lock)
+				spinlock_t *lock, int cpu_in_loop)
+	__acquires(lock)
 {
 	struct cgroup *cgrp = css->cgroup;
 	bool contended;
 
-	contended = !spin_trylock_irq(&cgroup_rstat_lock);
+	contended = !spin_trylock_irq(lock);
 	if (contended) {
 		trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
-		spin_lock_irq(&cgroup_rstat_lock);
+		spin_lock_irq(lock);
 	}
 	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
 }
 
 static inline void __cgroup_rstat_unlock(struct cgroup_subsys_state *css,
-				int cpu_in_loop)
-	__releases(&cgroup_rstat_lock)
+				spinlock_t *lock, int cpu_in_loop)
+	__releases(lock)
 {
 	struct cgroup *cgrp = css->cgroup;
 
 	trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
-	spin_unlock_irq(&cgroup_rstat_lock);
+	spin_unlock_irq(lock);
 }
 
 /* see cgroup_rstat_flush() */
 static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
-	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
+	__releases(&css->ss->rstat_lock) __acquires(&css->ss->rstat_lock)
 {
+	spinlock_t *lock;
 	int cpu;
 
-	lockdep_assert_held(&cgroup_rstat_lock);
+	if (!css->ss) {
+		pr_warn("cannot use generic flush on base subsystem\n");
+		return;
+	}
+
+	lock = &css->ss->rstat_lock;
+	lockdep_assert_held(lock);
 
 	for_each_possible_cpu(cpu) {
 		struct cgroup_subsys_state *pos = cgroup_rstat_updated_list(css, cpu);
@@ -334,11 +352,11 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
 		}
 
 		/* play nice and yield if necessary */
-		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
-			__cgroup_rstat_unlock(css, cpu);
+		if (need_resched() || spin_needbreak(lock)) {
+			__cgroup_rstat_unlock(css, lock, cpu);
 			if (!cond_resched())
 				cpu_relax();
-			__cgroup_rstat_lock(css, cpu);
+			__cgroup_rstat_lock(css, lock, cpu);
 		}
 	}
 }
@@ -358,11 +376,22 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
  */
 __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 {
+	spinlock_t *lock;
+
+	if (!css->ss) {
+		int cpu;
+
+		for_each_possible_cpu(cpu)
+			cgroup_base_stat_flush(css->cgroup, cpu);
+		return;
+	}
+
 	might_sleep();
 
-	__cgroup_rstat_lock(css, -1);
+	lock = &css->ss->rstat_lock;
+	__cgroup_rstat_lock(css, lock, -1);
 	cgroup_rstat_flush_locked(css);
-	__cgroup_rstat_unlock(css, -1);
+	__cgroup_rstat_unlock(css, lock, -1);
 }
 
 /**
@@ -374,11 +403,11 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
  *
  * This function may block.
  */
-void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
-	__acquires(&cgroup_rstat_lock)
+static void cgroup_rstat_base_flush_hold(struct cgroup_subsys_state *css)
+	__acquires(&cgroup_rstat_base_lock)
 {
 	might_sleep();
-	__cgroup_rstat_lock(css, -1);
+	__cgroup_rstat_lock(css, &cgroup_rstat_base_lock, -1);
 	cgroup_rstat_flush_locked(css);
 }
 
@@ -386,10 +415,10 @@ void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
  * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
  * @cgrp: cgroup used by tracepoint
  */
-void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
-	__releases(&cgroup_rstat_lock)
+static void cgroup_rstat_base_flush_release(struct cgroup_subsys_state *css)
+	__releases(&cgroup_rstat_base_lock)
 {
-	__cgroup_rstat_unlock(css, -1);
+	__cgroup_rstat_unlock(css, &cgroup_rstat_base_lock, -1);
 }
 
 int cgroup_rstat_init(struct cgroup_subsys_state *css)
@@ -435,10 +464,15 @@ void cgroup_rstat_exit(struct cgroup_subsys_state *css)
 
 void __init cgroup_rstat_boot(void)
 {
-	int cpu;
+	struct cgroup_subsys *ss;
+	int cpu, ssid;
+
+	for_each_possible_cpu(cpu) {
+		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu));
 
-	for_each_possible_cpu(cpu)
-		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
+		for_each_subsys(ss, ssid)
+			raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock[ssid], cpu));
+	}
 }
 
 /*
@@ -629,12 +663,12 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 	u64 usage, utime, stime, ntime;
 
 	if (cgroup_parent(cgrp)) {
-		cgroup_rstat_flush_hold(css);
+		cgroup_rstat_base_flush_hold(css);
 		usage = cgrp->bstat.cputime.sum_exec_runtime;
 		cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime,
 			       &utime, &stime);
 		ntime = cgrp->bstat.ntime;
-		cgroup_rstat_flush_release(css);
+		cgroup_rstat_base_flush_release(css);
 	} else {
 		/* cgrp->bstat of root is not actually used, reuse it */
 		root_cgroup_cputime(&cgrp->bstat);
-- 
2.47.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 6/9 RFC] cgroup: isolate base stat flush
  2024-12-24  1:13 [PATCH 0/9 RFC] cgroup: separate rstat trees JP Kobryn
                   ` (4 preceding siblings ...)
  2024-12-24  1:13 ` [PATCH 5/9 RFC] cgroup: separate locking between base css and others JP Kobryn
@ 2024-12-24  1:13 ` JP Kobryn
  2024-12-24  1:14 ` [PATCH 7/9 RFC] cgroup: remove unneeded rcu list JP Kobryn
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: JP Kobryn @ 2024-12-24  1:13 UTC (permalink / raw)
  To: shakeel.butt, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

Remove the base stat flushing from the generic css flushing routine. Provide a
direct way to flush the base stats and make use of it when the cpu stats are
read.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/rstat.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 958bdccf0359..92a46b960be1 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -320,6 +320,14 @@ static inline void __cgroup_rstat_unlock(struct cgroup_subsys_state *css,
 	spin_unlock_irq(lock);
 }
 
+static void cgroup_rstat_base_flush_locked(struct cgroup *cgrp)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		cgroup_base_stat_flush(cgrp, cpu);
+}
+
 /* see cgroup_rstat_flush() */
 static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
 	__releases(&css->ss->rstat_lock) __acquires(&css->ss->rstat_lock)
@@ -341,7 +349,6 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
 		for (; pos; pos = pos->rstat_flush_next) {
 			struct cgroup_subsys_state *css_iter;
 
-			cgroup_base_stat_flush(pos->cgroup, cpu);
 			bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu);
 
 			rcu_read_lock();
@@ -408,7 +415,7 @@ static void cgroup_rstat_base_flush_hold(struct cgroup_subsys_state *css)
 {
 	might_sleep();
 	__cgroup_rstat_lock(css, &cgroup_rstat_base_lock, -1);
-	cgroup_rstat_flush_locked(css);
+	cgroup_rstat_base_flush_locked(css->cgroup);
 }
 
 /**
-- 
2.47.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 7/9 RFC] cgroup: remove unneeded rcu list
  2024-12-24  1:13 [PATCH 0/9 RFC] cgroup: separate rstat trees JP Kobryn
                   ` (5 preceding siblings ...)
  2024-12-24  1:13 ` [PATCH 6/9 RFC] cgroup: isolate base stat flush JP Kobryn
@ 2024-12-24  1:14 ` JP Kobryn
  2024-12-24  1:14 ` [PATCH 8/9 RFC] cgroup: remove bpf rstat flush from css generic flush JP Kobryn
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: JP Kobryn @ 2024-12-24  1:14 UTC (permalink / raw)
  To: shakeel.butt, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

Since the cgroup_subsystem_state owns the rstat tree, the list management
previously done by the cgroup is no longer needed.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/cgroup-defs.h | 11 -----------
 kernel/cgroup/cgroup.c      | 20 +-------------------
 kernel/cgroup/rstat.c       |  9 +--------
 3 files changed, 2 insertions(+), 38 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 4d87519ff023..836260c422a0 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -182,14 +182,6 @@ struct cgroup_subsys_state {
 
 	/* per-cpu recursive resource statistics */
 	struct cgroup_rstat_cpu __percpu *rstat_cpu;
-	struct list_head rstat_css_list;
-
-	/*
-	 * Add padding to separate the read mostly rstat_cpu and
-	 * rstat_css_list into a different cacheline from the following
-	 * rstat_flush_next and *bstat fields which can have frequent updates.
-	 */
-	CACHELINE_PADDING(_pad_);
 
 	/*
 	 * A singly-linked list of cgroup structures to be rstat flushed.
@@ -198,9 +190,6 @@ struct cgroup_subsys_state {
 	 */
 	struct cgroup_subsys_state *rstat_flush_next;
 
-	/* flush target list anchored at cgrp->rstat_css_list */
-	struct list_head rstat_css_node;
-
 	/*
 	 * PI: Subsys-unique ID.  0 is unused and root is always 1.  The
 	 * matching css can be looked up using css_from_id().
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 96a2d15fe5e9..a36ed3995c6f 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1826,7 +1826,6 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
 		struct cgroup_root *src_root = ss->root;
 		struct cgroup *scgrp = &src_root->cgrp;
 		struct cgroup_subsys_state *css = cgroup_css(scgrp, ss);
-		struct cgroup_subsys_state *dcss = cgroup_css(dcgrp, ss);
 		struct css_set *cset, *cset_pos;
 		struct css_task_iter *it;
 
@@ -1864,13 +1863,6 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
 		}
 		spin_unlock_irq(&css_set_lock);
 
-		if (ss->css_rstat_flush) {
-			list_del_rcu(&css->rstat_css_node);
-			synchronize_rcu();
-			list_add_rcu(&css->rstat_css_node,
-				     &dcss->rstat_css_list);
-		}
-
 		/* default hierarchy doesn't enable controllers by default */
 		dst_root->subsys_mask |= 1 << ssid;
 		if (dst_root == &cgrp_dfl_root) {
@@ -5491,11 +5483,7 @@ static void css_release_work_fn(struct work_struct *work)
 	if (ss) {
 		struct cgroup *parent_cgrp;
 
-		/* css release path */
-		if (!list_empty(&css->rstat_css_node)) {
-			cgroup_rstat_flush(css);
-			list_del_rcu(&css->rstat_css_node);
-		}
+		cgroup_rstat_flush(css);
 
 		cgroup_idr_replace(&ss->css_idr, NULL, css->id);
 		if (ss->css_released)
@@ -5569,8 +5557,6 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	css->id = -1;
 	INIT_LIST_HEAD(&css->sibling);
 	INIT_LIST_HEAD(&css->children);
-	INIT_LIST_HEAD(&css->rstat_css_list);
-	INIT_LIST_HEAD(&css->rstat_css_node);
 	css->serial_nr = css_serial_nr_next++;
 	atomic_set(&css->online_cnt, 0);
 
@@ -5579,9 +5565,6 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 		css_get(css->parent);
 	}
 
-	if (ss->css_rstat_flush)
-		list_add_rcu(&css->rstat_css_node, &css->rstat_css_list);
-
 	BUG_ON(cgroup_css(cgrp, ss));
 }
 
@@ -5687,7 +5670,6 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 err_list_del:
 	list_del_rcu(&css->sibling);
 err_free_css:
-	list_del_rcu(&css->rstat_css_node);
 	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
 	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
 	return ERR_PTR(err);
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 92a46b960be1..c52e8429c75d 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -347,15 +347,8 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
 		struct cgroup_subsys_state *pos = cgroup_rstat_updated_list(css, cpu);
 
 		for (; pos; pos = pos->rstat_flush_next) {
-			struct cgroup_subsys_state *css_iter;
-
 			bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu);
-
-			rcu_read_lock();
-			list_for_each_entry_rcu(css_iter, &pos->rstat_css_list,
-						rstat_css_node)
-				css_iter->ss->css_rstat_flush(css_iter, cpu);
-			rcu_read_unlock();
+			pos->ss->css_rstat_flush(pos, cpu);
 		}
 
 		/* play nice and yield if necessary */
-- 
2.47.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 8/9 RFC] cgroup: remove bpf rstat flush from css generic flush
  2024-12-24  1:13 [PATCH 0/9 RFC] cgroup: separate rstat trees JP Kobryn
                   ` (6 preceding siblings ...)
  2024-12-24  1:14 ` [PATCH 7/9 RFC] cgroup: remove unneeded rcu list JP Kobryn
@ 2024-12-24  1:14 ` JP Kobryn
  2024-12-24  1:14 ` [PATCH 9/9 RFC] cgroup: avoid allocating rstat when flush func not present JP Kobryn
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: JP Kobryn @ 2024-12-24  1:14 UTC (permalink / raw)
  To: shakeel.butt, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

Remove this call from the generic subsystem flush. Leave it up to bpf
programs to manually flush any subsystems desired by using the kfunc
cgroup_rstat_flush().

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/rstat.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index c52e8429c75d..03effaaf09a4 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -346,10 +346,8 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
 	for_each_possible_cpu(cpu) {
 		struct cgroup_subsys_state *pos = cgroup_rstat_updated_list(css, cpu);
 
-		for (; pos; pos = pos->rstat_flush_next) {
-			bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu);
+		for (; pos; pos = pos->rstat_flush_next)
 			pos->ss->css_rstat_flush(pos, cpu);
-		}
 
 		/* play nice and yield if necessary */
 		if (need_resched() || spin_needbreak(lock)) {
-- 
2.47.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 9/9 RFC] cgroup: avoid allocating rstat when flush func not present
  2024-12-24  1:13 [PATCH 0/9 RFC] cgroup: separate rstat trees JP Kobryn
                   ` (7 preceding siblings ...)
  2024-12-24  1:14 ` [PATCH 8/9 RFC] cgroup: remove bpf rstat flush from css generic flush JP Kobryn
@ 2024-12-24  1:14 ` JP Kobryn
  2024-12-24  4:57 ` [PATCH 0/9 RFC] cgroup: separate rstat trees Shakeel Butt
  2025-01-08 18:16 ` Michal Koutný
  10 siblings, 0 replies; 21+ messages in thread
From: JP Kobryn @ 2024-12-24  1:14 UTC (permalink / raw)
  To: shakeel.butt, hannes, yosryahmed, akpm; +Cc: linux-mm, cgroups

If a given subsystem is not the base type and does not have a flush
func, do not allocate.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/rstat.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 03effaaf09a4..4feefa37fa46 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -423,6 +423,9 @@ int cgroup_rstat_init(struct cgroup_subsys_state *css)
 {
 	int cpu;
 
+	if (css->ss && !css->ss->css_rstat_flush)
+		return 0;
+
 	/* the root cgrp css has rstat_cpu preallocated */
 	if (!css->rstat_cpu) {
 		css->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
@@ -445,6 +448,9 @@ void cgroup_rstat_exit(struct cgroup_subsys_state *css)
 {
 	int cpu;
 
+	if (css->ss && !css->ss->css_rstat_flush)
+		return;
+
 	cgroup_rstat_flush(css);
 
 	/* sanity check */
-- 
2.47.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9 RFC] cgroup: separate rstat trees
  2024-12-24  1:13 [PATCH 0/9 RFC] cgroup: separate rstat trees JP Kobryn
                   ` (8 preceding siblings ...)
  2024-12-24  1:14 ` [PATCH 9/9 RFC] cgroup: avoid allocating rstat when flush func not present JP Kobryn
@ 2024-12-24  4:57 ` Shakeel Butt
  2025-01-08 18:16 ` Michal Koutný
  10 siblings, 0 replies; 21+ messages in thread
From: Shakeel Butt @ 2024-12-24  4:57 UTC (permalink / raw)
  To: JP Kobryn
  Cc: hannes, yosryahmed, akpm, linux-mm, cgroups, Tejun Heo,
	Michal Koutný

Thanks JP for the awesome work. Please CC Tejun and Michal in the
followup revisions.

On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn wrote:
> I've been experimenting with these changes to allow for separate
> updating/flushing of cgroup stats per-subsystem. The idea was instead of having
> a single per-cpu rstat tree for managing stats across all subsystems, we could
> maybe split up the rstat trees into separate trees for each subsystem. So each
> cpu would have individual trees for each subsystem. It would allow subsystems
> to update and flush their stats without having any contention with others, i.e.
> the io subsystem would not have to wait for an in progress memory subsystem
> flush to finish. The core change is moving the rstat entities off of the cgroup
> struct and onto the cgroup_subsystem_state struct. Every patch revolves around
> that concept.
> 
> I reached a point where this started to feel stable in my local testing, so I
> wanted to share and get feedback on this approach.

For the cover letter, please start with describing the problem you are
trying to solve and then possibly give an overview of the solution you
are proposing.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9 RFC] cgroup: separate rstat trees
  2024-12-24  1:13 [PATCH 0/9 RFC] cgroup: separate rstat trees JP Kobryn
                   ` (9 preceding siblings ...)
  2024-12-24  4:57 ` [PATCH 0/9 RFC] cgroup: separate rstat trees Shakeel Butt
@ 2025-01-08 18:16 ` Michal Koutný
  2025-01-13 18:25   ` Shakeel Butt
  10 siblings, 1 reply; 21+ messages in thread
From: Michal Koutný @ 2025-01-08 18:16 UTC (permalink / raw)
  To: JP Kobryn
  Cc: shakeel.butt, hannes, yosryahmed, akpm, linux-mm, cgroups, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 942 bytes --]

Hello JP.

On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote:
> I've been experimenting with these changes to allow for separate
> updating/flushing of cgroup stats per-subsystem.

Nice.

> I reached a point where this started to feel stable in my local testing, so I
> wanted to share and get feedback on this approach.

The split is not straight-forwardly an improvement -- there's at least
higher memory footprint and flushing efffectiveness depends on how
individual readers are correlated, OTOH writer correlation affects
updaters when extending the update tree. So a workload dependent effect
can go (in my theory) both sides.
There are also in-kernel consumers of stats, namely memory controller
that's been optimized over the years to balance the tradeoff between
precision and latency.

So do you have any measurements (or expectations) that show how readers
or writers are affected?

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9 RFC] cgroup: separate rstat trees
  2025-01-08 18:16 ` Michal Koutný
@ 2025-01-13 18:25   ` Shakeel Butt
  2025-01-15  1:33     ` JP Kobryn
  2025-01-16 15:19     ` Michal Koutný
  0 siblings, 2 replies; 21+ messages in thread
From: Shakeel Butt @ 2025-01-13 18:25 UTC (permalink / raw)
  To: Michal Koutný
  Cc: JP Kobryn, hannes, yosryahmed, akpm, linux-mm, cgroups, Tejun Heo

On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote:
> Hello JP.
> 
> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote:
> > I've been experimenting with these changes to allow for separate
> > updating/flushing of cgroup stats per-subsystem.
> 
> Nice.
> 
> > I reached a point where this started to feel stable in my local testing, so I
> > wanted to share and get feedback on this approach.
> 
> The split is not straight-forwardly an improvement --

The major improvement in my opinion is the performance isolation for
stats readers i.e. cpu stats readers do not need to flush memory stats.

> there's at least
> higher memory footprint 

Yes this is indeed the case and JP, can you please give a ballmark on
the memory overhead?

> and flushing efffectiveness depends on how
> individual readers are correlated, 

Sorry I am confused by the above statement, can you please expand on
what you meant by it?

> OTOH writer correlation affects
> updaters when extending the update tree.

Here I am confused about the difference between writer and updater.

> So a workload dependent effect
> can go (in my theory) both sides.
> There are also in-kernel consumers of stats, namely memory controller
> that's been optimized over the years to balance the tradeoff between
> precision and latency.

In-kernel memcg stats readers will be unaffected most of the time with
this change. The only difference will be when they flush, they will only
flush memcg stats.

> 
> So do you have any measurements (or expectations) that show how readers
> or writers are affected?
> 

Here I am assuming you meant measurements in terms of cpu cost or do you
have something else in mind?


Thanks a lot Michal for taking a look.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9 RFC] cgroup: separate rstat trees
  2025-01-13 18:25   ` Shakeel Butt
@ 2025-01-15  1:33     ` JP Kobryn
  2025-01-15  1:39       ` Yosry Ahmed
  2025-01-16 15:19     ` Michal Koutný
  1 sibling, 1 reply; 21+ messages in thread
From: JP Kobryn @ 2025-01-15  1:33 UTC (permalink / raw)
  To: Shakeel Butt, Michal Koutný
  Cc: hannes, yosryahmed, akpm, linux-mm, cgroups, Tejun Heo

Hi Michal,

On 1/13/25 10:25 AM, Shakeel Butt wrote:
> On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote:
>> Hello JP.
>>
>> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote:
>>> I've been experimenting with these changes to allow for separate
>>> updating/flushing of cgroup stats per-subsystem.
>>
>> Nice.
>>
>>> I reached a point where this started to feel stable in my local testing, so I
>>> wanted to share and get feedback on this approach.
>>
>> The split is not straight-forwardly an improvement --
> 
> The major improvement in my opinion is the performance isolation for
> stats readers i.e. cpu stats readers do not need to flush memory stats.
> 
>> there's at least
>> higher memory footprint
> 
> Yes this is indeed the case and JP, can you please give a ballmark on
> the memory overhead?

Yes, the trade-off is using more memory to allow for separate trees.
With these patches the changes in allocated memory for the 
cgroup_rstat_cpu instances and their associated locks are:
static
	reduced by 58%
dynamic
	increased by 344%

The threefold increase on the dynamic side is attributed to now having 3 
rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io), 
instead of originally just 1. The number will change if more subsystems 
start or stop using rstat in the future. Feel free to let me know if you 
would like to see the detailed breakdown of these values.

> 
>> and flushing efffectiveness depends on how
>> individual readers are correlated,
> 
> Sorry I am confused by the above statement, can you please expand on
> what you meant by it?
> 
>> OTOH writer correlation affects
>> updaters when extending the update tree.
> 
> Here I am confused about the difference between writer and updater.
> 
>> So a workload dependent effect
>> can go (in my theory) both sides.
>> There are also in-kernel consumers of stats, namely memory controller
>> that's been optimized over the years to balance the tradeoff between
>> precision and latency.
> 
> In-kernel memcg stats readers will be unaffected most of the time with
> this change. The only difference will be when they flush, they will only
> flush memcg stats.
> 
>>
>> So do you have any measurements (or expectations) that show how readers
>> or writers are affected?
>>
> 
> Here I am assuming you meant measurements in terms of cpu cost or do you
> have something else in mind?
> 
> 
> Thanks a lot Michal for taking a look.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9 RFC] cgroup: separate rstat trees
  2025-01-15  1:33     ` JP Kobryn
@ 2025-01-15  1:39       ` Yosry Ahmed
  2025-01-15 19:38         ` JP Kobryn
  0 siblings, 1 reply; 21+ messages in thread
From: Yosry Ahmed @ 2025-01-15  1:39 UTC (permalink / raw)
  To: JP Kobryn
  Cc: Shakeel Butt, Michal Koutný,
	hannes, akpm, linux-mm, cgroups, Tejun Heo

On Tue, Jan 14, 2025 at 5:33 PM JP Kobryn <inwardvessel@gmail.com> wrote:
>
> Hi Michal,
>
> On 1/13/25 10:25 AM, Shakeel Butt wrote:
> > On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote:
> >> Hello JP.
> >>
> >> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote:
> >>> I've been experimenting with these changes to allow for separate
> >>> updating/flushing of cgroup stats per-subsystem.
> >>
> >> Nice.
> >>
> >>> I reached a point where this started to feel stable in my local testing, so I
> >>> wanted to share and get feedback on this approach.
> >>
> >> The split is not straight-forwardly an improvement --
> >
> > The major improvement in my opinion is the performance isolation for
> > stats readers i.e. cpu stats readers do not need to flush memory stats.
> >
> >> there's at least
> >> higher memory footprint
> >
> > Yes this is indeed the case and JP, can you please give a ballmark on
> > the memory overhead?
>
> Yes, the trade-off is using more memory to allow for separate trees.
> With these patches the changes in allocated memory for the
> cgroup_rstat_cpu instances and their associated locks are:
> static
>         reduced by 58%
> dynamic
>         increased by 344%
>
> The threefold increase on the dynamic side is attributed to now having 3
> rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io),
> instead of originally just 1. The number will change if more subsystems
> start or stop using rstat in the future. Feel free to let me know if you
> would like to see the detailed breakdown of these values.

What is the absolute per-CPU memory usage?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9 RFC] cgroup: separate rstat trees
  2025-01-15  1:39       ` Yosry Ahmed
@ 2025-01-15 19:38         ` JP Kobryn
  2025-01-15 21:36           ` Yosry Ahmed
  0 siblings, 1 reply; 21+ messages in thread
From: JP Kobryn @ 2025-01-15 19:38 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Shakeel Butt, Michal Koutný,
	hannes, akpm, linux-mm, cgroups, Tejun Heo

Hi Yosry,

On 1/14/25 5:39 PM, Yosry Ahmed wrote:
> On Tue, Jan 14, 2025 at 5:33 PM JP Kobryn <inwardvessel@gmail.com> wrote:
>>
>> Hi Michal,
>>
>> On 1/13/25 10:25 AM, Shakeel Butt wrote:
>>> On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote:
>>>> Hello JP.
>>>>
>>>> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote:
>>>>> I've been experimenting with these changes to allow for separate
>>>>> updating/flushing of cgroup stats per-subsystem.
>>>>
>>>> Nice.
>>>>
>>>>> I reached a point where this started to feel stable in my local testing, so I
>>>>> wanted to share and get feedback on this approach.
>>>>
>>>> The split is not straight-forwardly an improvement --
>>>
>>> The major improvement in my opinion is the performance isolation for
>>> stats readers i.e. cpu stats readers do not need to flush memory stats.
>>>
>>>> there's at least
>>>> higher memory footprint
>>>
>>> Yes this is indeed the case and JP, can you please give a ballmark on
>>> the memory overhead?
>>
>> Yes, the trade-off is using more memory to allow for separate trees.
>> With these patches the changes in allocated memory for the
>> cgroup_rstat_cpu instances and their associated locks are:
>> static
>>          reduced by 58%
>> dynamic
>>          increased by 344%
>>
>> The threefold increase on the dynamic side is attributed to now having 3
>> rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io),
>> instead of originally just 1. The number will change if more subsystems
>> start or stop using rstat in the future. Feel free to let me know if you
>> would like to see the detailed breakdown of these values.
> 
> What is the absolute per-CPU memory usage?

This is what I calculate as the combined per-cpu usage.
before:
	one cgroup_rstat_cpu instance for every cgroup
	sizeof(cgroup_rstat_cpu) * nr_cgroups
after:
	three cgroup_rstat_cpu instances for every cgroup + updater lock for 
every subsystem plus one for base stats
	sizeof(cgroup_rstat_cpu) * 3 * nr_cgroups +
		sizeof(spinlock_t) * (CGROUP_SUBSYS_COUNT + 1)

Note that "every cgroup" includes the root cgroup. Also, 3 represents 
the number of current rstat clients: base stats, memory, and io 
(assuming all enabled).

As I'm writing this, I realize I might need to include the bpf cgroups 
as a fourth client and include this in my testing.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9 RFC] cgroup: separate rstat trees
  2025-01-15 19:38         ` JP Kobryn
@ 2025-01-15 21:36           ` Yosry Ahmed
  2025-01-16 18:20             ` JP Kobryn
  0 siblings, 1 reply; 21+ messages in thread
From: Yosry Ahmed @ 2025-01-15 21:36 UTC (permalink / raw)
  To: JP Kobryn
  Cc: Shakeel Butt, Michal Koutný,
	hannes, akpm, linux-mm, cgroups, Tejun Heo

On Wed, Jan 15, 2025 at 11:39 AM JP Kobryn <inwardvessel@gmail.com> wrote:
>
> Hi Yosry,
>
> On 1/14/25 5:39 PM, Yosry Ahmed wrote:
> > On Tue, Jan 14, 2025 at 5:33 PM JP Kobryn <inwardvessel@gmail.com> wrote:
> >>
> >> Hi Michal,
> >>
> >> On 1/13/25 10:25 AM, Shakeel Butt wrote:
> >>> On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote:
> >>>> Hello JP.
> >>>>
> >>>> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote:
> >>>>> I've been experimenting with these changes to allow for separate
> >>>>> updating/flushing of cgroup stats per-subsystem.
> >>>>
> >>>> Nice.
> >>>>
> >>>>> I reached a point where this started to feel stable in my local testing, so I
> >>>>> wanted to share and get feedback on this approach.
> >>>>
> >>>> The split is not straight-forwardly an improvement --
> >>>
> >>> The major improvement in my opinion is the performance isolation for
> >>> stats readers i.e. cpu stats readers do not need to flush memory stats.
> >>>
> >>>> there's at least
> >>>> higher memory footprint
> >>>
> >>> Yes this is indeed the case and JP, can you please give a ballmark on
> >>> the memory overhead?
> >>
> >> Yes, the trade-off is using more memory to allow for separate trees.
> >> With these patches the changes in allocated memory for the
> >> cgroup_rstat_cpu instances and their associated locks are:
> >> static
> >>          reduced by 58%
> >> dynamic
> >>          increased by 344%
> >>
> >> The threefold increase on the dynamic side is attributed to now having 3
> >> rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io),
> >> instead of originally just 1. The number will change if more subsystems
> >> start or stop using rstat in the future. Feel free to let me know if you
> >> would like to see the detailed breakdown of these values.
> >
> > What is the absolute per-CPU memory usage?
>
> This is what I calculate as the combined per-cpu usage.
> before:
>         one cgroup_rstat_cpu instance for every cgroup
>         sizeof(cgroup_rstat_cpu) * nr_cgroups
> after:
>         three cgroup_rstat_cpu instances for every cgroup + updater lock for
> every subsystem plus one for base stats
>         sizeof(cgroup_rstat_cpu) * 3 * nr_cgroups +
>                 sizeof(spinlock_t) * (CGROUP_SUBSYS_COUNT + 1)
>
> Note that "every cgroup" includes the root cgroup. Also, 3 represents
> the number of current rstat clients: base stats, memory, and io
> (assuming all enabled).

On a config I have at hand sizeof(cgroup_rstat_cpu) is 160 bytes.
Ignoring the spinlock for a second because it doesn't scale with
cgroups, that'd be an extra 320 * nr_cgroups * nr_cpus bytes. On a
moderately large machine with 256 CPUs and 100 cgroups for example
that's ~8MB.

>
> As I'm writing this, I realize I might need to include the bpf cgroups
> as a fourth client and include this in my testing.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9 RFC] cgroup: separate rstat trees
  2025-01-13 18:25   ` Shakeel Butt
  2025-01-15  1:33     ` JP Kobryn
@ 2025-01-16 15:19     ` Michal Koutný
  2025-01-16 15:35       ` Yosry Ahmed
  2025-01-16 19:03       ` Shakeel Butt
  1 sibling, 2 replies; 21+ messages in thread
From: Michal Koutný @ 2025-01-16 15:19 UTC (permalink / raw)
  To: Shakeel Butt, JP Kobryn
  Cc: hannes, yosryahmed, akpm, linux-mm, cgroups, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 2149 bytes --]

Hello.

On Mon, Jan 13, 2025 at 10:25:34AM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > and flushing efffectiveness depends on how individual readers are
> > correlated, 
> 
> Sorry I am confused by the above statement, can you please expand on
> what you meant by it?
> 
> > OTOH writer correlation affects
> > updaters when extending the update tree.
> 
> Here I am confused about the difference between writer and updater.

reader -- a call site that'd need to call cgroup_rstat_flush() to
	calculate aggregated stats
writer (or updater) -- a call site that calls cgroup_rstat_updated()
	when it modifies whatever datum

By correlated readers I meant that stats for multiple controllers are
read close to each other (time-wise). First such a reader does the heavy
lifting, consequent readers enjoy quick access.
(With per-controller flushing, each reader would need to do the flush
and I'm suspecting the total time non-linear wrt parts.)

Similarly for writers, if multiple controller's data change in short
window, only the first one has to construct the rstat tree from top down
to self, the other are updating the same tree.

> In-kernel memcg stats readers will be unaffected most of the time with
> this change. The only difference will be when they flush, they will only
> flush memcg stats.

That "most of the time" is what depends on how other controller's
readers are active.

> Here I am assuming you meant measurements in terms of cpu cost or do you
> have something else in mind?

I have in mind something like Tejun's point 2:
| 2. It has noticeable benefits in the targeted use cases.

The cover letter mentions some old problems (which may not be problems
nowadays with memcg flushing reworks) and it's not clear how the
separation into per-controller trees impacts (today's) problems.

(I can imagine if the problem is stated like: io.stat readers are
unnecessarily waiting for memory.stat flushing, the benefit can be shown
(unless io.stat readers could benefit from flushing triggered by e.g.
memory).  But I didn't get if _that_ is the problem.)

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9 RFC] cgroup: separate rstat trees
  2025-01-16 15:19     ` Michal Koutný
@ 2025-01-16 15:35       ` Yosry Ahmed
  2025-01-16 19:03       ` Shakeel Butt
  1 sibling, 0 replies; 21+ messages in thread
From: Yosry Ahmed @ 2025-01-16 15:35 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Shakeel Butt, JP Kobryn, hannes, akpm, linux-mm, cgroups, Tejun Heo

On Thu, Jan 16, 2025 at 7:19 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello.
>
> On Mon, Jan 13, 2025 at 10:25:34AM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > and flushing efffectiveness depends on how individual readers are
> > > correlated,
> >
> > Sorry I am confused by the above statement, can you please expand on
> > what you meant by it?
> >
> > > OTOH writer correlation affects
> > > updaters when extending the update tree.
> >
> > Here I am confused about the difference between writer and updater.
>
> reader -- a call site that'd need to call cgroup_rstat_flush() to
>         calculate aggregated stats
> writer (or updater) -- a call site that calls cgroup_rstat_updated()
>         when it modifies whatever datum
>
> By correlated readers I meant that stats for multiple controllers are
> read close to each other (time-wise). First such a reader does the heavy
> lifting, consequent readers enjoy quick access.
> (With per-controller flushing, each reader would need to do the flush
> and I'm suspecting the total time non-linear wrt parts.)

In this case, I actually think it's better if every reader pays for
the flush they asked for (and only that). There is a bit of repeated
work if we read memory stats then io stats right after, but in cases
where we don't, paying to flush all subsystems because they are likely
to be flushed soon is not necessarily a good thing imo.

>
> Similarly for writers, if multiple controller's data change in short
> window, only the first one has to construct the rstat tree from top down
> to self, the other are updating the same tree.

This I agree about. If we have consecutive updates from two different
subsystems to the same cgroup, almost all the work is repeated.
Whether that causes a tangible performance difference or not is
something the numbers should show. In my experience, real regressions
on the update side are usually caught by LKP and are somewhat easy to
surface in benchmarks (I used netperf in the past).

>
> > In-kernel memcg stats readers will be unaffected most of the time with
> > this change. The only difference will be when they flush, they will only
> > flush memcg stats.
>
> That "most of the time" is what depends on how other controller's
> readers are active.

Since readers of other controllers are only in userspace (AFAICT), I
think it's unlikely that they are correlated with in-kernel memcg stat
readers in general.

>
> > Here I am assuming you meant measurements in terms of cpu cost or do you
> > have something else in mind?
>
> I have in mind something like Tejun's point 2:
> | 2. It has noticeable benefits in the targeted use cases.
>
> The cover letter mentions some old problems (which may not be problems
> nowadays with memcg flushing reworks) and it's not clear how the
> separation into per-controller trees impacts (today's) problems.
>
> (I can imagine if the problem is stated like: io.stat readers are
> unnecessarily waiting for memory.stat flushing, the benefit can be shown
> (unless io.stat readers could benefit from flushing triggered by e.g.
> memory).  But I didn't get if _that_ is the problem.)

Yeah I hope/expect that numbers will show that reading memcg stats (in
userspace or the kernel) becomes a bit faster, while reading other
subsystem stats should be significantly faster (at least in some
cases). We will see how that turns out.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9 RFC] cgroup: separate rstat trees
  2025-01-15 21:36           ` Yosry Ahmed
@ 2025-01-16 18:20             ` JP Kobryn
  0 siblings, 0 replies; 21+ messages in thread
From: JP Kobryn @ 2025-01-16 18:20 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Shakeel Butt, Michal Koutný,
	hannes, akpm, linux-mm, cgroups, Tejun Heo



On 1/15/25 1:36 PM, Yosry Ahmed wrote:
> On Wed, Jan 15, 2025 at 11:39 AM JP Kobryn <inwardvessel@gmail.com> wrote:
>>
>> Hi Yosry,
>>
>> On 1/14/25 5:39 PM, Yosry Ahmed wrote:
>>> On Tue, Jan 14, 2025 at 5:33 PM JP Kobryn <inwardvessel@gmail.com> wrote:
>>>>
>>>> Hi Michal,
>>>>
>>>> On 1/13/25 10:25 AM, Shakeel Butt wrote:
>>>>> On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote:
>>>>>> Hello JP.
>>>>>>
>>>>>> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote:
>>>>>>> I've been experimenting with these changes to allow for separate
>>>>>>> updating/flushing of cgroup stats per-subsystem.
>>>>>>
>>>>>> Nice.
>>>>>>
>>>>>>> I reached a point where this started to feel stable in my local testing, so I
>>>>>>> wanted to share and get feedback on this approach.
>>>>>>
>>>>>> The split is not straight-forwardly an improvement --
>>>>>
>>>>> The major improvement in my opinion is the performance isolation for
>>>>> stats readers i.e. cpu stats readers do not need to flush memory stats.
>>>>>
>>>>>> there's at least
>>>>>> higher memory footprint
>>>>>
>>>>> Yes this is indeed the case and JP, can you please give a ballmark on
>>>>> the memory overhead?
>>>>
>>>> Yes, the trade-off is using more memory to allow for separate trees.
>>>> With these patches the changes in allocated memory for the
>>>> cgroup_rstat_cpu instances and their associated locks are:
>>>> static
>>>>           reduced by 58%
>>>> dynamic
>>>>           increased by 344%
>>>>
>>>> The threefold increase on the dynamic side is attributed to now having 3
>>>> rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io),
>>>> instead of originally just 1. The number will change if more subsystems
>>>> start or stop using rstat in the future. Feel free to let me know if you
>>>> would like to see the detailed breakdown of these values.
>>>
>>> What is the absolute per-CPU memory usage?
>>
>> This is what I calculate as the combined per-cpu usage.
>> before:
>>          one cgroup_rstat_cpu instance for every cgroup
>>          sizeof(cgroup_rstat_cpu) * nr_cgroups
>> after:
>>          three cgroup_rstat_cpu instances for every cgroup + updater lock for
>> every subsystem plus one for base stats
>>          sizeof(cgroup_rstat_cpu) * 3 * nr_cgroups +
>>                  sizeof(spinlock_t) * (CGROUP_SUBSYS_COUNT + 1)
>>
>> Note that "every cgroup" includes the root cgroup. Also, 3 represents
>> the number of current rstat clients: base stats, memory, and io
>> (assuming all enabled).
> 
> On a config I have at hand sizeof(cgroup_rstat_cpu) is 160 bytes.
> Ignoring the spinlock for a second because it doesn't scale with
> cgroups, that'd be an extra 320 * nr_cgroups * nr_cpus bytes. On a
> moderately large machine with 256 CPUs and 100 cgroups for example
> that's ~8MB.
> 

Revisiting the cgroup_rstat_cpu struct, there might be an opportunity to 
save some memory here. This struct has several cgroup_base_stat fields 
that are not useful to the actual subsystems (memory, io). So I'm 
considering a scenario where the existing cgroup_rstat_cpu is used for 
the base stats while a new lighter struct is used for others that 
maintains compatibility with the rstat infrastructure.

>>
>> As I'm writing this, I realize I might need to include the bpf cgroups
>> as a fourth client and include this in my testing.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9 RFC] cgroup: separate rstat trees
  2025-01-16 15:19     ` Michal Koutný
  2025-01-16 15:35       ` Yosry Ahmed
@ 2025-01-16 19:03       ` Shakeel Butt
  1 sibling, 0 replies; 21+ messages in thread
From: Shakeel Butt @ 2025-01-16 19:03 UTC (permalink / raw)
  To: Michal Koutný
  Cc: JP Kobryn, hannes, yosryahmed, akpm, linux-mm, cgroups, Tejun Heo

Hi Michal,

On Thu, Jan 16, 2025 at 04:19:07PM +0100, Michal Koutný wrote:
> Hello.
> 
> On Mon, Jan 13, 2025 at 10:25:34AM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > and flushing efffectiveness depends on how individual readers are
> > > correlated, 
> > 
> > Sorry I am confused by the above statement, can you please expand on
> > what you meant by it?
> > 
> > > OTOH writer correlation affects
> > > updaters when extending the update tree.
> > 
> > Here I am confused about the difference between writer and updater.
> 
> reader -- a call site that'd need to call cgroup_rstat_flush() to
> 	calculate aggregated stats
> writer (or updater) -- a call site that calls cgroup_rstat_updated()
> 	when it modifies whatever datum
>

Ah so writer and updater are same.

> By correlated readers I meant that stats for multiple controllers are
> read close to each other (time-wise). First such a reader does the heavy
> lifting, consequent readers enjoy quick access.
> (With per-controller flushing, each reader would need to do the flush
> and I'm suspecting the total time non-linear wrt parts.)

This is a good point and actually in prod we are observing machines with
very active workloads, the close readers of different stats are flushing
all the subsystems even when reads are very close-by time-wise. In
addition there are users who are only reading non-memory stats and still
paying the cost of memory flushing. Please note that memory stats
flushing is the most expensive one at the moment and it does not make
sense to do flush memory stats for above two cases.

> 
> Similarly for writers, if multiple controller's data change in short
> window, only the first one has to construct the rstat tree from top down
> to self, the other are updating the same tree.

Another good point. From my observation, the cost of rstat tree
insertion is very cheap and the cost get amortized i.e. a lot of updates
within flushes such that the insertion cost is not noticeable at all, at
least in the perf traces.

> 
> > In-kernel memcg stats readers will be unaffected most of the time with
> > this change. The only difference will be when they flush, they will only
> > flush memcg stats.
> 
> That "most of the time" is what depends on how other controller's
> readers are active.
> 
> > Here I am assuming you meant measurements in terms of cpu cost or do you
> > have something else in mind?
> 
> I have in mind something like Tejun's point 2:
> | 2. It has noticeable benefits in the targeted use cases.
> 
> The cover letter mentions some old problems (which may not be problems
> nowadays with memcg flushing reworks) and it's not clear how the
> separation into per-controller trees impacts (today's) problems.
> 
> (I can imagine if the problem is stated like: io.stat readers are
> unnecessarily waiting for memory.stat flushing, the benefit can be shown
> (unless io.stat readers could benefit from flushing triggered by e.g.
> memory).  But I didn't get if _that_ is the problem.)
> 

The cover letter of v2 has more information on the motivation. The
main motivation is the same I descrived above i.e. many applications
(stat monitors) has to flush all the subsystem even when they are
reading different subsystem stats close-by and then there are
applications who are reading just stats of cpu or io and still have to
pay for memcg stat flushing. This series is targeting these two very
common scenarios.

Thanks Michal for your time and comments.

Shakeel



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-01-16 19:03 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-24  1:13 [PATCH 0/9 RFC] cgroup: separate rstat trees JP Kobryn
2024-12-24  1:13 ` [PATCH 1/9 RFC] change cgroup to css in rstat updated and flush api JP Kobryn
2024-12-24  1:13 ` [PATCH 2/9 RFC] cgroup: change cgroup to css in rstat internal flush and lock funcs JP Kobryn
2024-12-24  1:13 ` [PATCH 3/9 RFC] cgroup: change cgroup to css in rstat init and exit api JP Kobryn
2024-12-24  1:13 ` [PATCH 4/9 RFC] cgroup: split rstat from cgroup into separate css JP Kobryn
2024-12-24  1:13 ` [PATCH 5/9 RFC] cgroup: separate locking between base css and others JP Kobryn
2024-12-24  1:13 ` [PATCH 6/9 RFC] cgroup: isolate base stat flush JP Kobryn
2024-12-24  1:14 ` [PATCH 7/9 RFC] cgroup: remove unneeded rcu list JP Kobryn
2024-12-24  1:14 ` [PATCH 8/9 RFC] cgroup: remove bpf rstat flush from css generic flush JP Kobryn
2024-12-24  1:14 ` [PATCH 9/9 RFC] cgroup: avoid allocating rstat when flush func not present JP Kobryn
2024-12-24  4:57 ` [PATCH 0/9 RFC] cgroup: separate rstat trees Shakeel Butt
2025-01-08 18:16 ` Michal Koutný
2025-01-13 18:25   ` Shakeel Butt
2025-01-15  1:33     ` JP Kobryn
2025-01-15  1:39       ` Yosry Ahmed
2025-01-15 19:38         ` JP Kobryn
2025-01-15 21:36           ` Yosry Ahmed
2025-01-16 18:20             ` JP Kobryn
2025-01-16 15:19     ` Michal Koutný
2025-01-16 15:35       ` Yosry Ahmed
2025-01-16 19:03       ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox