[PATCH 0/4 v2] cgroup: separate rstat trees

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4 v2] cgroup: separate rstat trees
@ 2025-02-27 21:55 inwardvessel
  2025-02-27 21:55 ` [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state inwardvessel
                   ` (5 more replies)
  0 siblings, 6 replies; 39+ messages in thread
From: inwardvessel @ 2025-02-27 21:55 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mhocko, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

From: JP Kobryn <inwardvessel@gmail.com>

The current design of rstat takes the approach that if one subsystem is
to be flushed, all other subsystems with pending updates should also be
flushed. It seems that over time, the stat-keeping of some subsystems
has grown in size to the extent that they are noticeably slowing down
others. This has been most observable in situations where the memory
controller is enabled. One big area where the issue comes up is system
telemetry, where programs periodically sample cpu stats. It would be a
benefit for programs like this if the overhead of having to flush memory
stats (and others) could be eliminated. It would save cpu cycles for
existing cpu-based telemetry programs and improve scalability in terms
of sampling frequency and volume of hosts.

This series changes the approach of "flush all subsystems" to "flush
only the requested subsystem". The core design change is moving from a
single unified rstat tree of cgroups to having separate trees made up of
cgroup_subsys_state's. There will be one (per-cpu) tree for the base
stats (cgroup::self) and one for each enabled subsystem (if it
implements css_rstat_flush()). In order to do this, the rstat list
pointers were moved off of the cgroup and onto the css. In the
transition, these list pointer types were changed to
cgroup_subsys_state. This allows for rstat trees to now be made up of
css nodes, where a given tree will only contains css nodes associated
with a specific subsystem. The rstat api's were changed to accept a
reference to a cgroup_subsys_state instead of a cgroup. This allows for
callers to be specific about which stats are being updated/flushed.
Since separate trees will be in use, the locking scheme was adjusted.
The global locks were split up in such a way that there are separate
locks for the base stats (cgroup::self) and each subsystem (memory, io,
etc). This allows different subsystems (including base stats) to use
rstat in parallel with no contention.

Breaking up the unified tree into separate trees eliminates the overhead
and scalability issue explained in the first section, but comes at the
expense of using additional memory. In an effort to minimize this
overhead, a conditional allocation is performed. The cgroup_rstat_cpu
originally contained the rstat list pointers and the base stat entities.
This struct was renamed to cgroup_rstat_base_cpu and is only allocated
when the associated css is cgroup::self. A new compact struct was added
that only contains the rstat list pointers. When the css is associated
with an actual subsystem, this compact struct is allocated. With this
conditional allocation, the change in memory overhead on a per-cpu basis
before/after is shown below.

before:
sizeof(struct cgroup_rstat_cpu) =~ 176 bytes /* can vary based on config */

nr_cgroups * sizeof(struct cgroup_rstat_cpu)
nr_cgroups * 176 bytes

after:
sizeof(struct cgroup_rstat_cpu) == 16 bytes
sizeof(struct cgroup_rstat_base_cpu) =~ 176 bytes

nr_cgroups * (
	sizeof(struct cgroup_rstat_base_cpu) +
		sizeof(struct cgroup_rstat_cpu) * nr_rstat_controllers
	)

nr_cgroups * (176 + 16 * nr_rstat_controllers)

... where nr_rstat_controllers is the number of enabled cgroup
controllers that implement css_rstat_flush(). On a host where both
memory and io are enabled:

nr_cgroups * (176 + 16 * 2)
nr_cgroups * 208 bytes

With regard to validation, there is a measurable benefit when reading
stats with this series. A test program was made to loop 1M times while
reading all four of the files cgroup.stat, cpu.stat, io.stat,
memory.stat of a given parent cgroup each iteration. This test program
has been run in the experiments that follow.

The first experiment consisted of a parent cgroup with memory.swap.max=0
and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created
and within each child cgroup a process was spawned to frequently update
the memory cgroup stats by creating and then reading a file of size 1T
(encouraging reclaim). The test program was run alongside these 26 tasks
in parallel. The results showed a benefit in both time elapsed and perf
data of the test program.

time before:
real    0m44.612s
user    0m0.567s
sys     0m43.887s

perf before:
27.02% mem_cgroup_css_rstat_flush
 6.35% __blkcg_rstat_flush
 0.06% cgroup_base_stat_cputime_show

time after:
real    0m27.125s
user    0m0.544s
sys     0m26.491s

perf after:
6.03% mem_cgroup_css_rstat_flush
0.37% blkcg_print_stat
0.11% cgroup_base_stat_cputime_show

Another experiment was setup on the same host using a parent cgroup with
two child cgroups. The same swap and memory max were used as the
previous experiment. In the two child cgroups, kernel builds were done
in parallel, each using "-j 20". The perf comparison of the test program
was very similar to the values in the previous experiment. The time
comparison is shown below.

before:
real    1m2.077s
user    0m0.784s
sys     1m0.895s

after:
real    0m32.216s
user    0m0.709s
sys     0m31.256s

Note that the above two experiments were also done with a modified test
program that only reads the cpu.stat file and none of the other three
files previously mentioned. The results were similar to what was seen in
the v1 email for this series. See changelog for link to v1 if needed.

For the final experiment, perf events were recorded during a kernel
build with the same host and cgroup setup. The builds took place in the
child node. Control and experimental sides both showed similar in cycles
spent on cgroup_rstat_updated() and appeard insignificant compared among
the events recorded with the workload.

changelog
v2:
	drop the patch creating a new cgroup_rstat struct and related code
	drop bpf-specific patches. instead just use cgroup::self in bpf progs
	drop the cpu lock patches. instead select cpu lock in updated_list func
	relocate the cgroup_rstat_init() call to inside css_create()
	relocate the cgroup_rstat_exit() cleanup from apply_control_enable()
		to css_free_rwork_fn()
v1:
	https://lore.kernel.org/all/20250218031448.46951-1-inwardvessel@gmail.com/

JP Kobryn (4):
  cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  cgroup: rstat lock indirection
  cgroup: separate rstat locks for subsystems
  cgroup: separate rstat list pointers from base stats

 block/blk-cgroup.c                            |   4 +-
 include/linux/cgroup-defs.h                   |  67 ++--
 include/linux/cgroup.h                        |   8 +-
 kernel/cgroup/cgroup-internal.h               |   4 +-
 kernel/cgroup/cgroup.c                        |  53 +--
 kernel/cgroup/rstat.c                         | 318 +++++++++++-------
 mm/memcontrol.c                               |   4 +-
 .../selftests/bpf/progs/btf_type_tag_percpu.c |   5 +-
 .../bpf/progs/cgroup_hierarchical_stats.c     |   8 +-
 9 files changed, 276 insertions(+), 195 deletions(-)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  2025-02-27 21:55 [PATCH 0/4 v2] cgroup: separate rstat trees inwardvessel
@ 2025-02-27 21:55 ` inwardvessel
  2025-02-27 22:43   ` Shakeel Butt
                     ` (2 more replies)
  2025-02-27 21:55 ` [PATCH 2/4 v2] cgroup: rstat lock indirection inwardvessel
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 39+ messages in thread
From: inwardvessel @ 2025-02-27 21:55 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mhocko, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

From: JP Kobryn <inwardvessel@gmail.com>

Each cgroup owns rstat pointers. This means that a tree of pending rstat
updates can contain changes from different subsystems. Because of this
arrangement, when one subsystem is flushed via the public api
cgroup_rstat_flushed(), all other subsystems with pending updates will
also be flushed. Remove the rstat pointers from the cgroup and instead
give them to each cgroup_subsys_state. Separate rstat trees will now
exist for each unique subsystem. This separation allows for subsystems
to make updates and flushes without the side effects of other
subsystems. i.e. flushing the cpu stats does not cause the memory stats
to be flushed and vice versa. The change in pointer ownership from
cgroup to cgroup_subsys_state allows for direct flushing of the css, so
the rcu list management entities and operations previously tied to the
cgroup which were used for managing a list of subsystem states with
pending flushes are removed. In terms of client code, public api calls
were changed to now accept a reference to the cgroup_subsys_state so
that when flushing or updating, a specific subsystem is associated with
the call.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 block/blk-cgroup.c                            |   4 +-
 include/linux/cgroup-defs.h                   |  36 ++--
 include/linux/cgroup.h                        |   8 +-
 kernel/cgroup/cgroup-internal.h               |   4 +-
 kernel/cgroup/cgroup.c                        |  53 +++---
 kernel/cgroup/rstat.c                         | 159 +++++++++---------
 mm/memcontrol.c                               |   4 +-
 .../selftests/bpf/progs/btf_type_tag_percpu.c |   5 +-
 .../bpf/progs/cgroup_hierarchical_stats.c     |   8 +-
 9 files changed, 140 insertions(+), 141 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 9ed93d91d754..6a0680d8ce6a 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1201,7 +1201,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
 	if (!seq_css(sf)->parent)
 		blkcg_fill_root_iostats();
 	else
-		cgroup_rstat_flush(blkcg->css.cgroup);
+		cgroup_rstat_flush(&blkcg->css);
 
 	rcu_read_lock();
 	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
@@ -2186,7 +2186,7 @@ void blk_cgroup_bio_start(struct bio *bio)
 	}
 
 	u64_stats_update_end_irqrestore(&bis->sync, flags);
-	cgroup_rstat_updated(blkcg->css.cgroup, cpu);
+	cgroup_rstat_updated(&blkcg->css, cpu);
 	put_cpu();
 }
 
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 17960a1e858d..1598e1389615 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -169,6 +169,9 @@ struct cgroup_subsys_state {
 	/* reference count - access via css_[try]get() and css_put() */
 	struct percpu_ref refcnt;
 
+	/* per-cpu recursive resource statistics */
+	struct cgroup_rstat_cpu __percpu *rstat_cpu;
+
 	/*
 	 * siblings list anchored at the parent's ->children
 	 *
@@ -177,9 +180,6 @@ struct cgroup_subsys_state {
 	struct list_head sibling;
 	struct list_head children;
 
-	/* flush target list anchored at cgrp->rstat_css_list */
-	struct list_head rstat_css_node;
-
 	/*
 	 * PI: Subsys-unique ID.  0 is unused and root is always 1.  The
 	 * matching css can be looked up using css_from_id().
@@ -219,6 +219,14 @@ struct cgroup_subsys_state {
 	 * Protected by cgroup_mutex.
 	 */
 	int nr_descendants;
+
+	/*
+	 * A singly-linked list of css structures to be rstat flushed.
+	 * This is a scratch field to be used exclusively by
+	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
+	 */
+	struct cgroup_subsys_state *rstat_flush_next;
+
 };
 
 /*
@@ -386,8 +394,8 @@ struct cgroup_rstat_cpu {
 	 *
 	 * Protected by per-cpu cgroup_rstat_cpu_lock.
 	 */
-	struct cgroup *updated_children;	/* terminated by self cgroup */
-	struct cgroup *updated_next;		/* NULL iff not on the list */
+	struct cgroup_subsys_state *updated_children;	/* terminated by self */
+	struct cgroup_subsys_state *updated_next;		/* NULL if not on list */
 };
 
 struct cgroup_freezer_state {
@@ -516,24 +524,6 @@ struct cgroup {
 	struct cgroup *dom_cgrp;
 	struct cgroup *old_dom_cgrp;		/* used while enabling threaded */
 
-	/* per-cpu recursive resource statistics */
-	struct cgroup_rstat_cpu __percpu *rstat_cpu;
-	struct list_head rstat_css_list;
-
-	/*
-	 * Add padding to separate the read mostly rstat_cpu and
-	 * rstat_css_list into a different cacheline from the following
-	 * rstat_flush_next and *bstat fields which can have frequent updates.
-	 */
-	CACHELINE_PADDING(_pad_);
-
-	/*
-	 * A singly-linked list of cgroup structures to be rstat flushed.
-	 * This is a scratch field to be used exclusively by
-	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
-	 */
-	struct cgroup	*rstat_flush_next;
-
 	/* cgroup basic resource statistics */
 	struct cgroup_base_stat last_bstat;
 	struct cgroup_base_stat bstat;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index f8ef47f8a634..eec970622419 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -687,10 +687,10 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
 /*
  * cgroup scalable recursive statistics.
  */
-void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
-void cgroup_rstat_flush(struct cgroup *cgrp);
-void cgroup_rstat_flush_hold(struct cgroup *cgrp);
-void cgroup_rstat_flush_release(struct cgroup *cgrp);
+void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu);
+void cgroup_rstat_flush(struct cgroup_subsys_state *css);
+void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css);
+void cgroup_rstat_flush_release(struct cgroup_subsys_state *css);
 
 /*
  * Basic resource stats.
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index c964dd7ff967..87d062baff90 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -269,8 +269,8 @@ int cgroup_task_count(const struct cgroup *cgrp);
 /*
  * rstat.c
  */
-int cgroup_rstat_init(struct cgroup *cgrp);
-void cgroup_rstat_exit(struct cgroup *cgrp);
+int cgroup_rstat_init(struct cgroup_subsys_state *css);
+void cgroup_rstat_exit(struct cgroup_subsys_state *css);
 void cgroup_rstat_boot(void);
 void cgroup_base_stat_cputime_show(struct seq_file *seq);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index afc665b7b1fe..31b3bfebf7ba 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -164,7 +164,9 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
 static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu);
 
 /* the default hierarchy */
-struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
+struct cgroup_root cgrp_dfl_root = {
+	.cgrp.self.rstat_cpu = &cgrp_dfl_root_rstat_cpu
+};
 EXPORT_SYMBOL_GPL(cgrp_dfl_root);
 
 /*
@@ -1358,7 +1360,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
 
 	cgroup_unlock();
 
-	cgroup_rstat_exit(cgrp);
+	cgroup_rstat_exit(&cgrp->self);
 	kernfs_destroy_root(root->kf_root);
 	cgroup_free_root(root);
 }
@@ -1863,13 +1865,6 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
 		}
 		spin_unlock_irq(&css_set_lock);
 
-		if (ss->css_rstat_flush) {
-			list_del_rcu(&css->rstat_css_node);
-			synchronize_rcu();
-			list_add_rcu(&css->rstat_css_node,
-				     &dcgrp->rstat_css_list);
-		}
-
 		/* default hierarchy doesn't enable controllers by default */
 		dst_root->subsys_mask |= 1 << ssid;
 		if (dst_root == &cgrp_dfl_root) {
@@ -2052,7 +2047,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	cgrp->dom_cgrp = cgrp;
 	cgrp->max_descendants = INT_MAX;
 	cgrp->max_depth = INT_MAX;
-	INIT_LIST_HEAD(&cgrp->rstat_css_list);
 	prev_cputime_init(&cgrp->prev_cputime);
 
 	for_each_subsys(ss, ssid)
@@ -2132,7 +2126,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	if (ret)
 		goto destroy_root;
 
-	ret = cgroup_rstat_init(root_cgrp);
+	ret = cgroup_rstat_init(&root_cgrp->self);
 	if (ret)
 		goto destroy_root;
 
@@ -2174,7 +2168,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	goto out;
 
 exit_stats:
-	cgroup_rstat_exit(root_cgrp);
+	cgroup_rstat_exit(&root_cgrp->self);
 destroy_root:
 	kernfs_destroy_root(root->kf_root);
 	root->kf_root = NULL;
@@ -5407,7 +5401,11 @@ static void css_free_rwork_fn(struct work_struct *work)
 		struct cgroup_subsys_state *parent = css->parent;
 		int id = css->id;
 
+		if (css->ss->css_rstat_flush)
+			cgroup_rstat_exit(css);
+
 		ss->css_free(css);
+
 		cgroup_idr_remove(&ss->css_idr, id);
 		cgroup_put(cgrp);
 
@@ -5431,7 +5429,7 @@ static void css_free_rwork_fn(struct work_struct *work)
 			cgroup_put(cgroup_parent(cgrp));
 			kernfs_put(cgrp->kn);
 			psi_cgroup_free(cgrp);
-			cgroup_rstat_exit(cgrp);
+			cgroup_rstat_exit(&cgrp->self);
 			kfree(cgrp);
 		} else {
 			/*
@@ -5459,11 +5457,7 @@ static void css_release_work_fn(struct work_struct *work)
 	if (ss) {
 		struct cgroup *parent_cgrp;
 
-		/* css release path */
-		if (!list_empty(&css->rstat_css_node)) {
-			cgroup_rstat_flush(cgrp);
-			list_del_rcu(&css->rstat_css_node);
-		}
+		cgroup_rstat_flush(css);
 
 		cgroup_idr_replace(&ss->css_idr, NULL, css->id);
 		if (ss->css_released)
@@ -5489,7 +5483,7 @@ static void css_release_work_fn(struct work_struct *work)
 		/* cgroup release path */
 		TRACE_CGROUP_PATH(release, cgrp);
 
-		cgroup_rstat_flush(cgrp);
+		cgroup_rstat_flush(&cgrp->self);
 
 		spin_lock_irq(&css_set_lock);
 		for (tcgrp = cgroup_parent(cgrp); tcgrp;
@@ -5537,7 +5531,6 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	css->id = -1;
 	INIT_LIST_HEAD(&css->sibling);
 	INIT_LIST_HEAD(&css->children);
-	INIT_LIST_HEAD(&css->rstat_css_node);
 	css->serial_nr = css_serial_nr_next++;
 	atomic_set(&css->online_cnt, 0);
 
@@ -5546,9 +5539,6 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 		css_get(css->parent);
 	}
 
-	if (ss->css_rstat_flush)
-		list_add_rcu(&css->rstat_css_node, &cgrp->rstat_css_list);
-
 	BUG_ON(cgroup_css(cgrp, ss));
 }
 
@@ -5641,6 +5631,12 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 		goto err_free_css;
 	css->id = err;
 
+	if (css->ss->css_rstat_flush) {
+		err = cgroup_rstat_init(css);
+		if (err)
+			goto err_free_css;
+	}
+
 	/* @css is ready to be brought online now, make it visible */
 	list_add_tail_rcu(&css->sibling, &parent_css->children);
 	cgroup_idr_replace(&ss->css_idr, css, css->id);
@@ -5654,7 +5650,6 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 err_list_del:
 	list_del_rcu(&css->sibling);
 err_free_css:
-	list_del_rcu(&css->rstat_css_node);
 	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
 	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
 	return ERR_PTR(err);
@@ -5682,7 +5677,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 	if (ret)
 		goto out_free_cgrp;
 
-	ret = cgroup_rstat_init(cgrp);
+	ret = cgroup_rstat_init(&cgrp->self);
 	if (ret)
 		goto out_cancel_ref;
 
@@ -5775,7 +5770,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 out_kernfs_remove:
 	kernfs_remove(cgrp->kn);
 out_stat_exit:
-	cgroup_rstat_exit(cgrp);
+	cgroup_rstat_exit(&cgrp->self);
 out_cancel_ref:
 	percpu_ref_exit(&cgrp->self.refcnt);
 out_free_cgrp:
@@ -6087,6 +6082,9 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	} else {
 		css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, GFP_KERNEL);
 		BUG_ON(css->id < 0);
+
+		if (css->ss && css->ss->css_rstat_flush)
+			BUG_ON(cgroup_rstat_init(css));
 	}
 
 	/* Update the init_css_set to contain a subsys
@@ -6188,6 +6186,9 @@ int __init cgroup_init(void)
 			css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
 						   GFP_KERNEL);
 			BUG_ON(css->id < 0);
+
+			if (css->ss && css->ss->css_rstat_flush)
+				BUG_ON(cgroup_rstat_init(css));
 		} else {
 			cgroup_init_subsys(ss, false);
 		}
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index aac91466279f..9976f9acd62b 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -14,9 +14,10 @@ static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
-static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
+static struct cgroup_rstat_cpu *cgroup_rstat_cpu(
+		struct cgroup_subsys_state *css, int cpu)
 {
-	return per_cpu_ptr(cgrp->rstat_cpu, cpu);
+	return per_cpu_ptr(css->rstat_cpu, cpu);
 }
 
 /*
@@ -75,15 +76,17 @@ void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
 
 /**
  * cgroup_rstat_updated - keep track of updated rstat_cpu
- * @cgrp: target cgroup
+ * @css: target cgroup subsystem state
  * @cpu: cpu on which rstat_cpu was updated
  *
- * @cgrp's rstat_cpu on @cpu was updated.  Put it on the parent's matching
+ * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching
  * rstat_cpu->updated_children list.  See the comment on top of
  * cgroup_rstat_cpu definition for details.
  */
-__bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
+__bpf_kfunc void cgroup_rstat_updated(
+		struct cgroup_subsys_state *css, int cpu)
 {
+	struct cgroup *cgrp = css->cgroup;
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	unsigned long flags;
 
@@ -92,18 +95,18 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 	 * temporary inaccuracies, which is fine.
 	 *
 	 * Because @parent's updated_children is terminated with @parent
-	 * instead of NULL, we can tell whether @cgrp is on the list by
+	 * instead of NULL, we can tell whether @css is on the list by
 	 * testing the next pointer for NULL.
 	 */
-	if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next))
+	if (data_race(cgroup_rstat_cpu(css, cpu)->updated_next))
 		return;
 
 	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
 
-	/* put @cgrp and all ancestors on the corresponding updated lists */
+	/* put @css and all ancestors on the corresponding updated lists */
 	while (true) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
-		struct cgroup *parent = cgroup_parent(cgrp);
+		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(css, cpu);
+		struct cgroup_subsys_state *parent = css->parent;
 		struct cgroup_rstat_cpu *prstatc;
 
 		/*
@@ -115,15 +118,15 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 
 		/* Root has no parent to link it to, but mark it busy */
 		if (!parent) {
-			rstatc->updated_next = cgrp;
+			rstatc->updated_next = css;
 			break;
 		}
 
 		prstatc = cgroup_rstat_cpu(parent, cpu);
 		rstatc->updated_next = prstatc->updated_children;
-		prstatc->updated_children = cgrp;
+		prstatc->updated_children = css;
 
-		cgrp = parent;
+		css = parent;
 	}
 
 	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, true);
@@ -141,12 +144,13 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
  * into a singly linked list built from the tail backward like "pushing"
  * cgroups into a stack. The root is pushed by the caller.
  */
-static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
-						 struct cgroup *child, int cpu)
+static struct cgroup_subsys_state *cgroup_rstat_push_children(
+		struct cgroup_subsys_state *head,
+		struct cgroup_subsys_state *child, int cpu)
 {
-	struct cgroup *chead = child;	/* Head of child cgroup level */
-	struct cgroup *ghead = NULL;	/* Head of grandchild cgroup level */
-	struct cgroup *parent, *grandchild;
+	struct cgroup_subsys_state *chead = child;	/* Head of child css level */
+	struct cgroup_subsys_state *ghead = NULL;	/* Head of grandchild css level */
+	struct cgroup_subsys_state *parent, *grandchild;
 	struct cgroup_rstat_cpu *crstatc;
 
 	child->rstat_flush_next = NULL;
@@ -155,7 +159,7 @@ static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
 	while (chead) {
 		child = chead;
 		chead = child->rstat_flush_next;
-		parent = cgroup_parent(child);
+		parent = child->parent;
 
 		/* updated_next is parent cgroup terminated */
 		while (child != parent) {
@@ -184,30 +188,32 @@ static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
 
 /**
  * cgroup_rstat_updated_list - return a list of updated cgroups to be flushed
- * @root: root of the cgroup subtree to traverse
+ * @root: root of the css subtree to traverse
  * @cpu: target cpu
  * Return: A singly linked list of cgroups to be flushed
  *
  * Walks the updated rstat_cpu tree on @cpu from @root.  During traversal,
- * each returned cgroup is unlinked from the updated tree.
+ * each returned css is unlinked from the updated tree.
  *
  * The only ordering guarantee is that, for a parent and a child pair
  * covered by a given traversal, the child is before its parent in
  * the list.
  *
  * Note that updated_children is self terminated and points to a list of
- * child cgroups if not empty. Whereas updated_next is like a sibling link
- * within the children list and terminated by the parent cgroup. An exception
+ * child css's if not empty. Whereas updated_next is like a sibling link
+ * within the children list and terminated by the parent css. An exception
  * here is the cgroup root whose updated_next can be self terminated.
  */
-static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
+static struct cgroup_subsys_state *cgroup_rstat_updated_list(
+		struct cgroup_subsys_state *root, int cpu)
 {
+	struct cgroup *cgrp = root->cgroup;
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
-	struct cgroup *head = NULL, *parent, *child;
+	struct cgroup_subsys_state *head = NULL, *parent, *child;
 	unsigned long flags;
 
-	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, root, false);
+	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, false);
 
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
@@ -217,10 +223,10 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
 	 * Unlink @root from its parent. As the updated_children list is
 	 * singly linked, we have to walk it to find the removal point.
 	 */
-	parent = cgroup_parent(root);
+	parent = root->parent;
 	if (parent) {
 		struct cgroup_rstat_cpu *prstatc;
-		struct cgroup **nextp;
+		struct cgroup_subsys_state **nextp;
 
 		prstatc = cgroup_rstat_cpu(parent, cpu);
 		nextp = &prstatc->updated_children;
@@ -244,7 +250,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
 	if (child != root)
 		head = cgroup_rstat_push_children(head, child, cpu);
 unlock_ret:
-	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, root, flags, false);
+	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, false);
 	return head;
 }
 
@@ -300,27 +306,25 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
 }
 
 /* see cgroup_rstat_flush() */
-static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
+static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
 	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
 {
+	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
 	lockdep_assert_held(&cgroup_rstat_lock);
 
 	for_each_possible_cpu(cpu) {
-		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
+		struct cgroup_subsys_state *pos;
 
+		pos = cgroup_rstat_updated_list(css, cpu);
 		for (; pos; pos = pos->rstat_flush_next) {
-			struct cgroup_subsys_state *css;
+			if (!pos->ss)
+				cgroup_base_stat_flush(pos->cgroup, cpu);
+			else
+				pos->ss->css_rstat_flush(pos, cpu);
 
-			cgroup_base_stat_flush(pos, cpu);
-			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
-
-			rcu_read_lock();
-			list_for_each_entry_rcu(css, &pos->rstat_css_list,
-						rstat_css_node)
-				css->ss->css_rstat_flush(css, cpu);
-			rcu_read_unlock();
+			bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu);
 		}
 
 		/* play nice and yield if necessary */
@@ -334,93 +338,96 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 }
 
 /**
- * cgroup_rstat_flush - flush stats in @cgrp's subtree
- * @cgrp: target cgroup
+ * cgroup_rstat_flush - flush stats in @css's rstat subtree
+ * @css: target cgroup subsystem state
  *
- * Collect all per-cpu stats in @cgrp's subtree into the global counters
- * and propagate them upwards.  After this function returns, all cgroups in
- * the subtree have up-to-date ->stat.
+ * Collect all per-cpu stats in @css's subtree into the global counters
+ * and propagate them upwards. After this function returns, all rstat
+ * nodes in the subtree have up-to-date ->stat.
  *
- * This also gets all cgroups in the subtree including @cgrp off the
+ * This also gets all rstat nodes in the subtree including @css off the
  * ->updated_children lists.
  *
  * This function may block.
  */
-__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
+__bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 {
+	struct cgroup *cgrp = css->cgroup;
+
 	might_sleep();
 
 	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(cgrp);
+	cgroup_rstat_flush_locked(css);
 	__cgroup_rstat_unlock(cgrp, -1);
 }
 
 /**
- * cgroup_rstat_flush_hold - flush stats in @cgrp's subtree and hold
- * @cgrp: target cgroup
+ * cgroup_rstat_flush_hold - flush stats in @css's rstat subtree and hold
+ * @css: target subsystem state
  *
- * Flush stats in @cgrp's subtree and prevent further flushes.  Must be
+ * Flush stats in @css's rstat subtree and prevent further flushes. Must be
  * paired with cgroup_rstat_flush_release().
  *
  * This function may block.
  */
-void cgroup_rstat_flush_hold(struct cgroup *cgrp)
-	__acquires(&cgroup_rstat_lock)
+void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
 {
+	struct cgroup *cgrp = css->cgroup;
+
 	might_sleep();
 	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(cgrp);
+	cgroup_rstat_flush_locked(css);
 }
 
 /**
  * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
- * @cgrp: cgroup used by tracepoint
+ * @css: css that was previously used for the call to flush hold
  */
-void cgroup_rstat_flush_release(struct cgroup *cgrp)
-	__releases(&cgroup_rstat_lock)
+void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
 {
+	struct cgroup *cgrp = css->cgroup;
 	__cgroup_rstat_unlock(cgrp, -1);
 }
 
-int cgroup_rstat_init(struct cgroup *cgrp)
+int cgroup_rstat_init(struct cgroup_subsys_state *css)
 {
 	int cpu;
 
-	/* the root cgrp has rstat_cpu preallocated */
-	if (!cgrp->rstat_cpu) {
-		cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
-		if (!cgrp->rstat_cpu)
+	/* the root cgrp's self css has rstat_cpu preallocated */
+	if (!css->rstat_cpu) {
+		css->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
+		if (!css->rstat_cpu)
 			return -ENOMEM;
 	}
 
 	/* ->updated_children list is self terminated */
 	for_each_possible_cpu(cpu) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(css, cpu);
 
-		rstatc->updated_children = cgrp;
+		rstatc->updated_children = css;
 		u64_stats_init(&rstatc->bsync);
 	}
 
 	return 0;
 }
 
-void cgroup_rstat_exit(struct cgroup *cgrp)
+void cgroup_rstat_exit(struct cgroup_subsys_state *css)
 {
 	int cpu;
 
-	cgroup_rstat_flush(cgrp);
+	cgroup_rstat_flush(css);
 
 	/* sanity check */
 	for_each_possible_cpu(cpu) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(css, cpu);
 
-		if (WARN_ON_ONCE(rstatc->updated_children != cgrp) ||
+		if (WARN_ON_ONCE(rstatc->updated_children != css) ||
 		    WARN_ON_ONCE(rstatc->updated_next))
 			return;
 	}
 
-	free_percpu(cgrp->rstat_cpu);
-	cgrp->rstat_cpu = NULL;
+	free_percpu(css->rstat_cpu);
+	css->rstat_cpu = NULL;
 }
 
 void __init cgroup_rstat_boot(void)
@@ -461,7 +468,7 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 {
-	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(&cgrp->self, cpu);
 	struct cgroup *parent = cgroup_parent(cgrp);
 	struct cgroup_rstat_cpu *prstatc;
 	struct cgroup_base_stat delta;
@@ -491,7 +498,7 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 		cgroup_base_stat_add(&cgrp->last_bstat, &delta);
 
 		delta = rstatc->subtree_bstat;
-		prstatc = cgroup_rstat_cpu(parent, cpu);
+		prstatc = cgroup_rstat_cpu(&parent->self, cpu);
 		cgroup_base_stat_sub(&delta, &rstatc->last_subtree_bstat);
 		cgroup_base_stat_add(&prstatc->subtree_bstat, &delta);
 		cgroup_base_stat_add(&rstatc->last_subtree_bstat, &delta);
@@ -503,7 +510,7 @@ cgroup_base_stat_cputime_account_begin(struct cgroup *cgrp, unsigned long *flags
 {
 	struct cgroup_rstat_cpu *rstatc;
 
-	rstatc = get_cpu_ptr(cgrp->rstat_cpu);
+	rstatc = get_cpu_ptr(cgrp->self.rstat_cpu);
 	*flags = u64_stats_update_begin_irqsave(&rstatc->bsync);
 	return rstatc;
 }
@@ -513,7 +520,7 @@ static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
 						 unsigned long flags)
 {
 	u64_stats_update_end_irqrestore(&rstatc->bsync, flags);
-	cgroup_rstat_updated(cgrp, smp_processor_id());
+	cgroup_rstat_updated(&cgrp->self, smp_processor_id());
 	put_cpu_ptr(rstatc);
 }
 
@@ -615,12 +622,12 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 	u64 usage, utime, stime, ntime;
 
 	if (cgroup_parent(cgrp)) {
-		cgroup_rstat_flush_hold(cgrp);
+		cgroup_rstat_flush_hold(&cgrp->self);
 		usage = cgrp->bstat.cputime.sum_exec_runtime;
 		cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime,
 			       &utime, &stime);
 		ntime = cgrp->bstat.ntime;
-		cgroup_rstat_flush_release(cgrp);
+		cgroup_rstat_flush_release(&cgrp->self);
 	} else {
 		/* cgrp->bstat of root is not actually used, reuse it */
 		root_cgroup_cputime(&cgrp->bstat);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 46f8b372d212..88c2c8e610b1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -579,7 +579,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 	if (!val)
 		return;
 
-	cgroup_rstat_updated(memcg->css.cgroup, cpu);
+	cgroup_rstat_updated(&memcg->css, cpu);
 	statc = this_cpu_ptr(memcg->vmstats_percpu);
 	for (; statc; statc = statc->parent) {
 		stats_updates = READ_ONCE(statc->stats_updates) + abs(val);
@@ -611,7 +611,7 @@ static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
 	if (mem_cgroup_is_root(memcg))
 		WRITE_ONCE(flush_last_time, jiffies_64);
 
-	cgroup_rstat_flush(memcg->css.cgroup);
+	cgroup_rstat_flush(&memcg->css);
 }
 
 /*
diff --git a/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c b/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
index 38f78d9345de..f362f7d41b9e 100644
--- a/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
+++ b/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
@@ -45,7 +45,7 @@ int BPF_PROG(test_percpu2, struct bpf_testmod_btf_type_tag_2 *arg)
 SEC("tp_btf/cgroup_mkdir")
 int BPF_PROG(test_percpu_load, struct cgroup *cgrp, const char *path)
 {
-	g = (__u64)cgrp->rstat_cpu->updated_children;
+	g = (__u64)cgrp->self.rstat_cpu->updated_children;
 	return 0;
 }
 
@@ -56,7 +56,8 @@ int BPF_PROG(test_percpu_helper, struct cgroup *cgrp, const char *path)
 	__u32 cpu;
 
 	cpu = bpf_get_smp_processor_id();
-	rstat = (struct cgroup_rstat_cpu *)bpf_per_cpu_ptr(cgrp->rstat_cpu, cpu);
+	rstat = (struct cgroup_rstat_cpu *)bpf_per_cpu_ptr(
+			cgrp->self.rstat_cpu, cpu);
 	if (rstat) {
 		/* READ_ONCE */
 		*(volatile int *)rstat;
diff --git a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
index c74362854948..10c803c8dc70 100644
--- a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
+++ b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
@@ -37,8 +37,8 @@ struct {
 	__type(value, struct attach_counter);
 } attach_counters SEC(".maps");
 
-extern void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __ksym;
-extern void cgroup_rstat_flush(struct cgroup *cgrp) __ksym;
+extern void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu) __ksym;
+extern void cgroup_rstat_flush(struct cgroup_subsys_state *css) __ksym;
 
 static uint64_t cgroup_id(struct cgroup *cgrp)
 {
@@ -75,7 +75,7 @@ int BPF_PROG(counter, struct cgroup *dst_cgrp, struct task_struct *leader,
 	else if (create_percpu_attach_counter(cg_id, 1))
 		return 0;
 
-	cgroup_rstat_updated(dst_cgrp, bpf_get_smp_processor_id());
+	cgroup_rstat_updated(&dst_cgrp->self, bpf_get_smp_processor_id());
 	return 0;
 }
 
@@ -141,7 +141,7 @@ int BPF_PROG(dumper, struct bpf_iter_meta *meta, struct cgroup *cgrp)
 		return 1;
 
 	/* Flush the stats to make sure we get the most updated numbers */
-	cgroup_rstat_flush(cgrp);
+	cgroup_rstat_flush(&cgrp->self);
 
 	total_counter = bpf_map_lookup_elem(&attach_counters, &cg_id);
 	if (!total_counter) {
-- 
2.43.5



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 2/4 v2] cgroup: rstat lock indirection
  2025-02-27 21:55 [PATCH 0/4 v2] cgroup: separate rstat trees inwardvessel
  2025-02-27 21:55 ` [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state inwardvessel
@ 2025-02-27 21:55 ` inwardvessel
  2025-03-03 15:21   ` Michal Koutný
  2025-02-27 21:55 ` [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems inwardvessel
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: inwardvessel @ 2025-02-27 21:55 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mhocko, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

From: JP Kobryn <inwardvessel@gmail.com>

Instead of accessing the target lock directly via global var, access it
indirectly in the form of a new parameter. Also change the ordering of
the parameters to be consistent with the related per-cpu locking
function _cgroup_rstat_cpu_lock().

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/rstat.c | 41 ++++++++++++++++++++++-------------------
 1 file changed, 22 insertions(+), 19 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 9976f9acd62b..88908ef9212d 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -277,7 +277,7 @@ __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
 __bpf_hook_end();
 
 /*
- * Helper functions for locking cgroup_rstat_lock.
+ * Helper functions for locking.
  *
  * This makes it easier to diagnose locking issues and contention in
  * production environments.  The parameter @cpu_in_loop indicate lock
@@ -285,29 +285,32 @@ __bpf_hook_end();
  * value -1 is used when obtaining the main lock else this is the CPU
  * number processed last.
  */
-static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
-	__acquires(&cgroup_rstat_lock)
+static inline void __cgroup_rstat_lock(spinlock_t *lock,
+		struct cgroup *cgrp, int cpu_in_loop)
+	__acquires(lock)
 {
 	bool contended;
 
-	contended = !spin_trylock_irq(&cgroup_rstat_lock);
+	contended = !spin_trylock_irq(lock);
 	if (contended) {
 		trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
-		spin_lock_irq(&cgroup_rstat_lock);
+		spin_lock_irq(lock);
 	}
 	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
 }
 
-static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
-	__releases(&cgroup_rstat_lock)
+static inline void __cgroup_rstat_unlock(spinlock_t *lock,
+		struct cgroup *cgrp, int cpu_in_loop)
+	__releases(lock)
 {
 	trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
-	spin_unlock_irq(&cgroup_rstat_lock);
+	spin_unlock_irq(lock);
 }
 
 /* see cgroup_rstat_flush() */
-static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
-	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
+static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css,
+		spinlock_t *lock)
+	__releases(lock) __acquires(lock)
 {
 	struct cgroup *cgrp = css->cgroup;
 	int cpu;
@@ -328,11 +331,11 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
 		}
 
 		/* play nice and yield if necessary */
-		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
-			__cgroup_rstat_unlock(cgrp, cpu);
+		if (need_resched() || spin_needbreak(lock)) {
+			__cgroup_rstat_unlock(lock, cgrp, cpu);
 			if (!cond_resched())
 				cpu_relax();
-			__cgroup_rstat_lock(cgrp, cpu);
+			__cgroup_rstat_lock(lock, cgrp, cpu);
 		}
 	}
 }
@@ -356,9 +359,9 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 
 	might_sleep();
 
-	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(css);
-	__cgroup_rstat_unlock(cgrp, -1);
+	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
+	cgroup_rstat_flush_locked(css, &cgroup_rstat_lock);
+	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
 }
 
 /**
@@ -375,8 +378,8 @@ void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
 	struct cgroup *cgrp = css->cgroup;
 
 	might_sleep();
-	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(css);
+	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
+	cgroup_rstat_flush_locked(css, &cgroup_rstat_lock);
 }
 
 /**
@@ -386,7 +389,7 @@ void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
 void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
 {
 	struct cgroup *cgrp = css->cgroup;
-	__cgroup_rstat_unlock(cgrp, -1);
+	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
 }
 
 int cgroup_rstat_init(struct cgroup_subsys_state *css)
-- 
2.43.5



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-02-27 21:55 [PATCH 0/4 v2] cgroup: separate rstat trees inwardvessel
  2025-02-27 21:55 ` [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state inwardvessel
  2025-02-27 21:55 ` [PATCH 2/4 v2] cgroup: rstat lock indirection inwardvessel
@ 2025-02-27 21:55 ` inwardvessel
  2025-02-27 22:52   ` Shakeel Butt
                     ` (5 more replies)
  2025-02-27 21:55 ` [PATCH 4/4 v2] cgroup: separate rstat list pointers from base stats inwardvessel
                   ` (2 subsequent siblings)
  5 siblings, 6 replies; 39+ messages in thread
From: inwardvessel @ 2025-02-27 21:55 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mhocko, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

From: JP Kobryn <inwardvessel@gmail.com>

Let the existing locks be dedicated to the base stats and rename them as
such. Also add new rstat locks for each enabled subsystem. When handling
cgroup subsystem states, distinguish between formal subsystems (memory,
io, etc) and the base stats subsystem state (represented by
cgroup::self) to decide on which locks to take. This change is made to
prevent contention between subsystems when updating/flushing stats.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/rstat.c | 93 +++++++++++++++++++++++++++++++++----------
 1 file changed, 72 insertions(+), 21 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 88908ef9212d..b3eaefc1fd07 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -9,8 +9,12 @@
 
 #include <trace/events/cgroup.h>
 
-static DEFINE_SPINLOCK(cgroup_rstat_lock);
-static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
+static DEFINE_SPINLOCK(cgroup_rstat_base_lock);
+static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_base_cpu_lock);
+
+static spinlock_t cgroup_rstat_subsys_lock[CGROUP_SUBSYS_COUNT];
+static DEFINE_PER_CPU(raw_spinlock_t,
+		cgroup_rstat_subsys_cpu_lock[CGROUP_SUBSYS_COUNT]);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
@@ -20,8 +24,13 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(
 	return per_cpu_ptr(css->rstat_cpu, cpu);
 }
 
+static inline bool is_base_css(struct cgroup_subsys_state *css)
+{
+	return css->ss == NULL;
+}
+
 /*
- * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
+ * Helper functions for rstat per CPU locks.
  *
  * This makes it easier to diagnose locking issues and contention in
  * production environments. The parameter @fast_path determine the
@@ -36,12 +45,12 @@ unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
 	bool contended;
 
 	/*
-	 * The _irqsave() is needed because cgroup_rstat_lock is
-	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
-	 * this lock with the _irq() suffix only disables interrupts on
-	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
-	 * interrupts on both configurations. The _irqsave() ensures
-	 * that interrupts are always disabled and later restored.
+	 * The _irqsave() is needed because the locks used for flushing are
+	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring this lock
+	 * with the _irq() suffix only disables interrupts on a non-PREEMPT_RT
+	 * kernel. The raw_spinlock_t below disables interrupts on both
+	 * configurations. The _irqsave() ensures that interrupts are always
+	 * disabled and later restored.
 	 */
 	contended = !raw_spin_trylock_irqsave(cpu_lock, flags);
 	if (contended) {
@@ -87,7 +96,7 @@ __bpf_kfunc void cgroup_rstat_updated(
 		struct cgroup_subsys_state *css, int cpu)
 {
 	struct cgroup *cgrp = css->cgroup;
-	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
+	raw_spinlock_t *cpu_lock;
 	unsigned long flags;
 
 	/*
@@ -101,6 +110,12 @@ __bpf_kfunc void cgroup_rstat_updated(
 	if (data_race(cgroup_rstat_cpu(css, cpu)->updated_next))
 		return;
 
+	if (is_base_css(css))
+		cpu_lock = per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu);
+	else
+		cpu_lock = per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) +
+			css->ss->id;
+
 	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
 
 	/* put @css and all ancestors on the corresponding updated lists */
@@ -208,11 +223,17 @@ static struct cgroup_subsys_state *cgroup_rstat_updated_list(
 		struct cgroup_subsys_state *root, int cpu)
 {
 	struct cgroup *cgrp = root->cgroup;
-	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
 	struct cgroup_subsys_state *head = NULL, *parent, *child;
+	raw_spinlock_t *cpu_lock;
 	unsigned long flags;
 
+	if (is_base_css(root))
+		cpu_lock = per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu);
+	else
+		cpu_lock = per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) +
+			root->ss->id;
+
 	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, false);
 
 	/* Return NULL if this subtree is not on-list */
@@ -315,7 +336,7 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css,
 	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
-	lockdep_assert_held(&cgroup_rstat_lock);
+	lockdep_assert_held(&lock);
 
 	for_each_possible_cpu(cpu) {
 		struct cgroup_subsys_state *pos;
@@ -356,12 +377,18 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css,
 __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 {
 	struct cgroup *cgrp = css->cgroup;
+	spinlock_t *lock;
+
+	if (is_base_css(css))
+		lock = &cgroup_rstat_base_lock;
+	else
+		lock = &cgroup_rstat_subsys_lock[css->ss->id];
 
 	might_sleep();
 
-	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
-	cgroup_rstat_flush_locked(css, &cgroup_rstat_lock);
-	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
+	__cgroup_rstat_lock(lock, cgrp, -1);
+	cgroup_rstat_flush_locked(css, lock);
+	__cgroup_rstat_unlock(lock, cgrp, -1);
 }
 
 /**
@@ -376,10 +403,16 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
 {
 	struct cgroup *cgrp = css->cgroup;
+	spinlock_t *lock;
+
+	if (is_base_css(css))
+		lock = &cgroup_rstat_base_lock;
+	else
+		lock = &cgroup_rstat_subsys_lock[css->ss->id];
 
 	might_sleep();
-	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
-	cgroup_rstat_flush_locked(css, &cgroup_rstat_lock);
+	__cgroup_rstat_lock(lock, cgrp, -1);
+	cgroup_rstat_flush_locked(css, lock);
 }
 
 /**
@@ -389,7 +422,14 @@ void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
 void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
 {
 	struct cgroup *cgrp = css->cgroup;
-	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
+	spinlock_t *lock;
+
+	if (is_base_css(css))
+		lock = &cgroup_rstat_base_lock;
+	else
+		lock = &cgroup_rstat_subsys_lock[css->ss->id];
+
+	__cgroup_rstat_unlock(lock, cgrp, -1);
 }
 
 int cgroup_rstat_init(struct cgroup_subsys_state *css)
@@ -435,10 +475,21 @@ void cgroup_rstat_exit(struct cgroup_subsys_state *css)
 
 void __init cgroup_rstat_boot(void)
 {
-	int cpu;
+	struct cgroup_subsys *ss;
+	int cpu, ssid;
 
-	for_each_possible_cpu(cpu)
-		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
+	for_each_subsys(ss, ssid) {
+		spin_lock_init(&cgroup_rstat_subsys_lock[ssid]);
+	}
+
+	for_each_possible_cpu(cpu) {
+		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu));
+
+		for_each_subsys(ss, ssid) {
+			raw_spin_lock_init(
+					per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) + ssid);
+		}
+	}
 }
 
 /*
-- 
2.43.5



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 4/4 v2] cgroup: separate rstat list pointers from base stats
  2025-02-27 21:55 [PATCH 0/4 v2] cgroup: separate rstat trees inwardvessel
                   ` (2 preceding siblings ...)
  2025-02-27 21:55 ` [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems inwardvessel
@ 2025-02-27 21:55 ` inwardvessel
  2025-02-27 23:01   ` Shakeel Butt
  2025-02-28 20:33   ` Yosry Ahmed
  2025-02-28 18:22 ` [PATCH 0/4 v2] cgroup: separate rstat trees Yosry Ahmed
  2025-03-03 15:19 ` Michal Koutný
  5 siblings, 2 replies; 39+ messages in thread
From: inwardvessel @ 2025-02-27 21:55 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mhocko, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

From: JP Kobryn <inwardvessel@gmail.com>

A majority of the cgroup_rstat_cpu struct size is made up of the base
stat entities. Since only the "self" subsystem state makes use of these,
move them into a struct of their own. This allows for a new compact
cgroup_rstat_cpu struct that the formal subsystems can make use of.
Where applicable, decide on whether to allocate the compact or full
struct including the base stats.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/cgroup-defs.h | 37 ++++++++++++++----------
 kernel/cgroup/rstat.c       | 57 +++++++++++++++++++++++++------------
 2 files changed, 61 insertions(+), 33 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1598e1389615..b0a07c63fd46 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -170,7 +170,10 @@ struct cgroup_subsys_state {
 	struct percpu_ref refcnt;
 
 	/* per-cpu recursive resource statistics */
-	struct cgroup_rstat_cpu __percpu *rstat_cpu;
+	union {
+		struct cgroup_rstat_cpu __percpu *rstat_cpu;
+		struct cgroup_rstat_base_cpu __percpu *rstat_base_cpu;
+	};
 
 	/*
 	 * siblings list anchored at the parent's ->children
@@ -356,6 +359,24 @@ struct cgroup_base_stat {
  * resource statistics on top of it - bsync, bstat and last_bstat.
  */
 struct cgroup_rstat_cpu {
+	/*
+	 * Child cgroups with stat updates on this cpu since the last read
+	 * are linked on the parent's ->updated_children through
+	 * ->updated_next.
+	 *
+	 * In addition to being more compact, singly-linked list pointing
+	 * to the cgroup makes it unnecessary for each per-cpu struct to
+	 * point back to the associated cgroup.
+	 *
+	 * Protected by per-cpu cgroup_rstat_cpu_lock.
+	 */
+	struct cgroup_subsys_state *updated_children;	/* terminated by self */
+	struct cgroup_subsys_state *updated_next;		/* NULL if not on list */
+};
+
+struct cgroup_rstat_base_cpu {
+	struct cgroup_rstat_cpu self;
+
 	/*
 	 * ->bsync protects ->bstat.  These are the only fields which get
 	 * updated in the hot path.
@@ -382,20 +403,6 @@ struct cgroup_rstat_cpu {
 	 * deltas to propagate to the per-cpu subtree_bstat.
 	 */
 	struct cgroup_base_stat last_subtree_bstat;
-
-	/*
-	 * Child cgroups with stat updates on this cpu since the last read
-	 * are linked on the parent's ->updated_children through
-	 * ->updated_next.
-	 *
-	 * In addition to being more compact, singly-linked list pointing
-	 * to the cgroup makes it unnecessary for each per-cpu struct to
-	 * point back to the associated cgroup.
-	 *
-	 * Protected by per-cpu cgroup_rstat_cpu_lock.
-	 */
-	struct cgroup_subsys_state *updated_children;	/* terminated by self */
-	struct cgroup_subsys_state *updated_next;		/* NULL if not on list */
 };
 
 struct cgroup_freezer_state {
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index b3eaefc1fd07..c08ebe2f9568 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -24,6 +24,12 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(
 	return per_cpu_ptr(css->rstat_cpu, cpu);
 }
 
+static struct cgroup_rstat_base_cpu *cgroup_rstat_base_cpu(
+		struct cgroup_subsys_state *css, int cpu)
+{
+	return per_cpu_ptr(css->rstat_base_cpu, cpu);
+}
+
 static inline bool is_base_css(struct cgroup_subsys_state *css)
 {
 	return css->ss == NULL;
@@ -438,17 +444,31 @@ int cgroup_rstat_init(struct cgroup_subsys_state *css)
 
 	/* the root cgrp's self css has rstat_cpu preallocated */
 	if (!css->rstat_cpu) {
-		css->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
-		if (!css->rstat_cpu)
-			return -ENOMEM;
-	}
+		if (is_base_css(css)) {
+			css->rstat_base_cpu = alloc_percpu(struct cgroup_rstat_base_cpu);
+			if (!css->rstat_base_cpu)
+				return -ENOMEM;
 
-	/* ->updated_children list is self terminated */
-	for_each_possible_cpu(cpu) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(css, cpu);
+			for_each_possible_cpu(cpu) {
+				struct cgroup_rstat_base_cpu *rstatc;
+
+				rstatc = cgroup_rstat_base_cpu(css, cpu);
+				rstatc->self.updated_children = css;
+				u64_stats_init(&rstatc->bsync);
+			}
+		} else {
+			css->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
+			if (!css->rstat_cpu)
+				return -ENOMEM;
+
+			for_each_possible_cpu(cpu) {
+				struct cgroup_rstat_cpu *rstatc;
+
+				rstatc = cgroup_rstat_cpu(css, cpu);
+				rstatc->updated_children = css;
+			}
+		}
 
-		rstatc->updated_children = css;
-		u64_stats_init(&rstatc->bsync);
 	}
 
 	return 0;
@@ -522,9 +542,10 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 {
-	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(&cgrp->self, cpu);
+	struct cgroup_rstat_base_cpu *rstatc = cgroup_rstat_base_cpu(
+			&cgrp->self, cpu);
 	struct cgroup *parent = cgroup_parent(cgrp);
-	struct cgroup_rstat_cpu *prstatc;
+	struct cgroup_rstat_base_cpu *prstatc;
 	struct cgroup_base_stat delta;
 	unsigned seq;
 
@@ -552,25 +573,25 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 		cgroup_base_stat_add(&cgrp->last_bstat, &delta);
 
 		delta = rstatc->subtree_bstat;
-		prstatc = cgroup_rstat_cpu(&parent->self, cpu);
+		prstatc = cgroup_rstat_base_cpu(&parent->self, cpu);
 		cgroup_base_stat_sub(&delta, &rstatc->last_subtree_bstat);
 		cgroup_base_stat_add(&prstatc->subtree_bstat, &delta);
 		cgroup_base_stat_add(&rstatc->last_subtree_bstat, &delta);
 	}
 }
 
-static struct cgroup_rstat_cpu *
+static struct cgroup_rstat_base_cpu *
 cgroup_base_stat_cputime_account_begin(struct cgroup *cgrp, unsigned long *flags)
 {
-	struct cgroup_rstat_cpu *rstatc;
+	struct cgroup_rstat_base_cpu *rstatc;
 
-	rstatc = get_cpu_ptr(cgrp->self.rstat_cpu);
+	rstatc = get_cpu_ptr(cgrp->self.rstat_base_cpu);
 	*flags = u64_stats_update_begin_irqsave(&rstatc->bsync);
 	return rstatc;
 }
 
 static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
-						 struct cgroup_rstat_cpu *rstatc,
+						 struct cgroup_rstat_base_cpu *rstatc,
 						 unsigned long flags)
 {
 	u64_stats_update_end_irqrestore(&rstatc->bsync, flags);
@@ -580,7 +601,7 @@ static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
 
 void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
 {
-	struct cgroup_rstat_cpu *rstatc;
+	struct cgroup_rstat_base_cpu *rstatc;
 	unsigned long flags;
 
 	rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
@@ -591,7 +612,7 @@ void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
 void __cgroup_account_cputime_field(struct cgroup *cgrp,
 				    enum cpu_usage_stat index, u64 delta_exec)
 {
-	struct cgroup_rstat_cpu *rstatc;
+	struct cgroup_rstat_base_cpu *rstatc;
 	unsigned long flags;
 
 	rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
-- 
2.43.5



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  2025-02-27 21:55 ` [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state inwardvessel
@ 2025-02-27 22:43   ` Shakeel Butt
  2025-02-28 19:04   ` Yosry Ahmed
  2025-03-03 15:20   ` Michal Koutný
  2 siblings, 0 replies; 39+ messages in thread
From: Shakeel Butt @ 2025-02-27 22:43 UTC (permalink / raw)
  To: inwardvessel
  Cc: tj, yosryahmed, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On Thu, Feb 27, 2025 at 01:55:40PM -0800, inwardvessel wrote:
> From: JP Kobryn <inwardvessel@gmail.com>
> 
> Each cgroup owns rstat pointers. This means that a tree of pending rstat
> updates can contain changes from different subsystems. Because of this
> arrangement, when one subsystem is flushed via the public api
> cgroup_rstat_flushed(), all other subsystems with pending updates will
> also be flushed. Remove the rstat pointers from the cgroup and instead
> give them to each cgroup_subsys_state. Separate rstat trees will now
> exist for each unique subsystem. This separation allows for subsystems
> to make updates and flushes without the side effects of other
> subsystems. i.e. flushing the cpu stats does not cause the memory stats
> to be flushed and vice versa. The change in pointer ownership from
> cgroup to cgroup_subsys_state allows for direct flushing of the css, so
> the rcu list management entities and operations previously tied to the
> cgroup which were used for managing a list of subsystem states with
> pending flushes are removed. In terms of client code, public api calls
> were changed to now accept a reference to the cgroup_subsys_state so
> that when flushing or updating, a specific subsystem is associated with
> the call.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>

One nit: add couple of lines in commit message  on why removing the
padding from struct cgroup is fine. Most probably the reason to add the
padding is not valid anymore.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-02-27 21:55 ` [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems inwardvessel
@ 2025-02-27 22:52   ` Shakeel Butt
  2025-02-28 16:07     ` JP Kobryn
  2025-02-28 17:37   ` JP Kobryn
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: Shakeel Butt @ 2025-02-27 22:52 UTC (permalink / raw)
  To: inwardvessel
  Cc: tj, yosryahmed, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On Thu, Feb 27, 2025 at 01:55:42PM -0800, inwardvessel wrote:
> From: JP Kobryn <inwardvessel@gmail.com>
> 
> Let the existing locks be dedicated to the base stats and rename them as
> such. Also add new rstat locks for each enabled subsystem. When handling
> cgroup subsystem states, distinguish between formal subsystems (memory,
> io, etc) and the base stats subsystem state (represented by
> cgroup::self) to decide on which locks to take. This change is made to
> prevent contention between subsystems when updating/flushing stats.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

Couple of nits below otherwise:

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>

> ---
>  kernel/cgroup/rstat.c | 93 +++++++++++++++++++++++++++++++++----------
>  1 file changed, 72 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index 88908ef9212d..b3eaefc1fd07 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -9,8 +9,12 @@
>  
>  #include <trace/events/cgroup.h>
>  
> -static DEFINE_SPINLOCK(cgroup_rstat_lock);
> -static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
> +static DEFINE_SPINLOCK(cgroup_rstat_base_lock);
> +static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_base_cpu_lock);
> +
> +static spinlock_t cgroup_rstat_subsys_lock[CGROUP_SUBSYS_COUNT];
> +static DEFINE_PER_CPU(raw_spinlock_t,
> +		cgroup_rstat_subsys_cpu_lock[CGROUP_SUBSYS_COUNT]);
>  

The name of these locks are too long and you had to go multi-line.
Please just reduce the size of the names. These are local to this file,
so maybe you can drop cgroup_rstat_ from them or keep the rstat.

>  static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
>  
> @@ -20,8 +24,13 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(
>  	return per_cpu_ptr(css->rstat_cpu, cpu);
>  }
>  
> +static inline bool is_base_css(struct cgroup_subsys_state *css)
> +{
> +	return css->ss == NULL;
> +}
> +
>  /*
> - * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
> + * Helper functions for rstat per CPU locks.
>   *
>   * This makes it easier to diagnose locking issues and contention in
>   * production environments. The parameter @fast_path determine the
> @@ -36,12 +45,12 @@ unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
>  	bool contended;
>  
>  	/*
> -	 * The _irqsave() is needed because cgroup_rstat_lock is
> -	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
> -	 * this lock with the _irq() suffix only disables interrupts on
> -	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
> -	 * interrupts on both configurations. The _irqsave() ensures
> -	 * that interrupts are always disabled and later restored.
> +	 * The _irqsave() is needed because the locks used for flushing are
> +	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring this lock
> +	 * with the _irq() suffix only disables interrupts on a non-PREEMPT_RT
> +	 * kernel. The raw_spinlock_t below disables interrupts on both
> +	 * configurations. The _irqsave() ensures that interrupts are always
> +	 * disabled and later restored.
>  	 */
>  	contended = !raw_spin_trylock_irqsave(cpu_lock, flags);
>  	if (contended) {
> @@ -87,7 +96,7 @@ __bpf_kfunc void cgroup_rstat_updated(
>  		struct cgroup_subsys_state *css, int cpu)
>  {
>  	struct cgroup *cgrp = css->cgroup;
> -	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
> +	raw_spinlock_t *cpu_lock;
>  	unsigned long flags;
>  
>  	/*
> @@ -101,6 +110,12 @@ __bpf_kfunc void cgroup_rstat_updated(
>  	if (data_race(cgroup_rstat_cpu(css, cpu)->updated_next))
>  		return;
>  
> +	if (is_base_css(css))
> +		cpu_lock = per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu);
> +	else
> +		cpu_lock = per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) +
> +			css->ss->id;

Use the array index here like cgroup_rstat_subsys_cpu_lock[css->ss->id].

> +
>  	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
>  
>  	/* put @css and all ancestors on the corresponding updated lists */
> @@ -208,11 +223,17 @@ static struct cgroup_subsys_state *cgroup_rstat_updated_list(
>  		struct cgroup_subsys_state *root, int cpu)
>  {
>  	struct cgroup *cgrp = root->cgroup;
> -	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
>  	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
>  	struct cgroup_subsys_state *head = NULL, *parent, *child;
> +	raw_spinlock_t *cpu_lock;
>  	unsigned long flags;
>  
> +	if (is_base_css(root))
> +		cpu_lock = per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu);
> +	else
> +		cpu_lock = per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) +
> +			root->ss->id;

Same here.

> +
>  	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, false);
>  
>  	/* Return NULL if this subtree is not on-list */
> @@ -315,7 +336,7 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css,
>  	struct cgroup *cgrp = css->cgroup;
>  	int cpu;
>  
> -	lockdep_assert_held(&cgroup_rstat_lock);
> +	lockdep_assert_held(&lock);
>  
>  	for_each_possible_cpu(cpu) {
>  		struct cgroup_subsys_state *pos;
> @@ -356,12 +377,18 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css,
>  __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
>  {
>  	struct cgroup *cgrp = css->cgroup;
> +	spinlock_t *lock;
> +
> +	if (is_base_css(css))
> +		lock = &cgroup_rstat_base_lock;
> +	else
> +		lock = &cgroup_rstat_subsys_lock[css->ss->id];
>  
>  	might_sleep();
>  
> -	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
> -	cgroup_rstat_flush_locked(css, &cgroup_rstat_lock);
> -	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
> +	__cgroup_rstat_lock(lock, cgrp, -1);
> +	cgroup_rstat_flush_locked(css, lock);
> +	__cgroup_rstat_unlock(lock, cgrp, -1);
>  }
>  
>  /**
> @@ -376,10 +403,16 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
>  void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
>  {
>  	struct cgroup *cgrp = css->cgroup;
> +	spinlock_t *lock;
> +
> +	if (is_base_css(css))
> +		lock = &cgroup_rstat_base_lock;
> +	else
> +		lock = &cgroup_rstat_subsys_lock[css->ss->id];
>  
>  	might_sleep();
> -	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
> -	cgroup_rstat_flush_locked(css, &cgroup_rstat_lock);
> +	__cgroup_rstat_lock(lock, cgrp, -1);
> +	cgroup_rstat_flush_locked(css, lock);
>  }
>  
>  /**
> @@ -389,7 +422,14 @@ void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
>  void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
>  {
>  	struct cgroup *cgrp = css->cgroup;
> -	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
> +	spinlock_t *lock;
> +
> +	if (is_base_css(css))
> +		lock = &cgroup_rstat_base_lock;
> +	else
> +		lock = &cgroup_rstat_subsys_lock[css->ss->id];
> +
> +	__cgroup_rstat_unlock(lock, cgrp, -1);
>  }
>  
>  int cgroup_rstat_init(struct cgroup_subsys_state *css)
> @@ -435,10 +475,21 @@ void cgroup_rstat_exit(struct cgroup_subsys_state *css)
>  
>  void __init cgroup_rstat_boot(void)
>  {
> -	int cpu;
> +	struct cgroup_subsys *ss;
> +	int cpu, ssid;
>  
> -	for_each_possible_cpu(cpu)
> -		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
> +	for_each_subsys(ss, ssid) {
> +		spin_lock_init(&cgroup_rstat_subsys_lock[ssid]);
> +	}
> +
> +	for_each_possible_cpu(cpu) {
> +		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu));
> +
> +		for_each_subsys(ss, ssid) {
> +			raw_spin_lock_init(
> +					per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) + ssid);

Same here.

> +		}
> +	}
>  }
>  
>  /*
> -- 
> 2.43.5
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4 v2] cgroup: separate rstat list pointers from base stats
  2025-02-27 21:55 ` [PATCH 4/4 v2] cgroup: separate rstat list pointers from base stats inwardvessel
@ 2025-02-27 23:01   ` Shakeel Butt
  2025-02-28 20:33   ` Yosry Ahmed
  1 sibling, 0 replies; 39+ messages in thread
From: Shakeel Butt @ 2025-02-27 23:01 UTC (permalink / raw)
  To: inwardvessel
  Cc: tj, yosryahmed, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On Thu, Feb 27, 2025 at 01:55:43PM -0800, inwardvessel wrote:
> From: JP Kobryn <inwardvessel@gmail.com>
> 
> A majority of the cgroup_rstat_cpu struct size is made up of the base
> stat entities. Since only the "self" subsystem state makes use of these,
> move them into a struct of their own. This allows for a new compact
> cgroup_rstat_cpu struct that the formal subsystems can make use of.
> Where applicable, decide on whether to allocate the compact or full
> struct including the base stats.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

One nit below otherwise:

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>

[...]
> @@ -438,17 +444,31 @@ int cgroup_rstat_init(struct cgroup_subsys_state *css)
>  
>  	/* the root cgrp's self css has rstat_cpu preallocated */
>  	if (!css->rstat_cpu) {

Early return here for root to eliminate one indent in this function.

> -		css->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
> -		if (!css->rstat_cpu)
> -			return -ENOMEM;
> -	}
> +		if (is_base_css(css)) {
> +			css->rstat_base_cpu = alloc_percpu(struct cgroup_rstat_base_cpu);
> +			if (!css->rstat_base_cpu)
> +				return -ENOMEM;
>  
> -	/* ->updated_children list is self terminated */
> -	for_each_possible_cpu(cpu) {
> -		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(css, cpu);
> +			for_each_possible_cpu(cpu) {
> +				struct cgroup_rstat_base_cpu *rstatc;
> +
> +				rstatc = cgroup_rstat_base_cpu(css, cpu);
> +				rstatc->self.updated_children = css;
> +				u64_stats_init(&rstatc->bsync);
> +			}
> +		} else {
> +			css->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
> +			if (!css->rstat_cpu)
> +				return -ENOMEM;
> +
> +			for_each_possible_cpu(cpu) {
> +				struct cgroup_rstat_cpu *rstatc;
> +
> +				rstatc = cgroup_rstat_cpu(css, cpu);
> +				rstatc->updated_children = css;
> +			}
> +		}
>  
> -		rstatc->updated_children = css;
> -		u64_stats_init(&rstatc->bsync);
>  	}


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-02-27 22:52   ` Shakeel Butt
@ 2025-02-28 16:07     ` JP Kobryn
  0 siblings, 0 replies; 39+ messages in thread
From: JP Kobryn @ 2025-02-28 16:07 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: tj, yosryahmed, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On 2/27/25 2:52 PM, Shakeel Butt wrote:
> On Thu, Feb 27, 2025 at 01:55:42PM -0800, inwardvessel wrote:
>> From: JP Kobryn <inwardvessel@gmail.com>
>>
>> Let the existing locks be dedicated to the base stats and rename them as
>> such. Also add new rstat locks for each enabled subsystem. When handling
>> cgroup subsystem states, distinguish between formal subsystems (memory,
>> io, etc) and the base stats subsystem state (represented by
>> cgroup::self) to decide on which locks to take. This change is made to
>> prevent contention between subsystems when updating/flushing stats.
>>
>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> 
> Couple of nits below otherwise:
> 
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> 
>> ---
>>   kernel/cgroup/rstat.c | 93 +++++++++++++++++++++++++++++++++----------
>>   1 file changed, 72 insertions(+), 21 deletions(-)
>>
>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
>> index 88908ef9212d..b3eaefc1fd07 100644
>> --- a/kernel/cgroup/rstat.c
>> +++ b/kernel/cgroup/rstat.c
>> @@ -9,8 +9,12 @@
>>   
>>   #include <trace/events/cgroup.h>
>>   
>> -static DEFINE_SPINLOCK(cgroup_rstat_lock);
>> -static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
>> +static DEFINE_SPINLOCK(cgroup_rstat_base_lock);
>> +static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_base_cpu_lock);
>> +
>> +static spinlock_t cgroup_rstat_subsys_lock[CGROUP_SUBSYS_COUNT];
>> +static DEFINE_PER_CPU(raw_spinlock_t,
>> +		cgroup_rstat_subsys_cpu_lock[CGROUP_SUBSYS_COUNT]);
>>   
> 
> The name of these locks are too long and you had to go multi-line.
> Please just reduce the size of the names. These are local to this file,
> so maybe you can drop cgroup_rstat_ from them or keep the rstat.

Agreed. Will do in next rev.

> 
>>   static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
>>   
>> @@ -20,8 +24,13 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(
>>   	return per_cpu_ptr(css->rstat_cpu, cpu);
>>   }
>>   
>> +static inline bool is_base_css(struct cgroup_subsys_state *css)
>> +{
>> +	return css->ss == NULL;
>> +}
>> +
>>   /*
>> - * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
>> + * Helper functions for rstat per CPU locks.
>>    *
>>    * This makes it easier to diagnose locking issues and contention in
>>    * production environments. The parameter @fast_path determine the
>> @@ -36,12 +45,12 @@ unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
>>   	bool contended;
>>   
>>   	/*
>> -	 * The _irqsave() is needed because cgroup_rstat_lock is
>> -	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
>> -	 * this lock with the _irq() suffix only disables interrupts on
>> -	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
>> -	 * interrupts on both configurations. The _irqsave() ensures
>> -	 * that interrupts are always disabled and later restored.
>> +	 * The _irqsave() is needed because the locks used for flushing are
>> +	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring this lock
>> +	 * with the _irq() suffix only disables interrupts on a non-PREEMPT_RT
>> +	 * kernel. The raw_spinlock_t below disables interrupts on both
>> +	 * configurations. The _irqsave() ensures that interrupts are always
>> +	 * disabled and later restored.
>>   	 */
>>   	contended = !raw_spin_trylock_irqsave(cpu_lock, flags);
>>   	if (contended) {
>> @@ -87,7 +96,7 @@ __bpf_kfunc void cgroup_rstat_updated(
>>   		struct cgroup_subsys_state *css, int cpu)
>>   {
>>   	struct cgroup *cgrp = css->cgroup;
>> -	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
>> +	raw_spinlock_t *cpu_lock;
>>   	unsigned long flags;
>>   
>>   	/*
>> @@ -101,6 +110,12 @@ __bpf_kfunc void cgroup_rstat_updated(
>>   	if (data_race(cgroup_rstat_cpu(css, cpu)->updated_next))
>>   		return;
>>   
>> +	if (is_base_css(css))
>> +		cpu_lock = per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu);
>> +	else
>> +		cpu_lock = per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) +
>> +			css->ss->id;
> 
> Use the array index here like cgroup_rstat_subsys_cpu_lock[css->ss->id].

Good call. I see the arithmetic may be more appropriate in cases where
the per-cpu array/buffer is dynamic. I'll use the index notation in v3.

> 
>> +
>>   	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
>>   
>>   	/* put @css and all ancestors on the corresponding updated lists */
>> @@ -208,11 +223,17 @@ static struct cgroup_subsys_state *cgroup_rstat_updated_list(
>>   		struct cgroup_subsys_state *root, int cpu)
>>   {
>>   	struct cgroup *cgrp = root->cgroup;
>> -	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
>>   	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
>>   	struct cgroup_subsys_state *head = NULL, *parent, *child;
>> +	raw_spinlock_t *cpu_lock;
>>   	unsigned long flags;
>>   
>> +	if (is_base_css(root))
>> +		cpu_lock = per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu);
>> +	else
>> +		cpu_lock = per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) +
>> +			root->ss->id;
> 
> Same here.
> 
>> +
>>   	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, false);
>>   
>>   	/* Return NULL if this subtree is not on-list */
>> @@ -315,7 +336,7 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css,
>>   	struct cgroup *cgrp = css->cgroup;
>>   	int cpu;
>>   
>> -	lockdep_assert_held(&cgroup_rstat_lock);
>> +	lockdep_assert_held(&lock);
>>   
>>   	for_each_possible_cpu(cpu) {
>>   		struct cgroup_subsys_state *pos;
>> @@ -356,12 +377,18 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css,
>>   __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
>>   {
>>   	struct cgroup *cgrp = css->cgroup;
>> +	spinlock_t *lock;
>> +
>> +	if (is_base_css(css))
>> +		lock = &cgroup_rstat_base_lock;
>> +	else
>> +		lock = &cgroup_rstat_subsys_lock[css->ss->id];
>>   
>>   	might_sleep();
>>   
>> -	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
>> -	cgroup_rstat_flush_locked(css, &cgroup_rstat_lock);
>> -	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
>> +	__cgroup_rstat_lock(lock, cgrp, -1);
>> +	cgroup_rstat_flush_locked(css, lock);
>> +	__cgroup_rstat_unlock(lock, cgrp, -1);
>>   }
>>   
>>   /**
>> @@ -376,10 +403,16 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
>>   void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
>>   {
>>   	struct cgroup *cgrp = css->cgroup;
>> +	spinlock_t *lock;
>> +
>> +	if (is_base_css(css))
>> +		lock = &cgroup_rstat_base_lock;
>> +	else
>> +		lock = &cgroup_rstat_subsys_lock[css->ss->id];
>>   
>>   	might_sleep();
>> -	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
>> -	cgroup_rstat_flush_locked(css, &cgroup_rstat_lock);
>> +	__cgroup_rstat_lock(lock, cgrp, -1);
>> +	cgroup_rstat_flush_locked(css, lock);
>>   }
>>   
>>   /**
>> @@ -389,7 +422,14 @@ void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
>>   void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
>>   {
>>   	struct cgroup *cgrp = css->cgroup;
>> -	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
>> +	spinlock_t *lock;
>> +
>> +	if (is_base_css(css))
>> +		lock = &cgroup_rstat_base_lock;
>> +	else
>> +		lock = &cgroup_rstat_subsys_lock[css->ss->id];
>> +
>> +	__cgroup_rstat_unlock(lock, cgrp, -1);
>>   }
>>   
>>   int cgroup_rstat_init(struct cgroup_subsys_state *css)
>> @@ -435,10 +475,21 @@ void cgroup_rstat_exit(struct cgroup_subsys_state *css)
>>   
>>   void __init cgroup_rstat_boot(void)
>>   {
>> -	int cpu;
>> +	struct cgroup_subsys *ss;
>> +	int cpu, ssid;
>>   
>> -	for_each_possible_cpu(cpu)
>> -		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
>> +	for_each_subsys(ss, ssid) {
>> +		spin_lock_init(&cgroup_rstat_subsys_lock[ssid]);
>> +	}
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu));
>> +
>> +		for_each_subsys(ss, ssid) {
>> +			raw_spin_lock_init(
>> +					per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) + ssid);
> 
> Same here.
> 
>> +		}
>> +	}
>>   }
>>   
>>   /*
>> -- 
>> 2.43.5
>>



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-02-27 21:55 ` [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems inwardvessel
  2025-02-27 22:52   ` Shakeel Butt
@ 2025-02-28 17:37   ` JP Kobryn
  2025-02-28 19:20   ` Yosry Ahmed
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: JP Kobryn @ 2025-02-28 17:37 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mhocko, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

On 2/27/25 1:55 PM, inwardvessel wrote:
> From: JP Kobryn <inwardvessel@gmail.com>
> 
> Let the existing locks be dedicated to the base stats and rename them as
> such. Also add new rstat locks for each enabled subsystem. When handling
> cgroup subsystem states, distinguish between formal subsystems (memory,
> io, etc) and the base stats subsystem state (represented by
> cgroup::self) to decide on which locks to take. This change is made to
> prevent contention between subsystems when updating/flushing stats.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> ---
>   kernel/cgroup/rstat.c | 93 +++++++++++++++++++++++++++++++++----------
>   1 file changed, 72 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index 88908ef9212d..b3eaefc1fd07 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -9,8 +9,12 @@
>   
>   #include <trace/events/cgroup.h>
>   
> -static DEFINE_SPINLOCK(cgroup_rstat_lock);
> -static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
> +static DEFINE_SPINLOCK(cgroup_rstat_base_lock);
> +static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_base_cpu_lock);
> +
> +static spinlock_t cgroup_rstat_subsys_lock[CGROUP_SUBSYS_COUNT];
> +static DEFINE_PER_CPU(raw_spinlock_t,
> +		cgroup_rstat_subsys_cpu_lock[CGROUP_SUBSYS_COUNT]);
>   
>   static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
>   
> @@ -20,8 +24,13 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(
>   	return per_cpu_ptr(css->rstat_cpu, cpu);
>   }
>   
> +static inline bool is_base_css(struct cgroup_subsys_state *css)
> +{
> +	return css->ss == NULL;
> +}
> +
>   /*
> - * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
> + * Helper functions for rstat per CPU locks.
>    *
>    * This makes it easier to diagnose locking issues and contention in
>    * production environments. The parameter @fast_path determine the
> @@ -36,12 +45,12 @@ unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
>   	bool contended;
>   
>   	/*
> -	 * The _irqsave() is needed because cgroup_rstat_lock is
> -	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
> -	 * this lock with the _irq() suffix only disables interrupts on
> -	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
> -	 * interrupts on both configurations. The _irqsave() ensures
> -	 * that interrupts are always disabled and later restored.
> +	 * The _irqsave() is needed because the locks used for flushing are
> +	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring this lock
> +	 * with the _irq() suffix only disables interrupts on a non-PREEMPT_RT
> +	 * kernel. The raw_spinlock_t below disables interrupts on both
> +	 * configurations. The _irqsave() ensures that interrupts are always
> +	 * disabled and later restored.
>   	 */
>   	contended = !raw_spin_trylock_irqsave(cpu_lock, flags);
>   	if (contended) {
> @@ -87,7 +96,7 @@ __bpf_kfunc void cgroup_rstat_updated(
>   		struct cgroup_subsys_state *css, int cpu)
>   {
>   	struct cgroup *cgrp = css->cgroup;
> -	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
> +	raw_spinlock_t *cpu_lock;
>   	unsigned long flags;
>   
>   	/*
> @@ -101,6 +110,12 @@ __bpf_kfunc void cgroup_rstat_updated(
>   	if (data_race(cgroup_rstat_cpu(css, cpu)->updated_next))
>   		return;
>   
> +	if (is_base_css(css))
> +		cpu_lock = per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu);
> +	else
> +		cpu_lock = per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) +
> +			css->ss->id;
> +
>   	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
>   
>   	/* put @css and all ancestors on the corresponding updated lists */
> @@ -208,11 +223,17 @@ static struct cgroup_subsys_state *cgroup_rstat_updated_list(
>   		struct cgroup_subsys_state *root, int cpu)
>   {
>   	struct cgroup *cgrp = root->cgroup;
> -	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
>   	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
>   	struct cgroup_subsys_state *head = NULL, *parent, *child;
> +	raw_spinlock_t *cpu_lock;
>   	unsigned long flags;
>   
> +	if (is_base_css(root))
> +		cpu_lock = per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu);
> +	else
> +		cpu_lock = per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) +
> +			root->ss->id;
> +
>   	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, false);
>   
>   	/* Return NULL if this subtree is not on-list */
> @@ -315,7 +336,7 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css,
>   	struct cgroup *cgrp = css->cgroup;
>   	int cpu;
>   
> -	lockdep_assert_held(&cgroup_rstat_lock);
> +	lockdep_assert_held(&lock);

I need to remove the ampersand since the variable is already a pointer.

>   
>   	for_each_possible_cpu(cpu) {
>   		struct cgroup_subsys_state *pos;
> @@ -356,12 +377,18 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css,
>   __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
>   {
>   	struct cgroup *cgrp = css->cgroup;
> +	spinlock_t *lock;
> +
> +	if (is_base_css(css))
> +		lock = &cgroup_rstat_base_lock;
> +	else
> +		lock = &cgroup_rstat_subsys_lock[css->ss->id];
>   
>   	might_sleep();
>   
> -	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
> -	cgroup_rstat_flush_locked(css, &cgroup_rstat_lock);
> -	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
> +	__cgroup_rstat_lock(lock, cgrp, -1);
> +	cgroup_rstat_flush_locked(css, lock);
> +	__cgroup_rstat_unlock(lock, cgrp, -1);
>   }
>   
>   /**
> @@ -376,10 +403,16 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
>   void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
>   {
>   	struct cgroup *cgrp = css->cgroup;
> +	spinlock_t *lock;
> +
> +	if (is_base_css(css))
> +		lock = &cgroup_rstat_base_lock;
> +	else
> +		lock = &cgroup_rstat_subsys_lock[css->ss->id];
>   
>   	might_sleep();
> -	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
> -	cgroup_rstat_flush_locked(css, &cgroup_rstat_lock);
> +	__cgroup_rstat_lock(lock, cgrp, -1);
> +	cgroup_rstat_flush_locked(css, lock);
>   }
>   
>   /**
> @@ -389,7 +422,14 @@ void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
>   void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
>   {
>   	struct cgroup *cgrp = css->cgroup;
> -	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
> +	spinlock_t *lock;
> +
> +	if (is_base_css(css))
> +		lock = &cgroup_rstat_base_lock;
> +	else
> +		lock = &cgroup_rstat_subsys_lock[css->ss->id];
> +
> +	__cgroup_rstat_unlock(lock, cgrp, -1);
>   }
>   
>   int cgroup_rstat_init(struct cgroup_subsys_state *css)
> @@ -435,10 +475,21 @@ void cgroup_rstat_exit(struct cgroup_subsys_state *css)
>   
>   void __init cgroup_rstat_boot(void)
>   {
> -	int cpu;
> +	struct cgroup_subsys *ss;
> +	int cpu, ssid;
>   
> -	for_each_possible_cpu(cpu)
> -		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
> +	for_each_subsys(ss, ssid) {
> +		spin_lock_init(&cgroup_rstat_subsys_lock[ssid]);
> +	}
> +
> +	for_each_possible_cpu(cpu) {
> +		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu));
> +
> +		for_each_subsys(ss, ssid) {
> +			raw_spin_lock_init(
> +					per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) + ssid);
> +		}
> +	}
>   }
>   
>   /*



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/4 v2] cgroup: separate rstat trees
  2025-02-27 21:55 [PATCH 0/4 v2] cgroup: separate rstat trees inwardvessel
                   ` (3 preceding siblings ...)
  2025-02-27 21:55 ` [PATCH 4/4 v2] cgroup: separate rstat list pointers from base stats inwardvessel
@ 2025-02-28 18:22 ` Yosry Ahmed
  2025-03-03 15:19 ` Michal Koutný
  5 siblings, 0 replies; 39+ messages in thread
From: Yosry Ahmed @ 2025-02-28 18:22 UTC (permalink / raw)
  To: inwardvessel
  Cc: tj, shakeel.butt, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On Thu, Feb 27, 2025 at 01:55:39PM -0800, inwardvessel wrote:
> From: JP Kobryn <inwardvessel@gmail.com>
> 
> The current design of rstat takes the approach that if one subsystem is
> to be flushed, all other subsystems with pending updates should also be
> flushed. It seems that over time, the stat-keeping of some subsystems
> has grown in size to the extent that they are noticeably slowing down
> others. This has been most observable in situations where the memory
> controller is enabled. One big area where the issue comes up is system
> telemetry, where programs periodically sample cpu stats. It would be a
> benefit for programs like this if the overhead of having to flush memory
> stats (and others) could be eliminated. It would save cpu cycles for
> existing cpu-based telemetry programs and improve scalability in terms
> of sampling frequency and volume of hosts.
> 
> This series changes the approach of "flush all subsystems" to "flush
> only the requested subsystem". The core design change is moving from a
> single unified rstat tree of cgroups to having separate trees made up of
> cgroup_subsys_state's. There will be one (per-cpu) tree for the base
> stats (cgroup::self) and one for each enabled subsystem (if it
> implements css_rstat_flush()). In order to do this, the rstat list
> pointers were moved off of the cgroup and onto the css. In the
> transition, these list pointer types were changed to
> cgroup_subsys_state. This allows for rstat trees to now be made up of
> css nodes, where a given tree will only contains css nodes associated
> with a specific subsystem. The rstat api's were changed to accept a
> reference to a cgroup_subsys_state instead of a cgroup. This allows for
> callers to be specific about which stats are being updated/flushed.
> Since separate trees will be in use, the locking scheme was adjusted.
> The global locks were split up in such a way that there are separate
> locks for the base stats (cgroup::self) and each subsystem (memory, io,
> etc). This allows different subsystems (including base stats) to use
> rstat in parallel with no contention.
> 
> Breaking up the unified tree into separate trees eliminates the overhead
> and scalability issue explained in the first section, but comes at the
> expense of using additional memory. In an effort to minimize this
> overhead, a conditional allocation is performed. The cgroup_rstat_cpu
> originally contained the rstat list pointers and the base stat entities.
> This struct was renamed to cgroup_rstat_base_cpu and is only allocated
> when the associated css is cgroup::self. A new compact struct was added
> that only contains the rstat list pointers. When the css is associated
> with an actual subsystem, this compact struct is allocated. With this
> conditional allocation, the change in memory overhead on a per-cpu basis
> before/after is shown below.
> 
> before:
> sizeof(struct cgroup_rstat_cpu) =~ 176 bytes /* can vary based on config */
> 
> nr_cgroups * sizeof(struct cgroup_rstat_cpu)
> nr_cgroups * 176 bytes
> 
> after:
> sizeof(struct cgroup_rstat_cpu) == 16 bytes
> sizeof(struct cgroup_rstat_base_cpu) =~ 176 bytes
> 
> nr_cgroups * (
> 	sizeof(struct cgroup_rstat_base_cpu) +
> 		sizeof(struct cgroup_rstat_cpu) * nr_rstat_controllers
> 	)
> 
> nr_cgroups * (176 + 16 * nr_rstat_controllers)
> 
> ... where nr_rstat_controllers is the number of enabled cgroup
> controllers that implement css_rstat_flush(). On a host where both
> memory and io are enabled:
> 
> nr_cgroups * (176 + 16 * 2)
> nr_cgroups * 208 bytes
> 
> With regard to validation, there is a measurable benefit when reading
> stats with this series. A test program was made to loop 1M times while
> reading all four of the files cgroup.stat, cpu.stat, io.stat,
> memory.stat of a given parent cgroup each iteration. This test program
> has been run in the experiments that follow.
> 
> The first experiment consisted of a parent cgroup with memory.swap.max=0
> and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created
> and within each child cgroup a process was spawned to frequently update
> the memory cgroup stats by creating and then reading a file of size 1T
> (encouraging reclaim). The test program was run alongside these 26 tasks
> in parallel. The results showed a benefit in both time elapsed and perf
> data of the test program.
> 
> time before:
> real    0m44.612s
> user    0m0.567s
> sys     0m43.887s
> 
> perf before:
> 27.02% mem_cgroup_css_rstat_flush
>  6.35% __blkcg_rstat_flush
>  0.06% cgroup_base_stat_cputime_show
> 
> time after:
> real    0m27.125s
> user    0m0.544s
> sys     0m26.491s
> 
> perf after:
> 6.03% mem_cgroup_css_rstat_flush
> 0.37% blkcg_print_stat
> 0.11% cgroup_base_stat_cputime_show
> 
> Another experiment was setup on the same host using a parent cgroup with
> two child cgroups. The same swap and memory max were used as the
> previous experiment. In the two child cgroups, kernel builds were done
> in parallel, each using "-j 20". The perf comparison of the test program
> was very similar to the values in the previous experiment. The time
> comparison is shown below.
> 
> before:
> real    1m2.077s
> user    0m0.784s
> sys     1m0.895s
> 
> after:
> real    0m32.216s
> user    0m0.709s
> sys     0m31.256s

Great results, and I am glad that the series went down from 11 patches
to 4 once we simplified the BPF handling. The added memory overhead
doesn't seem to be concerning (~320KB on a system with 100 cgroups and
100 CPUs).

Nice work.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  2025-02-27 21:55 ` [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state inwardvessel
  2025-02-27 22:43   ` Shakeel Butt
@ 2025-02-28 19:04   ` Yosry Ahmed
  2025-03-01  1:06     ` JP Kobryn
  2025-03-03 15:20   ` Michal Koutný
  2 siblings, 1 reply; 39+ messages in thread
From: Yosry Ahmed @ 2025-02-28 19:04 UTC (permalink / raw)
  To: inwardvessel
  Cc: tj, shakeel.butt, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On Thu, Feb 27, 2025 at 01:55:40PM -0800, inwardvessel wrote:
> From: JP Kobryn <inwardvessel@gmail.com>
> 
> Each cgroup owns rstat pointers. This means that a tree of pending rstat
> updates can contain changes from different subsystems. Because of this
> arrangement, when one subsystem is flushed via the public api
> cgroup_rstat_flushed(), all other subsystems with pending updates will
> also be flushed. Remove the rstat pointers from the cgroup and instead
> give them to each cgroup_subsys_state. Separate rstat trees will now
> exist for each unique subsystem. This separation allows for subsystems
> to make updates and flushes without the side effects of other
> subsystems. i.e. flushing the cpu stats does not cause the memory stats
> to be flushed and vice versa. The change in pointer ownership from
> cgroup to cgroup_subsys_state allows for direct flushing of the css, so
> the rcu list management entities and operations previously tied to the
> cgroup which were used for managing a list of subsystem states with
> pending flushes are removed. In terms of client code, public api calls
> were changed to now accept a reference to the cgroup_subsys_state so
> that when flushing or updating, a specific subsystem is associated with
> the call.

I think the subject is misleading. It makes it seem like this is a
refactoring patch that is only moving a member from one struct to
another, but this is actually the core of the series.

Maybe something lik "cgroup: use separate rstat trees for diffrent
subsystems"?

Also, breaking down the commit message into paragraphs helps with
readability.

[..]
> @@ -386,8 +394,8 @@ struct cgroup_rstat_cpu {
>  	 *
>  	 * Protected by per-cpu cgroup_rstat_cpu_lock.
>  	 */
> -	struct cgroup *updated_children;	/* terminated by self cgroup */
> -	struct cgroup *updated_next;		/* NULL iff not on the list */
> +	struct cgroup_subsys_state *updated_children;	/* terminated by self */
> +	struct cgroup_subsys_state *updated_next;		/* NULL if not on list */

nit: comment indentation needs fixing here

>  };
>  
>  struct cgroup_freezer_state {
[..]  
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index afc665b7b1fe..31b3bfebf7ba 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -164,7 +164,9 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
>  static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu);

Should we rename cgrp_dfl_root_rstat_cpu to indicate that it's specific
to self css?

>  
>  /* the default hierarchy */
> -struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
> +struct cgroup_root cgrp_dfl_root = {
> +	.cgrp.self.rstat_cpu = &cgrp_dfl_root_rstat_cpu
> +};
>  EXPORT_SYMBOL_GPL(cgrp_dfl_root);
>  
>  /*
[..]
> @@ -5407,7 +5401,11 @@ static void css_free_rwork_fn(struct work_struct *work)
>  		struct cgroup_subsys_state *parent = css->parent;
>  		int id = css->id;
>  
> +		if (css->ss->css_rstat_flush)
> +			cgroup_rstat_exit(css);
> +
>  		ss->css_free(css);
> +

nit: extra blank line here

>  		cgroup_idr_remove(&ss->css_idr, id);
>  		cgroup_put(cgrp);
>  
> @@ -5431,7 +5429,7 @@ static void css_free_rwork_fn(struct work_struct *work)
>  			cgroup_put(cgroup_parent(cgrp));
>  			kernfs_put(cgrp->kn);
>  			psi_cgroup_free(cgrp);
> -			cgroup_rstat_exit(cgrp);
> +			cgroup_rstat_exit(&cgrp->self);
>  			kfree(cgrp);
>  		} else {
>  			/*
> @@ -5459,11 +5457,7 @@ static void css_release_work_fn(struct work_struct *work)
>  	if (ss) {
>  		struct cgroup *parent_cgrp;
>  
> -		/* css release path */
> -		if (!list_empty(&css->rstat_css_node)) {
> -			cgroup_rstat_flush(cgrp);
> -			list_del_rcu(&css->rstat_css_node);
> -		}
> +		cgroup_rstat_flush(css);

Here we used to call cgroup_rstat_flush() only if there was a
css_rstat_flush() callback registered, now we call it unconditionally.

Could this cause a NULL dereference when we try to call
css->ss->css_rstat_flush() for controllers that did not register a
callback?

>  
>  		cgroup_idr_replace(&ss->css_idr, NULL, css->id);
>  		if (ss->css_released)
[..]
> @@ -6188,6 +6186,9 @@ int __init cgroup_init(void)
>  			css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
>  						   GFP_KERNEL);
>  			BUG_ON(css->id < 0);
> +
> +			if (css->ss && css->ss->css_rstat_flush)
> +				BUG_ON(cgroup_rstat_init(css));

Why do we need this call here? We already call cgroup_rstat_init() in
cgroup_init_subsys(). IIUC for subsystems with ss->early_init, we will
have already called cgroup_init_subsys() in cgroup_init_early().

Did I miss something?

>  		} else {
>  			cgroup_init_subsys(ss, false);
>  		}
[..]
> @@ -300,27 +306,25 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
>  }
>  
>  /* see cgroup_rstat_flush() */
> -static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
> +static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
>  	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
>  {
> +	struct cgroup *cgrp = css->cgroup;
>  	int cpu;
>  
>  	lockdep_assert_held(&cgroup_rstat_lock);
>  
>  	for_each_possible_cpu(cpu) {
> -		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
> +		struct cgroup_subsys_state *pos;
>  
> +		pos = cgroup_rstat_updated_list(css, cpu);
>  		for (; pos; pos = pos->rstat_flush_next) {
> -			struct cgroup_subsys_state *css;
> +			if (!pos->ss)
> +				cgroup_base_stat_flush(pos->cgroup, cpu);
> +			else
> +				pos->ss->css_rstat_flush(pos, cpu);
>  
> -			cgroup_base_stat_flush(pos, cpu);
> -			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
> -
> -			rcu_read_lock();
> -			list_for_each_entry_rcu(css, &pos->rstat_css_list,
> -						rstat_css_node)
> -				css->ss->css_rstat_flush(css, cpu);
> -			rcu_read_unlock();
> +			bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu);

We should call bpf_rstat_flush() only if (!pos->ss) as well, right?
Otherwise we will call BPF rstat flush whenever any subsystem is
flushed.

I guess it's because BPF can now pass any subsystem to
cgroup_rstat_flush(), and we don't keep track. I think it would be
better if we do not allow BPF programs to select a css and always make
them flush the self css.

We can perhaps introduce a bpf_cgroup_rstat_flush() wrapper that takes
in a cgroup and passes cgroup->self internally to cgroup_rstat_flush().

But if the plan is to remove the bpf_rstat_flush() call here soon then
it's probably not worth the hassle.

Shakeel (and others), WDYT?


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-02-27 21:55 ` [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems inwardvessel
  2025-02-27 22:52   ` Shakeel Butt
  2025-02-28 17:37   ` JP Kobryn
@ 2025-02-28 19:20   ` Yosry Ahmed
  2025-03-06 21:47     ` JP Kobryn
  2025-03-01 23:00   ` kernel test robot
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: Yosry Ahmed @ 2025-02-28 19:20 UTC (permalink / raw)
  To: inwardvessel
  Cc: tj, shakeel.butt, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On Thu, Feb 27, 2025 at 01:55:42PM -0800, inwardvessel wrote:
> From: JP Kobryn <inwardvessel@gmail.com>
> 
> Let the existing locks be dedicated to the base stats and rename them as
> such. Also add new rstat locks for each enabled subsystem. When handling
> cgroup subsystem states, distinguish between formal subsystems (memory,
> io, etc) and the base stats subsystem state (represented by
> cgroup::self) to decide on which locks to take. This change is made to
> prevent contention between subsystems when updating/flushing stats.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> ---
>  kernel/cgroup/rstat.c | 93 +++++++++++++++++++++++++++++++++----------
>  1 file changed, 72 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index 88908ef9212d..b3eaefc1fd07 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -9,8 +9,12 @@
>  
>  #include <trace/events/cgroup.h>
>  
> -static DEFINE_SPINLOCK(cgroup_rstat_lock);
> -static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
> +static DEFINE_SPINLOCK(cgroup_rstat_base_lock);
> +static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_base_cpu_lock);
> +
> +static spinlock_t cgroup_rstat_subsys_lock[CGROUP_SUBSYS_COUNT];
> +static DEFINE_PER_CPU(raw_spinlock_t,
> +		cgroup_rstat_subsys_cpu_lock[CGROUP_SUBSYS_COUNT]);
>  
>  static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
>  
> @@ -20,8 +24,13 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(
>  	return per_cpu_ptr(css->rstat_cpu, cpu);
>  }
>  
> +static inline bool is_base_css(struct cgroup_subsys_state *css)
> +{
> +	return css->ss == NULL;
> +}
> +
>  /*
> - * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
> + * Helper functions for rstat per CPU locks.
>   *
>   * This makes it easier to diagnose locking issues and contention in
>   * production environments. The parameter @fast_path determine the
> @@ -36,12 +45,12 @@ unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
>  	bool contended;
>  
>  	/*
> -	 * The _irqsave() is needed because cgroup_rstat_lock is
> -	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
> -	 * this lock with the _irq() suffix only disables interrupts on
> -	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
> -	 * interrupts on both configurations. The _irqsave() ensures
> -	 * that interrupts are always disabled and later restored.
> +	 * The _irqsave() is needed because the locks used for flushing are
> +	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring this lock
> +	 * with the _irq() suffix only disables interrupts on a non-PREEMPT_RT
> +	 * kernel. The raw_spinlock_t below disables interrupts on both
> +	 * configurations. The _irqsave() ensures that interrupts are always
> +	 * disabled and later restored.
>  	 */
>  	contended = !raw_spin_trylock_irqsave(cpu_lock, flags);
>  	if (contended) {
> @@ -87,7 +96,7 @@ __bpf_kfunc void cgroup_rstat_updated(
>  		struct cgroup_subsys_state *css, int cpu)
>  {
>  	struct cgroup *cgrp = css->cgroup;
> -	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
> +	raw_spinlock_t *cpu_lock;
>  	unsigned long flags;
>  
>  	/*
> @@ -101,6 +110,12 @@ __bpf_kfunc void cgroup_rstat_updated(
>  	if (data_race(cgroup_rstat_cpu(css, cpu)->updated_next))
>  		return;
>  
> +	if (is_base_css(css))
> +		cpu_lock = per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu);
> +	else
> +		cpu_lock = per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) +
> +			css->ss->id;
> +

Maybe wrap this in a macro or function since it's used more than once.

>  	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
>  
>  	/* put @css and all ancestors on the corresponding updated lists */
> @@ -208,11 +223,17 @@ static struct cgroup_subsys_state *cgroup_rstat_updated_list(
>  		struct cgroup_subsys_state *root, int cpu)
>  {
>  	struct cgroup *cgrp = root->cgroup;
> -	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
>  	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
>  	struct cgroup_subsys_state *head = NULL, *parent, *child;
> +	raw_spinlock_t *cpu_lock;
>  	unsigned long flags;
>  
> +	if (is_base_css(root))
> +		cpu_lock = per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu);
> +	else
> +		cpu_lock = per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) +
> +			root->ss->id;
> +
>  	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, false);
>  
>  	/* Return NULL if this subtree is not on-list */
> @@ -315,7 +336,7 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css,
>  	struct cgroup *cgrp = css->cgroup;
>  	int cpu;
>  
> -	lockdep_assert_held(&cgroup_rstat_lock);
> +	lockdep_assert_held(&lock);
>  
>  	for_each_possible_cpu(cpu) {
>  		struct cgroup_subsys_state *pos;
> @@ -356,12 +377,18 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css,
>  __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
>  {
>  	struct cgroup *cgrp = css->cgroup;
> +	spinlock_t *lock;
> +
> +	if (is_base_css(css))
> +		lock = &cgroup_rstat_base_lock;
> +	else
> +		lock = &cgroup_rstat_subsys_lock[css->ss->id];

Same here. Also, instead of passing locks around, can we just pass
the css to __cgroup_rstat_lock/unlock() and have them find the correct
lock and use it directly?

>  
>  	might_sleep();
>  
> -	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
> -	cgroup_rstat_flush_locked(css, &cgroup_rstat_lock);
> -	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
> +	__cgroup_rstat_lock(lock, cgrp, -1);
> +	cgroup_rstat_flush_locked(css, lock);
> +	__cgroup_rstat_unlock(lock, cgrp, -1);
>  }
>  
>  /**
> @@ -376,10 +403,16 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
>  void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
>  {
>  	struct cgroup *cgrp = css->cgroup;
> +	spinlock_t *lock;
> +
> +	if (is_base_css(css))
> +		lock = &cgroup_rstat_base_lock;
> +	else
> +		lock = &cgroup_rstat_subsys_lock[css->ss->id];
>  
>  	might_sleep();
> -	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
> -	cgroup_rstat_flush_locked(css, &cgroup_rstat_lock);
> +	__cgroup_rstat_lock(lock, cgrp, -1);
> +	cgroup_rstat_flush_locked(css, lock);
>  }
>  
>  /**
> @@ -389,7 +422,14 @@ void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
>  void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
>  {
>  	struct cgroup *cgrp = css->cgroup;
> -	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
> +	spinlock_t *lock;
> +
> +	if (is_base_css(css))
> +		lock = &cgroup_rstat_base_lock;
> +	else
> +		lock = &cgroup_rstat_subsys_lock[css->ss->id];
> +
> +	__cgroup_rstat_unlock(lock, cgrp, -1);
>  }
>  
>  int cgroup_rstat_init(struct cgroup_subsys_state *css)
> @@ -435,10 +475,21 @@ void cgroup_rstat_exit(struct cgroup_subsys_state *css)
>  
>  void __init cgroup_rstat_boot(void)
>  {
> -	int cpu;
> +	struct cgroup_subsys *ss;
> +	int cpu, ssid;
>  
> -	for_each_possible_cpu(cpu)
> -		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
> +	for_each_subsys(ss, ssid) {
> +		spin_lock_init(&cgroup_rstat_subsys_lock[ssid]);
> +	}
> +
> +	for_each_possible_cpu(cpu) {
> +		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu));
> +
> +		for_each_subsys(ss, ssid) {
> +			raw_spin_lock_init(
> +					per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) + ssid);
> +		}
> +	}
>  }
>  
>  /*
> -- 
> 2.43.5
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4 v2] cgroup: separate rstat list pointers from base stats
  2025-02-27 21:55 ` [PATCH 4/4 v2] cgroup: separate rstat list pointers from base stats inwardvessel
  2025-02-27 23:01   ` Shakeel Butt
@ 2025-02-28 20:33   ` Yosry Ahmed
  1 sibling, 0 replies; 39+ messages in thread
From: Yosry Ahmed @ 2025-02-28 20:33 UTC (permalink / raw)
  To: inwardvessel
  Cc: tj, shakeel.butt, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On Thu, Feb 27, 2025 at 01:55:43PM -0800, inwardvessel wrote:
> From: JP Kobryn <inwardvessel@gmail.com>
> 
> A majority of the cgroup_rstat_cpu struct size is made up of the base
> stat entities. Since only the "self" subsystem state makes use of these,
> move them into a struct of their own. This allows for a new compact
> cgroup_rstat_cpu struct that the formal subsystems can make use of.
> Where applicable, decide on whether to allocate the compact or full
> struct including the base stats.

Mentioning the memory savings in this patch's log would be helpful.

> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> ---
>  include/linux/cgroup-defs.h | 37 ++++++++++++++----------
>  kernel/cgroup/rstat.c       | 57 +++++++++++++++++++++++++------------
>  2 files changed, 61 insertions(+), 33 deletions(-)
> 
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 1598e1389615..b0a07c63fd46 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -170,7 +170,10 @@ struct cgroup_subsys_state {
>  	struct percpu_ref refcnt;
>  
>  	/* per-cpu recursive resource statistics */
> -	struct cgroup_rstat_cpu __percpu *rstat_cpu;
> +	union {
> +		struct cgroup_rstat_cpu __percpu *rstat_cpu;
> +		struct cgroup_rstat_base_cpu __percpu *rstat_base_cpu;
> +	};
>  
>  	/*
>  	 * siblings list anchored at the parent's ->children
> @@ -356,6 +359,24 @@ struct cgroup_base_stat {
>   * resource statistics on top of it - bsync, bstat and last_bstat.
>   */
>  struct cgroup_rstat_cpu {
> +	/*
> +	 * Child cgroups with stat updates on this cpu since the last read
> +	 * are linked on the parent's ->updated_children through
> +	 * ->updated_next.
> +	 *
> +	 * In addition to being more compact, singly-linked list pointing
> +	 * to the cgroup makes it unnecessary for each per-cpu struct to
> +	 * point back to the associated cgroup.
> +	 *
> +	 * Protected by per-cpu cgroup_rstat_cpu_lock.

I just noticed, the patch that split the lock should have updated this
comment.

> +	 */
> +	struct cgroup_subsys_state *updated_children;	/* terminated by self */
> +	struct cgroup_subsys_state *updated_next;		/* NULL if not on list */
> +};
> +
> +struct cgroup_rstat_base_cpu {
> +	struct cgroup_rstat_cpu self;

Why 'self'? Why not 'rstat_cpu' like it's named in struct
cgroup_subsys_state?

> +
>  	/*
>  	 * ->bsync protects ->bstat.  These are the only fields which get
>  	 * updated in the hot path.
> @@ -382,20 +403,6 @@ struct cgroup_rstat_cpu {
>  	 * deltas to propagate to the per-cpu subtree_bstat.
>  	 */
>  	struct cgroup_base_stat last_subtree_bstat;
> -
> -	/*
> -	 * Child cgroups with stat updates on this cpu since the last read
> -	 * are linked on the parent's ->updated_children through
> -	 * ->updated_next.
> -	 *
> -	 * In addition to being more compact, singly-linked list pointing
> -	 * to the cgroup makes it unnecessary for each per-cpu struct to
> -	 * point back to the associated cgroup.
> -	 *
> -	 * Protected by per-cpu cgroup_rstat_cpu_lock.
> -	 */
> -	struct cgroup_subsys_state *updated_children;	/* terminated by self */
> -	struct cgroup_subsys_state *updated_next;		/* NULL if not on list */
>  };
>  
>  struct cgroup_freezer_state {
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index b3eaefc1fd07..c08ebe2f9568 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -24,6 +24,12 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(
>  	return per_cpu_ptr(css->rstat_cpu, cpu);
>  }
>  
> +static struct cgroup_rstat_base_cpu *cgroup_rstat_base_cpu(
> +		struct cgroup_subsys_state *css, int cpu)
> +{
> +	return per_cpu_ptr(css->rstat_base_cpu, cpu);
> +}
> +
>  static inline bool is_base_css(struct cgroup_subsys_state *css)
>  {
>  	return css->ss == NULL;
> @@ -438,17 +444,31 @@ int cgroup_rstat_init(struct cgroup_subsys_state *css)
>  
>  	/* the root cgrp's self css has rstat_cpu preallocated */
>  	if (!css->rstat_cpu) {
> -		css->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
> -		if (!css->rstat_cpu)
> -			return -ENOMEM;
> -	}
> +		if (is_base_css(css)) {
> +			css->rstat_base_cpu = alloc_percpu(struct cgroup_rstat_base_cpu);
> +			if (!css->rstat_base_cpu)
> +				return -ENOMEM;
>  
> -	/* ->updated_children list is self terminated */
> -	for_each_possible_cpu(cpu) {
> -		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(css, cpu);
> +			for_each_possible_cpu(cpu) {
> +				struct cgroup_rstat_base_cpu *rstatc;

We should use different variable names for cgroup_rstat_base_cpu and
cgroup_rstat_cpu throughout. Maybe 'brstatc' or 'rstatbc' for the
latter?

> +
> +				rstatc = cgroup_rstat_base_cpu(css, cpu);
> +				rstatc->self.updated_children = css;
> +				u64_stats_init(&rstatc->bsync);
> +			}
> +		} else {
> +			css->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
> +			if (!css->rstat_cpu)
> +				return -ENOMEM;
> +
> +			for_each_possible_cpu(cpu) {
> +				struct cgroup_rstat_cpu *rstatc;
> +
> +				rstatc = cgroup_rstat_cpu(css, cpu);
> +				rstatc->updated_children = css;
> +			}
> +		}
>  
> -		rstatc->updated_children = css;
> -		u64_stats_init(&rstatc->bsync);

I think there's too much replication here. We can probably do something
like this (untested):

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index de057c992f824..1750a69887a2e 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -443,34 +443,30 @@ int cgroup_rstat_init(struct cgroup_subsys_state *css)
 	int cpu;
 
 	/* the root cgrp's self css has rstat_cpu preallocated */
-	if (!css->rstat_cpu) {
-		if (is_base_css(css)) {
-			css->rstat_base_cpu = alloc_percpu(struct cgroup_rstat_base_cpu);
-			if (!css->rstat_base_cpu)
-				return -ENOMEM;
-
-			for_each_possible_cpu(cpu) {
-				struct cgroup_rstat_base_cpu *rstatc;
+	if (css->rstat_cpu)
+		return 0;
 
-				rstatc = cgroup_rstat_base_cpu(css, cpu);
-				rstatc->self.updated_children = css;
-				u64_stats_init(&rstatc->bsync);
-			}
-		} else {
-			css->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
-			if (!css->rstat_cpu)
-				return -ENOMEM;
+	if (is_base_css(css)) {
+		css->rstat_base_cpu = alloc_percpu(struct cgroup_rstat_base_cpu);
+		if (!css->rstat_base_cpu)
+			return -ENOMEM;
+	} else {
+		css->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
+		if (!css->rstat_cpu)
+			return -ENOMEM;
+	}
 
-			for_each_possible_cpu(cpu) {
-				struct cgroup_rstat_cpu *rstatc;
+	for_each_possible_cpu(cpu) {
+		struct cgroup_rstat_base_cpu *brstatc = NULL;
+		struct cgroup_rstat_cpu *rstatc;
 
-				rstatc = cgroup_rstat_cpu(css, cpu);
-				rstatc->updated_children = css;
-			}
+		if (is_base_css(css)) {
+			brstatc = cgroup_rstat_base_cpu(css, cpu);
+			u64_stats_init(&brstatc->bsync);
+			rstatc = brstatc->self;
 		}
-
+		rstatc->updated_children = css;
 	}
-
 	return 0;
 }
 
>  	}
>  
>  	return 0;
> @@ -522,9 +542,10 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
>  
>  static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
>  {
> -	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(&cgrp->self, cpu);
> +	struct cgroup_rstat_base_cpu *rstatc = cgroup_rstat_base_cpu(
> +			&cgrp->self, cpu);
>  	struct cgroup *parent = cgroup_parent(cgrp);
> -	struct cgroup_rstat_cpu *prstatc;
> +	struct cgroup_rstat_base_cpu *prstatc;

Same here, we should use different names than rstatc and prstatc. Same
applies for the rest of the diff.

>  	struct cgroup_base_stat delta;
>  	unsigned seq;
>  
> @@ -552,25 +573,25 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
>  		cgroup_base_stat_add(&cgrp->last_bstat, &delta);
>  
>  		delta = rstatc->subtree_bstat;
> -		prstatc = cgroup_rstat_cpu(&parent->self, cpu);
> +		prstatc = cgroup_rstat_base_cpu(&parent->self, cpu);
>  		cgroup_base_stat_sub(&delta, &rstatc->last_subtree_bstat);
>  		cgroup_base_stat_add(&prstatc->subtree_bstat, &delta);
>  		cgroup_base_stat_add(&rstatc->last_subtree_bstat, &delta);
>  	}
>  }
>  
> -static struct cgroup_rstat_cpu *
> +static struct cgroup_rstat_base_cpu *
>  cgroup_base_stat_cputime_account_begin(struct cgroup *cgrp, unsigned long *flags)
>  {
> -	struct cgroup_rstat_cpu *rstatc;
> +	struct cgroup_rstat_base_cpu *rstatc;
>  
> -	rstatc = get_cpu_ptr(cgrp->self.rstat_cpu);
> +	rstatc = get_cpu_ptr(cgrp->self.rstat_base_cpu);
>  	*flags = u64_stats_update_begin_irqsave(&rstatc->bsync);
>  	return rstatc;
>  }
>  
>  static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
> -						 struct cgroup_rstat_cpu *rstatc,
> +						 struct cgroup_rstat_base_cpu *rstatc,
>  						 unsigned long flags)
>  {
>  	u64_stats_update_end_irqrestore(&rstatc->bsync, flags);
> @@ -580,7 +601,7 @@ static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
>  
>  void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
>  {
> -	struct cgroup_rstat_cpu *rstatc;
> +	struct cgroup_rstat_base_cpu *rstatc;
>  	unsigned long flags;
>  
>  	rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
> @@ -591,7 +612,7 @@ void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
>  void __cgroup_account_cputime_field(struct cgroup *cgrp,
>  				    enum cpu_usage_stat index, u64 delta_exec)
>  {
> -	struct cgroup_rstat_cpu *rstatc;
> +	struct cgroup_rstat_base_cpu *rstatc;
>  	unsigned long flags;
>  
>  	rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
> -- 
> 2.43.5
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  2025-02-28 19:04   ` Yosry Ahmed
@ 2025-03-01  1:06     ` JP Kobryn
  2025-03-01  1:25       ` Yosry Ahmed
  0 siblings, 1 reply; 39+ messages in thread
From: JP Kobryn @ 2025-03-01  1:06 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: tj, shakeel.butt, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On 2/28/25 11:04 AM, Yosry Ahmed wrote:
> On Thu, Feb 27, 2025 at 01:55:40PM -0800, inwardvessel wrote:
>> From: JP Kobryn <inwardvessel@gmail.com>
>>
>> Each cgroup owns rstat pointers. This means that a tree of pending rstat
>> updates can contain changes from different subsystems. Because of this
>> arrangement, when one subsystem is flushed via the public api
>> cgroup_rstat_flushed(), all other subsystems with pending updates will
>> also be flushed. Remove the rstat pointers from the cgroup and instead
>> give them to each cgroup_subsys_state. Separate rstat trees will now
>> exist for each unique subsystem. This separation allows for subsystems
>> to make updates and flushes without the side effects of other
>> subsystems. i.e. flushing the cpu stats does not cause the memory stats
>> to be flushed and vice versa. The change in pointer ownership from
>> cgroup to cgroup_subsys_state allows for direct flushing of the css, so
>> the rcu list management entities and operations previously tied to the
>> cgroup which were used for managing a list of subsystem states with
>> pending flushes are removed. In terms of client code, public api calls
>> were changed to now accept a reference to the cgroup_subsys_state so
>> that when flushing or updating, a specific subsystem is associated with
>> the call.
> 
> I think the subject is misleading. It makes it seem like this is a
> refactoring patch that is only moving a member from one struct to
> another, but this is actually the core of the series.
> 
> Maybe something lik "cgroup: use separate rstat trees for diffrent
> subsystems"?
> 
> Also, breaking down the commit message into paragraphs helps with
> readability.

Makes sense. Will adjust in next rev.
> 
> [..]
>> @@ -386,8 +394,8 @@ struct cgroup_rstat_cpu {
>>   	 *
>>   	 * Protected by per-cpu cgroup_rstat_cpu_lock.
>>   	 */
>> -	struct cgroup *updated_children;	/* terminated by self cgroup */
>> -	struct cgroup *updated_next;		/* NULL iff not on the list */
>> +	struct cgroup_subsys_state *updated_children;	/* terminated by self */
>> +	struct cgroup_subsys_state *updated_next;		/* NULL if not on list */
> 
> nit: comment indentation needs fixing here
> 
>>   };
>>   
>>   struct cgroup_freezer_state {
> [..]
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index afc665b7b1fe..31b3bfebf7ba 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -164,7 +164,9 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
>>   static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu);
> 
> Should we rename cgrp_dfl_root_rstat_cpu to indicate that it's specific
> to self css?

Sure.

> 
>>   
>>   /* the default hierarchy */
>> -struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
>> +struct cgroup_root cgrp_dfl_root = {
>> +	.cgrp.self.rstat_cpu = &cgrp_dfl_root_rstat_cpu
>> +};
>>   EXPORT_SYMBOL_GPL(cgrp_dfl_root);
>>   
>>   /*
> [..]
>> @@ -5407,7 +5401,11 @@ static void css_free_rwork_fn(struct work_struct *work)
>>   		struct cgroup_subsys_state *parent = css->parent;
>>   		int id = css->id;
>>   
>> +		if (css->ss->css_rstat_flush)
>> +			cgroup_rstat_exit(css);
>> +
>>   		ss->css_free(css);
>> +
> 
> nit: extra blank line here
> 
>>   		cgroup_idr_remove(&ss->css_idr, id);
>>   		cgroup_put(cgrp);
>>   
>> @@ -5431,7 +5429,7 @@ static void css_free_rwork_fn(struct work_struct *work)
>>   			cgroup_put(cgroup_parent(cgrp));
>>   			kernfs_put(cgrp->kn);
>>   			psi_cgroup_free(cgrp);
>> -			cgroup_rstat_exit(cgrp);
>> +			cgroup_rstat_exit(&cgrp->self);
>>   			kfree(cgrp);
>>   		} else {
>>   			/*
>> @@ -5459,11 +5457,7 @@ static void css_release_work_fn(struct work_struct *work)
>>   	if (ss) {
>>   		struct cgroup *parent_cgrp;
>>   
>> -		/* css release path */
>> -		if (!list_empty(&css->rstat_css_node)) {
>> -			cgroup_rstat_flush(cgrp);
>> -			list_del_rcu(&css->rstat_css_node);
>> -		}
>> +		cgroup_rstat_flush(css);
> 
> Here we used to call cgroup_rstat_flush() only if there was a
> css_rstat_flush() callback registered, now we call it unconditionally.
> 
> Could this cause a NULL dereference when we try to call
> css->ss->css_rstat_flush() for controllers that did not register a
> callback?

Good point. Misuse here can lead to a bad access. Will cover in v3.

> 
>>   
>>   		cgroup_idr_replace(&ss->css_idr, NULL, css->id);
>>   		if (ss->css_released)
> [..]
>> @@ -6188,6 +6186,9 @@ int __init cgroup_init(void)
>>   			css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
>>   						   GFP_KERNEL);
>>   			BUG_ON(css->id < 0);
>> +
>> +			if (css->ss && css->ss->css_rstat_flush)
>> +				BUG_ON(cgroup_rstat_init(css));
> 
> Why do we need this call here? We already call cgroup_rstat_init() in
> cgroup_init_subsys(). IIUC for subsystems with ss->early_init, we will
> have already called cgroup_init_subsys() in cgroup_init_early().
> 
> Did I miss something?

Hmmm it's a good question. cgroup_rstat_init() is deferred in the same
way that cgroup_idr_alloc() is. So for ss with early_init == true,
cgroup_rstat_init() is not called during cgroup_early_init().

Is it safe to call alloc_percpu() during early boot? If so, the
deferral can be removed and cgroup_rstat_init() can be called in one
place.

> 
>>   		} else {
>>   			cgroup_init_subsys(ss, false);
>>   		}
> [..]
>> @@ -300,27 +306,25 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
>>   }
>>   
>>   /* see cgroup_rstat_flush() */
>> -static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>> +static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
>>   	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
>>   {
>> +	struct cgroup *cgrp = css->cgroup;
>>   	int cpu;
>>   
>>   	lockdep_assert_held(&cgroup_rstat_lock);
>>   
>>   	for_each_possible_cpu(cpu) {
>> -		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
>> +		struct cgroup_subsys_state *pos;
>>   
>> +		pos = cgroup_rstat_updated_list(css, cpu);
>>   		for (; pos; pos = pos->rstat_flush_next) {
>> -			struct cgroup_subsys_state *css;
>> +			if (!pos->ss)
>> +				cgroup_base_stat_flush(pos->cgroup, cpu);
>> +			else
>> +				pos->ss->css_rstat_flush(pos, cpu);
>>   
>> -			cgroup_base_stat_flush(pos, cpu);
>> -			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
>> -
>> -			rcu_read_lock();
>> -			list_for_each_entry_rcu(css, &pos->rstat_css_list,
>> -						rstat_css_node)
>> -				css->ss->css_rstat_flush(css, cpu);
>> -			rcu_read_unlock();
>> +			bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu);
> 
> We should call bpf_rstat_flush() only if (!pos->ss) as well, right?
> Otherwise we will call BPF rstat flush whenever any subsystem is
> flushed.
> 
> I guess it's because BPF can now pass any subsystem to
> cgroup_rstat_flush(), and we don't keep track. I think it would be
> better if we do not allow BPF programs to select a css and always make
> them flush the self css.
> 
> We can perhaps introduce a bpf_cgroup_rstat_flush() wrapper that takes
> in a cgroup and passes cgroup->self internally to cgroup_rstat_flush().

I'm fine with this if others are in agreement. A similar concept was
done in v1.

> 
> But if the plan is to remove the bpf_rstat_flush() call here soon then
> it's probably not worth the hassle.
> 
> Shakeel (and others), WDYT?



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  2025-03-01  1:06     ` JP Kobryn
@ 2025-03-01  1:25       ` Yosry Ahmed
  2025-03-01  1:30         ` JP Kobryn
  2025-03-03 18:18         ` Shakeel Butt
  0 siblings, 2 replies; 39+ messages in thread
From: Yosry Ahmed @ 2025-03-01  1:25 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, shakeel.butt, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On Fri, Feb 28, 2025 at 05:06:23PM -0800, JP Kobryn wrote:
[..] 
> > 
> > >   		cgroup_idr_replace(&ss->css_idr, NULL, css->id);
> > >   		if (ss->css_released)
> > [..]
> > > @@ -6188,6 +6186,9 @@ int __init cgroup_init(void)
> > >   			css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
> > >   						   GFP_KERNEL);
> > >   			BUG_ON(css->id < 0);
> > > +
> > > +			if (css->ss && css->ss->css_rstat_flush)
> > > +				BUG_ON(cgroup_rstat_init(css));
> > 
> > Why do we need this call here? We already call cgroup_rstat_init() in
> > cgroup_init_subsys(). IIUC for subsystems with ss->early_init, we will
> > have already called cgroup_init_subsys() in cgroup_init_early().
> > 
> > Did I miss something?
> 
> Hmmm it's a good question. cgroup_rstat_init() is deferred in the same
> way that cgroup_idr_alloc() is. So for ss with early_init == true,
> cgroup_rstat_init() is not called during cgroup_early_init().

Oh I didn't realize that the call here is only when early_init == false.
I think we need a comment to clarify that cgroup_idr_alloc() and
cgroup_rstat_init() are not called in cgroup_init_subsys() when
early_init == true, and hence need to be called in cgroup_init().

Or maybe organize the code in a way to make this more obvious (put them
in a helper with a descriptive name? idk).

> 
> Is it safe to call alloc_percpu() during early boot? If so, the
> deferral can be removed and cgroup_rstat_init() can be called in one
> place.

I don't think so. cgroup_init_early() is called before
setup_per_cpu_areas().

> 
> > 
> > >   		} else {
> > >   			cgroup_init_subsys(ss, false);
> > >   		}
> > [..]
> > > @@ -300,27 +306,25 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
> > >   }
> > >   /* see cgroup_rstat_flush() */
> > > -static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
> > > +static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
> > >   	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
> > >   {
> > > +	struct cgroup *cgrp = css->cgroup;
> > >   	int cpu;
> > >   	lockdep_assert_held(&cgroup_rstat_lock);
> > >   	for_each_possible_cpu(cpu) {
> > > -		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
> > > +		struct cgroup_subsys_state *pos;
> > > +		pos = cgroup_rstat_updated_list(css, cpu);
> > >   		for (; pos; pos = pos->rstat_flush_next) {
> > > -			struct cgroup_subsys_state *css;
> > > +			if (!pos->ss)
> > > +				cgroup_base_stat_flush(pos->cgroup, cpu);
> > > +			else
> > > +				pos->ss->css_rstat_flush(pos, cpu);
> > > -			cgroup_base_stat_flush(pos, cpu);
> > > -			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
> > > -
> > > -			rcu_read_lock();
> > > -			list_for_each_entry_rcu(css, &pos->rstat_css_list,
> > > -						rstat_css_node)
> > > -				css->ss->css_rstat_flush(css, cpu);
> > > -			rcu_read_unlock();
> > > +			bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu);
> > 
> > We should call bpf_rstat_flush() only if (!pos->ss) as well, right?
> > Otherwise we will call BPF rstat flush whenever any subsystem is
> > flushed.
> > 
> > I guess it's because BPF can now pass any subsystem to
> > cgroup_rstat_flush(), and we don't keep track. I think it would be
> > better if we do not allow BPF programs to select a css and always make
> > them flush the self css.
> > 
> > We can perhaps introduce a bpf_cgroup_rstat_flush() wrapper that takes
> > in a cgroup and passes cgroup->self internally to cgroup_rstat_flush().
> 
> I'm fine with this if others are in agreement. A similar concept was
> done in v1.

Let's wait for Shakeel to chime in here since he suggested removing this
hook, but I am not sure if he intended to actually do it or not. Better
not to waste effort if this will be gone soon anyway.

> 
> > 
> > But if the plan is to remove the bpf_rstat_flush() call here soon then
> > it's probably not worth the hassle.
> > 
> > Shakeel (and others), WDYT?
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  2025-03-01  1:25       ` Yosry Ahmed
@ 2025-03-01  1:30         ` JP Kobryn
  2025-03-03 18:18         ` Shakeel Butt
  1 sibling, 0 replies; 39+ messages in thread
From: JP Kobryn @ 2025-03-01  1:30 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: tj, shakeel.butt, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On 2/28/25 5:25 PM, Yosry Ahmed wrote:
> On Fri, Feb 28, 2025 at 05:06:23PM -0800, JP Kobryn wrote:
> [..]
>>>
>>>>    		cgroup_idr_replace(&ss->css_idr, NULL, css->id);
>>>>    		if (ss->css_released)
>>> [..]
>>>> @@ -6188,6 +6186,9 @@ int __init cgroup_init(void)
>>>>    			css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
>>>>    						   GFP_KERNEL);
>>>>    			BUG_ON(css->id < 0);
>>>> +
>>>> +			if (css->ss && css->ss->css_rstat_flush)
>>>> +				BUG_ON(cgroup_rstat_init(css));
>>>
>>> Why do we need this call here? We already call cgroup_rstat_init() in
>>> cgroup_init_subsys(). IIUC for subsystems with ss->early_init, we will
>>> have already called cgroup_init_subsys() in cgroup_init_early().
>>>
>>> Did I miss something?
>>
>> Hmmm it's a good question. cgroup_rstat_init() is deferred in the same
>> way that cgroup_idr_alloc() is. So for ss with early_init == true,
>> cgroup_rstat_init() is not called during cgroup_early_init().
> 
> Oh I didn't realize that the call here is only when early_init == false.
> I think we need a comment to clarify that cgroup_idr_alloc() and
> cgroup_rstat_init() are not called in cgroup_init_subsys() when
> early_init == true, and hence need to be called in cgroup_init().
> 
> Or maybe organize the code in a way to make this more obvious (put them
> in a helper with a descriptive name? idk).

I see what you're getting at. Let me think of something for v3.

> 
>>
>> Is it safe to call alloc_percpu() during early boot? If so, the
>> deferral can be removed and cgroup_rstat_init() can be called in one
>> place.
> 
> I don't think so. cgroup_init_early() is called before
> setup_per_cpu_areas().

Cool. Thanks for pointing that out.

> 
>>
>>>
>>>>    		} else {
>>>>    			cgroup_init_subsys(ss, false);
>>>>    		}
>>> [..]
>>>> @@ -300,27 +306,25 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
>>>>    }
>>>>    /* see cgroup_rstat_flush() */
>>>> -static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>>>> +static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
>>>>    	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
>>>>    {
>>>> +	struct cgroup *cgrp = css->cgroup;
>>>>    	int cpu;
>>>>    	lockdep_assert_held(&cgroup_rstat_lock);
>>>>    	for_each_possible_cpu(cpu) {
>>>> -		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
>>>> +		struct cgroup_subsys_state *pos;
>>>> +		pos = cgroup_rstat_updated_list(css, cpu);
>>>>    		for (; pos; pos = pos->rstat_flush_next) {
>>>> -			struct cgroup_subsys_state *css;
>>>> +			if (!pos->ss)
>>>> +				cgroup_base_stat_flush(pos->cgroup, cpu);
>>>> +			else
>>>> +				pos->ss->css_rstat_flush(pos, cpu);
>>>> -			cgroup_base_stat_flush(pos, cpu);
>>>> -			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
>>>> -
>>>> -			rcu_read_lock();
>>>> -			list_for_each_entry_rcu(css, &pos->rstat_css_list,
>>>> -						rstat_css_node)
>>>> -				css->ss->css_rstat_flush(css, cpu);
>>>> -			rcu_read_unlock();
>>>> +			bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu);
>>>
>>> We should call bpf_rstat_flush() only if (!pos->ss) as well, right?
>>> Otherwise we will call BPF rstat flush whenever any subsystem is
>>> flushed.
>>>
>>> I guess it's because BPF can now pass any subsystem to
>>> cgroup_rstat_flush(), and we don't keep track. I think it would be
>>> better if we do not allow BPF programs to select a css and always make
>>> them flush the self css.
>>>
>>> We can perhaps introduce a bpf_cgroup_rstat_flush() wrapper that takes
>>> in a cgroup and passes cgroup->self internally to cgroup_rstat_flush().
>>
>> I'm fine with this if others are in agreement. A similar concept was
>> done in v1.
> 
> Let's wait for Shakeel to chime in here since he suggested removing this
> hook, but I am not sure if he intended to actually do it or not. Better
> not to waste effort if this will be gone soon anyway.
> 
>>
>>>
>>> But if the plan is to remove the bpf_rstat_flush() call here soon then
>>> it's probably not worth the hassle.
>>>
>>> Shakeel (and others), WDYT?
>>



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-02-27 21:55 ` [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems inwardvessel
                     ` (2 preceding siblings ...)
  2025-02-28 19:20   ` Yosry Ahmed
@ 2025-03-01 23:00   ` kernel test robot
  2025-03-03 15:22   ` Michal Koutný
  2025-03-03 23:49   ` kernel test robot
  5 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2025-03-01 23:00 UTC (permalink / raw)
  To: inwardvessel, tj, shakeel.butt, yosryahmed, mhocko, hannes, akpm
  Cc: llvm, oe-kbuild-all, linux-mm, cgroups, kernel-team

Hi inwardvessel,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/net]
[also build test ERROR on bpf-next/master bpf/master linus/master v6.14-rc4]
[cannot apply to tj-cgroup/for-next next-20250228]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/inwardvessel/cgroup-move-cgroup_rstat-from-cgroup-to-cgroup_subsys_state/20250228-055819
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git net
patch link:    https://lore.kernel.org/r/20250227215543.49928-4-inwardvessel%40gmail.com
patch subject: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
config: hexagon-randconfig-001-20250302 (https://download.01.org/0day-ci/archive/20250302/202503020616.SJVwhuOV-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project 14170b16028c087ca154878f5ed93d3089a965c6)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250302/202503020616.SJVwhuOV-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503020616.SJVwhuOV-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/cgroup/rstat.c:339:2: error: member reference base type 'spinlock_t *' (aka 'struct spinlock *') is not a structure or union
     339 |         lockdep_assert_held(&lock);
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/lockdep.h:285:17: note: expanded from macro 'lockdep_assert_held'
     285 |         lockdep_assert(lockdep_is_held(l) != LOCK_STATE_NOT_HELD)
         |         ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/lockdep.h:252:52: note: expanded from macro 'lockdep_is_held'
     252 | #define lockdep_is_held(lock)           lock_is_held(&(lock)->dep_map)
         |                                                             ^ ~~~~~~~
   include/linux/lockdep.h:279:32: note: expanded from macro 'lockdep_assert'
     279 |         do { WARN_ON(debug_locks && !(cond)); } while (0)
         |              ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
   include/asm-generic/bug.h:123:25: note: expanded from macro 'WARN_ON'
     123 |         int __ret_warn_on = !!(condition);                              \
         |                                ^~~~~~~~~
   1 error generated.


vim +339 kernel/cgroup/rstat.c

   330	
   331	/* see cgroup_rstat_flush() */
   332	static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css,
   333			spinlock_t *lock)
   334		__releases(lock) __acquires(lock)
   335	{
   336		struct cgroup *cgrp = css->cgroup;
   337		int cpu;
   338	
 > 339		lockdep_assert_held(&lock);
   340	
   341		for_each_possible_cpu(cpu) {
   342			struct cgroup_subsys_state *pos;
   343	
   344			pos = cgroup_rstat_updated_list(css, cpu);
   345			for (; pos; pos = pos->rstat_flush_next) {
   346				if (!pos->ss)
   347					cgroup_base_stat_flush(pos->cgroup, cpu);
   348				else
   349					pos->ss->css_rstat_flush(pos, cpu);
   350	
   351				bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu);
   352			}
   353	
   354			/* play nice and yield if necessary */
   355			if (need_resched() || spin_needbreak(lock)) {
   356				__cgroup_rstat_unlock(lock, cgrp, cpu);
   357				if (!cond_resched())
   358					cpu_relax();
   359				__cgroup_rstat_lock(lock, cgrp, cpu);
   360			}
   361		}
   362	}
   363	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/4 v2] cgroup: separate rstat trees
  2025-02-27 21:55 [PATCH 0/4 v2] cgroup: separate rstat trees inwardvessel
                   ` (4 preceding siblings ...)
  2025-02-28 18:22 ` [PATCH 0/4 v2] cgroup: separate rstat trees Yosry Ahmed
@ 2025-03-03 15:19 ` Michal Koutný
  2025-03-06  1:07   ` JP Kobryn
  5 siblings, 1 reply; 39+ messages in thread
From: Michal Koutný @ 2025-03-03 15:19 UTC (permalink / raw)
  To: inwardvessel
  Cc: tj, shakeel.butt, yosryahmed, mhocko, hannes, akpm, linux-mm,
	cgroups, kernel-team

[-- Attachment #1: Type: text/plain, Size: 4597 bytes --]

Hello JP.

On Thu, Feb 27, 2025 at 01:55:39PM -0800, inwardvessel <inwardvessel@gmail.com> wrote:
> From: JP Kobryn <inwardvessel@gmail.com>
> 
> The current design of rstat takes the approach that if one subsystem is
> to be flushed, all other subsystems with pending updates should also be
> flushed. It seems that over time, the stat-keeping of some subsystems
> has grown in size to the extent that they are noticeably slowing down
> others. This has been most observable in situations where the memory
> controller is enabled. One big area where the issue comes up is system
> telemetry, where programs periodically sample cpu stats. It would be a
> benefit for programs like this if the overhead of having to flush memory
> stats (and others) could be eliminated. It would save cpu cycles for
> existing cpu-based telemetry programs and improve scalability in terms
> of sampling frequency and volume of hosts.
 
> This series changes the approach of "flush all subsystems" to "flush
> only the requested subsystem".
...

> before:
> sizeof(struct cgroup_rstat_cpu) =~ 176 bytes /* can vary based on config */
> 
> nr_cgroups * sizeof(struct cgroup_rstat_cpu)
> nr_cgroups * 176 bytes
> 
> after:
...
> nr_cgroups * (176 + 16 * 2)
> nr_cgroups * 208 bytes
 
~ 32B/cgroup/cpu

> With regard to validation, there is a measurable benefit when reading
> stats with this series. A test program was made to loop 1M times while
> reading all four of the files cgroup.stat, cpu.stat, io.stat,
> memory.stat of a given parent cgroup each iteration. This test program
> has been run in the experiments that follow.

Thanks for looking into this and running experiments on the behavior of
split rstat trees.

> The first experiment consisted of a parent cgroup with memory.swap.max=0
> and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created
> and within each child cgroup a process was spawned to frequently update
> the memory cgroup stats by creating and then reading a file of size 1T
> (encouraging reclaim). The test program was run alongside these 26 tasks
> in parallel. The results showed a benefit in both time elapsed and perf
> data of the test program.
> 
> time before:
> real    0m44.612s
> user    0m0.567s
> sys     0m43.887s
> 
> perf before:
> 27.02% mem_cgroup_css_rstat_flush
>  6.35% __blkcg_rstat_flush
>  0.06% cgroup_base_stat_cputime_show
> 
> time after:
> real    0m27.125s
> user    0m0.544s
> sys     0m26.491s

So this shows that flushing rstat trees one by one (as the test program
reads *.stat) is quicker than flushing all at once (+idle reads of
*.stat).
Interesting, I'd not bet on that at first but that is convincing to
favor the separate trees approach.

> perf after:
> 6.03% mem_cgroup_css_rstat_flush
> 0.37% blkcg_print_stat
> 0.11% cgroup_base_stat_cputime_show

I'd understand why the series reduces time spent in
mem_cgroup_flush_stats() but what does the lower proportion of
mem_cgroup_css_rstat_flush() show?


> Another experiment was setup on the same host using a parent cgroup with
> two child cgroups. The same swap and memory max were used as the
> previous experiment. In the two child cgroups, kernel builds were done
> in parallel, each using "-j 20". The perf comparison of the test program
> was very similar to the values in the previous experiment. The time
> comparison is shown below.
> 
> before:
> real    1m2.077s
> user    0m0.784s
> sys     1m0.895s

This is 1M loops of stats reading program like before? I.e. if this
should be analogous to 0m44.612s above why isn't it same? (I'm thinking
of more frequent updates in the latter test.)

> after:
> real    0m32.216s
> user    0m0.709s
> sys     0m31.256s

What was impact on the kernel build workloads (cgroup_rstat_updated)?

(Perhaps the saved 30s of CPU work (if potentially moved from readers to
writers) would be spread too thin in all of two 20-parallel kernel
builds, right?)

...
> For the final experiment, perf events were recorded during a kernel
> build with the same host and cgroup setup. The builds took place in the
> child node. Control and experimental sides both showed similar in cycles
> spent on cgroup_rstat_updated() and appeard insignificant compared among
> the events recorded with the workload.

What's the change between control vs experiment? Runnning in root cg vs
nested? Or running without *.stat readers vs with them against the
kernel build?
(This clarification would likely answer my question above.)


Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  2025-02-27 21:55 ` [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state inwardvessel
  2025-02-27 22:43   ` Shakeel Butt
  2025-02-28 19:04   ` Yosry Ahmed
@ 2025-03-03 15:20   ` Michal Koutný
  2025-03-03 19:31     ` JP Kobryn
  2 siblings, 1 reply; 39+ messages in thread
From: Michal Koutný @ 2025-03-03 15:20 UTC (permalink / raw)
  To: inwardvessel
  Cc: tj, shakeel.butt, yosryahmed, mhocko, hannes, akpm, linux-mm,
	cgroups, kernel-team

[-- Attachment #1: Type: text/plain, Size: 692 bytes --]

On Thu, Feb 27, 2025 at 01:55:40PM -0800, inwardvessel <inwardvessel@gmail.com> wrote:
 ...
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -269,8 +269,8 @@ int cgroup_task_count(const struct cgroup *cgrp);
>  /*
>   * rstat.c
>   */
> -int cgroup_rstat_init(struct cgroup *cgrp);
> -void cgroup_rstat_exit(struct cgroup *cgrp);
> +int cgroup_rstat_init(struct cgroup_subsys_state *css);
> +void cgroup_rstat_exit(struct cgroup_subsys_state *css);

Given the functions have little in common with struct cgroup now, I'd
not be afraid of renaming them to css_rstat_* so that names match the
processed structure.
(Similar, _updated and _flush and others.)



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/4 v2] cgroup: rstat lock indirection
  2025-02-27 21:55 ` [PATCH 2/4 v2] cgroup: rstat lock indirection inwardvessel
@ 2025-03-03 15:21   ` Michal Koutný
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Koutný @ 2025-03-03 15:21 UTC (permalink / raw)
  To: inwardvessel
  Cc: tj, shakeel.butt, yosryahmed, mhocko, hannes, akpm, linux-mm,
	cgroups, kernel-team

[-- Attachment #1: Type: text/plain, Size: 497 bytes --]

On Thu, Feb 27, 2025 at 01:55:41PM -0800, inwardvessel <inwardvessel@gmail.com> wrote:
> From: JP Kobryn <inwardvessel@gmail.com>
> 
> Instead of accessing the target lock directly via global var, access it
> indirectly in the form of a new parameter. Also change the ordering of
> the parameters to be consistent with the related per-cpu locking
> function _cgroup_rstat_cpu_lock().

This is non-functional change serving as preparation for futher lock
split.

(commit message fixup)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-02-27 21:55 ` [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems inwardvessel
                     ` (3 preceding siblings ...)
  2025-03-01 23:00   ` kernel test robot
@ 2025-03-03 15:22   ` Michal Koutný
  2025-03-03 18:29     ` Yosry Ahmed
  2025-03-03 23:49   ` kernel test robot
  5 siblings, 1 reply; 39+ messages in thread
From: Michal Koutný @ 2025-03-03 15:22 UTC (permalink / raw)
  To: inwardvessel
  Cc: tj, shakeel.butt, yosryahmed, mhocko, hannes, akpm, linux-mm,
	cgroups, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1320 bytes --]

On Thu, Feb 27, 2025 at 01:55:42PM -0800, inwardvessel <inwardvessel@gmail.com> wrote:
> From: JP Kobryn <inwardvessel@gmail.com>
...
> +static inline bool is_base_css(struct cgroup_subsys_state *css)
> +{
> +	return css->ss == NULL;
> +}

Similar predicate is also used in cgroup.c (various cgroup vs subsys
lifecycle functions, e.g. css_free_rwork_fn()). I think it'd better
unified, i.e. open code the predicate here or use the helper in both
cases (css_is_cgroup() or similar).

>  void __init cgroup_rstat_boot(void)
>  {
> -	int cpu;
> +	struct cgroup_subsys *ss;
> +	int cpu, ssid;
>  
> -	for_each_possible_cpu(cpu)
> -		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
> +	for_each_subsys(ss, ssid) {
> +		spin_lock_init(&cgroup_rstat_subsys_lock[ssid]);
> +	}

Hm, with this loop I realize it may be worth putting this lock into
struct cgroup_subsys_state and initializing them in
cgroup_init_subsys() to keep all per-subsys data in one pack.

> +
> +	for_each_possible_cpu(cpu) {
> +		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu));
> +
> +		for_each_subsys(ss, ssid) {
> +			raw_spin_lock_init(
> +					per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) + ssid);
> +		}

Similar here, and keep cgroup_rstat_boot() for the base locks only.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  2025-03-01  1:25       ` Yosry Ahmed
  2025-03-01  1:30         ` JP Kobryn
@ 2025-03-03 18:18         ` Shakeel Butt
  2025-03-03 18:21           ` Yosry Ahmed
  1 sibling, 1 reply; 39+ messages in thread
From: Shakeel Butt @ 2025-03-03 18:18 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: JP Kobryn, tj, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On Sat, Mar 01, 2025 at 01:25:03AM +0000, Yosry Ahmed wrote:
> On Fri, Feb 28, 2025 at 05:06:23PM -0800, JP Kobryn wrote:
[...]
> > > 
> > > We should call bpf_rstat_flush() only if (!pos->ss) as well, right?
> > > Otherwise we will call BPF rstat flush whenever any subsystem is
> > > flushed.
> > > 
> > > I guess it's because BPF can now pass any subsystem to
> > > cgroup_rstat_flush(), and we don't keep track. I think it would be
> > > better if we do not allow BPF programs to select a css and always make
> > > them flush the self css.
> > > 
> > > We can perhaps introduce a bpf_cgroup_rstat_flush() wrapper that takes
> > > in a cgroup and passes cgroup->self internally to cgroup_rstat_flush().
> > 
> > I'm fine with this if others are in agreement. A similar concept was
> > done in v1.
> 
> Let's wait for Shakeel to chime in here since he suggested removing this
> hook, but I am not sure if he intended to actually do it or not. Better
> not to waste effort if this will be gone soon anyway.
> 

Yes, let's remove this unused hook. JP, can you please followup after
this series with the removal/deprecation of this? One thing we might
want to careful about is if in future someone to add this functionality
again, that would not be a clean revert but what Yosry is asking for.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  2025-03-03 18:18         ` Shakeel Butt
@ 2025-03-03 18:21           ` Yosry Ahmed
  0 siblings, 0 replies; 39+ messages in thread
From: Yosry Ahmed @ 2025-03-03 18:21 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: JP Kobryn, tj, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On Mon, Mar 03, 2025 at 10:18:33AM -0800, Shakeel Butt wrote:
> On Sat, Mar 01, 2025 at 01:25:03AM +0000, Yosry Ahmed wrote:
> > On Fri, Feb 28, 2025 at 05:06:23PM -0800, JP Kobryn wrote:
> [...]
> > > > 
> > > > We should call bpf_rstat_flush() only if (!pos->ss) as well, right?
> > > > Otherwise we will call BPF rstat flush whenever any subsystem is
> > > > flushed.
> > > > 
> > > > I guess it's because BPF can now pass any subsystem to
> > > > cgroup_rstat_flush(), and we don't keep track. I think it would be
> > > > better if we do not allow BPF programs to select a css and always make
> > > > them flush the self css.
> > > > 
> > > > We can perhaps introduce a bpf_cgroup_rstat_flush() wrapper that takes
> > > > in a cgroup and passes cgroup->self internally to cgroup_rstat_flush().
> > > 
> > > I'm fine with this if others are in agreement. A similar concept was
> > > done in v1.
> > 
> > Let's wait for Shakeel to chime in here since he suggested removing this
> > hook, but I am not sure if he intended to actually do it or not. Better
> > not to waste effort if this will be gone soon anyway.
> > 
> 
> Yes, let's remove this unused hook. JP, can you please followup after
> this series with the removal/deprecation of this?

In this case I think it's fine for the purpose of this series to keep
bpf_rstat_flush() called if any css is being flushed to keep things
simple.

> One thing we might want to careful about is if in future someone to
> add this functionality again, that would not be a clean revert but
> what Yosry is asking for.

Not sure what you mean here, I am not asking for anything :P



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-03-03 15:22   ` Michal Koutný
@ 2025-03-03 18:29     ` Yosry Ahmed
  2025-03-03 18:40       ` Shakeel Butt
                         ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Yosry Ahmed @ 2025-03-03 18:29 UTC (permalink / raw)
  To: Michal Koutný
  Cc: inwardvessel, tj, shakeel.butt, mhocko, hannes, akpm, linux-mm,
	cgroups, kernel-team

On Mon, Mar 03, 2025 at 04:22:42PM +0100, Michal Koutný wrote:
> On Thu, Feb 27, 2025 at 01:55:42PM -0800, inwardvessel <inwardvessel@gmail.com> wrote:
> > From: JP Kobryn <inwardvessel@gmail.com>
> ...
> > +static inline bool is_base_css(struct cgroup_subsys_state *css)
> > +{
> > +	return css->ss == NULL;
> > +}
> 
> Similar predicate is also used in cgroup.c (various cgroup vs subsys
> lifecycle functions, e.g. css_free_rwork_fn()). I think it'd better
> unified, i.e. open code the predicate here or use the helper in both
> cases (css_is_cgroup() or similar).
> 
> >  void __init cgroup_rstat_boot(void)
> >  {
> > -	int cpu;
> > +	struct cgroup_subsys *ss;
> > +	int cpu, ssid;
> >  
> > -	for_each_possible_cpu(cpu)
> > -		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
> > +	for_each_subsys(ss, ssid) {
> > +		spin_lock_init(&cgroup_rstat_subsys_lock[ssid]);
> > +	}
> 
> Hm, with this loop I realize it may be worth putting this lock into
> struct cgroup_subsys_state and initializing them in
> cgroup_init_subsys() to keep all per-subsys data in one pack.

I thought about this, but this would have unnecessary memory overhead as
we only need one lock per-subsystem. So having a lock in every single
css is wasteful.

Maybe we can put the lock in struct cgroup_subsys? Then we can still
initialize them in cgroup_init_subsys().

> 
> > +
> > +	for_each_possible_cpu(cpu) {
> > +		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu));
> > +
> > +		for_each_subsys(ss, ssid) {
> > +			raw_spin_lock_init(
> > +					per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) + ssid);
> > +		}
> 
> Similar here, and keep cgroup_rstat_boot() for the base locks only.

I think it will be confusing to have cgroup_rstat_boot() only initialize
some of the locks.

I think if we initialize the subsys locks in cgroup_init_subsys(), then
we should open code initializing the base locks in cgroup_init(), and
remove cgroup_rstat_boot().

Alternatively, we can make cgroup_rstat_boot() take in a subsys and
initialize its lock. If we pass NULL, then it initialize the base locks.
In this case we can call cgroup_rstat_boot() for each subsystem that has
an rstat callback in cgroup_init() (or cgroup_init_subsys()), and then
once for the base locks.

WDYT?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-03-03 18:29     ` Yosry Ahmed
@ 2025-03-03 18:40       ` Shakeel Butt
  2025-03-03 19:23         ` JP Kobryn
  2025-03-03 19:50         ` Yosry Ahmed
  2025-03-03 18:49       ` Michal Koutný
  2025-03-06 21:36       ` JP Kobryn
  2 siblings, 2 replies; 39+ messages in thread
From: Shakeel Butt @ 2025-03-03 18:40 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Michal Koutný,
	inwardvessel, tj, mhocko, hannes, akpm, linux-mm, cgroups,
	kernel-team

On Mon, Mar 03, 2025 at 06:29:53PM +0000, Yosry Ahmed wrote:
> On Mon, Mar 03, 2025 at 04:22:42PM +0100, Michal Koutný wrote:
> > On Thu, Feb 27, 2025 at 01:55:42PM -0800, inwardvessel <inwardvessel@gmail.com> wrote:
> > > From: JP Kobryn <inwardvessel@gmail.com>
> > ...
> > > +static inline bool is_base_css(struct cgroup_subsys_state *css)
> > > +{
> > > +	return css->ss == NULL;
> > > +}
> > 
> > Similar predicate is also used in cgroup.c (various cgroup vs subsys
> > lifecycle functions, e.g. css_free_rwork_fn()). I think it'd better
> > unified, i.e. open code the predicate here or use the helper in both
> > cases (css_is_cgroup() or similar).
> > 
> > >  void __init cgroup_rstat_boot(void)
> > >  {
> > > -	int cpu;
> > > +	struct cgroup_subsys *ss;
> > > +	int cpu, ssid;
> > >  
> > > -	for_each_possible_cpu(cpu)
> > > -		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
> > > +	for_each_subsys(ss, ssid) {
> > > +		spin_lock_init(&cgroup_rstat_subsys_lock[ssid]);
> > > +	}
> > 
> > Hm, with this loop I realize it may be worth putting this lock into
> > struct cgroup_subsys_state and initializing them in
> > cgroup_init_subsys() to keep all per-subsys data in one pack.
> 
> I thought about this, but this would have unnecessary memory overhead as
> we only need one lock per-subsystem. So having a lock in every single
> css is wasteful.
> 
> Maybe we can put the lock in struct cgroup_subsys? Then we can still
> initialize them in cgroup_init_subsys().
> 

Actually one of things I was thinking about if we can just not have
per-subsystem lock at all. At the moment, it is protecting
rstat_flush_next field (today in cgroup and JP's series it is in css).
What if we make it a per-cpu then we don't need the per-subsystem lock
all? Let me know if I missed something which is being protected by this
lock.

This is help the case where there are multiple same subsystem stat
flushers, possibly of differnt part of cgroup tree. Though they will
still compete on per-cpu lock but still would be better than a
sub-system level lock.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-03-03 18:29     ` Yosry Ahmed
  2025-03-03 18:40       ` Shakeel Butt
@ 2025-03-03 18:49       ` Michal Koutný
  2025-03-10 17:59         ` JP Kobryn
  2025-03-06 21:36       ` JP Kobryn
  2 siblings, 1 reply; 39+ messages in thread
From: Michal Koutný @ 2025-03-03 18:49 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: inwardvessel, tj, shakeel.butt, mhocko, hannes, akpm, linux-mm,
	cgroups, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1502 bytes --]

On Mon, Mar 03, 2025 at 06:29:53PM +0000, Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
> I thought about this, but this would have unnecessary memory overhead as
> we only need one lock per-subsystem. So having a lock in every single
> css is wasteful.
> 
> Maybe we can put the lock in struct cgroup_subsys? Then we can still
> initialize them in cgroup_init_subsys().

Ah, yes, muscle memory, of course I had struct cgroup_subsys\> in mind.

> I think it will be confusing to have cgroup_rstat_boot() only initialize
> some of the locks.
> 
> I think if we initialize the subsys locks in cgroup_init_subsys(), then
> we should open code initializing the base locks in cgroup_init(), and
> remove cgroup_rstat_boot().

Can this work?

DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_base_cpu_lock) =
	__RAW_SPIN_LOCK_INITIALIZER(cgroup_rstat_base_cpu_lock);

(I see other places in kernel that assign into the per-cpu definition
but I have no idea whether that does expands and links to what's
expected. Neglecting the fact that the lock initializer is apparently
not for external use.)

> Alternatively, we can make cgroup_rstat_boot() take in a subsys and
> initialize its lock. If we pass NULL, then it initialize the base locks.
> In this case we can call cgroup_rstat_boot() for each subsystem that has
> an rstat callback in cgroup_init() (or cgroup_init_subsys()), and then
> once for the base locks.
> 
> WDYT?

Such calls from cgroup_init_subsys() are fine too.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-03-03 18:40       ` Shakeel Butt
@ 2025-03-03 19:23         ` JP Kobryn
  2025-03-03 19:39           ` Shakeel Butt
  2025-03-03 19:50         ` Yosry Ahmed
  1 sibling, 1 reply; 39+ messages in thread
From: JP Kobryn @ 2025-03-03 19:23 UTC (permalink / raw)
  To: Shakeel Butt, Yosry Ahmed
  Cc: Michal Koutný,
	tj, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team



On 3/3/25 10:40 AM, Shakeel Butt wrote:
> On Mon, Mar 03, 2025 at 06:29:53PM +0000, Yosry Ahmed wrote:
>> On Mon, Mar 03, 2025 at 04:22:42PM +0100, Michal Koutný wrote:
>>> On Thu, Feb 27, 2025 at 01:55:42PM -0800, inwardvessel <inwardvessel@gmail.com> wrote:
>>>> From: JP Kobryn <inwardvessel@gmail.com>
>>> ...
>>>> +static inline bool is_base_css(struct cgroup_subsys_state *css)
>>>> +{
>>>> +	return css->ss == NULL;
>>>> +}
>>>
>>> Similar predicate is also used in cgroup.c (various cgroup vs subsys
>>> lifecycle functions, e.g. css_free_rwork_fn()). I think it'd better
>>> unified, i.e. open code the predicate here or use the helper in both
>>> cases (css_is_cgroup() or similar).
>>>
>>>>   void __init cgroup_rstat_boot(void)
>>>>   {
>>>> -	int cpu;
>>>> +	struct cgroup_subsys *ss;
>>>> +	int cpu, ssid;
>>>>   
>>>> -	for_each_possible_cpu(cpu)
>>>> -		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
>>>> +	for_each_subsys(ss, ssid) {
>>>> +		spin_lock_init(&cgroup_rstat_subsys_lock[ssid]);
>>>> +	}
>>>
>>> Hm, with this loop I realize it may be worth putting this lock into
>>> struct cgroup_subsys_state and initializing them in
>>> cgroup_init_subsys() to keep all per-subsys data in one pack.
>>
>> I thought about this, but this would have unnecessary memory overhead as
>> we only need one lock per-subsystem. So having a lock in every single
>> css is wasteful.
>>
>> Maybe we can put the lock in struct cgroup_subsys? Then we can still
>> initialize them in cgroup_init_subsys().
>>
> 
> Actually one of things I was thinking about if we can just not have
> per-subsystem lock at all. At the moment, it is protecting
> rstat_flush_next field (today in cgroup and JP's series it is in css).
> What if we make it a per-cpu then we don't need the per-subsystem lock
> all? Let me know if I missed something which is being protected by this
> lock.
> 
> This is help the case where there are multiple same subsystem stat
> flushers, possibly of differnt part of cgroup tree. Though they will
> still compete on per-cpu lock but still would be better than a
> sub-system level lock.

Right, the trade-off would mean one subsystem flushing could contend for
a cpu where a different subsystem is updating and vice versa.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  2025-03-03 15:20   ` Michal Koutný
@ 2025-03-03 19:31     ` JP Kobryn
  0 siblings, 0 replies; 39+ messages in thread
From: JP Kobryn @ 2025-03-03 19:31 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj, shakeel.butt, yosryahmed, mhocko, hannes, akpm, linux-mm,
	cgroups, kernel-team

On 3/3/25 7:20 AM, Michal Koutný wrote:
> On Thu, Feb 27, 2025 at 01:55:40PM -0800, inwardvessel <inwardvessel@gmail.com> wrote:
>   ...
>> --- a/kernel/cgroup/cgroup-internal.h
>> +++ b/kernel/cgroup/cgroup-internal.h
>> @@ -269,8 +269,8 @@ int cgroup_task_count(const struct cgroup *cgrp);
>>   /*
>>    * rstat.c
>>    */
>> -int cgroup_rstat_init(struct cgroup *cgrp);
>> -void cgroup_rstat_exit(struct cgroup *cgrp);
>> +int cgroup_rstat_init(struct cgroup_subsys_state *css);
>> +void cgroup_rstat_exit(struct cgroup_subsys_state *css);
> 
> Given the functions have little in common with struct cgroup now, I'd
> not be afraid of renaming them to css_rstat_* so that names match the
> processed structure.
> (Similar, _updated and _flush and others.)

Makes sense. I can rename these (and others that now accept css) in the
next rev.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-03-03 19:23         ` JP Kobryn
@ 2025-03-03 19:39           ` Shakeel Butt
  0 siblings, 0 replies; 39+ messages in thread
From: Shakeel Butt @ 2025-03-03 19:39 UTC (permalink / raw)
  To: JP Kobryn
  Cc: Yosry Ahmed, Michal Koutný,
	tj, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On Mon, Mar 03, 2025 at 11:23:49AM -0800, JP Kobryn wrote:
> 
> 
> On 3/3/25 10:40 AM, Shakeel Butt wrote:
> > On Mon, Mar 03, 2025 at 06:29:53PM +0000, Yosry Ahmed wrote:
> > > On Mon, Mar 03, 2025 at 04:22:42PM +0100, Michal Koutný wrote:
> > > > On Thu, Feb 27, 2025 at 01:55:42PM -0800, inwardvessel <inwardvessel@gmail.com> wrote:
> > > > > From: JP Kobryn <inwardvessel@gmail.com>
> > > > ...
> > > > > +static inline bool is_base_css(struct cgroup_subsys_state *css)
> > > > > +{
> > > > > +	return css->ss == NULL;
> > > > > +}
> > > > 
> > > > Similar predicate is also used in cgroup.c (various cgroup vs subsys
> > > > lifecycle functions, e.g. css_free_rwork_fn()). I think it'd better
> > > > unified, i.e. open code the predicate here or use the helper in both
> > > > cases (css_is_cgroup() or similar).
> > > > 
> > > > >   void __init cgroup_rstat_boot(void)
> > > > >   {
> > > > > -	int cpu;
> > > > > +	struct cgroup_subsys *ss;
> > > > > +	int cpu, ssid;
> > > > > -	for_each_possible_cpu(cpu)
> > > > > -		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
> > > > > +	for_each_subsys(ss, ssid) {
> > > > > +		spin_lock_init(&cgroup_rstat_subsys_lock[ssid]);
> > > > > +	}
> > > > 
> > > > Hm, with this loop I realize it may be worth putting this lock into
> > > > struct cgroup_subsys_state and initializing them in
> > > > cgroup_init_subsys() to keep all per-subsys data in one pack.
> > > 
> > > I thought about this, but this would have unnecessary memory overhead as
> > > we only need one lock per-subsystem. So having a lock in every single
> > > css is wasteful.
> > > 
> > > Maybe we can put the lock in struct cgroup_subsys? Then we can still
> > > initialize them in cgroup_init_subsys().
> > > 
> > 
> > Actually one of things I was thinking about if we can just not have
> > per-subsystem lock at all. At the moment, it is protecting
> > rstat_flush_next field (today in cgroup and JP's series it is in css).
> > What if we make it a per-cpu then we don't need the per-subsystem lock
> > all? Let me know if I missed something which is being protected by this
> > lock.
> > 
> > This is help the case where there are multiple same subsystem stat
> > flushers, possibly of differnt part of cgroup tree. Though they will
> > still compete on per-cpu lock but still would be better than a
> > sub-system level lock.
> 
> Right, the trade-off would mean one subsystem flushing could contend for
> a cpu where a different subsystem is updating and vice versa.
> 

I meant we keep the per-subsystem per-cpu locks but remove the
per-subsystem lock (i.e. cgroup_rstat_subsys_lock) if we make
rstat_flush_next per-cpu. In that case two memory stats readers will not
compete on memory subsystem lock but they will compete on per-cpu memory
locks. Though we will have to experiment to see if this really helps or
not.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-03-03 18:40       ` Shakeel Butt
  2025-03-03 19:23         ` JP Kobryn
@ 2025-03-03 19:50         ` Yosry Ahmed
  2025-03-03 20:09           ` Shakeel Butt
  1 sibling, 1 reply; 39+ messages in thread
From: Yosry Ahmed @ 2025-03-03 19:50 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Koutný,
	inwardvessel, tj, mhocko, hannes, akpm, linux-mm, cgroups,
	kernel-team

On Mon, Mar 03, 2025 at 10:40:45AM -0800, Shakeel Butt wrote:
> On Mon, Mar 03, 2025 at 06:29:53PM +0000, Yosry Ahmed wrote:
> > On Mon, Mar 03, 2025 at 04:22:42PM +0100, Michal Koutný wrote:
> > > On Thu, Feb 27, 2025 at 01:55:42PM -0800, inwardvessel <inwardvessel@gmail.com> wrote:
> > > > From: JP Kobryn <inwardvessel@gmail.com>
> > > ...
> > > > +static inline bool is_base_css(struct cgroup_subsys_state *css)
> > > > +{
> > > > +	return css->ss == NULL;
> > > > +}
> > > 
> > > Similar predicate is also used in cgroup.c (various cgroup vs subsys
> > > lifecycle functions, e.g. css_free_rwork_fn()). I think it'd better
> > > unified, i.e. open code the predicate here or use the helper in both
> > > cases (css_is_cgroup() or similar).
> > > 
> > > >  void __init cgroup_rstat_boot(void)
> > > >  {
> > > > -	int cpu;
> > > > +	struct cgroup_subsys *ss;
> > > > +	int cpu, ssid;
> > > >  
> > > > -	for_each_possible_cpu(cpu)
> > > > -		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
> > > > +	for_each_subsys(ss, ssid) {
> > > > +		spin_lock_init(&cgroup_rstat_subsys_lock[ssid]);
> > > > +	}
> > > 
> > > Hm, with this loop I realize it may be worth putting this lock into
> > > struct cgroup_subsys_state and initializing them in
> > > cgroup_init_subsys() to keep all per-subsys data in one pack.
> > 
> > I thought about this, but this would have unnecessary memory overhead as
> > we only need one lock per-subsystem. So having a lock in every single
> > css is wasteful.
> > 
> > Maybe we can put the lock in struct cgroup_subsys? Then we can still
> > initialize them in cgroup_init_subsys().
> > 
> 
> Actually one of things I was thinking about if we can just not have
> per-subsystem lock at all. At the moment, it is protecting
> rstat_flush_next field (today in cgroup and JP's series it is in css).
> What if we make it a per-cpu then we don't need the per-subsystem lock
> all? Let me know if I missed something which is being protected by this
> lock.

I think it protects more than that. I remember locking into this before,
and the thing I remember is that stats themselves. Looking at
mem_cgroup_stat_aggregate(), we aggregate the stats into the per-cgroup
counters non-atomically. This is only protected by the rstat lock
(currently global, per-subsystem with the series) AFAICT.

Not sure if only the memory subsystem has this dependency or if others
do as well. I remember looking into switching these counters to atomics
to remove the global lock, but it performed worse IIRC.

I remember also looking into partioning the lock into a per-cgroup (or
per-css now?) lock, and only holding locks of the parent and child
cgroups as we flush each cgroup. I don't remember if I actually
implemented this, but it introduces complexity.

Perhaps we can defer the locking into the subsystem, if only the memory
controller requires it. In this case mem_cgroup_css_rstat_flush() (or
mem_cgroup_stat_aggregate()) can hold a memcg-specific spinlock only
while aggregating the stats.

There could be other things protected by the lock, but that's what I
remember.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-03-03 19:50         ` Yosry Ahmed
@ 2025-03-03 20:09           ` Shakeel Butt
  0 siblings, 0 replies; 39+ messages in thread
From: Shakeel Butt @ 2025-03-03 20:09 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Michal Koutný,
	inwardvessel, tj, mhocko, hannes, akpm, linux-mm, cgroups,
	kernel-team

On Mon, Mar 03, 2025 at 07:50:20PM +0000, Yosry Ahmed wrote:
> On Mon, Mar 03, 2025 at 10:40:45AM -0800, Shakeel Butt wrote:
> > On Mon, Mar 03, 2025 at 06:29:53PM +0000, Yosry Ahmed wrote:
> > > On Mon, Mar 03, 2025 at 04:22:42PM +0100, Michal Koutný wrote:
> > > > On Thu, Feb 27, 2025 at 01:55:42PM -0800, inwardvessel <inwardvessel@gmail.com> wrote:
> > > > > From: JP Kobryn <inwardvessel@gmail.com>
> > > > ...
> > > > > +static inline bool is_base_css(struct cgroup_subsys_state *css)
> > > > > +{
> > > > > +	return css->ss == NULL;
> > > > > +}
> > > > 
> > > > Similar predicate is also used in cgroup.c (various cgroup vs subsys
> > > > lifecycle functions, e.g. css_free_rwork_fn()). I think it'd better
> > > > unified, i.e. open code the predicate here or use the helper in both
> > > > cases (css_is_cgroup() or similar).
> > > > 
> > > > >  void __init cgroup_rstat_boot(void)
> > > > >  {
> > > > > -	int cpu;
> > > > > +	struct cgroup_subsys *ss;
> > > > > +	int cpu, ssid;
> > > > >  
> > > > > -	for_each_possible_cpu(cpu)
> > > > > -		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
> > > > > +	for_each_subsys(ss, ssid) {
> > > > > +		spin_lock_init(&cgroup_rstat_subsys_lock[ssid]);
> > > > > +	}
> > > > 
> > > > Hm, with this loop I realize it may be worth putting this lock into
> > > > struct cgroup_subsys_state and initializing them in
> > > > cgroup_init_subsys() to keep all per-subsys data in one pack.
> > > 
> > > I thought about this, but this would have unnecessary memory overhead as
> > > we only need one lock per-subsystem. So having a lock in every single
> > > css is wasteful.
> > > 
> > > Maybe we can put the lock in struct cgroup_subsys? Then we can still
> > > initialize them in cgroup_init_subsys().
> > > 
> > 
> > Actually one of things I was thinking about if we can just not have
> > per-subsystem lock at all. At the moment, it is protecting
> > rstat_flush_next field (today in cgroup and JP's series it is in css).
> > What if we make it a per-cpu then we don't need the per-subsystem lock
> > all? Let me know if I missed something which is being protected by this
> > lock.
> 
> I think it protects more than that. I remember locking into this before,
> and the thing I remember is that stats themselves. Looking at
> mem_cgroup_stat_aggregate(), we aggregate the stats into the per-cgroup
> counters non-atomically. This is only protected by the rstat lock
> (currently global, per-subsystem with the series) AFAICT.
> 
> Not sure if only the memory subsystem has this dependency or if others
> do as well. I remember looking into switching these counters to atomics
> to remove the global lock, but it performed worse IIRC.
> 
> I remember also looking into partioning the lock into a per-cgroup (or
> per-css now?) lock, and only holding locks of the parent and child
> cgroups as we flush each cgroup. I don't remember if I actually
> implemented this, but it introduces complexity.
> 
> Perhaps we can defer the locking into the subsystem, if only the memory
> controller requires it. In this case mem_cgroup_css_rstat_flush() (or
> mem_cgroup_stat_aggregate()) can hold a memcg-specific spinlock only
> while aggregating the stats.
> 
> There could be other things protected by the lock, but that's what I
> remember.

Thanks for reminding me. It does not seem straight forward or simple.
Lower priority then.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-02-27 21:55 ` [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems inwardvessel
                     ` (4 preceding siblings ...)
  2025-03-03 15:22   ` Michal Koutný
@ 2025-03-03 23:49   ` kernel test robot
  5 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2025-03-03 23:49 UTC (permalink / raw)
  To: inwardvessel, tj, shakeel.butt, yosryahmed, mhocko, hannes, akpm
  Cc: oe-kbuild-all, linux-mm, cgroups, kernel-team

Hi inwardvessel,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/net]
[also build test ERROR on bpf-next/master bpf/master linus/master v6.14-rc5]
[cannot apply to tj-cgroup/for-next next-20250303]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/inwardvessel/cgroup-move-cgroup_rstat-from-cgroup-to-cgroup_subsys_state/20250228-055819
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git net
patch link:    https://lore.kernel.org/r/20250227215543.49928-4-inwardvessel%40gmail.com
patch subject: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
config: loongarch-allyesconfig (https://download.01.org/0day-ci/archive/20250304/202503040736.kEeH1wt6-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250304/202503040736.kEeH1wt6-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503040736.kEeH1wt6-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/loongarch/include/asm/bug.h:61,
                    from include/linux/bug.h:5,
                    from include/linux/thread_info.h:13,
                    from include/asm-generic/current.h:6,
                    from ./arch/loongarch/include/generated/asm/current.h:1,
                    from include/linux/sched.h:12,
                    from include/linux/cgroup.h:12,
                    from kernel/cgroup/cgroup-internal.h:5,
                    from kernel/cgroup/rstat.c:2:
   kernel/cgroup/rstat.c: In function 'cgroup_rstat_flush_locked':
>> include/linux/lockdep.h:252:61: error: 'lock' is a pointer; did you mean to use '->'?
     252 | #define lockdep_is_held(lock)           lock_is_held(&(lock)->dep_map)
         |                                                             ^~
   include/asm-generic/bug.h:123:32: note: in definition of macro 'WARN_ON'
     123 |         int __ret_warn_on = !!(condition);                              \
         |                                ^~~~~~~~~
   include/linux/lockdep.h:285:9: note: in expansion of macro 'lockdep_assert'
     285 |         lockdep_assert(lockdep_is_held(l) != LOCK_STATE_NOT_HELD)
         |         ^~~~~~~~~~~~~~
   include/linux/lockdep.h:285:24: note: in expansion of macro 'lockdep_is_held'
     285 |         lockdep_assert(lockdep_is_held(l) != LOCK_STATE_NOT_HELD)
         |                        ^~~~~~~~~~~~~~~
   kernel/cgroup/rstat.c:339:9: note: in expansion of macro 'lockdep_assert_held'
     339 |         lockdep_assert_held(&lock);
         |         ^~~~~~~~~~~~~~~~~~~


vim +252 include/linux/lockdep.h

f607c668577481 Peter Zijlstra 2009-07-20  251  
f8319483f57f1c Peter Zijlstra 2016-11-30 @252  #define lockdep_is_held(lock)		lock_is_held(&(lock)->dep_map)
f8319483f57f1c Peter Zijlstra 2016-11-30  253  #define lockdep_is_held_type(lock, r)	lock_is_held_type(&(lock)->dep_map, (r))
f607c668577481 Peter Zijlstra 2009-07-20  254  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/4 v2] cgroup: separate rstat trees
  2025-03-03 15:19 ` Michal Koutný
@ 2025-03-06  1:07   ` JP Kobryn
  2025-03-11 13:49     ` Michal Koutný
  0 siblings, 1 reply; 39+ messages in thread
From: JP Kobryn @ 2025-03-06  1:07 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj, shakeel.butt, yosryahmed, mhocko, hannes, akpm, linux-mm,
	cgroups, kernel-team

On 3/3/25 7:19 AM, Michal Koutný wrote:
> Hello JP.
> 
> On Thu, Feb 27, 2025 at 01:55:39PM -0800, inwardvessel <inwardvessel@gmail.com> wrote:
>> From: JP Kobryn <inwardvessel@gmail.com>
>>
>> The current design of rstat takes the approach that if one subsystem is
>> to be flushed, all other subsystems with pending updates should also be
>> flushed. It seems that over time, the stat-keeping of some subsystems
>> has grown in size to the extent that they are noticeably slowing down
>> others. This has been most observable in situations where the memory
>> controller is enabled. One big area where the issue comes up is system
>> telemetry, where programs periodically sample cpu stats. It would be a
>> benefit for programs like this if the overhead of having to flush memory
>> stats (and others) could be eliminated. It would save cpu cycles for
>> existing cpu-based telemetry programs and improve scalability in terms
>> of sampling frequency and volume of hosts.
>   
>> This series changes the approach of "flush all subsystems" to "flush
>> only the requested subsystem".
> ...
> 
>> before:
>> sizeof(struct cgroup_rstat_cpu) =~ 176 bytes /* can vary based on config */
>>
>> nr_cgroups * sizeof(struct cgroup_rstat_cpu)
>> nr_cgroups * 176 bytes
>>
>> after:
> ...
>> nr_cgroups * (176 + 16 * 2)
>> nr_cgroups * 208 bytes
>   
> ~ 32B/cgroup/cpu

Thanks. I'll make this clear in the cover letter next rev.

> 
>> With regard to validation, there is a measurable benefit when reading
>> stats with this series. A test program was made to loop 1M times while
>> reading all four of the files cgroup.stat, cpu.stat, io.stat,
>> memory.stat of a given parent cgroup each iteration. This test program
>> has been run in the experiments that follow.
> 
> Thanks for looking into this and running experiments on the behavior of
> split rstat trees.

And thank you for reviewing along with the good questions.

> 
>> The first experiment consisted of a parent cgroup with memory.swap.max=0
>> and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created
>> and within each child cgroup a process was spawned to frequently update
>> the memory cgroup stats by creating and then reading a file of size 1T
>> (encouraging reclaim). The test program was run alongside these 26 tasks
>> in parallel. The results showed a benefit in both time elapsed and perf
>> data of the test program.
>>
>> time before:
>> real    0m44.612s
>> user    0m0.567s
>> sys     0m43.887s
>>
>> perf before:
>> 27.02% mem_cgroup_css_rstat_flush
>>   6.35% __blkcg_rstat_flush
>>   0.06% cgroup_base_stat_cputime_show
>>
>> time after:
>> real    0m27.125s
>> user    0m0.544s
>> sys     0m26.491s
> 
> So this shows that flushing rstat trees one by one (as the test program
> reads *.stat) is quicker than flushing all at once (+idle reads of
> *.stat).
> Interesting, I'd not bet on that at first but that is convincing to
> favor the separate trees approach.
> 
>> perf after:mem_cgroup_css_rstat_flush
>> 6.03% mem_cgroup_css_rstat_flush
>> 0.37% blkcg_print_stat
>> 0.11% cgroup_base_stat_cputime_show
> 
> I'd understand why the series reduces time spent in
> mem_cgroup_flush_stats() but what does the lower proportion of
> mem_cgroup_css_rstat_flush() show?

When the entry point for flushing is reading the file memory.stat, 
memory_stat_show() is called which leads to __mem_cgroup_flush_stats(). 
In this function, there is an early return when (!force && !needs_flush) 
is true. This opportunity to "skip" a flush is not reached when another 
subsystem has initiated the flush and entry point for flushing memory is 
css->css_rstat_flush().

To verify above, I made use of a tracepoint previously added [0] to get 
info info on the number of memcg flushes performed vs skipped. In a 
comparison between reading only the memory.stat file vs reading 
{memory,io,cpu}.stat files under the same test, the flush count 
increased by about the same value the skip count decreased.

Reading memory.stat
non-forced flushes: 5781
flushes skipped: 995826

Reading {memory,io.cpu}.stat
non-forced flushes: 12047
flushes skipped: 990857

If the flushes were not skipped, I think we would see similar proportion 
of mem_cgroup_css_rstat_flush() when reading memory.stat.

[0] 
https://lore.kernel.org/all/20241029021106.25587-1-inwardvessel@gmail.com/

> 
> 
>> Another experiment was setup on the same host using a parent cgroup with
>> two child cgroups. The same swap and memory max were used as the
>> previous experiment. In the two child cgroups, kernel builds were done
>> in parallel, each using "-j 20". The perf comparison of the test program
>> was very similar to the values in the previous experiment. The time
>> comparison is shown below.
>>
>> before:
>> real    1m2.077s
>> user    0m0.784s
>> sys     1m0.895s
> 
> This is 1M loops of stats reading program like before? I.e. if this
> should be analogous to 0m44.612s above why isn't it same? (I'm thinking
> of more frequent updates in the latter test.)

Yes. One notable difference on this test is there are more threads in 
the workload (40 vs 26) which are doing the updates.

> 
>> after:
>> real    0m32.216s
>> user    0m0.709s
>> sys     0m31.256s
> 
> What was impact on the kernel build workloads (cgroup_rstat_updated)?

You can now find some workload timing results further down. If you're 
asking specifically about time spent in cgroup_rstat_updated(), perf 
reports show fractional values on both sides.

> 
> (Perhaps the saved 30s of CPU work (if potentially moved from readers to
> writers) would be spread too thin in all of two 20-parallel kernel
> builds, right?)

Are you suggesting a workload with fewer threads?

> 
> ...
>> For the final experiment, perf events were recorded during a kernel
>> build with the same host and cgroup setup. The builds took place in the
>> child node. Control and experimental sides both showed similar in cycles
>> spent on cgroup_rstat_updated() and appeard insignificant compared among
>> the events recorded with the workload.
> 
> What's the change between control vs experiment? Runnning in root cg vs
> nested? Or running without *.stat readers vs with them against the
> kernel build?
> (This clarification would likely answer my question above.)
> 

workload control with no readers:
real    6m54.818s
user    117m3.122s
sys     5m4.996s

workload experiment with no readers:
real    6m54.862s
user    117m12.812s
sys     5m0.943s

workload control with constant readers {memory,io,cpu,cgroup}.stat:
real    6m59.468s
user    118m26.981s
sys     5m20.163s

workload experiment with constant readers {memory,io,cpu,cgroup}.stat:
real    6m57.031s
user    118m13.833s
sys     5m3.454s

These tests were done in a child (nested) cgroup. Were you also asking 
for a root vs nested experiment or were you just needing clarification 
on the test details?

> 
> Michal



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-03-03 18:29     ` Yosry Ahmed
  2025-03-03 18:40       ` Shakeel Butt
  2025-03-03 18:49       ` Michal Koutný
@ 2025-03-06 21:36       ` JP Kobryn
  2 siblings, 0 replies; 39+ messages in thread
From: JP Kobryn @ 2025-03-06 21:36 UTC (permalink / raw)
  To: Yosry Ahmed, Michal Koutný
  Cc: tj, shakeel.butt, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On 3/3/25 10:29 AM, Yosry Ahmed wrote:
> On Mon, Mar 03, 2025 at 04:22:42PM +0100, Michal Koutný wrote:
>> On Thu, Feb 27, 2025 at 01:55:42PM -0800, inwardvessel <inwardvessel@gmail.com> wrote:
>>> From: JP Kobryn <inwardvessel@gmail.com>
>> ...
>>> +static inline bool is_base_css(struct cgroup_subsys_state *css)
>>> +{
>>> +	return css->ss == NULL;
>>> +}
>>
>> Similar predicate is also used in cgroup.c (various cgroup vs subsys
>> lifecycle functions, e.g. css_free_rwork_fn()). I think it'd better
>> unified, i.e. open code the predicate here or use the helper in both
>> cases (css_is_cgroup() or similar).

Sure. Thanks for pointing that one out.

>>
>>>   void __init cgroup_rstat_boot(void)
>>>   {
>>> -	int cpu;
>>> +	struct cgroup_subsys *ss;
>>> +	int cpu, ssid;
>>>   
>>> -	for_each_possible_cpu(cpu)
>>> -		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
>>> +	for_each_subsys(ss, ssid) {
>>> +		spin_lock_init(&cgroup_rstat_subsys_lock[ssid]);
>>> +	}
>>
>> Hm, with this loop I realize it may be worth putting this lock into
>> struct cgroup_subsys_state and initializing them in
>> cgroup_init_subsys() to keep all per-subsys data in one pack.

Will give this a go in next rev.

> 
> I thought about this, but this would have unnecessary memory overhead as
> we only need one lock per-subsystem. So having a lock in every single
> css is wasteful.
> 
> Maybe we can put the lock in struct cgroup_subsys? Then we can still
> initialize them in cgroup_init_subsys().
> 
>>
>>> +
>>> +	for_each_possible_cpu(cpu) {
>>> +		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu));
>>> +
>>> +		for_each_subsys(ss, ssid) {
>>> +			raw_spin_lock_init(
>>> +					per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) + ssid);
>>> +		}
>>
>> Similar here, and keep cgroup_rstat_boot() for the base locks only.
> 
> I think it will be confusing to have cgroup_rstat_boot() only initialize
> some of the locks.
> 
> I think if we initialize the subsys locks in cgroup_init_subsys(), then
> we should open code initializing the base locks in cgroup_init(), and
> remove cgroup_rstat_boot().
> 
> Alternatively, we can make cgroup_rstat_boot() take in a subsys and
> initialize its lock. If we pass NULL, then it initialize the base locks.
> In this case we can call cgroup_rstat_boot() for each subsystem that has
> an rstat callback in cgroup_init() (or cgroup_init_subsys()), and then
> once for the base locks.
> 
> WDYT?

I like this alternative idea of adjusting cgroup_rstat_boot() so it can
accept a subsys ref. Let's see how it looks when v3 goes out.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-02-28 19:20   ` Yosry Ahmed
@ 2025-03-06 21:47     ` JP Kobryn
  0 siblings, 0 replies; 39+ messages in thread
From: JP Kobryn @ 2025-03-06 21:47 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: tj, shakeel.butt, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On 2/28/25 11:20 AM, Yosry Ahmed wrote:
> On Thu, Feb 27, 2025 at 01:55:42PM -0800, inwardvessel wrote:
>> From: JP Kobryn <inwardvessel@gmail.com>
>>
>> Let the existing locks be dedicated to the base stats and rename them as
>> such. Also add new rstat locks for each enabled subsystem. When handling
>> cgroup subsystem states, distinguish between formal subsystems (memory,
>> io, etc) and the base stats subsystem state (represented by
>> cgroup::self) to decide on which locks to take. This change is made to
>> prevent contention between subsystems when updating/flushing stats.
>>
>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>> ---
>>   kernel/cgroup/rstat.c | 93 +++++++++++++++++++++++++++++++++----------
>>   1 file changed, 72 insertions(+), 21 deletions(-)
>>
>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
>> index 88908ef9212d..b3eaefc1fd07 100644
>> --- a/kernel/cgroup/rstat.c
>> +++ b/kernel/cgroup/rstat.c
>> @@ -9,8 +9,12 @@
>>   
>>   #include <trace/events/cgroup.h>
>>   
>> -static DEFINE_SPINLOCK(cgroup_rstat_lock);
>> -static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
>> +static DEFINE_SPINLOCK(cgroup_rstat_base_lock);
>> +static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_base_cpu_lock);
>> +
>> +static spinlock_t cgroup_rstat_subsys_lock[CGROUP_SUBSYS_COUNT];
>> +static DEFINE_PER_CPU(raw_spinlock_t,
>> +		cgroup_rstat_subsys_cpu_lock[CGROUP_SUBSYS_COUNT]);
>>   
>>   static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
>>   
>> @@ -20,8 +24,13 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(
>>   	return per_cpu_ptr(css->rstat_cpu, cpu);
>>   }
>>   
>> +static inline bool is_base_css(struct cgroup_subsys_state *css)
>> +{
>> +	return css->ss == NULL;
>> +}
>> +
>>   /*
>> - * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
>> + * Helper functions for rstat per CPU locks.
>>    *
>>    * This makes it easier to diagnose locking issues and contention in
>>    * production environments. The parameter @fast_path determine the
>> @@ -36,12 +45,12 @@ unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
>>   	bool contended;
>>   
>>   	/*
>> -	 * The _irqsave() is needed because cgroup_rstat_lock is
>> -	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
>> -	 * this lock with the _irq() suffix only disables interrupts on
>> -	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
>> -	 * interrupts on both configurations. The _irqsave() ensures
>> -	 * that interrupts are always disabled and later restored.
>> +	 * The _irqsave() is needed because the locks used for flushing are
>> +	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring this lock
>> +	 * with the _irq() suffix only disables interrupts on a non-PREEMPT_RT
>> +	 * kernel. The raw_spinlock_t below disables interrupts on both
>> +	 * configurations. The _irqsave() ensures that interrupts are always
>> +	 * disabled and later restored.
>>   	 */
>>   	contended = !raw_spin_trylock_irqsave(cpu_lock, flags);
>>   	if (contended) {
>> @@ -87,7 +96,7 @@ __bpf_kfunc void cgroup_rstat_updated(
>>   		struct cgroup_subsys_state *css, int cpu)
>>   {
>>   	struct cgroup *cgrp = css->cgroup;
>> -	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
>> +	raw_spinlock_t *cpu_lock;
>>   	unsigned long flags;
>>   
>>   	/*
>> @@ -101,6 +110,12 @@ __bpf_kfunc void cgroup_rstat_updated(
>>   	if (data_race(cgroup_rstat_cpu(css, cpu)->updated_next))
>>   		return;
>>   
>> +	if (is_base_css(css))
>> +		cpu_lock = per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu);
>> +	else
>> +		cpu_lock = per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) +
>> +			css->ss->id;
>> +
> 
> Maybe wrap this in a macro or function since it's used more than once.
> 
>>   	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
>>   
>>   	/* put @css and all ancestors on the corresponding updated lists */
>> @@ -208,11 +223,17 @@ static struct cgroup_subsys_state *cgroup_rstat_updated_list(
>>   		struct cgroup_subsys_state *root, int cpu)
>>   {
>>   	struct cgroup *cgrp = root->cgroup;
>> -	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
>>   	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
>>   	struct cgroup_subsys_state *head = NULL, *parent, *child;
>> +	raw_spinlock_t *cpu_lock;
>>   	unsigned long flags;
>>   
>> +	if (is_base_css(root))
>> +		cpu_lock = per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu);
>> +	else
>> +		cpu_lock = per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) +
>> +			root->ss->id;
>> +
>>   	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, false);
>>   
>>   	/* Return NULL if this subtree is not on-list */
>> @@ -315,7 +336,7 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css,
>>   	struct cgroup *cgrp = css->cgroup;
>>   	int cpu;
>>   
>> -	lockdep_assert_held(&cgroup_rstat_lock);
>> +	lockdep_assert_held(&lock);
>>   
>>   	for_each_possible_cpu(cpu) {
>>   		struct cgroup_subsys_state *pos;
>> @@ -356,12 +377,18 @@ static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css,
>>   __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
>>   {
>>   	struct cgroup *cgrp = css->cgroup;
>> +	spinlock_t *lock;
>> +
>> +	if (is_base_css(css))
>> +		lock = &cgroup_rstat_base_lock;
>> +	else
>> +		lock = &cgroup_rstat_subsys_lock[css->ss->id];
> 
> Same here. Also, instead of passing locks around, can we just pass
> the css to __cgroup_rstat_lock/unlock() and have them find the correct
> lock and use it directly?

I like this and will take that approach in v3. It will allow the lock
selection code you previously commented on to exist in just one place.

> 
>>   
>>   	might_sleep();
>>   
>> -	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
>> -	cgroup_rstat_flush_locked(css, &cgroup_rstat_lock);
>> -	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
>> +	__cgroup_rstat_lock(lock, cgrp, -1);
>> +	cgroup_rstat_flush_locked(css, lock);
>> +	__cgroup_rstat_unlock(lock, cgrp, -1);
>>   }
>>   
>>   /**
>> @@ -376,10 +403,16 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
>>   void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
>>   {
>>   	struct cgroup *cgrp = css->cgroup;
>> +	spinlock_t *lock;
>> +
>> +	if (is_base_css(css))
>> +		lock = &cgroup_rstat_base_lock;
>> +	else
>> +		lock = &cgroup_rstat_subsys_lock[css->ss->id];
>>   
>>   	might_sleep();
>> -	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
>> -	cgroup_rstat_flush_locked(css, &cgroup_rstat_lock);
>> +	__cgroup_rstat_lock(lock, cgrp, -1);
>> +	cgroup_rstat_flush_locked(css, lock);
>>   }
>>   
>>   /**
>> @@ -389,7 +422,14 @@ void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
>>   void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
>>   {
>>   	struct cgroup *cgrp = css->cgroup;
>> -	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
>> +	spinlock_t *lock;
>> +
>> +	if (is_base_css(css))
>> +		lock = &cgroup_rstat_base_lock;
>> +	else
>> +		lock = &cgroup_rstat_subsys_lock[css->ss->id];
>> +
>> +	__cgroup_rstat_unlock(lock, cgrp, -1);
>>   }
>>   
>>   int cgroup_rstat_init(struct cgroup_subsys_state *css)
>> @@ -435,10 +475,21 @@ void cgroup_rstat_exit(struct cgroup_subsys_state *css)
>>   
>>   void __init cgroup_rstat_boot(void)
>>   {
>> -	int cpu;
>> +	struct cgroup_subsys *ss;
>> +	int cpu, ssid;
>>   
>> -	for_each_possible_cpu(cpu)
>> -		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
>> +	for_each_subsys(ss, ssid) {
>> +		spin_lock_init(&cgroup_rstat_subsys_lock[ssid]);
>> +	}
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu));
>> +
>> +		for_each_subsys(ss, ssid) {
>> +			raw_spin_lock_init(
>> +					per_cpu_ptr(cgroup_rstat_subsys_cpu_lock, cpu) + ssid);
>> +		}
>> +	}
>>   }
>>   
>>   /*
>> -- 
>> 2.43.5
>>
>>



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-03-03 18:49       ` Michal Koutný
@ 2025-03-10 17:59         ` JP Kobryn
  2025-03-11 13:49           ` Michal Koutný
  0 siblings, 1 reply; 39+ messages in thread
From: JP Kobryn @ 2025-03-10 17:59 UTC (permalink / raw)
  To: Michal Koutný, Yosry Ahmed
  Cc: tj, shakeel.butt, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On 3/3/25 10:49 AM, Michal Koutný wrote:
> On Mon, Mar 03, 2025 at 06:29:53PM +0000, Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
>> I thought about this, but this would have unnecessary memory overhead as
>> we only need one lock per-subsystem. So having a lock in every single
>> css is wasteful.
>>
>> Maybe we can put the lock in struct cgroup_subsys? Then we can still
>> initialize them in cgroup_init_subsys().
> 
> Ah, yes, muscle memory, of course I had struct cgroup_subsys\> in mind.
> 
>> I think it will be confusing to have cgroup_rstat_boot() only initialize
>> some of the locks.
>>
>> I think if we initialize the subsys locks in cgroup_init_subsys(), then
>> we should open code initializing the base locks in cgroup_init(), and
>> remove cgroup_rstat_boot().
> 
> Can this work?
> 
> DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_base_cpu_lock) =
> 	__RAW_SPIN_LOCK_INITIALIZER(cgroup_rstat_base_cpu_lock);
> 
> (I see other places in kernel that assign into the per-cpu definition
> but I have no idea whether that does expands and links to what's
> expected. Neglecting the fact that the lock initializer is apparently
> not for external use.)

I gave this a try. Using lockdep fields to verify, it expanded as
intended:
[    1.442498] [ss_rstat_init] cpu:0, lock.magic:dead4ead, 
lock.owner_cpu:-1, lock.owner:ffffffffffffffff
[    1.443027] [ss_rstat_init] cpu:1, lock.magic:dead4ead, 
lock.owner_cpu:-1, lock.owner:ffffffffffffffff
...

Unless anyone has objections on using the double under macro, I will use
this in v3.

> 
>> Alternatively, we can make cgroup_rstat_boot() take in a subsys and
>> initialize its lock. If we pass NULL, then it initialize the base locks.
>> In this case we can call cgroup_rstat_boot() for each subsystem that has
>> an rstat callback in cgroup_init() (or cgroup_init_subsys()), and then
>> once for the base locks.
>>
>> WDYT?
> 
> Such calls from cgroup_init_subsys() are fine too.

Using the lock initializer macro above simplifies this situation. I'm
currently working with these changes that should appear in v3:

move global subsys locks to ss struct:
	struct cgroup_subsys {
		...
		spinlock_t lock;
		raw_spinlock_t __percpu *percpu_lock;
	};

change:
	cgroup_rstat_boot(void)
to:
	ss_rstat_init(struct cgroup_subsys *ss)

... and only use ss_rstat_init(ss) to initialize the new subsystem
lock fields, since the locks for base stats are now initialized at
definition.

Note that these changes also have the added benefit of not having to
perform an allocation of the per-cpu lock for subsystems that do not
participate in rstat. Previously it was wasted static memory.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems
  2025-03-10 17:59         ` JP Kobryn
@ 2025-03-11 13:49           ` Michal Koutný
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Koutný @ 2025-03-11 13:49 UTC (permalink / raw)
  To: JP Kobryn
  Cc: Yosry Ahmed, tj, shakeel.butt, mhocko, hannes, akpm, linux-mm,
	cgroups, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1192 bytes --]

On Mon, Mar 10, 2025 at 10:59:30AM -0700, JP Kobryn <inwardvessel@gmail.com> wrote:
> > DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_base_cpu_lock) =
> > 	__RAW_SPIN_LOCK_INITIALIZER(cgroup_rstat_base_cpu_lock);
> > 
> > (I see other places in kernel that assign into the per-cpu definition
> > but I have no idea whether that does expands and links to what's
> > expected. Neglecting the fact that the lock initializer is apparently
> > not for external use.)
> 
> I gave this a try. Using lockdep fields to verify, it expanded as
> intended:
> [    1.442498] [ss_rstat_init] cpu:0, lock.magic:dead4ead,
> lock.owner_cpu:-1, lock.owner:ffffffffffffffff
> [    1.443027] [ss_rstat_init] cpu:1, lock.magic:dead4ead,
> lock.owner_cpu:-1, lock.owner:ffffffffffffffff
> ...
> 
> Unless anyone has objections on using the double under macro, I will use
> this in v3.

Actually, I have the objection (to the underscored macro).
It may be work today but it may break subtly in the future.

Maybe add a separate patch that introduces a proper (non-underscore)
initializer (or percpu wrapped initializer) macro and people Cc'd on
that may evaluate wiseness of that.

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/4 v2] cgroup: separate rstat trees
  2025-03-06  1:07   ` JP Kobryn
@ 2025-03-11 13:49     ` Michal Koutný
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Koutný @ 2025-03-11 13:49 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, shakeel.butt, yosryahmed, mhocko, hannes, akpm, linux-mm,
	cgroups, kernel-team

[-- Attachment #1: Type: text/plain, Size: 2374 bytes --]

On Wed, Mar 05, 2025 at 05:07:04PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote:
> When the entry point for flushing is reading the file memory.stat,
> memory_stat_show() is called which leads to __mem_cgroup_flush_stats(). In
> this function, there is an early return when (!force && !needs_flush) is
> true. This opportunity to "skip" a flush is not reached when another
> subsystem has initiated the flush and entry point for flushing memory is
> css->css_rstat_flush().

That sounds spot on, I'd say that explains the savings observed.
Could you add a note the next version along the lines like this:

	memcg flushing uses heuristics to optimize flushing but this is
	bypassed when memcg is flushed as consequence of sharing the
	update tree with another controller.

IOW, other controllers did flushing work instead of memcg but it was
inefficient (effective though).


> Are you suggesting a workload with fewer threads?

No, no, I only roughly wondered where the work disappeared (but I've
understood it from the flushing heuristics above).

> > What's the change between control vs experiment? Runnning in root cg vs
> > nested? Or running without *.stat readers vs with them against the
> > kernel build?
> > (This clarification would likely answer my question above.)
> > 
> 

(reordered by me, hopefully we're on the same page)

before split:
> workload control with no readers:
> real    6m54.818s
> user    117m3.122s
> sys     5m4.996s
>
> workload control with constant readers {memory,io,cpu,cgroup}.stat:
> real    6m59.468s
> user    118m26.981s
> sys     5m20.163s

after split:
> workload experiment with no readers:
> real    6m54.862s
> user    117m12.812s
> sys     5m0.943s
> 
> workload experiment with constant readers {memory,io,cpu,cgroup}.stat:
> real    6m57.031s
> user    118m13.833s
> sys     5m3.454s

I reckon this is positive effect* of the utilized heuristics (no
unnecessary flushes, therefore no unnecessary tree updates on writer
side neither).

*) Not statistical but it doesn't look worse.

> These tests were done in a child (nested) cgroup. Were you also asking for a
> root vs nested experiment or were you just needing clarification on the test
> details?

No, I don't think the root vs nested would be that much interesting in
this case.

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2025-03-11 13:49 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-27 21:55 [PATCH 0/4 v2] cgroup: separate rstat trees inwardvessel
2025-02-27 21:55 ` [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state inwardvessel
2025-02-27 22:43   ` Shakeel Butt
2025-02-28 19:04   ` Yosry Ahmed
2025-03-01  1:06     ` JP Kobryn
2025-03-01  1:25       ` Yosry Ahmed
2025-03-01  1:30         ` JP Kobryn
2025-03-03 18:18         ` Shakeel Butt
2025-03-03 18:21           ` Yosry Ahmed
2025-03-03 15:20   ` Michal Koutný
2025-03-03 19:31     ` JP Kobryn
2025-02-27 21:55 ` [PATCH 2/4 v2] cgroup: rstat lock indirection inwardvessel
2025-03-03 15:21   ` Michal Koutný
2025-02-27 21:55 ` [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems inwardvessel
2025-02-27 22:52   ` Shakeel Butt
2025-02-28 16:07     ` JP Kobryn
2025-02-28 17:37   ` JP Kobryn
2025-02-28 19:20   ` Yosry Ahmed
2025-03-06 21:47     ` JP Kobryn
2025-03-01 23:00   ` kernel test robot
2025-03-03 15:22   ` Michal Koutný
2025-03-03 18:29     ` Yosry Ahmed
2025-03-03 18:40       ` Shakeel Butt
2025-03-03 19:23         ` JP Kobryn
2025-03-03 19:39           ` Shakeel Butt
2025-03-03 19:50         ` Yosry Ahmed
2025-03-03 20:09           ` Shakeel Butt
2025-03-03 18:49       ` Michal Koutný
2025-03-10 17:59         ` JP Kobryn
2025-03-11 13:49           ` Michal Koutný
2025-03-06 21:36       ` JP Kobryn
2025-03-03 23:49   ` kernel test robot
2025-02-27 21:55 ` [PATCH 4/4 v2] cgroup: separate rstat list pointers from base stats inwardvessel
2025-02-27 23:01   ` Shakeel Butt
2025-02-28 20:33   ` Yosry Ahmed
2025-02-28 18:22 ` [PATCH 0/4 v2] cgroup: separate rstat trees Yosry Ahmed
2025-03-03 15:19 ` Michal Koutný
2025-03-06  1:07   ` JP Kobryn
2025-03-11 13:49     ` Michal Koutný

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox