From: JP Kobryn <inwardvessel@gmail.com>
To: tj@kernel.org, shakeel.butt@linux.dev, yosryahmed@google.com,
mkoutny@suse.com, hannes@cmpxchg.org, akpm@linux-foundation.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, kernel-team@meta.com
Subject: Re: [PATCH 0/4 v2] cgroup: separate rstat trees
Date: Wed, 19 Mar 2025 15:19:52 -0700 [thread overview]
Message-ID: <11e6f814-af32-4dc3-a8a9-3d1e3fdf7f6d@gmail.com> (raw)
In-Reply-To: <20250319221634.71128-1-inwardvessel@gmail.com>
Please disregard this thread. It was sent with an incorrect version. A
revised series labeled "v3" will be sent shortly.
On 3/19/25 3:16 PM, JP Kobryn wrote:
> The current design of rstat takes the approach that if one subsystem is to
> be flushed, all other subsystems with pending updates should also be
> flushed. A flush may be initiated by reading specific stats (like cpu.stat)
> and other subsystems will be flushed alongside. The complexity of flushing
> some subsystems has grown to the extent that the overhead of side flushes
> is unnecessarily delaying the fetching of desired stats.
>
> One big area where the issue comes up is system telemetry, where programs
> periodically sample cpu stats while the memory controller is enabled. It
> would be a benefit for programs sampling cpu.stat if the overhead of having
> to flush memory (and also io) stats was eliminated. It would save cpu
> cycles for existing stat reader programs and improve scalability in terms
> of sampling frequency and host volume.
>
> This series changes the approach of "flush all subsystems" to "flush only
> the requested subsystem". The core design change is moving from a unified
> model where rstat trees are shared by subsystems to having separate trees
> for each subsystem. On a per-cpu basis, there will be separate trees for
> each enabled subsystem if it implements css_rstat_flush and one tree for
> the base stats. In order to do this, the rstat list pointers were moved off
> of the cgroup and onto the css. In the transition, these pointer types were
> changed to cgroup_subsys_state. Finally the API for updated/flush was
> changed to accept a reference to a css instead of a cgroup. This allows for
> a specific subsystem to be associated with a given update or flush. The
> result is that rstat trees will now be made up of css nodes, and a given
> tree will only contain nodes associated with a specific subsystem.
>
> Since separate trees will now be in use, the locking scheme was adjusted.
> The global locks were split up in such a way that there are separate locks
> for the base stats (cgroup::self) and also for each subsystem (memory, io,
> etc). This allows different subsystems (and base stats) to use rstat in
> parallel with no contention.
>
> Breaking up the unified tree into separate trees eliminates the overhead
> and scalability issue explained in the first section, but comes at the
> expense of additional memory. In an effort to minimize this overhead, new
> rstat structs are introduced and a conditional allocation is performed.
> The cgroup_rstat_cpu which originally contained the rstat list pointers and
> the base stat entities was renamed cgroup_rstat_base_cpu. It is only
> allocated when the associated css is cgroup::self. As for non-self css's, a
> new compact struct was added that only contains the rstat list pointers.
> During initialization, when the given css is associated with an actual
> subsystem (not cgroup::self), this compact struct is allocated. With this
> conditional allocation, the change in memory overhead on a per-cpu basis
> before/after is shown below.
>
> memory overhead before:
> sizeof(struct cgroup_rstat_cpu) =~ 144 bytes /* can vary based on config */
>
> nr_cgroups * sizeof(struct cgroup_rstat_cpu)
> nr_cgroups * 144 bytes
>
> memory overhead after:
> sizeof(struct cgroup_rstat_cpu) == 16 bytes
> sizeof(struct cgroup_rstat_base_cpu) =~ 144 bytes
>
> nr_cgroups * (
> sizeof(struct cgroup_rstat_base_cpu) +
> sizeof(struct cgroup_rstat_cpu) * nr_rstat_controllers
> )
>
> nr_cgroups * (144 + 16 * nr_rstat_controllers)
>
> ... where nr_rstat_controllers is the number of enabled cgroup controllers
> that implement css_rstat_flush(). On a host where both memory and io are
> enabled:
>
> nr_cgroups * (144 + 16 * 2)
> nr_cgroups * 176 bytes
>
> This leaves us with an increase in memory overhead of:
> 32 bytes per cgroup per cpu
>
> Validation was performed by reading some *.stat files of a target parent
> cgroup while the system was under different workloads. A test program was
> made to loop 1M times, reading the files cgroup.stat, cpu.stat, io.stat,
> memory.stat of the parent cgroup each iteration. Using a non-patched kernel
> as control and this series as experimental, the findings show perf gains
> when reading stats with this series.
>
> The first experiment consisted of a parent cgroup with memory.swap.max=0
> and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and
> within each child cgroup a process was spawned to frequently update the
> memory cgroup stats by creating and then reading a file of size 1T
> (encouraging reclaim). The test program was run alongside these 26 tasks in
> parallel. The results showed time and perf gains for the reader test
> program.
>
> test program elapsed time
> control:
> real 1m13.663s
> user 0m0.948s
> sys 1m12.356s
>
> experiment:
> real 0m42.498s
> user 0m0.764s
> sys 0m41.546s
>
> test program perf
> control:
> 31.75% mem_cgroup_css_rstat_flush
> 5.49% __blkcg_rstat_flush
> 0.10% cpu_stat_show
> 0.05% cgroup_base_stat_cputime_show
>
> experiment:
> 8.60% mem_cgroup_css_rstat_flush
> 0.15% blkcg_print_stat
> 0.12% cgroup_base_stat_cputime_show
>
> It's worth noting that memcg uses heuristics to optimize flushing.
> Depending on the state of updated stats at a given time, a memcg flush may
> be considered unnecessary and skipped as a result. This opportunity to skip
> a flush is bypassed when memcg is flushed as a consequence of sharing the
> tree with another controller.
>
> A second experiment was setup on the same host using a parent cgroup with
> two child cgroups. The same swap and memory max were used as in the
> previous experiment. In the two child cgroups, kernel builds were done in
> parallel, each using "-j 20". The perf comparison is shown below.
>
> test program elapsed time
> control:
> real 1m22.809s
> user 0m1.142s
> sys 1m21.138s
>
> experiment:
> real 0m42.504s
> user 0m1.000s
> sys 0m41.220s
>
> test program perf
> control:
> 37.16% mem_cgroup_css_rstat_flush
> 3.68% __blkcg_rstat_flush
> 0.09% cpu_stat_show
> 0.06% cgroup_base_stat_cputime_show
>
> experiment:
> 2.02% mem_cgroup_css_rstat_flush
> 0.20% blkcg_print_stat
> 0.14% cpu_stat_show
> 0.08% cgroup_base_stat_cputime_show
>
> The final experiment differs from the previous two in that it measures
> performance from the stat updater perspective. A kernel build was run in a
> child node with -j 20 on the same host and cgroup setup. A baseline was
> established by having the build run while no stats were read. The builds
> were then repeated while stats were constantly being read. In all cases,
> perf appeared similar in cycles spent on cgroup_rstat_updated()
> (insignificant compared to the other recorded events). As for the elapsed
> build times, the results of the different scenarios are shown below,
> indicating no significant drawbacks of the split tree approach.
>
> control with no readers
> real 5m12.307s
> user 84m52.037s
> sys 3m54.000s
>
> control with constant readers of {memory,io,cpu,cgroup}.stat
> real 5m13.209s
> user 84m47.949s
> sys 4m9.260s
>
> experiment with no readers
> real 5m11.961s
> user 84m41.750s
> sys 3m54.058s
>
> experiment with constant readers of {memory,io,cpu,cgroup}.stat
> real 5m12.626s
> user 85m0.323s
> sys 3m56.167s
>
> changelog
> v3:
> new bpf kfunc api for updated/flush
> rename cgroup_rstat_{updated,flush} and related to "css_rstat_*"
> check for ss->css_rstat_flush existence where applicable
> rename locks for base stats
> move subsystem locks to cgroup_subsys struct
> change cgroup_rstat_boot() to ss_rstat_init(ss) and init locks within
> change lock helpers to accept css and perform lock selection within
> fix comments that had outdated lock names
> add open css_is_cgroup() helper
> rename rstatc to rstatbc to reflect base stats in use
> rename cgroup_dfl_root_rstat_cpu to root_self_rstat_cpu
> add comments in early init code to explain deferred allocation
> misc formatting fixes
>
> v2:
> drop the patch creating a new cgroup_rstat struct and related code
> drop bpf-specific patches. instead just use cgroup::self in bpf progs
> drop the cpu lock patches. instead select cpu lock in updated_list func
> relocate the cgroup_rstat_init() call to inside css_create()
> relocate the cgroup_rstat_exit() cleanup from apply_control_enable()
> to css_free_rwork_fn()
> v1:
> https://lore.kernel.org/all/20250218031448.46951-1-inwardvessel@gmail.com/
>
> JP Kobryn (4):
> cgroup: separate rstat api for bpf programs
> cgroup: use separate rstat trees for each subsystem
> cgroup: use subsystem-specific rstat locks to avoid contention
> cgroup: split up cgroup_rstat_cpu into base stat and non base stat
> versions
>
> block/blk-cgroup.c | 6 +-
> include/linux/cgroup-defs.h | 80 ++--
> include/linux/cgroup.h | 16 +-
> include/trace/events/cgroup.h | 10 +-
> kernel/cgroup/cgroup-internal.h | 6 +-
> kernel/cgroup/cgroup.c | 69 +--
> kernel/cgroup/rstat.c | 412 +++++++++++-------
> mm/memcontrol.c | 4 +-
> .../selftests/bpf/progs/btf_type_tag_percpu.c | 5 +-
> .../bpf/progs/cgroup_hierarchical_stats.c | 8 +-
> 10 files changed, 363 insertions(+), 253 deletions(-)
>
next prev parent reply other threads:[~2025-03-19 22:19 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-19 22:16 JP Kobryn
2025-03-19 22:16 ` [PATCH 1/4] cgroup: use separate rstat api for bpf programs JP Kobryn
2025-03-19 22:16 ` [PATCH 2/4] cgroup: use separate rstat trees for each subsystem JP Kobryn
2025-03-19 22:16 ` [PATCH 3/4] cgroup: use subsystem-specific rstat locks to avoid contention JP Kobryn
2025-03-19 22:16 ` [PATCH 4/4] cgroup: save memory by splitting cgroup_rstat_cpu into compact and full versions JP Kobryn
2025-03-19 22:19 ` JP Kobryn [this message]
-- strict thread matches above, loose matches on Subject: below --
2025-02-27 21:55 [PATCH 0/4 v2] cgroup: separate rstat trees inwardvessel
2025-02-28 18:22 ` Yosry Ahmed
2025-03-03 15:19 ` Michal Koutný
2025-03-06 1:07 ` JP Kobryn
2025-03-11 13:49 ` Michal Koutný
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=11e6f814-af32-4dc3-a8a9-3d1e3fdf7f6d@gmail.com \
--to=inwardvessel@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=linux-mm@kvack.org \
--cc=mkoutny@suse.com \
--cc=shakeel.butt@linux.dev \
--cc=tj@kernel.org \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox