Re: [PATCH 0/4 v2] cgroup: separate rstat trees

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: JP Kobryn <inwardvessel@gmail.com>
To: tj@kernel.org, shakeel.butt@linux.dev, yosryahmed@google.com,
	mkoutny@suse.com, hannes@cmpxchg.org, akpm@linux-foundation.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, kernel-team@meta.com
Subject: Re: [PATCH 0/4 v2] cgroup: separate rstat trees
Date: Wed, 19 Mar 2025 15:19:52 -0700	[thread overview]
Message-ID: <11e6f814-af32-4dc3-a8a9-3d1e3fdf7f6d@gmail.com> (raw)
In-Reply-To: <20250319221634.71128-1-inwardvessel@gmail.com>

Please disregard this thread. It was sent with an incorrect version. A 
revised series labeled "v3" will be sent shortly.

On 3/19/25 3:16 PM, JP Kobryn wrote:
> The current design of rstat takes the approach that if one subsystem is to
> be flushed, all other subsystems with pending updates should also be
> flushed. A flush may be initiated by reading specific stats (like cpu.stat)
> and other subsystems will be flushed alongside. The complexity of flushing
> some subsystems has grown to the extent that the overhead of side flushes
> is unnecessarily delaying the fetching of desired stats.
> 
> One big area where the issue comes up is system telemetry, where programs
> periodically sample cpu stats while the memory controller is enabled. It
> would be a benefit for programs sampling cpu.stat if the overhead of having
> to flush memory (and also io) stats was eliminated. It would save cpu
> cycles for existing stat reader programs and improve scalability in terms
> of sampling frequency and host volume.
> 
> This series changes the approach of "flush all subsystems" to "flush only
> the requested subsystem". The core design change is moving from a unified
> model where rstat trees are shared by subsystems to having separate trees
> for each subsystem. On a per-cpu basis, there will be separate trees for
> each enabled subsystem if it implements css_rstat_flush and one tree for
> the base stats. In order to do this, the rstat list pointers were moved off
> of the cgroup and onto the css. In the transition, these pointer types were
> changed to cgroup_subsys_state. Finally the API for updated/flush was
> changed to accept a reference to a css instead of a cgroup. This allows for
> a specific subsystem to be associated with a given update or flush. The
> result is that rstat trees will now be made up of css nodes, and a given
> tree will only contain nodes associated with a specific subsystem.
> 
> Since separate trees will now be in use, the locking scheme was adjusted.
> The global locks were split up in such a way that there are separate locks
> for the base stats (cgroup::self) and also for each subsystem (memory, io,
> etc). This allows different subsystems (and base stats) to use rstat in
> parallel with no contention.
> 
> Breaking up the unified tree into separate trees eliminates the overhead
> and scalability issue explained in the first section, but comes at the
> expense of additional memory. In an effort to minimize this overhead, new
> rstat structs are introduced and a conditional allocation is performed.
> The cgroup_rstat_cpu which originally contained the rstat list pointers and
> the base stat entities was renamed cgroup_rstat_base_cpu. It is only
> allocated when the associated css is cgroup::self. As for non-self css's, a
> new compact struct was added that only contains the rstat list pointers.
> During initialization, when the given css is associated with an actual
> subsystem (not cgroup::self), this compact struct is allocated. With this
> conditional allocation, the change in memory overhead on a per-cpu basis
> before/after is shown below.
> 
> memory overhead before:
> sizeof(struct cgroup_rstat_cpu) =~ 144 bytes /* can vary based on config */
> 
> nr_cgroups * sizeof(struct cgroup_rstat_cpu)
> nr_cgroups * 144 bytes
> 
> memory overhead after:
> sizeof(struct cgroup_rstat_cpu) == 16 bytes
> sizeof(struct cgroup_rstat_base_cpu) =~ 144 bytes
> 
> nr_cgroups * (
> 	sizeof(struct cgroup_rstat_base_cpu) +
> 		sizeof(struct cgroup_rstat_cpu) * nr_rstat_controllers
> 	)
> 
> nr_cgroups * (144 + 16 * nr_rstat_controllers)
> 
> ... where nr_rstat_controllers is the number of enabled cgroup controllers
> that implement css_rstat_flush(). On a host where both memory and io are
> enabled:
> 
> nr_cgroups * (144 + 16 * 2)
> nr_cgroups * 176 bytes
> 
> This leaves us with an increase in memory overhead of:
> 	32 bytes per cgroup per cpu
> 
> Validation was performed by reading some *.stat files of a target parent
> cgroup while the system was under different workloads. A test program was
> made to loop 1M times, reading the files cgroup.stat, cpu.stat, io.stat,
> memory.stat of the parent cgroup each iteration. Using a non-patched kernel
> as control and this series as experimental, the findings show perf gains
> when reading stats with this series.
> 
> The first experiment consisted of a parent cgroup with memory.swap.max=0
> and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and
> within each child cgroup a process was spawned to frequently update the
> memory cgroup stats by creating and then reading a file of size 1T
> (encouraging reclaim). The test program was run alongside these 26 tasks in
> parallel. The results showed time and perf gains for the reader test
> program.
> 
> test program elapsed time
> control:
> real    1m13.663s
> user    0m0.948s
> sys     1m12.356s
> 
> experiment:
> real    0m42.498s
> user    0m0.764s
> sys     0m41.546s
> 
> test program perf
> control:
> 31.75% mem_cgroup_css_rstat_flush
>   5.49% __blkcg_rstat_flush
>   0.10% cpu_stat_show
>   0.05% cgroup_base_stat_cputime_show
> 
> experiment:
> 8.60% mem_cgroup_css_rstat_flush
> 0.15% blkcg_print_stat
> 0.12% cgroup_base_stat_cputime_show
> 
> It's worth noting that memcg uses heuristics to optimize flushing.
> Depending on the state of updated stats at a given time, a memcg flush may
> be considered unnecessary and skipped as a result. This opportunity to skip
> a flush is bypassed when memcg is flushed as a consequence of sharing the
> tree with another controller.
> 
> A second experiment was setup on the same host using a parent cgroup with
> two child cgroups. The same swap and memory max were used as in the
> previous experiment. In the two child cgroups, kernel builds were done in
> parallel, each using "-j 20". The perf comparison is shown below.
> 
> test program elapsed time
> control:
> real    1m22.809s
> user    0m1.142s
> sys     1m21.138s
> 
> experiment:
> real    0m42.504s
> user    0m1.000s
> sys     0m41.220s
> 
> test program perf
> control:
> 37.16% mem_cgroup_css_rstat_flush
>   3.68% __blkcg_rstat_flush
>   0.09% cpu_stat_show
>   0.06% cgroup_base_stat_cputime_show
> 
> experiment:
> 2.02% mem_cgroup_css_rstat_flush
> 0.20% blkcg_print_stat
> 0.14% cpu_stat_show
> 0.08% cgroup_base_stat_cputime_show
> 
> The final experiment differs from the previous two in that it measures
> performance from the stat updater perspective. A kernel build was run in a
> child node with -j 20 on the same host and cgroup setup. A baseline was
> established by having the build run while no stats were read. The builds
> were then repeated while stats were constantly being read. In all cases,
> perf appeared similar in cycles spent on cgroup_rstat_updated()
> (insignificant compared to the other recorded events). As for the elapsed
> build times, the results of the different scenarios are shown below,
> indicating no significant drawbacks of the split tree approach.
> 
> control with no readers
> real    5m12.307s
> user    84m52.037s
> sys     3m54.000s
> 
> control with constant readers of {memory,io,cpu,cgroup}.stat
> real    5m13.209s
> user    84m47.949s
> sys     4m9.260s
> 
> experiment with no readers
> real    5m11.961s
> user    84m41.750s
> sys     3m54.058s
> 
> experiment with constant readers of {memory,io,cpu,cgroup}.stat
> real    5m12.626s
> user    85m0.323s
> sys     3m56.167s
> 
> changelog
> v3:
> 	new bpf kfunc api for updated/flush
> 	rename cgroup_rstat_{updated,flush} and related to "css_rstat_*"
> 	check for ss->css_rstat_flush existence where applicable
> 	rename locks for base stats
> 	move subsystem locks to cgroup_subsys struct
> 	change cgroup_rstat_boot() to ss_rstat_init(ss) and init locks within
> 	change lock helpers to accept css and perform lock selection within
> 	fix comments that had outdated lock names
> 	add open css_is_cgroup() helper
> 	rename rstatc to rstatbc to reflect base stats in use
> 	rename cgroup_dfl_root_rstat_cpu to root_self_rstat_cpu
> 	add comments in early init code to explain deferred allocation
> 	misc formatting fixes
> 
> v2:
> 	drop the patch creating a new cgroup_rstat struct and related code
> 	drop bpf-specific patches. instead just use cgroup::self in bpf progs
> 	drop the cpu lock patches. instead select cpu lock in updated_list func
> 	relocate the cgroup_rstat_init() call to inside css_create()
> 	relocate the cgroup_rstat_exit() cleanup from apply_control_enable()
> 		to css_free_rwork_fn()
> v1:
> 	https://lore.kernel.org/all/20250218031448.46951-1-inwardvessel@gmail.com/
> 
> JP Kobryn (4):
>    cgroup: separate rstat api for bpf programs
>    cgroup: use separate rstat trees for each subsystem
>    cgroup: use subsystem-specific rstat locks to avoid contention
>    cgroup: split up cgroup_rstat_cpu into base stat and non base stat
>      versions
> 
>   block/blk-cgroup.c                            |   6 +-
>   include/linux/cgroup-defs.h                   |  80 ++--
>   include/linux/cgroup.h                        |  16 +-
>   include/trace/events/cgroup.h                 |  10 +-
>   kernel/cgroup/cgroup-internal.h               |   6 +-
>   kernel/cgroup/cgroup.c                        |  69 +--
>   kernel/cgroup/rstat.c                         | 412 +++++++++++-------
>   mm/memcontrol.c                               |   4 +-
>   .../selftests/bpf/progs/btf_type_tag_percpu.c |   5 +-
>   .../bpf/progs/cgroup_hierarchical_stats.c     |   8 +-
>   10 files changed, 363 insertions(+), 253 deletions(-)
>

next prev parent reply	other threads:[~2025-03-19 22:19 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-19 22:16 JP Kobryn
2025-03-19 22:16 ` [PATCH 1/4] cgroup: use separate rstat api for bpf programs JP Kobryn
2025-03-19 22:16 ` [PATCH 2/4] cgroup: use separate rstat trees for each subsystem JP Kobryn
2025-03-19 22:16 ` [PATCH 3/4] cgroup: use subsystem-specific rstat locks to avoid contention JP Kobryn
2025-03-19 22:16 ` [PATCH 4/4] cgroup: save memory by splitting cgroup_rstat_cpu into compact and full versions JP Kobryn
2025-03-19 22:19 ` JP Kobryn [this message]
  -- strict thread matches above, loose matches on Subject: below --
2025-02-27 21:55 [PATCH 0/4 v2] cgroup: separate rstat trees inwardvessel
2025-02-28 18:22 ` Yosry Ahmed
2025-03-03 15:19 ` Michal Koutný
2025-03-06  1:07   ` JP Kobryn
2025-03-11 13:49     ` Michal Koutný

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=11e6f814-af32-4dc3-a8a9-3d1e3fdf7f6d@gmail.com \
    --to=inwardvessel@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@meta.com \
    --cc=linux-mm@kvack.org \
    --cc=mkoutny@suse.com \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox