From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 686D4C3600C
	for <linux-mm@archiver.kernel.org>; Fri,  4 Apr 2025 01:11:06 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 5754D6B0005; Thu,  3 Apr 2025 21:11:04 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4FE326B0006; Thu,  3 Apr 2025 21:11:04 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 39D9A6B0007; Thu,  3 Apr 2025 21:11:04 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 187926B0005
	for <linux-mm@kvack.org>; Thu,  3 Apr 2025 21:11:04 -0400 (EDT)
Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 00BBE81884
	for <linux-mm@kvack.org>; Fri,  4 Apr 2025 01:11:04 +0000 (UTC)
X-FDA: 83294582490.29.8466551
Received: from mail-pl1-f170.google.com (mail-pl1-f170.google.com [209.85.214.170])
	by imf02.hostedemail.com (Postfix) with ESMTP id 38C468000C
	for <linux-mm@kvack.org>; Fri,  4 Apr 2025 01:11:03 +0000 (UTC)
Authentication-Results: imf02.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=US3Jyold;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf02.hostedemail.com: domain of inwardvessel@gmail.com designates 209.85.214.170 as permitted sender) smtp.mailfrom=inwardvessel@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743729063; a=rsa-sha256;
	cv=none;
	b=pUbv9K0XuHl3q7MWwP06ojufapkF499l+CM2cU/GCGoMR5yXjnaRON9hEWU4UcqjO3hXGj
	69evwbajNZt6K8a22mlSaMXaG/QwRdGXO/P5+y7O+r01Y352STDJ2iNHVSvTzURzImHqij
	PwUbBaSX3wtMjE7bohmpHGMRFUx7DTw=
ARC-Authentication-Results: i=1;
	imf02.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=US3Jyold;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf02.hostedemail.com: domain of inwardvessel@gmail.com designates 209.85.214.170 as permitted sender) smtp.mailfrom=inwardvessel@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1743729063;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:references:dkim-signature;
	bh=Ewl/x4Zt26kQeJ50ssNokb9M5hP8pj7uoufInmgAhqY=;
	b=Saf2PikjIAJCJD7myWWRYA+frbDgFXdeuAPiSRwp99JSu1UY7gmAzTyBjfzYAy1oyfkHgo
	IyKrBd00DHvMgBfOhndcXHtxn1LAP9LrHa+kW9YYp62M6Dcpskme35Flm5NUauQ/pyNVQh
	3TDgbkrNo7ZkGhGXJJigZssOLWyLeW0=
Received: by mail-pl1-f170.google.com with SMTP id d9443c01a7336-224171d6826so20855855ad.3
        for <linux-mm@kvack.org>; Thu, 03 Apr 2025 18:11:02 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1743729062; x=1744333862; darn=kvack.org;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:from:to:cc:subject:date:message-id:reply-to;
        bh=Ewl/x4Zt26kQeJ50ssNokb9M5hP8pj7uoufInmgAhqY=;
        b=US3JyoldVO/8/jzZrvLxrUXgqOdChqdUbNNA2judnuheN2AHErCqFU2wFj+FVy3C6+
         UlZt0YUt3IU/fl5VZ+UewkiKE3VnULBYAb9fXzJSBoGpUvQPGTFSFQssMdwLWzbQPp2V
         /pNGIp+NZXP+vPpVv9pM7vCyKuvm0arRcSJCAA/ecejAN2PFvvxeRjcydyg1pCivRU2b
         vfGtQluty+EZpp+FPNuY2OO2oBG+WavhGhSG0Hi3X+IoVt9WcM30XDWP58Ngl9q9of3r
         u+b9BSWxoAiHAhSCOD3+SGF4dO5q5je6LJMS1S/ODAivv1SDWbyEl5fvq75QgPLeU5IZ
         qz/w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1743729062; x=1744333862;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=Ewl/x4Zt26kQeJ50ssNokb9M5hP8pj7uoufInmgAhqY=;
        b=c5+yKX4lHH1QcFq6HPbiHcH+GOYPfcdHf7KxHgk/j4lRJzBCVC4p1X+fAWgRkaLZEo
         r+1TML3Tx/bwLtB42NEAci2pv0ApsENCx3iNN78xg5xXYxpSQ7TX8AP0dASzvJcEkzqd
         IQ8tiwkpEIuHm5qGh8FUnR1BIM8tvIUX9UhSgAyMl2Ig0pfdVLZzLMfRIp1f7dhC+HPY
         0MpbrMAsMyTyL2m9hLgIYWlAZiWzqNjeEdxB8A/k09ckGPtLD2IoW9PxCZng+kb02oW5
         hoC9WsZ6fgv4ZHbhR+kwXy6jCFrvEhIh/MKlFFTIsMS3ov4C/aRBciHuqo3RUCMqlEpr
         REBQ==
X-Gm-Message-State: AOJu0YwsFvfm0R400fcT8unRyF/dO6ECA7YPPiCLUhLepFFTefnmLTXT
	Y6/+BLJswynbozVXSx1ppHjBxsC/3aYUqRxRS3+lgkNFL9jLKy5q
X-Gm-Gg: ASbGncvxzV7iEdqCWXbdajDOAHDOePryB02Bzbu9FvSpaSV1qaXrK4keaxir2Ad1yv7
	OyR+ZNYMb3rJQpIReJpB9nr4BwcrXgzmHcvR6P6y93P1tZcRdxYIaLfap/p3xJfrRjXD4pvqXgI
	fOP1rZz+pSe/16ThMuO4Z48M0GvoJzpff4K6xUAWHKIv3wGE+TmckSWoMe6qVWqpTGLh2KziGMM
	vjsCEICYoLQJGYFjSsYzlp3PAdNyQ/Svuk1KKOLoCwiMx7meeWCjJc1xqnFN0CcwwHKk5rIeBQI
	xEOXDOiV9LqNMliHanQlCjOvr5qQkPANWKbhX0CX8/mod1gBiTL/StKthSGleiIeI3tGfADE
X-Google-Smtp-Source: AGHT+IGJGjJTFXlhYdVoZZw5ePtKECQtO2lm8eoSS8/qHKd6wtOVKwcq/4BLkFIsuWi85Roiom82fQ==
X-Received: by 2002:a17:902:e5cf:b0:223:517a:d2e2 with SMTP id d9443c01a7336-22a8a0b4149mr17192995ad.53.1743729061680;
        Thu, 03 Apr 2025 18:11:01 -0700 (PDT)
Received: from jpkobryn-fedora-PF5CFKNC.thefacebook.com ([2620:10d:c090:500::7:9b28])
        by smtp.gmail.com with ESMTPSA id d9443c01a7336-229785ad9a0sm21268675ad.39.2025.04.03.18.11.00
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 03 Apr 2025 18:11:01 -0700 (PDT)
From: JP Kobryn <inwardvessel@gmail.com>
To: tj@kernel.org,
	shakeel.butt@linux.dev,
	yosryahmed@google.com,
	mkoutny@suse.com,
	hannes@cmpxchg.org,
	akpm@linux-foundation.org
Cc: linux-mm@kvack.org,
	cgroups@vger.kernel.org,
	kernel-team@meta.com
Subject: [PATCH v4 0/5] cgroup: separate rstat trees
Date: Thu,  3 Apr 2025 18:10:45 -0700
Message-ID: <20250404011050.121777-1-inwardvessel@gmail.com>
X-Mailer: git-send-email 2.49.0
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 38C468000C
X-Stat-Signature: xxak6g6xepmjkc6p9kw8xthjbhkxpazx
X-Rspam-User: 
X-HE-Tag: 1743729063-339386
X-HE-Meta: U2FsdGVkX19saL2W1HPVskRPHeIUlDvG2jOIT0ctemOtScuSbeUV9zB3Q0Kbk7+OsFCdPs2ktzYlMlsZkEMF+WS9MGPX0aqAfXWUsP3dQUVlABTngOjN5/67J+4NqkRd/ob10Kf+wj/dv8da+GCz6GTTTX0kitQRXerKyzk4AH63mvFHjORxKVUAKsHH9gDhAE282ybWCqKpGNRBFKYgvwzJWLj/tz07KYfWoM7TUHcoSvyQQEHlwt4EjgxbVM3ndahCJtOwiCexSsjlg6xIPofBmpY9ZGxTzR7eyS+9a7hKEp5HmfOn74+LnCwPopD7JBX2m/gm54mR74QbtghOx95oNZqKWv/NmYmzNCtirbPtjpYZtKOgrMx60EFY9h7s95Y5NWL0gl79HlrlnQj5KR3UMTx6k7xwO2A/cOmiDtokr8SmXI5R0bHJMMY+/fKhDfV+TCQMqSvpfTAovJi/ElzVjIVookIp8rvnBuJQNh1059apQ2THDge5qTNRvU7XTWJbFIHRhEdK6nWDPr7tT9XE/yJydzrQZYHfWkcDuBIP8ZOC91BXPmjIKfpKawsxFI3sEz+yUuuritcv7aU4Kr4naUaUKt2O00FJkQ3f/X3cC1KlmX8stsv7Ajh3wdadnWmmDsbebtS4IztpNP4rhDFWOIka3fpB+zDFz0fsOqazllmq/xSAjyjHVgI40naZO5k2Pa3OxnS66vfq1FtmL1qNOSpNKgYJexFU3Rj4HczZPXsvSWJTuclJD19nqfRU+RA3lS0Ghq7GtOJ9ZuwTRoMuexBqU3whZfgZjjQvWSRTfJ0TSsPnrfR0gelX4JLC2No/KvvXE/AsJUP9mFFFzV0TDMVmmIR4rQaXXjqL+eB0sr6JVweok4z6qlkFj3OGzGuO9pbj/s1aSfmuenAoMPMsdThk1LOulZ0GF2zM/yW1NQR1wf25k/ah/JdOWOiIsS4k7n2eiVz92+3B0S9
 DieIDLrc
 J7prlJEtSW+0bzjyABvPeNYVjP01UbiSUAZp0QnZ8QrqlnSxoezWRFI2h/AXO21A/mPkA5uNnbti8kfWToCFmzxQOl2vpet9asfjXi73dgAU7Nx/mKWLDH8g5Poc3uEXcsKCFd366ER/VIVgGVXcjODysex90Jv7ov28QBkWB/CFoqN8Sa38tnMjl2zcArCCvRa077ejMWyVaDfNLO11Qu35NFQkpPUOU/GYbQdXxumw167yHP29RjEVa3nY6tbOk6rDA6gdyZtfxdkS5BcpeuAuqz6HHNm9O8h3rX+s7FLBfIvp7HOZkKqj7neAXSGf+rWF60hsFPoVDVikWZdyYXBF8rz7IHN8mLmNp1XDVXyc6we2cq8ZRYnbwCXym5BqkR2fZ+pbppKMmGqpR7RTarTYxGg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

The current design of rstat takes the approach that if one subsystem is to
be flushed, all other subsystems with pending updates should also be
flushed. A flush may be initiated by reading specific stats (like cpu.stat)
and other subsystems will be flushed alongside. The complexity of flushing
some subsystems has grown to the extent that the overhead of side flushes
is causing noticeable delays in reading the desired stats.

One big area where the issue comes up is system telemetry, where programs
periodically sample cpu stats while the memory controller is enabled. It
would be a benefit for programs sampling cpu.stat if the overhead of having
to flush memory (and also io) stats was eliminated. It would save cpu
cycles for existing stat reader programs and improve scalability in terms
of sampling frequency and host volume.

This series changes the approach of "flush all subsystems" to "flush only
the requested subsystem". The core design change is moving from a unified
model where rstat trees are shared by subsystems to having separate trees
for each subsystem. On a per-cpu basis, there will be separate trees for
each enabled subsystem that implements css_rstat_flush plus one tree
dedicated to the base stats. In order to do this, the rstat list pointers
were moved off of the cgroup and onto the css. In the transition, these
pointer types were changed to cgroup_subsys_state. Finally the API for
updated/flush was changed to accept a reference to a css instead of a
cgroup. This allows for a specific subsystem to be associated with a given
update or flush. The result is that rstat trees will now be made up of css
nodes, and a given tree will only contain nodes associated with a specific
subsystem.

Since separate trees will now be in use, the locking scheme was adjusted.
The global locks were split up in such a way that there are separate locks
for the base stats and also for each subsystem (memory, io, etc). This
allows different subsystems (and base stats) to use rstat in parallel with
no contention.

Breaking up the unified tree into separate trees eliminates the overhead
and scalability issues explained in the first section, but comes at the
cost of additional memory. Originally, each cgroup contained an instance of
the cgroup_rstat_cpu. The design change of moving to css-based trees calls
for each css having the rstat per-cpu objects instead. Moving these objects
to every css is where this overhead is created. In an effort to minimize
this, the cgroup_rstat_cpu struct was split into two separate structs. One
is the cgroup_rstat_base_cpu struct which only contains the per-cpu base
stat objects used in rstat. The other is the css_rstat_cpu struct which
contains the minimum amount of pointers needed for a css to participate in
rstat. Since only the cgroup::self css is associated with the base stats,
an instance of the cgroup_rstat_base_cpu struct is placed on the cgroup.
Meanwhile an instance of the css_rstat_cpu is placed on the
cgroup_subsys_state. This allows for all css's to participate in rstat
while avoiding the unnecessary inclusion of the base stats. The base stat
objects will only exist once per-cgroup regardless of however many
subsystems are enabled. With this division of rstat list pointers and base
stats, the change in memory overhead on a per-cpu basis before/after is
shown below.

memory overhead before:
	nr_cgroups * sizeof(struct cgroup_rstat_cpu)
where
	sizeof(struct cgroup_rstat_cpu) = 144 bytes /* config-dependent */
resulting in
	nr_cgroups * 144 bytes

memory overhead after:
	nr_cgroups * (
		sizeof(struct cgroup_rstat_base_cpu) +
			sizeof(struct css_rstat_cpu) * (1 + nr_rstat_controllers)
		)
where
	sizeof(struct cgroup_rstat_base_cpu) = 128 bytes
	sizeof(struct css_rstat_cpu) = 16 bytes
	the constant "1" accounts for the cgroup::self css
	nr_rstat_controllers = number of controllers defining css_rstat_flush
when both memory and io are enabled
	nr_rstat_controllers = 2
resulting in
	nr_cgroups * (128 + 16 * (1 + 2))
	nr_cgroups * 176 bytes

This leaves us with an increase in memory overhead of:
	32 bytes per cgroup per cpu

Validation was performed by reading some *.stat files of a target parent
cgroup while the system was under different workloads. A test program was
made to loop 1M times, reading the files cgroup.stat, cpu.stat, io.stat,
memory.stat of the parent cgroup each iteration. Using a non-patched kernel
as control and this series as experimental, the findings show perf gains
when reading stats with this series.

The first experiment consisted of a parent cgroup with memory.swap.max=0
and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and
within each child cgroup a process was spawned to frequently update the
memory cgroup stats by creating and then reading a file of size 1T
(encouraging reclaim). The test program was run alongside these 26 tasks in
parallel. The results showed time and perf gains for the reader test
program.

test program elapsed time
control:
real    1m0.956s
user    0m0.569s
sys     1m0.195s

experiment:
real    0m37.660s
user    0m0.463s
sys     0m37.078s

test program perf
control:
24.62% mem_cgroup_css_rstat_flush
 4.97% __blkcg_rstat_flush
 0.09% cpu_stat_show
 0.05% cgroup_base_stat_cputime_show

experiment:
2.68% mem_cgroup_css_rstat_flush
0.04% blkcg_print_stat
0.07% cpu_stat_show
0.06% cgroup_base_stat_cputime_show

It's worth noting that memcg uses heuristics to optimize flushing.
Depending on the state of updated stats at a given time, a memcg flush may
be considered unnecessary and skipped as a result. This opportunity to skip
a flush is bypassed when memcg is flushed as a consequence of sharing the
tree with another controller.

A second experiment was setup on the same host using a parent cgroup with
two child cgroups. The same swap and memory max were used as in the
previous experiment. In the two child cgroups, kernel builds were done in
parallel, each using "-j 20". The perf comparison is shown below.

test program elapsed time
control:
real    1m27.620s
user    0m0.779s
sys     1m26.258s

experiment:
real    0m45.805s
user    0m0.723s
sys     0m44.757s

test program perf
control:
30.84% mem_cgroup_css_rstat_flush
 6.75% __blkcg_rstat_flush
 0.08% cpu_stat_show
 0.04% cgroup_base_stat_cputime_show

experiment:
1.55% mem_cgroup_css_rstat_flush
0.15% blkcg_print_stat
0.10% cpu_stat_show
0.09% cgroup_base_stat_cputime_show
0.00% __blkcg_rstat_flush

The final experiment differs from the previous two in that it measures
performance from the stat updater perspective. A kernel build was run in a
child node with -j 20 on the same host and cgroup setup. A baseline was
established by having the build run while no stats were read. The builds
were then repeated while stats were constantly being read. In all cases,
perf appeared similar in cycles spent on cgroup_rstat_updated()
(insignificant compared to the other recorded events). As for the elapsed
build times, the results of the different scenarios are shown below,
showing no significant drawbacks of the split tree approach.

control with no readers
real    3m21.003s
user    55m52.133s
sys     2m40.728s

control with constant readers of {memory,io,cpu,cgroup}.stat
real    3m26.164s
user    56m49.474s
sys     2m56.389s

experiment with no readers
real    3m22.740s
user    56m18.972s
sys     2m45.041s

experiment with constant readers of {memory,io,cpu,cgroup}.stat
real    3m26.971s
user    57m11.540s
sys     2m49.735s

changelog
v4:
	drop bpf api patch
	drop cgroup_rstat_cpu split and union patch,
		replace with patch for moving base stats into new struct
	new patch for renaming rstat api's from cgroup_* to css_*
	new patch for adding css_is_cgroup() helper
	rename ss->lock and ss->cpu_lock to ss->rstat_ss_lock and
		ss->rstat_ss_cpu_lock respectively
	rename root_self_stat_cpu to root_base_rstat_cpu
	rename cgroup_rstat_push_children to css_rstat_push_children
	format comments for consistency in wings and capitalization
	update comments in bpf selftests

v3:
	new bpf kfunc api for updated/flush
	rename cgroup_rstat_{updated,flush} and related to "css_rstat_*"
	check for ss->css_rstat_flush existence where applicable
	rename locks for base stats
	move subsystem locks to cgroup_subsys struct
	change cgroup_rstat_boot() to ss_rstat_init(ss) and init locks within
	change lock helpers to accept css and perform lock selection within
	fix comments that had outdated lock names
	add open css_is_cgroup() helper
	rename rstatc to rstatbc to reflect base stats in use
	rename cgroup_dfl_root_rstat_cpu to root_self_rstat_cpu
	add comments in early init code to explain deferred allocation
	misc formatting fixes

v2:
	drop the patch creating a new cgroup_rstat struct and related code
	drop bpf-specific patches. instead just use cgroup::self in bpf progs
	drop the cpu lock patches. instead select cpu lock in updated_list func
	relocate the cgroup_rstat_init() call to inside css_create()
	relocate the cgroup_rstat_exit() cleanup from apply_control_enable()
		to css_free_rwork_fn()
v1:
	https://lore.kernel.org/all/20250218031448.46951-1-inwardvessel@gmail.com/

JP Kobryn (4):
  cgroup: separate rstat api for bpf programs
  cgroup: use separate rstat trees for each subsystem
  cgroup: use subsystem-specific rstat locks to avoid contention
  cgroup: split up cgroup_rstat_cpu into base stat and non base stat
    versions

 block/blk-cgroup.c                            |   6 +-
 include/linux/cgroup-defs.h                   |  80 ++--
 include/linux/cgroup.h                        |  16 +-
 include/trace/events/cgroup.h                 |  10 +-
 kernel/cgroup/cgroup-internal.h               |   6 +-
 kernel/cgroup/cgroup.c                        |  69 +--
 kernel/cgroup/rstat.c                         | 412 +++++++++++-------
 mm/memcontrol.c                               |   4 +-
 .../selftests/bpf/progs/btf_type_tag_percpu.c |   5 +-
 .../bpf/progs/cgroup_hierarchical_stats.c     |   8 +-
 10 files changed, 363 insertions(+), 253 deletions(-)

-- 
2.47.1