From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C902EC3ABAA for ; Sat, 3 May 2025 00:12:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8961D6B00C8; Fri, 2 May 2025 20:12:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 86BBB6B00C9; Fri, 2 May 2025 20:12:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 70C3C6B00CB; Fri, 2 May 2025 20:12:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 550DD6B00C8 for ; Fri, 2 May 2025 20:12:37 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 69F5880D76 for ; Sat, 3 May 2025 00:12:38 +0000 (UTC) X-FDA: 83399670396.21.5F3CB3D Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) by imf27.hostedemail.com (Postfix) with ESMTP id A5F8A40002 for ; Sat, 3 May 2025 00:12:36 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="PrCuY/kr"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf27.hostedemail.com: domain of inwardvessel@gmail.com designates 209.85.214.181 as permitted sender) smtp.mailfrom=inwardvessel@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746231156; a=rsa-sha256; cv=none; b=JEQNAN1hRp2QCSjrLtYIdCylvGC4EDV6UJpSyxR9/+sbaOVn3+X2Vy2e+z/J7vZuuGMrZ6 Q8THMkkMvl4Z8T0aytBG+mcxcn/0so7kJu+xeOsgBteRMm7jj5C/ZvdbbQ8FZSGlqYKgP3 wdKbbEa5FJ/SznjZn4safGwV1JWjZeE= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746231156; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=KMQPikejRrIEBctpdAfUA6CI6t3vi46iXnaulKRN6bs=; b=zeUYixfQrSAmugzlH9tG2h5KQ9LuDe0BGF/7Y4IYTOUqTzoQMqc+e5iaKaJYwnaFnIsJtv JfgAGe4oDf1WbANIvd75ofvxNeW+ounOg40M54eucracaRFmnqQIQVQXigq5jn1xLojuJ9 PbDLjvJrMmz+RVHt8g0JR08gGY1kWhw= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="PrCuY/kr"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf27.hostedemail.com: domain of inwardvessel@gmail.com designates 209.85.214.181 as permitted sender) smtp.mailfrom=inwardvessel@gmail.com Received: by mail-pl1-f181.google.com with SMTP id d9443c01a7336-224191d92e4so28645735ad.3 for ; Fri, 02 May 2025 17:12:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746231155; x=1746835955; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=KMQPikejRrIEBctpdAfUA6CI6t3vi46iXnaulKRN6bs=; b=PrCuY/krYa77DXJ2YF5JhRD5xv3oDmt1yXjdvqFxaapVCO3VxSoKGtyPvVa1+tlFwx 4BGG8OiZPM8gIpVUv5OX8LCiSMxjEOkl8+AhcM+Jao0afJnOBtw30SjNSBk+JWpI3yXQ tPQUaymZoQLtHhFEXH7k2A7ld+WaH3gIH5daA2ryRRVPixiuO5hXJ1rOwr4y+71ABpST U2owEVBtjBhM4DzxFKx5B+5SKsQAUjm+0mWnqm5W3R2UhbscK1bnd2XdvHGzcCbT7apz 9SDOsbZL9dGIxwxWUTxb+MuRHiDlYUqWvmgJRmRY1BZ3o3mRiiO0mKDPcjf1i61a+zel qiQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746231155; x=1746835955; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=KMQPikejRrIEBctpdAfUA6CI6t3vi46iXnaulKRN6bs=; b=ZYXWqElr2QaS+KOzUzMcF5kJ/W/8ZN6dVSEiMah6lpmKsf6dNQYsFATlYcHuuwAqSA kBSFytjqJQNp9bl4LQMrQez84h6iHjXH/Y16WniN9nmgfZDFDEEhoQmh8Uo91EI6O6W4 mDfHAMExcDEsahbt1vevk94+3pJ4f7t9UqL8YewtOCwvhLV4nzfVL5SUG5J4rRKzoj3w t5k+1/NRdLF60gYQ+w/XhViONbxrovBOjv2FYpk4hjWR+gJPc/PRvhke4HM1YTohW22T Y72FsvjamQ1YO1QUr20/IDDVbMrwf60XhTZ0I+iPwRg0gJlkQqukp9gdJMLCDRseETZK kxFQ== X-Gm-Message-State: AOJu0Yz5m7Puv7EffAdPXDaFZ/ib2v8DUvy1zS2onvvp81tx4LkdVOIB AfejqUx8nLswcnhkLGjTt/bQ10BLdjhlhrrql+fBVRgBcUaTn3IK X-Gm-Gg: ASbGncsiwTkHbDQdxsADcq6dCTR6Ku6BJieo8noZkQW0CQH1ojCeFNqTxlMEEyRzFe3 XwqGJNju5IPJDcwe8TmebD4iA+Bfi+SZRQqJtUqe2nbgFg/5XUT33m6zrV9upV5XniB1v6ONdRt xbzdjp56BmfCG5TLAQcJbO+8YyOzcWIRdh+qPMosfP60WFiLy1Ok0XEnYs3ZMYi65asDZKUSYPX tkMpyMwxkth+A8+2CDy2Q7rhtf+y1dUNqlYtlE+IDBdb6bEcXLVtnSY0PNfs9NPnEvC5LIMEVwS 31dbzhEkjRdHvor60l47SGRWPQXxYXxs7d57r/LXojQHOA2+6iOE3CpkHW0HGxpe4eCk X-Google-Smtp-Source: AGHT+IGlq6J9afu3u3FiSYd5nD6EGQN1bLKOUri8vnXOKMlgQQ76DNZb9jq2bJuBKOv2a2NheXWtrA== X-Received: by 2002:a17:902:c404:b0:224:c47:cb7 with SMTP id d9443c01a7336-22e10225959mr81666325ad.0.1746231155450; Fri, 02 May 2025 17:12:35 -0700 (PDT) Received: from jpkobryn-fedora-PF5CFKNC.thefacebook.com ([2620:10d:c090:500::5:6a01]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-22e15228f9csm13718635ad.178.2025.05.02.17.12.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 May 2025 17:12:34 -0700 (PDT) From: JP Kobryn To: tj@kernel.org, shakeel.butt@linux.dev, yosryahmed@google.com, mkoutny@suse.com, hannes@cmpxchg.org, akpm@linux-foundation.org Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, kernel-team@meta.com Subject: [PATCH v5 0/5] cgroup: separate rstat trees Date: Fri, 2 May 2025 17:12:17 -0700 Message-ID: <20250503001222.146355-1-inwardvessel@gmail.com> X-Mailer: git-send-email 2.49.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: A5F8A40002 X-Stat-Signature: 58ytkw1o95jqyh5f6c1xio81qsnaur4r X-Rspam-User: X-HE-Tag: 1746231156-66243 X-HE-Meta: U2FsdGVkX18BFt97/Ilpmh/vwdai9LJ10t0LjyKKPJz3cUaS6t3fwlPIUyL8+CRKGJnBnISChgNgqMNS50V3jxXDDXbEqhy3pAcBKfFsvFG/2We0rnL8Lg1f7rco7cH1fed5UOYcMRChlCTaTcMAxup5ncSctA7XDcTnApj7UAsY84I39RU3VZKX/JEgR12CRDmMAT0PzlPSMz1JmxZjAXZCQkt6yZeZ9+PqO9sTnKL9hyOPdMXjYLkUxWGwDwKW9cQ4DWY1cX15pRWt1AesKLt/usg/J3djsuBPVmHQkPNZcWaQik3maO4qUL9AIw1h7ANab7al7Fa+7EX3P6Jkek/Y/V7T3xann5zF5tjVnw3NtqZhzzQsV89xhZFzcnYVJNKRpfjHd1o0QoUmmCvfcRgmgBBALmHed2ZIjRHFzjxF/eAjpzjqc/PGZ9sS5r5Nr8ScSjUguHCwCUQOmexGSDJsaZb8c2DGghrCpgTrBWWh4OWjzYWwodMpCrRL4zdSIoH310hw1utSWFdzri85nEV0MJrWlnPTu8b5iB3EAsWdzuxkkfNG+wNPFcR8ZWHTL9syNJR9eRCA99PKXmjG++uLBnoda/L95gDW+DPha/FFy0ekXhLsQJET9A2qzv2BODCf7aDKmhGNGXgwSipyT7MDUKwuZ4DFH/ro/l2D1fkTQjwHhuxzzeO96AcXc69fkBBM6E3km5u91dNM+WoeDZrjDqT2SGKmvXZTzhbTYo6kyAOV/XcFGsnhRWUsmFZ1ZAvyOeRwlY2nvcpcD24F2rbG4Urbrdy6VYG0nXwnHe0Ehtys6FPJ/zPAwOwsaFgCDKLGL2n5rho4AESBxKPmdYX2nJn37GYptJauQuEB7V4lBIKBgLXUIBtJkqYCJ6YRnDZxU+BE1guS+TaEkXMIU7rJsid6Hju1GGfl33OrV5u35XaDE14A2QkfljqQc7frh8bc3oYXa62jgMTG6sB rnc3H67a vQfOQrPFNEBIlRO7HmWBtR6oAd0yV/a/WwV9rpEypKacC9PUjUf+a/a2T3orYXL0lv3jBNaDTAmpb/2lhT/lEQsPUm0SuRY8f8CANZObeSWZRjfSMOF40gCPoD1GtY3Ko5/UUGOyF1059OoCG2suN2pdGYFSRVEgx4hM1BgddvFW5UwaTIQRyiYvKdOUYwA7DQhKDiZTGyA6TeXv/rpIZSbLoEU/I6tTAOvYbIKIW7AEQlyhUAo8jixo7Qic1X9M4fvI6lfT1vfUvdPJJ7a6l6hDel41qF/ZsLxVOlo+narwwbvwDTJBGEs0Gro0Gn55CLfeZr3E9MA/iSjHwRlw0B8wWJWr/dKw1YJQTAM9CtvCQyq9u/5gr7Cl5m8S068ZdGeXptRmP0/H4YtV6o2Vhadoiyw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: The current design of rstat takes the approach that if one subsystem is to be flushed, all other subsystems with pending updates should also be flushed. A flush may be initiated by reading specific stats (like cpu.stat) and other subsystems will be flushed alongside. The complexity of flushing some subsystems has grown to the extent that the overhead of side flushes is causing noticeable delays in reading the desired stats. One big area where the issue comes up is system telemetry, where programs periodically sample cpu stats while the memory controller is enabled. It would be a benefit for programs sampling cpu.stat if the overhead of having to flush memory (and also io) stats was eliminated. It would save cpu cycles for existing stat reader programs and improve scalability in terms of sampling frequency and host volume. This series changes the approach of "flush all subsystems" to "flush only the requested subsystem". The core design change is moving from a unified model where rstat trees are shared by subsystems to having separate trees for each subsystem. On a per-cpu basis, there will be separate trees for each enabled subsystem that implements css_rstat_flush plus one tree dedicated to the base stats. In order to do this, the rstat list pointers were moved off of the cgroup and onto the css. In the transition, these pointer types were changed to cgroup_subsys_state. Finally the API for updated/flush was changed to accept a reference to a css instead of a cgroup. This allows for a specific subsystem to be associated with a given update or flush. The result is that rstat trees will now be made up of css nodes, and a given tree will only contain nodes associated with a specific subsystem. Since separate trees will now be in use, the locking scheme was adjusted. The global locks were split up in such a way that there are separate locks for the base stats and also for each subsystem (memory, io, etc). This allows different subsystems (and base stats) to use rstat in parallel with no contention. Breaking up the unified tree into separate trees eliminates the overhead and scalability issues explained in the first section, but comes at the cost of additional memory. Originally, each cgroup contained an instance of the cgroup_rstat_cpu. The design change of moving to css-based trees calls for each css having the rstat per-cpu objects instead. Moving these objects to every css is where this overhead is created. In an effort to minimize this, the cgroup_rstat_cpu struct was split into two separate structs. One is the cgroup_rstat_base_cpu struct which only contains the per-cpu base stat objects used in rstat. The other is the css_rstat_cpu struct which contains the minimum amount of pointers needed for a css to participate in rstat. Since only the cgroup::self css is associated with the base stats, an instance of the cgroup_rstat_base_cpu struct is placed on the cgroup. Meanwhile an instance of the css_rstat_cpu is placed on the cgroup_subsys_state. This allows for all css's to participate in rstat while avoiding the unnecessary inclusion of the base stats. The base stat objects will only exist once per-cgroup regardless of however many subsystems are enabled. With this division of rstat list pointers and base stats, the change in memory overhead on a per-cpu basis before/after is shown below. memory overhead before: nr_cgroups * sizeof(struct cgroup_rstat_cpu) where sizeof(struct cgroup_rstat_cpu) = 144 bytes /* config-dependent */ resulting in nr_cgroups * 144 bytes memory overhead after: nr_cgroups * ( sizeof(struct cgroup_rstat_base_cpu) + sizeof(struct css_rstat_cpu) * (1 + nr_rstat_controllers) ) where sizeof(struct cgroup_rstat_base_cpu) = 128 bytes sizeof(struct css_rstat_cpu) = 16 bytes the constant "1" accounts for the cgroup::self css nr_rstat_controllers = number of controllers defining css_rstat_flush when both memory and io are enabled nr_rstat_controllers = 2 resulting in nr_cgroups * (128 + 16 * (1 + 2)) nr_cgroups * 176 bytes This leaves us with an increase in memory overhead of: 32 bytes per cgroup per cpu Validation was performed by reading some *.stat files of a target parent cgroup while the system was under different workloads. A test program was made to loop 1M times, reading the files cgroup.stat, cpu.stat, io.stat, memory.stat of the parent cgroup each iteration. Using a non-patched kernel as control and this series as experimental, the findings show perf gains when reading stats with this series. The first experiment consisted of a parent cgroup with memory.swap.max=0 and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and within each child cgroup a process was spawned to frequently update the memory cgroup stats by creating and then reading a file of size 1T (encouraging reclaim). The test program was run alongside these 26 tasks in parallel. The results showed time and perf gains for the reader test program. test program elapsed time control: real 1m29.929s user 0m0.933s sys 1m28.525s experiment: real 1m3.604s user 0m0.828s sys 1m2.497s test program perf control: 29.47% mem_cgroup_css_rstat_flush 5.09% __blkcg_rstat_flush 0.07% cpu_stat_show experiment: 6.89% mem_cgroup_css_rstat_flush 0.31% blkcg_print_stat 0.07% cpu_stat_show It's worth noting that memcg uses heuristics to optimize flushing. Depending on the state of updated stats at a given time, a memcg flush may be considered unnecessary and skipped as a result. This opportunity to skip a flush is bypassed when memcg is flushed as a consequence of sharing the tree with another controller. A second experiment was setup on the same host using a parent cgroup with two child cgroups. In the two child cgroups, kernel builds were done in parallel, each using "-j 20". The perf comparison is shown below. test program elapsed time control: real 1m59.647s user 0m1.263s sys 1m57.511s experiment: real 1m0.328s user 0m1.077s sys 0m58.834s test program perf control: 35.69% mem_cgroup_css_rstat_flush 4.49% __blkcg_rstat_flush 0.07% cpu_stat_show 0.05% cgroup_base_stat_cputime_show experiment: 2.04% mem_cgroup_css_rstat_flush 0.18% blkcg_print_stat 0.09% cpu_stat_show 0.09% cgroup_base_stat_cputime_show The final experiment differs from the previous two in that it measures performance from the stat updater perspective. A kernel build was run in a child node with -j 20 on the same host and cgroup setup. A baseline was established by having the build run while no stats were read. The builds were then repeated while stats were constantly being read. In all cases, perf appeared similar in cycles spent on cgroup_rstat_updated() (insignificant compared to the other recorded events). As for the elapsed build times, the results of the different scenarios are shown below, showing no significant drawbacks of the split tree approach. control with no readers real 5m11.548s user 84m45.072s sys 3m52.069s control with constant readers of {memory,io,cpu,cgroup}.stat real 5m13.619s user 85m1.847s sys 4m5.379s experiment with no readers real 5m12.557s user 84m54.966s sys 3m53.383s experiment with constant readers of {memory,io,cpu,cgroup}.stat real 5m12.548s user 84m56.313s sys 3m54.955s changelog v5: new patch for using css_is_cgroup() in more places new patch adding is_css_rstat() helper new patch documenting circumstances behind where css_rstat_init occurs check if css is cgroup early in css_rstat_flush() remove ss->css_rstat_flush check in flush loop fix css_rstat_flush where "pos" should be used instead of "css" change lockdep text in __css_rstat_lock/unlock() remove unnecessary base lock init in ss_rstat_init() guard against invalid css in css_rstat_updated/flush() guard against invalid css in css_rstat_init/exit() call css_rstat_updated/flush and css_rstat_init/exit unconditionally consolidate calls to css_rstat_exit() into one (aside from error cases) eliminate call to css_rstat_init() in cgroup_init() for ss->early_init move comment changes to matching commits where applicable fix comment with mention of stale function css_rstat_flush_locked() fix comment referring to "cgroup" where "css" should be used v4: drop bpf api patch drop cgroup_rstat_cpu split and union patch, replace with patch for moving base stats into new struct new patch for renaming rstat api's from cgroup_* to css_* new patch for adding css_is_cgroup() helper rename ss->lock and ss->cpu_lock to ss->rstat_ss_lock and ss->rstat_ss_cpu_lock respectively rename root_self_stat_cpu to root_base_rstat_cpu rename cgroup_rstat_push_children to css_rstat_push_children format comments for consistency in wings and capitalization update comments in bpf selftests v3: new bpf kfunc api for updated/flush rename cgroup_rstat_{updated,flush} and related to "css_rstat_*" check for ss->css_rstat_flush existence where applicable rename locks for base stats move subsystem locks to cgroup_subsys struct change cgroup_rstat_boot() to ss_rstat_init(ss) and init locks within change lock helpers to accept css and perform lock selection within fix comments that had outdated lock names add open css_is_cgroup() helper rename rstatc to rstatbc to reflect base stats in use rename cgroup_dfl_root_rstat_cpu to root_self_rstat_cpu add comments in early init code to explain deferred allocation misc formatting fixes v2: drop the patch creating a new cgroup_rstat struct and related code drop bpf-specific patches. instead just use cgroup::self in bpf progs drop the cpu lock patches. instead select cpu lock in updated_list func relocate the cgroup_rstat_init() call to inside css_create() relocate the cgroup_rstat_exit() cleanup from apply_control_enable() to css_free_rwork_fn() v1: https://lore.kernel.org/all/20250218031448.46951-1-inwardvessel@gmail.com/ JP Kobryn (5): cgroup: use helper for distingushing css in callbacks cgroup: use separate rstat trees for each subsystem cgroup: use subsystem-specific rstat locks to avoid contention cgroup: helper for checking rstat participation of css cgroup: document the rstat per-cpu initialization block/blk-cgroup.c | 2 +- include/linux/cgroup-defs.h | 78 +++-- include/trace/events/cgroup.h | 12 +- kernel/cgroup/cgroup-internal.h | 2 +- kernel/cgroup/cgroup.c | 41 +-- kernel/cgroup/rstat.c | 310 +++++++++++------- .../selftests/bpf/progs/btf_type_tag_percpu.c | 18 +- 7 files changed, 289 insertions(+), 174 deletions(-) -- 2.47.1