From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 82BABC35FFA for ; Wed, 19 Mar 2025 22:19:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CD480280003; Wed, 19 Mar 2025 18:19:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C5D23280002; Wed, 19 Mar 2025 18:19:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AD7C7280003; Wed, 19 Mar 2025 18:19:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 86D06280002 for ; Wed, 19 Mar 2025 18:19:56 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 2AD641A0446 for ; Wed, 19 Mar 2025 22:19:58 +0000 (UTC) X-FDA: 83239719276.08.7F90AA9 Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) by imf23.hostedemail.com (Postfix) with ESMTP id 15D3914000B for ; Wed, 19 Mar 2025 22:19:55 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=j+KR8kvB; spf=pass (imf23.hostedemail.com: domain of inwardvessel@gmail.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=inwardvessel@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742422796; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GN9op/SlQLrgTkYSLwrrvqZWoYltAYHtq7sca/hJnxA=; b=nnsp2xM17PzmT2t99PqMd/dSzFPog3rd6QxGWoxgM6ThOYJ5mhNwpdErVKNm71jBIZAWiA Z57/wacD0hxScAeZPIxkIVrFKdq6OEvR7Vyvo8PshNZvOi6UcQMMTguFL+reAjqs4y/siP bHCwJoMy4Z0ZdZvj2184Ip6MpzGMOYc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742422796; a=rsa-sha256; cv=none; b=AXy/UV5pnkZjUAmz4CqcWib3yZvVEUpYrvNYbcRir6ljELrl1FvqQ9TPokPpqCsfHz1Ue4 5r/WPgulkQDp2fASssDNBGWvDg1ai6uMiqkmNmlkkxEsbRQLZJK5Nt4JGDll+A05kIMBTU iYhObHKrAdoHT3AVEARjZ4gr0E+zab0= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=j+KR8kvB; spf=pass (imf23.hostedemail.com: domain of inwardvessel@gmail.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=inwardvessel@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-2235189adaeso2642265ad.0 for ; Wed, 19 Mar 2025 15:19:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1742422795; x=1743027595; darn=kvack.org; h=content-transfer-encoding:in-reply-to:content-language:references :cc:to:from:subject:user-agent:mime-version:date:message-id:from:to :cc:subject:date:message-id:reply-to; bh=GN9op/SlQLrgTkYSLwrrvqZWoYltAYHtq7sca/hJnxA=; b=j+KR8kvBMzHS7xpX+Cd5tDdwrbWpM9pizWciH1Ysc98td5PZCD8U+WowadK6qoV6j+ bg5buHevfMi/Na1dQPgdXuEyZkxwVIfXPnwTotChYhT0LAffg4vUkYBxnX9uGn21/hoz bJf1075xARwk4VOexOuYjIvRBQdkHxlcchylW69NnnAhiQzB1++RuLKJOY7N81PTprrY zK6jIINZywS65O6SpI00USlnHthgbV+VTKL+5sfgc7a4CL7uf33HOtz9gy/FfpK1J9yU Yt322OhT4rDNiLtdG+afVFG3S/hdabLuEqkbmNUUnBpZ1Gq3wYhcynOqiKgoO60fAP3y aB5w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742422795; x=1743027595; h=content-transfer-encoding:in-reply-to:content-language:references :cc:to:from:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=GN9op/SlQLrgTkYSLwrrvqZWoYltAYHtq7sca/hJnxA=; b=BzgGcDzEe0Li+A6wAfkyd1/b9EUVxk3qUktZEPNbJ1WgdGT8bel97bb50PpMmE/2FT 6A0E3fJ3fFk9Q1XQBsaAzUqgjuh8enJFWZ27KmfzmV2vilxQW1b97grZ65guwJXlgXXb f/DGweHNN1uI7g5HkzUHqqlrKhQJAkECJMMqG9s6PN3Y3nqad74nyxktx5UvE4ZiwiPN IXaOVfiGruZGeo6a/ipeAOA1NB7J+bUuIzv39FY2pKBWD/DmSURoZ9/A173DU6W882nV P4vvKBKeEaxN7HqUNj0TWSB1RgjiQGCyhpS7UEHTNGa2AjAOHuo8tCjqyaQLOGFgxajc 1flQ== X-Gm-Message-State: AOJu0YxMuyQIyMy8C3/EPKP6Ziqxe4naUfN2eNBayE3gwsI/uRhQXusv URgB/tV+kJQ/uVpeLhEmVSM2osXSVMZvZjbHGXNzYzeakK3Zp83i X-Gm-Gg: ASbGnctN5Cr+l8cxp+hpgH4il5TpxBsQ3aK2lLv6nVpnTlUjnFyO3KOX4TyLPKN/OJ3 U+GAPANvQMx8eIcMDSlHt5jiZtR78ojFo0LnVuyhjivIwPuqplrEn/nSa4f8QA7sN0legVBFsXh GIu9cyE6H7GVp3qua6OGYgByoXDbfn32E5ssMHAE70iL9g30S1EIQ8BWNT4H6duLtNf4De4KWJ0 SeO8/1sDG21pNzi9JSMMP1HbN0vV7tunYNloQVOlfgde90sWVAyLQoZcdgukSdZLkOjYVVPJ0Zo dDWnNNgFS45VWmn+w47WINJTfyM9RNiOKpS1zr+8RPSfJa5Rj1r+MGQrXaoPlEX92YNtNTtxai4 + X-Google-Smtp-Source: AGHT+IFRY2vs0wlt/u6Me4oUYYwJy9ZGY1A8tyjIgIH6CCUWxRGhamdJCQbucpxLBpduB4ngalKlOQ== X-Received: by 2002:a17:902:ce84:b0:223:fb95:b019 with SMTP id d9443c01a7336-2265e7a1b2emr17034385ad.24.1742422794868; Wed, 19 Mar 2025 15:19:54 -0700 (PDT) Received: from ?IPV6:2a03:83e0:1151:15:164e:5889:32a9:bf6? ([2620:10d:c090:500::4:39d5]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-737115295f0sm12641974b3a.18.2025.03.19.15.19.53 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 19 Mar 2025 15:19:54 -0700 (PDT) Message-ID: <11e6f814-af32-4dc3-a8a9-3d1e3fdf7f6d@gmail.com> Date: Wed, 19 Mar 2025 15:19:52 -0700 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 0/4 v2] cgroup: separate rstat trees From: JP Kobryn To: tj@kernel.org, shakeel.butt@linux.dev, yosryahmed@google.com, mkoutny@suse.com, hannes@cmpxchg.org, akpm@linux-foundation.org Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, kernel-team@meta.com References: <20250319221634.71128-1-inwardvessel@gmail.com> Content-Language: en-US In-Reply-To: <20250319221634.71128-1-inwardvessel@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 15D3914000B X-Stat-Signature: mdiqsojx99e3m8ogsz3y59omw1g9zaz9 X-HE-Tag: 1742422795-410791 X-HE-Meta: U2FsdGVkX1/O7Poi5GtwLfZV0YG8h18xXPpIlH0RSnfG6PRdyKMudY2/ci4OqQLXbuV5iqauB2E89n2L6xWj43kj35tStQu8TBlGc/GsJpo5nYTcphWII1hEVsTbtz1w3ozVZ3D+W7ONLeHh7rzKrM0N8vastAP7/yEY7vvkWXUUdUnVtTvJp5A6ffIK5qOgpLoY/NfPAE3A6K6tlwzi6Ipx6QLirtPBwDoLSGW0iq9d3J2ppeUJ7Yd8dh736p1xesIGdW7Ai+KIiul4GNZbPEqdKADbtsTG6ey2IwQZ2po21eQgxyK5C5IKZnzD6BDijbMx2DThlqko+9YezHFG2h5KU7pRcrV/IE3weXhVNN0/Wp11WWUtLmEEhtGIcuGQoN70BEbHQE14hEls2TWGOBpia8w9RKjW46YL+DSIGhQahEtLpqNs+NlRpJjynxOSCmzYws1zM25j0wCKRxTryfaqjd3F9M84ujUFIrtVFCn0arcmHntV1veq5yqE3ZTxwrZ8S7LJuUUYoqmwN92MtBajXd7G/M6KBKBJOeOuTx+Ke24WxNUSFQXeftTvBBK9LytdLFUh7+0evflc3tmkad13hNndAcR9SQiQK9Cr/BStia6+N/uoVrRdyy0gxfKUBL+qhTtQQUfg7W3826T7zKrqyJFLAznfrZxKjEBom/hjoHOkJq2/fcd79klzmnjL8B7NZ9nTddSJbUr1Zrbnja0/rBGah4EQvtHrF7zNoG18PmLEOCjmnmmyQly5Pc140SUcibp84atIiwaw/RKhr1JrWDFrHy6EIb6x660IRmmbKyrpdFG4Cd6JfS/xq5G7FSazo9IuTNKV7fPU98GqwZqT6GIQCk1UgQ4+xqcU2IX/ZQw4E5B6yP/3wT3KGOk+iYobumJe8Xo7wwreyRB7592M3oPapA5BF06OpYCWi2acs3/yrinj2FxCmubJDxRDjOKEkYlOGtKpC2aWN/C Q4FjqM50 amypPQSdiTcLbHXJ0WwY/FAGnBvjfsoER3nFJg4H56LEnve2TBDX6jeP07u1j2uVRJ3Y7SAHOqJzJkZ3nBn7kBDqwbNUISlLsy/f8uWFEV4/dvGndplJV20fOSTYqgMN3urxnH1ZIbnZr24QalzZKpnH6XllQIU/WHAdBBCT62kYiZOADlU4dVkip1UNruIs3WCfK2WXDhfp6EB01WesaPd8iFLNYAFnDulZtNdMdxfSUR5YxaRuPYXoNTi54PNlFKspOQ5SUhr3siamf6O6oy//H9wwyDxiBFHYH6yTsQ5aEWV+1vypLHAThoLQdCGaqoCN78u0RKFpwWHFOdGH0QDLEYIwWogDntivvqc+W0kpDymOv9iZr+75eWOPrhloTM0Cu91R2eAyf9vUjHGTLn9j2/lUJTAW2BWdTH7C57PzhQq10vQ8alOEf0VS3EK5fRFUyXFj7Y+Ae2OI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Please disregard this thread. It was sent with an incorrect version. A revised series labeled "v3" will be sent shortly. On 3/19/25 3:16 PM, JP Kobryn wrote: > The current design of rstat takes the approach that if one subsystem is to > be flushed, all other subsystems with pending updates should also be > flushed. A flush may be initiated by reading specific stats (like cpu.stat) > and other subsystems will be flushed alongside. The complexity of flushing > some subsystems has grown to the extent that the overhead of side flushes > is unnecessarily delaying the fetching of desired stats. > > One big area where the issue comes up is system telemetry, where programs > periodically sample cpu stats while the memory controller is enabled. It > would be a benefit for programs sampling cpu.stat if the overhead of having > to flush memory (and also io) stats was eliminated. It would save cpu > cycles for existing stat reader programs and improve scalability in terms > of sampling frequency and host volume. > > This series changes the approach of "flush all subsystems" to "flush only > the requested subsystem". The core design change is moving from a unified > model where rstat trees are shared by subsystems to having separate trees > for each subsystem. On a per-cpu basis, there will be separate trees for > each enabled subsystem if it implements css_rstat_flush and one tree for > the base stats. In order to do this, the rstat list pointers were moved off > of the cgroup and onto the css. In the transition, these pointer types were > changed to cgroup_subsys_state. Finally the API for updated/flush was > changed to accept a reference to a css instead of a cgroup. This allows for > a specific subsystem to be associated with a given update or flush. The > result is that rstat trees will now be made up of css nodes, and a given > tree will only contain nodes associated with a specific subsystem. > > Since separate trees will now be in use, the locking scheme was adjusted. > The global locks were split up in such a way that there are separate locks > for the base stats (cgroup::self) and also for each subsystem (memory, io, > etc). This allows different subsystems (and base stats) to use rstat in > parallel with no contention. > > Breaking up the unified tree into separate trees eliminates the overhead > and scalability issue explained in the first section, but comes at the > expense of additional memory. In an effort to minimize this overhead, new > rstat structs are introduced and a conditional allocation is performed. > The cgroup_rstat_cpu which originally contained the rstat list pointers and > the base stat entities was renamed cgroup_rstat_base_cpu. It is only > allocated when the associated css is cgroup::self. As for non-self css's, a > new compact struct was added that only contains the rstat list pointers. > During initialization, when the given css is associated with an actual > subsystem (not cgroup::self), this compact struct is allocated. With this > conditional allocation, the change in memory overhead on a per-cpu basis > before/after is shown below. > > memory overhead before: > sizeof(struct cgroup_rstat_cpu) =~ 144 bytes /* can vary based on config */ > > nr_cgroups * sizeof(struct cgroup_rstat_cpu) > nr_cgroups * 144 bytes > > memory overhead after: > sizeof(struct cgroup_rstat_cpu) == 16 bytes > sizeof(struct cgroup_rstat_base_cpu) =~ 144 bytes > > nr_cgroups * ( > sizeof(struct cgroup_rstat_base_cpu) + > sizeof(struct cgroup_rstat_cpu) * nr_rstat_controllers > ) > > nr_cgroups * (144 + 16 * nr_rstat_controllers) > > ... where nr_rstat_controllers is the number of enabled cgroup controllers > that implement css_rstat_flush(). On a host where both memory and io are > enabled: > > nr_cgroups * (144 + 16 * 2) > nr_cgroups * 176 bytes > > This leaves us with an increase in memory overhead of: > 32 bytes per cgroup per cpu > > Validation was performed by reading some *.stat files of a target parent > cgroup while the system was under different workloads. A test program was > made to loop 1M times, reading the files cgroup.stat, cpu.stat, io.stat, > memory.stat of the parent cgroup each iteration. Using a non-patched kernel > as control and this series as experimental, the findings show perf gains > when reading stats with this series. > > The first experiment consisted of a parent cgroup with memory.swap.max=0 > and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and > within each child cgroup a process was spawned to frequently update the > memory cgroup stats by creating and then reading a file of size 1T > (encouraging reclaim). The test program was run alongside these 26 tasks in > parallel. The results showed time and perf gains for the reader test > program. > > test program elapsed time > control: > real 1m13.663s > user 0m0.948s > sys 1m12.356s > > experiment: > real 0m42.498s > user 0m0.764s > sys 0m41.546s > > test program perf > control: > 31.75% mem_cgroup_css_rstat_flush > 5.49% __blkcg_rstat_flush > 0.10% cpu_stat_show > 0.05% cgroup_base_stat_cputime_show > > experiment: > 8.60% mem_cgroup_css_rstat_flush > 0.15% blkcg_print_stat > 0.12% cgroup_base_stat_cputime_show > > It's worth noting that memcg uses heuristics to optimize flushing. > Depending on the state of updated stats at a given time, a memcg flush may > be considered unnecessary and skipped as a result. This opportunity to skip > a flush is bypassed when memcg is flushed as a consequence of sharing the > tree with another controller. > > A second experiment was setup on the same host using a parent cgroup with > two child cgroups. The same swap and memory max were used as in the > previous experiment. In the two child cgroups, kernel builds were done in > parallel, each using "-j 20". The perf comparison is shown below. > > test program elapsed time > control: > real 1m22.809s > user 0m1.142s > sys 1m21.138s > > experiment: > real 0m42.504s > user 0m1.000s > sys 0m41.220s > > test program perf > control: > 37.16% mem_cgroup_css_rstat_flush > 3.68% __blkcg_rstat_flush > 0.09% cpu_stat_show > 0.06% cgroup_base_stat_cputime_show > > experiment: > 2.02% mem_cgroup_css_rstat_flush > 0.20% blkcg_print_stat > 0.14% cpu_stat_show > 0.08% cgroup_base_stat_cputime_show > > The final experiment differs from the previous two in that it measures > performance from the stat updater perspective. A kernel build was run in a > child node with -j 20 on the same host and cgroup setup. A baseline was > established by having the build run while no stats were read. The builds > were then repeated while stats were constantly being read. In all cases, > perf appeared similar in cycles spent on cgroup_rstat_updated() > (insignificant compared to the other recorded events). As for the elapsed > build times, the results of the different scenarios are shown below, > indicating no significant drawbacks of the split tree approach. > > control with no readers > real 5m12.307s > user 84m52.037s > sys 3m54.000s > > control with constant readers of {memory,io,cpu,cgroup}.stat > real 5m13.209s > user 84m47.949s > sys 4m9.260s > > experiment with no readers > real 5m11.961s > user 84m41.750s > sys 3m54.058s > > experiment with constant readers of {memory,io,cpu,cgroup}.stat > real 5m12.626s > user 85m0.323s > sys 3m56.167s > > changelog > v3: > new bpf kfunc api for updated/flush > rename cgroup_rstat_{updated,flush} and related to "css_rstat_*" > check for ss->css_rstat_flush existence where applicable > rename locks for base stats > move subsystem locks to cgroup_subsys struct > change cgroup_rstat_boot() to ss_rstat_init(ss) and init locks within > change lock helpers to accept css and perform lock selection within > fix comments that had outdated lock names > add open css_is_cgroup() helper > rename rstatc to rstatbc to reflect base stats in use > rename cgroup_dfl_root_rstat_cpu to root_self_rstat_cpu > add comments in early init code to explain deferred allocation > misc formatting fixes > > v2: > drop the patch creating a new cgroup_rstat struct and related code > drop bpf-specific patches. instead just use cgroup::self in bpf progs > drop the cpu lock patches. instead select cpu lock in updated_list func > relocate the cgroup_rstat_init() call to inside css_create() > relocate the cgroup_rstat_exit() cleanup from apply_control_enable() > to css_free_rwork_fn() > v1: > https://lore.kernel.org/all/20250218031448.46951-1-inwardvessel@gmail.com/ > > JP Kobryn (4): > cgroup: separate rstat api for bpf programs > cgroup: use separate rstat trees for each subsystem > cgroup: use subsystem-specific rstat locks to avoid contention > cgroup: split up cgroup_rstat_cpu into base stat and non base stat > versions > > block/blk-cgroup.c | 6 +- > include/linux/cgroup-defs.h | 80 ++-- > include/linux/cgroup.h | 16 +- > include/trace/events/cgroup.h | 10 +- > kernel/cgroup/cgroup-internal.h | 6 +- > kernel/cgroup/cgroup.c | 69 +-- > kernel/cgroup/rstat.c | 412 +++++++++++------- > mm/memcontrol.c | 4 +- > .../selftests/bpf/progs/btf_type_tag_percpu.c | 5 +- > .../bpf/progs/cgroup_hierarchical_stats.c | 8 +- > 10 files changed, 363 insertions(+), 253 deletions(-) >