From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 82BABC35FFA
	for <linux-mm@archiver.kernel.org>; Wed, 19 Mar 2025 22:19:59 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id CD480280003; Wed, 19 Mar 2025 18:19:56 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C5D23280002; Wed, 19 Mar 2025 18:19:56 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id AD7C7280003; Wed, 19 Mar 2025 18:19:56 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 86D06280002
	for <linux-mm@kvack.org>; Wed, 19 Mar 2025 18:19:56 -0400 (EDT)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 2AD641A0446
	for <linux-mm@kvack.org>; Wed, 19 Mar 2025 22:19:58 +0000 (UTC)
X-FDA: 83239719276.08.7F90AA9
Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173])
	by imf23.hostedemail.com (Postfix) with ESMTP id 15D3914000B
	for <linux-mm@kvack.org>; Wed, 19 Mar 2025 22:19:55 +0000 (UTC)
Authentication-Results: imf23.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=j+KR8kvB;
	spf=pass (imf23.hostedemail.com: domain of inwardvessel@gmail.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=inwardvessel@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1742422796;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=GN9op/SlQLrgTkYSLwrrvqZWoYltAYHtq7sca/hJnxA=;
	b=nnsp2xM17PzmT2t99PqMd/dSzFPog3rd6QxGWoxgM6ThOYJ5mhNwpdErVKNm71jBIZAWiA
	Z57/wacD0hxScAeZPIxkIVrFKdq6OEvR7Vyvo8PshNZvOi6UcQMMTguFL+reAjqs4y/siP
	bHCwJoMy4Z0ZdZvj2184Ip6MpzGMOYc=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742422796; a=rsa-sha256;
	cv=none;
	b=AXy/UV5pnkZjUAmz4CqcWib3yZvVEUpYrvNYbcRir6ljELrl1FvqQ9TPokPpqCsfHz1Ue4
	5r/WPgulkQDp2fASssDNBGWvDg1ai6uMiqkmNmlkkxEsbRQLZJK5Nt4JGDll+A05kIMBTU
	iYhObHKrAdoHT3AVEARjZ4gr0E+zab0=
ARC-Authentication-Results: i=1;
	imf23.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=j+KR8kvB;
	spf=pass (imf23.hostedemail.com: domain of inwardvessel@gmail.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=inwardvessel@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-2235189adaeso2642265ad.0
        for <linux-mm@kvack.org>; Wed, 19 Mar 2025 15:19:55 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1742422795; x=1743027595; darn=kvack.org;
        h=content-transfer-encoding:in-reply-to:content-language:references
         :cc:to:from:subject:user-agent:mime-version:date:message-id:from:to
         :cc:subject:date:message-id:reply-to;
        bh=GN9op/SlQLrgTkYSLwrrvqZWoYltAYHtq7sca/hJnxA=;
        b=j+KR8kvBMzHS7xpX+Cd5tDdwrbWpM9pizWciH1Ysc98td5PZCD8U+WowadK6qoV6j+
         bg5buHevfMi/Na1dQPgdXuEyZkxwVIfXPnwTotChYhT0LAffg4vUkYBxnX9uGn21/hoz
         bJf1075xARwk4VOexOuYjIvRBQdkHxlcchylW69NnnAhiQzB1++RuLKJOY7N81PTprrY
         zK6jIINZywS65O6SpI00USlnHthgbV+VTKL+5sfgc7a4CL7uf33HOtz9gy/FfpK1J9yU
         Yt322OhT4rDNiLtdG+afVFG3S/hdabLuEqkbmNUUnBpZ1Gq3wYhcynOqiKgoO60fAP3y
         aB5w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1742422795; x=1743027595;
        h=content-transfer-encoding:in-reply-to:content-language:references
         :cc:to:from:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=GN9op/SlQLrgTkYSLwrrvqZWoYltAYHtq7sca/hJnxA=;
        b=BzgGcDzEe0Li+A6wAfkyd1/b9EUVxk3qUktZEPNbJ1WgdGT8bel97bb50PpMmE/2FT
         6A0E3fJ3fFk9Q1XQBsaAzUqgjuh8enJFWZ27KmfzmV2vilxQW1b97grZ65guwJXlgXXb
         f/DGweHNN1uI7g5HkzUHqqlrKhQJAkECJMMqG9s6PN3Y3nqad74nyxktx5UvE4ZiwiPN
         IXaOVfiGruZGeo6a/ipeAOA1NB7J+bUuIzv39FY2pKBWD/DmSURoZ9/A173DU6W882nV
         P4vvKBKeEaxN7HqUNj0TWSB1RgjiQGCyhpS7UEHTNGa2AjAOHuo8tCjqyaQLOGFgxajc
         1flQ==
X-Gm-Message-State: AOJu0YxMuyQIyMy8C3/EPKP6Ziqxe4naUfN2eNBayE3gwsI/uRhQXusv
	URgB/tV+kJQ/uVpeLhEmVSM2osXSVMZvZjbHGXNzYzeakK3Zp83i
X-Gm-Gg: ASbGnctN5Cr+l8cxp+hpgH4il5TpxBsQ3aK2lLv6nVpnTlUjnFyO3KOX4TyLPKN/OJ3
	U+GAPANvQMx8eIcMDSlHt5jiZtR78ojFo0LnVuyhjivIwPuqplrEn/nSa4f8QA7sN0legVBFsXh
	GIu9cyE6H7GVp3qua6OGYgByoXDbfn32E5ssMHAE70iL9g30S1EIQ8BWNT4H6duLtNf4De4KWJ0
	SeO8/1sDG21pNzi9JSMMP1HbN0vV7tunYNloQVOlfgde90sWVAyLQoZcdgukSdZLkOjYVVPJ0Zo
	dDWnNNgFS45VWmn+w47WINJTfyM9RNiOKpS1zr+8RPSfJa5Rj1r+MGQrXaoPlEX92YNtNTtxai4
	+
X-Google-Smtp-Source: AGHT+IFRY2vs0wlt/u6Me4oUYYwJy9ZGY1A8tyjIgIH6CCUWxRGhamdJCQbucpxLBpduB4ngalKlOQ==
X-Received: by 2002:a17:902:ce84:b0:223:fb95:b019 with SMTP id d9443c01a7336-2265e7a1b2emr17034385ad.24.1742422794868;
        Wed, 19 Mar 2025 15:19:54 -0700 (PDT)
Received: from ?IPV6:2a03:83e0:1151:15:164e:5889:32a9:bf6? ([2620:10d:c090:500::4:39d5])
        by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-737115295f0sm12641974b3a.18.2025.03.19.15.19.53
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Wed, 19 Mar 2025 15:19:54 -0700 (PDT)
Message-ID: <11e6f814-af32-4dc3-a8a9-3d1e3fdf7f6d@gmail.com>
Date: Wed, 19 Mar 2025 15:19:52 -0700
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH 0/4 v2] cgroup: separate rstat trees
From: JP Kobryn <inwardvessel@gmail.com>
To: tj@kernel.org, shakeel.butt@linux.dev, yosryahmed@google.com,
 mkoutny@suse.com, hannes@cmpxchg.org, akpm@linux-foundation.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, kernel-team@meta.com
References: <20250319221634.71128-1-inwardvessel@gmail.com>
Content-Language: en-US
In-Reply-To: <20250319221634.71128-1-inwardvessel@gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: 15D3914000B
X-Stat-Signature: mdiqsojx99e3m8ogsz3y59omw1g9zaz9
X-HE-Tag: 1742422795-410791
X-HE-Meta: U2FsdGVkX1/O7Poi5GtwLfZV0YG8h18xXPpIlH0RSnfG6PRdyKMudY2/ci4OqQLXbuV5iqauB2E89n2L6xWj43kj35tStQu8TBlGc/GsJpo5nYTcphWII1hEVsTbtz1w3ozVZ3D+W7ONLeHh7rzKrM0N8vastAP7/yEY7vvkWXUUdUnVtTvJp5A6ffIK5qOgpLoY/NfPAE3A6K6tlwzi6Ipx6QLirtPBwDoLSGW0iq9d3J2ppeUJ7Yd8dh736p1xesIGdW7Ai+KIiul4GNZbPEqdKADbtsTG6ey2IwQZ2po21eQgxyK5C5IKZnzD6BDijbMx2DThlqko+9YezHFG2h5KU7pRcrV/IE3weXhVNN0/Wp11WWUtLmEEhtGIcuGQoN70BEbHQE14hEls2TWGOBpia8w9RKjW46YL+DSIGhQahEtLpqNs+NlRpJjynxOSCmzYws1zM25j0wCKRxTryfaqjd3F9M84ujUFIrtVFCn0arcmHntV1veq5yqE3ZTxwrZ8S7LJuUUYoqmwN92MtBajXd7G/M6KBKBJOeOuTx+Ke24WxNUSFQXeftTvBBK9LytdLFUh7+0evflc3tmkad13hNndAcR9SQiQK9Cr/BStia6+N/uoVrRdyy0gxfKUBL+qhTtQQUfg7W3826T7zKrqyJFLAznfrZxKjEBom/hjoHOkJq2/fcd79klzmnjL8B7NZ9nTddSJbUr1Zrbnja0/rBGah4EQvtHrF7zNoG18PmLEOCjmnmmyQly5Pc140SUcibp84atIiwaw/RKhr1JrWDFrHy6EIb6x660IRmmbKyrpdFG4Cd6JfS/xq5G7FSazo9IuTNKV7fPU98GqwZqT6GIQCk1UgQ4+xqcU2IX/ZQw4E5B6yP/3wT3KGOk+iYobumJe8Xo7wwreyRB7592M3oPapA5BF06OpYCWi2acs3/yrinj2FxCmubJDxRDjOKEkYlOGtKpC2aWN/C
 Q4FjqM50
 amypPQSdiTcLbHXJ0WwY/FAGnBvjfsoER3nFJg4H56LEnve2TBDX6jeP07u1j2uVRJ3Y7SAHOqJzJkZ3nBn7kBDqwbNUISlLsy/f8uWFEV4/dvGndplJV20fOSTYqgMN3urxnH1ZIbnZr24QalzZKpnH6XllQIU/WHAdBBCT62kYiZOADlU4dVkip1UNruIs3WCfK2WXDhfp6EB01WesaPd8iFLNYAFnDulZtNdMdxfSUR5YxaRuPYXoNTi54PNlFKspOQ5SUhr3siamf6O6oy//H9wwyDxiBFHYH6yTsQ5aEWV+1vypLHAThoLQdCGaqoCN78u0RKFpwWHFOdGH0QDLEYIwWogDntivvqc+W0kpDymOv9iZr+75eWOPrhloTM0Cu91R2eAyf9vUjHGTLn9j2/lUJTAW2BWdTH7C57PzhQq10vQ8alOEf0VS3EK5fRFUyXFj7Y+Ae2OI=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Please disregard this thread. It was sent with an incorrect version. A 
revised series labeled "v3" will be sent shortly.

On 3/19/25 3:16 PM, JP Kobryn wrote:
> The current design of rstat takes the approach that if one subsystem is to
> be flushed, all other subsystems with pending updates should also be
> flushed. A flush may be initiated by reading specific stats (like cpu.stat)
> and other subsystems will be flushed alongside. The complexity of flushing
> some subsystems has grown to the extent that the overhead of side flushes
> is unnecessarily delaying the fetching of desired stats.
> 
> One big area where the issue comes up is system telemetry, where programs
> periodically sample cpu stats while the memory controller is enabled. It
> would be a benefit for programs sampling cpu.stat if the overhead of having
> to flush memory (and also io) stats was eliminated. It would save cpu
> cycles for existing stat reader programs and improve scalability in terms
> of sampling frequency and host volume.
> 
> This series changes the approach of "flush all subsystems" to "flush only
> the requested subsystem". The core design change is moving from a unified
> model where rstat trees are shared by subsystems to having separate trees
> for each subsystem. On a per-cpu basis, there will be separate trees for
> each enabled subsystem if it implements css_rstat_flush and one tree for
> the base stats. In order to do this, the rstat list pointers were moved off
> of the cgroup and onto the css. In the transition, these pointer types were
> changed to cgroup_subsys_state. Finally the API for updated/flush was
> changed to accept a reference to a css instead of a cgroup. This allows for
> a specific subsystem to be associated with a given update or flush. The
> result is that rstat trees will now be made up of css nodes, and a given
> tree will only contain nodes associated with a specific subsystem.
> 
> Since separate trees will now be in use, the locking scheme was adjusted.
> The global locks were split up in such a way that there are separate locks
> for the base stats (cgroup::self) and also for each subsystem (memory, io,
> etc). This allows different subsystems (and base stats) to use rstat in
> parallel with no contention.
> 
> Breaking up the unified tree into separate trees eliminates the overhead
> and scalability issue explained in the first section, but comes at the
> expense of additional memory. In an effort to minimize this overhead, new
> rstat structs are introduced and a conditional allocation is performed.
> The cgroup_rstat_cpu which originally contained the rstat list pointers and
> the base stat entities was renamed cgroup_rstat_base_cpu. It is only
> allocated when the associated css is cgroup::self. As for non-self css's, a
> new compact struct was added that only contains the rstat list pointers.
> During initialization, when the given css is associated with an actual
> subsystem (not cgroup::self), this compact struct is allocated. With this
> conditional allocation, the change in memory overhead on a per-cpu basis
> before/after is shown below.
> 
> memory overhead before:
> sizeof(struct cgroup_rstat_cpu) =~ 144 bytes /* can vary based on config */
> 
> nr_cgroups * sizeof(struct cgroup_rstat_cpu)
> nr_cgroups * 144 bytes
> 
> memory overhead after:
> sizeof(struct cgroup_rstat_cpu) == 16 bytes
> sizeof(struct cgroup_rstat_base_cpu) =~ 144 bytes
> 
> nr_cgroups * (
> 	sizeof(struct cgroup_rstat_base_cpu) +
> 		sizeof(struct cgroup_rstat_cpu) * nr_rstat_controllers
> 	)
> 
> nr_cgroups * (144 + 16 * nr_rstat_controllers)
> 
> ... where nr_rstat_controllers is the number of enabled cgroup controllers
> that implement css_rstat_flush(). On a host where both memory and io are
> enabled:
> 
> nr_cgroups * (144 + 16 * 2)
> nr_cgroups * 176 bytes
> 
> This leaves us with an increase in memory overhead of:
> 	32 bytes per cgroup per cpu
> 
> Validation was performed by reading some *.stat files of a target parent
> cgroup while the system was under different workloads. A test program was
> made to loop 1M times, reading the files cgroup.stat, cpu.stat, io.stat,
> memory.stat of the parent cgroup each iteration. Using a non-patched kernel
> as control and this series as experimental, the findings show perf gains
> when reading stats with this series.
> 
> The first experiment consisted of a parent cgroup with memory.swap.max=0
> and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and
> within each child cgroup a process was spawned to frequently update the
> memory cgroup stats by creating and then reading a file of size 1T
> (encouraging reclaim). The test program was run alongside these 26 tasks in
> parallel. The results showed time and perf gains for the reader test
> program.
> 
> test program elapsed time
> control:
> real    1m13.663s
> user    0m0.948s
> sys     1m12.356s
> 
> experiment:
> real    0m42.498s
> user    0m0.764s
> sys     0m41.546s
> 
> test program perf
> control:
> 31.75% mem_cgroup_css_rstat_flush
>   5.49% __blkcg_rstat_flush
>   0.10% cpu_stat_show
>   0.05% cgroup_base_stat_cputime_show
> 
> experiment:
> 8.60% mem_cgroup_css_rstat_flush
> 0.15% blkcg_print_stat
> 0.12% cgroup_base_stat_cputime_show
> 
> It's worth noting that memcg uses heuristics to optimize flushing.
> Depending on the state of updated stats at a given time, a memcg flush may
> be considered unnecessary and skipped as a result. This opportunity to skip
> a flush is bypassed when memcg is flushed as a consequence of sharing the
> tree with another controller.
> 
> A second experiment was setup on the same host using a parent cgroup with
> two child cgroups. The same swap and memory max were used as in the
> previous experiment. In the two child cgroups, kernel builds were done in
> parallel, each using "-j 20". The perf comparison is shown below.
> 
> test program elapsed time
> control:
> real    1m22.809s
> user    0m1.142s
> sys     1m21.138s
> 
> experiment:
> real    0m42.504s
> user    0m1.000s
> sys     0m41.220s
> 
> test program perf
> control:
> 37.16% mem_cgroup_css_rstat_flush
>   3.68% __blkcg_rstat_flush
>   0.09% cpu_stat_show
>   0.06% cgroup_base_stat_cputime_show
> 
> experiment:
> 2.02% mem_cgroup_css_rstat_flush
> 0.20% blkcg_print_stat
> 0.14% cpu_stat_show
> 0.08% cgroup_base_stat_cputime_show
> 
> The final experiment differs from the previous two in that it measures
> performance from the stat updater perspective. A kernel build was run in a
> child node with -j 20 on the same host and cgroup setup. A baseline was
> established by having the build run while no stats were read. The builds
> were then repeated while stats were constantly being read. In all cases,
> perf appeared similar in cycles spent on cgroup_rstat_updated()
> (insignificant compared to the other recorded events). As for the elapsed
> build times, the results of the different scenarios are shown below,
> indicating no significant drawbacks of the split tree approach.
> 
> control with no readers
> real    5m12.307s
> user    84m52.037s
> sys     3m54.000s
> 
> control with constant readers of {memory,io,cpu,cgroup}.stat
> real    5m13.209s
> user    84m47.949s
> sys     4m9.260s
> 
> experiment with no readers
> real    5m11.961s
> user    84m41.750s
> sys     3m54.058s
> 
> experiment with constant readers of {memory,io,cpu,cgroup}.stat
> real    5m12.626s
> user    85m0.323s
> sys     3m56.167s
> 
> changelog
> v3:
> 	new bpf kfunc api for updated/flush
> 	rename cgroup_rstat_{updated,flush} and related to "css_rstat_*"
> 	check for ss->css_rstat_flush existence where applicable
> 	rename locks for base stats
> 	move subsystem locks to cgroup_subsys struct
> 	change cgroup_rstat_boot() to ss_rstat_init(ss) and init locks within
> 	change lock helpers to accept css and perform lock selection within
> 	fix comments that had outdated lock names
> 	add open css_is_cgroup() helper
> 	rename rstatc to rstatbc to reflect base stats in use
> 	rename cgroup_dfl_root_rstat_cpu to root_self_rstat_cpu
> 	add comments in early init code to explain deferred allocation
> 	misc formatting fixes
> 
> v2:
> 	drop the patch creating a new cgroup_rstat struct and related code
> 	drop bpf-specific patches. instead just use cgroup::self in bpf progs
> 	drop the cpu lock patches. instead select cpu lock in updated_list func
> 	relocate the cgroup_rstat_init() call to inside css_create()
> 	relocate the cgroup_rstat_exit() cleanup from apply_control_enable()
> 		to css_free_rwork_fn()
> v1:
> 	https://lore.kernel.org/all/20250218031448.46951-1-inwardvessel@gmail.com/
> 
> JP Kobryn (4):
>    cgroup: separate rstat api for bpf programs
>    cgroup: use separate rstat trees for each subsystem
>    cgroup: use subsystem-specific rstat locks to avoid contention
>    cgroup: split up cgroup_rstat_cpu into base stat and non base stat
>      versions
> 
>   block/blk-cgroup.c                            |   6 +-
>   include/linux/cgroup-defs.h                   |  80 ++--
>   include/linux/cgroup.h                        |  16 +-
>   include/trace/events/cgroup.h                 |  10 +-
>   kernel/cgroup/cgroup-internal.h               |   6 +-
>   kernel/cgroup/cgroup.c                        |  69 +--
>   kernel/cgroup/rstat.c                         | 412 +++++++++++-------
>   mm/memcontrol.c                               |   4 +-
>   .../selftests/bpf/progs/btf_type_tag_percpu.c |   5 +-
>   .../bpf/progs/cgroup_hierarchical_stats.c     |   8 +-
>   10 files changed, 363 insertions(+), 253 deletions(-)
>