From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 545B3C282C6
	for <linux-mm@archiver.kernel.org>; Fri, 28 Feb 2025 18:23:09 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 67FEA6B0082; Fri, 28 Feb 2025 13:23:08 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6303E6B0083; Fri, 28 Feb 2025 13:23:08 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 51FE96B0085; Fri, 28 Feb 2025 13:23:08 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 37B6E6B0082
	for <linux-mm@kvack.org>; Fri, 28 Feb 2025 13:23:08 -0500 (EST)
Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id D632B1C91C0
	for <linux-mm@kvack.org>; Fri, 28 Feb 2025 18:23:07 +0000 (UTC)
X-FDA: 83170175214.07.7C2F256
Received: from out-183.mta0.migadu.com (out-183.mta0.migadu.com [91.218.175.183])
	by imf15.hostedemail.com (Postfix) with ESMTP id 65A35A001A
	for <linux-mm@kvack.org>; Fri, 28 Feb 2025 18:23:05 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=TyFNH99Z;
	spf=pass (imf15.hostedemail.com: domain of yosry.ahmed@linux.dev designates 91.218.175.183 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1740766985;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=A++5dfea9+dtEZ6XcVvMlUULHW9afvXKCEIx5VskUj0=;
	b=QqcbSuVnFC87BMo+vnfY0HJKHec0/pa6FdYKkBMMt3V2gSvG2uxWqilior42li6gZePXj0
	0KE9VzqqeoK+a7u0zfT8eGpf18ljODpwfKIHf+SLs0xDJSjDLOoHIhPEUAULsAvPbWsjhs
	QfuZFhV/EQms5LcGEgQYtoslpUrl4qY=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740766985; a=rsa-sha256;
	cv=none;
	b=yqdg8CwMUDZqv/b0DngdiObdjtMibS/xsItdWf6bzu2lzMzyaf+7Ez+58jRNcNlF4agMNe
	M2vty2+gqhtOyoLyZvVw4pngeepNm+FxlsLNnQfM7DSPDAjlaUNj3bMjOJZpwmcphDtZ3W
	L27SYu36ibm6I2W7o3AR5mA/zusDvL8=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=TyFNH99Z;
	spf=pass (imf15.hostedemail.com: domain of yosry.ahmed@linux.dev designates 91.218.175.183 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
Date: Fri, 28 Feb 2025 18:22:57 +0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1740766983;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=A++5dfea9+dtEZ6XcVvMlUULHW9afvXKCEIx5VskUj0=;
	b=TyFNH99ZEHwwRfI7FoHWHjw6wcUfX+ITwvt8/6ol0hug9pytcB51uW0SB2WztZT6dTSJdQ
	PtM3ap75PbeTn/LFkITGG52XIUuvqWdSHHCn5LYcr/y/4qKVco6q6Xo9VjOgHjK5GlcDM+
	L0ed88ZC/EVvgwHfSbF6ksfhHq0Inqg=
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Yosry Ahmed <yosry.ahmed@linux.dev>
To: inwardvessel <inwardvessel@gmail.com>
Cc: tj@kernel.org, shakeel.butt@linux.dev, mhocko@kernel.org,
	hannes@cmpxchg.org, akpm@linux-foundation.org, linux-mm@kvack.org,
	cgroups@vger.kernel.org, kernel-team@meta.com
Subject: Re: [PATCH 0/4 v2] cgroup: separate rstat trees
Message-ID: <Z8H_Afv2XwL_2NxJ@google.com>
References: <20250227215543.49928-1-inwardvessel@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250227215543.49928-1-inwardvessel@gmail.com>
X-Migadu-Flow: FLOW_OUT
X-Stat-Signature: 1idtycijgat7e38twqrjotffprm3x6dr
X-Rspamd-Queue-Id: 65A35A001A
X-Rspam-User: 
X-Rspamd-Server: rspam01
X-HE-Tag: 1740766985-93685
X-HE-Meta: U2FsdGVkX18nERzgiPRrwUhFLdJLEHXz0WuaVEznLNHd+1xj35gqsSax64tXZkiMrcoPpNZ/mMfYm7OtfU7rUprV3jZajwP+7oMJVEzNt57p1ovpd0Pf5ULuRPgd5tyL5Pbq8w1OCFI4MQ1XHRvUn5sig6lupDnfSt247LpLXaYPEqgGLIqAh0CUT3D+8vuqOPJKiKrSMcbwL4wJzQdTVbjOex8CvG97xGAzYmJPaTaehEMOV2BWhH5/J6h9Os/7XK/tNNc23DIPMB7J8kA9qN+LGBgtsNy5u8TSqRrJ2HiYOF/u8qSdw9mGOQdhQPkLAn0N/ly0u+BZ9fN8IVq7xMlGT/V4OV0qc9QXBPY8M5E8dyDQtodvWUI/CU1nEqnisIjTqb8uur0FaKVB6buCVbXLt8IZkP51sWiR4ZjNVXURpkqlxIkIb0+luG8/3VTJl1KBG24X3dABuczJy8uO3ZIwCcj3GrD2k7VbxUEVExc7M9+8pm5aCgjcTy/StGOxoXNu6w/lN5NANvou2nNcYoCPkHWexwuWjWl1AZm6yUUK7vYvbOTFVCXxqqWPD3kEeNNlT/prvRyucQVgvzepq9bD6YNM/zaJEY09C7MNA9x2dYU9xECADVUlGMflDSXKZclkNiLOaGR2fP0toydrARKrBI+X257YSS+37Nw6vZ2VNfs/Vb3TNPaEISJU66Vw51xOOrQbmvdyNrJFVIPypckezKErk+Qr75mDVNAz4ny0VcWPAXJepvYnM95GuywEVn2UGjf5vbjYsxDbUZAAtJx79YaU7oJwAwdcF3Lp0O/sGwIgRIW/KBy5skKP8kLZURQI+7mRvQ4njhOh94rb9Nfy3EQQJvyksoMn59SdXhcbOWEmmVfYQjPdsjfnLTF8dADMBlOMQrc+eC6JhxOKOsNDgk9wVPRG4U/NnIMTSjQO7X5rE/xoE14emCyAZCfTsw7s8QzvJNDqIKRiMle
 ueiIRpGn
 lEjeFXTnMOWYz74zEkh2PbD66fmyP3n5hS7AF/Ab/PU1QLZgWbVjxCTI1OIz+PKwKaBq8bOMXxxItLJCC38wcTuBSMrHUodjM78aPWHD0GcggSdjsJ2DbFtFV0c7Uuaeaai5kUaa407fhmAa0LUWDmpvnUoEa1VmWt+IcUVASTR7/ztF8XL9iiMzUl1MPxE2aL0Dh9fZzX06JzDA=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Feb 27, 2025 at 01:55:39PM -0800, inwardvessel wrote:
> From: JP Kobryn <inwardvessel@gmail.com>
> 
> The current design of rstat takes the approach that if one subsystem is
> to be flushed, all other subsystems with pending updates should also be
> flushed. It seems that over time, the stat-keeping of some subsystems
> has grown in size to the extent that they are noticeably slowing down
> others. This has been most observable in situations where the memory
> controller is enabled. One big area where the issue comes up is system
> telemetry, where programs periodically sample cpu stats. It would be a
> benefit for programs like this if the overhead of having to flush memory
> stats (and others) could be eliminated. It would save cpu cycles for
> existing cpu-based telemetry programs and improve scalability in terms
> of sampling frequency and volume of hosts.
> 
> This series changes the approach of "flush all subsystems" to "flush
> only the requested subsystem". The core design change is moving from a
> single unified rstat tree of cgroups to having separate trees made up of
> cgroup_subsys_state's. There will be one (per-cpu) tree for the base
> stats (cgroup::self) and one for each enabled subsystem (if it
> implements css_rstat_flush()). In order to do this, the rstat list
> pointers were moved off of the cgroup and onto the css. In the
> transition, these list pointer types were changed to
> cgroup_subsys_state. This allows for rstat trees to now be made up of
> css nodes, where a given tree will only contains css nodes associated
> with a specific subsystem. The rstat api's were changed to accept a
> reference to a cgroup_subsys_state instead of a cgroup. This allows for
> callers to be specific about which stats are being updated/flushed.
> Since separate trees will be in use, the locking scheme was adjusted.
> The global locks were split up in such a way that there are separate
> locks for the base stats (cgroup::self) and each subsystem (memory, io,
> etc). This allows different subsystems (including base stats) to use
> rstat in parallel with no contention.
> 
> Breaking up the unified tree into separate trees eliminates the overhead
> and scalability issue explained in the first section, but comes at the
> expense of using additional memory. In an effort to minimize this
> overhead, a conditional allocation is performed. The cgroup_rstat_cpu
> originally contained the rstat list pointers and the base stat entities.
> This struct was renamed to cgroup_rstat_base_cpu and is only allocated
> when the associated css is cgroup::self. A new compact struct was added
> that only contains the rstat list pointers. When the css is associated
> with an actual subsystem, this compact struct is allocated. With this
> conditional allocation, the change in memory overhead on a per-cpu basis
> before/after is shown below.
> 
> before:
> sizeof(struct cgroup_rstat_cpu) =~ 176 bytes /* can vary based on config */
> 
> nr_cgroups * sizeof(struct cgroup_rstat_cpu)
> nr_cgroups * 176 bytes
> 
> after:
> sizeof(struct cgroup_rstat_cpu) == 16 bytes
> sizeof(struct cgroup_rstat_base_cpu) =~ 176 bytes
> 
> nr_cgroups * (
> 	sizeof(struct cgroup_rstat_base_cpu) +
> 		sizeof(struct cgroup_rstat_cpu) * nr_rstat_controllers
> 	)
> 
> nr_cgroups * (176 + 16 * nr_rstat_controllers)
> 
> ... where nr_rstat_controllers is the number of enabled cgroup
> controllers that implement css_rstat_flush(). On a host where both
> memory and io are enabled:
> 
> nr_cgroups * (176 + 16 * 2)
> nr_cgroups * 208 bytes
> 
> With regard to validation, there is a measurable benefit when reading
> stats with this series. A test program was made to loop 1M times while
> reading all four of the files cgroup.stat, cpu.stat, io.stat,
> memory.stat of a given parent cgroup each iteration. This test program
> has been run in the experiments that follow.
> 
> The first experiment consisted of a parent cgroup with memory.swap.max=0
> and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created
> and within each child cgroup a process was spawned to frequently update
> the memory cgroup stats by creating and then reading a file of size 1T
> (encouraging reclaim). The test program was run alongside these 26 tasks
> in parallel. The results showed a benefit in both time elapsed and perf
> data of the test program.
> 
> time before:
> real    0m44.612s
> user    0m0.567s
> sys     0m43.887s
> 
> perf before:
> 27.02% mem_cgroup_css_rstat_flush
>  6.35% __blkcg_rstat_flush
>  0.06% cgroup_base_stat_cputime_show
> 
> time after:
> real    0m27.125s
> user    0m0.544s
> sys     0m26.491s
> 
> perf after:
> 6.03% mem_cgroup_css_rstat_flush
> 0.37% blkcg_print_stat
> 0.11% cgroup_base_stat_cputime_show
> 
> Another experiment was setup on the same host using a parent cgroup with
> two child cgroups. The same swap and memory max were used as the
> previous experiment. In the two child cgroups, kernel builds were done
> in parallel, each using "-j 20". The perf comparison of the test program
> was very similar to the values in the previous experiment. The time
> comparison is shown below.
> 
> before:
> real    1m2.077s
> user    0m0.784s
> sys     1m0.895s
> 
> after:
> real    0m32.216s
> user    0m0.709s
> sys     0m31.256s

Great results, and I am glad that the series went down from 11 patches
to 4 once we simplified the BPF handling. The added memory overhead
doesn't seem to be concerning (~320KB on a system with 100 cgroups and
100 CPUs).

Nice work.