From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 545B3C282C6 for ; Fri, 28 Feb 2025 18:23:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 67FEA6B0082; Fri, 28 Feb 2025 13:23:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6303E6B0083; Fri, 28 Feb 2025 13:23:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 51FE96B0085; Fri, 28 Feb 2025 13:23:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 37B6E6B0082 for ; Fri, 28 Feb 2025 13:23:08 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id D632B1C91C0 for ; Fri, 28 Feb 2025 18:23:07 +0000 (UTC) X-FDA: 83170175214.07.7C2F256 Received: from out-183.mta0.migadu.com (out-183.mta0.migadu.com [91.218.175.183]) by imf15.hostedemail.com (Postfix) with ESMTP id 65A35A001A for ; Fri, 28 Feb 2025 18:23:05 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=TyFNH99Z; spf=pass (imf15.hostedemail.com: domain of yosry.ahmed@linux.dev designates 91.218.175.183 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740766985; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=A++5dfea9+dtEZ6XcVvMlUULHW9afvXKCEIx5VskUj0=; b=QqcbSuVnFC87BMo+vnfY0HJKHec0/pa6FdYKkBMMt3V2gSvG2uxWqilior42li6gZePXj0 0KE9VzqqeoK+a7u0zfT8eGpf18ljODpwfKIHf+SLs0xDJSjDLOoHIhPEUAULsAvPbWsjhs QfuZFhV/EQms5LcGEgQYtoslpUrl4qY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740766985; a=rsa-sha256; cv=none; b=yqdg8CwMUDZqv/b0DngdiObdjtMibS/xsItdWf6bzu2lzMzyaf+7Ez+58jRNcNlF4agMNe M2vty2+gqhtOyoLyZvVw4pngeepNm+FxlsLNnQfM7DSPDAjlaUNj3bMjOJZpwmcphDtZ3W L27SYu36ibm6I2W7o3AR5mA/zusDvL8= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=TyFNH99Z; spf=pass (imf15.hostedemail.com: domain of yosry.ahmed@linux.dev designates 91.218.175.183 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev; dmarc=pass (policy=none) header.from=linux.dev Date: Fri, 28 Feb 2025 18:22:57 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1740766983; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=A++5dfea9+dtEZ6XcVvMlUULHW9afvXKCEIx5VskUj0=; b=TyFNH99ZEHwwRfI7FoHWHjw6wcUfX+ITwvt8/6ol0hug9pytcB51uW0SB2WztZT6dTSJdQ PtM3ap75PbeTn/LFkITGG52XIUuvqWdSHHCn5LYcr/y/4qKVco6q6Xo9VjOgHjK5GlcDM+ L0ed88ZC/EVvgwHfSbF6ksfhHq0Inqg= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yosry Ahmed To: inwardvessel Cc: tj@kernel.org, shakeel.butt@linux.dev, mhocko@kernel.org, hannes@cmpxchg.org, akpm@linux-foundation.org, linux-mm@kvack.org, cgroups@vger.kernel.org, kernel-team@meta.com Subject: Re: [PATCH 0/4 v2] cgroup: separate rstat trees Message-ID: References: <20250227215543.49928-1-inwardvessel@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250227215543.49928-1-inwardvessel@gmail.com> X-Migadu-Flow: FLOW_OUT X-Stat-Signature: 1idtycijgat7e38twqrjotffprm3x6dr X-Rspamd-Queue-Id: 65A35A001A X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1740766985-93685 X-HE-Meta: U2FsdGVkX18nERzgiPRrwUhFLdJLEHXz0WuaVEznLNHd+1xj35gqsSax64tXZkiMrcoPpNZ/mMfYm7OtfU7rUprV3jZajwP+7oMJVEzNt57p1ovpd0Pf5ULuRPgd5tyL5Pbq8w1OCFI4MQ1XHRvUn5sig6lupDnfSt247LpLXaYPEqgGLIqAh0CUT3D+8vuqOPJKiKrSMcbwL4wJzQdTVbjOex8CvG97xGAzYmJPaTaehEMOV2BWhH5/J6h9Os/7XK/tNNc23DIPMB7J8kA9qN+LGBgtsNy5u8TSqRrJ2HiYOF/u8qSdw9mGOQdhQPkLAn0N/ly0u+BZ9fN8IVq7xMlGT/V4OV0qc9QXBPY8M5E8dyDQtodvWUI/CU1nEqnisIjTqb8uur0FaKVB6buCVbXLt8IZkP51sWiR4ZjNVXURpkqlxIkIb0+luG8/3VTJl1KBG24X3dABuczJy8uO3ZIwCcj3GrD2k7VbxUEVExc7M9+8pm5aCgjcTy/StGOxoXNu6w/lN5NANvou2nNcYoCPkHWexwuWjWl1AZm6yUUK7vYvbOTFVCXxqqWPD3kEeNNlT/prvRyucQVgvzepq9bD6YNM/zaJEY09C7MNA9x2dYU9xECADVUlGMflDSXKZclkNiLOaGR2fP0toydrARKrBI+X257YSS+37Nw6vZ2VNfs/Vb3TNPaEISJU66Vw51xOOrQbmvdyNrJFVIPypckezKErk+Qr75mDVNAz4ny0VcWPAXJepvYnM95GuywEVn2UGjf5vbjYsxDbUZAAtJx79YaU7oJwAwdcF3Lp0O/sGwIgRIW/KBy5skKP8kLZURQI+7mRvQ4njhOh94rb9Nfy3EQQJvyksoMn59SdXhcbOWEmmVfYQjPdsjfnLTF8dADMBlOMQrc+eC6JhxOKOsNDgk9wVPRG4U/NnIMTSjQO7X5rE/xoE14emCyAZCfTsw7s8QzvJNDqIKRiMle ueiIRpGn lEjeFXTnMOWYz74zEkh2PbD66fmyP3n5hS7AF/Ab/PU1QLZgWbVjxCTI1OIz+PKwKaBq8bOMXxxItLJCC38wcTuBSMrHUodjM78aPWHD0GcggSdjsJ2DbFtFV0c7Uuaeaai5kUaa407fhmAa0LUWDmpvnUoEa1VmWt+IcUVASTR7/ztF8XL9iiMzUl1MPxE2aL0Dh9fZzX06JzDA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 27, 2025 at 01:55:39PM -0800, inwardvessel wrote: > From: JP Kobryn > > The current design of rstat takes the approach that if one subsystem is > to be flushed, all other subsystems with pending updates should also be > flushed. It seems that over time, the stat-keeping of some subsystems > has grown in size to the extent that they are noticeably slowing down > others. This has been most observable in situations where the memory > controller is enabled. One big area where the issue comes up is system > telemetry, where programs periodically sample cpu stats. It would be a > benefit for programs like this if the overhead of having to flush memory > stats (and others) could be eliminated. It would save cpu cycles for > existing cpu-based telemetry programs and improve scalability in terms > of sampling frequency and volume of hosts. > > This series changes the approach of "flush all subsystems" to "flush > only the requested subsystem". The core design change is moving from a > single unified rstat tree of cgroups to having separate trees made up of > cgroup_subsys_state's. There will be one (per-cpu) tree for the base > stats (cgroup::self) and one for each enabled subsystem (if it > implements css_rstat_flush()). In order to do this, the rstat list > pointers were moved off of the cgroup and onto the css. In the > transition, these list pointer types were changed to > cgroup_subsys_state. This allows for rstat trees to now be made up of > css nodes, where a given tree will only contains css nodes associated > with a specific subsystem. The rstat api's were changed to accept a > reference to a cgroup_subsys_state instead of a cgroup. This allows for > callers to be specific about which stats are being updated/flushed. > Since separate trees will be in use, the locking scheme was adjusted. > The global locks were split up in such a way that there are separate > locks for the base stats (cgroup::self) and each subsystem (memory, io, > etc). This allows different subsystems (including base stats) to use > rstat in parallel with no contention. > > Breaking up the unified tree into separate trees eliminates the overhead > and scalability issue explained in the first section, but comes at the > expense of using additional memory. In an effort to minimize this > overhead, a conditional allocation is performed. The cgroup_rstat_cpu > originally contained the rstat list pointers and the base stat entities. > This struct was renamed to cgroup_rstat_base_cpu and is only allocated > when the associated css is cgroup::self. A new compact struct was added > that only contains the rstat list pointers. When the css is associated > with an actual subsystem, this compact struct is allocated. With this > conditional allocation, the change in memory overhead on a per-cpu basis > before/after is shown below. > > before: > sizeof(struct cgroup_rstat_cpu) =~ 176 bytes /* can vary based on config */ > > nr_cgroups * sizeof(struct cgroup_rstat_cpu) > nr_cgroups * 176 bytes > > after: > sizeof(struct cgroup_rstat_cpu) == 16 bytes > sizeof(struct cgroup_rstat_base_cpu) =~ 176 bytes > > nr_cgroups * ( > sizeof(struct cgroup_rstat_base_cpu) + > sizeof(struct cgroup_rstat_cpu) * nr_rstat_controllers > ) > > nr_cgroups * (176 + 16 * nr_rstat_controllers) > > ... where nr_rstat_controllers is the number of enabled cgroup > controllers that implement css_rstat_flush(). On a host where both > memory and io are enabled: > > nr_cgroups * (176 + 16 * 2) > nr_cgroups * 208 bytes > > With regard to validation, there is a measurable benefit when reading > stats with this series. A test program was made to loop 1M times while > reading all four of the files cgroup.stat, cpu.stat, io.stat, > memory.stat of a given parent cgroup each iteration. This test program > has been run in the experiments that follow. > > The first experiment consisted of a parent cgroup with memory.swap.max=0 > and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created > and within each child cgroup a process was spawned to frequently update > the memory cgroup stats by creating and then reading a file of size 1T > (encouraging reclaim). The test program was run alongside these 26 tasks > in parallel. The results showed a benefit in both time elapsed and perf > data of the test program. > > time before: > real 0m44.612s > user 0m0.567s > sys 0m43.887s > > perf before: > 27.02% mem_cgroup_css_rstat_flush > 6.35% __blkcg_rstat_flush > 0.06% cgroup_base_stat_cputime_show > > time after: > real 0m27.125s > user 0m0.544s > sys 0m26.491s > > perf after: > 6.03% mem_cgroup_css_rstat_flush > 0.37% blkcg_print_stat > 0.11% cgroup_base_stat_cputime_show > > Another experiment was setup on the same host using a parent cgroup with > two child cgroups. The same swap and memory max were used as the > previous experiment. In the two child cgroups, kernel builds were done > in parallel, each using "-j 20". The perf comparison of the test program > was very similar to the values in the previous experiment. The time > comparison is shown below. > > before: > real 1m2.077s > user 0m0.784s > sys 1m0.895s > > after: > real 0m32.216s > user 0m0.709s > sys 0m31.256s Great results, and I am glad that the series went down from 11 patches to 4 once we simplified the BPF handling. The added memory overhead doesn't seem to be concerning (~320KB on a system with 100 cgroups and 100 CPUs). Nice work.