From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0E06E77198 for ; Mon, 6 Jan 2025 23:13:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 336086B009A; Mon, 6 Jan 2025 18:13:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2E6FB6B009D; Mon, 6 Jan 2025 18:13:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1AE0A6B009E; Mon, 6 Jan 2025 18:13:47 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id F22726B009A for ; Mon, 6 Jan 2025 18:13:46 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 7670DC0494 for ; Mon, 6 Jan 2025 23:13:46 +0000 (UTC) X-FDA: 82978581252.05.C5E8F51 Received: from mail-qk1-f175.google.com (mail-qk1-f175.google.com [209.85.222.175]) by imf18.hostedemail.com (Postfix) with ESMTP id 869501C0003 for ; Mon, 6 Jan 2025 23:13:44 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=tw20XILg; spf=pass (imf18.hostedemail.com: domain of yosryahmed@google.com designates 209.85.222.175 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736205224; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Uwp+vRjxgRIF9zCRQ8EF3eMxpCJjfxIuCjiuC1fjLTc=; b=pXZDyzsXjWAsoZ4Ce4FbIOS7ZtD1cW5jqLVA/4ZHvmTV+YQ6aqWwpbOJ+22ULKxLxYXQ1a ous+BjU1WleOUcGb1ZuTa1slp7550rNYj6tmGCZZ8MR0yToNBWdBdgaudSqwrFERU6dx7r b88cJKmKyi3rrJ8k7KB+l8E7Yy9/9KI= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=tw20XILg; spf=pass (imf18.hostedemail.com: domain of yosryahmed@google.com designates 209.85.222.175 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736205224; a=rsa-sha256; cv=none; b=PHSUc5wdPsFwf25gvdriJS0SM1QfyIGUZU4iqRKbQU05m9PTIJZhnsQgiIvIkB0buGI/1j iEyWF/tPCzCVM3eiRBG7ZYxa6zQPekggnY6BBV6znCjn3OuYrnzNbFJSaRcooip/b5RhNd /ffGUPrOJZ5aLcQOusF4tw3v2MC8qn8= Received: by mail-qk1-f175.google.com with SMTP id af79cd13be357-7b6eb531e13so773613485a.0 for ; Mon, 06 Jan 2025 15:13:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1736205223; x=1736810023; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Uwp+vRjxgRIF9zCRQ8EF3eMxpCJjfxIuCjiuC1fjLTc=; b=tw20XILg4LtUx7dme7+B/JVmgO6iOhdNFhM3H5rgIwoC7M+9pFFon1X1omMZcCYPdo IzbGMys/Hx+7TieHzjv51wzJVC7unNFI8CFzb9xd2eVo1ULqbwjMwozTapb2cQx+pTlq vDLmK1NylmuNoQ3J366mVu2RoWK8xZcoDGw832G9VWMMjZgGeC9idCMrTAhZyQy+xVFm 1XO6pvQ6LtfMxb3y3B9OhsIbsrve5n9ZJpKBMcPQOlSdCluoVhbc62fccShWaauxfRAZ QskmHiURNzV63UNJ8HuUOWjKysGThfIeORnRblFf18OXWNRlFki3IydX1ztCuLDXG7th xqcg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736205223; x=1736810023; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Uwp+vRjxgRIF9zCRQ8EF3eMxpCJjfxIuCjiuC1fjLTc=; b=jgvyuociMTm9WuwGHHihR1CnGCtBevo0mTi+Oqh/eX7aLA3LKe3Ecvbz41ux17XyRi 7VMLMwIod5UeF8PfIJ0j10+rpDVTZuFe8ralN/M73EuIfJHVp/L1Ckg/hsZ1RshUjX1i TtNeImuKHfqKF7oJP/u+zZPt+nzFCviRaXcq6jLyx9p6r3rY2RIdklnmlfjm/38SnCDk yU75uWyCsmh5t87NEIVtW20W4lLzr8z/UIZay46z5xojU+1kgRIza1ib95M5oHzC7KrK 9Mg4DRqNy69L5uUn8Repexdx8+/UIJIHAr2uql+n5XlF2zaBXgKGkfnGZafLTTtuPPQM KvJQ== X-Forwarded-Encrypted: i=1; AJvYcCX95LKfFvhDbAmWhsJS3tdeDQRxjLyHDDM+5qWiWTFKJrSYzXcCYKg8MvM6g2XqMC1qgPuJutNX/A==@kvack.org X-Gm-Message-State: AOJu0YzZT8au40qqDjQoXRKLh6Zufwj8reiecrr4VZeCUiSkCkCke+DV yTrP+LHp2ALYtNe+ZyzLb/kTvO9qT+nbsbkbV4hwX9Tm2qADcVaZtw9JsE6QdH8GC99iPYschWb jxdrtPniMdbv9QIXGxOAluK4iIU1hI7Y2bYfI X-Gm-Gg: ASbGncuEFxWosdsoT0XusOTW5+/VDA9e2ty+xQYntfyvFvlq923zlnmuqxTa8m3VKsu Gf0GsJ7KGhF3VcbiUG2BN9EzJS/M2HJB3v0Y= X-Google-Smtp-Source: AGHT+IFTmheYQkfNiE+wJra3WcyiDIXlm/GreXbGIHMXplrOAJC93cASbVnS3x6U6C1tSzdpOkSdnIfbGx+i2SwafTE= X-Received: by 2002:a05:620a:4103:b0:7b6:f0e4:d9a0 with SMTP id af79cd13be357-7b9ba7aa42dmr7541193985a.33.1736205223411; Mon, 06 Jan 2025 15:13:43 -0800 (PST) MIME-Version: 1.0 References: <20250103015020.78547-1-inwardvessel@gmail.com> In-Reply-To: <20250103015020.78547-1-inwardvessel@gmail.com> From: Yosry Ahmed Date: Mon, 6 Jan 2025 15:13:07 -0800 X-Gm-Features: AbW1kvYCXQfp5C1npainbuqrtpNxN9kB7R6ptH2Kber27aGZVMvx59IyF5hLLxk Message-ID: Subject: Re: [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees To: JP Kobryn Cc: shakeel.butt@linux.dev, tj@kernel.org, mhocko@kernel.org, hannes@cmpxchg.org, akpm@linux-foundation.org, linux-mm@kvack.org, cgroups@vger.kernel.org, =?UTF-8?Q?Michal_Koutn=C3=BD?= Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 869501C0003 X-Rspamd-Server: rspam12 X-Stat-Signature: 41z6iqmtessy8rrsnazwtg66adrcoxth X-Rspam-User: X-HE-Tag: 1736205224-289122 X-HE-Meta: U2FsdGVkX1+GeAFqVBkPAWHLw468ZYYZo1mLTOvF8KX2A8CXUsPM95tCPVu1kDNzKn1Xclc/pFtOqDTLuNBUchCAiEsif66IoTHpTVGVigqeAfB2DYsBZgqASg2tH3QpgLSH1awpdcvBucccIHFuC2dW06fwMciAmJiYq377XVQxefdMVErUS3EjilCpnjLAlUXYN2yZDBw2KBh6yNk1kqnizaQ7tHv9uY9zXkQgPxONBmJ1yuiGn9V49jPQ/7HAGyp8SvB2VrI6BMNDqlMiDM19h9eBoZN/GWNx7Ow+//LE7Y+ZpMRQVY/eKJ++F5TzOCNrOgaDWLpIqHV5WfG4DpkqW35YeRWE7HMSepHzk4i7TNFdD0VBdV4s+pZw9xeWES5ZDQcQ0GtayYscSz1cgeSy4p49xFT1QlbIxjCONfqiz0LMPOjTZAWmmkMdLw1On4xmqpoWqMY2PVnx92dxJb89E8VpgYEd8TZThaGFEPnD2/InPvjQO0hXhnxV2Lc8Slk0ceOvVZ1zMK7s90r2zjqKS8lW1XIeZf3SSuhM76q0xWNg1b8Ose0yR1W+iSF61E9V3VowLjMcX680/5LtV+hVlsaCT86cEHB4t6WU5P2jPnZevJFYXVQ+NboACmaQTc5IatwKYuCTK7P8Tdi7gEGUDEm0aTiPeAlulR88DTSTHcSlyqu19zdP3OQK4ehLCMVRC99ORTzBJVEhnEXsvAAZJLaaSUYBZQ7lpvATcLw7KRWIqBSjNyI5j5LGxXAquF2bM/qPHSQTDEZBolFWWx9zsMeCB7i244TzOo+RwHI+TogmdZztTqG/aAsQoIMpa9jupXXbNKowrC1QuKnHKEsobAjIvmZQ91Dmfv9tD8rjXILJi7i1YPAnMQhpOYtOBvkc5k5uv/ze9Ow0Tb53mA/ZwsPrpLOlo8m+tSHRudiQzKpVhKFyOCE1SpSJHe3aNbcwzqubX7b33Y2O2Xr sGaNCB4a 8QC7svaYxJH8jp5KYJJS+vM/GDhxHXMeOsvyoe6jOYdbKL18DtPpRdiMzdYAaGfRlZZkVKX4O3Oi7kS6fZGry4krXZ4HOdSyrIvzw7El2pk2cAMjx1+mppCNhJml1XLf6talmFWrO4wRwZuBH6LZel0xtIe4enqpU+puUcX2DkG+p7MG2fqS7Fi77hYGbferoVJRRq0NXxAnJ4h9dwfaHGdpBiXFi9qJhA5YnR6QohKHlnOCPwF6m4oTy15vK64xOsaynBlKeXw0f5aWG0x1NNrnuegGSmVaNYCY+h71RMzt0UCFccsoiy9ZHXpkJELfppGgcS11DCMBJeL8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jan 2, 2025 at 5:50=E2=80=AFPM JP Kobryn w= rote: > > The current rstat model is set up to keep track of cgroup stats on a per-= cpu > basis. When a stat (of any subsystem) is updated, the updater notes this = change > using the cgroup_rstat_updated() API call. This change is propagated to t= he > cpu-specific rstat tree, by appending the updated cgroup to the tree (unl= ess > it's already on the tree). So for each cpu, an rstat tree will consist of= the > cgroups that reported one or more updated stats. Later on when a flush is > requested via cgroup_rstat_flush(), each per-cpu rstat tree is traversed > starting at the requested cgroup and the subsystem-specific flush callbac= ks > (via css_rstat_flush) are invoked along the way. During the flush, the se= ction > of the tree starting at the requested cgroup through its descendants are > removed. > > Using the cgroup struct to represent nodes of change means that the chang= es > represented by a given tree are heterogeneous - the tree can consist of n= odes > that have changes from different subsystems; i.e. changes in stats from t= he > memory subsystem and the io subsystem can coexist in the same tree. The > implication is that when a flush is requested, usually in the context of = a > single subsystem, all other subsystems need to be flushed along with it. = This > seems to have become a drawback due to how expensive the flushing of the > memory-specific stats have become [0][1]. Another implication is when upd= ates > are performed, subsystems may contend with each other over the locks invo= lved. > > I've been experimenting with an idea that allows for isolating the updati= ng and > flushing of cgroup stats on a per-subsystem basis. The idea was instead o= f > having a per-cpu rstat tree for managing stats across all subsystems, we = could > split up the per-cpu trees into separate trees for each subsystem. So eac= h cpu > would have separate trees for each subsystem. It would allow subsystems t= o > update and flush their stats without any contention or extra overhead fro= m > other subsystems. The core change is moving ownership of the the rstat en= tities > from the cgroup struct onto the cgroup_subsystem_state struct. > > To complement the ownership change, the lockng scheme was adjusted. The g= lobal > cgroup_rstat_lock for synchronizing updates and flushes was replaced with > subsystem-specific locks (in the cgroup_subsystem struct). An additional = global > lock was added to allow the base stats pseudo-subsystem to be synchronize= d in a > similar way. The per-cpu locks called cgroup_rstat_cpu_lock have changed = to a > per-cpu array of locks which is indexed by subsystem id. Following suit, = there > is also a per-cpu array of locks dedicated to the base subsystem. The ded= icated > locks for the base stats was added since the base stats have a NULL subsy= stem > so it did not fit the subsystem id index approach. > > I reached a point where this started to feel stable in my local testing, = so I > wanted to share and get feedback on this approach. I remember discussing this with Shakeel and Michal Koutn=C3=BD in LPC two years ago. I suggested it multiple times over the last few years, most recently in: https://lore.kernel.org/lkml/CAJD7tkbpFu8z1HaUgkaE6bup_fsD39QL= PmgNyOnaTrm+hZ_9hA@mail.gmail.com/. I think it conceptually makes sense, and I took a stab at it when I was working on fixing the hard lockups due to atomic flushing, but the system I was working on was using cgroup v1, so different subsystems had different hierarchies (and hence different trees) anyway, so it wouldn't have helped. This is especially true for the MM subsystem, which apparently flushes most often and has the most expensive flushes, so other subsystems are probably being unnecessarily taxed. > > [0] https://lore.kernel.org/all/CAOm-9arwY3VLUx5189JAR9J7B=3DMiad9nQjjet_= VNdT3i+J+5FA@mail.gmail.com/ > [1] https://github.blog/engineering/debugging-network-stalls-on-kubernete= s/ > > Changelog > v2: updated cover letter and some patch text. no code changes. > > JP Kobryn (8): > change cgroup to css in rstat updated and flush api > change cgroup to css in rstat internal flush and lock funcs > change cgroup to css in rstat init and exit api > split rstat from cgroup into separate css > separate locking between base css and others > isolate base stat flush > remove unneeded rcu list > remove bpf rstat flush from css generic flush > > block/blk-cgroup.c | 4 +- > include/linux/cgroup-defs.h | 35 ++--- > include/linux/cgroup.h | 8 +- > kernel/cgroup/cgroup-internal.h | 4 +- > kernel/cgroup/cgroup.c | 79 ++++++----- > kernel/cgroup/rstat.c | 225 +++++++++++++++++++------------- > mm/memcontrol.c | 4 +- > 7 files changed, 203 insertions(+), 156 deletions(-) > > -- > 2.47.1 >