From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B768BC282C6 for ; Fri, 28 Feb 2025 19:04:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2CDD6280002; Fri, 28 Feb 2025 14:04:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 253C5280001; Fri, 28 Feb 2025 14:04:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0F5ED280002; Fri, 28 Feb 2025 14:04:47 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id E024C280001 for ; Fri, 28 Feb 2025 14:04:46 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 665BF1C7B80 for ; Fri, 28 Feb 2025 19:04:46 +0000 (UTC) X-FDA: 83170280172.22.DE99F8D Received: from out-177.mta0.migadu.com (out-177.mta0.migadu.com [91.218.175.177]) by imf03.hostedemail.com (Postfix) with ESMTP id 9938E20016 for ; Fri, 28 Feb 2025 19:04:44 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=UMLnaGOu; spf=pass (imf03.hostedemail.com: domain of yosry.ahmed@linux.dev designates 91.218.175.177 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740769484; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=EUoDXTNpMhUllcwII7ASYWMxp1W5rvEw1o/euNXhFvk=; b=SOX5YZDuNBAecJFccey+VkxlSJSAVTPTJt6B72O6ozjdij1YPqMH6hzi3iDTVoW3ALxSLp /RLhRbPE0hAeXzKjhinPU1N1bGKryYBxtCLsuEx7puTqqBuHBRkeST2/eebtc4JK8tVyM1 yr52XuqyFtIXWIRYZfaohO4c53ZigqE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740769484; a=rsa-sha256; cv=none; b=6B5Ku0TWQOPRP/ls4AaSEW9ETE474BztiHmcru9NoHJZUV2pu+vTWnG8bLDU0JyFx2dApf HRBWDdFgGSJuDDRYLT+AGOlM0v7Gdge5Wfzz3L1zwNO1XiLQp2UR4YjT1UPfSzQVcC8xnj HNhnXNU3c5DMweAt0WWpnQ9TQ2RvW2o= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=UMLnaGOu; spf=pass (imf03.hostedemail.com: domain of yosry.ahmed@linux.dev designates 91.218.175.177 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev; dmarc=pass (policy=none) header.from=linux.dev Date: Fri, 28 Feb 2025 19:04:37 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1740769482; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=EUoDXTNpMhUllcwII7ASYWMxp1W5rvEw1o/euNXhFvk=; b=UMLnaGOufHW2t9TCLQ7Uujtem/w0EipDTVwUlF6w8zUz34cLvZLa3FyJ7oQf21Fy75YZEd gBEOUDBqj8votVKeejBpflSyHbUF8fw/sNQcDFsFb7xQ+FLe1Dr8uweLmfTVLxiEQN+/g1 BiP+swD13BBQEaenUpa8u5wStxxP2HU= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yosry Ahmed To: inwardvessel Cc: tj@kernel.org, shakeel.butt@linux.dev, mhocko@kernel.org, hannes@cmpxchg.org, akpm@linux-foundation.org, linux-mm@kvack.org, cgroups@vger.kernel.org, kernel-team@meta.com Subject: Re: [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state Message-ID: References: <20250227215543.49928-1-inwardvessel@gmail.com> <20250227215543.49928-2-inwardvessel@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250227215543.49928-2-inwardvessel@gmail.com> X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam02 X-Stat-Signature: shyk6wsjdgiatwbawkdjmkwcn466kyhk X-Rspamd-Queue-Id: 9938E20016 X-Rspam-User: X-HE-Tag: 1740769484-184573 X-HE-Meta: U2FsdGVkX1928uuaqsEF58zWfe+sSYL+G7GECYw5UQccuTOcqrYtFzHrX84Q2RHTa4jAsRiOWTw++nrJ7O+mzxU7ZDH75h4ldbNS7HHcMcuLIAcf25NZZPxnbswOrKOhxg0HjUZNEQNBKC5p8K9FZnJZoAXebGpHSxVdDD4QLNy6XKEYZku8GOEJXOeJ+9jR6LQ1BqtdX4Dcb+enoQrr+z2m3MJofTnVWzyWzupTmYcJJTXwNM+mDd8q0fMw4hox/oDkfdX7SbUCmIV+F0pdRIMrZPQyu9+Ead3j8k726DgJNgGtex/SyWDCBdXkwGVAhn+MnHfgLz0mVlLaTpRYduohhAXQJ5D5BmYpEXAiLF5dTzcRxobylgZUhQHkEmi3DFJG2bNkc3bsy+mEz8qto4Vx/kyuCHAP7bDJw3dBSWZlu2NMRpdBDsSH0rP+C1BDdRNJS6DL+W66uAzHLxdBohqr7i+46quH/uPvzKKaBtldenh3Rc5G9gVD2v6DHEzCUFDm6fh2rEOSnx58h/YE1DwGzfi48dhdg0vaDY7oMjAy7ie9OOtxWZAOYPMXGmkdGvaJMtgxXSGn8U9DbENZwli6ItG0grlATPz+dwETa74Cksf23eGlB4AHyQmGCiVZ1jkzzEpvICQCovLynqQaDqqEVts6zlSgFySJjb1ulrX6RKCKUFTnBLv/O/3oDs67R3TaFfODxF8/A3gq3PgsUTP0xg/urHpLpWdYEl7Y0ZEfhVKG+HyPY8kMCEMYTRXWtTKY7RKuA6vQKIMVLi5/7flvWtBKcSVsfOm66rJcLUhTYVz1tBdLCtMhbfKlPBmO/G71fqgyfcsKy6u129G7Q5O79QhrmSVEgpnpmGVTBDiRYstkm26JWbcz9RMDKDSuQSbUPxpkQjgeYghpFzKiULstyhV/8B7Ulz3LoARfqX8V+1JKo6rIiaMnJVFCYQ/u9VRwZDR85wUzMu1suQ5 OMmrTWBq 8PZC/ebxG+xMljLuLfF5cL52MTC7WC8hPtUSeveX0KxF5iCZ/9UV85zpkdJORCBgvpwsD9P9KN6jWHU/oH80je5jc2hJCh0N2qfHS3JFSFU3n8gDh2qPHSoEKsj0jHk+qV4sAZgXOlaOInQ4qyDMrqn8gsVKW0NNCgOczdnGOfcQNaEBW9n002g5mazTyVZQ4y6bZlKyoltl2hNymAVy5cSQaE/7kf3lg9KnkNO/Z/O1aShnFoIk1EKL3gyDMNuOs7tD59rtwy/8//YxlAmqq2jja0QB6yBF4JYBL X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 27, 2025 at 01:55:40PM -0800, inwardvessel wrote: > From: JP Kobryn > > Each cgroup owns rstat pointers. This means that a tree of pending rstat > updates can contain changes from different subsystems. Because of this > arrangement, when one subsystem is flushed via the public api > cgroup_rstat_flushed(), all other subsystems with pending updates will > also be flushed. Remove the rstat pointers from the cgroup and instead > give them to each cgroup_subsys_state. Separate rstat trees will now > exist for each unique subsystem. This separation allows for subsystems > to make updates and flushes without the side effects of other > subsystems. i.e. flushing the cpu stats does not cause the memory stats > to be flushed and vice versa. The change in pointer ownership from > cgroup to cgroup_subsys_state allows for direct flushing of the css, so > the rcu list management entities and operations previously tied to the > cgroup which were used for managing a list of subsystem states with > pending flushes are removed. In terms of client code, public api calls > were changed to now accept a reference to the cgroup_subsys_state so > that when flushing or updating, a specific subsystem is associated with > the call. I think the subject is misleading. It makes it seem like this is a refactoring patch that is only moving a member from one struct to another, but this is actually the core of the series. Maybe something lik "cgroup: use separate rstat trees for diffrent subsystems"? Also, breaking down the commit message into paragraphs helps with readability. [..] > @@ -386,8 +394,8 @@ struct cgroup_rstat_cpu { > * > * Protected by per-cpu cgroup_rstat_cpu_lock. > */ > - struct cgroup *updated_children; /* terminated by self cgroup */ > - struct cgroup *updated_next; /* NULL iff not on the list */ > + struct cgroup_subsys_state *updated_children; /* terminated by self */ > + struct cgroup_subsys_state *updated_next; /* NULL if not on list */ nit: comment indentation needs fixing here > }; > > struct cgroup_freezer_state { [..] > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index afc665b7b1fe..31b3bfebf7ba 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -164,7 +164,9 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = { > static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu); Should we rename cgrp_dfl_root_rstat_cpu to indicate that it's specific to self css? > > /* the default hierarchy */ > -struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu }; > +struct cgroup_root cgrp_dfl_root = { > + .cgrp.self.rstat_cpu = &cgrp_dfl_root_rstat_cpu > +}; > EXPORT_SYMBOL_GPL(cgrp_dfl_root); > > /* [..] > @@ -5407,7 +5401,11 @@ static void css_free_rwork_fn(struct work_struct *work) > struct cgroup_subsys_state *parent = css->parent; > int id = css->id; > > + if (css->ss->css_rstat_flush) > + cgroup_rstat_exit(css); > + > ss->css_free(css); > + nit: extra blank line here > cgroup_idr_remove(&ss->css_idr, id); > cgroup_put(cgrp); > > @@ -5431,7 +5429,7 @@ static void css_free_rwork_fn(struct work_struct *work) > cgroup_put(cgroup_parent(cgrp)); > kernfs_put(cgrp->kn); > psi_cgroup_free(cgrp); > - cgroup_rstat_exit(cgrp); > + cgroup_rstat_exit(&cgrp->self); > kfree(cgrp); > } else { > /* > @@ -5459,11 +5457,7 @@ static void css_release_work_fn(struct work_struct *work) > if (ss) { > struct cgroup *parent_cgrp; > > - /* css release path */ > - if (!list_empty(&css->rstat_css_node)) { > - cgroup_rstat_flush(cgrp); > - list_del_rcu(&css->rstat_css_node); > - } > + cgroup_rstat_flush(css); Here we used to call cgroup_rstat_flush() only if there was a css_rstat_flush() callback registered, now we call it unconditionally. Could this cause a NULL dereference when we try to call css->ss->css_rstat_flush() for controllers that did not register a callback? > > cgroup_idr_replace(&ss->css_idr, NULL, css->id); > if (ss->css_released) [..] > @@ -6188,6 +6186,9 @@ int __init cgroup_init(void) > css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, > GFP_KERNEL); > BUG_ON(css->id < 0); > + > + if (css->ss && css->ss->css_rstat_flush) > + BUG_ON(cgroup_rstat_init(css)); Why do we need this call here? We already call cgroup_rstat_init() in cgroup_init_subsys(). IIUC for subsystems with ss->early_init, we will have already called cgroup_init_subsys() in cgroup_init_early(). Did I miss something? > } else { > cgroup_init_subsys(ss, false); > } [..] > @@ -300,27 +306,25 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) > } > > /* see cgroup_rstat_flush() */ > -static void cgroup_rstat_flush_locked(struct cgroup *cgrp) > +static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css) > __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock) > { > + struct cgroup *cgrp = css->cgroup; > int cpu; > > lockdep_assert_held(&cgroup_rstat_lock); > > for_each_possible_cpu(cpu) { > - struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu); > + struct cgroup_subsys_state *pos; > > + pos = cgroup_rstat_updated_list(css, cpu); > for (; pos; pos = pos->rstat_flush_next) { > - struct cgroup_subsys_state *css; > + if (!pos->ss) > + cgroup_base_stat_flush(pos->cgroup, cpu); > + else > + pos->ss->css_rstat_flush(pos, cpu); > > - cgroup_base_stat_flush(pos, cpu); > - bpf_rstat_flush(pos, cgroup_parent(pos), cpu); > - > - rcu_read_lock(); > - list_for_each_entry_rcu(css, &pos->rstat_css_list, > - rstat_css_node) > - css->ss->css_rstat_flush(css, cpu); > - rcu_read_unlock(); > + bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu); We should call bpf_rstat_flush() only if (!pos->ss) as well, right? Otherwise we will call BPF rstat flush whenever any subsystem is flushed. I guess it's because BPF can now pass any subsystem to cgroup_rstat_flush(), and we don't keep track. I think it would be better if we do not allow BPF programs to select a css and always make them flush the self css. We can perhaps introduce a bpf_cgroup_rstat_flush() wrapper that takes in a cgroup and passes cgroup->self internally to cgroup_rstat_flush(). But if the plan is to remove the bpf_rstat_flush() call here soon then it's probably not worth the hassle. Shakeel (and others), WDYT?