From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 058A3C282C5 for ; Sat, 1 Mar 2025 01:06:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5D6FD280001; Fri, 28 Feb 2025 20:06:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 585706B0082; Fri, 28 Feb 2025 20:06:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 44D06280001; Fri, 28 Feb 2025 20:06:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 284476B007B for ; Fri, 28 Feb 2025 20:06:30 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 72103C1539 for ; Sat, 1 Mar 2025 01:06:29 +0000 (UTC) X-FDA: 83171191698.04.FB24752 Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) by imf01.hostedemail.com (Postfix) with ESMTP id 6BB8A40007 for ; Sat, 1 Mar 2025 01:06:27 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kHixI67p; spf=pass (imf01.hostedemail.com: domain of inwardvessel@gmail.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=inwardvessel@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740791187; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/7/zWsHsPc6a2faj25H4e7UGV7b8b3EYJTuixl3SmsU=; b=iZk+A76xNoH+sUyRFnf3WNX8Wry63XOHaVDsF4AJ3nG+VwMxJf0TFEbP00UybC5w/9uMuU YQBb7xqSz/E0rLcw9ZzouUcvNOSgfFT5g0pmUOVjIuh/gNQ+C/1ybjbvFIJPiqw6QjkxxD 64KAyqBU68+8zYkvdz3NkbL04q/q5iU= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kHixI67p; spf=pass (imf01.hostedemail.com: domain of inwardvessel@gmail.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=inwardvessel@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740791187; a=rsa-sha256; cv=none; b=Od0OyUF3wZQ4THwmXuB7IHQYO9IPsoLe6VwrA+Yyd+d4YJhnWznJrTWV6OGbojmCaWzelw yqwu+TK2qfYgLYqE1FtseOhsUrMkR5hTUmX48FmZDXYWP2CNUQ64dDX3OUYEyHABd7HiFA ya4JMS1W0OY+djGIY5la/FhrEzeqDB4= Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-22334203781so59737575ad.0 for ; Fri, 28 Feb 2025 17:06:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1740791186; x=1741395986; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=/7/zWsHsPc6a2faj25H4e7UGV7b8b3EYJTuixl3SmsU=; b=kHixI67pzs+MFNX24gmcaGW5wPS0+0jM9iTL2AC+qeh5zxN3YMiLf2kL+/jUej+MhR qSoHH1Dijyh9xYHsNLiLkqH0P3BEjZ2709FvOHCr91x8rZlwN214GdgGedGHk0wwxq0J Zx1zsX+GP6zJHDmPw8IjmPzQvnRDlspozCi9mF1BtxAs4Vk44VqvwG6CVZXy0dYZkYDp E6WFb2nJ6IF89fWrUsGJaSka7J9wEiOo2eQFJfsijI0aENsNyt6TAvRLL6sYwY1Nw/tf 1gZLo4I9ZnkdBVjUUa5wNs7nhfp8+sVzgtAypO8CgG5DYbc8QxcJn9Kf8/ftaKOxtv+w u6Wg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740791186; x=1741395986; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=/7/zWsHsPc6a2faj25H4e7UGV7b8b3EYJTuixl3SmsU=; b=pKa+IjWzg/KV9pvCqO4JNxl8eF3BlKOJHfPb878eKlHow4dHxiAOkdiyU1iCnc/mwb k5A2KRP515rcANZom/Dj6nqYVfTULlVXl3G7+xVWkdaJohwyGB5a8HBAnGhUWTq5ev9H 9IYakM+v1g3gNK3Bglp5N1G8WQh0pMDU5aYEouOaJ6/t8J7xWhFO2ybAZEMzgW2/1GD1 iklJMk6xDbI91PkHVj8usHoF3Hgliq9SmlsHgIOpTMbQN807szfIRksizURBHZXeSR6A Smrt2vPdCrvlsj2m+L+Vj7d/8rBcz9RRMI3uh6gs6OQVaOaxJ0W5uiQHzaDQmQMLGLWo YHLw== X-Forwarded-Encrypted: i=1; AJvYcCUVPtX0J3ubkrbnEYVtY9vPCf58WPFrprr1QyH+/I/T3zlvVk1T6qLZC04Kzd1x7IPZDBO4V8Vd+A==@kvack.org X-Gm-Message-State: AOJu0YwqJkafuQAzf6qx3K5ir/+oPRyKHO4c23G78KkxgtECxf6mNn9d 7ATxr0uKyBRKItT/T+bIX5x22+oTYD2vvRBWPm/63jOrEUk19ocS X-Gm-Gg: ASbGncv87Ct60ilgmSTyG5WqXsCxOmDAiMnlm1ZMxfVtQB0pPKa2AXQx4Op5NBy/LUp zZ1dVdLAz5z8lyS5J395Y3bm8/MMfoHsrQZms93fqEPRS8CuRc/SMJodjTdpO1+v+IzJYng3TTq IKpKiRetSb7Vp9F9xfVWnpsfR6HXUBMRUbivqbZZaTcdJsdF2+thv+WjCXImDID6MlYIzA7QTml xdm/MW3KjBuUokazYQfCuDc9PqS3AbOTDCVA3XQpd5y8jlqe6nbm4m7ZE4QDgWWKrRgM0x3Aqte agszXj/wLdonwVO6AYv1r896Pf/GPqkIKG+Tlq9jHXu3KjOq9I/IZdEV2qhBqFH29ku1mc3V5C7 MXWj+NcqS1aEbFsM= X-Google-Smtp-Source: AGHT+IEVk4MG7sbkHKT+/NayVjJ0aygZ6h8HenfCxUF1dzvAvFILnPFqw9uc3f3IcVcQpbpvcleQpQ== X-Received: by 2002:a05:6a00:ad0:b0:732:1ce5:4a4c with SMTP id d2e1a72fcca58-734ac36b279mr8905910b3a.1.1740791186057; Fri, 28 Feb 2025 17:06:26 -0800 (PST) Received: from [192.168.2.117] (c-67-188-127-15.hsd1.ca.comcast.net. [67.188.127.15]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-734a0024c81sm4392299b3a.118.2025.02.28.17.06.24 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 28 Feb 2025 17:06:25 -0800 (PST) Message-ID: Date: Fri, 28 Feb 2025 17:06:23 -0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state To: Yosry Ahmed Cc: tj@kernel.org, shakeel.butt@linux.dev, mhocko@kernel.org, hannes@cmpxchg.org, akpm@linux-foundation.org, linux-mm@kvack.org, cgroups@vger.kernel.org, kernel-team@meta.com References: <20250227215543.49928-1-inwardvessel@gmail.com> <20250227215543.49928-2-inwardvessel@gmail.com> Content-Language: en-US From: JP Kobryn In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: m8k31na5ao4tq1hcz4qnb1kizdhw8ads X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 6BB8A40007 X-Rspam-User: X-HE-Tag: 1740791187-808777 X-HE-Meta: U2FsdGVkX18y0PHtjrn6tP4LtZqnfF6nnNbt3hA/mixxrzJmfBYqqHqdaoZ0r2tE9SginOEXDDs6zj5CSwHgrwmKMuv6KOSPK4UegNVkiO/jR+uIEO+i6PBSu9EvE6tl/GnxOughlXxZTr458/IkiEs287SMPFnu7zHVaHUVQG3/r8eeCJNFbxYJ0n4MRo5+QGa6LX3XZYDiHRTcxDhRYknZEfUQzVSH5qaabtt3n5LYwbORgij3sP0fqJn23U1U7na2c8/RPfR+ucT/Cpid4AG+5CdFNckafZGf5HUGMp3mcv/kn5141MjpMFRUAnZthaBgU69W1GDGxPBnPRiAQrGn8bdQQRGbc3mNa6qqwJmGrZ1dfDwIpdcd/q0E+s6zlq7FR5fS+sOVRwV1GVlALEYjMCdOcCCH4M+Nl6CJ+S7hFgnYEi4XRckWEFdFcxeULhZUT0lwxgsz/CkYpngjFTUiX4yuNBBIu5UI54atzEznZYBUoU3RbiDeQuARhvjh/xg8krrb/+6xGzDVJk3UXQS3EhJllLEMtlBb0tG0Y2H6nP+SKCuYgEHfJAA1mn0QqHhlU2J3hvb2wl4oNbMK4swr2SOVL3cUupO4r6ziLWudVuWxUABKMEWepEamoyuBgEqJJqJ4UJYf8h45p3hxKIXDuIsRWq4J2kPey2RaqG08SqDRWjzyHbCdV8CsMyU3qKR5tjRZuETG++N/bkig9x73rr8Z6q+cGk1c9KFHKnDPJ3s231A6s/Hr9msLdxfKuqDHjwjo7db3sBfkZqem51vf042eO5O6Apm5rjI9QPNAdzSfBiRPKBLafzIPKh0S8HfE/LtX+pfidLoDhxgSixbzDKgYrd8qU/GREFh06xe4c1uOONCXJEhyz9rEafKpC0fRy8uM+wZCjkMvkT5ewk/EisX7lkSXJX2YfdIIjwVo0+WdoKAEgyb95fxLT3lsrnFg+PzS8AD/RBt8/70 dAGQPVd3 ruVPfC4FLt2ttF9HbSC7T5eVtY102g9gtLY22+6Zabx1ZP6DW+3AV9bYZG1NHmcEuoT+XqRBaSM5zbk/4KrDCTv3VvIINkhExALPVIq3ZGM78vYdMBYidTygPg/jFOuE/CVKBE1gMSreABXTPDMffhpmq5E5+DfGcGjYm2vDG2IyFa+2JPHP1aShMyEfYpa+FucGt+dNChOR2fyQdiljAxBbQRCmKujNRYnzB/Dv8tf+EJto/eqIWTT2EbWe6EmObhEqEPW9cJWDak2eJrQfVIDvWtPEBzaNn4CTkdno17G2RXmQahIzLbMXcVz6fQemb0YWVqeQ6pKibuWYqS11pQH7IIQNc0FQvRAlWq4sLk6WU/YCOFepZGaUQkC3RLO4IjnVTp5ph78L2GP4MemPnWVDs/oMdzBSibjaWrb7i7LihpfLA9zDRxufkWLbXorBkkj0yDSX6/dlPWI7BZI4/IsdddA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2/28/25 11:04 AM, Yosry Ahmed wrote: > On Thu, Feb 27, 2025 at 01:55:40PM -0800, inwardvessel wrote: >> From: JP Kobryn >> >> Each cgroup owns rstat pointers. This means that a tree of pending rstat >> updates can contain changes from different subsystems. Because of this >> arrangement, when one subsystem is flushed via the public api >> cgroup_rstat_flushed(), all other subsystems with pending updates will >> also be flushed. Remove the rstat pointers from the cgroup and instead >> give them to each cgroup_subsys_state. Separate rstat trees will now >> exist for each unique subsystem. This separation allows for subsystems >> to make updates and flushes without the side effects of other >> subsystems. i.e. flushing the cpu stats does not cause the memory stats >> to be flushed and vice versa. The change in pointer ownership from >> cgroup to cgroup_subsys_state allows for direct flushing of the css, so >> the rcu list management entities and operations previously tied to the >> cgroup which were used for managing a list of subsystem states with >> pending flushes are removed. In terms of client code, public api calls >> were changed to now accept a reference to the cgroup_subsys_state so >> that when flushing or updating, a specific subsystem is associated with >> the call. > > I think the subject is misleading. It makes it seem like this is a > refactoring patch that is only moving a member from one struct to > another, but this is actually the core of the series. > > Maybe something lik "cgroup: use separate rstat trees for diffrent > subsystems"? > > Also, breaking down the commit message into paragraphs helps with > readability. Makes sense. Will adjust in next rev. > > [..] >> @@ -386,8 +394,8 @@ struct cgroup_rstat_cpu { >> * >> * Protected by per-cpu cgroup_rstat_cpu_lock. >> */ >> - struct cgroup *updated_children; /* terminated by self cgroup */ >> - struct cgroup *updated_next; /* NULL iff not on the list */ >> + struct cgroup_subsys_state *updated_children; /* terminated by self */ >> + struct cgroup_subsys_state *updated_next; /* NULL if not on list */ > > nit: comment indentation needs fixing here > >> }; >> >> struct cgroup_freezer_state { > [..] >> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c >> index afc665b7b1fe..31b3bfebf7ba 100644 >> --- a/kernel/cgroup/cgroup.c >> +++ b/kernel/cgroup/cgroup.c >> @@ -164,7 +164,9 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = { >> static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu); > > Should we rename cgrp_dfl_root_rstat_cpu to indicate that it's specific > to self css? Sure. > >> >> /* the default hierarchy */ >> -struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu }; >> +struct cgroup_root cgrp_dfl_root = { >> + .cgrp.self.rstat_cpu = &cgrp_dfl_root_rstat_cpu >> +}; >> EXPORT_SYMBOL_GPL(cgrp_dfl_root); >> >> /* > [..] >> @@ -5407,7 +5401,11 @@ static void css_free_rwork_fn(struct work_struct *work) >> struct cgroup_subsys_state *parent = css->parent; >> int id = css->id; >> >> + if (css->ss->css_rstat_flush) >> + cgroup_rstat_exit(css); >> + >> ss->css_free(css); >> + > > nit: extra blank line here > >> cgroup_idr_remove(&ss->css_idr, id); >> cgroup_put(cgrp); >> >> @@ -5431,7 +5429,7 @@ static void css_free_rwork_fn(struct work_struct *work) >> cgroup_put(cgroup_parent(cgrp)); >> kernfs_put(cgrp->kn); >> psi_cgroup_free(cgrp); >> - cgroup_rstat_exit(cgrp); >> + cgroup_rstat_exit(&cgrp->self); >> kfree(cgrp); >> } else { >> /* >> @@ -5459,11 +5457,7 @@ static void css_release_work_fn(struct work_struct *work) >> if (ss) { >> struct cgroup *parent_cgrp; >> >> - /* css release path */ >> - if (!list_empty(&css->rstat_css_node)) { >> - cgroup_rstat_flush(cgrp); >> - list_del_rcu(&css->rstat_css_node); >> - } >> + cgroup_rstat_flush(css); > > Here we used to call cgroup_rstat_flush() only if there was a > css_rstat_flush() callback registered, now we call it unconditionally. > > Could this cause a NULL dereference when we try to call > css->ss->css_rstat_flush() for controllers that did not register a > callback? Good point. Misuse here can lead to a bad access. Will cover in v3. > >> >> cgroup_idr_replace(&ss->css_idr, NULL, css->id); >> if (ss->css_released) > [..] >> @@ -6188,6 +6186,9 @@ int __init cgroup_init(void) >> css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, >> GFP_KERNEL); >> BUG_ON(css->id < 0); >> + >> + if (css->ss && css->ss->css_rstat_flush) >> + BUG_ON(cgroup_rstat_init(css)); > > Why do we need this call here? We already call cgroup_rstat_init() in > cgroup_init_subsys(). IIUC for subsystems with ss->early_init, we will > have already called cgroup_init_subsys() in cgroup_init_early(). > > Did I miss something? Hmmm it's a good question. cgroup_rstat_init() is deferred in the same way that cgroup_idr_alloc() is. So for ss with early_init == true, cgroup_rstat_init() is not called during cgroup_early_init(). Is it safe to call alloc_percpu() during early boot? If so, the deferral can be removed and cgroup_rstat_init() can be called in one place. > >> } else { >> cgroup_init_subsys(ss, false); >> } > [..] >> @@ -300,27 +306,25 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) >> } >> >> /* see cgroup_rstat_flush() */ >> -static void cgroup_rstat_flush_locked(struct cgroup *cgrp) >> +static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css) >> __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock) >> { >> + struct cgroup *cgrp = css->cgroup; >> int cpu; >> >> lockdep_assert_held(&cgroup_rstat_lock); >> >> for_each_possible_cpu(cpu) { >> - struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu); >> + struct cgroup_subsys_state *pos; >> >> + pos = cgroup_rstat_updated_list(css, cpu); >> for (; pos; pos = pos->rstat_flush_next) { >> - struct cgroup_subsys_state *css; >> + if (!pos->ss) >> + cgroup_base_stat_flush(pos->cgroup, cpu); >> + else >> + pos->ss->css_rstat_flush(pos, cpu); >> >> - cgroup_base_stat_flush(pos, cpu); >> - bpf_rstat_flush(pos, cgroup_parent(pos), cpu); >> - >> - rcu_read_lock(); >> - list_for_each_entry_rcu(css, &pos->rstat_css_list, >> - rstat_css_node) >> - css->ss->css_rstat_flush(css, cpu); >> - rcu_read_unlock(); >> + bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu); > > We should call bpf_rstat_flush() only if (!pos->ss) as well, right? > Otherwise we will call BPF rstat flush whenever any subsystem is > flushed. > > I guess it's because BPF can now pass any subsystem to > cgroup_rstat_flush(), and we don't keep track. I think it would be > better if we do not allow BPF programs to select a css and always make > them flush the self css. > > We can perhaps introduce a bpf_cgroup_rstat_flush() wrapper that takes > in a cgroup and passes cgroup->self internally to cgroup_rstat_flush(). I'm fine with this if others are in agreement. A similar concept was done in v1. > > But if the plan is to remove the bpf_rstat_flush() call here soon then > it's probably not worth the hassle. > > Shakeel (and others), WDYT?