Re: [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yosry Ahmed <yosry.ahmed@linux.dev>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: "Tejun Heo" <tj@kernel.org>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Alexei Starovoitov" <ast@kernel.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Roman Gushchin" <roman.gushchin@linux.dev>,
	"Muchun Song" <muchun.song@linux.dev>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>,
	"JP Kobryn" <inwardvessel@gmail.com>,
	bpf@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	"Meta kernel team" <kernel-team@meta.com>
Subject: Re: [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe
Date: Wed, 30 Apr 2025 06:14:28 -0700	[thread overview]
Message-ID: <aBIiNMXIl6vyaNQ6@Asmaa.> (raw)
In-Reply-To: <20250429061211.1295443-4-shakeel.butt@linux.dev>

On Mon, Apr 28, 2025 at 11:12:09PM -0700, Shakeel Butt wrote:
> To make css_rstat_updated() able to safely run in nmi context, it can
> not spin on locks and rather has to do trylock on the per-cpu per-ss raw
> spinlock. This patch implements the backlog mechanism to handle the
> failure in acquiring the per-cpu per-ss raw spinlock.
> 
> Each subsystem provides a per-cpu lockless list on which the kernel
> stores the css given to css_rstat_updated() on trylock failure. These
> lockless lists serve as backlog. On cgroup stats flushing code path, the
> kernel first processes all the per-cpu lockless backlog lists of the
> given ss and then proceeds to flush the update stat trees.
> 
> With css_rstat_updated() being nmi safe, the memch stats can and will be
> converted to be nmi safe to enable nmi safe mem charging.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>  kernel/cgroup/rstat.c | 99 +++++++++++++++++++++++++++++++++----------
>  1 file changed, 76 insertions(+), 23 deletions(-)
> 
[..]
> @@ -153,6 +160,51 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
>  
>  		css = parent;
>  	}
> +}
> +
> +static void css_process_backlog(struct cgroup_subsys *ss, int cpu)
> +{
> +	struct llist_head *lhead = ss_lhead_cpu(ss, cpu);
> +	struct llist_node *lnode;
> +
> +	while ((lnode = llist_del_first_init(lhead))) {
> +		struct css_rstat_cpu *rstatc;
> +
> +		rstatc = container_of(lnode, struct css_rstat_cpu, lnode);
> +		__css_rstat_updated(rstatc->owner, cpu);
> +	}
> +}
> +
> +/**
> + * css_rstat_updated - keep track of updated rstat_cpu
> + * @css: target cgroup subsystem state
> + * @cpu: cpu on which rstat_cpu was updated
> + *
> + * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching
> + * rstat_cpu->updated_children list. See the comment on top of
> + * css_rstat_cpu definition for details.
> + */
> +__bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
> +{
> +	unsigned long flags;
> +
> +	/*
> +	 * Speculative already-on-list test. This may race leading to
> +	 * temporary inaccuracies, which is fine.
> +	 *
> +	 * Because @parent's updated_children is terminated with @parent
> +	 * instead of NULL, we can tell whether @css is on the list by
> +	 * testing the next pointer for NULL.
> +	 */
> +	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
> +		return;
> +
> +	if (!_css_rstat_cpu_trylock(css, cpu, &flags)) {


IIUC this trylock will only fail if a BPF program runs in NMI context
and tries to update cgroup stats, interrupting a context that is already
holding the lock (i.e. updating or flushing stats).

How often does this happen in practice tho? Is it worth the complexity?

I wonder if it's better if we make css_rstat_updated() inherently
lockless instead.

What if css_rstat_updated() always just adds to a lockless tree, and we
defer constructing the proper tree to the flushing side? This should
make updates generally faster and avoids locking or disabling interrupts
in the fast path. We essentially push more work to the flushing side.

We may be able to consolidate some of the code too if all the logic
manipulating the tree is on the flushing side.

WDYT? Am I missing something here?

> +		css_add_to_backlog(css, cpu);
> +		return;
> +	}
> +
> +	__css_rstat_updated(css, cpu);
>  
>  	_css_rstat_cpu_unlock(css, cpu, flags, true);
>  }
> @@ -255,6 +307,7 @@ static struct cgroup_subsys_state *css_rstat_updated_list(
>  
>  	flags = _css_rstat_cpu_lock(root, cpu, false);
>  
> +	css_process_backlog(root->ss, cpu);
>  	/* Return NULL if this subtree is not on-list */
>  	if (!rstatc->updated_next)
>  		goto unlock_ret;
> -- 
> 2.47.1
>

next prev parent reply	other threads:[~2025-04-30 13:14 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-29  6:12 [RFC PATCH 0/3] cgroup: nmi safe css_rstat_updated Shakeel Butt
2025-04-29  6:12 ` [RFC PATCH 1/3] llist: add list_add_iff_not_on_list() Shakeel Butt
2025-04-30 12:44   ` [RFC PATCH 1/3] llist: add list_add_iff_not_on_list()g Yosry Ahmed
2025-04-29  6:12 ` [RFC PATCH 2/3] cgroup: support to enable nmi-safe css_rstat_updated Shakeel Butt
2025-04-29  6:12 ` [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe Shakeel Butt
2025-04-30 13:14   ` Yosry Ahmed [this message]
2025-05-01 22:10     ` Shakeel Butt
2025-05-06  9:41       ` Yosry Ahmed
2025-05-06 19:30         ` Shakeel Butt
2025-05-07  6:52           ` Yosry Ahmed
2025-04-29  6:12 ` [OFFLIST PATCH 1/2] cgroup: use separate rstat trees for each subsystem Shakeel Butt
2025-04-29  6:12   ` [OFFLIST PATCH 2/2] cgroup: use subsystem-specific rstat locks to avoid contention Shakeel Butt
2025-04-29  6:15     ` Shakeel Butt
2025-05-21 22:23       ` Klara Modin
2025-05-21 22:29         ` Tejun Heo
2025-05-21 23:23         ` Shakeel Butt
2025-05-21 23:33           ` Shakeel Butt
2025-05-21 23:47             ` JP Kobryn
2025-05-21 23:50               ` Shakeel Butt
2025-05-21 23:52                 ` JP Kobryn
2025-05-21 23:47             ` Shakeel Butt
2025-04-29  6:15   ` [OFFLIST PATCH 1/2] cgroup: use separate rstat trees for each subsystem Shakeel Butt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aBIiNMXIl6vyaNQ6@Asmaa. \
    --to=yosry.ahmed@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=ast@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=bpf@vger.kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=inwardvessel@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox