[PATCH v2] cgroup: rstat: use LOCK CMPXCHG in css_rstat

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated
@ 2025-12-05 20:01 Shakeel Butt
  2025-12-08 18:11 ` Michal Koutný
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Shakeel Butt @ 2025-12-05 20:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Michal Koutný,
	Paul E . McKenney, JP Kobryn, Yosry Ahmed, linux-mm, cgroups,
	linux-kernel, Meta kernel team

On x86-64, this_cpu_cmpxchg() uses CMPXCHG without LOCK prefix which
means it is only safe for the local CPU and not for multiple CPUs.
Recently the commit 36df6e3dbd7e ("cgroup: make css_rstat_updated nmi
safe") make css_rstat_updated lockless and uses lockless list to allow
reentrancy. Since css_rstat_updated can invoked from process context,
IRQ and NMI, it uses this_cpu_cmpxchg() to select the winner which will
inset the lockless lnode into the global per-cpu lockless list.

However the commit missed one case where lockless node of a cgroup can
be accessed and modified by another CPU doing the flushing. Basically
llist_del_first_init() in css_process_update_tree().

On a cursory look, it can be questioned how css_process_update_tree()
can see a lockless node in global lockless list where the updater is at
this_cpu_cmpxchg() and before llist_add() call in css_rstat_updated().
This can indeed happen in the presence of IRQs/NMI.

Consider this scenario: Updater for cgroup stat C on CPU A in process
context is after llist_on_list() check and before this_cpu_cmpxchg() in
css_rstat_updated() where it get interrupted by IRQ/NMI. In the IRQ/NMI
context, a new updater calls css_rstat_updated() for same cgroup C and
successfully inserts rstatc_pcpu->lnode.

Now concurrently CPU B is running the flusher and it calls
llist_del_first_init() for CPU A and got rstatc_pcpu->lnode of cgroup C
which was added by the IRQ/NMI updater.

Now imagine CPU B calling init_llist_node() on cgroup C's
rstatc_pcpu->lnode of CPU A and on CPU A, the process context updater
calling this_cpu_cmpxchg(rstatc_pcpu->lnode) concurrently.

The CMPXCNG without LOCK on CPU A is not safe and thus we need LOCK
prefix.

In Meta's fleet running the kernel with the commit 36df6e3dbd7e, we are
observing on some machines the memcg stats are getting skewed by more
than the actual memory on the system. On close inspection, we noticed
that lockless node for a workload for specific CPU was in the bad state
and thus all the updates on that CPU for that cgroup was being lost.

To confirm if this skew was indeed due to this CMPXCHG without LOCK in
css_rstat_updated(), we created a repro (using AI) at [1] which shows
that CMPXCHG without LOCK creates almost the same lnode corruption as
seem in Meta's fleet and with LOCK CMPXCHG the issue does not
reproduces.

Link: http://lore.kernel.org/efiagdwmzfwpdzps74fvcwq3n4cs36q33ij7eebcpssactv3zu@se4hqiwxcfxq [1]
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Fixes: 36df6e3dbd7e ("cgroup: make css_rstat_updated nmi safe")
---
Changes since v1:
- Updated the comments based on Tejun's feedback
- Added link to the reproducer
- Updated commit message
v1: http://lore.kernel.org/20251205022437.1743547-1-shakeel.butt@linux.dev

 kernel/cgroup/rstat.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 91b34ebd5370..3b4d71eec8b9 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -71,7 +71,6 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
 	struct llist_head *lhead;
 	struct css_rstat_cpu *rstatc;
-	struct css_rstat_cpu __percpu *rstatc_pcpu;
 	struct llist_node *self;

 	/*
@@ -104,18 +103,22 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 	/*
 	 * This function can be renentered by irqs and nmis for the same cgroup
 	 * and may try to insert the same per-cpu lnode into the llist. Note
-	 * that llist_add() does not protect against such scenarios.
+	 * that llist_add() does not protect against such scenarios. In addition
+	 * this same per-cpu lnode can be modified through init_llist_node()
+	 * from css_rstat_flush() running on a different CPU.
 	 *
 	 * To protect against such stacked contexts of irqs/nmis, we use the
 	 * fact that lnode points to itself when not on a list and then use
-	 * this_cpu_cmpxchg() to atomically set to NULL to select the winner
+	 * try_cmpxchg() to atomically set to NULL to select the winner
 	 * which will call llist_add(). The losers can assume the insertion is
 	 * successful and the winner will eventually add the per-cpu lnode to
 	 * the llist.
+	 *
+	 * Please note that we can not use this_cpu_cmpxchg() here as on some
+	 * archs it is not safe against modifications from multiple CPUs.
 	 */
 	self = &rstatc->lnode;
-	rstatc_pcpu = css->rstat_cpu;
-	if (this_cpu_cmpxchg(rstatc_pcpu->lnode.next, self, NULL) != self)
+	if (!try_cmpxchg(&rstatc->lnode.next, &self, NULL))
 		return;

 	lhead = ss_lhead_cpu(css->ss, cpu);
-- 
2.47.3

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated
  2025-12-05 20:01 [PATCH v2] cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated Shakeel Butt
@ 2025-12-08 18:11 ` Michal Koutný
  2025-12-15 18:18   ` Shakeel Butt
  2025-12-08 18:15 ` Michal Koutný
  2025-12-08 18:32 ` Tejun Heo
  2 siblings, 1 reply; 5+ messages in thread
From: Michal Koutný @ 2025-12-08 18:11 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Johannes Weiner, Paul E . McKenney, JP Kobryn,
	Yosry Ahmed, linux-mm, cgroups, linux-kernel, Meta kernel team

[-- Attachment #1: Type: text/plain, Size: 2006 bytes --]

On Fri, Dec 05, 2025 at 12:01:06PM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> On x86-64, this_cpu_cmpxchg() uses CMPXCHG without LOCK prefix which
> means it is only safe for the local CPU and not for multiple CPUs.
...
> The CMPXCNG without LOCK on CPU A is not safe and thus we need LOCK
> prefix.

Does it mean that this_cpu_cmpxchg() is generally useless? (It appears
so from your analysis.)

> Now concurrently CPU B is running the flusher and it calls
> llist_del_first_init() for CPU A and got rstatc_pcpu->lnode of cgroup C
> which was added by the IRQ/NMI updater.

Or it's rather the case where rstat code combines both this_cpu_* and
remote access from the flusher.

Documentation/core-api/this_cpu_ops.rst washes its hands with:
| Please note that accesses by remote processors to a per cpu area are
| exceptional situations and may impact performance and/or correctness
| (remote write operations) of local RMW operations via this_cpu_*.

I see there's currently only one other user of that in kernel/scs.c
(__scs_alloc() vs scs_cleanup() without even WRITE_ONCE, but the race
would involve CPU hotplug, so its impact may be limited(?)).

I think your learnt-the-hard-way discovery should not only be in
cgroup.c but also in this this_cpu_ops.rst document to be wary
especially with this_cpu_cmpxchg (when dealing with pointers and not
more tolerable counters).

> Consider this scenario: Updater for cgroup stat C on CPU A in process
> context is after llist_on_list() check and before this_cpu_cmpxchg() in
> css_rstat_updated() where it get interrupted by IRQ/NMI. In the IRQ/NMI
> context, a new updater calls css_rstat_updated() for same cgroup C and
> successfully inserts rstatc_pcpu->lnode.
> 
> Now imagine CPU B calling init_llist_node() on cgroup C's
> rstatc_pcpu->lnode of CPU A and on CPU A, the process context updater
> calling this_cpu_cmpxchg(rstatc_pcpu->lnode) concurrently.

Sounds feasible to me.

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated
  2025-12-05 20:01 [PATCH v2] cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated Shakeel Butt
  2025-12-08 18:11 ` Michal Koutný
@ 2025-12-08 18:15 ` Michal Koutný
  2025-12-08 18:32 ` Tejun Heo
  2 siblings, 0 replies; 5+ messages in thread
From: Michal Koutný @ 2025-12-08 18:15 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Johannes Weiner, Paul E . McKenney, JP Kobryn,
	Yosry Ahmed, linux-mm, cgroups, linux-kernel, Meta kernel team

[-- Attachment #1: Type: text/plain, Size: 1470 bytes --]

On Fri, Dec 05, 2025 at 12:01:06PM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> On x86-64, this_cpu_cmpxchg() uses CMPXCHG without LOCK prefix which
> means it is only safe for the local CPU and not for multiple CPUs.
> Recently the commit 36df6e3dbd7e ("cgroup: make css_rstat_updated nmi
> safe") make css_rstat_updated lockless and uses lockless list to allow
> reentrancy. Since css_rstat_updated can invoked from process context,
					be

> IRQ and NMI, it uses this_cpu_cmpxchg() to select the winner which will
> inset the lockless lnode into the global per-cpu lockless list.
   insert

> 
> However the commit missed one case where lockless node of a cgroup can
> be accessed and modified by another CPU doing the flushing. Basically
> llist_del_first_init() in css_process_update_tree().
> 
> On a cursory look, it can be questioned how css_process_update_tree()
> can see a lockless node in global lockless list where the updater is at
> this_cpu_cmpxchg() and before llist_add() call in css_rstat_updated().
> This can indeed happen in the presence of IRQs/NMI.
> 
> Consider this scenario: Updater for cgroup stat C on CPU A in process
> context is after llist_on_list() check and before this_cpu_cmpxchg() in
> css_rstat_updated() where it get interrupted by IRQ/NMI. In the IRQ/NMI
				gets


(sorry for another mail, when I read it I noticed those in a different
buffer that may be applied if you decide for v2+)


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated
  2025-12-05 20:01 [PATCH v2] cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated Shakeel Butt
  2025-12-08 18:11 ` Michal Koutný
  2025-12-08 18:15 ` Michal Koutný
@ 2025-12-08 18:32 ` Tejun Heo
  2 siblings, 0 replies; 5+ messages in thread
From: Tejun Heo @ 2025-12-08 18:32 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal KoutnÃ½,
	Paul E. McKenney, JP Kobryn, Yosry Ahmed, linux-mm, cgroups,
	linux-kernel, Meta kernel team

Applied to cgroup/for-6.19-fixes w/ stable cc added.

Thanks.
-- 
tejun


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated
  2025-12-08 18:11 ` Michal Koutný
@ 2025-12-15 18:18   ` Shakeel Butt
  0 siblings, 0 replies; 5+ messages in thread
From: Shakeel Butt @ 2025-12-15 18:18 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Johannes Weiner, Paul E . McKenney, JP Kobryn,
	Yosry Ahmed, linux-mm, cgroups, linux-kernel, Meta kernel team

Hi Michal,

Sorry for the late response as I was travelling.

On Mon, Dec 08, 2025 at 07:11:31PM +0100, Michal Koutný wrote:
> On Fri, Dec 05, 2025 at 12:01:06PM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > On x86-64, this_cpu_cmpxchg() uses CMPXCHG without LOCK prefix which
> > means it is only safe for the local CPU and not for multiple CPUs.
> ...
> > The CMPXCNG without LOCK on CPU A is not safe and thus we need LOCK
> > prefix.
> 
> Does it mean that this_cpu_cmpxchg() is generally useless? (It appears
> so from your analysis.)

No it is still useful for single CPU atomicity i.e. process context vs
irq and NMIs.

> 
> > Now concurrently CPU B is running the flusher and it calls
> > llist_del_first_init() for CPU A and got rstatc_pcpu->lnode of cgroup C
> > which was added by the IRQ/NMI updater.
> 
> Or it's rather the case where rstat code combines both this_cpu_* and
> remote access from the flusher.

Yes.

> 
> Documentation/core-api/this_cpu_ops.rst washes its hands with:
> | Please note that accesses by remote processors to a per cpu area are
> | exceptional situations and may impact performance and/or correctness
> | (remote write operations) of local RMW operations via this_cpu_*.
> 
> I see there's currently only one other user of that in kernel/scs.c
> (__scs_alloc() vs scs_cleanup() without even WRITE_ONCE, but the race
> would involve CPU hotplug, so its impact may be limited(?)).

No, I don't think there is a race as hotplug callback happens in the
PREPARE state where the target CPU is already off and thus nothing is
running on it. BTW cached_stacks for VMAP kernel stack is similar.

> 
> I think your learnt-the-hard-way discovery should not only be in
> cgroup.c but also in this this_cpu_ops.rst document to be wary
> especially with this_cpu_cmpxchg (when dealing with pointers and not
> more tolerable counters).

Yes, this makes sense. I will followup on that.
> 
> 
> > Consider this scenario: Updater for cgroup stat C on CPU A in process
> > context is after llist_on_list() check and before this_cpu_cmpxchg() in
> > css_rstat_updated() where it get interrupted by IRQ/NMI. In the IRQ/NMI
> > context, a new updater calls css_rstat_updated() for same cgroup C and
> > successfully inserts rstatc_pcpu->lnode.
> > 
> > Now imagine CPU B calling init_llist_node() on cgroup C's
> > rstatc_pcpu->lnode of CPU A and on CPU A, the process context updater
> > calling this_cpu_cmpxchg(rstatc_pcpu->lnode) concurrently.
> 
> Sounds feasible to me.

Thanks for taking a look.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-12-15 18:19 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-05 20:01 [PATCH v2] cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated Shakeel Butt
2025-12-08 18:11 ` Michal Koutný
2025-12-15 18:18   ` Shakeel Butt
2025-12-08 18:15 ` Michal Koutný
2025-12-08 18:32 ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox