[BUG] Memory ordering between kmalloc() and kfree()? it's confusing!

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [BUG] Memory ordering between kmalloc() and kfree()? it's confusing!
@ 2026-02-26  6:35 Harry Yoo
  2026-02-26 15:45 ` Alan Stern
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Harry Yoo @ 2026-02-26  6:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Dmitry Vyukov, lkmm, linux-arch, linux-kernel, Joel Fernandes,
	Daniel Lustig, Akira Yokosawa, Paul E. McKenney, Luc Maranget,
	Jade Alglave, David Howells, Nicholas Piggin, Boqun Feng,
	Peter Zijlstra, Will Deacon, Andrea Parri, Alan Stern,
	Pedro Falcato, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hao Li, Shakeel Butt,
	Venkat Rao Bagalkote, Mateusz Guzik, Suren Baghdasaryan,
	Marco Elver

Hello, SLAB, LKMM, and KCSAN folks!

I'd like to discuss slab's assumption on users regarding memory ordering.

Recently, I've been investigating an interesting slab memory ordering
issue [3] [4] in v7.0-rc1, which made me think about memory ordering
for slab objects.

But without answering "What does slab expect users to do for correct
operation?", I kept getting puzzled, and my brain hurt too much :/
I'm writing things down to stop getting confused :)

Since I have never thought about this before, my reasoning could be
partially or entirely incorrect. If so, please kindly let me know.

# Slab's assumption: Stores to object, its metadata, or struct slab
# must be visible to the CPU that frees the object, when it is
# passed to kfree(). It's users' responsibility to guarantee that.

When the slab allocator allocates an object, it updates its metadata and
struct slab fields. After allocation, the user of slab updates object's
content. As long as the object is freed on the same CPU that it was
allocated, kfree() can see those stores (A CPU must be able to see
what's in its store buffer), so no problem!

However, when e.g.) the pointer to object is stored in a shared variable
and then freed on a different CPU, things become trickier.

In this case, I think it's fair for the slab allocator to assume that:

  1) Such stores must involve _at least_ a release barrier
     (for example, via {cmp,}xchg{,_release}, or smp_store_release())
     to ensure preceding stores are visible to other CPUs before
     the pointer store becomes visible, and

  2) The CPU that frees an object must invoke at least an acquire
     barrier to ensure that stores to object content / metadata, etc.,
     are visible to the freeing CPU when it calls kfree().

Because the slab allocator itself doesn't guarantee that such
barriers are invoked within the allocator, it relies on users to
do this when needed.

... and that's quite a reasonable assumption, isn't it?

Actually, I'm not the first person to question this.

[1] https://lore.kernel.org/linux-mm/CACT4Y+Yfz3XvT+w6a3WjcZuATb1b9JdQHHf637zdT=6QZ-hjKg@mail.gmail.com
[2] https://lore.kernel.org/linux-mm/20140102203320.GA27615@linux.vnet.ibm.com

# Now, let's take a look at the bug I've been investigating

There were two bugs [3] [4] reported, with symptoms that appear to be
caused by slab returning wrong metadata (the symptoms: incorrect
reference counting of obj_cgroup, integer overflow as more memory is
uncharged than charged).

[3] https://lore.kernel.org/lkml/ca241daa-e7e7-4604-a48d-de91ec9184a5@linux.ibm.com
[4] https://lore.kernel.org/all/ddff7c7d-c0c3-4780-808f-9a83268bbf0c@linux.ibm.com

Hmm, if it's returning wrong metadata, how could that happen?

Well, perhaps it's either 1) the calculation of metadata address is
incorrect, or 2) reading the metadata itself is racy.

Shakeel Butt pointed out [9] that there's a potential memory ordering
issue. It suggests that no enforced ordering between slab->obj_exts
and slab->stride can make the metadata address calculation incorrect.

[9] https://lore.kernel.org/lkml/aZu9G9mVIVzSm6Ft@hyeyoo

Let's say CPU X and Y are allocating/freeing slab objects from/to
the same slab. They need to access metadata for the objects:

CPU X				CPU Y

// CPU X allocates metadata array
- slab->obj_exts = <the address of the metadata array>
- slab->stride = 16 (sizeof struct slab)

- stride = plain load slab->stride
- obj_exts = READ_ONCE(slab->obj_exts)
- if (obj_exts)
    - metadata_addr =
      stride * index + obj_exts
				- stride = plain load slab->stride
				- obj_exts = READ_ONCE(slab->obj_exts)
				- if (obj_exts)
				  - metadata_addr = stride * index +
						    obj_exts

				// Wait, obj_exts is non-NULL,
				// but slab->stride is stale!

				// Now, metadata_addr is wrong.

Hmm, this could definitely happen when two CPUs try to allocate/free
objects from/to the same slab. We need to make sure that, CPUs cannot
see stale slab->stride as long as slab->obj_exts is not NULL.

# How I tried to fix it

An expensive solution would be do:

CPU X:					CPU Y:
- slab->stride = 16			- READ_ONCE(slab->obj_exts)
- smp_wmb()				- if (obj_exts)
- slab->obj_exts = <something>		  - smp_rmb()
					  - plain load slab->stride

Then, CPU Y should see either (obj_exts == 0), or
(obj_exts != 0 and a valid stride). (obj_exts != 0) && (invalid stride)
is impossible.

This fix [5] seems to resolve the bug [6], yay!

Before testing this fix, I wasn't fully convinced that it was a memory
ordering issue. But after testing it, it seems reasonable to assume that
it's indeed a memory ordering issue.

[5] https://lore.kernel.org/linux-mm/aZ2Gwie5dpXotxWc@hyeyoo
[6] https://lore.kernel.org/linux-mm/84492f08-04c2-485c-9a18-cdafd5a9c3e5@linux.ibm.com 

# How I tried to optimize the fix

Hmm, but it's not great to have memory barriers in slab alloc/free
path, right? So I tried to optimize it while maintaining correctness.

Previously, slab->stride could be set during slab's initialization
by alloc_slab_obj_exts_early(), or later by alloc_slab_obj_exts().

Within the slab allocator, for the slab to be accessible by other CPUs,
they need to go through per-node partial slab list
(struct kmem_cache_node.partial), protected by a spinlock.

Hmm, if we make sure that slab->stride is set early, before the slab
becomes accessible to other CPUs, smp_wmb()/smp_rmb() pair is not
necessary. So I made that change [7].

But something strange happens when I tried to optimize it!
The fix didn't resolve the bug [8].

[7] https://lore.kernel.org/linux-mm/20260223075809.19265-1-harry.yoo@oracle.com 
[8] https://lore.kernel.org/linux-mm/2d106583-4ec6-4da0-87ea-4ecad893b24f@linux.ibm.com

Hmm... even when slabs don't go through the list protected by the
spinlock, perhaps an object was allocated on CPU X, and then freed
on CPU Y?

But as long as "Slab's assumption" described above is satisfied,
I can't explain why stores to slab->stride, or metadata won't be visible
to the freeing CPU :/

That makes me wonder "is somebody breaking that assumption?"

If so, the smp_rmb() in the previous fix [5] might have unintentionally
acted as a band-aid to make sure stores to slab->stride and the metadata
are visible to the freeing CPU. But in fact, a barrier should have
been invoked by the user.

Looking at commit 9e6b7cd7e77d ("tty: fix data race in
tty_buffer_flush") and commit f57e515a1b56 ("kernel/pid.c: convert
struct pid count to refcount_t"), it doesn't seem too crazy to suspect
that somebody is breaking the assumption.

Does this sound reasonable, or am I missing something?

p.s., Many thanks to Pedro Falcato and Vlastimil Babka, who actively
discussed this off-list with me. That helped develop my understanding
a lot!

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Memory ordering between kmalloc() and kfree()? it's confusing!
  2026-02-26  6:35 [BUG] Memory ordering between kmalloc() and kfree()? it's confusing! Harry Yoo
@ 2026-02-26 15:45 ` Alan Stern
  2026-02-26 16:17   ` Harry Yoo
  2026-02-27  9:14 ` Akira Yokosawa
  2026-03-06  2:46 ` Harry Yoo
  2 siblings, 1 reply; 13+ messages in thread
From: Alan Stern @ 2026-02-26 15:45 UTC (permalink / raw)
  To: Harry Yoo
  Cc: linux-mm, Dmitry Vyukov, lkmm, linux-arch, linux-kernel,
	Joel Fernandes, Daniel Lustig, Akira Yokosawa, Paul E. McKenney,
	Luc Maranget, Jade Alglave, David Howells, Nicholas Piggin,
	Boqun Feng, Peter Zijlstra, Will Deacon, Andrea Parri,
	Pedro Falcato, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hao Li, Shakeel Butt,
	Venkat Rao Bagalkote, Mateusz Guzik, Suren Baghdasaryan,
	Marco Elver

On Thu, Feb 26, 2026 at 03:35:08PM +0900, Harry Yoo wrote:
> Hello, SLAB, LKMM, and KCSAN folks!
> 
> I'd like to discuss slab's assumption on users regarding memory ordering.
> 
> Recently, I've been investigating an interesting slab memory ordering
> issue [3] [4] in v7.0-rc1, which made me think about memory ordering
> for slab objects.
> 
> But without answering "What does slab expect users to do for correct
> operation?", I kept getting puzzled, and my brain hurt too much :/
> I'm writing things down to stop getting confused :)
> 
> Since I have never thought about this before, my reasoning could be
> partially or entirely incorrect. If so, please kindly let me know.
> 
> # Slab's assumption: Stores to object, its metadata, or struct slab
> # must be visible to the CPU that frees the object, when it is
> # passed to kfree(). It's users' responsibility to guarantee that.
> 
> When the slab allocator allocates an object, it updates its metadata and
> struct slab fields. After allocation, the user of slab updates object's
> content. As long as the object is freed on the same CPU that it was
> allocated, kfree() can see those stores (A CPU must be able to see
> what's in its store buffer), so no problem!
> 
> However, when e.g.) the pointer to object is stored in a shared variable
> and then freed on a different CPU, things become trickier.
> 
> In this case, I think it's fair for the slab allocator to assume that:
> 
>   1) Such stores must involve _at least_ a release barrier
>      (for example, via {cmp,}xchg{,_release}, or smp_store_release())
>      to ensure preceding stores are visible to other CPUs before
>      the pointer store becomes visible, and
> 
>   2) The CPU that frees an object must invoke at least an acquire
>      barrier to ensure that stores to object content / metadata, etc.,
>      are visible to the freeing CPU when it calls kfree().
> 
> Because the slab allocator itself doesn't guarantee that such
> barriers are invoked within the allocator, it relies on users to
> do this when needed.

It doesn't?  Then how does the slab allocator guarantee that two 
different CPUs won't try to perform allocations or deallocations from 
the same slab at the same time, messing everything up?

Can you explain how this is meant to work, for those of us who don't 
know anything about the slab allocator's internal mechanism?

Alan Stern


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Memory ordering between kmalloc() and kfree()? it's confusing!
  2026-02-26 15:45 ` Alan Stern
@ 2026-02-26 16:17   ` Harry Yoo
  2026-02-26 16:42     ` Alan Stern
                       ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Harry Yoo @ 2026-02-26 16:17 UTC (permalink / raw)
  To: Alan Stern
  Cc: linux-mm, Dmitry Vyukov, lkmm, linux-arch, linux-kernel,
	Joel Fernandes, Daniel Lustig, Akira Yokosawa, Paul E. McKenney,
	Luc Maranget, Jade Alglave, David Howells, Nicholas Piggin,
	Boqun Feng, Peter Zijlstra, Will Deacon, Andrea Parri,
	Pedro Falcato, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hao Li, Shakeel Butt,
	Venkat Rao Bagalkote, Mateusz Guzik, Suren Baghdasaryan,
	Marco Elver

On Thu, Feb 26, 2026 at 10:45:55AM -0500, Alan Stern wrote:
> On Thu, Feb 26, 2026 at 03:35:08PM +0900, Harry Yoo wrote:
> > Hello, SLAB, LKMM, and KCSAN folks!
> > 
> > I'd like to discuss slab's assumption on users regarding memory ordering.
> > 
> > Recently, I've been investigating an interesting slab memory ordering
> > issue [3] [4] in v7.0-rc1, which made me think about memory ordering
> > for slab objects.
> > 
> > But without answering "What does slab expect users to do for correct
> > operation?", I kept getting puzzled, and my brain hurt too much :/
> > I'm writing things down to stop getting confused :)
> > 
> > Since I have never thought about this before, my reasoning could be
> > partially or entirely incorrect. If so, please kindly let me know.
> > 
> > # Slab's assumption: Stores to object, its metadata, or struct slab
> > # must be visible to the CPU that frees the object, when it is
> > # passed to kfree(). It's users' responsibility to guarantee that.
> > 
> > When the slab allocator allocates an object, it updates its metadata and
> > struct slab fields. After allocation, the user of slab updates object's
> > content. As long as the object is freed on the same CPU that it was
> > allocated, kfree() can see those stores (A CPU must be able to see
> > what's in its store buffer), so no problem!
> > 
> > However, when e.g.) the pointer to object is stored in a shared variable
> > and then freed on a different CPU, things become trickier.
> > 
> > In this case, I think it's fair for the slab allocator to assume that:
> > 
> >   1) Such stores must involve _at least_ a release barrier
> >      (for example, via {cmp,}xchg{,_release}, or smp_store_release())
> >      to ensure preceding stores are visible to other CPUs before
> >      the pointer store becomes visible, and
> > 
> >   2) The CPU that frees an object must invoke at least an acquire
> >      barrier to ensure that stores to object content / metadata, etc.,
> >      are visible to the freeing CPU when it calls kfree().
> > 
> > Because the slab allocator itself doesn't guarantee that such
> > barriers are invoked within the allocator, it relies on users to
> > do this when needed.
> 
> It doesn't?  Then how does the slab allocator guarantee that two 
> different CPUs won't try to perform allocations or deallocations from 
> the same slab at the same time, messing everything up?

Ah, alloc/free slowpaths do use cmpxchg128 or spinlock and
don't mess things up.

But fastpath allocs/frees are served from percpu array that is protected
by a local_lock. local_lock has a compiler barrier in it, but that's
not enough.

> Can you explain how this is meant to work, for those of us who don't 
> know anything about the slab allocator's internal mechanism?

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Memory ordering between kmalloc() and kfree()? it's confusing!
  2026-02-26 16:17   ` Harry Yoo
@ 2026-02-26 16:42     ` Alan Stern
  2026-02-26 17:11       ` Harry Yoo
  2026-02-26 17:59     ` Christoph Lameter (Ampere)
  2026-02-27  8:06     ` Hao Li
  2 siblings, 1 reply; 13+ messages in thread
From: Alan Stern @ 2026-02-26 16:42 UTC (permalink / raw)
  To: Harry Yoo
  Cc: linux-mm, Dmitry Vyukov, lkmm, linux-arch, linux-kernel,
	Joel Fernandes, Daniel Lustig, Akira Yokosawa, Paul E. McKenney,
	Luc Maranget, Jade Alglave, David Howells, Nicholas Piggin,
	Boqun Feng, Peter Zijlstra, Will Deacon, Andrea Parri,
	Pedro Falcato, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hao Li, Shakeel Butt,
	Venkat Rao Bagalkote, Mateusz Guzik, Suren Baghdasaryan,
	Marco Elver

On Fri, Feb 27, 2026 at 01:17:52AM +0900, Harry Yoo wrote:
> On Thu, Feb 26, 2026 at 10:45:55AM -0500, Alan Stern wrote:
> > On Thu, Feb 26, 2026 at 03:35:08PM +0900, Harry Yoo wrote:
> > > Because the slab allocator itself doesn't guarantee that such
> > > barriers are invoked within the allocator, it relies on users to
> > > do this when needed.
> > 
> > It doesn't?  Then how does the slab allocator guarantee that two 
> > different CPUs won't try to perform allocations or deallocations from 
> > the same slab at the same time, messing everything up?
> 
> Ah, alloc/free slowpaths do use cmpxchg128 or spinlock and
> don't mess things up.
> 
> But fastpath allocs/frees are served from percpu array that is protected
> by a local_lock. local_lock has a compiler barrier in it, but that's
> not enough.

If those things rely on a percpu array, how can one CPU possibly 
manipulate a resource (slab or something else) that was changed by a 
different CPU?  The whole point of percpu data structures is that each 
CPU gets its own copy.

Alan Stern


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Memory ordering between kmalloc() and kfree()? it's confusing!
  2026-02-26 16:42     ` Alan Stern
@ 2026-02-26 17:11       ` Harry Yoo
  2026-02-26 18:06         ` Alan Stern
  0 siblings, 1 reply; 13+ messages in thread
From: Harry Yoo @ 2026-02-26 17:11 UTC (permalink / raw)
  To: Alan Stern
  Cc: linux-mm, Dmitry Vyukov, lkmm, linux-arch, linux-kernel,
	Joel Fernandes, Daniel Lustig, Akira Yokosawa, Paul E. McKenney,
	Luc Maranget, Jade Alglave, David Howells, Nicholas Piggin,
	Boqun Feng, Peter Zijlstra, Will Deacon, Andrea Parri,
	Pedro Falcato, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hao Li, Shakeel Butt,
	Venkat Rao Bagalkote, Mateusz Guzik, Suren Baghdasaryan,
	Marco Elver

On Thu, Feb 26, 2026 at 11:42:02AM -0500, Alan Stern wrote:
> On Fri, Feb 27, 2026 at 01:17:52AM +0900, Harry Yoo wrote:
> > On Thu, Feb 26, 2026 at 10:45:55AM -0500, Alan Stern wrote:
> > > On Thu, Feb 26, 2026 at 03:35:08PM +0900, Harry Yoo wrote:
> > > > Because the slab allocator itself doesn't guarantee that such
> > > > barriers are invoked within the allocator, it relies on users to
> > > > do this when needed.
> > > 
> > > It doesn't?  Then how does the slab allocator guarantee that two 
> > > different CPUs won't try to perform allocations or deallocations from 
> > > the same slab at the same time, messing everything up?
> > 
> > Ah, alloc/free slowpaths do use cmpxchg128 or spinlock and
> > don't mess things up.
> > 
> > But fastpath allocs/frees are served from percpu array that is protected
> > by a local_lock. local_lock has a compiler barrier in it, but that's
> > not enough.
> 
> If those things rely on a percpu array, how can one CPU possibly 
> manipulate a resource (slab or something else) that was changed by a 
> different CPU?

AFAICT that shouldn't happen within the slab allocator.

> The whole point of percpu data structures is that each 
> CPU gets its own copy.

Exactly.

But I'm not talking about what happens within the allocator,
but rather, about what slab expects to happen outside the allocator.

Something like this:

CPU X				CPU Y
ptr = kmalloc();
WRITE_ONCE(gp, ptr);
				if (p = READ_ONCE(gp))
					kfree(p);

Yes, it's a crazy thing to do. CPU Y isn't guaranteed to see
up-to-date version of object content or metadata.

Instead, the code should do:

CPU X				CPU Y
ptr = kmalloc();
gp = smp_store_release(&gp, ptr);

				if (p = smp_load_acquire(&gp))
					kfree(p);

One reason that I started this discussion was to argue that we should
have a well-defined a contract between the slab allocator and its users.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Memory ordering between kmalloc() and kfree()? it's confusing!
  2026-02-26 16:17   ` Harry Yoo
  2026-02-26 16:42     ` Alan Stern
@ 2026-02-26 17:59     ` Christoph Lameter (Ampere)
  2026-02-27  8:06     ` Hao Li
  2 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter (Ampere) @ 2026-02-26 17:59 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Alan Stern, linux-mm, Dmitry Vyukov, lkmm, linux-arch,
	linux-kernel, Joel Fernandes, Daniel Lustig, Akira Yokosawa,
	Paul E. McKenney, Luc Maranget, Jade Alglave, David Howells,
	Nicholas Piggin, Boqun Feng, Peter Zijlstra, Will Deacon,
	Andrea Parri, Pedro Falcato, Vlastimil Babka, David Rientjes,
	Roman Gushchin, Hao Li, Shakeel Butt, Venkat Rao Bagalkote,
	Mateusz Guzik, Suren Baghdasaryan, Marco Elver

On Fri, 27 Feb 2026, Harry Yoo wrote:

> Ah, alloc/free slowpaths do use cmpxchg128 or spinlock and
> don't mess things up.
>
> But fastpath allocs/frees are served from percpu array that is protected
> by a local_lock. local_lock has a compiler barrier in it, but that's
> not enough.

Well if objects are coming from different folios then that is an issue.

The prior slub approach had no per cpu linked lists and restricted
allocations to the objects of a single page that was only used by a
specific cpu. locks were used when that page changed. There was no need
for further synchronization since accesses were known to only refer to a
single page frame and there was only one cpu accessing.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Memory ordering between kmalloc() and kfree()? it's confusing!
  2026-02-26 17:11       ` Harry Yoo
@ 2026-02-26 18:06         ` Alan Stern
  2026-02-27 12:36           ` Harry Yoo
  0 siblings, 1 reply; 13+ messages in thread
From: Alan Stern @ 2026-02-26 18:06 UTC (permalink / raw)
  To: Harry Yoo
  Cc: linux-mm, Dmitry Vyukov, lkmm, linux-arch, linux-kernel,
	Joel Fernandes, Daniel Lustig, Akira Yokosawa, Paul E. McKenney,
	Luc Maranget, Jade Alglave, David Howells, Nicholas Piggin,
	Boqun Feng, Peter Zijlstra, Will Deacon, Andrea Parri,
	Pedro Falcato, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hao Li, Shakeel Butt,
	Venkat Rao Bagalkote, Mateusz Guzik, Suren Baghdasaryan,
	Marco Elver

On Fri, Feb 27, 2026 at 02:11:49AM +0900, Harry Yoo wrote:
> On Thu, Feb 26, 2026 at 11:42:02AM -0500, Alan Stern wrote:
> > On Fri, Feb 27, 2026 at 01:17:52AM +0900, Harry Yoo wrote:
> > > On Thu, Feb 26, 2026 at 10:45:55AM -0500, Alan Stern wrote:
> > > > On Thu, Feb 26, 2026 at 03:35:08PM +0900, Harry Yoo wrote:
> > > > > Because the slab allocator itself doesn't guarantee that such
> > > > > barriers are invoked within the allocator, it relies on users to
> > > > > do this when needed.
> > > > 
> > > > It doesn't?  Then how does the slab allocator guarantee that two 
> > > > different CPUs won't try to perform allocations or deallocations from 
> > > > the same slab at the same time, messing everything up?
> > > 
> > > Ah, alloc/free slowpaths do use cmpxchg128 or spinlock and
> > > don't mess things up.
> > > 
> > > But fastpath allocs/frees are served from percpu array that is protected
> > > by a local_lock. local_lock has a compiler barrier in it, but that's
> > > not enough.
> > 
> > If those things rely on a percpu array, how can one CPU possibly 
> > manipulate a resource (slab or something else) that was changed by a 
> > different CPU?
> 
> AFAICT that shouldn't happen within the slab allocator.
> 
> > The whole point of percpu data structures is that each 
> > CPU gets its own copy.
> 
> Exactly.
> 
> But I'm not talking about what happens within the allocator,
> but rather, about what slab expects to happen outside the allocator.

I understand.

> Something like this:
> 
> CPU X				CPU Y
> ptr = kmalloc();
> WRITE_ONCE(gp, ptr);
> 				if (p = READ_ONCE(gp))
> 					kfree(p);
> 
> Yes, it's a crazy thing to do. CPU Y isn't guaranteed to see
> up-to-date version of object content or metadata.
> 
> Instead, the code should do:
> 
> CPU X				CPU Y
> ptr = kmalloc();
> gp = smp_store_release(&gp, ptr);
> 
> 				if (p = smp_load_acquire(&gp))
> 					kfree(p);
> 
> One reason that I started this discussion was to argue that we should
> have a well-defined a contract between the slab allocator and its users.

Yes, you have made that quite clear.  But you're missing _my_ point.

Which is: The same mechanism that the slab allocator uses to ensure that 
CPU X and CPU Y won't step on each other's toes if they both run 
kmalloc/kfree at the same time should also be able to guarantee that the 
metadata changes made by CPU X will be visible to CPU Y if Y manipulates 
a slab that X just finished with.

To put it another way, ensuring non-interference during simultaneous 
accesses isn't all that different from ensuring coherence during 
sequential accesses.  Doing the first should easily allow doing the 
second.

And if it doesn't then something questionable is going on.

Alan Stern


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Memory ordering between kmalloc() and kfree()? it's confusing!
  2026-02-26 16:17   ` Harry Yoo
  2026-02-26 16:42     ` Alan Stern
  2026-02-26 17:59     ` Christoph Lameter (Ampere)
@ 2026-02-27  8:06     ` Hao Li
  2026-02-27  9:03       ` Harry Yoo
  2 siblings, 1 reply; 13+ messages in thread
From: Hao Li @ 2026-02-27  8:06 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Alan Stern, linux-mm, Dmitry Vyukov, lkmm, linux-arch,
	linux-kernel, Joel Fernandes, Daniel Lustig, Akira Yokosawa,
	Paul E. McKenney, Luc Maranget, Jade Alglave, David Howells,
	Nicholas Piggin, Boqun Feng, Peter Zijlstra, Will Deacon,
	Andrea Parri, Pedro Falcato, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Shakeel Butt,
	Venkat Rao Bagalkote, Mateusz Guzik, Suren Baghdasaryan,
	Marco Elver

On Fri, Feb 27, 2026 at 01:17:52AM +0900, Harry Yoo wrote:
> On Thu, Feb 26, 2026 at 10:45:55AM -0500, Alan Stern wrote:
> > On Thu, Feb 26, 2026 at 03:35:08PM +0900, Harry Yoo wrote:
> > > Hello, SLAB, LKMM, and KCSAN folks!
> > > 
> > > I'd like to discuss slab's assumption on users regarding memory ordering.
> > > 
> > > Recently, I've been investigating an interesting slab memory ordering
> > > issue [3] [4] in v7.0-rc1, which made me think about memory ordering
> > > for slab objects.
> > > 
> > > But without answering "What does slab expect users to do for correct
> > > operation?", I kept getting puzzled, and my brain hurt too much :/
> > > I'm writing things down to stop getting confused :)
> > > 
> > > Since I have never thought about this before, my reasoning could be
> > > partially or entirely incorrect. If so, please kindly let me know.
> > > 
> > > # Slab's assumption: Stores to object, its metadata, or struct slab
> > > # must be visible to the CPU that frees the object, when it is
> > > # passed to kfree(). It's users' responsibility to guarantee that.
> > > 
> > > When the slab allocator allocates an object, it updates its metadata and
> > > struct slab fields. After allocation, the user of slab updates object's
> > > content. As long as the object is freed on the same CPU that it was
> > > allocated, kfree() can see those stores (A CPU must be able to see
> > > what's in its store buffer), so no problem!
> > > 
> > > However, when e.g.) the pointer to object is stored in a shared variable
> > > and then freed on a different CPU, things become trickier.
> > > 
> > > In this case, I think it's fair for the slab allocator to assume that:
> > > 
> > >   1) Such stores must involve _at least_ a release barrier
> > >      (for example, via {cmp,}xchg{,_release}, or smp_store_release())
> > >      to ensure preceding stores are visible to other CPUs before
> > >      the pointer store becomes visible, and
> > > 
> > >   2) The CPU that frees an object must invoke at least an acquire
> > >      barrier to ensure that stores to object content / metadata, etc.,
> > >      are visible to the freeing CPU when it calls kfree().
> > > 
> > > Because the slab allocator itself doesn't guarantee that such
> > > barriers are invoked within the allocator, it relies on users to
> > > do this when needed.
> > 
> > It doesn't?  Then how does the slab allocator guarantee that two 
> > different CPUs won't try to perform allocations or deallocations from 
> > the same slab at the same time, messing everything up?
> 
> Ah, alloc/free slowpaths do use cmpxchg128 or spinlock and
> don't mess things up.
> 
> But fastpath allocs/frees are served from percpu array that is protected
> by a local_lock. local_lock has a compiler barrier in it, but that's
> not enough.

Hmm, this memory-ordering issue is indeed pretty mind-bending. I'd like to
share a few thoughts as well. Happy to be corrected!

For our current problem, I think the key lies in the relative ordering between
the two variables, stride and obj_exts. To address it, we need to ensure that
on the writer side, stride is assigned before obj_exts. And on the reader
side, we need to guarantee that if it can observe the latest value of
obj_exts, then it must also be able to observe the latest value of stride. If
this understanding is correct, then even if the slab API caller inserts a
memory barrier between alloc and free, or uses a spinlock (or any statement
that provides an equivalent memory-barrier effect), it would only ensure that
the writes to the pair {stride, obj_exts} as a whole happen-before the reads
of {stride, obj_exts} as a whole. However, it still wouldn't be able to
guarantee the ordering between the two variables: stride and obj_exts.

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Memory ordering between kmalloc() and kfree()? it's confusing!
  2026-02-27  8:06     ` Hao Li
@ 2026-02-27  9:03       ` Harry Yoo
  0 siblings, 0 replies; 13+ messages in thread
From: Harry Yoo @ 2026-02-27  9:03 UTC (permalink / raw)
  To: Hao Li
  Cc: Alan Stern, linux-mm, Dmitry Vyukov, lkmm, linux-arch,
	linux-kernel, Joel Fernandes, Daniel Lustig, Akira Yokosawa,
	Paul E. McKenney, Luc Maranget, Jade Alglave, David Howells,
	Nicholas Piggin, Boqun Feng, Peter Zijlstra, Will Deacon,
	Andrea Parri, Pedro Falcato, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Shakeel Butt,
	Venkat Rao Bagalkote, Mateusz Guzik, Suren Baghdasaryan,
	Marco Elver

On Fri, Feb 27, 2026 at 04:06:37PM +0800, Hao Li wrote:
> On Fri, Feb 27, 2026 at 01:17:52AM +0900, Harry Yoo wrote:
> > On Thu, Feb 26, 2026 at 10:45:55AM -0500, Alan Stern wrote:
> > > On Thu, Feb 26, 2026 at 03:35:08PM +0900, Harry Yoo wrote:
> > > > Hello, SLAB, LKMM, and KCSAN folks!
> > > > 
> > > > I'd like to discuss slab's assumption on users regarding memory ordering.
> > > > 
> > > > Recently, I've been investigating an interesting slab memory ordering
> > > > issue [3] [4] in v7.0-rc1, which made me think about memory ordering
> > > > for slab objects.
> > > > 
> > > > But without answering "What does slab expect users to do for correct
> > > > operation?", I kept getting puzzled, and my brain hurt too much :/
> > > > I'm writing things down to stop getting confused :)
> > > > 
> > > > Since I have never thought about this before, my reasoning could be
> > > > partially or entirely incorrect. If so, please kindly let me know.
> > > > 
> > > > # Slab's assumption: Stores to object, its metadata, or struct slab
> > > > # must be visible to the CPU that frees the object, when it is
> > > > # passed to kfree(). It's users' responsibility to guarantee that.
> > > > 
> > > > When the slab allocator allocates an object, it updates its metadata and
> > > > struct slab fields. After allocation, the user of slab updates object's
> > > > content. As long as the object is freed on the same CPU that it was
> > > > allocated, kfree() can see those stores (A CPU must be able to see
> > > > what's in its store buffer), so no problem!
> > > > 
> > > > However, when e.g.) the pointer to object is stored in a shared variable
> > > > and then freed on a different CPU, things become trickier.
> > > > 
> > > > In this case, I think it's fair for the slab allocator to assume that:
> > > > 
> > > >   1) Such stores must involve _at least_ a release barrier
> > > >      (for example, via {cmp,}xchg{,_release}, or smp_store_release())
> > > >      to ensure preceding stores are visible to other CPUs before
> > > >      the pointer store becomes visible, and
> > > > 
> > > >   2) The CPU that frees an object must invoke at least an acquire
> > > >      barrier to ensure that stores to object content / metadata, etc.,
> > > >      are visible to the freeing CPU when it calls kfree().
> > > > 
> > > > Because the slab allocator itself doesn't guarantee that such
> > > > barriers are invoked within the allocator, it relies on users to
> > > > do this when needed.
> > > 
> > > It doesn't?  Then how does the slab allocator guarantee that two 
> > > different CPUs won't try to perform allocations or deallocations from 
> > > the same slab at the same time, messing everything up?
> > 
> > Ah, alloc/free slowpaths do use cmpxchg128 or spinlock and
> > don't mess things up.
> > 
> > But fastpath allocs/frees are served from percpu array that is protected
> > by a local_lock. local_lock has a compiler barrier in it, but that's
> > not enough.
> 
> Hmm, this memory-ordering issue is indeed pretty mind-bending. I'd like to
> share a few thoughts as well. Happy to be corrected!

Yeah, it's indeed confusing :)

> For our current problem, I think the key lies in the relative ordering between
> the two variables, stride and obj_exts. To address it, we need to ensure that
> on the writer side, stride is assigned before obj_exts. And on the reader
> side, we need to guarantee that if it can observe the latest value of
> obj_exts, then it must also be able to observe the latest value of stride.

Yes, that's a somewhat expensive way to avoid the problem by enforcing
ordering between these two variables.

While obj_exts still can be set concurrently (via cmpxchg()), if we set
stride very early during slab initialization, by the time the object is
allocated or freed on another CPU - it must observe a valid stride, no?
(In a similar way we always expect slab->slab_cache to be visible
after slab initialization)

Then, the ordering between those variables doesn't really matter?

> If this understanding is correct, then even if the slab API caller inserts a
> memory barrier between alloc and free, or uses a spinlock (or any statement
> that provides an equivalent memory-barrier effect), it would only ensure that
> the writes to the pair {stride, obj_exts} as a whole happen-before the reads
> of {stride, obj_exts} as a whole. However, it still wouldn't be able to
> guarantee the ordering between the two variables: stride and obj_exts.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Memory ordering between kmalloc() and kfree()? it's confusing!
  2026-02-26  6:35 [BUG] Memory ordering between kmalloc() and kfree()? it's confusing! Harry Yoo
  2026-02-26 15:45 ` Alan Stern
@ 2026-02-27  9:14 ` Akira Yokosawa
  2026-03-06  2:46 ` Harry Yoo
  2 siblings, 0 replies; 13+ messages in thread
From: Akira Yokosawa @ 2026-02-27  9:14 UTC (permalink / raw)
  To: Harry Yoo, linux-mm
  Cc: Dmitry Vyukov, lkmm, linux-arch, linux-kernel, Joel Fernandes,
	Daniel Lustig, Paul E. McKenney, Luc Maranget, Jade Alglave,
	David Howells, Nicholas Piggin, Boqun Feng, Peter Zijlstra,
	Will Deacon, Andrea Parri, Alan Stern, Pedro Falcato,
	Vlastimil Babka, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Shakeel Butt, Venkat Rao Bagalkote,
	Mateusz Guzik, Suren Baghdasaryan, Marco Elver, Akira Yokosawa

Hi,

A comment from a LKMM reviewer.

On Thu, 26 Feb 2026 15:35:08 +0900, Harry Yoo wrote:
> Hello, SLAB, LKMM, and KCSAN folks!
> 
> I'd like to discuss slab's assumption on users regarding memory ordering.
> 
> Recently, I've been investigating an interesting slab memory ordering
> issue [3] [4] in v7.0-rc1, which made me think about memory ordering
> for slab objects.
> 
> But without answering "What does slab expect users to do for correct
> operation?", I kept getting puzzled, and my brain hurt too much :/
> I'm writing things down to stop getting confused :)
> 
> Since I have never thought about this before, my reasoning could be
> partially or entirely incorrect. If so, please kindly let me know.
> 
> # Slab's assumption: Stores to object, its metadata, or struct slab
> # must be visible to the CPU that frees the object, when it is
> # passed to kfree(). It's users' responsibility to guarantee that.
> 
> When the slab allocator allocates an object, it updates its metadata and
> struct slab fields. After allocation, the user of slab updates object's
> content. As long as the object is freed on the same CPU that it was
> allocated, kfree() can see those stores (A CPU must be able to see
> what's in its store buffer), so no problem!
> 
> However, when e.g.) the pointer to object is stored in a shared variable
> and then freed on a different CPU, things become trickier.
> 
> In this case, I think it's fair for the slab allocator to assume that:
> 
>   1) Such stores must involve _at least_ a release barrier
>      (for example, via {cmp,}xchg{,_release}, or smp_store_release())
>      to ensure preceding stores are visible to other CPUs before
>      the pointer store becomes visible, and
> 
>   2) The CPU that frees an object must invoke at least an acquire
>      barrier to ensure that stores to object content / metadata, etc.,
>      are visible to the freeing CPU when it calls kfree().
> 
> Because the slab allocator itself doesn't guarantee that such
> barriers are invoked within the allocator, it relies on users to
> do this when needed.
> 
> ... and that's quite a reasonable assumption, isn't it?
> 
> Actually, I'm not the first person to question this.
> 
> [1] https://lore.kernel.org/linux-mm/CACT4Y+Yfz3XvT+w6a3WjcZuATb1b9JdQHHf637zdT=6QZ-hjKg@mail.gmail.com
> [2] https://lore.kernel.org/linux-mm/20140102203320.GA27615@linux.vnet.ibm.com
> 
> # Now, let's take a look at the bug I've been investigating
> 
> There were two bugs [3] [4] reported, with symptoms that appear to be
> caused by slab returning wrong metadata (the symptoms: incorrect
> reference counting of obj_cgroup, integer overflow as more memory is
> uncharged than charged).
> 
> [3] https://lore.kernel.org/lkml/ca241daa-e7e7-4604-a48d-de91ec9184a5@linux.ibm.com
> [4] https://lore.kernel.org/all/ddff7c7d-c0c3-4780-808f-9a83268bbf0c@linux.ibm.com
> 
> Hmm, if it's returning wrong metadata, how could that happen?
> 
> Well, perhaps it's either 1) the calculation of metadata address is
> incorrect, or 2) reading the metadata itself is racy.
> 
> Shakeel Butt pointed out [9] that there's a potential memory ordering
> issue. It suggests that no enforced ordering between slab->obj_exts
> and slab->stride can make the metadata address calculation incorrect.
> 
> [9] https://lore.kernel.org/lkml/aZu9G9mVIVzSm6Ft@hyeyoo
> 
> Let's say CPU X and Y are allocating/freeing slab objects from/to
> the same slab. They need to access metadata for the objects:
> 
> CPU X				CPU Y
> 
> // CPU X allocates metadata array
> - slab->obj_exts = <the address of the metadata array>
> - slab->stride = 16 (sizeof struct slab)
> 
> - stride = plain load slab->stride
> - obj_exts = READ_ONCE(slab->obj_exts)
> - if (obj_exts)
>     - metadata_addr =
>       stride * index + obj_exts
> 				- stride = plain load slab->stride
> 				- obj_exts = READ_ONCE(slab->obj_exts)
> 				- if (obj_exts)
> 				  - metadata_addr = stride * index +
> 						    obj_exts
> 
> 				// Wait, obj_exts is non-NULL,
> 				// but slab->stride is stale!
> 
> 				// Now, metadata_addr is wrong.
> 
> Hmm, this could definitely happen when two CPUs try to allocate/free
> objects from/to the same slab. We need to make sure that, CPUs cannot
> see stale slab->stride as long as slab->obj_exts is not NULL.
> 
> # How I tried to fix it
> 
> An expensive solution would be do:
> 
> CPU X:					CPU Y:
> - slab->stride = 16			- READ_ONCE(slab->obj_exts)
> - smp_wmb()				- if (obj_exts)
> - slab->obj_exts = <something>		  - smp_rmb()
> 					  - plain load slab->stride
> 
> Then, CPU Y should see either (obj_exts == 0), or
> (obj_exts != 0 and a valid stride). (obj_exts != 0) && (invalid stride)
> is impossible.
> 
> This fix [5] seems to resolve the bug [6], yay!
> 
> Before testing this fix, I wasn't fully convinced that it was a memory
> ordering issue. But after testing it, it seems reasonable to assume that
> it's indeed a memory ordering issue.
> 
> [5] https://lore.kernel.org/linux-mm/aZ2Gwie5dpXotxWc@hyeyoo
> [6] https://lore.kernel.org/linux-mm/84492f08-04c2-485c-9a18-cdafd5a9c3e5@linux.ibm.com 
> 
> # How I tried to optimize the fix
> 
> Hmm, but it's not great to have memory barriers in slab alloc/free
> path, right? So I tried to optimize it while maintaining correctness.
> 
> Previously, slab->stride could be set during slab's initialization
> by alloc_slab_obj_exts_early(), or later by alloc_slab_obj_exts().
> 
> Within the slab allocator, for the slab to be accessible by other CPUs,
> they need to go through per-node partial slab list
> (struct kmem_cache_node.partial), protected by a spinlock.
> 
> Hmm, if we make sure that slab->stride is set early, before the slab
> becomes accessible to other CPUs, smp_wmb()/smp_rmb() pair is not
> necessary. So I made that change [7].

So in [7], you moved slab_set_stride() before the store to slab->obj_exts.
Both of them are done without any markers for racy memory accesses.

Moving plain stores around in the C source code does have no effect WRT
the ordering of those accesses.

To guarantee the store to slab->obj_exts to happen after the effect of
slab_set_stride(), you need to do:

     -         slab->obj_exts = obj_exts;
     +         smp_store_release(&slab->obj_exts, obj_exts);

Also, the reader side also needs (in slab.h):

               smp_load_acquire(&slab->obj_exts);
               READ_ONCE(&slab->stride);

This pattern is known as MP (Message Passing).

Please have a look at section "Message passing (MP)" in
tools/memory-model/Documentation/recipes.txt, whose leading
paragraph starts like so:

    The MP pattern has one CPU execute a pair of stores to a pair of variables
    and another CPU execute a pair of loads from this same pair of variables,
    but in the opposite order.  The goal is to avoid the counter-intuitive
    outcome in which the first load sees the value written by the second store
    but the second load does not see the value written by the first store.
    In the absence of any ordering, this goal may not be met, as can be seen
    in the MP+poonceonces.litmus litmus test.  This section therefore looks at
    a number of ways of meeting this goal.

The RELEASE:ACQUIRE pare is needed because your fast path is not
protected by any lock and can see any intermediate states of concurrent
updates done under the lock protection.

Hope this helps.

Thanks, Akira

> 
> But something strange happens when I tried to optimize it!
> The fix didn't resolve the bug [8].
> 
> [7] https://lore.kernel.org/linux-mm/20260223075809.19265-1-harry.yoo@oracle.com 
> [8] https://lore.kernel.org/linux-mm/2d106583-4ec6-4da0-87ea-4ecad893b24f@linux.ibm.com
> 
> Hmm... even when slabs don't go through the list protected by the
> spinlock, perhaps an object was allocated on CPU X, and then freed
> on CPU Y?
> 
> But as long as "Slab's assumption" described above is satisfied,
> I can't explain why stores to slab->stride, or metadata won't be visible
> to the freeing CPU :/
> 
> That makes me wonder "is somebody breaking that assumption?"
> 
> If so, the smp_rmb() in the previous fix [5] might have unintentionally
> acted as a band-aid to make sure stores to slab->stride and the metadata
> are visible to the freeing CPU. But in fact, a barrier should have
> been invoked by the user.
> 
> Looking at commit 9e6b7cd7e77d ("tty: fix data race in
> tty_buffer_flush") and commit f57e515a1b56 ("kernel/pid.c: convert
> struct pid count to refcount_t"), it doesn't seem too crazy to suspect
> that somebody is breaking the assumption.
> 
> Does this sound reasonable, or am I missing something?
> 
> p.s., Many thanks to Pedro Falcato and Vlastimil Babka, who actively
> discussed this off-list with me. That helped develop my understanding
> a lot!
> 
> -- 
> Cheers,
> Harry / Hyeonggon



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Memory ordering between kmalloc() and kfree()? it's confusing!
  2026-02-26 18:06         ` Alan Stern
@ 2026-02-27 12:36           ` Harry Yoo
  2026-02-27 17:00             ` Alan Stern
  0 siblings, 1 reply; 13+ messages in thread
From: Harry Yoo @ 2026-02-27 12:36 UTC (permalink / raw)
  To: Alan Stern
  Cc: linux-mm, Dmitry Vyukov, lkmm, linux-arch, linux-kernel,
	Joel Fernandes, Daniel Lustig, Akira Yokosawa, Paul E. McKenney,
	Luc Maranget, Jade Alglave, David Howells, Nicholas Piggin,
	Boqun Feng, Peter Zijlstra, Will Deacon, Andrea Parri,
	Pedro Falcato, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hao Li, Shakeel Butt,
	Venkat Rao Bagalkote, Mateusz Guzik, Suren Baghdasaryan,
	Marco Elver

On Thu, Feb 26, 2026 at 01:06:51PM -0500, Alan Stern wrote:
> On Fri, Feb 27, 2026 at 02:11:49AM +0900, Harry Yoo wrote:
> > On Thu, Feb 26, 2026 at 11:42:02AM -0500, Alan Stern wrote:
> > > On Fri, Feb 27, 2026 at 01:17:52AM +0900, Harry Yoo wrote:
> > > > On Thu, Feb 26, 2026 at 10:45:55AM -0500, Alan Stern wrote:
> > > > > On Thu, Feb 26, 2026 at 03:35:08PM +0900, Harry Yoo wrote:
> > > > > > Because the slab allocator itself doesn't guarantee that such
> > > > > > barriers are invoked within the allocator, it relies on users to
> > > > > > do this when needed.
> > > > > 
> > > > > It doesn't?  Then how does the slab allocator guarantee that two 
> > > > > different CPUs won't try to perform allocations or deallocations from 
> > > > > the same slab at the same time, messing everything up?
> > > > 
> > > > Ah, alloc/free slowpaths do use cmpxchg128 or spinlock and
> > > > don't mess things up.
> > > > 
> > > > But fastpath allocs/frees are served from percpu array that is protected
> > > > by a local_lock. local_lock has a compiler barrier in it, but that's
> > > > not enough.
> > > 
> > > If those things rely on a percpu array, how can one CPU possibly 
> > > manipulate a resource (slab or something else) that was changed by a 
> > > different CPU?
> > 
> > AFAICT that shouldn't happen within the slab allocator.
> > 
> > > The whole point of percpu data structures is that each 
> > > CPU gets its own copy.
> > 
> > Exactly.
> > 
> > But I'm not talking about what happens within the allocator,
> > but rather, about what slab expects to happen outside the allocator.
> 
> I understand.
> 
> > Something like this:
> > 
> > CPU X				CPU Y
> > ptr = kmalloc();
> > WRITE_ONCE(gp, ptr);
> > 				if (p = READ_ONCE(gp))
> > 					kfree(p);
> > 
> > Yes, it's a crazy thing to do. CPU Y isn't guaranteed to see
> > up-to-date version of object content or metadata.
> > 
> > Instead, the code should do:
> > 
> > CPU X				CPU Y
> > ptr = kmalloc();
> > gp = smp_store_release(&gp, ptr);
> > 
> > 				if (p = smp_load_acquire(&gp))
> > 					kfree(p);
> > 
> > One reason that I started this discussion was to argue that we should
> > have a well-defined a contract between the slab allocator and its users.
> 
> Yes, you have made that quite clear.  But you're missing _my_ point.

Okay. Looks like I misread your point...

> Which is: The same mechanism that the slab allocator uses to ensure that 
> CPU X and CPU Y won't step on each other's toes if they both run 
> kmalloc/kfree at the same time should also be able to guarantee that the 
> metadata changes made by CPU X will be visible to CPU Y if Y manipulates 
> a slab that X just finished with.

Within the slab allocator, I believe there are sufficient mechanisms
(either spinlock or cmpxchg) to prevent CPUs from interfereing
with each other.

My earlier statement "Because the slab allocator itself doesn't
guarantee such barriers are invoked within the allocator, ..." may have
caused some confusion. To be clarify, the slab allocator of course uses
proper locks and atomic operations to avoid CPUs interfereing with each
other, and yes, those mechanisms should guarantee that the metadata changes
made by CPU X will be visible to CPU Y e.g) when the object is transferred
from CPU X to Y within the slab allocator.

What I meant by the statement was that the slab allocator doesn't provide
enough barriers to ensure correctness when a user performs a drive-by
free on a different CPU w/o proper barriers.

Hopefully I'm not missing your point this time :)

Thanks!

> To put it another way, ensuring non-interference during simultaneous 
> accesses isn't all that different from ensuring coherence during 
> sequential accesses.  Doing the first should easily allow doing the 
> second.
> 
> And if it doesn't then something questionable is going on.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Memory ordering between kmalloc() and kfree()? it's confusing!
  2026-02-27 12:36           ` Harry Yoo
@ 2026-02-27 17:00             ` Alan Stern
  0 siblings, 0 replies; 13+ messages in thread
From: Alan Stern @ 2026-02-27 17:00 UTC (permalink / raw)
  To: Harry Yoo
  Cc: linux-mm, Dmitry Vyukov, lkmm, linux-arch, linux-kernel,
	Joel Fernandes, Daniel Lustig, Akira Yokosawa, Paul E. McKenney,
	Luc Maranget, Jade Alglave, David Howells, Nicholas Piggin,
	Boqun Feng, Peter Zijlstra, Will Deacon, Andrea Parri,
	Pedro Falcato, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hao Li, Shakeel Butt,
	Venkat Rao Bagalkote, Mateusz Guzik, Suren Baghdasaryan,
	Marco Elver

On Fri, Feb 27, 2026 at 09:36:37PM +0900, Harry Yoo wrote:
> > Which is: The same mechanism that the slab allocator uses to ensure that 
> > CPU X and CPU Y won't step on each other's toes if they both run 
> > kmalloc/kfree at the same time should also be able to guarantee that the 
> > metadata changes made by CPU X will be visible to CPU Y if Y manipulates 
> > a slab that X just finished with.
> 
> Within the slab allocator, I believe there are sufficient mechanisms
> (either spinlock or cmpxchg) to prevent CPUs from interfereing
> with each other.
> 
> My earlier statement "Because the slab allocator itself doesn't
> guarantee such barriers are invoked within the allocator, ..." may have
> caused some confusion. To be clarify, the slab allocator of course uses
> proper locks and atomic operations to avoid CPUs interfereing with each
> other, and yes, those mechanisms should guarantee that the metadata changes
> made by CPU X will be visible to CPU Y e.g) when the object is transferred
> from CPU X to Y within the slab allocator.
> 
> What I meant by the statement was that the slab allocator doesn't provide
> enough barriers to ensure correctness when a user performs a drive-by
> free on a different CPU w/o proper barriers.
> 
> Hopefully I'm not missing your point this time :)

You got it.  :-)

But since I don't know anything about the details of how the slab 
allocator works, can you explain in more detail what the locks and 
atomic operations are and how they prevent CPUs from interfering when an 
object is transferred from one CPU to another within the slab allocator?

In particular, which part of the mechanism fails (or doesn't get used) 
when the object is transferred by the user with no memory barriers?  I'm 
trying to learn exactly how these two cases differ, because at first 
glance I can't imagine how you could accomplish the first without also 
accomplishing the second.  It seems that transferring an object from one 
CPU to another within the slab allocator should be very much like 
transferring it from the slab allocator back to the kmalloc caller.

Alan Stern


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Memory ordering between kmalloc() and kfree()? it's confusing!
  2026-02-26  6:35 [BUG] Memory ordering between kmalloc() and kfree()? it's confusing! Harry Yoo
  2026-02-26 15:45 ` Alan Stern
  2026-02-27  9:14 ` Akira Yokosawa
@ 2026-03-06  2:46 ` Harry Yoo
  2 siblings, 0 replies; 13+ messages in thread
From: Harry Yoo @ 2026-03-06  2:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Dmitry Vyukov, lkmm, linux-arch, linux-kernel, Joel Fernandes,
	Daniel Lustig, Akira Yokosawa, Paul E. McKenney, Luc Maranget,
	Jade Alglave, David Howells, Nicholas Piggin, Boqun Feng,
	Peter Zijlstra, Will Deacon, Andrea Parri, Alan Stern,
	Pedro Falcato, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hao Li, Shakeel Butt,
	Venkat Rao Bagalkote, Mateusz Guzik, Suren Baghdasaryan,
	Marco Elver

On Thu, Feb 26, 2026 at 03:35:08PM +0900, Harry Yoo wrote:
> Hello, SLAB, LKMM, and KCSAN folks!

[...snip...]

> # Now, let's take a look at the bug I've been investigating
> 
> There were two bugs [3] [4] reported, with symptoms that appear to be
> caused by slab returning wrong metadata (the symptoms: incorrect
> reference counting of obj_cgroup, integer overflow as more memory is
> uncharged than charged).
> 
> [3] https://lore.kernel.org/lkml/ca241daa-e7e7-4604-a48d-de91ec9184a5@linux.ibm.com
> [4] https://lore.kernel.org/all/ddff7c7d-c0c3-4780-808f-9a83268bbf0c@linux.ibm.com
> 
> Hmm, if it's returning wrong metadata, how could that happen?
> 
> Well, perhaps it's either 1) the calculation of metadata address is
> incorrect, or 2) reading the metadata itself is racy.
> 
> Shakeel Butt pointed out [9] that there's a potential memory ordering
> issue. It suggests that no enforced ordering between slab->obj_exts
> and slab->stride can make the metadata address calculation incorrect.
> 
> [9] https://lore.kernel.org/lkml/aZu9G9mVIVzSm6Ft@hyeyoo
> 
> Let's say CPU X and Y are allocating/freeing slab objects from/to
> the same slab. They need to access metadata for the objects:
> 
> CPU X				CPU Y
> 
> // CPU X allocates metadata array
> - slab->obj_exts = <the address of the metadata array>
> - slab->stride = 16 (sizeof struct slab)
> 
> - stride = plain load slab->stride
> - obj_exts = READ_ONCE(slab->obj_exts)
> - if (obj_exts)
>     - metadata_addr =
>       stride * index + obj_exts
> 				- stride = plain load slab->stride
> 				- obj_exts = READ_ONCE(slab->obj_exts)
> 				- if (obj_exts)
> 				  - metadata_addr = stride * index +
> 						    obj_exts
> 
> 				// Wait, obj_exts is non-NULL,
> 				// but slab->stride is stale!
> 
> 				// Now, metadata_addr is wrong.
> 
> Hmm, this could definitely happen when two CPUs try to allocate/free
> objects from/to the same slab. We need to make sure that, CPUs cannot
> see stale slab->stride as long as slab->obj_exts is not NULL.
> 
> # How I tried to fix it
> 
> An expensive solution would be do:
> 
> CPU X:					CPU Y:
> - slab->stride = 16			- READ_ONCE(slab->obj_exts)
> - smp_wmb()				- if (obj_exts)
> - slab->obj_exts = <something>		  - smp_rmb()
> 					  - plain load slab->stride
> 
> Then, CPU Y should see either (obj_exts == 0), or
> (obj_exts != 0 and a valid stride). (obj_exts != 0) && (invalid stride)
> is impossible.
> 
> This fix [5] seems to resolve the bug [6], yay!
>
> Before testing this fix, I wasn't fully convinced that it was a memory
> ordering issue. But after testing it, it seems reasonable to assume that
> it's indeed a memory ordering issue.

Apologies for delay. I had to confirm that there was a confusion
in the analysis above.

It turns out that smp_wmb()+smp_rmb() pair didn't really fix the
underlying problem [10]. And the confusion was that the bugs reported
[5] [7] are actually caused by lack of enforced memory ordering.

It's true that there was a theoretical memory ordering issue (now fixed
in 7.0-rc2 [7]), but the reason why stride value was invalid was because
stride's type was unsigned short, which was too small [9] [11].

So my previous argument that "probably there is a user that violates
slab's assumption" becomes invalid. That's a relif ;)

> [5] https://lore.kernel.org/linux-mm/aZ2Gwie5dpXotxWc@hyeyoo
> [6] https://lore.kernel.org/linux-mm/84492f08-04c2-485c-9a18-cdafd5a9c3e5@linux.ibm.com 

[9] https://lore.kernel.org/linux-mm/20260303135722.2680521-1-harry.yoo@oracle.com
[10] https://lore.kernel.org/linux-mm/aaj--Lej6kWE0aV-@hyeyoo 
[11] https://lore.kernel.org/linux-mm/41f1c856-2c41-4d11-96e6-079d95d8efbb@linux.ibm.com

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-03-06  2:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-26  6:35 [BUG] Memory ordering between kmalloc() and kfree()? it's confusing! Harry Yoo
2026-02-26 15:45 ` Alan Stern
2026-02-26 16:17   ` Harry Yoo
2026-02-26 16:42     ` Alan Stern
2026-02-26 17:11       ` Harry Yoo
2026-02-26 18:06         ` Alan Stern
2026-02-27 12:36           ` Harry Yoo
2026-02-27 17:00             ` Alan Stern
2026-02-26 17:59     ` Christoph Lameter (Ampere)
2026-02-27  8:06     ` Hao Li
2026-02-27  9:03       ` Harry Yoo
2026-02-27  9:14 ` Akira Yokosawa
2026-03-06  2:46 ` Harry Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox