linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [BUG] Memory ordering between kmalloc() and kfree()? it's confusing!
@ 2026-02-26  6:35 Harry Yoo
  0 siblings, 0 replies; only message in thread
From: Harry Yoo @ 2026-02-26  6:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Dmitry Vyukov, lkmm, linux-arch, linux-kernel, Joel Fernandes,
	Daniel Lustig, Akira Yokosawa, Paul E. McKenney, Luc Maranget,
	Jade Alglave, David Howells, Nicholas Piggin, Boqun Feng,
	Peter Zijlstra, Will Deacon, Andrea Parri, Alan Stern,
	Pedro Falcato, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hao Li, Shakeel Butt,
	Venkat Rao Bagalkote, Mateusz Guzik, Suren Baghdasaryan,
	Marco Elver

Hello, SLAB, LKMM, and KCSAN folks!

I'd like to discuss slab's assumption on users regarding memory ordering.

Recently, I've been investigating an interesting slab memory ordering
issue [3] [4] in v7.0-rc1, which made me think about memory ordering
for slab objects.

But without answering "What does slab expect users to do for correct
operation?", I kept getting puzzled, and my brain hurt too much :/
I'm writing things down to stop getting confused :)

Since I have never thought about this before, my reasoning could be
partially or entirely incorrect. If so, please kindly let me know.

# Slab's assumption: Stores to object, its metadata, or struct slab
# must be visible to the CPU that frees the object, when it is
# passed to kfree(). It's users' responsibility to guarantee that.

When the slab allocator allocates an object, it updates its metadata and
struct slab fields. After allocation, the user of slab updates object's
content. As long as the object is freed on the same CPU that it was
allocated, kfree() can see those stores (A CPU must be able to see
what's in its store buffer), so no problem!

However, when e.g.) the pointer to object is stored in a shared variable
and then freed on a different CPU, things become trickier.

In this case, I think it's fair for the slab allocator to assume that:

  1) Such stores must involve _at least_ a release barrier
     (for example, via {cmp,}xchg{,_release}, or smp_store_release())
     to ensure preceding stores are visible to other CPUs before
     the pointer store becomes visible, and

  2) The CPU that frees an object must invoke at least an acquire
     barrier to ensure that stores to object content / metadata, etc.,
     are visible to the freeing CPU when it calls kfree().

Because the slab allocator itself doesn't guarantee that such
barriers are invoked within the allocator, it relies on users to
do this when needed.

... and that's quite a reasonable assumption, isn't it?

Actually, I'm not the first person to question this.

[1] https://lore.kernel.org/linux-mm/CACT4Y+Yfz3XvT+w6a3WjcZuATb1b9JdQHHf637zdT=6QZ-hjKg@mail.gmail.com
[2] https://lore.kernel.org/linux-mm/20140102203320.GA27615@linux.vnet.ibm.com

# Now, let's take a look at the bug I've been investigating

There were two bugs [3] [4] reported, with symptoms that appear to be
caused by slab returning wrong metadata (the symptoms: incorrect
reference counting of obj_cgroup, integer overflow as more memory is
uncharged than charged).

[3] https://lore.kernel.org/lkml/ca241daa-e7e7-4604-a48d-de91ec9184a5@linux.ibm.com
[4] https://lore.kernel.org/all/ddff7c7d-c0c3-4780-808f-9a83268bbf0c@linux.ibm.com

Hmm, if it's returning wrong metadata, how could that happen?

Well, perhaps it's either 1) the calculation of metadata address is
incorrect, or 2) reading the metadata itself is racy.

Shakeel Butt pointed out [9] that there's a potential memory ordering
issue. It suggests that no enforced ordering between slab->obj_exts
and slab->stride can make the metadata address calculation incorrect.

[9] https://lore.kernel.org/lkml/aZu9G9mVIVzSm6Ft@hyeyoo

Let's say CPU X and Y are allocating/freeing slab objects from/to
the same slab. They need to access metadata for the objects:

CPU X				CPU Y

// CPU X allocates metadata array
- slab->obj_exts = <the address of the metadata array>
- slab->stride = 16 (sizeof struct slab)

- stride = plain load slab->stride
- obj_exts = READ_ONCE(slab->obj_exts)
- if (obj_exts)
    - metadata_addr =
      stride * index + obj_exts
				- stride = plain load slab->stride
				- obj_exts = READ_ONCE(slab->obj_exts)
				- if (obj_exts)
				  - metadata_addr = stride * index +
						    obj_exts

				// Wait, obj_exts is non-NULL,
				// but slab->stride is stale!

				// Now, metadata_addr is wrong.

Hmm, this could definitely happen when two CPUs try to allocate/free
objects from/to the same slab. We need to make sure that, CPUs cannot
see stale slab->stride as long as slab->obj_exts is not NULL.

# How I tried to fix it

An expensive solution would be do:

CPU X:					CPU Y:
- slab->stride = 16			- READ_ONCE(slab->obj_exts)
- smp_wmb()				- if (obj_exts)
- slab->obj_exts = <something>		  - smp_rmb()
					  - plain load slab->stride

Then, CPU Y should see either (obj_exts == 0), or
(obj_exts != 0 and a valid stride). (obj_exts != 0) && (invalid stride)
is impossible.

This fix [5] seems to resolve the bug [6], yay!

Before testing this fix, I wasn't fully convinced that it was a memory
ordering issue. But after testing it, it seems reasonable to assume that
it's indeed a memory ordering issue.

[5] https://lore.kernel.org/linux-mm/aZ2Gwie5dpXotxWc@hyeyoo
[6] https://lore.kernel.org/linux-mm/84492f08-04c2-485c-9a18-cdafd5a9c3e5@linux.ibm.com 

# How I tried to optimize the fix

Hmm, but it's not great to have memory barriers in slab alloc/free
path, right? So I tried to optimize it while maintaining correctness.

Previously, slab->stride could be set during slab's initialization
by alloc_slab_obj_exts_early(), or later by alloc_slab_obj_exts().

Within the slab allocator, for the slab to be accessible by other CPUs,
they need to go through per-node partial slab list
(struct kmem_cache_node.partial), protected by a spinlock.

Hmm, if we make sure that slab->stride is set early, before the slab
becomes accessible to other CPUs, smp_wmb()/smp_rmb() pair is not
necessary. So I made that change [7].

But something strange happens when I tried to optimize it!
The fix didn't resolve the bug [8].

[7] https://lore.kernel.org/linux-mm/20260223075809.19265-1-harry.yoo@oracle.com 
[8] https://lore.kernel.org/linux-mm/2d106583-4ec6-4da0-87ea-4ecad893b24f@linux.ibm.com

Hmm... even when slabs don't go through the list protected by the
spinlock, perhaps an object was allocated on CPU X, and then freed
on CPU Y?

But as long as "Slab's assumption" described above is satisfied,
I can't explain why stores to slab->stride, or metadata won't be visible
to the freeing CPU :/

That makes me wonder "is somebody breaking that assumption?"

If so, the smp_rmb() in the previous fix [5] might have unintentionally
acted as a band-aid to make sure stores to slab->stride and the metadata
are visible to the freeing CPU. But in fact, a barrier should have
been invoked by the user.

Looking at commit 9e6b7cd7e77d ("tty: fix data race in
tty_buffer_flush") and commit f57e515a1b56 ("kernel/pid.c: convert
struct pid count to refcount_t"), it doesn't seem too crazy to suspect
that somebody is breaking the assumption.

Does this sound reasonable, or am I missing something?

p.s., Many thanks to Pedro Falcato and Vlastimil Babka, who actively
discussed this off-list with me. That helped develop my understanding
a lot!

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2026-02-26  6:35 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-26  6:35 [BUG] Memory ordering between kmalloc() and kfree()? it's confusing! Harry Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox