[PATCH 0/2] mm/slab: fix lockdep warnings with kmalloc

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] mm/slab: fix lockdep warnings with kmalloc_nolock()
@ 2026-02-06 17:13 Harry Yoo
  2026-02-06 17:13 ` [PATCH 1/2] mm/slab: skip get_from_any_partial() if !allow_spin Harry Yoo
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Harry Yoo @ 2026-02-06 17:13 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Alexei Starovoitov, Hao Li, linux-mm

Hi, I've observed a tw lockdep warnings while testing
kmalloc_nolock() in NMI:

  1. Accessing current->mems_allowed_seq seqlock in NMI isn't safe
     and lockdep complains.

  2. w/ CONFIG_SLAB_FREELIST_RANDOM, get_random_u32() acquires
     a local_lock, which isn't safe in NMI and could cause a deadlock.

Let's fix them.

Harry Yoo (2):
  mm/slab: skip get_from_any_partial() if !allow_spin
  mm/slab: use prandom if !allow_spin

 mm/slub.c | 36 ++++++++++++++++++++++++++++++++----
 1 file changed, 32 insertions(+), 4 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] mm/slab: skip get_from_any_partial() if !allow_spin
  2026-02-06 17:13 [PATCH 0/2] mm/slab: fix lockdep warnings with kmalloc_nolock() Harry Yoo
@ 2026-02-06 17:13 ` Harry Yoo
  2026-02-06 18:10   ` Vlastimil Babka
  2026-02-06 17:13 ` [PATCH 2/2] mm/slab: use prandom " Harry Yoo
  2026-02-06 17:37 ` [PATCH 0/2] mm/slab: fix lockdep warnings with kmalloc_nolock() Harry Yoo
  2 siblings, 1 reply; 12+ messages in thread
From: Harry Yoo @ 2026-02-06 17:13 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Alexei Starovoitov, Hao Li, linux-mm, stable

Lockdep complains when get_from_any_partial() is called in an NMI
context, because current->mems_allowed_seq is seqcount_spinlock_t and
not NMI-safe:

  ================================
  WARNING: inconsistent lock state
  6.19.0-rc5-kfree-rcu+ #315 Tainted: G                 N
  --------------------------------
  inconsistent {INITIAL USE} -> {IN-NMI} usage.
  kunit_try_catch/9989 [HC1[1]:SC0[0]:HE0:SE1] takes:
  ffff889085799820 (&____s->seqcount#3){.-.-}-{0:0}, at: ___slab_alloc+0x58f/0xc00
  {INITIAL USE} state was registered at:
    lock_acquire+0x185/0x320
    kernel_init_freeable+0x391/0x1150
    kernel_init+0x1f/0x220
    ret_from_fork+0x736/0x8f0
    ret_from_fork_asm+0x1a/0x30
  irq event stamp: 56
  hardirqs last  enabled at (55): [<ffffffff850a68d7>] _raw_spin_unlock_irq+0x27/0x70
  hardirqs last disabled at (56): [<ffffffff850858ca>] __schedule+0x2a8a/0x6630
  softirqs last  enabled at (0): [<ffffffff81536711>] copy_process+0x1dc1/0x6a10
  softirqs last disabled at (0): [<0000000000000000>] 0x0

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&____s->seqcount#3);
    <Interrupt>
      lock(&____s->seqcount#3);

   *** DEADLOCK ***

According to Documentation/locking/seqlock.rst, seqcount_t is not
NMI-safe and seqcount_latch_t should be used when read path can interrupt
the write-side critical section. In this case, return NULL and fall back
to slab allocation if !allow_spin.

Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Cc: stable@vger.kernel.org
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/slub.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 102fb47ae013..d46464654c15 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3789,6 +3789,14 @@ static void *get_from_any_partial(struct kmem_cache *s, struct partial_context *
 	enum zone_type highest_zoneidx = gfp_zone(pc->flags);
 	unsigned int cpuset_mems_cookie;
 
+	/*
+	 * read_mems_allow_begin() accesses current->mems_allowed_seq,
+	 * a seqcount_spinlock_t that is not NMI-safe. Skip allocation
+	 * when GFP flags indicate spinning is not allowed.
+	 */
+	if (!gfpflags_allow_spinning(pc->flags))
+		return NULL;
+
 	/*
 	 * The defrag ratio allows a configuration of the tradeoffs between
 	 * inter node defragmentation and node local allocations. A lower
-- 
2.43.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/2] mm/slab: use prandom if !allow_spin
  2026-02-06 17:13 [PATCH 0/2] mm/slab: fix lockdep warnings with kmalloc_nolock() Harry Yoo
  2026-02-06 17:13 ` [PATCH 1/2] mm/slab: skip get_from_any_partial() if !allow_spin Harry Yoo
@ 2026-02-06 17:13 ` Harry Yoo
  2026-02-06 18:27   ` Vlastimil Babka
  2026-02-06 17:37 ` [PATCH 0/2] mm/slab: fix lockdep warnings with kmalloc_nolock() Harry Yoo
  2 siblings, 1 reply; 12+ messages in thread
From: Harry Yoo @ 2026-02-06 17:13 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Alexei Starovoitov, Hao Li, linux-mm

When CONFIG_SLAB_FREELIST_RANDOM is enabled and get_random_u32()
is called in an NMI context, lockdep complains because it acquires
a local_lock:

  ================================
  WARNING: inconsistent lock state
  6.19.0-rc5-slab-for-next+ #325 Tainted: G                 N
  --------------------------------
  inconsistent {INITIAL USE} -> {IN-NMI} usage.
  kunit_try_catch/8312 [HC2[2]:SC0[0]:HE0:SE1] takes:
  ffff88a02ec49cc0 (batched_entropy_u32.lock){-.-.}-{3:3}, at: get_random_u32+0x7f/0x2e0
  {INITIAL USE} state was registered at:
    lock_acquire+0xd9/0x2f0
    get_random_u32+0x93/0x2e0
    __get_random_u32_below+0x17/0x70
    cache_random_seq_create+0x121/0x1c0
    init_cache_random_seq+0x5d/0x110
    do_kmem_cache_create+0x1e0/0xa30
    __kmem_cache_create_args+0x4ec/0x830
    create_kmalloc_caches+0xe6/0x130
    kmem_cache_init+0x1b1/0x660
    mm_core_init+0x1d8/0x4b0
    start_kernel+0x620/0xcd0
    x86_64_start_reservations+0x18/0x30
    x86_64_start_kernel+0xf3/0x140
    common_startup_64+0x13e/0x148
  irq event stamp: 76
  hardirqs last  enabled at (75): [<ffffffff8298b77a>] exc_nmi+0x11a/0x240
  hardirqs last disabled at (76): [<ffffffff8298b991>] sysvec_irq_work+0x11/0x110
  softirqs last  enabled at (0): [<ffffffff813b2dda>] copy_process+0xc7a/0x2350
  softirqs last disabled at (0): [<0000000000000000>] 0x0

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(batched_entropy_u32.lock);
    <Interrupt>
      lock(batched_entropy_u32.lock);

   *** DEADLOCK ***

Fix this by using pseudo-random number generator if !allow_spin.
This means kmalloc_nolock() users won't get truly random numbers,
but there is not much we can do about it.

Note that an NMI handler might interrupt prandom_u32_state() and
change the random state, but that's safe.

Link: https://lore.kernel.org/all/0c33bdee-6de8-4d9f-92ca-4f72c1b6fb9f@suse.cz
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/slub.c | 28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index d46464654c15..4d76af84f018 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -43,6 +43,7 @@
 #include <linux/prefetch.h>
 #include <linux/memcontrol.h>
 #include <linux/random.h>
+#include <linux/prandom.h>
 #include <kunit/test.h>
 #include <kunit/test-bug.h>
 #include <linux/sort.h>
@@ -3308,8 +3309,11 @@ static void *next_freelist_entry(struct kmem_cache *s,
 	return (char *)start + idx;
 }
 
+static DEFINE_PER_CPU(struct rnd_state, slab_rnd_state);
+
 /* Shuffle the single linked freelist based on a random pre-computed sequence */
-static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
+static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab,
+			     bool allow_spin)
 {
 	void *start;
 	void *cur;
@@ -3320,7 +3324,19 @@ static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
 		return false;
 
 	freelist_count = oo_objects(s->oo);
-	pos = get_random_u32_below(freelist_count);
+	if (allow_spin) {
+		pos = get_random_u32_below(freelist_count);
+	} else {
+		struct rnd_state *state;
+
+		/*
+		 * kmalloc_nolock() called in an NMI context might interrupt
+		 * and change the state in the middle.
+		 */
+		state = &get_cpu_var(slab_rnd_state);
+		pos = prandom_u32_state(state) % freelist_count;
+		put_cpu_var(slab_rnd_state);
+	}
 
 	page_limit = slab->objects * s->size;
 	start = fixup_red_left(s, slab_address(slab));
@@ -3347,7 +3363,8 @@ static inline int init_cache_random_seq(struct kmem_cache *s)
 	return 0;
 }
 static inline void init_freelist_randomization(void) { }
-static inline bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
+static inline bool shuffle_freelist(struct kmem_cache *s, struct slab *slab,
+				    bool allow_spin)
 {
 	return false;
 }
@@ -3438,7 +3455,7 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	alloc_slab_obj_exts_early(s, slab);
 	account_slab(slab, oo_order(oo), s, flags);
 
-	shuffle = shuffle_freelist(s, slab);
+	shuffle = shuffle_freelist(s, slab, allow_spin);
 
 	if (!shuffle) {
 		start = fixup_red_left(s, start);
@@ -8337,6 +8354,9 @@ void __init kmem_cache_init_late(void)
 {
 	flushwq = alloc_workqueue("slub_flushwq", WQ_MEM_RECLAIM, 0);
 	WARN_ON(!flushwq);
+#ifdef CONFIG_SLAB_FREELIST_RANDOM
+	prandom_init_once(&slab_rnd_state);
+#endif
 }
 
 int do_kmem_cache_create(struct kmem_cache *s, const char *name,
-- 
2.43.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] mm/slab: fix lockdep warnings with kmalloc_nolock()
  2026-02-06 17:13 [PATCH 0/2] mm/slab: fix lockdep warnings with kmalloc_nolock() Harry Yoo
  2026-02-06 17:13 ` [PATCH 1/2] mm/slab: skip get_from_any_partial() if !allow_spin Harry Yoo
  2026-02-06 17:13 ` [PATCH 2/2] mm/slab: use prandom " Harry Yoo
@ 2026-02-06 17:37 ` Harry Yoo
  2026-02-09 19:03   ` Vlastimil Babka
  2 siblings, 1 reply; 12+ messages in thread
From: Harry Yoo @ 2026-02-06 17:37 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin,
	Alexei Starovoitov, Hao Li, linux-mm

On Sat, Feb 07, 2026 at 02:13:46AM +0900, Harry Yoo wrote:
> Hi, I've observed a tw lockdep warnings while testing
> kmalloc_nolock() in NMI:
> 
>   1. Accessing current->mems_allowed_seq seqlock in NMI isn't safe
>      and lockdep complains.
> 
>   2. w/ CONFIG_SLAB_FREELIST_RANDOM, get_random_u32() acquires
>      a local_lock, which isn't safe in NMI and could cause a deadlock.
> 
> Let's fix them.

I think we should probably add some sort of
kmalloc_nolock()/kfree_nolock() test cases in lib/tests/slub_kunit.c.

These haven't been discovered by bots because (I guess) it is very
unlikely for bots to somehow trigger those APIs in NMI.

Also, I forgot to mention that this is based on slab/for-next:

commit bc33906024eb5955294e28128c3d0f492d2ded5e
Merge: ec15c383fcda 40fd0acc45d0
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Thu Jan 29 10:10:50 2026 +0100

    Merge branch 'slab/for-7.0/sheaves' into slab/for-next

> Harry Yoo (2):
>   mm/slab: skip get_from_any_partial() if !allow_spin
>   mm/slab: use prandom if !allow_spin
> 
>  mm/slub.c | 36 ++++++++++++++++++++++++++++++++----
>  1 file changed, 32 insertions(+), 4 deletions(-)
> 
> -- 
> 2.43.0
> 

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm/slab: skip get_from_any_partial() if !allow_spin
  2026-02-06 17:13 ` [PATCH 1/2] mm/slab: skip get_from_any_partial() if !allow_spin Harry Yoo
@ 2026-02-06 18:10   ` Vlastimil Babka
  2026-02-06 19:19     ` Alexei Starovoitov
  0 siblings, 1 reply; 12+ messages in thread
From: Vlastimil Babka @ 2026-02-06 18:10 UTC (permalink / raw)
  To: Harry Yoo, Andrew Morton
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin,
	Alexei Starovoitov, Hao Li, linux-mm, stable

On 2/6/26 18:13, Harry Yoo wrote:
> Lockdep complains when get_from_any_partial() is called in an NMI
> context, because current->mems_allowed_seq is seqcount_spinlock_t and
> not NMI-safe:
> 
>   ================================
>   WARNING: inconsistent lock state
>   6.19.0-rc5-kfree-rcu+ #315 Tainted: G                 N
>   --------------------------------
>   inconsistent {INITIAL USE} -> {IN-NMI} usage.
>   kunit_try_catch/9989 [HC1[1]:SC0[0]:HE0:SE1] takes:
>   ffff889085799820 (&____s->seqcount#3){.-.-}-{0:0}, at: ___slab_alloc+0x58f/0xc00
>   {INITIAL USE} state was registered at:
>     lock_acquire+0x185/0x320
>     kernel_init_freeable+0x391/0x1150
>     kernel_init+0x1f/0x220
>     ret_from_fork+0x736/0x8f0
>     ret_from_fork_asm+0x1a/0x30
>   irq event stamp: 56
>   hardirqs last  enabled at (55): [<ffffffff850a68d7>] _raw_spin_unlock_irq+0x27/0x70
>   hardirqs last disabled at (56): [<ffffffff850858ca>] __schedule+0x2a8a/0x6630
>   softirqs last  enabled at (0): [<ffffffff81536711>] copy_process+0x1dc1/0x6a10
>   softirqs last disabled at (0): [<0000000000000000>] 0x0
> 
>   other info that might help us debug this:
>    Possible unsafe locking scenario:
> 
>          CPU0
>          ----
>     lock(&____s->seqcount#3);
>     <Interrupt>
>       lock(&____s->seqcount#3);
> 
>    *** DEADLOCK ***
> 
> According to Documentation/locking/seqlock.rst, seqcount_t is not
> NMI-safe and seqcount_latch_t should be used when read path can interrupt
> the write-side critical section. In this case, return NULL and fall back
> to slab allocation if !allow_spin.
> 
> Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
> Cc: stable@vger.kernel.org
> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> ---
>  mm/slub.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 102fb47ae013..d46464654c15 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3789,6 +3789,14 @@ static void *get_from_any_partial(struct kmem_cache *s, struct partial_context *
>  	enum zone_type highest_zoneidx = gfp_zone(pc->flags);
>  	unsigned int cpuset_mems_cookie;
>  
> +	/*
> +	 * read_mems_allow_begin() accesses current->mems_allowed_seq,
> +	 * a seqcount_spinlock_t that is not NMI-safe. Skip allocation
> +	 * when GFP flags indicate spinning is not allowed.
> +	 */
> +	if (!gfpflags_allow_spinning(pc->flags))
> +		return NULL;

I think it would be less restrictive to just continue, but skip the
read_mems_allowed_retry() part in the do-while loop, so just make it one
iteration for !allow_spin. If lockdep doesn't like even the
read_mems_allowed_begin() (not clear to me), skip it too?

> +
>  	/*
>  	 * The defrag ratio allows a configuration of the tradeoffs between
>  	 * inter node defragmentation and node local allocations. A lower



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] mm/slab: use prandom if !allow_spin
  2026-02-06 17:13 ` [PATCH 2/2] mm/slab: use prandom " Harry Yoo
@ 2026-02-06 18:27   ` Vlastimil Babka
  2026-02-06 19:22     ` Alexei Starovoitov
  0 siblings, 1 reply; 12+ messages in thread
From: Vlastimil Babka @ 2026-02-06 18:27 UTC (permalink / raw)
  To: Harry Yoo, Andrew Morton
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin,
	Alexei Starovoitov, Hao Li, linux-mm

On 2/6/26 18:13, Harry Yoo wrote:
> When CONFIG_SLAB_FREELIST_RANDOM is enabled and get_random_u32()
> is called in an NMI context, lockdep complains because it acquires
> a local_lock:
> 
>   ================================
>   WARNING: inconsistent lock state
>   6.19.0-rc5-slab-for-next+ #325 Tainted: G                 N
>   --------------------------------
>   inconsistent {INITIAL USE} -> {IN-NMI} usage.
>   kunit_try_catch/8312 [HC2[2]:SC0[0]:HE0:SE1] takes:
>   ffff88a02ec49cc0 (batched_entropy_u32.lock){-.-.}-{3:3}, at: get_random_u32+0x7f/0x2e0
>   {INITIAL USE} state was registered at:
>     lock_acquire+0xd9/0x2f0
>     get_random_u32+0x93/0x2e0
>     __get_random_u32_below+0x17/0x70
>     cache_random_seq_create+0x121/0x1c0
>     init_cache_random_seq+0x5d/0x110
>     do_kmem_cache_create+0x1e0/0xa30
>     __kmem_cache_create_args+0x4ec/0x830
>     create_kmalloc_caches+0xe6/0x130
>     kmem_cache_init+0x1b1/0x660
>     mm_core_init+0x1d8/0x4b0
>     start_kernel+0x620/0xcd0
>     x86_64_start_reservations+0x18/0x30
>     x86_64_start_kernel+0xf3/0x140
>     common_startup_64+0x13e/0x148
>   irq event stamp: 76
>   hardirqs last  enabled at (75): [<ffffffff8298b77a>] exc_nmi+0x11a/0x240
>   hardirqs last disabled at (76): [<ffffffff8298b991>] sysvec_irq_work+0x11/0x110
>   softirqs last  enabled at (0): [<ffffffff813b2dda>] copy_process+0xc7a/0x2350
>   softirqs last disabled at (0): [<0000000000000000>] 0x0
> 
>   other info that might help us debug this:
>    Possible unsafe locking scenario:
> 
>          CPU0
>          ----
>     lock(batched_entropy_u32.lock);
>     <Interrupt>
>       lock(batched_entropy_u32.lock);
> 
>    *** DEADLOCK ***
> 
> Fix this by using pseudo-random number generator if !allow_spin.
> This means kmalloc_nolock() users won't get truly random numbers,
> but there is not much we can do about it.
> 
> Note that an NMI handler might interrupt prandom_u32_state() and
> change the random state, but that's safe.
> 
> Link: https://lore.kernel.org/all/0c33bdee-6de8-4d9f-92ca-4f72c1b6fb9f@suse.cz
> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> ---
>  mm/slub.c | 28 ++++++++++++++++++++++++----
>  1 file changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index d46464654c15..4d76af84f018 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -43,6 +43,7 @@
>  #include <linux/prefetch.h>
>  #include <linux/memcontrol.h>
>  #include <linux/random.h>
> +#include <linux/prandom.h>
>  #include <kunit/test.h>
>  #include <kunit/test-bug.h>
>  #include <linux/sort.h>
> @@ -3308,8 +3309,11 @@ static void *next_freelist_entry(struct kmem_cache *s,
>  	return (char *)start + idx;
>  }
>  
> +static DEFINE_PER_CPU(struct rnd_state, slab_rnd_state);
> +
>  /* Shuffle the single linked freelist based on a random pre-computed sequence */
> -static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
> +static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab,
> +			     bool allow_spin)
>  {
>  	void *start;
>  	void *cur;
> @@ -3320,7 +3324,19 @@ static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
>  		return false;
>  
>  	freelist_count = oo_objects(s->oo);
> -	pos = get_random_u32_below(freelist_count);
> +	if (allow_spin) {
> +		pos = get_random_u32_below(freelist_count);
> +	} else {
> +		struct rnd_state *state;
> +
> +		/*
> +		 * kmalloc_nolock() called in an NMI context might interrupt
> +		 * and change the state in the middle.
> +		 */
> +		state = &get_cpu_var(slab_rnd_state);
> +		pos = prandom_u32_state(state) % freelist_count;
> +		put_cpu_var(slab_rnd_state);

I don't think this prevents the changing in the middle? We just stored the
pointer in a local variable state, but the prandom call will still access
the percpu variable through that?

So we might need to disable irq here, and have another percpu state that's
used when in_nmi()?

> +	}
>  
>  	page_limit = slab->objects * s->size;
>  	start = fixup_red_left(s, slab_address(slab));
> @@ -3347,7 +3363,8 @@ static inline int init_cache_random_seq(struct kmem_cache *s)
>  	return 0;
>  }
>  static inline void init_freelist_randomization(void) { }
> -static inline bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
> +static inline bool shuffle_freelist(struct kmem_cache *s, struct slab *slab,
> +				    bool allow_spin)
>  {
>  	return false;
>  }
> @@ -3438,7 +3455,7 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>  	alloc_slab_obj_exts_early(s, slab);
>  	account_slab(slab, oo_order(oo), s, flags);
>  
> -	shuffle = shuffle_freelist(s, slab);
> +	shuffle = shuffle_freelist(s, slab, allow_spin);
>  
>  	if (!shuffle) {
>  		start = fixup_red_left(s, start);
> @@ -8337,6 +8354,9 @@ void __init kmem_cache_init_late(void)
>  {
>  	flushwq = alloc_workqueue("slub_flushwq", WQ_MEM_RECLAIM, 0);
>  	WARN_ON(!flushwq);
> +#ifdef CONFIG_SLAB_FREELIST_RANDOM
> +	prandom_init_once(&slab_rnd_state);
> +#endif
>  }
>  
>  int do_kmem_cache_create(struct kmem_cache *s, const char *name,



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm/slab: skip get_from_any_partial() if !allow_spin
  2026-02-06 18:10   ` Vlastimil Babka
@ 2026-02-06 19:19     ` Alexei Starovoitov
  2026-02-09  3:18       ` Harry Yoo
  0 siblings, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2026-02-06 19:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Andrew Morton, Christoph Lameter, David Rientjes,
	Roman Gushchin, Alexei Starovoitov, Hao Li, linux-mm, stable

On Fri, Feb 6, 2026 at 10:10 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 2/6/26 18:13, Harry Yoo wrote:
> > Lockdep complains when get_from_any_partial() is called in an NMI
> > context, because current->mems_allowed_seq is seqcount_spinlock_t and
> > not NMI-safe:
> >
> >   ================================
> >   WARNING: inconsistent lock state
> >   6.19.0-rc5-kfree-rcu+ #315 Tainted: G                 N
> >   --------------------------------
> >   inconsistent {INITIAL USE} -> {IN-NMI} usage.
> >   kunit_try_catch/9989 [HC1[1]:SC0[0]:HE0:SE1] takes:
> >   ffff889085799820 (&____s->seqcount#3){.-.-}-{0:0}, at: ___slab_alloc+0x58f/0xc00
> >   {INITIAL USE} state was registered at:
> >     lock_acquire+0x185/0x320
> >     kernel_init_freeable+0x391/0x1150
> >     kernel_init+0x1f/0x220
> >     ret_from_fork+0x736/0x8f0
> >     ret_from_fork_asm+0x1a/0x30
> >   irq event stamp: 56
> >   hardirqs last  enabled at (55): [<ffffffff850a68d7>] _raw_spin_unlock_irq+0x27/0x70
> >   hardirqs last disabled at (56): [<ffffffff850858ca>] __schedule+0x2a8a/0x6630
> >   softirqs last  enabled at (0): [<ffffffff81536711>] copy_process+0x1dc1/0x6a10
> >   softirqs last disabled at (0): [<0000000000000000>] 0x0
> >
> >   other info that might help us debug this:
> >    Possible unsafe locking scenario:
> >
> >          CPU0
> >          ----
> >     lock(&____s->seqcount#3);
> >     <Interrupt>
> >       lock(&____s->seqcount#3);
> >
> >    *** DEADLOCK ***
> >
> > According to Documentation/locking/seqlock.rst, seqcount_t is not
> > NMI-safe and seqcount_latch_t should be used when read path can interrupt
> > the write-side critical section. In this case, return NULL and fall back
> > to slab allocation if !allow_spin.
> >
> > Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> > ---
> >  mm/slub.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 102fb47ae013..d46464654c15 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -3789,6 +3789,14 @@ static void *get_from_any_partial(struct kmem_cache *s, struct partial_context *
> >       enum zone_type highest_zoneidx = gfp_zone(pc->flags);
> >       unsigned int cpuset_mems_cookie;
> >
> > +     /*
> > +      * read_mems_allow_begin() accesses current->mems_allowed_seq,
> > +      * a seqcount_spinlock_t that is not NMI-safe. Skip allocation
> > +      * when GFP flags indicate spinning is not allowed.
> > +      */
> > +     if (!gfpflags_allow_spinning(pc->flags))
> > +             return NULL;
>
> I think it would be less restrictive to just continue, but skip the
> read_mems_allowed_retry() part in the do-while loop, so just make it one
> iteration for !allow_spin. If lockdep doesn't like even the
> read_mems_allowed_begin() (not clear to me), skip it too?

+1
Just unconditional return NULL seems too restrictive.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] mm/slab: use prandom if !allow_spin
  2026-02-06 18:27   ` Vlastimil Babka
@ 2026-02-06 19:22     ` Alexei Starovoitov
  2026-02-07  1:25       ` Harry Yoo
  0 siblings, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2026-02-06 19:22 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Andrew Morton, Christoph Lameter, David Rientjes,
	Roman Gushchin, Alexei Starovoitov, Hao Li, linux-mm

On Fri, Feb 6, 2026 at 10:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 2/6/26 18:13, Harry Yoo wrote:
> > When CONFIG_SLAB_FREELIST_RANDOM is enabled and get_random_u32()
> > is called in an NMI context, lockdep complains because it acquires
> > a local_lock:
> >
> >   ================================
> >   WARNING: inconsistent lock state
> >   6.19.0-rc5-slab-for-next+ #325 Tainted: G                 N
> >   --------------------------------
> >   inconsistent {INITIAL USE} -> {IN-NMI} usage.
> >   kunit_try_catch/8312 [HC2[2]:SC0[0]:HE0:SE1] takes:
> >   ffff88a02ec49cc0 (batched_entropy_u32.lock){-.-.}-{3:3}, at: get_random_u32+0x7f/0x2e0
> >   {INITIAL USE} state was registered at:
> >     lock_acquire+0xd9/0x2f0
> >     get_random_u32+0x93/0x2e0
> >     __get_random_u32_below+0x17/0x70
> >     cache_random_seq_create+0x121/0x1c0
> >     init_cache_random_seq+0x5d/0x110
> >     do_kmem_cache_create+0x1e0/0xa30
> >     __kmem_cache_create_args+0x4ec/0x830
> >     create_kmalloc_caches+0xe6/0x130
> >     kmem_cache_init+0x1b1/0x660
> >     mm_core_init+0x1d8/0x4b0
> >     start_kernel+0x620/0xcd0
> >     x86_64_start_reservations+0x18/0x30
> >     x86_64_start_kernel+0xf3/0x140
> >     common_startup_64+0x13e/0x148
> >   irq event stamp: 76
> >   hardirqs last  enabled at (75): [<ffffffff8298b77a>] exc_nmi+0x11a/0x240
> >   hardirqs last disabled at (76): [<ffffffff8298b991>] sysvec_irq_work+0x11/0x110
> >   softirqs last  enabled at (0): [<ffffffff813b2dda>] copy_process+0xc7a/0x2350
> >   softirqs last disabled at (0): [<0000000000000000>] 0x0
> >
> >   other info that might help us debug this:
> >    Possible unsafe locking scenario:
> >
> >          CPU0
> >          ----
> >     lock(batched_entropy_u32.lock);
> >     <Interrupt>
> >       lock(batched_entropy_u32.lock);
> >
> >    *** DEADLOCK ***
> >
> > Fix this by using pseudo-random number generator if !allow_spin.
> > This means kmalloc_nolock() users won't get truly random numbers,
> > but there is not much we can do about it.
> >
> > Note that an NMI handler might interrupt prandom_u32_state() and
> > change the random state, but that's safe.
> >
> > Link: https://lore.kernel.org/all/0c33bdee-6de8-4d9f-92ca-4f72c1b6fb9f@suse.cz
> > Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> > ---
> >  mm/slub.c | 28 ++++++++++++++++++++++++----
> >  1 file changed, 24 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index d46464654c15..4d76af84f018 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -43,6 +43,7 @@
> >  #include <linux/prefetch.h>
> >  #include <linux/memcontrol.h>
> >  #include <linux/random.h>
> > +#include <linux/prandom.h>
> >  #include <kunit/test.h>
> >  #include <kunit/test-bug.h>
> >  #include <linux/sort.h>
> > @@ -3308,8 +3309,11 @@ static void *next_freelist_entry(struct kmem_cache *s,
> >       return (char *)start + idx;
> >  }
> >
> > +static DEFINE_PER_CPU(struct rnd_state, slab_rnd_state);
> > +
> >  /* Shuffle the single linked freelist based on a random pre-computed sequence */
> > -static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
> > +static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab,
> > +                          bool allow_spin)
> >  {
> >       void *start;
> >       void *cur;
> > @@ -3320,7 +3324,19 @@ static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
> >               return false;
> >
> >       freelist_count = oo_objects(s->oo);
> > -     pos = get_random_u32_below(freelist_count);
> > +     if (allow_spin) {
> > +             pos = get_random_u32_below(freelist_count);
> > +     } else {
> > +             struct rnd_state *state;
> > +
> > +             /*
> > +              * kmalloc_nolock() called in an NMI context might interrupt
> > +              * and change the state in the middle.
> > +              */
> > +             state = &get_cpu_var(slab_rnd_state);
> > +             pos = prandom_u32_state(state) % freelist_count;
> > +             put_cpu_var(slab_rnd_state);
>
> I don't think this prevents the changing in the middle? We just stored the
> pointer in a local variable state, but the prandom call will still access
> the percpu variable through that?
>
> So we might need to disable irq here, and have another percpu state that's
> used when in_nmi()?

imo this is all overkill.
Just prandom_u32_state() without any protection is fine.
Even if it reenters there is no harm. Just more randomness.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] mm/slab: use prandom if !allow_spin
  2026-02-06 19:22     ` Alexei Starovoitov
@ 2026-02-07  1:25       ` Harry Yoo
  0 siblings, 0 replies; 12+ messages in thread
From: Harry Yoo @ 2026-02-07  1:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Vlastimil Babka, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov, Hao Li,
	linux-mm

On Fri, Feb 06, 2026 at 11:22:27AM -0800, Alexei Starovoitov wrote:
> On Fri, Feb 6, 2026 at 10:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 2/6/26 18:13, Harry Yoo wrote:
> > > When CONFIG_SLAB_FREELIST_RANDOM is enabled and get_random_u32()
> > > is called in an NMI context, lockdep complains because it acquires
> > > a local_lock:
> > >
> > >   ================================
> > >   WARNING: inconsistent lock state
> > >   6.19.0-rc5-slab-for-next+ #325 Tainted: G                 N
> > >   --------------------------------
> > >   inconsistent {INITIAL USE} -> {IN-NMI} usage.
> > >   kunit_try_catch/8312 [HC2[2]:SC0[0]:HE0:SE1] takes:
> > >   ffff88a02ec49cc0 (batched_entropy_u32.lock){-.-.}-{3:3}, at: get_random_u32+0x7f/0x2e0
> > >   {INITIAL USE} state was registered at:
> > >     lock_acquire+0xd9/0x2f0
> > >     get_random_u32+0x93/0x2e0
> > >     __get_random_u32_below+0x17/0x70
> > >     cache_random_seq_create+0x121/0x1c0
> > >     init_cache_random_seq+0x5d/0x110
> > >     do_kmem_cache_create+0x1e0/0xa30
> > >     __kmem_cache_create_args+0x4ec/0x830
> > >     create_kmalloc_caches+0xe6/0x130
> > >     kmem_cache_init+0x1b1/0x660
> > >     mm_core_init+0x1d8/0x4b0
> > >     start_kernel+0x620/0xcd0
> > >     x86_64_start_reservations+0x18/0x30
> > >     x86_64_start_kernel+0xf3/0x140
> > >     common_startup_64+0x13e/0x148
> > >   irq event stamp: 76
> > >   hardirqs last  enabled at (75): [<ffffffff8298b77a>] exc_nmi+0x11a/0x240
> > >   hardirqs last disabled at (76): [<ffffffff8298b991>] sysvec_irq_work+0x11/0x110
> > >   softirqs last  enabled at (0): [<ffffffff813b2dda>] copy_process+0xc7a/0x2350
> > >   softirqs last disabled at (0): [<0000000000000000>] 0x0
> > >
> > >   other info that might help us debug this:
> > >    Possible unsafe locking scenario:
> > >
> > >          CPU0
> > >          ----
> > >     lock(batched_entropy_u32.lock);
> > >     <Interrupt>
> > >       lock(batched_entropy_u32.lock);
> > >
> > >    *** DEADLOCK ***
> > >
> > > Fix this by using pseudo-random number generator if !allow_spin.
> > > This means kmalloc_nolock() users won't get truly random numbers,
> > > but there is not much we can do about it.
> > >
> > > Note that an NMI handler might interrupt prandom_u32_state() and
> > > change the random state, but that's safe.
> > >
> > > Link: https://lore.kernel.org/all/0c33bdee-6de8-4d9f-92ca-4f72c1b6fb9f@suse.cz
> > > Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> > > ---
> > >  mm/slub.c | 28 ++++++++++++++++++++++++----
> > >  1 file changed, 24 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index d46464654c15..4d76af84f018 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -43,6 +43,7 @@
> > >  #include <linux/prefetch.h>
> > >  #include <linux/memcontrol.h>
> > >  #include <linux/random.h>
> > > +#include <linux/prandom.h>
> > >  #include <kunit/test.h>
> > >  #include <kunit/test-bug.h>
> > >  #include <linux/sort.h>
> > > @@ -3308,8 +3309,11 @@ static void *next_freelist_entry(struct kmem_cache *s,
> > >       return (char *)start + idx;
> > >  }
> > >
> > > +static DEFINE_PER_CPU(struct rnd_state, slab_rnd_state);
> > > +
> > >  /* Shuffle the single linked freelist based on a random pre-computed sequence */
> > > -static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
> > > +static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab,
> > > +                          bool allow_spin)
> > >  {
> > >       void *start;
> > >       void *cur;
> > > @@ -3320,7 +3324,19 @@ static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
> > >               return false;
> > >
> > >       freelist_count = oo_objects(s->oo);
> > > -     pos = get_random_u32_below(freelist_count);
> > > +     if (allow_spin) {
> > > +             pos = get_random_u32_below(freelist_count);
> > > +     } else {
> > > +             struct rnd_state *state;
> > > +
> > > +             /*
> > > +              * kmalloc_nolock() called in an NMI context might interrupt
> > > +              * and change the state in the middle.
> > > +              */
> > > +             state = &get_cpu_var(slab_rnd_state);
> > > +             pos = prandom_u32_state(state) % freelist_count;
> > > +             put_cpu_var(slab_rnd_state);
> >
> > I don't think this prevents the changing in the middle? We just stored the
> > pointer in a local variable state, but the prandom call will still access
> > the percpu variable through that?
> >
> > So we might need to disable irq here, and have another percpu state that's
> > used when in_nmi()?

Oh, my intention was not preventing state changes in the middle.
I was thinking "Hmm if we can't disable NMI, do we even need to disable
IRQ? just add some comment saying it might be interrupted
in the middle".

I was even thinking of using raw_cpu_ptr() instead without disabling
preemption through get/put_cpu_var()...

> imo this is all overkill.
> Just prandom_u32_state() without any protection is fine.
> Even if it reenters there is no harm. Just more randomness.

Yeah.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm/slab: skip get_from_any_partial() if !allow_spin
  2026-02-06 19:19     ` Alexei Starovoitov
@ 2026-02-09  3:18       ` Harry Yoo
  2026-02-09 19:03         ` Vlastimil Babka
  0 siblings, 1 reply; 12+ messages in thread
From: Harry Yoo @ 2026-02-09  3:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Vlastimil Babka, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov, Hao Li,
	linux-mm, stable

On Fri, Feb 06, 2026 at 11:19:01AM -0800, Alexei Starovoitov wrote:
> On Fri, Feb 6, 2026 at 10:10 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 2/6/26 18:13, Harry Yoo wrote:
> > > Lockdep complains when get_from_any_partial() is called in an NMI
> > > context, because current->mems_allowed_seq is seqcount_spinlock_t and
> > > not NMI-safe:
> > >
> > >   ================================
> > >   WARNING: inconsistent lock state
> > >   6.19.0-rc5-kfree-rcu+ #315 Tainted: G                 N
> > >   --------------------------------
> > >   inconsistent {INITIAL USE} -> {IN-NMI} usage.
> > >   kunit_try_catch/9989 [HC1[1]:SC0[0]:HE0:SE1] takes:
> > >   ffff889085799820 (&____s->seqcount#3){.-.-}-{0:0}, at: ___slab_alloc+0x58f/0xc00
> > >   {INITIAL USE} state was registered at:
> > >     lock_acquire+0x185/0x320
> > >     kernel_init_freeable+0x391/0x1150
> > >     kernel_init+0x1f/0x220
> > >     ret_from_fork+0x736/0x8f0
> > >     ret_from_fork_asm+0x1a/0x30
> > >   irq event stamp: 56
> > >   hardirqs last  enabled at (55): [<ffffffff850a68d7>] _raw_spin_unlock_irq+0x27/0x70
> > >   hardirqs last disabled at (56): [<ffffffff850858ca>] __schedule+0x2a8a/0x6630
> > >   softirqs last  enabled at (0): [<ffffffff81536711>] copy_process+0x1dc1/0x6a10
> > >   softirqs last disabled at (0): [<0000000000000000>] 0x0
> > >
> > >   other info that might help us debug this:
> > >    Possible unsafe locking scenario:
> > >
> > >          CPU0
> > >          ----
> > >     lock(&____s->seqcount#3);
> > >     <Interrupt>
> > >       lock(&____s->seqcount#3);
> > >
> > >    *** DEADLOCK ***
> > >
> > > According to Documentation/locking/seqlock.rst, seqcount_t is not
> > > NMI-safe and seqcount_latch_t should be used when read path can interrupt
> > > the write-side critical section. In this case, return NULL and fall back
> > > to slab allocation if !allow_spin.
> > >
> > > Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> > > ---
> > >  mm/slub.c | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > >
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index 102fb47ae013..d46464654c15 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -3789,6 +3789,14 @@ static void *get_from_any_partial(struct kmem_cache *s, struct partial_context *
> > >       enum zone_type highest_zoneidx = gfp_zone(pc->flags);
> > >       unsigned int cpuset_mems_cookie;
> > >
> > > +     /*
> > > +      * read_mems_allow_begin() accesses current->mems_allowed_seq,
> > > +      * a seqcount_spinlock_t that is not NMI-safe. Skip allocation
> > > +      * when GFP flags indicate spinning is not allowed.
> > > +      */
> > > +     if (!gfpflags_allow_spinning(pc->flags))
> > > +             return NULL;
> >
> > I think it would be less restrictive to just continue,

Ack.

> > but skip the
> > read_mems_allowed_retry() part in the do-while loop, so just make it one
> > iteration for !allow_spin.

Makes sense.

> > If lockdep doesn't like even the
> > read_mems_allowed_begin() (not clear to me), skip it too?

Yes, lockdep doesn't like read_mems_allowed_begin(), and thus
we should skip both.

> 
> +1
> Just unconditional return NULL seems too restrictive.

Ack.

I'll do something like this:

diff --git a/mm/slub.c b/mm/slub.c
index 102fb47ae013..cc686ab929fe 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3788,6 +3788,7 @@ static void *get_from_any_partial(struct kmem_cache *s, struct partial_context *
 	struct zone *zone;
 	enum zone_type highest_zoneidx = gfp_zone(pc->flags);
 	unsigned int cpuset_mems_cookie;
+	bool allow_spin = gfpflags_allow_spinning(pc->flags);

 	/*
 	 * The defrag ratio allows a configuration of the tradeoffs between
@@ -3812,7 +3813,15 @@ static void *get_from_any_partial(struct kmem_cache *s, struct partial_context *
 		return NULL;

 	do {
-		cpuset_mems_cookie = read_mems_allowed_begin();
+		/*
+		 * read_mems_allow_begin() accesses current->mems_allowed_seq,
+		 * a seqcount_spinlock_t that is not NMI-safe. Do not access
+		 * current->mems_allowed_seq and avoid retry when GFP flags
+		 * indicate spinning is not allowed.
+		 */
+		if (allow_spin)
+			cpuset_mems_cookie = read_mems_allowed_begin();
+
 		zonelist = node_zonelist(mempolicy_slab_node(), pc->flags);
 		for_each_zone_zonelist(zone, z, zonelist, highest_zoneidx) {
 			struct kmem_cache_node *n;
@@ -3836,7 +3845,7 @@ static void *get_from_any_partial(struct kmem_cache *s, struct partial_context *
 				}
 			}
 		}
-	} while (read_mems_allowed_retry(cpuset_mems_cookie));
+	} while (allow_spin && read_mems_allowed_retry(cpuset_mems_cookie));
 #endif	/* CONFIG_NUMA */
 	return NULL;
 }


-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] mm/slab: fix lockdep warnings with kmalloc_nolock()
  2026-02-06 17:37 ` [PATCH 0/2] mm/slab: fix lockdep warnings with kmalloc_nolock() Harry Yoo
@ 2026-02-09 19:03   ` Vlastimil Babka
  0 siblings, 0 replies; 12+ messages in thread
From: Vlastimil Babka @ 2026-02-09 19:03 UTC (permalink / raw)
  To: Harry Yoo, Andrew Morton
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin,
	Alexei Starovoitov, Hao Li, linux-mm

On 2/6/26 18:37, Harry Yoo wrote:
> On Sat, Feb 07, 2026 at 02:13:46AM +0900, Harry Yoo wrote:
>> Hi, I've observed a tw lockdep warnings while testing
>> kmalloc_nolock() in NMI:
>> 
>>   1. Accessing current->mems_allowed_seq seqlock in NMI isn't safe
>>      and lockdep complains.
>> 
>>   2. w/ CONFIG_SLAB_FREELIST_RANDOM, get_random_u32() acquires
>>      a local_lock, which isn't safe in NMI and could cause a deadlock.
>> 
>> Let's fix them.
> 
> I think we should probably add some sort of
> kmalloc_nolock()/kfree_nolock() test cases in lib/tests/slub_kunit.c.

That would be useful, yes!

> These haven't been discovered by bots because (I guess) it is very
> unlikely for bots to somehow trigger those APIs in NMI.
> 
> Also, I forgot to mention that this is based on slab/for-next:
> 
> commit bc33906024eb5955294e28128c3d0f492d2ded5e
> Merge: ec15c383fcda 40fd0acc45d0
> Author: Vlastimil Babka <vbabka@suse.cz>
> Date:   Thu Jan 29 10:10:50 2026 +0100
> 
>     Merge branch 'slab/for-7.0/sheaves' into slab/for-next
> 
>> Harry Yoo (2):
>>   mm/slab: skip get_from_any_partial() if !allow_spin
>>   mm/slab: use prandom if !allow_spin
>> 
>>  mm/slub.c | 36 ++++++++++++++++++++++++++++++++----
>>  1 file changed, 32 insertions(+), 4 deletions(-)
>> 
>> -- 
>> 2.43.0
>> 
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm/slab: skip get_from_any_partial() if !allow_spin
  2026-02-09  3:18       ` Harry Yoo
@ 2026-02-09 19:03         ` Vlastimil Babka
  0 siblings, 0 replies; 12+ messages in thread
From: Vlastimil Babka @ 2026-02-09 19:03 UTC (permalink / raw)
  To: Harry Yoo, Alexei Starovoitov
  Cc: Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin,
	Alexei Starovoitov, Hao Li, linux-mm, stable

On 2/9/26 04:18, Harry Yoo wrote:
> On Fri, Feb 06, 2026 at 11:19:01AM -0800, Alexei Starovoitov wrote:
>> On Fri, Feb 6, 2026 at 10:10 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> >
>> > On 2/6/26 18:13, Harry Yoo wrote:
>> > > Lockdep complains when get_from_any_partial() is called in an NMI
>> > > context, because current->mems_allowed_seq is seqcount_spinlock_t and
>> > > not NMI-safe:
>> > >
>> > >   ================================
>> > >   WARNING: inconsistent lock state
>> > >   6.19.0-rc5-kfree-rcu+ #315 Tainted: G                 N
>> > >   --------------------------------
>> > >   inconsistent {INITIAL USE} -> {IN-NMI} usage.
>> > >   kunit_try_catch/9989 [HC1[1]:SC0[0]:HE0:SE1] takes:
>> > >   ffff889085799820 (&____s->seqcount#3){.-.-}-{0:0}, at: ___slab_alloc+0x58f/0xc00
>> > >   {INITIAL USE} state was registered at:
>> > >     lock_acquire+0x185/0x320
>> > >     kernel_init_freeable+0x391/0x1150
>> > >     kernel_init+0x1f/0x220
>> > >     ret_from_fork+0x736/0x8f0
>> > >     ret_from_fork_asm+0x1a/0x30
>> > >   irq event stamp: 56
>> > >   hardirqs last  enabled at (55): [<ffffffff850a68d7>] _raw_spin_unlock_irq+0x27/0x70
>> > >   hardirqs last disabled at (56): [<ffffffff850858ca>] __schedule+0x2a8a/0x6630
>> > >   softirqs last  enabled at (0): [<ffffffff81536711>] copy_process+0x1dc1/0x6a10
>> > >   softirqs last disabled at (0): [<0000000000000000>] 0x0
>> > >
>> > >   other info that might help us debug this:
>> > >    Possible unsafe locking scenario:
>> > >
>> > >          CPU0
>> > >          ----
>> > >     lock(&____s->seqcount#3);
>> > >     <Interrupt>
>> > >       lock(&____s->seqcount#3);
>> > >
>> > >    *** DEADLOCK ***
>> > >
>> > > According to Documentation/locking/seqlock.rst, seqcount_t is not
>> > > NMI-safe and seqcount_latch_t should be used when read path can interrupt
>> > > the write-side critical section. In this case, return NULL and fall back
>> > > to slab allocation if !allow_spin.
>> > >
>> > > Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
>> > > Cc: stable@vger.kernel.org
>> > > Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
>> > > ---
>> > >  mm/slub.c | 8 ++++++++
>> > >  1 file changed, 8 insertions(+)
>> > >
>> > > diff --git a/mm/slub.c b/mm/slub.c
>> > > index 102fb47ae013..d46464654c15 100644
>> > > --- a/mm/slub.c
>> > > +++ b/mm/slub.c
>> > > @@ -3789,6 +3789,14 @@ static void *get_from_any_partial(struct kmem_cache *s, struct partial_context *
>> > >       enum zone_type highest_zoneidx = gfp_zone(pc->flags);
>> > >       unsigned int cpuset_mems_cookie;
>> > >
>> > > +     /*
>> > > +      * read_mems_allow_begin() accesses current->mems_allowed_seq,
>> > > +      * a seqcount_spinlock_t that is not NMI-safe. Skip allocation
>> > > +      * when GFP flags indicate spinning is not allowed.
>> > > +      */
>> > > +     if (!gfpflags_allow_spinning(pc->flags))
>> > > +             return NULL;
>> >
>> > I think it would be less restrictive to just continue,
> 
> Ack.
> 
>> > but skip the
>> > read_mems_allowed_retry() part in the do-while loop, so just make it one
>> > iteration for !allow_spin.
> 
> Makes sense.
> 
>> > If lockdep doesn't like even the
>> > read_mems_allowed_begin() (not clear to me), skip it too?
> 
> Yes, lockdep doesn't like read_mems_allowed_begin(), and thus
> we should skip both.
> 
>> 
>> +1
>> Just unconditional return NULL seems too restrictive.
> 
> Ack.
> 
> I'll do something like this:

Looks good!

> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 102fb47ae013..cc686ab929fe 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3788,6 +3788,7 @@ static void *get_from_any_partial(struct kmem_cache *s, struct partial_context *
>  	struct zone *zone;
>  	enum zone_type highest_zoneidx = gfp_zone(pc->flags);
>  	unsigned int cpuset_mems_cookie;
> +	bool allow_spin = gfpflags_allow_spinning(pc->flags);
> 
>  	/*
>  	 * The defrag ratio allows a configuration of the tradeoffs between
> @@ -3812,7 +3813,15 @@ static void *get_from_any_partial(struct kmem_cache *s, struct partial_context *
>  		return NULL;
> 
>  	do {
> -		cpuset_mems_cookie = read_mems_allowed_begin();
> +		/*
> +		 * read_mems_allow_begin() accesses current->mems_allowed_seq,
> +		 * a seqcount_spinlock_t that is not NMI-safe. Do not access
> +		 * current->mems_allowed_seq and avoid retry when GFP flags
> +		 * indicate spinning is not allowed.
> +		 */
> +		if (allow_spin)
> +			cpuset_mems_cookie = read_mems_allowed_begin();
> +
>  		zonelist = node_zonelist(mempolicy_slab_node(), pc->flags);
>  		for_each_zone_zonelist(zone, z, zonelist, highest_zoneidx) {
>  			struct kmem_cache_node *n;
> @@ -3836,7 +3845,7 @@ static void *get_from_any_partial(struct kmem_cache *s, struct partial_context *
>  				}
>  			}
>  		}
> -	} while (read_mems_allowed_retry(cpuset_mems_cookie));
> +	} while (allow_spin && read_mems_allowed_retry(cpuset_mems_cookie));
>  #endif	/* CONFIG_NUMA */
>  	return NULL;
>  }
> 
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-02-09 19:03 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-06 17:13 [PATCH 0/2] mm/slab: fix lockdep warnings with kmalloc_nolock() Harry Yoo
2026-02-06 17:13 ` [PATCH 1/2] mm/slab: skip get_from_any_partial() if !allow_spin Harry Yoo
2026-02-06 18:10   ` Vlastimil Babka
2026-02-06 19:19     ` Alexei Starovoitov
2026-02-09  3:18       ` Harry Yoo
2026-02-09 19:03         ` Vlastimil Babka
2026-02-06 17:13 ` [PATCH 2/2] mm/slab: use prandom " Harry Yoo
2026-02-06 18:27   ` Vlastimil Babka
2026-02-06 19:22     ` Alexei Starovoitov
2026-02-07  1:25       ` Harry Yoo
2026-02-06 17:37 ` [PATCH 0/2] mm/slab: fix lockdep warnings with kmalloc_nolock() Harry Yoo
2026-02-09 19:03   ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox