linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [BUG] -next lockdep invalid wait context
@ 2024-10-30 21:05 Paul E. McKenney
  2024-10-30 21:48 ` Vlastimil Babka
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2024-10-30 21:05 UTC (permalink / raw)
  To: linux-next, linux-kernel, kasan-dev, linux-mm
  Cc: sfr, bigeasy, longman, boqun.feng, elver, cl, penberg, rientjes,
	iamjoonsoo.kim, akpm, vbabka

Hello!

The next-20241030 release gets the splat shown below when running
scftorture in a preemptible kernel.  This bisects to this commit:

560af5dc839e ("lockdep: Enable PROVE_RAW_LOCK_NESTING with PROVE_LOCKING")

Except that all this is doing is enabling lockdep to find the problem.

The obvious way to fix this is to make the kmem_cache structure's
cpu_slab field's ->lock be a raw spinlock, but this might not be what
we want for real-time response.

This can be reproduced deterministically as follows:

tools/testing/selftests/rcutorture/bin/kvm.sh --torture scf --allcpus --duration 2 --configs PREEMPT --kconfig CONFIG_NR_CPUS=64 --memory 7G --trust-make --kasan --bootargs "scftorture.nthreads=64 torture.disable_onoff_at_boot csdlock_debug=1"

I doubt that the number of CPUs or amount of memory makes any difference,
but that is what I used.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

[   35.659746] =============================
[   35.659746] [ BUG: Invalid wait context ]
[   35.659746] 6.12.0-rc5-next-20241029 #57233 Not tainted
[   35.659746] -----------------------------
[   35.659746] swapper/37/0 is trying to lock:
[   35.659746] ffff8881ff4bf2f0 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x49/0x1b0
[   35.659746] other info that might help us debug this:
[   35.659746] context-{2:2}
[   35.659746] no locks held by swapper/37/0.
[   35.659746] stack backtrace:
[   35.659746] CPU: 37 UID: 0 PID: 0 Comm: swapper/37 Not tainted 6.12.0-rc5-next-20241029 #57233
[   35.659746] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[   35.659746] Call Trace:
[   35.659746]  <IRQ>
[   35.659746]  dump_stack_lvl+0x68/0xa0
[   35.659746]  __lock_acquire+0x8fd/0x3b90
[   35.659746]  ? start_secondary+0x113/0x210
[   35.659746]  ? __pfx___lock_acquire+0x10/0x10
[   35.659746]  ? __pfx___lock_acquire+0x10/0x10
[   35.659746]  ? __pfx___lock_acquire+0x10/0x10
[   35.659746]  ? __pfx___lock_acquire+0x10/0x10
[   35.659746]  lock_acquire+0x19b/0x520
[   35.659746]  ? put_cpu_partial+0x49/0x1b0
[   35.659746]  ? __pfx_lock_acquire+0x10/0x10
[   35.659746]  ? __pfx_lock_release+0x10/0x10
[   35.659746]  ? lock_release+0x20f/0x6f0
[   35.659746]  ? __pfx_lock_release+0x10/0x10
[   35.659746]  ? lock_release+0x20f/0x6f0
[   35.659746]  ? kasan_save_track+0x14/0x30
[   35.659746]  put_cpu_partial+0x52/0x1b0
[   35.659746]  ? put_cpu_partial+0x49/0x1b0
[   35.659746]  ? __pfx_scf_handler_1+0x10/0x10
[   35.659746]  __flush_smp_call_function_queue+0x2d2/0x600
[   35.659746]  __sysvec_call_function_single+0x50/0x280
[   35.659746]  sysvec_call_function_single+0x6b/0x80
[   35.659746]  </IRQ>
[   35.659746]  <TASK>
[   35.659746]  asm_sysvec_call_function_single+0x1a/0x20
[   35.659746] RIP: 0010:default_idle+0xf/0x20
[   35.659746] Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90
 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 33 80 3e 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
[   35.659746] RSP: 0018:ffff888100a9fe68 EFLAGS: 00000202
[   35.659746] RAX: 0000000000040d75 RBX: 0000000000000025 RCX: ffffffffab83df45
[   35.659746] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa8a5f7ba
[   35.659746] RBP: dffffc0000000000 R08: 0000000000000001 R09: ffffed103fe96c3c
[   35.659746] R10: ffff8881ff4b61e3 R11: 0000000000000000 R12: ffffffffad13f1d0
[   35.659746] R13: 1ffff11020153fd2 R14: 0000000000000000 R15: 0000000000000000
[   35.659746]  ? ct_kernel_exit.constprop.0+0xc5/0xf0
[   35.659746]  ? do_idle+0x2fa/0x3b0
[   35.659746]  default_idle_call+0x6d/0xb0
[   35.659746]  do_idle+0x2fa/0x3b0
[   35.659746]  ? __pfx_do_idle+0x10/0x10
[   35.659746]  cpu_startup_entry+0x4f/0x60
[   35.659746]  start_secondary+0x1bc/0x210
[   35.659746]  common_startup_64+0x12c/0x138
[   35.659746]  </TASK>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] -next lockdep invalid wait context
  2024-10-30 21:05 [BUG] -next lockdep invalid wait context Paul E. McKenney
@ 2024-10-30 21:48 ` Vlastimil Babka
  2024-10-30 22:34   ` Marco Elver
  0 siblings, 1 reply; 23+ messages in thread
From: Vlastimil Babka @ 2024-10-30 21:48 UTC (permalink / raw)
  To: paulmck, linux-next, linux-kernel, kasan-dev, linux-mm
  Cc: sfr, bigeasy, longman, boqun.feng, elver, cl, penberg, rientjes,
	iamjoonsoo.kim, akpm

On 10/30/24 22:05, Paul E. McKenney wrote:
> Hello!

Hi!

> The next-20241030 release gets the splat shown below when running
> scftorture in a preemptible kernel.  This bisects to this commit:
> 
> 560af5dc839e ("lockdep: Enable PROVE_RAW_LOCK_NESTING with PROVE_LOCKING")
> 
> Except that all this is doing is enabling lockdep to find the problem.
> 
> The obvious way to fix this is to make the kmem_cache structure's
> cpu_slab field's ->lock be a raw spinlock, but this might not be what
> we want for real-time response.

But it's a local_lock, not spinlock and it's doing local_lock_irqsave(). I'm
confused what's happening here, the code has been like this for years now.

> This can be reproduced deterministically as follows:
> 
> tools/testing/selftests/rcutorture/bin/kvm.sh --torture scf --allcpus --duration 2 --configs PREEMPT --kconfig CONFIG_NR_CPUS=64 --memory 7G --trust-make --kasan --bootargs "scftorture.nthreads=64 torture.disable_onoff_at_boot csdlock_debug=1"
> 
> I doubt that the number of CPUs or amount of memory makes any difference,
> but that is what I used.
> 
> Thoughts?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> [   35.659746] =============================
> [   35.659746] [ BUG: Invalid wait context ]
> [   35.659746] 6.12.0-rc5-next-20241029 #57233 Not tainted
> [   35.659746] -----------------------------
> [   35.659746] swapper/37/0 is trying to lock:
> [   35.659746] ffff8881ff4bf2f0 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x49/0x1b0
> [   35.659746] other info that might help us debug this:
> [   35.659746] context-{2:2}
> [   35.659746] no locks held by swapper/37/0.
> [   35.659746] stack backtrace:
> [   35.659746] CPU: 37 UID: 0 PID: 0 Comm: swapper/37 Not tainted 6.12.0-rc5-next-20241029 #57233
> [   35.659746] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> [   35.659746] Call Trace:
> [   35.659746]  <IRQ>
> [   35.659746]  dump_stack_lvl+0x68/0xa0
> [   35.659746]  __lock_acquire+0x8fd/0x3b90
> [   35.659746]  ? start_secondary+0x113/0x210
> [   35.659746]  ? __pfx___lock_acquire+0x10/0x10
> [   35.659746]  ? __pfx___lock_acquire+0x10/0x10
> [   35.659746]  ? __pfx___lock_acquire+0x10/0x10
> [   35.659746]  ? __pfx___lock_acquire+0x10/0x10
> [   35.659746]  lock_acquire+0x19b/0x520
> [   35.659746]  ? put_cpu_partial+0x49/0x1b0
> [   35.659746]  ? __pfx_lock_acquire+0x10/0x10
> [   35.659746]  ? __pfx_lock_release+0x10/0x10
> [   35.659746]  ? lock_release+0x20f/0x6f0
> [   35.659746]  ? __pfx_lock_release+0x10/0x10
> [   35.659746]  ? lock_release+0x20f/0x6f0
> [   35.659746]  ? kasan_save_track+0x14/0x30
> [   35.659746]  put_cpu_partial+0x52/0x1b0
> [   35.659746]  ? put_cpu_partial+0x49/0x1b0
> [   35.659746]  ? __pfx_scf_handler_1+0x10/0x10
> [   35.659746]  __flush_smp_call_function_queue+0x2d2/0x600

How did we even get to put_cpu_partial directly from flushing smp calls?
SLUB doesn't use them, it uses queue_work_on)_ for flushing and that
flushing doesn't involve put_cpu_partial() AFAIK.

I think only slab allocation or free can lead to put_cpu_partial() that
would mean the backtrace is missing something. And that somebody does a slab
alloc/free from a smp callback, which I'd then assume isn't allowed?

> [   35.659746]  __sysvec_call_function_single+0x50/0x280
> [   35.659746]  sysvec_call_function_single+0x6b/0x80
> [   35.659746]  </IRQ>
> [   35.659746]  <TASK>
> [   35.659746]  asm_sysvec_call_function_single+0x1a/0x20
> [   35.659746] RIP: 0010:default_idle+0xf/0x20
> [   35.659746] Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90
>  90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 33 80 3e 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
> [   35.659746] RSP: 0018:ffff888100a9fe68 EFLAGS: 00000202
> [   35.659746] RAX: 0000000000040d75 RBX: 0000000000000025 RCX: ffffffffab83df45
> [   35.659746] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa8a5f7ba
> [   35.659746] RBP: dffffc0000000000 R08: 0000000000000001 R09: ffffed103fe96c3c
> [   35.659746] R10: ffff8881ff4b61e3 R11: 0000000000000000 R12: ffffffffad13f1d0
> [   35.659746] R13: 1ffff11020153fd2 R14: 0000000000000000 R15: 0000000000000000
> [   35.659746]  ? ct_kernel_exit.constprop.0+0xc5/0xf0
> [   35.659746]  ? do_idle+0x2fa/0x3b0
> [   35.659746]  default_idle_call+0x6d/0xb0
> [   35.659746]  do_idle+0x2fa/0x3b0
> [   35.659746]  ? __pfx_do_idle+0x10/0x10
> [   35.659746]  cpu_startup_entry+0x4f/0x60
> [   35.659746]  start_secondary+0x1bc/0x210
> [   35.659746]  common_startup_64+0x12c/0x138
> [   35.659746]  </TASK>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] -next lockdep invalid wait context
  2024-10-30 21:48 ` Vlastimil Babka
@ 2024-10-30 22:34   ` Marco Elver
  2024-10-30 23:04     ` Boqun Feng
  2024-10-30 23:10     ` Paul E. McKenney
  0 siblings, 2 replies; 23+ messages in thread
From: Marco Elver @ 2024-10-30 22:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: paulmck, linux-next, linux-kernel, kasan-dev, linux-mm, sfr,
	bigeasy, longman, boqun.feng, cl, penberg, rientjes,
	iamjoonsoo.kim, akpm

On Wed, Oct 30, 2024 at 10:48PM +0100, Vlastimil Babka wrote:
> On 10/30/24 22:05, Paul E. McKenney wrote:
> > Hello!
> 
> Hi!
> 
> > The next-20241030 release gets the splat shown below when running
> > scftorture in a preemptible kernel.  This bisects to this commit:
> > 
> > 560af5dc839e ("lockdep: Enable PROVE_RAW_LOCK_NESTING with PROVE_LOCKING")
> > 
> > Except that all this is doing is enabling lockdep to find the problem.
> > 
> > The obvious way to fix this is to make the kmem_cache structure's
> > cpu_slab field's ->lock be a raw spinlock, but this might not be what
> > we want for real-time response.
> 
> But it's a local_lock, not spinlock and it's doing local_lock_irqsave(). I'm
> confused what's happening here, the code has been like this for years now.
> 
> > This can be reproduced deterministically as follows:
> > 
> > tools/testing/selftests/rcutorture/bin/kvm.sh --torture scf --allcpus --duration 2 --configs PREEMPT --kconfig CONFIG_NR_CPUS=64 --memory 7G --trust-make --kasan --bootargs "scftorture.nthreads=64 torture.disable_onoff_at_boot csdlock_debug=1"
> > 
> > I doubt that the number of CPUs or amount of memory makes any difference,
> > but that is what I used.
> > 
> > Thoughts?
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > [   35.659746] =============================
> > [   35.659746] [ BUG: Invalid wait context ]
> > [   35.659746] 6.12.0-rc5-next-20241029 #57233 Not tainted
> > [   35.659746] -----------------------------
> > [   35.659746] swapper/37/0 is trying to lock:
> > [   35.659746] ffff8881ff4bf2f0 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x49/0x1b0
> > [   35.659746] other info that might help us debug this:
> > [   35.659746] context-{2:2}
> > [   35.659746] no locks held by swapper/37/0.
> > [   35.659746] stack backtrace:
> > [   35.659746] CPU: 37 UID: 0 PID: 0 Comm: swapper/37 Not tainted 6.12.0-rc5-next-20241029 #57233
> > [   35.659746] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > [   35.659746] Call Trace:
> > [   35.659746]  <IRQ>
> > [   35.659746]  dump_stack_lvl+0x68/0xa0
> > [   35.659746]  __lock_acquire+0x8fd/0x3b90
> > [   35.659746]  ? start_secondary+0x113/0x210
> > [   35.659746]  ? __pfx___lock_acquire+0x10/0x10
> > [   35.659746]  ? __pfx___lock_acquire+0x10/0x10
> > [   35.659746]  ? __pfx___lock_acquire+0x10/0x10
> > [   35.659746]  ? __pfx___lock_acquire+0x10/0x10
> > [   35.659746]  lock_acquire+0x19b/0x520
> > [   35.659746]  ? put_cpu_partial+0x49/0x1b0
> > [   35.659746]  ? __pfx_lock_acquire+0x10/0x10
> > [   35.659746]  ? __pfx_lock_release+0x10/0x10
> > [   35.659746]  ? lock_release+0x20f/0x6f0
> > [   35.659746]  ? __pfx_lock_release+0x10/0x10
> > [   35.659746]  ? lock_release+0x20f/0x6f0
> > [   35.659746]  ? kasan_save_track+0x14/0x30
> > [   35.659746]  put_cpu_partial+0x52/0x1b0
> > [   35.659746]  ? put_cpu_partial+0x49/0x1b0
> > [   35.659746]  ? __pfx_scf_handler_1+0x10/0x10
> > [   35.659746]  __flush_smp_call_function_queue+0x2d2/0x600
> 
> How did we even get to put_cpu_partial directly from flushing smp calls?
> SLUB doesn't use them, it uses queue_work_on)_ for flushing and that
> flushing doesn't involve put_cpu_partial() AFAIK.
> 
> I think only slab allocation or free can lead to put_cpu_partial() that
> would mean the backtrace is missing something. And that somebody does a slab
> alloc/free from a smp callback, which I'd then assume isn't allowed?

Tail-call optimization is hiding the caller. Compiling with
-fno-optimize-sibling-calls exposes the caller. This gives the full
picture:

[   40.321505] =============================
[   40.322711] [ BUG: Invalid wait context ]
[   40.323927] 6.12.0-rc5-next-20241030-dirty #4 Not tainted
[   40.325502] -----------------------------
[   40.326653] cpuhp/47/253 is trying to lock:
[   40.327869] ffff8881ff9bf2f0 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x48/0x1a0
[   40.330081] other info that might help us debug this:
[   40.331540] context-{2:2}
[   40.332305] 3 locks held by cpuhp/47/253:
[   40.333468]  #0: ffffffffae6e6910 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0xe0/0x590
[   40.336048]  #1: ffffffffae6e9060 (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0xe0/0x590
[   40.338607]  #2: ffff8881002a6948 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_remove_by_name_ns+0x78/0x100
[   40.341454] stack backtrace:
[   40.342291] CPU: 47 UID: 0 PID: 253 Comm: cpuhp/47 Not tainted 6.12.0-rc5-next-20241030-dirty #4
[   40.344807] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   40.347482] Call Trace:
[   40.348199]  <IRQ>
[   40.348827]  dump_stack_lvl+0x6b/0xa0
[   40.349899]  dump_stack+0x10/0x20
[   40.350850]  __lock_acquire+0x900/0x4010
[   40.360290]  lock_acquire+0x191/0x4f0
[   40.364850]  put_cpu_partial+0x51/0x1a0
[   40.368341]  scf_handler+0x1bd/0x290
[   40.370590]  scf_handler_1+0x4e/0xb0
[   40.371630]  __flush_smp_call_function_queue+0x2dd/0x600
[   40.373142]  generic_smp_call_function_single_interrupt+0xe/0x20
[   40.374801]  __sysvec_call_function_single+0x50/0x280
[   40.376214]  sysvec_call_function_single+0x6c/0x80
[   40.377543]  </IRQ>
[   40.378142]  <TASK>

And scf_handler does indeed tail-call kfree:

	static void scf_handler(void *scfc_in)
	{
	[...]
		} else {
			kfree(scfcp);
		}
	}


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] -next lockdep invalid wait context
  2024-10-30 22:34   ` Marco Elver
@ 2024-10-30 23:04     ` Boqun Feng
  2024-10-30 23:10     ` Paul E. McKenney
  1 sibling, 0 replies; 23+ messages in thread
From: Boqun Feng @ 2024-10-30 23:04 UTC (permalink / raw)
  To: Marco Elver
  Cc: Vlastimil Babka, paulmck, linux-next, linux-kernel, kasan-dev,
	linux-mm, sfr, bigeasy, longman, cl, penberg, rientjes,
	iamjoonsoo.kim, akpm

On Wed, Oct 30, 2024 at 11:34:08PM +0100, Marco Elver wrote:
> On Wed, Oct 30, 2024 at 10:48PM +0100, Vlastimil Babka wrote:
> > On 10/30/24 22:05, Paul E. McKenney wrote:
> > > Hello!
> > 
> > Hi!
> > 
> > > The next-20241030 release gets the splat shown below when running
> > > scftorture in a preemptible kernel.  This bisects to this commit:
> > > 
> > > 560af5dc839e ("lockdep: Enable PROVE_RAW_LOCK_NESTING with PROVE_LOCKING")
> > > 
> > > Except that all this is doing is enabling lockdep to find the problem.
> > > 
> > > The obvious way to fix this is to make the kmem_cache structure's
> > > cpu_slab field's ->lock be a raw spinlock, but this might not be what
> > > we want for real-time response.
> > 
> > But it's a local_lock, not spinlock and it's doing local_lock_irqsave(). I'm
> > confused what's happening here, the code has been like this for years now.
> > 
> > > This can be reproduced deterministically as follows:
> > > 
> > > tools/testing/selftests/rcutorture/bin/kvm.sh --torture scf --allcpus --duration 2 --configs PREEMPT --kconfig CONFIG_NR_CPUS=64 --memory 7G --trust-make --kasan --bootargs "scftorture.nthreads=64 torture.disable_onoff_at_boot csdlock_debug=1"
> > > 
> > > I doubt that the number of CPUs or amount of memory makes any difference,
> > > but that is what I used.
> > > 
> > > Thoughts?
> > > 
> > > 							Thanx, Paul
> > > 
> > > ------------------------------------------------------------------------
> > > 
> > > [   35.659746] =============================
> > > [   35.659746] [ BUG: Invalid wait context ]
> > > [   35.659746] 6.12.0-rc5-next-20241029 #57233 Not tainted
> > > [   35.659746] -----------------------------
> > > [   35.659746] swapper/37/0 is trying to lock:
> > > [   35.659746] ffff8881ff4bf2f0 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x49/0x1b0
> > > [   35.659746] other info that might help us debug this:
> > > [   35.659746] context-{2:2}
> > > [   35.659746] no locks held by swapper/37/0.
> > > [   35.659746] stack backtrace:
> > > [   35.659746] CPU: 37 UID: 0 PID: 0 Comm: swapper/37 Not tainted 6.12.0-rc5-next-20241029 #57233
> > > [   35.659746] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > > [   35.659746] Call Trace:
> > > [   35.659746]  <IRQ>
> > > [   35.659746]  dump_stack_lvl+0x68/0xa0
> > > [   35.659746]  __lock_acquire+0x8fd/0x3b90
> > > [   35.659746]  ? start_secondary+0x113/0x210
> > > [   35.659746]  ? __pfx___lock_acquire+0x10/0x10
> > > [   35.659746]  ? __pfx___lock_acquire+0x10/0x10
> > > [   35.659746]  ? __pfx___lock_acquire+0x10/0x10
> > > [   35.659746]  ? __pfx___lock_acquire+0x10/0x10
> > > [   35.659746]  lock_acquire+0x19b/0x520
> > > [   35.659746]  ? put_cpu_partial+0x49/0x1b0
> > > [   35.659746]  ? __pfx_lock_acquire+0x10/0x10
> > > [   35.659746]  ? __pfx_lock_release+0x10/0x10
> > > [   35.659746]  ? lock_release+0x20f/0x6f0
> > > [   35.659746]  ? __pfx_lock_release+0x10/0x10
> > > [   35.659746]  ? lock_release+0x20f/0x6f0
> > > [   35.659746]  ? kasan_save_track+0x14/0x30
> > > [   35.659746]  put_cpu_partial+0x52/0x1b0
> > > [   35.659746]  ? put_cpu_partial+0x49/0x1b0
> > > [   35.659746]  ? __pfx_scf_handler_1+0x10/0x10
> > > [   35.659746]  __flush_smp_call_function_queue+0x2d2/0x600
> > 
> > How did we even get to put_cpu_partial directly from flushing smp calls?
> > SLUB doesn't use them, it uses queue_work_on)_ for flushing and that
> > flushing doesn't involve put_cpu_partial() AFAIK.
> > 
> > I think only slab allocation or free can lead to put_cpu_partial() that
> > would mean the backtrace is missing something. And that somebody does a slab
> > alloc/free from a smp callback, which I'd then assume isn't allowed?
> 

I think in this particular case, it is queuing a callback for
smp_call_function_single() which doesn't have an interrupt handle
thread AKAICT, that means the callback will be executed in non-threaded
hardirq context, and that makes locks must be taken with real interrupt
disabled.

Using irq_work might be fine, because it has a handler thread (but the
torture is for s(mp) c(all) f(unction), so replacing with irq_work is
not really fixing it ;-)).

Regards,
Boqun

> Tail-call optimization is hiding the caller. Compiling with
> -fno-optimize-sibling-calls exposes the caller. This gives the full
> picture:
> 
> [   40.321505] =============================
> [   40.322711] [ BUG: Invalid wait context ]
> [   40.323927] 6.12.0-rc5-next-20241030-dirty #4 Not tainted
> [   40.325502] -----------------------------
> [   40.326653] cpuhp/47/253 is trying to lock:
> [   40.327869] ffff8881ff9bf2f0 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x48/0x1a0
> [   40.330081] other info that might help us debug this:
> [   40.331540] context-{2:2}
> [   40.332305] 3 locks held by cpuhp/47/253:
> [   40.333468]  #0: ffffffffae6e6910 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0xe0/0x590
> [   40.336048]  #1: ffffffffae6e9060 (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0xe0/0x590
> [   40.338607]  #2: ffff8881002a6948 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_remove_by_name_ns+0x78/0x100
> [   40.341454] stack backtrace:
> [   40.342291] CPU: 47 UID: 0 PID: 253 Comm: cpuhp/47 Not tainted 6.12.0-rc5-next-20241030-dirty #4
> [   40.344807] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [   40.347482] Call Trace:
> [   40.348199]  <IRQ>
> [   40.348827]  dump_stack_lvl+0x6b/0xa0
> [   40.349899]  dump_stack+0x10/0x20
> [   40.350850]  __lock_acquire+0x900/0x4010
> [   40.360290]  lock_acquire+0x191/0x4f0
> [   40.364850]  put_cpu_partial+0x51/0x1a0
> [   40.368341]  scf_handler+0x1bd/0x290
> [   40.370590]  scf_handler_1+0x4e/0xb0
> [   40.371630]  __flush_smp_call_function_queue+0x2dd/0x600
> [   40.373142]  generic_smp_call_function_single_interrupt+0xe/0x20
> [   40.374801]  __sysvec_call_function_single+0x50/0x280
> [   40.376214]  sysvec_call_function_single+0x6c/0x80
> [   40.377543]  </IRQ>
> [   40.378142]  <TASK>
> 
> And scf_handler does indeed tail-call kfree:
> 
> 	static void scf_handler(void *scfc_in)
> 	{
> 	[...]
> 		} else {
> 			kfree(scfcp);
> 		}
> 	}


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] -next lockdep invalid wait context
  2024-10-30 22:34   ` Marco Elver
  2024-10-30 23:04     ` Boqun Feng
@ 2024-10-30 23:10     ` Paul E. McKenney
  2024-10-31  7:21       ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2024-10-30 23:10 UTC (permalink / raw)
  To: Marco Elver
  Cc: Vlastimil Babka, linux-next, linux-kernel, kasan-dev, linux-mm,
	sfr, bigeasy, longman, boqun.feng, cl, penberg, rientjes,
	iamjoonsoo.kim, akpm

On Wed, Oct 30, 2024 at 11:34:08PM +0100, Marco Elver wrote:
> On Wed, Oct 30, 2024 at 10:48PM +0100, Vlastimil Babka wrote:
> > On 10/30/24 22:05, Paul E. McKenney wrote:
> > > Hello!
> > 
> > Hi!
> > 
> > > The next-20241030 release gets the splat shown below when running
> > > scftorture in a preemptible kernel.  This bisects to this commit:
> > > 
> > > 560af5dc839e ("lockdep: Enable PROVE_RAW_LOCK_NESTING with PROVE_LOCKING")
> > > 
> > > Except that all this is doing is enabling lockdep to find the problem.
> > > 
> > > The obvious way to fix this is to make the kmem_cache structure's
> > > cpu_slab field's ->lock be a raw spinlock, but this might not be what
> > > we want for real-time response.
> > 
> > But it's a local_lock, not spinlock and it's doing local_lock_irqsave(). I'm
> > confused what's happening here, the code has been like this for years now.
> > 
> > > This can be reproduced deterministically as follows:
> > > 
> > > tools/testing/selftests/rcutorture/bin/kvm.sh --torture scf --allcpus --duration 2 --configs PREEMPT --kconfig CONFIG_NR_CPUS=64 --memory 7G --trust-make --kasan --bootargs "scftorture.nthreads=64 torture.disable_onoff_at_boot csdlock_debug=1"
> > > 
> > > I doubt that the number of CPUs or amount of memory makes any difference,
> > > but that is what I used.
> > > 
> > > Thoughts?
> > > 
> > > 							Thanx, Paul
> > > 
> > > ------------------------------------------------------------------------
> > > 
> > > [   35.659746] =============================
> > > [   35.659746] [ BUG: Invalid wait context ]
> > > [   35.659746] 6.12.0-rc5-next-20241029 #57233 Not tainted
> > > [   35.659746] -----------------------------
> > > [   35.659746] swapper/37/0 is trying to lock:
> > > [   35.659746] ffff8881ff4bf2f0 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x49/0x1b0
> > > [   35.659746] other info that might help us debug this:
> > > [   35.659746] context-{2:2}
> > > [   35.659746] no locks held by swapper/37/0.
> > > [   35.659746] stack backtrace:
> > > [   35.659746] CPU: 37 UID: 0 PID: 0 Comm: swapper/37 Not tainted 6.12.0-rc5-next-20241029 #57233
> > > [   35.659746] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > > [   35.659746] Call Trace:
> > > [   35.659746]  <IRQ>
> > > [   35.659746]  dump_stack_lvl+0x68/0xa0
> > > [   35.659746]  __lock_acquire+0x8fd/0x3b90
> > > [   35.659746]  ? start_secondary+0x113/0x210
> > > [   35.659746]  ? __pfx___lock_acquire+0x10/0x10
> > > [   35.659746]  ? __pfx___lock_acquire+0x10/0x10
> > > [   35.659746]  ? __pfx___lock_acquire+0x10/0x10
> > > [   35.659746]  ? __pfx___lock_acquire+0x10/0x10
> > > [   35.659746]  lock_acquire+0x19b/0x520
> > > [   35.659746]  ? put_cpu_partial+0x49/0x1b0
> > > [   35.659746]  ? __pfx_lock_acquire+0x10/0x10
> > > [   35.659746]  ? __pfx_lock_release+0x10/0x10
> > > [   35.659746]  ? lock_release+0x20f/0x6f0
> > > [   35.659746]  ? __pfx_lock_release+0x10/0x10
> > > [   35.659746]  ? lock_release+0x20f/0x6f0
> > > [   35.659746]  ? kasan_save_track+0x14/0x30
> > > [   35.659746]  put_cpu_partial+0x52/0x1b0
> > > [   35.659746]  ? put_cpu_partial+0x49/0x1b0
> > > [   35.659746]  ? __pfx_scf_handler_1+0x10/0x10
> > > [   35.659746]  __flush_smp_call_function_queue+0x2d2/0x600
> > 
> > How did we even get to put_cpu_partial directly from flushing smp calls?
> > SLUB doesn't use them, it uses queue_work_on)_ for flushing and that
> > flushing doesn't involve put_cpu_partial() AFAIK.
> > 
> > I think only slab allocation or free can lead to put_cpu_partial() that
> > would mean the backtrace is missing something. And that somebody does a slab
> > alloc/free from a smp callback, which I'd then assume isn't allowed?
> 
> Tail-call optimization is hiding the caller. Compiling with
> -fno-optimize-sibling-calls exposes the caller. This gives the full
> picture:
> 
> [   40.321505] =============================
> [   40.322711] [ BUG: Invalid wait context ]
> [   40.323927] 6.12.0-rc5-next-20241030-dirty #4 Not tainted
> [   40.325502] -----------------------------
> [   40.326653] cpuhp/47/253 is trying to lock:
> [   40.327869] ffff8881ff9bf2f0 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x48/0x1a0
> [   40.330081] other info that might help us debug this:
> [   40.331540] context-{2:2}
> [   40.332305] 3 locks held by cpuhp/47/253:
> [   40.333468]  #0: ffffffffae6e6910 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0xe0/0x590
> [   40.336048]  #1: ffffffffae6e9060 (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0xe0/0x590
> [   40.338607]  #2: ffff8881002a6948 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_remove_by_name_ns+0x78/0x100
> [   40.341454] stack backtrace:
> [   40.342291] CPU: 47 UID: 0 PID: 253 Comm: cpuhp/47 Not tainted 6.12.0-rc5-next-20241030-dirty #4
> [   40.344807] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [   40.347482] Call Trace:
> [   40.348199]  <IRQ>
> [   40.348827]  dump_stack_lvl+0x6b/0xa0
> [   40.349899]  dump_stack+0x10/0x20
> [   40.350850]  __lock_acquire+0x900/0x4010
> [   40.360290]  lock_acquire+0x191/0x4f0
> [   40.364850]  put_cpu_partial+0x51/0x1a0
> [   40.368341]  scf_handler+0x1bd/0x290
> [   40.370590]  scf_handler_1+0x4e/0xb0
> [   40.371630]  __flush_smp_call_function_queue+0x2dd/0x600
> [   40.373142]  generic_smp_call_function_single_interrupt+0xe/0x20
> [   40.374801]  __sysvec_call_function_single+0x50/0x280
> [   40.376214]  sysvec_call_function_single+0x6c/0x80
> [   40.377543]  </IRQ>
> [   40.378142]  <TASK>
> 
> And scf_handler does indeed tail-call kfree:
> 
> 	static void scf_handler(void *scfc_in)
> 	{
> 	[...]
> 		} else {
> 			kfree(scfcp);
> 		}
> 	}

So I need to avoid calling kfree() within an smp_call_function() handler?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] -next lockdep invalid wait context
  2024-10-30 23:10     ` Paul E. McKenney
@ 2024-10-31  7:21       ` Sebastian Andrzej Siewior
  2024-10-31  7:35         ` Vlastimil Babka
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-31  7:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Marco Elver, Vlastimil Babka, linux-next, linux-kernel,
	kasan-dev, linux-mm, sfr, longman, boqun.feng, cl, penberg,
	rientjes, iamjoonsoo.kim, akpm

On 2024-10-30 16:10:58 [-0700], Paul E. McKenney wrote:
> 
> So I need to avoid calling kfree() within an smp_call_function() handler?

Yes. No kmalloc()/ kfree() in IRQ context.

> 							Thanx, Paul

Sebastian


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] -next lockdep invalid wait context
  2024-10-31  7:21       ` Sebastian Andrzej Siewior
@ 2024-10-31  7:35         ` Vlastimil Babka
  2024-10-31  7:55           ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Vlastimil Babka @ 2024-10-31  7:35 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Paul E. McKenney
  Cc: Marco Elver, linux-next, linux-kernel, kasan-dev, linux-mm, sfr,
	longman, boqun.feng, cl, penberg, rientjes, iamjoonsoo.kim, akpm

On 10/31/24 08:21, Sebastian Andrzej Siewior wrote:
> On 2024-10-30 16:10:58 [-0700], Paul E. McKenney wrote:
>> 
>> So I need to avoid calling kfree() within an smp_call_function() handler?
> 
> Yes. No kmalloc()/ kfree() in IRQ context.

However, isn't this the case that the rule is actually about hardirq context
on RT, and most of these operations that are in IRQ context on !RT become
the threaded interrupt context on RT, so they are actually fine? Or is smp
call callback a hardirq context on RT and thus it really can't do those
operations?

Vlastimil

>> 							Thanx, Paul
> 
> Sebastian



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] -next lockdep invalid wait context
  2024-10-31  7:35         ` Vlastimil Babka
@ 2024-10-31  7:55           ` Sebastian Andrzej Siewior
  2024-10-31  8:18             ` Vlastimil Babka
  2024-10-31 17:50             ` Paul E. McKenney
  0 siblings, 2 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-31  7:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Paul E. McKenney, Marco Elver, linux-next, linux-kernel,
	kasan-dev, linux-mm, sfr, longman, boqun.feng, cl, penberg,
	rientjes, iamjoonsoo.kim, akpm

On 2024-10-31 08:35:45 [+0100], Vlastimil Babka wrote:
> On 10/31/24 08:21, Sebastian Andrzej Siewior wrote:
> > On 2024-10-30 16:10:58 [-0700], Paul E. McKenney wrote:
> >> 
> >> So I need to avoid calling kfree() within an smp_call_function() handler?
> > 
> > Yes. No kmalloc()/ kfree() in IRQ context.
> 
> However, isn't this the case that the rule is actually about hardirq context
> on RT, and most of these operations that are in IRQ context on !RT become
> the threaded interrupt context on RT, so they are actually fine? Or is smp
> call callback a hardirq context on RT and thus it really can't do those
> operations?

interrupt handlers as of request_irq() are forced-threaded on RT so you
can do kmalloc()/ kfree() there. smp_call_function.*() on the other hand
are not threaded and invoked directly within the IRQ context.

> Vlastimil
> 
Sebastian



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] -next lockdep invalid wait context
  2024-10-31  7:55           ` Sebastian Andrzej Siewior
@ 2024-10-31  8:18             ` Vlastimil Babka
  2024-11-01 17:14               ` Paul E. McKenney
  2024-10-31 17:50             ` Paul E. McKenney
  1 sibling, 1 reply; 23+ messages in thread
From: Vlastimil Babka @ 2024-10-31  8:18 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Paul E. McKenney
  Cc: Marco Elver, linux-next, linux-kernel, kasan-dev, linux-mm, sfr,
	longman, boqun.feng, cl, penberg, rientjes, iamjoonsoo.kim, akpm

On 10/31/24 08:55, Sebastian Andrzej Siewior wrote:
> On 2024-10-31 08:35:45 [+0100], Vlastimil Babka wrote:
>> On 10/31/24 08:21, Sebastian Andrzej Siewior wrote:
>> > On 2024-10-30 16:10:58 [-0700], Paul E. McKenney wrote:
>> >> 
>> >> So I need to avoid calling kfree() within an smp_call_function() handler?
>> > 
>> > Yes. No kmalloc()/ kfree() in IRQ context.
>> 
>> However, isn't this the case that the rule is actually about hardirq context
>> on RT, and most of these operations that are in IRQ context on !RT become
>> the threaded interrupt context on RT, so they are actually fine? Or is smp
>> call callback a hardirq context on RT and thus it really can't do those
>> operations?
> 
> interrupt handlers as of request_irq() are forced-threaded on RT so you
> can do kmalloc()/ kfree() there. smp_call_function.*() on the other hand
> are not threaded and invoked directly within the IRQ context.

Makes sense, thanks.

So how comes rcutorture wasn't deadlocking on RT already, is it (or RCU
itself) doing anything differently there that avoids the kfree() from
smp_call_function() handler?

>> Vlastimil
>> 
> Sebastian
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] -next lockdep invalid wait context
  2024-10-31  7:55           ` Sebastian Andrzej Siewior
  2024-10-31  8:18             ` Vlastimil Babka
@ 2024-10-31 17:50             ` Paul E. McKenney
  2024-11-01 19:50               ` Boqun Feng
  1 sibling, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2024-10-31 17:50 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Vlastimil Babka, Marco Elver, linux-next, linux-kernel,
	kasan-dev, linux-mm, sfr, longman, boqun.feng, cl, penberg,
	rientjes, iamjoonsoo.kim, akpm

On Thu, Oct 31, 2024 at 08:55:09AM +0100, Sebastian Andrzej Siewior wrote:
> On 2024-10-31 08:35:45 [+0100], Vlastimil Babka wrote:
> > On 10/31/24 08:21, Sebastian Andrzej Siewior wrote:
> > > On 2024-10-30 16:10:58 [-0700], Paul E. McKenney wrote:
> > >> 
> > >> So I need to avoid calling kfree() within an smp_call_function() handler?
> > > 
> > > Yes. No kmalloc()/ kfree() in IRQ context.
> > 
> > However, isn't this the case that the rule is actually about hardirq context
> > on RT, and most of these operations that are in IRQ context on !RT become
> > the threaded interrupt context on RT, so they are actually fine? Or is smp
> > call callback a hardirq context on RT and thus it really can't do those
> > operations?
> 
> interrupt handlers as of request_irq() are forced-threaded on RT so you
> can do kmalloc()/ kfree() there. smp_call_function.*() on the other hand
> are not threaded and invoked directly within the IRQ context.

OK, thank you all for the explanation!  I will fix using Boqun's
suggestion of irq work, but avoiding the issue Boqun raises by invoking
the irq-work handler from the smp_call_function() handler.

It will be a few days before I get to this, so if there is a better way,
please do not keep it a secret!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] -next lockdep invalid wait context
  2024-10-31  8:18             ` Vlastimil Babka
@ 2024-11-01 17:14               ` Paul E. McKenney
  0 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-11-01 17:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sebastian Andrzej Siewior, Marco Elver, linux-next, linux-kernel,
	kasan-dev, linux-mm, sfr, longman, boqun.feng, cl, penberg,
	rientjes, iamjoonsoo.kim, akpm

On Thu, Oct 31, 2024 at 09:18:52AM +0100, Vlastimil Babka wrote:
> On 10/31/24 08:55, Sebastian Andrzej Siewior wrote:
> > On 2024-10-31 08:35:45 [+0100], Vlastimil Babka wrote:
> >> On 10/31/24 08:21, Sebastian Andrzej Siewior wrote:
> >> > On 2024-10-30 16:10:58 [-0700], Paul E. McKenney wrote:
> >> >> 
> >> >> So I need to avoid calling kfree() within an smp_call_function() handler?
> >> > 
> >> > Yes. No kmalloc()/ kfree() in IRQ context.
> >> 
> >> However, isn't this the case that the rule is actually about hardirq context
> >> on RT, and most of these operations that are in IRQ context on !RT become
> >> the threaded interrupt context on RT, so they are actually fine? Or is smp
> >> call callback a hardirq context on RT and thus it really can't do those
> >> operations?
> > 
> > interrupt handlers as of request_irq() are forced-threaded on RT so you
> > can do kmalloc()/ kfree() there. smp_call_function.*() on the other hand
> > are not threaded and invoked directly within the IRQ context.
> 
> Makes sense, thanks.
> 
> So how comes rcutorture wasn't deadlocking on RT already, is it (or RCU
> itself) doing anything differently there that avoids the kfree() from
> smp_call_function() handler?

This was scftorture rather than rcutorture.  While I know of others who
regularly run rcutorture, to the best of my knowledge I am the only one
who ever runs scftorture, which tests smp_call_function(), its friends,
and its diagnostics.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] -next lockdep invalid wait context
  2024-10-31 17:50             ` Paul E. McKenney
@ 2024-11-01 19:50               ` Boqun Feng
  2024-11-01 19:54                 ` [PATCH] scftorture: Use workqueue to free scf_check Boqun Feng
  0 siblings, 1 reply; 23+ messages in thread
From: Boqun Feng @ 2024-11-01 19:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Sebastian Andrzej Siewior, Vlastimil Babka, Marco Elver,
	linux-next, linux-kernel, kasan-dev, linux-mm, sfr, longman, cl,
	penberg, rientjes, iamjoonsoo.kim, akpm

On Thu, Oct 31, 2024 at 10:50:29AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 31, 2024 at 08:55:09AM +0100, Sebastian Andrzej Siewior wrote:
> > On 2024-10-31 08:35:45 [+0100], Vlastimil Babka wrote:
> > > On 10/31/24 08:21, Sebastian Andrzej Siewior wrote:
> > > > On 2024-10-30 16:10:58 [-0700], Paul E. McKenney wrote:
> > > >> 
> > > >> So I need to avoid calling kfree() within an smp_call_function() handler?
> > > > 
> > > > Yes. No kmalloc()/ kfree() in IRQ context.
> > > 
> > > However, isn't this the case that the rule is actually about hardirq context
> > > on RT, and most of these operations that are in IRQ context on !RT become
> > > the threaded interrupt context on RT, so they are actually fine? Or is smp
> > > call callback a hardirq context on RT and thus it really can't do those
> > > operations?
> > 
> > interrupt handlers as of request_irq() are forced-threaded on RT so you
> > can do kmalloc()/ kfree() there. smp_call_function.*() on the other hand
> > are not threaded and invoked directly within the IRQ context.
> 
> OK, thank you all for the explanation!  I will fix using Boqun's
> suggestion of irq work, but avoiding the issue Boqun raises by invoking

I've tried fixing this with irq work, however, unlike normal
work_struct, irq_work will still touch the work item header after the
work function is executed (see irq_work_single()). So it needs more work
to build an "one-off free" functionality on it.

I think we can just use normal workqueue, because queue_work() uses
local_irq_save() + raw_spin_lock(), so it's irq-safe even for
non-threaded interrupts.

Sending a patch soon.

Regards,
Boqun

> the irq-work handler from the smp_call_function() handler.
> 
> It will be a few days before I get to this, so if there is a better way,
> please do not keep it a secret!
> 
> 							Thanx, Paul


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] scftorture: Use workqueue to free scf_check
  2024-11-01 19:50               ` Boqun Feng
@ 2024-11-01 19:54                 ` Boqun Feng
  2024-11-01 23:35                   ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Boqun Feng @ 2024-11-01 19:54 UTC (permalink / raw)
  To: paulmck
  Cc: Sebastian Andrzej Siewior, Vlastimil Babka, Marco Elver,
	linux-next, linux-kernel, kasan-dev, linux-mm, sfr, longman, cl,
	penberg, rientjes, iamjoonsoo.kim, akpm, Thomas Gleixner,
	Peter Zijlstra, Boqun Feng

Paul reported an invalid wait context issue in scftorture catched by
lockdep, and the cause of the issue is because scf_handler() may call
kfree() to free the struct scf_check:

	static void scf_handler(void *scfc_in)
        {
        [...]
                } else {
                        kfree(scfcp);
                }
        }

(call chain anlysis from Marco Elver)

This is problematic because smp_call_function() uses non-threaded
interrupt and kfree() may acquire a local_lock which is a sleepable lock
on RT.

The general rule is: do not alloc or free memory in non-threaded
interrupt conntexts.

A quick fix is to use workqueue to defer the kfree(). However, this is
OK only because scftorture is test code. In general the users of
interrupts should avoid giving interrupt handlers the ownership of
objects, that is, users should handle the lifetime of objects outside
and interrupt handlers should only hold references to objects.

Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Link: https://lore.kernel.org/lkml/41619255-cdc2-4573-a360-7794fc3614f7@paulmck-laptop/
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 kernel/scftorture.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/scftorture.c b/kernel/scftorture.c
index 44e83a646264..ab6dcc7c0116 100644
--- a/kernel/scftorture.c
+++ b/kernel/scftorture.c
@@ -127,6 +127,7 @@ static unsigned long scf_sel_totweight;
 
 // Communicate between caller and handler.
 struct scf_check {
+	struct work_struct work;
 	bool scfc_in;
 	bool scfc_out;
 	int scfc_cpu; // -1 for not _single().
@@ -252,6 +253,13 @@ static struct scf_selector *scf_sel_rand(struct torture_random_state *trsp)
 	return &scf_sel_array[0];
 }
 
+static void kfree_scf_check_work(struct work_struct *w)
+{
+	struct scf_check *scfcp = container_of(w, struct scf_check, work);
+
+	kfree(scfcp);
+}
+
 // Update statistics and occasionally burn up mass quantities of CPU time,
 // if told to do so via scftorture.longwait.  Otherwise, occasionally burn
 // a little bit.
@@ -296,7 +304,10 @@ static void scf_handler(void *scfc_in)
 		if (scfcp->scfc_rpc)
 			complete(&scfcp->scfc_completion);
 	} else {
-		kfree(scfcp);
+		// Cannot call kfree() directly, pass it to workqueue. It's OK
+		// only because this is test code, avoid this in real world
+		// usage.
+		queue_work(system_wq, &scfcp->work);
 	}
 }
 
@@ -335,6 +346,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
 			scfcp->scfc_wait = scfsp->scfs_wait;
 			scfcp->scfc_out = false;
 			scfcp->scfc_rpc = false;
+			INIT_WORK(&scfcp->work, kfree_scf_check_work);
 		}
 	}
 	switch (scfsp->scfs_prim) {
-- 
2.45.2



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] scftorture: Use workqueue to free scf_check
  2024-11-01 19:54                 ` [PATCH] scftorture: Use workqueue to free scf_check Boqun Feng
@ 2024-11-01 23:35                   ` Paul E. McKenney
  2024-11-03  3:35                     ` Boqun Feng
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2024-11-01 23:35 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Sebastian Andrzej Siewior, Vlastimil Babka, Marco Elver,
	linux-next, linux-kernel, kasan-dev, linux-mm, sfr, longman, cl,
	penberg, rientjes, iamjoonsoo.kim, akpm, Thomas Gleixner,
	Peter Zijlstra

On Fri, Nov 01, 2024 at 12:54:38PM -0700, Boqun Feng wrote:
> Paul reported an invalid wait context issue in scftorture catched by
> lockdep, and the cause of the issue is because scf_handler() may call
> kfree() to free the struct scf_check:
> 
> 	static void scf_handler(void *scfc_in)
>         {
>         [...]
>                 } else {
>                         kfree(scfcp);
>                 }
>         }
> 
> (call chain anlysis from Marco Elver)
> 
> This is problematic because smp_call_function() uses non-threaded
> interrupt and kfree() may acquire a local_lock which is a sleepable lock
> on RT.
> 
> The general rule is: do not alloc or free memory in non-threaded
> interrupt conntexts.
> 
> A quick fix is to use workqueue to defer the kfree(). However, this is
> OK only because scftorture is test code. In general the users of
> interrupts should avoid giving interrupt handlers the ownership of
> objects, that is, users should handle the lifetime of objects outside
> and interrupt handlers should only hold references to objects.
> 
> Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
> Link: https://lore.kernel.org/lkml/41619255-cdc2-4573-a360-7794fc3614f7@paulmck-laptop/
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Thank you!

I was worried that putting each kfree() into a separate workqueue handler
would result in freeing not keeping up with allocation for asynchronous
testing (for example, scftorture.weight_single=1), but it seems to be
doing fine in early testing.

So I have queued this in my -rcu tree for review and further testing.

							Thanx, Paul

> ---
>  kernel/scftorture.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/scftorture.c b/kernel/scftorture.c
> index 44e83a646264..ab6dcc7c0116 100644
> --- a/kernel/scftorture.c
> +++ b/kernel/scftorture.c
> @@ -127,6 +127,7 @@ static unsigned long scf_sel_totweight;
>  
>  // Communicate between caller and handler.
>  struct scf_check {
> +	struct work_struct work;
>  	bool scfc_in;
>  	bool scfc_out;
>  	int scfc_cpu; // -1 for not _single().
> @@ -252,6 +253,13 @@ static struct scf_selector *scf_sel_rand(struct torture_random_state *trsp)
>  	return &scf_sel_array[0];
>  }
>  
> +static void kfree_scf_check_work(struct work_struct *w)
> +{
> +	struct scf_check *scfcp = container_of(w, struct scf_check, work);
> +
> +	kfree(scfcp);
> +}
> +
>  // Update statistics and occasionally burn up mass quantities of CPU time,
>  // if told to do so via scftorture.longwait.  Otherwise, occasionally burn
>  // a little bit.
> @@ -296,7 +304,10 @@ static void scf_handler(void *scfc_in)
>  		if (scfcp->scfc_rpc)
>  			complete(&scfcp->scfc_completion);
>  	} else {
> -		kfree(scfcp);
> +		// Cannot call kfree() directly, pass it to workqueue. It's OK
> +		// only because this is test code, avoid this in real world
> +		// usage.
> +		queue_work(system_wq, &scfcp->work);
>  	}
>  }
>  
> @@ -335,6 +346,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
>  			scfcp->scfc_wait = scfsp->scfs_wait;
>  			scfcp->scfc_out = false;
>  			scfcp->scfc_rpc = false;
> +			INIT_WORK(&scfcp->work, kfree_scf_check_work);
>  		}
>  	}
>  	switch (scfsp->scfs_prim) {
> -- 
> 2.45.2
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] scftorture: Use workqueue to free scf_check
  2024-11-01 23:35                   ` Paul E. McKenney
@ 2024-11-03  3:35                     ` Boqun Feng
  2024-11-03 15:03                       ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Boqun Feng @ 2024-11-03  3:35 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Sebastian Andrzej Siewior, Vlastimil Babka, Marco Elver,
	linux-next, linux-kernel, kasan-dev, linux-mm, sfr, longman, cl,
	penberg, rientjes, iamjoonsoo.kim, akpm, Thomas Gleixner,
	Peter Zijlstra

On Fri, Nov 01, 2024 at 04:35:28PM -0700, Paul E. McKenney wrote:
> On Fri, Nov 01, 2024 at 12:54:38PM -0700, Boqun Feng wrote:
> > Paul reported an invalid wait context issue in scftorture catched by
> > lockdep, and the cause of the issue is because scf_handler() may call
> > kfree() to free the struct scf_check:
> > 
> > 	static void scf_handler(void *scfc_in)
> >         {
> >         [...]
> >                 } else {
> >                         kfree(scfcp);
> >                 }
> >         }
> > 
> > (call chain anlysis from Marco Elver)
> > 
> > This is problematic because smp_call_function() uses non-threaded
> > interrupt and kfree() may acquire a local_lock which is a sleepable lock
> > on RT.
> > 
> > The general rule is: do not alloc or free memory in non-threaded
> > interrupt conntexts.
> > 
> > A quick fix is to use workqueue to defer the kfree(). However, this is
> > OK only because scftorture is test code. In general the users of
> > interrupts should avoid giving interrupt handlers the ownership of
> > objects, that is, users should handle the lifetime of objects outside
> > and interrupt handlers should only hold references to objects.
> > 
> > Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
> > Link: https://lore.kernel.org/lkml/41619255-cdc2-4573-a360-7794fc3614f7@paulmck-laptop/
> > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> 
> Thank you!
> 
> I was worried that putting each kfree() into a separate workqueue handler
> would result in freeing not keeping up with allocation for asynchronous
> testing (for example, scftorture.weight_single=1), but it seems to be
> doing fine in early testing.
> 

I shared the same worry, so it's why I added the comments before
queue_work() saying it's only OK because it's test code, it's certainly
not something recommended for general use.

But glad it turns out OK so far for scftorture ;-)

Regards,
Boqun

> So I have queued this in my -rcu tree for review and further testing.
> 
> 							Thanx, Paul
> 
> > ---
> >  kernel/scftorture.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/scftorture.c b/kernel/scftorture.c
> > index 44e83a646264..ab6dcc7c0116 100644
> > --- a/kernel/scftorture.c
> > +++ b/kernel/scftorture.c
> > @@ -127,6 +127,7 @@ static unsigned long scf_sel_totweight;
> >  
> >  // Communicate between caller and handler.
> >  struct scf_check {
> > +	struct work_struct work;
> >  	bool scfc_in;
> >  	bool scfc_out;
> >  	int scfc_cpu; // -1 for not _single().
> > @@ -252,6 +253,13 @@ static struct scf_selector *scf_sel_rand(struct torture_random_state *trsp)
> >  	return &scf_sel_array[0];
> >  }
> >  
> > +static void kfree_scf_check_work(struct work_struct *w)
> > +{
> > +	struct scf_check *scfcp = container_of(w, struct scf_check, work);
> > +
> > +	kfree(scfcp);
> > +}
> > +
> >  // Update statistics and occasionally burn up mass quantities of CPU time,
> >  // if told to do so via scftorture.longwait.  Otherwise, occasionally burn
> >  // a little bit.
> > @@ -296,7 +304,10 @@ static void scf_handler(void *scfc_in)
> >  		if (scfcp->scfc_rpc)
> >  			complete(&scfcp->scfc_completion);
> >  	} else {
> > -		kfree(scfcp);
> > +		// Cannot call kfree() directly, pass it to workqueue. It's OK
> > +		// only because this is test code, avoid this in real world
> > +		// usage.
> > +		queue_work(system_wq, &scfcp->work);
> >  	}
> >  }
> >  
> > @@ -335,6 +346,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
> >  			scfcp->scfc_wait = scfsp->scfs_wait;
> >  			scfcp->scfc_out = false;
> >  			scfcp->scfc_rpc = false;
> > +			INIT_WORK(&scfcp->work, kfree_scf_check_work);
> >  		}
> >  	}
> >  	switch (scfsp->scfs_prim) {
> > -- 
> > 2.45.2
> > 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] scftorture: Use workqueue to free scf_check
  2024-11-03  3:35                     ` Boqun Feng
@ 2024-11-03 15:03                       ` Paul E. McKenney
  2024-11-04 10:50                         ` [PATCH 1/2] scftorture: Move memory allocation outside of preempt_disable region Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2024-11-03 15:03 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Sebastian Andrzej Siewior, Vlastimil Babka, Marco Elver,
	linux-next, linux-kernel, kasan-dev, linux-mm, sfr, longman, cl,
	penberg, rientjes, iamjoonsoo.kim, akpm, Thomas Gleixner,
	Peter Zijlstra

On Sat, Nov 02, 2024 at 08:35:36PM -0700, Boqun Feng wrote:
> On Fri, Nov 01, 2024 at 04:35:28PM -0700, Paul E. McKenney wrote:
> > On Fri, Nov 01, 2024 at 12:54:38PM -0700, Boqun Feng wrote:
> > > Paul reported an invalid wait context issue in scftorture catched by
> > > lockdep, and the cause of the issue is because scf_handler() may call
> > > kfree() to free the struct scf_check:
> > > 
> > > 	static void scf_handler(void *scfc_in)
> > >         {
> > >         [...]
> > >                 } else {
> > >                         kfree(scfcp);
> > >                 }
> > >         }
> > > 
> > > (call chain anlysis from Marco Elver)
> > > 
> > > This is problematic because smp_call_function() uses non-threaded
> > > interrupt and kfree() may acquire a local_lock which is a sleepable lock
> > > on RT.
> > > 
> > > The general rule is: do not alloc or free memory in non-threaded
> > > interrupt conntexts.
> > > 
> > > A quick fix is to use workqueue to defer the kfree(). However, this is
> > > OK only because scftorture is test code. In general the users of
> > > interrupts should avoid giving interrupt handlers the ownership of
> > > objects, that is, users should handle the lifetime of objects outside
> > > and interrupt handlers should only hold references to objects.
> > > 
> > > Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
> > > Link: https://lore.kernel.org/lkml/41619255-cdc2-4573-a360-7794fc3614f7@paulmck-laptop/
> > > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> > 
> > Thank you!
> > 
> > I was worried that putting each kfree() into a separate workqueue handler
> > would result in freeing not keeping up with allocation for asynchronous
> > testing (for example, scftorture.weight_single=1), but it seems to be
> > doing fine in early testing.
> 
> I shared the same worry, so it's why I added the comments before
> queue_work() saying it's only OK because it's test code, it's certainly
> not something recommended for general use.
> 
> But glad it turns out OK so far for scftorture ;-)

That said, I have only tried a couple of memory sizes at 64 CPUs, the
default (512M), which OOMs both with and without this fix and 7G, which
is selected by torture.sh, which avoids OOMing either way.  It would be
interesting to vary the memory provided between those limits and see if
there is any difference in behavior.

It avoids OOM at the default 512M at 16 CPUs.

Ah, and I did not check throughput, which might have changed.  A quick
test on my laptop says that it dropped by almost a factor of two,
from not quite 1M invocations/s to a bit more than 500K invocations/s.
So something more efficient does seem in order.  ;-)

tools/testing/selftests/rcutorture/bin/kvm.sh --torture scf --allcpus --configs PREEMPT --duration 30 --bootargs "scftorture.weight_single=1" --trust-make

							Thanx, Paul

> Regards,
> Boqun
> 
> > So I have queued this in my -rcu tree for review and further testing.
> > 
> > 							Thanx, Paul
> > 
> > > ---
> > >  kernel/scftorture.c | 14 +++++++++++++-
> > >  1 file changed, 13 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/scftorture.c b/kernel/scftorture.c
> > > index 44e83a646264..ab6dcc7c0116 100644
> > > --- a/kernel/scftorture.c
> > > +++ b/kernel/scftorture.c
> > > @@ -127,6 +127,7 @@ static unsigned long scf_sel_totweight;
> > >  
> > >  // Communicate between caller and handler.
> > >  struct scf_check {
> > > +	struct work_struct work;
> > >  	bool scfc_in;
> > >  	bool scfc_out;
> > >  	int scfc_cpu; // -1 for not _single().
> > > @@ -252,6 +253,13 @@ static struct scf_selector *scf_sel_rand(struct torture_random_state *trsp)
> > >  	return &scf_sel_array[0];
> > >  }
> > >  
> > > +static void kfree_scf_check_work(struct work_struct *w)
> > > +{
> > > +	struct scf_check *scfcp = container_of(w, struct scf_check, work);
> > > +
> > > +	kfree(scfcp);
> > > +}
> > > +
> > >  // Update statistics and occasionally burn up mass quantities of CPU time,
> > >  // if told to do so via scftorture.longwait.  Otherwise, occasionally burn
> > >  // a little bit.
> > > @@ -296,7 +304,10 @@ static void scf_handler(void *scfc_in)
> > >  		if (scfcp->scfc_rpc)
> > >  			complete(&scfcp->scfc_completion);
> > >  	} else {
> > > -		kfree(scfcp);
> > > +		// Cannot call kfree() directly, pass it to workqueue. It's OK
> > > +		// only because this is test code, avoid this in real world
> > > +		// usage.
> > > +		queue_work(system_wq, &scfcp->work);
> > >  	}
> > >  }
> > >  
> > > @@ -335,6 +346,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
> > >  			scfcp->scfc_wait = scfsp->scfs_wait;
> > >  			scfcp->scfc_out = false;
> > >  			scfcp->scfc_rpc = false;
> > > +			INIT_WORK(&scfcp->work, kfree_scf_check_work);
> > >  		}
> > >  	}
> > >  	switch (scfsp->scfs_prim) {
> > > -- 
> > > 2.45.2
> > > 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/2] scftorture: Move memory allocation outside of preempt_disable region.
  2024-11-03 15:03                       ` Paul E. McKenney
@ 2024-11-04 10:50                         ` Sebastian Andrzej Siewior
  2024-11-04 10:50                           ` [PATCH 2/2] scftorture: Use a lock-less list to free memory Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-04 10:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Boqun Feng, Vlastimil Babka, Marco Elver, linux-kernel,
	kasan-dev, linux-mm, sfr, longman, cl, penberg, rientjes,
	iamjoonsoo.kim, akpm, Tomas Gleixner, Peter Zijlstra,
	Sebastian Andrzej Siewior

Memory allocations can not happen within regions with explicit disabled
preemption PREEMPT_RT. The problem is that the locking structures
underneath are sleeping locks.

Move the memory allocation outside of the preempt-disabled section. Keep
the GFP_ATOMIC for the allocation to behave like a "ememergncy
allocation".

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/scftorture.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/scftorture.c b/kernel/scftorture.c
index 44e83a6462647..e5546fe256329 100644
--- a/kernel/scftorture.c
+++ b/kernel/scftorture.c
@@ -320,10 +320,6 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
 	struct scf_check *scfcp = NULL;
 	struct scf_selector *scfsp = scf_sel_rand(trsp);
 
-	if (use_cpus_read_lock)
-		cpus_read_lock();
-	else
-		preempt_disable();
 	if (scfsp->scfs_prim == SCF_PRIM_SINGLE || scfsp->scfs_wait) {
 		scfcp = kmalloc(sizeof(*scfcp), GFP_ATOMIC);
 		if (!scfcp) {
@@ -337,6 +333,10 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
 			scfcp->scfc_rpc = false;
 		}
 	}
+	if (use_cpus_read_lock)
+		cpus_read_lock();
+	else
+		preempt_disable();
 	switch (scfsp->scfs_prim) {
 	case SCF_PRIM_RESCHED:
 		if (IS_BUILTIN(CONFIG_SCF_TORTURE_TEST)) {
-- 
2.45.2



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 2/2] scftorture: Use a lock-less list to free memory.
  2024-11-04 10:50                         ` [PATCH 1/2] scftorture: Move memory allocation outside of preempt_disable region Sebastian Andrzej Siewior
@ 2024-11-04 10:50                           ` Sebastian Andrzej Siewior
  2024-11-05  1:00                             ` Boqun Feng
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-04 10:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Boqun Feng, Vlastimil Babka, Marco Elver, linux-kernel,
	kasan-dev, linux-mm, sfr, longman, cl, penberg, rientjes,
	iamjoonsoo.kim, akpm, Tomas Gleixner, Peter Zijlstra,
	Sebastian Andrzej Siewior

scf_handler() is used as a SMP function call. This function is always
invoked in IRQ-context even with forced-threading enabled. This function
frees memory which not allowed on PREEMPT_RT because the locking
underneath is using sleeping locks.

Add a per-CPU scf_free_pool where each SMP functions adds its memory to
be freed. This memory is then freed by scftorture_invoker() on each
iteration. On the majority of invocations the number of items is less
than five. If the thread sleeps/ gets delayed the number exceed 350 but
did not reach 400 in testing. These were the spikes during testing.
The bulk free of 64 pointers at once should improve the give-back if the
list grows. The list size is ~1.3 items per invocations.

Having one global scf_free_pool with one cleaning thread let the list
grow to over 10.000 items with 32 CPUs (again, spikes not the average)
especially if the CPU went to sleep. The per-CPU part looks like a good
compromise.

Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Closes: https://lore.kernel.org/lkml/41619255-cdc2-4573-a360-7794fc3614f7@paulmck-laptop/
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/scftorture.c | 47 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 43 insertions(+), 4 deletions(-)

diff --git a/kernel/scftorture.c b/kernel/scftorture.c
index e5546fe256329..ba9f1125821b8 100644
--- a/kernel/scftorture.c
+++ b/kernel/scftorture.c
@@ -97,6 +97,7 @@ struct scf_statistics {
 static struct scf_statistics *scf_stats_p;
 static struct task_struct *scf_torture_stats_task;
 static DEFINE_PER_CPU(long long, scf_invoked_count);
+static DEFINE_PER_CPU(struct llist_head, scf_free_pool);
 
 // Data for random primitive selection
 #define SCF_PRIM_RESCHED	0
@@ -133,6 +134,7 @@ struct scf_check {
 	bool scfc_wait;
 	bool scfc_rpc;
 	struct completion scfc_completion;
+	struct llist_node scf_node;
 };
 
 // Use to wait for all threads to start.
@@ -148,6 +150,40 @@ static DEFINE_TORTURE_RANDOM_PERCPU(scf_torture_rand);
 
 extern void resched_cpu(int cpu); // An alternative IPI vector.
 
+static void scf_add_to_free_list(struct scf_check *scfcp)
+{
+	struct llist_head *pool;
+	unsigned int cpu;
+
+	cpu = raw_smp_processor_id() % nthreads;
+	pool = &per_cpu(scf_free_pool, cpu);
+	llist_add(&scfcp->scf_node, pool);
+}
+
+static void scf_cleanup_free_list(unsigned int cpu)
+{
+	struct llist_head *pool;
+	struct llist_node *node;
+	struct scf_check *scfcp;
+	unsigned int slot = 0;
+	void *free_pool[64];
+
+	pool = &per_cpu(scf_free_pool, cpu);
+	node = llist_del_all(pool);
+	while (node) {
+		scfcp = llist_entry(node, struct scf_check, scf_node);
+		node = node->next;
+		free_pool[slot] = scfcp;
+		slot++;
+		if (slot == ARRAY_SIZE(free_pool)) {
+			kfree_bulk(slot, free_pool);
+			slot = 0;
+		}
+	}
+	if (slot)
+		kfree_bulk(slot, free_pool);
+}
+
 // Print torture statistics.  Caller must ensure serialization.
 static void scf_torture_stats_print(void)
 {
@@ -296,7 +332,7 @@ static void scf_handler(void *scfc_in)
 		if (scfcp->scfc_rpc)
 			complete(&scfcp->scfc_completion);
 	} else {
-		kfree(scfcp);
+		scf_add_to_free_list(scfcp);
 	}
 }
 
@@ -363,7 +399,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
 				scfp->n_single_wait_ofl++;
 			else
 				scfp->n_single_ofl++;
-			kfree(scfcp);
+			scf_add_to_free_list(scfcp);
 			scfcp = NULL;
 		}
 		break;
@@ -391,7 +427,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
 				preempt_disable();
 		} else {
 			scfp->n_single_rpc_ofl++;
-			kfree(scfcp);
+			scf_add_to_free_list(scfcp);
 			scfcp = NULL;
 		}
 		break;
@@ -428,7 +464,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
 			pr_warn("%s: Memory-ordering failure, scfs_prim: %d.\n", __func__, scfsp->scfs_prim);
 			atomic_inc(&n_mb_out_errs); // Leak rather than trash!
 		} else {
-			kfree(scfcp);
+			scf_add_to_free_list(scfcp);
 		}
 		barrier(); // Prevent race-reduction compiler optimizations.
 	}
@@ -442,6 +478,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
 		schedule_timeout_uninterruptible(1);
 }
 
+
 // SCF test kthread.  Repeatedly does calls to members of the
 // smp_call_function() family of functions.
 static int scftorture_invoker(void *arg)
@@ -479,6 +516,8 @@ static int scftorture_invoker(void *arg)
 	VERBOSE_SCFTORTOUT("scftorture_invoker %d started", scfp->cpu);
 
 	do {
+		scf_cleanup_free_list(scfp->cpu);
+
 		scftorture_invoke_one(scfp, &rand);
 		while (cpu_is_offline(cpu) && !torture_must_stop()) {
 			schedule_timeout_interruptible(HZ / 5);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] scftorture: Use a lock-less list to free memory.
  2024-11-04 10:50                           ` [PATCH 2/2] scftorture: Use a lock-less list to free memory Sebastian Andrzej Siewior
@ 2024-11-05  1:00                             ` Boqun Feng
  2024-11-07 11:21                               ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Boqun Feng @ 2024-11-05  1:00 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Paul E. McKenney, Vlastimil Babka, Marco Elver, linux-kernel,
	kasan-dev, linux-mm, sfr, longman, cl, penberg, rientjes,
	iamjoonsoo.kim, akpm, Tomas Gleixner, Peter Zijlstra

Hi Sebastian,

I love this approach, I think it's better than my workqueue work around,
a few comments below:

On Mon, Nov 04, 2024 at 11:50:53AM +0100, Sebastian Andrzej Siewior wrote:
> scf_handler() is used as a SMP function call. This function is always
> invoked in IRQ-context even with forced-threading enabled. This function
> frees memory which not allowed on PREEMPT_RT because the locking
> underneath is using sleeping locks.
> 
> Add a per-CPU scf_free_pool where each SMP functions adds its memory to
> be freed. This memory is then freed by scftorture_invoker() on each
> iteration. On the majority of invocations the number of items is less
> than five. If the thread sleeps/ gets delayed the number exceed 350 but
> did not reach 400 in testing. These were the spikes during testing.
> The bulk free of 64 pointers at once should improve the give-back if the
> list grows. The list size is ~1.3 items per invocations.
> 
> Having one global scf_free_pool with one cleaning thread let the list
> grow to over 10.000 items with 32 CPUs (again, spikes not the average)
> especially if the CPU went to sleep. The per-CPU part looks like a good
> compromise.
> 
> Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
> Closes: https://lore.kernel.org/lkml/41619255-cdc2-4573-a360-7794fc3614f7@paulmck-laptop/
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  kernel/scftorture.c | 47 +++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 43 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/scftorture.c b/kernel/scftorture.c
> index e5546fe256329..ba9f1125821b8 100644
> --- a/kernel/scftorture.c
> +++ b/kernel/scftorture.c
> @@ -97,6 +97,7 @@ struct scf_statistics {
>  static struct scf_statistics *scf_stats_p;
>  static struct task_struct *scf_torture_stats_task;
>  static DEFINE_PER_CPU(long long, scf_invoked_count);
> +static DEFINE_PER_CPU(struct llist_head, scf_free_pool);
>  
>  // Data for random primitive selection
>  #define SCF_PRIM_RESCHED	0
> @@ -133,6 +134,7 @@ struct scf_check {
>  	bool scfc_wait;
>  	bool scfc_rpc;
>  	struct completion scfc_completion;
> +	struct llist_node scf_node;
>  };
>  
>  // Use to wait for all threads to start.
> @@ -148,6 +150,40 @@ static DEFINE_TORTURE_RANDOM_PERCPU(scf_torture_rand);
>  
>  extern void resched_cpu(int cpu); // An alternative IPI vector.
>  
> +static void scf_add_to_free_list(struct scf_check *scfcp)
> +{
> +	struct llist_head *pool;
> +	unsigned int cpu;
> +
> +	cpu = raw_smp_processor_id() % nthreads;
> +	pool = &per_cpu(scf_free_pool, cpu);
> +	llist_add(&scfcp->scf_node, pool);
> +}
> +
> +static void scf_cleanup_free_list(unsigned int cpu)
> +{
> +	struct llist_head *pool;
> +	struct llist_node *node;
> +	struct scf_check *scfcp;
> +	unsigned int slot = 0;
> +	void *free_pool[64];
> +
> +	pool = &per_cpu(scf_free_pool, cpu);
> +	node = llist_del_all(pool);
> +	while (node) {
> +		scfcp = llist_entry(node, struct scf_check, scf_node);
> +		node = node->next;
> +		free_pool[slot] = scfcp;
> +		slot++;
> +		if (slot == ARRAY_SIZE(free_pool)) {
> +			kfree_bulk(slot, free_pool);
> +			slot = 0;
> +		}
> +	}
> +	if (slot)
> +		kfree_bulk(slot, free_pool);
> +}
> +
>  // Print torture statistics.  Caller must ensure serialization.
>  static void scf_torture_stats_print(void)
>  {
> @@ -296,7 +332,7 @@ static void scf_handler(void *scfc_in)
>  		if (scfcp->scfc_rpc)
>  			complete(&scfcp->scfc_completion);
>  	} else {
> -		kfree(scfcp);
> +		scf_add_to_free_list(scfcp);
>  	}
>  }
>  
> @@ -363,7 +399,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
>  				scfp->n_single_wait_ofl++;
>  			else
>  				scfp->n_single_ofl++;
> -			kfree(scfcp);
> +			scf_add_to_free_list(scfcp);
>  			scfcp = NULL;
>  		}
>  		break;
> @@ -391,7 +427,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
>  				preempt_disable();
>  		} else {
>  			scfp->n_single_rpc_ofl++;
> -			kfree(scfcp);
> +			scf_add_to_free_list(scfcp);
>  			scfcp = NULL;
>  		}
>  		break;
> @@ -428,7 +464,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
>  			pr_warn("%s: Memory-ordering failure, scfs_prim: %d.\n", __func__, scfsp->scfs_prim);
>  			atomic_inc(&n_mb_out_errs); // Leak rather than trash!
>  		} else {
> -			kfree(scfcp);
> +			scf_add_to_free_list(scfcp);
>  		}
>  		barrier(); // Prevent race-reduction compiler optimizations.
>  	}
> @@ -442,6 +478,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
>  		schedule_timeout_uninterruptible(1);
>  }
>  
> +
>  // SCF test kthread.  Repeatedly does calls to members of the
>  // smp_call_function() family of functions.
>  static int scftorture_invoker(void *arg)
> @@ -479,6 +516,8 @@ static int scftorture_invoker(void *arg)
>  	VERBOSE_SCFTORTOUT("scftorture_invoker %d started", scfp->cpu);
>  
>  	do {
> +		scf_cleanup_free_list(scfp->cpu);
> +

I think this needs to be:

		scf_cleanup_free_list(cpu);

or

		scf_cleanup_free_list(curcpu);

because scfp->cpu is actually the thread number, and I got a NULL
dereference:

[   14.219225] BUG: unable to handle page fault for address: ffffffffb2ff7210

while running Paul's reproduce command:

tools/testing/selftests/rcutorture/bin/kvm.sh --torture scf --allcpus --duration 2 --configs PREEMPT --kconfig CONFIG_NR_CPUS=64 --memory 7G --trust-make --kasan --bootargs "scftorture.nthreads=64 torture.disable_onoff_at_boot csdlock_debug=1"

on my 48 cores VM (I think 48 core may be key to reproduce the NULL
dereference).


Another thing is, how do we guarantee that we don't exit the loop
eariler (i.e. while there are still callbacks on the list)? After the
following scftorture_invoke_one(), there could an IPI pending somewhere,
and we may exit this loop if torture_must_stop() is true. And that IPI
might add its scf_check to the list but no scf_cleanup_free_list() is
going to handle that, right?

Regards,
Boqun

>  		scftorture_invoke_one(scfp, &rand);
>  		while (cpu_is_offline(cpu) && !torture_must_stop()) {
>  			schedule_timeout_interruptible(HZ / 5);
> -- 
> 2.45.2
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] scftorture: Use a lock-less list to free memory.
  2024-11-05  1:00                             ` Boqun Feng
@ 2024-11-07 11:21                               ` Sebastian Andrzej Siewior
  2024-11-07 14:08                                 ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-07 11:21 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Paul E. McKenney, Vlastimil Babka, Marco Elver, linux-kernel,
	kasan-dev, linux-mm, sfr, longman, cl, penberg, rientjes,
	iamjoonsoo.kim, akpm, Tomas Gleixner, Peter Zijlstra

On 2024-11-04 17:00:19 [-0800], Boqun Feng wrote:
> Hi Sebastian,
Hi Boqun,

…
> I think this needs to be:
> 
> 		scf_cleanup_free_list(cpu);
> 
> or
> 
> 		scf_cleanup_free_list(curcpu);
> 
> because scfp->cpu is actually the thread number, and I got a NULL
> dereference:
> 
> [   14.219225] BUG: unable to handle page fault for address: ffffffffb2ff7210

Right. Replaced with cpu.
…
> 
> Another thing is, how do we guarantee that we don't exit the loop
> eariler (i.e. while there are still callbacks on the list)? After the
> following scftorture_invoke_one(), there could an IPI pending somewhere,
> and we may exit this loop if torture_must_stop() is true. And that IPI
> might add its scf_check to the list but no scf_cleanup_free_list() is
> going to handle that, right?

Okay. Assuming that IPIs are done by the time scf_torture_cleanup is
invoked, I added scf_cleanup_free_list() for all CPUs there.

Reposted at
	https://lore.kernel.org/20241107111821.3417762-1-bigeasy@linutronix.de

> Regards,
> Boqun

Sebastian


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] scftorture: Use a lock-less list to free memory.
  2024-11-07 11:21                               ` Sebastian Andrzej Siewior
@ 2024-11-07 14:08                                 ` Paul E. McKenney
  2024-11-07 14:43                                   ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2024-11-07 14:08 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Boqun Feng, Vlastimil Babka, Marco Elver, linux-kernel,
	kasan-dev, linux-mm, sfr, longman, cl, penberg, rientjes,
	iamjoonsoo.kim, akpm, Tomas Gleixner, Peter Zijlstra

On Thu, Nov 07, 2024 at 12:21:07PM +0100, Sebastian Andrzej Siewior wrote:
> On 2024-11-04 17:00:19 [-0800], Boqun Feng wrote:
> > Hi Sebastian,
> Hi Boqun,
> 
> …
> > I think this needs to be:
> > 
> > 		scf_cleanup_free_list(cpu);
> > 
> > or
> > 
> > 		scf_cleanup_free_list(curcpu);
> > 
> > because scfp->cpu is actually the thread number, and I got a NULL
> > dereference:
> > 
> > [   14.219225] BUG: unable to handle page fault for address: ffffffffb2ff7210
> 
> Right. Replaced with cpu.
> …
> > 
> > Another thing is, how do we guarantee that we don't exit the loop
> > eariler (i.e. while there are still callbacks on the list)? After the
> > following scftorture_invoke_one(), there could an IPI pending somewhere,
> > and we may exit this loop if torture_must_stop() is true. And that IPI
> > might add its scf_check to the list but no scf_cleanup_free_list() is
> > going to handle that, right?
> 
> Okay. Assuming that IPIs are done by the time scf_torture_cleanup is
> invoked, I added scf_cleanup_free_list() for all CPUs there.

This statement in scf_torture_cleanup() is supposed to wait for all
outstanding IPIs:

	smp_call_function(scf_cleanup_handler, NULL, 0);

And the scf_cleanup_handler() function is as follows:

	static void scf_cleanup_handler(void *unused)
	{
	}

Does that work, or am I yet again being overly naive?

> Reposted at
> 	https://lore.kernel.org/20241107111821.3417762-1-bigeasy@linutronix.de

Thank you!

I will do some testing on this later today.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] scftorture: Use a lock-less list to free memory.
  2024-11-07 14:08                                 ` Paul E. McKenney
@ 2024-11-07 14:43                                   ` Sebastian Andrzej Siewior
  2024-11-07 14:59                                     ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-07 14:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Boqun Feng, Vlastimil Babka, Marco Elver, linux-kernel,
	kasan-dev, linux-mm, sfr, longman, cl, penberg, rientjes,
	iamjoonsoo.kim, akpm, Tomas Gleixner, Peter Zijlstra

On 2024-11-07 06:08:35 [-0800], Paul E. McKenney wrote:
…
> This statement in scf_torture_cleanup() is supposed to wait for all
> outstanding IPIs:
> 
> 	smp_call_function(scf_cleanup_handler, NULL, 0);

This should be
	smp_call_function(scf_cleanup_handler, NULL, 1);

so it queues the function call and waits for its completion. Otherwise
it is queued and might be invoked _later_.

> And the scf_cleanup_handler() function is as follows:
> 
> 	static void scf_cleanup_handler(void *unused)
> 	{
> 	}
> 
> Does that work, or am I yet again being overly naive?

See above. I can send a patch later on if you have no other complains ;)

> 							Thanx, Paul

Sebastian


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] scftorture: Use a lock-less list to free memory.
  2024-11-07 14:43                                   ` Sebastian Andrzej Siewior
@ 2024-11-07 14:59                                     ` Paul E. McKenney
  0 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-11-07 14:59 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Boqun Feng, Vlastimil Babka, Marco Elver, linux-kernel,
	kasan-dev, linux-mm, sfr, longman, cl, penberg, rientjes,
	iamjoonsoo.kim, akpm, Tomas Gleixner, Peter Zijlstra

On Thu, Nov 07, 2024 at 03:43:00PM +0100, Sebastian Andrzej Siewior wrote:
> On 2024-11-07 06:08:35 [-0800], Paul E. McKenney wrote:
> …
> > This statement in scf_torture_cleanup() is supposed to wait for all
> > outstanding IPIs:
> > 
> > 	smp_call_function(scf_cleanup_handler, NULL, 0);
> 
> This should be
> 	smp_call_function(scf_cleanup_handler, NULL, 1);
> 
> so it queues the function call and waits for its completion. Otherwise
> it is queued and might be invoked _later_.
> 
> > And the scf_cleanup_handler() function is as follows:
> > 
> > 	static void scf_cleanup_handler(void *unused)
> > 	{
> > 	}
> > 
> > Does that work, or am I yet again being overly naive?
> 
> See above. I can send a patch later on if you have no other complains ;)

You got me on that one!  Thank you, and please do feel free to send
a patch.

Interestingly enough, this has never failed.  Perhaps because the usual
scftorture run injects enough idle time that all the IPIs have time
to finish.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-11-07 14:59 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-30 21:05 [BUG] -next lockdep invalid wait context Paul E. McKenney
2024-10-30 21:48 ` Vlastimil Babka
2024-10-30 22:34   ` Marco Elver
2024-10-30 23:04     ` Boqun Feng
2024-10-30 23:10     ` Paul E. McKenney
2024-10-31  7:21       ` Sebastian Andrzej Siewior
2024-10-31  7:35         ` Vlastimil Babka
2024-10-31  7:55           ` Sebastian Andrzej Siewior
2024-10-31  8:18             ` Vlastimil Babka
2024-11-01 17:14               ` Paul E. McKenney
2024-10-31 17:50             ` Paul E. McKenney
2024-11-01 19:50               ` Boqun Feng
2024-11-01 19:54                 ` [PATCH] scftorture: Use workqueue to free scf_check Boqun Feng
2024-11-01 23:35                   ` Paul E. McKenney
2024-11-03  3:35                     ` Boqun Feng
2024-11-03 15:03                       ` Paul E. McKenney
2024-11-04 10:50                         ` [PATCH 1/2] scftorture: Move memory allocation outside of preempt_disable region Sebastian Andrzej Siewior
2024-11-04 10:50                           ` [PATCH 2/2] scftorture: Use a lock-less list to free memory Sebastian Andrzej Siewior
2024-11-05  1:00                             ` Boqun Feng
2024-11-07 11:21                               ` Sebastian Andrzej Siewior
2024-11-07 14:08                                 ` Paul E. McKenney
2024-11-07 14:43                                   ` Sebastian Andrzej Siewior
2024-11-07 14:59                                     ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox