linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Possible slab-allocator issue
@ 2024-07-22 21:38 Paul E. McKenney
  2024-07-23  9:05 ` Vlastimil Babka
  0 siblings, 1 reply; 3+ messages in thread
From: Paul E. McKenney @ 2024-07-22 21:38 UTC (permalink / raw)
  To: linux-mm
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, vbabka,
	roman.gushchin, 42.hyeyoo, hannes, willy, rcu

Hello!

I have been chasing a hang [1] in rcuscale, which is a kernel module
that does performance testing for RCU updaters.  I have a good long-term
workaround for this problem, namely an extremely stupid custom allocator,
so I don't need a fix, but I am nevertheless posting this just in case
this affects others.  Or in case I am doing something stupid that I
should avoid doing in the future.

I first encountered this on RCU's "dev" branch, but the problem also
exists in current mainline [2].  One high-probability reproducer for
this hang, tested on both AMD and Intel x86, is as follows:

tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuscale --allcpus --duration 5 --configs TREE --bootargs "rcuscale.scale_type=tasks rcuscale.minruntime=5 rcuscale.nreaders=0 rcuscale.nwriters rcuscale.gp_async=1 rcuscale.gp_async_max=100000" --trust-make

This script builds a kernel, requires that virtualization be enabled,
and uses the qemu-system-x86_64 command.  I have not run this on a system
with fewer than 16 CPUs.

When it fails, this takes a bit less than ten minutes to fail, which
results in a non-zero exit code from kvm.sh.  Rare successes normally
consume kernel-build time plus less than a minute of run time.

The failure mode is CPU 0 (and always CPU 0) failing to get out of this
(admittedly ugly, since reworked) "retry" loop in the rcu_scale_writer()
function in kernel/rcu/rcuscale.c:

 1 retry:
 2   if (!rhp)
 3     rhp = kmalloc(sizeof(*rhp), GFP_KERNEL);
 4   if (rhp && atomic_read(this_cpu_ptr(&n_async_inflight)) < gp_async_max) {
 5     atomic_inc(this_cpu_ptr(&n_async_inflight));
 6     cur_ops->async(rhp, rcu_scale_async_cb);
 7     rhp = NULL;
 8   } else if (!kthread_should_stop()) {
 9     cur_ops->gp_barrier();
10     goto retry;
11   } else {
12     kfree(rhp); /* Because we are stopping. */
13   }

The loop begins with the "retry" label on line 1, finds "rhp"
equal to NULL on line 2, gets a NULL from kmalloc(..., GFP_KERNEL)
on line 3, proceeds to the "else if" clause on lines 8-11,
invokes rcu_barrier_tasks() on line 9 (because that is what the
rcuscale.scale_type=tasks kernel boot parameter tells it to do), then
line 10 jumps back to the retry label.  This repeats thousands of times,
consuming some minutes.  There is no drama in the console log, as in no
mention of OOM killer, allocation failures, or much of anything other
than Tasks RCU switching to per-CPU callback queueing due to the heavy
load and the expected rcuscale output.

Note that all 16 CPUs are looping allocating and invoking either
call_rcu_tasks() when allocation succeeds or rcu_barrier_tasks() when
allocation fails.  So CPU 0 could simply be quite unlucky or have other
work so that when the rcu_barrier_tasks() waits for memory to be released,
the other CPUs allocate it all before CPU 0's rcu_scale_writer() task
is awakened.  But again, there is no sign of memory-system drama on
the console log.

I am good, as I implemented a stupid per-CPU-array-based allocator,
which avoids this issue, admittedly quite crudely.

Johannes suggested turning on tracing for the kmem:kmalloc and
kmem:mm_page_alloc trace events, which I did using this command:

tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuscale --allcpus --duration 5 --configs TREE --kconfig "CONFIG_DEBUG_KERNEL=y CONFIG_RCU_TRACE=y CONFIG_FTRACE=y" --bootargs "rcuscale.scale_type=tasks rcuscale.minruntime=5 rcuscale.nreaders=0 rcuscale.nwriters rcuscale.gp_async=1 rcuscale.gp_async_max=100000 trace_event=kmem:kmalloc,kmem:mm_page_alloc trace_buf_size=3k" --trust-make

I collected the .config file [3] and console output [4].  The console
output features more than 1500 kmalloc trace events and not quite 10
mm_page_alloc trace events.

Please let me know if more information would be helpful.

							Thanx, Paul

[1] https://docs.google.com/document/d/1M_SJ_hVUe-jtkH6BjHzzRFOnAiu1sqThlFnx8Jfp-fg/edit?usp=sharing
[2] bbb3556c014d ("Merge tag 'keys-next-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd"), for example.
[3] https://drive.google.com/file/d/1lZmIvatLyVYzdRK_rv2xNNWDBiGwyQvk/view?usp=sharing
[4] https://drive.google.com/file/d/1jh4hJG877j0Rp65153wtmzKm6jErXZ9n/view?usp=sharing


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Possible slab-allocator issue
  2024-07-22 21:38 Possible slab-allocator issue Paul E. McKenney
@ 2024-07-23  9:05 ` Vlastimil Babka
  2024-07-23 16:51   ` Paul E. McKenney
  0 siblings, 1 reply; 3+ messages in thread
From: Vlastimil Babka @ 2024-07-23  9:05 UTC (permalink / raw)
  To: paulmck, linux-mm
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, roman.gushchin,
	42.hyeyoo, hannes, willy, rcu

On 7/22/24 11:38 PM, Paul E. McKenney wrote:
> Hello!

Hi!

> I have been chasing a hang [1] in rcuscale, which is a kernel module
> that does performance testing for RCU updaters.  I have a good long-term
> workaround for this problem, namely an extremely stupid custom allocator,
> so I don't need a fix, but I am nevertheless posting this just in case
> this affects others.  Or in case I am doing something stupid that I
> should avoid doing in the future.
> 
> I first encountered this on RCU's "dev" branch, but the problem also
> exists in current mainline [2].  One high-probability reproducer for
> this hang, tested on both AMD and Intel x86, is as follows:
> 
> tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuscale --allcpus --duration 5 --configs TREE --bootargs "rcuscale.scale_type=tasks rcuscale.minruntime=5 rcuscale.nreaders=0 rcuscale.nwriters rcuscale.gp_async=1 rcuscale.gp_async_max=100000" --trust-make
> 
> This script builds a kernel, requires that virtualization be enabled,
> and uses the qemu-system-x86_64 command.  I have not run this on a system
> with fewer than 16 CPUs.
> 
> When it fails, this takes a bit less than ten minutes to fail, which
> results in a non-zero exit code from kvm.sh.  Rare successes normally
> consume kernel-build time plus less than a minute of run time.
> 
> The failure mode is CPU 0 (and always CPU 0) failing to get out of this
> (admittedly ugly, since reworked) "retry" loop in the rcu_scale_writer()
> function in kernel/rcu/rcuscale.c:
> 
>  1 retry:
>  2   if (!rhp)
>  3     rhp = kmalloc(sizeof(*rhp), GFP_KERNEL);
>  4   if (rhp && atomic_read(this_cpu_ptr(&n_async_inflight)) < gp_async_max) {
>  5     atomic_inc(this_cpu_ptr(&n_async_inflight));
>  6     cur_ops->async(rhp, rcu_scale_async_cb);
>  7     rhp = NULL;
>  8   } else if (!kthread_should_stop()) {
>  9     cur_ops->gp_barrier();
> 10     goto retry;
> 11   } else {
> 12     kfree(rhp); /* Because we are stopping. */

Noted no "rhp = NULL;" but we probably really shouldn't reach this again due
to torture_must_stop() including kthread_should_stop() so it should be fine.

> 13   }
> 
> The loop begins with the "retry" label on line 1, finds "rhp"
> equal to NULL on line 2, gets a NULL from kmalloc(..., GFP_KERNEL)
> on line 3, proceeds to the "else if" clause on lines 8-11,

So it's been verified that rhp is indeed null after kmalloc() and it's not
going to the "else if" because of the rest of the condition?

I wonder how this loop works, it increases a percpu counter
n_async_inflight, and rcu_scale_async_cb() decreases it, but can't the
callback be invoked on a different cpu, making these counters imbalanced?

> invokes rcu_barrier_tasks() on line 9 (because that is what the
> rcuscale.scale_type=tasks kernel boot parameter tells it to do), then
> line 10 jumps back to the retry label.  This repeats thousands of times,
> consuming some minutes.  There is no drama in the console log, as in no
> mention of OOM killer, allocation failures, or much of anything other
> than Tasks RCU switching to per-CPU callback queueing due to the heavy
> load and the expected rcuscale output.

A GFP_KERNEL failure should definitely produce OOM or allocation failure
warning (and it's "too small to fail" to start with :)) Silently returning
NULL is, uh, inconcievable. That would really require a bug in the
implementation, or perhaps some slab structures corruption due to double
free or use after free (although I've seen those always manifest in a
different, much more noisy way).

> Note that all 16 CPUs are looping allocating and invoking either
> call_rcu_tasks() when allocation succeeds or rcu_barrier_tasks() when
> allocation fails.  So CPU 0 could simply be quite unlucky or have other
> work so that when the rcu_barrier_tasks() waits for memory to be released,
> the other CPUs allocate it all before CPU 0's rcu_scale_writer() task
> is awakened.  But again, there is no sign of memory-system drama on
> the console log.
> 
> I am good, as I implemented a stupid per-CPU-array-based allocator,
> which avoids this issue, admittedly quite crudely.

That shouldn't help if it was a problem with n_async_inflight imbalance as I
suggested above but who knows, maybe it's being busy in kmalloc() that
causes some rcu callbacks to be executed elsewhere than on the CPU that
submitted them, and using your allocator avoids that?

> Johannes suggested turning on tracing for the kmem:kmalloc and
> kmem:mm_page_alloc trace events, which I did using this command:
> 
> tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuscale --allcpus --duration 5 --configs TREE --kconfig "CONFIG_DEBUG_KERNEL=y CONFIG_RCU_TRACE=y CONFIG_FTRACE=y" --bootargs "rcuscale.scale_type=tasks rcuscale.minruntime=5 rcuscale.nreaders=0 rcuscale.nwriters rcuscale.gp_async=1 rcuscale.gp_async_max=100000 trace_event=kmem:kmalloc,kmem:mm_page_alloc trace_buf_size=3k" --trust-make
> 
> I collected the .config file [3] and console output [4].  The console
> output features more than 1500 kmalloc trace events and not quite 10
> mm_page_alloc trace events.

Hm but tracing doesn't output to console so I don't see any trace events there.

> Please let me know if more information would be helpful.
> 
> 							Thanx, Paul
> 
> [1] https://docs.google.com/document/d/1M_SJ_hVUe-jtkH6BjHzzRFOnAiu1sqThlFnx8Jfp-fg/edit?usp=sharing
> [2] bbb3556c014d ("Merge tag 'keys-next-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd"), for example.
> [3] https://drive.google.com/file/d/1lZmIvatLyVYzdRK_rv2xNNWDBiGwyQvk/view?usp=sharing
> [4] https://drive.google.com/file/d/1jh4hJG877j0Rp65153wtmzKm6jErXZ9n/view?usp=sharing



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Possible slab-allocator issue
  2024-07-23  9:05 ` Vlastimil Babka
@ 2024-07-23 16:51   ` Paul E. McKenney
  0 siblings, 0 replies; 3+ messages in thread
From: Paul E. McKenney @ 2024-07-23 16:51 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, 42.hyeyoo, hannes, willy, rcu

On Tue, Jul 23, 2024 at 11:05:16AM +0200, Vlastimil Babka wrote:
> On 7/22/24 11:38 PM, Paul E. McKenney wrote:
> > Hello!
> 
> Hi!

TL;DR:  My mistakes in testing, nothing to see hear, thank you for
bearing with me, and apologies for wasting your time.

> > I have been chasing a hang [1] in rcuscale, which is a kernel module
> > that does performance testing for RCU updaters.  I have a good long-term
> > workaround for this problem, namely an extremely stupid custom allocator,
> > so I don't need a fix, but I am nevertheless posting this just in case
> > this affects others.  Or in case I am doing something stupid that I
> > should avoid doing in the future.
> > 
> > I first encountered this on RCU's "dev" branch, but the problem also
> > exists in current mainline [2].  One high-probability reproducer for
> > this hang, tested on both AMD and Intel x86, is as follows:
> > 
> > tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuscale --allcpus --duration 5 --configs TREE --bootargs "rcuscale.scale_type=tasks rcuscale.minruntime=5 rcuscale.nreaders=0 rcuscale.nwriters rcuscale.gp_async=1 rcuscale.gp_async_max=100000" --trust-make
> > 
> > This script builds a kernel, requires that virtualization be enabled,
> > and uses the qemu-system-x86_64 command.  I have not run this on a system
> > with fewer than 16 CPUs.
> > 
> > When it fails, this takes a bit less than ten minutes to fail, which
> > results in a non-zero exit code from kvm.sh.  Rare successes normally
> > consume kernel-build time plus less than a minute of run time.
> > 
> > The failure mode is CPU 0 (and always CPU 0) failing to get out of this
> > (admittedly ugly, since reworked) "retry" loop in the rcu_scale_writer()
> > function in kernel/rcu/rcuscale.c:
> > 
> >  1 retry:
> >  2   if (!rhp)
> >  3     rhp = kmalloc(sizeof(*rhp), GFP_KERNEL);
> >  4   if (rhp && atomic_read(this_cpu_ptr(&n_async_inflight)) < gp_async_max) {
> >  5     atomic_inc(this_cpu_ptr(&n_async_inflight));
> >  6     cur_ops->async(rhp, rcu_scale_async_cb);
> >  7     rhp = NULL;
> >  8   } else if (!kthread_should_stop()) {
> >  9     cur_ops->gp_barrier();
> > 10     goto retry;
> > 11   } else {
> > 12     kfree(rhp); /* Because we are stopping. */
> 
> Noted no "rhp = NULL;" but we probably really shouldn't reach this again due
> to torture_must_stop() including kthread_should_stop() so it should be fine.
> 
> > 13   }
> > 
> > The loop begins with the "retry" label on line 1, finds "rhp"
> > equal to NULL on line 2, gets a NULL from kmalloc(..., GFP_KERNEL)
> > on line 3, proceeds to the "else if" clause on lines 8-11,
> 
> So it's been verified that rhp is indeed null after kmalloc() and it's not
> going to the "else if" because of the rest of the condition?

Yes, but as it turns out, incorrectly.  :-(

> I wonder how this loop works, it increases a percpu counter
> n_async_inflight, and rcu_scale_async_cb() decreases it, but can't the
> callback be invoked on a different cpu, making these counters imbalanced?

The theory was that, given the configuration, callbacks must be invoked
on the CPU on which they were queued.  Which is the case.  Except that
there is no per-thread sched_setaffinity(), so there is no guarantee that
the callback will be queued on the same CPU whose counter was incremented.

I now have a patch that associates the counter with the task instead
of the CPU, which will take care of things even on configurations on
which callbacks can be invoked on different CPUs (CPU hotplug enabled
or callback offloading enabled).

So it was a bug in my test code.

On the other hand, my test code has found a great many bugs, so I guess
in the long run I have been winning?

> > invokes rcu_barrier_tasks() on line 9 (because that is what the
> > rcuscale.scale_type=tasks kernel boot parameter tells it to do), then
> > line 10 jumps back to the retry label.  This repeats thousands of times,
> > consuming some minutes.  There is no drama in the console log, as in no
> > mention of OOM killer, allocation failures, or much of anything other
> > than Tasks RCU switching to per-CPU callback queueing due to the heavy
> > load and the expected rcuscale output.
> 
> A GFP_KERNEL failure should definitely produce OOM or allocation failure
> warning (and it's "too small to fail" to start with :)) Silently returning
> NULL is, uh, inconcievable. That would really require a bug in the
> implementation, or perhaps some slab structures corruption due to double
> free or use after free (although I've seen those always manifest in a
> different, much more noisy way).

Good to know that my initial belief was not wrong.

> > Note that all 16 CPUs are looping allocating and invoking either
> > call_rcu_tasks() when allocation succeeds or rcu_barrier_tasks() when
> > allocation fails.  So CPU 0 could simply be quite unlucky or have other
> > work so that when the rcu_barrier_tasks() waits for memory to be released,
> > the other CPUs allocate it all before CPU 0's rcu_scale_writer() task
> > is awakened.  But again, there is no sign of memory-system drama on
> > the console log.
> > 
> > I am good, as I implemented a stupid per-CPU-array-based allocator,
> > which avoids this issue, admittedly quite crudely.
> 
> That shouldn't help if it was a problem with n_async_inflight imbalance as I
> suggested above but who knows, maybe it's being busy in kmalloc() that
> causes some rcu callbacks to be executed elsewhere than on the CPU that
> submitted them, and using your allocator avoids that?

I lost interest in completing the bisection, but given that it was
between v6.4 and v6.6, my guess is that the aforementioned migration
became more probable.

> > Johannes suggested turning on tracing for the kmem:kmalloc and
> > kmem:mm_page_alloc trace events, which I did using this command:
> > 
> > tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuscale --allcpus --duration 5 --configs TREE --kconfig "CONFIG_DEBUG_KERNEL=y CONFIG_RCU_TRACE=y CONFIG_FTRACE=y" --bootargs "rcuscale.scale_type=tasks rcuscale.minruntime=5 rcuscale.nreaders=0 rcuscale.nwriters rcuscale.gp_async=1 rcuscale.gp_async_max=100000 trace_event=kmem:kmalloc,kmem:mm_page_alloc trace_buf_size=3k" --trust-make
> > 
> > I collected the .config file [3] and console output [4].  The console
> > output features more than 1500 kmalloc trace events and not quite 10
> > mm_page_alloc trace events.
> 
> Hm but tracing doesn't output to console so I don't see any trace events there.

It does when you invoke ftrace_dump(), but I have to give you the
right console output (something about doing three different bisections
yesterday).  Not that it matters, but here you go:

https://drive.google.com/file/d/1ufjKYGtgf8s8BVS9DHbEB0cftA7cKbsY/view?usp=sharing

Search for "Dumping ftrace buffer:".

Again, please accept my apologies for wasting your time!

							Thanx, Paul

> > Please let me know if more information would be helpful.
> > 
> > 							Thanx, Paul
> > 
> > [1] https://docs.google.com/document/d/1M_SJ_hVUe-jtkH6BjHzzRFOnAiu1sqThlFnx8Jfp-fg/edit?usp=sharing
> > [2] bbb3556c014d ("Merge tag 'keys-next-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd"), for example.
> > [3] https://drive.google.com/file/d/1lZmIvatLyVYzdRK_rv2xNNWDBiGwyQvk/view?usp=sharing
> > [4] https://drive.google.com/file/d/1jh4hJG877j0Rp65153wtmzKm6jErXZ9n/view?usp=sharing
> 


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-07-23 16:51 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-22 21:38 Possible slab-allocator issue Paul E. McKenney
2024-07-23  9:05 ` Vlastimil Babka
2024-07-23 16:51   ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox