On Wed, 27 April 2011 Bruno Prémont wrote:
> On Wed, 27 Apr 2011 00:28:37 +0200 (CEST) Thomas Gleixner wrote:
> > On Tue, 26 Apr 2011, Linus Torvalds wrote:
> > > On Tue, Apr 26, 2011 at 10:09 AM, Bruno Prémont wrote:
> > > >
> > > > Just in case, /proc/$(pidof rcu_kthread)/status shows ~20k voluntary
> > > > context switches and exactly one non-voluntary one.
> > > >
> > > > In addition when rcu_kthread has stopped doing its work
> > > > `swapoff $(swapdevice)` seems to block forever (at least normal shutdown
> > > > blocks on disabling swap device).

Apparently it's not swapoff but `umount -a -t tmpfs` that's getting
stuck here. Manual swapoff worked.

The stuck umount:
[ 1714.960735] umount          D 5a000040  5668 20331  20324 0x00000000
[ 1714.960735]  c3c99e5c 00000086 dd407900 5a000040 dd25a1a8 dd407900 dd25a120 c3c99e0c
[ 1714.960735]  c3c99e24 c10c1be2 c14d9f20 c3c99e5c c3c8c680 c3c8c680 000000bb c3c99e24
[ 1714.960735]  c10c0b88 dd25a120 dd407900 ddfd4b40 c3c99e4c ddfc9d20 dd402380 5a000010
[ 1714.960735] Call Trace:
[ 1714.960735]  [<c10c1be2>] ? check_object+0x92/0x210
[ 1714.960735]  [<c10c0b88>] ? init_object+0x38/0x70
[ 1714.960735]  [<c10c1be2>] ? check_object+0x92/0x210
[ 1714.960735]  [<c13cb37d>] schedule_timeout+0x16d/0x280
[ 1714.960735]  [<c10c0b88>] ? init_object+0x38/0x70
[ 1714.960735]  [<c10c2122>] ? free_debug_processing+0x112/0x1f0
[ 1714.960735]  [<c10a3791>] ? shmem_put_super+0x11/0x20
[ 1714.960735]  [<c13cae9c>] wait_for_common+0x9c/0x150
[ 1714.960735]  [<c102c890>] ? try_to_wake_up+0x170/0x170
[ 1714.960735]  [<c13caff2>] wait_for_completion+0x12/0x20
[ 1714.960735]  [<c1075ad7>] rcu_barrier_sched+0x47/0x50
[ 1714.960735]  [<c104d3c0>] ? alloc_pid+0x370/0x370
[ 1714.960735]  [<c10ce74a>] deactivate_locked_super+0x3a/0x60
[ 1714.960735]  [<c10ce948>] deactivate_super+0x48/0x70
[ 1714.960735]  [<c10e7427>] mntput_no_expire+0x87/0xe0
[ 1714.960735]  [<c10e7800>] sys_umount+0x60/0x320
[ 1714.960735]  [<c10b231a>] ? remove_vma+0x3a/0x50
[ 1714.960735]  [<c10b3b22>] ? do_munmap+0x212/0x2f0
[ 1714.960735]  [<c10e7ad9>] sys_oldumount+0x19/0x20
[ 1714.960735]  [<c13cce10>] sysenter_do_call+0x12/0x26

which looks like lock conflict with RCU:

[ 1714.960735] rcu_kthread     R running   6924     6      2 0x00000000
[ 1714.960735]  dd473f28 00000046 5a000240 dbd6ba7c dd407360 ddfaf840 dbd6b740 dd473ed8
[ 1714.960735]  ddfaee00 dd407a20 5a000000 dd473f28 dd445040 dd445040 0000009c dd473f0c
[ 1714.960735]  c10c1be2 c14d9f20 dbf7057c 0000005a 000000bb 000000bb dd473f0c c10c0b88
[ 1714.960735] Call Trace:
[ 1714.960735]  [<c10c1be2>] ? check_object+0x92/0x210
[ 1714.960735]  [<c10c0b88>] ? init_object+0x38/0x70
[ 1714.960735]  [<c103fd8d>] ? lock_timer_base+0x2d/0x70
[ 1714.960735]  [<c13cb347>] schedule_timeout+0x137/0x280
[ 1714.960735]  [<c103fef0>] ? sys_gettid+0x20/0x20
[ 1714.960735]  [<c13cb4e4>] schedule_timeout_interruptible+0x14/0x20
[ 1714.960735]  [<c1075a70>] rcu_kthread+0xa0/0xc0
[ 1714.960735]  [<c1050190>] ? wake_up_bit+0x70/0x70
[ 1714.960735]  [<c10759d0>] ? rcu_process_callbacks+0x60/0x60
[ 1714.960735]  [<c104fc04>] kthread+0x74/0x80
[ 1714.960735]  [<c104fb90>] ? flush_kthread_worker+0x90/0x90
[ 1714.960735]  [<c13cd336>] kernel_thread_helper+0x6/0xd

(I have rest of sysreq+t output available in case someone wants it)

> > > > If I get to do it when I get back home I will manually try to swapoff
> > > > and take process traces with sysrq-t.
> > > 
> > > That "exactly one non-voluntary one" sounds like the smoking gun.

It's not the gun we're looking for as it's already smoking long before
any RCU-managed slabs start piling up (e.g. already when I get at a
shell after boot sequence).

Voluntary context switches stay constant from the time on SLABs pile up.
(which makes sense as it doesn't run get CPU slices anymore)

> > Can you please enable CONFIG_SCHED_DEBUG and provide the output of
> > /proc/sched_stat when the problem surfaces and a minute after the
> > first snapshot?

hm, did you mean CONFIG_SCHEDSTAT or /proc/sched_debug?

I did use CONFIG_SCHED_DEBUG (and there is no /proc/sched_stat) so I took
/proc/sched_debug which exists... (attached, taken about 7min and +1min
after SLABs started piling up), though build processes were SIGSTOPped
during first minute.

printk wrote (in case its timestamp is useful, more below):
[  518.480103] sched: RT throttling activated

If my choice was the wrong one, please tell so I can generate the other
ones.

> > Also please apply the patch below and check, whether the printk shows
> > up in your dmesg.
> 
> > Index: linux-2.6-tip/kernel/sched_rt.c
> > ===================================================================
> > --- linux-2.6-tip.orig/kernel/sched_rt.c
> > +++ linux-2.6-tip/kernel/sched_rt.c
> > @@ -609,6 +609,7 @@ static int sched_rt_runtime_exceeded(str
> >  
> >  	if (rt_rq->rt_time > runtime) {
> >  		rt_rq->rt_throttled = 1;
> > +		printk_once(KERN_WARNING "sched: RT throttling activated\n");

This gun is triggering right before RCU-managed slabs start piling up as
visible under slabtop so chances are it's at least a related!

Bruno


> >  		if (rt_rq_throttled(rt_rq)) {
> >  			sched_rt_rq_dequeue(rt_rq);
> >  			return 1;