On Wed, 27 April 2011 Bruno Prémont wrote: > On Wed, 27 Apr 2011 00:28:37 +0200 (CEST) Thomas Gleixner wrote: > > On Tue, 26 Apr 2011, Linus Torvalds wrote: > > > On Tue, Apr 26, 2011 at 10:09 AM, Bruno Prémont wrote: > > > > > > > > Just in case, /proc/$(pidof rcu_kthread)/status shows ~20k voluntary > > > > context switches and exactly one non-voluntary one. > > > > > > > > In addition when rcu_kthread has stopped doing its work > > > > `swapoff $(swapdevice)` seems to block forever (at least normal shutdown > > > > blocks on disabling swap device). Apparently it's not swapoff but `umount -a -t tmpfs` that's getting stuck here. Manual swapoff worked. The stuck umount: [ 1714.960735] umount D 5a000040 5668 20331 20324 0x00000000 [ 1714.960735] c3c99e5c 00000086 dd407900 5a000040 dd25a1a8 dd407900 dd25a120 c3c99e0c [ 1714.960735] c3c99e24 c10c1be2 c14d9f20 c3c99e5c c3c8c680 c3c8c680 000000bb c3c99e24 [ 1714.960735] c10c0b88 dd25a120 dd407900 ddfd4b40 c3c99e4c ddfc9d20 dd402380 5a000010 [ 1714.960735] Call Trace: [ 1714.960735] [] ? check_object+0x92/0x210 [ 1714.960735] [] ? init_object+0x38/0x70 [ 1714.960735] [] ? check_object+0x92/0x210 [ 1714.960735] [] schedule_timeout+0x16d/0x280 [ 1714.960735] [] ? init_object+0x38/0x70 [ 1714.960735] [] ? free_debug_processing+0x112/0x1f0 [ 1714.960735] [] ? shmem_put_super+0x11/0x20 [ 1714.960735] [] wait_for_common+0x9c/0x150 [ 1714.960735] [] ? try_to_wake_up+0x170/0x170 [ 1714.960735] [] wait_for_completion+0x12/0x20 [ 1714.960735] [] rcu_barrier_sched+0x47/0x50 [ 1714.960735] [] ? alloc_pid+0x370/0x370 [ 1714.960735] [] deactivate_locked_super+0x3a/0x60 [ 1714.960735] [] deactivate_super+0x48/0x70 [ 1714.960735] [] mntput_no_expire+0x87/0xe0 [ 1714.960735] [] sys_umount+0x60/0x320 [ 1714.960735] [] ? remove_vma+0x3a/0x50 [ 1714.960735] [] ? do_munmap+0x212/0x2f0 [ 1714.960735] [] sys_oldumount+0x19/0x20 [ 1714.960735] [] sysenter_do_call+0x12/0x26 which looks like lock conflict with RCU: [ 1714.960735] rcu_kthread R running 6924 6 2 0x00000000 [ 1714.960735] dd473f28 00000046 5a000240 dbd6ba7c dd407360 ddfaf840 dbd6b740 dd473ed8 [ 1714.960735] ddfaee00 dd407a20 5a000000 dd473f28 dd445040 dd445040 0000009c dd473f0c [ 1714.960735] c10c1be2 c14d9f20 dbf7057c 0000005a 000000bb 000000bb dd473f0c c10c0b88 [ 1714.960735] Call Trace: [ 1714.960735] [] ? check_object+0x92/0x210 [ 1714.960735] [] ? init_object+0x38/0x70 [ 1714.960735] [] ? lock_timer_base+0x2d/0x70 [ 1714.960735] [] schedule_timeout+0x137/0x280 [ 1714.960735] [] ? sys_gettid+0x20/0x20 [ 1714.960735] [] schedule_timeout_interruptible+0x14/0x20 [ 1714.960735] [] rcu_kthread+0xa0/0xc0 [ 1714.960735] [] ? wake_up_bit+0x70/0x70 [ 1714.960735] [] ? rcu_process_callbacks+0x60/0x60 [ 1714.960735] [] kthread+0x74/0x80 [ 1714.960735] [] ? flush_kthread_worker+0x90/0x90 [ 1714.960735] [] kernel_thread_helper+0x6/0xd (I have rest of sysreq+t output available in case someone wants it) > > > > If I get to do it when I get back home I will manually try to swapoff > > > > and take process traces with sysrq-t. > > > > > > That "exactly one non-voluntary one" sounds like the smoking gun. It's not the gun we're looking for as it's already smoking long before any RCU-managed slabs start piling up (e.g. already when I get at a shell after boot sequence). Voluntary context switches stay constant from the time on SLABs pile up. (which makes sense as it doesn't run get CPU slices anymore) > > Can you please enable CONFIG_SCHED_DEBUG and provide the output of > > /proc/sched_stat when the problem surfaces and a minute after the > > first snapshot? hm, did you mean CONFIG_SCHEDSTAT or /proc/sched_debug? I did use CONFIG_SCHED_DEBUG (and there is no /proc/sched_stat) so I took /proc/sched_debug which exists... (attached, taken about 7min and +1min after SLABs started piling up), though build processes were SIGSTOPped during first minute. printk wrote (in case its timestamp is useful, more below): [ 518.480103] sched: RT throttling activated If my choice was the wrong one, please tell so I can generate the other ones. > > Also please apply the patch below and check, whether the printk shows > > up in your dmesg. > > > Index: linux-2.6-tip/kernel/sched_rt.c > > =================================================================== > > --- linux-2.6-tip.orig/kernel/sched_rt.c > > +++ linux-2.6-tip/kernel/sched_rt.c > > @@ -609,6 +609,7 @@ static int sched_rt_runtime_exceeded(str > > > > if (rt_rq->rt_time > runtime) { > > rt_rq->rt_throttled = 1; > > + printk_once(KERN_WARNING "sched: RT throttling activated\n"); This gun is triggering right before RCU-managed slabs start piling up as visible under slabtop so chances are it's at least a related! Bruno > > if (rt_rq_throttled(rt_rq)) { > > sched_rt_rq_dequeue(rt_rq); > > return 1;