On Mon, 25 April 2011 "Paul E. McKenney" wrote: > On Mon, Apr 25, 2011 at 08:36:06PM +0200, Bruno Prémont wrote: > > On Mon, 25 April 2011 Linus Torvalds wrote: > > > On Mon, Apr 25, 2011 at 10:00 AM, Bruno Prémont wrote: > > > > > > > > I hope tiny-rcu is not that broken... as it would mean driving any > > > > PREEMPT_NONE or PREEMPT_VOLUNTARY system out of memory when compiling > > > > packages (and probably also just unpacking larger tarballs or running > > > > things like du). > > > > > > I'm sure that TINYRCU can be fixed if it really is the problem. > > > > > > So I just want to make sure that we know what the root cause of your > > > problem is. It's quite possible that it _is_ a real leak of filp or > > > something, but before possibly wasting time trying to figure that out, > > > let's see if your config is to blame. > > > > With changed config (PREEMPT=y, TREE_PREEMPT_RCU=y) I haven't reproduced > > yet. > > > > When I was reproducing with TINYRCU things went normally for some time > > until suddenly slabs stopped being freed. > > Hmmm... If the system is responsive during this time, could you please > do the following after the slabs stop being freed? > > ps -eo pid,class,sched,rtprio,stat,state,sgi_p,cpu_time,cmd | grep '\[rcu' Looks like tinyrcu is not innocent (or at least it makes bug appear much more easily) With + + TREE_PREMPT_RCU system was stable compiling for over 2 hours, switching to TINY_RCU, filp count started increasing pretty early after beginning compiling. All the relevant information attached (PREEMPT+TINY_RCU): config.gz ps auxf | slabinfo | twice, once early (1-*), the second 30 minutes later (2-*) meminfo | ls -l proc/*/fd produces 658 lines for the 1-* series of numbers, 300 for 2-*. In both cases ps -eo pid,class,sched,rtprio,stat,state,sgi_p,cputime,cmd | grep '\[rcu' returns the same information: 6 FF 1 1 R R 0 00:00:00 [rcu_kthread] according to slabtop filp count is increasing permanentally, (about +1000 every 3 seconds) probably because of top (1s refresh rate) and collectd (10s rate) scanning /proc (without top, increasing by about 300 every 10s). Running something like `for ((X=0; X < 200; X++)); do /bin/true; done` causes count of pid, task_struct, signal_cache slab count to increase by about 200, but no zombies are being left behind. 1-* Taken a few minutes after starting compile process, but after having SIGSTOPed the compiling process tree 2-* about 30 minutes later, killed compile process tree, run above for loop multiple times, close most terminal sessions (including top) Between 1-slabinfo and 2-slabinfo some values increased (a lot) while a few ones did decrease. Don't know which ones are RCU-affected and which ones are not. Bruno