On Mon, 25 April 2011 "Paul E. McKenney" wrote:
> On Mon, Apr 25, 2011 at 08:36:06PM +0200, Bruno Prémont wrote:
> > On Mon, 25 April 2011 Linus Torvalds wrote:
> > > On Mon, Apr 25, 2011 at 10:00 AM, Bruno Prémont wrote:
> > > >
> > > > I hope tiny-rcu is not that broken... as it would mean driving any
> > > > PREEMPT_NONE or PREEMPT_VOLUNTARY system out of memory when compiling
> > > > packages (and probably also just unpacking larger tarballs or running
> > > > things like du).
> > > 
> > > I'm sure that TINYRCU can be fixed if it really is the problem.
> > > 
> > > So I just want to make sure that we know what the root cause of your
> > > problem is. It's quite possible that it _is_ a real leak of filp or
> > > something, but before possibly wasting time trying to figure that out,
> > > let's see if your config is to blame.
> > 
> > With changed config (PREEMPT=y, TREE_PREEMPT_RCU=y) I haven't reproduced
> > yet.
> > 
> > When I was reproducing with TINYRCU things went normally for some time
> > until suddenly slabs stopped being freed.
> 
> Hmmm... If the system is responsive during this time, could you please
> do the following after the slabs stop being freed?
> 
> ps -eo pid,class,sched,rtprio,stat,state,sgi_p,cpu_time,cmd | grep '\[rcu'

Looks like tinyrcu is not innocent (or at least it makes bug appear much
more easily)

With + + TREE_PREMPT_RCU system was stable compiling for over 2 hours,
switching to TINY_RCU, filp count started increasing pretty early after beginning
compiling.

All the relevant information attached (PREEMPT+TINY_RCU):
  config.gz
  ps auxf     |
  slabinfo    |  twice, once early (1-*), the second 30 minutes later (2-*)
  meminfo     |

ls -l proc/*/fd produces 658 lines for the 1-* series of numbers, 300 for 2-*.

In both cases 
   ps -eo pid,class,sched,rtprio,stat,state,sgi_p,cputime,cmd | grep '\[rcu'
returns the same information:
      6 FF    1      1 R    R 0 00:00:00 [rcu_kthread]


according to slabtop filp count is increasing permanentally, (about +1000
every 3 seconds) probably because of top (1s refresh rate) and collectd (10s
rate) scanning /proc (without top, increasing by about 300 every 10s).

Running something like `for ((X=0; X < 200; X++)); do /bin/true; done` causes
count of pid, task_struct, signal_cache slab count to increase by about 200,
but no zombies are being left behind.

1-*  Taken a few minutes after starting compile process, but after having
     SIGSTOPed the compiling process tree
2-*  about 30 minutes later, killed compile process tree, run above for loop
     multiple times, close most terminal sessions (including top)

Between 1-slabinfo and 2-slabinfo some values increased (a lot) while a few
ones did decrease. Don't know which ones are RCU-affected and which ones are
not.

Bruno