Avoid excessive time spend on concurrent slab shrinking

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Avoid excessive time spend on concurrent slab shrinking
@ 2006-03-31 22:44 Christoph Lameter
       [not found] ` <20060331150120.21fad488.akpm@osdl.org>
  2006-03-31 23:45 ` Andrew Morton
  0 siblings, 2 replies; 14+ messages in thread
From: Christoph Lameter @ 2006-03-31 22:44 UTC (permalink / raw)
  To: akpm, nickpiggin; +Cc: linux-mm

We experienced that concurrent slab shrinking on 2.6.16 can slow down a
system excessively due to lock contention. Slab shrinking is a global
operation so it does not make sense for multiple slab shrink operations
to be ongoing at the same time. The single shrinking task can perform the
shrinking for all nodes and processors in the system. Introduce an atomic
counter that works in the same was as in shrink_zone to limit concurrent
shrinking.

Also calculate the time it took to do the shrinking and wait at least twice
that time before doing it again. If we are spending excessive time 
on slab shrinking then we need to pause for some time to insure that the 
system is capable of archiving other tasks.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.16/mm/vmscan.c
===================================================================
--- linux-2.6.16.orig/mm/vmscan.c	2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16/mm/vmscan.c	2006-03-31 14:38:18.000000000 -0800
@@ -130,6 +130,8 @@ static long total_memory;
 
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
+static atomic_t active_shrinkers;
+static unsigned long next_slab_shrink;
 
 /*
  * Add a shrinker callback to be called from the vm
@@ -187,12 +189,18 @@ int shrink_slab(unsigned long scanned, g
 {
 	struct shrinker *shrinker;
 	int ret = 0;
+	unsigned long shrinkstart;
 
 	if (scanned == 0)
 		scanned = SWAP_CLUSTER_MAX;
 
-	if (!down_read_trylock(&shrinker_rwsem))
-		return 1;	/* Assume we'll be able to shrink next time */
+	if (atomic_read(&active_shrinkers) ||
+		time_before(jiffies, next_slab_shrink) ||
+		!down_read_trylock(&shrinker_rwsem))
+			/* Assume we'll be able to shrink next time */
+			return 1;
+	atomic_inc(&active_shrinkers);
+	shrinkstart = jiffies;
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
 		unsigned long long delta;
@@ -239,6 +247,12 @@ int shrink_slab(unsigned long scanned, g
 
 		shrinker->nr += total_scan;
 	}
+	/*
+	 * If slab shrinking took a long time then lets at least wait
+	 * twice as long as it took before we do it again.
+	 */
+	next_slab_shrink = jiffies + 2 * (jiffies - shrinkstart);
+	atomic_dec(&active_shrinkers);
 	up_read(&shrinker_rwsem);
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Avoid excessive time spend on concurrent slab shrinking
       [not found] ` <20060331150120.21fad488.akpm@osdl.org>
@ 2006-03-31 23:17   ` Christoph Lameter
  2006-03-31 23:46     ` Andrew Morton
       [not found]     ` <20060331153235.754deb0c.akpm@osdl.org>
  0 siblings, 2 replies; 14+ messages in thread
From: Christoph Lameter @ 2006-03-31 23:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, linux-mm

On Fri, 31 Mar 2006, Andrew Morton wrote:

> Christoph Lameter <clameter@sgi.com> wrote:
> >
> > We experienced that concurrent slab shrinking on 2.6.16 can slow down a
> >  system excessively due to lock contention.
> 
> How much?

System sluggish in general. cscope takes 20 minutes to start etc. Dropping 
the caches restored performance.

> Which lock(s)?

Seems to be mainly iprune_sem. So its inode reclaim.
 
> > Slab shrinking is a global
> >  operation so it does not make sense for multiple slab shrink operations
> >  to be ongoing at the same time.
> 
> That's how it used to be - it was a semaphore and we baled out if
> down_trylock() failed.  If we're going to revert that change then I'd
> prefer to just go back to doing it that way (only with a mutex).

No problem with that. Seems that the behavior <2.6.9 was okay. This showed 
up during beta testing of a new major distribution release.
 
> The reason we made that change in 2.6.9:
> 
>   Use an rwsem to protect the shrinker list instead of a regular
>   semaphore.  Modifications to the list are now done under the write lock,
>   shrink_slab takes the read lock, and access to shrinker->nr becomes racy
>   (which is no concurrent.
> 
>   Previously, having the slab scanner get preempted or scheduling while
>   holding the semaphore would cause other tasks to skip putting pressure on
>   the slab.
> 
>   Also, make shrink_icache_memory return -1 if it can't do anything in
>   order to hold pressure on this cache and prevent useless looping in
>   shrink_slab.

Shrink_icache_memory() never returns -1.

> Note the lack of performance numbers?  How are we to judge which the
> regression which your proposal introduces is outweighed by the (unmeasured)
> gain it provides?

We just noticed general sluggishness and took some stackdumps to see what 
the system was up to. Do we have a benchmark for slab shrinking?

> We need a *lot* of testing results with varied workloads and varying
> machine types before we can say that changes like this are of aggregate
> benefit and do not introduce bad corner-case regressions.

The slowdown of the system running concurrent slab reclaim is pretty 
severe. Machine is basically unusable until you manually trigger the 
dropping of the caches.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Avoid excessive time spend on concurrent slab shrinking
  2006-03-31 22:44 Avoid excessive time spend on concurrent slab shrinking Christoph Lameter
       [not found] ` <20060331150120.21fad488.akpm@osdl.org>
@ 2006-03-31 23:45 ` Andrew Morton
  1 sibling, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2006-03-31 23:45 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: nickpiggin, linux-mm

(Resent to correct linux-mm address)

Christoph Lameter <clameter@sgi.com> wrote:
>
> We experienced that concurrent slab shrinking on 2.6.16 can slow down a
>  system excessively due to lock contention.

How much?

Which lock(s)?

> Slab shrinking is a global
>  operation so it does not make sense for multiple slab shrink operations
>  to be ongoing at the same time.

That's how it used to be - it was a semaphore and we baled out if
down_trylock() failed.  If we're going to revert that change then I'd
prefer to just go back to doing it that way (only with a mutex).

The reason we made that change in 2.6.9:

  Use an rwsem to protect the shrinker list instead of a regular
  semaphore.  Modifications to the list are now done under the write lock,
  shrink_slab takes the read lock, and access to shrinker->nr becomes racy
  (which is no concurrent.

  Previously, having the slab scanner get preempted or scheduling while
  holding the semaphore would cause other tasks to skip putting pressure on
  the slab.

  Also, make shrink_icache_memory return -1 if it can't do anything in
  order to hold pressure on this cache and prevent useless looping in
  shrink_slab.

Note the lack of performance numbers?  How are we to judge which the
regression which your proposal introduces is outweighed by the (unmeasured)
gain it provides?

> The single shrinking task can perform the
>  shrinking for all nodes and processors in the system.

Probably.  But we _can_ sometimes do disk I/O while holding that lock, down
in the inode-releasing code, iirc.  Could get bad with a `-o sync' mounted
filesystem.

> Introduce an atomic
>  counter that works in the same was as in shrink_zone to limit concurrent
>  shrinking.

No, a simple mutex_trylock() should suffice.

>  Also calculate the time it took to do the shrinking and wait at least twice
>  that time before doing it again. If we are spending excessive time 
>  on slab shrinking then we need to pause for some time to insure that the 
>  system is capable of archiving other tasks.

No way, sorry.  I've had it with "gee let's do this, it might be better"
"optimisations" in that code.

We need a *lot* of testing results with varied workloads and varying
machine types before we can say that changes like this are of aggregate
benefit and do not introduce bad corner-case regressions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Avoid excessive time spend on concurrent slab shrinking
  2006-03-31 23:17   ` Christoph Lameter
@ 2006-03-31 23:46     ` Andrew Morton
       [not found]     ` <20060331153235.754deb0c.akpm@osdl.org>
  1 sibling, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2006-03-31 23:46 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: nickpiggin, linux-mm

Christoph Lameter <clameter@sgi.com> wrote:
>
> On Fri, 31 Mar 2006, Andrew Morton wrote:
> 
> > Christoph Lameter <clameter@sgi.com> wrote:
> > >
> > > We experienced that concurrent slab shrinking on 2.6.16 can slow down a
> > >  system excessively due to lock contention.
> > 
> > How much?
> 
> System sluggish in general. cscope takes 20 minutes to start etc. Dropping 
> the caches restored performance.

OK.  What sort of system was it, and what was the workload?  FIlesystem types?

What sort of overhead was it?  sleeping-in-D-state? 
pingponging-cachelines-around?

It's been like that for an awful long time.  Can you think why this has
only just now been noticed?

> > Which lock(s)?
> 
> Seems to be mainly iprune_sem. So its inode reclaim.

But why on earth would iprune_mutex make such a difference?  The kernel can
throw away inodes at a great old rate, and it takes quite some time to
restore them.

I fear that something new is happening, and that prune_icache() is now
doing lots of work without achieving anything.  

We have fiddled with various things in fs/inode.c which could affect this
over the past year.  I wonder if one of those changes has caused the inode
scan to now scan lots of unreclaimable inodes.

> > > Slab shrinking is a global
> > >  operation so it does not make sense for multiple slab shrink operations
> > >  to be ongoing at the same time.
> > 
> > That's how it used to be - it was a semaphore and we baled out if
> > down_trylock() failed.  If we're going to revert that change then I'd
> > prefer to just go back to doing it that way (only with a mutex).
> 
> No problem with that. Seems that the behavior <2.6.9 was okay. This showed 
> up during beta testing of a new major distribution release.

OK.

> > The reason we made that change in 2.6.9:
> > 
> >   Use an rwsem to protect the shrinker list instead of a regular
> >   semaphore.  Modifications to the list are now done under the write lock,
> >   shrink_slab takes the read lock, and access to shrinker->nr becomes racy
> >   (which is no concurrent.
> > 
> >   Previously, having the slab scanner get preempted or scheduling while
> >   holding the semaphore would cause other tasks to skip putting pressure on
> >   the slab.
> > 
> >   Also, make shrink_icache_memory return -1 if it can't do anything in
> >   order to hold pressure on this cache and prevent useless looping in
> >   shrink_slab.
> 
> Shrink_icache_memory() never returns -1.

                if (!(gfp_mask & __GFP_FS))
                        return -1;

> > Note the lack of performance numbers?  How are we to judge which the
> > regression which your proposal introduces is outweighed by the (unmeasured)
> > gain it provides?
> 
> We just noticed general sluggishness and took some stackdumps to see what 
> the system was up to.

OK.  But was it D-state sleep (semaphore lock contention) or what?

> Do we have a benchmark for slab shrinking?

Nope.  In general reclaim shouldn't be a performance problem because the
things which we reclaim take so much work to reestablish.  It only causes
problems when we're repeatedly scanning lots of things which aren't
actually reclaimable.   Hence my suspicions are aroused...

> > We need a *lot* of testing results with varied workloads and varying
> > machine types before we can say that changes like this are of aggregate
> > benefit and do not introduce bad corner-case regressions.
> 
> The slowdown of the system running concurrent slab reclaim is pretty 
> severe. Machine is basically unusable until you manually trigger the 
> dropping of the caches.

bad.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Avoid excessive time spend on concurrent slab shrinking
       [not found]     ` <20060331153235.754deb0c.akpm@osdl.org>
@ 2006-03-31 23:48       ` Christoph Lameter
  2006-04-01  0:00         ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2006-03-31 23:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, linux-mm

On Fri, 31 Mar 2006, Andrew Morton wrote:

> > System sluggish in general. cscope takes 20 minutes to start etc. Dropping 
> > the caches restored performance.
> 
> OK.  What sort of system was it, and what was the workload?  FIlesystem types?

A build server. Lots of scripts running, compilers etc etc.

> It's been like that for an awful long time.  Can you think why this has
> only just now been noticed?

Testing has reached new level of thoroughness because of the new releases 
that are due soon...

> > We just noticed general sluggishness and took some stackdumps to see what 
> > the system was up to.
> 
> OK.  But was it D-state sleep (semaphore lock contention) or what?

Yes, lots of processes waiting on semaphores in 
shrink_slab->shrink_icache_memory. Need to look at this in more detail it 
seems.

I looked at the old release that worked. Seems that it did the same thing 
in terms of slab shrinking. Concurrent slab shrinking was no problem. So 
you may be right. Its something unrelated to the code in vmscan.c. Maybe 
Nick knows something about this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Avoid excessive time spend on concurrent slab shrinking
  2006-03-31 23:48       ` Christoph Lameter
@ 2006-04-01  0:00         ` Andrew Morton
  2006-04-01  0:14           ` Andrew Morton
  2006-04-01  0:22           ` Christoph Lameter
  0 siblings, 2 replies; 14+ messages in thread
From: Andrew Morton @ 2006-04-01  0:00 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: nickpiggin, linux-mm

Christoph Lameter <clameter@sgi.com> wrote:
>
> On Fri, 31 Mar 2006, Andrew Morton wrote:
> 
> > > System sluggish in general. cscope takes 20 minutes to start etc. Dropping 
> > > the caches restored performance.
> > 
> > OK.  What sort of system was it, and what was the workload?  FIlesystem types?
> 
> A build server. Lots of scripts running, compilers etc etc.

Interesting.    Many CPUs?

> > It's been like that for an awful long time.  Can you think why this has
> > only just now been noticed?
> 
> Testing has reached new level of thoroughness because of the new releases 
> that are due soon...
> 
> > > We just noticed general sluggishness and took some stackdumps to see what 
> > > the system was up to.
> > 
> > OK.  But was it D-state sleep (semaphore lock contention) or what?
> 
> Yes, lots of processes waiting on semaphores in 
> shrink_slab->shrink_icache_memory. Need to look at this in more detail it 
> seems.

Please.  Or at least suggest a means-of-reproducing.

A plain old sysrq-T would be great.  That'll tell us who owns iprune_sem,
and what he's up to while holding it.  Actually five-odd sysrq-T's would be
better.

If the lock holder is stuck on disk I/O or a congested queue or something
then that's very different from the lock holder being in a
pointlessly-burn-CPU-scanning-stuff condition.

> I looked at the old release that worked. Seems that it did the same thing 
> in terms of slab shrinking. Concurrent slab shrinking was no problem. So 
> you may be right. Its something unrelated to the code in vmscan.c. Maybe 
> Nick knows something about this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Avoid excessive time spend on concurrent slab shrinking
  2006-04-01  0:00         ` Andrew Morton
@ 2006-04-01  0:14           ` Andrew Morton
  2006-04-01  0:22           ` Christoph Lameter
  1 sibling, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2006-04-01  0:14 UTC (permalink / raw)
  To: clameter, nickpiggin, linux-mm

Andrew Morton <akpm@osdl.org> wrote:
>
>  A plain old sysrq-T would be great.
>

Really great.

We do potentially-vast gobs of waiting for I/O in
prune_icache->dispose_list->truncate_inode_pages().

But then, why would dispose_list() run truncate_inode_pages()?  Reclaiming
an inode which has no links to it, perhaps - it's been a while since I was
in there <wishes he added more comments last time he understood that stuff>

clear_inode() does wait_on_inode()...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Avoid excessive time spend on concurrent slab shrinking
  2006-04-01  0:00         ` Andrew Morton
  2006-04-01  0:14           ` Andrew Morton
@ 2006-04-01  0:22           ` Christoph Lameter
  2006-04-01  1:25             ` Andrew Morton
  2006-04-01 18:24             ` David Chinner
  1 sibling, 2 replies; 14+ messages in thread
From: Christoph Lameter @ 2006-04-01  0:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, linux-mm

On Fri, 31 Mar 2006, Andrew Morton wrote:

> > A build server. Lots of scripts running, compilers etc etc.
> 
> Interesting.    Many CPUs?

12 processors. 6 nodes.

> A plain old sysrq-T would be great.  That'll tell us who owns iprune_sem,
> and what he's up to while holding it.  Actually five-odd sysrq-T's would be
> better.

Some traces:

   Stack traceback for pid 16836
        0xe00000380bc68000    16836        1  1    6   R  
        0xa00000020b8e6050 [xfs]xfs_iextract+0x190
        0xa00000020b8e63a0 [xfs]xfs_ireclaim+0x80
        0xa00000020b921c70 [xfs]xfs_finish_reclaim+0x330
        0xa00000020b921fa0 [xfs]xfs_reclaim+0x140
        0xa00000020b93f820 [xfs]linvfs_clear_inode+0x260
        0xa0000001001855f0 clear_inode+0x310
        0xa000000100185f70 dispose_list+0x90
        0xa000000100186c40 shrink_icache_memory+0x480
        0xa000000100105bb0 shrink_slab+0x290
        0xa000000100107cc0 try_to_free_pages+0x380
        0xa0000001000f9f70 __alloc_pages+0x330
        0xa0000001000ed940 page_cache_alloc_cold+0x160
        0xa0000001000fe3a0 __do_page_cache_readahead+0x120
        0xa0000001000fe820 blockable_page_cache_readahead+0xe0
        0xa0000001000fea50 make_ahead_window+0x150
        0xa0000001000fee30 page_cache_readahead+0x390
        0xa0000001000ee730 do_generic_mapping_read+0x190
        0xa0000001000efd80 __generic_file_aio_read+0x2c0
        0xa00000020b93c190 [xfs]xfs_read+0x3b0
        0xa00000020b934170 [xfs]linvfs_aio_read+0x130

        Stack traceback for pid 19357
        0xe000003815dc0000    19357    19108  0   10   D  
        0xa000000100524b80 schedule+0x2940
        0xa000000100521ee0 __down+0x260
        0xa000000100186880 shrink_icache_memory+0xc0
        0xa000000100105bb0 shrink_slab+0x290
        0xa000000100107cc0 try_to_free_pages+0x380
        0xa0000001000f9f70 __alloc_pages+0x330
        0xa00000010012e7d0 alloc_page_vma+0x150
        0xa0000001001108d0 __handle_mm_fault+0x390
        0xa00000010052c520 ia64_do_page_fault+0x280
        0xa00000010000caa0 ia64_leave_kernel
        0xa0000001000f29b0 file_read_actor+0xb0
        0xa0000001000ee830 do_generic_mapping_read+0x290
        0xa0000001000efd80 __generic_file_aio_read+0x2c0
        0xa00000020b93c190 [xfs]xfs_read+0x3b0
        0xa00000020b934170 [xfs]linvfs_aio_read+0x130
        0xa000000100146430 do_sync_read+0x170
        0xa000000100147f20 vfs_read+0x200
        0xa000000100148790 sys_read+0x70
        0xa00000010000c830 ia64_trace_syscall+0xd0

        Stack traceback for pid 12033
        0xe000003414cd0000    12033    18401  0   10   D  
        0xa000000100524b80 schedule+0x2940
        0xa000000100521ee0 __down+0x260
        0xa000000100186880 shrink_icache_memory+0xc0
        0xa000000100105bb0 shrink_slab+0x290
        0xa000000100107cc0 try_to_free_pages+0x380
        0xa0000001000f9f70 __alloc_pages+0x330
        0xa00000010012e7d0 alloc_page_vma+0x150
        0xa0000001001108d0 __handle_mm_fault+0x390
        0xa00000010052c520 ia64_do_page_fault+0x280
        0xa00000010000caa0 ia64_leave_kernel
        0xa0000001000f29b0 file_read_actor+0xb0
        0xa0000001000ee830 do_generic_mapping_read+0x290
        0xa0000001000efd80 __generic_file_aio_read+0x2c0
        0xa00000020b93c190 [xfs]xfs_read+0x3b0
        0xa00000020b934170 [xfs]linvfs_aio_read+0x130
        0xa000000100146430 do_sync_read+0x170
        0xa000000100147f20 vfs_read+0x200
        0xa000000100148790 sys_read+0x70
        0xa00000010000c830 ia64_trace_syscall+0xd0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Avoid excessive time spend on concurrent slab shrinking
  2006-04-01  0:22           ` Christoph Lameter
@ 2006-04-01  1:25             ` Andrew Morton
  2006-04-01  2:34               ` Nick Piggin
  2006-04-01  5:59               ` Nathan Scott
  2006-04-01 18:24             ` David Chinner
  1 sibling, 2 replies; 14+ messages in thread
From: Andrew Morton @ 2006-04-01  1:25 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: nickpiggin, linux-mm, Nathan Scott

Christoph Lameter <clameter@sgi.com> wrote:
>
>  Some traces:
> 
>     Stack traceback for pid 16836
>          0xe00000380bc68000    16836        1  1    6   R  
>          0xa00000020b8e6050 [xfs]xfs_iextract+0x190
>          0xa00000020b8e63a0 [xfs]xfs_ireclaim+0x80
>          0xa00000020b921c70 [xfs]xfs_finish_reclaim+0x330
>          0xa00000020b921fa0 [xfs]xfs_reclaim+0x140
>          0xa00000020b93f820 [xfs]linvfs_clear_inode+0x260
>          0xa0000001001855f0 clear_inode+0x310
>          0xa000000100185f70 dispose_list+0x90
>          0xa000000100186c40 shrink_icache_memory+0x480
>          0xa000000100105bb0 shrink_slab+0x290
>          0xa000000100107cc0 try_to_free_pages+0x380
>          0xa0000001000f9f70 __alloc_pages+0x330
>          0xa0000001000ed940 page_cache_alloc_cold+0x160
>          0xa0000001000fe3a0 __do_page_cache_readahead+0x120
>          0xa0000001000fe820 blockable_page_cache_readahead+0xe0
>          0xa0000001000fea50 make_ahead_window+0x150
>          0xa0000001000fee30 page_cache_readahead+0x390
>          0xa0000001000ee730 do_generic_mapping_read+0x190
>          0xa0000001000efd80 __generic_file_aio_read+0x2c0
>          0xa00000020b93c190 [xfs]xfs_read+0x3b0
>          0xa00000020b934170 [xfs]linvfs_aio_read+0x130

OK, thanks.   Is that a typical trace?

It appears that we're being busy in xfs_iextract(), but it would be sad if
the problem was really lock contention in xfs_iextract(), and we just
happened to catch it when it was running.

Or maybe xfs_iextract is just slow.  So this is one thing we need to get to
the bottom of (profiles might tell us).

Assuming that there's nothing we can do to improve the XFS situation, our
options appear to be, in order of preference:

a) move some/all of dispose_list() outside iprune_mutex.

b) make iprune_mutex an rwlock, take it for reading around
   dispose_list(), for writing elsewhere.

c) go back to single-threading shrink_slab (or just shrink_icache_memory())

   For this one we'd need to understand which observations prompted Nick
   to make shrinker_rwsem an rwsem?


We also need to understand why this has become worse.  Perhaps xfs_iextract
got slower (cc's Nathan).  Do you have any idea whenabout in kernel history
this started happening?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Avoid excessive time spend on concurrent slab shrinking
  2006-04-01  1:25             ` Andrew Morton
@ 2006-04-01  2:34               ` Nick Piggin
  2006-04-01  5:59               ` Nathan Scott
  1 sibling, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2006-04-01  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-mm, Nathan Scott

Andrew Morton wrote:

> c) go back to single-threading shrink_slab (or just shrink_icache_memory())
> 
>    For this one we'd need to understand which observations prompted Nick
>    to make shrinker_rwsem an rwsem?
> 

This was when I was looking for reasons why inode and dentry caches
would sometimes apparently explode on people and consume most of their
memory. One of the reasons was here, when slab caches did build up, and
multiple processes would start reclaim, scanning would skew away from
slab.

Considering the actual slab shrinkers are single threaded, I agree this
could cause more semaphore contention.

One thing we could do is ensure shrinker->nr gets incremented, but not
actually have more than one thread enter slab reclaim at once.

Or have the trylock&abort behaviour pushed down into the actual
shrinkers themselves, then at least we can get concurrent icache and
dcache scanning happening.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Avoid excessive time spend on concurrent slab shrinking
  2006-04-01  1:25             ` Andrew Morton
  2006-04-01  2:34               ` Nick Piggin
@ 2006-04-01  5:59               ` Nathan Scott
  2006-04-01 18:30                 ` David Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Nathan Scott @ 2006-04-01  5:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, nickpiggin, linux-mm, dgc

On Fri, Mar 31, 2006 at 05:25:18PM -0800, Andrew Morton wrote:
> Christoph Lameter <clameter@sgi.com> wrote:
> ...
> It appears that we're being busy in xfs_iextract(), but it would be sad if
> the problem was really lock contention in xfs_iextract(), and we just
> happened to catch it when it was running.
> 
> Or maybe xfs_iextract is just slow.  So this is one thing we need to get to
> the bottom of (profiles might tell us).

I assume (profiles would be good to prove it) we are spending
time walking the hash bucket list there Christoph (while we're
holding the ch_lock spinlock on the hash bucket)?  [CC'ing Dave
Chinner for any further comment, he's been looking at the chash
list for unrelated reasons recently..]

> Assuming that there's nothing we can do to improve the XFS situation, our
> options appear to be, in order of preference:
> 
> a) move some/all of dispose_list() outside iprune_mutex.
> 
> b) make iprune_mutex an rwlock, take it for reading around
>    dispose_list(), for writing elsewhere.
> 
> c) go back to single-threading shrink_slab (or just shrink_icache_memory())
> 
>    For this one we'd need to understand which observations prompted Nick
>    to make shrinker_rwsem an rwsem?
> 
> We also need to understand why this has become worse.  Perhaps xfs_iextract
> got slower (cc's Nathan).  Do you have any idea whenabout in kernel history
> this started happening?

Nothings changed in xfs_iextract for many years.  Its quite possible
the simple hash with linked list buckets is no longer an effective
choice of algorithm here for the inode cluster hash... or perhaps the
hash table is too small... or... but anyway, I would not expect any
difference between kernel versions here (esp. the two vendor kernel
versions Christoph will be comparing - they'll be behaving exactly
the same way in this regard from XFS's POV as the code in question is
identical).

Its also quite possible some other performance bottleneck was moved
out of the way, and lock contention on the chashlist lock is now the
next biggest thing in line..

If its useful for experimenting, Christoph, you can easily tweak the
cluster hash size manually by dinking with xfs_iget.c::xfs_chash_init.

cheers.

-- 
Nathan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Avoid excessive time spend on concurrent slab shrinking
  2006-04-01  0:22           ` Christoph Lameter
  2006-04-01  1:25             ` Andrew Morton
@ 2006-04-01 18:24             ` David Chinner
  1 sibling, 0 replies; 14+ messages in thread
From: David Chinner @ 2006-04-01 18:24 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, nickpiggin, linux-mm

On Fri, Mar 31, 2006 at 04:22:29PM -0800, Christoph Lameter wrote:
> Some traces:
> 
>    Stack traceback for pid 16836
>         0xe00000380bc68000    16836        1  1    6   R  
>         0xa00000020b8e6050 [xfs]xfs_iextract+0x190
>         0xa00000020b8e63a0 [xfs]xfs_ireclaim+0x80
>         0xa00000020b921c70 [xfs]xfs_finish_reclaim+0x330
>         0xa00000020b921fa0 [xfs]xfs_reclaim+0x140
>         0xa00000020b93f820 [xfs]linvfs_clear_inode+0x260

Christoph, what machine, what XFS mount options? Did the latest upgrade
lose the "ihashsize=xxxxx" mount option that used to be set on all
the large filesystems?

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Avoid excessive time spend on concurrent slab shrinking
  2006-04-01  5:59               ` Nathan Scott
@ 2006-04-01 18:30                 ` David Chinner
  2006-04-01 18:49                   ` Christoph Lameter
  0 siblings, 1 reply; 14+ messages in thread
From: David Chinner @ 2006-04-01 18:30 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Andrew Morton, Christoph Lameter, nickpiggin, linux-mm, dgc

On Sat, Apr 01, 2006 at 03:59:42PM +1000, Nathan Scott wrote:
> On Fri, Mar 31, 2006 at 05:25:18PM -0800, Andrew Morton wrote:
> > Christoph Lameter <clameter@sgi.com> wrote:
> > ...
> > It appears that we're being busy in xfs_iextract(), but it would be sad if
> > the problem was really lock contention in xfs_iextract(), and we just
> > happened to catch it when it was running.
> > 
> > Or maybe xfs_iextract is just slow.  So this is one thing we need to get to
> > the bottom of (profiles might tell us).
> 
> I assume (profiles would be good to prove it) we are spending
> time walking the hash bucket list there Christoph (while we're
> holding the ch_lock spinlock on the hash bucket)?  [CC'ing Dave
> Chinner for any further comment, he's been looking at the chash
> list for unrelated reasons recently..]

You'll only get contention if something else is trying to walk the
same hash chain, which tends to implicate not enough hash buckets.

> If its useful for experimenting, Christoph, you can easily tweak the
> cluster hash size manually by dinking with xfs_iget.c::xfs_chash_init.

Just use the ihashsize mount option - the cluster hash size is proportional
to the inode hash size which is changed by the ihashsize mount option.

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Avoid excessive time spend on concurrent slab shrinking
  2006-04-01 18:30                 ` David Chinner
@ 2006-04-01 18:49                   ` Christoph Lameter
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2006-04-01 18:49 UTC (permalink / raw)
  To: David Chinner; +Cc: Nathan Scott, Andrew Morton, nickpiggin, linux-mm, dgc

On Sun, 2 Apr 2006, David Chinner wrote:

> same hash chain, which tends to implicate not enough hash buckets.
> 
> > If its useful for experimenting, Christoph, you can easily tweak the
> > cluster hash size manually by dinking with xfs_iget.c::xfs_chash_init.
> 
> Just use the ihashsize mount option - the cluster hash size is proportional
> to the inode hash size which is changed by the ihashsize mount option.
> 
> Cheers,

XFS settings visible via /proc/mounts are

rw,ihashsize=32768,sunit=32,swidth=25

Not enough hash buckets? This was the default selection by xfs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-04-01 18:49 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-03-31 22:44 Avoid excessive time spend on concurrent slab shrinking Christoph Lameter
     [not found] ` <20060331150120.21fad488.akpm@osdl.org>
2006-03-31 23:17   ` Christoph Lameter
2006-03-31 23:46     ` Andrew Morton
     [not found]     ` <20060331153235.754deb0c.akpm@osdl.org>
2006-03-31 23:48       ` Christoph Lameter
2006-04-01  0:00         ` Andrew Morton
2006-04-01  0:14           ` Andrew Morton
2006-04-01  0:22           ` Christoph Lameter
2006-04-01  1:25             ` Andrew Morton
2006-04-01  2:34               ` Nick Piggin
2006-04-01  5:59               ` Nathan Scott
2006-04-01 18:30                 ` David Chinner
2006-04-01 18:49                   ` Christoph Lameter
2006-04-01 18:24             ` David Chinner
2006-03-31 23:45 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox