Hi David,

Forgive my selective quoting...

David Chinner wrote:
> Take a large file - say Size = 5x RAM or so - and then start
> N threads runnnning at offset (n / Size) where n = the thread
> number. They each read (Size / N) and so typically don't overlap. 
> 
> Throughput with increasing numbers of threads on a 24p altix
> on an XFS filesystem on 2.6.15-rc5 looks like:
> 
> Loads      tput
> -----   -------
>   1      789.59
>   2     1191.56
>   4     1724.63
>   8     1213.63
>   16    1057.03
>   32     744.73
> 
> Basically,  we hit a scaling limitation at b/t 4 and 8 threads. This was
> consistent across I/O sizes from 4KB to 4MB. I took a simple 30s PC sample
> profile:

> Percent  Routine
> --------------------------
>   63.62  _write_lock_irqsave
>   15.66  _read_unlock_irq

> So read_unlock_irq looks to be triggered by the mapping->tree_lock.
> 
> I think that the write_lock_irqsave() contention is from memory
> reclaim (shrink_list()->try_to_release_page()-> ->releasepage()->
> xfs_vm_releasepage()-> try_to_free_buffers()->clear_page_dirty()->
> test_clear_page_dirty()-> write_lock_irqsave(&mapping->tree_lock...))
> because page cache memory was full of this one file and demand is
> causing them to be constantly recycled.

I'd say you're right.

tree_lock contention will be coming from a number of sources. reclaim,
as you say, will be a big one. mpage_readpages (from readahead) will
be another.

Then the read lock in find_get_page in generic_mapping_read will start
contending heavily on the writers and not get much concurrency.

I'm sure lockless (read-side) pagecache will help... not only will it
eliminate read_lock costs, but the reduced read contention should also
decrease write_lock contention and bouncing.

As well as lockless pagecache, I think we can batch tree_lock operations
in readahead. Would be interesting to see how much this patch helps.

-- 
SUSE Labs, Novell Inc.