On Tue, Apr 14, 2009 at 03:29:52PM +0800, Andi Kleen wrote:
> Wu Fengguang <fengguang.wu@intel.com> writes:
> >
> > The context readahead algorithm guarantees to discover the sequentialness no
> > matter how the streams are interleaved. For the above example, it will start
> > sequential readahead since page 2 and 1002.
> >
> > The trick is to poke for page @offset-1 in the page cache when it has no other
> > clues on the sequentialness of request @offset: if the current requenst belongs
> > to a sequential stream, that stream must have accessed page @offset-1 recently,
> 
> Really? The page could be just randomly cached from some previous IO, right?
> I don't think you can detect that right? [maybe you could play some games
> with the LRU list, but that would be likely not O(1)]
>
> I wonder if this could cause the prefetcher to "switch gears" suddenly
> when it first walks a long stretch of non cached file data and then
> suddenly hits a cached page.
> 
> One possible way around this would be to add a concept of "page generation".
> e.g. use a few bits of page flags. keep a global generation count
> that increases slowly (let's say every few seconds) Mark the page with the 
> current generation count when the prefetcher takes it. When doing
> your algorithm check the count first and only use pages with a recent 
> count.
> 
> Not sure if it's a real problem or not.

Good catch Andi!

I'll list some possible situations. I guess you are referring to (2.3)?

1) page at @offset-1 is present, and it may be a trace of:

1.1) (interleaved) sequential reads: we catch it, bingo!

1.2) clustered random reads: readahead will be mostly favorable

If page @offset-1 exists, it means there are two references that come
close in both space(distance < one readahead window) and time
(distance < one LRU scan cycle). So it's a good indication that some
clustered random reads are occurring in the area around @offset.

I have did many experiments on the performance impact of readahead on
clustered random reads - the results are very encouraging. The tests
covered different random read densities, random read sizes, as well as
different thrashing conditions (by varying the dataset:memory ratios).
There are hardly any performance loss(2% for the worst case), but the
gain can be as large as 300%!

Some numbers, curves and analysis can be found in the attached
slides and paper:
        - readahead performance slides: Page 23-26: Clustered random reads
        - readahead framework paper: Page 7, Section 4.3: Random reads

The recent readahead addition to mmap reads makes another vivid example
of the stated "readahead is good for clustered random reads" principle.
The readahead's side effect for the random references by executables
are very good:
        - major faults reduced by 1/3
        - mmap IO numbers reduced by 1/4
        - with no obvious overheads

But as always, one can fake a workload to totally defeat the readahead
heuristics ;-)

2) page at @offset-1 is not present (for a sequential stream)

2.1) aggressive user space drop behind: fixable nuisance

The user space could be doing fadvise(offset-1, DONTNEED), which will
drop the history hint required to enable context readahead. But I
guess when the administrator/developer noticed its impact -
performance dropped instead of increased - he can easily fix it up to
do fadvise(offset-2, DONTNEED), or to manage its own readahead via
fadvise(WILLNEED).

So this is an nuisance but fixable situation.

2.2) readahead thrashing: not handled for now

We don't handle this for now. For now the behavior is to stop
readahead and possibly restart the readahead window rampup process.

2.3) readahead cache hits: rare case and the impact is temporary

The page at @offset-1 does get referenced by this stream, but it's
created by someone else at some distant time ago. The page at
@offset-1 may be lifted to active lru by this second reference, or too
late and get reclaimed - by the time we reference page @offset.

Normally its a range of cached pages. We are a) either walking inside the
range and enjoying the cache hits, b) or we walk out of it and restart
readahead by ourself, c) or the range of cached pages get reclaimed
while we are walking on them, and hence cannot find page @offset-1.

Obviously (c) is rare and temporary and is the main cause of (2.3).
As soon as we goto the next page at @offset+1, we'll its 'previous'
page at @offset to be cached(it is created by us!). So the context
readahead starts working again - it's merely delayed by one page :-)

Thanks,
Fengguang