[RFC] start_aggressive

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] start_aggressive_readahead
@ 2002-07-25 16:10 Christoph Hellwig
  2002-07-25 16:44 ` Rik van Riel
  2002-07-26  6:53 ` Daniel Phillips
  0 siblings, 2 replies; 23+ messages in thread
From: Christoph Hellwig @ 2002-07-25 16:10 UTC (permalink / raw)
  To: torvalds; +Cc: linux-mm

Another patch from the XFS tree, I'd be happy to get some comments on
this one again.

This function (start_aggressive_readahead()) checks whether all zones
of the given gfp mask have lots of free pages.  XFS needs this for it's
own readahead code (used only deep in the directory code, normal file
readahead is handled by the generic pagecache code).  We perform the
readahead only is it returns 1 for enough free pages.

We could rip it out of XFS entirely without funcionality-loss, but it
would cost directory handling performance.

I'm also open for a better name (I think the current one is very bad,
but don't have a better idea :)).  I'd also be ineterested in comments
how to avoid the new function and use existing functionality for it,
but I've tried to find it for a long time and didn't find something
suiteable.

-- 
The US Army issues lap-top computers now to squad-leaders on up. [...]
Believe me, there is nothing more lethal than a Power Point briefing
given by an Army person.	-- Leon A. Goldstein

--- linux/include/linux/mm.h Wed, 29 May 2002 14:00:22
+++ linux/include/linux/mm.h Mon, 22 Jul 2002 12:06:09
@@ -460,6 +460,8 @@ extern void FASTCALL(free_pages(unsigned
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr),0)
 
+extern int start_aggressive_readahead(int);
+
 extern void show_free_areas(void);
 extern void show_free_areas_node(pg_data_t *pgdat);
 
--- linux/kernel/ksyms.c Wed, 17 Jul 2002 12:08:06
+++ linux/kernel/ksyms.c Mon, 22 Jul 2002 12:06:09
@@ -90,6 +90,7 @@ EXPORT_SYMBOL(exit_fs);
 EXPORT_SYMBOL(exit_sighand);
 
 /* internal kernel memory management */
+EXPORT_SYMBOL(start_aggressive_readahead);
 EXPORT_SYMBOL(_alloc_pages);
 EXPORT_SYMBOL(__alloc_pages);
 EXPORT_SYMBOL(alloc_pages_node);
--- linux/mm/page_alloc.c Tue, 25 Jun 2002 10:15:12 
+++ linux/mm/page_alloc.c Mon, 22 Jul 2002 12:06:09
@@ -512,6 +512,37 @@ unsigned int nr_free_highpages (void)
 #define K(x) ((x) << (PAGE_SHIFT-10))
 
 /*
+ * If it returns non zero it means there's lots of ram "free"
+ * (note: not in cache!) so any caller will know that
+ * he can allocate some memory to do some more aggressive
+ * (possibly wasteful) readahead. The state of the memory
+ * should be rechecked after every few pages allocated for
+ * doing this aggressive readahead.
+ *
+ * NOTE: caller passes in gfp_mask of zones to check
+ */
+int start_aggressive_readahead(int gfp_mask)
+{
+	pg_data_t *pgdat = pgdat_list;
+	zonelist_t *zonelist;
+	zone_t **zonep, *zone;
+	int ret = 0;
+
+	do {
+		zonelist = pgdat->node_zonelists + (gfp_mask & GFP_ZONEMASK);
+		zonep = zonelist->zones;
+
+		for (zone = *zonep++; zone; zone = *zonep++)
+			if (zone->free_pages > zone->pages_high * 2)
+				ret = 1;
+
+		pgdat = pgdat->node_next;
+	} while (pgdat);
+
+	return ret;
+}
+
+/*
  * Show free area list (used inside shift_scroll-lock stuff)
  * We also calculate the percentage fragmentation. We do this by counting the
  * memory on each free list with the exception of the first item on the list.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-25 16:10 [RFC] start_aggressive_readahead Christoph Hellwig
@ 2002-07-25 16:44 ` Rik van Riel
  2002-07-25 19:40   ` Andrew Morton
  2002-07-26  6:53 ` Daniel Phillips
  1 sibling, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2002-07-25 16:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: torvalds, linux-mm

On Thu, 25 Jul 2002, Christoph Hellwig wrote:

> This function (start_aggressive_readahead()) checks whether all zones
> of the given gfp mask have lots of free pages.

Seems a bit silly since ideally we wouldn't reclaim cache memory
until we're low on physical memory.


regards,

Rik
-- 
	http://www.linuxsymposium.org/2002/
"You're one of those condescending OLS attendants"
"Here's a nickle kid.  Go buy yourself a real t-shirt"

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-25 16:44 ` Rik van Riel
@ 2002-07-25 19:40   ` Andrew Morton
  2002-07-26 16:50     ` Scott Kaplan
  2002-07-26 20:14     ` Stephen Lord
  0 siblings, 2 replies; 23+ messages in thread
From: Andrew Morton @ 2002-07-25 19:40 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Christoph Hellwig, torvalds, linux-mm

Rik van Riel wrote:
> 
> On Thu, 25 Jul 2002, Christoph Hellwig wrote:
> 
> > This function (start_aggressive_readahead()) checks whether all zones
> > of the given gfp mask have lots of free pages.
> 
> Seems a bit silly since ideally we wouldn't reclaim cache memory
> until we're low on physical memory.
> 

Yes, I would question its worth also.


What it boils down to is:  which pages are we, in the immediate future,
more likely to use?  Pages which are at the tail of the inactive list,
or pages which are in the file's readahead window?

I'd say the latter, so readahead should just go and do reclaim.

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-25 19:40   ` Andrew Morton
@ 2002-07-26 16:50     ` Scott Kaplan
  2002-07-26 19:38       ` Andrew Morton
  2002-07-26 20:14     ` Stephen Lord
  1 sibling, 1 reply; 23+ messages in thread
From: Scott Kaplan @ 2002-07-26 16:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, Christoph Hellwig, torvalds, linux-mm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thursday, July 25, 2002, at 03:40 PM, Andrew Morton wrote:

> What it boils down to is:  which pages are we, in the immediate future,
> more likely to use?  Pages which are at the tail of the inactive list,
> or pages which are in the file's readahead window?

That is the right question to ask...

> I'd say the latter, so readahead should just go and do reclaim.

...but the answer's not that simple, I'm afraid.  You've got two groups of 
logical pages competing for physical page frames.  Which is more valuable 
depends entirely on the reference behavior of workload.  I'll point you to 
a recent paper of mine on exactly this problem (in two formats):

   http://www.cs.amherst.edu/~sfkaplan/papers/prepaging.pdf
   http://www.cs.amherst.edu/~sfkaplan/papers/prepaging.ps.bz2

The results presented are from uniprogrammed reference traces, but the 
principle still applies:  For some reference patterns, caching of some 
number of readahead pages is a great idea.  For other reference patterns, 
the pages at the tail end of the inactive list are *still* more valuable, 
and the readahead pages should be completely ignored.  There's also a lot 
of space in the middle:  Readahead pages should be cached, but only for a 
limited time, lest they displace too many pages on the tail end of the 
inactive list.

What you really want is some kind of adaptivity that allows you to compare 
the rates at which these two pools of pages are referenced, and then 
decides how many readahead pages (if any) to cache.

Scott
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (Darwin)
Comment: For info see http://www.gnupg.org

iD8DBQE9QX3y8eFdWQtoOmgRAplfAKCLrmURjCkuf6snOfwrFQFmqXlYoACgnvCa
IFEC/tDsVLY+isCC/qkxn5w=
=8Jx5
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-26 16:50     ` Scott Kaplan
@ 2002-07-26 19:38       ` Andrew Morton
  2002-07-28 23:32         ` Scott Kaplan
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2002-07-26 19:38 UTC (permalink / raw)
  To: Scott Kaplan; +Cc: Rik van Riel, Christoph Hellwig, torvalds, linux-mm

Scott Kaplan wrote:
> 
> ..
> > What it boils down to is:  which pages are we, in the immediate future,
> > more likely to use?  Pages which are at the tail of the inactive list,
> > or pages which are in the file's readahead window?
> 
> That is the right question to ask...
> 
> > I'd say the latter, so readahead should just go and do reclaim.
> 
> ...but the answer's not that simple, I'm afraid.  You've got two groups of
> logical pages competing for physical page frames.  Which is more valuable
> depends entirely on the reference behavior of workload.  I'll point you to
> a recent paper of mine on exactly this problem (in two formats):
> 
>    http://www.cs.amherst.edu/~sfkaplan/papers/prepaging.pdf

readahead was rewritten for 2.5.

I think it covers most of the things you discuss there.

- It adaptively grows the window size in response to "hits":
  each time userspace requests a page, and that page is found
  to be inside the previously-requested readahead window, we
  grow the window by 2 pages (up to a configurable limit)
  because readahead is being beneficial.

- It shrinks the window size in response to "misses" - if
  userspace requests a page which is *not* inside the previously-requested
  window, the future window size is shrunk by 25%

- It detects eviction:  if userspace requests a page which *should*
  have been inside the readahead window, but it's actually not there,
  then we know it was evicted prior to being used.  We shrink the
  window by 3 pages.  (This almost never happens, in my testing).

- It behaves differently for page faults:  for read(2), readahead is
  strictly ahead of the requested page.  For mmap pagefaults, 
  the readaround window is positioned 25% behind the requested page and
  75% ahead of it.

All these numbers were engineered by the time-honoured practice of
guess-and-giggle.

On IDE disks, you can fiddle extensively with readahead and make
virtually no difference at all, because the disk does it as well.
On older SCSI disks, readahead makes a lot of difference.  Because,
presumably, the disk isn't being as smart.  To some extent, this
device-level caching makes the whole readahead thing of historical
interest only, I suspect.

- For CPU efficiency against an already-fully-cached file: If readahead
  finds that all pages inside a readahead request are already in core,
  it shrinks the readahead window by a page, and ultimately turns
  readahead off completely.  It is resumed when there is a miss.

- We no longer put readahead pages on the active list.  They are placed
  on the head of the inactive list.  If nobody subsequently uses the
  page, it proceeds to the tail of the inactive list and is evicted.

  Sort of.  This code needs some checking.  When the readahead page
  is accessed, we set PageReferenced and leave it on the inactive
  list.  It will still be evicted when it reaches the tail of the
  inactive list.  It will only be moved to the active list if it
  is referenced (faulted in or read() from) a second time.  I guess
  this is the "use-once" feature, and it is designed to detect
  the common case of a once-off streaming read.

I'd be interested in your assessment of the 2.5 readahead/readaround
implementation.

It still has one nasty problem, which is not VM-related.  It is to
do with the interaction with request merging.  When performing 
streaming reads from two large files, we tend to seek between the
two files at the readahead window size granularity.  But we *should*
be alternating between the files at a coarser granularity: the
request queue's read latency.   2.4 does this - somehow it manages
to get its new readahead requests merged with its old ones, so
this has the effect of "capturing" the disk head until the request
latency of a request from the other file expires.

I still need to get down and fix this - it's a very subtle interaction
between readahead and request queueing and I suspect it'll need to
be formalised in some manner, rather than just fiddling the code
so it happens to work out right.

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-26 19:38       ` Andrew Morton
@ 2002-07-28 23:32         ` Scott Kaplan
  2002-07-29  0:19           ` Rik van Riel
  2002-07-29  7:34           ` Andrew Morton
  0 siblings, 2 replies; 23+ messages in thread
From: Scott Kaplan @ 2002-07-28 23:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, Christoph Hellwig, torvalds, linux-mm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Friday, July 26, 2002, at 03:38 PM, Andrew Morton wrote:

> readahead was rewritten for 2.5.

It is just darned difficult to keep up with all of the changes!

> I think it covers most of the things you discuss there.
>
> - It adaptively grows the window size in response to "hits"

Seems somewhat reasonable, although easy to be fooled.  If I reference 
some of the most recently read-ahead blocks, I'll grow the read-ahead 
window, keeping other unreference, read-ahead blocks for longer, even 
though there's no evidence that keeping them longer will result in more 
hits.  In other words, it's not hits that should necessarily make you grow 
the cache -- it's the evidence that there will be an *increase* in hits if 
you do.

> - It shrinks the window size in response to "misses"  - if
>   userspace requests a page which is *not* inside the previously-requested
>   window, the future window size is shrunk by 25%

This one seems wierd.  If I reference a page that could have been in a 
larger read-ahead window, shouldn't I make the window *larger* so that 
next time, it *will* be in the window?

> - It detects eviction:  if userspace requests a page which *should*
>   have been inside the readahead window, but it's actually not there,
>   then we know it was evicted prior to being used.  We shrink the
>   window by 3 pages.  (This almost never happens, in my testing).

Again, this seems backwards in the manner mentioned above.  It could have 
been resident, but it was evicted, so if you want it to be a hit, make the 
window *bigger*, no?  What should drive the reduction in the read-ahead 
window is the observation that recent increases have not yielding higher 
hit rates -- more has not been better.

> - It behaves differently for page faults:  for read(2), readahead is
>   strictly ahead of the requested page.  For mmap pagefaults,
>   the readaround window is positioned 25% behind the requested page and
>   75% ahead of it.

That seems sensible enough...

The entire adaptive mechanism you've described seems only to consider one 
of the two competing pools, though, namely the read-ahead pool of pages.  
What about its competition -- The references to pages that are near 
eviction at the end of the inactive list?  Adapting to one without 
consideration of the other is working half-blind.  Why would you ever want 
to shrink the read-ahead window if very, very few pages at the end of the 
inactive list are being hit?  Similarly, you would want to be very 
cautious about increasing the size of the read-ahead window of many pages 
at the end of the inactive list are being re-used.

> To some extent, this device-level caching makes the whole readahead thing 
> of historical interest only, I suspect.

To some extent, yes, but the scales are substantially difference.  If your 
disk has just a few MB of cache, but your RAM is hundreds of MB (or larger)
, the VM system can choose to cache read-ahead pages for much, much longer 
if it detects that its of greater benefit than caching very old, used 
pages.

> - We no longer put readahead pages on the active list.  They are placed
>   on the head of the inactive list.  If nobody subsequently uses the
>   page, it proceeds to the tail of the inactive list and is evicted.

This seems a wise move, as placing them in the active list is only going 
to be beneficial in some very unusual cases.  Still, the question does 
remain as to *how long* a read-ahead page should be left unused before it 
is prepared for eviction.

I'll admit that it's not necessarily clear how to do the cost/benefit 
adaptivity that I'm describing.  I'm working on that right now, which I 
why I'm suddenly so curious about the details of this VM and how to play 
with it.  All in all, it sounds like you've made good changes, but perhaps 
you can address the weaknesses that I've pointed out (or tell me why I'm 
wrong about them).

Scott
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (Darwin)
Comment: For info see http://www.gnupg.org

iD8DBQE9RH8R8eFdWQtoOmgRAtQPAJwJr6z3zkY5fJShQ3fSq44j2PwsLgCffw2B
xyUMF/CKvmvn3+4BDvbcekQ=
=BuBd
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-28 23:32         ` Scott Kaplan
@ 2002-07-29  0:19           ` Rik van Riel
  2002-07-29  2:12             ` Scott Kaplan
  2002-07-29  7:34           ` Andrew Morton
  1 sibling, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2002-07-29  0:19 UTC (permalink / raw)
  To: Scott Kaplan; +Cc: Andrew Morton, Christoph Hellwig, torvalds, linux-mm

On Sun, 28 Jul 2002, Scott Kaplan wrote:

> > - We no longer put readahead pages on the active list.  They are placed
> >   on the head of the inactive list.  If nobody subsequently uses the
> >   page, it proceeds to the tail of the inactive list and is evicted.
>
> This seems a wise move, as placing them in the active list is only going
> to be beneficial in some very unusual cases.

I'm not sure about that. If we do linear IO we most likely
want to evict the pages we've already used as opposed to the
pages we're about to use.

This means that (1) we want to clear the accessed bit of the
pages we've already read, moving them to the inactive list if
needed  and (2) we'll want to keep the about-to-be-used pages
separate from the already-used pages.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-29  0:19           ` Rik van Riel
@ 2002-07-29  2:12             ` Scott Kaplan
  2002-07-29  3:05               ` Rik van Riel
  0 siblings, 1 reply; 23+ messages in thread
From: Scott Kaplan @ 2002-07-29  2:12 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, Christoph Hellwig, torvalds, linux-mm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sunday, July 28, 2002, at 08:19 PM, Rik van Riel wrote:

> I'm not sure about that. If we do linear IO we most likely
> want to evict the pages we've already used as opposed to the
> pages we're about to use.

The situation is more subtle than that.  I agree that in a linear I/O case,
  the read-ahead pages are extremely likely to be used very soon.  However,
  that does *not* imply that they should be promoted to the active list -- 
in fact, quite the opposite when considering the read-ahead situation.

Consider exactly the case you have raised -- strict, linear referencing of 
blocks, such as a sequential file read.  When block `i' is referenced, it 
is an excellent prediction that block `i+1' will be referenced soon.  If 
block `i+1' is not referenced soon, then the prediction was incorrect, 
*and there's little reason to keep the block around any longer*.  In other 
words, the better the prediction, the closer to the end of the LRU 
ordering the blocks can be placed.  The ones that *are* used soon will be 
referenced and promoted to the front of the LRU ordering before they are 
evicted, exactly because the soonness of use is so strong.  The read-ahead 
blocks that are not used soon are evicted before long.  In other words, 
the shorter a time you think you need to keep a block, the closer to the 
end of the list it should go.  If your guess is wrong, you've displaced 
fewer other blocks.  If your prediction is a good one, such as with linear 
file reading, you will not need to cache a block as a read-ahead block for 
long before it is actually used.

It is when you predict that a read-ahead will not pay off for some time -- 
that the read-ahead blocks will not be used so soon -- that such blocks 
need to be placed closer to the front of the LRU ordering (that is, in the 
active list).  That way, they will be cached much longer so that they will 
still be resident when they finally are used.  Of course, such caching 
displaces more of the other pages, possibly causing faults on those.  It 
is when your read-ahead prediction indicates a weak soonness of use that 
you must compare the benefits of caching those pages against the cost of 
displacing other pages.  Only if few pages near the end of the LRU 
ordering -- non-read-ahead pages -- are being referenced might it be worth 
caching read-ahead pages for so long.

So, in the case of linear I/O, placing the read-ahead pages at the front 
of the inactive list is likely to provide more than enough time for those 
pages to be used and promoted to the active list.  By placing them in the 
inactive list, you reduce the damage done when read-ahead pages are *not* 
used soon.

Scott
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (Darwin)
Comment: For info see http://www.gnupg.org

iD8DBQE9RKR78eFdWQtoOmgRAr1EAJ9RSY10utFCEvIftv9qEMNZzzczswCfTlZv
63z5vAMl38r+jtGQRImUkoY=
=X6S4
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-29  2:12             ` Scott Kaplan
@ 2002-07-29  3:05               ` Rik van Riel
  2002-07-29 15:24                 ` Scott Kaplan
  0 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2002-07-29  3:05 UTC (permalink / raw)
  To: Scott Kaplan; +Cc: Andrew Morton, Christoph Hellwig, torvalds, linux-mm

On Sun, 28 Jul 2002, Scott Kaplan wrote:
> On Sunday, July 28, 2002, at 08:19 PM, Rik van Riel wrote:
>
> > I'm not sure about that. If we do linear IO we most likely
> > want to evict the pages we've already used as opposed to the
> > pages we're about to use.
>
> The situation is more subtle than that.

> Consider exactly the case you have raised -- strict, linear referencing of
> blocks, such as a sequential file read.  When block `i' is referenced, it
> is an excellent prediction that block `i+1' will be referenced soon.  If
> block `i+1' is not referenced soon, then the prediction was incorrect,
> *and there's little reason to keep the block around any longer*.

My experience with 300 ftp clients pulling a collective 40 Mbit/s
suggests otherwise.

About 70% of the clients were on modem speed and the other 30% of
the clients were on widely variable higher speeds.

Since a disk seek + read is about 10ms, the absolute maximum
number of seeks that can be done is 100 a second and the minimum
amount of time between disk seeks for one stream should be about
3 seconds.

In reality the situation is worse because of the large speed
difference between the disk seeks and the fact that we want a
reasonably low latency for disk IO for the other tasks in the
system.

This would put the conservative minimum time we should keep
readahead data in RAM at something like 10 seconds, to account
for the speed differences of fast and slow data streams and to
not completely bog down the IO subsystem with requests.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-29  3:05               ` Rik van Riel
@ 2002-07-29 15:24                 ` Scott Kaplan
  0 siblings, 0 replies; 23+ messages in thread
From: Scott Kaplan @ 2002-07-29 15:24 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, Christoph Hellwig, torvalds, linux-mm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sunday, July 28, 2002, at 11:05 PM, Rik van Riel wrote:

> My experience with 300 ftp clients pulling a collective 40 Mbit/s
> suggests otherwise.
>
> About 70% of the clients were on modem speed and the other 30% of
> the clients were on widely variable higher speeds.
>
> Since a disk seek + read is about 10ms, the absolute maximum
> number of seeks that can be done is 100 a second and the minimum
> amount of time between disk seeks for one stream should be about
> 3 seconds.

This is a very interesting example of some real (and important) reference 
behavior that must be understood to be handled well.  In the context of 
this thread of discussion, this case is substantially different from your 
original comment on read-ahead for ``linear file I/O''.

Just as a refresher for myself and anyone else that needs it:  I claimed 
that linear file I/O was a case in which read-ahead blocks should not be 
cached for long before they would either be used or evicted from lack of 
use.  (That is, they should be placed nearer to the end of the LRU 
ordering.)  The claim was based on the observation that sequential file 
traversal is a very good case for read-ahead, where the read-ahead blocks 
are very likely to be used very soon.

What's important about this example is that, due to the whole system 
workload and the disparate connection speeds of the ftp clients, it is 
*NOT* a typical case of linear file I/O.  In fact, what's odd about it is 
that block `i' of a file will be read, and for slower connections, block `
i+1' will *not* be used for some time, since reading block `i' will take a 
while.  In other words, the interleaved reference behavior from all of 
these ftp downloads makes the prediction that block `i+1' will be used 
soon a weaker prediction.  It is very likely to be used, yes, but not so 
soon in many cases due to the other files being read and referenced.

Because the soonness of use is weak, we do indeed want to cache the 
read-ahead pages for longer.  (That is, I agree that for this example, 
read-ahead pages should go into the active list.)  Caching read-ahead 
pages for longer, though, displaces more used pages, forcing them to be 
evicted sooner then they would have been without the aggressive read-ahead 
caching.  Critically, for *this* workload, that's probably just fine.  
Assuming that different files are being downloaded by different ftp 
clients, after reading and referencing a block, it's probably worth little 
to cache it in case of re-use for very long.  In other words, among the 
referenced pages, those near the end of the LRU ordering are referenced 
rarely.  The competition between read-ahead pages and less recently used 
referenced pages is lopsided in favor of the read-ahead pages.  But that 
is only a consequence of reference pattern for *this specific workload* -- 
it may not be true for other workloads.

Incidentally, this is all just mental masturbation until someone actually 
records and measures the reference behavior from this kind of workload.  
It all sounds about right, but that's neither good science nor good 
engineering.

In short, I agree that for this case, inserting read-ahead pages into the 
inactive list may not be aggressive enough.  I disagree that the reason is 
``linear file I/O'', as the reference pattern here is more complex than 
that.  This is also a wonderful case for getting read-ahead caching 
adaptivity right:  A system that can weigh read-ahead caching allocations 
against less recently used referenced-page allocations will detect and 
adjust to this case quickly, while avoiding such aggressive read-ahead 
caching for other workloads.

Scott
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (Darwin)
Comment: For info see http://www.gnupg.org

iD8DBQE9RV4a8eFdWQtoOmgRAk6tAKCYX8tHrauHGMaek1oyCJMvEQf5yACgrEgX
pHx2gTsY4HTy9OUmOZjT7I8=
=JTJP
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-28 23:32         ` Scott Kaplan
  2002-07-29  0:19           ` Rik van Riel
@ 2002-07-29  7:34           ` Andrew Morton
  2002-07-29  7:37             ` Vladimir Dergachev
                               ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Andrew Morton @ 2002-07-29  7:34 UTC (permalink / raw)
  To: Scott Kaplan; +Cc: Rik van Riel, Christoph Hellwig, linux-mm

[ snipped poor old Linus.  he doesn't read 'em anyway ]

Scott Kaplan wrote:
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On Friday, July 26, 2002, at 03:38 PM, Andrew Morton wrote:
> 
> > readahead was rewritten for 2.5.
> 
> It is just darned difficult to keep up with all of the changes!
> 
> > I think it covers most of the things you discuss there.
> >
> > - It adaptively grows the window size in response to "hits"
> 
> Seems somewhat reasonable, although easy to be fooled.  If I reference
> some of the most recently read-ahead blocks, I'll grow the read-ahead
> window, keeping other unreference, read-ahead blocks for longer, even
> though there's no evidence that keeping them longer will result in more
> hits.  In other words, it's not hits that should necessarily make you grow
> the cache -- it's the evidence that there will be an *increase* in hits if
> you do.

Ah, but if we're not getting hits in the readahead window
then we're getting misses.  And misses shrink the window.

Add two pages for a hit, remove 25% for a miss.  The window size
should stabilise at a size which  is larger if readahead is
being useful.  I hope.

> > - It shrinks the window size in response to "misses"  - if
> >   userspace requests a page which is *not* inside the previously-requested
> >   window, the future window size is shrunk by 25%
> 
> This one seems wierd.  If I reference a page that could have been in a
> larger read-ahead window, shouldn't I make the window *larger* so that
> next time, it *will* be in the window?

That's true.  If the application is walking across a file
touching every fifth page, readahead will stabilise at
its minimum window size, which is less than five pages and
we lose bigtime.   I'm not sure how to fix that while retaining
some sanity in the code.

> > - It detects eviction:  if userspace requests a page which *should*
> >   have been inside the readahead window, but it's actually not there,
> >   then we know it was evicted prior to being used.  We shrink the
> >   window by 3 pages.  (This almost never happens, in my testing).
> 
> Again, this seems backwards in the manner mentioned above.  It could have
> been resident, but it was evicted, so if you want it to be a hit, make the
> window *bigger*, no?  What should drive the reduction in the read-ahead
> window is the observation that recent increases have not yielding higher
> hit rates -- more has not been better.

That's the thrashing situation which Rik mentioned.  The application
must be reading the file very slowly.   We try to reduce the window
size to a point at which all the slow readers in the system stabilise
and stop thrashing each other's readahead.

This works up to a point - I had a little artificial test - just a process
which opens a great number of files and reads a page from each one,
cycling around.  The current code reduces the onset of thrashing in
that test, and reduces its severity.  It's significantly better than
the old code.  But there is still a dramatic dropoff in throughput once it
happens.

> > - It behaves differently for page faults:  for read(2), readahead is
> >   strictly ahead of the requested page.  For mmap pagefaults,
> >   the readaround window is positioned 25% behind the requested page and
> >   75% ahead of it.
> 
> That seems sensible enough...
> 
> The entire adaptive mechanism you've described seems only to consider one
> of the two competing pools, though, namely the read-ahead pool of pages.
> What about its competition -- The references to pages that are near
> eviction at the end of the inactive list?  Adapting to one without
> consideration of the other is working half-blind.  Why would you ever want
> to shrink the read-ahead window if very, very few pages at the end of the
> inactive list are being hit?

hmm.  The default max window size is 128kbytes at present.  For some
but not many tests, increasing it does help.  But mainly because of the
merging artifact which I mentioned earlier.

>  Similarly, you would want to be very
> cautious about increasing the size of the read-ahead window of many pages
> at the end of the inactive list are being re-used.

I tend to think that if pages at the tail of the LRU are being
referenced with any frequency we've goofed anyway.  There are
many things apart from readahead which will allocate pages, yes?

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-29  7:34           ` Andrew Morton
@ 2002-07-29  7:37             ` Vladimir Dergachev
  2002-07-29  7:53               ` Andrew Morton
  2002-07-29  8:04             ` Rik van Riel
  2002-07-30 16:11             ` Scott Kaplan
  2 siblings, 1 reply; 23+ messages in thread
From: Vladimir Dergachev @ 2002-07-29  7:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm


> > > - It shrinks the window size in response to "misses"  - if
> > >   userspace requests a page which is *not* inside the previously-requested
> > >   window, the future window size is shrunk by 25%
> >
> > This one seems wierd.  If I reference a page that could have been in a
> > larger read-ahead window, shouldn't I make the window *larger* so that
> > next time, it *will* be in the window?
>
> That's true.  If the application is walking across a file
> touching every fifth page, readahead will stabilise at
> its minimum window size, which is less than five pages and
> we lose bigtime.   I'm not sure how to fix that while retaining
> some sanity in the code.

I am curious: which applications do you know of that actually do this ?

What about growing the window even if there is a miss as long as misses
are sequential and not further than a fixed amount from the window ?

                          Vladimir Dergachev


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-29  7:37             ` Vladimir Dergachev
@ 2002-07-29  7:53               ` Andrew Morton
  0 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2002-07-29  7:53 UTC (permalink / raw)
  To: Vladimir Dergachev; +Cc: linux-mm

Vladimir Dergachev wrote:
> 
> > > > - It shrinks the window size in response to "misses"  - if
> > > >   userspace requests a page which is *not* inside the previously-requested
> > > >   window, the future window size is shrunk by 25%
> > >
> > > This one seems wierd.  If I reference a page that could have been in a
> > > larger read-ahead window, shouldn't I make the window *larger* so that
> > > next time, it *will* be in the window?
> >
> > That's true.  If the application is walking across a file
> > touching every fifth page, readahead will stabilise at
> > its minimum window size, which is less than five pages and
> > we lose bigtime.   I'm not sure how to fix that while retaining
> > some sanity in the code.
> 
> I am curious: which applications do you know of that actually do this ?

None.  Just a test program which I used for testing readahead!

> What about growing the window even if there is a miss as long as misses
> are sequential and not further than a fixed amount from the window ?

That would work.  If the window size is less than max, and the
miss occurred inside the max, increase the window to a size which
would have caught that page.  Or to the max.

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-29  7:34           ` Andrew Morton
  2002-07-29  7:37             ` Vladimir Dergachev
@ 2002-07-29  8:04             ` Rik van Riel
  2002-07-30 16:11             ` Scott Kaplan
  2 siblings, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2002-07-29  8:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Scott Kaplan, Christoph Hellwig, linux-mm

On Mon, 29 Jul 2002, Andrew Morton wrote:

> >  Similarly, you would want to be very
> > cautious about increasing the size of the read-ahead window of many pages
> > at the end of the inactive list are being re-used.
>
> I tend to think that if pages at the tail of the LRU are being
> referenced with any frequency we've goofed anyway.  There are
> many things apart from readahead which will allocate pages, yes?

It would be a useful thing to measure, though.

We can use this information to decide to:

1) reduce readahead and, if if the situation continues

2) do load control

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-29  7:34           ` Andrew Morton
  2002-07-29  7:37             ` Vladimir Dergachev
  2002-07-29  8:04             ` Rik van Riel
@ 2002-07-30 16:11             ` Scott Kaplan
  2002-07-30 16:21               ` Martin J. Bligh
  2 siblings, 1 reply; 23+ messages in thread
From: Scott Kaplan @ 2002-07-30 16:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, Christoph Hellwig, linux-mm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Monday, July 29, 2002, at 03:34 AM, Andrew Morton wrote:

> Scott Kaplan wrote:
>> In other words, it's not hits that should necessarily make you grow
>> the cache -- it's the evidence that there will be an *increase* in hits 
>> if
>> you do.
>
> Ah, but if we're not getting hits in the readahead window
> then we're getting misses.  And misses shrink the window.

Yes, and that's the wrong thing to do.  If you are getting hits, you 
should try *skrinking* the window to see if there is a reduction in hits.  
If there is no reduction, you can capture just as many hits with a smaller 
window -- the extra space was superfluous.  If you're getting misses, you 
should try to *grow* the window (to commit an awful case of verbing) in an 
attempt to turn such misses into hits.  If growing the window doesn't 
decrease the misses, then you may need too large of an increase to cache 
those pages successfully.  If growing the window does decrease the misses,
  then keep growing until you don't see a decrease.

What's I'm describing here has its own major pitfalls:

1) It considers only the read-ahead pool.  Shrinking or growing the window 
could also have an effect on the hits and misses to the used pool of pages.

2) You can get trapped in local minima.  Part of what makes memory 
allocation hard under any realistic on-line replacement policy is that 
changes in hits/misses are non-monotonic.  For example, if we are 
observing misses to evicted read-ahead pages and try to grow the cache in 
response, we may not see any improvement unless we grow the cache 
sufficiently, and then get diminishing returns if we grow it beyond that 
point.  To avoid this kind of problem, you need more than just hit and 
miss counts -- you need reference distributions.

> I tend to think that if pages at the tail of the LRU are being
> referenced with any frequency we've goofed anyway.

I disagree.  Referencing things at the tail of the LRU is the sign of 
having done something *right*.  It means that for a workload with 
substantial memory needs, the VM system is holding onto pages *just long 
enough*, and no longer, to ensure that they are cached before reuse.  It 
means that the workload is leaving some pages unused for some time, but 
consistently revisiting those pages as part of a phase change that is near 
the scale of the memory size.  It a case where LRU and its approximations 
perform about as well as possible.  Remember that the ordering of resident 
pages doesn't need to be very exact.  A policy can have a completely 
goofed notion of which pages will be used soon; if they're all resident, 
it doesn't matter that the ordering among the resident pages was poor.  
What counts is that they were resident.  When you evict pages poorly, that'
s when the mis-ordering is trouble:  Referencing pages that have just been 
reclaimed is when we've really goofed.  Otherwise, it's fine.

This comment serves to highlight a point:  Memory pressure is not merely 
defined by the amount of paging or the number of new page allocations.  It 
should also be defined by the number of references to pages that *nearly* 
got evicted.  Those references represent behavior that is on the scale of 
the memory size, where good and bad decisions make a different.  Therefore,
  those are events relevant to the VM and the physical memory it is 
managing, and should contribute to the perception that there is pressure 
on the memory resources.

Scott
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (Darwin)
Comment: For info see http://www.gnupg.org

iD8DBQE9Rrqa8eFdWQtoOmgRAq8oAJ9fJ+AlaXcfSc3U5xLIQQITPAc8QwCfQGK5
NvLZM39UauOSZ5TSjZYPH6s=
=ovjL
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-30 16:11             ` Scott Kaplan
@ 2002-07-30 16:21               ` Martin J. Bligh
  2002-07-30 16:38                 ` Scott Kaplan
  2002-07-30 17:13                 ` William Lee Irwin III
  0 siblings, 2 replies; 23+ messages in thread
From: Martin J. Bligh @ 2002-07-30 16:21 UTC (permalink / raw)
  To: Scott Kaplan, Andrew Morton; +Cc: Rik van Riel, Christoph Hellwig, linux-mm

>> Ah, but if we're not getting hits in the readahead window
>> then we're getting misses.  And misses shrink the window.
> 
> Yes, and that's the wrong thing to do.  If you are getting hits, 
> you should try *skrinking* the window to see if there is a 
> reduction in hits.  If there is no reduction, you can capture 
> just as many hits with a smaller window -- the extra space was
> superfluous.  If you're getting misses, you should try to *grow* 
> the window (to commit an awful case of verbing) in an attempt to 
> turn such misses into hits.  If growing the window doesn't decrease 
> the misses, then you may need too large of an increase to cache 
> those pages successfully.  If growing the window does decrease 
> the misses, then keep growing until you don't see a decrease.

Would it not be easier to actually calculate (statistically) the 
read-ahead window, rather than actually tweaking it empirically?
If we're getting misses, there could be at least two causes - 

1. We're doing random, not sequential IO. Shrinking the window
would be most sensible.

2. We're reading ahead really fast, or skip-reading ahead. 
Growing the window would probably be most sensible.

Thus I'd contend that either growing or shrinking in straight 
response to just a hit/miss rate is not correct. We need to actually 
look at the access pattern of the application, surely? Perhaps I'm 
being naive, but I would have thought it would be possible
to calculate what the hit/miss rate with a given readahead window
would be without actually going to the pain of shrinking it up
and down.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-30 16:21               ` Martin J. Bligh
@ 2002-07-30 16:38                 ` Scott Kaplan
  2002-07-30 16:52                   ` Martin J. Bligh
  2002-07-30 17:13                 ` William Lee Irwin III
  1 sibling, 1 reply; 23+ messages in thread
From: Scott Kaplan @ 2002-07-30 16:38 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrew Morton, Rik van Riel, Christoph Hellwig, linux-mm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tuesday, July 30, 2002, at 12:21 PM, Martin J. Bligh wrote:

> Thus I'd contend that either growing or shrinking in straight
> response to just a hit/miss rate is not correct. We need to actually
> look at the access pattern of the application, surely?

I agree.  I probably should have made it clear that what I was suggesting 
wasn't the right way to go about it, but rather an argument against the 
heuristics that seemed backwards to me.

The causes for misses are necessarily as clear cut as you mentioned, as 
there are a lot of behaviors that are neither fully random nor fully 
sequential.  So, while it is ideal to have some foresight before resizing 
the window -- some calculation that determines whether or not growth will 
help or shrinkage will hurt -- it will require the VM system to gather hit 
distributions.  I'm trying to make that happen right now, although for all 
VM pages, and not for the specific purpose of read-ahead calculations.  
However, the paper for which I gave a pointer (in a shameless act of self 
promotion) proposes exactly that:  Keeping reference distributions for 
read-ahead and non-read-ahead pages, and then balancing the two against 
each other in an attempt to determine what the best read-ahead window size 
would be given recent reference behavior.

There may be simpler, kruftier, and/or more effective versions of what I 
proposed, but what you said above is, I think, the right idea.

Scott
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (Darwin)
Comment: For info see http://www.gnupg.org

iD8DBQE9RsEf8eFdWQtoOmgRAp+vAJoCF6mUgAI42x6Bac4A2/u+7oZXIwCdHVqZ
AQCPlqTF+84udI5xSWqYWas=
=swZ6
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-30 16:38                 ` Scott Kaplan
@ 2002-07-30 16:52                   ` Martin J. Bligh
  2002-08-05 18:54                     ` Scott Kaplan
  0 siblings, 1 reply; 23+ messages in thread
From: Martin J. Bligh @ 2002-07-30 16:52 UTC (permalink / raw)
  To: Scott Kaplan; +Cc: Andrew Morton, Rik van Riel, Christoph Hellwig, linux-mm

>> Thus I'd contend that either growing or shrinking in straight
>> response to just a hit/miss rate is not correct. We need to actually
>> look at the access pattern of the application, surely?
> 
> I agree.  I probably should have made it clear that what I was 
> suggesting wasn't the right way to go about it, but rather an 
> argument against the heuristics that seemed backwards to me.

Both sets of heuristics seem backwards to me, depending on the
circumstances ;-)

> The causes for misses are necessarily as clear cut as you 
> mentioned, as there are a lot of behaviors that are neither 
> fully random nor fully sequential. 

Indeed. Sorry - all I was trying to point out was that if there
exist two identical sets of input data that can lead two different
correct sets of output data, the calculation you're doing is
insufficient. Of course, there are many more than two circumstances.

> So, while it is ideal to have some foresight before resizing the 
> window -- some calculation that determines whether or not growth 
> will help or shrinkage will hurt -- it will require the VM system
> to gather hit distributions.  

Yup, but I think it's almost certainly worth that expense.

> However, the paper for which I gave a pointer (in a shameless act 
> of self promotion) proposes exactly that:  Keeping reference 

I should read that ;-) We seem to be mostly in violent agreement ...
How you actually calculate the window is a matter for debate and
experimentation, but just growing and shrinking based on purely the 
hit rate seems like a bad idea.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-30 16:52                   ` Martin J. Bligh
@ 2002-08-05 18:54                     ` Scott Kaplan
  0 siblings, 0 replies; 23+ messages in thread
From: Scott Kaplan @ 2002-08-05 18:54 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrew Morton, Rik van Riel, Christoph Hellwig, linux-mm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin,

Sorry for the slowness of the response, but just a thought or two...

> Both sets of heuristics seem backwards to me, depending on the
> circumstances ;-)

I don't agree, but more on that in a moment.  First, I'd like to point out 
a minor difference between what I meant by my suggestion and your 
interpretation of it.  The heuristic that I was suggesting -- grow in 
response to read-ahead misses, shrink in response to hits -- was not 
intended as a mere replacement.  It was meant as a ``blind'' approach to 
discovering the reference distribution for read-ahead pages.  So, the 
heuristic wouldn't be used simply as stated; instead, it would be a first 
approach to changing the read-ahead window size until evidence was 
gathered to make higher-level decisions.

For example, the VM system could shrink the window in response to hits, 
but if that shrinking decreased the hit count ``significantly'', it would 
return to the smallest window size that did not cause a hit decrease.  
Similarly, the VM system could increase the window size in response to 
misses, but after reaching some limit of increase where the misses do not 
decrease ``sufficiently'', it could return the window to the smallest size 
at which miss decrease was observed.

Now back to my claim that the heuristic that I suggested is not just the 
flip side of the original heuristics, where both are roughly equivalent, 
and the success of one or the other is just a matter of the reference 
behavior.  Assuming that an LRU-like replacement strategy is in place -- 
and I believe that page aging is LRU-like in the vast majority of 
situations -- the only way to turn a miss into a hit is to increase the 
window size.  Thus, the original heuristic's approach of shrinking the 
window in response to misses is a guarantee that future references that 
are part of the same reference behavior will remain misses.  Put 
differently, the *only* case in which it makes sense to shrink the 
read-ahead window in response to misses is one in which the misses are the 
result of un-cache-able references -- ones that would have required an 
absurdly large window, and so no window would be the best choice.  However,
  the heuristic that I described above will reach the same conclusion, 
although more slowly.  After growing the cache in response to the misses 
and observing no miss decrease, it would revert to a zero-sized window.

Granted, this discussion is based only on the read-ahead references, and 
not on the references to other, used pages.  However, even with that 
consideration, there's almost no situation in which you want to respond to 
read-ahead misses by shrinking the window -- and in those cases where you 
do, it's because of other factors, such as the need for a hopeless large 
window or a heavy demand on used pages that are near eviction that you 
want to shrink the window.  Read-ahead misses may not motivate larger 
read-ahead windows, but alone they *never* motivate smaller read-ahead 
windows.

>> So, while it is ideal to have some foresight before resizing the
>> window -- some calculation that determines whether or not growth
>> will help or shrinkage will hurt -- it will require the VM system
>> to gather hit distributions.
>
> Yup, but I think it's almost certainly worth that expense.

I'm happy that you think so, because I'm trying to do that now, and it's 
going to create some overhead.  Much like current rmap implementations, it'
s going to be the most intrusive for those cases where no paging is 
involved, and so the gains of tracking such information cannot be realized.

> How you actually calculate the window is a matter for debate and
> experimentation, but just growing and shrinking based on purely the
> hit rate seems like a bad idea.

Here I do agree.  Rather than finding the hit distribution by blindly 
setting allocations and observing the outcome, we can gather data to 
indicate what the outcome *would* be for that allocation.  Note, however, 
that VM systems have a long, long history of doing things like just 
responding to blind data gathering, much like increasing or decreasing 
allocation due to hit rate.  It's a matter of convincing people that 
gathering data that shows you the search space on-line is worth the 
complexity and the overhead.

Scott
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (Darwin)
Comment: For info see http://www.gnupg.org

iD8DBQE9TsnO8eFdWQtoOmgRAiC1AJsE3nhGa5zIGtkTsn7FBEuwrhX2uwCfcgzK
x7JgsWbQcQIhk3BSS2Wyu/o=
=oSsq
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-30 16:21               ` Martin J. Bligh
  2002-07-30 16:38                 ` Scott Kaplan
@ 2002-07-30 17:13                 ` William Lee Irwin III
  1 sibling, 0 replies; 23+ messages in thread
From: William Lee Irwin III @ 2002-07-30 17:13 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Scott Kaplan, Andrew Morton, Rik van Riel, Christoph Hellwig, linux-mm

On Tue, Jul 30, 2002 at 09:21:57AM -0700, Martin J. Bligh wrote:
> Would it not be easier to actually calculate (statistically) the 
> read-ahead window, rather than actually tweaking it empirically?
> If we're getting misses, there could be at least two causes - 

I wonder where these stats should really be kept. They seem to be in
the vma which probably doesn't fly too well when 20K threads are
pounding on different chunks of the same thing. Each could do locally
sequential reads and look random to the perspective of per-vma stats.

This probably gets worse if different threads are stomping in different
patterns, e.g. one sequential, one random. They also seem to lack any
way to cooperate since the hints are kept per-vma. It's also probably
easier to predict the behavior of a single task.

Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-25 19:40   ` Andrew Morton
  2002-07-26 16:50     ` Scott Kaplan
@ 2002-07-26 20:14     ` Stephen Lord
  2002-07-26 20:29       ` Andrew Morton
  1 sibling, 1 reply; 23+ messages in thread
From: Stephen Lord @ 2002-07-26 20:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, Christoph Hellwig, torvalds, linux-mm

On Thu, 2002-07-25 at 14:40, Andrew Morton wrote:
> Rik van Riel wrote:
> > 
> > On Thu, 25 Jul 2002, Christoph Hellwig wrote:
> > 
> > > This function (start_aggressive_readahead()) checks whether all zones
> > > of the given gfp mask have lots of free pages.
> > 
> > Seems a bit silly since ideally we wouldn't reclaim cache memory
> > until we're low on physical memory.
> > 
> 
> Yes, I would question its worth also.
> 
> 
> What it boils down to is:  which pages are we, in the immediate future,
> more likely to use?  Pages which are at the tail of the inactive list,
> or pages which are in the file's readahead window?
> 
> I'd say the latter, so readahead should just go and do reclaim.
> 

The interesting thing is that tuning metadata readahead using
this function does indeed improve performance under heavy memory
load. It seems we end up pushing more useful things out of
memory than the metadata we read in. Andrew, you talked about
a GFP flag which would mean only return memory if there was
some available which was already free and clean. The best
approach might be to use that flag in this scenario and skip
the readahead if no memory is returned.

For the record, this is not just used for directory readahead,
but for any btree structured metadata in xfs.

Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-26 20:14     ` Stephen Lord
@ 2002-07-26 20:29       ` Andrew Morton
  0 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2002-07-26 20:29 UTC (permalink / raw)
  To: Stephen Lord; +Cc: Rik van Riel, Christoph Hellwig, torvalds, linux-mm

Stephen Lord wrote:
> 
> On Thu, 2002-07-25 at 14:40, Andrew Morton wrote:
> > Rik van Riel wrote:
> > >
> > > On Thu, 25 Jul 2002, Christoph Hellwig wrote:
> > >
> > > > This function (start_aggressive_readahead()) checks whether all zones
> > > > of the given gfp mask have lots of free pages.
> > >
> > > Seems a bit silly since ideally we wouldn't reclaim cache memory
> > > until we're low on physical memory.
> > >
> >
> > Yes, I would question its worth also.
> >
> >
> > What it boils down to is:  which pages are we, in the immediate future,
> > more likely to use?  Pages which are at the tail of the inactive list,
> > or pages which are in the file's readahead window?
> >
> > I'd say the latter, so readahead should just go and do reclaim.
> >
> 
> The interesting thing is that tuning metadata readahead using
> this function does indeed improve performance under heavy memory
> load. It seems we end up pushing more useful things out of
> memory than the metadata we read in.

I'm surprised.  Could be that even when there is no memory
pressure, you're simply reading stuff which you're never using?

Ah.  Could be that the improvements which you saw are nothing
to do with leaving memory free, and everything to do with the
extreme latency which occurs in page reclaim when the system
is under load.  (I'm whining again).

> Andrew, you talked about
> a GFP flag which would mean only return memory if there was
> some available which was already free and clean.

Yes, you can do that now.  Just use

	GFP_ATOMIC & ~__GFP_HIGH

and the allocation will fail if it could not be satisfied
from a zone which has (free_pages > zone->pages_min).

Which will dip further into the page reserves than the
start_aggressive_readahead() approach would have, but it'll
certainly get around the page reclaim latency.

(You'll need to set PF_NOWARN around the call, else the
page allocator will spam you to death.  Sorry)

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] start_aggressive_readahead
  2002-07-25 16:10 [RFC] start_aggressive_readahead Christoph Hellwig
  2002-07-25 16:44 ` Rik van Riel
@ 2002-07-26  6:53 ` Daniel Phillips
  1 sibling, 0 replies; 23+ messages in thread
From: Daniel Phillips @ 2002-07-26  6:53 UTC (permalink / raw)
  To: Christoph Hellwig, torvalds; +Cc: linux-mm

On Thursday 25 July 2002 18:10, Christoph Hellwig wrote:
> I'm also open for a better name (I think the current one is very bad,
> but don't have a better idea :)).  I'd also be ineterested in comments
> how to avoid the new function and use existing functionality for it,
> but I've tried to find it for a long time and didn't find something
> suiteable.

That's the right attitude imho.  Redoing reahead needs to be a project
all by itself, a fine thing to experiment with in the stable series.
A bad idea that sort of works for now is better than what we've got.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2002-08-05 18:54 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-07-25 16:10 [RFC] start_aggressive_readahead Christoph Hellwig
2002-07-25 16:44 ` Rik van Riel
2002-07-25 19:40   ` Andrew Morton
2002-07-26 16:50     ` Scott Kaplan
2002-07-26 19:38       ` Andrew Morton
2002-07-28 23:32         ` Scott Kaplan
2002-07-29  0:19           ` Rik van Riel
2002-07-29  2:12             ` Scott Kaplan
2002-07-29  3:05               ` Rik van Riel
2002-07-29 15:24                 ` Scott Kaplan
2002-07-29  7:34           ` Andrew Morton
2002-07-29  7:37             ` Vladimir Dergachev
2002-07-29  7:53               ` Andrew Morton
2002-07-29  8:04             ` Rik van Riel
2002-07-30 16:11             ` Scott Kaplan
2002-07-30 16:21               ` Martin J. Bligh
2002-07-30 16:38                 ` Scott Kaplan
2002-07-30 16:52                   ` Martin J. Bligh
2002-08-05 18:54                     ` Scott Kaplan
2002-07-30 17:13                 ` William Lee Irwin III
2002-07-26 20:14     ` Stephen Lord
2002-07-26 20:29       ` Andrew Morton
2002-07-26  6:53 ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox