Re: cp file /dev/zero <-> cache [was Re: increasing page size]

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
       [not found] <199807091442.PAA01020@dax.dcs.ed.ac.uk>
@ 1998-07-09 18:59 ` Rik van Riel
  1998-07-09 23:37   ` Stephen C. Tweedie
  1998-07-11 14:14 ` Rik van Riel
  1 sibling, 1 reply; 40+ messages in thread
From: Rik van Riel @ 1998-07-09 18:59 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Benjamin C.R. LaHaise, Andrea Arcangeli, Stephen Tweedie,
	Linux Kernel, Linux MM

On Thu, 9 Jul 1998, Stephen C. Tweedie wrote:
> On Tue, 7 Jul 1998 13:50:02 -0400 (EDT), "Benjamin C.R. LaHaise"
> <blah@kvack.org> said:
> 
> > Right.  I'd rather see a multi-level lru like policy (ie on each cache hit
> > it gets moved up one level in the cache, with the lru'd pages from a given
>
> There's a fundamentally nice property about the multi-level cache
> which we _cannot_ easily emulate with page aging, and that is the
> ability to avoid aging any hot pages at all while we are just
> consuming cold pages.  For example, a large "find|xargs grep" can be
> satisfied without staling any of the existing hot cached pages.

Then I'd better incorporate a design for this in the zone
allocator (we could add this to the page_struct, but in
the zone_struct we can make a nice bitmap of it).

OTOH, is it really _that_ much different from an aging
scheme with an initial age of 1?

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-09 18:59 ` cp file /dev/zero <-> cache [was Re: increasing page size] Rik van Riel
@ 1998-07-09 23:37   ` Stephen C. Tweedie
  1998-07-10  5:57     ` Rik van Riel
  0 siblings, 1 reply; 40+ messages in thread
From: Stephen C. Tweedie @ 1998-07-09 23:37 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Stephen C. Tweedie, Benjamin C.R. LaHaise, Andrea Arcangeli

Hi,

On Thu, 9 Jul 1998 20:59:57 +0200 (CEST), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> On Thu, 9 Jul 1998, Stephen C. Tweedie wrote:
>> 
>> There's a fundamentally nice property about the multi-level cache
>> which we _cannot_ easily emulate with page aging, and that is the
>> ability to avoid aging any hot pages at all while we are just
>> consuming cold pages.  

> Then I'd better incorporate a design for this in the zone
> allocator (we could add this to the page_struct, but in
> the zone_struct we can make a nice bitmap of it).

It's nothing to do with the allocator per se; it's really a different
solution to a different problem.  That helps, actually, as it means
we're not forced to stick with one allocator if we want to use such a
scheme.

> OTOH, is it really _that_ much different from an aging
> scheme with an initial age of 1?

Yes, it is: the aging scheme pretty much forces us to age all pages on
an equal basis, so a lot of transient pages hitting the cache has the
side effect of prematurely aging and evicting a lot of existing,
potentially far more valuable pages.  A multilevel cache is pretty much
essential if you're going to let any cached data survive a grep flood.
Whether you _want_ that, or whether you'd rather just let the cache
drain and repopulate it after the IO has calmed, is a different
question; there are situations where one or other decision might be
best, so it's not a guaranteed win.  But the multilevel cache does have
some nice properties which aren't so easy to get with page aging.  It
also tends to be faster at finding pages to evict, since we don't
require multiple passes to flush the transient page queue.

--Stephen.
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-09 23:37   ` Stephen C. Tweedie
@ 1998-07-10  5:57     ` Rik van Riel
  0 siblings, 0 replies; 40+ messages in thread
From: Rik van Riel @ 1998-07-10  5:57 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Benjamin C.R. LaHaise, Linux MM

On Fri, 10 Jul 1998, Stephen C. Tweedie wrote:

> potentially far more valuable pages.  A multilevel cache is pretty much
> essential if you're going to let any cached data survive a grep flood.
> Whether you _want_ that, or whether you'd rather just let the cache
> drain and repopulate it after the IO has calmed, is a different
> question; there are situations where one or other decision might be
> best, so it's not a guaranteed win.  But the multilevel cache does have
> some nice properties which aren't so easy to get with page aging.  It
> also tends to be faster at finding pages to evict, since we don't
> require multiple passes to flush the transient page queue.

Let's go with those nice properties. Especially the last
one (quicker at finding pages) is essential in preventing
memory fragmentation (a 'lazy' list can be used to prevent
pressure on the few last 'free' zones from building).

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
       [not found] <199807091442.PAA01020@dax.dcs.ed.ac.uk>
  1998-07-09 18:59 ` cp file /dev/zero <-> cache [was Re: increasing page size] Rik van Riel
@ 1998-07-11 14:14 ` Rik van Riel
  1998-07-11 21:23   ` Stephen C. Tweedie
  1 sibling, 1 reply; 40+ messages in thread
From: Rik van Riel @ 1998-07-11 14:14 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Benjamin C.R. LaHaise, Linux MM

On Thu, 9 Jul 1998, Stephen C. Tweedie wrote:

> There's a fundamentally nice property about the multi-level cache
> which we _cannot_ easily emulate with page aging, and that is the
> ability to avoid aging any hot pages at all while we are just
> consuming cold pages.  For example, a large "find|xargs grep" can be
> satisfied without staling any of the existing hot cached pages.

Thinking over this design, I wonder how many levels
we'll need for normal operation, and how many pages
are allowed in each level.

I'd think we'll want 4 levels, with each 'lower'
level having 30% to 70% more pages than the level
above. This should be enough to cater to the needs
of both rc5des-like programs and multi-megabyte
tiled image processing.

Then again, I could be completely wrong :) Anyone?

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-11 14:14 ` Rik van Riel
@ 1998-07-11 21:23   ` Stephen C. Tweedie
  1998-07-11 22:25     ` Rik van Riel
  1998-07-12  1:47     ` Benjamin C.R. LaHaise
  0 siblings, 2 replies; 40+ messages in thread
From: Stephen C. Tweedie @ 1998-07-11 21:23 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Stephen C. Tweedie, Benjamin C.R. LaHaise, Linux MM

Hi,

On Sat, 11 Jul 1998 16:14:26 +0200 (CEST), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> I'd think we'll want 4 levels, with each 'lower'
> level having 30% to 70% more pages than the level
> above. This should be enough to cater to the needs
> of both rc5des-like programs and multi-megabyte
> tiled image processing.

> Then again, I could be completely wrong :) Anyone?

Maybe, maybe not --- we'd have to try it.  However, I'm always a bit
dubious about being overly clever about this kind of stuff, and two
level may well work fine.  At worst, we can do ageing on the resident
level and LRU on the transient, and let the aging take care of it.

Personally, I think just a two-level LRU ought to be adequat.   Yes, I
know this implies getting rid of some of the page ageing from 2.1 again,
but frankly, that code seems to be more painful than it's worth.  The
"solution" of calling shrink_mmap multiple times just makes the
algorithm hideously expensive to execute.

--Stephen

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-11 21:23   ` Stephen C. Tweedie
@ 1998-07-11 22:25     ` Rik van Riel
  1998-07-13 13:23       ` Stephen C. Tweedie
  1998-07-12  1:47     ` Benjamin C.R. LaHaise
  1 sibling, 1 reply; 40+ messages in thread
From: Rik van Riel @ 1998-07-11 22:25 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Benjamin C.R. LaHaise, Linux MM

On Sat, 11 Jul 1998, Stephen C. Tweedie wrote:
> On Sat, 11 Jul 1998 16:14:26 +0200 (CEST), Rik van Riel
> <H.H.vanRiel@phys.uu.nl> said:
> 
> > I'd think we'll want 4 levels, with each 'lower'
> > level having 30% to 70% more pages than the level
> 
> Personally, I think just a two-level LRU ought to be adequat.   Yes, I
> know this implies getting rid of some of the page ageing from 2.1 again,
> but frankly, that code seems to be more painful than it's worth.  The
> "solution" of calling shrink_mmap multiple times just makes the
> algorithm hideously expensive to execute.

This could be adequat, but then we will want to maintain
an active:inactive ratio of 1:2, in order to get a somewhat
realistic aging effect on the LRU inactive pages.

Or maybe we want to do a 3-level thingy, inactive in LRU
order and active and hyperactive (wired?) with aging.
Then we only promote pages to the highest level when they've
reached the highest age in the active level.
(OK, this is probably _far_ too complex, but I'm just
exploring some wild ideas here in the hope of triggering
some ingenious idea)

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-11 21:23   ` Stephen C. Tweedie
  1998-07-11 22:25     ` Rik van Riel
@ 1998-07-12  1:47     ` Benjamin C.R. LaHaise
  1998-07-13 13:42       ` Stephen C. Tweedie
  1 sibling, 1 reply; 40+ messages in thread
From: Benjamin C.R. LaHaise @ 1998-07-12  1:47 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Rik van Riel, Linux MM

On Sat, 11 Jul 1998, Stephen C. Tweedie wrote:

> Personally, I think just a two-level LRU ought to be adequat.   Yes, I
> know this implies getting rid of some of the page ageing from 2.1 again,
> but frankly, that code seems to be more painful than it's worth.  The
> "solution" of calling shrink_mmap multiple times just makes the
> algorithm hideously expensive to execute.

Hmmm, is that a hint that I should sit down and work on the code tomorrow
whilst recovering? =)

		-ben

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-11 22:25     ` Rik van Riel
@ 1998-07-13 13:23       ` Stephen C. Tweedie
  0 siblings, 0 replies; 40+ messages in thread
From: Stephen C. Tweedie @ 1998-07-13 13:23 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Stephen C. Tweedie, Benjamin C.R. LaHaise, Linux MM

Hi,

On Sun, 12 Jul 1998 00:25:20 +0200 (CEST), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> On Sat, 11 Jul 1998, Stephen C. Tweedie wrote:
>> On Sat, 11 Jul 1998 16:14:26 +0200 (CEST), Rik van Riel
>> <H.H.vanRiel@phys.uu.nl> said:
>> 
>> > I'd think we'll want 4 levels, with each 'lower'
>> > level having 30% to 70% more pages than the level
>> 
>> Personally, I think just a two-level LRU ought to be adequat.   Yes, I
>> know this implies getting rid of some of the page ageing from 2.1 again,
>> but frankly, that code seems to be more painful than it's worth.  The
>> "solution" of calling shrink_mmap multiple times just makes the
>> algorithm hideously expensive to execute.

> This could be adequat, but then we will want to maintain
> an active:inactive ratio of 1:2, in order to get a somewhat
> realistic aging effect on the LRU inactive pages.

Aging is not a good thing in the cache, in general.  We _want_ to be
able to empty the cache at short notice.  LRU works for that.  The
existing physical scan is definitely suboptimal without ageing, but that
doesn't mean that aging is the right answer.  (I tried doing buffer
ageing in the original kswap.  It sucked.)

> Or maybe we want to do a 3-level thingy, inactive in LRU
> order and active and hyperactive (wired?) with aging.

If we have more than 2 levels, then we definitely don't want ageing:
just let migration of pages between the levels do the ageing for us.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-12  1:47     ` Benjamin C.R. LaHaise
@ 1998-07-13 13:42       ` Stephen C. Tweedie
  1998-07-18 22:10         ` Rik van Riel
  0 siblings, 1 reply; 40+ messages in thread
From: Stephen C. Tweedie @ 1998-07-13 13:42 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise; +Cc: Stephen C. Tweedie, Rik van Riel, Linux MM

Hi,

On Sat, 11 Jul 1998 21:47:44 -0400 (EDT), "Benjamin C.R. LaHaise"
<blah@kvack.org> said:

> On Sat, 11 Jul 1998, Stephen C. Tweedie wrote:
>> Personally, I think just a two-level LRU ought to be adequat.   Yes, I
>> know this implies getting rid of some of the page ageing from 2.1 again,
>> but frankly, that code seems to be more painful than it's worth.  The
>> "solution" of calling shrink_mmap multiple times just makes the
>> algorithm hideously expensive to execute.

> Hmmm, is that a hint that I should sit down and work on the code tomorrow
> whilst recovering? =)

I'm working on it right now.  Currently, the VM is so bad that it is
seriously getting in the way of my job.  Just trying to fix some odd
swapper bugs is impossible to test because I can't set up a ramdisk for
swap and do in-memory tests that way: things thrash incredibly.  The
algorithms for aggressive cache pruning rely on fractions of
nr_physpages, and that simply doesn't work if you have large numbers of
pages dedicated to non-swappable things such as ramdisk, bigphysarea DMA
buffers or network buffers.

Rik, unfortunately I think we're just going to have to back out your
cache page ageing.  I've just done that on my local test box and the
results are *incredible*: it is going much more than an order of
magnitude faster on many things.  Fragmentation also seems drastically
improved: I've been doing builds of defrag in a 6MB box which were
impossible beforehand due to NFS stalls.

I'm going to do a bit more experimenting to see if we can keep some of
the good ageing behaviour by doing proper LRU in the cache, but
otherwise I think the cache ageing has either got to go or to be
drastically altered.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-13 13:42       ` Stephen C. Tweedie
@ 1998-07-18 22:10         ` Rik van Riel
  1998-07-20 16:04           ` Stephen C. Tweedie
  0 siblings, 1 reply; 40+ messages in thread
From: Rik van Riel @ 1998-07-18 22:10 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Benjamin C.R. LaHaise, Linux MM

On Mon, 13 Jul 1998, Stephen C. Tweedie wrote:

> I'm working on it right now.  Currently, the VM is so bad that it is
> seriously getting in the way of my job.  Just trying to fix some odd
> swapper bugs is impossible to test because I can't set up a ramdisk for
> swap and do in-memory tests that way: things thrash incredibly.  The
> algorithms for aggressive cache pruning rely on fractions of
> nr_physpages, and that simply doesn't work if you have large numbers of
> pages dedicated to non-swappable things such as ramdisk, bigphysarea DMA
> buffers or network buffers.

This means we'll have to substract those pages before
determining the used percentage.

> Rik, unfortunately I think we're just going to have to back out your
> cache page ageing.  I've just done that on my local test box and the
> results are *incredible*:

OK, I don't see much problems with that, except that the
aging helps a _lot_ with readahead. For the rest, it's
not much more than a kludge anyway ;(

We really ought to do better than that anyway. I'll give
you guys the URL of the Digital Unix manuals on this...
(they have some _very_ nice mechanisms for this)

> I'm going to do a bit more experimenting to see if we can keep some of
> the good ageing behaviour by doing proper LRU in the cache, but
> otherwise I think the cache ageing has either got to go or to be
> drastically altered.

A 2-level LRU on the page cache would be _very_ nice,
but probably just as desastrous wrt. fragmentation as
aging...

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-18 22:10         ` Rik van Riel
@ 1998-07-20 16:04           ` Stephen C. Tweedie
  0 siblings, 0 replies; 40+ messages in thread
From: Stephen C. Tweedie @ 1998-07-20 16:04 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Stephen C. Tweedie, Benjamin C.R. LaHaise, Linux MM

Hi,

On Sun, 19 Jul 1998 00:10:09 +0200 (CEST), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> On Mon, 13 Jul 1998, Stephen C. Tweedie wrote:
>> I'm working on it right now.  Currently, the VM is so bad that it is
>> seriously getting in the way of my job.  Just trying to fix some odd
>> swapper bugs is impossible to test because I can't set up a ramdisk for
>> swap and do in-memory tests that way: things thrash incredibly.  The
>> algorithms for aggressive cache pruning rely on fractions of
>> nr_physpages, and that simply doesn't work if you have large numbers of
>> pages dedicated to non-swappable things such as ramdisk, bigphysarea DMA
>> buffers or network buffers.

> This means we'll have to substract those pages before
> determining the used percentage.

Sure, but that's just admitting that the system is so inherently
incapable of balancing itself that we have to place fixed limits on
the cache size, and I'm not sure that's a good thing.

>> Rik, unfortunately I think we're just going to have to back out your
>> cache page ageing.  I've just done that on my local test box and the
>> results are *incredible*:

> OK, I don't see much problems with that, except that the
> aging helps a _lot_ with readahead. For the rest, it's
> not much more than a kludge anyway ;(

This is something we need to sort out.  From my benchmarks so far, the
one thing that's certain is that you were benchmarking something
different from me when you found the ageing speedups.  That's not
good, because it implies that neither mechanism is doing the Right
Thing.  What sort of circumstances were you seeing big performance
improvements in for your original page ageing code?  That might help
us to identify what the core improvement in the ageing is, so that we
don't lose too much if we start changing the scheme again.

> We really ought to do better than that anyway. I'll give
> you guys the URL of the Digital Unix manuals on this...
> (they have some _very_ nice mechanisms for this)

OK, thanks!

> A 2-level LRU on the page cache would be _very_ nice,
> but probably just as desastrous wrt. fragmentation as
> aging...

Actually, fragmentation is not the big issue wrt ageing.  The page
ageing code is simply keeping the cache too large; the time it takes
to age the cache means that far too much is getting swapped out, and
on low memory machine the cache grows too large altogether.

This means that there may be several ways forward.  A multi-level LRU
would not necessarily be any worse for fragmentation.  Keeping a (low)
ceiling on the page age in the cache might also be a way forward,
allowing us to give a priority boost to readahead pages, but letting
us then cap the age once the pages are read to prevent them from
staying too long in the cache.  

I'm also experimenting right now with a number of new zoneing and
ageing mechanisms which may address the fragmentation issue.  As far
as page ageing is concerned, it's really just the overall cache size,
and the self-tuning of the cache size, which are my main concerns at
the moment.

--Stephen

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-09 20:39                         ` Rik van Riel
@ 1998-07-13 11:54                           ` Stephen C. Tweedie
  0 siblings, 0 replies; 40+ messages in thread
From: Stephen C. Tweedie @ 1998-07-13 11:54 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Stephen C. Tweedie, Andrea Arcangeli, Linux MM

Hi,

On Thu, 9 Jul 1998 22:39:10 +0200 (CEST), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> On Wed, 8 Jul 1998, Stephen C. Tweedie wrote:
>> <H.H.vanRiel@phys.uu.nl> said:
>> 
>> > When my zone allocator is finished, it'll be a piece of
>> > cake to implement lazy page reclamation.
>> 
>> I've already got a working implementation.  The issue of lazy
>> reclamation is pretty much independent of the allocator underneath; I

> We really should integrate this _now_, with the twist
> that pages which could form a larger buddy should be
> immediately deallocated.

Perhaps, but I don't think Linus will take it.  He's right, too, it's
too near 2.2 for that.

> This can give us a cheap way to:
> - create larger memory buddies
> - remove some of the pressure on the buddy allocator
>   (no need to grab that last 64 kB area when 25% of
>   user pages are lazy reclaim)

All it can do is to reduce the pain of doing swapping too aggressively.
It doesn't make it much easier to do true defragmentation; it just lets
you hang on to the defragmented pages a bit longer, which is a different
thing.  If you end up with non-pagable pages allocated to
kmalloc/slab/page tables all over memory, then lazy reclaim is powerless
to help defrag the memory.  We need something else for 2.2.

--Stephen

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-11 11:18                         ` Rik van Riel
@ 1998-07-11 21:11                           ` Stephen C. Tweedie
  0 siblings, 0 replies; 40+ messages in thread
From: Stephen C. Tweedie @ 1998-07-11 21:11 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, Linux MM, Stephen Tweedie, Linux Kernel

Hi,

On Sat, 11 Jul 1998 13:18:35 +0200 (CEST), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> This morning I have posted a patch to Linux MM which can
> drastically improve this situation.

> For the low-mem linux-kernel users, you can get the patch
> from my homepage too.

I can't see it...

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-08 21:19                       ` Andrea Arcangeli
@ 1998-07-11 11:18                         ` Rik van Riel
  1998-07-11 21:11                           ` Stephen C. Tweedie
  0 siblings, 1 reply; 40+ messages in thread
From: Rik van Riel @ 1998-07-11 11:18 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linux MM, Stephen Tweedie, Linux Kernel

On Wed, 8 Jul 1998, Andrea Arcangeli wrote:
> On Wed, 8 Jul 1998, Stephen C. Tweedie wrote:
> 
> >I'm unconvinced.  It's pretty clear that the underlying problem is that
> >the cache is far too agressive when you are copying large amounts of
> >data around.  The fact that interactive performance is bad suggests not
> >that the swapping algorithm is making bad decisions, but that it is
> >being forced to work with far too little physical memory due to the
> >cache size.

This morning I have posted a patch to Linux MM which can
drastically improve this situation.

For the low-mem linux-kernel users, you can get the patch
from my homepage too.

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-08 22:11                       ` Stephen C. Tweedie
  1998-07-09  7:43                         ` Rik van Riel
@ 1998-07-09 20:39                         ` Rik van Riel
  1998-07-13 11:54                           ` Stephen C. Tweedie
  1 sibling, 1 reply; 40+ messages in thread
From: Rik van Riel @ 1998-07-09 20:39 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Andrea Arcangeli, Linux MM

On Wed, 8 Jul 1998, Stephen C. Tweedie wrote:
> <H.H.vanRiel@phys.uu.nl> said:
> 
> > When my zone allocator is finished, it'll be a piece of
> > cake to implement lazy page reclamation.
> 
> I've already got a working implementation.  The issue of lazy
> reclamation is pretty much independent of the allocator underneath; I

We really should integrate this _now_, with the twist
that pages which could form a larger buddy should be
immediately deallocated.

This can give us a cheap way to:
- create larger memory buddies
- remove some of the pressure on the buddy allocator
  (no need to grab that last 64 kB area when 25% of
  user pages are lazy reclaim)

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
@ 1998-07-09 13:01 Zachary Amsden
  0 siblings, 0 replies; 40+ messages in thread
From: Zachary Amsden @ 1998-07-09 13:01 UTC (permalink / raw)
  To: Rik van Riel, Stephen C. Tweedie; +Cc: Andrea Arcangeli, Linux MM, Linux Kernel

-----Original Message-----
From: Rik van Riel <H.H.vanRiel@phys.uu.nl>
To: Stephen C. Tweedie <sct@redhat.com>
Cc: Andrea Arcangeli <arcangeli@mbox.queen.it>; Linux MM
<linux-mm@kvack.org>; Linux Kernel <linux-kernel@vger.rutgers.edu>
Date: Thursday, July 09, 1998 3:50 AM
Subject: Re: cp file /dev/zero <-> cache [was Re: increasing page size]

>On Wed, 8 Jul 1998, Stephen C. Tweedie wrote:
>> <H.H.vanRiel@phys.uu.nl> said:
>>
>> > When my zone allocator is finished, it'll be a piece of
>> > cake to implement lazy page reclamation.
>>
>> I've already got a working implementation.  The issue of lazy
>> reclamation is pretty much independent of the allocator underneath; I
>> don't see it being at all hard to run the lazy reclamation stuff on
>top
>> of any form of zoned allocation.
>
>The problem with the current allocator is that it stores
>the pointers to available blocks in the blocks themselves.
>This means we can't wait till the last moment with lazy
>reclamation.

Presumably to reduce memory use, but at what cost?  It prevents
lazy reclamation and makes locating available blocks a major
headache.  It only takes 4k of memory to store a bitmap of free
blocks in a 128 Meg system.  Storing the free list in free space is
an admirable hack, but maybe outdated.

Zach Amsden
amsden@andrew.cmu.edu

P.S. I'm new to this discussion, so please don't flay me if
everything I said is in gross violation of the truth.

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-08 22:11                       ` Stephen C. Tweedie
@ 1998-07-09  7:43                         ` Rik van Riel
  1998-07-09 20:39                         ` Rik van Riel
  1 sibling, 0 replies; 40+ messages in thread
From: Rik van Riel @ 1998-07-09  7:43 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Andrea Arcangeli, Linux MM, Linux Kernel

On Wed, 8 Jul 1998, Stephen C. Tweedie wrote:
> <H.H.vanRiel@phys.uu.nl> said:
> 
> > When my zone allocator is finished, it'll be a piece of
> > cake to implement lazy page reclamation.
> 
> I've already got a working implementation.  The issue of lazy
> reclamation is pretty much independent of the allocator underneath; I
> don't see it being at all hard to run the lazy reclamation stuff on top
> of any form of zoned allocation.

The problem with the current allocator is that it stores
the pointers to available blocks in the blocks themselves.
This means we can't wait till the last moment with lazy
reclamation.

> is already present in 2.1 now.  The only thing missing is the
> maintenance of the LRU list of lazy pages for reuse.

That part will come for free with my zone allocator.

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-08 18:57                     ` Rik van Riel
@ 1998-07-08 22:11                       ` Stephen C. Tweedie
  1998-07-09  7:43                         ` Rik van Riel
  1998-07-09 20:39                         ` Rik van Riel
  0 siblings, 2 replies; 40+ messages in thread
From: Stephen C. Tweedie @ 1998-07-08 22:11 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Stephen C. Tweedie, Andrea Arcangeli, Linux MM, Linux Kernel

Hi,

On Wed, 8 Jul 1998 20:57:27 +0200 (CEST), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> When my zone allocator is finished, it'll be a piece of
> cake to implement lazy page reclamation.

I've already got a working implementation.  The issue of lazy
reclamation is pretty much independent of the allocator underneath; I
don't see it being at all hard to run the lazy reclamation stuff on top
of any form of zoned allocation.

> With lazy reclamation, we simply place an upper limit
> on the number of _active_ pages. A process that's really
> thrashing away will simply be moving it's pages to/from
> the inactive list.

Exactly.  We _do_ want to be able to increase the RSS limit dynamically
to avoid moving too many pages in and out of the working set, but if the
process's working set is _that_ large, then performance will be
dominated so much by L2 cache trashing and CPU TLB misses that the extra
minor page faults we'd get are unlikely to be a catastrophic performance
problem.  

In short, if there's no contention on memory, there's no need to impose
RSS limits at all: it's just an extra performance cost.  But as soon as
physical memory contention becomes important, the RSS management is an
obvious way of restricting the performance impact of the large processes
on the rest of the system.

> And when memory pressure increases, other processes will
> start taking pages away from the inactive pages collection
> of our memory hog.

Precisely. 

> That looks quite OK to me...

Yep.  That's one of the main motivations behind the swap cache work in
2.1: the way the swapper now works, we can unhook pages from the
process's page tables and send them to swap once the RSS limit is
exceeded, but keep a copy of those pages in the swap cache so that if
the process wants a page back before we've got around to reusing the
memory, it's just a minor fault to bring it back in.  All of this code
is already present in 2.1 now.  The only thing missing is the
maintenance of the LRU list of lazy pages for reuse.

--Stephen

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-08 13:54                     ` Stephen C. Tweedie
@ 1998-07-08 21:19                       ` Andrea Arcangeli
  1998-07-11 11:18                         ` Rik van Riel
  0 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 1998-07-08 21:19 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Benjamin C.R. LaHaise, Rik van Riel, Linux MM, Linux Kernel

On Wed, 8 Jul 1998, Stephen C. Tweedie wrote:

>I'm unconvinced.  It's pretty clear that the underlying problem is that
>the cache is far too agressive when you are copying large amounts of
>data around.  The fact that interactive performance is bad suggests not
>that the swapping algorithm is making bad decisions, but that it is
>being forced to work with far too little physical memory due to the
>cache size.

Yes, this is exactly what I think too.

Andrea[s] Arcangeli

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-08 13:45                   ` Stephen C. Tweedie
@ 1998-07-08 18:57                     ` Rik van Riel
  1998-07-08 22:11                       ` Stephen C. Tweedie
  0 siblings, 1 reply; 40+ messages in thread
From: Rik van Riel @ 1998-07-08 18:57 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Andrea Arcangeli, Linux MM, Linux Kernel

On Wed, 8 Jul 1998, Stephen C. Tweedie wrote:
> On Tue, 7 Jul 1998 17:54:46 +0200 (CEST), Rik van Riel
> <H.H.vanRiel@phys.uu.nl> said:
> 
> > There's a good compromize between balancing per-page
> > and per-process. We can simply declare the last X
> > (say 8) pages of a process holy unless that process
> > has slept for more than Y (say 5) seconds.
> 
> Yep --- this is per-process RSS management, and there is a _lot_ we
> can do once we start following this route.  I've been talking with
> some folk about it already, and this is something we definitely want
> to look into for 2.3.
> 
> The hard part is the self-tuning --- making sure that we don't give a

When my zone allocator is finished, it'll be a piece of
cake to implement lazy page reclamation.
With lazy reclamation, we simply place an upper limit
on the number of _active_ pages. A process that's really
thrashing away will simply be moving it's pages to/from
the inactive list.

And when memory pressure increases, other processes will
start taking pages away from the inactive pages collection
of our memory hog.

That looks quite OK to me...

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-07 17:32                   ` Benjamin C.R. LaHaise
@ 1998-07-08 13:54                     ` Stephen C. Tweedie
  1998-07-08 21:19                       ` Andrea Arcangeli
  0 siblings, 1 reply; 40+ messages in thread
From: Stephen C. Tweedie @ 1998-07-08 13:54 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise
  Cc: Rik van Riel, Stephen C. Tweedie, Andrea Arcangeli, Linux MM,
	Linux Kernel

Hi,

On Tue, 7 Jul 1998 13:32:34 -0400 (8UU\x01), "Benjamin C.R. LaHaise"
<blah@kvack.org> said:

> This is the wrong fix for the case that Andrea is complaining about -
> tossing out chunks of processes piecemeal, resulting in a length page-in
> time when the process becomes active again.  Two things that might help
> with this are: read-ahead on swapins, and *true* swapping.  

I'm unconvinced.  It's pretty clear that the underlying problem is that
the cache is far too agressive when you are copying large amounts of
data around.  The fact that interactive performance is bad suggests not
that the swapping algorithm is making bad decisions, but that it is
being forced to work with far too little physical memory due to the
cache size.

There's no doubt that swap readahead and true full-process swapping can
give us performance benefits, but Andrea is quite clearly seeing
enormous resident cache sizes when copying large files to /dev/null, and
that's a problem which we need to tackle independently of the swapper's
own page selection algorithms.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-07 15:54                 ` Rik van Riel
  1998-07-07 17:32                   ` Benjamin C.R. LaHaise
@ 1998-07-08 13:45                   ` Stephen C. Tweedie
  1998-07-08 18:57                     ` Rik van Riel
  1 sibling, 1 reply; 40+ messages in thread
From: Stephen C. Tweedie @ 1998-07-08 13:45 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Stephen C. Tweedie, Andrea Arcangeli, Linux MM, Linux Kernel

Hi,

On Tue, 7 Jul 1998 17:54:46 +0200 (CEST), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> There's a good compromize between balancing per-page
> and per-process. We can simply declare the last X
> (say 8) pages of a process holy unless that process
> has slept for more than Y (say 5) seconds.

Yep --- this is per-process RSS management, and there is a _lot_ we
can do once we start following this route.  I've been talking with
some folk about it already, and this is something we definitely want
to look into for 2.3.

For example, we can do both RSS limits (upper limits to RSS) plus RSS
quotas (a guaranteed lower limit which we allocate to the process).
Consider a machine where we have some very large processes thrashing
away; placing an RSS limit on those excessive processes will prevent
them from hogging all of physical memory, and giving interactive
processes a small guaranteed RSS quota will ensure that those
processes are allowed to make at least some progress even under severe
VM load.

The hard part is the self-tuning --- making sure that we don't give a
resident quota to idle processes, so that they can be fully swapped
out, and making sure that we don't overly trim back large processes
for which there is actually sufficient physical memory.  However, the
principle of RSS management is a powerful one and we should most
certainly be doing this for 2.3.

--Stephen

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-07 15:54                 ` Rik van Riel
@ 1998-07-07 17:32                   ` Benjamin C.R. LaHaise
  1998-07-08 13:54                     ` Stephen C. Tweedie
  1998-07-08 13:45                   ` Stephen C. Tweedie
  1 sibling, 1 reply; 40+ messages in thread
From: Benjamin C.R. LaHaise @ 1998-07-07 17:32 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Stephen C. Tweedie, Andrea Arcangeli, Linux MM, Linux Kernel

On Tue, 7 Jul 1998, Rik van Riel wrote:

> There's a good compromize between balancing per-page
> and per-process. We can simply declare the last X
> (say 8) pages of a process holy unless that process
> has slept for more than Y (say 5) seconds.

This is the wrong fix for the case that Andrea is complaining about -
tossing out chunks of processes piecemeal, resulting in a length page-in
time when the process becomes active again.  Two things that might help
with this are: read-ahead on swapins, and *true* swapping.  If the system
has run out of ram for the tasks at hand, should it not swap out a process
that's inactive in one fell swoop?  Likewise, when said process resumes,
it's probably worth bringing that entire working set back into memory.
That way the user will only experience a brief pause on the first
keystroke issued to bash, not the 'pause on first character type, then
pause as line editing code faults back in...'

		-ben

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-07 12:01               ` Stephen C. Tweedie
@ 1998-07-07 15:54                 ` Rik van Riel
  1998-07-07 17:32                   ` Benjamin C.R. LaHaise
  1998-07-08 13:45                   ` Stephen C. Tweedie
  0 siblings, 2 replies; 40+ messages in thread
From: Rik van Riel @ 1998-07-07 15:54 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Andrea Arcangeli, Linux MM, Linux Kernel

On Tue, 7 Jul 1998, Stephen C. Tweedie wrote:
> On Mon, 6 Jul 1998 21:28:42 +0200 (CEST), Andrea Arcangeli
> <arcangeli@mbox.queen.it> said:
> 
> > It would be nice if it would be swapped out _only_ pages that are not used
> > in the past half an hour. If kswapd would run in such way I would thank
> > you a lot instead of being irritate ;-).
> 
> ?? Some people will want to keep anything used within the last half
> hour; in other cases, 5 minutes idle should qualify for a swapout.  On
> the compilation benchmarks I run on 6MB machines, any page not used
> within the past 10 seconds or so should be history!

There's a good compromize between balancing per-page
and per-process. We can simply declare the last X
(say 8) pages of a process holy unless that process
has slept for more than Y (say 5) seconds.

As a temporary measure, you can tune swapctl to
have an age_cluster_fract of 128 and an
age_cluster_min of 0; this will leave the 8 last
pages of an app in memory, whatever happens...

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-06 13:37       ` Eric W. Biederman
@ 1998-07-07 12:35         ` Stephen C. Tweedie
  0 siblings, 0 replies; 40+ messages in thread
From: Stephen C. Tweedie @ 1998-07-07 12:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Stephen C. Tweedie, Rik van Riel, Andrea Arcangeli, Linux MM,
	Linux Kernel

Hi,

On 06 Jul 1998 08:37:02 -0500, ebiederm+eric@npwt.net (Eric
W. Biederman) said:

> The use of touch_page and age_page appear to be the most likely
> canidates for the page cache being more persistent than it used to
> be.

Yes., very much so.

> If I'm not mistaken shrink_mmap must be called more often now to
> remove a given page.

Indeed.  Three things I think we need to do are to lower the age
ceiling for the page cache pages; perform page allocations for the
page cache with a GFP_CACHE flag which forces us to look for other
cache pages first in try_to_free_page; and try to eliminate several
pages at a time from the page cache when we can.  (There's no point in
keeping only half the pages from a closed, sequentially accessed file
in cache.)

The first two of these are definitely small enough and clean enough
changes to be appropriate for 2.1.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-06 19:28             ` Andrea Arcangeli
@ 1998-07-07 12:01               ` Stephen C. Tweedie
  1998-07-07 15:54                 ` Rik van Riel
  0 siblings, 1 reply; 40+ messages in thread
From: Stephen C. Tweedie @ 1998-07-07 12:01 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Stephen C. Tweedie, Rik van Riel, Linux MM, Linux Kernel

Hi,

On Mon, 6 Jul 1998 21:28:42 +0200 (CEST), Andrea Arcangeli
<arcangeli@mbox.queen.it> said:

> On Mon, 6 Jul 1998, Stephen C. Tweedie wrote:
>> No --- that's the whole point.  We have per-page process page aging
>> which lets us differentiate between processes which are active and those
>> which are idle, and between the used and unused pages within the active
>> processes.

> Nice! The problem is that probably the kernel think that bash and every
> not 100% CPU eater is an idle process... 

Not at all. :) A process only has to touch a page once per sweep of
the vm scanner for that page to be marked in use.  A shell which
touches a few pages for every keystroke will get the same preservation
of those pages as a process which is touching the same number of pages
in a tight loop.

>> If you are short on memory, then you don't want to keep around any
>> process pages which belong to idle tasks.  The only way to do that is to

> This is again more true for low memory machines (where the current kswapd
> policy sucks). I 100% agree with this, I don' t agree to swapout to make
> space from the cache. 

I've just explained why we _do_ want to do this on low memory
machines, to a certain extent.  When memory is low, we don't want to
keep around anything which we don't need,  and so swapping out
completely unused pages is a good thing.  The thing we need to avoid
is swapping anything touched recently; switching off swapout completely,
even just to make room for the cache, is wrong.

>> invoke the swapper.  We need to make sure that we are just aggressive
>> enough to discard pages which are not in use, and not to discard pages
>> which have been touched recently.

> I think that we are too much aggressive. 

Sure, in 2.1.

> It would be nice if it would be swapped out _only_ pages that are not used
> in the past half an hour. If kswapd would run in such way I would thank
> you a lot instead of being irritate ;-).

?? Some people will want to keep anything used within the last half
hour; in other cases, 5 minutes idle should qualify for a swapout.  On
the compilation benchmarks I run on 6MB machines, any page not used
within the past 10 seconds or so should be history!

>> You also don't want lpd sitting around, either.

> NO. I want lpd sitting around if it' s been used in the last 10 minutes
> for example. I don' t want to swapout process for make space for the
> _cache_ if the process is not 100% idle instead.

Not if your memory is full.

You CANNOT say "I want this in memory, not that".  You will always be
able to find situations where it doesn't work.  You need a balance.
I'm quite sure that you don't want your kernel build to thrash simply
because the vm system is afraid of swapping out the sendmail and lpd
daemons you used 10 minutes ago.

> 2.0.34 destroy (wooo nice I love when I see the cache destroyed ;-)
> completly the cache and runs great. 

No it doesn't.  It balances the cache better; that's a very different
thing.  The only difference between 2.0 and 2.1 in this regard is the
tuning of that balance; the underlying code is more or less the same.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-06 14:36           ` Stephen C. Tweedie
@ 1998-07-06 19:28             ` Andrea Arcangeli
  1998-07-07 12:01               ` Stephen C. Tweedie
  0 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 1998-07-06 19:28 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Rik van Riel, Linux MM, Linux Kernel

On Mon, 6 Jul 1998, Stephen C. Tweedie wrote:

>No --- that's the whole point.  We have per-page process page aging
>which lets us differentiate between processes which are active and those
>which are idle, and between the used and unused pages within the active
>processes.

Nice! The problem is that probably the kernel think that bash and every
not 100% CPU eater is an idle process... 

>If you are short on memory, then you don't want to keep around any
>process pages which belong to idle tasks.  The only way to do that is to

This is again more true for low memory machines (where the current kswapd
policy sucks). I 100% agree with this, I don' t agree to swapout to make
space from the cache. The cache is too much dynamic and so the swapin/out
continue forever.

>invoke the swapper.  We need to make sure that we are just aggressive
>enough to discard pages which are not in use, and not to discard pages
>which have been touched recently.

I think that we are too much aggressive. Also my bash gone swapped out. If
I run `cp file /dev/null' on 2.0.34, when I launch `free' from the shall I
don' t see stalls. It seems that `free' remains in the cache, while on
2.1.108 I had to wait a lot of seconds to see `free' executed (and
characters printed to the console).

>If we simply prune the cache to zero before doing any swapping, then we
>will be eliminating potentially useful data out of the cache instead of
>throwing away pages to swap which may not have been used in the past
>half an hour.

It would be nice if it would be swapped out _only_ pages that are not used
in the past half an hour. If kswapd would run in such way I would thank
you a lot instead of being irritate ;-).

>That's what the balancing issue is about: if there are swap pages which
>are not being touched at all and files such as header files which are
>being constantly accessed, then we need to do at least _some_ swapping
>to eliminate the idle process pages.

100% agree.

>> I _really_ don' t want cache and readahead when the system needs
>> memory. 
>
>You also don't want lpd sitting around, either.

NO. I want lpd sitting around if it' s been used in the last 10 minutes
for example. I don' t want to swapout process for make space for the
_cache_ if the process is not 100% idle instead.

>> The only important thing is to avoid the always swapin/out and provide
>> free memory to the process. 
>
>It's just wishful thinking to assume you can do this simply by
>destroying the cache.  Oh, and you _do_ want readahead even with little

Yes we can avoid it destroying the cache I think, since it' s the only
cause I can touch by hand that cause me problems when nothing of huge is
running (when I have 20Mbyte of "not used by me" memory).  2.0.34 destroy
(wooo nice I love when I see the cache destroyed ;-)  completly the cache
and runs great. I have a friend that take 2.0.34 on its 8Mbyte laptop only
to compile the kernel in 30Minutes instead of in the N hours of 2.0.10x.

>memory, otherwise you are doing 10 disk IOs to read a file instead of
>one; and on a box which is starved of memory, that implies you'll
>probably see a disk seek between each IO.  That's just going to thrash
>your disk even harder. 

I really don' t bother about read-ahead. When the system swap the hd is so
busy that there are really no difference to go at speed of 1Km/h or 0.1Km/h ;-).
Readahead in that case is the same of run an optimized O(2^n) algorithm
(against running a not optimized one (no-readahead)).

>> You don' t run in a 32Mbyte box I see ;-).
>
>I run in 64MB,  16MB and 6MB for testing purposes.

Maybe your test are a bit light ;-). Also maybe you are not running on a
single IDE0 (UDMA) HD with the swap partition on the same HD as me. 

Please avoid the swap every time you can. Swap is the end of the life of
every machine. Trash the cache instead.

Which functions I had to touch and use to destroy the cache instead of
swapping out processes? I don' t ask a so nice feature of page aging you
are claiming about, I only need to avoid the swap to run _fast_ (as does 
2.0.34).

BTW, I started this thread these days only because I booted 2.0.34 and I
noticed the big improvement.

Andrea[s] Arcangeli

PS. Thanks anyway to all mm guys that that contributed to 2.1.x
    since I _guess_ that kswapd and the mm layer in general is OK for high
    memory machines. __Maybe__ we only need some tuning for low memory
    machines.

    BTW, how many people tune the vm layer using the sysctls?

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-06 12:34         ` Andrea Arcangeli
@ 1998-07-06 14:36           ` Stephen C. Tweedie
  1998-07-06 19:28             ` Andrea Arcangeli
  0 siblings, 1 reply; 40+ messages in thread
From: Stephen C. Tweedie @ 1998-07-06 14:36 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Stephen C. Tweedie, Rik van Riel, Linux MM, Linux Kernel

Hi,

On Mon, 6 Jul 1998 14:34:02 +0200 (CEST), Andrea Arcangeli
<arcangeli@mbox.queen.it> said:

> On Mon, 6 Jul 1998, Stephen C. Tweedie wrote:
>> or 16MB box doing compilations, then you desperately want unused process
>> data pages --- idle bits of inetd, lpd, sendmail, init, the shell, the

> Now also the process that needs memory got swapped out.

No --- that's the whole point.  We have per-page process page aging
which lets us differentiate between processes which are active and those
which are idle, and between the used and unused pages within the active
processes.

If you are short on memory, then you don't want to keep around any
process pages which belong to idle tasks.  The only way to do that is to
invoke the swapper.  We need to make sure that we are just aggressive
enough to discard pages which are not in use, and not to discard pages
which have been touched recently.

If we simply prune the cache to zero before doing any swapping, then we
will be eliminating potentially useful data out of the cache instead of
throwing away pages to swap which may not have been used in the past
half an hour.  

That's what the balancing issue is about: if there are swap pages which
are not being touched at all and files such as header files which are
being constantly accessed, then we need to do at least _some_ swapping
to eliminate the idle process pages.

> I _really_ don' t want cache and readahead when the system needs
> memory. 

You also don't want lpd sitting around, either.

> The only important thing is to avoid the always swapin/out and provide
> free memory to the process. 

It's just wishful thinking to assume you can do this simply by
destroying the cache.  Oh, and you _do_ want readahead even with little
memory, otherwise you are doing 10 disk IOs to read a file instead of
one; and on a box which is starved of memory, that implies you'll
probably see a disk seek between each IO.  That's just going to thrash
your disk even harder.

> You don' t run in a 32Mbyte box I see ;-).

I run in 64MB,  16MB and 6MB for testing purposes.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-05 19:31       ` Rik van Riel
  1998-07-06 10:38         ` Stephen C. Tweedie
@ 1998-07-06 14:20         ` Andrea Arcangeli
  1 sibling, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 1998-07-06 14:20 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linux MM, Linux Kernel, Linus Torvalds

On Sun, 5 Jul 1998, Rik van Riel wrote:

>A few months ago someone (who?) posted a patch that modified
>kswapd's internals to only unmap clean pages when told to.
>
>If I can find the patch, I'll integrate it and let kswapd
>only swap clean pages when:
>- page_cache_size * 100 > num_physpages * page_cache.borrow_percent

I don' t agree with swapping out if there are enough freeable pages in the
cache (or at least the aging should be very more clever than now).

It seems that setting to 1 2 3 pagecache, buffers and freepages and
setting 1 1 1 kswapd (so that kswapd can only swap one page at time) help
a lot to make the system _usable_ (when I press a key I see it on the
console)  during `cp file /dev/null' (the cache got reduced to 3Mbyte
against the default 10Mbyte if memtest 10000000 is running at the same
time).  Sometimes I get out of memory with these settings while `cp file
/dev/null' is running, since the cache is allocated and the less priority
of kswapd can' t free a lot of memory I think. 

Now I have a new question. What would happen if kswapd would be stopped
while `cp file /dev/null' is running? The cache memory allocated by cp is
reused or it' s always allocated from the free memory?

And is it possible to know how much memory is unmappable (and then
freeable) from the cache? If so we should use the swap_out() method in
do_try_to_free_page() only if there isn' t enough freeable memory in the
cache. If swap_out() is not used kswapd will free memory from the cache or
buffers without swapout, or no?

Think about a 128Mbyte system. I think that is a no sense swapping out 3/4
Mbyte of RAM and have 40/50Mbyte of cache and a lot of buffers allocated.
If I buy memory _I_ don' t want to see the swap used. I _hate_ the swap. I
would run with swapoff -a if the machine would not deadlock (with kswapd
loading 100% of the CPU) instead of return out of memory.

And how is handled the aging of the pages? i386 (and MIPS if I remeber
well) (don' t tell me "and every other modern CPU" because I can guess
that ;-) provides a flag in every page that should be usable to take care
of the page recently read/write against the unused pages. Is that flags
used to take care of the aging or the aging is done all by software
without take advantages of CPU facilites? I ask this because it seems that
the aging doesn' t work since my bash is swapped out (or removed from the
RAM) when read(2) allocate the cache while in 2.0.34 all is perfect. 

Now I am using this simple program to test kswapd:

#include <unistd.h>

main()
{
  char buf[4096];
  while (read(0, buf, sizeof(buf)) == sizeof(buf));
}

./a.out < /tmp/zero

Where zero is a big file. When there is no more memory free (because
it' s all allocated in the cache) bash is not more responsive to keypress
and the swap/in/out start.

Fortunately at least the 2.0.34 mm algorithms seems to works _perfect_
under all kind of conditions so in the worst case I' ll try to port for my
machine the linux/mm/* interesting things from 2.0.34 to 108 and I' ll
start rejecting every other kernel official mm patch (you can see that I
am really irritate due too much swapping in the last month ;-). It will be
an hard work but at least I will be sure of the good result...  Somebody
has really _screwed_ the really _perfect_ 2.0.34 kswapd in the 2.1.x way.

As far as I known, nobody except me is working to fix kswapd. I had also
to tell that I never used Linux in a machine with > 32Mbyte of ram so I
don' t know if there 2.1.108 works perfect as 2.0.34. So please tell me to
buy other 32Mbyte of memory or help me to fix kswapd instead of developing
new things for memory defragmentation for example.

Andrea[s] Arcangeli

PS. Now I am running 2.0.34 and it' s very very more efficient than
    2.1.108. 108 is sure very faster in all things but _here_ the "always
    swapping" thing remove all other improvements and make the system very
    very less fluid :-(.

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-06 10:24     ` Stephen C. Tweedie
@ 1998-07-06 13:37       ` Eric W. Biederman
  1998-07-07 12:35         ` Stephen C. Tweedie
  0 siblings, 1 reply; 40+ messages in thread
From: Eric W. Biederman @ 1998-07-06 13:37 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Rik van Riel, Andrea Arcangeli, Linux MM, Linux Kernel

>>>>> "ST" == Stephen C Tweedie <sct@redhat.com> writes:

ST> It does: the Duff's device in try_to_free_page does it, and seems to
ST> work well enough.  It was certainly tuned tightly enough: all of the
ST> hard part of getting the kswap stuff working well in try_to_swap_out()
ST> was to do with tuning the aggressiveness of swap relative to the buffer
ST> and cache reclaim mechanisms so that the try_to_free_page loop works
ST> well.  That's why the recent policies of adding little rules here and
ST> there all over the mm layer have disturbed the balance so much, I think.

The use of touch_page and age_page appear to be the most likely
canidates for the page cache being more persistent than it used to
be.

If I'm not mistaken shrink_mmap must be called more often now to
remove a given page.

Eric

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-06 10:31       ` Stephen C. Tweedie
@ 1998-07-06 12:34         ` Andrea Arcangeli
  1998-07-06 14:36           ` Stephen C. Tweedie
  0 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 1998-07-06 12:34 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Rik van Riel, Linux MM, Linux Kernel

On Mon, 6 Jul 1998, Stephen C. Tweedie wrote:

>On Sun, 5 Jul 1998 20:38:57 +0200 (CEST), Andrea Arcangeli
><arcangeli@mbox.queen.it> said:
>
>> kswapd must swap _nothing_ if _freeable_ cache memory is allocated.
>> kswapd _must_ consider freeable cache memory as _free_ not used memory
>> and so it must not start swapping out useful code and data for make
>> space for allocating more cache.  
>
>You just can't make blanket statements like that!  If you're on an 8MB

I' d like to not make statements like that, in that case the aging would
work ;-).

>or 16MB box doing compilations, then you desperately want unused process
>data pages --- idle bits of inetd, lpd, sendmail, init, the shell, the

Now also the process that needs memory got swapped out.

>top-level make and so on --- to be swapped out to make room for a few
>more header files in cache.  Throwing away all cache pages will also
>destroy readahead and prevent you from caching pages of a binary between
>successive invocations.

I _really_ don' t want cache and readahead when the system needs memory. 
The only important thing is to avoid the always swapin/out and provide
free memory to the process. You don' t run in a 32Mbyte box I see ;-).

Andrea[s] Arcangeli

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-06 10:38         ` Stephen C. Tweedie
@ 1998-07-06 11:42           ` Rik van Riel
  0 siblings, 0 replies; 40+ messages in thread
From: Rik van Riel @ 1998-07-06 11:42 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Andrea Arcangeli, Linux MM

On Mon, 6 Jul 1998, Stephen C. Tweedie wrote:
> <H.H.vanRiel@phys.uu.nl> said:
> 
> > A few months ago someone (who?) posted a patch that modified
> > kswapd's internals to only unmap clean pages when told to.
> 
> > If I can find the patch, I'll integrate it and let kswapd
> > only swap clean pages when:
> 
> I'm not sure what that is supposed to achieve, and I'm not sure how well
> we expect such tinkering to work uniformly on 8MB and 512MB machines.
> Unmapping is not an issue with respect to cache sizes.

When we use this, we can finally 'enforce' the borrow_percent
stuff. Yes, I know the borrow_percent isn't really a good thing,
but we'll need the framework anyway when your balancing code
is implemented.

The 'only unmap clean pages' flag is a good way of implementing
this framework; maybe we want to combine it with a flag to
shrink_mmap() not to unmap swap cache pages...
Or maybe we want to do swap cache LRU reclamation when
free_memory_available(4) returns true.

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-05 19:31       ` Rik van Riel
@ 1998-07-06 10:38         ` Stephen C. Tweedie
  1998-07-06 11:42           ` Rik van Riel
  1998-07-06 14:20         ` Andrea Arcangeli
  1 sibling, 1 reply; 40+ messages in thread
From: Stephen C. Tweedie @ 1998-07-06 10:38 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, Linux MM

Hi Rik,

On Sun, 5 Jul 1998 21:31:56 +0200 (CEST), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> A few months ago someone (who?) posted a patch that modified
> kswapd's internals to only unmap clean pages when told to.

> If I can find the patch, I'll integrate it and let kswapd
> only swap clean pages when:
> - page_cache_size * 100 > num_physpages * page_cache.borrow_percent
> or
> - (buffer_mem >> PAGE_SHIFT) * 100 > num_physpages * buffermem.borrow_percent

I'm not sure what that is supposed to achieve, and I'm not sure how well
we expect such tinkering to work uniformly on 8MB and 512MB machines.
Unmapping is not an issue with respect to cache sizes.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-05 18:38     ` Andrea Arcangeli
  1998-07-05 19:31       ` Rik van Riel
@ 1998-07-06 10:31       ` Stephen C. Tweedie
  1998-07-06 12:34         ` Andrea Arcangeli
  1 sibling, 1 reply; 40+ messages in thread
From: Stephen C. Tweedie @ 1998-07-06 10:31 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Rik van Riel, Linux MM, Linux Kernel, Stephen Tweedie

Hi,

On Sun, 5 Jul 1998 20:38:57 +0200 (CEST), Andrea Arcangeli
<arcangeli@mbox.queen.it> said:

> kswapd must swap _nothing_ if _freeable_ cache memory is allocated.
> kswapd _must_ consider freeable cache memory as _free_ not used memory
> and so it must not start swapping out useful code and data for make
> space for allocating more cache.  

You just can't make blanket statements like that!  If you're on an 8MB
or 16MB box doing compilations, then you desperately want unused process
data pages --- idle bits of inetd, lpd, sendmail, init, the shell, the
top-level make and so on --- to be swapped out to make room for a few
more header files in cache.  Throwing away all cache pages will also
destroy readahead and prevent you from caching pages of a binary between
successive invocations.

That's the problem with all rules of the form "memory management MUST
prioritise X over Y".  There are always cases where it is not true.
What we need is a balance, not arbitrary rules like that.  

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-05 17:00   ` Rik van Riel
  1998-07-05 18:38     ` Andrea Arcangeli
  1998-07-05 18:57     ` MOLNAR Ingo
@ 1998-07-06 10:24     ` Stephen C. Tweedie
  1998-07-06 13:37       ` Eric W. Biederman
  2 siblings, 1 reply; 40+ messages in thread
From: Stephen C. Tweedie @ 1998-07-06 10:24 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, Linux MM, Linux Kernel

Hi,

On Sun, 5 Jul 1998 19:00:04 +0200 (CEST), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> On Sun, 5 Jul 1998, Andrea Arcangeli wrote:
>> Where does the cache is allocated? Is it allocated in the inode? If so
>> kswapd should shrink the inode before start swapping out! 

> The cache is also mapped into a process'es address space.
> Currently we would have to walk all pagetables to find a
> specific page ;(

Not in this case, where the file is just being copied.  For a copy, the
reads exist unmapped in the page cache; only mmap() creates mapped
pages.

> When Stephen and Ben have merged their PTE stuff, we can
> do the freeing much easier though...

In this case, it's not an issue, so we need to fix it for 2.2.

>> I had to ask "2.0.34 has balancing code implemented and
>> running?". The

> 2.0 has no balancing code at all. At least, not AFAIK...

It does: the Duff's device in try_to_free_page does it, and seems to
work well enough.  It was certainly tuned tightly enough: all of the
hard part of getting the kswap stuff working well in try_to_swap_out()
was to do with tuning the aggressiveness of swap relative to the buffer
and cache reclaim mechanisms so that the try_to_free_page loop works
well.  That's why the recent policies of adding little rules here and
there all over the mm layer have disturbed the balance so much, I think.

>> Is there a function call (such us shrink_mmap for mmap or
>> kmem_cache_reap() for slab or shrink_dcache_memory() for dcache) that
>> is able to shrink the cache allocated by cp file /dev/zero?

> shrink_mmap() can only shrink unlocked and clean buffer pages
> and unmapped cache pages. We need to go through either bdflush
> (for buffer) or try_to_swap_out() first, in order to make some
> easy victims for shrink_mmap()...

Only for mapped files, not files copied through the standard read/write
calls.

--Stephen

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-05 18:38     ` Andrea Arcangeli
@ 1998-07-05 19:31       ` Rik van Riel
  1998-07-06 10:38         ` Stephen C. Tweedie
  1998-07-06 14:20         ` Andrea Arcangeli
  1998-07-06 10:31       ` Stephen C. Tweedie
  1 sibling, 2 replies; 40+ messages in thread
From: Rik van Riel @ 1998-07-05 19:31 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linux MM, Linux Kernel

On Sun, 5 Jul 1998, Andrea Arcangeli wrote:
> On Sun, 5 Jul 1998, Rik van Riel wrote:
> 
> >The cache is also mapped into a process'es address space.
> >Currently we would have to walk all pagetables to find a
> >specific page ;(
> 
> I start to think that the problem is kswapd. Running cp file /dev/null the
> system remains fluid (when press a key I see the char on the _console_) 
> until there is free (wasted because not used) memory. While there is free
> memory the swap is 0. When the free memory finish, the system die and when
> I press a key I don' t see the character on the screen immediatly. I think
> that it' s kswapd that is irratiting me. So now I am trying to fuck kswapd
> (I am starting to hate it since I really hate swap ;-). kswapd must swap
> _nothing_ if _freeable_ cache memory is allocated.  kswapd _must_ consider
> freeable cache memory as _free_ not used memory and so it must not start
> swapping out useful code and data for make space for allocating more
> cache.  With 2.0.34 when the cache eat all free memory nothing gone
> swapped out and all perform better.

A few months ago someone (who?) posted a patch that modified
kswapd's internals to only unmap clean pages when told to.

If I can find the patch, I'll integrate it and let kswapd
only swap clean pages when:
- page_cache_size * 100 > num_physpages * page_cache.borrow_percent
or
- (buffer_mem >> PAGE_SHIFT) * 100 > num_physpages * buffermem.borrow_percent

> >shrink_mmap() can only shrink unlocked and clean buffer pages
> >and unmapped cache pages. We need to go through either bdflush
> ...unmapped cache pages. Good.

Not good, it means that kswapd needs to unmap the pages
first, using the try_to_swap_out() function. [which really
needs to be renamed to try_to_unmap()]

> >(for buffer) or try_to_swap_out() first, in order to make some
> try_to_swap_out() should unmap the cache pages? Then I had to recall
> shrink_mmap()?

Shrink_mmap() frees the pages that are already unmapped
by try_to_swap_out(). This means that the pages need to
be handled by both functions (which is good, because it
gives us a second 'timeout' for page aging).

> Rik reading vmscan.c I noticed that you are the one that worked on kswapd
> (for example removing hard page limits and checking instead
> free_memory_available(nr)). Could you tell me what you changed (or in
> which kernel-patch I can find the kswapd patches) to force kswapd to swap
> so much? 

Most of the patches are on my homepage, you can get
and read them there...

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-05 17:00   ` Rik van Riel
  1998-07-05 18:38     ` Andrea Arcangeli
@ 1998-07-05 18:57     ` MOLNAR Ingo
  1998-07-06 10:24     ` Stephen C. Tweedie
  2 siblings, 0 replies; 40+ messages in thread
From: MOLNAR Ingo @ 1998-07-05 18:57 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, Linux MM, Linux Kernel


On Sun, 5 Jul 1998, Rik van Riel wrote:

> > I run hdparm -a0 /dev/hda and nothing change. Now the cache take 20Mbyte
> > of memory running cp file /dev/null while memtest 10000000 is running.
> 
> Hdparm only affects _hardware_ readahead and has nothing
> to do with software readahead.

nope, -a0 turns off software readahead. -A controls hardware readahead.

-- mingo

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-05 17:00   ` Rik van Riel
@ 1998-07-05 18:38     ` Andrea Arcangeli
  1998-07-05 19:31       ` Rik van Riel
  1998-07-06 10:31       ` Stephen C. Tweedie
  1998-07-05 18:57     ` MOLNAR Ingo
  1998-07-06 10:24     ` Stephen C. Tweedie
  2 siblings, 2 replies; 40+ messages in thread
From: Andrea Arcangeli @ 1998-07-05 18:38 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linux MM, Linux Kernel

On Sun, 5 Jul 1998, Rik van Riel wrote:

>Hdparm only affects _hardware_ readahead and has nothing
>to do with software readahead.

Wooops.

>The cache is also mapped into a process'es address space.
>Currently we would have to walk all pagetables to find a
>specific page ;(
>When Stephen and Ben have merged their PTE stuff, we can
>do the freeing much easier though...

I start to think that the problem is kswapd. Running cp file /dev/null the
system remains fluid (when press a key I see the char on the _console_) 
until there is free (wasted because not used) memory. While there is free
memory the swap is 0. When the free memory finish, the system die and when
I press a key I don' t see the character on the screen immediatly. I think
that it' s kswapd that is irratiting me. So now I am trying to fuck kswapd
(I am starting to hate it since I really hate swap ;-). kswapd must swap
_nothing_ if _freeable_ cache memory is allocated.  kswapd _must_ consider
freeable cache memory as _free_ not used memory and so it must not start
swapping out useful code and data for make space for allocating more
cache.  With 2.0.34 when the cache eat all free memory nothing gone
swapped out and all perform better.

>> >Both can be avoided by using (not yet implemented)
>> >balancing code. It is on the priority list of the MM
>> I had to ask "2.0.34 has balancing code implemented and running?". The
>
>2.0 has no balancing code at all. At least, not AFAIK...

So 2.1.108 must be able to perform as 2.0.34.

>> current mm layer is not able to shrink the cache memory and I consider it
>> a bug that must be fixed without adding other code. 
>
>How do you propose we solve a bug without programming :)

;-). I want to tell "without adding new features or replacing the most of
the code"... 

>> Is there a function call (such us shrink_mmap for mmap or
>> kmem_cache_reap() for slab or shrink_dcache_memory() for dcache) that is
>> able to shrink the cache allocated by cp file /dev/zero?
>
>shrink_mmap() can only shrink unlocked and clean buffer pages
>and unmapped cache pages. We need to go through either bdflush

...unmapped cache pages. Good.

>(for buffer) or try_to_swap_out() first, in order to make some

try_to_swap_out() should unmap the cache pages? Then I had to recall
shrink_mmap()?

>easy victims for shrink_mmap()...

Rik reading vmscan.c I noticed that you are the one that worked on kswapd
(for example removing hard page limits and checking instead
free_memory_available(nr)). Could you tell me what you changed (or in
which kernel-patch I can find the kswapd patches) to force kswapd to swap
so much? 

Andrea[s] Arcangeli

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
  1998-07-05 11:32 ` Andrea Arcangeli
@ 1998-07-05 17:00   ` Rik van Riel
  1998-07-05 18:38     ` Andrea Arcangeli
                       ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Rik van Riel @ 1998-07-05 17:00 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linux MM, Linux Kernel

On Sun, 5 Jul 1998, Andrea Arcangeli wrote:
> On Sun, 5 Jul 1998, Rik van Riel wrote:
> 
> >We can achieve this by switching off readahead when we
> >reach the maximum RSS of the inode. Then we should probably
> 
> I run hdparm -a0 /dev/hda and nothing change. Now the cache take 20Mbyte
> of memory running cp file /dev/null while memtest 10000000 is running.

Hdparm only affects _hardware_ readahead and has nothing
to do with software readahead.

> >instruct kswapd in some way to remove pages from that inode,
> >but I'm not completely sure how to do that...
> 
> Where does the cache is allocated? Is it allocated in the inode? If so
> kswapd should shrink the inode before start swapping out! 

The cache is also mapped into a process'es address space.
Currently we would have to walk all pagetables to find a
specific page ;(
When Stephen and Ben have merged their PTE stuff, we can
do the freeing much easier though...

> >For the buffer cache, we might be able to use the same
> >kind of algorithm, but I'm not completely sure of that.
> 
> The buffer memory seems to be reduced better than the cache memory though.

This is partly because buffer memory is not mapped in any
pagetable and because buffer memory generally isn't worth
keeping around. Because of that we can and do just throw
it out on the next opportunity.

> >Both can be avoided by using (not yet implemented)
> >balancing code. It is on the priority list of the MM
> I had to ask "2.0.34 has balancing code implemented and running?". The

2.0 has no balancing code at all. At least, not AFAIK...

> current mm layer is not able to shrink the cache memory and I consider it
> a bug that must be fixed without adding other code. 

How do you propose we solve a bug without programming :)

> Is there a function call (such us shrink_mmap for mmap or
> kmem_cache_reap() for slab or shrink_dcache_memory() for dcache) that is
> able to shrink the cache allocated by cp file /dev/zero?

shrink_mmap() can only shrink unlocked and clean buffer pages
and unmapped cache pages. We need to go through either bdflush
(for buffer) or try_to_swap_out() first, in order to make some
easy victims for shrink_mmap()...

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cp file /dev/zero <-> cache [was Re: increasing page size]
       [not found] <Pine.LNX.3.96.980705072829.17879D-100000@mirkwood.dummy.home>
@ 1998-07-05 11:32 ` Andrea Arcangeli
  1998-07-05 17:00   ` Rik van Riel
  0 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 1998-07-05 11:32 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, Linux Kernel

On Sun, 5 Jul 1998, Rik van Riel wrote:

>The current allocator is often unable to keep fragmentation
>from happening when too many allocations are done. When we

So I don' t bother about fragmentation and about the zone allocator.

>I have a better idea. The RSS for an inode shouldn't be
>allowed to grow larger than 50% of the size of the page
>cache when:
>- we are tight on memory; and
>- the page cache takes more than 25% of memory
>
>We can achieve this by switching off readahead when we
>reach the maximum RSS of the inode. Then we should probably

I run hdparm -a0 /dev/hda and nothing change. Now the cache take 20Mbyte
of memory running cp file /dev/null while memtest 10000000 is running.

>instruct kswapd in some way to remove pages from that inode,
>but I'm not completely sure how to do that...

Where does the cache is allocated? Is it allocated in the inode? If so
kswapd should shrink the inode before start swapping out! 

>For the buffer cache, we might be able to use the same
>kind of algorithm, but I'm not completely sure of that.

The buffer memory seems to be reduced better than the cache memory though.

>> I would ask to people to really run the kernel with mem=30Mbyte and then
>> run a `cp /dev/zero file' and then a `cp file /dev/null' to really see
>> what happens.
>
>In the first case, the buffer cache will grow without
>bounds and without it being needed. In the second case
>the page cache will grow a bit too much.

10Mbyte of 108 against 1Mbyte of 2.0.34 is not only a bit ;-).

>Both can be avoided by using (not yet implemented)
>balancing code. It is on the priority list of the MM

I had to ask "2.0.34 has balancing code implemented and running?". The
current mm layer is not able to shrink the cache memory and I consider it
a bug that must be fixed without adding other code. 

Is there a function call (such us shrink_mmap for mmap or
kmem_cache_reap() for slab or shrink_dcache_memory() for dcache) that is
able to shrink the cache allocated by cp file /dev/zero? I could also try
to apply to my kernel the memleak detector to see where the cache is
really allocated... 

>team, so we will be working on it some day. There

Good!

>are some stability issues to be solved first, however.

I wasn' t aware of these stability problems...

>Try the MM team first: linux-mm@kvack.org.
>Or read our TODO list: http://www.phys.uu.nl/~riel/mm-patch/todo.html

OK.

Andrea[s] Arcangeli

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~1998-07-20 16:04 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <199807091442.PAA01020@dax.dcs.ed.ac.uk>
1998-07-09 18:59 ` cp file /dev/zero <-> cache [was Re: increasing page size] Rik van Riel
1998-07-09 23:37   ` Stephen C. Tweedie
1998-07-10  5:57     ` Rik van Riel
1998-07-11 14:14 ` Rik van Riel
1998-07-11 21:23   ` Stephen C. Tweedie
1998-07-11 22:25     ` Rik van Riel
1998-07-13 13:23       ` Stephen C. Tweedie
1998-07-12  1:47     ` Benjamin C.R. LaHaise
1998-07-13 13:42       ` Stephen C. Tweedie
1998-07-18 22:10         ` Rik van Riel
1998-07-20 16:04           ` Stephen C. Tweedie
1998-07-09 13:01 Zachary Amsden
     [not found] <Pine.LNX.3.96.980705072829.17879D-100000@mirkwood.dummy.home>
1998-07-05 11:32 ` Andrea Arcangeli
1998-07-05 17:00   ` Rik van Riel
1998-07-05 18:38     ` Andrea Arcangeli
1998-07-05 19:31       ` Rik van Riel
1998-07-06 10:38         ` Stephen C. Tweedie
1998-07-06 11:42           ` Rik van Riel
1998-07-06 14:20         ` Andrea Arcangeli
1998-07-06 10:31       ` Stephen C. Tweedie
1998-07-06 12:34         ` Andrea Arcangeli
1998-07-06 14:36           ` Stephen C. Tweedie
1998-07-06 19:28             ` Andrea Arcangeli
1998-07-07 12:01               ` Stephen C. Tweedie
1998-07-07 15:54                 ` Rik van Riel
1998-07-07 17:32                   ` Benjamin C.R. LaHaise
1998-07-08 13:54                     ` Stephen C. Tweedie
1998-07-08 21:19                       ` Andrea Arcangeli
1998-07-11 11:18                         ` Rik van Riel
1998-07-11 21:11                           ` Stephen C. Tweedie
1998-07-08 13:45                   ` Stephen C. Tweedie
1998-07-08 18:57                     ` Rik van Riel
1998-07-08 22:11                       ` Stephen C. Tweedie
1998-07-09  7:43                         ` Rik van Riel
1998-07-09 20:39                         ` Rik van Riel
1998-07-13 11:54                           ` Stephen C. Tweedie
1998-07-05 18:57     ` MOLNAR Ingo
1998-07-06 10:24     ` Stephen C. Tweedie
1998-07-06 13:37       ` Eric W. Biederman
1998-07-07 12:35         ` Stephen C. Tweedie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox