Re: Linux-2.1.129..

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: Linux-2.1.129..
       [not found] <Pine.LNX.3.95.981119002335.838A-100000@penguin.transmeta.com>
@ 1998-11-19 21:34 ` Dr. Werner Fink
  1998-11-19 21:58   ` Linux-2.1.129 Rik van Riel
  1998-11-19 22:33   ` Linux-2.1.129 Linus Torvalds
  0 siblings, 2 replies; 29+ messages in thread
From: Dr. Werner Fink @ 1998-11-19 21:34 UTC (permalink / raw)
  To: Linus Torvalds, Kernel Mailing List; +Cc: linux-mm

> 
> Have fun with it, and tell me if it breaks. But it won't. I'm finally
> getting the old "greased weasel" feeling back. In short, this is the much
> awaited perfect and bug-free release, and the only reason I don't call it
> 2.2 is that I'm chicken.
> 
> 	Kvaa, kvaa,
> 			Linus

Yes on a 512MB system it's a great win ... on a 64 system I see
something like a ``swapping weasel'' under high load.

It seems that page ageing or something *similar* would be nice
for a factor 512/64 >= 2  ... under high load and not enough
memory it's maybe better if we could get the processes in turn
into work instead of useless swapping (this was a side effect
of page ageing due to the implicit slow down).

          Werner

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-19 21:34 ` Linux-2.1.129 Dr. Werner Fink
@ 1998-11-19 21:58   ` Rik van Riel
  1998-11-20 12:09     ` Linux-2.1.129 Dr. Werner Fink
  1998-11-19 22:33   ` Linux-2.1.129 Linus Torvalds
  1 sibling, 1 reply; 29+ messages in thread
From: Rik van Riel @ 1998-11-19 21:58 UTC (permalink / raw)
  To: Dr. Werner Fink; +Cc: Linus Torvalds, Kernel Mailing List, linux-mm

On Thu, 19 Nov 1998, Dr. Werner Fink wrote:

> Yes on a 512MB system it's a great win ... on a 64 system I see
> something like a ``swapping weasel'' under high load.
> 
> It seems that page ageing or something *similar* would be nice
> for a factor 512/64 >= 2  ... under high load and not enough
> memory it's maybe better if we could get the processes in turn
> into work instead of useless swapping (this was a side effect
> of page ageing due to the implicit slow down).

It was certainly a huge win when page aging was implemented,
but we mainly felt that because there used to be an obscure
bug in vmscan.c, causing the kernel to always start scanning
at the start of the process' address space.

Now that bug is fixed, it might just be better to switch
to a multi-queue system. A full implementation of that
will have to wait until 2.3, but we can easily do an
el-cheapo simulation of it by simply not freeing swap
cached pages on the first pass of shrink_mmap().

This gives the process a chance of reclaiming the page
without incurring any I/O and it gives the kernel the
possibility of keeping a lot of easily-freeable pages
around.

Maybe we even want to keep a 3:1 ratio or something
like that for mapped:swap_cached pages and a semi-
FIFO reclamation of swap cached pages so we can
simulate a bit of (very cheap) page aging.

Digital Unix does things this way and it works pretty
well (they keep a 1:2 ratio though, but the overhead
in maintaining that seems a bit too high).

cheers,

Rik -- slowly getting used to dvorak kbd layout...
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-19 21:58   ` Linux-2.1.129 Rik van Riel
@ 1998-11-20 12:09     ` Dr. Werner Fink
  0 siblings, 0 replies; 29+ messages in thread
From: Dr. Werner Fink @ 1998-11-20 12:09 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linus Torvalds, Kernel Mailing List, linux-mm

On Thu, Nov 19, 1998 at 10:58:30PM +0100, Rik van Riel wrote:
> On Thu, 19 Nov 1998, Dr. Werner Fink wrote:
> 
> > Yes on a 512MB system it's a great win ... on a 64 system I see
> > something like a ``swapping weasel'' under high load.
> > 
> > It seems that page ageing or something *similar* would be nice
> > for a factor 512/64 >= 2  ... under high load and not enough
> > memory it's maybe better if we could get the processes in turn
> > into work instead of useless swapping (this was a side effect
> > of page ageing due to the implicit slow down).
> 
> It was certainly a huge win when page aging was implemented,
> but we mainly felt that because there used to be an obscure
> bug in vmscan.c, causing the kernel to always start scanning
> at the start of the process' address space.
> 
> Now that bug is fixed, it might just be better to switch
> to a multi-queue system. A full implementation of that
> will have to wait until 2.3, but we can easily do an
> el-cheapo simulation of it by simply not freeing swap
> cached pages on the first pass of shrink_mmap().

Hmmm ... we need something real for 2.2 ... so,
let's analyse the problem

     If the average time slice of the processes is eaten up by
     swapping page back *and* if these pages are spapped out
     during to the next time slice the system becomes unusable
     (freeing swap cached pages on the first pass of shrink_mmap()
      does force this behaviour at high stress).

Therefore we need something like a page ageing which does not
mean that the old scheme is required.

     Pages which are swapped in need a higher life time in
     physical memory.  If a page can be shared this life
     time could be a bigger one.
     If a process counts such pages up to a limit his pages
     should not get a higher life for the next few cycles.

This simple scheme should be implementable in a easy way,
shouldn't it?  The appropiate places are

      ipc/shm.c::shm_swap_in()
      mm/page_alloc.c::swap_in()

and the old place of the old age_page():

      mm/vmscan.c::try_to_swap_out()

together with some unused variables out of

      include/linux/sched.h::task_struct (e.g. dec_flt)
      include/linux/sched.h::struct page (e.g. unused :-)

nothing more is needed due to the better swap cache of
2.1.129 in comparision to 2.0.36.


          Werner

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-19 21:34 ` Linux-2.1.129 Dr. Werner Fink
  1998-11-19 21:58   ` Linux-2.1.129 Rik van Riel
@ 1998-11-19 22:33   ` Linus Torvalds
  1998-11-23 17:13     ` Linux-2.1.129 Stephen C. Tweedie
  1 sibling, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 1998-11-19 22:33 UTC (permalink / raw)
  To: Dr. Werner Fink; +Cc: Kernel Mailing List, linux-mm



On Thu, 19 Nov 1998, Dr. Werner Fink wrote:
> 
> Yes on a 512MB system it's a great win ... on a 64 system I see
> something like a ``swapping weasel'' under high load.

The reason the page aging was removed was that I had people who did
studies and told me that the page aging hurts on low-memory machines.

On something like the machine I have, page aging makes absolutely
no difference whatsoever, either positive or negative.

Stephen, you're the one who did the studies. Comments?

		Linus

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-19 22:33   ` Linux-2.1.129 Linus Torvalds
@ 1998-11-23 17:13     ` Stephen C. Tweedie
  1998-11-23 19:16       ` Linux-2.1.129 Eric W. Biederman
                         ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Stephen C. Tweedie @ 1998-11-23 17:13 UTC (permalink / raw)
  To: Linus Torvalds, Rik van Riel
  Cc: Dr. Werner Fink, Kernel Mailing List, linux-mm, Stephen Tweedie

Hi,

On Thu, 19 Nov 1998 14:33:59 -0800 (PST), Linus Torvalds
<torvalds@transmeta.com> said:

> On Thu, 19 Nov 1998, Dr. Werner Fink wrote:
>> 
>> Yes on a 512MB system it's a great win ... on a 64 system I see
>> something like a ``swapping weasel'' under high load.

> The reason the page aging was removed was that I had people who did
> studies and told me that the page aging hurts on low-memory machines.

> On something like the machine I have, page aging makes absolutely
> no difference whatsoever, either positive or negative.

> Stephen, you're the one who did the studies. Comments?

Hmm.  The vast majority of the old studies I did with page aging assumed
that the kswap side of things still did aging.  I was primarily
disabling the page cache aging in those experiments, and I had a number
of other people testing it too.

With the 2.1.129 prepatches, Linus removed the page aging from the swap
logic.  That makes it much easier to find free pages in swap.  Given
that the page cache still used aging, the try_to_free_pages() loop was
essentially being instructed to concentrate all of its effort on swap.
This looked "obviously" wrong, since in all the previous experiments, I
had disabled the cache aging but kept swap aging, and that improved
things.  Swinging the balance the other way would obviously cause us to
swap too much.

So, rather than back out Linus's removal of the swap aging, I just
removed the page cache aging to compensate and revalidated my original
tests, especially on a low memory machine.  That still showed that
performance with no page cache aging on 2.1.129-prewhatever was better
than performance with page cache aging.

So, I have still seen no cases where overall performance with no page
cache aging was better than performance with it.  However, with the swap
aging removed as well, we seem to have a page/swap balance which doesn't
work well on 64MB.  To be honest, I just haven't spent much time playing
with swap page aging since the early kswap work, and that was all done
before the page cache was added.

On Thu, 19 Nov 1998 22:58:30 +0100 (CET), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> It was certainly a huge win when page aging was implemented, but we
> mainly felt that because there used to be an obscure bug in vmscan.c,
> causing the kernel to always start scanning at the start of the
> process' address space.

Rik, you keep asserting this but I have never understood it.  I have
asked you several times for a precise description of what benchmarks
improved when page cache aging was added, but I've only ever seen
performance degradation with it in.  The only test you've given me where
page cache aging helped was a case of a readahead bug which had an
obvious fix elsewhere.

And the "obscure bug" you describe was never there: I've said to you
more than once that you were misreading the source, and that the field
you pointed to which was being reset to zero at the start of the swapout
loop was *guaranteed* to be overwritten with the last address scanned
before we exited that loop.  Look at 2.0's mm/vmscan.c: in
swap_out_pmd(), there is a line

		tsk->swap_address = address + PAGE_SIZE;

which is executed unconditionally as soon as we start the pmd scan.  It
is simply impossible for the "p->swap_address = 0" assignment you were
worried about to have any effect at all unless we never get as far as
swap_out_pmd(), and that can only happen if we never find a vma to swap
out.  So, the end result is that p->swap_address only gets left at zero
if we have nothing left beyond the current swap address to swap.  This
was correct in the first place.

> Now that bug is fixed, it might just be better to switch to a
> multi-queue system. A full implementation of that will have to wait
> until 2.3, but we can easily do an el-cheapo simulation of it by
> simply not freeing swap cached pages on the first pass of
> shrink_mmap().

Right now, that will achieve precisely nothing, since the
free_page_and_swap_cache() call in try_to_swap_out() already deletes
swap cache after we start swap IO.  (That's precisely why we check the
page_free_after bit on the page in is_page_shared(), so that we still
remove the swap cache when doing async swapout.)  Except for the recent
changes in the behaviour of shared COW pages, shrink_mmap() should never
ever see a swap cache page.

> This gives the process a chance of reclaiming the page without
> incurring any I/O and it gives the kernel the possibility of keeping a
> lot of easily-freeable pages around.

That would be true if we didn't do the free_page_and_swap_cache trick.
However, doing that would require two passes: once by the swapper, and
once by shrink_mmap(): before actually freeing a page.  This actually
sounds like a *very* good idea to explore, since it means that vmscan.c
will be concerned exclusively with returning mapped and anonymous pages
to the page cache.  As a result, all of the actual freeing of pages will
be done in shrink_mmap(), which is the closest we can get to a true
self-balancing system for freeing memory.

I'm going to check this out: I'll post preliminary benchmarks and a
patch for other people to test tomorrow.  Getting the balancing right
will then just be a matter of making sure that try_to_swap_out gets
called often enough under normal running conditions.  I'm open to
suggestions about that: we've never tried that sort of behaviour in the
vm to my knowledge.

> Maybe we even want to keep a 3:1 ratio or something like that for
> mapped:swap_cached pages and a semi- FIFO reclamation of swap cached
> pages so we can simulate a bit of (very cheap) page aging.

I will just restate my profound conviction that any VM balancing which
works by imposing precalculated limits on resources is fundamentally
wrong.

Cheers,
  Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-23 17:13     ` Linux-2.1.129 Stephen C. Tweedie
@ 1998-11-23 19:16       ` Eric W. Biederman
  1998-11-23 20:02         ` Linux-2.1.129 Linus Torvalds
  1998-11-23 19:46       ` Linux-2.1.129 Eric W. Biederman
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 29+ messages in thread
From: Eric W. Biederman @ 1998-11-23 19:16 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Rik van Riel, Dr. Werner Fink,
	Kernel Mailing List, linux-mm

>>>>> "ST" == Stephen C Tweedie <sct@redhat.com> writes:

ST> That would be true if we didn't do the free_page_and_swap_cache trick.
ST> However, doing that would require two passes: once by the swapper, and
ST> once by shrink_mmap(): before actually freeing a page.  This actually
ST> sounds like a *very* good idea to explore, since it means that vmscan.c
ST> will be concerned exclusively with returning mapped and anonymous pages
ST> to the page cache.  As a result, all of the actual freeing of pages will
ST> be done in shrink_mmap(), which is the closest we can get to a true
ST> self-balancing system for freeing memory.

There are a few other reasons this would be useful as well.
1) It resembles a 2 handed clock algorithm.  So there would
  be some real page aging functionality.  And we could reclaim pages
  that we are currently writing to disk.

2) We could remove the swap lock map.

I have wanted to suggest this for a while but I haven't had the time
to carry it through. :(

ST> I'm going to check this out: I'll post preliminary benchmarks and a
ST> patch for other people to test tomorrow.  Getting the balancing right
ST> will then just be a matter of making sure that try_to_swap_out gets
ST> called often enough under normal running conditions.  I'm open to
ST> suggestions about that: we've never tried that sort of behaviour in the
ST> vm to my knowledge.

We might want to look at the balance between the buffer cache writing and
shrink_mmap.  Because that is how those two systems interact already.

Eric

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-23 19:16       ` Linux-2.1.129 Eric W. Biederman
@ 1998-11-23 20:02         ` Linus Torvalds
  1998-11-23 21:25           ` Linux-2.1.129 Rik van Riel
                             ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Linus Torvalds @ 1998-11-23 20:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Stephen C. Tweedie, Rik van Riel, Dr. Werner Fink,
	Kernel Mailing List, linux-mm



On 23 Nov 1998, Eric W. Biederman wrote:
> 
> ST> That would be true if we didn't do the free_page_and_swap_cache trick.
> ST> However, doing that would require two passes: once by the swapper, and
> ST> once by shrink_mmap(): before actually freeing a page. 

This is something I considered doing. It has various advantages, and it's
almost done already in a sense: the swap cache thing is what would act as
the buffer between the two passes. 

Then the page table scanning would never really page anything out: it
would just move things into the swap cache. That makes the table scanner
simpler, actually. The real page-out would be when the swap-cache is
flushed to disk and then freed.

I'd like to see this, although I think it's way too late for 2.2

		Linus

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-23 20:02         ` Linux-2.1.129 Linus Torvalds
@ 1998-11-23 21:25           ` Rik van Riel
  1998-11-23 22:19           ` Linux-2.1.129 Dr. Werner Fink
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: Rik van Riel @ 1998-11-23 21:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Stephen C. Tweedie, Dr. Werner Fink,
	Kernel Mailing List, linux-mm

On Mon, 23 Nov 1998, Linus Torvalds wrote:
> On 23 Nov 1998, Eric W. Biederman wrote:
> > 
> > ST> That would be true if we didn't do the free_page_and_swap_cache trick.
> > ST> However, doing that would require two passes: once by the swapper, and
> > ST> once by shrink_mmap(): before actually freeing a page. 
> 
> This is something I considered doing. It has various advantages, and it's
> almost done already in a sense: the swap cache thing is what would act as
> the buffer between the two passes. 
> 
> Then the page table scanning would never really page anything out: it
> would just move things into the swap cache. That makes the table scanner
> simpler, actually. The real page-out would be when the swap-cache is
> flushed to disk and then freed.

For the buffer to properly act as an easy-freeable buffer
we will want to do the I/O based on the page table scanning
cycle, possibly with the addition of a special dirty list
we use for better I/O clustering.

> I'd like to see this, although I think it's way too late for 2.2

It is a bit late for implementing it wholesale, but this
system certainly looks like something we can implement
piece by piece, completing (well, sort of) the new VM
system at about 2.1.10...

Only having the dual pass freeing and some very basic
balancing can be implemented now, the more advanced
balancing (to gain more performance), a better dirty list,
swap block layout and other stuff can be implemented
gradually. It is a pretty modular system, from a coder's
point of view.

I am willing to maintain some sort of patch series
bringing the efforts from multiple people together
if you decide you really don't want it in 2.2 now.

regards,

Rik -- slowly getting used to dvorak kbd layout...
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-23 20:02         ` Linux-2.1.129 Linus Torvalds
  1998-11-23 21:25           ` Linux-2.1.129 Rik van Riel
@ 1998-11-23 22:19           ` Dr. Werner Fink
  1998-11-24  3:37           ` Linux-2.1.129 Eric W. Biederman
  1998-11-24 15:25           ` Linux-2.1.129 Stephen C. Tweedie
  3 siblings, 0 replies; 29+ messages in thread
From: Dr. Werner Fink @ 1998-11-23 22:19 UTC (permalink / raw)
  To: Linus Torvalds, Eric W. Biederman
  Cc: Stephen C. Tweedie, Rik van Riel, Kernel Mailing List, linux-mm


> > ST> That would be true if we didn't do the free_page_and_swap_cache trick.
> > ST> However, doing that would require two passes: once by the swapper, and
> > ST> once by shrink_mmap(): before actually freeing a page. 
> 
> This is something I considered doing. It has various advantages, and it's
> almost done already in a sense: the swap cache thing is what would act as
> the buffer between the two passes. 
> 
> Then the page table scanning would never really page anything out: it
> would just move things into the swap cache. That makes the table scanner
> simpler, actually. The real page-out would be when the swap-cache is
> flushed to disk and then freed.

Furthermore this would give a good getting in for a effective ageing
scheme for often needed pages. Pages frequently going in and out of the
swap cache are the best candidates to get an higher page age.

> 
> I'd like to see this, although I think it's way too late for 2.2
> 
> 		Linus

Better doing it know than within 2.2 ;^)

           Werner

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-23 20:02         ` Linux-2.1.129 Linus Torvalds
  1998-11-23 21:25           ` Linux-2.1.129 Rik van Riel
  1998-11-23 22:19           ` Linux-2.1.129 Dr. Werner Fink
@ 1998-11-24  3:37           ` Eric W. Biederman
  1998-11-24 15:25           ` Linux-2.1.129 Stephen C. Tweedie
  3 siblings, 0 replies; 29+ messages in thread
From: Eric W. Biederman @ 1998-11-24  3:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Stephen C. Tweedie, Rik van Riel,
	Dr. Werner Fink, Kernel Mailing List, linux-mm

>>>>> "LT" == Linus Torvalds <torvalds@transmeta.com> writes:

LT> On 23 Nov 1998, Eric W. Biederman wrote:
>> 
ST> That would be true if we didn't do the free_page_and_swap_cache trick.
ST> However, doing that would require two passes: once by the swapper, and
ST> once by shrink_mmap(): before actually freeing a page. 

LT> This is something I considered doing. It has various advantages, and it's
LT> almost done already in a sense: the swap cache thing is what would act as
LT> the buffer between the two passes. 

LT> Then the page table scanning would never really page anything out: it
LT> would just move things into the swap cache. That makes the table scanner
LT> simpler, actually. The real page-out would be when the swap-cache is
LT> flushed to disk and then freed.

LT> I'd like to see this, although I think it's way too late for 2.2

Agreed.

But something quite similiar is still possible.

Not removing pages from the swap cache while they are in flight.
Letting shrink-mmap remove all of the clean pages from memory.

This can be implemented by simply removing code.

And it provides a weak kind of aging, so heavily used pages will not
be removed from memory, just minor faults will occur.

For 2.2 we can either experiment with minor variations on no swap
aging, taking a little longer.  Or we can put it swap aging back in,
and run with a system people have confidence in already.

And now for my 2 cents.

For a policy more akin to what we have with the buffer cache I have
been working on generic dirty page handling for the whole page cache,
that I intend to send to submit for early 2.3.   

Eric

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-23 20:02         ` Linux-2.1.129 Linus Torvalds
                             ` (2 preceding siblings ...)
  1998-11-24  3:37           ` Linux-2.1.129 Eric W. Biederman
@ 1998-11-24 15:25           ` Stephen C. Tweedie
  1998-11-24 17:33             ` Linux-2.1.129 Linus Torvalds
  1998-11-25 20:33             ` Linux-2.1.129 Zlatko Calusic
  3 siblings, 2 replies; 29+ messages in thread
From: Stephen C. Tweedie @ 1998-11-24 15:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Stephen C. Tweedie, Rik van Riel,
	Dr. Werner Fink, Kernel Mailing List, linux-mm

Hi,

On Mon, 23 Nov 1998 12:02:41 -0800 (PST), Linus Torvalds
<torvalds@transmeta.com> said:

> On 23 Nov 1998, Eric W. Biederman wrote:
>> 
ST> That would be true if we didn't do the free_page_and_swap_cache trick.
ST> However, doing that would require two passes: once by the swapper, and
ST> once by shrink_mmap(): before actually freeing a page. 

> This is something I considered doing. It has various advantages, and it's
> almost done already in a sense: the swap cache thing is what would act as
> the buffer between the two passes. 

Yes.  

> Then the page table scanning would never really page anything out: it
> would just move things into the swap cache. That makes the table scanner
> simpler, actually. The real page-out would be when the swap-cache is
> flushed to disk and then freed.

Indeed.  However, I think it misses the real advantage, which is that
the mechanism would be inherently self-tuning (much more so than the
existing code).   The swapper would batch up pageouts from the page
tables, leaving a number of recyclable pages in the swap cache, and
those cached pages would be subject to fair removal from the cache:
we would not start to ignore cache completely once we start swapping
(which is important if we don't age the swap pages: the lack of aging
makes it far to easy to keep finding free swap pages so we never go
back to shrink_mmap() mode).

> I'd like to see this, although I think it's way too late for 2.2

The mechanism is all there, and we're just tuning policy.  Frankly,
the changes we've seen in vm policy since 2.1.125 are pretty major
already, and I think it's important to get it right before 2.2.0.

The patch below is a very simple implementation of this concept.
I have been running it on 2.1.130-pre2 on 8MB and on 64MB.  On 8, it
gives the expected performance, roughly similar to previous
incarnations of the page-cache-ageless kernels.  

On 64MB, however, it feels rather different: subjectively I think it
feels like the fastest kernel I've ever run on this box.  It happily
swaps out unused code while refusing to touch used ptes, and seems to
balance cache much better than before.  With a very large emacs
(a couple of thousand-message mailboxes loaded in VM), netscape and xv
running, switching between desktops is still zero-wait, and compiles
still go fast.

Unfortunately, 2.1.129 keeps hanging on me, so the testing on 64MB was
cut short after a couple of hours (I think it's either audio CDs or
Ingo's latest alpha-raid which causes the trouble).  No problems on
the 8MB box though.

Linus, is it really too late to look at adding this?

--Stephen

----------------------------------------------------------------
--- mm/vmscan.c~	Tue Nov 17 15:43:55 1998
+++ mm/vmscan.c	Mon Nov 23 17:05:33 1998
@@ -170,7 +170,7 @@
 			 * copy in memory, so we add it to the swap
 			 * cache. */
 			if (PageSwapCache(page_map)) {
-				free_page_and_swap_cache(page);
+				free_page(page);
 				return (atomic_read(&page_map->count) == 0);
 			}
 			add_to_swap_cache(page_map, entry);
@@ -188,7 +188,7 @@
 		 * asynchronously.  That's no problem, shrink_mmap() can
 		 * correctly clean up the occassional unshared page
 		 * which gets left behind in the swap cache. */
-		free_page_and_swap_cache(page);
+		free_page(page);
 		return 1;	/* we slept: the process may not exist any more */
 	}

@@ -202,7 +202,7 @@
 		set_pte(page_table, __pte(entry));
 		flush_tlb_page(vma, address);
 		swap_duplicate(entry);
-		free_page_and_swap_cache(page);
+		free_page(page);
 		return (atomic_read(&page_map->count) == 0);
 	} 
 	/* 
@@ -218,7 +218,11 @@
 	flush_cache_page(vma, address);
 	pte_clear(page_table);
 	flush_tlb_page(vma, address);
+#if 0
 	entry = page_unuse(page_map);
+#else
+	entry = (atomic_read(&page_map->count) == 1);
+#endif
 	__free_page(page_map);
 	return entry;
 }
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-24 15:25           ` Linux-2.1.129 Stephen C. Tweedie
@ 1998-11-24 17:33             ` Linus Torvalds
  1998-11-24 19:59               ` Linux-2.1.129 Rik van Riel
  1998-11-25 14:19               ` Linux-2.1.129 Stephen C. Tweedie
  1998-11-25 20:33             ` Linux-2.1.129 Zlatko Calusic
  1 sibling, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 1998-11-24 17:33 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Eric W. Biederman, Rik van Riel, Dr. Werner Fink,
	Kernel Mailing List, linux-mm

On Tue, 24 Nov 1998, Stephen C. Tweedie wrote:
> 
> Indeed.  However, I think it misses the real advantage, which is that
> the mechanism would be inherently self-tuning (much more so than the
> existing code).

Yes, that's one of the reasons I like it.

The other reason I like it is that right now it is extremely hard to share
swapped out pages unless you share them due to a fork(). The problem is
that the swap cache supports the notion of sharing, but out swap-out
routines do not - they swap things out on a per-virtual-page basis, and
that results in various nasty things - we page out the same page to
multiple places, and lose the sharing. 

> > I'd like to see this, although I think it's way too late for 2.2
> 
> The mechanism is all there, and we're just tuning policy.  Frankly,
> the changes we've seen in vm policy since 2.1.125 are pretty major
> already, and I think it's important to get it right before 2.2.0.

The VM policy changes weren't stability issues, they were only "timing". 
As such, if they broke something, it was really broken before too. 

And I agree that the mechanism is already there, however as it stands we
really populate the swap cache at page-in rather than page-out, and
changing that is fairly fundamental. It would be good, no question about
it, but it's still fairly fundamental. 

Note that if done right, this would also fix the damn stupid dirty page
write-back thing: right now if multiple processes share the same dirty
page and they all write to it, it will be written multiple times. But done
right, the dirty inode page write-out would be done the same way. 

> The patch below is a very simple implementation of this concept.

I will most probably apply the patch - it just looks fundamentally
correct. However, what I was thinking of was a bit more ambitious.

		Linus

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-24 17:33             ` Linux-2.1.129 Linus Torvalds
@ 1998-11-24 19:59               ` Rik van Riel
  1998-11-24 20:45                 ` Linux-2.1.129 Linus Torvalds
  1998-11-25 14:19               ` Linux-2.1.129 Stephen C. Tweedie
  1 sibling, 1 reply; 29+ messages in thread
From: Rik van Riel @ 1998-11-24 19:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Eric W. Biederman, Dr. Werner Fink,
	Kernel Mailing List, linux-mm

On Tue, 24 Nov 1998, Linus Torvalds wrote:
> On Tue, 24 Nov 1998, Stephen C. Tweedie wrote:

> > > I'd like to see this, although I think it's way too late for 2.2
> > 
> > The mechanism is all there, and we're just tuning policy.  Frankly,
> > the changes we've seen in vm policy since 2.1.125 are pretty major
> > already, and I think it's important to get it right before 2.2.0.
> 
> The VM policy changes weren't stability issues, they were only
> "timing". As such, if they broke something, it was really broken
> before too.

It was quite a bit more than just timing, it shoves the load
more to userland programs and decreases the priority of kswapd,
it removed page aging from swap_out() etc...

IMHO this is hardly any more fundamental than the change Stephen
just proposed.

> And I agree that the mechanism is already there, however as it
> stands we really populate the swap cache at page-in rather than
> page-out, and changing that is fairly fundamental. It would be good,
> no question about it, but it's still fairly fundamental.

But we can easily add some kind of new balancing code later,
wihout having any impact on stability.

2.2 is a _stable_ kernel, not a kernel with unchanging
performance... I think we can go with the new (more stable
because of a larger pool of clean pages around) VM scheme
without impacting stability whatsoever.

> > The patch below is a very simple implementation of this concept.
> 
> I will most probably apply the patch - it just looks fundamentally
> correct. However, what I was thinking of was a bit more ambitious.

>From the discussion we've been having yesterday, I get
the impression that the ambitious stuff can be added
little by little, during the lifetime of 2.2, without
impacting stability or hampering preformance.

It's not going to be as bad as during the 2.1.small days :)

cheers,

Rik -- slowly getting used to dvorak kbd layout...
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-24 19:59               ` Linux-2.1.129 Rik van Riel
@ 1998-11-24 20:45                 ` Linus Torvalds
  0 siblings, 0 replies; 29+ messages in thread
From: Linus Torvalds @ 1998-11-24 20:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stephen C. Tweedie, Eric W. Biederman, Dr. Werner Fink,
	Kernel Mailing List, linux-mm

On Tue, 24 Nov 1998, Rik van Riel wrote:
> 
> From the discussion we've been having yesterday, I get
> the impression that the ambitious stuff can be added
> little by little, during the lifetime of 2.2, without
> impacting stability or hampering preformance.

I believe that may well be true. I _do_ believe that the MM code actually
has all the basic functionality there, and that the infrastructure is
stable and in place. That helps a lot.

But there may be some unforseen thing that makes it harder than expected
to add a page to the swap cache at write-out rather that page-in. The
patches may be trivial, in which case I will certainly apply them, because
I do believe that it's the RightThing(tm) to do, but if it turns out to be
nontrivial due to some unforseen circumstance.. 

		Linus

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-24 17:33             ` Linux-2.1.129 Linus Torvalds
  1998-11-24 19:59               ` Linux-2.1.129 Rik van Riel
@ 1998-11-25 14:19               ` Stephen C. Tweedie
  1998-11-25 21:07                 ` Linux-2.1.129 Eric W. Biederman
  1 sibling, 1 reply; 29+ messages in thread
From: Stephen C. Tweedie @ 1998-11-25 14:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Eric W. Biederman, Rik van Riel,
	Dr. Werner Fink, Kernel Mailing List, linux-mm

Hi,

On Tue, 24 Nov 1998 09:33:11 -0800 (PST), Linus Torvalds
<torvalds@transmeta.com> said:

> On Tue, 24 Nov 1998, Stephen C. Tweedie wrote:
>> 
>> Indeed.  However, I think it misses the real advantage, which is that
>> the mechanism would be inherently self-tuning (much more so than the
>> existing code).

> Yes, that's one of the reasons I like it.

> The other reason I like it is that right now it is extremely hard to share
> swapped out pages unless you share them due to a fork(). The problem is
> that the swap cache supports the notion of sharing, but out swap-out
> routines do not - they swap things out on a per-virtual-page basis, and
> that results in various nasty things - we page out the same page to
> multiple places, and lose the sharing. 

No, I fixed that in 2.1.89.  Shared anonymous pages _must_ be COW and
therefore readonly (this is why moving to MAP_SHARED anonymous regions
is so hard).  So, the first process which tries to swap such a shared
page will write it to disk and set up a swap cache entry.  Because the
page is necessarily readonly, we can safely assume it is OK to write it
at this point and not at the point of the last unmapping.

Subsequent processes which pageout the same page will find it in the
swap cache already and will just free the page.  I've tested this with a
program which sets up large anonymous region, forks, and then thrashes
the memory.  On prior kernels we lose the sharing, but on 2.1.89 and
later, that sharing is maintained perfectly even after fork and we never
grow the amount of swap which is used.

> The VM policy changes weren't stability issues, they were only "timing". 
> As such, if they broke something, it was really broken before too. 

Absolutely.

> And I agree that the mechanism is already there, however as it stands we
> really populate the swap cache at page-in rather than page-out, and
> changing that is fairly fundamental. It would be good, no question about
> it, but it's still fairly fundamental. 

We still have to populate the swap cache at page-in time.  The initial
reason for the early swap cache implementation was to prevent us from
having to re-write to disk pages which are still clean in memory.  For
that to work we need to cache the page-in.

However, for pages which become dirty in memory, we _do_ populate the
swap cache only at page-out time.  That's why the sharing still works.
I think that the real change we need is to cleanly support PG_dirty
flags per page.  Once we do that, not only do all of the dirty inode
pageouts get fixed, but we also automatically get MAP_SHARED |
MAP_ANONYMOUS.

While we're on that subject, Linus, do you still have Andrea's patch to
propogate page writes around all shared ptes?  I noticed that Zlatko
Calusic recently re-posted it, and it looks like the sort of short-term
fix we need for this issue in 2.2 (assuming we don't have time to do a
proper PG_dirty fix).

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-25 14:19               ` Linux-2.1.129 Stephen C. Tweedie
@ 1998-11-25 21:07                 ` Eric W. Biederman
  1998-11-26 12:57                   ` Linux-2.1.129 Stephen C. Tweedie
  0 siblings, 1 reply; 29+ messages in thread
From: Eric W. Biederman @ 1998-11-25 21:07 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-mm

>>>>> "ST" == Stephen C Tweedie <sct@redhat.com> writes:

ST> However, for pages which become dirty in memory, we _do_ populate the
ST> swap cache only at page-out time.  That's why the sharing still works.
ST> I think that the real change we need is to cleanly support PG_dirty
ST> flags per page.  Once we do that, not only do all of the dirty inode
ST> pageouts get fixed, but we also automatically get MAP_SHARED |
ST> MAP_ANONYMOUS.

ST> While we're on that subject, Linus, do you still have Andrea's patch to
ST> propogate page writes around all shared ptes?  I noticed that Zlatko
ST> Calusic recently re-posted it, and it looks like the sort of short-term
ST> fix we need for this issue in 2.2 (assuming we don't have time to do a
ST> proper PG_dirty fix).

What do you consider a proper PG_dirty fix?

I have been working on it (what I would call a PG_dirty fix) and have
most thing working but my code has a lot of policy questions still to
answer.

But as far as MAP_SHARED | MAP_ANONYMOUS to retain our current
swapping model (of never rewriting a swap page), and for swapoff
support we need the ability to change which swap page all of the pages
are associated with.

There are 2 ways to do this.  
1) Implement it like SYSV shared mem.
2) Just maintain vma structs for the memory, with vma_next_share used!
   Then when we allocate a new swap page we can walk the
   *vm_area_struct's to find the page_tables that need to be updated.

   The real tricky case to get right is simulaneous COW & SOW.
   SOW == Share On Write.

  The question right now is where do we anchor the vma_next_share
  linked list, as we don't have an inode.

Eric
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-25 21:07                 ` Linux-2.1.129 Eric W. Biederman
@ 1998-11-26 12:57                   ` Stephen C. Tweedie
  0 siblings, 0 replies; 29+ messages in thread
From: Stephen C. Tweedie @ 1998-11-26 12:57 UTC (permalink / raw)
  To: Eric W. Biederman, Benjamin LaHaise; +Cc: Stephen C. Tweedie, linux-mm

Hi,

On 25 Nov 1998 15:07:39 -0600, ebiederm+eric@ccr.net (Eric W. Biederman)
said:

> What do you consider a proper PG_dirty fix?

One which allows us to have dirty pages in the page cache without
worrying about who dirtied them.  In the first instance, we at least
need to allow the swap cache to hold dirty pages and allow
mmap(MAP_SHARED) to synchronise dirty page writeback between processes
(so that we don't get multiple accessors writing back the same page).  A
proper solution obviously requires a way to propogate a write to disk
back to all of the dirty bits in any ptes which reference the page,
which is next to impossible to do efficiently in the current VM (at
least for anonymous pages).

The right 2.2 fix is probably just to go with the existing patch which
propogates dirty bits at msync() time: that doesn't have to deal with
anonymous pages at all.

> But as far as MAP_SHARED | MAP_ANONYMOUS to retain our current
> swapping model (of never rewriting a swap page), and for swapoff
> support we need the ability to change which swap page all of the pages
> are associated with.

> There are 2 ways to do this.  
> 1) Implement it like SYSV shared mem.
> 2) Just maintain vma structs for the memory, with vma_next_share used!
>    Then when we allocate a new swap page we can walk the
>    *vm_area_struct's to find the page_tables that need to be updated.

Ben LaHaise and I discussed this extensively a while ago, and Ben has a
really nice solution to the problem of finding all ptes for a given
page.  I still think it's a 2.3 thing, but it should definitely be
possible.

>   The question right now is where do we anchor the vma_next_share
>   linked list, as we don't have an inode.

We have the swapper inode, but that alone is not good enough.

A vma for a file mapped MAP_PRIVATE needs to be on the inode vma list
for that file.  Any anonymous private pages created for that file need
to be kept in the swap cache, which has its own inode.  After fork, we
need to keep the COW pages shared (even over swap) and the clean pages
linked to the page cache.  As a result, we need to support one vma
holding pages both on the inode vma list _and_ the swap inode.  Ben's
solution deals very cleanly with this.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-24 15:25           ` Linux-2.1.129 Stephen C. Tweedie
  1998-11-24 17:33             ` Linux-2.1.129 Linus Torvalds
@ 1998-11-25 20:33             ` Zlatko Calusic
  1 sibling, 0 replies; 29+ messages in thread
From: Zlatko Calusic @ 1998-11-25 20:33 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Linus Torvalds, Linux Kernel List, Linux-MM List

"Stephen C. Tweedie" <sct@redhat.com> writes:

> --- mm/vmscan.c~	Tue Nov 17 15:43:55 1998
> +++ mm/vmscan.c	Mon Nov 23 17:05:33 1998
> @@ -170,7 +170,7 @@
>  			 * copy in memory, so we add it to the swap
>  			 * cache. */
>  			if (PageSwapCache(page_map)) {
> -				free_page_and_swap_cache(page);
> +				free_page(page);
>  				return (atomic_read(&page_map->count) == 0);
>  			}
>  			add_to_swap_cache(page_map, entry);
> @@ -188,7 +188,7 @@
>  		 * asynchronously.  That's no problem, shrink_mmap() can
>  		 * correctly clean up the occassional unshared page
>  		 * which gets left behind in the swap cache. */
> -		free_page_and_swap_cache(page);
> +		free_page(page);
>  		return 1;	/* we slept: the process may not exist any more */
>  	}
>  
> @@ -202,7 +202,7 @@
>  		set_pte(page_table, __pte(entry));
>  		flush_tlb_page(vma, address);
>  		swap_duplicate(entry);
> -		free_page_and_swap_cache(page);
> +		free_page(page);
>  		return (atomic_read(&page_map->count) == 0);
>  	} 
>  	/* 
> @@ -218,7 +218,11 @@
>  	flush_cache_page(vma, address);
>  	pte_clear(page_table);
>  	flush_tlb_page(vma, address);
> +#if 0
>  	entry = page_unuse(page_map);
> +#else
> +	entry = (atomic_read(&page_map->count) == 1);
> +#endif
>  	__free_page(page_map);
>  	return entry;
>  }

I must admit that after some preliminary testing I can't believe how
GOOD these changes work!

Stephen, you've done a *really* good job.

I will still do some more testing, not to find bugs, but to enjoy
great performance. :)

Everybody, get pre-2.1.130-3 (which already includes above changes),
add #include <linux/interrupt.h> in kernel/itimer.c and enjoy the most
fair MM in Linux, EVER!

Stephen, thanks for such a good code!
-- 
Posted by Zlatko Calusic           E-mail: <Zlatko.Calusic@CARNet.hr>
---------------------------------------------------------------------
	   REALITY.SYS Corrupted: Re-boot universe? (Y/N/Q)
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-23 17:13     ` Linux-2.1.129 Stephen C. Tweedie
  1998-11-23 19:16       ` Linux-2.1.129 Eric W. Biederman
@ 1998-11-23 19:46       ` Eric W. Biederman
  1998-11-23 21:18         ` Linux-2.1.129 Rik van Riel
  1998-11-24 15:38         ` Linux-2.1.129 Stephen C. Tweedie
  1998-11-23 20:12       ` Linux-2.1.129 Rik van Riel
  1998-11-23 20:53       ` Running 2.1.129 at extrem load [patch] (Was: Linux-2.1.129..) Dr. Werner Fink
  3 siblings, 2 replies; 29+ messages in thread
From: Eric W. Biederman @ 1998-11-23 19:46 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Rik van Riel, Dr. Werner Fink,
	Kernel Mailing List, linux-mm

>>>>> "ST" == Stephen C Tweedie <sct@redhat.com> writes:

ST> I'm going to check this out: I'll post preliminary benchmarks and a
ST> patch for other people to test tomorrow.  Getting the balancing right
ST> will then just be a matter of making sure that try_to_swap_out gets
ST> called often enough under normal running conditions.  I'm open to
ST> suggestions about that: we've never tried that sort of behaviour in the
ST> vm to my knowledge.

I just said the buffer cache is very similiar and it is.

The trick part with balancing is there is an tradition in linux of
using very little swap space.  And the linux kernel not using swap
space unless it needs it.

The simplest model (and what we use for disk writes) is after
something becomes dirty to wait a little bit (in case of more writes,
(so we don't flood the disk)) and write the data to disk.

Ideally/Theoretically I think that is what we should be doing for swap
as well, as it would spread out the swap writes across evenly across
time.  And should leave most of our pages clean.

To implement that model we would need some different swap statistics,
so our users wouldn't panic.  (i.e. swap used but in swap cache ...)

But that is obviously going a little far for 2.2.  We already have our
model of only try to clean pages when we need memory (ouch!)  Which
we must balance with an amount of reaping by shrink_mmap.  This I
agree is unprecedented.

The correct ratio (of pages to free from each source) (compuated
dynamically) would be:
(# of process pages)/(# of pages)

Basically for every page kswapd frees shrink_mmap must also free one
page.  Plus however many pages shrink_mmap used to return.

So I in practicall terms this would either be a call of shrink_mmap
for every call to swap_out.  Or we would need an extra case added to
the extra shrink_mmap call at the start of do_try_to_free_page.

Eric


--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-23 19:46       ` Linux-2.1.129 Eric W. Biederman
@ 1998-11-23 21:18         ` Rik van Riel
  1998-11-24  6:28           ` Linux-2.1.129 Eric W. Biederman
  1998-11-24 15:38         ` Linux-2.1.129 Stephen C. Tweedie
  1 sibling, 1 reply; 29+ messages in thread
From: Rik van Riel @ 1998-11-23 21:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Stephen C. Tweedie, Linus Torvalds, Dr. Werner Fink,
	Kernel Mailing List, linux-mm

On 23 Nov 1998, Eric W. Biederman wrote:

> The simplest model (and what we use for disk writes) is after
> something becomes dirty to wait a little bit (in case of more
> writes, (so we don't flood the disk)) and write the data to disk. 

This waiting is also a good thing if we want to do proper
I/O clustering. I believe DU has a switch to only write
dirty data when there's more than XX kB of contiguous data
at that place on the disk (or the data is old).

> Ideally/Theoretically I think that is what we should be doing for
> swap as well, as it would spread out the swap writes across evenly
> across time.  And should leave most of our pages clean. 

Something like this is easily accomplished by pushing the
non-accessed pages into swap cache and swap simultaneously,
remapping the page from swap cache when we want to access
it again.

In order to spread out the disk I/O evenly (why would we
want to do this? -- writing is cheap once the disk head
is in the right place) we might want to implement a BSIisd
'laundry' list...

> But that is obviously going a little far for 2.2.  We already have
> our model of only try to clean pages when we need memory (ouch!) 

This really hurts and can be bad for application stability
when we're under a lot of pressure but there's still swap
space left.

> The correct ratio (of pages to free from each source) (compuated
> dynamically) would be:  (# of process pages)/(# of pages) 
> 
> Basically for every page kswapd frees shrink_mmap must also free one
> page.  Plus however many pages shrink_mmap used to return. 

This is clearly wrong. We can remap the page (soft fault)
from the swap cache, thereby removing the page from the
'inactive list' but not freeing the memory -- after all,
this hidden aging is the main purpose for this system...

I propose we maintain somewhat of a ratio of active/inactive
pages to keep around, so that all unmapped pages get a healthy
amount of aging and we always have enough memory we can easily
free by means of shrink_mmap().

This would give a kswapd() somewhat like the following:

if (nr_free_pages < free_pages_high && inactive_pages)
	shrink_mmap(GFP_SOMETHING);
if (inactive_pages * ratio < active_pages)
	do_try_to_swapout_pages(GFP_SOMETHING);

With things like shrink_dcache_memory(), shm_swap() and
kmem_cache_reap() folded in in the right places and
swap_tick() adjusted to take the active/inactive ratio
into account (get_free_page() too?).

A system like this would have much smoother swapout I/O,
giving higher possible loads on the VM system. Combined
with things like swapin readahead (Stephen, Ingo where
is it?:=) and 'real swapping' it will give a truly
scalable VM system...

Only for multi-gigabyte boxes we might want to bound
the number of inactive pages to (say) 16 MBs in order
to avoid a very large number of soft page faults that
use up too much CPU (Digital Unix seems to be slowed
down a lot by this -- it's using a 1:2 active/inactive
ratio even on our local 1GB box :)...

regards,

Rik -- slowly getting used to dvorak kbd layout...
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-23 21:18         ` Linux-2.1.129 Rik van Riel
@ 1998-11-24  6:28           ` Eric W. Biederman
  1998-11-24  7:56             ` Linux-2.1.129 Rik van Riel
  1998-11-24 15:48             ` Linux-2.1.129 Stephen C. Tweedie
  0 siblings, 2 replies; 29+ messages in thread
From: Eric W. Biederman @ 1998-11-24  6:28 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm

>>>>> "RR" == Rik van Riel <H.H.vanRiel@phys.uu.nl> writes:

RR> On 23 Nov 1998, Eric W. Biederman wrote:
>> The simplest model (and what we use for disk writes) is after
>> something becomes dirty to wait a little bit (in case of more
>> writes, (so we don't flood the disk)) and write the data to disk. 

RR> This waiting is also a good thing if we want to do proper
RR> I/O clustering. I believe DU has a switch to only write
RR> dirty data when there's more than XX kB of contiguous data
RR> at that place on the disk (or the data is old).

I can tell who has been reading Digital Unix literature latetly.

>> Ideally/Theoretically I think that is what we should be doing for
>> swap as well, as it would spread out the swap writes across evenly
>> across time.  And should leave most of our pages clean. 

RR> Something like this is easily accomplished by pushing the
RR> non-accessed pages into swap cache and swap simultaneously,
RR> remapping the page from swap cache when we want to access
RR> it again.

RR> In order to spread out the disk I/O evenly (why would we
RR> want to do this?

Imagine a machine with 1 Gigabyte of RAM and 8 Gigabyte of swap,
in heavy use.  Swapping but not thrashing.

You can't swap out several hundred megabytes all at once.
They need to be swapped out over time.   For pages that are not likely to change
you want them to hit the disk soon after they get set, so you have
more clean memory,  and don't need to write all of the data out when
you get busy.  

You can handle a suddne flurry of network traffic much better this way
for example.

>> The correct ratio (of pages to free from each source) (compuated
>> dynamically) would be:  (# of process pages)/(# of pages) 
>> 
>> Basically for every page kswapd frees shrink_mmap must also free one
>> page.  Plus however many pages shrink_mmap used to return. 

RR> This is clearly wrong.  

No. If for each page we schedule to be swapped, we reclaim a different
page with shrink_mmap immediately.... so we have free ram.

That should keep the balance between swapping and mm as it has
always been.  But I doubt we need to go even that far, to get a working balance.

As far as fixed percentages.  It's a loose every time, and I
won't drop a working feature for an older lesser design.   Having tuneable
fixed percentages is only a win on a 1 application, 1 load pattern box.

Eric
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-24  6:28           ` Linux-2.1.129 Eric W. Biederman
@ 1998-11-24  7:56             ` Rik van Riel
  1998-11-24 15:48             ` Linux-2.1.129 Stephen C. Tweedie
  1 sibling, 0 replies; 29+ messages in thread
From: Rik van Riel @ 1998-11-24  7:56 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-mm

On 24 Nov 1998, Eric W. Biederman wrote:
> >>>>> "RR" == Rik van Riel <H.H.vanRiel@phys.uu.nl> writes:
> RR> On 23 Nov 1998, Eric W. Biederman wrote:
> 
> RR> This waiting is also a good thing if we want to do proper
> RR> I/O clustering. I believe DU has a switch to only write
> RR> dirty data when there's more than XX kB of contiguous data
> RR> at that place on the disk (or the data is old).
> 
> I can tell who has been reading Digital Unix literature latetly.

DU and IRIX scale to much larger machines than Linux does,
so I've been reading the DU bookshelf for quite a while
now. Guess where some of the stuff in /proc/sys/vm comes
from :)

I'd be grateful if anyone can help me to IRIX documentation
(will be bugging our sysadmins later today -- I know they've
got an origin and several indys :).

> >> Ideally/Theoretically I think that is what we should be doing for
> >> swap as well, as it would spread out the swap writes across evenly
> >> across time.  And should leave most of our pages clean. 
> 
> RR> In order to spread out the disk I/O evenly (why would we
> RR> want to do this?
> 
> Imagine a machine with 1 Gigabyte of RAM and 8 Gigabyte of swap, in
> heavy use.  Swapping but not thrashing.  You can't swap out several
> hundred megabytes all at once. 

OK, I see your point now. In your original message I thought
to have read that you wanted to do swap I/O on an individual
basis as opposed to proper I/O clustering. Your second version
of the story is remarkably like what I had in mind :)

> You can handle a suddne flurry of network traffic much better this
> way for example. 

This is the main goal why we should push through the new
VM code ASAP. Gigabit ethernet will be in common use long
before 2.4 hits the street.

> >> The correct ratio (of pages to free from each source) (compuated
> >> dynamically) would be:  (# of process pages)/(# of pages) 
> >> 
> >> Basically for every page kswapd frees shrink_mmap must also free one
> >> page.  Plus however many pages shrink_mmap used to return. 
> 
> RR> This is clearly wrong.  
> 
> No. If for each page we schedule to be swapped, we reclaim a different
> page with shrink_mmap immediately.... so we have free ram.

We only need to have a very small amount of free ram, since
we can easily reclaim memory if we just make sure that we've
got enough unmapped swap cache and page cache laying around.

> As far as fixed percentages.  It's a loose every time, and I won't
> drop a working feature for an older lesser design.  Having tuneable
> fixed percentages is only a win on a 1 application, 1 load pattern
> box. 

The only reason for something like that is that we need to
have some control over the amount of memory that's in the
unmapped/cached state, since:
- we want the pages to undergo somewhat of an aging in order
  to avoid easy thrashing
- we need a large enough amount of unmapped memory which we
  can reclaim fast when we're under heavy (network) pressure
- having a lot of unmapped memory around will give minor page
  faults, decreasing the amount of unmapped memory and requiring
  us to keep scanning memory in a slow but steady pace, this:
  - spreads out swap I/O evenly over time
  - spreads out page aging evenly over space, giving us more
    performance and fair aging than we ever dreamt of

Maybe we want the system to auto-tune the mapped:unmapped
ratio depending on the amount of minor faults and actual
page reclaims going on, with a bottom value of 1/16th of
memory so we always have enough buffer to catch big things.

Rik -- slowly getting used to dvorak kbd layout...
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-24  6:28           ` Linux-2.1.129 Eric W. Biederman
  1998-11-24  7:56             ` Linux-2.1.129 Rik van Riel
@ 1998-11-24 15:48             ` Stephen C. Tweedie
  1 sibling, 0 replies; 29+ messages in thread
From: Stephen C. Tweedie @ 1998-11-24 15:48 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Rik van Riel, linux-mm

Hi,

On 24 Nov 1998 00:28:16 -0600, ebiederm+eric@ccr.net (Eric
W. Biederman) said:

> Imagine a machine with 1 Gigabyte of RAM and 8 Gigabyte of swap,
> in heavy use.  Swapping but not thrashing.

> You can't swap out several hundred megabytes all at once.  

Sure you can!  It's a tradeoff between moment-to-moment predictability
and overall throughput.  Swapping loads at once gives unpredictable
short-term behaviour but great throughput.  Performance is nearly
always about trading of throughput for things like predictability or
fairness.

> You can handle a suddne flurry of network traffic much better this way
> for example.

As long as you have got enough clean pages around, you can deal with
this anyway: kswapd can find free memory very rapidly as long as it
doesn't have to spend time writing things to swap.

> As far as fixed percentages.  It's a loose every time, and I won't
> drop a working feature for an older lesser design.  Having tuneable
> fixed percentages is only a win on a 1 application, 1 load pattern
> box.

<Nods head vigorously.>  Try running with a big ramdisk on a 2.1.125
box, for example: the precomputed page cache limits no longer work and
performance falls apart.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-23 19:46       ` Linux-2.1.129 Eric W. Biederman
  1998-11-23 21:18         ` Linux-2.1.129 Rik van Riel
@ 1998-11-24 15:38         ` Stephen C. Tweedie
  1 sibling, 0 replies; 29+ messages in thread
From: Stephen C. Tweedie @ 1998-11-24 15:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Stephen C. Tweedie, Linus Torvalds, Rik van Riel,
	Dr. Werner Fink, Kernel Mailing List, linux-mm

Hi,

On 23 Nov 1998 13:46:16 -0600, ebiederm+eric@ccr.net (Eric
W. Biederman) said:

> The simplest model (and what we use for disk writes) is after
> something becomes dirty to wait a little bit (in case of more writes,
> (so we don't flood the disk)) and write the data to disk.

The disk write model is not a good comparison, since (a) our current
write model is badly broken anyway (the only way to throttle writes is
to run out of memory), and (b) there are all sorts of fairness issues
involving the IO queues too in the write case.  But they have
similarities.

> Ideally/Theoretically I think that is what we should be doing for swap
> as well, as it would spread out the swap writes across evenly across
> time.  And should leave most of our pages clean.

Batching the writes improves our swap throughput enormously.  This is
well proven.  Sometimes we don't want to be too even. :)

> So I in practicall terms this would either be a call of shrink_mmap
> for every call to swap_out.  Or we would need an extra case added to
> the extra shrink_mmap call at the start of do_try_to_free_page.

The patch I just sent out essentially does this.  By making swap_out
unlikely to free real memory (it just unlinks things from ptes while
leaving them in the page cache), it batches out our swap writes and
causes regular aging of swap pages when memory gets short, but still
leaves all of the work of balancing the vm to shrink_mmap() where
those unlinked pages can be reused at will.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Linux-2.1.129..
  1998-11-23 17:13     ` Linux-2.1.129 Stephen C. Tweedie
  1998-11-23 19:16       ` Linux-2.1.129 Eric W. Biederman
  1998-11-23 19:46       ` Linux-2.1.129 Eric W. Biederman
@ 1998-11-23 20:12       ` Rik van Riel
  1998-11-23 20:53       ` Running 2.1.129 at extrem load [patch] (Was: Linux-2.1.129..) Dr. Werner Fink
  3 siblings, 0 replies; 29+ messages in thread
From: Rik van Riel @ 1998-11-23 20:12 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Dr. Werner Fink, Kernel Mailing List, linux-mm

On Mon, 23 Nov 1998, Stephen C. Tweedie wrote:

> So, I have still seen no cases where overall performance with no
> page cache aging was better than performance with it.  However, with
> the swap aging removed as well, we seem to have a page/swap balance
> which doesn't work well on 64MB.  To be honest, I just haven't spent
> much time playing with swap page aging since the early kswap work,
> and that was all done before the page cache was added. 

What way does the balance go? Too much cache/buffer memory
can be 'fixed' by adjusting the settings in /proc/sys/vm/*
(yes, I know it goes against your principles, but some folks
need special behaviour for special-purpose systems anyway)

> On Thu, 19 Nov 1998 22:58:30 +0100 (CET), Rik van Riel
> <H.H.vanRiel@phys.uu.nl> said:
> 
> > It was certainly a huge win when page aging was implemented, but we
> > mainly felt that because there used to be an obscure bug in vmscan.c,
> > causing the kernel to always start scanning at the start of the
> > process' address space.
> 
> Rik, you keep asserting this but I have never understood it.  I have
> asked you several times for a precise description of what benchmarks
> improved when page cache aging was added,

I mean the addition of page aging in kernel version 1.2.x.

Back then there certainly was a big improvement vs 1.1.x,
but unfortunately I was not really into kernel hacking
back then (I didn't even have a Net connection) so I
might have misunderstood things...

> And the "obscure bug" you describe was never there: I've said to you
> more than once that you were misreading the source, and that the
> field you pointed to which was being reset to zero at the start of
> the swapout loop was *guaranteed* to be overwritten with the last
> address scanned before we exited that loop. 

Nevertheless I observed a much more stable and less thash-
prone system with my small patch included.

> swap_out_pmd(), there is a line
> 
> 		tsk->swap_address = address + PAGE_SIZE;

Hmm, this means that it should work as you say. The
system seemed to be much more thash-prone however...(?)

> > This gives the process a chance of reclaiming the page without
> > incurring any I/O and it gives the kernel the possibility of keeping a
> > lot of easily-freeable pages around.
> 
> That would be true if we didn't do the free_page_and_swap_cache trick.
> However, doing that would require two passes: once by the swapper, and
> once by shrink_mmap(): before actually freeing a page.  This actually
> sounds like a *very* good idea to explore, since it means that vmscan.c
> will be concerned exclusively with returning mapped and anonymous pages
> to the page cache.

It is also what *BSD and OSF/1 seem to do. They have tuned
and balanced this system for the last 15 years so the system
should be rather well tuned...

> > Maybe we even want to keep a 3:1 ratio or something like that for
> > mapped:swap_cached pages and a semi- FIFO reclamation of swap cached
> > pages so we can simulate a bit of (very cheap) page aging.
> 
> I will just restate my profound conviction that any VM balancing which
> works by imposing precalculated limits on resources is fundamentally
> wrong.

The reason for a ratio like this is to ensure that:
- there are enough pages that can be free()d at any time,
  without us needing to scan the page tables, this also
  serves as a 'buffer' for high-pressure moments
- pages will spend enough time in 'unmapped' mode to have
  some serious aging imposed on them, not doing this might
  cancel out the effect we want (multi queue semantics)
- pages that are used semi-often will have some soft faults,
  always-used pages won't. keeping the soft-fault stats will
  enable us to make better pageout decisions cheaply
- when a page softfaults (is remapped in from the unmapped
  state) we can get below the wanted ratio and push out
  something else, this gives a nice, slow and uniform page
  aging system (especially when we observe a second chance FIFO
  algorithm for reclaiming the page-/swapcached and buffer
  pages, only breaking the FIFO style when memory is fragmented)
- keeping 25% of memory in unmapped state allows us to easily
  'fix' memory fragmentation, solving that problem as well --
  without having to give up the fast & cheap memory allocator
  we use now
- the easy-free buffer will allow us to keep less free memory,
  a few higher-order buffers should be all since we can free
  cached pages (shrink_mmap()) pages immediately,
- this in turn might slightly reduce swapping, especially on
  smaller machines

cheers,

Rik -- slowly getting used to dvorak kbd layout...
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Running 2.1.129 at extrem load [patch] (Was: Linux-2.1.129..)
  1998-11-23 17:13     ` Linux-2.1.129 Stephen C. Tweedie
                         ` (2 preceding siblings ...)
  1998-11-23 20:12       ` Linux-2.1.129 Rik van Riel
@ 1998-11-23 20:53       ` Dr. Werner Fink
  1998-11-23 21:59         ` Rik van Riel
  3 siblings, 1 reply; 29+ messages in thread
From: Dr. Werner Fink @ 1998-11-23 20:53 UTC (permalink / raw)
  To: Stephen C. Tweedie, Linus Torvalds, Rik van Riel
  Cc: linux-mm, Kernel Mailing List

[...]

> > Maybe we even want to keep a 3:1 ratio or something like that for
> > mapped:swap_cached pages and a semi- FIFO reclamation of swap cached
> > pages so we can simulate a bit of (very cheap) page aging.
> 
> I will just restate my profound conviction that any VM balancing which
> works by imposing precalculated limits on resources is fundamentally
> wrong.
> 
> Cheers,
>   Stephen

I've done some simply test and worked out some changes (patch enclosed).
Starting with a plain 2.1.129 I've run a simple stress
situation:

       * 64MB ram + 128 MB swap
       * Under X11 (fvwm2)
       * xload
       * xosview
       * xterm running top
       * xterm running tail -f /var/log/warn /var/log/messages
       * xterm compiling 2.0.36 sources with:
             while true; do make clean; make -j || break ; done
       * xterm compiling 2.1.129 sources with:
             while true; do make clean; make MAKE='make -j5' || break ; done


.. clearly all together.  Load goes upto 30 and higher and random SIGBUS
to random processes occurs (in best case the X server was signaled which
makes the system usable again).

I've add some changes:

       * changed the position of deleting pages from
         swap cache in mm/filemap.c::shrink_one_page()
       * add a simple repeat case in
         mm/page_alloc.c::__get_free_pages() if we wait
         on low priority pages (aka GFP_USER).
       * don't let mm/vmscan.c::try_to_free_pages()
         scan to much.
       * add a simple age scheme for recently swapped in
         pages. (The condition, e.g. a bigger rss window
         is changeable).

The random SIGBUS disappears and the system seems more usable
which means only loads over 35 and higher makes the system
only temporarily unusable.


            Werner

--------------------------------------------------------------------
diff -urN linux-2.1.129/include/linux/mm.h linux/include/linux/mm.h
--- linux-2.1.129/include/linux/mm.h	Thu Nov 19 20:49:37 1998
+++ linux/include/linux/mm.h	Mon Nov 23 14:53:14 1998
@@ -117,7 +117,7 @@
 	unsigned long offset;
 	struct page *next_hash;
 	atomic_t count;
-	unsigned int unused;
+	unsigned int lifetime;
 	unsigned long flags;	/* atomic flags, some possibly updated asynchronously */
 	struct wait_queue *wait;
 	struct page **pprev_hash;
diff -urN linux-2.1.129/ipc/shm.c linux/ipc/shm.c
--- linux-2.1.129/ipc/shm.c	Sun Oct 18 00:52:18 1998
+++ linux/ipc/shm.c	Mon Nov 23 15:14:00 1998
@@ -15,6 +15,7 @@
 #include <linux/stat.h>
 #include <linux/malloc.h>
 #include <linux/swap.h>
+#include <linux/swapctl.h>
 #include <linux/smp.h>
 #include <linux/smp_lock.h>
 #include <linux/init.h>
@@ -656,6 +657,7 @@
 
 	pte = __pte(shp->shm_pages[idx]);
 	if (!pte_present(pte)) {
+		int old_rss = shm_rss;
 		unsigned long page = get_free_page(GFP_KERNEL);
 		if (!page) {
 			oom(current);
@@ -677,6 +679,16 @@
 			shm_swp--;
 		}
 		shm_rss++;
+
+		/* Increase life time of the page */
+		mem_map[MAP_NR(page)].lifetime = 0;
+		if (old_rss == 0) 
+			current->dec_flt++;
+		if (current->dec_flt > 3) {
+			mem_map[MAP_NR(page)].lifetime = 3 * PAGE_ADVANCE;
+			current->dec_flt = 0;
+		}
+
 		pte = pte_mkdirty(mk_pte(page, PAGE_SHARED));
 		shp->shm_pages[idx] = pte_val(pte);
 	} else
diff -urN linux-2.1.129/mm/filemap.c linux/mm/filemap.c
--- linux-2.1.129/mm/filemap.c	Thu Nov 19 20:44:18 1998
+++ linux/mm/filemap.c	Mon Nov 23 13:38:47 1998
@@ -167,15 +167,14 @@
 	case 1:
 		/* is it a swap-cache or page-cache page? */
 		if (page->inode) {
-			/* Throw swap-cache pages away more aggressively */
-			if (PageSwapCache(page)) {
-				delete_from_swap_cache(page);
-				return 1;
-			}
 			if (test_and_clear_bit(PG_referenced, &page->flags))
 				break;
 			if (pgcache_under_min())
 				break;
+			if (PageSwapCache(page)) {
+				delete_from_swap_cache(page);
+				return 1;
+			}
 			remove_inode_page(page);
 			return 1;
 		}
diff -urN linux-2.1.129/mm/page_alloc.c linux/mm/page_alloc.c
--- linux-2.1.129/mm/page_alloc.c	Thu Nov 19 20:44:18 1998
+++ linux/mm/page_alloc.c	Mon Nov 23 19:31:10 1998
@@ -236,6 +236,7 @@
 unsigned long __get_free_pages(int gfp_mask, unsigned long order)
 {
 	unsigned long flags;
+	int loop = 0;
 
 	if (order >= NR_MEM_LISTS)
 		goto nopage;
@@ -262,6 +263,7 @@
 				goto nopage;
 		}
 	}
+repeat:
 	spin_lock_irqsave(&page_alloc_lock, flags);
 	RMQUEUE(order, (gfp_mask & GFP_DMA));
 	spin_unlock_irqrestore(&page_alloc_lock, flags);
@@ -274,6 +276,8 @@
 	if (gfp_mask & __GFP_WAIT) {
 		current->policy |= SCHED_YIELD;
 		schedule();
+		if (!loop++ && nr_free_pages > freepages.low)
+			goto repeat;
 	}
 
 nopage:
@@ -380,6 +384,7 @@
 {
 	unsigned long page;
 	struct page *page_map;
+	int shared, old_rss = vma->vm_mm->rss;
 	
 	page_map = read_swap_cache(entry);
 
@@ -399,8 +404,18 @@
 	vma->vm_mm->rss++;
 	tsk->min_flt++;
 	swap_free(entry);
+	shared = is_page_shared(page_map);
 
-	if (!write_access || is_page_shared(page_map)) {
+	/* Increase life time of the page */
+	page_map->lifetime = 0;
+	if (old_rss == 0)
+		tsk->dec_flt++;
+	if (tsk->dec_flt > 3) {
+		page_map->lifetime = (shared ? 2 : 5) * PAGE_ADVANCE;
+		tsk->dec_flt = 0;
+	}
+
+	if (!write_access || shared) {
 		set_pte(page_table, mk_pte(page, vma->vm_page_prot));
 		return;
 	}
diff -urN linux-2.1.129/mm/vmscan.c linux/mm/vmscan.c
--- linux-2.1.129/mm/vmscan.c	Thu Nov 19 20:44:18 1998
+++ linux/mm/vmscan.c	Mon Nov 23 19:34:21 1998
@@ -131,12 +131,21 @@
 		return 0;
 	}
 
+	/* life time decay */
+	if (page_map->lifetime > PAGE_DECLINE)
+		page_map->lifetime -= PAGE_DECLINE;
+	else
+		page_map->lifetime = 0;
+	if (page_map->lifetime)
+		return 0;
+
 	if (pte_dirty(pte)) {
 		if (vma->vm_ops && vma->vm_ops->swapout) {
 			pid_t pid = tsk->pid;
 			vma->vm_mm->rss--;
-			if (vma->vm_ops->swapout(vma, address - vma->vm_start + vma->vm_offset, page_table))
+			if (vma->vm_ops->swapout(vma, address - vma->vm_start + vma->vm_offset, page_table)) {
 				kill_proc(pid, SIGBUS, 1);
+			}
 		} else {
 			/*
 			 * This is a dirty, swappable page.  First of all,
@@ -561,6 +570,7 @@
 int try_to_free_pages(unsigned int gfp_mask, int count)
 {
 	int retval = 1;
+	int is_dma = (gfp_mask & __GFP_DMA);
 
 	lock_kernel();
 	if (!(current->flags & PF_MEMALLOC)) {
@@ -568,6 +578,8 @@
 		do {
 			retval = do_try_to_free_page(gfp_mask);
 			if (!retval)
+				break;
+			if (!is_dma && nr_free_pages > freepages.high + SWAP_CLUSTER_MAX)
 				break;
 			count--;
 		} while (count > 0);
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Running 2.1.129 at extrem load [patch] (Was: Linux-2.1.129..)
  1998-11-23 20:53       ` Running 2.1.129 at extrem load [patch] (Was: Linux-2.1.129..) Dr. Werner Fink
@ 1998-11-23 21:59         ` Rik van Riel
  1998-11-23 22:35           ` Dr. Werner Fink
  0 siblings, 1 reply; 29+ messages in thread
From: Rik van Riel @ 1998-11-23 21:59 UTC (permalink / raw)
  To: Dr. Werner Fink
  Cc: Stephen C. Tweedie, Linus Torvalds, linux-mm, Kernel Mailing List

On Mon, 23 Nov 1998, Dr. Werner Fink wrote:

>  	struct page *next_hash;
>  	atomic_t count;
> -	unsigned int unused;
> +	unsigned int lifetime;
>  	unsigned long flags;	/* atomic flags, some possibly updated asynchronously */

Hmm, this looks suspiciously like a new incarnation of
page aging (which we want to avoid, at least in some
parts of the kernel).

> --- linux-2.1.129/mm/filemap.c	Thu Nov 19 20:44:18 1998
> +++ linux/mm/filemap.c	Mon Nov 23 13:38:47 1998
> @@ -167,15 +167,14 @@
>  	case 1:
>  		/* is it a swap-cache or page-cache page? */
>  		if (page->inode) {
> -			/* Throw swap-cache pages away more aggressively */
> -			if (PageSwapCache(page)) {
> -				delete_from_swap_cache(page);
> -				return 1;
> -			}
>  			if (test_and_clear_bit(PG_referenced, &page->flags))
>  				break;
>  			if (pgcache_under_min())
>  				break;
> +			if (PageSwapCache(page)) {
> +				delete_from_swap_cache(page);
> +				return 1;
> +			}

This piece looks good and will result in us keeping swap cached
pages when the page cache is low. We might want to include this
in the current kernel tree, together with the removal of the
free_after construction.

> diff -urN linux-2.1.129/mm/vmscan.c linux/mm/vmscan.c
> --- linux-2.1.129/mm/vmscan.c	Thu Nov 19 20:44:18 1998
> +++ linux/mm/vmscan.c	Mon Nov 23 19:34:21 1998
> @@ -131,12 +131,21 @@
>  		return 0;
>  	}
>  
> +	/* life time decay */
> +	if (page_map->lifetime > PAGE_DECLINE)
> +		page_map->lifetime -= PAGE_DECLINE;
> +	else
> +		page_map->lifetime = 0;
> +	if (page_map->lifetime)
> +		return 0;
> +

Sorry Werner, but this is exactly the place where we need to
remove any from of page aging. We can do some kind of aging
in the swap cache, page cache and buffer cache, but doing
aging here is just prohibitively expensive and needs to be
removed.

IMHO a better construction be to have a page->fresh flag
which would be set on unmapping from swap_out(). Then
shrink_mmap() would free pages with page->fresh reset
and reset page->fresh if it is set. This way we can
free a page at it's second scan so we avoid freeing
a page that was just unmapped (and giving each page a
bit of a chance to undergo cheap aging).

regards,

Rik -- slowly getting used to dvorak kbd layout...
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Running 2.1.129 at extrem load [patch] (Was: Linux-2.1.129..)
  1998-11-23 21:59         ` Rik van Riel
@ 1998-11-23 22:35           ` Dr. Werner Fink
  1998-11-24 12:38             ` Dr. Werner Fink
  0 siblings, 1 reply; 29+ messages in thread
From: Dr. Werner Fink @ 1998-11-23 22:35 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stephen C. Tweedie, Linus Torvalds, linux-mm, Kernel Mailing List


> >  	struct page *next_hash;
> >  	atomic_t count;
> > -	unsigned int unused;
> > +	unsigned int lifetime;
> >  	unsigned long flags;	/* atomic flags, some possibly updated asynchronously */
> 
> Hmm, this looks suspiciously like a new incarnation of
> page aging (which we want to avoid, at least in some
> parts of the kernel).

Yep ... we should avoid removing this `unused'.

> > --- linux-2.1.129/mm/filemap.c	Thu Nov 19 20:44:18 1998
> > +++ linux/mm/filemap.c	Mon Nov 23 13:38:47 1998
> > @@ -167,15 +167,14 @@
> >  	case 1:
> >  		/* is it a swap-cache or page-cache page? */
> >  		if (page->inode) {
> > -			/* Throw swap-cache pages away more aggressively */
> > -			if (PageSwapCache(page)) {
> > -				delete_from_swap_cache(page);
> > -				return 1;
> > -			}
> >  			if (test_and_clear_bit(PG_referenced, &page->flags))
> >  				break;
> >  			if (pgcache_under_min())
> >  				break;
> > +			if (PageSwapCache(page)) {
> > +				delete_from_swap_cache(page);
> > +				return 1;
> > +			}
> 
> This piece looks good and will result in us keeping swap cached
> pages when the page cache is low. We might want to include this
> in the current kernel tree, together with the removal of the
> free_after construction.

Hmmm ... don't forget the change in __get_free_pages(). Without this
page I see random SIGBUS at extreme load killing random processes.

[...]

> Sorry Werner, but this is exactly the place where we need to
> remove any from of page aging. We can do some kind of aging
> in the swap cache, page cache and buffer cache, but doing
> aging here is just prohibitively expensive and needs to be
> removed.
> 
> IMHO a better construction be to have a page->fresh flag
> which would be set on unmapping from swap_out(). Then
> shrink_mmap() would free pages with page->fresh reset
> and reset page->fresh if it is set. This way we can
> free a page at it's second scan so we avoid freeing
> a page that was just unmapped (and giving each page a
> bit of a chance to undergo cheap aging).

Furthermore highly used pages should go not to often
into the swap cache ... this leads to something like
a score list of often used pages.  Such a score value
instead of a flag could be easily decreased by
shrink_mmap() scanning all pages.


          Werner
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Running 2.1.129 at extrem load [patch] (Was: Linux-2.1.129..)
  1998-11-23 22:35           ` Dr. Werner Fink
@ 1998-11-24 12:38             ` Dr. Werner Fink
  0 siblings, 0 replies; 29+ messages in thread
From: Dr. Werner Fink @ 1998-11-24 12:38 UTC (permalink / raw)
  To: Rik van Riel, Stephen C. Tweedie, Linus Torvalds, linux-mm,
	Kernel Mailing List

> > Sorry Werner, but this is exactly the place where we need to
> > remove any from of page aging. We can do some kind of aging
> > in the swap cache, page cache and buffer cache, but doing
> > aging here is just prohibitively expensive and needs to be
> > removed.
> > 
> > IMHO a better construction be to have a page->fresh flag
> > which would be set on unmapping from swap_out(). Then
> > shrink_mmap() would free pages with page->fresh reset
> > and reset page->fresh if it is set. This way we can
> > free a page at it's second scan so we avoid freeing
> > a page that was just unmapped (and giving each page a
> > bit of a chance to undergo cheap aging).

Comments on the enclosed patch please :-)

 * Without the old ageing scheme within try_to_swap_out any
   bigger increase of the load causes a temporarily unusable system.

 * The `if (buffer_under_min()) break;' within shrink_one_page()
   reduces the average system CPU time in comparison to the user
   CPU time.


            Werner

--------------------------------------------------------------------------
diff -urN linux-2.1.129/include/linux/mm.h linux/include/linux/mm.h
--- linux-2.1.129/include/linux/mm.h	Thu Nov 19 20:49:37 1998
+++ linux/include/linux/mm.h	Tue Nov 24 00:09:29 1998
@@ -117,7 +117,7 @@
 	unsigned long offset;
 	struct page *next_hash;
 	atomic_t count;
-	unsigned int unused;
+	unsigned int lifetime;
 	unsigned long flags;	/* atomic flags, some possibly updated asynchronously */
 	struct wait_queue *wait;
 	struct page **pprev_hash;
diff -urN linux-2.1.129/ipc/shm.c linux/ipc/shm.c
--- linux-2.1.129/ipc/shm.c	Sun Oct 18 00:52:18 1998
+++ linux/ipc/shm.c	Tue Nov 24 12:38:07 1998
@@ -15,6 +15,7 @@
 #include <linux/stat.h>
 #include <linux/malloc.h>
 #include <linux/swap.h>
+#include <linux/swapctl.h>
 #include <linux/smp.h>
 #include <linux/smp_lock.h>
 #include <linux/init.h>
@@ -677,6 +678,11 @@
 			shm_swp--;
 		}
 		shm_rss++;
+
+		/* Increase life time of the page */
+		if (mem_map[MAP_NR(page)].lifetime < 3 && pgcache_under_max())
+			mem_map[MAP_NR(page)].lifetime++;
+
 		pte = pte_mkdirty(mk_pte(page, PAGE_SHARED));
 		shp->shm_pages[idx] = pte_val(pte);
 	} else
diff -urN linux-2.1.129/mm/filemap.c linux/mm/filemap.c
--- linux-2.1.129/mm/filemap.c	Thu Nov 19 20:44:18 1998
+++ linux/mm/filemap.c	Tue Nov 24 12:25:10 1998
@@ -136,6 +136,8 @@
 
 	if (PageLocked(page))
 		goto next;
+	if (page->lifetime)
+		page->lifetime--;
 	if ((gfp_mask & __GFP_DMA) && !PageDMA(page))
 		goto next;
 	/* First of all, regenerate the page's referenced bit
@@ -167,15 +169,16 @@
 	case 1:
 		/* is it a swap-cache or page-cache page? */
 		if (page->inode) {
-			/* Throw swap-cache pages away more aggressively */
-			if (PageSwapCache(page)) {
-				delete_from_swap_cache(page);
-				return 1;
-			}
 			if (test_and_clear_bit(PG_referenced, &page->flags))
 				break;
 			if (pgcache_under_min())
 				break;
+			if (PageSwapCache(page)) {
+				if (page->lifetime && pgcache_under_borrow())
+					break;
+				delete_from_swap_cache(page);
+				return 1;
+			}
 			remove_inode_page(page);
 			return 1;
 		}
@@ -183,7 +186,8 @@
 		 * If it has been referenced recently, don't free it */
 		if (test_and_clear_bit(PG_referenced, &page->flags))
 			break;
-
+		if (buffer_under_min())
+			break;
 		/* is it a buffer cache page? */
 		if (bh && try_to_free_buffer(bh, &bh, 6))
 			return 1;
diff -urN linux-2.1.129/mm/page_alloc.c linux/mm/page_alloc.c
--- linux-2.1.129/mm/page_alloc.c	Thu Nov 19 20:44:18 1998
+++ linux/mm/page_alloc.c	Tue Nov 24 12:37:30 1998
@@ -231,11 +231,13 @@
 		map += size; \
 	} \
 	atomic_set(&map->count, 1); \
+	map->lifetime = 0; \
 } while (0)
 
 unsigned long __get_free_pages(int gfp_mask, unsigned long order)
 {
 	unsigned long flags;
+	int loop = 0;
 
 	if (order >= NR_MEM_LISTS)
 		goto nopage;
@@ -262,6 +264,7 @@
 				goto nopage;
 		}
 	}
+repeat:
 	spin_lock_irqsave(&page_alloc_lock, flags);
 	RMQUEUE(order, (gfp_mask & GFP_DMA));
 	spin_unlock_irqrestore(&page_alloc_lock, flags);
@@ -274,6 +277,8 @@
 	if (gfp_mask & __GFP_WAIT) {
 		current->policy |= SCHED_YIELD;
 		schedule();
+		if (!loop++ && nr_free_pages > freepages.low)
+			goto repeat;
 	}
 
 nopage:
@@ -399,6 +404,10 @@
 	vma->vm_mm->rss++;
 	tsk->min_flt++;
 	swap_free(entry);
+
+	/* Increase life time of the page */
+	if (page_map->lifetime < 3 && pgcache_under_max())
+		page_map->lifetime++;
 
 	if (!write_access || is_page_shared(page_map)) {
 		set_pte(page_table, mk_pte(page, vma->vm_page_prot));
diff -urN linux-2.1.129/mm/swap.c linux/mm/swap.c
--- linux-2.1.129/mm/swap.c	Wed Sep  9 17:56:59 1998
+++ linux/mm/swap.c	Tue Nov 24 13:08:19 1998
@@ -76,7 +76,7 @@
 
 buffer_mem_t page_cache = {
 	5,	/* minimum percent page cache */
-	30,	/* borrow percent page cache */
+	25,	/* borrow percent page cache */
 	75	/* maximum */
 };
 
diff -urN linux-2.1.129/mm/vmscan.c linux/mm/vmscan.c
--- linux-2.1.129/mm/vmscan.c	Thu Nov 19 20:44:18 1998
+++ linux/mm/vmscan.c	Tue Nov 24 00:06:20 1998
@@ -561,6 +561,7 @@
 int try_to_free_pages(unsigned int gfp_mask, int count)
 {
 	int retval = 1;
+	int is_dma = (gfp_mask & __GFP_DMA);
 
 	lock_kernel();
 	if (!(current->flags & PF_MEMALLOC)) {
@@ -568,6 +569,8 @@
 		do {
 			retval = do_try_to_free_page(gfp_mask);
 			if (!retval)
+				break;
+			if (!is_dma && nr_free_pages > freepages.high + SWAP_CLUSTER_MAX)
 				break;
 			count--;
 		} while (count > 0);
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~1998-11-26 12:58 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.LNX.3.95.981119002335.838A-100000@penguin.transmeta.com>
1998-11-19 21:34 ` Linux-2.1.129 Dr. Werner Fink
1998-11-19 21:58   ` Linux-2.1.129 Rik van Riel
1998-11-20 12:09     ` Linux-2.1.129 Dr. Werner Fink
1998-11-19 22:33   ` Linux-2.1.129 Linus Torvalds
1998-11-23 17:13     ` Linux-2.1.129 Stephen C. Tweedie
1998-11-23 19:16       ` Linux-2.1.129 Eric W. Biederman
1998-11-23 20:02         ` Linux-2.1.129 Linus Torvalds
1998-11-23 21:25           ` Linux-2.1.129 Rik van Riel
1998-11-23 22:19           ` Linux-2.1.129 Dr. Werner Fink
1998-11-24  3:37           ` Linux-2.1.129 Eric W. Biederman
1998-11-24 15:25           ` Linux-2.1.129 Stephen C. Tweedie
1998-11-24 17:33             ` Linux-2.1.129 Linus Torvalds
1998-11-24 19:59               ` Linux-2.1.129 Rik van Riel
1998-11-24 20:45                 ` Linux-2.1.129 Linus Torvalds
1998-11-25 14:19               ` Linux-2.1.129 Stephen C. Tweedie
1998-11-25 21:07                 ` Linux-2.1.129 Eric W. Biederman
1998-11-26 12:57                   ` Linux-2.1.129 Stephen C. Tweedie
1998-11-25 20:33             ` Linux-2.1.129 Zlatko Calusic
1998-11-23 19:46       ` Linux-2.1.129 Eric W. Biederman
1998-11-23 21:18         ` Linux-2.1.129 Rik van Riel
1998-11-24  6:28           ` Linux-2.1.129 Eric W. Biederman
1998-11-24  7:56             ` Linux-2.1.129 Rik van Riel
1998-11-24 15:48             ` Linux-2.1.129 Stephen C. Tweedie
1998-11-24 15:38         ` Linux-2.1.129 Stephen C. Tweedie
1998-11-23 20:12       ` Linux-2.1.129 Rik van Riel
1998-11-23 20:53       ` Running 2.1.129 at extrem load [patch] (Was: Linux-2.1.129..) Dr. Werner Fink
1998-11-23 21:59         ` Rik van Riel
1998-11-23 22:35           ` Dr. Werner Fink
1998-11-24 12:38             ` Dr. Werner Fink

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox