Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
       [not found] ` <Pine.LNX.4.21.0006291330520.1713-100000@inspiron.random>
@ 2000-06-29 13:00   ` Stephen C. Tweedie
  2000-07-06 10:35     ` Andrea Arcangeli
  0 siblings, 1 reply; 42+ messages in thread
From: Stephen C. Tweedie @ 2000-06-29 13:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Marcelo Tosatti, Rik van Riel, Jens Axboe,
	Alan Cox, Derek Martin, Linux Kernel, linux-mm

Hi,

On Thu, Jun 29, 2000 at 01:55:07PM +0200, Andrea Arcangeli wrote:
> 
> I agree, the current swap_out design is too much fragile.
> 
> btw, in such area we also have a subtle hack/magic: when we unmap a clean
> page we consider that a "fail" ;), while instead we really did some kind
> of progress.

I really think we need to avoid such hacks entirely and just fix the
design.  The thing is, fixing the design isn't actually all that hard.

Rik's multi-queue stuff is the place to start (this is not a
coincidence --- we spent quite a bit of time talking this through).
Aging process pages and unmapping them should be considered part of
the same job.  Removing pages from memory completely is a separate
job.  I can't emphasise this enough --- this separation just fixes so
many problems in our current VM that we really, really need it for
2.4.

Look at how such a separation affects the swap_out problem above.  We
now have two jobs to do --- the aging code needs to keep a certain
number of pages freeable on the last-chance list (whatever you happen
to call it), that number being dependent on current memory pressure.
That list consists of nothing but unmapped, clean pages.  (A separate
list for unmapped, dirty pages is probably desirable for completely
different reasons.)

Do this and there is no longer any confusion in the swapper itself
about whether a page has become freed or not.  Either a foreground
call to the swapout code, or a background kswapd loop, can keep
populating the last chance lists; it doesn't matter, because we
decouple the concept of swapout from the concept of freeing memory.
When we actually want to free pages now, we can *always* tell how much
cheap page reclaim can be done, just by looking at the length of the
last-chance list. 

We can play all sorts of games with this, easily.  For example, when
the real free page count gets too low, we can force all normal page
allocations to be done from the last-chance list instead of the free
list, allowing only GFP_ATOMIC allocations to use up genuine free
pages.  That gives us proper flow control for non-atomic memory
allocations without all of the current races between one process
freeing a page and then trying to allocate it once try_to_free_page()
has returned (right now, an interrupt may have gobbled the page in the
mean time because we use the same list for the pages returned by
swap_out as for allocations).

I really think we need to forget about tuning the 2.4 VM until we have
such fundamental structures in place.  Until we have done that hard
work, we're fine-tuning a system which is ultimately fragile.  Any
structural changes will make the fine-tuning obsolete, so we need to
get the changes necessary for a robust VM in _first_, and then do the
performance fine-tuning.

One obvious consequence of doing this is that we need to separate out
mechanisms from policy.  With multiple queues in the VM for these
different jobs --- aging, cleaning, reclaiming --- we can separate out
the different mechanisms in the VM much more easily, which makes it
far easier to tune the policy for performance optimisations later on.
Right now, to do policy tuning we end up playing with core mechanisms
like the flow control loops all over the place.  Nasty.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-06-29 13:00   ` [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2 Stephen C. Tweedie
@ 2000-07-06 10:35     ` Andrea Arcangeli
  2000-07-06 13:29       ` Stephen C. Tweedie
  2000-07-06 13:54       ` [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on2.4.0-test2 Roman Zippel
  0 siblings, 2 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2000-07-06 10:35 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Marcelo Tosatti, Rik van Riel, Jens Axboe, Alan Cox,
	Derek Martin, Linux Kernel, linux-mm, David S. Miller

On Thu, 29 Jun 2000, Stephen C. Tweedie wrote:

>Rik's multi-queue stuff is the place to start (this is not a
>coincidence --- we spent quite a bit of time talking this through).
>Aging process pages and unmapping them should be considered part of
>the same job.  Removing pages from memory completely is a separate
>job.  I can't emphasise this enough --- this separation just fixes so
>many problems in our current VM that we really, really need it for
>2.4.
>
>Look at how such a separation affects the swap_out problem above.  We
>now have two jobs to do --- the aging code needs to keep a certain
>number of pages freeable on the last-chance list (whatever you happen
>to call it), that number being dependent on current memory pressure.
>That list consists of nothing but unmapped, clean pages.  (A separate
>list for unmapped, dirty pages is probably desirable for completely
>different reasons.)
>
>Do this and there is no longer any confusion in the swapper itself
>about whether a page has become freed or not.  Either a foreground
>call to the swapout code, or a background kswapd loop, can keep
>populating the last chance lists; it doesn't matter, because we
>decouple the concept of swapout from the concept of freeing memory.
>When we actually want to free pages now, we can *always* tell how much
>cheap page reclaim can be done, just by looking at the length of the
>last-chance list. 
>
>We can play all sorts of games with this, easily.  For example, when
>the real free page count gets too low, we can force all normal page
>allocations to be done from the last-chance list instead of the free
>list, allowing only GFP_ATOMIC allocations to use up genuine free
>pages.  That gives us proper flow control for non-atomic memory
>allocations without all of the current races between one process
>freeing a page and then trying to allocate it once try_to_free_page()
>has returned (right now, an interrupt may have gobbled the page in the
>mean time because we use the same list for the pages returned by
>swap_out as for allocations).
>
>I really think we need to forget about tuning the 2.4 VM until we have
>such fundamental structures in place.  Until we have done that hard
>work, we're fine-tuning a system which is ultimately fragile.  Any
>structural changes will make the fine-tuning obsolete, so we need to
>get the changes necessary for a robust VM in _first_, and then do the
>performance fine-tuning.
>
>One obvious consequence of doing this is that we need to separate out
>mechanisms from policy.  With multiple queues in the VM for these
>different jobs --- aging, cleaning, reclaiming --- we can separate out
>the different mechanisms in the VM much more easily, which makes it
>far easier to tune the policy for performance optimisations later on.
>Right now, to do policy tuning we end up playing with core mechanisms
>like the flow control loops all over the place.  Nasty.

I'm not sure what you planned exactly to do (maybe we can talk about this
some time soon) but I'll tell you what I planned to do taking basic idea
to throw-out-swap_out from the very _cool_ DaveM throw-swap_out patch
floating around that's been the _only_ recent VM 2.[34].x patch that I
seen floating around that really excited me (I've not focused all the
details of his patch but I'm pretty sure it's very similar design even if
probably not equal to what I'm trying to do).

In classzone I just have mapped pages out of lru and I have two lists one
for swap_cache and one for page_cache, that is necessary to avoid cache
pollution during swapping and users and all the numbers I received noticed
that (only bad report I got is from hpa and I think the problem was the
suboptimal free_before_allocate fix that I forward ported and merged into
the classzone patch that I sent to Alan for ac22-class, and that I then I
dropped immediatly in ac22-class++. However I didn't had the confirm that
ac22-class++ fixed the bad behaviour with lots of memory and streaming I/O
so I can't exclude the problem is still there. But at least now to fix
that allocator race I developed GFP-race-3 for 2.2.16 that seems to work
fine)

The next step after what we have in classzone is instead of only removing
the mapped pages from the lru_cache (as classzone is just doing), to
_refile_ (not only list_del) the mapped pages in the lru-mapped lru queue.
Then also anonymous and shm pages will be chained in a the same lru-mapped
list (and I just have all the entry points of anonymous memory too from
current classzone, I only need to do the same for shm, but probably in the
first step I will left shm_swap around providing backwards compatibility
to memory with page->map_count left to zero like shm).

Then we'll need a page-to-pte_chain reverse lookup. Once we'll have that
too we'll can remove swap_out and do everything (except dcache/icache
things) in shrink_mmap (I'm sure Dave just throwed swap_out away and I'm
pretty sure he used very similar way). On a longer term also dcache/icache
should be placed in a page based lru that lives at the same level of the
lru_cache lru (or alternatively between lru_cache and lru_mapped).

So basically we'll have these completly different lists:

	lru_swap_cache
	lru_cache
	lru_mapped

The three caches have completly different importance that is implicit by
the semantics of the memory they are queuing. Shrinking swap_cache first
is vital for performance under swap for example (and I can just do that in
recent classzone patches). Shrinking lru_cache first is vital for
performance under streaming I/O but without low on freeable memory
scenario.

We'll only have to walk on the swap cache then fallback in lru_cache and
then fallback in lru_mapped. In normal usage the swap_cache lru will be
empty. (The mapped swap_cache can probably be mixed with the lru_mapped
cache). Then while browsing the lru_mapped list we'll take care of the
accessed bit in the pte by checking all ptes and clearing the accessed bit
from all them and avoiding to free the page if at least one pte have the
accessed bit set. For all the pages in all the lrus the referenced bit
will keep working as now to avoid rolling the lru for each cache-hit.

For the the pages in lru_cache and lru_swap_cache the pte-chain have to be
empty or we'll BUG().

Then we'll also avoid completly the problems we have now with not being
able to do success/non-success in swap_out with clean pages since we'll
free them in one go after clearing the pte from shrink_mmap or we'll
convert them to swap_cache that we'll free later.

Then also invalidate_inode_pages will become trivial since we'll be able
to corretly invalidate also mapped pages and clear their ptes. This also
makes trivial to optimize the msync by simply clearing the dirty bitflag
from all the ptes in the chain of each page ;) and probably some other
actually-nasty thing too.

Very downside of this design is that we'll have to chain the ptes with
potential additional SMP locking and for sure a few more cycles per page
fault and per pte_clearing. The additional work is O(1) at least and it
will be only a ""mere"" lock+unlock plus list_add or list_del. However the
design looks promising to me even if the rework is very intensive
(probably more intensive than Dave's patch).

I usually prefer to talk about things when they're working on my box to
avoid vapourware threads but since I'm often reading about other
vapourware stuff too I'd preferred to describe my so-far-only-vapourware
plan too so you're aware of the other alternative vm works going on and
you can choose to join (or reject) it ;).

Comments?

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-06 10:35     ` Andrea Arcangeli
@ 2000-07-06 13:29       ` Stephen C. Tweedie
  2000-07-09 17:11         ` Swap clustering with new VM Marcelo Tosatti
  2000-07-09 20:31         ` [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2 Andrea Arcangeli
  2000-07-06 13:54       ` [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on2.4.0-test2 Roman Zippel
  1 sibling, 2 replies; 42+ messages in thread
From: Stephen C. Tweedie @ 2000-07-06 13:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Marcelo Tosatti, Rik van Riel, Jens Axboe,
	Alan Cox, Derek Martin, Linux Kernel, linux-mm, David S. Miller

Hi,

On Thu, Jul 06, 2000 at 12:35:58PM +0200, Andrea Arcangeli wrote:
> 
> I'm not sure what you planned exactly to do (maybe we can talk about this
> some time soon) but I'll tell you what I planned to do taking basic idea
> to throw-out-swap_out from the very _cool_ DaveM throw-swap_out patch
> floating around that's been the _only_ recent VM 2.[34].x patch that I
> seen floating around that really excited me (I've not focused all the
> details of his patch but I'm pretty sure it's very similar design even if
> probably not equal to what I'm trying to do).

Right, this is obviously needed for 2.5 (at least as an experimental
branch), but we simply can't do it in time for 2.4.  It's too big a
change.  If we get rid of swap_out, and do our reclaim based on
physical page lists, then suddenly a whole new class of problems
arises.  For example, our swap clustering relies on allocating
sequential swap addresses to sequentially scanned VM addresses, so
that clustered swapout and swapin work naturally.  Switch to
physically-ordered swapping and there's no longer any natural way of
getting the on-disk swap related to VA ordering, so that swapin
clustering breaks completely.  To fix this, you need the final swapout
to try to swap nearby pages in VA space at the same time.  It's a lot
of work to get it right.

> Then we'll need a page-to-pte_chain reverse lookup.

Right, and I think there are ways we can do this relatively cheaply.
Use the address_space's vma ring for shared pages, use the struct page
itself to encode the VA of the page for unshared anon pages, and keep
a separate hash of all shared anon ptes.

> Once we'll have that
> too we'll can remove swap_out and do everything (except dcache/icache
> things) in shrink_mmap

Right, but this is all completely orthogonal to the problems I was
talkiing about in my original email.  Those problems were to do with
things like write-throttling and managing free space, and did not
concern identifying which pages to throw out or how to age them.
Rik's multi-queued code, or the new code from Ludovic Fernandez which
separates out page aging to a different thread.

> So basically we'll have these completly different lists:
> 
> 	lru_swap_cache
> 	lru_cache
> 	lru_mapped
> 
> The three caches have completly different importance that is implicit by
> the semantics of the memory they are queuing.

I think this is entirely the wrong way to be thinking about the
problem.  It seems to me to be much more important that we know:

1) What pages are unreferenced by the VM (except for page cache
references) and which can therefore be freed at a moment's notice;

2) What pages are queued for write;

3) what pages are referenced and in use for other reasons.

Completely unreferenced pages can be freed on a moment's notice.  If
we are careful with the spinlocks we can even free them from within an
interrupt.  

By measuring the throughput of these different page classes we can
work out what the VM pressure and write pressure is.  When we get a
write page fault, we can (for example) block until the write queue
comes down to a certain size, to obtain write flow control.

More importantly, the scanning of the dirty and in-use queues can go
on separately from the freeing of clean pages.  The more memory
pressure we are under --- ie. the faster we are gobbling unmapped
pages off the unreferenced queue --- the more rapidly we let the aging
thread walk the referenced pages and try to age pages onto the
unreferenced queue.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on2.4.0-test2
  2000-07-06 10:35     ` Andrea Arcangeli
  2000-07-06 13:29       ` Stephen C. Tweedie
@ 2000-07-06 13:54       ` Roman Zippel
  1 sibling, 0 replies; 42+ messages in thread
From: Roman Zippel @ 2000-07-06 13:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Marcelo Tosatti, Rik van Riel, Jens Axboe,
	Alan Cox, Derek Martin, Linux Kernel, linux-mm, David S. Miller

Hi,

Andrea Arcangeli wrote:

> So basically we'll have these completly different lists:
> 
>         lru_swap_cache
>         lru_cache
>         lru_mapped
> 
> The three caches have completly different importance that is implicit by
> the semantics of the memory they are queuing. Shrinking swap_cache first
> is vital for performance under swap for example (and I can just do that in
> recent classzone patches). Shrinking lru_cache first is vital for
> performance under streaming I/O but without low on freeable memory
> scenario.

How do you want to synchronize and balance these caches? Do you expect
that these are never used at the same time? What happens with disk
blocks that end up in different caches?
IMO the problem gets worse, if we want better direct i/o support
especially on systems where fs block size is different from page size.

bye, Roman
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Swap clustering with new VM
  2000-07-06 13:29       ` Stephen C. Tweedie
@ 2000-07-09 17:11         ` Marcelo Tosatti
  2000-07-09 20:53           ` Andrea Arcangeli
  2000-07-09 20:31         ` [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2 Andrea Arcangeli
  1 sibling, 1 reply; 42+ messages in thread
From: Marcelo Tosatti @ 2000-07-09 17:11 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrea Arcangeli, Jens Axboe, Alan Cox, Linux Kernel, linux-mm,
	David S. Miller, Rik van Riel

On Thu, 6 Jul 2000, Stephen C. Tweedie wrote:

<snip> 

> For example, our swap clustering relies on allocating
> sequential swap addresses to sequentially scanned VM addresses, so
> that clustered swapout and swapin work naturally.  Switch to
> physically-ordered swapping and there's no longer any natural way of
> getting the on-disk swap related to VA ordering, so that swapin
> clustering breaks completely.  To fix this, you need the final swapout
> to try to swap nearby pages in VA space at the same time.  It's a lot
> of work to get it right.

AFAIK XFS's pagebuf structure contains a list of contiguous on-disk
buffers, so the filesystem can do IO on a pagebuf structure avoiding disk
seek time.

Do you plan to fix the swap clustering problem with a similar idea? 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-06 13:29       ` Stephen C. Tweedie
  2000-07-09 17:11         ` Swap clustering with new VM Marcelo Tosatti
@ 2000-07-09 20:31         ` Andrea Arcangeli
  2000-07-11 11:50           ` Stephen C. Tweedie
  2000-07-11 17:32           ` Rik van Riel
  1 sibling, 2 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2000-07-09 20:31 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Marcelo Tosatti, Rik van Riel, Jens Axboe, Alan Cox,
	Derek Martin, Linux Kernel, linux-mm, David S. Miller

On Thu, 6 Jul 2000, Stephen C. Tweedie wrote:

>concern identifying which pages to throw out or how to age them.
>Rik's multi-queued code, or the new code from Ludovic Fernandez which
>separates out page aging to a different thread.

I don't think it's orthogonal (at least not completly).

>> So basically we'll have these completly different lists:
>> 
>> 	lru_swap_cache
>> 	lru_cache
>> 	lru_mapped
>> 
>> The three caches have completly different importance that is implicit by
>> the semantics of the memory they are queuing.
>
>I think this is entirely the wrong way to be thinking about the
>problem.  It seems to me to be much more important that we know:

Think what happens if we shrink lru_mapped first. That would be an
obviously wrong behaviour and this proof we have to consider a priority
between lists.

Note I'm not thinking to fallback into lru_mapped when lru_cache is empty,
but probably doing something like, free 3 times from lru_cache and 1 from
lru_mapped could work. The 3 times should be a dynamic variable that
changes in function of the pressure that we have in the lru_cache.

Using an inactive lru for providing a dynamic behaviour or to provide
longer life to mapped pages looks bloating, if something I think it's
better to do more aggressive aging within the lru_cache itself.

>1) What pages are unreferenced by the VM (except for page cache
>references) and which can therefore be freed at a moment's notice;

That's the lru_cache and it's just implemented in classzone patch just
working fine. (it doesn't include any mapped pages, it only includes
unreferenced pages unlike current v2.4)

>2) What pages are queued for write;

I think it's better to have a global LRU_DIRTY (composed by any dirty
object) and to let kupdate to flush not only the old dirty buffers, but
also the old dirty pages. The pages have to be at the same time on the
LRU_DIRTY and on the lru_mapped or lru_cache so that's not really a list
in the same domain of lru_cache/lru_mapped and LRU_DIRTY. And we'll need
this anyway for allocate on flush (we for sure don't want to enter the fs
in any way [except than for accounting of the decreased available space to
assure the flush to succeed] before the dirty pages become old)

>3) what pages are referenced and in use for other reasons.

That's the lru_mapped. And to implement lru_mapped I will only need to
change lru_cache_map/unmap macro of the classzone patch since I just have
all the necessary hooks in place.

>Completely unreferenced pages can be freed on a moment's notice.  If

That's what I'm doing with lru_cache.

>we are careful with the spinlocks we can even free them from within an
>interrupt.  

That would cause us to use an irq spinlock in shrink_mmap and I'm not sure
this is good idea.

Talking about irq spinlocks I'd love to keep the pages under I/O out of
the lru too but I can't trivially because I can't grab the lru-spinlock
from the irq completation handler (since the spinlock of the LRU isn't an
irq spinlock). To workaround the irq spinlock thing (but to be still able
to keep locked pages out of the lru), I thought to split each list in two:

	lru_swap_cache
	lru_cache
	lru_mapped_cache

in:

	lru_swap_cache_irq
	lru_swap_cache

	lru_cache_irq
	lru_cache

	lru_mapped_cache_irq
	lru_mapped_cache

the lru_*_irq will be lists available _only_ with an irq spinlock held.

So then when we'll want to swapout an anonymous pages, instead of adding
it to the lru_swap_cache, we'll simply left it out of any lru list and
we'll start the I/O forgetting about it. Then the IRQ completion irq
handler will insert the page into the lru_swap_cache_irq, and it will
queue a task-scheduler tasklet for execution. This tasklet will simply
grab the lru_swap_cache_irq spinlock and it will extract the pages from
such list, and it will put the pages into the lru_swap_cache (it can
acquire the non-irq lru_swap_cache spinlock because it won't run from
irqs). I guess I'll try this trick once the stuff described in my previous
email will work.

>By measuring the throughput of these different page classes we can
>work out what the VM pressure and write pressure is.  When we get a

I agree, in my way I see it like: the falling back algorithm between lrus
should be dynamic and it should have some knowledge on the pressure going
on.

For the write pressure the thing should be really in a separated domain 
that is the current BUF_DIRTY that we have now.

I think instead it should be the list that kupdate is browsing that should
also include the dirty pages (and the dirty pages can be at the same time
also in the lru_mapped_cache of course so from a VM point of view dirty
pages will stay in lru_cache and lru_mapped_cache, and not on a
dedicated VM-lru list).

>write page fault, we can (for example) block until the write queue
>comes down to a certain size, to obtain write flow control.

Right, however we don't need a new page-lru for this, we simply need to
account for dirty pages in do_wp_page and no-page, and to extend the
current BUF_DIRTY list and to run kupdate will work on them too.

I'm not sure if somebody is abusing the missing (and I think also broken)
flush-old-buffers behaviour of the MAP_SHARED segment to build kind of SHM
memory with sane interface, in such case the app should be changed to use
new shm_open sane interface which is not that different than the old trick
after all.

>More importantly, the scanning of the dirty and in-use queues can go
>on separately from the freeing of clean pages.  The more memory

That's not what I planned to do for now. I'd prefer to learn when it's
time to fallback between the lists than to have to bloat with an
additional list. However I may as well change idea over the time.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Swap clustering with new VM
  2000-07-09 17:11         ` Swap clustering with new VM Marcelo Tosatti
@ 2000-07-09 20:53           ` Andrea Arcangeli
  2000-07-11  9:36             ` Stephen C. Tweedie
  0 siblings, 1 reply; 42+ messages in thread
From: Andrea Arcangeli @ 2000-07-09 20:53 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stephen C. Tweedie, Jens Axboe, Alan Cox, Linux Kernel, linux-mm,
	David S. Miller, Rik van Riel

On Sun, 9 Jul 2000, Marcelo Tosatti wrote:

>AFAIK XFS's pagebuf structure contains a list of contiguous on-disk
>buffers, so the filesystem can do IO on a pagebuf structure avoiding disk
>seek time.
>
>Do you plan to fix the swap clustering problem with a similar idea? 

I don't know pagebuf well enough to understand if it can helps. However
I have a possible solution (not that it looks like to me that there are
many other possible solutions btw ;).

What worries me a bit is that whatever we do to improve swapin seeks it
can always disagree with what the lru says that have to be thrown away.

A dumb way to provide the current swapin-contiguous behaviour is to do a
unmap/swap-around of the pages pointed by the pagetables slots near the
one that we found in the lru.

I guess we could left a sysctl so that we can select between
swapin-optimized or lru-optimized behaviour at runtime to handy bench.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Swap clustering with new VM
  2000-07-09 20:53           ` Andrea Arcangeli
@ 2000-07-11  9:36             ` Stephen C. Tweedie
  0 siblings, 0 replies; 42+ messages in thread
From: Stephen C. Tweedie @ 2000-07-11  9:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, Stephen C. Tweedie, Jens Axboe, Alan Cox,
	Linux Kernel, linux-mm, David S. Miller, Rik van Riel

Hi,

On Sun, Jul 09, 2000 at 10:53:35PM +0200, Andrea Arcangeli wrote:
> On Sun, 9 Jul 2000, Marcelo Tosatti wrote:
> 
> >AFAIK XFS's pagebuf structure contains a list of contiguous on-disk
> >buffers, so the filesystem can do IO on a pagebuf structure avoiding disk
> >seek time.
> >
> >Do you plan to fix the swap clustering problem with a similar idea? 
> 
> I don't know pagebuf well enough to understand if it can helps.

It can't --- not directly, at least --- but the underlying
kiobuf-based IO code can improve CPU efficiency for swap IO.

> What worries me a bit is that whatever we do to improve swapin seeks it
> can always disagree with what the lru says that have to be thrown away.

Sure, but disk seeks are so much more expensive than anything else
that you really want to minimise them at all costs.  In 2.2 we added
paging clustering, which performs extra IO to minimise seeks, at the
cost of potentially evicting too many useful pages from memory to make
room for clustered incoming pages which aren't guaranteed to be used.
It made things _enormously_ faster for paging.

> A dumb way to provide the current swapin-contiguous behaviour is to do a
> unmap/swap-around of the pages pointed by the pagetables slots near the
> one that we found in the lru.

Ultimately we really need to be allocating pages based on VM address,
not on lru location, to get remotely good swap clustering.  The
existinng VM-based scanout achieves this cheaply as a side effect of
the scan order, but we need to realise that it isn't perfect and that
every move to a physical page-based scan algorithm will require us to
think about replacing the clustering mechanism.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-09 20:31         ` [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2 Andrea Arcangeli
@ 2000-07-11 11:50           ` Stephen C. Tweedie
  2000-07-11 16:17             ` Andrea Arcangeli
  2000-07-11 17:32           ` Rik van Riel
  1 sibling, 1 reply; 42+ messages in thread
From: Stephen C. Tweedie @ 2000-07-11 11:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Marcelo Tosatti, Rik van Riel, Jens Axboe,
	Alan Cox, Derek Martin, Linux Kernel, linux-mm, David S. Miller

Hi,

On Sun, Jul 09, 2000 at 10:31:46PM +0200, Andrea Arcangeli wrote:
> 
> Think what happens if we shrink lru_mapped first.

It's not supposed to work that way.

> Note I'm not thinking to fallback into lru_mapped when lru_cache is empty,
> but probably doing something like, free 3 times from lru_cache and 1 from
> lru_mapped could work. The 3 times should be a dynamic variable that
> changes in function of the pressure that we have in the lru_cache.

No, the mechanism would be that we only free pages from the scavenge
or cache lists.  The mapped list contains only pages which _can't_ be
freed.  The dynamic memory pressure is used to maintain a goal for the
number of pages in the cache list, and to achieve that goal, we
perform aging on the mapped list.  Pages which reach age zero can be
unmapped and added to the cache list, from where they can be
reclaimed.

In other words, the queues naturally assist us in breaking apart the
jobs of freeing pages and aging mappings.  

> I think it's better to have a global LRU_DIRTY (composed by any dirty
> object) and to let kupdate to flush not only the old dirty buffers, but
> also the old dirty pages.

We _must_ have separate dirty behaviour for dirty VM pages and for
writeback pages.  Think about a large simulation filling most of main
memory with dirty anon pages --- we don't want write throttling to
kick in and swap out all of that memory!  But for writeback data ---
data dirtied by the filesystem directly, not just by the VM --- we
definitely want to keep control of the amount of dirty memory.

Cheers,
 Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 11:50           ` Stephen C. Tweedie
@ 2000-07-11 16:17             ` Andrea Arcangeli
  2000-07-11 16:36               ` Juan J. Quintela
  2000-07-14  8:51               ` Stephen C. Tweedie
  0 siblings, 2 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2000-07-11 16:17 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Marcelo Tosatti, Rik van Riel, Jens Axboe, Alan Cox,
	Derek Martin, Linux Kernel, linux-mm, David S. Miller

On Tue, 11 Jul 2000, Stephen C. Tweedie wrote:

>Hi,
>
>On Sun, Jul 09, 2000 at 10:31:46PM +0200, Andrea Arcangeli wrote:
>> 
>> Think what happens if we shrink lru_mapped first.
>
>It's not supposed to work that way.

Think if I shrink the lru_mapped first. you could have 100mbyte of clean
and unmapped cache and before shrinking it you would unmap `vi`. So you
would unmap/swapout vi from memory while you still have 100mbyte of
freeable cache. Isn't that broken? The other way around is much more sane.
With the other way around as worse you will have zero fs cache and you'll
run just like DOS.

The object of this simple example is to show that the lrus have different
priorities. These priorities will probably change in function of the
workload of course but we can try to take care of that.

>> Note I'm not thinking to fallback into lru_mapped when lru_cache is empty,
>> but probably doing something like, free 3 times from lru_cache and 1 from
>> lru_mapped could work. The 3 times should be a dynamic variable that
>> changes in function of the pressure that we have in the lru_cache.
>
>No, the mechanism would be that we only free pages from the scavenge
>or cache lists.  The mapped list contains only pages which _can't_ be
>freed. [..]

We will be _able_ free them on the fly instead. The only point of the
page2ptechain reverse lookup is to be able to free them on the fly and
nothing else.

>[..] The dynamic memory pressure is used to maintain a goal for the
>number of pages in the cache list, and to achieve that goal, we
>perform aging on the mapped list.  Pages which reach age zero can be
>unmapped and added to the cache list, from where they can be
>reclaimed.
>
>In other words, the queues naturally assist us in breaking apart the
>jobs of freeing pages and aging mappings.  

I see what you plan to do. Fact is that I'm not convinced it's necessary
and I prefer to have a dynamic falling back algorithms between caches that
will avoid me to have additional lru lists and additional refile between
lrus. Also I will be able to say when I did progress because my progress
will _always_ correspond to a page freed (so I'll remove the unrobusteness
of the current swap_out completly).

>> I think it's better to have a global LRU_DIRTY (composed by any dirty
>> object) and to let kupdate to flush not only the old dirty buffers, but
>> also the old dirty pages.
>
>We _must_ have separate dirty behaviour for dirty VM pages and for
>writeback pages.  Think about a large simulation filling most of main
>memory with dirty anon pages --- we don't want write throttling to
>kick in and swap out all of that memory!  But for writeback data ---

Good point (I was always thinking about MAP_SHARED but MAP_ANON is dirty
in the same way indeed). So I think at first step I'll left the dirty
pages into the lru_mapped lru. With the locked-pages-out-of-the-lru trick
I could reinsert them to the bottom of the lru (old pages) when the I/O is
completed so that we could free them without rolling the lru again.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 16:17             ` Andrea Arcangeli
@ 2000-07-11 16:36               ` Juan J. Quintela
  2000-07-11 17:33                 ` Andrea Arcangeli
  2000-07-14  8:51               ` Stephen C. Tweedie
  1 sibling, 1 reply; 42+ messages in thread
From: Juan J. Quintela @ 2000-07-11 16:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Marcelo Tosatti, Rik van Riel, Jens Axboe,
	Alan Cox, Derek Martin, Linux Kernel, linux-mm, David S. Miller

>>>>> "andrea" == Andrea Arcangeli <andrea@suse.de> writes:

andrea> On Tue, 11 Jul 2000, Stephen C. Tweedie wrote:
>> Hi,
>> 
>> On Sun, Jul 09, 2000 at 10:31:46PM +0200, Andrea Arcangeli wrote:
>>> 
>>> Think what happens if we shrink lru_mapped first.
>> 
>> It's not supposed to work that way.

andrea> The object of this simple example is to show that the lrus have different
andrea> priorities. These priorities will probably change in function of the
andrea> workload of course but we can try to take care of that.

I agree with Stephen here, if my cache page is older than my mmaped vi
page, I want to unmap first the vi page.

andrea> I see what you plan to do. Fact is that I'm not convinced it's necessary
andrea> and I prefer to have a dynamic falling back algorithms between caches that
andrea> will avoid me to have additional lru lists and additional refile between
andrea> lrus. Also I will be able to say when I did progress because my progress
andrea> will _always_ correspond to a page freed (so I'll remove the unrobusteness
andrea> of the current swap_out completly).

Yes, but you have to find a _magic_ number for knowing when to free
for the maped pages/cache pages.  That number comes for free with the
inactive list implementation and is based in the actual workload,
i.e. we don't need to guess.

Later, Juan.

-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-09 20:31         ` [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2 Andrea Arcangeli
  2000-07-11 11:50           ` Stephen C. Tweedie
@ 2000-07-11 17:32           ` Rik van Riel
  2000-07-11 17:41             ` Andrea Arcangeli
  1 sibling, 1 reply; 42+ messages in thread
From: Rik van Riel @ 2000-07-11 17:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Marcelo Tosatti, Jens Axboe, Alan Cox,
	Derek Martin, Linux Kernel, linux-mm, David S. Miller

On Sun, 9 Jul 2000, Andrea Arcangeli wrote:
> On Thu, 6 Jul 2000, Stephen C. Tweedie wrote:
> 
> >> So basically we'll have these completly different lists:
> >> 
> >> 	lru_swap_cache
> >> 	lru_cache
> >> 	lru_mapped
> >> 
> >> The three caches have completly different importance that is implicit by
> >> the semantics of the memory they are queuing.
> >
> >I think this is entirely the wrong way to be thinking about the
> >problem.  It seems to me to be much more important that we know:
> 
> Think what happens if we shrink lru_mapped first. That would be
> an obviously wrong behaviour and this proof we have to consider
> a priority between lists.

No. You just wrote down the strongest argument in favour of one
unified queue for all types of memory usage.

(insert QED here)

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 16:36               ` Juan J. Quintela
@ 2000-07-11 17:33                 ` Andrea Arcangeli
  2000-07-11 17:45                   ` Rik van Riel
  0 siblings, 1 reply; 42+ messages in thread
From: Andrea Arcangeli @ 2000-07-11 17:33 UTC (permalink / raw)
  To: Juan J. Quintela
  Cc: Stephen C. Tweedie, Marcelo Tosatti, Rik van Riel, Jens Axboe,
	Alan Cox, Derek Martin, Linux Kernel, linux-mm, David S. Miller

Hi Juan,

On 11 Jul 2000, Juan J. Quintela wrote:

>I agree with Stephen here, if my cache page is older than my mmaped vi
>page, I want to unmap first the vi page.

You said it in the other way around ;) but never mind I got your point
indeed.

With the logic "if my cache page is younger than my mmaped vi page, I want
to unmap first the vi page" then when you'll run:

	cp /dev/zero .

or also:

	find /usr/ -type f -exec cp {} /dev/null \;

(and also rsync of course)

and you'll start hanging in gnus, while switching desktop, while switching
window, while pressing a key in bash, and indeed also while pressing a key
in vi. For what? The cache got polluted because you only had 32mbyte of
ram so the second run of the above command will cause exactly the same
hangs all over the tasks.

I don't think that is a sane behaviour. I think caches have very different
priorities due the semantics of their objects.

And also think the swap_cache. When I add a page to the lru_swap_cache for
a swapout, it means that such page is the less interesting one of all the
VM, it means that it is the page that we are not interested at all to keep
in memory. It means we should throw it away ASAP. The swap_cache is only a
locking entitiy that avoids us to allocate a static and slow swap lockmap
for the swapouts. With current global lru to get rid of the
less-interesting-of-all swap cache we first have to throw away all the
cache in the lru and that hurts very much. That's one of the reasons
classzone is more responsive and deliver better performance under swap
load, because it knows it have to try to throw away the swap_cache first.

>andrea> I see what you plan to do. Fact is that I'm not convinced it's necessary
>andrea> and I prefer to have a dynamic falling back algorithms between caches that
>andrea> will avoid me to have additional lru lists and additional refile between
>andrea> lrus. Also I will be able to say when I did progress because my progress
>andrea> will _always_ correspond to a page freed (so I'll remove the unrobusteness
>andrea> of the current swap_out completly).
>
>Yes, but you have to find a _magic_ number for knowing when to free
>for the maped pages/cache pages.  That number comes for free with the
>inactive list implementation and is based in the actual workload,
>i.e. we don't need to guess.

Well, I'm pretty sure that with your design you'll end needing a magic
number too somewhere and it might be going to be more subtle than mine.
Also note that in someway I want that number to be dynamic. And of course
we have magic numbers also in current 2.[234].x.

Suppose you run out of lru_cache, then you start refiling the inactive
list, then you'll have to choose how much to unmap from the active list
and to put into the inactive list. How much stuff will you refile? 10
mapped pages or 20, or 30? If you unmap only one page then you could as
well move it directly to the lru_cache dropping the inactive or dirty
list, right? ;)

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 17:32           ` Rik van Riel
@ 2000-07-11 17:41             ` Andrea Arcangeli
  2000-07-11 17:47               ` Rik van Riel
                                 ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2000-07-11 17:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stephen C. Tweedie, Marcelo Tosatti, Jens Axboe, Alan Cox,
	Derek Martin, Linux Kernel, linux-mm, David S. Miller

On Tue, 11 Jul 2000, Rik van Riel wrote:

>No. You just wrote down the strongest argument in favour of one
>unified queue for all types of memory usage.

Do that and download an dozen of iso image with gigabit ethernet in
background.

>(insert QED here)

What do you mean with "QED"?

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 17:33                 ` Andrea Arcangeli
@ 2000-07-11 17:45                   ` Rik van Riel
  2000-07-11 17:54                     ` Andrea Arcangeli
  0 siblings, 1 reply; 42+ messages in thread
From: Rik van Riel @ 2000-07-11 17:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Juan J. Quintela, Stephen C. Tweedie, Marcelo Tosatti,
	Jens Axboe, Alan Cox, Derek Martin, Linux Kernel, linux-mm,
	David S. Miller

On Tue, 11 Jul 2000, Andrea Arcangeli wrote:
> On 11 Jul 2000, Juan J. Quintela wrote:
> 
> >I agree with Stephen here, if my cache page is older than my mmaped vi
> >page, I want to unmap first the vi page.
> 
> You said it in the other way around ;) but never mind I got your point
> indeed.
> 
> With the logic "if my cache page is younger than my mmaped vi page, I want
> to unmap first the vi page" then when you'll run:
> 
> 	cp /dev/zero .
> 
> and you'll start hanging in gnus, while switching desktop, while
> switching window, while pressing a key in bash, and indeed also
> while pressing a key in vi. For what?

This is why LRU is wrong and we need page aging (which
approximates both LRU and NFU).

The idea is to remove those pages from memory which will
not be used again for the longest time, regardless of in
which 'state' they live in main memory.

(and proper page aging is a good approximation to this)

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 17:41             ` Andrea Arcangeli
@ 2000-07-11 17:47               ` Rik van Riel
  2000-07-11 18:00                 ` Andrea Arcangeli
  2000-07-11 18:13               ` Juan J. Quintela
  2000-07-12 16:01               ` Kev
  2 siblings, 1 reply; 42+ messages in thread
From: Rik van Riel @ 2000-07-11 17:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Marcelo Tosatti, Jens Axboe, Alan Cox,
	Derek Martin, Linux Kernel, linux-mm, David S. Miller

On Tue, 11 Jul 2000, Andrea Arcangeli wrote:
> On Tue, 11 Jul 2000, Rik van Riel wrote:
> 
> >No. You just wrote down the strongest argument in favour of one
> >unified queue for all types of memory usage.
> 
> Do that and download an dozen of iso image with gigabit ethernet
> in background.

You need to forget about LRU for a moment. The fact that
LRU is fundamentally broken doesn't mean that it has
anything whatsoever to do with whether we age all pages
fairly or whether we prefer some pages over other pages.

If LRU is broken we need to fix that, a workaround like
your proposal doesn't fix anything in this case.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 17:45                   ` Rik van Riel
@ 2000-07-11 17:54                     ` Andrea Arcangeli
  2000-07-11 18:03                       ` Juan J. Quintela
  0 siblings, 1 reply; 42+ messages in thread
From: Andrea Arcangeli @ 2000-07-11 17:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Juan J. Quintela, Stephen C. Tweedie, Marcelo Tosatti,
	Jens Axboe, Alan Cox, Derek Martin, Linux Kernel, linux-mm,
	David S. Miller

On Tue, 11 Jul 2000, Rik van Riel wrote:

>This is why LRU is wrong and we need page aging (which
>approximates both LRU and NFU).
>
>The idea is to remove those pages from memory which will
>not be used again for the longest time, regardless of in
>which 'state' they live in main memory.
>
>(and proper page aging is a good approximation to this)

It will still drop _all_ VM mappings from memory if you left "cp /dev/zero
." in background for say 2 hours. This in turn mean that during streming
I/O you'll have _much_ more than the current swapin/swapout troubles.

If I download a dozen of CD images with a gigabit ethernet I don't want
_anything_ to be unmapped from main RAM, and yes I may have 8giga of RAM
so I don't want to use O_DIRECT for the downloads.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 17:47               ` Rik van Riel
@ 2000-07-11 18:00                 ` Andrea Arcangeli
  2000-07-11 18:06                   ` Rik van Riel
  2000-07-14  9:01                   ` [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2 Stephen C. Tweedie
  0 siblings, 2 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2000-07-11 18:00 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stephen C. Tweedie, Marcelo Tosatti, Jens Axboe, Alan Cox,
	Derek Martin, Linux Kernel, linux-mm, David S. Miller

On Tue, 11 Jul 2000, Rik van Riel wrote:

>On Tue, 11 Jul 2000, Andrea Arcangeli wrote:
>> On Tue, 11 Jul 2000, Rik van Riel wrote:
>> 
>> >No. You just wrote down the strongest argument in favour of one
>> >unified queue for all types of memory usage.
>> 
>> Do that and download an dozen of iso image with gigabit ethernet
>> in background.
>
>You need to forget about LRU for a moment. The fact that
>LRU is fundamentally broken doesn't mean that it has
>anything whatsoever to do with whether we age all pages
>fairly or whether we prefer some pages over other pages.
>
>If LRU is broken we need to fix that, a workaround like
>your proposal doesn't fix anything in this case.

So tell me how with your design can I avoid the kernel to unmap anything
while running:

	cp /dev/zero .

forever.

Whatever aging algorithm you use if you wait enough time the mapped pages
will be thrown away eventually.

If the above `cp` is able to throw away _everything_ eventually, that will
be a major problem IMHO and I don't agree in using a long-term-design that
can't avoid that so common problem.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 17:54                     ` Andrea Arcangeli
@ 2000-07-11 18:03                       ` Juan J. Quintela
  2000-07-11 19:32                         ` Andrea Arcangeli
  0 siblings, 1 reply; 42+ messages in thread
From: Juan J. Quintela @ 2000-07-11 18:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Stephen C. Tweedie, Marcelo Tosatti, Jens Axboe,
	Alan Cox, Derek Martin, Linux Kernel, linux-mm, David S. Miller

>>>>> "andrea" == Andrea Arcangeli <andrea@suse.de> writes:

andrea> On Tue, 11 Jul 2000, Rik van Riel wrote:
>> This is why LRU is wrong and we need page aging (which
>> approximates both LRU and NFU).
>> 
>> The idea is to remove those pages from memory which will
>> not be used again for the longest time, regardless of in
>> which 'state' they live in main memory.
>> 
>> (and proper page aging is a good approximation to this)

andrea> It will still drop _all_ VM mappings from memory if you left "cp /dev/zero
andrea> ." in background for say 2 hours. This in turn mean that during streming
andrea> I/O you'll have _much_ more than the current swapin/swapout troubles.

If you are copying in the background a cp and you don't touch your
vi/emacs/whatever pages in 2 hours (i.e. age = 0) then I think that it
is ok for that pages to be swaped out.  Notice that the cage pages
will have _initial age_  and the pages of the binaries will have an
_older_ age.

Later, Juan.

-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 18:00                 ` Andrea Arcangeli
@ 2000-07-11 18:06                   ` Rik van Riel
  2000-07-17  7:09                     ` [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on Yannis Smaragdakis
  2000-07-14  9:01                   ` [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2 Stephen C. Tweedie
  1 sibling, 1 reply; 42+ messages in thread
From: Rik van Riel @ 2000-07-11 18:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Marcelo Tosatti, Jens Axboe, Alan Cox,
	Derek Martin, Linux Kernel, linux-mm, David S. Miller

On Tue, 11 Jul 2000, Andrea Arcangeli wrote:
> On Tue, 11 Jul 2000, Rik van Riel wrote:
> >On Tue, 11 Jul 2000, Andrea Arcangeli wrote:
> >> On Tue, 11 Jul 2000, Rik van Riel wrote:
> >> 
> >> >No. You just wrote down the strongest argument in favour of one
> >> >unified queue for all types of memory usage.
> >> 
> >> Do that and download an dozen of iso image with gigabit ethernet
> >> in background.
> >
> >You need to forget about LRU for a moment. The fact that
> >LRU is fundamentally broken doesn't mean that it has
> >anything whatsoever to do with whether we age all pages
> >fairly or whether we prefer some pages over other pages.
> >
> >If LRU is broken we need to fix that, a workaround like
> >your proposal doesn't fix anything in this case.
> 
> So tell me how with your design can I avoid the kernel to unmap anything
> while running:
> 
> 	cp /dev/zero .
> 
> forever.
> 
> Whatever aging algorithm you use if you wait enough time the
> mapped pages will be thrown away eventually.

And that is correct behaviour. The problem with LRU is that the
"eventually" is too short, but proper page aging is as close to
LFU (least _frequently_ used) as it is to LRU. In that case any
page which was used only once (or was only used a long time ago)
will be freed before a page which has been used more often
recently will be.

This effectively and efficiently protects things like X, xterm
and other things which are used over and over again, while still
swapping out things which are not used at all.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 17:41             ` Andrea Arcangeli
  2000-07-11 17:47               ` Rik van Riel
@ 2000-07-11 18:13               ` Juan J. Quintela
  2000-07-11 20:57                 ` Roger Larsson
  2000-07-12 16:01               ` Kev
  2 siblings, 1 reply; 42+ messages in thread
From: Juan J. Quintela @ 2000-07-11 18:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Stephen C. Tweedie, Marcelo Tosatti, Jens Axboe,
	Alan Cox, Derek Martin, Linux Kernel, linux-mm, David S. Miller

>>>>> "andrea" == Andrea Arcangeli <andrea@suse.de> writes:

andrea> On Tue, 11 Jul 2000, Rik van Riel wrote:
>> No. You just wrote down the strongest argument in favour of one
>> unified queue for all types of memory usage.

andrea> Do that and download an dozen of iso image with gigabit ethernet in
andrea> background.

With Gigabit etherenet, the pages that you are coping will never be
touched again -> that means that its age will never will increase,
that means that it will only remove pages from the cache that are
younger/have been a lot of time without being used.  That looks quite
ok to me.  Notice that the fact that the pages came from the Gigabit
ethernet makes no diference that if you copy from other medium.  Only
difference is that you will get them only faster.

Later, Juan.

-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 18:03                       ` Juan J. Quintela
@ 2000-07-11 19:32                         ` Andrea Arcangeli
  2000-07-12  0:05                           ` John Alvord
  0 siblings, 1 reply; 42+ messages in thread
From: Andrea Arcangeli @ 2000-07-11 19:32 UTC (permalink / raw)
  To: Juan J. Quintela
  Cc: Rik van Riel, Stephen C. Tweedie, Marcelo Tosatti, Jens Axboe,
	Alan Cox, Derek Martin, Linux Kernel, linux-mm, David S. Miller

On 11 Jul 2000, Juan J. Quintela wrote:

>If you are copying in the background a cp and you don't touch your
>vi/emacs/whatever pages in 2 hours (i.e. age = 0) then I think that it
>is ok for that pages to be swaped out.  Notice that the cage pages
>will have _initial age_  and the pages of the binaries will have an
>_older_ age.

If we want to do that we can do that. My design doesn't forbid this. I
only avoid the overhead of the inactive list.

Also note that what I was really complaining is to threat the lru_cached
and lru_mapped list equally. If you threat them equally you get in
troubles as I pointed out. I just want to say that lru_mapped have much
more priority than lru_cache. If you give the higher priority with a aging
factor, or I give higher priority with a different falling back behaviour
it doesn't matter (with the difference that I avoid overhead of refiling
between lru lists and I avoid to roll ex-mapped-pages in the lru_cache
list just to decrease their age).

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 18:13               ` Juan J. Quintela
@ 2000-07-11 20:57                 ` Roger Larsson
  2000-07-11 22:49                   ` Juan J. Quintela
  0 siblings, 1 reply; 42+ messages in thread
From: Roger Larsson @ 2000-07-11 20:57 UTC (permalink / raw)
  To: Juan J. Quintela, linux-kernel, linux-mm

"Juan J. Quintela" wrote:
> 
> >>>>> "andrea" == Andrea Arcangeli <andrea@suse.de> writes:
> 
> andrea> On Tue, 11 Jul 2000, Rik van Riel wrote:
> >> No. You just wrote down the strongest argument in favour of one
> >> unified queue for all types of memory usage.
> 
> andrea> Do that and download an dozen of iso image with gigabit ethernet in
> andrea> background.
> 
> With Gigabit etherenet, the pages that you are coping will never be
> touched again -> that means that its age will never will increase,
> that means that it will only remove pages from the cache that are
> younger/have been a lot of time without being used.  That looks quite
> ok to me.  Notice that the fact that the pages came from the Gigabit
> ethernet makes no diference that if you copy from other medium.  Only
> difference is that you will get them only faster.
> 

Problem is that you have to age all pages, at some point the newly read
pages will be older than the almost never reused ones.

Note: You can not avoid ageing all pages. If not an easy attack would be
to reread some pages over and over... (they would never go away...)

Someone mentioned the 2Q algorithm earlier - pages had to prove
themselves
to get added in the first place. Interesting approach.

/RogerL

--
Home page:
  http://www.norran.net/nra02596/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 20:57                 ` Roger Larsson
@ 2000-07-11 22:49                   ` Juan J. Quintela
  0 siblings, 0 replies; 42+ messages in thread
From: Juan J. Quintela @ 2000-07-11 22:49 UTC (permalink / raw)
  To: Roger Larsson; +Cc: linux-kernel, linux-mm

>>>>> "roger" == Roger Larsson <roger.larsson@norran.net> writes:

Hi

roger> Problem is that you have to age all pages, at some point the newly read
roger> pages will be older than the almost never reused ones.

The almost never reused pages is ok for them to go to swap, they are
good candidates, i.e. candidates to go to swap:
     1- unused pages
     2- almost unused pages

roger> Note: You can not avoid ageing all pages. If not an easy attack would be
roger> to reread some pages over and over... (they would never go away...)

I wast to age all the pages.  That is not an attack, how do you
differentiate a program that touches its pages from that.  It is ok to
do that.

Later, Juan.

-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 19:32                         ` Andrea Arcangeli
@ 2000-07-12  0:05                           ` John Alvord
  2000-07-12  0:52                             ` Andrea Arcangeli
  2000-07-12 18:02                             ` Rik van Riel
  0 siblings, 2 replies; 42+ messages in thread
From: John Alvord @ 2000-07-12  0:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Juan J. Quintela, Rik van Riel, Stephen C. Tweedie,
	Marcelo Tosatti, Jens Axboe, Alan Cox, Derek Martin,
	Linux Kernel, linux-mm, David S. Miller

On Tue, 11 Jul 2000 21:32:30 +0200 (CEST), Andrea Arcangeli
<andrea@suse.de> wrote:

>On 11 Jul 2000, Juan J. Quintela wrote:
>
>>If you are copying in the background a cp and you don't touch your
>>vi/emacs/whatever pages in 2 hours (i.e. age = 0) then I think that it
>>is ok for that pages to be swaped out.  Notice that the cage pages
>>will have _initial age_  and the pages of the binaries will have an
>>_older_ age.
>
>If we want to do that we can do that. My design doesn't forbid this. I
>only avoid the overhead of the inactive list.
>
>Also note that what I was really complaining is to threat the lru_cached
>and lru_mapped list equally. If you threat them equally you get in
>troubles as I pointed out. I just want to say that lru_mapped have much
>more priority than lru_cache. If you give the higher priority with a aging
>factor, or I give higher priority with a different falling back behaviour
>it doesn't matter (with the difference that I avoid overhead of refiling
>between lru lists and I avoid to roll ex-mapped-pages in the lru_cache
>list just to decrease their age).

One question that puzzles me... cache for disk files and cache for
program data will have very unlike characteristics. Executable program
storage is typically more constant. Often disk files are read once and
throw away and program data is often reused. This isn't always true,
but it is very common.

My puzzle is how the MM system should balance between those three uses
of cache. Under pressure. it is very easy for disk file cache to
overwhelm program data and executable storage. And equally program
data can overwhelm disk file cache storage.

If there is more than enough memory, no problem. When there is not
enough, what algorithm is used to achieve an effective balance of
usage?

Thanks,

John Alvord
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-12  0:05                           ` John Alvord
@ 2000-07-12  0:52                             ` Andrea Arcangeli
  2000-07-12 18:02                             ` Rik van Riel
  1 sibling, 0 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2000-07-12  0:52 UTC (permalink / raw)
  To: John Alvord
  Cc: Juan J. Quintela, Rik van Riel, Stephen C. Tweedie,
	Marcelo Tosatti, Jens Axboe, Alan Cox, Derek Martin,
	Linux Kernel, linux-mm, David S. Miller

On Wed, 12 Jul 2000, John Alvord wrote:

>One question that puzzles me... cache for disk files and cache for
>program data will have very unlike characteristics. Executable program

Agreed. That is exactly what I'm trying say by telling that lru_cache and
lru_mapped_cache have different implicit priorities and we can't threat
them in the same way.

>enough, what algorithm is used to achieve an effective balance of
>usage?

In 2.[234].x we basically first try to shrink the cache for disk and when
we run low in cache for disk (so when we start to fail in shrinking it) we
fallback shrinking the cache for program. That's sane algorithm even if
currently it's not very smart in understanding when it's time to shrink
the cache for programs and it's also not able to shrink it properly in
some case.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 17:41             ` Andrea Arcangeli
  2000-07-11 17:47               ` Rik van Riel
  2000-07-11 18:13               ` Juan J. Quintela
@ 2000-07-12 16:01               ` Kev
  2 siblings, 0 replies; 42+ messages in thread
From: Kev @ 2000-07-12 16:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Stephen C. Tweedie, Marcelo Tosatti, Jens Axboe,
	Alan Cox, Derek Martin, Linux Kernel, linux-mm, David S. Miller

> >(insert QED here)
> 
> What do you mean with "QED"?

"Quod Erat Demonstradum," Latin for "which was to be demonstrated"; it's
used to indicate the end of mathematical proofs, among others.
-- 
Kevin L. Mitchell <klmitch@mit.edu>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-12  0:05                           ` John Alvord
  2000-07-12  0:52                             ` Andrea Arcangeli
@ 2000-07-12 18:02                             ` Rik van Riel
  1 sibling, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2000-07-12 18:02 UTC (permalink / raw)
  To: John Alvord
  Cc: Andrea Arcangeli, Juan J. Quintela, Stephen C. Tweedie,
	Marcelo Tosatti, Jens Axboe, Alan Cox, Derek Martin,
	Linux Kernel, linux-mm, David S. Miller

On Wed, 12 Jul 2000, John Alvord wrote:

> One question that puzzles me... cache for disk files and cache
> for program data will have very unlike characteristics.
> Executable program storage is typically more constant. Often
> disk files are read once and throw away and program data is
> often reused. This isn't always true, but it is very common.

Page aging is the solution here. Doing proper page aging
allows us to make the distinction between use-once pages
and pages which are used over and over again.

And the best part is that we can do that without regard
for what type of cache a page happens to be in. We replace
pages based on observing their usage pattern and not on
some assumptions we make based on what is (should be) in
the page....

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 16:17             ` Andrea Arcangeli
  2000-07-11 16:36               ` Juan J. Quintela
@ 2000-07-14  8:51               ` Stephen C. Tweedie
  1 sibling, 0 replies; 42+ messages in thread
From: Stephen C. Tweedie @ 2000-07-14  8:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Marcelo Tosatti, Rik van Riel, Jens Axboe,
	Alan Cox, Derek Martin, Linux Kernel, linux-mm, David S. Miller

Hi,

On Tue, Jul 11, 2000 at 06:17:25PM +0200, Andrea Arcangeli wrote:
> >It's not supposed to work that way.
> Think if I shrink the lru_mapped first. you could have 100mbyte of clean
> and unmapped cache and before shrinking it you would unmap `vi`. ...

Andrea:	"This case here is really bad."
Stephen:"But the proposed mechanism doesn't work that way."
Andrea:	"But if it did, this case here would be really bad."

Andrea, it doesn't work this way, so the case you are complaining
about does not happen!

These different lists are DIFFERENT LISTS.  They do different jobs.

Free pages are given out from the scavenge list.  That list can be
topped up from the inactive list.  _That_ list can be topped up from
the active list.

The active list is where we do page aging.  The inactive list is where
we do write-back of unmapped but dirty pages.  The scavenge list is a
list of clean, unmapped pages which can be reclaimed at any time.

So, we do page aging in the active list when the inactive list is too
short (and having these different lists means that, finally, we can
actually measure the amount of pressure on these different lists
quantitatively, so that we can --- for example --- set a goal of
having one second's worth of pages on each list given the current
demand for pages).  

We just don't shrink lru_mapped (or the active list or whatever you
want to call it) first.  We only shrink it on demand, and we don't
swap out pages when we do so.  

> The object of this simple example is to show that the lrus have different
> priorities. These priorities will probably change in function of the
> workload of course but we can try to take care of that.

More than that, the lrus have completely different functions.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2
  2000-07-11 18:00                 ` Andrea Arcangeli
  2000-07-11 18:06                   ` Rik van Riel
@ 2000-07-14  9:01                   ` Stephen C. Tweedie
  1 sibling, 0 replies; 42+ messages in thread
From: Stephen C. Tweedie @ 2000-07-14  9:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Stephen C. Tweedie, Marcelo Tosatti, Jens Axboe,
	Alan Cox, Derek Martin, Linux Kernel, linux-mm, David S. Miller

Hi,

On Tue, Jul 11, 2000 at 08:00:23PM +0200, Andrea Arcangeli wrote:
> 
> So tell me how with your design can I avoid the kernel to unmap anything
> while running:
> 
> 	cp /dev/zero .

Because the unmapped cache pages from the ./zero output file are on
the inactive queue.  We refill the scavenge queue from that inactive
queue.  Only once there are too few inactive pages will we do _any_
aging of active pages.

You are still left with the "cp /dev/zero ." command flushing other
unmapped pages out of cache, of course, but nothing mapped needs to
get unmapped.

Actually it's likely to be a bit more complex than this, because there
is memory pressure on the inactive queue in this case, so it _will_
grow initially until the write throttling achieves an equilibrium.
That's fine, because we really don't want a machine with zero memory
in cache to become unable to cache anything simply because it refuses
ever to swap out.

However, there's a second thing Rik and I were talking about relating
to this problem --- if there is constant memory pressure on the
inactive queue, we *want* to do a small amount of background page
aging.  Any mapped pages which are still in use, even if they are only
being used infrequently, should still end up being marked referenced.
Pages which are just not being touched can, and should, be swapped
eventually.

The trouble is that the kernel can't easily tell the difference
between the read of a cd image and the read of a compiler header file
just from the file access alone.  If we don't have enough cache to
cache the whole set of header files that the compiler is using, then
obviously (1) we never see multiple cache hits, because the data is
evicted from cache before the application requests it again; but also
(2) we _want_ to be able to increase the cache in this case, even at
the expense of swappinng out other inactive processes.

The only way you can detect the "other inactive processes" (or at
least inactive pages) is to have background page aging going on while
you have memory pressure in the cache.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on
  2000-07-11 18:06                   ` Rik van Riel
@ 2000-07-17  7:09                     ` Yannis Smaragdakis
  2000-07-17  9:28                       ` Stephen C. Tweedie
  2000-07-17 14:46                       ` Alan Cox
  0 siblings, 2 replies; 42+ messages in thread
From: Yannis Smaragdakis @ 2000-07-17  7:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Stephen C. Tweedie, Marcelo Tosatti,
	Jens Axboe, Alan Cox, Derek Martin, davem, linux-mm,
	Yannis Smaragdakis

> On Tue, 11 Jul 2000, Rik van Riel wrote:
> And that is correct behaviour. The problem with LRU is that the
> "eventually" is too short, but proper page aging is as close to
> LFU (least _frequently_ used) as it is to LRU. In that case any
> page which was used only once (or was only used a long time ago)
> will be freed before a page which has been used more often
> recently will be.

I'm a Linux kernel newbie and perhaps I should keep my mouth shut, but
I have done a bit of work in memory management and I can't resist
putting my 2c in.

Although I agree with Rik in many major points, I disagree in that I
don't think that page aging should be frequency-based. Overall, I strongly
believe that frequency is the wrong thing to be measuring for deciding
which page to evict from RAM. The reason is that a page that is brought
to memory and touched 1000 times in relatively quick succession is *not*
more valuable than one that is brought to memory and only touched once. 
Both will cause exactly one page fault. Also, one should be cautious of
pages that are brought in RAM, touched many times, but then stay untouched
for a long time. Frequency should never outweigh recency--the latter is
a better predictor, as OS designers have found since the early 70s.

Having said that, LRU is certainly broken, but there are other ways to
fix it. I'll shamelessly plug a paper by myself, Scott Kaplan, and
Paul Wilson, from SIGMETRICS 99. It is in:
	http://www.cc.gatech.edu/~yannis/eelru.ps.gz
(Sorry for the PS, but it compresses well and the original is >3MB.)

I'll be glad to answer questions. The main idea is that we can keep
rough page "ages" (where "age" refers to recency) not only for pages
in RAM, but also for recently evicted pages. Then if we detect that
our overall eviction strategy is wrong (i.e., we touch lots of the
pages we recently evicted), we adapt it by evicting more recently 
touched pages (sounds hacky, but it is actually very clean).

The results are very good (even better than in the paper, as we have
improved the algorithm since).

Back to my cave...
	Yannis.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on
  2000-07-17  7:09                     ` [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on Yannis Smaragdakis
@ 2000-07-17  9:28                       ` Stephen C. Tweedie
  2000-07-17 13:01                         ` James Manning
  2000-07-17 14:53                         ` Rik van Riel
  2000-07-17 14:46                       ` Alan Cox
  1 sibling, 2 replies; 42+ messages in thread
From: Stephen C. Tweedie @ 2000-07-17  9:28 UTC (permalink / raw)
  To: Yannis Smaragdakis
  Cc: Rik van Riel, Andrea Arcangeli, Stephen C. Tweedie,
	Marcelo Tosatti, Jens Axboe, Alan Cox, Derek Martin, davem,
	linux-mm

Hi,

On Mon, Jul 17, 2000 at 03:09:06AM -0400, Yannis Smaragdakis wrote:

> Although I agree with Rik in many major points, I disagree in that I
> don't think that page aging should be frequency-based. Overall, I strongly
> believe that frequency is the wrong thing to be measuring for deciding
> which page to evict from RAM. The reason is that a page that is brought
> to memory and touched 1000 times in relatively quick succession is *not*
> more valuable than one that is brought to memory and only touched once. 
> Both will cause exactly one page fault.

Not when you are swapping.  A page which is likely to be touched again
in the future will cause further page faults if we evict it.  A page
which isn't going to be touched again can be evicted without that
penalty.  The past behaviour is only useful in as much as it provides
a way of guessing future behaviour, and we want to make sure that we
evict those pages least likely to be touched again in the near future.
Access frequency *is* a credible way of assessing that, as there are
many common access patterns in which a large volume of data is
accessed exactly once --- LRU breaks down completely in that case, LFU
does not.

> Also, one should be cautious of
> pages that are brought in RAM, touched many times, but then stay untouched
> for a long time. Frequency should never outweigh recency--the latter is
> a better predictor, as OS designers have found since the early 70s.

No, they have not.  Look at the literature and you will see that OS
designers keep peppering their code with large numbers of special
cases to cope with the fact that LRU breaks down on large sequential
accesses.  FreeBSD, which uses a LFU-based design, needs no such
special cases.

> Having said that, LRU is certainly broken, but there are other ways to
> fix it.

Right.  LFU is just one way of fixing LRU.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on
  2000-07-17  9:28                       ` Stephen C. Tweedie
@ 2000-07-17 13:01                         ` James Manning
  2000-07-17 14:32                           ` Scott F. Kaplan
  2000-07-17 14:53                         ` Rik van Riel
  1 sibling, 1 reply; 42+ messages in thread
From: James Manning @ 2000-07-17 13:01 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 450 bytes --]

[Stephen C. Tweedie]
> > Having said that, LRU is certainly broken, but there are other ways to
> > fix it.
> 
> Right.  LFU is just one way of fixing LRU.

Just food for thought for anyone wanting to read up on other algorithms
and a decent explanation of the basic problem.

http://www.cs.wisc.edu/~solomon/cs537/paging.html
-- 
James Manning <jmm@computer.org>
GPG Key fingerprint = B913 2FBD 14A9 CE18 B2B7  9C8E A0BF B026 EEBB F6E4

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on
  2000-07-17 13:01                         ` James Manning
@ 2000-07-17 14:32                           ` Scott F. Kaplan
  0 siblings, 0 replies; 42+ messages in thread
From: Scott F. Kaplan @ 2000-07-17 14:32 UTC (permalink / raw)
  To: linux-mm

[Stephen C. Tweedie]
> > Having said that, LRU is certainly broken, but there are other ways to
> > fix it.
>
> Right.  LFU is just one way of fixing LRU.

I, too, am new to this mailing list, but since this comment was in
reference to the one made by Yannis, and I participated in the research
to which he mentioned, I'll chime in anyhow.

The problem is that LFU *doesn't* really fix LRU.  There are some cases
for which LRU performs as badly as possible.  (Imagine a 100 page memory
and a program that loops over 101 pages.)  In those cases, doing
*anything* that deviates from LRU will be an improvement; it's not much
of an accomplishment if LFU does well in this case, as RANDOM would be
an improvement as well.  Frequency isn't the right metric -- it just
allows for noise so that LFU can possibly do something different from
LRU.

There's lots of evidence that LFU can perform horribly, particularly
when the reference behavior changes (a.k.a. phase changes.)  Frequency
information doesn't reveal this change well, and the system can page
quite badly before the statistics come into line with the new behavior.

When LFU performs well, it's usually because of the skew in how often
recently used pages are re-used; that is, recently used pages *are* used
frequently.  It's when that association stops being true for a given set
of pages that a replacement policy must update its notion of the
program's behavior quickly.  LRU does so as quickly as the program can
touch some new pages.  LFU takes much longer.

LRU does the right thing in most cases.  With a little extra data, a
system can notice when LRU is doing the *wrong* thing, and only then
should non-LRU replacement be used.  At least, that's the basis of the
paper to which Yannis provided a reference.  I'll also throw out of a
reference to my dissertation, which has a more thorough (and, I hope,
better written!) discussion of recency, its uses, and the failings of
frequency information.  So, for anyone interested,
<http://www.cs.amherst.edu/~sfkaplan/papers/sfkaplan-dissertation.ps.gz>.

Scott Kaplan
sfkaplan@cs.amherst.edu
http://www.cs.amherst.edu/~sfkaplan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on
  2000-07-17  7:09                     ` [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on Yannis Smaragdakis
  2000-07-17  9:28                       ` Stephen C. Tweedie
@ 2000-07-17 14:46                       ` Alan Cox
  2000-07-17 14:55                         ` Scott F. Kaplan
  1 sibling, 1 reply; 42+ messages in thread
From: Alan Cox @ 2000-07-17 14:46 UTC (permalink / raw)
  To: Yannis Smaragdakis
  Cc: Rik van Riel, Andrea Arcangeli, Stephen C. Tweedie,
	Marcelo Tosatti, Jens Axboe, Alan Cox, Derek Martin, davem,
	linux-mm

> Both will cause exactly one page fault. Also, one should be cautious of
> pages that are brought in RAM, touched many times, but then stay untouched
> for a long time. Frequency should never outweigh recency--the latter is
> a better predictor, as OS designers have found since the early 70s.

Modern OS designers are consistently seeing LFU work better. In our case this
is partly theory in the FreeBSD case its proven by trying it.

> pages we recently evicted), we adapt it by evicting more recently 
> touched pages (sounds hacky, but it is actually very clean).
> 
> The results are very good (even better than in the paper, as we have
> improved the algorithm since).

Interesting

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on
  2000-07-17  9:28                       ` Stephen C. Tweedie
  2000-07-17 13:01                         ` James Manning
@ 2000-07-17 14:53                         ` Rik van Riel
  2000-07-17 16:44                           ` Manfred Spraul
                                             ` (2 more replies)
  1 sibling, 3 replies; 42+ messages in thread
From: Rik van Riel @ 2000-07-17 14:53 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Yannis Smaragdakis, Andrea Arcangeli, Marcelo Tosatti,
	Jens Axboe, Alan Cox, Derek Martin, davem, linux-mm

On Mon, 17 Jul 2000, Stephen C. Tweedie wrote:
> On Mon, Jul 17, 2000 at 03:09:06AM -0400, Yannis Smaragdakis wrote:
> 
> > Although I agree with Rik in many major points, I disagree in that I
> > don't think that page aging should be frequency-based. Overall, I strongly
> > believe that frequency is the wrong thing to be measuring for deciding
> > which page to evict from RAM. The reason is that a page that is brought
> > to memory and touched 1000 times in relatively quick succession is *not*
> > more valuable than one that is brought to memory and only touched once. 
> > Both will cause exactly one page fault.
> 
> Not when you are swapping.  A page which is likely to be touched
> again in the future will cause further page faults if we evict
> it.  A page which isn't going to be touched again can be evicted
> without that penalty.  The past behaviour is only useful in as
> much as it provides a way of guessing future behaviour, and we
> want to make sure that we evict those pages least likely to be
> touched again in the near future. Access frequency *is* a
> credible way of assessing that, as there are many common access
> patterns in which a large volume of data is accessed exactly
> once --- LRU breaks down completely in that case, LFU does not.

*nod*

LFU works great in preventing typical LRU breakdown in some
common situations, but pure page aging isn't enough either...

> > Also, one should be cautious of
> > pages that are brought in RAM, touched many times, but then stay untouched
> > for a long time. Frequency should never outweigh recency--the latter is
> > a better predictor, as OS designers have found since the early 70s.
> 
> No, they have not.  Look at the literature and you will see that OS
> designers keep peppering their code with large numbers of special
> cases to cope with the fact that LRU breaks down on large sequential
> accesses.  FreeBSD, which uses a LFU-based design, needs no such
> special cases.

Actually, FreeBSD has a special case in the page fault code
for sequential accesses and I believe we must have that too.

Both LRU and LFU break down on linear accesses to an array
that doesn't fit in memory. In that case you really want
MRU replacement, with some simple code that "detects the
window size" you need to keep in memory. This seems to be
the only way to get any speedup on such programs when you
increase memory size to something which is still smaller
than the total program size.

> > Having said that, LRU is certainly broken, but there are other ways to
> > fix it.
> 
> Right.  LFU is just one way of fixing LRU.

Since *both* recency and frequency are important, we can
simply use an algorithm which keeps both into account.
Page aging nicely fits the bill here.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on
  2000-07-17 14:46                       ` Alan Cox
@ 2000-07-17 14:55                         ` Scott F. Kaplan
  2000-07-17 15:31                           ` Rik van Riel
  0 siblings, 1 reply; 42+ messages in thread
From: Scott F. Kaplan @ 2000-07-17 14:55 UTC (permalink / raw)
  To: linux-mm

Alan Cox wrote:
> Modern OS designers are consistently seeing LFU work better. In our case this
> is partly theory in the FreeBSD case its proven by trying it.

Have any of the FreeBSD people compiled some results to this effect? 
I'd be interested to see under what circumstances LFU works better, and
just what approximations of both LRU and LFU are being used.  There
could be something interesting in such results, as years of other
experiments have shown otherwise.

Scott Kaplan
sfkaplan@cs.amherst.edu
http://www.cs.amherst.edu/~sfkaplan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on
  2000-07-17 14:55                         ` Scott F. Kaplan
@ 2000-07-17 15:31                           ` Rik van Riel
  0 siblings, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2000-07-17 15:31 UTC (permalink / raw)
  To: Scott F. Kaplan; +Cc: linux-mm

On Mon, 17 Jul 2000, Scott F. Kaplan wrote:
> Alan Cox wrote:
> > Modern OS designers are consistently seeing LFU work better. In our case this
> > is partly theory in the FreeBSD case its proven by trying it.
> 
> Have any of the FreeBSD people compiled some results to this
> effect?  I'd be interested to see under what circumstances LFU
> works better,

Say you're in the situation where 1/2 of your memory is
memory used by programs, memory which is used over and
over again.

The other half of your memory is used to cache the
multimedia data you're streaming or the files you're
exporting over NFS. This is mostly use-once memory.

If a sudden burst of IO occurs, LRU would evict memory
from the programs, memory which will be used again soon.
LFU, on the other hand, correctly evicts memory from the
cache ... especially the memory which was used only once.

> and just what approximations of both LRU and LFU are being used.  

Page aging. Basically the pages in memory are scanned periodically
(with the period being driven by memory pressure), if a page was
referenced since the last time, the page age/act_count is increased,
otherwise the page age/act_count is decreased. Pages are deactivated
(moved to the inactive list) when the age/act_count reaches 0.

if (test_and_clear_referenced(page)) {
	page->age += PG_AGE_ADV;
	if (page->age > PG_AGE_MAX)
		page->age = PG_AGE_MAX;
} else {
	page->age -= min(page->age, PG_AGE_DECL);
	if (page->age == 0)
		deactivate_page(page);
}

This is a nice approximation of LRU and LFU, one which comes
pretty close to LFU because of the linear decline. If we were
to use

	page->age /= 2;

as page age decreaser instead, we'd probably be closer to LRU.

It would be worth it to experiment a bit to see which of these
will work best.

> There could be something interesting in such results, as years
> of other experiments have shown otherwise.

I wonder if the speed difference between CPU, memory and hard disk
have changed over the years .. ;)

(or if system loads have changed a bit ... nowadays the working
set of processes usually fits in memory but there is a lot of
streaming IO going on in the background)

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on
  2000-07-17 14:53                         ` Rik van Riel
@ 2000-07-17 16:44                           ` Manfred Spraul
  2000-07-17 17:02                             ` Rik van Riel
  2000-07-17 18:55                           ` Yannis Smaragdakis
  2000-07-17 19:57                           ` John Fremlin
  2 siblings, 1 reply; 42+ messages in thread
From: Manfred Spraul @ 2000-07-17 16:44 UTC (permalink / raw)
  To: Rik van Riel, linux-mm

Rik van Riel wrote:
> 
> 
> Actually, FreeBSD has a special case in the page fault code
> for sequential accesses and I believe we must have that too.
> 

Where is that code? I found a sysctl parameter vm_pageout_algorithm_lru,
but nothing else.

> Both LRU and LFU break down on linear accesses to an array
> that doesn't fit in memory. In that case you really want
> MRU replacement, with some simple code that "detects the
> window size" you need to keep in memory. This seems to be
> the only way to get any speedup on such programs when you
> increase memory size to something which is still smaller
> than the total program size.
> 

Do you have an idea how to detect that situation?

--
	Manfred
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on
  2000-07-17 16:44                           ` Manfred Spraul
@ 2000-07-17 17:02                             ` Rik van Riel
  0 siblings, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2000-07-17 17:02 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-mm

On Mon, 17 Jul 2000, Manfred Spraul wrote:
> Rik van Riel wrote:
> > 
> > Actually, FreeBSD has a special case in the page fault code
> > for sequential accesses and I believe we must have that too.
> 
> Where is that code?

It's in vm_fault.c, look for the readaround code.

> > Both LRU and LFU break down on linear accesses to an array
> > that doesn't fit in memory. In that case you really want
> > MRU replacement, with some simple code that "detects the
> > window size" you need to keep in memory. This seems to be
> > the only way to get any speedup on such programs when you
> > increase memory size to something which is still smaller
> > than the total program size.
> 
> Do you have an idea how to detect that situation?

I've got some ideas, but they need to be polished a bit
before I can put them into code. I'll probably do this
at OLS...

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on
  2000-07-17 14:53                         ` Rik van Riel
  2000-07-17 16:44                           ` Manfred Spraul
@ 2000-07-17 18:55                           ` Yannis Smaragdakis
  2000-07-17 19:57                           ` John Fremlin
  2 siblings, 0 replies; 42+ messages in thread
From: Yannis Smaragdakis @ 2000-07-17 18:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: sct, andrea, marcelo, axboe, alan, derek, Yannis Smaragdakis,
	davem, linux-mm

Unfortunately, it sounded like I was arguing in favor of LRU, while
I was not. Also, I agree that a good algorithm should never swap out
program pages in favor of transient data. But I think it is 
overgeneralizing to go from "often pages are accessed only *once*"
to "frequency is good". The problem with frequency is that it's
very sensitive to phase behavior and may keep old pages around for
too long, just because they were accessed often some time ago.

Rik wrote:
> Both LRU and LFU break down on linear accesses to an array
> that doesn't fit in memory. In that case you really want
> MRU replacement, with some simple code that "detects the
> window size" you need to keep in memory. This seems to be

I agree and this is partly the point in our paper, only we argue that
this strategy can be generalized cleanly (instead of being a special
case hack).

> Since *both* recency and frequency are important, we can
> simply use an algorithm which keeps both into account.
> Page aging nicely fits the bill here.

Proposal:
Why not define "frequency" as "references over *normalized* time"
instead of "references over time"? If you touch a page twice
and in the meantime you have touched a million other pages,
this is important. If you touch a page twice and
in the meantime you have only touched one other page, this should not
affect "page age". In short, the way the page's age is updated should
be a function of how many other pages were found to be recently
referenced.

Say you call the code that reads/resets the reference bits and you
find that n pages were referenced in total. Then each of those
gets its age incremented by a factor proportional to n. For efficiency,
one could use the "n" that was computed during the last scan.

I think that this would get the effect you want and would alleviate
my concerns about "frequency".
	Yannis.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on
  2000-07-17 14:53                         ` Rik van Riel
  2000-07-17 16:44                           ` Manfred Spraul
  2000-07-17 18:55                           ` Yannis Smaragdakis
@ 2000-07-17 19:57                           ` John Fremlin
  2 siblings, 0 replies; 42+ messages in thread
From: John Fremlin @ 2000-07-17 19:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stephen C. Tweedie, Yannis Smaragdakis, Andrea Arcangeli,
	Marcelo Tosatti, Jens Axboe, Alan Cox, Derek Martin, davem,
	linux-mm

Rik van Riel <riel@conectiva.com.br> writes:

[...]

> Both LRU and LFU break down on linear accesses to an array
> that doesn't fit in memory. In that case you really want
> MRU replacement, with some simple code that "detects the
> window size" you need to keep in memory. This seems to be
> the only way to get any speedup on such programs when you
> increase memory size to something which is still smaller
> than the total program size.

I think that a generational garbage collection like scheme might work
well here (also obviating the need for the "simple code"). 

In more detail: you keep a bunch of (possibly just 2) lists of pages
(generations). Every time you want pages you search the younger lists
first; if a page is still being used (how to measure this? -- if it's
faulted back quickly off of the scavenge list?) it gets promoted to
the next generation and scanned less often. This could be tuned to
deal well with the pathological streaming IO case (i.e. even the app
doing the IO doesn't suffer at all), I don't know how well in general.

I hope this stuff isn't just reiterating the obvious (alternatively I
hope it isn't too obvious I don't know what I'm talking about ;-) ).

-- 

	http://web.onetel.net.uk/~elephant/john
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2000-07-17 19:57 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20000629114407.A3914@redhat.com>
     [not found] ` <Pine.LNX.4.21.0006291330520.1713-100000@inspiron.random>
2000-06-29 13:00   ` [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2 Stephen C. Tweedie
2000-07-06 10:35     ` Andrea Arcangeli
2000-07-06 13:29       ` Stephen C. Tweedie
2000-07-09 17:11         ` Swap clustering with new VM Marcelo Tosatti
2000-07-09 20:53           ` Andrea Arcangeli
2000-07-11  9:36             ` Stephen C. Tweedie
2000-07-09 20:31         ` [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2 Andrea Arcangeli
2000-07-11 11:50           ` Stephen C. Tweedie
2000-07-11 16:17             ` Andrea Arcangeli
2000-07-11 16:36               ` Juan J. Quintela
2000-07-11 17:33                 ` Andrea Arcangeli
2000-07-11 17:45                   ` Rik van Riel
2000-07-11 17:54                     ` Andrea Arcangeli
2000-07-11 18:03                       ` Juan J. Quintela
2000-07-11 19:32                         ` Andrea Arcangeli
2000-07-12  0:05                           ` John Alvord
2000-07-12  0:52                             ` Andrea Arcangeli
2000-07-12 18:02                             ` Rik van Riel
2000-07-14  8:51               ` Stephen C. Tweedie
2000-07-11 17:32           ` Rik van Riel
2000-07-11 17:41             ` Andrea Arcangeli
2000-07-11 17:47               ` Rik van Riel
2000-07-11 18:00                 ` Andrea Arcangeli
2000-07-11 18:06                   ` Rik van Riel
2000-07-17  7:09                     ` [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on Yannis Smaragdakis
2000-07-17  9:28                       ` Stephen C. Tweedie
2000-07-17 13:01                         ` James Manning
2000-07-17 14:32                           ` Scott F. Kaplan
2000-07-17 14:53                         ` Rik van Riel
2000-07-17 16:44                           ` Manfred Spraul
2000-07-17 17:02                             ` Rik van Riel
2000-07-17 18:55                           ` Yannis Smaragdakis
2000-07-17 19:57                           ` John Fremlin
2000-07-17 14:46                       ` Alan Cox
2000-07-17 14:55                         ` Scott F. Kaplan
2000-07-17 15:31                           ` Rik van Riel
2000-07-14  9:01                   ` [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2 Stephen C. Tweedie
2000-07-11 18:13               ` Juan J. Quintela
2000-07-11 20:57                 ` Roger Larsson
2000-07-11 22:49                   ` Juan J. Quintela
2000-07-12 16:01               ` Kev
2000-07-06 13:54       ` [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on2.4.0-test2 Roman Zippel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox