PTE chaining, kswapd and swapin readahead

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* PTE chaining, kswapd and swapin readahead
@ 1998-06-16 22:10 Rik van Riel
  1998-06-17  9:24 ` Eric W. Biederman
  0 siblings, 1 reply; 5+ messages in thread
From: Rik van Riel @ 1998-06-16 22:10 UTC (permalink / raw)
  To: Linux MM

Hi,

In the PTE chaining discussion/patches a while ago, I saw
that kswapd was changed in a way that it scanned memory
in physical order instead of walking the pagetables.

This has the advantage of deallocating memory in physically
adjecant chunks, which will be nice while we still have the
primitive buddy allocator we're using now.

However, it will be a major performance bottleneck when we
get around to implementing the zone allocator and swapin
readahead. This is because we don't need physical deallocation
with the zone allocatore and because swapin readahead is just
an awful lot faster when the pages are contiguous in swap.

I write this to let the PTE people (Stephen and Ben) know
that they probably shouldn't remove the pagetable walking
routines from kswapd...

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PTE chaining, kswapd and swapin readahead
  1998-06-16 22:10 PTE chaining, kswapd and swapin readahead Rik van Riel
@ 1998-06-17  9:24 ` Eric W. Biederman
  1998-06-17 16:03   ` Rik van Riel
  0 siblings, 1 reply; 5+ messages in thread
From: Eric W. Biederman @ 1998-06-17  9:24 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linux MM

>>>>> "RR" == Rik van Riel <H.H.vanRiel@phys.uu.nl> writes:

RR> Hi,
RR> In the PTE chaining discussion/patches a while ago, I saw
RR> that kswapd was changed in a way that it scanned memory
RR> in physical order instead of walking the pagetables.

RR> This has the advantage of deallocating memory in physically
RR> adjecant chunks, which will be nice while we still have the
RR> primitive buddy allocator we're using now.

Also it has the advantage that shared pages are only scanned once, and
empty address space needn't be scanned.

RR> However, it will be a major performance bottleneck when we
RR> get around to implementing the zone allocator and swapin
RR> readahead. This is because we don't need physical deallocation
RR> with the zone allocatore and because swapin readahead is just
RR> an awful lot faster when the pages are contiguous in swap.

Just what is your zone allocator?  I have a few ideas based on the
name but my ideas don't seem to jive with your descriptions.
This part about not needing physically contigous memory is really
puzzling.

RR> I write this to let the PTE people (Stephen and Ben) know
RR> that they probably shouldn't remove the pagetable walking
RR> routines from kswapd...

If we get around to using a true LRU algorithm we aren't too likely
too to swap out address space adjacent pages...  Though I can see the
advantage for pages of the same age.

Also for swapin readahead the only effective strategy I know is to
implement a kernel system call, that says I'm going to be accessing
this chunck of my address space soon.  The clustering people have
already implemented a system call of this nature for their own use.
It would probably be a good idea to do something similiar...

Eric

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PTE chaining, kswapd and swapin readahead
  1998-06-17  9:24 ` Eric W. Biederman
@ 1998-06-17 16:03   ` Rik van Riel
  1998-06-18  4:06     ` Eric W. Biederman
  0 siblings, 1 reply; 5+ messages in thread
From: Rik van Riel @ 1998-06-17 16:03 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux MM

On 17 Jun 1998, Eric W. Biederman wrote:
> >>>>> "RR" == Rik van Riel <H.H.vanRiel@phys.uu.nl> writes:
> 
> RR> This has the advantage of deallocating memory in physically
> RR> adjecant chunks, which will be nice while we still have the
> RR> primitive buddy allocator we're using now.
> 
> Also it has the advantage that shared pages are only scanned once, and
> empty address space needn't be scanned.

OK, this is a _very_ big advantage which I overlooked...

> Just what is your zone allocator?  I have a few ideas based on the
> name but my ideas don't seem to jive with your descriptions.
> This part about not needing physically contigous memory is really
> puzzling.

Well, the idea is to divide memory into different areas
(of 32 to 256 pages in size, depending on the amount of
main memory) for different uses.
There are 3 different uses:
- process pages, buffers and page cache
- pagetables and small (order 0, 1 and maybe 2) SLAB areas
- large SLAB allocations (order 2, 3, 4 and 5)
On large memory machines (>=128M) we might even split the
SLAB areas into 3 types...

Allocation is always done in the fullest area. We keep
track of this by hashing the area's in a doubly linked
list, using perhaps 8 different degrees of 'fullness'.
When an area get's fuller than the queue is meant to be,
it 'promotes' one level up and is added to the _tail_ of
the queue above. When an area get's emptier than it's
queue is supposed to be, it get's added to the _head_
of the queue below.

This way, the emptier areas get emptier and the fullest
area's get fuller. This way we can force-free an area
(with PTE chaining) when we're short of memory.

Inside the user area's, we can simply use a linked list
to mark free pages. Alternatively, we can keep the
administration in a separate area of memory. This has
the advantage that we don't have to reread a page when
it's needed shortly after we swapped it out. Then we
can simply use a bitmap and a slightly optimized
function.

For the SLAB area's, where we use different sizes of
allocation, we could use a simple buddy allocator.
Because the SLAB data is usually either long-lived
or _very_ short-lived and because we use only a few
different sizes in one area, the buddy allocator could
actually work here. Maybe we want the SLAB allocator
to give us a hint on whether it needs the memory for
a long or a short period and using separate area's...

There's no code yet, because I'm having a lot of
trouble switching to glibc _and_ keeping PPP working :(
Maybe later this month.

> RR> I write this to let the PTE people (Stephen and Ben) know
> RR> that they probably shouldn't remove the pagetable walking
> RR> routines from kswapd...
> 
> If we get around to using a true LRU algorithm we aren't too likely
> too to swap out address space adjacent pages...  Though I can see the
> advantage for pages of the same age.

True LRU swapping might actually be a disadvantage. The way
we do things now (walking process address space) can result
in a much larger I/O bandwidth to/from the swapping device.

> Also for swapin readahead the only effective strategy I know is to
> implement a kernel system call, that says I'm going to be accessing

There are more possibilities. One of them is to use the
same readahead tactic that is being used for mmap()
readahead. To do this, we'll either have to rewrite
the mmap() stuff, or we can piggyback the mmap() code
by writing a vnode system for normal memory area's.
The vnode system is probably easier, since that would
also allow for an easy implementation of shared memory
and easier tracking of memory movement (since we never
loose track of pages). Also, this vnode system will
make it possible to turn off memory overcommitment
(something that a lot of people have requested) and
do some other nice tricks...

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PTE chaining, kswapd and swapin readahead
  1998-06-17 16:03   ` Rik van Riel
@ 1998-06-18  4:06     ` Eric W. Biederman
  1998-06-18  7:25       ` Rik van Riel
  0 siblings, 1 reply; 5+ messages in thread
From: Eric W. Biederman @ 1998-06-18  4:06 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linux MM

>>>>> "RR" == Rik van Riel <H.H.vanRiel@phys.uu.nl> writes:

RR> On 17 Jun 1998, Eric W. Biederman wrote:

>> If we get around to using a true LRU algorithm we aren't too likely
>> too to swap out address space adjacent pages...  Though I can see the
>> advantage for pages of the same age.

RR> True LRU swapping might actually be a disadvantage. The way
RR> we do things now (walking process address space) can result
RR> in a much larger I/O bandwidth to/from the swapping device.

The truly optimal algorithm is I to read in the page/pages we are going
to use next, and remove the page we won't use for the longest period
of time.  It's the same as LRU but in reverse time order.   And both
of these algorithms have the important property that they avoid
Belady's anomoly.  That is with more memory they won't cause more page
faults and more I/O.

The goal should be to reduce disk I/O as disk bandwidth is way below
memory bandwidth.  Using ``unused'' disk bandwidth in prepaging may
also be a help.  

Note: much of the write I/O performance we achieve is because
get_swap_page() is very efficient at returning adjacent swap pages.
I don't see where location is memory makes a difference.

As to which pages should be put close to each other for read
performance, that is a different question.  Files tend to be
read/written sequentially, so it is a safe bet to anticipate this one
usage pattern and if it is going on capitalize on it, if not don't do
any read ahead.

We could probably add a few more likely cases to the vm system.  The
only simple special cases I can think to add are reverse sequential
access, and stack access where pages 1 2 3 4 are accesed and then 4 3
2 1 are accessed in reverse order.  But for the general case it is quite
difficult to predict, and a wrong prediction could make things worse.

>> Also for swapin readahead the only effective strategy I know is to
>> implement a kernel system call, that says I'm going to be accessing

The point I was hoping to make is that for programs that find
themselves swapping frequently a non blocking read (for mmapped areas)
can be quite effective.  Because in certain circumstances a program
can in fact predict which pages it will be using next.  And this will
allow a very close approximation of the true optimal paging
algorithm, and is fairly simple to implement.  

For a really close approximation we might also want to have a system
call that says which pages we aren't likely to be using soon.
Perhaps:
mtouch(void *addr, int len, int when);
where ``when'' can be SOON or LATER ...

Or after looking at:
http://cesdis.gsfc.nasa.gov/beowulf/software/pager.html

struct mtouch_tuple {
	caddr_t addr;
	size_t extent;
	int when;
} prepage_list[NR_ELEMENTS];
mtouch(&prepage_list);

Where prepage_list is terminated with an special element.
I don't like suggested terminator of a NULL address of 0 because there
are some programs that actually use that...

The when paremeter is my idea...

RR> There are more possibilities. One of them is to use the
RR> same readahead tactic that is being used for mmap()
RR> readahead. 

Actually that sounds like a decent idea.  But I doubt it will help
much. I will start on the vnodes fairly soon, after I get a kernel
pgflush deamon working.

Eric

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PTE chaining, kswapd and swapin readahead
  1998-06-18  4:06     ` Eric W. Biederman
@ 1998-06-18  7:25       ` Rik van Riel
  0 siblings, 0 replies; 5+ messages in thread
From: Rik van Riel @ 1998-06-18  7:25 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux MM

On 17 Jun 1998, Eric W. Biederman wrote:
> >>>>> "RR" == Rik van Riel <H.H.vanRiel@phys.uu.nl> writes:
> 
> RR> True LRU swapping might actually be a disadvantage. The way
> RR> we do things now (walking process address space) can result
> RR> in a much larger I/O bandwidth to/from the swapping device.
> 
> The goal should be to reduce disk I/O as disk bandwidth is way below
> memory bandwidth.  Using ``unused'' disk bandwidth in prepaging may
> also be a help.  

For normal disks, most I/O is so completely dominated by
_seeks_, that transfer time can be almost completely ignored.
This is why we should focus on reducing the number of disk
I/Os, not the number of blocks transferred.

> Note: much of the write I/O performance we achieve is because
> get_swap_page() is very efficient at returning adjacent swap pages.
> I don't see where location is memory makes a difference.

It makes a difference because:
- programs usually use larger area's in one piece
- swapout clustering saves a lot of I/O (like you just said)
- when pages from a process are adjacant on disk, we can
  save the same amount of I/O on _swapin_ too.
- this will result in a _much_ improved bandwidth

> We could probably add a few more likely cases to the vm system.  The
> only simple special cases I can think to add are reverse sequential
> access, and stack access where pages 1 2 3 4 are accesed and then 4 3
> 2 1 are accessed in reverse order.

Maybe that's why stacks grow down :-) Looking at the addresses
of a shrinking stack, you'll notice that linear forward readahead
still is the best algorithm.

> >> Also for swapin readahead the only effective strategy I know is to
> >> implement a kernel system call, that says I'm going to be accessing
> 
> The point I was hoping to make is that for programs that find
> themselves swapping frequently a non blocking read (for mmapped areas)
> can be quite effective.

Agreed. And combined with your other (snipped) call, it might
give a _huge_ advantage for large simulations and other
processes which implement it.

> RR> There are more possibilities. One of them is to use the
> RR> same readahead tactic that is being used for mmap()
> RR> readahead. 
> 
> Actually that sounds like a decent idea.  But I doubt it will help
> much. I will start on the vnodes fairly soon, after I get a kernel
> pgflush deamon working.

It _does_ help very much. A lot of the perceived slowness
of Linux is the 'task switching' in X. By this I mean new
people selecting another window as their foreground window,
causing X and the programs to read in huge amounts of
graphics data, while simultaneously swapping out other data.

By implementing the same readahead tactic as we use for
mmap()ed files, we could cut the number of I/Os by more
than one 3rd and probably even more.

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~1998-06-18  8:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-06-16 22:10 PTE chaining, kswapd and swapin readahead Rik van Riel
1998-06-17  9:24 ` Eric W. Biederman
1998-06-17 16:03   ` Rik van Riel
1998-06-18  4:06     ` Eric W. Biederman
1998-06-18  7:25       ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox