Re: PTE chaining, kswapd and swapin readahead

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Rik van Riel <H.H.vanRiel@phys.uu.nl>
To: "Eric W. Biederman" <ebiederm+eric@npwt.net>
Cc: Linux MM <linux-mm@kvack.org>
Subject: Re: PTE chaining, kswapd and swapin readahead
Date: Wed, 17 Jun 1998 18:03:14 +0200 (CEST)	[thread overview]
Message-ID: <Pine.LNX.3.96.980617173630.722A-100000@mirkwood.dummy.home> (raw)
In-Reply-To: <m17m2gz8hq.fsf@flinx.npwt.net>

On 17 Jun 1998, Eric W. Biederman wrote:
> >>>>> "RR" == Rik van Riel <H.H.vanRiel@phys.uu.nl> writes:
> 
> RR> This has the advantage of deallocating memory in physically
> RR> adjecant chunks, which will be nice while we still have the
> RR> primitive buddy allocator we're using now.
> 
> Also it has the advantage that shared pages are only scanned once, and
> empty address space needn't be scanned.

OK, this is a _very_ big advantage which I overlooked...

> Just what is your zone allocator?  I have a few ideas based on the
> name but my ideas don't seem to jive with your descriptions.
> This part about not needing physically contigous memory is really
> puzzling.

Well, the idea is to divide memory into different areas
(of 32 to 256 pages in size, depending on the amount of
main memory) for different uses.
There are 3 different uses:
- process pages, buffers and page cache
- pagetables and small (order 0, 1 and maybe 2) SLAB areas
- large SLAB allocations (order 2, 3, 4 and 5)
On large memory machines (>=128M) we might even split the
SLAB areas into 3 types...

Allocation is always done in the fullest area. We keep
track of this by hashing the area's in a doubly linked
list, using perhaps 8 different degrees of 'fullness'.
When an area get's fuller than the queue is meant to be,
it 'promotes' one level up and is added to the _tail_ of
the queue above. When an area get's emptier than it's
queue is supposed to be, it get's added to the _head_
of the queue below.

This way, the emptier areas get emptier and the fullest
area's get fuller. This way we can force-free an area
(with PTE chaining) when we're short of memory.

Inside the user area's, we can simply use a linked list
to mark free pages. Alternatively, we can keep the
administration in a separate area of memory. This has
the advantage that we don't have to reread a page when
it's needed shortly after we swapped it out. Then we
can simply use a bitmap and a slightly optimized
function.

For the SLAB area's, where we use different sizes of
allocation, we could use a simple buddy allocator.
Because the SLAB data is usually either long-lived
or _very_ short-lived and because we use only a few
different sizes in one area, the buddy allocator could
actually work here. Maybe we want the SLAB allocator
to give us a hint on whether it needs the memory for
a long or a short period and using separate area's...

There's no code yet, because I'm having a lot of
trouble switching to glibc _and_ keeping PPP working :(
Maybe later this month.

> RR> I write this to let the PTE people (Stephen and Ben) know
> RR> that they probably shouldn't remove the pagetable walking
> RR> routines from kswapd...
> 
> If we get around to using a true LRU algorithm we aren't too likely
> too to swap out address space adjacent pages...  Though I can see the
> advantage for pages of the same age.

True LRU swapping might actually be a disadvantage. The way
we do things now (walking process address space) can result
in a much larger I/O bandwidth to/from the swapping device.

> Also for swapin readahead the only effective strategy I know is to
> implement a kernel system call, that says I'm going to be accessing

There are more possibilities. One of them is to use the
same readahead tactic that is being used for mmap()
readahead. To do this, we'll either have to rewrite
the mmap() stuff, or we can piggyback the mmap() code
by writing a vnode system for normal memory area's.
The vnode system is probably easier, since that would
also allow for an easy implementation of shared memory
and easier tracking of memory movement (since we never
loose track of pages). Also, this vnode system will
make it possible to turn off memory overcommitment
(something that a lot of people have requested) and
do some other nice tricks...

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

next prev parent reply	other threads:[~1998-06-17 16:45 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
1998-06-16 22:10 Rik van Riel
1998-06-17  9:24 ` Eric W. Biederman
1998-06-17 16:03   ` Rik van Riel [this message]
1998-06-18  4:06     ` Eric W. Biederman
1998-06-18  7:25       ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.3.96.980617173630.722A-100000@mirkwood.dummy.home \
    --to=h.h.vanriel@phys.uu.nl \
    --cc=ebiederm+eric@npwt.net \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox