linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC] 2.3/4 VM queues idea
@ 2000-05-24 19:37 Mark_H_Johnson
  2000-05-24 20:35 ` Matthew Dillon
  0 siblings, 1 reply; 35+ messages in thread
From: Mark_H_Johnson @ 2000-05-24 19:37 UTC (permalink / raw)
  To: riel; +Cc: acme, dillon, linux-mm, sct

I'll try to combine comments on the VM queues & Matt Dillon's material in
one response. I've added some analysis at the end - yes the reference is
OLD, but the math is still valid. The bottom line of my comments - make
sure the goals are right, put some measures into place so we can determine
"success", & build the solution based on sound principles. Also, thanks to
all who are working to make the VM system in Linux better.

Re: Goals
 - Robust design - absolutely essential. VM is one of those key
capabilities that must be done right or not at all.
 - page aging and "buffer for allocations" - not sure if I care which
methods we use as long as they work well. To me, the goal of VM is to have
the "right" pages in memory when the CPU executes an instruction. Page
aging and free lists are two methods, lookahead and clustered reads &
writes are others.
 - treat pages equally - I think I disagree with both you and Matt on this
one. We have different usage patterns for different kinds of data (e.g.
execution of code tends to be localized but not sequential vs. sequential
read of data in a file) & should have a means of distinguishing between
them. This does not mean that one algorithm won't do a good job for both
the VM & buffer cache, just recognize that we should have ways to treat
them differently. See my comments on "stress cases" below for my rationale.
 - simple changes - I agree, get this issue settled & move on.
To these goals I might add
 - predictable performance [essential if I deploy any large (or real time)
system on Linux]
 - no more automatic process kills - give the system [and/or operator] the
ability to recover without terminating jobs
 - make it adjustable by the user or system administrator for different
workloads
 - durable - works well with varied CPU, memory, and disk performance
A related item - measure that the "goals have been met". How do we know
that method A works better than method B without some hooks to measure the
performance?

A final comment on goals - I honestly don't mind systems that swap - IF
they do a better job of running the active jobs as a result. The FUD that
Matt refers to on FreeBSD swapping may actually mean that FreeBSD runs
better than Linux. I can't tell, and it probably varies by application area
anyway.

Re: Design ideas [with ties into Matt's material]
 - not sure what the 3 [4?] queues are doing for us; I can't tie the queue
concept to how we are meeting one of the goals above.
 - allocations / second as a parameter to adjust free pages. Sounds OK, may
want a scalable factor here with min & max limits to free pages. For
example, in a real time system I want no paging - that doesn't mean that I
want no free pages. A burst of activity could occur where I need to do an
allocation & don't want to wait for a disk write before taking action
[though that usually means MY system design is broke...].
 - not sure the distinction between "atomic" and "non atomic" allocations.
Please relate this to meeting the goals above, or add a goal.
 - not clear how the inactive queue scan & page cleaner methods meet the
goals either.
 - [Matt] real statistics - I think is a part of the data needed to
determine if A is better than B.
 - [Matt] several places talks about LRU or not strict LRU - LRU is a good
algorithm but has overhead involved. You may find that a simple FIFO or
clock algorithm gets >90% of the benefit of LRU without the overhead [can
be a net gain, something to consider].
 - [Matt] relate scan rate to perceived memory load. Looks OK if the
overhead makes it worth the investment.
 - [Matt] moving balance points & changing algorithms [swap, not page].
Ditto.
 - [Matt] adjustments to initial weight. Ditto - relates directly to the
"right page" goal.
See the analysis below to get an idea of what I mean by "overhead is worth
the investment". It looks like we can spend a lot of CPU cycles to save one
disk read or write and still run the system efficiently.

The remainder of Matt's message was somewhat harder to read - some mixture
of stress cases & methods used to implement VM. I've reorganized what he
said to characterize this somewhat more clearly [at least in my mind]...

  Stress cases [characterized as sequential/not, read only or read/write,
shared or not, locked or not] I tried to note the application for each case
as well as the general technique that could be applied to memory
management.
  - sequential read only access [file I/O or mmap'd file, use once &
discard]
  - sequential read/write access [file updates or mmap'd file, use once &
push to file on disk]
  - non-sequential read or execute access, not shared [typical executable,
locality of reference; when memory is tight, can discard & refresh from
disk]
  - non-sequential read/write access, not shared [stack or heap; locality
of reference, must push to file or swap when memory is tight]
  - read only shared access [typical shared library, higher "apparent cost"
if not in memory, can discard & refresh from disk]
  - read/write shared access [typical memory mapped file or shared memory
region, higher apparent cost if not in memory, has cache impact on multi
processors, must push to file or swap when memory is tight]
  - locked pages for I/O or real time applications [fixed in memory due to
system constraints, VM system must (?) leave alone, cannot (?) move in
memory, not "likely" needed on disk (?)]. I had a crazy idea as I was
writing this one - application is to capture a stream of data from the
network onto disk. The user mmap's a file & locks into memory to ensure
bandwidth can be met, fills region with data from network, unlocks & sync's
the file when transfer is done. How would FreeBSD/Linux handle that case?

I expect there are other combinations possible - I just tried to
characterize the kind of memory usage patterns that are typical. It may
also be good to look at some real application areas - interactive X
applications, application compile/link, database updates, web server, real
time, streaming data, and so on; measure how they stress the system & build
the VM system to handle the varied loads - automatically if you can but
with help from the user/system administrator if you need it.

  Methods [recap of Matt's list]
  - write clustering - a good technique used on many systems
  - read clustering - ditto, especially note about priority reduction on
pages already read
  - sequential access detection - ditto
  - sequential write detection - ditto, would be especially helpful with
some of our shared memory test programs that currently stress Linux to
process kills; think of sequential access to memory mapped files in the
same way as sequential file I/O. They should have similar performance
characteristics if we do this right [relatively small resident set sizes,
high throughput].

Some analysis of the situation

I took a few minutes to go through some of my old reference books. Here are
some concepts that might help from "Operating System Principles" by Per
Brinch Hansen (1973), pp 182-191.

Your disk can supply at most one page every T seconds, memory access every
t seconds, and your program demands at a rate of p(s) - s relates to the
percent of resident to total size of the process. The processor tends to
demands one page every t/p(s) memory accesses. There are three situations:
 - disk is idle, p(s) < t/T. Generally OK, desired if you need to keep
latency down.
 - CPU is idle, p(s) > t/T. Generally bad, leads to thrashing. May indicate
you need more memory or faster disks for your application.
 - balanced system, p(s) = t/T. Full utilization, may lead to latency
increases.
Let's feed some current performance numbers & see where it leads.

A disk access (T) is still a few milliseconds and a memory access (t) is
well under a microsecond (lets use T=5E-3 and t=5E-8 to keep the math
simple). The ratio of t/T is thus 1/100000 - that means one disk page
access per 100000 memory accesses. Check the math, on 20mhz memory
accesses, your balanced demand is 200 disk pages per second (perhaps
high?). Note that many of our stress tests have much higher ratios than
this. I suggest a few real measurements to get ratios for real systems
rather than my made up answers.

Even if I'm off by 10x or 100x, this kind of ratio means that many of the
methods Matt describes makes sense. It helps justify why spending time to
do read/write clusters pays off - it gets you more work done for the cost
of each disk access (T). It helps justify why detecting sequential access &
limiting resident set sizes in those cases is OK [discard pages that will
never be used again]. It shows how some of the measures Matt has described
(e.g., page weights & perceived memory load) help determine how we should
proceed.

The book goes on to talk about replacement algorithms. FIFO or
approximations to LRU can cause up to 2-3x more pages to be transferred
than the "ideal algorithm" (one that knows in advance the best choice).
This results in a 12-24 % reduction of working set size but improves
utilization only a few percent. There is also a discussion of transfer
algorithms, basically a combination of elevator and/or clustering to reduce
latency & related overhead. I think Matt has covered better techniques than
was in this old book. I did have to chuckle with the biggest gain -  fix
the application to improve locality - a problem then & still one today.
Small tools work better than big ones. Perhaps a reason to combat code
bloat in the tools we use.

Closing
First, let's make sure we agree what we really need from the VM system
(goals).
Second, define a few measurements in place so we can determine we've
succeeded.
Third, implement solutions that will get us there.

Don't forget to give thanks for all the hard work that has gone into what
we have today & will have tomorrow.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-24 19:37 [RFC] 2.3/4 VM queues idea Mark_H_Johnson
@ 2000-05-24 20:35 ` Matthew Dillon
  0 siblings, 0 replies; 35+ messages in thread
From: Matthew Dillon @ 2000-05-24 20:35 UTC (permalink / raw)
  To: Mark_H_Johnson; +Cc: riel, acme, linux-mm, sct

    I should make a clarification on the LRU.  It's a center-weighted LRU,
    statistically based, which is different from a strict LRU.  In a
    strict LRU design when you access an element it will be moved to the end
    of the LRU queue.  In a weighted LRU design accessing the element
    adjusts the weight, and the queue placement depends on the new weight
    (i.e. the element may not be moved to the end of the queue).  The idea
    here is to get a statistical view of how 'active' the page is that
    is fairly consistent no matter when you sample it.  This is a very 
    different result then what you get with a strict LRU model because
    the statistical model incorporates the page's use history whereas a
    strict LRU model does not.

    The statistical model cannot predict if a page will be accessed 'soon',
    but it can do a reasonable job of predicting whether a page will be
    accessed continuously (over and over again), and it is the latter
    prediction which is the most important when trying to regulate a
    system's paging load.

    In regards to the overhead of maintaining the weight-ordering of pages
    in the queue -- this is one of the (several) reasons why you use a 
    multi-queue model rather then a single-queue model.  With a multi-queue
    model you do not have to spend a lot of time trying to keep the elements
    in each queue sorted, which reduces the per-page complexity of entering
    a page into the queue from O(N) to nearly O(1).  Taking the FBsd 
    implementation as an example, the pageout code will scan the active 
    queue looking for pages with 0 weightings.  It will scan the *entire*
    queue if it needs to but, in general, the loose ordering of pages
    within that queue results in being able to cut the scan off early
    in most cases.  This is what I mean by 'loose ordering within the queue'.

: - treat pages equally - I think I disagree with both you and Matt on this
:one. We have different usage patterns for different kinds of data (e.g.
:execution of code tends to be localized but not sequential vs. sequential
:read of data in a file) & should have a means of distinguishing between
:them. This does not mean that one algorithm won't do a good job for both
:the VM & buffer cache, just recognize that we should have ways to treat
:them differently. See my comments on "stress cases" below for my rationale.

    Finally, on how to treat pages.  Here's the problem:  When one 
    allocates a new page there is enough contextual information to determine
    how the page is likely to be used, allowing you to adjust the initial
    weight of the page.

    But once the page has been allocated you can't really make any assumptions
    about the ongoing use of the page short of actually checking to see if
    it has been accessed, not without putting the processes running on the
    system that happen to not operate under your assumptions at a huge
    disadvantage and as a consequence of that placing the system under more
    stress.

    What page load comes down to is simply this:  It's the kernel deciding
    to reuse a page that some process immediately tries to fault back in,
    requiring an I/O to get it back, or the kernel flushing a dirty page to
    backing store that some process immediately re-dirties, costing an
    unnecessary I/O.  This extra overhead creates memory strain on a system
    that can be avoided.  It doesn't matter what kind of page the strain
    was related to (swap, data file, binary, NFS, anything...).  We can't
    prevent the above from occuring (you can fully predict when a page will
    be needed), but what the statistical weighting gives us is the ability
    to prevent the above situation from *re*occuring, over and over again,
    for any given page.

    Actually measuring the useage statistically (the 'weight' of the page)
    is the only way to get reasonably close to actual use.  If you skew
    the results by making continuing assumptions (beyond the calculation 
    of the initial weight) on how the page will be used simply based on 
    the type of page you have, then *any* process in the system that happens
    to operate differently from those assumptions will not just cause
    inefficient paging to occur, it will *always* cause inefficient paging
    to occur.  Many of the memory stress situations Linux has come up
    against in the last few years are due directly to this scenario.

    The statistical model, on the otherhand, has a much better chance (not
    perfect obviously, but *better*) of adapting itself to the useage pattern
    of a process whatever that pattern happens to be.

    What you want to do instead is have heuristics which 'detect' certain
    page-use patterns and then adjust the weight based on that.  You aren't
    really making blatent assumptions here, you are simply figuring out what
    the actual pattern is and then acting upon it.  This is what all those
    'sequential detection' and other heuristics do.  These sorts of 
    adjustments to page weighting are very, very different from adjustments
    based on assumptions.

					-Matt
					Matthew Dillon
					<dillon@backplane.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 20:41                             ` Jamie Lokier
@ 2000-05-28 22:42                               ` Stephen Tweedie
  0 siblings, 0 replies; 35+ messages in thread
From: Stephen Tweedie @ 2000-05-28 22:42 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Stephen C. Tweedie, Matthew Dillon, Rik van Riel, linux-mm

Hi,

On Fri, May 26, 2000 at 10:41:13PM +0200, Jamie Lokier wrote:
> Stephen C. Tweedie wrote:
> > > The stacked private address_spaces I described don't have to be shared
> > > between address_spaces in a single mm.  You can have one per vma --
> > > that's simple but maybe wasteful.
> > 
> > That's basically exactly what davem's stuff did.
> 
> Ok, I shall look more carefully at davem's code.
> Is this the most recent one?:

Yes, I think he dropped it at this point.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 17:15                           ` Stephen C. Tweedie
@ 2000-05-26 20:41                             ` Jamie Lokier
  2000-05-28 22:42                               ` Stephen Tweedie
  0 siblings, 1 reply; 35+ messages in thread
From: Jamie Lokier @ 2000-05-26 20:41 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Matthew Dillon, Rik van Riel, linux-mm

Stephen C. Tweedie wrote:
> > The stacked private address_spaces I described don't have to be shared
> > between address_spaces in a single mm.  You can have one per vma --
> > that's simple but maybe wasteful.
> 
> That's basically exactly what davem's stuff did.

Ok, I shall look more carefully at davem's code.
Is this the most recent one?:

> Date:   Wed, 17 May 2000 11:00:34 +0100
> Subject: [davem@redhat.com: my paging work]
> From:   "Stephen C. Tweedie" <sct@redhat.com>
> 
> Hi,
> 
> Here's davem's page-based swapout snapshot.  It's UNFINISHED + DANGEROUS
> WILL_EAT_YOUR_DISK (his words!), but somebody may want to archive this
> and pick up on the work in 2.5.
> 
> --Stephen
> 
> ----- Forwarded message from "David S. Miller" <davem@redhat.com> -----
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 17:35                             ` Matthew Dillon
@ 2000-05-26 17:46                               ` Stephen C. Tweedie
  0 siblings, 0 replies; 35+ messages in thread
From: Stephen C. Tweedie @ 2000-05-26 17:46 UTC (permalink / raw)
  To: Matthew Dillon; +Cc: Jamie Lokier, Stephen C. Tweedie, Rik van Riel, linux-mm

Hi,

On Fri, May 26, 2000 at 10:35:30AM -0700, Matthew Dillon wrote:
> 
>     Hmm.  I know apps which use madvise() to manage allocated/free pages
>     efficiently, but not any that use mprotect().

Persistent and distributed data stores use it to mark existing
pages as PROT_NONE, so that they can trap accesses to the data
and perform the necessary locking to make the local storage
consistent with the on-disk data (persistent object stores) or
with other machines in the cluster (for distributed shared 
memory).

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 17:05                           ` Jamie Lokier
@ 2000-05-26 17:35                             ` Matthew Dillon
  2000-05-26 17:46                               ` Stephen C. Tweedie
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew Dillon @ 2000-05-26 17:35 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Stephen C. Tweedie, Rik van Riel, linux-mm

:Matthew Dillon wrote:
:>     In regards to overhead, anything that collects a bunch of pages together
:>     (e.g. vm_map_entry, vm_object under FBsd, VMA in Jamie's scheme)
:>     simply does not create a memory overhead issue.  None at all.  It's
:>     the things that eat memory on a per-page basis that get annoying.
:
:Stephen's point is that there are applications which use mprotect() on a
:per-page basis.  Some garbage collectors for example, to track dirtied
:pages.
:
:And having to scan hundreds of vmas to find one pte sucks :-)
:
:But I addressed that a few minutes ago.  In my scheme you don't have to
:scan lots of vmas to find that pte.  Only one or two, or you can choose
:to increase that number while decreasing memory requirements a little.
:
:-- Jamie

    Hmm.  I know apps which use madvise() to manage allocated/free pages
    efficiently, but not any that use mprotect().  The madvise() flags 
    typically used effect the underlying pages directly and should not fragment
    the VMA's at all.  In anycase, it's not a big deal because even if you
    did have to fragment the VMA's, you can still collapse adjacent entries
    together.  i.e. if the garbage collector protects page 1 and then later
    on protects page 2 the same way, you still need only one VMA to
    represent both pages.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 17:02                         ` Jamie Lokier
@ 2000-05-26 17:15                           ` Stephen C. Tweedie
  2000-05-26 20:41                             ` Jamie Lokier
  0 siblings, 1 reply; 35+ messages in thread
From: Stephen C. Tweedie @ 2000-05-26 17:15 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Stephen C. Tweedie, Matthew Dillon, Rik van Riel, linux-mm

Hi,

On Fri, May 26, 2000 at 07:02:08PM +0200, Jamie Lokier wrote:
> 
> The stacked private address_spaces I described don't have to be shared
> between address_spaces in a single mm.  You can have one per vma --
> that's simple but maybe wasteful.

That's basically exactly what davem's stuff did.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 16:55                         ` Matthew Dillon
@ 2000-05-26 17:05                           ` Jamie Lokier
  2000-05-26 17:35                             ` Matthew Dillon
  0 siblings, 1 reply; 35+ messages in thread
From: Jamie Lokier @ 2000-05-26 17:05 UTC (permalink / raw)
  To: Matthew Dillon; +Cc: Stephen C. Tweedie, Rik van Riel, linux-mm

Matthew Dillon wrote:
>     In regards to overhead, anything that collects a bunch of pages together
>     (e.g. vm_map_entry, vm_object under FBsd, VMA in Jamie's scheme)
>     simply does not create a memory overhead issue.  None at all.  It's
>     the things that eat memory on a per-page basis that get annoying.

Stephen's point is that there are applications which use mprotect() on a
per-page basis.  Some garbage collectors for example, to track dirtied
pages.

And having to scan hundreds of vmas to find one pte sucks :-)

But I addressed that a few minutes ago.  In my scheme you don't have to
scan lots of vmas to find that pte.  Only one or two, or you can choose
to increase that number while decreasing memory requirements a little.

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 16:40                       ` Stephen C. Tweedie
  2000-05-26 16:55                         ` Matthew Dillon
@ 2000-05-26 17:02                         ` Jamie Lokier
  2000-05-26 17:15                           ` Stephen C. Tweedie
  1 sibling, 1 reply; 35+ messages in thread
From: Jamie Lokier @ 2000-05-26 17:02 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Matthew Dillon, Rik van Riel, linux-mm

Stephen C. Tweedie wrote:
> > That's ok.  VA == vma->pgoff + page_offset.  Move a vma and that's still
> > true.  The ptes are found by looking at the list of all vmas referring
> > to all the address_spaces that refer to a page.
> 
> And that is _exactly_ the problem --- especially with heavy mprotect()
> use, processes can have enormous numbers of vmas.  Electric fence and
> distributed shared memory/persistent object stores are the two big,
> obvious cases here.

The stacked private address_spaces I described don't have to be shared
between address_spaces in a single mm.  You can have one per vma --
that's simple but maybe wasteful.  Or one per several vmas.  When you
divide a single vma into zillions using mprotect(), you're free to split
off new spaces to limit the number of vmas per space.

-- Jamie








> 
> --Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 16:40                       ` Stephen C. Tweedie
@ 2000-05-26 16:55                         ` Matthew Dillon
  2000-05-26 17:05                           ` Jamie Lokier
  2000-05-26 17:02                         ` Jamie Lokier
  1 sibling, 1 reply; 35+ messages in thread
From: Matthew Dillon @ 2000-05-26 16:55 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Jamie Lokier, Rik van Riel, linux-mm

:Hi,
:
:On Fri, May 26, 2000 at 06:36:40PM +0200, Jamie Lokier wrote:
:
:> That's ok.  VA == vma->pgoff + page_offset.  Move a vma and that's still
:> true.  The ptes are found by looking at the list of all vmas referring
:> to all the address_spaces that refer to a page.
:
:And that is _exactly_ the problem --- especially with heavy mprotect()
:use, processes can have enormous numbers of vmas.  Electric fence and
:distributed shared memory/persistent object stores are the two big,
:obvious cases here.
:
:--Stephen

    I don't think this will be a problem.  FreeBSD's vm_map_entry scheme 
    is very similar and we found it to be fairly trivial to optimize adjacent
    entries in many cases for both madvise() and mprotect().  In fact, half
    the madvise() calls (such as MADV_WILLNEED, MADV_DONTNEED, and MADV_FREE)
    have no effect on the vm_map_entry (VMA equivalent) at all, they operate
    directly on the associated pages.

    The only fragmentation issue I've ever seen with our vm_map_entry scheme
    occurs when you use mprotect() to create a guard page for each thread's
    stack.  The creation of the guard page forces the vm_map_entry to be
    broken up, preventing the vm_map_entries for adjacent stack segments
    from being collapsed together into a single entry.

    You wind up with two vm_map_entry's per thread.  Not really a problem,
    but somewhat of an eyesore.

    In regards to overhead, anything that collects a bunch of pages together
    (e.g. vm_map_entry, vm_object under FBsd, VMA in Jamie's scheme)
    simply does not create a memory overhead issue.  None at all.  It's
    the things that eat memory on a per-page basis that get annoying.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 16:36                     ` Jamie Lokier
@ 2000-05-26 16:40                       ` Stephen C. Tweedie
  2000-05-26 16:55                         ` Matthew Dillon
  2000-05-26 17:02                         ` Jamie Lokier
  0 siblings, 2 replies; 35+ messages in thread
From: Stephen C. Tweedie @ 2000-05-26 16:40 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Stephen C. Tweedie, Matthew Dillon, Rik van Riel, linux-mm

Hi,

On Fri, May 26, 2000 at 06:36:40PM +0200, Jamie Lokier wrote:

> That's ok.  VA == vma->pgoff + page_offset.  Move a vma and that's still
> true.  The ptes are found by looking at the list of all vmas referring
> to all the address_spaces that refer to a page.

And that is _exactly_ the problem --- especially with heavy mprotect()
use, processes can have enormous numbers of vmas.  Electric fence and
distributed shared memory/persistent object stores are the two big,
obvious cases here.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 14:38                   ` Stephen C. Tweedie
  2000-05-26 15:59                     ` Matthew Dillon
@ 2000-05-26 16:36                     ` Jamie Lokier
  2000-05-26 16:40                       ` Stephen C. Tweedie
  1 sibling, 1 reply; 35+ messages in thread
From: Jamie Lokier @ 2000-05-26 16:36 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Matthew Dillon, Rik van Riel, linux-mm

Stephen C. Tweedie wrote:
> > mremaps that simply expand or shrink a segment are fine by themselves.
> > mremaps that move a segment are fine by themselves.
> 
> No, they are not fine.  When you move a segment, you end up with pages
> which have the same offset but are now at a different VA.  What that 
> means is that you have no way of finding out, for a given physical page,
> what the VA of all of the mappings of that page may be.  That means that
> you have no way to find all of the ptes short of scanning all the vmas
> in order.

That's ok.  VA == vma->pgoff + page_offset.  Move a vma and that's still
true.  The ptes are found by looking at the list of all vmas referring
to all the address_spaces that refer to a page.

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 14:38                   ` Stephen C. Tweedie
@ 2000-05-26 15:59                     ` Matthew Dillon
  2000-05-26 16:36                     ` Jamie Lokier
  1 sibling, 0 replies; 35+ messages in thread
From: Matthew Dillon @ 2000-05-26 15:59 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Jamie Lokier, Rik van Riel, linux-mm

:Hi,
:
:On Fri, May 26, 2000 at 04:31:29PM +0200, Jamie Lokier wrote:
:
:> You didn't mention it, but that leaves mremap.  This is a fiddly one!
:
:Yes, we know this.  :-)
:
:> mremaps that simply expand or shrink a segment are fine by themselves.
:> mremaps that move a segment are fine by themselves.
:
:No, they are not fine.  When you move a segment, you end up with pages
:which have the same offset but are now at a different VA.  What that 
:means is that you have no way of finding out, for a given physical page,
:what the VA of all of the mappings of that page may be.  That means that
:you have no way to find all of the ptes short of scanning all the vmas
:in order.
:
:--Stephen

    Basically you have two choices:  Either track all the mappings to an
    underlying object in such a way that you can locate all the 
    potential (object,index) -> (process,va) mappings, or you can
    track the PTE's themselves as FreeBSD does with its PV entry junk.

    I personally hate the PV entry junk in FreeBSD.  It's fast, but it has
    a lot of memory overhead.

    I would not be afraid of adding appropriate linked-list fields to your
    various tracking structures to be able to locate the potential mappings
    more easily, and I'll bet dollars to donoughts that you would be able
    to refine the scheme to make it just as fast as our PV entry scheme 
    but without the overhead.

    In anycase, locating the pte's would go like this:

	* physical page candidate

	* direct knowledge of (object,index) for physical page (any given 
	  physical page exists in just one VM object.  I'm using a FreeBSD
	  style VM object as a reference here).

	* scan mapping structures linked to the object for mappings that
	  cover (index).

	* Lookup the pte associated with each such mapping (the pte may or may
	  not exist in the actual page table, depending on whether it has 
	  been faulted or not, or overloaded by another VM object layer).

	* done (you are able to locate all the pte's associated with a physical 
	  page)

    In FreeBSD, locating the pte's goes like this:

	* physical page candidate

	* (direct knowledge of (object,index) for physical page, but FreeBSD
	  doesn't use it for this particular operation).

	* scan linked list of PV's based in physical page structure.

	* each PV represents a *mapped* pte.

	(Which is a lot faster and less stressful on the cache, but which
	also eats a truely disgusting amount of memory having to have a
	PV entry structure for each pte).

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 14:31                 ` Jamie Lokier
  2000-05-26 14:38                   ` Stephen C. Tweedie
@ 2000-05-26 15:45                   ` Matthew Dillon
  1 sibling, 0 replies; 35+ messages in thread
From: Matthew Dillon @ 2000-05-26 15:45 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Stephen C. Tweedie, Rik van Riel, linux-mm

:Stephen C. Tweedie wrote:
:> > Agreed.  I looked at that code though and it seemed very... large.
:> > I think COW address_space gets the same results with less code.  Fast, too.
:> > I know what I've got to do to prove it :-)
:> 
:> How will it deal with fork() cases where the child starts mprotecting
:> arbitrary regions, so that you have completely independent vmas all
:> sharing the same private pages?
:
:Each VMA points to an address_space, and each private address_space can
:have a parent.  Pages that aren't hashed in a private address space are
:found in the parent's address space.
:
:When a VMA is cloned for fork(), they have the same address_space which
:is now marked as requiring COW.  When you modify a page in either, a new
:space is created which contains the modified pages and the appropriate
:VMA refers to the new space.  Now if it was from a file there were page
:modifications at all stages by everyone, you have a small tree of 4
:address_spaces:
:
:                      1 - underlying file
:                      |
:                      2 - privately modified pages from the file,
:                     / \  shared by child & parent
:                    /   \
:pages only seen by 3     4 pages only seen by the child
:the parent                          
:
:The beauty here is that the sharing structure is quite explicit.
:
:Note that stacked address_spaces are only created when they actually
:contain pages, and page counters are used to collapse layers when
:appropriate.
:...

    This appears to be very close to what FreeBSD does with its vm_map_entry
    and vm_object structures.  If you haven't read my article on how VM 
    objects work in FreeBSD, you really should, because you are going to hit
    exactly the same problems.  Ignore the linux jabs in the article :-)
    and skip to the 'VM objects' section.

	http://www.daemonnews.org/200001/freebsd_vm.html

    The article describes VM objects, which represent logical entities such
    as files or anonymous memory areas.   VM objects can be stacked to
    implement private adderss spaces (MAP_PRIVATE mappings).

    However, in FreeBSD a VM object represents a complete logical entity
    (such as a file), *NOT* a memory mapping.  There is a separate structure
    called a vm_map_entry which is responsible for mapping portions of a
    process's adderss space to portions of a VM object.   Things like COW
    flags and madvise() flags are stored in the vm_map_entry, not the 
    vm_object.  The actual function of doing a copy-on-write involves 
    stacking a new anonymous-memory (swap-backed) VM object in front of the
    one that took the COW hit, based on the COW flag in the vm_map_entry.
    Once the vm_map_entry is repointed to the new 'top layer' for that
    process, the COW flag can be cleared.  Write faults always occur in
    the top layer, so if you attempt to write to a page in a MAP_PRIVATE
    mapped file and that page cannot be found in the top level VM object
    (swap-backed anonymous memory object), it will be copied up from the
    lower level VM object (the one representing the actual file).

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 14:31                 ` Jamie Lokier
@ 2000-05-26 14:38                   ` Stephen C. Tweedie
  2000-05-26 15:59                     ` Matthew Dillon
  2000-05-26 16:36                     ` Jamie Lokier
  2000-05-26 15:45                   ` Matthew Dillon
  1 sibling, 2 replies; 35+ messages in thread
From: Stephen C. Tweedie @ 2000-05-26 14:38 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Stephen C. Tweedie, Matthew Dillon, Rik van Riel, linux-mm

Hi,

On Fri, May 26, 2000 at 04:31:29PM +0200, Jamie Lokier wrote:

> You didn't mention it, but that leaves mremap.  This is a fiddly one!

Yes, we know this.  :-)

> mremaps that simply expand or shrink a segment are fine by themselves.
> mremaps that move a segment are fine by themselves.

No, they are not fine.  When you move a segment, you end up with pages
which have the same offset but are now at a different VA.  What that 
means is that you have no way of finding out, for a given physical page,
what the VA of all of the mappings of that page may be.  That means that
you have no way to find all of the ptes short of scanning all the vmas
in order.

--Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 13:15               ` Stephen C. Tweedie
@ 2000-05-26 14:31                 ` Jamie Lokier
  2000-05-26 14:38                   ` Stephen C. Tweedie
  2000-05-26 15:45                   ` Matthew Dillon
  0 siblings, 2 replies; 35+ messages in thread
From: Jamie Lokier @ 2000-05-26 14:31 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Matthew Dillon, Rik van Riel, linux-mm

Stephen C. Tweedie wrote:
> > Agreed.  I looked at that code though and it seemed very... large.
> > I think COW address_space gets the same results with less code.  Fast, too.
> > I know what I've got to do to prove it :-)
> 
> How will it deal with fork() cases where the child starts mprotecting
> arbitrary regions, so that you have completely independent vmas all
> sharing the same private pages?

Each VMA points to an address_space, and each private address_space can
have a parent.  Pages that aren't hashed in a private address space are
found in the parent's address space.

When a VMA is cloned for fork(), they have the same address_space which
is now marked as requiring COW.  When you modify a page in either, a new
space is created which contains the modified pages and the appropriate
VMA refers to the new space.  Now if it was from a file there were page
modifications at all stages by everyone, you have a small tree of 4
address_spaces:

                      1 - underlying file
                      |
                      2 - privately modified pages from the file,
                     / \  shared by child & parent
                    /   \
pages only seen by 3     4 pages only seen by the child
the parent                          

The beauty here is that the sharing structure is quite explicit.

Note that stacked address_spaces are only created when they actually
contain pages, and page counters are used to collapse layers when
appropriate.

mprotect & partial munmap are fine.  What happens here is that the VMAs
created by those functions refer to the same address_space -- this time
without COW semantics.  For this, all VMAs sharing an address_space that
COW as a single unit are linked together.  A modification to any one
that COWs its address_space updates all its linked VMAs.

You didn't mention it, but that leaves mremap.  This is a fiddly one!
mremaps that simply expand or shrink a segment are fine by themselves.
mremaps that move a segment are fine by themselves.  But the combination
can cause page offset values to duplicate for different pages.

So mremap needs a fixup to create new address_spaces in certain unusual
cases and rehash pages when that happens.  I don't think those cases
occur in any usual use of mremap.

thanks,
-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 11:22             ` Jamie Lokier
@ 2000-05-26 13:15               ` Stephen C. Tweedie
  2000-05-26 14:31                 ` Jamie Lokier
  0 siblings, 1 reply; 35+ messages in thread
From: Stephen C. Tweedie @ 2000-05-26 13:15 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Stephen C. Tweedie, Matthew Dillon, Rik van Riel, linux-mm

Hi,

On Fri, May 26, 2000 at 01:22:19PM +0200, Jamie Lokier wrote:
> 
> Agreed.  I looked at that code though and it seemed very... large.
> I think COW address_space gets the same results with less code.  Fast, too.
> I know what I've got to do to prove it :-)

How will it deal with fork() cases where the child starts mprotecting
arbitrary regions, so that you have completely independent vmas all
sharing the same private pages?

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 11:08           ` Stephen C. Tweedie
  2000-05-26 11:22             ` Jamie Lokier
@ 2000-05-26 12:04             ` Rik van Riel
  1 sibling, 0 replies; 35+ messages in thread
From: Rik van Riel @ 2000-05-26 12:04 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Jamie Lokier, Matthew Dillon, linux-mm

On Fri, 26 May 2000, Stephen C. Tweedie wrote:
> On Thu, May 25, 2000 at 06:50:59PM +0200, Jamie Lokier wrote:
> > 
> > Fwiw, with COW address_spaces (I posted an article a couple of weeks ago
> > explaining) it should be fairly simple to find all the ptes for a given
> > page without the space overhead of pte chaining.
> 
> Davem's anon area stuff already implements a large chunk of what
> is needed.

It would be cool if somebody could take the time and implement
the rest of what's needed. I'm currently working at making page
aging and deferred swapout work, so we have the basic mechanisms
for aging the active pages and doing swapout from the inactive
queue.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 11:11 ` Stephen C. Tweedie
@ 2000-05-26 11:49   ` Rik van Riel
  0 siblings, 0 replies; 35+ messages in thread
From: Rik van Riel @ 2000-05-26 11:49 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-mm, Matthew Dillon, Arnaldo Carvalho de Melo

On Fri, 26 May 2000, Stephen C. Tweedie wrote:
> On Wed, May 24, 2000 at 12:11:35PM -0300, Rik van Riel wrote:
> 
> > - try to keep about one second worth of allocations around in
> >   the inactive queue (we do 100 allocations/second -> at least
> >   100 inactive pages), we do this in order to:
> >   - get some aging in that queue (one second to be reclaimed)
> >   - have enough old pages around to free
> 
> Careful here.  If your box is running several Gig Ethernet
> interfaces, it could well be allocating 100s of MB of skbuffs
> every second, each allocation being very short-lived.  The rate
> of allocation is not a good indicator of memory load.  The rate
> of allocations which could not be satisfied immediately would be
> a far better metric.

Oh definately. The number of pages taken off of the cache
queue (aka. scavenge list) per second, averaged over time,
will probably be the measure to use.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-25 17:53             ` Matthew Dillon
@ 2000-05-26 11:38               ` Jamie Lokier
  0 siblings, 0 replies; 35+ messages in thread
From: Jamie Lokier @ 2000-05-26 11:38 UTC (permalink / raw)
  To: Matthew Dillon; +Cc: Rik van Riel, linux-mm

Matthew Dillon wrote:
>     When you scan by physical page, then locate the VM mappings for that
>     page, you have:
> 
> 	* a count of the number of mappings
> 	* a count of how many of those referenced the page since the
> 	  last check.
> 	* more determinism (see below)

>     When you scan by virtual page, then locate the physical mapping:
> 
> 	* you cannot tell how many other virtual mappings referenced the
> 	  page (short of checking, at which point you might as well be
> 	  scanning by physical page)

But you can at least accumulate the info.  If it's just a question of
bumping a decaying "page referenced recently" measure, than with virtual
scanning you bump it more often (if it's being used in many places);
with physical scanning you bump it more in one go.

> 	* you have no way of figuring out how many discrete physical pages
> 	  your virtual page scan has covered.  For all you know you could
> 	  scan 500 virtual mappings and still only have gotten through a
> 	  handful of physical pages.  Big problem!

Counters in struct page.

>     Another example of why physical page scanning is better then
>     virtual page scanning:  When there is memory pressure and you are
>     scanning by physical page, and the weight reaches 0, you can then
>     turn around and unmap ALL of its virtual pte's all at once (or mark
>     them read-only for a dirty page to allow it to be flushed).  Sure
>     you have to eat cpu to find those virtual pte's, but the end result
>     is a page which is now cleanable or freeable.

That's obviously a _big_ advantage under memory pressure.  A fast
feedback loop, so much more robust dynamics.

>     Now try this with a virtual scan:  You do a virtual scan, locate
>     a page you decide is idle, and then... what?  Unmap just that one
>     instance of the pte?  What about the others?  You would have to unmap
>     them too, which would cost as much as it would when doing a physical
>     page scan *EXCEPT* that you are running through a whole lot more virtual
>     pages during the virtual page scan to get the same effect as with
>     the physical page scan (when trying to locate idle pages).  It's
>     the difference between O(N) and O(N^2).  If the physical page queues
>     are reasonably well ordered, its the difference between O(1) and O(N^2).

[ Virtual is not O(N^2) if you've limited the number of ptes mapped, but I
agree the fixed overhead of that is rather high. ]

Conclusion: _Everyone_ agrees that physical page scanning is better in
every way except one: Linux 2.4 wants to be released :-)

But, given its conceptual simplicity, I wonder if it isn't worth trying
to implement physical scanning anyway?  It would simplify the paging
dynamics; I expect tuning etc. would be easier and more robust.  And
Rik's queues proposal is more or less orthogonal.  We need good queues
however pages are freed up.

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-26 11:08           ` Stephen C. Tweedie
@ 2000-05-26 11:22             ` Jamie Lokier
  2000-05-26 13:15               ` Stephen C. Tweedie
  2000-05-26 12:04             ` Rik van Riel
  1 sibling, 1 reply; 35+ messages in thread
From: Jamie Lokier @ 2000-05-26 11:22 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Matthew Dillon, Rik van Riel, linux-mm

Stephen C. Tweedie wrote:
> > Fwiw, with COW address_spaces (I posted an article a couple of weeks ago
> > explaining) it should be fairly simple to find all the ptes for a given
> > page without the space overhead of pte chaining.
> 
> Davem's anon area stuff already implements a large chunk of what is needed.

Agreed.  I looked at that code though and it seemed very... large.
I think COW address_space gets the same results with less code.  Fast, too.
I know what I've got to do to prove it :-)

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-24 15:11 Rik van Riel
  2000-05-24 22:44 ` Juan J. Quintela
@ 2000-05-26 11:11 ` Stephen C. Tweedie
  2000-05-26 11:49   ` Rik van Riel
  1 sibling, 1 reply; 35+ messages in thread
From: Stephen C. Tweedie @ 2000-05-26 11:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, Matthew Dillon, Stephen C. Tweedie, Arnaldo Carvalho de Melo

Hi,

On Wed, May 24, 2000 at 12:11:35PM -0300, Rik van Riel wrote:

> - try to keep about one second worth of allocations around in
>   the inactive queue (we do 100 allocations/second -> at least
>   100 inactive pages), we do this in order to:
>   - get some aging in that queue (one second to be reclaimed)
>   - have enough old pages around to free

Careful here.  If your box is running several Gig Ethernet interfaces,
it could well be allocating 100s of MB of skbuffs every second, each
allocation being very short-lived.  The rate of allocation is not a 
good indicator of memory load.  The rate of allocations which could
not be satisfied immediately would be a far better metric.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-25 16:50         ` Jamie Lokier
  2000-05-25 17:17           ` Rik van Riel
@ 2000-05-26 11:08           ` Stephen C. Tweedie
  2000-05-26 11:22             ` Jamie Lokier
  2000-05-26 12:04             ` Rik van Riel
  1 sibling, 2 replies; 35+ messages in thread
From: Stephen C. Tweedie @ 2000-05-26 11:08 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Matthew Dillon, Rik van Riel, linux-mm, Stephen Tweedie

Hi,

On Thu, May 25, 2000 at 06:50:59PM +0200, Jamie Lokier wrote:
> 
> Fwiw, with COW address_spaces (I posted an article a couple of weeks ago
> explaining) it should be fairly simple to find all the ptes for a given
> page without the space overhead of pte chaining.

Davem's anon area stuff already implements a large chunk of what is needed.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-25 17:17           ` Rik van Riel
@ 2000-05-25 17:53             ` Matthew Dillon
  2000-05-26 11:38               ` Jamie Lokier
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew Dillon @ 2000-05-25 17:53 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Jamie Lokier, linux-mm

    Another big difference is that when you scan by physical page, you
    can collect a whole lot of information together to help you make
    the decision on how the adjust the weight.

    When you scan by physical page, then locate the VM mappings for that
    page, you have:

	* a count of the number of mappings
	* a count of how many of those referenced the page since the
	  last check.
	* more determinism (see below)

    When you scan by virtual page, then locate the physical mapping:

	* you cannot tell how many other virtual mappings referenced the
	  page (short of checking, at which point you might as well be
	  scanning by physical page)

	* you have no way of figuring out how many discrete physical pages
	  your virtual page scan has covered.  For all you know you could
	  scan 500 virtual mappings and still only have gotten through a
	  handful of physical pages.  Big problem!

	* you have much less information available to make the decision on
	  how to adjust the weight.

:> How so?  You're only scanning currently mapped ptes, and one
:> goal is to keep that number small enough that you can gather
:> good LRU stats of page usage.
:
:Page aging may well be cheaper than continuously unmapping ptes
:(including tlb flushes and cache flushes of the page tables) and
:softfaulting them back in.

    It's definitely cheaper.  If you unmap a page and then have to
    take a page fault to get it back, the cost is going to be
    roughly 300 instructions plus other overhead.

    Another example of why physical page scanning is better then
    virtual page scanning:  When there is memory pressure and you are
    scanning by physical page, and the weight reaches 0, you can then
    turn around and unmap ALL of its virtual pte's all at once (or mark
    them read-only for a dirty page to allow it to be flushed).  Sure
    you have to eat cpu to find those virtual pte's, but the end result
    is a page which is now cleanable or freeable.

    Now try this with a virtual scan:  You do a virtual scan, locate
    a page you decide is idle, and then... what?  Unmap just that one
    instance of the pte?  What about the others?  You would have to unmap
    them too, which would cost as much as it would when doing a physical
    page scan *EXCEPT* that you are running through a whole lot more virtual
    pages during the virtual page scan to get the same effect as with
    the physical page scan (when trying to locate idle pages).  It's
    the difference between O(N) and O(N^2).  If the physical page queues
    are reasonably well ordered, its the difference between O(1) and O(N^2).

					-Matt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-25 16:50         ` Jamie Lokier
@ 2000-05-25 17:17           ` Rik van Riel
  2000-05-25 17:53             ` Matthew Dillon
  2000-05-26 11:08           ` Stephen C. Tweedie
  1 sibling, 1 reply; 35+ messages in thread
From: Rik van Riel @ 2000-05-25 17:17 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Matthew Dillon, linux-mm

On Thu, 25 May 2000, Jamie Lokier wrote:
> Matthew Dillon wrote:
> >     Yes, but at an unreasonable cost:  Artificially limiting the number
> >     of discrete processes able to share a given amount of memory.  Any
> >     reasonable limit would still be an order of magnitude more expensive
> >     then a physical page scan.
> 
> I disagree.  The "amount of sharing" scan-time overhead is there
> whether you do a physical or virtual scan.  For a physical scan,
> if you have a lot of sharing then you look at a lot of ptes per
> page.

Not really. If there are enough inactive pages, the active
pages will never be scanned. And when the active pages get
scanned, chances are that only a few of them (the ones near
the "end" of the queue) need to be scanned.

With virtual scanning, OTOH, you'll need to scan all pages
since you have no "candidate list" of which pages are more
likely to be suitable candidates for swapout.

> One possible goal is to limit the total number of mapped ptes in
> the system.  You can still permit a lot of sharing: the number
> does not have to be limited per task or per mm.

No. The goal is to have an efficient memory management system
which supports running a lot of applications efficiently.

This idea would place an artificial limit on the number of
shared pages between processes. If your particular workload
needs more you'll spend your time handling soft pagefaults
and finding pages to unmap. Even though the machine has
enough physical memory to hold all the pages (and their
pte mappings) and there is no memory load.

If we do NOT have such an artificial restriction, then we
could end up with slightly more scanning overhead when we
have memory pressure, but at least the system would run
fine in situations where we do have enough memory.

> >     And even with limits you still wind up with extremely non-deterministic
> >     scanning.
> 
> How so?  You're only scanning currently mapped ptes, and one
> goal is to keep that number small enough that you can gather
> good LRU stats of page usage.

Page aging may well be cheaper than continuously unmapping ptes
(including tlb flushes and cache flushes of the page tables) and
softfaulting them back in.

> Fwiw, with COW address_spaces (I posted an article a couple of
> weeks ago explaining) it should be fairly simple to find all the
> ptes for a given page without the space overhead of pte
> chaining.

But what if you have a page which is 1) mapped in multiple
addresses by different apps  and 2) COW shared by subsets
of those multiple apps?

We still need pte chaining or something similar.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-25 16:18       ` Matthew Dillon
@ 2000-05-25 16:50         ` Jamie Lokier
  2000-05-25 17:17           ` Rik van Riel
  2000-05-26 11:08           ` Stephen C. Tweedie
  0 siblings, 2 replies; 35+ messages in thread
From: Jamie Lokier @ 2000-05-25 16:50 UTC (permalink / raw)
  To: Matthew Dillon; +Cc: Rik van Riel, linux-mm

Matthew Dillon wrote:
>     Yes, but at an unreasonable cost:  Artificially limiting the number
>     of discrete processes able to share a given amount of memory.  Any
>     reasonable limit would still be an order of magnitude more expensive
>     then a physical page scan.

I disagree.  The "amount of sharing" scan-time overhead is there whether
you do a physical or virtual scan.  For a physical scan, if you have a
lot of sharing then you look at a lot of ptes per page.

One possible goal is to limit the total number of mapped ptes in the
system.  You can still permit a lot of sharing: the number does not have
to be limited per task or per mm.

>     And even with limits you still wind up with extremely non-deterministic
>     scanning.

How so?  You're only scanning currently mapped ptes, and one goal is to
keep that number small enough that you can gather good LRU stats of page
usage.  So the set of mapped ptes at any time should reasonably reflect
current usage.  If current usage is chewing up a lot of pages, you
simply get a lot of page faults and very good LRU stats :-)  I don't see
how this is particularly non-deterministic.

>     I'm not advocating a physical page scan for the current linux kernel,
>     it just isn't designed for it, but I would argue that doing a physical
>     page scan should be considered after you get past your next release.

Agreed.  Fwiw, I rather like the idea of physical scanning.  A smaller,
simpler feedback loop should show much less pathological behaviour with
unusual loads.

Fwiw, with COW address_spaces (I posted an article a couple of weeks ago
explaining) it should be fairly simple to find all the ptes for a given
page without the space overhead of pte chaining.

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-25  9:52     ` Jamie Lokier
@ 2000-05-25 16:18       ` Matthew Dillon
  2000-05-25 16:50         ` Jamie Lokier
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew Dillon @ 2000-05-25 16:18 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Rik van Riel, linux-mm

:
:Matthew Dillon wrote:
:>     Virtual
:>     page scanning has severe scaleability problems over physical page
:>     scanning.  For example, what happens when you have an oracle database
:>     running with a hundred independant (non-threaded) processes mapping
:>     300MB+ of shared memory?
:
:Actually you can make this scalable quite easily.   I think it's
:asymptotically equivalent to physical page scanning.
:
:First, ensure the async. unmapper can limit the number of mapped
:ptes.  Method: whenever the number of established ptes increases
:above a high water mark (e.g. due to a page fault), invoke the unmapper
:synchronously to push the number below a low water mark.  (Both marks
:can be the same).
:
:Second, make the scanner scale independently of the virtual addresses
:used.  Method: store boundary tags in the /unused/ ptes so that
:scanning skips unused ptes.  Ok, this can have fiddly interactions with
:not-present swap entries.
:
:In this way, the work required to scan _all_ mapped pages can be
:strictly bounded.
:
:cheers,
:-- Jamie

    Yes, but at an unreasonable cost:  Artificially limiting the number
    of discrete processes able to share a given amount of memory.  Any
    reasonable limit would still be an order of magnitude more expensive
    then a physical page scan.

    And even with limits you still wind up with extremely non-deterministic
    scanning.

    I'm not advocating a physical page scan for the current linux kernel,
    it just isn't designed for it, but I would argue that doing a physical
    page scan should be considered after you get past your next release.


					-Matt
					Matthew Dillon 
					<dillon@backplane.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-24 20:57   ` Matthew Dillon
  2000-05-24 22:44     ` Rik van Riel
@ 2000-05-25  9:52     ` Jamie Lokier
  2000-05-25 16:18       ` Matthew Dillon
  1 sibling, 1 reply; 35+ messages in thread
From: Jamie Lokier @ 2000-05-25  9:52 UTC (permalink / raw)
  To: Matthew Dillon; +Cc: Rik van Riel, linux-mm

Matthew Dillon wrote:
>     Virtual
>     page scanning has severe scaleability problems over physical page
>     scanning.  For example, what happens when you have an oracle database
>     running with a hundred independant (non-threaded) processes mapping
>     300MB+ of shared memory?

Actually you can make this scalable quite easily.   I think it's
asymptotically equivalent to physical page scanning.

First, ensure the async. unmapper can limit the number of mapped
ptes.  Method: whenever the number of established ptes increases
above a high water mark (e.g. due to a page fault), invoke the unmapper
synchronously to push the number below a low water mark.  (Both marks
can be the same).

Second, make the scanner scale independently of the virtual addresses
used.  Method: store boundary tags in the /unused/ ptes so that
scanning skips unused ptes.  Ok, this can have fiddly interactions with
not-present swap entries.

In this way, the work required to scan _all_ mapped pages can be
strictly bounded.

cheers,
-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-24 22:44 ` Juan J. Quintela
@ 2000-05-24 23:32   ` Rik van Riel
  0 siblings, 0 replies; 35+ messages in thread
From: Rik van Riel @ 2000-05-24 23:32 UTC (permalink / raw)
  To: Juan J. Quintela
  Cc: linux-mm, Matthew Dillon, Stephen C. Tweedie, Arnaldo Carvalho de Melo

On 25 May 2000, Juan J. Quintela wrote:

> we need to be able to write pages syncronously to disk if they are
> dirty, there are no free pages around, and we can sleep.

Indeed. Luckily this is covered in my design ;)

> Other question, who do you write the pages from the laundry disk to
> disk if they are dirty pages, not dirty buffers.  You need to look at
> the ptes to be able to do a swap_entry.  Or I am loosing something
> evident here?

The swap entry is allocated at swap_out (page unmapping) time and
put in the page struct.

> I continue with my problem, how do you write one page for the dirty
> page that has not a swap_entry defined.

The solution is to not have such pages around. Allocating a swap
entry when we unmap the page is easy...

> I think that the desing is quite right, but I have that problem
> just now with the current design, we jump over dirty pages in
> shrink_mmap due to the fact that we don't know what to do with
> them, I see the same problem here.

No we don't. We will move pages from the active list to the
inactive list regardless of whether they're clean or dirty.
And pages will stay on the inactive list either until they're
referenced by something or until they're reclaimed for something
else.

Once we end up with an inactive queue full of dirty pages,
we'll be syncing stuff to disk instead of putting memory pressure
on the active pages.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-24 15:11 Rik van Riel
@ 2000-05-24 22:44 ` Juan J. Quintela
  2000-05-24 23:32   ` Rik van Riel
  2000-05-26 11:11 ` Stephen C. Tweedie
  1 sibling, 1 reply; 35+ messages in thread
From: Juan J. Quintela @ 2000-05-24 22:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, Matthew Dillon, Stephen C. Tweedie, Arnaldo Carvalho de Melo

>>>>> "rik" == Rik van Riel <riel@conectiva.com.br> writes:

Hi 

rik> The goals of this design:
rik> - robustness, currently small changes to the design usually result
rik>   in breaking VM performance for everybody and there doesn't seem
rik>   to be any particular version of the VM subsystem which works right 
rik>   for everybody ... having a design that is more robust to both
rik>   changes and wildly variable workloads would be better
rik> - somewhat better page aging
rik> - making sure we always have a "buffer" around to do allocations
rik>   from
rik> - treat all pages in the system equally wrt. page aging and
rik> flushing

This point is important, just now we are having a *lot* of problems
with dirty pages from mmaped files, we find dirty buffers in
shrink_mmap, we need to skip them, and wait that swap_out find them,
but it can be a lot of time before swap_out find them.  But that
moment they are at the end of the LRU list, then we found more dirty
pages, .... This process repeats until we begin to found the pages
that have been selected for swap_out, and have buffes entries.  In
that moment we send a *lot* of async disk writes to the disk, causing
the actuals slowdowns of until 5 seconds here.

rik> - keeping changes to the code base simple, since we're already
rik>   at the 2.4-pre stage !!

rik> 	DESIGN IDEAS

rik> - have three LRU queues instead of the current one queue
rik>   - a cache/scavenge queue, which contains clean, unmapped pages
rik>   - a laundry queue, which contains dirty, unmapped pages
rik>   - an inactive queue, which contains both dirty and clean unmapped
rik>     pages
rik>   - an active queue, which contains active and/or mapped pages

This are four queues :)

rik> - keep a decaying average of the number of allocations per second
rik>   around
rik> - try to keep about one second worth of allocations around in
rik>   the inactive queue (we do 100 allocations/second -> at least
rik>   100 inactive pages), we do this in order to:
rik>   - get some aging in that queue (one second to be reclaimed)
rik>   - have enough old pages around to free
rik> - keep zone->pages_high of free pages + cache pages around,
rik>   with at least pages_min of really free pages for atomic
rik>   allocations   // FIXME: buddy fragmentation and defragmentation
rik> - pages_min can be a lot lower than what it is now, since we only
rik>   need to use pages from the free list for atomic allocations
rik> - non-atomic allocations take a free page if we have a lot of free
rik>   pages, they take a page from the cache queue otherwise
rik> - when the number of free+cache pages gets too low:
rik>   - scan the inactive queue
rik> 	- put clean pages on the cache list
rik> 	- put dirty pages on the laundry list
rik> 	- stop when we have enough cache pages
rik>   - the page cleaner will clean the dirty pages asynchronously
rik>     and put them on the cache list when they are clean
rik> 	- stop when we have no more dirty pages
rik> 	- if we have dirty pages, sync them to disk,
rik> 	  periodically scanning the list to see if
rik> 	  pages are clean now

we need to be able to write pages syncronously to disk if they are
dirty, there are no free pages around, and we can sleep.

Other question, who do you write the pages from the laundry disk to
disk if they are dirty pages, not dirty buffers.  You need to look at
the ptes to be able to do a swap_entry.  Or I am loosing something
evident here?

rik> 	CODE CHANGES

rik> - try_to_swap_out() will no longer queue a swapout, but allocate
rik>   the swap entry and mark the page dirty
rik> - shrink_mmap() will be split into multiple functions
rik> 	- reclaim_cache() to reclaim pages from the cache list
rik> 	- kflushd (???) could get the task of laundering pages
rik> 	- reclaim_inactive() to move inactive pages to the cached
rik> 	  and dirty list
              laundry 
rik> 	- refill_inactive(), which scans the active list to refill
rik> 	  the inactive list and calls swap_out() if needed
rik> 	- kswapd will refill the free list by freeing pages it
rik> 	  gets using reclaim_cache()
rik> - __alloc_pages() will call reclaim_cache() to fulfill non-atomic
rik>   allocations and do rmqueue() if:
rik> 	- we're dealing with an atomic allocation, or
rik> 	- we have "too many" free pages
rik> - if an inactive, laundry or cache page is faulted back into a
rik>   process, we reactivate the page, move the page to the active
rik>   list, adjust the statistics and wake up kswapd if needed

I continue with my problem, how do you write one page for the dirty
page that has not a swap_entry defined.

I think that the desing is quite right, but I have that problem just
now with the current design, we jump over dirty pages in shrink_mmap
due to the fact that we don't know what to do with them, I see the
same problem here.

Later, Juan.

-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-24 20:57   ` Matthew Dillon
@ 2000-05-24 22:44     ` Rik van Riel
  2000-05-25  9:52     ` Jamie Lokier
  1 sibling, 0 replies; 35+ messages in thread
From: Rik van Riel @ 2000-05-24 22:44 UTC (permalink / raw)
  To: Matthew Dillon; +Cc: linux-mm

On Wed, 24 May 2000, Matthew Dillon wrote:

> :>        Two things can be done:  First, you collect a bunch of pages to be
> :>        laundered before issuing the I/O, allowing you to sort the I/O
> :>        (this is what you suggest in your design ideas email).  (p.p.s.
> :>        don't launder more then 64 or so pages at a time, doing so will just
> :>        stall other processes trying to do normal I/O).
> :> 
> :>        Second, you can locate other pages nearby the ones you've decided to
> :>        launder and launder them as well, getting the most out of the disk
> :>        seeking you have to do anyway.
> :
> :Virtual page scanning should provide us with some of these
> :benefits. Also, we'll allocate the swap entry at unmapping
> :time and can make sure to unmap virtually close pages at
> :the same time so they'll end up close to each other in the
> :inactive queue.
> :
> :This isn't going to be as good as it could be, but it's
> :probably as good as it can get without getting more invasive
> :with our changes to the source tree...
> 
>     Virtual page scanning will help with clustering, but unless you
>     already have a good page candidate to base your virtual scan on
>     you will not be able to *find* a good page candidate to base the
>     clustering around.  Or at least not find one easily.  Virtual
>     page scanning has severe scaleability problems over physical page
>     scanning.  For example, what happens when you have an oracle database
>     running with a hundred independant (non-threaded) processes mapping
>     300MB+ of shared memory?

Ohhh definately. It's just that coding up the administrative changes
required to support this would be too big a change for Linux 2.4...

>     So it can be a toss-up.  I don't think *anyone* (linux, freebsd, solaris,
>     or anyone else) has yet written the definitive swap allocation algorithm!

We still have some time. There's little chance of implementing it in
Linux before kernel version 2.5, so we should have some time left to
design the "definitive" algorithm.

For now I'll be focussing on having something decent in kernel 2.4,
we really need it to be better than 2.2. Keeping the virtual
scanning but combining it with a multi-queue system for the unmapped
pages (with all mapped pages residing in the active queue) should
at least provide us with a predictable, robust and moderately good
VM subsystem for the next stable kernel series.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-24 18:51 ` Rik van Riel
@ 2000-05-24 20:57   ` Matthew Dillon
  2000-05-24 22:44     ` Rik van Riel
  2000-05-25  9:52     ` Jamie Lokier
  0 siblings, 2 replies; 35+ messages in thread
From: Matthew Dillon @ 2000-05-24 20:57 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm

    Virtual page scanning will help with clustering, but unless you
    already have a good page candidate to base your virtual scan on
    you will not be able to *find* a good page candidate to base the
    clustering around.  Or at least not find one easily.  Virtual
    page scanning has severe scaleability problems over physical page
    scanning.  For example, what happens when you have an oracle database
    running with a hundred independant (non-threaded) processes mapping
    300MB+ of shared memory?

    On the swap allocation -- I think there are several approaches to this
    problem, all equally viable.  If you do not do object clustering for
    pageouts then allocating the swap at unmap time is viable -- due to
    the time delay between unmap and the actual I/O & page-selection
    for cleaning, your pageouts will be slower but you *will* get locality
    of reference on your pageins (pageins will be faster).

    If you do object clustering then you get the benefit of both worlds.
    FreeBSD delays swap allocation until it actually decides to swap
    something, which means that it can take a collection of unrelated
    pages to be cleaned and assign contiguous swap to them.  This results
    in a very fast, deterministic pageout capability but, without clustering,
    there will be no locality of reference for pageins.  So pageins would
    be slow.

    With clustering, however, the at-swap-time allocation tends to have more
    locality of reference due to there being additional nearby pages 
    selected from the objects in the mix.  It still does not approach the
    performance you can get from an object-oriented swap allocation scheme,
    but at least it would no longer be considered 'slow'.

    So it can be a toss-up.  I don't think *anyone* (linux, freebsd, solaris,
    or anyone else) has yet written the definitive swap allocation algorithm!

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
  2000-05-24 16:16 Matthew Dillon
@ 2000-05-24 18:51 ` Rik van Riel
  2000-05-24 20:57   ` Matthew Dillon
  0 siblings, 1 reply; 35+ messages in thread
From: Rik van Riel @ 2000-05-24 18:51 UTC (permalink / raw)
  To: Matthew Dillon; +Cc: linux-mm

On Wed, 24 May 2000, Matthew Dillon wrote:

>      All right!  I think your spec is coming together nicely!   The
>      multi-queue approach is the right way to go (for the same reason
>      FBsd took that approach).  The most important aspect of using
>      a multi-queue design is to *not* blow-off the page weighting tests
>      within each queue.  Having N queues alone is not fine enough granularity,
>      but having N queues and locating the lowest (in FreeBSD's case 0)
>      weighted pages within a queue is the magic of making it work well.
> 
>      I actually tried to blow off the weighting tests in FreeBSD, even just
>      a little, but when I did FreeBSD immediately started to stall as the
>      load increased.  Needless to say I threw away that patchset :-).

OK, I'll look at implementing this in Linux as well. Maybe it
won't work due to the virtual page scanning, but I'll look into
it and try a few things. This change should be relatively easy
to make.

>      I have three comments:
> 
>      * On the laundry list.  In FreeBSD 3.x we laundered pages as we went
>        through the inactive queue.   In 4.x I changed this to a two-pass
>        algorithm (vm_pageout_scan() line 674 vm/vm_pageout.c around the 
>        rescan0: label).  It tries to locate clean inactive pages in pass1,
>        and if there is still a page shortage (line 927 vm/vm_pageout.c,
>        the launder_loop conditional) we go back up and try again, this 
>        time laundering pages.
> 
>        There is also a heuristic prior to the first loop, around line 650
>        ('Figure out what to do with dirty pages...'), where it tries to 
>        figure out whether it is worth doing two passes or whether it should
>        just start laundering pages immediately.

Another good idea to implement. I don't know in how far it'll
interfere with the "age one second" idea though...
(maybe we want to have the inactive list "4 seconds big" so we
can implement this FreeBSD idea and still keep our second of aging?)

>      * On page aging.   This is going to be the most difficult item for you
>        to implement under linux.  In FreeBSD the PV entry mmu tracking 
>        structures make it fairly easy to scan *physical* pages then check
>        whether they've been used or not by locating all the pte's mapping them,
>        via the PV structures.  
> 
>        In linux this is harder to do, but I still believe it is the right
>        way to do it - that is, have the main page scan loop scan physical 
>        pages rather then virtual pages, for reasons I've outlined in previous
>        emails (fairness in the weighting calculation).
> 
>        (I am *not* advocating a PV tracking structure for linux.  I really 
>        hate the PV stuff in FBsd).

This is something for the 2.5 kernel. The changes involved in
doing this are just too invasive right now...

>      * On write clustering.  In a completely fair aging design, the pages
>        you extract for laundering will tend to appear to be 'random'.  
>        Flushing them to disk can be expensive due to seeking.
> 
>        Two things can be done:  First, you collect a bunch of pages to be
>        laundered before issuing the I/O, allowing you to sort the I/O
>        (this is what you suggest in your design ideas email).  (p.p.s.
>        don't launder more then 64 or so pages at a time, doing so will just
>        stall other processes trying to do normal I/O).
> 
>        Second, you can locate other pages nearby the ones you've decided to
>        launder and launder them as well, getting the most out of the disk
>        seeking you have to do anyway.

Virtual page scanning should provide us with some of these
benefits. Also, we'll allocate the swap entry at unmapping
time and can make sure to unmap virtually close pages at
the same time so they'll end up close to each other in the
inactive queue.

This isn't going to be as good as it could be, but it's
probably as good as it can get without getting more invasive
with our changes to the source tree...

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] 2.3/4 VM queues idea
@ 2000-05-24 16:16 Matthew Dillon
  2000-05-24 18:51 ` Rik van Riel
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew Dillon @ 2000-05-24 16:16 UTC (permalink / raw)
  To: Rik van Riel

     All right!  I think your spec is coming together nicely!   The
     multi-queue approach is the right way to go (for the same reason
     FBsd took that approach).  The most important aspect of using
     a multi-queue design is to *not* blow-off the page weighting tests
     within each queue.  Having N queues alone is not fine enough granularity,
     but having N queues and locating the lowest (in FreeBSD's case 0)
     weighted pages within a queue is the magic of making it work well.

     I actually tried to blow off the weighting tests in FreeBSD, even just
     a little, but when I did FreeBSD immediately started to stall as the
     load increased.  Needless to say I threw away that patchset :-).


     I have three comments:

     * On the laundry list.  In FreeBSD 3.x we laundered pages as we went
       through the inactive queue.   In 4.x I changed this to a two-pass
       algorithm (vm_pageout_scan() line 674 vm/vm_pageout.c around the 
       rescan0: label).  It tries to locate clean inactive pages in pass1,
       and if there is still a page shortage (line 927 vm/vm_pageout.c,
       the launder_loop conditional) we go back up and try again, this 
       time laundering pages.

       There is also a heuristic prior to the first loop, around line 650
       ('Figure out what to do with dirty pages...'), where it tries to 
       figure out whether it is worth doing two passes or whether it should
       just start laundering pages immediately.

     * On page aging.   This is going to be the most difficult item for you
       to implement under linux.  In FreeBSD the PV entry mmu tracking 
       structures make it fairly easy to scan *physical* pages then check
       whether they've been used or not by locating all the pte's mapping them,
       via the PV structures.  

       In linux this is harder to do, but I still believe it is the right
       way to do it - that is, have the main page scan loop scan physical 
       pages rather then virtual pages, for reasons I've outlined in previous
       emails (fairness in the weighting calculation).

       (I am *not* advocating a PV tracking structure for linux.  I really 
       hate the PV stuff in FBsd).

     * On write clustering.  In a completely fair aging design, the pages
       you extract for laundering will tend to appear to be 'random'.  
       Flushing them to disk can be expensive due to seeking.

       Two things can be done:  First, you collect a bunch of pages to be
       laundered before issuing the I/O, allowing you to sort the I/O
       (this is what you suggest in your design ideas email).  (p.p.s.
       don't launder more then 64 or so pages at a time, doing so will just
       stall other processes trying to do normal I/O).

       Second, you can locate other pages nearby the ones you've decided to
       launder and launder them as well, getting the most out of the disk
       seeking you have to do anyway.

       The first item is important.  The second item will help extend the
       life of the system in a heavy-load environment by being able to
       sustain a higher pagout rate.  

       In tests with FBsd, the nearby-write-clustering doubled the pageout
       rate capability under high disk load situations.  This is one of the
       main reasons why we do 'the weird two-level page scan' stuff.

       (ok to reprint this email too!)

						-Matt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC] 2.3/4 VM queues idea
@ 2000-05-24 15:11 Rik van Riel
  2000-05-24 22:44 ` Juan J. Quintela
  2000-05-26 11:11 ` Stephen C. Tweedie
  0 siblings, 2 replies; 35+ messages in thread
From: Rik van Riel @ 2000-05-24 15:11 UTC (permalink / raw)
  To: linux-mm; +Cc: Matthew Dillon, Stephen C. Tweedie, Arnaldo Carvalho de Melo

Hi,

I've been talking with some people lately and have come up with
the following plan for short-term changes to the Linux VM subsystem.

The goals of this design:
- robustness, currently small changes to the design usually result
  in breaking VM performance for everybody and there doesn't seem
  to be any particular version of the VM subsystem which works right 
  for everybody ... having a design that is more robust to both
  changes and wildly variable workloads would be better
- somewhat better page aging
- making sure we always have a "buffer" around to do allocations
  from
- treat all pages in the system equally wrt. page aging and flushing
- keeping changes to the code base simple, since we're already
  at the 2.4-pre stage !!


	DESIGN IDEAS

- have three LRU queues instead of the current one queue
  - a cache/scavenge queue, which contains clean, unmapped pages
  - a laundry queue, which contains dirty, unmapped pages
  - an inactive queue, which contains both dirty and clean unmapped
    pages
  - an active queue, which contains active and/or mapped pages
- keep a decaying average of the number of allocations per second
  around
- try to keep about one second worth of allocations around in
  the inactive queue (we do 100 allocations/second -> at least
  100 inactive pages), we do this in order to:
  - get some aging in that queue (one second to be reclaimed)
  - have enough old pages around to free
- keep zone->pages_high of free pages + cache pages around,
  with at least pages_min of really free pages for atomic
  allocations   // FIXME: buddy fragmentation and defragmentation
- pages_min can be a lot lower than what it is now, since we only
  need to use pages from the free list for atomic allocations
- non-atomic allocations take a free page if we have a lot of free
  pages, they take a page from the cache queue otherwise
- when the number of free+cache pages gets too low:
  - scan the inactive queue
	- put clean pages on the cache list
	- put dirty pages on the laundry list
	- stop when we have enough cache pages
  - the page cleaner will clean the dirty pages asynchronously
    and put them on the cache list when they are clean
	- stop when we have no more dirty pages
	- if we have dirty pages, sync them to disk,
	  periodically scanning the list to see if
	  pages are clean now

(hmm, the page cleaning thing doesn't sound completely right ...
what should I change here?)


	CODE CHANGES

- try_to_swap_out() will no longer queue a swapout, but allocate
  the swap entry and mark the page dirty
- shrink_mmap() will be split into multiple functions
	- reclaim_cache() to reclaim pages from the cache list
	- kflushd (???) could get the task of laundering pages
	- reclaim_inactive() to move inactive pages to the cached
	  and dirty list
	- refill_inactive(), which scans the active list to refill
	  the inactive list and calls swap_out() if needed
	- kswapd will refill the free list by freeing pages it
	  gets using reclaim_cache()
- __alloc_pages() will call reclaim_cache() to fulfill non-atomic
  allocations and do rmqueue() if:
	- we're dealing with an atomic allocation, or
	- we have "too many" free pages
- if an inactive, laundry or cache page is faulted back into a
  process, we reactivate the page, move the page to the active
  list, adjust the statistics and wake up kswapd if needed

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2000-05-28 22:42 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-05-24 19:37 [RFC] 2.3/4 VM queues idea Mark_H_Johnson
2000-05-24 20:35 ` Matthew Dillon
  -- strict thread matches above, loose matches on Subject: below --
2000-05-24 16:16 Matthew Dillon
2000-05-24 18:51 ` Rik van Riel
2000-05-24 20:57   ` Matthew Dillon
2000-05-24 22:44     ` Rik van Riel
2000-05-25  9:52     ` Jamie Lokier
2000-05-25 16:18       ` Matthew Dillon
2000-05-25 16:50         ` Jamie Lokier
2000-05-25 17:17           ` Rik van Riel
2000-05-25 17:53             ` Matthew Dillon
2000-05-26 11:38               ` Jamie Lokier
2000-05-26 11:08           ` Stephen C. Tweedie
2000-05-26 11:22             ` Jamie Lokier
2000-05-26 13:15               ` Stephen C. Tweedie
2000-05-26 14:31                 ` Jamie Lokier
2000-05-26 14:38                   ` Stephen C. Tweedie
2000-05-26 15:59                     ` Matthew Dillon
2000-05-26 16:36                     ` Jamie Lokier
2000-05-26 16:40                       ` Stephen C. Tweedie
2000-05-26 16:55                         ` Matthew Dillon
2000-05-26 17:05                           ` Jamie Lokier
2000-05-26 17:35                             ` Matthew Dillon
2000-05-26 17:46                               ` Stephen C. Tweedie
2000-05-26 17:02                         ` Jamie Lokier
2000-05-26 17:15                           ` Stephen C. Tweedie
2000-05-26 20:41                             ` Jamie Lokier
2000-05-28 22:42                               ` Stephen Tweedie
2000-05-26 15:45                   ` Matthew Dillon
2000-05-26 12:04             ` Rik van Riel
2000-05-24 15:11 Rik van Riel
2000-05-24 22:44 ` Juan J. Quintela
2000-05-24 23:32   ` Rik van Riel
2000-05-26 11:11 ` Stephen C. Tweedie
2000-05-26 11:49   ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox