linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC] 2.3/4 VM queues idea
@ 2000-05-24 19:37 Mark_H_Johnson
  2000-05-24 20:35 ` Matthew Dillon
  0 siblings, 1 reply; 35+ messages in thread
From: Mark_H_Johnson @ 2000-05-24 19:37 UTC (permalink / raw)
  To: riel; +Cc: acme, dillon, linux-mm, sct

I'll try to combine comments on the VM queues & Matt Dillon's material in
one response. I've added some analysis at the end - yes the reference is
OLD, but the math is still valid. The bottom line of my comments - make
sure the goals are right, put some measures into place so we can determine
"success", & build the solution based on sound principles. Also, thanks to
all who are working to make the VM system in Linux better.

Re: Goals
 - Robust design - absolutely essential. VM is one of those key
capabilities that must be done right or not at all.
 - page aging and "buffer for allocations" - not sure if I care which
methods we use as long as they work well. To me, the goal of VM is to have
the "right" pages in memory when the CPU executes an instruction. Page
aging and free lists are two methods, lookahead and clustered reads &
writes are others.
 - treat pages equally - I think I disagree with both you and Matt on this
one. We have different usage patterns for different kinds of data (e.g.
execution of code tends to be localized but not sequential vs. sequential
read of data in a file) & should have a means of distinguishing between
them. This does not mean that one algorithm won't do a good job for both
the VM & buffer cache, just recognize that we should have ways to treat
them differently. See my comments on "stress cases" below for my rationale.
 - simple changes - I agree, get this issue settled & move on.
To these goals I might add
 - predictable performance [essential if I deploy any large (or real time)
system on Linux]
 - no more automatic process kills - give the system [and/or operator] the
ability to recover without terminating jobs
 - make it adjustable by the user or system administrator for different
workloads
 - durable - works well with varied CPU, memory, and disk performance
A related item - measure that the "goals have been met". How do we know
that method A works better than method B without some hooks to measure the
performance?

A final comment on goals - I honestly don't mind systems that swap - IF
they do a better job of running the active jobs as a result. The FUD that
Matt refers to on FreeBSD swapping may actually mean that FreeBSD runs
better than Linux. I can't tell, and it probably varies by application area
anyway.

Re: Design ideas [with ties into Matt's material]
 - not sure what the 3 [4?] queues are doing for us; I can't tie the queue
concept to how we are meeting one of the goals above.
 - allocations / second as a parameter to adjust free pages. Sounds OK, may
want a scalable factor here with min & max limits to free pages. For
example, in a real time system I want no paging - that doesn't mean that I
want no free pages. A burst of activity could occur where I need to do an
allocation & don't want to wait for a disk write before taking action
[though that usually means MY system design is broke...].
 - not sure the distinction between "atomic" and "non atomic" allocations.
Please relate this to meeting the goals above, or add a goal.
 - not clear how the inactive queue scan & page cleaner methods meet the
goals either.
 - [Matt] real statistics - I think is a part of the data needed to
determine if A is better than B.
 - [Matt] several places talks about LRU or not strict LRU - LRU is a good
algorithm but has overhead involved. You may find that a simple FIFO or
clock algorithm gets >90% of the benefit of LRU without the overhead [can
be a net gain, something to consider].
 - [Matt] relate scan rate to perceived memory load. Looks OK if the
overhead makes it worth the investment.
 - [Matt] moving balance points & changing algorithms [swap, not page].
Ditto.
 - [Matt] adjustments to initial weight. Ditto - relates directly to the
"right page" goal.
See the analysis below to get an idea of what I mean by "overhead is worth
the investment". It looks like we can spend a lot of CPU cycles to save one
disk read or write and still run the system efficiently.

The remainder of Matt's message was somewhat harder to read - some mixture
of stress cases & methods used to implement VM. I've reorganized what he
said to characterize this somewhat more clearly [at least in my mind]...

  Stress cases [characterized as sequential/not, read only or read/write,
shared or not, locked or not] I tried to note the application for each case
as well as the general technique that could be applied to memory
management.
  - sequential read only access [file I/O or mmap'd file, use once &
discard]
  - sequential read/write access [file updates or mmap'd file, use once &
push to file on disk]
  - non-sequential read or execute access, not shared [typical executable,
locality of reference; when memory is tight, can discard & refresh from
disk]
  - non-sequential read/write access, not shared [stack or heap; locality
of reference, must push to file or swap when memory is tight]
  - read only shared access [typical shared library, higher "apparent cost"
if not in memory, can discard & refresh from disk]
  - read/write shared access [typical memory mapped file or shared memory
region, higher apparent cost if not in memory, has cache impact on multi
processors, must push to file or swap when memory is tight]
  - locked pages for I/O or real time applications [fixed in memory due to
system constraints, VM system must (?) leave alone, cannot (?) move in
memory, not "likely" needed on disk (?)]. I had a crazy idea as I was
writing this one - application is to capture a stream of data from the
network onto disk. The user mmap's a file & locks into memory to ensure
bandwidth can be met, fills region with data from network, unlocks & sync's
the file when transfer is done. How would FreeBSD/Linux handle that case?

I expect there are other combinations possible - I just tried to
characterize the kind of memory usage patterns that are typical. It may
also be good to look at some real application areas - interactive X
applications, application compile/link, database updates, web server, real
time, streaming data, and so on; measure how they stress the system & build
the VM system to handle the varied loads - automatically if you can but
with help from the user/system administrator if you need it.

  Methods [recap of Matt's list]
  - write clustering - a good technique used on many systems
  - read clustering - ditto, especially note about priority reduction on
pages already read
  - sequential access detection - ditto
  - sequential write detection - ditto, would be especially helpful with
some of our shared memory test programs that currently stress Linux to
process kills; think of sequential access to memory mapped files in the
same way as sequential file I/O. They should have similar performance
characteristics if we do this right [relatively small resident set sizes,
high throughput].

Some analysis of the situation

I took a few minutes to go through some of my old reference books. Here are
some concepts that might help from "Operating System Principles" by Per
Brinch Hansen (1973), pp 182-191.

Your disk can supply at most one page every T seconds, memory access every
t seconds, and your program demands at a rate of p(s) - s relates to the
percent of resident to total size of the process. The processor tends to
demands one page every t/p(s) memory accesses. There are three situations:
 - disk is idle, p(s) < t/T. Generally OK, desired if you need to keep
latency down.
 - CPU is idle, p(s) > t/T. Generally bad, leads to thrashing. May indicate
you need more memory or faster disks for your application.
 - balanced system, p(s) = t/T. Full utilization, may lead to latency
increases.
Let's feed some current performance numbers & see where it leads.

A disk access (T) is still a few milliseconds and a memory access (t) is
well under a microsecond (lets use T=5E-3 and t=5E-8 to keep the math
simple). The ratio of t/T is thus 1/100000 - that means one disk page
access per 100000 memory accesses. Check the math, on 20mhz memory
accesses, your balanced demand is 200 disk pages per second (perhaps
high?). Note that many of our stress tests have much higher ratios than
this. I suggest a few real measurements to get ratios for real systems
rather than my made up answers.

Even if I'm off by 10x or 100x, this kind of ratio means that many of the
methods Matt describes makes sense. It helps justify why spending time to
do read/write clusters pays off - it gets you more work done for the cost
of each disk access (T). It helps justify why detecting sequential access &
limiting resident set sizes in those cases is OK [discard pages that will
never be used again]. It shows how some of the measures Matt has described
(e.g., page weights & perceived memory load) help determine how we should
proceed.

The book goes on to talk about replacement algorithms. FIFO or
approximations to LRU can cause up to 2-3x more pages to be transferred
than the "ideal algorithm" (one that knows in advance the best choice).
This results in a 12-24 % reduction of working set size but improves
utilization only a few percent. There is also a discussion of transfer
algorithms, basically a combination of elevator and/or clustering to reduce
latency & related overhead. I think Matt has covered better techniques than
was in this old book. I did have to chuckle with the biggest gain -  fix
the application to improve locality - a problem then & still one today.
Small tools work better than big ones. Perhaps a reason to combat code
bloat in the tools we use.

Closing
First, let's make sure we agree what we really need from the VM system
(goals).
Second, define a few measurements in place so we can determine we've
succeeded.
Third, implement solutions that will get us there.

Don't forget to give thanks for all the hard work that has gone into what
we have today & will have tomorrow.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread
* Re: [RFC] 2.3/4 VM queues idea
@ 2000-05-24 16:16 Matthew Dillon
  2000-05-24 18:51 ` Rik van Riel
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew Dillon @ 2000-05-24 16:16 UTC (permalink / raw)
  To: Rik van Riel

     All right!  I think your spec is coming together nicely!   The
     multi-queue approach is the right way to go (for the same reason
     FBsd took that approach).  The most important aspect of using
     a multi-queue design is to *not* blow-off the page weighting tests
     within each queue.  Having N queues alone is not fine enough granularity,
     but having N queues and locating the lowest (in FreeBSD's case 0)
     weighted pages within a queue is the magic of making it work well.

     I actually tried to blow off the weighting tests in FreeBSD, even just
     a little, but when I did FreeBSD immediately started to stall as the
     load increased.  Needless to say I threw away that patchset :-).


     I have three comments:

     * On the laundry list.  In FreeBSD 3.x we laundered pages as we went
       through the inactive queue.   In 4.x I changed this to a two-pass
       algorithm (vm_pageout_scan() line 674 vm/vm_pageout.c around the 
       rescan0: label).  It tries to locate clean inactive pages in pass1,
       and if there is still a page shortage (line 927 vm/vm_pageout.c,
       the launder_loop conditional) we go back up and try again, this 
       time laundering pages.

       There is also a heuristic prior to the first loop, around line 650
       ('Figure out what to do with dirty pages...'), where it tries to 
       figure out whether it is worth doing two passes or whether it should
       just start laundering pages immediately.

     * On page aging.   This is going to be the most difficult item for you
       to implement under linux.  In FreeBSD the PV entry mmu tracking 
       structures make it fairly easy to scan *physical* pages then check
       whether they've been used or not by locating all the pte's mapping them,
       via the PV structures.  

       In linux this is harder to do, but I still believe it is the right
       way to do it - that is, have the main page scan loop scan physical 
       pages rather then virtual pages, for reasons I've outlined in previous
       emails (fairness in the weighting calculation).

       (I am *not* advocating a PV tracking structure for linux.  I really 
       hate the PV stuff in FBsd).

     * On write clustering.  In a completely fair aging design, the pages
       you extract for laundering will tend to appear to be 'random'.  
       Flushing them to disk can be expensive due to seeking.

       Two things can be done:  First, you collect a bunch of pages to be
       laundered before issuing the I/O, allowing you to sort the I/O
       (this is what you suggest in your design ideas email).  (p.p.s.
       don't launder more then 64 or so pages at a time, doing so will just
       stall other processes trying to do normal I/O).

       Second, you can locate other pages nearby the ones you've decided to
       launder and launder them as well, getting the most out of the disk
       seeking you have to do anyway.

       The first item is important.  The second item will help extend the
       life of the system in a heavy-load environment by being able to
       sustain a higher pagout rate.  

       In tests with FBsd, the nearby-write-clustering doubled the pageout
       rate capability under high disk load situations.  This is one of the
       main reasons why we do 'the weird two-level page scan' stuff.

       (ok to reprint this email too!)

						-Matt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread
* [RFC] 2.3/4 VM queues idea
@ 2000-05-24 15:11 Rik van Riel
  2000-05-24 22:44 ` Juan J. Quintela
  2000-05-26 11:11 ` Stephen C. Tweedie
  0 siblings, 2 replies; 35+ messages in thread
From: Rik van Riel @ 2000-05-24 15:11 UTC (permalink / raw)
  To: linux-mm; +Cc: Matthew Dillon, Stephen C. Tweedie, Arnaldo Carvalho de Melo

Hi,

I've been talking with some people lately and have come up with
the following plan for short-term changes to the Linux VM subsystem.

The goals of this design:
- robustness, currently small changes to the design usually result
  in breaking VM performance for everybody and there doesn't seem
  to be any particular version of the VM subsystem which works right 
  for everybody ... having a design that is more robust to both
  changes and wildly variable workloads would be better
- somewhat better page aging
- making sure we always have a "buffer" around to do allocations
  from
- treat all pages in the system equally wrt. page aging and flushing
- keeping changes to the code base simple, since we're already
  at the 2.4-pre stage !!


	DESIGN IDEAS

- have three LRU queues instead of the current one queue
  - a cache/scavenge queue, which contains clean, unmapped pages
  - a laundry queue, which contains dirty, unmapped pages
  - an inactive queue, which contains both dirty and clean unmapped
    pages
  - an active queue, which contains active and/or mapped pages
- keep a decaying average of the number of allocations per second
  around
- try to keep about one second worth of allocations around in
  the inactive queue (we do 100 allocations/second -> at least
  100 inactive pages), we do this in order to:
  - get some aging in that queue (one second to be reclaimed)
  - have enough old pages around to free
- keep zone->pages_high of free pages + cache pages around,
  with at least pages_min of really free pages for atomic
  allocations   // FIXME: buddy fragmentation and defragmentation
- pages_min can be a lot lower than what it is now, since we only
  need to use pages from the free list for atomic allocations
- non-atomic allocations take a free page if we have a lot of free
  pages, they take a page from the cache queue otherwise
- when the number of free+cache pages gets too low:
  - scan the inactive queue
	- put clean pages on the cache list
	- put dirty pages on the laundry list
	- stop when we have enough cache pages
  - the page cleaner will clean the dirty pages asynchronously
    and put them on the cache list when they are clean
	- stop when we have no more dirty pages
	- if we have dirty pages, sync them to disk,
	  periodically scanning the list to see if
	  pages are clean now

(hmm, the page cleaning thing doesn't sound completely right ...
what should I change here?)


	CODE CHANGES

- try_to_swap_out() will no longer queue a swapout, but allocate
  the swap entry and mark the page dirty
- shrink_mmap() will be split into multiple functions
	- reclaim_cache() to reclaim pages from the cache list
	- kflushd (???) could get the task of laundering pages
	- reclaim_inactive() to move inactive pages to the cached
	  and dirty list
	- refill_inactive(), which scans the active list to refill
	  the inactive list and calls swap_out() if needed
	- kswapd will refill the free list by freeing pages it
	  gets using reclaim_cache()
- __alloc_pages() will call reclaim_cache() to fulfill non-atomic
  allocations and do rmqueue() if:
	- we're dealing with an atomic allocation, or
	- we have "too many" free pages
- if an inactive, laundry or cache page is faulted back into a
  process, we reactivate the page, move the page to the active
  list, adjust the statistics and wake up kswapd if needed

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2000-05-28 22:42 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-05-24 19:37 [RFC] 2.3/4 VM queues idea Mark_H_Johnson
2000-05-24 20:35 ` Matthew Dillon
  -- strict thread matches above, loose matches on Subject: below --
2000-05-24 16:16 Matthew Dillon
2000-05-24 18:51 ` Rik van Riel
2000-05-24 20:57   ` Matthew Dillon
2000-05-24 22:44     ` Rik van Riel
2000-05-25  9:52     ` Jamie Lokier
2000-05-25 16:18       ` Matthew Dillon
2000-05-25 16:50         ` Jamie Lokier
2000-05-25 17:17           ` Rik van Riel
2000-05-25 17:53             ` Matthew Dillon
2000-05-26 11:38               ` Jamie Lokier
2000-05-26 11:08           ` Stephen C. Tweedie
2000-05-26 11:22             ` Jamie Lokier
2000-05-26 13:15               ` Stephen C. Tweedie
2000-05-26 14:31                 ` Jamie Lokier
2000-05-26 14:38                   ` Stephen C. Tweedie
2000-05-26 15:59                     ` Matthew Dillon
2000-05-26 16:36                     ` Jamie Lokier
2000-05-26 16:40                       ` Stephen C. Tweedie
2000-05-26 16:55                         ` Matthew Dillon
2000-05-26 17:05                           ` Jamie Lokier
2000-05-26 17:35                             ` Matthew Dillon
2000-05-26 17:46                               ` Stephen C. Tweedie
2000-05-26 17:02                         ` Jamie Lokier
2000-05-26 17:15                           ` Stephen C. Tweedie
2000-05-26 20:41                             ` Jamie Lokier
2000-05-28 22:42                               ` Stephen Tweedie
2000-05-26 15:45                   ` Matthew Dillon
2000-05-26 12:04             ` Rik van Riel
2000-05-24 15:11 Rik van Riel
2000-05-24 22:44 ` Juan J. Quintela
2000-05-24 23:32   ` Rik van Riel
2000-05-26 11:11 ` Stephen C. Tweedie
2000-05-26 11:49   ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox