From: Rik van Riel <riel@conectiva.com.br>
To: linux-mm@kvack.org
Cc: Matthew Dillon <dillon@apollo.backplane.com>
Subject: discussion with Matt Dillon
Date: Wed, 24 May 2000 13:02:57 -0300 (BRST) [thread overview]
Message-ID: <Pine.LNX.4.21.0005241256270.24993-170000@duckman.distro.conectiva> (raw)
[-- Attachment #1: Type: TEXT/PLAIN, Size: 854 bytes --]
Hi,
over the last days I had a very interesting discussion
about virtual memory with Matt Dillon. It would be cool
if people here could read these emails and give us comments,
reactions, questions, ... about them.
I believe it would be a good thing for Linux to have a
whole bunch of people who understand virtual memory so
we can keep an eye on each other and make sure something
good gets built.
Also, I'd *really* appreciate some comments on the 2.3/4
VM plan I sent to the list earlier today. I want to start
implementing it RSN (as in, after lunch) so I need to know
if I missed any critical design issues...
regards,
Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.
Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/ http://www.surriel.com/
[-- Attachment #2.1: Linux VM/IO balancing (fwd to linux-mm?) (fwd) --]
[-- Type: MESSAGE/RFC822, Size: 17225 bytes --]
Return-Path: <dillon@apollo.backplane.com>
Received: from perninha.conectiva.com.br (perninha.conectiva.com.br [200.250.58.156])
by duckman.distro.conectiva (8.9.3/8.8.7) with ESMTP id DAA18483
for <riel@duckman.distro.conectiva>; Tue, 23 May 2000 03:32:33 -0300
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
by perninha.conectiva.com.br (8.9.3/8.9.3) with ESMTP id DAA00990
for <riel@conectiva.com.br>; Tue, 23 May 2000 03:32:56 -0300
Received: (from dillon@localhost)
by apollo.backplane.com (8.9.3/8.9.1) id XAA64159;
Mon, 22 May 2000 23:32:20 -0700 (PDT)
(envelope-from dillon)
Date: Mon, 22 May 2000 23:32:20 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200005230632.XAA64159@apollo.backplane.com>
To: Rik van Riel <riel@conectiva.com.br>
Subject: Linux VM/IO balancing (fwd to linux-mm?)
I sent this to Alan who suggested that I send this to you! I've
fleshed it out a bit more from the version I sent to Alan.
I've been following the linux VM/memory subsystem issues closely and
I have a few suggestions on how to clean it up. Unfortunately,
being FreeBSD centric these days I do not want to create controversy.
But at this point I think even linux developers know that the
algorithms being used in the linux kernel are struggling
with the issue. The time may be ripe for some outside input (though I
will note that I was heavy into linux in earlier years, as Alan's
mail archives attest to!).
What I do below is essentially describe the FreeBSD VM/memory subsystem
from an algorithmic point of view, minus some of the fluff. It is very
straightforward in concept and the benefits should be obvious to anyone
who has worked on the VM system. I think it would be a fine blueprint
to use in linux.
I would like you to consider posting this to linux-kernel. I leave it
up to you -- I think if I were to post it independantly it would simply
result in controversy and seem more like a FreeBSD vs Linux thing rather
then a 'how to fix the VM system' thing. But if you post it
and preface it appropriately, the result might be better. So I am
sending it to you. It may help kickstart creating a formal algorithmic
blueprint rather then hacking up patches that solve one problem that
only create other problems.
Please feel free to edit it in any way if you think you can reduce
potential us-vs-them issues even more. I think there is a chance here
to get the whole of the linux developer community on the same page
in regards to the memory-load issue.
I didn't intend to turn this into a paper, but I've spent so much time
describing it that I think I am going to submit it to daemon news in
a cleaned-up form :-).
Thanks!
-Matt
---------
Even though I don't do much coding for Linux these days, I've always
avidly tracked the kernel mailing list. I've been tracking the memory
balancing thread for a while now and, having dealt with similar issues
in FreeBSD I believe I can offer a suggestion.
First of all, I am not trying to turn this into a comparison or anything.
Don't think of the FreeBSD memory subsystem as being 'FreeBSD' more
as being the work of a handful of very good theorists and programmers
(John Dyson being the main element there), and years of testing under
varying loads. The algorithms FreeBSD uses are NOT 15 years old, they
are more like 3 years old, and much of the work I myself have done is
less then one year old. Also, keep in mind that the standard FUD about
FreeBSD 'swapping more' the linux is just that --- FUD. FreeBSD only
swaps when it needs to, or thinks it may benefit by freeing up an
idle dirty page for more active reuse. I make an attempt to describe
in exact detail how and why FreeBSD pages things out, and why it works
so well, somewhere down below :-).
My suggestions are as follows:
First, stop treating the buffer cache as an entity separate from the
VM system. The two are inexorably bound together, especially considering
the massive use of mmap() (both file-backed and anonymous mmaps) in
modern programming. Depending on what you are running a properly
balanced system might have anywhere from 80% of its memory assigned
as file cache to 80% of its memory assigned to hold anonymous memory
for processes. it is NOT possible to impose limitations and still
get a reasonably scaleable balanced system. DO NOT TREAT THESE
AS TWO DIFFERENT CACHES!
Second, start keeping real statistics on memory use across on a
physical-page-basis. That means tracking how often VM pages are
referenced (statistically) as well as how often filesystem pages are
referenced by discrete I/O calls (deterministically). Keep track of
a real per-physical-page statistical 'weight'. (What this means for
linux is that you really need to test the pte's associated with physical
pages by iterating through the physical pagse in your outer loop, NOT
by trying to iterate through every page table of every process!).
FreeBSD keeps a center-weighted statistic for every page of memory
(buffer cache or VM cache, it makes no difference). This has turned
out to be a nearly perfect balancing algorithm and I strongly recommend
that linux adopt a similar model. But what makes it work is that
FreeBSD is willing to eat a couple of cpu cycles to keep accurate
statistics of page use by the VM system in order to avoid the bad
things that happen when one would otherwise choose the wrong page to
reuse or clean.
What I describe below is the essential core of the algorithm FreeBSD
uses. It's not an exact representation, but it gets to the heart of
FreeBSD's memory-balancing success.
The algorithm is a *modified* LRU. Lets say you decide on a weighting
betweeen 0 and 10. When a page is first allocated (either to the
buffer cache or for anonymous memory) its statistical weight is
set to the middle (5). If the page is used often the statistical
weight slowly rises to its maximum (10). If the page remains idle
(or was just used once) the statistical weight slowly drops to its
minimum (0).
The statistical weight is updated in real time by I/O system calls,
and updated statistically (by checking and clearing the page-referenced
bit in pte's) for mapped memory. When you mmap() a file and issue
syscalls on the descriptor, the weight may be updated by BOTH methods.
The rate at which the statistical page-reference updating operates depends
on the perceived memory load. A lightly loaded system (unstressed
memory) doesn't bother to scan the page-referenced bit all that often,
while a heavy memory load scans the page-referenced bit quite often
to keep the statistical weight intact.
When memory is allocated and no free pages are available, a clean page
is discarded from the cache (all 'clean' pages are considered to be
cache pretty much), lowest weight first. This in itself does NOT
contribute to the memory load calculation. That is, if you are scanning
a 10GB file you are not creating any memory stress on the system.
The LRU nature of the order of the pages in the queue is not strict.
The real parameter is the statistic, the ordering of the pages in the
queue uses a heuristic -- the pages 'migrate' over time so they are
reasonably well ordered within the queue, but no great effort is made
to order them exactly. The VM system will scan a portion of the queue
to locate a reasonable page to reuse (for example, it will look for
a page with a weighting less then 2).
The pagedaemon's scan rate is based on the perceived memory load
and ONLY the perceived memory load. It is perfectly acceptable to
have most of the system memory in 'active' use if allocations are not
occuring often, perfectly acceptable to have most of the system memory
backing file pages if processes aren't doing a lot of pageins, perfectly
acceptable for the system memory to be mostly dedicated to process
anonymous memory if processes have big ACTIVE footprints, perfectly
acceptable for most of the pages to be dirty if they are all in active
use and the memory subsystem is not otherwise being stressed.
The reason FreeBSD's memory subsystem works so well is precisely because
it does not impose any artificial limitations on the balance point.
Memory load is calculated in two ways: First, if the memory system finds
itself reusing active pages (in my example, any page with a statistical
weight greater then 5), second based on the dirty:clean page ratio. A
high ratio does not itself cause paging to occur, but a high ratio
combined with the system reusing active pages does.
The dirty/clean ratio is treated as an INDEPENDANT problem. The
same statistic is kept for dirty pages as it is for clean pages, but
dirty pages are placed on their own independant LRUish queue and do
not take part in the 'normal' memory allocation algorithm. A
separate algorithm (also part of the pageout daemon) controls the
cleaning of dirty pages.
When the memory load increases, an attempt is made to balance the
dirty/clean ratio by 'cleaning' dirty pages, which of course means
paging them out. FreeBSD makes NO distinction between writing a dirty
file-backed page and allocating swap for a dirty anonymous memory page.
The same per-page memory-use statistic is also used to determine which
dirty pages to clean first. In effect, it is precisely this attempt
to balance the dirty/clean ratio which increases the number of clean
pages available to reuse. The idea here is to increase the number of
clean pages to the point where the system is no longer being forced
to reuse 'active' pages. Once this is achieved there is no longer any
need clean the remaining dirty pages.
Under extreme memory loads the balance point moves on its own to a
point where FreeBSD tries to keep as many pages in a clean state as
possible. When the memory load gets to this point the system is
considered to be thrashing and we start taking anti-thrashing measures,
such as swapping out whole processes and holding them idle for 20-second
spans. It rarely gets to this point, but even when it does the system
is still kept reasonably balanced.
It should be noted that the center-weighting algorithm works in virtually
all situations, including workign WONDERFULLY when you have I/O
centric programs (i.e. a program that reads or writes gigabytes of
data). By making slight adjustments to the initial weight (or even no
adjustments at all) the VM system will tend to reuse used-once memory
(think of scanning a file) before it tries to reuse more actively used
memory.
Now, of course, there are other kernel processes messing with memory.
The filesystem update daemon, for example. But these processes are
not designed to handle heavy memory loads and we do it that way on
purpose. At most the update daemon will speed up a little under intense
filesystem loads, but that is as far as it goes. Only one process is
designed to handle heavy memory loads and that is the pageout daemon.
---
Stress Cases
* Stressing dirty pages in the system via I/O calls (read/write)
The algorithm tends to cause sequential I/O calls to give pages
a middling weight, and since the pages are not reused they tend
to be recycled within their domain (so you don't blow the rest
of the cache).
* Stressing dirty pages in the system via mmap (shared R+W)
The system tends to run low on clean pages, detected by the
fact that new allocations are reusing clean pages which have high
weights. When this occurs the pageout daemon attempts to 'clean'
dirty pages (page them out) in order to increase the number of
clean pages available. Having a larger number of clean pages
available tends to give them more time to age, thus reducing the
average weight the allocator sees. This is a negative feedback
loop which results in balance.
* I/O (read/shared-mmap) stress
The algorithm tends to weight the clean pages according to use.
The weightings for filesystem cache pages read via read() are
adjusted at the time of the read() while VM pages are adjusted
statistically (The VM page scan rate depends on the level of
stress). Since in modern systems mmap() is used heavily, no
special consideration is given to one access method verses the
other.
* VM (anonymous memory) stress
Anonymous swap-backed memory is treated no differently from
file-backed (filesystem buffers / mmap) memory. Clean anonymous
pages (most likely with swap already allocated if they are clean)
can be reused just the same as pages belonging to the filesystem
buffer cache. Swap is assigned to dirty anonymous pages on the
fly, only when the pageout daemon decides to actually clean the
page. Once swap is assigned the clean page can be reused.
If a swap-backed page is brought back into memory, it is brought
back in clean (swap is left assigned). Swap is only freed if
the page is re-dirtied by the process.
Thus most anonymous-memory pages in a heavily loaded system tend
to remain clean, allowing them to be reused more easily and extending
the life of the system further along the curve before it reaches a
thrashing state.
* Write Clustering.
Whenever the system decides to clean a dirty page it will, on the
fly, attempt to locate dirty nearby pages. FreeBSD is actually
quite sophisticated in this regard in that it actually goes and does
the calculation to ensure that only pages physically contiguous
on the disk are clustered for the write. The cluster is then written
and marked clean all in one go (cluster size limit is 64-128K).
* Sequential Detection Heuristic for read clustering (read())
A heuristic detects sequential read behavior and implements two
optimizations. (1) it implements read-aheads (as long as they
are reasonably contiguous on the physical media, we explicitly do
not try to issue read-aheads if it would cause an extra disk seek),
(2) it implements priority depression read-behind (reduce by 1 the
statistical weight of pages that have already been read). Reuse of
the pages can still cause the statistical weighting to increase to
the maximum, but this optimization has a tendancy to greatly reduce
the stress that large sequential reads have on the rest of the
memory subsystem.
* Sequential Detection Heuristic for read clustering (VM fault)
A heuristic detects sequential VM fault operation, either forwards
or backwards and adjusts the cluster window around the fault taken,
either shifting it forwards or backwards, or making the window
smaller (e.g. if random fault operation is detecting). fault-ahead
I/O is initiated based on the algorithm and anything found cached
is pre-faulted into the page table. (The window size in FreeBSD is
approximately 64KBytes for this particular algorithm). The window
is further restriction to ensure that only media-contiguous blocks
are clustered.
* Sequential Detection Heuristic for write clustering (write())
In the case of write() I/O (write system call), in order to
avoid saturating the memory system with dirty pages, if the
sequential detection heuristic determines that writes are
occuring sequentially, FreeBSD implements write-behind. That
is it issues the I/O on the dirty buffers preceding the write
point immediately (and asynchronously), in order to get the
pages into a clean state and thus reuseable, thus avoiding
stressing the memory system. In this case there is also a
limit emplaced on the number of dirty filesystem buffers
allowed to accumulate (since I/O is slower then the write()
calls creating the dirty buffers).
What you wind up in this case is maximum disk throughput for the
sequential write without thousands of unnecessary dirty pages,
which is asynchronous up to a reasonable point and then starts
blocking to give the I/O the chance to catch up a little in
order to avoid starving the clean page cache.
* Sequential Detection Heuristic for write clustering (mmap)
Currently not implemented under FreeBSD. This used to be a big
problem because you could completely saturate the VM system with
dirty pages before the system even realized it. To fix this we
threw in a memory-stress check in vm_fault to block when dirtying
pages in the face of having too many dirty pages already, giving
I/O a chance to catch up a little.
This actually improved performance because it left a greater number
of clean pages available and so the page selection algorithm in the
allocator worked better (tended to select idle pages rather then
active pages).
-Matt
Matthew Dillon
<dillon@backplane.com>
From: Matthew Dillon <dillon@apollo.backplane.com>
To: Rik van Riel <riel@conectiva.com.br>
Subject: Linux VM/IO balancing (fwd to linux-mm?)
Date: Mon, 22 May 2000 23:32:20 -0700 (PDT)
Message-ID: <200005230632.XAA64159@apollo.backplane.com>
I sent this to Alan who suggested that I send this to you! I've
fleshed it out a bit more from the version I sent to Alan.
I've been following the linux VM/memory subsystem issues closely and
I have a few suggestions on how to clean it up. Unfortunately,
being FreeBSD centric these days I do not want to create controversy.
But at this point I think even linux developers know that the
algorithms being used in the linux kernel are struggling
with the issue. The time may be ripe for some outside input (though I
will note that I was heavy into linux in earlier years, as Alan's
mail archives attest to!).
What I do below is essentially describe the FreeBSD VM/memory subsystem
from an algorithmic point of view, minus some of the fluff. It is very
straightforward in concept and the benefits should be obvious to anyone
who has worked on the VM system. I think it would be a fine blueprint
to use in linux.
I would like you to consider posting this to linux-kernel. I leave it
up to you -- I think if I were to post it independantly it would simply
result in controversy and seem more like a FreeBSD vs Linux thing rather
then a 'how to fix the VM system' thing. But if you post it
and preface it appropriately, the result might be better. So I am
sending it to you. It may help kickstart creating a formal algorithmic
blueprint rather then hacking up patches that solve one problem that
only create other problems.
Please feel free to edit it in any way if you think you can reduce
potential us-vs-them issues even more. I think there is a chance here
to get the whole of the linux developer community on the same page
in regards to the memory-load issue.
I didn't intend to turn this into a paper, but I've spent so much time
describing it that I think I am going to submit it to daemon news in
a cleaned-up form :-).
Thanks!
-Matt
---------
Even though I don't do much coding for Linux these days, I've always
avidly tracked the kernel mailing list. I've been tracking the memory
balancing thread for a while now and, having dealt with similar issues
in FreeBSD I believe I can offer a suggestion.
First of all, I am not trying to turn this into a comparison or anything.
Don't think of the FreeBSD memory subsystem as being 'FreeBSD' more
as being the work of a handful of very good theorists and programmers
(John Dyson being the main element there), and years of testing under
varying loads. The algorithms FreeBSD uses are NOT 15 years old, they
are more like 3 years old, and much of the work I myself have done is
less then one year old. Also, keep in mind that the standard FUD about
FreeBSD 'swapping more' the linux is just that --- FUD. FreeBSD only
swaps when it needs to, or thinks it may benefit by freeing up an
idle dirty page for more active reuse. I make an attempt to describe
in exact detail how and why FreeBSD pages things out, and why it works
so well, somewhere down below :-).
My suggestions are as follows:
First, stop treating the buffer cache as an entity separate from the
VM system. The two are inexorably bound together, especially considering
the massive use of mmap() (both file-backed and anonymous mmaps) in
modern programming. Depending on what you are running a properly
balanced system might have anywhere from 80% of its memory assigned
as file cache to 80% of its memory assigned to hold anonymous memory
for processes. it is NOT possible to impose limitations and still
get a reasonably scaleable balanced system. DO NOT TREAT THESE
AS TWO DIFFERENT CACHES!
Second, start keeping real statistics on memory use across on a
physical-page-basis. That means tracking how often VM pages are
referenced (statistically) as well as how often filesystem pages are
referenced by discrete I/O calls (deterministically). Keep track of
a real per-physical-page statistical 'weight'. (What this means for
linux is that you really need to test the pte's associated with physical
pages by iterating through the physical pagse in your outer loop, NOT
by trying to iterate through every page table of every process!).
FreeBSD keeps a center-weighted statistic for every page of memory
(buffer cache or VM cache, it makes no difference). This has turned
out to be a nearly perfect balancing algorithm and I strongly recommend
that linux adopt a similar model. But what makes it work is that
FreeBSD is willing to eat a couple of cpu cycles to keep accurate
statistics of page use by the VM system in order to avoid the bad
things that happen when one would otherwise choose the wrong page to
reuse or clean.
What I describe below is the essential core of the algorithm FreeBSD
uses. It's not an exact representation, but it gets to the heart of
FreeBSD's memory-balancing success.
The algorithm is a *modified* LRU. Lets say you decide on a weighting
betweeen 0 and 10. When a page is first allocated (either to the
buffer cache or for anonymous memory) its statistical weight is
set to the middle (5). If the page is used often the statistical
weight slowly rises to its maximum (10). If the page remains idle
(or was just used once) the statistical weight slowly drops to its
minimum (0).
The statistical weight is updated in real time by I/O system calls,
and updated statistically (by checking and clearing the page-referenced
bit in pte's) for mapped memory. When you mmap() a file and issue
syscalls on the descriptor, the weight may be updated by BOTH methods.
The rate at which the statistical page-reference updating operates depends
on the perceived memory load. A lightly loaded system (unstressed
memory) doesn't bother to scan the page-referenced bit all that often,
while a heavy memory load scans the page-referenced bit quite often
to keep the statistical weight intact.
When memory is allocated and no free pages are available, a clean page
is discarded from the cache (all 'clean' pages are considered to be
cache pretty much), lowest weight first. This in itself does NOT
contribute to the memory load calculation. That is, if you are scanning
a 10GB file you are not creating any memory stress on the system.
The LRU nature of the order of the pages in the queue is not strict.
The real parameter is the statistic, the ordering of the pages in the
queue uses a heuristic -- the pages 'migrate' over time so they are
reasonably well ordered within the queue, but no great effort is made
to order them exactly. The VM system will scan a portion of the queue
to locate a reasonable page to reuse (for example, it will look for
a page with a weighting less then 2).
The pagedaemon's scan rate is based on the perceived memory load
and ONLY the perceived memory load. It is perfectly acceptable to
have most of the system memory in 'active' use if allocations are not
occuring often, perfectly acceptable to have most of the system memory
backing file pages if processes aren't doing a lot of pageins, perfectly
acceptable for the system memory to be mostly dedicated to process
anonymous memory if processes have big ACTIVE footprints, perfectly
acceptable for most of the pages to be dirty if they are all in active
use and the memory subsystem is not otherwise being stressed.
The reason FreeBSD's memory subsystem works so well is precisely because
it does not impose any artificial limitations on the balance point.
Memory load is calculated in two ways: First, if the memory system finds
itself reusing active pages (in my example, any page with a statistical
weight greater then 5), second based on the dirty:clean page ratio. A
high ratio does not itself cause paging to occur, but a high ratio
combined with the system reusing active pages does.
The dirty/clean ratio is treated as an INDEPENDANT problem. The
same statistic is kept for dirty pages as it is for clean pages, but
dirty pages are placed on their own independant LRUish queue and do
not take part in the 'normal' memory allocation algorithm. A
separate algorithm (also part of the pageout daemon) controls the
cleaning of dirty pages.
When the memory load increases, an attempt is made to balance the
dirty/clean ratio by 'cleaning' dirty pages, which of course means
paging them out. FreeBSD makes NO distinction between writing a dirty
file-backed page and allocating swap for a dirty anonymous memory page.
The same per-page memory-use statistic is also used to determine which
dirty pages to clean first. In effect, it is precisely this attempt
to balance the dirty/clean ratio which increases the number of clean
pages available to reuse. The idea here is to increase the number of
clean pages to the point where the system is no longer being forced
to reuse 'active' pages. Once this is achieved there is no longer any
need clean the remaining dirty pages.
Under extreme memory loads the balance point moves on its own to a
point where FreeBSD tries to keep as many pages in a clean state as
possible. When the memory load gets to this point the system is
considered to be thrashing and we start taking anti-thrashing measures,
such as swapping out whole processes and holding them idle for 20-second
spans. It rarely gets to this point, but even when it does the system
is still kept reasonably balanced.
It should be noted that the center-weighting algorithm works in virtually
all situations, including workign WONDERFULLY when you have I/O
centric programs (i.e. a program that reads or writes gigabytes of
data). By making slight adjustments to the initial weight (or even no
adjustments at all) the VM system will tend to reuse used-once memory
(think of scanning a file) before it tries to reuse more actively used
memory.
Now, of course, there are other kernel processes messing with memory.
The filesystem update daemon, for example. But these processes are
not designed to handle heavy memory loads and we do it that way on
purpose. At most the update daemon will speed up a little under intense
filesystem loads, but that is as far as it goes. Only one process is
designed to handle heavy memory loads and that is the pageout daemon.
---
Stress Cases
* Stressing dirty pages in the system via I/O calls (read/write)
The algorithm tends to cause sequential I/O calls to give pages
a middling weight, and since the pages are not reused they tend
to be recycled within their domain (so you don't blow the rest
of the cache).
* Stressing dirty pages in the system via mmap (shared R+W)
The system tends to run low on clean pages, detected by the
fact that new allocations are reusing clean pages which have high
weights. When this occurs the pageout daemon attempts to 'clean'
dirty pages (page them out) in order to increase the number of
clean pages available. Having a larger number of clean pages
available tends to give them more time to age, thus reducing the
average weight the allocator sees. This is a negative feedback
loop which results in balance.
* I/O (read/shared-mmap) stress
The algorithm tends to weight the clean pages according to use.
The weightings for filesystem cache pages read via read() are
adjusted at the time of the read() while VM pages are adjusted
statistically (The VM page scan rate depends on the level of
stress). Since in modern systems mmap() is used heavily, no
special consideration is given to one access method verses the
other.
* VM (anonymous memory) stress
Anonymous swap-backed memory is treated no differently from
file-backed (filesystem buffers / mmap) memory. Clean anonymous
pages (most likely with swap already allocated if they are clean)
can be reused just the same as pages belonging to the filesystem
buffer cache. Swap is assigned to dirty anonymous pages on the
fly, only when the pageout daemon decides to actually clean the
page. Once swap is assigned the clean page can be reused.
If a swap-backed page is brought back into memory, it is brought
back in clean (swap is left assigned). Swap is only freed if
the page is re-dirtied by the process.
Thus most anonymous-memory pages in a heavily loaded system tend
to remain clean, allowing them to be reused more easily and extending
the life of the system further along the curve before it reaches a
thrashing state.
* Write Clustering.
Whenever the system decides to clean a dirty page it will, on the
fly, attempt to locate dirty nearby pages. FreeBSD is actually
quite sophisticated in this regard in that it actually goes and does
the calculation to ensure that only pages physically contiguous
on the disk are clustered for the write. The cluster is then written
and marked clean all in one go (cluster size limit is 64-128K).
* Sequential Detection Heuristic for read clustering (read())
A heuristic detects sequential read behavior and implements two
optimizations. (1) it implements read-aheads (as long as they
are reasonably contiguous on the physical media, we explicitly do
not try to issue read-aheads if it would cause an extra disk seek),
(2) it implements priority depression read-behind (reduce by 1 the
statistical weight of pages that have already been read). Reuse of
the pages can still cause the statistical weighting to increase to
the maximum, but this optimization has a tendancy to greatly reduce
the stress that large sequential reads have on the rest of the
memory subsystem.
* Sequential Detection Heuristic for read clustering (VM fault)
A heuristic detects sequential VM fault operation, either forwards
or backwards and adjusts the cluster window around the fault taken,
either shifting it forwards or backwards, or making the window
smaller (e.g. if random fault operation is detecting). fault-ahead
I/O is initiated based on the algorithm and anything found cached
is pre-faulted into the page table. (The window size in FreeBSD is
approximately 64KBytes for this particular algorithm). The window
is further restriction to ensure that only media-contiguous blocks
are clustered.
* Sequential Detection Heuristic for write clustering (write())
In the case of write() I/O (write system call), in order to
avoid saturating the memory system with dirty pages, if the
sequential detection heuristic determines that writes are
occuring sequentially, FreeBSD implements write-behind. That
is it issues the I/O on the dirty buffers preceding the write
point immediately (and asynchronously), in order to get the
pages into a clean state and thus reuseable, thus avoiding
stressing the memory system. In this case there is also a
limit emplaced on the number of dirty filesystem buffers
allowed to accumulate (since I/O is slower then the write()
calls creating the dirty buffers).
What you wind up in this case is maximum disk throughput for the
sequential write without thousands of unnecessary dirty pages,
which is asynchronous up to a reasonable point and then starts
blocking to give the I/O the chance to catch up a little in
order to avoid starving the clean page cache.
* Sequential Detection Heuristic for write clustering (mmap)
Currently not implemented under FreeBSD. This used to be a big
problem because you could completely saturate the VM system with
dirty pages before the system even realized it. To fix this we
threw in a memory-stress check in vm_fault to block when dirtying
pages in the face of having too many dirty pages already, giving
I/O a chance to catch up a little.
This actually improved performance because it left a greater number
of clean pages available and so the page selection algorithm in the
allocator worked better (tended to select idle pages rather then
active pages).
-Matt
Matthew Dillon
<dillon@backplane.com>
[-- Attachment #2.2: Re: Linux VM/IO balancing (fwd to linux-mm?) (fwd) --]
[-- Type: MESSAGE/RFC822, Size: 17343 bytes --]
Date: Tue, 23 May 2000 09:34:23 -0300 (BRST)
From: Rik van Riel <riel@conectiva.com.br>
X-Sender: riel@duckman.distro.conectiva
To: Matthew Dillon <dillon@apollo.backplane.com>
Subject: Re: Linux VM/IO balancing (fwd to linux-mm?)
In-Reply-To: <200005230632.XAA64159@apollo.backplane.com>
Message-ID: <Pine.LNX.4.21.0005230836110.19121-100000@duckman.distro.conectiva>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
On Mon, 22 May 2000, Matthew Dillon wrote:
> What I do below is essentially describe the FreeBSD VM/memory subsystem
> from an algorithmic point of view, minus some of the fluff. It is very
> straightforward in concept and the benefits should be obvious to anyone
> who has worked on the VM system. I think it would be a fine blueprint
> to use in linux.
I'm already looking at implementing this (and have been for a few
weeks). I'm reading vm/vm_pageout.c in detail now, mainly to scan
for pitfalls which I may have overlooked in my own design.
Davem has implemented some nice active/inactive queue code too, I
will probably cut'n'paste some of his code. I'll also try to avoid
some of the problems FreeBSD has (using the active/inactive queue
as primary scanning point and _then_ skipping to the virtual object
for IO clustering, as well as results obtained with my anti-hog code
suggest that we want to do scanning of virtual objects only, with an
active list, cache list and laundry list as only lists).
> My suggestions are as follows:
>
> First, stop treating the buffer cache as an entity separate from the
> VM system. The two are inexorably bound together, especially considering
> the massive use of mmap() (both file-backed and anonymous mmaps) in
> modern programming. Depending on what you are running a properly
> balanced system might have anywhere from 80% of its memory assigned
> as file cache to 80% of its memory assigned to hold anonymous memory
> for processes. it is NOT possible to impose limitations and still
> get a reasonably scaleable balanced system. DO NOT TREAT THESE
> AS TWO DIFFERENT CACHES!
We stopped doing this around kernel version 2.3.8 ;)
For all intents and purposes the buffer cache and the page
cache are one. There are still a few special cases where we
haven't figured out how to put the data into the page cache
cleanly, but those are real exceptions.
> Second, start keeping real statistics on memory use across on a
> physical-page-basis. That means tracking how often VM pages are
> referenced (statistically) as well as how often filesystem pages are
> referenced by discrete I/O calls (deterministically). Keep track of
> a real per-physical-page statistical 'weight'. (What this means for
> linux is that you really need to test the pte's associated with physical
> pages by iterating through the physical pagse in your outer loop, NOT
> by trying to iterate through every page table of every process!).
Hmmm, why would it matter in what order we scan stuff?
> FreeBSD keeps a center-weighted statistic for every page of memory
> (buffer cache or VM cache, it makes no difference). This has turned
> out to be a nearly perfect balancing algorithm and I strongly recommend
> that linux adopt a similar model. But what makes it work is that
> FreeBSD is willing to eat a couple of cpu cycles to keep accurate
> statistics of page use by the VM system in order to avoid the bad
> things that happen when one would otherwise choose the wrong page to
> reuse or clean.
I fully agree on this point. Linus and Andrea have always been
a strong opponent of any possible "overhead", leading to a VM
subsystem that is optimised for CPU use, not for managing memory ;(
> The algorithm is a *modified* LRU. Lets say you decide on a weighting
> betweeen 0 and 10. When a page is first allocated (either to the
> buffer cache or for anonymous memory) its statistical weight is
> set to the middle (5). If the page is used often the statistical
> weight slowly rises to its maximum (10). If the page remains idle
> (or was just used once) the statistical weight slowly drops to its
> minimum (0).
*nod* We may want to age physical pages in a way like this.
> The statistical weight is updated in real time by I/O system calls,
> and updated statistically (by checking and clearing the page-referenced
> bit in pte's) for mapped memory. When you mmap() a file and issue
> syscalls on the descriptor, the weight may be updated by BOTH methods.
> The rate at which the statistical page-reference updating operates depends
> on the perceived memory load. A lightly loaded system (unstressed
> memory) doesn't bother to scan the page-referenced bit all that often,
> while a heavy memory load scans the page-referenced bit quite often
> to keep the statistical weight intact.
Nice.
> When memory is allocated and no free pages are available, a clean page
> is discarded from the cache (all 'clean' pages are considered to be
> cache pretty much), lowest weight first. This in itself does NOT
> contribute to the memory load calculation. That is, if you are scanning
> a 10GB file you are not creating any memory stress on the system.
*nod*
> The LRU nature of the order of the pages in the queue is not strict.
> The real parameter is the statistic, the ordering of the pages in the
> queue uses a heuristic -- the pages 'migrate' over time so they are
> reasonably well ordered within the queue, but no great effort is made
> to order them exactly. The VM system will scan a portion of the queue
> to locate a reasonable page to reuse (for example, it will look for
> a page with a weighting less then 2).
*nod*
> The pagedaemon's scan rate is based on the perceived memory load
> and ONLY the perceived memory load. It is perfectly acceptable to
> have most of the system memory in 'active' use if allocations are not
> occuring often, perfectly acceptable to have most of the system memory
> backing file pages if processes aren't doing a lot of pageins, perfectly
> acceptable for the system memory to be mostly dedicated to process
> anonymous memory if processes have big ACTIVE footprints, perfectly
> acceptable for most of the pages to be dirty if they are all in active
> use and the memory subsystem is not otherwise being stressed.
>
> The reason FreeBSD's memory subsystem works so well is precisely because
> it does not impose any artificial limitations on the balance point.
Yup. This is a good thing which will be implemented in Linux too.
> Memory load is calculated in two ways: First, if the memory system finds
> itself reusing active pages (in my example, any page with a statistical
> weight greater then 5), second based on the dirty:clean page ratio. A
> high ratio does not itself cause paging to occur, but a high ratio
> combined with the system reusing active pages does.
Hmmmm, this isn't what I'm seeing in the code. Could you point me
at the code where this is happening?
> The dirty/clean ratio is treated as an INDEPENDANT problem. The
> same statistic is kept for dirty pages as it is for clean pages, but
> dirty pages are placed on their own independant LRUish queue and do
> not take part in the 'normal' memory allocation algorithm. A
> separate algorithm (also part of the pageout daemon) controls the
> cleaning of dirty pages.
This is planned for Linux as well.
> When the memory load increases, an attempt is made to balance the
> dirty/clean ratio by 'cleaning' dirty pages, which of course means
> paging them out. FreeBSD makes NO distinction between writing a dirty
> file-backed page and allocating swap for a dirty anonymous memory page.
> The same per-page memory-use statistic is also used to determine which
> dirty pages to clean first. In effect, it is precisely this attempt
> to balance the dirty/clean ratio which increases the number of clean
> pages available to reuse. The idea here is to increase the number of
> clean pages to the point where the system is no longer being forced
> to reuse 'active' pages. Once this is achieved there is no longer any
> need clean the remaining dirty pages.
I've read the code and it seems to do this:
- put inactive pages in the cached queue
- launder up to maxlaunder pages per scan (if needed)
> Under extreme memory loads the balance point moves on its own to a
> point where FreeBSD tries to keep as many pages in a clean state as
> possible. When the memory load gets to this point the system is
> considered to be thrashing and we start taking anti-thrashing measures,
> such as swapping out whole processes and holding them idle for 20-second
> spans. It rarely gets to this point, but even when it does the system
> is still kept reasonably balanced.
Uhmmmm. This does not seem to be exactly what the code does...
(then again, I could be wrong)
> It should be noted that the center-weighting algorithm works in virtually
> all situations, including workign WONDERFULLY when you have I/O
> centric programs (i.e. a program that reads or writes gigabytes of
> data). By making slight adjustments to the initial weight (or even no
> adjustments at all) the VM system will tend to reuse used-once memory
> (think of scanning a file) before it tries to reuse more actively used
> memory.
*nod*
Always trying to keep 1/3rd of the mapped pages in an "inactive"
age is probably a good idea for Linux. The biggest problem with
our code seems to be that we're *not* trying to have a balanced
aging of pages ...
> Now, of course, there are other kernel processes messing with memory.
> The filesystem update daemon, for example. But these processes are
> not designed to handle heavy memory loads and we do it that way on
> purpose. At most the update daemon will speed up a little under intense
> filesystem loads, but that is as far as it goes. Only one process is
> designed to handle heavy memory loads and that is the pageout daemon.
*nod*
> ---
> Stress Cases
>
> * Stressing dirty pages in the system via I/O calls (read/write)
>
> The algorithm tends to cause sequential I/O calls to give pages
> a middling weight, and since the pages are not reused they tend
> to be recycled within their domain (so you don't blow the rest
> of the cache).
This is a good argument for starting at a middle weight. Thanks
for bringing this case to my attention, we really want something
like this.
> * Stressing dirty pages in the system via mmap (shared R+W)
>
> The system tends to run low on clean pages, detected by the
> fact that new allocations are reusing clean pages which have high
> weights. When this occurs the pageout daemon attempts to 'clean'
> dirty pages (page them out) in order to increase the number of
> clean pages available. Having a larger number of clean pages
> available tends to give them more time to age, thus reducing the
> average weight the allocator sees. This is a negative feedback
> loop which results in balance.
Hummm, where in the code can I find info on this?
It certainly sounds interesting...
> * I/O (read/shared-mmap) stress
>
> The algorithm tends to weight the clean pages according to use.
> The weightings for filesystem cache pages read via read() are
> adjusted at the time of the read() while VM pages are adjusted
> statistically (The VM page scan rate depends on the level of
> stress). Since in modern systems mmap() is used heavily, no
> special consideration is given to one access method verses the
> other.
*nod*
> * VM (anonymous memory) stress
>
> Anonymous swap-backed memory is treated no differently from
> file-backed (filesystem buffers / mmap) memory. Clean anonymous
> pages (most likely with swap already allocated if they are clean)
> can be reused just the same as pages belonging to the filesystem
> buffer cache. Swap is assigned to dirty anonymous pages on the
> fly, only when the pageout daemon decides to actually clean the
> page. Once swap is assigned the clean page can be reused.
>
> If a swap-backed page is brought back into memory, it is brought
> back in clean (swap is left assigned). Swap is only freed if
> the page is re-dirtied by the process.
>
> Thus most anonymous-memory pages in a heavily loaded system tend
> to remain clean, allowing them to be reused more easily and extending
> the life of the system further along the curve before it reaches a
> thrashing state.
A few points:
- it may be easier to assign swap entries based on virtual
scanning of processes, otherwise you'll end up with the
strange jumping between queue scanning and virtual scanning
that FreeBSD is doing now
- vnodes are why FreeBSD is able to delay writing anonymous
pages till later on
- not every page weighs equally ... you want some fairness
between memory hogs and small processes
- Linux already does the don't-rewrite-clean-swap thing
> * Write Clustering.
>
> Whenever the system decides to clean a dirty page it will, on the
> fly, attempt to locate dirty nearby pages. FreeBSD is actually
> quite sophisticated in this regard in that it actually goes and does
> the calculation to ensure that only pages physically contiguous
> on the disk are clustered for the write. The cluster is then written
> and marked clean all in one go (cluster size limit is 64-128K).
Neat...
> * Sequential Detection Heuristic for read clustering (read())
>
> A heuristic detects sequential read behavior and implements two
> optimizations. (1) it implements read-aheads (as long as they
> are reasonably contiguous on the physical media, we explicitly do
> not try to issue read-aheads if it would cause an extra disk seek),
> (2) it implements priority depression read-behind (reduce by 1 the
> statistical weight of pages that have already been read). Reuse of
> the pages can still cause the statistical weighting to increase to
> the maximum, but this optimization has a tendancy to greatly reduce
> the stress that large sequential reads have on the rest of the
> memory subsystem.
Ohhhh, nice ;)
> * Sequential Detection Heuristic for read clustering (VM fault)
>
> A heuristic detects sequential VM fault operation, either forwards
> or backwards and adjusts the cluster window around the fault taken,
> either shifting it forwards or backwards, or making the window
> smaller (e.g. if random fault operation is detecting). fault-ahead
> I/O is initiated based on the algorithm and anything found cached
> is pre-faulted into the page table. (The window size in FreeBSD is
> approximately 64KBytes for this particular algorithm). The window
> is further restriction to ensure that only media-contiguous blocks
> are clustered.
Again ... neat.
> * Sequential Detection Heuristic for write clustering (write())
>
> In the case of write() I/O (write system call), in order to
> avoid saturating the memory system with dirty pages, if the
> sequential detection heuristic determines that writes are
> occuring sequentially, FreeBSD implements write-behind. That
> is it issues the I/O on the dirty buffers preceding the write
> point immediately (and asynchronously), in order to get the
> pages into a clean state and thus reuseable, thus avoiding
> stressing the memory system. In this case there is also a
> limit emplaced on the number of dirty filesystem buffers
> allowed to accumulate (since I/O is slower then the write()
> calls creating the dirty buffers).
>
> What you wind up in this case is maximum disk throughput for the
> sequential write without thousands of unnecessary dirty pages,
> which is asynchronous up to a reasonable point and then starts
> blocking to give the I/O the chance to catch up a little in
> order to avoid starving the clean page cache.
Cool.
> * Sequential Detection Heuristic for write clustering (mmap)
>
> Currently not implemented under FreeBSD. This used to be a big
> problem because you could completely saturate the VM system with
> dirty pages before the system even realized it. To fix this we
> threw in a memory-stress check in vm_fault to block when dirtying
> pages in the face of having too many dirty pages already, giving
> I/O a chance to catch up a little.
>
> This actually improved performance because it left a greater number
> of clean pages available and so the page selection algorithm in the
> allocator worked better (tended to select idle pages rather then
> active pages).
This is an extremely good idea. When we have too many (or just,
a lot of) dirty pages, we should delay faults in write-enabled
VMAs, while allowing faults in read-only VMAs to proceed as
fast as possible ...
regards,
Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.
Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/ http://www.surriel.com/
From: Rik van Riel <riel@conectiva.com.br>
To: Matthew Dillon <dillon@apollo.backplane.com>
Subject: Re: Linux VM/IO balancing (fwd to linux-mm?)
Date: Tue, 23 May 2000 09:34:23 -0300 (BRST)
Message-ID: <Pine.LNX.4.21.0005230836110.19121-100000@duckman.distro.conectiva>
On Mon, 22 May 2000, Matthew Dillon wrote:
> What I do below is essentially describe the FreeBSD VM/memory subsystem
> from an algorithmic point of view, minus some of the fluff. It is very
> straightforward in concept and the benefits should be obvious to anyone
> who has worked on the VM system. I think it would be a fine blueprint
> to use in linux.
I'm already looking at implementing this (and have been for a few
weeks). I'm reading vm/vm_pageout.c in detail now, mainly to scan
for pitfalls which I may have overlooked in my own design.
Davem has implemented some nice active/inactive queue code too, I
will probably cut'n'paste some of his code. I'll also try to avoid
some of the problems FreeBSD has (using the active/inactive queue
as primary scanning point and _then_ skipping to the virtual object
for IO clustering, as well as results obtained with my anti-hog code
suggest that we want to do scanning of virtual objects only, with an
active list, cache list and laundry list as only lists).
> My suggestions are as follows:
>
> First, stop treating the buffer cache as an entity separate from the
> VM system. The two are inexorably bound together, especially considering
> the massive use of mmap() (both file-backed and anonymous mmaps) in
> modern programming. Depending on what you are running a properly
> balanced system might have anywhere from 80% of its memory assigned
> as file cache to 80% of its memory assigned to hold anonymous memory
> for processes. it is NOT possible to impose limitations and still
> get a reasonably scaleable balanced system. DO NOT TREAT THESE
> AS TWO DIFFERENT CACHES!
We stopped doing this around kernel version 2.3.8 ;)
For all intents and purposes the buffer cache and the page
cache are one. There are still a few special cases where we
haven't figured out how to put the data into the page cache
cleanly, but those are real exceptions.
> Second, start keeping real statistics on memory use across on a
> physical-page-basis. That means tracking how often VM pages are
> referenced (statistically) as well as how often filesystem pages are
> referenced by discrete I/O calls (deterministically). Keep track of
> a real per-physical-page statistical 'weight'. (What this means for
> linux is that you really need to test the pte's associated with physical
> pages by iterating through the physical pagse in your outer loop, NOT
> by trying to iterate through every page table of every process!).
Hmmm, why would it matter in what order we scan stuff?
> FreeBSD keeps a center-weighted statistic for every page of memory
> (buffer cache or VM cache, it makes no difference). This has turned
> out to be a nearly perfect balancing algorithm and I strongly recommend
> that linux adopt a similar model. But what makes it work is that
> FreeBSD is willing to eat a couple of cpu cycles to keep accurate
> statistics of page use by the VM system in order to avoid the bad
> things that happen when one would otherwise choose the wrong page to
> reuse or clean.
I fully agree on this point. Linus and Andrea have always been
a strong opponent of any possible "overhead", leading to a VM
subsystem that is optimised for CPU use, not for managing memory ;(
> The algorithm is a *modified* LRU. Lets say you decide on a weighting
> betweeen 0 and 10. When a page is first allocated (either to the
> buffer cache or for anonymous memory) its statistical weight is
> set to the middle (5). If the page is used often the statistical
> weight slowly rises to its maximum (10). If the page remains idle
> (or was just used once) the statistical weight slowly drops to its
> minimum (0).
*nod* We may want to age physical pages in a way like this.
> The statistical weight is updated in real time by I/O system calls,
> and updated statistically (by checking and clearing the page-referenced
> bit in pte's) for mapped memory. When you mmap() a file and issue
> syscalls on the descriptor, the weight may be updated by BOTH methods.
> The rate at which the statistical page-reference updating operates depends
> on the perceived memory load. A lightly loaded system (unstressed
> memory) doesn't bother to scan the page-referenced bit all that often,
> while a heavy memory load scans the page-referenced bit quite often
> to keep the statistical weight intact.
Nice.
> When memory is allocated and no free pages are available, a clean page
> is discarded from the cache (all 'clean' pages are considered to be
> cache pretty much), lowest weight first. This in itself does NOT
> contribute to the memory load calculation. That is, if you are scanning
> a 10GB file you are not creating any memory stress on the system.
*nod*
> The LRU nature of the order of the pages in the queue is not strict.
> The real parameter is the statistic, the ordering of the pages in the
> queue uses a heuristic -- the pages 'migrate' over time so they are
> reasonably well ordered within the queue, but no great effort is made
> to order them exactly. The VM system will scan a portion of the queue
> to locate a reasonable page to reuse (for example, it will look for
> a page with a weighting less then 2).
*nod*
> The pagedaemon's scan rate is based on the perceived memory load
> and ONLY the perceived memory load. It is perfectly acceptable to
> have most of the system memory in 'active' use if allocations are not
> occuring often, perfectly acceptable to have most of the system memory
> backing file pages if processes aren't doing a lot of pageins, perfectly
> acceptable for the system memory to be mostly dedicated to process
> anonymous memory if processes have big ACTIVE footprints, perfectly
> acceptable for most of the pages to be dirty if they are all in active
> use and the memory subsystem is not otherwise being stressed.
>
> The reason FreeBSD's memory subsystem works so well is precisely because
> it does not impose any artificial limitations on the balance point.
Yup. This is a good thing which will be implemented in Linux too.
> Memory load is calculated in two ways: First, if the memory system finds
> itself reusing active pages (in my example, any page with a statistical
> weight greater then 5), second based on the dirty:clean page ratio. A
> high ratio does not itself cause paging to occur, but a high ratio
> combined with the system reusing active pages does.
Hmmmm, this isn't what I'm seeing in the code. Could you point me
at the code where this is happening?
> The dirty/clean ratio is treated as an INDEPENDANT problem. The
> same statistic is kept for dirty pages as it is for clean pages, but
> dirty pages are placed on their own independant LRUish queue and do
> not take part in the 'normal' memory allocation algorithm. A
> separate algorithm (also part of the pageout daemon) controls the
> cleaning of dirty pages.
This is planned for Linux as well.
> When the memory load increases, an attempt is made to balance the
> dirty/clean ratio by 'cleaning' dirty pages, which of course means
> paging them out. FreeBSD makes NO distinction between writing a dirty
> file-backed page and allocating swap for a dirty anonymous memory page.
> The same per-page memory-use statistic is also used to determine which
> dirty pages to clean first. In effect, it is precisely this attempt
> to balance the dirty/clean ratio which increases the number of clean
> pages available to reuse. The idea here is to increase the number of
> clean pages to the point where the system is no longer being forced
> to reuse 'active' pages. Once this is achieved there is no longer any
> need clean the remaining dirty pages.
I've read the code and it seems to do this:
- put inactive pages in the cached queue
- launder up to maxlaunder pages per scan (if needed)
> Under extreme memory loads the balance point moves on its own to a
> point where FreeBSD tries to keep as many pages in a clean state as
> possible. When the memory load gets to this point the system is
> considered to be thrashing and we start taking anti-thrashing measures,
> such as swapping out whole processes and holding them idle for 20-second
> spans. It rarely gets to this point, but even when it does the system
> is still kept reasonably balanced.
Uhmmmm. This does not seem to be exactly what the code does...
(then again, I could be wrong)
> It should be noted that the center-weighting algorithm works in virtually
> all situations, including workign WONDERFULLY when you have I/O
> centric programs (i.e. a program that reads or writes gigabytes of
> data). By making slight adjustments to the initial weight (or even no
> adjustments at all) the VM system will tend to reuse used-once memory
> (think of scanning a file) before it tries to reuse more actively used
> memory.
*nod*
Always trying to keep 1/3rd of the mapped pages in an "inactive"
age is probably a good idea for Linux. The biggest problem with
our code seems to be that we're *not* trying to have a balanced
aging of pages ...
> Now, of course, there are other kernel processes messing with memory.
> The filesystem update daemon, for example. But these processes are
> not designed to handle heavy memory loads and we do it that way on
> purpose. At most the update daemon will speed up a little under intense
> filesystem loads, but that is as far as it goes. Only one process is
> designed to handle heavy memory loads and that is the pageout daemon.
*nod*
> ---
> Stress Cases
>
> * Stressing dirty pages in the system via I/O calls (read/write)
>
> The algorithm tends to cause sequential I/O calls to give pages
> a middling weight, and since the pages are not reused they tend
> to be recycled within their domain (so you don't blow the rest
> of the cache).
This is a good argument for starting at a middle weight. Thanks
for bringing this case to my attention, we really want something
like this.
> * Stressing dirty pages in the system via mmap (shared R+W)
>
> The system tends to run low on clean pages, detected by the
> fact that new allocations are reusing clean pages which have high
> weights. When this occurs the pageout daemon attempts to 'clean'
> dirty pages (page them out) in order to increase the number of
> clean pages available. Having a larger number of clean pages
> available tends to give them more time to age, thus reducing the
> average weight the allocator sees. This is a negative feedback
> loop which results in balance.
Hummm, where in the code can I find info on this?
It certainly sounds interesting...
> * I/O (read/shared-mmap) stress
>
> The algorithm tends to weight the clean pages according to use.
> The weightings for filesystem cache pages read via read() are
> adjusted at the time of the read() while VM pages are adjusted
> statistically (The VM page scan rate depends on the level of
> stress). Since in modern systems mmap() is used heavily, no
> special consideration is given to one access method verses the
> other.
*nod*
> * VM (anonymous memory) stress
>
> Anonymous swap-backed memory is treated no differently from
> file-backed (filesystem buffers / mmap) memory. Clean anonymous
> pages (most likely with swap already allocated if they are clean)
> can be reused just the same as pages belonging to the filesystem
> buffer cache. Swap is assigned to dirty anonymous pages on the
> fly, only when the pageout daemon decides to actually clean the
> page. Once swap is assigned the clean page can be reused.
>
> If a swap-backed page is brought back into memory, it is brought
> back in clean (swap is left assigned). Swap is only freed if
> the page is re-dirtied by the process.
>
> Thus most anonymous-memory pages in a heavily loaded system tend
> to remain clean, allowing them to be reused more easily and extending
> the life of the system further along the curve before it reaches a
> thrashing state.
A few points:
- it may be easier to assign swap entries based on virtual
scanning of processes, otherwise you'll end up with the
strange jumping between queue scanning and virtual scanning
that FreeBSD is doing now
- vnodes are why FreeBSD is able to delay writing anonymous
pages till later on
- not every page weighs equally ... you want some fairness
between memory hogs and small processes
- Linux already does the don't-rewrite-clean-swap thing
> * Write Clustering.
>
> Whenever the system decides to clean a dirty page it will, on the
> fly, attempt to locate dirty nearby pages. FreeBSD is actually
> quite sophisticated in this regard in that it actually goes and does
> the calculation to ensure that only pages physically contiguous
> on the disk are clustered for the write. The cluster is then written
> and marked clean all in one go (cluster size limit is 64-128K).
Neat...
> * Sequential Detection Heuristic for read clustering (read())
>
> A heuristic detects sequential read behavior and implements two
> optimizations. (1) it implements read-aheads (as long as they
> are reasonably contiguous on the physical media, we explicitly do
> not try to issue read-aheads if it would cause an extra disk seek),
> (2) it implements priority depression read-behind (reduce by 1 the
> statistical weight of pages that have already been read). Reuse of
> the pages can still cause the statistical weighting to increase to
> the maximum, but this optimization has a tendancy to greatly reduce
> the stress that large sequential reads have on the rest of the
> memory subsystem.
Ohhhh, nice ;)
> * Sequential Detection Heuristic for read clustering (VM fault)
>
> A heuristic detects sequential VM fault operation, either forwards
> or backwards and adjusts the cluster window around the fault taken,
> either shifting it forwards or backwards, or making the window
> smaller (e.g. if random fault operation is detecting). fault-ahead
> I/O is initiated based on the algorithm and anything found cached
> is pre-faulted into the page table. (The window size in FreeBSD is
> approximately 64KBytes for this particular algorithm). The window
> is further restriction to ensure that only media-contiguous blocks
> are clustered.
Again ... neat.
> * Sequential Detection Heuristic for write clustering (write())
>
> In the case of write() I/O (write system call), in order to
> avoid saturating the memory system with dirty pages, if the
> sequential detection heuristic determines that writes are
> occuring sequentially, FreeBSD implements write-behind. That
> is it issues the I/O on the dirty buffers preceding the write
> point immediately (and asynchronously), in order to get the
> pages into a clean state and thus reuseable, thus avoiding
> stressing the memory system. In this case there is also a
> limit emplaced on the number of dirty filesystem buffers
> allowed to accumulate (since I/O is slower then the write()
> calls creating the dirty buffers).
>
> What you wind up in this case is maximum disk throughput for the
> sequential write without thousands of unnecessary dirty pages,
> which is asynchronous up to a reasonable point and then starts
> blocking to give the I/O the chance to catch up a little in
> order to avoid starving the clean page cache.
Cool.
> * Sequential Detection Heuristic for write clustering (mmap)
>
> Currently not implemented under FreeBSD. This used to be a big
> problem because you could completely saturate the VM system with
> dirty pages before the system even realized it. To fix this we
> threw in a memory-stress check in vm_fault to block when dirtying
> pages in the face of having too many dirty pages already, giving
> I/O a chance to catch up a little.
>
> This actually improved performance because it left a greater number
> of clean pages available and so the page selection algorithm in the
> allocator worked better (tended to select idle pages rather then
> active pages).
This is an extremely good idea. When we have too many (or just,
a lot of) dirty pages, we should delay faults in write-enabled
VMAs, while allowing faults in read-only VMAs to proceed as
fast as possible ...
regards,
Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.
Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/ http://www.surriel.com/
[-- Attachment #2.3: Re: Linux VM/IO balancing (fwd to linux-mm?) (fwd) --]
[-- Type: MESSAGE/RFC822, Size: 21743 bytes --]
Return-Path: <dillon@apollo.backplane.com>
Received: from perninha.conectiva.com.br (perninha.conectiva.com.br [200.250.58.156])
by duckman.distro.conectiva (8.9.3/8.8.7) with ESMTP id OAA20874
for <riel@duckman.distro.conectiva>; Tue, 23 May 2000 14:07:04 -0300
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
by perninha.conectiva.com.br (8.9.3/8.9.3) with ESMTP id OAA31919
for <riel@conectiva.com.br>; Tue, 23 May 2000 14:07:19 -0300
Received: (from dillon@localhost)
by apollo.backplane.com (8.9.3/8.9.1) id KAA68220;
Tue, 23 May 2000 10:06:38 -0700 (PDT)
(envelope-from dillon)
Date: Tue, 23 May 2000 10:06:38 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200005231706.KAA68220@apollo.backplane.com>
To: Rik van Riel <riel@conectiva.com.br>
Subject: Re: Linux VM/IO balancing (fwd to linux-mm?)
References: <Pine.LNX.4.21.0005230836110.19121-100000@duckman.distro.conectiva>
:I'm already looking at implementing this (and have been for a few
:weeks). I'm reading vm/vm_pageout.c in detail now, mainly to scan
:for pitfalls which I may have overlooked in my own design.
Great!
:Davem has implemented some nice active/inactive queue code too, I
:will probably cut'n'paste some of his code. I'll also try to avoid
:some of the problems FreeBSD has (using the active/inactive queue
:as primary scanning point and _then_ skipping to the virtual object
:for IO clustering, as well as results obtained with my anti-hog code
:suggest that we want to do scanning of virtual objects only, with an
:active list, cache list and laundry list as only lists).
I would not characterize this as a problem in FreeBSD, it's actually
one of its more powerful features. What FreeBSD is doing is using
the center-weighted algorithm to locate a page that it believes
should be reused or paged out. Once the page is selected FreeBSD
tries to make the best of the disk seek that it's about to do by
clustering other nearby pages along with it. You can't find the best
candidate by scanning via the VM objects.
The problem with trying to locate the page to reuse from the VM object
alone is that it doesn't work. You might end up scanning all the VM
objects to locate a good page to reuse. If you just scan a few VM
objects, there is a very good chance that you will *not* choose the best
page to reuse/flush. For example, there was some code in the linux
kernel a while ago that tried to scan the 'biggest' VM object under
the assumption that good candidates for page reuse would likely be
found there)... if you happen to choose a VM object which is being
very actively used by the system, then try to reuse/flush a page from
it, you wind up stalling whatever process was using it even if other
VM objects have better candidates.
Nor do you want to scan ALL VM objects to locate the best page -- that
would lead to O(N^2) cpu overhead in locating the page. FreeBSD
locates the page candidate using the page queues because it can do so
in an O(N) or better manner (usually it's more around O(1)).
:> Second, start keeping real statistics on memory use across on a
:> physical-page-basis. That means tracking how often VM pages are
:> referenced (statistically) as well as how often filesystem pages are
:> referenced by discrete I/O calls (deterministically). Keep track of
:> a real per-physical-page statistical 'weight'. (What this means for
:> linux is that you really need to test the pte's associated with physical
:> pages by iterating through the physical pagse in your outer loop, NOT
:> by trying to iterate through every page table of every process!).
:
:Hmmm, why would it matter in what order we scan stuff?
The problem here is that any given page of physical memory may exist
in the virtual address space of many processes. If you scan by iterating
through process page tables you wind up giving an artificial advantage
(or disadvantage) to shared pages. This is not a good measure of whether
the page can be reused. Also, iterating through the process page tables
means going through an unknown number of pte's and blowing your L1
cache in the process -- it doesn't scale. Iterating through physical
memory and then locating the associated pte's treats each page on a more
equal footing, irregardless on how much or how little it is being shared
in regards to updating its statistics.
A massively shared page that happens to be very active already gets the
benefit of being accessed by multiple processse (setting the ref'd
bit), you don't need to give it any special handling.
:> FreeBSD keeps a center-weighted statistic for every page of memory
:> (buffer cache or VM cache, it makes no difference). This has turned
:> out to be a nearly perfect balancing algorithm and I strongly recommend
:> that linux adopt a similar model. But what makes it work is that
:> FreeBSD is willing to eat a couple of cpu cycles to keep accurate
:> statistics of page use by the VM system in order to avoid the bad
:> things that happen when one would otherwise choose the wrong page to
:> reuse or clean.
:
:I fully agree on this point. Linus and Andrea have always been
:a strong opponent of any possible "overhead", leading to a VM
:subsystem that is optimised for CPU use, not for managing memory ;(
I think you could sell it to them if you make it clear that the
statistics do not eat cpu unless the system is stressed (FreeBSD doesn't
even bother doing the page scan unless it thinks the system is
stressed), and when the system is stressed it makes a whole lot more
sense to eat cpu choosing the right page since choosing the wrong page
can lead to hundreds of thousands of wasted cpu cycles doing unnecessary
paging and even more real-time in process stalls.
:>...... (all sorts of good stuff removed)
:> Memory load is calculated in two ways: First, if the memory system finds
:> itself reusing active pages (in my example, any page with a statistical
:> weight greater then 5), second based on the dirty:clean page ratio. A
:> high ratio does not itself cause paging to occur, but a high ratio
:> combined with the system reusing active pages does.
:
:Hmmmm, this isn't what I'm seeing in the code. Could you point me
:at the code where this is happening?
Yes. It's confusing because FreeBSD implements the stress calculation
with additionals page queues rather then attempt to keep a single
queue sorted by weight, which can be expensive. When the system
allocates a page (vm_page_alloc() in vm/vm_page.c) it uses free pages
first (there usually aren't very many truely free pages), then cache
pages (basically clean 'idle' pages), and if this fails it wakes up
the page daemon, blocks, and retries. The page daemon is responsible
for moving pages into the cache for use by the allocator.
The page daemon (vm/vm_pageout.c, vm_pageout() and vm_pageout_scan())
is responsible for adjusting act_count (see vm_pageout_scan()) and
adjusting the position of pages in their queues.
The page daemon starts by scanning pages in the inactive queue. Pages
in this queue have no mmu mappings and have been determined to be
relatively inactive. Within the queue the pages are loosely ordered
by weight (trying to order them exactly would waste too much cpu).
See the 'rescan0' label in vm/vm_pageout.c. While scanning this queue
the pageout daemon updates the activity weighting for the page by
checking the PG_REFERENED bit.
The inactive queue also tends to contain dirty pages to launder. The
page daemon does not flush out dirty pages immediately, if it can't
find enough clean pages in the inactive queue (line 927 in
vm/vm_pageout.c, see 'If we still have a page shortage'), it goes up
and rescans the inactive queue to start the pageout ops on some of
its pages before going on to the active queue.
If there is still a page shortage the page daemon then scans the active
queue. Line 943 'Scan the active queue for things we can deactivate'.
This queue is also loosely sorted by the page weighting and as with the
inactive queue scan the loop also checks the page-referenced bit to
update the weighting on the fly while scanning the active queue.
Under normal conditions, pages are not deactivated until the active
count reaches 0. Line 1013 of vm/vm_pageout.c.
This is where the stress test is implied. If the page daemon scans the
active queue and does not find any pages whos weighting has reached 0,
then it starts to hit around line 1077 (the vm_paging_target() test)
and becomes much more aggressive with the syncer and swapper. At
line 1097 if we determine that we have run out of swap and also do
not have sufficient clean pages, we start killing processes.
I'll get back to the vm_paging_target() function in a second.
The vm_daemon() is responsible for the next stage of aggressiveness
in a heavily loaded system. The daemon is responsible for testing
processes against their resident-memory rlimit and forcefully deactivating
pages. The daemon also notices processes marked as being swapped out
and is more agressive on those processes (the act of marking a process
as being swapped out does not actually pageout any of its pages, but
if the system swapper is activated in a heavy-use situation it will
use the mark to then agressively swap out that process's pages).
Ok, back to vm_paging_target(). This function (sys/vmmeter.h) is
responsible for balancing the clean/dirty ratio by forcing pages
to be moved from the active queue to the inactive queue if insufficient
pages were found and no pages in the active queue were found with
a 0 weighting. This is the stress test. If the page daemon was not
able to find a sufficient number of pages in the active queue with a 0
weighting, it assumes that the memory system is under stress and speeds
up the syncer and moer-agressively enforces RSS resource limits
(vm/vm_pageout.c line 1081).
I will readily admit that FreeBSD suffers from accretions here... there
has been a lot of hacking of the code. The main reason FreeBSD is not
using a completely pure center-weighted LRU algorithm is to avoid
thrashing. It attempts to enforce leaving a page in the active queue
for a certain minimum period of time (based on the weighting having to
drop to 0 in order for the page to leave the queue), and FreeBSD doesn't
mind stalling processes a bit if this winds up occuring because if it
takes recently activated pages out of the active queue too quickly,
the system's thrash point occurs a lot sooner.
:...
:
:> When the memory load increases, an attempt is made to balance the
:> dirty/clean ratio by 'cleaning' dirty pages, which of course means
:> paging them out. FreeBSD makes NO distinction between writing a dirty
:> file-backed page and allocating swap for a dirty anonymous memory page.
:> The same per-page memory-use statistic is also used to determine which
:> dirty pages to clean first. In effect, it is precisely this attempt
:> to balance the dirty/clean ratio which increases the number of clean
:> pages available to reuse. The idea here is to increase the number of
:> clean pages to the point where the system is no longer being forced
:> to reuse 'active' pages. Once this is achieved there is no longer any
:> need clean the remaining dirty pages.
:
:I've read the code and it seems to do this:
:- put inactive pages in the cached queue
:- launder up to maxlaunder pages per scan (if needed)
Correct. Basically the way to think of the algorithm is to ignore
the fact that an active queue exists at all. The active queue holds
pages in an unstressed memory environment. In a stressed memory
environment the active queue imposes a minimum in-core-resident time
for the page to avoid thrashing.
The system implements the page weighting independantly for both the
active and inactive queues. Under medium stress conditions pages
are moved from the active to the inactive queue based on their weighting
reaching 0. Once in the inactive queue (under medium stress conditions),
pages tend to be LRU ordered (all their weightings tend to be 0 at this
point). But under heavier loading conditions pages may be moved from
the active to the inactive queue forcefully, before their weighting
reaches 0. In this case the inactive queue serves the same loosely-sorted
weighting function that the active queue was supposed to serve until
it too runs out of pages.
The page laundering code is stuffed in the middle there somewhere using
a simple heuristic to try to start laundering pages during the inactive
queue scan. It is not strictly based on page weighting but it winds up
being that way due to the loosely weighted ordering of the pages in the
queue.
:> Under extreme memory loads the balance point moves on its own to a
:> point where FreeBSD tries to keep as many pages in a clean state as
:> possible. When the memory load gets to this point the system is
:> considered to be thrashing and we start taking anti-thrashing measures,
:> such as swapping out whole processes and holding them idle for 20-second
:> spans. It rarely gets to this point, but even when it does the system
:> is still kept reasonably balanced.
:
:Uhmmmm. This does not seem to be exactly what the code does...
:(then again, I could be wrong)
Your right. It isn't *exactly* what the code does, but it's the
algorithmic effect minus the confusion of implementing the algorithm
over multiple page queues (active/inactive/cache), which was done
solely to avoid page thrashing.
:Always trying to keep 1/3rd of the mapped pages in an "inactive"
:age is probably a good idea for Linux. The biggest problem with
:our code seems to be that we're *not* trying to have a balanced
:aging of pages ...
That's my take too. Also note that FreeBSD doesn't bother deactivating
pages unless there is actually some stress on the memory system... if
you look at a lightly loaded FreeBSD box you will see a huge number of
'active' pages -- most of those are probably idle. The moving of pages
from queue to queue only occurs when the memory allocator starts to run
out. This alone does not indicate stress... it's only when the page
daemon is unable to find sufficient pages in the active queue with a
0 weighting where the stress starts to come into play.
:> * Stressing dirty pages in the system via I/O calls (read/write)
:>
:> The algorithm tends to cause sequential I/O calls to give pages
:> a middling weight, and since the pages are not reused they tend
:> to be recycled within their domain (so you don't blow the rest
:> of the cache).
:
:This is a good argument for starting at a middle weight. Thanks
:for bringing this case to my attention, we really want something
:like this.
This is something I added to FreeBSD 4.x after tests showed that
FreeBSD 3.x was 'blowing' the page cache unnecessarily when people
would do things like tar up directory trees. It made a big difference
in the performance of the rest of the system without effecting the
I/O bound process.
There is a gotcha, though-- you have to be careful to allow some
balancing to occur so you do not recycle too-few pages within the
domain. e.g. if the domain is only 10 pages, then the tar or dd or
whatever is throwing away potentially cache able file pages much to soon.
I have a hack in 4.x now that slowly balances this case. See
vm_page_dontneed() in vm/vm_page.c.
:
:> * Stressing dirty pages in the system via mmap (shared R+W)
:>
:> The system tends to run low on clean pages, detected by the
:> fact that new allocations are reusing clean pages which have high
:> weights. When this occurs the pageout daemon attempts to 'clean'
:> dirty pages (page them out) in order to increase the number of
:> clean pages available. Having a larger number of clean pages
:> available tends to give them more time to age, thus reducing the
:> average weight the allocator sees. This is a negative feedback
:> loop which results in balance.
:
:Hummm, where in the code can I find info on this?
:It certainly sounds interesting...
This is a side effect of the maxlaunder stuff in the inactive page
scan done by the page daemon. However, under heavier load situations
we also speedup the filesystem syncer, and we also, through other means,
speed up the buffer daemon (which is responsible for staging out I/O
related to dirty filesystem buffers).
:> page. Once swap is assigned the clean page can be reused.
:>
:> If a swap-backed page is brought back into memory, it is brought
:> back in clean (swap is left assigned). Swap is only freed if
:> the page is re-dirtied by the process.
:>
:> Thus most anonymous-memory pages in a heavily loaded system tend
:> to remain clean, allowing them to be reused more easily and extending
:> the life of the system further along the curve before it reaches a
:> thrashing state.
:
:A few points:
:- it may be easier to assign swap entries based on virtual
: scanning of processes, otherwise you'll end up with the
: strange jumping between queue scanning and virtual scanning
: that FreeBSD is doing now
No, because a physical page may be associated with many processes
and thus have multiple virtual copies. Scanning by virtual results
in improperly weighting these pages and winds up being an O(N^2)
scanning algorithm rather then an O(N) or O(1) algorithm.
FreeBSD scans physical pages in order to guarentee deterministic
algorithmic overhead and to avoid unnecessary redundant scanning.
The two-phase I/O clustering operation is done on purpose... it may seem
a bit strange to you but it actually works! The first phase is to
locate the page candidate (the center weighted LRU algorithm), the second
phase is to locate clusterable pages near that candidate. The second
phase is explicitly restricted to candidates which wind up being
contiguous on the physical media... if we are going to do an I/O we might
as well make the best of it. The clustering goes to the core of the
reason why FreeBSD remains reasonably efficient even when under very
heavy loads, but it only works if FreeBSD is able to choose the right
page candidate in the first place.
:- vnodes are why FreeBSD is able to delay writing anonymous
: pages till later on
:- not every page weighs equally ... you want some fairness
: between memory hogs and small processes
This is handled by the 'memoryuse' resource limit. You cannot otherwise
determine which is the more important process. You definitely do not
want to make the assumption that the processes with larger RSS should
be paged out moer... that will lead to serious performance problems
with things like, oh, your X session.
So FreeBSD does not make the distinction until it sees serious memory
stress on the system. FreeBSD tries to operate strictly based on the
page-use weighting.
:- Linux already does the don't-rewrite-clean-swap thing
:
:... (more good stuff removed)
:>
:> A heuristic detects sequential read behavior and implements two
:> optimizations. (1) it implements read-aheads (as long as they
:> are reasonably contiguous on the physical media, we explicitly do
:...
:
:Ohhhh, nice ;)
:
:> * Sequential Detection Heuristic for read clustering (VM fault)
:>
:> A heuristic detects sequential VM fault operation, either forwards
:...
:
:Again ... neat.
Yes. These optimizations have the side effect of reducing memory
stress as well, which makes the LRU center weighting algorithm work
better.
:> * Sequential Detection Heuristic for write clustering (write())
:>
:> In the case of write() I/O (write system call), in order to
:> avoid saturating the memory system with dirty pages, if the
:...
:
:Cool.
Same effect on the LRU cw algorithm.
:> * Sequential Detection Heuristic for write clustering (mmap)
:>
:> Currently not implemented under FreeBSD. This used to be a big
:> problem because you could completely saturate the VM system with
:> dirty pages before the system even realized it. To fix this we
:> threw in a memory-stress check in vm_fault to block when dirtying
:> pages in the face of having too many dirty pages already, giving
:> I/O a chance to catch up a little.
:>
:> This actually improved performance because it left a greater number
:> of clean pages available and so the page selection algorithm in the
:> allocator worked better (tended to select idle pages rather then
:> active pages).
:
:This is an extremely good idea. When we have too many (or just,
:a lot of) dirty pages, we should delay faults in write-enabled
:VMAs, while allowing faults in read-only VMAs to proceed as
:fast as possible ...
:
:regards,
:
:Rik
Happy to help! If you have any other questions don't hesitate to
give me an email ring, I really enjoy talking tech with people who
know what they are doing :-).
-Matt
Matthew Dillon
<dillon@backplane.com>
From: Matthew Dillon <dillon@apollo.backplane.com>
To: Rik van Riel <riel@conectiva.com.br>
Subject: Re: Linux VM/IO balancing (fwd to linux-mm?)
Date: Tue, 23 May 2000 10:06:38 -0700 (PDT)
Message-ID: <200005231706.KAA68220@apollo.backplane.com>
:I'm already looking at implementing this (and have been for a few
:weeks). I'm reading vm/vm_pageout.c in detail now, mainly to scan
:for pitfalls which I may have overlooked in my own design.
Great!
:Davem has implemented some nice active/inactive queue code too, I
:will probably cut'n'paste some of his code. I'll also try to avoid
:some of the problems FreeBSD has (using the active/inactive queue
:as primary scanning point and _then_ skipping to the virtual object
:for IO clustering, as well as results obtained with my anti-hog code
:suggest that we want to do scanning of virtual objects only, with an
:active list, cache list and laundry list as only lists).
I would not characterize this as a problem in FreeBSD, it's actually
one of its more powerful features. What FreeBSD is doing is using
the center-weighted algorithm to locate a page that it believes
should be reused or paged out. Once the page is selected FreeBSD
tries to make the best of the disk seek that it's about to do by
clustering other nearby pages along with it. You can't find the best
candidate by scanning via the VM objects.
The problem with trying to locate the page to reuse from the VM object
alone is that it doesn't work. You might end up scanning all the VM
objects to locate a good page to reuse. If you just scan a few VM
objects, there is a very good chance that you will *not* choose the best
page to reuse/flush. For example, there was some code in the linux
kernel a while ago that tried to scan the 'biggest' VM object under
the assumption that good candidates for page reuse would likely be
found there)... if you happen to choose a VM object which is being
very actively used by the system, then try to reuse/flush a page from
it, you wind up stalling whatever process was using it even if other
VM objects have better candidates.
Nor do you want to scan ALL VM objects to locate the best page -- that
would lead to O(N^2) cpu overhead in locating the page. FreeBSD
locates the page candidate using the page queues because it can do so
in an O(N) or better manner (usually it's more around O(1)).
:> Second, start keeping real statistics on memory use across on a
:> physical-page-basis. That means tracking how often VM pages are
:> referenced (statistically) as well as how often filesystem pages are
:> referenced by discrete I/O calls (deterministically). Keep track of
:> a real per-physical-page statistical 'weight'. (What this means for
:> linux is that you really need to test the pte's associated with physical
:> pages by iterating through the physical pagse in your outer loop, NOT
:> by trying to iterate through every page table of every process!).
:
:Hmmm, why would it matter in what order we scan stuff?
The problem here is that any given page of physical memory may exist
in the virtual address space of many processes. If you scan by iterating
through process page tables you wind up giving an artificial advantage
(or disadvantage) to shared pages. This is not a good measure of whether
the page can be reused. Also, iterating through the process page tables
means going through an unknown number of pte's and blowing your L1
cache in the process -- it doesn't scale. Iterating through physical
memory and then locating the associated pte's treats each page on a more
equal footing, irregardless on how much or how little it is being shared
in regards to updating its statistics.
A massively shared page that happens to be very active already gets the
benefit of being accessed by multiple processse (setting the ref'd
bit), you don't need to give it any special handling.
:> FreeBSD keeps a center-weighted statistic for every page of memory
:> (buffer cache or VM cache, it makes no difference). This has turned
:> out to be a nearly perfect balancing algorithm and I strongly recommend
:> that linux adopt a similar model. But what makes it work is that
:> FreeBSD is willing to eat a couple of cpu cycles to keep accurate
:> statistics of page use by the VM system in order to avoid the bad
:> things that happen when one would otherwise choose the wrong page to
:> reuse or clean.
:
:I fully agree on this point. Linus and Andrea have always been
:a strong opponent of any possible "overhead", leading to a VM
:subsystem that is optimised for CPU use, not for managing memory ;(
I think you could sell it to them if you make it clear that the
statistics do not eat cpu unless the system is stressed (FreeBSD doesn't
even bother doing the page scan unless it thinks the system is
stressed), and when the system is stressed it makes a whole lot more
sense to eat cpu choosing the right page since choosing the wrong page
can lead to hundreds of thousands of wasted cpu cycles doing unnecessary
paging and even more real-time in process stalls.
:>...... (all sorts of good stuff removed)
:> Memory load is calculated in two ways: First, if the memory system finds
:> itself reusing active pages (in my example, any page with a statistical
:> weight greater then 5), second based on the dirty:clean page ratio. A
:> high ratio does not itself cause paging to occur, but a high ratio
:> combined with the system reusing active pages does.
:
:Hmmmm, this isn't what I'm seeing in the code. Could you point me
:at the code where this is happening?
Yes. It's confusing because FreeBSD implements the stress calculation
with additionals page queues rather then attempt to keep a single
queue sorted by weight, which can be expensive. When the system
allocates a page (vm_page_alloc() in vm/vm_page.c) it uses free pages
first (there usually aren't very many truely free pages), then cache
pages (basically clean 'idle' pages), and if this fails it wakes up
the page daemon, blocks, and retries. The page daemon is responsible
for moving pages into the cache for use by the allocator.
The page daemon (vm/vm_pageout.c, vm_pageout() and vm_pageout_scan())
is responsible for adjusting act_count (see vm_pageout_scan()) and
adjusting the position of pages in their queues.
The page daemon starts by scanning pages in the inactive queue. Pages
in this queue have no mmu mappings and have been determined to be
relatively inactive. Within the queue the pages are loosely ordered
by weight (trying to order them exactly would waste too much cpu).
See the 'rescan0' label in vm/vm_pageout.c. While scanning this queue
the pageout daemon updates the activity weighting for the page by
checking the PG_REFERENED bit.
The inactive queue also tends to contain dirty pages to launder. The
page daemon does not flush out dirty pages immediately, if it can't
find enough clean pages in the inactive queue (line 927 in
vm/vm_pageout.c, see 'If we still have a page shortage'), it goes up
and rescans the inactive queue to start the pageout ops on some of
its pages before going on to the active queue.
If there is still a page shortage the page daemon then scans the active
queue. Line 943 'Scan the active queue for things we can deactivate'.
This queue is also loosely sorted by the page weighting and as with the
inactive queue scan the loop also checks the page-referenced bit to
update the weighting on the fly while scanning the active queue.
Under normal conditions, pages are not deactivated until the active
count reaches 0. Line 1013 of vm/vm_pageout.c.
This is where the stress test is implied. If the page daemon scans the
active queue and does not find any pages whos weighting has reached 0,
then it starts to hit around line 1077 (the vm_paging_target() test)
and becomes much more aggressive with the syncer and swapper. At
line 1097 if we determine that we have run out of swap and also do
not have sufficient clean pages, we start killing processes.
I'll get back to the vm_paging_target() function in a second.
The vm_daemon() is responsible for the next stage of aggressiveness
in a heavily loaded system. The daemon is responsible for testing
processes against their resident-memory rlimit and forcefully deactivating
pages. The daemon also notices processes marked as being swapped out
and is more agressive on those processes (the act of marking a process
as being swapped out does not actually pageout any of its pages, but
if the system swapper is activated in a heavy-use situation it will
use the mark to then agressively swap out that process's pages).
Ok, back to vm_paging_target(). This function (sys/vmmeter.h) is
responsible for balancing the clean/dirty ratio by forcing pages
to be moved from the active queue to the inactive queue if insufficient
pages were found and no pages in the active queue were found with
a 0 weighting. This is the stress test. If the page daemon was not
able to find a sufficient number of pages in the active queue with a 0
weighting, it assumes that the memory system is under stress and speeds
up the syncer and moer-agressively enforces RSS resource limits
(vm/vm_pageout.c line 1081).
I will readily admit that FreeBSD suffers from accretions here... there
has been a lot of hacking of the code. The main reason FreeBSD is not
using a completely pure center-weighted LRU algorithm is to avoid
thrashing. It attempts to enforce leaving a page in the active queue
for a certain minimum period of time (based on the weighting having to
drop to 0 in order for the page to leave the queue), and FreeBSD doesn't
mind stalling processes a bit if this winds up occuring because if it
takes recently activated pages out of the active queue too quickly,
the system's thrash point occurs a lot sooner.
:...
:
:> When the memory load increases, an attempt is made to balance the
:> dirty/clean ratio by 'cleaning' dirty pages, which of course means
:> paging them out. FreeBSD makes NO distinction between writing a dirty
:> file-backed page and allocating swap for a dirty anonymous memory page.
:> The same per-page memory-use statistic is also used to determine which
:> dirty pages to clean first. In effect, it is precisely this attempt
:> to balance the dirty/clean ratio which increases the number of clean
:> pages available to reuse. The idea here is to increase the number of
:> clean pages to the point where the system is no longer being forced
:> to reuse 'active' pages. Once this is achieved there is no longer any
:> need clean the remaining dirty pages.
:
:I've read the code and it seems to do this:
:- put inactive pages in the cached queue
:- launder up to maxlaunder pages per scan (if needed)
Correct. Basically the way to think of the algorithm is to ignore
the fact that an active queue exists at all. The active queue holds
pages in an unstressed memory environment. In a stressed memory
environment the active queue imposes a minimum in-core-resident time
for the page to avoid thrashing.
The system implements the page weighting independantly for both the
active and inactive queues. Under medium stress conditions pages
are moved from the active to the inactive queue based on their weighting
reaching 0. Once in the inactive queue (under medium stress conditions),
pages tend to be LRU ordered (all their weightings tend to be 0 at this
point). But under heavier loading conditions pages may be moved from
the active to the inactive queue forcefully, before their weighting
reaches 0. In this case the inactive queue serves the same loosely-sorted
weighting function that the active queue was supposed to serve until
it too runs out of pages.
The page laundering code is stuffed in the middle there somewhere using
a simple heuristic to try to start laundering pages during the inactive
queue scan. It is not strictly based on page weighting but it winds up
being that way due to the loosely weighted ordering of the pages in the
queue.
:> Under extreme memory loads the balance point moves on its own to a
:> point where FreeBSD tries to keep as many pages in a clean state as
:> possible. When the memory load gets to this point the system is
:> considered to be thrashing and we start taking anti-thrashing measures,
:> such as swapping out whole processes and holding them idle for 20-second
:> spans. It rarely gets to this point, but even when it does the system
:> is still kept reasonably balanced.
:
:Uhmmmm. This does not seem to be exactly what the code does...
:(then again, I could be wrong)
Your right. It isn't *exactly* what the code does, but it's the
algorithmic effect minus the confusion of implementing the algorithm
over multiple page queues (active/inactive/cache), which was done
solely to avoid page thrashing.
:Always trying to keep 1/3rd of the mapped pages in an "inactive"
:age is probably a good idea for Linux. The biggest problem with
:our code seems to be that we're *not* trying to have a balanced
:aging of pages ...
That's my take too. Also note that FreeBSD doesn't bother deactivating
pages unless there is actually some stress on the memory system... if
you look at a lightly loaded FreeBSD box you will see a huge number of
'active' pages -- most of those are probably idle. The moving of pages
from queue to queue only occurs when the memory allocator starts to run
out. This alone does not indicate stress... it's only when the page
daemon is unable to find sufficient pages in the active queue with a
0 weighting where the stress starts to come into play.
:> * Stressing dirty pages in the system via I/O calls (read/write)
:>
:> The algorithm tends to cause sequential I/O calls to give pages
:> a middling weight, and since the pages are not reused they tend
:> to be recycled within their domain (so you don't blow the rest
:> of the cache).
:
:This is a good argument for starting at a middle weight. Thanks
:for bringing this case to my attention, we really want something
:like this.
This is something I added to FreeBSD 4.x after tests showed that
FreeBSD 3.x was 'blowing' the page cache unnecessarily when people
would do things like tar up directory trees. It made a big difference
in the performance of the rest of the system without effecting the
I/O bound process.
There is a gotcha, though-- you have to be careful to allow some
balancing to occur so you do not recycle too-few pages within the
domain. e.g. if the domain is only 10 pages, then the tar or dd or
whatever is throwing away potentially cache able file pages much to soon.
I have a hack in 4.x now that slowly balances this case. See
vm_page_dontneed() in vm/vm_page.c.
:
:> * Stressing dirty pages in the system via mmap (shared R+W)
:>
:> The system tends to run low on clean pages, detected by the
:> fact that new allocations are reusing clean pages which have high
:> weights. When this occurs the pageout daemon attempts to 'clean'
:> dirty pages (page them out) in order to increase the number of
:> clean pages available. Having a larger number of clean pages
:> available tends to give them more time to age, thus reducing the
:> average weight the allocator sees. This is a negative feedback
:> loop which results in balance.
:
:Hummm, where in the code can I find info on this?
:It certainly sounds interesting...
This is a side effect of the maxlaunder stuff in the inactive page
scan done by the page daemon. However, under heavier load situations
we also speedup the filesystem syncer, and we also, through other means,
speed up the buffer daemon (which is responsible for staging out I/O
related to dirty filesystem buffers).
:> page. Once swap is assigned the clean page can be reused.
:>
:> If a swap-backed page is brought back into memory, it is brought
:> back in clean (swap is left assigned). Swap is only freed if
:> the page is re-dirtied by the process.
:>
:> Thus most anonymous-memory pages in a heavily loaded system tend
:> to remain clean, allowing them to be reused more easily and extending
:> the life of the system further along the curve before it reaches a
:> thrashing state.
:
:A few points:
:- it may be easier to assign swap entries based on virtual
: scanning of processes, otherwise you'll end up with the
: strange jumping between queue scanning and virtual scanning
: that FreeBSD is doing now
No, because a physical page may be associated with many processes
and thus have multiple virtual copies. Scanning by virtual results
in improperly weighting these pages and winds up being an O(N^2)
scanning algorithm rather then an O(N) or O(1) algorithm.
FreeBSD scans physical pages in order to guarentee deterministic
algorithmic overhead and to avoid unnecessary redundant scanning.
The two-phase I/O clustering operation is done on purpose... it may seem
a bit strange to you but it actually works! The first phase is to
locate the page candidate (the center weighted LRU algorithm), the second
phase is to locate clusterable pages near that candidate. The second
phase is explicitly restricted to candidates which wind up being
contiguous on the physical media... if we are going to do an I/O we might
as well make the best of it. The clustering goes to the core of the
reason why FreeBSD remains reasonably efficient even when under very
heavy loads, but it only works if FreeBSD is able to choose the right
page candidate in the first place.
:- vnodes are why FreeBSD is able to delay writing anonymous
: pages till later on
:- not every page weighs equally ... you want some fairness
: between memory hogs and small processes
This is handled by the 'memoryuse' resource limit. You cannot otherwise
determine which is the more important process. You definitely do not
want to make the assumption that the processes with larger RSS should
be paged out moer... that will lead to serious performance problems
with things like, oh, your X session.
So FreeBSD does not make the distinction until it sees serious memory
stress on the system. FreeBSD tries to operate strictly based on the
page-use weighting.
:- Linux already does the don't-rewrite-clean-swap thing
:
:... (more good stuff removed)
:>
:> A heuristic detects sequential read behavior and implements two
:> optimizations. (1) it implements read-aheads (as long as they
:> are reasonably contiguous on the physical media, we explicitly do
:...
:
:Ohhhh, nice ;)
:
:> * Sequential Detection Heuristic for read clustering (VM fault)
:>
:> A heuristic detects sequential VM fault operation, either forwards
:...
:
:Again ... neat.
Yes. These optimizations have the side effect of reducing memory
stress as well, which makes the LRU center weighting algorithm work
better.
:> * Sequential Detection Heuristic for write clustering (write())
:>
:> In the case of write() I/O (write system call), in order to
:> avoid saturating the memory system with dirty pages, if the
:...
:
:Cool.
Same effect on the LRU cw algorithm.
:> * Sequential Detection Heuristic for write clustering (mmap)
:>
:> Currently not implemented under FreeBSD. This used to be a big
:> problem because you could completely saturate the VM system with
:> dirty pages before the system even realized it. To fix this we
:> threw in a memory-stress check in vm_fault to block when dirtying
:> pages in the face of having too many dirty pages already, giving
:> I/O a chance to catch up a little.
:>
:> This actually improved performance because it left a greater number
:> of clean pages available and so the page selection algorithm in the
:> allocator worked better (tended to select idle pages rather then
:> active pages).
:
:This is an extremely good idea. When we have too many (or just,
:a lot of) dirty pages, we should delay faults in write-enabled
:VMAs, while allowing faults in read-only VMAs to proceed as
:fast as possible ...
:
:regards,
:
:Rik
Happy to help! If you have any other questions don't hesitate to
give me an email ring, I really enjoy talking tech with people who
know what they are doing :-).
-Matt
Matthew Dillon
<dillon@backplane.com>
[-- Attachment #2.4: Re: Linux VM/IO balancing (fwd to linux-mm?) (fwd) --]
[-- Type: MESSAGE/RFC822, Size: 3608 bytes --]
Date: Tue, 23 May 2000 15:11:04 -0300 (BRST)
From: Rik van Riel <riel@conectiva.com.br>
X-Sender: riel@duckman.distro.conectiva
To: Matthew Dillon <dillon@apollo.backplane.com>
Subject: Re: Linux VM/IO balancing (fwd to linux-mm?)
In-Reply-To: <200005231706.KAA68220@apollo.backplane.com>
Message-ID: <Pine.LNX.4.21.0005231454380.19121-100000@duckman.distro.conectiva>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
On Tue, 23 May 2000, Matthew Dillon wrote:
> :I'm already looking at implementing this (and have been for a few
> :weeks). I'm reading vm/vm_pageout.c in detail now, mainly to scan
> :for pitfalls which I may have overlooked in my own design.
>
> Great!
I'm currently looking at which part of the code I can implement
before 2.4 is released by Linus. I've read your email and would
like to think I'm understanding most of the rationale behind the
FreeBSD VM subsystem...
> :- not every page weighs equally ... you want some fairness
> : between memory hogs and small processes
>
> This is handled by the 'memoryuse' resource limit. You cannot otherwise
> determine which is the more important process. You definitely do not
> want to make the assumption that the processes with larger RSS should
> be paged out moer... that will lead to serious performance problems
> with things like, oh, your X session.
This is not exactly what the antihog code does. It will scan
bigger processes more often, leading (hopefully) to:
1) better statistics about which pages are really used in big
processes
2) better interactive performance because the lightly used memory
of the user's shell is not paged out
3) more memory pressure on big processes, I've seen this speed
up the system because
a) small processes had less page faults
b) the memory hog(s) had less disk wait time
I haven't seen this code be any problem to X or other interactive
processes ... if the memory is really used it won't be paged out,
it's just that a big process will have to use its pages more heavily
than a smaller process. To express this mathematically:
A = process a, rss size(A)
B = process b, rss size(B) [N times size(A)]
P(X) = memory pressure on process X
P(B) = P(A) * (size(B)/size(A)) * sqrt(size(B)/size(A))
and the memory pressure on every page from process B will be:
P(page in B) = P(page in A) * sqrt(size(B)/size(A))
I know this doesn't give the best system-wide aging of pages,
but it should provide fairer distribution of memory between
processes. There are no artificial restrictions on processes
with a big RSS, the only thing is that we pressure their pages
harder (and will inactivate their pages sooner).
This makes sure that a big process (that puts a lot of load on
the VM subsystem) does not "transfer" its memory load to smaller
processes. Lightly used pages in smaller processes will still be
paged out in favor of heavily used pages in the big process, but
small processes have a better chance of running well.
This is doing something comperable to what the swap code is
doing (penalise big processes more than small ones), but does
it in the paging code. It allows a few big processes to grow
until the point where they are thrashing without impacting
the rest of the system too much.
Then again, this VMS-influenced idea could be all wrong. I'd
*love* to hear your view on this ...
regards,
Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.
Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/ http://www.surriel.com/
From: Rik van Riel <riel@conectiva.com.br>
To: Matthew Dillon <dillon@apollo.backplane.com>
Subject: Re: Linux VM/IO balancing (fwd to linux-mm?)
Date: Tue, 23 May 2000 15:11:04 -0300 (BRST)
Message-ID: <Pine.LNX.4.21.0005231454380.19121-100000@duckman.distro.conectiva>
On Tue, 23 May 2000, Matthew Dillon wrote:
> :I'm already looking at implementing this (and have been for a few
> :weeks). I'm reading vm/vm_pageout.c in detail now, mainly to scan
> :for pitfalls which I may have overlooked in my own design.
>
> Great!
I'm currently looking at which part of the code I can implement
before 2.4 is released by Linus. I've read your email and would
like to think I'm understanding most of the rationale behind the
FreeBSD VM subsystem...
> :- not every page weighs equally ... you want some fairness
> : between memory hogs and small processes
>
> This is handled by the 'memoryuse' resource limit. You cannot otherwise
> determine which is the more important process. You definitely do not
> want to make the assumption that the processes with larger RSS should
> be paged out moer... that will lead to serious performance problems
> with things like, oh, your X session.
This is not exactly what the antihog code does. It will scan
bigger processes more often, leading (hopefully) to:
1) better statistics about which pages are really used in big
processes
2) better interactive performance because the lightly used memory
of the user's shell is not paged out
3) more memory pressure on big processes, I've seen this speed
up the system because
a) small processes had less page faults
b) the memory hog(s) had less disk wait time
I haven't seen this code be any problem to X or other interactive
processes ... if the memory is really used it won't be paged out,
it's just that a big process will have to use its pages more heavily
than a smaller process. To express this mathematically:
A = process a, rss size(A)
B = process b, rss size(B) [N times size(A)]
P(X) = memory pressure on process X
P(B) = P(A) * (size(B)/size(A)) * sqrt(size(B)/size(A))
and the memory pressure on every page from process B will be:
P(page in B) = P(page in A) * sqrt(size(B)/size(A))
I know this doesn't give the best system-wide aging of pages,
but it should provide fairer distribution of memory between
processes. There are no artificial restrictions on processes
with a big RSS, the only thing is that we pressure their pages
harder (and will inactivate their pages sooner).
This makes sure that a big process (that puts a lot of load on
the VM subsystem) does not "transfer" its memory load to smaller
processes. Lightly used pages in smaller processes will still be
paged out in favor of heavily used pages in the big process, but
small processes have a better chance of running well.
This is doing something comperable to what the swap code is
doing (penalise big processes more than small ones), but does
it in the paging code. It allows a few big processes to grow
until the point where they are thrashing without impacting
the rest of the system too much.
Then again, this VMS-influenced idea could be all wrong. I'd
*love* to hear your view on this ...
regards,
Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.
Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/ http://www.surriel.com/
[-- Attachment #2.5: Re: Linux VM/IO balancing (fwd to linux-mm?) (fwd) --]
[-- Type: MESSAGE/RFC822, Size: 5927 bytes --]
Return-Path: <dillon@apollo.backplane.com>
Received: from perninha.conectiva.com.br (perninha.conectiva.com.br [200.250.58.156])
by duckman.distro.conectiva (8.9.3/8.8.7) with ESMTP id QAA21732
for <riel@duckman.distro.conectiva>; Tue, 23 May 2000 16:39:31 -0300
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
by perninha.conectiva.com.br (8.9.3/8.9.3) with ESMTP id QAA30959
for <riel@conectiva.com.br>; Tue, 23 May 2000 16:39:47 -0300
Received: (from dillon@localhost)
by apollo.backplane.com (8.9.3/8.9.1) id MAA69005;
Tue, 23 May 2000 12:39:07 -0700 (PDT)
(envelope-from dillon)
Date: Tue, 23 May 2000 12:39:07 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200005231939.MAA69005@apollo.backplane.com>
To: Rik van Riel <riel@conectiva.com.br>
Subject: Re: Linux VM/IO balancing (fwd to linux-mm?)
References: <Pine.LNX.4.21.0005231454380.19121-100000@duckman.distro.conectiva>
:This is not exactly what the antihog code does. It will scan
:bigger processes more often, leading (hopefully) to:
:1) better statistics about which pages are really used in big
: processes
:2) better interactive performance because the lightly used memory
: of the user's shell is not paged out
:3) more memory pressure on big processes, I've seen this speed
: up the system because
: a) small processes had less page faults
: b) the memory hog(s) had less disk wait time
:
:I haven't seen this code be any problem to X or other interactive
:processes ... if the memory is really used it won't be paged out,
:it's just that a big process will have to use its pages more heavily
:than a smaller process. To express this mathematically:
:
:A = process a, rss size(A)
:B = process b, rss size(B) [N times size(A)]
:P(X) = memory pressure on process X
:
: P(B) = P(A) * (size(B)/size(A)) * sqrt(size(B)/size(A))
:
:and the memory pressure on every page from process B will be:
:
: P(page in B) = P(page in A) * sqrt(size(B)/size(A))
:
:I know this doesn't give the best system-wide aging of pages,
:but it should provide fairer distribution of memory between
:processes. There are no artificial restrictions on processes
:with a big RSS, the only thing is that we pressure their pages
:harder (and will inactivate their pages sooner).
:
:This makes sure that a big process (that puts a lot of load on
:the VM subsystem) does not "transfer" its memory load to smaller
:processes. Lightly used pages in smaller processes will still be
:paged out in favor of heavily used pages in the big process, but
:small processes have a better chance of running well.
:
:This is doing something comperable to what the swap code is
:doing (penalise big processes more than small ones), but does
:it in the paging code. It allows a few big processes to grow
:until the point where they are thrashing without impacting
:the rest of the system too much.
:
:Then again, this VMS-influenced idea could be all wrong. I'd
:*love* to hear your view on this ...
:
:regards,
:
:Rik
Well, I have a pretty strong opinion on trying to rationalize
penalizing big processes simply because they are big. It's a bad
idea for several reasons, not the least of which being that by
making such a rationalization you are assuming a particular system
topology -- you are assuming, for example, that the system may contain
a few large less-important processes and a reasonable number of
small processes. But if the system contains hundreds of small processes
or if some of the large processes turn out to be important, the
rationalization fails.
Also if the large process in question happens to really need the pages
(is accessing them all the time), trying to page those pages out
gratuitously does nothing but create a massive paging load on the
system. Unless you have a mechanism to (such as FreeBSD has) to
impose a 20-second forced sleep under extreme memory loads, any focus
on large processes will simply result in thrashing (read: screw up
the system).
Another reason for not trying to weight things in favor of small
processes is that the LRU algorithm *already* weights things in favor
of small processes, simply by virtue of the fact that small processes
tend to access/touch any given page in their RSS set much more often
then large processes access/touch their pages. The same goes for
shared pages --- which is why basing the algorithms on a physical page
scan works well. Basing algorithms on a virtual page scan, with lots
of physical pages being redundantly scan, skews the statistics badly.
FreeBSD has two mechanisms to deal with large processes, both used only
under duress. The first occurs when FreeBSD hits the memory
stress point - it starts enforcing the 'memoryuse' resource limit on
processes. The second occurs when FreeBSD hits the 'holy shit we're
starting to thrash' point - it starts forcefully swapping out processes
and holding them idle for 20-second periods (p.s. I have never personally
seen a production FreeBSD box get to the second stress point!).
Keep in mind that memory stress is best defined as "the system reused or
paged-out a page that some process then tries to use soon after". When
this occurs, processes stall, and the system is forced to issue I/O
(pageins) to unstall them. Worse, if the system just paged the page out
and then has to turn around and page it back in, the processes stall even
longer AND you wind up with a doubly-whammy on the I/O load -- this is
my definition of thrashing. Attempting to reuse pages from large
processes which are truely accessing those pages will thrash the system
much sooner then simply treating all pages and processes the same and
focusing on the page-activity weighting.
-Matt
Matthew Dillon
<dillon@backplane.com>
From: Matthew Dillon <dillon@apollo.backplane.com>
To: Rik van Riel <riel@conectiva.com.br>
Subject: Re: Linux VM/IO balancing (fwd to linux-mm?)
Date: Tue, 23 May 2000 12:39:07 -0700 (PDT)
Message-ID: <200005231939.MAA69005@apollo.backplane.com>
:This is not exactly what the antihog code does. It will scan
:bigger processes more often, leading (hopefully) to:
:1) better statistics about which pages are really used in big
: processes
:2) better interactive performance because the lightly used memory
: of the user's shell is not paged out
:3) more memory pressure on big processes, I've seen this speed
: up the system because
: a) small processes had less page faults
: b) the memory hog(s) had less disk wait time
:
:I haven't seen this code be any problem to X or other interactive
:processes ... if the memory is really used it won't be paged out,
:it's just that a big process will have to use its pages more heavily
:than a smaller process. To express this mathematically:
:
:A = process a, rss size(A)
:B = process b, rss size(B) [N times size(A)]
:P(X) = memory pressure on process X
:
: P(B) = P(A) * (size(B)/size(A)) * sqrt(size(B)/size(A))
:
:and the memory pressure on every page from process B will be:
:
: P(page in B) = P(page in A) * sqrt(size(B)/size(A))
:
:I know this doesn't give the best system-wide aging of pages,
:but it should provide fairer distribution of memory between
:processes. There are no artificial restrictions on processes
:with a big RSS, the only thing is that we pressure their pages
:harder (and will inactivate their pages sooner).
:
:This makes sure that a big process (that puts a lot of load on
:the VM subsystem) does not "transfer" its memory load to smaller
:processes. Lightly used pages in smaller processes will still be
:paged out in favor of heavily used pages in the big process, but
:small processes have a better chance of running well.
:
:This is doing something comperable to what the swap code is
:doing (penalise big processes more than small ones), but does
:it in the paging code. It allows a few big processes to grow
:until the point where they are thrashing without impacting
:the rest of the system too much.
:
:Then again, this VMS-influenced idea could be all wrong. I'd
:*love* to hear your view on this ...
:
:regards,
:
:Rik
Well, I have a pretty strong opinion on trying to rationalize
penalizing big processes simply because they are big. It's a bad
idea for several reasons, not the least of which being that by
making such a rationalization you are assuming a particular system
topology -- you are assuming, for example, that the system may contain
a few large less-important processes and a reasonable number of
small processes. But if the system contains hundreds of small processes
or if some of the large processes turn out to be important, the
rationalization fails.
Also if the large process in question happens to really need the pages
(is accessing them all the time), trying to page those pages out
gratuitously does nothing but create a massive paging load on the
system. Unless you have a mechanism to (such as FreeBSD has) to
impose a 20-second forced sleep under extreme memory loads, any focus
on large processes will simply result in thrashing (read: screw up
the system).
Another reason for not trying to weight things in favor of small
processes is that the LRU algorithm *already* weights things in favor
of small processes, simply by virtue of the fact that small processes
tend to access/touch any given page in their RSS set much more often
then large processes access/touch their pages. The same goes for
shared pages --- which is why basing the algorithms on a physical page
scan works well. Basing algorithms on a virtual page scan, with lots
of physical pages being redundantly scan, skews the statistics badly.
FreeBSD has two mechanisms to deal with large processes, both used only
under duress. The first occurs when FreeBSD hits the memory
stress point - it starts enforcing the 'memoryuse' resource limit on
processes. The second occurs when FreeBSD hits the 'holy shit we're
starting to thrash' point - it starts forcefully swapping out processes
and holding them idle for 20-second periods (p.s. I have never personally
seen a production FreeBSD box get to the second stress point!).
Keep in mind that memory stress is best defined as "the system reused or
paged-out a page that some process then tries to use soon after". When
this occurs, processes stall, and the system is forced to issue I/O
(pageins) to unstall them. Worse, if the system just paged the page out
and then has to turn around and page it back in, the processes stall even
longer AND you wind up with a doubly-whammy on the I/O load -- this is
my definition of thrashing. Attempting to reuse pages from large
processes which are truely accessing those pages will thrash the system
much sooner then simply treating all pages and processes the same and
focusing on the page-activity weighting.
-Matt
Matthew Dillon
<dillon@backplane.com>
[-- Attachment #2.6: Re: Linux VM/IO balancing (fwd to linux-mm?) (fwd) --]
[-- Type: MESSAGE/RFC822, Size: 6065 bytes --]
Date: Tue, 23 May 2000 20:18:25 -0300 (BRST)
From: Rik van Riel <riel@conectiva.com.br>
X-Sender: riel@duckman.distro.conectiva
To: Matthew Dillon <dillon@apollo.backplane.com>
Subject: Re: Linux VM/IO balancing (fwd to linux-mm?)
In-Reply-To: <200005231939.MAA69005@apollo.backplane.com>
Message-ID: <Pine.LNX.4.21.0005231823290.19121-100000@duckman.distro.conectiva>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
On Tue, 23 May 2000, Matthew Dillon wrote:
> :A = process a, rss size(A)
> :B = process b, rss size(B) [N times size(A)]
> :P(X) = memory pressure on process X
> :
> : P(B) = P(A) * (size(B)/size(A)) * sqrt(size(B)/size(A))
> :
> :and the memory pressure on every page from process B will be:
> :
> : P(page in B) = P(page in A) * sqrt(size(B)/size(A))
>
> Well, I have a pretty strong opinion on trying to rationalize
> penalizing big processes simply because they are big. It's a bad
> idea for several reasons, not the least of which being that by
> making such a rationalization you are assuming a particular system
> topology -- you are assuming, for example, that the system may contain
> a few large less-important processes and a reasonable number of
> small processes. But if the system contains hundreds of small processes
> or if some of the large processes turn out to be important, the
> rationalization fails.
The main assumption I was making here was that memory usage
would be dominated by a small number bigger programs and not
by a larger number of small processes. I see your point in
that I may be wrong here in some situations.
Also, memory pressure per page increases only two-fold if the
bigger process is four times bigger that the smaller process...
(likewise, 4 times for a 16 times bigger process, etc)
> Also if the large process in question happens to really need the pages
> (is accessing them all the time), trying to page those pages out
> gratuitously does nothing but create a massive paging load on the
> system.
But we don't. All we do is scan those pages more often. If
the big process is really using them all the time we will
not page them out. But we *will* try the pages of the big
process more often than the pages of smaller processes...
> Unless you have a mechanism to (such as FreeBSD has) to
> impose a 20-second forced sleep under extreme memory loads, any focus
> on large processes will simply result in thrashing (read: screw up
> the system).
I have a proposal for this that could be implemented relatively
easily. It is a variant on the swapping scheme that ITS had a
long long time ago and is "somewhat" different from what any of
the Unices has. Since I didn't get a lot of response from the
Linux community (some "sweet, I hope it works" replies on IRC)
I'd be really interested in what someone like you would have to
say (or other people in the know ... do you know any?).
http://mail.nl.linux.org/linux-mm/2000-05/msg00273.html
> Another reason for not trying to weight things in favor of small
> processes is that the LRU algorithm *already* weights things in favor
> of small processes, simply by virtue of the fact that small processes
> tend to access/touch any given page in their RSS set much more often
> then large processes access/touch their pages. The same goes for
> shared pages --- which is why basing the algorithms on a physical page
> scan works well. Basing algorithms on a virtual page scan, with lots
> of physical pages being redundantly scan, skews the statistics badly.
I'm curious about this ... would a small interactive task like
an xterm or a shell *really* touch its pages as much as a Netscape
or Mathematica?
I have no doubts about the second point though .. that needs to
be fixed. (but probably not before Linux 2.5, the changes in the
code would be too invasive)
> FreeBSD has two mechanisms to deal with large processes, both used only
> under duress. The first occurs when FreeBSD hits the memory
> stress point - it starts enforcing the 'memoryuse' resource limit on
I've grepped the source for this but haven't been able to find
the code which does this. I'm very interested in what is does
and how it is supposed to work (and how it would compare to my
"push memory hogs harder" idea).
> processes. The second occurs when FreeBSD hits the 'holy shit we're
> starting to thrash' point - it starts forcefully swapping out processes
> and holding them idle for 20-second periods (p.s. I have never personally
> seen a production FreeBSD box get to the second stress point!).
hehe
> Keep in mind that memory stress is best defined as "the system reused or
> paged-out a page that some process then tries to use soon after". When
> this occurs, processes stall, and the system is forced to issue I/O
> (pageins) to unstall them. Worse, if the system just paged the page out
> and then has to turn around and page it back in, the processes stall even
> longer AND you wind up with a doubly-whammy on the I/O load -- this is
> my definition of thrashing. Attempting to reuse pages from large
> processes which are truely accessing those pages will thrash the system
> much sooner then simply treating all pages and processes the same and
> focusing on the page-activity weighting.
Ahh, but if a process is truely using those pages, we won't page
them out. The trick is that the anti-hog code does *not* impose
any limits ... it simply makes sure that a big process *really*
uses its memory before it can stay at a large RSS size.
(but I agree that this is slightly worse from a global, system
throughput point of view .. and I must admit that I have no
idea how much worse it would be)
regards,
Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.
Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/ http://www.surriel.com/
From: Rik van Riel <riel@conectiva.com.br>
To: Matthew Dillon <dillon@apollo.backplane.com>
Subject: Re: Linux VM/IO balancing (fwd to linux-mm?)
Date: Tue, 23 May 2000 20:18:25 -0300 (BRST)
Message-ID: <Pine.LNX.4.21.0005231823290.19121-100000@duckman.distro.conectiva>
On Tue, 23 May 2000, Matthew Dillon wrote:
> :A = process a, rss size(A)
> :B = process b, rss size(B) [N times size(A)]
> :P(X) = memory pressure on process X
> :
> : P(B) = P(A) * (size(B)/size(A)) * sqrt(size(B)/size(A))
> :
> :and the memory pressure on every page from process B will be:
> :
> : P(page in B) = P(page in A) * sqrt(size(B)/size(A))
>
> Well, I have a pretty strong opinion on trying to rationalize
> penalizing big processes simply because they are big. It's a bad
> idea for several reasons, not the least of which being that by
> making such a rationalization you are assuming a particular system
> topology -- you are assuming, for example, that the system may contain
> a few large less-important processes and a reasonable number of
> small processes. But if the system contains hundreds of small processes
> or if some of the large processes turn out to be important, the
> rationalization fails.
The main assumption I was making here was that memory usage
would be dominated by a small number bigger programs and not
by a larger number of small processes. I see your point in
that I may be wrong here in some situations.
Also, memory pressure per page increases only two-fold if the
bigger process is four times bigger that the smaller process...
(likewise, 4 times for a 16 times bigger process, etc)
> Also if the large process in question happens to really need the pages
> (is accessing them all the time), trying to page those pages out
> gratuitously does nothing but create a massive paging load on the
> system.
But we don't. All we do is scan those pages more often. If
the big process is really using them all the time we will
not page them out. But we *will* try the pages of the big
process more often than the pages of smaller processes...
> Unless you have a mechanism to (such as FreeBSD has) to
> impose a 20-second forced sleep under extreme memory loads, any focus
> on large processes will simply result in thrashing (read: screw up
> the system).
I have a proposal for this that could be implemented relatively
easily. It is a variant on the swapping scheme that ITS had a
long long time ago and is "somewhat" different from what any of
the Unices has. Since I didn't get a lot of response from the
Linux community (some "sweet, I hope it works" replies on IRC)
I'd be really interested in what someone like you would have to
say (or other people in the know ... do you know any?).
http://mail.nl.linux.org/linux-mm/2000-05/msg00273.html
> Another reason for not trying to weight things in favor of small
> processes is that the LRU algorithm *already* weights things in favor
> of small processes, simply by virtue of the fact that small processes
> tend to access/touch any given page in their RSS set much more often
> then large processes access/touch their pages. The same goes for
> shared pages --- which is why basing the algorithms on a physical page
> scan works well. Basing algorithms on a virtual page scan, with lots
> of physical pages being redundantly scan, skews the statistics badly.
I'm curious about this ... would a small interactive task like
an xterm or a shell *really* touch its pages as much as a Netscape
or Mathematica?
I have no doubts about the second point though .. that needs to
be fixed. (but probably not before Linux 2.5, the changes in the
code would be too invasive)
> FreeBSD has two mechanisms to deal with large processes, both used only
> under duress. The first occurs when FreeBSD hits the memory
> stress point - it starts enforcing the 'memoryuse' resource limit on
I've grepped the source for this but haven't been able to find
the code which does this. I'm very interested in what is does
and how it is supposed to work (and how it would compare to my
"push memory hogs harder" idea).
> processes. The second occurs when FreeBSD hits the 'holy shit we're
> starting to thrash' point - it starts forcefully swapping out processes
> and holding them idle for 20-second periods (p.s. I have never personally
> seen a production FreeBSD box get to the second stress point!).
hehe
> Keep in mind that memory stress is best defined as "the system reused or
> paged-out a page that some process then tries to use soon after". When
> this occurs, processes stall, and the system is forced to issue I/O
> (pageins) to unstall them. Worse, if the system just paged the page out
> and then has to turn around and page it back in, the processes stall even
> longer AND you wind up with a doubly-whammy on the I/O load -- this is
> my definition of thrashing. Attempting to reuse pages from large
> processes which are truely accessing those pages will thrash the system
> much sooner then simply treating all pages and processes the same and
> focusing on the page-activity weighting.
Ahh, but if a process is truely using those pages, we won't page
them out. The trick is that the anti-hog code does *not* impose
any limits ... it simply makes sure that a big process *really*
uses its memory before it can stay at a large RSS size.
(but I agree that this is slightly worse from a global, system
throughput point of view .. and I must admit that I have no
idea how much worse it would be)
regards,
Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.
Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/ http://www.surriel.com/
[-- Attachment #2.7: Re: Linux VM/IO balancing (fwd to linux-mm?) (fwd) --]
[-- Type: MESSAGE/RFC822, Size: 9176 bytes --]
Return-Path: <dillon@apollo.backplane.com>
Received: from perninha.conectiva.com.br (perninha.conectiva.com.br [200.250.58.156])
by duckman.distro.conectiva (8.9.3/8.8.7) with ESMTP id VAA23117
for <riel@duckman.distro.conectiva>; Tue, 23 May 2000 21:32:56 -0300
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
by perninha.conectiva.com.br (8.9.3/8.9.3) with ESMTP id VAA06276
for <riel@conectiva.com.br>; Tue, 23 May 2000 21:32:46 -0300
Received: (from dillon@localhost)
by apollo.backplane.com (8.9.3/8.9.1) id RAA70157;
Tue, 23 May 2000 17:31:50 -0700 (PDT)
(envelope-from dillon)
Date: Tue, 23 May 2000 17:31:50 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200005240031.RAA70157@apollo.backplane.com>
To: Rik van Riel <riel@conectiva.com.br>
Subject: Re: Linux VM/IO balancing (fwd to linux-mm?)
References: <Pine.LNX.4.21.0005231823290.19121-100000@duckman.distro.conectiva>
:> Also if the large process in question happens to really need the pages
:> (is accessing them all the time), trying to page those pages out
:> gratuitously does nothing but create a massive paging load on the
:> system.
:
:But we don't. All we do is scan those pages more often. If
:the big process is really using them all the time we will
:not page them out. But we *will* try the pages of the big
:process more often than the pages of smaller processes...
:...
I don't think this really accomplishes your goal because the effect
of choosing the wrong page to reuse/flush is the same whether that
page comes from a large process or a small process. The effect
being a disk I/O and (if the wrong page is chosen) the process faulting
it back in immediately (a second disk I/O). This effects the entire
system, not just that one process.
When you skew the algorithm towards the larger process you are in effect
penalizing the large process, which I think is what you are trying to
accomplish. BUT, you are also penalizing the entire system by virtue
of causing the extra disk I/O. That's two marks against the idea...
penalizing a large process without knowing whether it deserves to be
penalized, and causing extra disk I/O by choosing the 'wrong' page
(that is, choosing a page that is likely to result in more disk activity
by being more active then the skewed statistic believes it is).
In a heavily loaded environment, using the normal center-weighted
LRU algorithm, a larger process is going to be penalized anyway by
virtue of the fact that it has more pages to be potentially reused.
I don't think you want to penalize it even more.
Quick mathmatical proof: If you have a small process with 10 pages in
core and a large process with 100 pages in core, and all the pages have
the same weighting, and the allocator tries to allocate a new page,
90% of the allocations are going to be taken from the large process
and only 10% will be taken from the small process. Thus the large
process is already being penalized by the normal algorithm... you
don't have to penalize it any more. The algorithm is self-correcting
for large processes.
:I have a proposal for this that could be implemented relatively
:..
:
: http://mail.nl.linux.org/linux-mm/2000-05/msg00273.html
Very interesting idea! In reference to your comment
'swapping doesn't actually swap anything...' near the end...
this is *precisely* the method that FreeBSD uses when
the idle-swap system feature is turned on. FreeBSD also
depresses the activity weight for the pages to make it
more likely to be reused (FreeBSD does this simply by moving the
page out of the active queue and onto the inactive queue).
I'm not sure you can use p->state as one of the basis for your
memory load calculation, because there are many classes of I/O
bound applications which nevertheless have low memory footprints.
You have to be careful not to generate a positive feedback loop.
That is, where you make some calculation X which has consequence Y
which has the side effect of increasing the calculation X2, which
has consequence Y2, and so forth until the calculation of X becomes
absurd and is no longer a proper measure of the state of the system.
:I'm curious about this ... would a small interactive task like
:an xterm or a shell *really* touch its pages as much as a Netscape
:or Mathematica?
:
:I have no doubts about the second point though .. that needs to
:be fixed. (but probably not before Linux 2.5, the changes in the
:code would be too invasive)
Yes, absolutely. Its simply due to tighter code... with fewer
active pages, smaller processes tend to touch more of them whenever
they do anything (like hitting a key in an xterm). Larger processes
tend to zone their memory. They will have a core set of pages
they touch as much as any other process, but the more complex issues
of larger processes will also touch additional pages depending on
what they are doing.
So with something like Netscape the 'broader core' set of pages it
touches will be different depending on what site you are visiting.
As you visit more sites, the RSS footprint starts to bloat, but many
of those pages will tend to be more 'idle' then the core pages in the
smaller process.
:> FreeBSD has two mechanisms to deal with large processes, both used only
:> under duress. The first occurs when FreeBSD hits the memory
:> stress point - it starts enforcing the 'memoryuse' resource limit on
:
:I've grepped the source for this but haven't been able to find
:the code which does this. I'm very interested in what is does
:and how it is supposed to work (and how it would compare to my
:"push memory hogs harder" idea).
vm/vm_pageout.c, vm_daemon() process, aroudn line 1392. The vm_daemon
is only woken up under certain heavy load situations. The actual
limit check occurs on line 1417. The actual forced page deactivation
occurs on line 1431 (the call to vm_pageout_map_deactivate_pages).
The enforced sleep is handled by the swapin code. vm/vm_glue.c
line 364 (in scheduler()). If a process is forcefully swapped out,
it is not allowed to wakeup until the system has recovered sufficiently
for it to be able to run reasonably well. The longer the process
stays in its forced sleep, the more likely it will be given the cpu back
so after a certain point it *will* wake up again, even if the system
is still thrashing. It's not strictly 20 seconds any more (it was in
pre-freeware BSDs).
( I'm assuming you are using the 4.x source tree as your reference
here ).
:> Keep in mind that memory stress is best defined as "the system reused or
:> paged-out a page that some process then tries to use soon after". When
:> this occurs, processes stall, and the system is forced to issue I/O
:> (pageins) to unstall them. Worse, if the system just paged the page out
:...
:
:Ahh, but if a process is truely using those pages, we won't page
:them out. The trick is that the anti-hog code does *not* impose
:any limits ... it simply makes sure that a big process *really*
:uses its memory before it can stay at a large RSS size.
Everything starts to break down when you hit extreme loads. Even
simple things like trying to give priority to one type of process
over another. If you weight in favor of smaller processes and happen
to be running on a system with larger processes, your algorithm could
(for example) wind up scanning their pages *too* often, depressing their
priority so quickly that the system may believe it can page out their
pages when, in fact, those processes are still using the pages.
Another example: When a system starts to thrash, *ALL* processes
doing any sort of I/O (paging or normal I/O) begin to stall. If under
heavy load conditions your algorithm 'speeds up' the page scan on an
unequal basis (only for certain processes), then the combination of
the sped-up scan AND the lower page access rate that is occuring due
to the thrashing itself may lead your algorithm to believe it can
steal pages from those processes when in reality stealing pages will
only result in even worse thrashing.
In fact, one of the reasons why FreeBSD went to a multi-queue approach
was to avoid these sorts of positive feedback loops. It has a separate
'active' queue as a means of guarenteeing that when a page is faulted in,
it *stays* in for a certain minimum period of time before being available
for paging out again, and if the enforced wait causes the system to run
low on memory and stall more, then too bad -- the alternative (insane
thrashing) is worse.
:(but I agree that this is slightly worse from a global, system
:throughput point of view .. and I must admit that I have no
:idea how much worse it would be)
:
:regards,
:
:Rik
-Matt
Matthew Dillon
<dillon@backplane.com>
From: Matthew Dillon <dillon@apollo.backplane.com>
To: Rik van Riel <riel@conectiva.com.br>
Subject: Re: Linux VM/IO balancing (fwd to linux-mm?)
Date: Tue, 23 May 2000 17:31:50 -0700 (PDT)
Message-ID: <200005240031.RAA70157@apollo.backplane.com>
:> Also if the large process in question happens to really need the pages
:> (is accessing them all the time), trying to page those pages out
:> gratuitously does nothing but create a massive paging load on the
:> system.
:
:But we don't. All we do is scan those pages more often. If
:the big process is really using them all the time we will
:not page them out. But we *will* try the pages of the big
:process more often than the pages of smaller processes...
:...
I don't think this really accomplishes your goal because the effect
of choosing the wrong page to reuse/flush is the same whether that
page comes from a large process or a small process. The effect
being a disk I/O and (if the wrong page is chosen) the process faulting
it back in immediately (a second disk I/O). This effects the entire
system, not just that one process.
When you skew the algorithm towards the larger process you are in effect
penalizing the large process, which I think is what you are trying to
accomplish. BUT, you are also penalizing the entire system by virtue
of causing the extra disk I/O. That's two marks against the idea...
penalizing a large process without knowing whether it deserves to be
penalized, and causing extra disk I/O by choosing the 'wrong' page
(that is, choosing a page that is likely to result in more disk activity
by being more active then the skewed statistic believes it is).
In a heavily loaded environment, using the normal center-weighted
LRU algorithm, a larger process is going to be penalized anyway by
virtue of the fact that it has more pages to be potentially reused.
I don't think you want to penalize it even more.
Quick mathmatical proof: If you have a small process with 10 pages in
core and a large process with 100 pages in core, and all the pages have
the same weighting, and the allocator tries to allocate a new page,
90% of the allocations are going to be taken from the large process
and only 10% will be taken from the small process. Thus the large
process is already being penalized by the normal algorithm... you
don't have to penalize it any more. The algorithm is self-correcting
for large processes.
:I have a proposal for this that could be implemented relatively
:..
:
: http://mail.nl.linux.org/linux-mm/2000-05/msg00273.html
Very interesting idea! In reference to your comment
'swapping doesn't actually swap anything...' near the end...
this is *precisely* the method that FreeBSD uses when
the idle-swap system feature is turned on. FreeBSD also
depresses the activity weight for the pages to make it
more likely to be reused (FreeBSD does this simply by moving the
page out of the active queue and onto the inactive queue).
I'm not sure you can use p->state as one of the basis for your
memory load calculation, because there are many classes of I/O
bound applications which nevertheless have low memory footprints.
You have to be careful not to generate a positive feedback loop.
That is, where you make some calculation X which has consequence Y
which has the side effect of increasing the calculation X2, which
has consequence Y2, and so forth until the calculation of X becomes
absurd and is no longer a proper measure of the state of the system.
:I'm curious about this ... would a small interactive task like
:an xterm or a shell *really* touch its pages as much as a Netscape
:or Mathematica?
:
:I have no doubts about the second point though .. that needs to
:be fixed. (but probably not before Linux 2.5, the changes in the
:code would be too invasive)
Yes, absolutely. Its simply due to tighter code... with fewer
active pages, smaller processes tend to touch more of them whenever
they do anything (like hitting a key in an xterm). Larger processes
tend to zone their memory. They will have a core set of pages
they touch as much as any other process, but the more complex issues
of larger processes will also touch additional pages depending on
what they are doing.
So with something like Netscape the 'broader core' set of pages it
touches will be different depending on what site you are visiting.
As you visit more sites, the RSS footprint starts to bloat, but many
of those pages will tend to be more 'idle' then the core pages in the
smaller process.
:> FreeBSD has two mechanisms to deal with large processes, both used only
:> under duress. The first occurs when FreeBSD hits the memory
:> stress point - it starts enforcing the 'memoryuse' resource limit on
:
:I've grepped the source for this but haven't been able to find
:the code which does this. I'm very interested in what is does
:and how it is supposed to work (and how it would compare to my
:"push memory hogs harder" idea).
vm/vm_pageout.c, vm_daemon() process, aroudn line 1392. The vm_daemon
is only woken up under certain heavy load situations. The actual
limit check occurs on line 1417. The actual forced page deactivation
occurs on line 1431 (the call to vm_pageout_map_deactivate_pages).
The enforced sleep is handled by the swapin code. vm/vm_glue.c
line 364 (in scheduler()). If a process is forcefully swapped out,
it is not allowed to wakeup until the system has recovered sufficiently
for it to be able to run reasonably well. The longer the process
stays in its forced sleep, the more likely it will be given the cpu back
so after a certain point it *will* wake up again, even if the system
is still thrashing. It's not strictly 20 seconds any more (it was in
pre-freeware BSDs).
( I'm assuming you are using the 4.x source tree as your reference
here ).
:> Keep in mind that memory stress is best defined as "the system reused or
:> paged-out a page that some process then tries to use soon after". When
:> this occurs, processes stall, and the system is forced to issue I/O
:> (pageins) to unstall them. Worse, if the system just paged the page out
:...
:
:Ahh, but if a process is truely using those pages, we won't page
:them out. The trick is that the anti-hog code does *not* impose
:any limits ... it simply makes sure that a big process *really*
:uses its memory before it can stay at a large RSS size.
Everything starts to break down when you hit extreme loads. Even
simple things like trying to give priority to one type of process
over another. If you weight in favor of smaller processes and happen
to be running on a system with larger processes, your algorithm could
(for example) wind up scanning their pages *too* often, depressing their
priority so quickly that the system may believe it can page out their
pages when, in fact, those processes are still using the pages.
Another example: When a system starts to thrash, *ALL* processes
doing any sort of I/O (paging or normal I/O) begin to stall. If under
heavy load conditions your algorithm 'speeds up' the page scan on an
unequal basis (only for certain processes), then the combination of
the sped-up scan AND the lower page access rate that is occuring due
to the thrashing itself may lead your algorithm to believe it can
steal pages from those processes when in reality stealing pages will
only result in even worse thrashing.
In fact, one of the reasons why FreeBSD went to a multi-queue approach
was to avoid these sorts of positive feedback loops. It has a separate
'active' queue as a means of guarenteeing that when a page is faulted in,
it *stays* in for a certain minimum period of time before being available
for paging out again, and if the enforced wait causes the system to run
low on memory and stall more, then too bad -- the alternative (insane
thrashing) is worse.
:(but I agree that this is slightly worse from a global, system
:throughput point of view .. and I must admit that I have no
:idea how much worse it would be)
:
:regards,
:
:Rik
-Matt
Matthew Dillon
<dillon@backplane.com>
reply other threads:[~2000-05-24 16:02 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.21.0005241256270.24993-170000@duckman.distro.conectiva \
--to=riel@conectiva.com.br \
--cc=dillon@apollo.backplane.com \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox