From: Mark Hemment <markhe@veritas.com>
To: "Stephen C. Tweedie" <sct@redhat.com>
Cc: yodaiken@fsmlabs.com, Jamie Lokier <lk@tantalophile.demon.co.uk>,
Alan Cox <alan@lxorguk.ukuu.org.uk>,
mingo@elte.hu, Andrea Arcangeli <andrea@suse.de>,
Marcelo Tosatti <marcelo@conectiva.com.br>,
Linus Torvalds <torvalds@transmeta.com>,
Rik van Riel <riel@conectiva.com.br>,
Roger Larsson <roger.larsson@norran.net>,
MM mailing list <linux-mm@kvack.org>,
linux-kernel@vger.kernel.org
Subject: Re: the new VMt
Date: Tue, 26 Sep 2000 13:10:30 +0100 (BST) [thread overview]
Message-ID: <Pine.LNX.4.21.0009261020020.11007-100000@alloc> (raw)
In-Reply-To: <20000925213201.C2615@redhat.com>
Hi,
On Mon, 25 Sep 2000, Stephen C. Tweedie wrote:
> So you have run out of physical memory --- what do you do about it?
Why let the system get into the state where it is neccessary to kill a
process?
Per-user/task resource counters should prevent unprivileged users from
soaking up too many resources. That is the DoS protection.
So an OOM is possibly;
1) A privileged, legally resource hungry, app(s) has taken all
the memory. Could be too important to simply kill (it
should exit gracefully).
2) Simply too many tasks*(memory-requirements-of-each-task).
Ignoring allocations done by the kernel, the suitation comes down to the
fact that the system has over committed its memory resources. ie. it has
sold too many tickets for the number of seats in the plane, and all the
passengers have turned up.
(note, I use the term "memory" and not "physical memory", I'm including
swap space).
Why not protect the system from over committing its memory resources?
It is possible to do true, system wide, resource counting of physical
memory and swap space, and to deny a fork() or mmap() which would cause
over committing of memoy resources if everyone cashed in their
requirements.
Named pages (those which came from a file) are the simplest to
handle. If dirty, they already have allocated backing store, so we know
there is somewhere to put them when memory is low.
How many named pages need to be held in physical memory at any one
instance for the system to function? Only a few, although if you reach
that state, the system will be thrashing itself to death.
Anonymous and copied (those faulted from a write to an
MAP_PRIVATE|MAP_WRITE mapping) pages can be stored in either physical
memory or on swap. To avoid getting into the OOM suitation, when these
mappings are created the system needs to check that it has (and will have,
in the future) space for every page that _could_ be allocated for the
mapping - ie. work out the worst case (including page-tables).
This space could be on swap or in physical memory. It is the accounting
which needs to be done, not the actual allocation (and not even the
decision of where to store the page when allocated - that is made much
later, when it needs to be). If a machine has 2GB of RAM, a 1MB
swap, and 1GB of dirty anon or copied pages, that is fine.
I'm stressing this point, as the scheme of reserving space for an (as
yet) unallocated page is sometimes refered to as "eager swap
allocation" (or some such similar term). This is confusing. People then
start to believe they need backing store for each anon/copied pages. You
don't. You simply need somewhere to store it, and that could be a
physical page. It is all in the accounting. :)
Allocations made by the kernel, for the kernel, are (obviously) pinned
memory. To ensure kernel allocations do not completely exhaust physical
memory (or cause phyiscal memory to be over committed if the worst case
occurs), they need to be limited.
How to limit?
As I first guess (and this is only a guess);
1) don't let kernel allocations exceed 25% of physical memory
(tunable)
2) don't let kernel allocations succeed if they would cause
over commitment.
Both conditions would need to pass before an allocation could succeed.
This does need much more thought. Should some tuning be per subsystem?
I don't know....
Perhaps 1) isn't needed. I'm not sure.
Because of 2), the total physical memory accounted for anon/copied
pages needs to have a high watermark. Otherwise, in the accounting, the
system could allow too much physical memory to be reserved for these
types of pages (there doesn't need to be space on swap for each
anon/copied page, just space somewhere - a watermark would prevent too
much of this being physical memory). Note, this doesn't mean start
swapping earlier - remember, this is accounting of anon/copied pages to
avoid over commitment.
For named pages, the page cache needs to have a reserved number of
physical pages (ie. how small is it allowed to get, before pruning
stops). Again, these reserved pages are in the accounting.
mlock()ed pages need to have accouting also to prevent over commitment of
physical memory. All fun.
The disadvantages;
1) Extra code to do the accouting.
This shouldn't be too heavy.
2) mmap(MAP_ANON)/mmap(MAP_PRIVATE|MAP_SHARED) can fail more readily.
Programs which expect to memory map areas (which would created
anon/copied pages when written to) will see an increased failure
rate in mmap(). This can be very annoying, espically when you
know the mapping will be used sparsely.
One solution is to add a new mmap() flag, which tells the kernel
to let this mmap() exceed the actually resources.
With such a flag, the mmap() will be allowed, but the task should
expected to be killed if memory is exhausted. (It could be
possible for the kernel to deliver a SIGDANGER signal to such a
task, as in AIX, to give it a chance of reducing its requirments
on the system or to exit gracefully.)
Another solution is to allow the strict resource accounting to be
over ridden on a global basis. Say, by allowing the system to
over commit the memory resources by 10%. This does remove the
absolute protection, but leaves some in place. The OOM killer
would come into play if the system did over commit.
Those who don't need/want protection, could set the over commit to
some large value. 500%?
3) fork() failures.
There is the problem of a large(ish) process wanting to run a
small program. Say, a shell wanting to run a simple utility.
Because of the memory resource accounting, the fork() is
disallowed as the newly created child could (in theory) write to
mmap()ed areas, creating anon/copied pages which would cause the
kernel to (in the worst case) be OOM for user-pages. Given that
the child will almost immediately do an exec(), which could well
succeed, this is frustrating.
Again, a small over commit kludge would reduce (but not
eliminate), this occurance.
An idea from a colleague, is to allow such a fork() to succeed,
but to run the child process in a "container".
Inside the container, the child is allowed to perform operations
which would be expected before an exec(). Such operations could
be closing file descriptors. However, if it tries to do something
which would _seriously_ affect the state of the system (such as
remove a file), then it is killed. ie. given it a chance to
do an exec(). This could be done by running with an alternative
system call table for the child process, which refers to bounce
functions within the kernel where the checks are done (ie. don't
load the common code path with the checks).
This could be tricky to do, and there could well be a few
system (library?) calls which would make it impossible. However,
if it could be achieved, it would remove one of the most annoying
"features" of over commitment protection.
This sort of protection isn't to prevent DoS attacks; as said above,
they need to be on a per user/task level. This protection is to protect
against asynchronous failures on page faults due to OOM, and to make
them synchronous (from mmap(), fork(), mlock(), etc) where programs
expected to test for an error code.
There isn't much an application can do with a synchronous memory
failure; sleep and try again, release some of its own resources, or exit
gracefully.
Anyway, I've skipped over a lot of interesting details (and problems).
This stuff isn't new. Some commercial OS have this type of protection.
Comments?
Mark
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
next prev parent reply other threads:[~2000-09-26 12:10 UTC|newest]
Thread overview: 243+ messages / expand[flat|nested] mbox.gz Atom feed top
2000-09-24 10:11 __GFP_IO && shrink_[d|i]cache_memory()? Ingo Molnar
2000-09-24 18:11 ` Linus Torvalds
2000-09-24 18:40 ` Ingo Molnar
2000-09-24 18:39 ` Linus Torvalds
2000-09-24 18:46 ` Linus Torvalds
2000-09-24 18:59 ` Ingo Molnar
2000-09-24 19:34 ` [patch] vmfixes-2.4.0-test9-B2 Ingo Molnar
2000-09-24 20:20 ` Rui Sousa
2000-09-24 20:24 ` Andrea Arcangeli
2000-09-24 20:26 ` Ingo Molnar
2000-09-24 21:12 ` Andrea Arcangeli
2000-09-24 21:12 ` Ingo Molnar
2000-09-24 21:43 ` Stephen C. Tweedie
2000-09-24 22:13 ` Andrea Arcangeli
2000-09-24 22:36 ` [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks bert hubert
2000-09-24 23:41 ` Andrea Arcangeli
2000-09-25 16:24 ` Stephen C. Tweedie
2000-09-25 17:03 ` Andrea Arcangeli
2000-09-25 18:06 ` Stephen C. Tweedie
2000-09-25 19:32 ` Andrea Arcangeli
2000-09-25 19:26 ` Rik van Riel
2000-09-25 22:28 ` Andrea Arcangeli
2000-09-25 22:26 ` Rik van Riel
2000-09-25 22:51 ` Andrea Arcangeli
2000-09-25 22:30 ` Linus Torvalds
2000-09-25 23:03 ` Andrea Arcangeli
2000-09-25 23:18 ` Linus Torvalds
2000-09-26 0:32 ` Andrea Arcangeli
2000-09-25 22:30 ` Juan J. Quintela
2000-09-25 23:00 ` Andrea Arcangeli
2000-09-25 19:54 ` Stephen C. Tweedie
2000-09-25 22:44 ` Andrea Arcangeli
2000-09-25 22:42 ` Rik van Riel
2000-09-26 6:54 ` Christoph Rohland
2000-09-26 14:05 ` Andrea Arcangeli
2000-09-26 16:20 ` Christoph Rohland
2000-09-26 17:10 ` Andrea Arcangeli
2000-09-27 8:11 ` Christoph Rohland
2000-09-27 8:28 ` Ingo Molnar
2000-09-27 9:24 ` Christoph Rohland
2000-09-27 13:56 ` Andrea Arcangeli
2000-09-27 16:56 ` Christoph Rohland
2000-09-27 17:42 ` Andrea Arcangeli
2000-09-27 18:25 ` Erik Andersen
2000-09-27 18:55 ` Andrea Arcangeli
2000-09-28 10:08 ` Rik van Riel
2000-09-28 11:16 ` Rik van Riel
2000-09-28 14:52 ` Andrea Arcangeli
2000-09-29 14:39 ` Rik van Riel
2000-09-29 14:55 ` Andrea Arcangeli
2000-09-29 15:40 ` Rik van Riel
2000-09-28 11:31 ` Ingo Molnar
2000-09-28 14:54 ` Andrea Arcangeli
2000-09-28 15:13 ` Ingo Molnar
2000-09-28 15:23 ` Andrea Arcangeli
2000-09-28 16:16 ` Juan J. Quintela
2000-09-28 14:31 ` Andrea Arcangeli
2000-09-25 17:21 ` bert hubert
2000-09-25 17:49 ` Andrea Arcangeli
2000-09-25 15:09 ` Miles Lane
2000-09-25 15:51 ` Stephen C. Tweedie
2000-09-25 16:05 ` Ingo Molnar
2000-09-25 16:06 ` Alexander Viro
2000-09-25 16:20 ` Ingo Molnar
2000-09-25 16:29 ` Andrea Arcangeli
2000-09-25 4:56 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds
2000-09-25 5:19 ` Alexander Viro
2000-09-25 6:06 ` Linus Torvalds
2000-09-25 6:17 ` Alexander Viro
2000-09-25 21:21 ` Alexander Viro
2000-09-26 13:42 ` [CFT][PATCH] ext2 directories in pagecache Alexander Viro
2000-09-26 21:29 ` Alexander Viro
2000-09-26 22:16 ` Marko Kreen
2000-09-26 22:31 ` Alexander Viro
2000-09-26 22:47 ` Marko Kreen
2000-09-27 7:32 ` Ingo Molnar
2000-09-27 9:22 ` Alexander Viro
2000-09-26 23:19 ` Andreas Dilger
2000-09-26 23:33 ` Alexander Viro
2000-09-26 23:44 ` Alexander Viro
2000-09-25 0:09 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds
2000-09-25 0:49 ` Alexander Viro
2000-09-25 0:53 ` Marcelo Tosatti
2000-09-25 1:45 ` Andrea Arcangeli
2000-09-25 2:39 ` Marcelo Tosatti
2000-09-25 15:36 ` Andrea Arcangeli
2000-09-25 10:42 ` the new VM Ingo Molnar
2000-09-25 13:02 ` Andrea Arcangeli
2000-09-25 13:02 ` Ingo Molnar
2000-09-25 13:08 ` Andrea Arcangeli
2000-09-25 13:12 ` Ingo Molnar
2000-09-25 13:30 ` Andrea Arcangeli
2000-09-25 13:39 ` Ingo Molnar
2000-09-25 14:04 ` Andrea Arcangeli
2000-09-25 14:04 ` Ingo Molnar
2000-09-25 14:23 ` Andrea Arcangeli
2000-09-25 14:27 ` Ingo Molnar
2000-09-25 14:39 ` Andrea Arcangeli
2000-09-25 14:43 ` Ingo Molnar
2000-09-25 15:01 ` Andrea Arcangeli
2000-09-25 15:10 ` Ingo Molnar
2000-09-25 15:24 ` Andrea Arcangeli
2000-09-25 15:26 ` Ingo Molnar
2000-09-25 15:22 ` yodaiken
2000-09-26 19:10 ` Pavel Machek
2000-09-26 20:16 ` Andrea Arcangeli
2000-09-27 7:42 ` Ingo Molnar
2000-09-27 12:11 ` yodaiken
2000-09-27 14:08 ` Andrea Arcangeli
2000-09-25 16:09 ` Rik van Riel
2000-09-25 14:26 ` Marcelo Tosatti
2000-09-25 14:50 ` Andrea Arcangeli
2000-09-25 14:47 ` Alan Cox
2000-09-25 15:16 ` Ingo Molnar
2000-09-25 15:16 ` the new VMt Alan Cox
2000-09-25 15:33 ` the new VM Ingo Molnar
2000-09-25 15:41 ` the new VMt Andrea Arcangeli
2000-09-25 16:02 ` Ingo Molnar
2000-09-25 16:04 ` Andi Kleen
2000-09-25 16:19 ` Ingo Molnar
2000-09-25 16:18 ` Andi Kleen
2000-09-25 16:41 ` Andrea Arcangeli
2000-09-25 16:35 ` Linus Torvalds
2000-09-25 16:41 ` Rik van Riel
2000-09-25 16:49 ` Linus Torvalds
2000-09-25 17:03 ` Ingo Molnar
2000-09-25 17:17 ` Andrea Arcangeli
2000-09-25 17:10 ` Rik van Riel
2000-09-25 17:27 ` Andrea Arcangeli
2000-09-25 17:15 ` Andrea Arcangeli
2000-09-27 7:14 ` Rusty Russell
2000-09-25 20:23 ` Russell King
2000-09-25 16:28 ` Rik van Riel
2000-09-25 16:11 ` Andrea Arcangeli
2000-09-25 16:22 ` Ingo Molnar
2000-09-25 16:17 ` Alexander Viro
2000-09-25 16:36 ` Jeff Garzik
2000-09-25 16:57 ` Alan Cox
2000-09-25 17:01 ` Alexander Viro
2000-09-25 17:06 ` Alan Cox
2000-09-25 17:31 ` Oliver Xymoron
2000-09-25 17:51 ` Jeff Garzik
2000-09-25 19:03 ` the new VMt [4MB+ blocks] Matti Aarnio
2000-09-25 20:02 ` Stephen Williams
2000-09-25 16:33 ` the new VMt Andrea Arcangeli
2000-09-26 8:38 ` Jes Sorensen
2000-09-26 8:52 ` Ingo Molnar
2000-09-26 9:02 ` Jes Sorensen
2000-09-25 16:53 ` Alan Cox
2000-09-25 15:42 ` Stephen C. Tweedie
2000-09-25 16:05 ` Andrea Arcangeli
2000-09-25 16:22 ` Rik van Riel
2000-09-25 16:42 ` Andrea Arcangeli
2000-09-25 17:39 ` Stephen C. Tweedie
2000-09-25 16:51 ` Alan Cox
2000-09-25 17:43 ` Stephen C. Tweedie
2000-09-25 18:13 ` Alan Cox
2000-09-25 18:21 ` Stephen C. Tweedie
2000-09-25 19:09 ` Alan Cox
2000-09-25 19:21 ` Stephen C. Tweedie
2000-09-25 16:52 ` yodaiken
2000-09-25 17:18 ` Jamie Lokier
2000-09-25 17:51 ` yodaiken
2000-09-25 18:04 ` Jamie Lokier
2000-09-25 18:13 ` yodaiken
2000-09-25 18:24 ` Stephen C. Tweedie
2000-09-25 18:34 ` yodaiken
2000-09-25 18:48 ` Jamie Lokier
2000-09-25 19:25 ` Stephen C. Tweedie
2000-09-25 20:04 ` yodaiken
2000-09-25 20:23 ` Alan Cox
2000-09-25 20:35 ` yodaiken
2000-09-25 20:46 ` Alan Cox
2000-09-25 21:07 ` yodaiken
2000-09-26 9:54 ` Stephen C. Tweedie
2000-09-26 13:17 ` yodaiken
2000-09-25 20:47 ` Benjamin C.R. LaHaise
2000-09-25 21:12 ` yodaiken
2000-09-26 10:07 ` Stephen C. Tweedie
2000-09-26 13:30 ` yodaiken
2000-09-25 20:32 ` Stephen C. Tweedie
2000-09-26 12:10 ` Mark Hemment [this message]
2000-09-27 10:13 ` Andrey Savochkin
2000-09-27 12:55 ` Hugh Dickins
2000-09-28 3:25 ` Andrey Savochkin
2000-09-25 23:14 ` Erik Andersen
2000-09-26 15:17 ` yodaiken
2000-09-26 16:04 ` Stephen C. Tweedie
2000-09-26 17:02 ` Erik Andersen
2000-09-26 17:08 ` Stephen C. Tweedie
2000-09-26 17:45 ` Erik Andersen
2000-09-27 10:20 ` Andrey Savochkin
2000-09-26 21:13 ` Eric Lowe
2000-09-25 18:20 ` Andrea Arcangeli
2000-09-25 16:16 ` Rik van Riel
2000-09-25 16:55 ` Alan Cox
2000-09-25 15:48 ` the new VM Andrea Arcangeli
2000-09-25 15:40 ` Stephen C. Tweedie
2000-09-25 16:01 ` Andrea Arcangeli
2000-09-25 14:37 ` Rik van Riel
2000-09-25 20:34 ` Christoph Rohland
2000-10-06 16:14 ` Rik van Riel
2000-10-09 7:37 ` Christoph Rohland
2000-09-25 13:04 ` Ingo Molnar
2000-09-25 13:19 ` Andrea Arcangeli
2000-09-25 13:18 ` Ingo Molnar
2000-09-25 13:21 ` Ingo Molnar
2000-09-25 13:31 ` Andrea Arcangeli
2000-09-25 13:47 ` Ingo Molnar
2000-09-25 14:04 ` Andrea Arcangeli
2000-09-25 1:31 ` [patch] vmfixes-2.4.0-test9-B2 Andrea Arcangeli
2000-09-25 1:27 ` Alexander Viro
2000-09-25 2:02 ` Andrea Arcangeli
2000-09-25 2:01 ` Alexander Viro
2000-09-25 13:47 ` Stephen C. Tweedie
2000-09-25 10:13 ` Ingo Molnar
2000-09-25 12:58 ` Andrea Arcangeli
2000-09-25 13:10 ` Ingo Molnar
2000-09-25 13:49 ` Jens Axboe
2000-09-25 14:11 ` Ingo Molnar
2000-09-25 14:05 ` Jens Axboe
2000-09-25 16:46 ` Linus Torvalds
2000-09-25 17:05 ` Ingo Molnar
2000-09-25 17:23 ` Andrea Arcangeli
2000-09-25 14:20 ` Andrea Arcangeli
2000-09-25 14:11 ` Jens Axboe
2000-09-25 14:33 ` Andrea Arcangeli
2000-09-25 13:56 ` Andrea Arcangeli
2000-09-25 13:57 ` Ingo Molnar
2000-09-25 14:13 ` Andrea Arcangeli
2000-09-25 14:08 ` Jens Axboe
2000-09-25 14:29 ` Andrea Arcangeli
2000-09-25 14:18 ` Jens Axboe
2000-09-25 14:47 ` Andrea Arcangeli
2000-09-25 21:28 ` Jens Axboe
2000-09-25 22:14 ` Andrea Arcangeli
2000-09-25 14:13 ` Ingo Molnar
2000-09-25 14:29 ` Ingo Molnar
2000-09-25 14:46 ` Andrea Arcangeli
2000-09-25 14:53 ` Ingo Molnar
2000-09-25 15:02 ` Andrea Arcangeli
2000-09-24 21:38 ` __GFP_IO && shrink_[d|i]cache_memory()? Stephen C. Tweedie
2000-09-24 23:20 ` Alan Cox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.21.0009261020020.11007-100000@alloc \
--to=markhe@veritas.com \
--cc=alan@lxorguk.ukuu.org.uk \
--cc=andrea@suse.de \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lk@tantalophile.demon.co.uk \
--cc=marcelo@conectiva.com.br \
--cc=mingo@elte.hu \
--cc=riel@conectiva.com.br \
--cc=roger.larsson@norran.net \
--cc=sct@redhat.com \
--cc=torvalds@transmeta.com \
--cc=yodaiken@fsmlabs.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox