From: Andrew Morton <akpm@digeo.com>
To: Rik van Riel <riel@conectiva.com.br>
Cc: Andrew Morton <akpm@zip.com.au>,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: inactive_dirty list
Date: Fri, 06 Sep 2002 17:29:58 -0700 [thread overview]
Message-ID: <3D794886.D9167993@digeo.com> (raw)
In-Reply-To: <Pine.LNX.4.44L.0209062048320.1857-100000@imladris.surriel.com>
Rik van Riel wrote:
>
> On Fri, 6 Sep 2002, Andrew Morton wrote:
>
> > My current code wastes CPU in the situation where the
> > zone is choked with dirty pagecache. It works happily
> > with mem=768M, because only 40% of the pages in the zone
> > are dirty - worst case, we get a 60% reclaim success rate.
>
> Which still doesn't deal with the situation where the
> dirty pages are primarily anonymous or MAP_SHARED
> pages, which don't fall under your dirty page accounting.
That's right - we're writing those things out as soon
as we scan them at present. If we move them over to the
dirty page list when their dirtiness is discovered then
the normal writeback stuff would kick in. But it's laggy,
of course.
> > So I'm looking for ways to fix that. The proposal is to
> > move those known-to-be-unreclaimable pages elsewhere.
>
> Basically, when scanning the zone we'll see "hmmm, all pages
> were dirty and I scheduled a whole bunch for writeout" and
> we _know_ it doesn't make sense for other threads to also
> scan this zone over and over again, at least not until a
> significant amount of IO has completed.
Yup. But with this proposal it's "hmm, the inactive_clean
list has zero pages, and the inactive_dirty list has 100,000
pages". The VM knows exactly what is going on, without any
scanning.
The appropriate action would be to kick pdflush, advance to
the next zone, and if that fails, take a nap.
> > Another possibility might be to say "gee, all dirty. Try
> > the next zone".
>
> Note that this also means we shouldn't submit ALL dirty
> pages we run into for IO. If we submit a GB worth of dirty
> pages from ZONE_HIGHMEM for IO, it'll take _ages_ before
> the IO for ZONE_NORMAL completes.
The mapping->dirty_pages-based writeback doesn't know about
zones...
Which is good in a way, because we can schedule IO in
filesystem-friendly patterns.
> Worse, if we're keeping the IO queues busy with ZONE_HIGHMEM
> pages we could create starvation of the other zones.
Right. So for a really big high:low ratio, that could be a
problem.
For these systems, in practice, we know where the cleanable
ZONE_NORMAL pagecache lives:
blockdev_superblock->inodes->mapping->dirty_pages.
So we could easily schedule IO specifically targetted at the
normal zone if needed. But it will be slow whatever we do,
because dirty blockdev pagecache is splattered all over the
platter.
> Another effect is that a GB of writes is sure to slow down
> any subsequent reads, even if 100 MB of RAM has already been
> freed...
>
> Because of this I want to make sure we only submit a sane
> amount of pages for IO at once, maybe <pulls number out of hat>
> max(zone->pages_high, 4 * (zone->pages_high - zone->free_pages) ?
And what, may I ask, was wrong with 42? ;)
Point taken on the IO starvation thing. But you know
my opinion of the read-vs-write policy in the IO scheduler...
> > more hm. It's possible that, because of the per-zone-lru,
> > we end up putting way too much swap pressure onto anon pages
> > in highmem. For the 1G boxes. This is an interaction between
> > per-zone LRU and the page allocator's highmem-first policy.
> >
> > Have you seen this in 2.4-rmap? It would happen there, I suspect.
>
> Shouldn't happen in 2.4-rmap, I've been careful to avoid any
> kind of worst-case scenarios like that by having a number of
> different watermarks.
>
> Basically kswapd won't free pages from a zone which isn't in
> severe trouble if we don't have a global memory shortage, so
> we will have allocated memory from each zone already before
> freeing the next batch of highmem pages.
I'm not sure that works... If the machine has 800M normal
and 200M highmem and is cruising along with 190M of dirty
pagecache (steady state, via balance_dirty_state) then surely
the poor little 10M of anon pages which are in the highmem zone
will be swapped out quite quickly?
Probably it doesn't matter much - chances are they'll get swapped
back into ZONE_NORMAL and then live a happy life.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
next prev parent reply other threads:[~2002-09-07 0:30 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-09-06 20:42 Andrew Morton
2002-09-06 21:03 ` Rik van Riel
2002-09-06 21:40 ` Andrew Morton
2002-09-06 21:49 ` Rik van Riel
2002-09-06 21:58 ` Andrew Morton
2002-09-06 22:04 ` Rik van Riel
2002-09-06 22:19 ` Andrew Morton
2002-09-06 22:23 ` Rik van Riel
2002-09-06 22:48 ` Andrew Morton
2002-09-06 23:03 ` Rik van Riel
2002-09-06 23:34 ` Andrew Morton
2002-09-07 0:00 ` Rik van Riel
2002-09-07 0:29 ` Andrew Morton [this message]
2002-09-08 21:21 ` Daniel Phillips
2002-09-06 22:22 ` Rik van Riel
2002-09-07 2:14 ` Andrew Morton
2002-09-07 2:10 ` Rik van Riel
2002-09-07 5:28 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3D794886.D9167993@digeo.com \
--to=akpm@digeo.com \
--cc=akpm@zip.com.au \
--cc=linux-mm@kvack.org \
--cc=riel@conectiva.com.br \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox