From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from digeo-nav01.digeo.com (digeo-nav01.digeo.com [192.168.1.233]) by packet.digeo.com (8.9.3+Sun/8.9.3) with SMTP id RAA18595 for ; Fri, 6 Sep 2002 17:30:08 -0700 (PDT) Message-ID: <3D794886.D9167993@digeo.com> Date: Fri, 06 Sep 2002 17:29:58 -0700 From: Andrew Morton MIME-Version: 1.0 Subject: Re: inactive_dirty list References: <3D793B9E.AAAC36CA@zip.com.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Rik van Riel Cc: Andrew Morton , "linux-mm@kvack.org" List-ID: Rik van Riel wrote: > > On Fri, 6 Sep 2002, Andrew Morton wrote: > > > My current code wastes CPU in the situation where the > > zone is choked with dirty pagecache. It works happily > > with mem=768M, because only 40% of the pages in the zone > > are dirty - worst case, we get a 60% reclaim success rate. > > Which still doesn't deal with the situation where the > dirty pages are primarily anonymous or MAP_SHARED > pages, which don't fall under your dirty page accounting. That's right - we're writing those things out as soon as we scan them at present. If we move them over to the dirty page list when their dirtiness is discovered then the normal writeback stuff would kick in. But it's laggy, of course. > > So I'm looking for ways to fix that. The proposal is to > > move those known-to-be-unreclaimable pages elsewhere. > > Basically, when scanning the zone we'll see "hmmm, all pages > were dirty and I scheduled a whole bunch for writeout" and > we _know_ it doesn't make sense for other threads to also > scan this zone over and over again, at least not until a > significant amount of IO has completed. Yup. But with this proposal it's "hmm, the inactive_clean list has zero pages, and the inactive_dirty list has 100,000 pages". The VM knows exactly what is going on, without any scanning. The appropriate action would be to kick pdflush, advance to the next zone, and if that fails, take a nap. > > Another possibility might be to say "gee, all dirty. Try > > the next zone". > > Note that this also means we shouldn't submit ALL dirty > pages we run into for IO. If we submit a GB worth of dirty > pages from ZONE_HIGHMEM for IO, it'll take _ages_ before > the IO for ZONE_NORMAL completes. The mapping->dirty_pages-based writeback doesn't know about zones... Which is good in a way, because we can schedule IO in filesystem-friendly patterns. > Worse, if we're keeping the IO queues busy with ZONE_HIGHMEM > pages we could create starvation of the other zones. Right. So for a really big high:low ratio, that could be a problem. For these systems, in practice, we know where the cleanable ZONE_NORMAL pagecache lives: blockdev_superblock->inodes->mapping->dirty_pages. So we could easily schedule IO specifically targetted at the normal zone if needed. But it will be slow whatever we do, because dirty blockdev pagecache is splattered all over the platter. > Another effect is that a GB of writes is sure to slow down > any subsequent reads, even if 100 MB of RAM has already been > freed... > > Because of this I want to make sure we only submit a sane > amount of pages for IO at once, maybe > max(zone->pages_high, 4 * (zone->pages_high - zone->free_pages) ? And what, may I ask, was wrong with 42? ;) Point taken on the IO starvation thing. But you know my opinion of the read-vs-write policy in the IO scheduler... > > more hm. It's possible that, because of the per-zone-lru, > > we end up putting way too much swap pressure onto anon pages > > in highmem. For the 1G boxes. This is an interaction between > > per-zone LRU and the page allocator's highmem-first policy. > > > > Have you seen this in 2.4-rmap? It would happen there, I suspect. > > Shouldn't happen in 2.4-rmap, I've been careful to avoid any > kind of worst-case scenarios like that by having a number of > different watermarks. > > Basically kswapd won't free pages from a zone which isn't in > severe trouble if we don't have a global memory shortage, so > we will have allocated memory from each zone already before > freeing the next batch of highmem pages. I'm not sure that works... If the machine has 800M normal and 200M highmem and is cruising along with 190M of dirty pagecache (steady state, via balance_dirty_state) then surely the poor little 10M of anon pages which are in the highmem zone will be swapped out quite quickly? Probably it doesn't matter much - chances are they'll get swapped back into ZONE_NORMAL and then live a happy life. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/