From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from digeo-nav01.digeo.com (digeo-nav01.digeo.com [192.168.1.233]) by packet.digeo.com (8.9.3+Sun/8.9.3) with SMTP id WAA22443 for ; Fri, 6 Sep 2002 22:14:50 -0700 (PDT) Message-ID: <3D798E8C.3DCD883C@digeo.com> Date: Fri, 06 Sep 2002 22:28:44 -0700 From: Andrew Morton MIME-Version: 1.0 Subject: Re: inactive_dirty list References: <3D7960FC.3E2C890A@digeo.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Rik van Riel Cc: "linux-mm@kvack.org" List-ID: Rik van Riel wrote: > > On Fri, 6 Sep 2002, Andrew Morton wrote: > > > I have a silly feeling that setting DEF_PRIORITY to "12" will > > simply fix this. > > > > Duh. > > Ideally we'd get rid of DEF_PRIORITY alltogether and would > just scan each zone once. > What I'm doing now is: #define DEF_PRIORITY 12 /* puke */ for (priority = DEF_PRIORITY; priority; priority--) { int total_scanned = 0; shrink_caches(priority, &total_scanned); if (that didn't work) { wakeup_bdflush(total_scanned); blk_congestion_wait(WRITE, HZ/4); } } and in shrink_caches(): max_scan = zone->nr_inactive >> priority; if (max_scan < nr_pages * 2) max_scan = nr_pages * 2; nr_pages = shrink_zone(zone, max_scan, gfp_mask, nr_pages); So in effect, for a 32-page reclaim attempt we'll scan 64 pages of ZONE_HIGHMEM, then 128 pages of ZONE_NORMAL/DMA. If that doesn't yield 32 pages we ask pdflush to write 3*64 pages. Then take a nap. Then do it again: scan 64 pages of ZONE_HIGHMEM, then 128 of ZONE_NORMAL/DMA, then write back 192 pages then nap. Then do it again: scan 128 pages of ZONE_HIGHMEM, then 256 of ZONE_NORMAL/DMA, then write back 384 pages then nap. etc. Plus there are the actual pages which we started IO against during the LRU scan - there can be up to 32 of those. BTW, it turns out that the main reason why kswapd was going silly was that the VM is *not* treating the `priority' as a logarithmic thing at all: int max_scan = nr_inactive_pages / priority; so the claims about scanning 1/64th of the list are crap. That thing scans 1/6th of the queue on the first pass. In the mem=1G case, that's 30,000 damn pages. Maybe someone should take a look at Marcelo's kernel? There are a few warts: pdflush_operation will fail if all pdflush threads are out doing something (pretty unlikely with the nonblocking stuff. Might happen if writeback has to run get_block()). But we'll be writing back stuff anyway. I changed blk_congestion_wait a bit too. The first version would return immediately if no queues were congested ( > 75% full). Now, it will sleep even if no queues are congested. It will return as soon as someone puts back a write request. If someone is silly enough to call blk_congestion_wait() when there are no write requests in flight at all, they get to take the full 1/4 second sleep. The mem=1G corner case is fixed, and page reclaim just doesn't figure: c012c034 288 0.317709 do_wp_page c0144ae0 316 0.348597 __block_commit_write c012c910 342 0.377279 do_anonymous_page c0143efc 353 0.389414 __find_get_block c012f7e0 356 0.392724 find_lock_page c012f9f0 356 0.392724 do_generic_file_read c01832bc 367 0.404858 ext2_free_branches c0136e70 371 0.409271 __free_pages_ok c010e7b4 386 0.425818 timer_interrupt c01e3cfc 414 0.456707 radix_tree_lookup c0141894 434 0.47877 vfs_write c012f580 474 0.522896 unlock_page c0134348 500 0.551578 kmem_cache_alloc c01347d0 531 0.585776 kmem_cache_free c013712c 574 0.633212 rmqueue c0141320 605 0.667409 generic_file_llseek c0156924 616 0.679544 count_list c0142c04 617 0.680647 fget c01091e0 793 0.874803 system_call c0155914 860 0.948714 __d_lookup c0144674 1076 1.187 __block_prepare_write c014c63c 1184 1.30614 link_path_walk c012fcd4 10932 12.0597 file_read_actor c0130674 16443 18.1392 generic_file_write_nolock c0107048 31293 34.5211 poll_idle The balancing of the zones looks OK from a first glance and of course the change in system behaviour under heavy writeout loads is profound. Let's do the MAP_SHARED-pages-get-a-second-round thing, and it'd be good if we could come up with some algorithm for setting the current dirty pagecache clamping level rather than relying on the dopey /proc/sys/vm/dirty_async_ratio magic number. I'm thinking that dirty_async_ratio becomes a maximum ratio, and that we dynamically lower it when large amounts of dirty pagecache would be embarrassing. Or maybe there's just no need for this. Dunno. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/