From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Wed, 18 Jul 2001 10:54:52 +0200 (CEST) From: Mike Galbraith Subject: Re: [PATCH] Separate global/perzone inactive/free shortage In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org Return-Path: To: Bulent Abali Cc: "Stephen C. Tweedie" , Marcelo Tosatti , Rik van Riel , Dirk Wetter , linux-mm@kvack.org List-ID: On Mon, 16 Jul 2001, Bulent Abali wrote: > >> On Sat, 14 Jul 2001, Marcelo Tosatti wrote: > > > >> On highmem machines, wouldn't it save a LOT of time to prevent > allocation > >> of ZONE_DMA as VM pages? Or, if we really need to, get those pages into > >> the swapcache instantly? Crawling through nearly 4 gig of VM looking > for > >> 16 MB of ram has got to be very expensive. Besides, those pages are > just > >> too precious to allow some user task to sit on them. > > > >Can't we balance that automatically? > > > >Why not just round-robin between the eligible zones when allocating, > >biasing each zone based on size? On a 4GB box you'd basically end up > >doing 3 times as many allocations from the highmem zone as the normal > >zone and only very occasionally would you try to dig into the dma > >zone. > >Cheers, > > Stephen > > If I understood page_alloc.c:build_zonelists() correctly > ZONE_HIGHMEM includes ZONE_NORMAL which includes ZONE_DMA. > Memory allocators (other than ZONE_DMA) will dip in to the dma zone > only when there are no highmem and/or normal zone pages available. > So, the current method is more conservative (better) than round-robin > it seems to me. Not really. As soon as ZONE_NORMAL is engaged such that free_pages hits pages_low, we will pilpher ZONE_DMA. That's guaranteed to happen because that's exactly what we balance for. Once ZONE_NORMAL reaches pages_low, we will fall back to allocating ZONE_DMA exclusively. (problem yes?.. if agree, skip to 'possible solution' below;) Thinking about doing a find /usr on my box: we commit ZONE_NORMAL, then transition to exclusive use of ZONE_DMA instantly. These pages will go to kernel structures. (except for metadata. Metadata will be aged/laundered, and become available for more kernel structures) The tendancy is for ever increasing quantities to become invisible to the balancing mechanisms. Thinking about Dirk's logs of rsync leads me to believe that this must be the case. kreclaimd is eating cpu. It can't possibly be any other zone. When rsync has had 30 minutes of cpu, kswapd has had 40 minutes. kreclaimd has eaten 15 solid minutes. It can't possibly accumulate that much time unless ZONE_DMA is the problem.. the other zones are just too easy to find/launder/reclaim. Much worse is the case of Dirk's two 2gig simulations on a dual cpu 4gig box. It will guaranteed allocate all of the dma zone to his two tasks vm. It will also guaranteed unbalance the zone. Doesn't that also guarantee that we will walk pagetables endlessly each and every time a ZONE_DMA page is issued? Try to write out a swapcache page, and you might get a dma page, try to do _anything_ and you might get a ZONE_DMA page. With per zone balancing, you will turn these pages over much much faster than before, and the aging will be unfair in the extreme.. it absolutely has to be. SCT's suggestion would make the total pressure equal, but it would not (could not) correct the problem of searching for this handful of pages, the very serious cost of owning a ZONE_DMA page, nor the problem of a specific request for GFP_DMA pages having a reasonable chance of succeeding. IMHO, it is really really bad, that any fallback allocation can also bring the dma zone into critical, and these allocations may end up in kernel structures which are invisible to the balancing logic, making a search a complete waste of time. In any case, on a machine with lots of ram, the search is going to be disproportionately expensive due to the size of the search area. Possible solution: Effectively reserving the last ~meg (pick a number, scaled by ramsize would be better) of ZONE_DMA for real GFP_DMA allocations would cure Dirk's problem I bet, and also cure most of the others too, simply by ensuring that the ONLY thing that could unbalance that zone would be real GFP_DMA pressure. That way, you'd only eat the incredible cost of balancing that zone when it really really had to be done. > I think Marcello is proposing to make ZONE_DMA exclusive in large > memory machines, which might make it better for allocators > needing ZONE_DMA pages... That would have to save very much time on HIGHMEM boxes. No box with >= a gig of ram would even miss (a lousy;) 16MB. The very small bit of extra balancing of other zones would easily be paid for by the reduced search time (supposition). You'd certainly be doing large tasks a big favor by refusing to give them ZONE_DMA pages on a fallback allocation. I'd almost bet money (will bet a bogobeer:) that disabling fallback to ZONE_DMA entirely on Dirk's box will make his troubles instantly gone. Not that we don't need per zone balancing anyway mind you.. it's just the tiny zone case that is an absolute guaranteed performance killer. Comments? -Mike -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/