From mboxrd@z Thu Jan 1 00:00:00 1970 From: kanoj@google.engr.sgi.com (Kanoj Sarcar) Message-Id: <200003202029.MAA75378@google.engr.sgi.com> Subject: Re: More VM balancing issues.. Date: Mon, 20 Mar 2000 12:29:17 -0800 (PST) In-Reply-To: from "Linus Torvalds" at Mar 17, 2000 06:59:19 PM MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Linus Torvalds Cc: linux-mm@kvack.org, Ben LaHaise , Christopher Zimmerman , Stephen Tweedie , Kanoj Sarcar List-ID: Okay, here it comes, you asked for it ... You know most of it anyway, but seeing it all together might help. 1. In a theoretical sense, there are _only_ memory classes. DMA class memory, direct mapped class memory and the rest. Code will ask for a dma, regular or other class memory (proactive balancing is needed for intr context allocations or otherwise when page stealing is impossible or deadlock prone). Hence, theoretically, it makes sense to decide how many pages in each memory _class_ we want to keep free for such requests (based on application type, #cpu, memory, devices and fs activity). Decisions on when pages need to be stolen should really be _class_ based. 2. Linux uses zones to implement memory classes. The DMA zone represents DMA class, the DMA+regular zone represents regular class, and the DMA+regular+himem zone represents other class. Theoretically, that is why decisions on page stealing need to be cumulative on the zones. (This explains why I did most of the code that way). 3. Implementation can of course diverge from theory (like using NRU in place of LRU). In Documentation/vm/balance, I have tried laying down the pros and cons of local vs cumulative balancing: "In 2.3, zone balancing can be done in one of two ways: depending on the zone size (and possibly of the size of lower class zones), we can decide at init time how many free pages we should aim for while balancing any zone. The good part is, while balancing, we do not need to look at sizes of lower class zones, the bad part is, we might do too frequent balancing due to ignoring possibly lower usage in the lower class zones. Also, with a slight change in the allocation routine, it is possible to reduce the memclass() macro to be a simple equality. Another possible solution is that we balance only when the free memory of a zone _and_ all its lower class zones falls below 1/64th of the total memory in the zone and its lower class zones. This fixes the 2.2 balancing problem, and stays as close to 2.2 behavior as possible. Also, the balancing algorithm works the same way on the various architectures, which have different numbers and types of zones. If we wanted to get fancy, we could assign different weights to free pages in different zones in the future." 4. In 2.3.50 and pre1, zone_balance_ratio[] is the ratio of each _class_ of memory that you want free, which is intuitive. 5. For true NUMA machines, there will be memory nodes, and each node will possibly have dma/regular/himem zones. For memory-hole architectures, ie DISCONTIG machines, there will again be nodes, but there will be a lot of nodes with only one class of memory (don't know yet, there are not too many people working on this). Coming specifically to the 2.3.99-pre2 code, I see a couple of bugs: 1. __alloc_pages needs to return NULL instead of doing zone_balance_memory for the PF_MEMALLOC case. if (!(current->flags & PF_MEMALLOC)) return(NULL); if (zone_balance_memory(zonelist)) { 2. The body of zone_balance_memory() should be replaced with the pre1 code, otherwise there are too many differences/problems to enumerate. Unless you are also proposing changes in this area. I attach a patch against 2.3.99-pre2 to fix these. The other issues are: 1. In the face of races, you probably want to do a loopback in __alloc_pages after the zone_balance_memory() returns success. Something like if (zone_balance_memory(zonelist)) { if (retry) return(NULL); retry++; goto tryagain; } 2. Due to purely zone-local computation, the pre2 version will more easily fall back to lower zones while allocating memory (when it is not neccessary). Specially interesting will be cases where the regular zone is much smaller than the dma zone, or the himem zone is tiny compared to the regular zone. So, gone will be the protection that dma and regular zones enjoyed in older versions. Kanoj --- mm/page_alloc.c Mon Mar 20 09:38:48 2000 +++ mm/page_alloc.c Mon Mar 20 11:48:02 2000 @@ -152,10 +152,10 @@ spin_unlock_irqrestore(&zone->lock, flags); - if (zone->free_pages > zone->pages_high) { + if (zone->free_pages > zone->pages_low) zone->zone_wake_kswapd = 0; + if (zone->free_pages > zone->pages_high) zone->low_on_memory = 0; - } } #define MARK_USED(index, order, area) \ @@ -233,21 +233,22 @@ zone = zonelist->zones; for (;;) { zone_t *z = *(zone++); + unsigned long free; if (!z) break; - if (z->free_pages > z->pages_low) - continue; - - z->zone_wake_kswapd = 1; - wake_up_interruptible(&kswapd_wait); + free = z->free_pages; + if (free <= z->pages_high) { + if (free <= z->pages_low) { + z->zone_wake_kswapd = 1; + wake_up_interruptible(&kswapd_wait); + } + if (free <= z->pages_min) + z->low_on_memory = 1; + } /* Are we reaching the critical stage? */ - if (!z->low_on_memory) { - /* Not yet critical, so let kswapd handle it.. */ - if (z->free_pages > z->pages_min) - continue; - z->low_on_memory = 1; - } + if (!z->low_on_memory) + continue; /* * In the atomic allocation case we only 'kick' the * state machine, but do not try to free pages @@ -307,6 +308,8 @@ return page; } } + if (!(current->flags & PF_MEMALLOC)) + return(NULL); if (zone_balance_memory(zonelist)) { zone = zonelist->zones; for (;;) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/