From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Wed, 26 Apr 2000 03:07:04 +0200 (CEST) From: Andrea Arcangeli Subject: Re: 2.3.x mem balancing In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org Return-Path: To: Linus Torvalds Cc: linux-mm@kvack.org List-ID: On Tue, 25 Apr 2000, Linus Torvalds wrote: >On Tue, 25 Apr 2000, Andrea Arcangeli wrote: >> >> The design I'm using is infact that each zone know about each other, each >> zone have a free_pages and a classzone_free_pages. The additional >> classzone_free_pages gives us the information about the free pages on the >> classzone and it's also inclusve of the free_pages of all the lower zones. > >AND WHAT ABOUT SETUPS WHERE THERE ISNO INCLUSION? They're simpler. The classzone for them matches with the zone. Actually I told the kernel that a classzone is always composed by the zone itself joined with all the lower zones (since I tought we're not going to have a non lower-zone-inclusive setup any time soon and so I made the code faster harcoding such assumption), but if that won't be true any longer we only need to tell the code that updates classzone_free_pages what zones the clazzone is composed by. That can be achieved by having the list of the zones that compose the classzone in an array allocated into the zone_t. A new zone->classzones[MAX_NR_ZONES+1] will do the trick. If there's no inclusion zones->classzones[0] will be equal to zone and [1] will be NULL. I can do that now if you think we'll _soon_ need that genericity. Please consider this. When we do a GFP_KERNEL allocation, we want to allocate from 1 zone. Not from two zones. When we do: alloc_page(GFP_KERNEL); we want to allocate from 1 zone that is between 0 and 2giga. 0 2g ------------------------------------------------- | ZONE_DMA | ZONE_NORMAL | ------------------------------------------------- | GFP_KERNEL | ------------------------------------------------- Incidentally we also have somebody (GFP_DMA) that wants to allocate from the ZONE_DMA and so to be able to provide GFP_DMA allocations in O(1) we can't create a single indpendent ZONE_NORMAL that spawns between 0 and 2g, but we have to split the GFP_KERNEL allocation place in two zones that are ZONE_DMA and ZONE_NORMAL. But alloc_pages really have to consider the 0-2g a single zone because alloc_pages gets the semantic of the zone by the user that consider the 0-2g range a single zone. If it's true as you say that we can make to work alloc_pages on the 0-2g zone using two completly separated and not-related-in-any-way zones, then we could also split the ZONE_DMA is ZONE_DMA0 and ZONE_DMA1 and be able to keep the ZONE_DMA1 classzone balanced as if it would be still ZONE_DMA. We could in the same way split ZONE_NORMAL in several zones and still keep ZONE_NORMAL in perfect balance. If the current zone based design is not broken, then we can split any zone in as _many_ zones as we _want_ without hurting anything (except some more CPU and memory resources wasted in the zone arrays since there would be more zones). This is obviously wrong and this also proof the current design is broken. >Andrea, face it, the design is WRONG! > >You've made both free_pages() and alloc_pages() more complex and costly, >and the per-zone spinlock cannot exist any more. WHICH IS BAD BAD BAD. rmqueue and __free_pages_ok are now doing at once the work that nr_free_buffer_pages was doing all the time. I don't know how much these changes are sensitive in performance. For the spinlock I fully agree, previous code was very nicer. However I couldn't find a way to avoid parallel alloc_pages() to fool the memory balancing by using a per-zone lock (I could acquire more than one spinlock in zone-decreasing-order but that was going to hurt too much) and so IMHO the previous code was risky indipendent of the zone problems. >Thing of the case of a NUMA architecture - you sure as hell want to have >all "local" zones share the same spinlock, because then you'll have to >grab a global spinlock for each allocation, even if you only allocate from >a local zone closeto the node you're actually running on. I'm sorry but I'm not sure to understand what you mean. NUMA scaling is not more penalized than non NUMA scaling as far I can tell. The spinlock is now in the node so you are still allowed to allocate from two _different_ nodes at the same time as before. node0->node_zones[0] != node1->node_zones[0] && node0->freelist_lock != node1->freelist_lock >I tell you - you are doing the wrong thing. The earlier you realize that, >the better. I apologise but I still think I'm doing the right thing and that the current strict zone based design is not correct and that it will end doing the wrong thing sometime with non obvious drawbacks. Linus, you can convinced me immediatly if you show me how I can make sure that all the ZONE_NORMAL is empty before going to allocate in the ZONE_DMA with the current design. I also like to know how can I stop kswapd that is trying to raise the ZONE_NORMAL->free_pages over ZONE_NORMAL->pages_high after kswapd succesfully put the ZONE_DMA over the ZONE_DMA->pages_high limit in the previous pass (assuming ZONE_DMA->pages_high >= ZONE_NORMAL->pages_high, that's perfectly allowed value). Also from swap_out how can I know how much a page is critical by only looking at how many free pages are in its zone (think if the lower zones are completly free). >> >> Now assume rest of memory zone (ZONE_NORMAL) is getting under the >> >> zone_normal->page_low watermark. We must definitely not start kswapd if >> >> there are still 16mbyte of free memory in the classzone. >> > >> >No. >> > >> >We should just not allocate from that zone. Look at what >> >__get_free_pages() does: it tries to first find _any_ zone that it can >> >allocate from, and if it cannot find such a zone only _then_ does it >> >decide to start kswapd. >> >> Woops I did wrong example to explain the suprious kswapd run and I also >> didn't explained the real problem in that scenario. >> >> The real problem in that scenario is that you don't empty the ZONE_NORMAL >> but you stop at the low watermark, while you should empty the ZONE_NORMAL >> complelty for allocation that supports ZONE_NORMAL memory (so for non >> ISA_DMA memory) before falling back into the ZONE_DMA zone. > >Oh? > >And what is wrong with just changing the water-marks, instead of your >global (and in my opinion stupid and wrong) change? > >Why didn't you just do a small little initialization routine that made the >watermark for the DMA zone go up, and the watermark for the bigger zones >go down? > >Let's face it, with the current Linux per-zone memory allocation, doing >things like that is _trivial_. There are no magic couplings between >different zones, so if you want to make sure that it's primarily only the >DMA zone that is kept free, then you can do _exactly_ that by saying >simply that the "critical watermark for the DMA zone should be 5%, while >the critical watermark for the regular zone should be just 1%". As first this mean you'll left an 1% of the ZONE_NORMAL free, while you should have allocated also such remaining 1% before falling back in the ZONE_DMA and so your solution doesn't solve the problem but only hides it better. There's no one single good reason for which you should left such 1% free, while there are obvious good reason for which you should allocate also such 1% before falling back on the lower zone. As second using 5% and 1% of critical watermarks won't give you a 6% watermark for the ZONE_NORMAL _class_zone but it will give you a 1% watermark instead and you probably wanted a 6% watermark to provide rasonable space for atomic allocations and for having more chances of doing high order allocations. (note the numbers 1%/5% and global 6% are just random values for me now, I'm not even thinking if they are a good default or not, just assume they are a good default for the following example) Assume you want a 6% watermark on the zone-normal classzone, ok? Now suppose 95% of the ZONE_DMA is mlocked and a ISA-DMA network card allocates from irqs the latest 5%. Suppose the ZONE_NORMAL is all _freeable_ in not mapped page cache but that 99% of the ZONE_NORMAL is allocated in page cache and only the 1% is free. Then the VM with the current design will take only 1% free in the ZONE_NORMAL _classzone_ because it have no knowledge about the classzone. If pages_high would been the 6% and referred to the classzone the VM would have immediatly and correctly shrunk an additional 5% from the freeable ZONE_NORMAL. See? You can't fix that problem without changing design. >I did no tuning at all of the watermarks when I changed the zone >behaviour. My bad. But that does NOT mean that the whole mm behaviour >should then be reverted to the old and broken setup. It only means that >the zone behaviour should be tuned. If you'll try to be friendly with ZONE_DMA (by allocating from the ZONE_NORMAL when possible) you'll make the allocation from the higher zones less reliable and you could end doing a differently kind of wrong thing as explained a few lines above. >> You can't >> optimize the zone usage without knowledge on the whole classzone. That's >> the first basic thing where the strict zone based design will be always >> inferior compared to a classzone based design. > >You are wrong. And you are so FUNDAMENTALLY wrong that I'm at a loss to >even explain why. > >What's the matter with just realizing that the whole issue of DMA vs >non-DMA, and local zone vs remote zone is just a very generic case of >trying to balance memory usage. It has nothing at all to do with >"inclusion", and I personally think that the whole notion of "inclusion" >is fundamentally flawed. It adds a rule that shouldn't be there, and has >no meaning. > >Andrea, how do you ever propose to handle the case of four different >memory zones, all "equal", but all separate in that while all of memory >isaccessible from each CPU, each zone is "closer" to certain CPU's? Let's >say that CPU's 0-3 have direct access to zone 0, CPU's 4-7 have direct >access to zone 1, etc etc.. Whenever a CPU touches memory on a non-local >zone, it takes longer for the cache miss, but it still works. Note that what you call zones are really nodes. Your zone 0 is not a zone_t but a pg_data_t instead. >Now, the way =I= propose that this be handled is: > > - each cluster has its own zone list: cluster 0 has the list 0, 1, 2, 3, > while cluster 1 has the list 1, 2, 3, 0 etc. In short, each of them > preferentially allocate from their _own_ cluster. Together with > cluster-affinity for the processes, this way we can naturally (and with > no artificial code) try to keep cross-cluster memory accesses to > a minimum. That have to be done in alloc_pages_node that later will fallback into the alloc_pages. This problematic is not relevant for alloc_pages IMVHO. In your scenario you'll have only 1 zone per node and there's never been any problem with only one zone per node since classzone design is completly equal to zone design in such case. All zones in a single node by design have to be all near or far in the same way to all the cpus since they all belong to the same node. >And note - the above works _now_ with the current code. And it will never >ever work with your approach. Because zone 0 is not a subset of zone 1, >they are both equal. It's just that there is a preferential order for >allocation depending on who does the allocation. IMVHO you put the NUMA stuff in the wrong domain by abusing the zones instead of using the proper nodes. >> About kswapd suppose we were low on memory on both the two zones and so we >> started kswapd on both zones and we keep allocating waiting to reach the >> min watermark. Then while kswapd it's running a process exits and release >> all the first 16mbyte of RAM. kswapd correctly stops freeing the ISADMA >> zone (because free_pages set zone_wake_kswapd back to zero) but kswapd >> will keep freeing the normal zone for no good reason. You have no way to >> stop the wrong kswapd in such scenario without a classzone design. If you >> keep looking only at zone->free_pages and zone->pages_high then kswapd >> will keep running for no good reason. > >The fact that you think that this is true obviously means that you haven't >thought AT ALL about the ways to just make the watermarks work that way. I don't see how playing with the watermarks can help. Playing with the watermarks is been my first idea but I discarded it immediatly when I seen what would be happened by lowering the higher zones watermarks. And also the watermark strict per zone doesn't make any sense in first place. You can't say how much you should free from ZONE_NORMAL if you don't know the state of the lower ZONE_DMA zone. All such pages_{high,low,min} watermarks can make sense only if referred to the classzone. >Give it five seconds, and you'll see that the above is _exactly_ what the >watermarks control, and _exactly_ by making all the flags per-zone >(instead of global or per-class). And by just using the watermark IMVO we used the wrong term since the first place. Kanoj wrote a function called classzone at 2.3.[345]? time, and so I inherit his 'classzone' name in my emails. But what I did is not really a classzone design, but I only reconstructed the real _zone_ where GFP_KERNEL wants to allocate from. In 2.3.99-pre6-5 the monolithic zone where GFP_KERNEL wants to allocate from, it's been broken in two unrelated, disconnected pieces. I only re-joined it in once piece since it's a _single_ zone. It's a more problematic zone since it's overlapped with the lower zones and so we have to update it also when we change the lower zones, but it's really a _single_ zone that shares the same equal properties of the other zones. >In short, you've only convinced me _not_ to touch your patches, by showing >that you haven't even though about what the current setup really means. Note that I have started saying "let's try to give the current code better balance since it seems we're calling swap_out two more times than necessary". And the first time I read the new alloc_pages code I thought "cute idea to implement alloc_pages that way, we avoid the cost of having to calculate the number of free pages in the classzone". _Then_ when I tried to give to the smarter code the expected right behaviour I noticed I couldn't in lots of cases and that's the only reason that caused me to change the design of the zones. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/