From mboxrd@z Thu Jan 1 00:00:00 1970 From: kanoj@google.engr.sgi.com (Kanoj Sarcar) Message-Id: <200003202217.OAA94775@google.engr.sgi.com> Subject: Re: More VM balancing issues.. Date: Mon, 20 Mar 2000 14:17:27 -0800 (PST) In-Reply-To: from "Linus Torvalds" at Mar 20, 2000 01:27:46 PM MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Linus Torvalds Cc: linux-mm@kvack.org, Ben LaHaise , Christopher Zimmerman , Stephen Tweedie List-ID: > > They happen to be inclusive on x86 (ie DMA <= direct-mapped <= > everything), but I think it is a mistake to consider that a design. It's > obviously not true on NUMA if you have per-CPU classes that fall back onto > other CPU's zones. I would imagine, for example, that on NUMA the best > arrangement would be something like > > - when making a NODE1 allocation, the "class" list is > > NODE1, NODE2, NODE3, NODE4, NULL > > - when making a NODE2 allocation it would be > > NODE2, NODE3, NODE4, NODE1, NULL > > etc... > > (So each node would preferentially always allocate from its own zone, but > would fall back on other nodes memory if the local zone fills up). Okay, I think the crux of this discussion lies in this statement. I do not believe this is what the numa code will do, but note that we are not 100% certain at this stage. The numa code will be layered on top of the generic code, (the primary goal being generic code should be impacted by numa minimally), so for example, the numa version of alloc_pages() will invoke __alloc_pages() on different nodes. The other thing to note is, the sequence of nodes to allocate is not static, but dynamic (depending on other data structures that numa code will track). This gives the most flexibility to numa code to do the best thing performance wise for a wide variety of apps under different situations. So apriori, you can not claim the class list for NODE1 allocation will be "NODE1, NODE2, NODE3, NODE4, NULL". I am ccing Hubertus Franke from IBM, we have been working on numa issues together. > > With something like the above, there is no longer any true inclusion. Each > class covers an "equal" amount of zones, but has a different structure. > The only example I can think of is a hole architecture, as I mentioned before, but even that can be handled with a "true inclusion" assumption. Unless you can point to a processor/architecture to the contrary, for the 2.4 timeframe, I would think we can assume true inclusion. (And that will be true even if we come up with a ZONE_PCI32 for 64bit machines). > > 2. The body of zone_balance_memory() should be replaced with the pre1 > > code, otherwise there are too many differences/problems to enumerate. > > Unless you are also proposing changes in this area. > > The pre1 code was broken, and never checked pages_low. The changes were > definitely pre-meditated - trying to think of the balancing as a "list of > zones" issue. Agreed, I pointed out the breakage when the balancing patch was sent out. I patched the pre1 code to get back to 2.3.50 behavior, and Christopher Zimmerman zim@av.com tested it out. > > And I think it's fine that kswapd continues to run until we reach "high". > Your patch makes kswapd stop when it reaches "low", but that makes kswapd > go into this kind of "start/stop/start/stop" behaviour at around the "low" > watermark. > > Maybe you meant to clear the flags the other way around: keep kswapd > running until it hits high, but remove the "low_on_mem" flag when we are > above "low" (but we've gotten away from "min"). That might work, but I > think clearing both flags at "high" is actually the right thing to do, > because that way we will not get into a state where kswapd runs all the > time because somebody is still allocating pages without helping to free > anything up. > Okay, that is a change on top of 2.3.50 behavior, this can be easily implemented. As I mention in Documentation/vm/balance, low_on_memory is a hysteric flag, zone_wake_kswapd/kswapd poking is not, we can change that. Do you want me to create a new patch against 2.3.99-pre2? Kanoj > The pre-3 behaviour is: if you ever hit "min", you set a flag that means > "ok, kswapd can't do this on its own, and needs some help from the people > that allocate memory all the time". If you think of it that way, I think > you'll agree that it shouldn't be cleared until after kswapd says > everything is ok again. > > I don't know.. > > Linus > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/