From mboxrd@z Thu Jan 1 00:00:00 1970 From: frankeh@us.ibm.com Message-ID: <852568AA.0057CE24.00@D51MTA07.pok.ibm.com> Date: Wed, 22 Mar 2000 10:56:15 -0500 Subject: Re: More VM balancing issues.. (fwd) Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org Return-Path: To: Kanoj Sarcar Cc: torvalds@transmeta.com, bcrl@redhat.com, zim@av.com, sct@redhat.com, linux-mm@kvack.org List-ID: Let me add my 2 cents here since I have been looking into this together with Kanoj. I have been tracking the 2.3.30++ kernel and have tried to make sense of some of the developments. Let me state first that we have a 2.3.36 kernel running on a true NUMA machine comprised of 4 4-way XEON SMPs interconnected via a NUMA switch, each node having 960MB if main memory located at full GB boundaries. > From owner-linux-mm@kvack.org Mon Mar 20 14:20:04 2000 > From: kanoj (Kanoj Sarcar) > Message-Id: <200003202217.OAA94775@google.engr.sgi.com> > Subject: Re: More VM balancing issues.. > To: torvalds@transmeta.com (Linus Torvalds) > Date: Mon, 20 Mar 2000 14:17:27 -0800 (PST) > Cc: linux-mm@kvack.org, bcrl@redhat.com (Ben LaHaise), > zim@av.com (Christopher Zimmerman), sct@redhat.com (Stephen Tweedie) > In-Reply-To: from "Linus Torvalds" at Mar 20, 2000 01:27:46 PM > X-Mailer: ELM [version 2.5 PL2] > MIME-Version: 1.0 > Content-Type: text/plain; charset=us-ascii > Content-Transfer-Encoding: 7bit > X-Orcpt: rfc822;linux-mm@kvack.org > Sender: owner-linux-mm@kvack.org > Precedence: bulk > X-Loop: majordomo@kvack.org > > > > > They happen to be inclusive on x86 (ie DMA <= direct-mapped <= > > everything), but I think it is a mistake to consider that a design. It's > > obviously not true on NUMA if you have per-CPU classes that fall back onto > > other CPU's zones. I would imagine, for example, that on NUMA the best > > arrangement would be something like > > > > - when making a NODE1 allocation, the "class" list is > > > > NODE1, NODE2, NODE3, NODE4, NULL > > > > - when making a NODE2 allocation it would be > > > > NODE2, NODE3, NODE4, NODE1, NULL > > > > etc... > > > > (So each node would preferentially always allocate from its own zone, but > > would fall back on other nodes memory if the local zone fills up). > > Okay, I think the crux of this discussion lies in this statement. I do > not believe this is what the numa code will do, but note that we are > not 100% certain at this stage. The numa code will be layered on top > of the generic code, (the primary goal being generic code should be > impacted by numa minimally), so for example, the numa version of > alloc_pages() will invoke __alloc_pages() on different nodes. The > other thing to note is, the sequence of nodes to allocate is not > static, but dynamic (depending on other data structures that numa > code will track). This gives the most flexibility to numa code to > do the best thing performance wise for a wide variety of apps > under different situations. So apriori, you can not claim the class > list for NODE1 allocation will be "NODE1, NODE2, NODE3, NODE4, NULL". > I am ccing Hubertus Franke from IBM, we have been working on numa > issues together. It has been shown that locality in NUMA machine can greatly increase performance. We see a need to provide flexible policies regarding allocation of memory and their node affinity. So what we would like to see and argue for is something like resource sets used in NUMA-Q etc. A process specifies a preferred set of nodes (and maybe a fallback set of nodes) on where to (a) allocate memory and (b) execute. This affinity based memory allocation can be layered on top of the current __alloc_pages() framework. Regarding fallback zones and inclusion. It is not clear to us whether allocation within a set of nodes should be horizontal or vertical first), i.e. if we ask for a HIGHMEM page, (horizonal::=) should one first allocate within the HIGHMEM zones of all specified nodes, or (vertical::=) should they first fallback on the lower zones before moving on to other nodes. What might weight into the discussion is that HIGHMEM pages have overhead associated with them (copies etc.) We adopted the horizontal approach. We have encountered various problems (== strange behavior) with the current scheme. Either the paging behaviour settles in too early or too late, it never seems quite right. We have seen (2.3.36 - 2.3.51) that apps get killed although there is plenty of swap space available, which we attribute to running out of low memory pages which are necessary to make forward progress. We have also seen that simply running __alloc_pages in a fixed order will result in paging request without actually clearing file caches etc. For each potential set of nodes (2**N) we order the allocation priority dynamically to provide some memory balancing with in each node-set. > > > > > With something like the above, there is no longer any true inclusion. Each > > class covers an "equal" amount of zones, but has a different structure. > > > > The only example I can think of is a hole architecture, as I mentioned > before, but even that can be handled with a "true inclusion" assumption. > > Unless you can point to a processor/architecture to the contrary, for > the 2.4 timeframe, I would think we can assume true inclusion. (And that > will be true even if we come up with a ZONE_PCI32 for 64bit machines). > Well on our architecture due to the IA32 memory layout all nodes but NODE-0 only provide HIGHMEM (> 1GB). So inclusion doesn't exist and their is no equal amount of zones per node. For NUMA it would be advantageous to actually have HIGH and NORMAL mem on each node. For that it would be sufficient to redefine the __va and __pa macros to provide a real non-linear translation, (I have done that once in the 2.2.7 kernel using 4MB memory segments and it seemed to work). > > > 2. The body of zone_balance_memory() should be replaced with the pre1 > > > code, otherwise there are too many differences/problems to enumerate. > > > Unless you are also proposing changes in this area. > > > > The pre1 code was broken, and never checked pages_low. The changes were > > definitely pre-meditated - trying to think of the balancing as a "list of > > zones" issue. > > Agreed, I pointed out the breakage when the balancing patch was sent out. > I patched the pre1 code to get back to 2.3.50 behavior, and Christopher > Zimmerman zim@av.com tested it out. > > > > > And I think it's fine that kswapd continues to run until we reach "high". > > Your patch makes kswapd stop when it reaches "low", but that makes kswapd > > go into this kind of "start/stop/start/stop" behaviour at around the "low" > > watermark. > > > > Maybe you meant to clear the flags the other way around: keep kswapd > > running until it hits high, but remove the "low_on_mem" flag when we are > > above "low" (but we've gotten away from "min"). That might work, but I > > think clearing both flags at "high" is actually the right thing to do, > > because that way we will not get into a state where kswapd runs all the > > time because somebody is still allocating pages without helping to free > > anything up. > > > > Okay, that is a change on top of 2.3.50 behavior, this can be easily > implemented. As I mention in Documentation/vm/balance, low_on_memory > is a hysteric flag, zone_wake_kswapd/kswapd poking is not, we can > change that. Do you want me to create a new patch against 2.3.99-pre2? > > Kanoj > > > The pre-3 behaviour is: if you ever hit "min", you set a flag that means > > "ok, kswapd can't do this on its own, and needs some help from the people > > that allocate memory all the time". If you think of it that way, I think > > you'll agree that it shouldn't be cleared until after kswapd says > > everything is ok again. > > > > I don't know.. > > > > Linus > > > > -- Another issue that I'd like to bring up in this context is that of HOTSWAP memory support. With NUMA systems we should expect partitionability at least through the switch. We are working on dynamic node migration in the context of such systems. This is a feature which pretty much all highend servers are now providing. For that the kernel must be ready to accept new resources and release a set of resources. The NUMA discussions here seem at the heart of this issue. Kanoj and I have come up with some means to allow higher level policy implementation, as described above without impacting the base kernel code. We also have a patch based on this, that allows the definition of a NUMA machine from a memories point of view running on a single SMP, that will allow to explore some of these issues mentioned above for those that don't have a numa machine available. If there is some interest, I can make this available. -- Hubertus Franke -- IBM T.J.Watson Research Center -- frankeh@us.ibm.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/