From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark_H_Johnson.RTS@raytheon.com Message-ID: <852568CD.0057D4FC.00@raylex-gh01.eo.ray.com> Date: Wed, 26 Apr 2000 11:03:58 -0500 Subject: Re: 2.3.x mem balancing Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org Return-Path: To: Andrea Arcangeli Cc: linux-mm@kvack.org, riel@nl.linux.org, torvalds@transmeta.com List-ID: Some of what's been discussed here about NUMA has me concerned. You can't treat a system with NUMA the same as a regular shared memory system. Let me take a moment to describe some of the issues I have w/ NUMA & see if this changes the way you interpret what needs to be done with memory balancing.... I'll let someone else comment on the other issues. NUMA - Non Uniform Memory Access means what it says - access to memory is not uniform. To the user of a system [not the kernel developer], NUMA works similar to cache memory. If the memory you access is "local" to where the processing is taking place, the access is much faster than if the memory is "far away". The difference in performance can be over 10:1 in terms of latency. Let's use a specific shared memory vs. NUMA example to illustrate. Many years ago, SGI produced the Challenge product line with a high speed backplane connecting CPU's and shared memory (a traditional shared memory system). More recently, SGI developed "cache coherent NUMA" as part of the Origin 2000 product line. We have been considering the Origin platform and its successors as an upgrade path for existing Challenge XL systems (24 CPU's, 2G shared memory). To us, the main difference between a Challenge and Origin is that the Origin performance range is much better than on the Challenge. However, access to the memory is equally fast across the entire memory range on the Challenge and "non uniform" [faster & slower] on the Origin. Some reported numbers on the Origin indicate a maximum latency of 200 nsec to 700 nsec with systems with 16 to 32 processors. More processors makes the effect somewhat worse with the "absolute worst case" around 1 microsecond (1000 nsec). To me, these kind of numbers make the cost of a cache miss staggering when compared to the cycle times of new processors. Our concern with NUMA basically is that the structure of our application must be changed to account for that latency. NUMA works best when you can put the data and the processing in the same area. However, our current implementation for exchanging information between processes is through a large shared memory area. That area will only be "close" to a few processors - the rest will be accessing it remotely. Yes, the connections are very fast, but I worry about the latency [and resulting execution stalls] much more. To us, it means that we must arrange to have the information sent across those fast interfaces before we expect to need it at the destination. Those extra "memory copies" are something we didn't have to worry about before. I see similar problems in the kernel. In the context of "memory balancing" - all processors and all memory is NOT equal in a NUMA system. To get the best performance from the hardware, you prefer to put "all" of the memory for each process into a single memory unit - then run that process from a processor "near" that memory unit. This seemingly simple principle has a lot of problems behind it. What about... - shared read only memory (e.g., libraries) [to clone or not?] - shared read/write memory [how to schedule work to be done when load >> "local capacity"] - when memory is low, which pages should I remove? - when I start a new job, even when there is lots of free memory, where should I load the job? These are issues that need to be addressed if you expect to use this high cost hardware effectively. Please don't implement a solution for virtual memory that does not have the ability to scale to solve the problems with NUMA. Thanks. --Mark H Johnson |--------+-----------------------> | | Andrea | | | Arcangeli | | | | | | | | | 04/26/00 | | | 09:19 AM | | | | |--------+-----------------------> >----------------------------------------------------------------------------| | | | To: riel@nl.linux.org | | cc: Linus Torvalds , linux-mm@kvack.org, | | (bcc: Mark H Johnson/RTS/Raytheon/US) | | Subject: Re: 2.3.x mem balancing | >----------------------------------------------------------------------------| On Tue, 25 Apr 2000, Rik van Riel wrote: >On Wed, 26 Apr 2000, Andrea Arcangeli wrote: >> On Tue, 25 Apr 2000, Linus Torvalds wrote: >> >> >On Tue, 25 Apr 2000, Andrea Arcangeli wrote: >> >> >> >> The design I'm using is infact that each zone know about each other, each >> >> zone have a free_pages and a classzone_free_pages. The additional >> >> classzone_free_pages gives us the information about the free pages on the >> >> classzone and it's also inclusve of the free_pages of all the lower zones. >> > >> >AND WHAT ABOUT SETUPS WHERE THERE ISNO INCLUSION? >> >> They're simpler. The classzone for them matches with the zone. > >It doesn't. Think NUMA. NUMA is irrelevant. If there's no inclusion the classzone matches with the zone. [snip] -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/