linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: 2.3.x mem balancing
@ 2000-04-26 16:03 Mark_H_Johnson.RTS
  2000-04-26 17:06 ` Andrea Arcangeli
  2000-04-26 17:43 ` Kanoj Sarcar
  0 siblings, 2 replies; 24+ messages in thread
From: Mark_H_Johnson.RTS @ 2000-04-26 16:03 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, riel, torvalds


Some of what's been discussed here about NUMA has me concerned. You can't treat
a system with NUMA the same as a regular shared memory system. Let me take a
moment to describe some of the issues I have w/ NUMA & see if this changes the
way you interpret what needs to be done with memory balancing.... I'll let
someone else comment on the other issues.

NUMA - Non Uniform Memory Access means what it says - access to memory is not
uniform. To the user of a system [not the kernel developer], NUMA works similar
to cache memory. If the memory you access is "local" to where the processing is
taking place, the access is much faster than if the memory is "far away". The
difference in performance can be over 10:1 in terms of latency.

Let's use a specific shared memory vs. NUMA example to illustrate. Many years
ago, SGI produced the Challenge product line with a high speed backplane
connecting CPU's and shared memory (a traditional shared memory system). More
recently, SGI developed "cache coherent NUMA" as part of the Origin 2000 product
line. We have been considering the Origin platform and its successors as an
upgrade path for existing Challenge XL systems (24 CPU's, 2G shared memory).

To us, the main difference between a Challenge and Origin is that the Origin
performance range is much better than on the Challenge.  However, access to the
memory is equally fast across the entire memory range on the Challenge and "non
uniform" [faster & slower] on the Origin. Some reported numbers on the Origin
indicate a maximum latency of 200 nsec to 700 nsec with systems with 16 to 32
processors. More processors makes the effect somewhat worse with the "absolute
worst case" around 1 microsecond (1000 nsec). To me, these kind of numbers make
the cost of a cache miss staggering when compared to the cycle times of new
processors.

Our concern with NUMA basically is that the structure of our application must be
changed to account for that latency. NUMA works best when you can put the data
and the processing in the same area. However, our current implementation for
exchanging information between processes is through a large shared memory area.
That area will only be "close" to a few processors - the rest will be accessing
it remotely. Yes, the connections are very fast, but I worry about the latency
[and resulting execution stalls] much more. To us, it means that we must arrange
to have the information sent across those fast interfaces before we expect to
need it at the destination. Those extra "memory copies" are something we didn't
have to worry about before. I see similar problems in the kernel.

In the context of "memory balancing" - all processors and all memory is NOT
equal in a NUMA system. To get the best performance from the hardware, you
prefer to put "all" of the memory for each process into a single memory unit -
then run that process from a processor "near" that memory unit. This seemingly
simple principle has a lot of problems behind it. What about...
 - shared read only memory (e.g., libraries) [to clone or not?]
 - shared read/write memory [how to schedule work to be done when load >> "local
capacity"]
 - when memory is low, which pages should I remove?
 - when I start a new job, even when there is lots of free memory, where should
I load the job?
These are issues that need to be addressed if you expect to use this high cost
hardware effectively. Please don't implement a solution for virtual memory that
does not have the ability to scale to solve the problems with NUMA. Thanks.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


|--------+----------------------->
|        |          Andrea       |
|        |          Arcangeli    |
|        |          <andrea@suse.|
|        |          de>          |
|        |                       |
|        |          04/26/00     |
|        |          09:19 AM     |
|        |                       |
|--------+----------------------->
  >----------------------------------------------------------------------------|
  |                                                                            |
  |       To:     riel@nl.linux.org                                            |
  |       cc:     Linus Torvalds <torvalds@transmeta.com>, linux-mm@kvack.org, |
  |       (bcc: Mark H Johnson/RTS/Raytheon/US)                                |
  |       Subject:     Re: 2.3.x mem balancing                                 |
  >----------------------------------------------------------------------------|



On Tue, 25 Apr 2000, Rik van Riel wrote:

>On Wed, 26 Apr 2000, Andrea Arcangeli wrote:
>> On Tue, 25 Apr 2000, Linus Torvalds wrote:
>>
>> >On Tue, 25 Apr 2000, Andrea Arcangeli wrote:
>> >>
>> >> The design I'm using is infact that each zone know about each other, each
>> >> zone have a free_pages and a classzone_free_pages. The additional
>> >> classzone_free_pages gives us the information about the free pages on the
>> >> classzone and it's also inclusve of the free_pages of all the lower zones.
>> >
>> >AND WHAT ABOUT SETUPS WHERE THERE ISNO INCLUSION?
>>
>> They're simpler. The classzone for them matches with the zone.
>
>It doesn't. Think NUMA.

NUMA is irrelevant. If there's no inclusion the classzone matches with the
zone.
[snip]





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: 2.3.x mem balancing
@ 2000-04-26 19:06 frankeh
  0 siblings, 0 replies; 24+ messages in thread
From: frankeh @ 2000-04-26 19:06 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: Andrea Arcangeli, Mark_H_Johnson.RTS, linux-mm, riel,
	Linus Torvalds, pratap

Kanoj, this is the issue I raised earlier on the board, but didn't get a
reply...

Yes, one NUMA machine here at IBM research consists of a 4-node cluster of
4-way xeon boxes.
When NUMA-d together, each memory controller simply relocates its own node
memory to
a designated 1-GB range and forwards other requests to the appropriate
nodes while maintaining cache coherence.

This ofcourse leads to the situation, that only the first node will have
DMA memory, given the 1GB kernel limitation.

I used to have  a software solution to this namely by rewritting the __pa
and __va macros to do some remapping
which would allow each node to provide some kernel virtual DMA memory.

Now how do you believe the architectures (particular x86 based NUMA
systems) will evolve ?

As with respect to some other messages regarding the zones.

With respect to NUMA allocation, I still like to see happening what was
pointed out for the IRIX and which is for instance
also available on NUMAQ/Dynix as well. Namely resource classes.

A resource class to be a set of basic resources such as (CPUs and memory,
i.e nodes) on which to restrict execution and allocation for user processes

(a) we have a full CPU affinity patch, driven by a system call interface
that restricts execution to a set of specified CPUs     .. any takers ...

(b) kanoj and I made a first attempt (~2.3.48 timeframe) to restrict
allocation to certain nodes, but the swapping behavior never properly
worked and with
     the constant changes under 2.3.99-preX, I put this on ice until the vm
becomes somewhat more stable.
    Again, I want to specify a set of nodes from where to allocate memory .
   Given a node set specification, I would like to treat the zones of the
same class on all those specified nodes (e.g. ZONE_HIGH) as a single target
class. Only if it can not allocate within that combined class  on the
specified set of nodes, should the allocator decent into the next lower
class.

   Open ofcourse in this spec is what will be effected by the memory
specification ??? only user pages, or pages that go to memory mapped files
as well?






kanoj@google.engr.sgi.com (Kanoj Sarcar) on 04/26/2000 01:36:48 PM

To:   andrea@suse.de (Andrea Arcangeli)
cc:   Mark_H_Johnson.RTS@raytheon.com, linux-mm@kvack.org,
      riel@nl.linux.org, torvalds@transmeta.com (Linus Torvalds)
Subject:  Re: 2.3.x mem balancing




>
> On NUMA hardware you have only one zone per node since nobody uses
ISA-DMA
> on such machines and you have PCI64 or you can use the PCI-DMA sg for
> PCI32. So on NUMA hardware you are going to have only one zone per node
> (at least this was the setup of the NUMA machine I was playing with). So
> you don't mind at all about classzone/zone. Classzone and zone are the
> same thing in such a setup, they both are the plain ZONE_DMA zone_t.
> Finished. Said that you don't care anymore about the changes of how the
> overlapped zones are handled since you don't have overlapped zones in
> first place.

Andrea, are you talking about the SGI Origin platform, or are you
using some other NUMA platform? In any case, the SGI platform in fact
does not support ISA-DMA, but unfortunately, I don't think just because
it has PCI mapping registers, you can assume that all memory is DMAable.
For us to be able to consider all memory as dmaable, before each dma
operation starts, we need to have a pci-dma type hook to program the
mapping registers. As far as I know, such a hook is not used on all
drivers (in 2.4 timeframe), so very unfortunately, I think we need
to keep the option open about each node having more than just ZONE_DMA.
Finally, I am not sure how things will work, we are still busy trying
to get the Origin/Linux port going.

FWIW, I think the IBM/Sequent NUMA machines in fact have nodes that
have only nondmaable memory.

>
> If you move the NUMA balancing and node selection into the higher layer
> as I was proposing, instead you can do clever things.
>

For an example and a (old) patch for this, look at

     http://oss.sgi.com/projects/numa/download/numa.gen.42b
     http://oss.sgi.com/projects/numa/download/numa.plat.42b

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread
[parent not found: <Pine.LNX.4.21.0004250401520.4898-100000@alpha.random>]

end of thread, other threads:[~2000-04-27 13:22 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-04-26 16:03 2.3.x mem balancing Mark_H_Johnson.RTS
2000-04-26 17:06 ` Andrea Arcangeli
2000-04-26 17:36   ` Kanoj Sarcar
2000-04-26 21:58     ` Andrea Arcangeli
2000-04-26 17:43 ` Kanoj Sarcar
  -- strict thread matches above, loose matches on Subject: below --
2000-04-26 19:06 frankeh
     [not found] <Pine.LNX.4.21.0004250401520.4898-100000@alpha.random>
2000-04-25 16:57 ` Linus Torvalds
2000-04-25 17:50   ` Rik van Riel
2000-04-25 18:11     ` Jeff Garzik
2000-04-25 18:33       ` Rik van Riel
2000-04-25 18:53     ` Linus Torvalds
2000-04-25 19:27       ` Rik van Riel
2000-04-26  0:26         ` Linus Torvalds
2000-04-26  1:19           ` Rik van Riel
2000-04-26  1:07   ` Andrea Arcangeli
2000-04-26  2:10     ` Rik van Riel
2000-04-26 11:24       ` Stephen C. Tweedie
2000-04-26 16:44         ` Linus Torvalds
2000-04-26 17:13           ` Rik van Riel
2000-04-26 17:24             ` Linus Torvalds
2000-04-27 13:22               ` Stephen C. Tweedie
2000-04-26 14:19       ` Andrea Arcangeli
2000-04-26 16:52         ` Linus Torvalds
2000-04-26 17:49           ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox