Re: 2.3.x mem balancing - Mark_H

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mark_H_Johnson.RTS@raytheon.com
To: Andrea Arcangeli <andrea@suse.de>
Cc: linux-mm@kvack.org, riel@nl.linux.org, torvalds@transmeta.com
Subject: Re: 2.3.x mem balancing
Date: Wed, 26 Apr 2000 11:03:58 -0500	[thread overview]
Message-ID: <852568CD.0057D4FC.00@raylex-gh01.eo.ray.com> (raw)

Some of what's been discussed here about NUMA has me concerned. You can't treat
a system with NUMA the same as a regular shared memory system. Let me take a
moment to describe some of the issues I have w/ NUMA & see if this changes the
way you interpret what needs to be done with memory balancing.... I'll let
someone else comment on the other issues.

NUMA - Non Uniform Memory Access means what it says - access to memory is not
uniform. To the user of a system [not the kernel developer], NUMA works similar
to cache memory. If the memory you access is "local" to where the processing is
taking place, the access is much faster than if the memory is "far away". The
difference in performance can be over 10:1 in terms of latency.

Let's use a specific shared memory vs. NUMA example to illustrate. Many years
ago, SGI produced the Challenge product line with a high speed backplane
connecting CPU's and shared memory (a traditional shared memory system). More
recently, SGI developed "cache coherent NUMA" as part of the Origin 2000 product
line. We have been considering the Origin platform and its successors as an
upgrade path for existing Challenge XL systems (24 CPU's, 2G shared memory).

To us, the main difference between a Challenge and Origin is that the Origin
performance range is much better than on the Challenge.  However, access to the
memory is equally fast across the entire memory range on the Challenge and "non
uniform" [faster & slower] on the Origin. Some reported numbers on the Origin
indicate a maximum latency of 200 nsec to 700 nsec with systems with 16 to 32
processors. More processors makes the effect somewhat worse with the "absolute
worst case" around 1 microsecond (1000 nsec). To me, these kind of numbers make
the cost of a cache miss staggering when compared to the cycle times of new
processors.

Our concern with NUMA basically is that the structure of our application must be
changed to account for that latency. NUMA works best when you can put the data
and the processing in the same area. However, our current implementation for
exchanging information between processes is through a large shared memory area.
That area will only be "close" to a few processors - the rest will be accessing
it remotely. Yes, the connections are very fast, but I worry about the latency
[and resulting execution stalls] much more. To us, it means that we must arrange
to have the information sent across those fast interfaces before we expect to
need it at the destination. Those extra "memory copies" are something we didn't
have to worry about before. I see similar problems in the kernel.

In the context of "memory balancing" - all processors and all memory is NOT
equal in a NUMA system. To get the best performance from the hardware, you
prefer to put "all" of the memory for each process into a single memory unit -
then run that process from a processor "near" that memory unit. This seemingly
simple principle has a lot of problems behind it. What about...
 - shared read only memory (e.g., libraries) [to clone or not?]
 - shared read/write memory [how to schedule work to be done when load >> "local
capacity"]
 - when memory is low, which pages should I remove?
 - when I start a new job, even when there is lots of free memory, where should
I load the job?
These are issues that need to be addressed if you expect to use this high cost
hardware effectively. Please don't implement a solution for virtual memory that
does not have the ability to scale to solve the problems with NUMA. Thanks.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>

|--------+----------------------->
|        |          Andrea       |
|        |          Arcangeli    |
|        |          <andrea@suse.|
|        |          de>          |
|        |                       |
|        |          04/26/00     |
|        |          09:19 AM     |
|        |                       |
|--------+----------------------->
  >----------------------------------------------------------------------------|
  |                                                                            |
  |       To:     riel@nl.linux.org                                            |
  |       cc:     Linus Torvalds <torvalds@transmeta.com>, linux-mm@kvack.org, |
  |       (bcc: Mark H Johnson/RTS/Raytheon/US)                                |
  |       Subject:     Re: 2.3.x mem balancing                                 |
  >----------------------------------------------------------------------------|

On Tue, 25 Apr 2000, Rik van Riel wrote:

>On Wed, 26 Apr 2000, Andrea Arcangeli wrote:
>> On Tue, 25 Apr 2000, Linus Torvalds wrote:
>>
>> >On Tue, 25 Apr 2000, Andrea Arcangeli wrote:
>> >>
>> >> The design I'm using is infact that each zone know about each other, each
>> >> zone have a free_pages and a classzone_free_pages. The additional
>> >> classzone_free_pages gives us the information about the free pages on the
>> >> classzone and it's also inclusve of the free_pages of all the lower zones.
>> >
>> >AND WHAT ABOUT SETUPS WHERE THERE ISNO INCLUSION?
>>
>> They're simpler. The classzone for them matches with the zone.
>
>It doesn't. Think NUMA.

NUMA is irrelevant. If there's no inclusion the classzone matches with the
zone.
[snip]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

next             reply	other threads:[~2000-04-26 16:03 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2000-04-26 16:03 Mark_H_Johnson.RTS [this message]
2000-04-26 17:06 ` Andrea Arcangeli
2000-04-26 17:36   ` Kanoj Sarcar
2000-04-26 21:58     ` Andrea Arcangeli
2000-04-26 17:43 ` Kanoj Sarcar
  -- strict thread matches above, loose matches on Subject: below --
2000-04-26 19:06 frankeh
     [not found] <Pine.LNX.4.21.0004250401520.4898-100000@alpha.random>
2000-04-25 16:57 ` Linus Torvalds
2000-04-25 17:50   ` Rik van Riel
2000-04-25 18:11     ` Jeff Garzik
2000-04-25 18:33       ` Rik van Riel
2000-04-25 18:53     ` Linus Torvalds
2000-04-25 19:27       ` Rik van Riel
2000-04-26  0:26         ` Linus Torvalds
2000-04-26  1:19           ` Rik van Riel
2000-04-26  1:07   ` Andrea Arcangeli
2000-04-26  2:10     ` Rik van Riel
2000-04-26 11:24       ` Stephen C. Tweedie
2000-04-26 16:44         ` Linus Torvalds
2000-04-26 17:13           ` Rik van Riel
2000-04-26 17:24             ` Linus Torvalds
2000-04-27 13:22               ` Stephen C. Tweedie
2000-04-26 14:19       ` Andrea Arcangeli
2000-04-26 16:52         ` Linus Torvalds
2000-04-26 17:49           ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=852568CD.0057D4FC.00@raylex-gh01.eo.ray.com \
    --to=mark_h_johnson.rts@raytheon.com \
    --cc=andrea@suse.de \
    --cc=linux-mm@kvack.org \
    --cc=riel@nl.linux.org \
    --cc=torvalds@transmeta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox