Re: 2.3.x mem balancing

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: 2.3.x mem balancing
@ 2000-04-26 16:03 Mark_H_Johnson.RTS
  2000-04-26 17:06 ` Andrea Arcangeli
  2000-04-26 17:43 ` Kanoj Sarcar
  0 siblings, 2 replies; 24+ messages in thread
From: Mark_H_Johnson.RTS @ 2000-04-26 16:03 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, riel, torvalds

Some of what's been discussed here about NUMA has me concerned. You can't treat
a system with NUMA the same as a regular shared memory system. Let me take a
moment to describe some of the issues I have w/ NUMA & see if this changes the
way you interpret what needs to be done with memory balancing.... I'll let
someone else comment on the other issues.

NUMA - Non Uniform Memory Access means what it says - access to memory is not
uniform. To the user of a system [not the kernel developer], NUMA works similar
to cache memory. If the memory you access is "local" to where the processing is
taking place, the access is much faster than if the memory is "far away". The
difference in performance can be over 10:1 in terms of latency.

Let's use a specific shared memory vs. NUMA example to illustrate. Many years
ago, SGI produced the Challenge product line with a high speed backplane
connecting CPU's and shared memory (a traditional shared memory system). More
recently, SGI developed "cache coherent NUMA" as part of the Origin 2000 product
line. We have been considering the Origin platform and its successors as an
upgrade path for existing Challenge XL systems (24 CPU's, 2G shared memory).

To us, the main difference between a Challenge and Origin is that the Origin
performance range is much better than on the Challenge.  However, access to the
memory is equally fast across the entire memory range on the Challenge and "non
uniform" [faster & slower] on the Origin. Some reported numbers on the Origin
indicate a maximum latency of 200 nsec to 700 nsec with systems with 16 to 32
processors. More processors makes the effect somewhat worse with the "absolute
worst case" around 1 microsecond (1000 nsec). To me, these kind of numbers make
the cost of a cache miss staggering when compared to the cycle times of new
processors.

Our concern with NUMA basically is that the structure of our application must be
changed to account for that latency. NUMA works best when you can put the data
and the processing in the same area. However, our current implementation for
exchanging information between processes is through a large shared memory area.
That area will only be "close" to a few processors - the rest will be accessing
it remotely. Yes, the connections are very fast, but I worry about the latency
[and resulting execution stalls] much more. To us, it means that we must arrange
to have the information sent across those fast interfaces before we expect to
need it at the destination. Those extra "memory copies" are something we didn't
have to worry about before. I see similar problems in the kernel.

In the context of "memory balancing" - all processors and all memory is NOT
equal in a NUMA system. To get the best performance from the hardware, you
prefer to put "all" of the memory for each process into a single memory unit -
then run that process from a processor "near" that memory unit. This seemingly
simple principle has a lot of problems behind it. What about...
 - shared read only memory (e.g., libraries) [to clone or not?]
 - shared read/write memory [how to schedule work to be done when load >> "local
capacity"]
 - when memory is low, which pages should I remove?
 - when I start a new job, even when there is lots of free memory, where should
I load the job?
These are issues that need to be addressed if you expect to use this high cost
hardware effectively. Please don't implement a solution for virtual memory that
does not have the ability to scale to solve the problems with NUMA. Thanks.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>

|--------+----------------------->
|        |          Andrea       |
|        |          Arcangeli    |
|        |          <andrea@suse.|
|        |          de>          |
|        |                       |
|        |          04/26/00     |
|        |          09:19 AM     |
|        |                       |
|--------+----------------------->
  >----------------------------------------------------------------------------|
  |                                                                            |
  |       To:     riel@nl.linux.org                                            |
  |       cc:     Linus Torvalds <torvalds@transmeta.com>, linux-mm@kvack.org, |
  |       (bcc: Mark H Johnson/RTS/Raytheon/US)                                |
  |       Subject:     Re: 2.3.x mem balancing                                 |
  >----------------------------------------------------------------------------|

On Tue, 25 Apr 2000, Rik van Riel wrote:

>On Wed, 26 Apr 2000, Andrea Arcangeli wrote:
>> On Tue, 25 Apr 2000, Linus Torvalds wrote:
>>
>> >On Tue, 25 Apr 2000, Andrea Arcangeli wrote:
>> >>
>> >> The design I'm using is infact that each zone know about each other, each
>> >> zone have a free_pages and a classzone_free_pages. The additional
>> >> classzone_free_pages gives us the information about the free pages on the
>> >> classzone and it's also inclusve of the free_pages of all the lower zones.
>> >
>> >AND WHAT ABOUT SETUPS WHERE THERE ISNO INCLUSION?
>>
>> They're simpler. The classzone for them matches with the zone.
>
>It doesn't. Think NUMA.

NUMA is irrelevant. If there's no inclusion the classzone matches with the
zone.
[snip]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-26 16:03 2.3.x mem balancing Mark_H_Johnson.RTS
@ 2000-04-26 17:06 ` Andrea Arcangeli
  2000-04-26 17:36   ` Kanoj Sarcar
  2000-04-26 17:43 ` Kanoj Sarcar
  1 sibling, 1 reply; 24+ messages in thread
From: Andrea Arcangeli @ 2000-04-26 17:06 UTC (permalink / raw)
  To: Mark_H_Johnson.RTS; +Cc: linux-mm, riel, Linus Torvalds

On Wed, 26 Apr 2000 Mark_H_Johnson.RTS@raytheon.com wrote:

>In the context of "memory balancing" - all processors and all memory is NOT
>equal in a NUMA system. To get the best performance from the hardware, you
>prefer to put "all" of the memory for each process into a single memory unit -
>then run that process from a processor "near" that memory unit. This seemingly

The classzone approch (aka overlapped zones approch) is irrelevant with
NUMA problematics as far I can tell.

NUMA is a problematic that belongs outside the pg_data_t. It doesn't
matter how we restructure the internal of the zone_t.

I only changed the internal structure of one node. Not at all how to
policy the allocations and the balance between different nodes (that
decisions have to live in the linux/arch/ tree and not in __alloc_pages).

On NUMA hardware you have only one zone per node since nobody uses ISA-DMA
on such machines and you have PCI64 or you can use the PCI-DMA sg for
PCI32. So on NUMA hardware you are going to have only one zone per node
(at least this was the setup of the NUMA machine I was playing with). So
you don't mind at all about classzone/zone. Classzone and zone are the
same thing in such a setup, they both are the plain ZONE_DMA zone_t.
Finished. Said that you don't care anymore about the changes of how the
overlapped zones are handled since you don't have overlapped zones in
first place.

Now on NUMA when you want to allocate memory you have to use
alloc_pages_node so that you can tell also which node allocate from.

Here Linus was proposing of making alloc_pages_node this way:

	alloc_pages_node(nid, gfpmask, order)
	{
		zonelist_t ** zonelist = nid2zonelist(nid, gfpmask);

		__alloc_pages(zonelist, order);
	}

and then having the automatic falling back between nodes and numa memory
balancing handled by __alloc_pages and by the current 2.3.99-pre6-5
zonelist falling back trick.

I care to explain why I think that's not the right approch for handling
NUMA allocations and balancing decisions.

As first it's clear that the above described NUMA approch is abusing
zonelist_t by looking the size of the zonelist_t structure:

	typedef struct zonelist_struct {
		zone_t * zones [MAX_NR_ZONES+1]; // NULL delimited
		int gfp_mask;
	} zonelist_t;

If zonelist was designed for NUMA it should be something like:

	typedef struct zonelist_struct {
		zone_t * zones [max(MAX_NR_ZONES*MAX_NR_NODES)+1]; // NULL delimited
		int gfp_mask;
	} zonelist_t;

however we can fix that easily by enlarging the zones array in the
zonelist.

and as second the zonelist-NUMA solution isn't enough flexible since if
there's lots of cache allocate in one node we may prefer to move or shrink
the cache than to allocate mapped areas of the same task in different
nodes (as the __alloc_pages would do).

With the zonelist_t abused to do NUMA we _don't_ have flexibility.

If you move the NUMA balancing and node selection into the higher layer
as I was proposing, instead you can do clever things.

And as soon as you move the decisions at the higher layer you don't mind
anymore about the node internals. Or better you only care to be able to
find the current life-state of a node and you of course can do that. Then
once you know the state of the interesting node you can do the decision of
what to do _still_ at the highlever layer.

At the highlevel layer you can see that the node is filled with 90% of
cache, and then you can say: ok allocate from this node anyway and let it 
to shrink some cache if necessary.

Then the lower layer (__alloc_pages) will do automatically the balancing
and it will try to allocate memory from such node. You can also grab the
per-node spinlock in the highlever layer before checking the state of the
node so that you'll know the state of the node won't change from under you
while doing the allocation.

>These are issues that need to be addressed if you expect to use this

I always tried to keep these issues in mind (also before writing the
classzone approch).

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-26 17:06 ` Andrea Arcangeli
@ 2000-04-26 17:36   ` Kanoj Sarcar
  2000-04-26 21:58     ` Andrea Arcangeli
  0 siblings, 1 reply; 24+ messages in thread
From: Kanoj Sarcar @ 2000-04-26 17:36 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Mark_H_Johnson.RTS, linux-mm, riel, Linus Torvalds

> 
> On NUMA hardware you have only one zone per node since nobody uses ISA-DMA
> on such machines and you have PCI64 or you can use the PCI-DMA sg for
> PCI32. So on NUMA hardware you are going to have only one zone per node
> (at least this was the setup of the NUMA machine I was playing with). So
> you don't mind at all about classzone/zone. Classzone and zone are the
> same thing in such a setup, they both are the plain ZONE_DMA zone_t.
> Finished. Said that you don't care anymore about the changes of how the
> overlapped zones are handled since you don't have overlapped zones in
> first place.

Andrea, are you talking about the SGI Origin platform, or are you 
using some other NUMA platform? In any case, the SGI platform in fact
does not support ISA-DMA, but unfortunately, I don't think just because
it has PCI mapping registers, you can assume that all memory is DMAable.
For us to be able to consider all memory as dmaable, before each dma
operation starts, we need to have a pci-dma type hook to program the
mapping registers. As far as I know, such a hook is not used on all
drivers (in 2.4 timeframe), so very unfortunately, I think we need
to keep the option open about each node having more than just ZONE_DMA.
Finally, I am not sure how things will work, we are still busy trying
to get the Origin/Linux port going.

FWIW, I think the IBM/Sequent NUMA machines in fact have nodes that 
have only nondmaable memory.

> 
> If you move the NUMA balancing and node selection into the higher layer
> as I was proposing, instead you can do clever things.
>

For an example and a (old) patch for this, look at 

	http://oss.sgi.com/projects/numa/download/numa.gen.42b
	http://oss.sgi.com/projects/numa/download/numa.plat.42b

Kanoj 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-26 17:36   ` Kanoj Sarcar
@ 2000-04-26 21:58     ` Andrea Arcangeli
  0 siblings, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2000-04-26 21:58 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Mark_H_Johnson.RTS, linux-mm, riel, Linus Torvalds

On Wed, 26 Apr 2000, Kanoj Sarcar wrote:

>[..] As far as I know, such a hook is not used on all
>drivers (in 2.4 timeframe), [..]

That's also why I still have the alpha HIGHMEM support in my TODO list so
we can ship a binary only kernel that doesn't risk to break with >2g RAM.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-26 16:03 2.3.x mem balancing Mark_H_Johnson.RTS
  2000-04-26 17:06 ` Andrea Arcangeli
@ 2000-04-26 17:43 ` Kanoj Sarcar
  1 sibling, 0 replies; 24+ messages in thread
From: Kanoj Sarcar @ 2000-04-26 17:43 UTC (permalink / raw)
  To: Mark_H_Johnson.RTS; +Cc: Andrea Arcangeli, linux-mm, riel, torvalds

> In the context of "memory balancing" - all processors and all memory is NOT
> equal in a NUMA system. To get the best performance from the hardware, you
> prefer to put "all" of the memory for each process into a single memory unit -
> then run that process from a processor "near" that memory unit. This seemingly
> simple principle has a lot of problems behind it. What about...
>  - shared read only memory (e.g., libraries) [to clone or not?]
>  - shared read/write memory [how to schedule work to be done when load >> "local
> capacity"]
>  - when memory is low, which pages should I remove?
>  - when I start a new job, even when there is lots of free memory, where should
> I load the job?

The problem is, every app has different requirements, and performs best under
different resource (cpu/memory) scheduling policies. IRIX provides a tool 
called "dplace", that will allow performance experts specify which threads
of a program should be run on cpus on which node, and how different sections
of the address space should have their pages allocated (that is, on which 
nodes; possible policies: firsttouch, ie, allocate the page on the node 
which has the processor that first accesses that page, roundrobin, ie, 
round robin the allocations across all nodes, etc etc). 

Linux is a little away from providing such flexible options, specially
since it is not even possible to pin a process to a cpu or node yet. 
The page allocation strategies are of course much more work to implement.

For global issues like "when memory is low, which pages should I remove"
the problem is a little more complex. Having a kswapd per node is an option,
although I think it is too early to decide that. I am hoping we can get
a multinode system up soon, and investigate these issues.

Kanoj

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
@ 2000-04-26 19:06 frankeh
  0 siblings, 0 replies; 24+ messages in thread
From: frankeh @ 2000-04-26 19:06 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: Andrea Arcangeli, Mark_H_Johnson.RTS, linux-mm, riel,
	Linus Torvalds, pratap

Kanoj, this is the issue I raised earlier on the board, but didn't get a
reply...

Yes, one NUMA machine here at IBM research consists of a 4-node cluster of
4-way xeon boxes.
When NUMA-d together, each memory controller simply relocates its own node
memory to
a designated 1-GB range and forwards other requests to the appropriate
nodes while maintaining cache coherence.

This ofcourse leads to the situation, that only the first node will have
DMA memory, given the 1GB kernel limitation.

I used to have  a software solution to this namely by rewritting the __pa
and __va macros to do some remapping
which would allow each node to provide some kernel virtual DMA memory.

Now how do you believe the architectures (particular x86 based NUMA
systems) will evolve ?

As with respect to some other messages regarding the zones.

With respect to NUMA allocation, I still like to see happening what was
pointed out for the IRIX and which is for instance
also available on NUMAQ/Dynix as well. Namely resource classes.

A resource class to be a set of basic resources such as (CPUs and memory,
i.e nodes) on which to restrict execution and allocation for user processes

(a) we have a full CPU affinity patch, driven by a system call interface
that restricts execution to a set of specified CPUs     .. any takers ...

(b) kanoj and I made a first attempt (~2.3.48 timeframe) to restrict
allocation to certain nodes, but the swapping behavior never properly
worked and with
     the constant changes under 2.3.99-preX, I put this on ice until the vm
becomes somewhat more stable.
    Again, I want to specify a set of nodes from where to allocate memory .
   Given a node set specification, I would like to treat the zones of the
same class on all those specified nodes (e.g. ZONE_HIGH) as a single target
class. Only if it can not allocate within that combined class  on the
specified set of nodes, should the allocator decent into the next lower
class.

   Open ofcourse in this spec is what will be effected by the memory
specification ??? only user pages, or pages that go to memory mapped files
as well?

kanoj@google.engr.sgi.com (Kanoj Sarcar) on 04/26/2000 01:36:48 PM

To:   andrea@suse.de (Andrea Arcangeli)
cc:   Mark_H_Johnson.RTS@raytheon.com, linux-mm@kvack.org,
      riel@nl.linux.org, torvalds@transmeta.com (Linus Torvalds)
Subject:  Re: 2.3.x mem balancing

>
> On NUMA hardware you have only one zone per node since nobody uses
ISA-DMA
> on such machines and you have PCI64 or you can use the PCI-DMA sg for
> PCI32. So on NUMA hardware you are going to have only one zone per node
> (at least this was the setup of the NUMA machine I was playing with). So
> you don't mind at all about classzone/zone. Classzone and zone are the
> same thing in such a setup, they both are the plain ZONE_DMA zone_t.
> Finished. Said that you don't care anymore about the changes of how the
> overlapped zones are handled since you don't have overlapped zones in
> first place.

Andrea, are you talking about the SGI Origin platform, or are you
using some other NUMA platform? In any case, the SGI platform in fact
does not support ISA-DMA, but unfortunately, I don't think just because
it has PCI mapping registers, you can assume that all memory is DMAable.
For us to be able to consider all memory as dmaable, before each dma
operation starts, we need to have a pci-dma type hook to program the
mapping registers. As far as I know, such a hook is not used on all
drivers (in 2.4 timeframe), so very unfortunately, I think we need
to keep the option open about each node having more than just ZONE_DMA.
Finally, I am not sure how things will work, we are still busy trying
to get the Origin/Linux port going.

FWIW, I think the IBM/Sequent NUMA machines in fact have nodes that
have only nondmaable memory.

>
> If you move the NUMA balancing and node selection into the higher layer
> as I was proposing, instead you can do clever things.
>

For an example and a (old) patch for this, look at

     http://oss.sgi.com/projects/numa/download/numa.gen.42b
     http://oss.sgi.com/projects/numa/download/numa.plat.42b

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <Pine.LNX.4.21.0004250401520.4898-100000@alpha.random>]

* Re: 2.3.x mem balancing
       [not found] <Pine.LNX.4.21.0004250401520.4898-100000@alpha.random>
@ 2000-04-25 16:57 ` Linus Torvalds
  2000-04-25 17:50   ` Rik van Riel
  2000-04-26  1:07   ` Andrea Arcangeli
  0 siblings, 2 replies; 24+ messages in thread
From: Linus Torvalds @ 2000-04-25 16:57 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

On Tue, 25 Apr 2000, Andrea Arcangeli wrote:
> 
> The design I'm using is infact that each zone know about each other, each
> zone have a free_pages and a classzone_free_pages. The additional
> classzone_free_pages gives us the information about the free pages on the
> classzone and it's also inclusve of the free_pages of all the lower zones.

AND WHAT ABOUT SETUPS WHERE THERE ISNO INCLUSION?

Andrea, face it, the design is WRONG!

You've made both free_pages() and alloc_pages() more complex and costly,
and the per-zone spinlock cannot exist any more. WHICH IS BAD BAD BAD.

Thing of the case of a NUMA architecture - you sure as hell want to have
all "local" zones share the same spinlock, because then you'll have to
grab a global spinlock for each allocation, even if you only allocate from
a local zone closeto the node you're actually running on.

I tell you - you are doing the wrong thing. The earlier you realize that,
the better.

> >> Now assume rest of memory zone (ZONE_NORMAL) is getting under the
> >> zone_normal->page_low watermark. We must definitely not start kswapd if
> >> there are still 16mbyte of free memory in the classzone.
> >
> >No.
> >
> >We should just not allocate from that zone. Look at what
> >__get_free_pages() does: it tries to first find _any_ zone that it can
> >allocate from, and if it cannot find such a zone only _then_ does it
> >decide to start kswapd.
> 
> Woops I did wrong example to explain the suprious kswapd run and I also
> didn't explained the real problem in that scenario.
> 
> The real problem in that scenario is that you don't empty the ZONE_NORMAL
> but you stop at the low watermark, while you should empty the ZONE_NORMAL
> complelty for allocation that supports ZONE_NORMAL memory (so for non
> ISA_DMA memory) before falling back into the ZONE_DMA zone.

Oh?

And what is wrong with just changing the water-marks, instead of your
global (and in my opinion stupid and wrong) change?

Why didn't you just do a small little initialization routine that made the
watermark for the DMA zone go up, and the watermark for the bigger zones
go down?

Let's face it, with the current Linux per-zone memory allocation, doing
things like that is _trivial_. There are no magic couplings between
different zones, so if you want to make sure that it's primarily only the
DMA zone that is kept free, then you can do _exactly_ that by saying
simply that the "critical watermark for the DMA zone should be 5%, while
the critical watermark for the regular zone should be just 1%".

I did no tuning at all of the watermarks when I changed the zone
behaviour. My bad. But that does NOT mean that the whole mm behaviour
should then be reverted to the old and broken setup. It only means that
the zone behaviour should be tuned.

> You can't
> optimize the zone usage without knowledge on the whole classzone. That's
> the first basic thing where the strict zone based design will be always
> inferior compared to a classzone based design.

You are wrong. And you are so FUNDAMENTALLY wrong that I'm at a loss to
even explain why.

What's the matter with just realizing that the whole issue of DMA vs
non-DMA, and local zone vs remote zone is just a very generic case of
trying to balance memory usage. It has nothing at all to do with
"inclusion", and I personally think that the whole notion of "inclusion"
is fundamentally flawed. It adds a rule that shouldn't be there, and has
no meaning.

Andrea, how do you ever propose to handle the case of four different
memory zones, all "equal", but all separate in that while all of memory
isaccessible from each CPU, each zone is "closer" to certain CPU's? Let's
say that CPU's 0-3 have direct access to zone 0, CPU's 4-7 have direct
access to zone 1, etc etc.. Whenever a CPU touches memory on a non-local
zone, it takes longer for the cache miss, but it still works.

Now, the way =I= propose that this be handled is:

 - each cluster has its own zone list: cluster 0 has the list 0, 1, 2, 3,
   while cluster 1 has the list 1, 2, 3, 0 etc. In short, each of them
   preferentially allocate from their _own_ cluster. Together with
   cluster-affinity for the processes, this way we can naturally (and with
   no artificial code) try to keep cross-cluster memory accesses to
   a minimum.

And note - the above works _now_ with the current code. And it will never
ever work with your approach. Because zone 0 is not a subset of zone 1,
they are both equal. It's just that there is a preferential order for
allocation depending on who does the allocation.

> About kswapd suppose we were low on memory on both the two zones and so we
> started kswapd on both zones and we keep allocating waiting to reach the
> min watermark. Then while kswapd it's running a process exits and release
> all the first 16mbyte of RAM. kswapd correctly stops freeing the ISADMA
> zone (because free_pages set zone_wake_kswapd back to zero) but kswapd
> will keep freeing the normal zone for no good reason. You have no way to
> stop the wrong kswapd in such scenario without a classzone design. If you
> keep looking only at zone->free_pages and zone->pages_high then kswapd
> will keep running for no good reason.

The fact that you think that this is true obviously means that you haven't
thought AT ALL about the ways to just make the watermarks work that way.

Give it five seconds, and you'll see that the above is _exactly_ what the
watermarks control, and _exactly_ by making all the flags per-zone
(instead of global or per-class). And by just using the watermark
heuristics to determine which zone to use for new allocations too, you get
into a situation where you can always say "this zone is currently the best
for new allocations, and these other zones are being free'd up because
they are getting low on memory".

In short, you've only convinced me _not_ to touch your patches, by showing
that you haven't even though about what the current setup really means.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-25 16:57 ` Linus Torvalds
@ 2000-04-25 17:50   ` Rik van Riel
  2000-04-25 18:11     ` Jeff Garzik
  2000-04-25 18:53     ` Linus Torvalds
  2000-04-26  1:07   ` Andrea Arcangeli
  1 sibling, 2 replies; 24+ messages in thread
From: Rik van Riel @ 2000-04-25 17:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrea Arcangeli, linux-mm

On Tue, 25 Apr 2000, Linus Torvalds wrote:

> Andrea, how do you ever propose to handle the case of four
> different memory zones, all "equal", but all separate in that
> while all of memory isaccessible from each CPU, each zone is
> "closer" to certain CPU's? Let's say that CPU's 0-3 have direct
> access to zone 0, CPU's 4-7 have direct access to zone 1, etc
> etc.. Whenever a CPU touches memory on a non-local zone, it
> takes longer for the cache miss, but it still works.

[snip different zonelists for each node at allocation time]

There's only one small addition that I'd like to see. Memory
should be reclaimed on a more or less _global_ level because
the processes in node 0 could use much less memory than the
processes in node 1.

Doing strict per-zone memory balancing in this case means that
node 0 will have a bunch of idle pages lying around while node
1 is swapping...

(and yes, I have this implemented in the patch I'm working on
and it mostly works. It just needs to be tuned some more before
it's ready for inclusion)

Another thing which we probably want before 2.4 is scanning
big processes more agressively than small processes. I've
implemented most of what is needed for that and it seems to
have a good influence on performance because:
- small processes suffer less from the presence of memory hogs
- memory hogs have their pages aged more agressively, making it
  easier for them to do higher throughput from/to swap or disk

The algorithm I'm using for that now is quite simple. At the 
time where we assign mm->swap_cnt we remember the biggest
process. After that we do a second loop, and reduce mm->swap_cnt
for smaller processes using this simple formula:

                 /* small processes are swapped out less */
                 while ((mm->swap_cnt << 2 * i) < max_cnt)
                          i++;
                 mm->swap_cnt >>= i;
                 mm->swap_cnt += i; /* in case swap_cnt reaches 0 */

We may want to refine this a bit in the future, but this form
seems to work quite well. A possible addition is to set a flag
for all the "big" processes (where i is 0) and have them run
swap_out() on every memory allocation...

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-25 17:50   ` Rik van Riel
@ 2000-04-25 18:11     ` Jeff Garzik
  2000-04-25 18:33       ` Rik van Riel
  2000-04-25 18:53     ` Linus Torvalds
  1 sibling, 1 reply; 24+ messages in thread
From: Jeff Garzik @ 2000-04-25 18:11 UTC (permalink / raw)
  To: riel; +Cc: Linus Torvalds, Andrea Arcangeli, linux-mm

Rik van Riel wrote:
> Another thing which we probably want before 2.4 is scanning
> big processes more agressively than small processes. I've
> implemented most of what is needed for that and it seems to
> have a good influence on performance because:
> - small processes suffer less from the presence of memory hogs
> - memory hogs have their pages aged more agressively, making it
>   easier for them to do higher throughput from/to swap or disk

Since you do not mention a new sysctl here...

The change you propose is policy.  Favoring interactivity over memory
hogs is not always a good idea and should be left up to the sysadmin not
kernel hacker to decide.

	Jeff




-- 
Jeff Garzik              | Nothing cures insomnia like the
Building 1024            | realization that it's time to get up.
MandrakeSoft, Inc.       |        -- random fortune
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-25 18:11     ` Jeff Garzik
@ 2000-04-25 18:33       ` Rik van Riel
  0 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2000-04-25 18:33 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Linus Torvalds, Andrea Arcangeli, linux-mm

On Tue, 25 Apr 2000, Jeff Garzik wrote:
> Rik van Riel wrote:
> > Another thing which we probably want before 2.4 is scanning
> > big processes more agressively than small processes. I've
> > implemented most of what is needed for that and it seems to
> > have a good influence on performance because:
> > - small processes suffer less from the presence of memory hogs
> > - memory hogs have their pages aged more agressively, making it
> >   easier for them to do higher throughput from/to swap or disk
> 
> Since you do not mention a new sysctl here...

Yeah, I forgot to mention that. This is something which can
be made switchable by the admin very easily.

I'll add the sysctl switch (and remove some old redundant
ones) later, when the code has stabilised and we know what's
needed.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-25 17:50   ` Rik van Riel
  2000-04-25 18:11     ` Jeff Garzik
@ 2000-04-25 18:53     ` Linus Torvalds
  2000-04-25 19:27       ` Rik van Riel
  1 sibling, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2000-04-25 18:53 UTC (permalink / raw)
  To: riel; +Cc: Andrea Arcangeli, linux-mm

On Tue, 25 Apr 2000, Rik van Riel wrote:
>
> There's only one small addition that I'd like to see. Memory
> should be reclaimed on a more or less _global_ level because
> the processes in node 0 could use much less memory than the
> processes in node 1.
> 
> Doing strict per-zone memory balancing in this case means that
> node 0 will have a bunch of idle pages lying around while node
> 1 is swapping...

This is why the page allocator has to have some knowledge about the whole
list of zones it allocates from.

The current one actually does that: before it tries to start paging it
first tries to find a zone that doesn't need any paging. So in this case
if node 0 is full, but there are empty pages in node 1, the page allocator
_will_ allocate from node 1 instead.

That is obviously not to say that the current code gets the heuristics
actually =right=. There are certainly bugs in the heuristics, as shown by
bad performance. David Miller pointed out that there also seems to be a
memory leak in the swap cache handling, so it can be more than just the
zone balancing that is wrong.

My argument is really not that the current code is perfect - it very
obviously is not. But I am 100% convinced that it is much better to have
independent memory-allocators with some common heuristics than it is to
try to force a global order on them.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-25 18:53     ` Linus Torvalds
@ 2000-04-25 19:27       ` Rik van Riel
  2000-04-26  0:26         ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Rik van Riel @ 2000-04-25 19:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrea Arcangeli, linux-mm

On Tue, 25 Apr 2000, Linus Torvalds wrote:
> On Tue, 25 Apr 2000, Rik van Riel wrote:
> >
> > There's only one small addition that I'd like to see. Memory
> > should be reclaimed on a more or less _global_ level because
> > the processes in node 0 could use much less memory than the
> > processes in node 1.
> > 
> > Doing strict per-zone memory balancing in this case means that
> > node 0 will have a bunch of idle pages lying around while node
> > 1 is swapping...
> 
> This is why the page allocator has to have some knowledge about
> the whole list of zones it allocates from.
> The current one actually does that:

Certainly, the current allocator is excellent for what we
do now. The only improvement I could see is quantification
of the memory load on zones and having the allocator eg. not
chose a non-local page if the memory load on the other node
is more than 90% of the memory load here.

(Stephen and me have some ideas on this, if I get the code
stable before 2.4 I'll submit it)

> That is obviously not to say that the current code gets the
> heuristics actually =right=. There are certainly bugs in the
> heuristics, as shown by bad performance.

The only bug I can see is that page _freeing_ in the current
code is done on a per-zone basis, so that we could end up with
a whole bunch of underused pages in one zone and too much
memory pressure in the other zone.

The allocation algorithm correctly choses the right zone to
allocate, but it can only do that if the freeing of pages is
done in such a way that those hints are available to the
allocation code. My patch (not completely ready yet) aims to
fix that.

> My argument is really not that the current code is perfect - it very
> obviously is not. But I am 100% convinced that it is much better to have
> independent memory-allocators with some common heuristics than it is to
> try to force a global order on them.

*nod*  Without this we can forget Linux on NUMA...

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-25 19:27       ` Rik van Riel
@ 2000-04-26  0:26         ` Linus Torvalds
  2000-04-26  1:19           ` Rik van Riel
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2000-04-26  0:26 UTC (permalink / raw)
  To: riel; +Cc: Andrea Arcangeli, linux-mm

On Tue, 25 Apr 2000, Rik van Riel wrote:
> 
> > That is obviously not to say that the current code gets the
> > heuristics actually =right=. There are certainly bugs in the
> > heuristics, as shown by bad performance.
> 
> The only bug I can see is that page _freeing_ in the current
> code is done on a per-zone basis, so that we could end up with
> a whole bunch of underused pages in one zone and too much
> memory pressure in the other zone.

Yes.

I removed the "zone" argument from try_to_swap_out(), because I just
always found it odious that it walked the page tables but then ignored all
the work it did if the page happened to be from the wrong zone - even if
that zone happened to be low on memory too. Just because we passed in one
zone, and the page to be free'd was of another zone equally well suited to
being free'd.

However, when I removed that argument, I completely removed the logic to
avoid freeing pages from an ok zone. And it really should be there, but it
should look something like

	/* Don't free a page if the zone in question is fine */
	if (!page->zone->zone_wake_kswapd)
		return 0;

instead of what it used to be (ie used to be something like

	if (zone && page->zone != zone)
		return 0;

which is just bogus, in my opinion, and I prefer removing bogus code
completely in order for it to not entrench itself too much).

That single test might actually improve things a lot, but this is the kind
of issue that needs testing/thinking beyond just my kind of "I think this
is the way things should work" approach. I tend to think that I'm good at
knowing how things _should_ work, but I'm just horribly lazy and bad at
the final test-and-tweak kind of thing. And this definitely needs some of
that.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-26  0:26         ` Linus Torvalds
@ 2000-04-26  1:19           ` Rik van Riel
  0 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2000-04-26  1:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrea Arcangeli, linux-mm

On Tue, 25 Apr 2000, Linus Torvalds wrote:
> On Tue, 25 Apr 2000, Rik van Riel wrote:
> > 
> > The only bug I can see is that page _freeing_ in the current
> > code is done on a per-zone basis, so that we could end up with
> > a whole bunch of underused pages in one zone and too much
> > memory pressure in the other zone.
> 
> 	/* Don't free a page if the zone in question is fine */
> 	if (!page->zone->zone_wake_kswapd)
> 		return 0;

This will only start freeing memory from a zone *after*
it has been low on memory, which could take ages if the
memory movement in that zone is very low...

if (page->zone->free_pages > page->zone->pages_high)
	return 0;

This way we'll always free the least used pages from
other zones, up to zone->pages_high, regardless of
memory pressure on that zone. This means that the
allocator has an easier job of identifying "idle" zones
and that load balancing between zones is way faster.

I've been running this code (in shrink_mmap()) for almost
one week now and it seems to work pretty well.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-25 16:57 ` Linus Torvalds
  2000-04-25 17:50   ` Rik van Riel
@ 2000-04-26  1:07   ` Andrea Arcangeli
  2000-04-26  2:10     ` Rik van Riel
  1 sibling, 1 reply; 24+ messages in thread
From: Andrea Arcangeli @ 2000-04-26  1:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm

On Tue, 25 Apr 2000, Linus Torvalds wrote:

>On Tue, 25 Apr 2000, Andrea Arcangeli wrote:
>> 
>> The design I'm using is infact that each zone know about each other, each
>> zone have a free_pages and a classzone_free_pages. The additional
>> classzone_free_pages gives us the information about the free pages on the
>> classzone and it's also inclusve of the free_pages of all the lower zones.
>
>AND WHAT ABOUT SETUPS WHERE THERE ISNO INCLUSION?

They're simpler. The classzone for them matches with the zone. Actually I
told the kernel that a classzone is always composed by the zone itself
joined with all the lower zones (since I tought we're not going to have a
non lower-zone-inclusive setup any time soon and so I made the code faster
harcoding such assumption), but if that won't be true any longer we only
need to tell the code that updates classzone_free_pages what zones the
clazzone is composed by. That can be achieved by having the list of the
zones that compose the classzone in an array allocated into the zone_t. A
new zone->classzones[MAX_NR_ZONES+1] will do the trick. If there's no
inclusion zones->classzones[0] will be equal to zone and [1] will be NULL.
I can do that now if you think we'll _soon_ need that genericity.

Please consider this. When we do a GFP_KERNEL allocation, we want to
allocate from 1 zone. Not from two zones. When we do:

	alloc_page(GFP_KERNEL);

we want to allocate from 1 zone that is between 0 and 2giga.

	0						2g
	-------------------------------------------------
	| ZONE_DMA	| ZONE_NORMAL			|
	-------------------------------------------------
	|            GFP_KERNEL				|
	-------------------------------------------------

Incidentally we also have somebody (GFP_DMA) that wants to allocate from
the ZONE_DMA and so to be able to provide GFP_DMA allocations in O(1) we
can't create a single indpendent ZONE_NORMAL that spawns between 0 and 2g,
but we have to split the GFP_KERNEL allocation place in two zones that are
ZONE_DMA and ZONE_NORMAL.

But alloc_pages really have to consider the 0-2g a single zone because
alloc_pages gets the semantic of the zone by the user that consider the
0-2g range a single zone.

If it's true as you say that we can make to work alloc_pages on the 0-2g
zone using two completly separated and not-related-in-any-way zones, then
we could also split the ZONE_DMA is ZONE_DMA0 and ZONE_DMA1 and be able to
keep the ZONE_DMA1 classzone balanced as if it would be still ZONE_DMA.
We could in the same way split ZONE_NORMAL in several zones and still keep
ZONE_NORMAL in perfect balance.

If the current zone based design is not broken, then we can split any zone
in as _many_ zones as we _want_ without hurting anything (except some
more CPU and memory resources wasted in the zone arrays since there would
be more zones).

This is obviously wrong and this also proof the current design is
broken.

>Andrea, face it, the design is WRONG!
>
>You've made both free_pages() and alloc_pages() more complex and costly,
>and the per-zone spinlock cannot exist any more. WHICH IS BAD BAD BAD.

rmqueue and __free_pages_ok are now doing at once the work that
nr_free_buffer_pages was doing all the time. I don't know how much these
changes are sensitive in performance.

For the spinlock I fully agree, previous code was very nicer. However I
couldn't find a way to avoid parallel alloc_pages() to fool the memory
balancing by using a per-zone lock (I could acquire more than one spinlock
in zone-decreasing-order but that was going to hurt too much) and so IMHO
the previous code was risky indipendent of the zone problems.

>Thing of the case of a NUMA architecture - you sure as hell want to have
>all "local" zones share the same spinlock, because then you'll have to
>grab a global spinlock for each allocation, even if you only allocate from
>a local zone closeto the node you're actually running on.

I'm sorry but I'm not sure to understand what you mean. NUMA scaling is
not more penalized than non NUMA scaling as far I can tell. The spinlock
is now in the node so you are still allowed to allocate from two
_different_ nodes at the same time as before.

node0->node_zones[0] != node1->node_zones[0] && node0->freelist_lock != node1->freelist_lock

>I tell you - you are doing the wrong thing. The earlier you realize that,
>the better.

I apologise but I still think I'm doing the right thing and that the
current strict zone based design is not correct and that it will end doing
the wrong thing sometime with non obvious drawbacks.

Linus, you can convinced me immediatly if you show me how I can make sure
that all the ZONE_NORMAL is empty before going to allocate in the ZONE_DMA
with the current design. I also like to know how can I stop kswapd that is
trying to raise the ZONE_NORMAL->free_pages over ZONE_NORMAL->pages_high
after kswapd succesfully put the ZONE_DMA over the ZONE_DMA->pages_high
limit in the previous pass (assuming ZONE_DMA->pages_high >=
ZONE_NORMAL->pages_high, that's perfectly allowed value). Also from
swap_out how can I know how much a page is critical by only looking at how
many free pages are in its zone (think if the lower zones are completly
free).

>> >> Now assume rest of memory zone (ZONE_NORMAL) is getting under the
>> >> zone_normal->page_low watermark. We must definitely not start kswapd if
>> >> there are still 16mbyte of free memory in the classzone.
>> >
>> >No.
>> >
>> >We should just not allocate from that zone. Look at what
>> >__get_free_pages() does: it tries to first find _any_ zone that it can
>> >allocate from, and if it cannot find such a zone only _then_ does it
>> >decide to start kswapd.
>> 
>> Woops I did wrong example to explain the suprious kswapd run and I also
>> didn't explained the real problem in that scenario.
>> 
>> The real problem in that scenario is that you don't empty the ZONE_NORMAL
>> but you stop at the low watermark, while you should empty the ZONE_NORMAL
>> complelty for allocation that supports ZONE_NORMAL memory (so for non
>> ISA_DMA memory) before falling back into the ZONE_DMA zone.
>
>Oh?
>
>And what is wrong with just changing the water-marks, instead of your
>global (and in my opinion stupid and wrong) change?
>
>Why didn't you just do a small little initialization routine that made the
>watermark for the DMA zone go up, and the watermark for the bigger zones
>go down?
>
>Let's face it, with the current Linux per-zone memory allocation, doing
>things like that is _trivial_. There are no magic couplings between
>different zones, so if you want to make sure that it's primarily only the
>DMA zone that is kept free, then you can do _exactly_ that by saying
>simply that the "critical watermark for the DMA zone should be 5%, while
>the critical watermark for the regular zone should be just 1%".

As first this mean you'll left an 1% of the ZONE_NORMAL free, while you
should have allocated also such remaining 1% before falling back in the
ZONE_DMA and so your solution doesn't solve the problem but only hides it
better. There's no one single good reason for which you should left such
1% free, while there are obvious good reason for which you should allocate
also such 1% before falling back on the lower zone.

As second using 5% and 1% of critical watermarks won't give you a 6%
watermark for the ZONE_NORMAL _class_zone but it will give you a 1%
watermark instead and you probably wanted a 6% watermark to provide
rasonable space for atomic allocations and for having more chances of
doing high order allocations.

(note the numbers 1%/5% and global 6% are just random values for me now,
I'm not even thinking if they are a good default or not, just assume they
are a good default for the following example)

Assume you want a 6% watermark on the zone-normal classzone, ok?

Now suppose 95% of the ZONE_DMA is mlocked and a ISA-DMA network card
allocates from irqs the latest 5%. Suppose the ZONE_NORMAL is all
_freeable_ in not mapped page cache but that 99% of the ZONE_NORMAL is
allocated in page cache and only the 1% is free. Then the VM with the
current design will take only 1% free in the ZONE_NORMAL _classzone_
because it have no knowledge about the classzone.

If pages_high would been the 6% and referred to the classzone the VM would
have immediatly and correctly shrunk an additional 5% from the freeable
ZONE_NORMAL. See?

You can't fix that problem without changing design.

>I did no tuning at all of the watermarks when I changed the zone
>behaviour. My bad. But that does NOT mean that the whole mm behaviour
>should then be reverted to the old and broken setup. It only means that
>the zone behaviour should be tuned.

If you'll try to be friendly with ZONE_DMA (by allocating from the
ZONE_NORMAL when possible) you'll make the allocation from the higher
zones less reliable and you could end doing a differently kind of wrong
thing as explained a few lines above.

>> You can't
>> optimize the zone usage without knowledge on the whole classzone. That's
>> the first basic thing where the strict zone based design will be always
>> inferior compared to a classzone based design.
>
>You are wrong. And you are so FUNDAMENTALLY wrong that I'm at a loss to
>even explain why.
>
>What's the matter with just realizing that the whole issue of DMA vs
>non-DMA, and local zone vs remote zone is just a very generic case of
>trying to balance memory usage. It has nothing at all to do with
>"inclusion", and I personally think that the whole notion of "inclusion"
>is fundamentally flawed. It adds a rule that shouldn't be there, and has
>no meaning.
>
>Andrea, how do you ever propose to handle the case of four different
>memory zones, all "equal", but all separate in that while all of memory
>isaccessible from each CPU, each zone is "closer" to certain CPU's? Let's
>say that CPU's 0-3 have direct access to zone 0, CPU's 4-7 have direct
>access to zone 1, etc etc.. Whenever a CPU touches memory on a non-local
>zone, it takes longer for the cache miss, but it still works.

Note that what you call zones are really nodes. Your zone 0 is not a
zone_t but a pg_data_t instead.

>Now, the way =I= propose that this be handled is:
>
> - each cluster has its own zone list: cluster 0 has the list 0, 1, 2, 3,
>   while cluster 1 has the list 1, 2, 3, 0 etc. In short, each of them
>   preferentially allocate from their _own_ cluster. Together with
>   cluster-affinity for the processes, this way we can naturally (and with
>   no artificial code) try to keep cross-cluster memory accesses to
>   a minimum.

That have to be done in alloc_pages_node that later will fallback into the
alloc_pages. This problematic is not relevant for alloc_pages IMVHO.

In your scenario you'll have only 1 zone per node and there's never been
any problem with only one zone per node since classzone design is
completly equal to zone design in such case.

All zones in a single node by design have to be all near or far in the
same way to all the cpus since they all belong to the same node.

>And note - the above works _now_ with the current code. And it will never
>ever work with your approach. Because zone 0 is not a subset of zone 1,
>they are both equal. It's just that there is a preferential order for
>allocation depending on who does the allocation.

IMVHO you put the NUMA stuff in the wrong domain by abusing the zones
instead of using the proper nodes.

>> About kswapd suppose we were low on memory on both the two zones and so we
>> started kswapd on both zones and we keep allocating waiting to reach the
>> min watermark. Then while kswapd it's running a process exits and release
>> all the first 16mbyte of RAM. kswapd correctly stops freeing the ISADMA
>> zone (because free_pages set zone_wake_kswapd back to zero) but kswapd
>> will keep freeing the normal zone for no good reason. You have no way to
>> stop the wrong kswapd in such scenario without a classzone design. If you
>> keep looking only at zone->free_pages and zone->pages_high then kswapd
>> will keep running for no good reason.
>
>The fact that you think that this is true obviously means that you haven't
>thought AT ALL about the ways to just make the watermarks work that way.

I don't see how playing with the watermarks can help.

Playing with the watermarks is been my first idea but I discarded it
immediatly when I seen what would be happened by lowering the higher zones
watermarks. And also the watermark strict per zone doesn't make any sense
in first place. You can't say how much you should free from ZONE_NORMAL if
you don't know the state of the lower ZONE_DMA zone. All such
pages_{high,low,min} watermarks can make sense only if referred to the
classzone.

>Give it five seconds, and you'll see that the above is _exactly_ what the
>watermarks control, and _exactly_ by making all the flags per-zone
>(instead of global or per-class). And by just using the watermark

IMVO we used the wrong term since the first place. Kanoj wrote a function
called classzone at 2.3.[345]? time, and so I inherit his 'classzone' name
in my emails. But what I did is not really a classzone design, but I only
reconstructed the real _zone_ where GFP_KERNEL wants to allocate from. In
2.3.99-pre6-5 the monolithic zone where GFP_KERNEL wants to allocate from,
it's been broken in two unrelated, disconnected pieces. I only re-joined
it in once piece since it's a _single_ zone. It's a more problematic zone
since it's overlapped with the lower zones and so we have to update it
also when we change the lower zones, but it's really a _single_ zone that
shares the same equal properties of the other zones.

>In short, you've only convinced me _not_ to touch your patches, by showing
>that you haven't even though about what the current setup really means.

Note that I have started saying "let's try to give the current code better
balance since it seems we're calling swap_out two more times than
necessary". And the first time I read the new alloc_pages code I thought
"cute idea to implement alloc_pages that way, we avoid the cost of having
to calculate the number of free pages in the classzone".

_Then_ when I tried to give to the smarter code the expected right
behaviour I noticed I couldn't in lots of cases and that's the only reason
that caused me to change the design of the zones.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-26  1:07   ` Andrea Arcangeli
@ 2000-04-26  2:10     ` Rik van Riel
  2000-04-26 11:24       ` Stephen C. Tweedie
  2000-04-26 14:19       ` Andrea Arcangeli
  0 siblings, 2 replies; 24+ messages in thread
From: Rik van Riel @ 2000-04-26  2:10 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-mm

On Wed, 26 Apr 2000, Andrea Arcangeli wrote:
> On Tue, 25 Apr 2000, Linus Torvalds wrote:
> 
> >On Tue, 25 Apr 2000, Andrea Arcangeli wrote:
> >> 
> >> The design I'm using is infact that each zone know about each other, each
> >> zone have a free_pages and a classzone_free_pages. The additional
> >> classzone_free_pages gives us the information about the free pages on the
> >> classzone and it's also inclusve of the free_pages of all the lower zones.
> >
> >AND WHAT ABOUT SETUPS WHERE THERE ISNO INCLUSION?
> 
> They're simpler. The classzone for them matches with the zone.

It doesn't. Think NUMA.

> clazzone is composed by. That can be achieved by having the list of the
> zones that compose the classzone in an array allocated into the zone_t. A
> new zone->classzones[MAX_NR_ZONES+1] will do the trick. If there's no
> inclusion zones->classzones[0] will be equal to zone and [1] will be NULL.
> I can do that now if you think we'll _soon_ need that genericity.

This sounds like the current code. The code your patch
deletes...

> ZONE_NORMAL that spawns between 0 and 2g, but we have to split
> the GFP_KERNEL allocation place in two zones that are ZONE_DMA
> and ZONE_NORMAL.
> 
> But alloc_pages really have to consider the 0-2g a single zone
> because alloc_pages gets the semantic of the zone by the user
> that consider the 0-2g range a single zone.

It does. If you read mm/page_alloc.c::__alloc_pages()
carefully, you'll see this code fragment which does
exactly that.

        for (;;) {
                zone_t *z = *(zone++);

                /* Are we supposed to free memory? Don't make it worse.. */
                if (!z->zone_wake_kswapd && z->free_pages > z->pages_low) {
                        struct page *page = rmqueue(z, order);
                        if (page)
                                return page;
                }
        }

Here it scans the entire zonelist for each allocation and
allocates from the first zone where we have enough free
pages. This will spead memory load across zones just fine.

> If the current zone based design is not broken, then we can
> split any zone in as _many_ zones as we _want_ without hurting
> anything (except some more CPU and memory resources wasted in
> the zone arrays since there would be more zones).

We can do this just fine. Splitting a box into a dozen more
zones than what we have currently should work just fine,
except for (as you say) higher cpu use by kwapd.

If I get my balancing patch right, most of that disadvantage
should be gone as well. Maybe we *do* want to do this on
bigger SMP boxes so each processor can start out with a
separate zone and check the other zone later to avoid lock
contention?

> This is obviously wrong and this also proof the current design
> is broken.

What's wrong with being able to split memory in arbitrary
zones without running into any kind of performance trouble?

> >I tell you - you are doing the wrong thing. The earlier you realize that,
> >the better.
> 
> Linus, you can convinced me immediatly if you show me how I can
> make sure that all the ZONE_NORMAL is empty before going to
> allocate in the ZONE_DMA with the current design.

[snip]

> As second using 5% and 1% of critical watermarks won't give you a 6%
> watermark for the ZONE_NORMAL _class_zone but it will give you a 1%
> watermark instead and you probably wanted a 6% watermark to provide
> rasonable space for atomic allocations and for having more chances of
> doing high order allocations.

So the 1% watermark for ZONE_NORMAL is too low ... fix that.

> If pages_high would been the 6% and referred to the classzone
> the VM would have immediatly and correctly shrunk an additional
> 5% from the freeable ZONE_NORMAL. See?

I see the situation and I don't see any problem with it.
Could you please explain to us what the problem with this
situation is?

> >I did no tuning at all of the watermarks when I changed the zone
> >behaviour. My bad. But that does NOT mean that the whole mm behaviour
> >should then be reverted to the old and broken setup. It only means that
> >the zone behaviour should be tuned.
> 
> If you'll try to be friendly with ZONE_DMA (by allocating from
> the ZONE_NORMAL when possible) you'll make the allocation from
> the higher zones less reliable and you could end doing a
> differently kind of wrong thing as explained a few lines above.

What's wrong with this?  We obviously need to set the limits
for ZONE_NORMAL to such a number that it's possible to do
higher-order allocations. That is no change from your proposal
and just means that your 1% example value is probably not
feasible. Then again, that's just an example value and has
absolutely nothing to do with the design principles of the
current code.

> >Now, the way =I= propose that this [NUMA] be handled is:
> >
> > - each cluster has its own zone list: cluster 0 has the list 0, 1, 2, 3,
> >   while cluster 1 has the list 1, 2, 3, 0 etc. In short, each of them
> >   preferentially allocate from their _own_ cluster. Together with
> >   cluster-affinity for the processes, this way we can naturally (and with
> >   no artificial code) try to keep cross-cluster memory accesses to
> >   a minimum.
> 
> That have to be done in alloc_pages_node that later will fallback into the
> alloc_pages. This problematic is not relevant for alloc_pages IMVHO.
> 
> In your scenario you'll have only 1 zone per node and there's never been
> any problem with only one zone per node since classzone design is
> completly equal to zone design in such case.
> 
> All zones in a single node by design have to be all near or far in the
> same way to all the cpus since they all belong to the same node.

You may want to do a s/cluster/node/ in Linus' paragraph and
try again, if that makes things more obvious. What you are
saying makes absolutely no sense at all to anybody who knows
how NUMA works. You may want to get a book on computer architecture
(or more sleep, if that was the problem here .. I guess we all have
that every once in a while).

> >And note - the above works _now_ with the current code. And it will never
> >ever work with your approach. Because zone 0 is not a subset of zone 1,
> >they are both equal. It's just that there is a preferential order for
> >allocation depending on who does the allocation.
> 
> IMVHO you put the NUMA stuff in the wrong domain by abusing the zones
> instead of using the proper nodes.

I'm sorry to say this, but you don't seem to understand how NUMA
works... We're talking about preferential memory allocation here
and IMHO the ONLY place to do memory allocation is in the memory
allocator.

> >The fact that you think that this is true obviously means that you haven't
> >thought AT ALL about the ways to just make the watermarks work that way.
> 
> I don't see how playing with the watermarks can help.
> 
> Playing with the watermarks is been my first idea but I
> discarded it immediatly when I seen what would be happened by
> lowering the higher zones watermarks. And also the watermark
> strict per zone doesn't make any sense in first place. You can't
> say how much you should free from ZONE_NORMAL if you don't know
> the state of the lower ZONE_DMA zone.

If there is a lot of free memory in ZONE_DMA, the memory
allocator will do the next allocations there, relieving the
other zones from memory pressure and not freeing pages there
at all. This seems to be _exactly_ what is needed and what
your classzones achieve at a much higher complexity and lower
flexibility...

> >Give it five seconds, and you'll see that the above is _exactly_ what the
> >watermarks control, and _exactly_ by making all the flags per-zone
> >(instead of global or per-class). And by just using the watermark
> 
> IMVO we used the wrong term since the first place. Kanoj wrote a function
> called classzone at 2.3.[345]? time, and so I inherit his 'classzone' name
> in my emails. But what I did is not really a classzone design, but I only
> reconstructed the real _zone_ where GFP_KERNEL wants to allocate from. In
> 2.3.99-pre6-5 the monolithic zone where GFP_KERNEL wants to allocate from,
> it's been broken in two unrelated, disconnected pieces. I only re-joined
> it in once piece since it's a _single_ zone.

Which doesn't make any sense at all since the allocator will do the
balancing between the zones automatically.

> >In short, you've only convinced me _not_ to touch your patches, by showing
> >that you haven't even though about what the current setup really means.
> 
> Note that I have started saying "let's try to give the current
> code better balance since it seems we're calling swap_out two
> more times than necessary"

Indeed, there's a small performance bug in the page freeing code
in the current kernels. The obvious place to fix that is (surprise)
in the page freeing code. The allocation code is fine and has been
for a long time.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-26  2:10     ` Rik van Riel
@ 2000-04-26 11:24       ` Stephen C. Tweedie
  2000-04-26 16:44         ` Linus Torvalds
  2000-04-26 14:19       ` Andrea Arcangeli
  1 sibling, 1 reply; 24+ messages in thread
From: Stephen C. Tweedie @ 2000-04-26 11:24 UTC (permalink / raw)
  To: riel; +Cc: Andrea Arcangeli, Linus Torvalds, linux-mm

Hi,

On Tue, Apr 25, 2000 at 11:10:56PM -0300, Rik van Riel wrote:

> > As second using 5% and 1% of critical watermarks won't give you a 6%
> > watermark for the ZONE_NORMAL _class_zone but it will give you a 1%
> > watermark instead and you probably wanted a 6% watermark to provide
> > rasonable space for atomic allocations and for having more chances of
> > doing high order allocations.
> 
> So the 1% watermark for ZONE_NORMAL is too low ... fix that.

We just shouldn't need to keep much memory free.

I'd much rather see a scheme in which we have two separate goals for 
the VM.  Goal one would be to keep a certain number of free pages in 
each class, for use by atomic allocations.  Goal two would be to have
a minimum number of pages in each class either free or on a global LRU
list which contains only pages known to be clean and unmapped (and
hence available for instant freeing without IO).

That gives us many advantages:

 * We can split kswapd into two tasks: a kswapd task for swapping, and a
   kreclaimd for freeing pages on the clean LRU.  Even while we are 
   swapping, kreclaimd can kick in to keep atomic allocations happy.

 * We can still keep the free page lists topped up on a per-zone (or
   per-class) basis, but have a global LRU of clean pages by which to
   balance the reclamation of memory between zones if we want it

 * There will be a bigger pool of pages available for reuse at short
   notice, without us having to actually throw away potentially 
   useful data until the time that the memory is actually needed.

Cleaning dirty pages for reuse, and actually freeing those pages, are
already two distinct concepts in our VM.  We ought to make that 
explicit in the free pages LRUs.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-26 11:24       ` Stephen C. Tweedie
@ 2000-04-26 16:44         ` Linus Torvalds
  2000-04-26 17:13           ` Rik van Riel
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2000-04-26 16:44 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: riel, Andrea Arcangeli, linux-mm

On Wed, 26 Apr 2000, Stephen C. Tweedie wrote:
> 
> We just shouldn't need to keep much memory free.
> 
> I'd much rather see a scheme in which we have two separate goals for 
> the VM.  Goal one would be to keep a certain number of free pages in 
> each class, for use by atomic allocations.  Goal two would be to have
> a minimum number of pages in each class either free or on a global LRU
> list which contains only pages known to be clean and unmapped (and
> hence available for instant freeing without IO).

This would work. However, there is a rather subtle issue with allocating
contiguous chunks of memory - something that is frowned upon, but however
hard we've triedthere has always been people that really need to do it.

And that subtle issue is that in order for the buddy system to work for
contiguous areas, you cannot have "free" pages _outside_ the buddy system.

The reason the buddy system works for contiguous allocations >1 pages is
_not_ simply that it has the data structures to keep track of power-of-
two pages. The bigger reason for why the buddy system works at all is that
it is inherenty anti-fragmenting - whenever there are free pages, the
buddy system coalesces them, and has a very strong bias to returning
already-fragmented areas over contiguous areas on new allocations.

This advantage of the buddy system is also why keeping a "free list" is
not actually necessarily that great of an idea. Because the free list will
make fragmentation much worse by not allowing the coalescing - which in
turn is needed in order to try to keep future allocations from fragmenting
the heap more.

And yes, part of having memory free is to have low latency - oneof the
huge advantages of kswapd is that it allows us to do background freeing so
that the perceived latency to the occasional page allocator is great. And
that is important, and the "almost free" list would work quite well for
that.

However, the contiguous area concern is also a real concern. That iswhy I
want to keep "alloc_page()" and "free_page()" as the main memory
allocators: the buddy system is certainly not the fastest memory allocator
around, but it's so far the only one I've seen that has reasonable
behaviour wrt contiguous areas without excessive overhead.

[ Side comment: maybe somebody remembers the _original_ page allocator in
  Linux. It was based on a very very simple linked list of free pages -
  and it was fast as hell. There is absolutely no allocator that does it
  faster: getting a new page was not just constant time, but it was just a
  few cycles. FAST. The reason I moved to the buddy allocator was that the
  flexibility of being able to allocate two or four pages at a time
  outweighed the speed disadvantage. I'd hate for people to unwittingly
  lose that advantage by just not thinking about these issues.. ]

However, it's certainly true that ourmemory freeing machinery could be
cleaned up a bit, and having the "two phase" thing encoded explicitly in
the page freeing logic might not be a bad idea. I just wanted to point out
some reasons why it might not be all that sensible to count the "easily
freed queue" as real free memory..

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-26 16:44         ` Linus Torvalds
@ 2000-04-26 17:13           ` Rik van Riel
  2000-04-26 17:24             ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Rik van Riel @ 2000-04-26 17:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Stephen C. Tweedie, Andrea Arcangeli, linux-mm

On Wed, 26 Apr 2000, Linus Torvalds wrote:
> On Wed, 26 Apr 2000, Stephen C. Tweedie wrote:
> > 
> > We just shouldn't need to keep much memory free.
> > 
> > I'd much rather see a scheme in which we have two separate goals for 
> > the VM.  Goal one would be to keep a certain number of free pages in 
> > each class, for use by atomic allocations.  Goal two would be to have
> > a minimum number of pages in each class either free or on a global LRU
> > list which contains only pages known to be clean and unmapped (and
> > hence available for instant freeing without IO).
> 
> This would work. However, there is a rather subtle issue with allocating
> contiguous chunks of memory - something that is frowned upon, but however
> hard we've triedthere has always been people that really need to do it.
> 
> And that subtle issue is that in order for the buddy system to work for
> contiguous areas, you cannot have "free" pages _outside_ the buddy system.

This is easy to fix. We can keep a fairly large amount (maybe 4
times more than pages_high?) amount of these "free" pages on the
queue. If we are low on contiguous pages, we can bypass the queue
for these pages or scan memory for pages on this queue (marked with
as special flag) and take them from the queue...

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-26 17:13           ` Rik van Riel
@ 2000-04-26 17:24             ` Linus Torvalds
  2000-04-27 13:22               ` Stephen C. Tweedie
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2000-04-26 17:24 UTC (permalink / raw)
  To: riel; +Cc: Stephen C. Tweedie, Andrea Arcangeli, linux-mm

On Wed, 26 Apr 2000, Rik van Riel wrote:
> > 
> > And that subtle issue is that in order for the buddy system to work for
> > contiguous areas, you cannot have "free" pages _outside_ the buddy system.
> 
> This is easy to fix. We can keep a fairly large amount (maybe 4
> times more than pages_high?) amount of these "free" pages on the
> queue. If we are low on contiguous pages, we can bypass the queue
> for these pages or scan memory for pages on this queue (marked with
> as special flag) and take them from the queue...

Note that there are many work-loads that normally have a ton of dirty
pages. Under those kinds of work-loads it is generally hard to keep a lot
of "free" pages around, without just wasting a lot of time flushing them
out all the time.

So I doubt it is "trivial". But it might be somewhere in-between balance,
where you have a heuristic along the lines of "let's try to have enough
'truly free' pages, and if we have lots of 'almost free' pages around we
can somewhat relax the requirements on the 'truly free' page
availability".

The other danger with the 'almost free' pages is that it really is very
load-dependent, and some loads have lots of easily free'd pages. If we
eagerly reap of the 'easily free' component, then that may be extremely
unfair towards one class of users that gets their pages stolen from under
them by another class of users that has less easily freeable pages.. So
fairness may also be an issue.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-26 17:24             ` Linus Torvalds
@ 2000-04-27 13:22               ` Stephen C. Tweedie
  0 siblings, 0 replies; 24+ messages in thread
From: Stephen C. Tweedie @ 2000-04-27 13:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: riel, Stephen C. Tweedie, Andrea Arcangeli, linux-mm

Hi,

On Wed, Apr 26, 2000 at 10:24:55AM -0700, Linus Torvalds wrote:
> 
> On Wed, 26 Apr 2000, Rik van Riel wrote:
> > > 
> > > And that subtle issue is that in order for the buddy system to work for
> > > contiguous areas, you cannot have "free" pages _outside_ the buddy system.
> > 
> > This is easy to fix. We can keep a fairly large amount (maybe 4
> > times more than pages_high?) amount of these "free" pages on the
> > queue.
> 
> Note that there are many work-loads that normally have a ton of dirty
> pages. Under those kinds of work-loads it is generally hard to keep a lot
> of "free" pages around, without just wasting a lot of time flushing them
> out all the time.

You have an instant win if the second-chance list is protected by an 
interrupt-safe spinlock.  Do that, and you basically don't ever need 
any free pages at all.  An atomic allocation can go throught the second-
chance list freeing pages until either a buddy page of the required
order becomes available, or we exhaust the list.

With a second-chance list of the same size as our current free page
goals, we'd have exactly the same chance as today of finding a high-
order page.  The advantage would be that our current pessimistic 
free-page management would become truly a lazy reclaim mechanism, 
never freeing a page until it is absolutely necessary.

The cost, of course, is a slightly longer latency while allocating 
memory in interrupts: we've moved some of the kswapd work into the
interrupt itself.  The overall system CPU time, however, should be
reduced if we can avoid unnecessarily freeing pages.

> The other danger with the 'almost free' pages is that it really is very
> load-dependent, and some loads have lots of easily free'd pages.

We have that today.  Whether we are populating the free list or the
last-chance list, we're still having to make that judgement.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-26  2:10     ` Rik van Riel
  2000-04-26 11:24       ` Stephen C. Tweedie
@ 2000-04-26 14:19       ` Andrea Arcangeli
  2000-04-26 16:52         ` Linus Torvalds
  1 sibling, 1 reply; 24+ messages in thread
From: Andrea Arcangeli @ 2000-04-26 14:19 UTC (permalink / raw)
  To: riel; +Cc: Linus Torvalds, linux-mm

On Tue, 25 Apr 2000, Rik van Riel wrote:

>On Wed, 26 Apr 2000, Andrea Arcangeli wrote:
>> On Tue, 25 Apr 2000, Linus Torvalds wrote:
>> 
>> >On Tue, 25 Apr 2000, Andrea Arcangeli wrote:
>> >> 
>> >> The design I'm using is infact that each zone know about each other, each
>> >> zone have a free_pages and a classzone_free_pages. The additional
>> >> classzone_free_pages gives us the information about the free pages on the
>> >> classzone and it's also inclusve of the free_pages of all the lower zones.
>> >
>> >AND WHAT ABOUT SETUPS WHERE THERE ISNO INCLUSION?
>> 
>> They're simpler. The classzone for them matches with the zone.
>
>It doesn't. Think NUMA.

NUMA is irrelevant. If there's no inclusion the classzone matches with the
zone.

>> clazzone is composed by. That can be achieved by having the list of the
>> zones that compose the classzone in an array allocated into the zone_t. A
>> new zone->classzones[MAX_NR_ZONES+1] will do the trick. If there's no
>> inclusion zones->classzones[0] will be equal to zone and [1] will be NULL.
>> I can do that now if you think we'll _soon_ need that genericity.
>
>This sounds like the current code. The code your patch
>deletes...

What I was talking about with zone->classzones isn't the zonelist. The
zonelist is in the pgdat, the classzones[] have to be in the zone_t
instead.

>> ZONE_NORMAL that spawns between 0 and 2g, but we have to split
>> the GFP_KERNEL allocation place in two zones that are ZONE_DMA
>> and ZONE_NORMAL.
>> 
>> But alloc_pages really have to consider the 0-2g a single zone
>> because alloc_pages gets the semantic of the zone by the user
>> that consider the 0-2g range a single zone.
>
>It does. If you read mm/page_alloc.c::__alloc_pages()
>carefully, you'll see this code fragment which does
>exactly that.
>
>        for (;;) {
>                zone_t *z = *(zone++);
>
>                /* Are we supposed to free memory? Don't make it worse.. */
>                if (!z->zone_wake_kswapd && z->free_pages > z->pages_low) {
>                        struct page *page = rmqueue(z, order);
>                        if (page)
>                                return page;
>                }
>        }

Please read what you quoted above. If the current zone is under the low
watermark we fallback in the following zone. That is obviously wrong. We
must fallback on the following zone _only_ if the current zone is empty.

>> If the current zone based design is not broken, then we can
>> split any zone in as _many_ zones as we _want_ without hurting
>> anything (except some more CPU and memory resources wasted in
>> the zone arrays since there would be more zones).
>
>We can do this just fine. Splitting a box into a dozen more
>zones than what we have currently should work just fine,
>except for (as you say) higher cpu use by kwapd.

For induction split the ZONE_NORMAL in ZONE_NORMAL->size>>PAGE_SHIFT
zones. Each zone have 1 page. The watermark have can be 0 or 1. You have
to set it to 1 _on_all_zones_ to keep the system stable. Now your VM will
try to keep all pages in the ZONE_NORMAL free all the time despite of the
settings of the other one-pages-sized zones that compose the original
ZONE_NORMAL. This behaviour is broken in the same way the 2.3.99-pre6-5 VM
is broken, the difference is _only_ that with lots of zones the broken
behaviour is more biased, and with mere two zones that 98% of machines out
there are using, the broken behavour of the current code may remain well
hided, but that doesn't change the design is fully broken and it have to
be fixed if we want something of stable and that is possible to use on all
possible hardware scenarios.

Right way to fix it is to give relation to the overlapped zones and that's
exactly what I did.

>If I get my balancing patch right, most of that disadvantage
>should be gone as well. Maybe we *do* want to do this on
>bigger SMP boxes so each processor can start out with a
>separate zone and check the other zone later to avoid lock
>contention?

See, if the current design wouldn't be broken we could get per-page
scalability in the allocation by creating all one-page-sized-zones.

>> This is obviously wrong and this also proof the current design
>> is broken.
>
>What's wrong with being able to split memory in arbitrary
>zones without running into any kind of performance trouble?

The point is that you'll run into troubles, kswapd and the VM will run
completly out of control.

>> >I tell you - you are doing the wrong thing. The earlier you realize that,
>> >the better.
>> 
>> Linus, you can convinced me immediatly if you show me how I can
>> make sure that all the ZONE_NORMAL is empty before going to
>> allocate in the ZONE_DMA with the current design.
>
>[snip]
>
>> As second using 5% and 1% of critical watermarks won't give you a 6%
>> watermark for the ZONE_NORMAL _class_zone but it will give you a 1%
>> watermark instead and you probably wanted a 6% watermark to provide
>> rasonable space for atomic allocations and for having more chances of
>> doing high order allocations.
>
>So the 1% watermark for ZONE_NORMAL is too low ... fix that.

Why do you think Linus suggested to lower the watermark of the higher
zones? Answer: because we should not left any page free in the ZONE_NORMAL
before falling back into the ZONE_DMA (assuming the ZONE_DMA is completly
free of course). We are the only ones that are able to use the ZONE_NORMAL
memory, and we should use it _now_ to be friendly with the ZONE_DMA users.

This is the same obviously right principle that is been introduced some
month ago into 2.2.1x. Right now we can do it right with 2.2.1x with a
10liner patch and we fail to do this fully right in 2.3.99-pre6-5.

>> If pages_high would been the 6% and referred to the classzone
>> the VM would have immediatly and correctly shrunk an additional
>> 5% from the freeable ZONE_NORMAL. See?
>
>I see the situation and I don't see any problem with it.
>Could you please explain to us what the problem with this
>situation is?

The VM should shrink the 5% from the ZONE_NORMAL and it doesn't do that
but it keeps trying to free the unfreeable ZONE_DMA without any
success. The VM does the wrong thing in such scenario because is not
aware of the relation between the zones.

>> >I did no tuning at all of the watermarks when I changed the zone
>> >behaviour. My bad. But that does NOT mean that the whole mm behaviour
>> >should then be reverted to the old and broken setup. It only means that
>> >the zone behaviour should be tuned.
>> 
>> If you'll try to be friendly with ZONE_DMA (by allocating from
>> the ZONE_NORMAL when possible) you'll make the allocation from
>> the higher zones less reliable and you could end doing a
>> differently kind of wrong thing as explained a few lines above.
>
>What's wrong with this?  We obviously need to set the limits
>for ZONE_NORMAL to such a number that it's possible to do
>higher-order allocations. That is no change from your proposal

Then we'll end having ZONE_DMA->pages_high+ZONE_NORMAL->pages_high free
almost all the time while we would only need pages_high memory free in the
ZONE_NORMAL _class_zone. That's memory wasted that we should be able to
use for production instead.

>and just means that your 1% example value is probably not
>feasible. Then again, that's just an example value and has
>absolutely nothing to do with the design principles of the
>current code.
>
>> >Now, the way =I= propose that this [NUMA] be handled is:
>> >
>> > - each cluster has its own zone list: cluster 0 has the list 0, 1, 2, 3,
>> >   while cluster 1 has the list 1, 2, 3, 0 etc. In short, each of them
>> >   preferentially allocate from their _own_ cluster. Together with
>> >   cluster-affinity for the processes, this way we can naturally (and with
>> >   no artificial code) try to keep cross-cluster memory accesses to
>> >   a minimum.
>> 
>> That have to be done in alloc_pages_node that later will fallback into the
>> alloc_pages. This problematic is not relevant for alloc_pages IMVHO.
>> 
>> In your scenario you'll have only 1 zone per node and there's never been
>> any problem with only one zone per node since classzone design is
>> completly equal to zone design in such case.
>> 
>> All zones in a single node by design have to be all near or far in the
>> same way to all the cpus since they all belong to the same node.
>
>You may want to do a s/cluster/node/ in Linus' paragraph and

thanks, good hint, I didn't understood well this last night, sorry.

>try again, if that makes things more obvious. What you are
>saying makes absolutely no sense at all to anybody who knows
>how NUMA works. You may want to get a book on computer architecture

Now I understood, (sorry for having missed that previously). What Linus
proposed was:

	node0->zone_zonelist[2] == node1->zone_zonelist[1] == node2->zone_zonelist[0]

That's dirty design IMVHO since it cause zones outside the node to be
referenced by the node itself.

What's wrong in putting the code that does the falling back between nodes
before starting page-freeing on them into the alloc_pages_node API? This
way a pgdat only includes stuff inside the node and gets not mixed with
the stuff from the other nodes.

>(or more sleep, if that was the problem here .. I guess we all have
>that every once in a while).
>
>> >And note - the above works _now_ with the current code. And it will never
>> >ever work with your approach. Because zone 0 is not a subset of zone 1,
>> >they are both equal. It's just that there is a preferential order for
>> >allocation depending on who does the allocation.
>> 
>> IMVHO you put the NUMA stuff in the wrong domain by abusing the zones
>> instead of using the proper nodes.
>
>I'm sorry to say this, but you don't seem to understand how NUMA
>works... We're talking about preferential memory allocation here

The way Linus proposed to do NUMA is not how NUMA works(tm), but it's one
interesting possible implementation (and yes, you are right last night I
didn't understood what Linus proposed, thanks for the clarification).

My only point is an implementation issue and is that the falling back
between nodes should be done in the higher layer (the layer where we also
ask to the interface which node we prefer to allocate from).

>and IMHO the ONLY place to do memory allocation is in the memory
>allocator.
>
>> >The fact that you think that this is true obviously means that you haven't
>> >thought AT ALL about the ways to just make the watermarks work that way.
>> 
>> I don't see how playing with the watermarks can help.
>> 
>> Playing with the watermarks is been my first idea but I
>> discarded it immediatly when I seen what would be happened by
>> lowering the higher zones watermarks. And also the watermark
>> strict per zone doesn't make any sense in first place. You can't
>> say how much you should free from ZONE_NORMAL if you don't know
>> the state of the lower ZONE_DMA zone.
>
>If there is a lot of free memory in ZONE_DMA, the memory
>allocator will do the next allocations there, relieving the

And who will stop kswapd from continuing to free the other zones? How can
you know how much a page is critical for allocations only looking
page->zone->pages_high and page->zone->free_pages? You _can't_, and any
heuristic that will consider a page critical if "page->zone->pages_high"
is major than "page->zone->free_pages" will be flawed by design and it
will end doing the wrong thing eventually.

>other zones from memory pressure and not freeing pages there
>at all. This seems to be _exactly_ what is needed and what
>your classzones achieve at a much higher complexity and lower
>flexibility...

The additional complexity is necessary to keep the overlapped zones in
relation. The flexibility point can be achieved instead. Since _none_
memory allocation in 2.3.99-pre6-5 needed further flexibility I avoided to
add such stuff in my patch ;).

>> >Give it five seconds, and you'll see that the above is _exactly_ what the
>> >watermarks control, and _exactly_ by making all the flags per-zone
>> >(instead of global or per-class). And by just using the watermark
>> 
>> IMVO we used the wrong term since the first place. Kanoj wrote a function
>> called classzone at 2.3.[345]? time, and so I inherit his 'classzone' name
>> in my emails. But what I did is not really a classzone design, but I only
>> reconstructed the real _zone_ where GFP_KERNEL wants to allocate from. In
>> 2.3.99-pre6-5 the monolithic zone where GFP_KERNEL wants to allocate from,
>> it's been broken in two unrelated, disconnected pieces. I only re-joined
>> it in once piece since it's a _single_ zone.
>
>Which doesn't make any sense at all since the allocator will do the

What does not make sense to you? Don't you believe the join between
ZONE_DMA and ZONE_NORMAL is really a single zone?

Ask yourself what do_anonymous_pages wants to allocate in a write fault.
do_anonymous_pages want to allocate a page that cames from the zone that
corresponds to the whole memory available. The zone is "0-end_of_memory".
It's 1 zone. It's 1 zone even if such zone is overlapped with lower zones.

On and on some machine it's also not overlapped and a single zone_t have
included all the memory there.

Think, think, on a machine where there's no braindamage with overlapped
zones and so where you have 1 zone from 0 to end_of_memory like here:

Apr 26 15:55:40 alpha kernel: NonDMA: 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB)
Apr 26 15:55:40 alpha kernel: DMA: 1*8kB 1*16kB 10*32kB 3*64kB 3*128kB 0*256kB 1*512kB 0*1024kB 1*2048kB 102*4096kB = 421272kB)
Apr 26 15:55:40 alpha kernel: BIGMEM: 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB)

the VM would scale worse than on x86 with the 2.3.99-pre6-5 design due the
zone->lock (there's only one zone, so there's only one lock instead of
two/three locks).

So with the current design the more you have overlapped zones the better
you scale. This doesn't make any sense, and you _have_ to pay having
subtle and unexpected drawbacks elsewhere.

And that drawbacks are exactly the scenarios I'm describing and that I'm
trying to put at the light of your eyes in my last 3/4 emails.

And yes, if you were right the current design is not broken, then I would
also be stupid to not split my 512mbyte wide ZONE_DMA in nr_cpus zones and
to let each CPU to prefer to allocate from a different zones before
falling back on the other zones.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-26 14:19       ` Andrea Arcangeli
@ 2000-04-26 16:52         ` Linus Torvalds
  2000-04-26 17:49           ` Andrea Arcangeli
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2000-04-26 16:52 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: riel, linux-mm

On Wed, 26 Apr 2000, Andrea Arcangeli wrote:
> 
> NUMA is irrelevant. If there's no inclusion the classzone matches with the
> zone.

But then all your arguments evaporate.

If you argue that memory balancing should work even in the instance where
the classzone has degenerated into a single zone, then I'll just say "why
have the classzone concept at all, then?".

Which is exactly what I'm saying.

I think we should have zones. Not classzones. And we should have
"zonelists", but those would not be first-class data structures, they'd
just be lists of zones that are acceptable for an allocation.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.3.x mem balancing
  2000-04-26 16:52         ` Linus Torvalds
@ 2000-04-26 17:49           ` Andrea Arcangeli
  0 siblings, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2000-04-26 17:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: riel, linux-mm

On Wed, 26 Apr 2000, Linus Torvalds wrote:

>On Wed, 26 Apr 2000, Andrea Arcangeli wrote:
>> 
>> NUMA is irrelevant. If there's no inclusion the classzone matches with the
>> zone.
>
>But then all your arguments evaporate.
>
>If you argue that memory balancing should work even in the instance where
>the classzone has degenerated into a single zone, [..]

Yes, I argue this otherwise my alpha box would not run stable anymore ;).

>[..] then I'll just say "why
>have the classzone concept at all, then?".

Because it's necessary to handle correctly the other case: setups where we
have to handle overlapped zones.

Note that the ZONE_DMA is classzone composed by one single zone too and of
course memory balancing have to work correctly with ZONE_DMA too.

>I think we should have zones. Not classzones. And we should have
>"zonelists", but those would not be first-class data structures, they'd
>just be lists of zones that are acceptable for an allocation.

My only problem is that I don't see how to solve the subtle drawbacks
elecated in my previous emails by keeping the strict zone based approch
and without considering the other zone_t that compose the real zone
(classzone) that we want to allocate from.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2000-04-27 13:22 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-04-26 16:03 2.3.x mem balancing Mark_H_Johnson.RTS
2000-04-26 17:06 ` Andrea Arcangeli
2000-04-26 17:36   ` Kanoj Sarcar
2000-04-26 21:58     ` Andrea Arcangeli
2000-04-26 17:43 ` Kanoj Sarcar
  -- strict thread matches above, loose matches on Subject: below --
2000-04-26 19:06 frankeh
     [not found] <Pine.LNX.4.21.0004250401520.4898-100000@alpha.random>
2000-04-25 16:57 ` Linus Torvalds
2000-04-25 17:50   ` Rik van Riel
2000-04-25 18:11     ` Jeff Garzik
2000-04-25 18:33       ` Rik van Riel
2000-04-25 18:53     ` Linus Torvalds
2000-04-25 19:27       ` Rik van Riel
2000-04-26  0:26         ` Linus Torvalds
2000-04-26  1:19           ` Rik van Riel
2000-04-26  1:07   ` Andrea Arcangeli
2000-04-26  2:10     ` Rik van Riel
2000-04-26 11:24       ` Stephen C. Tweedie
2000-04-26 16:44         ` Linus Torvalds
2000-04-26 17:13           ` Rik van Riel
2000-04-26 17:24             ` Linus Torvalds
2000-04-27 13:22               ` Stephen C. Tweedie
2000-04-26 14:19       ` Andrea Arcangeli
2000-04-26 16:52         ` Linus Torvalds
2000-04-26 17:49           ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox