From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Wed, 26 Apr 2000 19:06:57 +0200 (CEST) From: Andrea Arcangeli Subject: Re: 2.3.x mem balancing In-Reply-To: <852568CD.0057D4FC.00@raylex-gh01.eo.ray.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org Return-Path: To: Mark_H_Johnson.RTS@raytheon.com Cc: linux-mm@kvack.org, riel@nl.linux.org, Linus Torvalds List-ID: On Wed, 26 Apr 2000 Mark_H_Johnson.RTS@raytheon.com wrote: >In the context of "memory balancing" - all processors and all memory is NOT >equal in a NUMA system. To get the best performance from the hardware, you >prefer to put "all" of the memory for each process into a single memory unit - >then run that process from a processor "near" that memory unit. This seemingly The classzone approch (aka overlapped zones approch) is irrelevant with NUMA problematics as far I can tell. NUMA is a problematic that belongs outside the pg_data_t. It doesn't matter how we restructure the internal of the zone_t. I only changed the internal structure of one node. Not at all how to policy the allocations and the balance between different nodes (that decisions have to live in the linux/arch/ tree and not in __alloc_pages). On NUMA hardware you have only one zone per node since nobody uses ISA-DMA on such machines and you have PCI64 or you can use the PCI-DMA sg for PCI32. So on NUMA hardware you are going to have only one zone per node (at least this was the setup of the NUMA machine I was playing with). So you don't mind at all about classzone/zone. Classzone and zone are the same thing in such a setup, they both are the plain ZONE_DMA zone_t. Finished. Said that you don't care anymore about the changes of how the overlapped zones are handled since you don't have overlapped zones in first place. Now on NUMA when you want to allocate memory you have to use alloc_pages_node so that you can tell also which node allocate from. Here Linus was proposing of making alloc_pages_node this way: alloc_pages_node(nid, gfpmask, order) { zonelist_t ** zonelist = nid2zonelist(nid, gfpmask); __alloc_pages(zonelist, order); } and then having the automatic falling back between nodes and numa memory balancing handled by __alloc_pages and by the current 2.3.99-pre6-5 zonelist falling back trick. I care to explain why I think that's not the right approch for handling NUMA allocations and balancing decisions. As first it's clear that the above described NUMA approch is abusing zonelist_t by looking the size of the zonelist_t structure: typedef struct zonelist_struct { zone_t * zones [MAX_NR_ZONES+1]; // NULL delimited int gfp_mask; } zonelist_t; If zonelist was designed for NUMA it should be something like: typedef struct zonelist_struct { zone_t * zones [max(MAX_NR_ZONES*MAX_NR_NODES)+1]; // NULL delimited int gfp_mask; } zonelist_t; however we can fix that easily by enlarging the zones array in the zonelist. and as second the zonelist-NUMA solution isn't enough flexible since if there's lots of cache allocate in one node we may prefer to move or shrink the cache than to allocate mapped areas of the same task in different nodes (as the __alloc_pages would do). With the zonelist_t abused to do NUMA we _don't_ have flexibility. If you move the NUMA balancing and node selection into the higher layer as I was proposing, instead you can do clever things. And as soon as you move the decisions at the higher layer you don't mind anymore about the node internals. Or better you only care to be able to find the current life-state of a node and you of course can do that. Then once you know the state of the interesting node you can do the decision of what to do _still_ at the highlever layer. At the highlevel layer you can see that the node is filled with 90% of cache, and then you can say: ok allocate from this node anyway and let it to shrink some cache if necessary. Then the lower layer (__alloc_pages) will do automatically the balancing and it will try to allocate memory from such node. You can also grab the per-node spinlock in the highlever layer before checking the state of the node so that you'll know the state of the node won't change from under you while doing the allocation. >These are issues that need to be addressed if you expect to use this I always tried to keep these issues in mind (also before writing the classzone approch). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/