From mboxrd@z Thu Jan 1 00:00:00 1970 From: frankeh@us.ibm.com Message-ID: <852568D3.00670E44.00@D51MTA07.pok.ibm.com> Date: Tue, 2 May 2000 14:46:14 -0400 Subject: Re: kswapd @ 60-80% CPU during heavy HD i/o. Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org Return-Path: To: riel@nl.linux.org Cc: linux-mm@kvack.org List-ID: Hi, Rik... Rik van Riel on 05/02/2000 02:15:18 PM Please respond to riel@nl.linux.org To: Hubertus Franke/Watson/IBM@IBMUS cc: Andrea Arcangeli , Roger Larsson , linux-kernel@vger.rutgers.edu, linux-mm@kvack.org Subject: Re: kswapd @ 60-80% CPU during heavy HD i/o. On Tue, 2 May 2000 frankeh@us.ibm.com wrote: > It makes sense to me to make the number of pools configurable > and not tie them directly to the number of nodes in a NUMA > system. In particular allow memory pools (i.e. instance of > pg_dat_t) to be smaller than a node size. *nod* We should have different memory zones per node on Intel-handi^Wequipped NUMA machines. Wouldn't that be orthogonal.... Anyway, I believe x86 NUMA machines will exist in the future, so I am not ready to trash them right now, whether I like their architecture or not. > The smart things that I see has to happen is to allow a set of processes to > be attached to a set of memory pools and the OS basically enforcing > allocation in those constraints. I brought this up before and I think > Andrea proposed something similar. Allocation should take place in those > pools along the allocation levels based on GFP_MASK, so first allocate on > HIGH along all specified pools and if unsuccessful, then fallback on a > previous level. That idea is broken if you don't do balancing of VM load between zones. > With each pool we should associate a kswapd. How will local page replacement help you if the node next door has practically unloaded virtual memory? You need to do global page replacement of some sort... You wouldn't balance a zone until you have checked on the same level (e.g. HIGHMEM) on all the specified nodes. Then and only then you fall back. So we aren't doing any local page replacement unless I can not satisfy a page request within the given resource set. That means something along the following pseudo code forall zonelevels forall nodes in resource set zone = pgdat[node].zones[zonelevel]; if (zone->free_pages > threshold) alloc_page; return; set kswapd_required flag (kick) balance zones; // couldn't allocate a page in the desired resource set so start balancing. Now balancing zones kicks the kswaps or helps out... global balancing can take place by servicing the pgdat_t with the highest number of kicks... I think it is ok to have pools with unused memory lying around if a particular resource set does not include those pools. How else are you planning to control locality and affinity within memory other than using resource sets. We take the same approach in the kernel, for instance we have a minimum file cache size, because we know that we can increase throughput by doing so. > Making the size of the pools configurable allows to control the > velocity at which we can swap out. Standard Queuing theory: if > we can't get the desired througput, then increase the number of > servers, here kswapd. What we _could_ do is have one (or maybe even a few) kswapds doing global replacement with io-less and more fine-grained swap_out() and shrink_mmap() functions, and per-node kswapds taking care of the IO and maybe even a per-node inactive list (though that would probably be *bad* for page replacement). That is workable ....... Then again, if your machine can't get the desired throughput, how would adding kswapds help??? Have you taken a look at mm/page_alloc.c::alloc_pages()? If kswapd can't keep up, the biggest memory consumers will help a hand and prevent the rest of the system from thrashing too much. Correct... However, having finer grain pools also allows you to deal with potential lock contention, which is one of the biggest impedements to scale up. characteristics of NUMA machines are large memory and large number of CPUs. This implies that there will be increased lock contention, for instance on the lock that protects the memory pool. Also increased lock contention can arise by increased lock hold time, which I assume is somewhat related to the size of the memory. So decreasing lock contention time by limiting the number of pages that are managed per pool could remove an arising bottleneck. regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ regards... Hubertus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/