From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Wed, 21 Nov 2007 22:54:03 +0000 From: Mel Gorman Subject: Re: [PATCH] Page allocator: Get rid of the list of cold pages Message-ID: <20071121225403.GD31674@csn.ul.ie> References: <20071115162706.4b9b9e2a.akpm@linux-foundation.org> <20071121222059.GC31674@csn.ul.ie> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org Return-Path: To: Christoph Lameter Cc: Andrew Morton , linux-mm@kvack.org, apw@shadowen.org, Martin Bligh List-ID: On (21/11/07 14:28), Christoph Lameter didst pronounce: > On Wed, 21 Nov 2007, Mel Gorman wrote: > > > 1. In general, the split lists are faster than the combined list > > 2. Disabling Per-CPU has comparable performance to having the lists > > That is only true for the single threaded case (actually I am measuring a > slight performance benefit if I switch them off). If you have multiple > processes allocating from the same zone then you can get the zone locks > hot. um, I thought I went through this but I didn't just test single-threaded and you will see that the test C program forks children to do the work. 1instances is single process doing the work. 4instance graphs are 4 processes simultaneously doing the work (1 per CPU) and they showed comparable performance of split lists vs no-PCP lits. They are also bound to one CPU in an effort to maximise the use of the PCPU lists. There was some evidence this was beginning to change when 12 instances (3 per CPU) were running but I hadn't setup the test to run with more. > That was the reason for the recent regression in SLUB. The networking > layer went from an order 0 alloc to order 1. Zonelock contention then > dropped performance by 50% on an 8p! The potential for lock contention is > higher the more processor per nodeare involved. So you are not going to > see this as high on a standard NUMA config with 2p per node. > Ok. I've queued the test to re-run on a 16-way x86_64 machine non-NUMA machine and an 8-way PPC64 2-node-NUMA machine. I haven't worked on this machines before but hopefully they'll run to completion. > The main point at this juncture of the pcp lists seems to be avoiding > zone lock contention! I get that. I was suprised with the results too and leads me to wonder if the lock is being avoided elsewhere (quicklists or slab per-cpu lists maybe) or if there was a filesystem lock so big, it doesn't matter what the PCPU allocator is doing. I don't have other profile data available. > The overhead of extracting a page from the buddy > lists is not such a problem. > Ok, the higher-CPU machines may show the zone-lock contention. It could also be a case that file extend/truncate is not the right thing to be doing either for these measurements. Read the code and see what you think. > > single-pcplist-batch8: This is Christophs patch with pcp->high == 8*batch > > as suggested by Martin Bligh (I agreed with him that keeping lists > > the same size made sense) > > Ack. > > I have not had a look at the details of your performance measurements yet. > More later. > Thanks. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org