From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Wed, 21 Nov 2007 18:20:49 -0800 (PST) From: Christoph Lameter Subject: Re: [PATCH] Page allocator: Get rid of the list of cold pages In-Reply-To: <20071122014455.GH31674@csn.ul.ie> Message-ID: References: <20071115162706.4b9b9e2a.akpm@linux-foundation.org> <20071121222059.GC31674@csn.ul.ie> <20071121230041.GE31674@csn.ul.ie> <20071121235849.GG31674@csn.ul.ie> <20071122014455.GH31674@csn.ul.ie> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org Return-Path: To: Mel Gorman Cc: Andrew Morton , linux-mm@kvack.org, apw@shadowen.org, Martin Bligh List-ID: On Thu, 22 Nov 2007, Mel Gorman wrote: > And the results were better as well. Running one instance per-CPU, the > joined lists ignoring temperature was marginally faster than no-PCPU or > the hotcold-PCPU up to 0.5MB which roughly corresponds to the some of L1 > caches of the CPUs. At higher sizes, it starts to look slower but even > at 8MB files, it is by a much smaller amount. With list manuipulations, > it is about 0.3 seconds slower. With just the lists joined, it's 0.1 > seconds and I think the patch could simplify the paths more than what we > have currently. The full graph is at Hmmm... This sounds like we could improve the situation by just having single linked lists? The update effort is then much less. Here is a matrix of page allocator performance (2.6.24-rc2 with sparsemem) done with my page allocator test from http://git.kernel.org/?p=linux/kernel/git/christoph/slab.git;a=log;h=tests All tests are in cycles Single thread testing ===================== 1. Repeatedly allocate then free test 1000 times alloc_page(,0) -> 616 cycles __free_pages(,0)-> 295 cycles 1000 times alloc_page(,1) -> 576 cycles __free_pages(,1)-> 341 cycles 1000 times alloc_page(,2) -> 712 cycles __free_pages(,2)-> 380 cycles 1000 times alloc_page(,3) -> 966 cycles __free_pages(,3)-> 467 cycles 1000 times alloc_page(,4) -> 1435 cycles __free_pages(,4)-> 662 cycles 1000 times alloc_page(,5) -> 2201 cycles __free_pages(,5)-> 1044 cycles 1000 times alloc_page(,6) -> 3770 cycles __free_pages(,6)-> 2550 cycles 1000 times alloc_page(,7) -> 6781 cycles __free_pages(,7)-> 7652 cycles 1000 times alloc_page(,8) -> 13592 cycles __free_pages(,8)-> 17999 cycles 1000 times alloc_page(,9) -> 27970 cycles __free_pages(,9)-> 36335 cycles 1000 times alloc_page(,10) -> 58586 cycles __free_pages(,10)-> 72323 cycles 2. alloc/free test 1000 times alloc( ,0)/free -> 349 cycles 1000 times alloc( ,1)/free -> 531 cycles 1000 times alloc( ,2)/free -> 571 cycles 1000 times alloc( ,3)/free -> 663 cycles 1000 times alloc( ,4)/free -> 853 cycles 1000 times alloc( ,5)/free -> 1220 cycles 1000 times alloc( ,6)/free -> 2092 cycles 1000 times alloc( ,7)/free -> 3640 cycles 1000 times alloc( ,8)/free -> 6524 cycles 1000 times alloc( ,9)/free -> 12421 cycles 1000 times alloc( ,10)/free -> 30197 cycles This shows that actually order 1 allocations that bypass the pcp lists are faster! We save the overhead of extracting pages from the buddy lists and putting them into the pcp. The alloc free tests shows that the pcp lists are effective when cache hot. Concurrent allocs ================= Page alloc N*alloc N*free(0): 0=8266/8635 1=9667/8129 2=8501/8585 3=9485/8129 4=7870/8635 5=9761/7957 6=7687/8456 7=9749/7681 Average=8873/8276 Page alloc N*alloc N*free(1): 0=28917/22006 1=30057/26753 2=28930/23925 3=30099/26779 4=28845/23717 5=30166/26733 6=28250/23744 7=30149/26677 Average=29427/25042 Page alloc N*alloc N*free(2): 0=25316/23430 1=28749/26527 2=24858/22929 3=28804/26636 4=24871/23368 5=28496/26621 6=25188/22057 7=28730/26228 Average=26877/24725 Page alloc N*alloc N*free(3): 0=22414/23618 1=26397/27478 2=22359/24237 3=26413/27060 4=22328/24021 5=26098/27879 6=22391/23731 7=26322/27802 Average=24340/25728 Page alloc N*alloc N*free(4): 0=24922/26358 1=28126/30480 2=24733/26177 3=28267/30540 4=25016/25688 5=28150/30563 6=24938/24902 7=28247/30650 Average=26550/28170 Page alloc N*alloc N*free(5): 0=25211/27315 1=29504/32577 2=25796/27681 3=29565/32272 4=26056/26588 5=29471/32728 6=25967/26619 7=29447/32744 Average=27627/29816 The difference between order and 1 shows that pcp lists are effective at reducing zone lru lock overhead. The difference is factor 3 at 8p. ----Fastpath--- Page N*(alloc free)(0): 0=363 1=360 2=379 3=363 4=362 5=363 6=363 7=360 Average=364 Page N*(alloc free)(1): 0=41014 1=44448 2=40416 3=44367 4=40980 5=44411 6=40760 7=44265 Average=42583 Page N*(alloc free)(2): 0=42686 1=45588 2=42202 3=45509 4=42733 5=45561 6=42716 7=45485 Average=44060 Page N*(alloc free)(3): 0=40567 1=43556 2=39699 3=43404 4=40435 5=43274 6=39614 7=43545 Average=41762 Page N*(alloc free)(4): 0=43310 1=45097 2=43326 3=45405 4=43219 5=45372 6=42492 7=45378 Average=44200 Page N*(alloc free)(5): 0=42765 1=45370 2=42029 3=44979 4=42567 5=45336 6=42929 7=45016 Average=43874 This is just allocating and freeing the same page all the time. Here the pcps are orders of magnitude faster. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org