From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Mon, 14 Jan 2008 11:24:02 +0000 From: Mel Gorman Subject: Re: [PATCH] Eliminate the hot/cold distinction in the page allocator Message-ID: <20080114112401.GA32446@csn.ul.ie> References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org Return-Path: To: Christoph Lameter Cc: akpm@linux-foundation.org, linux-mm@kvack.org List-ID: On (10/01/08 20:13), Christoph Lameter didst pronounce: > This is on top of the patch that adds cold pages to the end of the pcp > list. It drops all the distinctions between hot and cold pages which > improves performance. See the discussion and the tests that Mel Gorman > performed with this patch at > > http://marc.info/?t=119507025400001&r=1&w=2 > To be sure, I ran some tests on this. They take a while to run, hence the delay in responding. The tests were based on 2.6.24-rc7 with the per-cpu-related patches and this patch rebased to mainline instead of -mm (see http://www.csn.ul.ie/~mel/postings/percpu-20080114/remove-hotcoldpcp.diff). It still is a case that the performance with or without the list-split is very close. With only one exception, the unified per-cpu list was slower on average but by such a small amount, it's mostly within the standard deviation between runs. Based on these tests, I still think it's safe to get rid of the hot/cold PCP split. Test Machine A: bl6-13 X86-64 (BladeCenter LS20) Test Machine B: elm3a68 X86 (xSeries 345, Xeon based) Test Machine C: gekko-lp3 PPC64 (System p5 570) Kernbench --------- X86-64 bl6-13 KernBench Timing Comparisin (2.6.24-rc7-hot-cold-pcp/2.6.24-rc7-unified-pcp) Min Average Max Std. Deviation --------------------------- --------------------------- --------------------------- ---------------------------- User CPU time 84.86/84.82 ( 0.05%) 85.32/84.87 ( 0.53%) 85.59/84.94 ( 0.76%) 0.28/0.05 ( 84.03%) System CPU time 33.14/33.46 ( -0.97%) 33.55/33.72 ( -0.49%) 34.14/33.83 ( 0.91%) 0.37/0.15 ( 59.84%) Total CPU time 118.73/118.40 ( 0.28%) 118.87/118.58 ( 0.24%) 119.00/118.67 ( 0.28%) 0.10/0.11 ( -12.02%) Elapsed time 34.06/36.01 ( -5.73%) 35.49/36.78 ( -3.65%) 36.48/37.71 ( -3.37%) 0.91/0.65 ( 28.53%) X86 elm3a68 KernBench Timing Comparisin (2.6.24-rc7-hot-cold-pcp/2.6.24-rc7-unified-pcp) Min Average Max Std. Deviation --------------------------- --------------------------- --------------------------- ---------------------------- User CPU time 1251.30/1251.25 ( 0.00%) 1251.40/1251.97 ( -0.05%) 1251.55/1253.07 ( -0.12%) 0.09/0.68 (-638.66%) System CPU time 271.00/274.00 ( -1.11%) 272.32/274.22 ( -0.70%) 272.98/274.45 ( -0.54%) 0.78/0.21 ( 73.70%) Total CPU time 1522.55/1525.28 ( -0.18%) 1523.72/1526.19 ( -0.16%) 1524.37/1527.07 ( -0.18%) 0.71/0.64 ( 10.63%) Elapsed time 387.55/388.19 ( -0.17%) 388.94/389.76 ( -0.21%) 391.27/392.51 ( -0.32%) 1.47/1.77 ( -20.72%) PPC64 gekko-lp3 KernBench Timing Comparisin (2.6.24-rc7-hot-cold-pcp/2.6.24-rc7-unified-pcp) Min Average Max Std. Deviation --------------------------- --------------------------- --------------------------- ---------------------------- User CPU time 308.92/308.29 ( 0.20%) 309.10/308.60 ( 0.16%) 309.35/308.86 ( 0.16%) 0.16/0.23 ( -44.74%) System CPU time 16.80/16.78 ( 0.12%) 16.82/16.80 ( 0.12%) 16.83/16.81 ( 0.12%) 0.01/0.01 ( 0.00%) Total CPU time 325.72/325.07 ( 0.20%) 325.92/325.39 ( 0.16%) 326.16/325.66 ( 0.15%) 0.16/0.23 ( -44.74%) Elapsed time 164.03/163.29 ( 0.45%) 164.20/163.85 ( 0.21%) 164.36/164.21 ( 0.09%) 0.12/0.34 (-191.99%) The bl6-13 elapsed time regression looks severe but it's within standard deviation. gekko-lp3 was the only machine (out of 12 I tested) that showed an improvement here. However, gekko-lp1 which is very similar to gekko-lp3 showed a small regression so I guess this is something that varies. However, I would conclude that the difference here is so minimal that it doesn't justify splitting per-cpu lists on its own. Create/Delete ------------- This is based on the create-delete.c test from ext3 mentioned last by Andrew here http://marc.info/?l=linux-mm&m=119517308705439&w=2. The test is run multiple times with different numbers of clients and size mappings. The results linked here as 1 client running per CPU in the system (i.e. 4 clients) bl6-13: http://www.csn.ul.ie/~mel/postings/percpu-20080114/bl6-13-comparison-anonfilemapping-4.ps elm3a68: http://www.csn.ul.ie/~mel/postings/percpu-20080114/elm3a68-comparison-anonfilemapping-4.ps gekko-lp3: http://www.csn.ul.ie/~mel/postings/percpu-20080114/gekko-lp3-comparison-anonfilemapping-4.ps On bl6-13, anonymous file mappings were comparable. With file mappings, splitting the per-cpu lists is comparable until the size is larger than the L2 cache, then it gets slower (11% at the end). In contrast with elm3a68 and gekko-lp3, the unifying the lists is sometimes marginally faster throughout. HackBench --------- While this test is for the scheduler, we've seen where SLAB/SLUB has different performance characteristics on this test. While the nature of that regression has no relevance here, I thought it wouldn't hurt to do a comparison just in case we were very unlucky with the batch sizes and PCP watermarks. bl6-13: http://www.csn.ul.ie/~mel/postings/percpu-20080114/bl6-13-comparison-hackbench.ps elm3a68: http://www.csn.ul.ie/~mel/postings/percpu-20080114/elm3a68-comparison-hackbench.ps gekko-lp3: http://www.csn.ul.ie/~mel/postings/percpu-20080114/gekko-lp3-comparison-hackbench.ps With bl6-13, performance is again very close. Unifying the lists seemed marginally faster with sockets and marginally slower with pipes - too small a margin to really say much about. Similar story with elm3a68. With gekko-lp3, unifying seems slightly *slower* with sockets but similar with pipes. HighAlloc Comparison -------------------- I'm not going to say much about this as it's not a performance issue. On some machines it helped and on others it hurt. I don't have specific details as to why it makes a difference at all but analysing it will be done independently of this patch. Ideally, sysbench and volanomark would also be run but I'm still in the process of getting them automated fully for doing this type of testing. As it is, I still see no problems with the patches. > Signed-off-by: Christoph Lameter > Signed-off-by: Mel Gorman > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org