linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Eliminate the hot/cold distinction in the page allocator
@ 2008-01-11  4:13 Christoph Lameter
  2008-01-14 11:24 ` Mel Gorman
  0 siblings, 1 reply; 2+ messages in thread
From: Christoph Lameter @ 2008-01-11  4:13 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Mel Gorman

This is on top of the patch that adds cold pages to the end of the pcp
list. It drops all the distinctions between hot and cold pages which
improves performance. See the discussion and the tests that Mel Gorman
performed with this patch at

http://marc.info/?t=119507025400001&r=1&w=2

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>

---
 include/linux/gfp.h |    3 +--
 mm/page_alloc.c     |   34 +++++++---------------------------
 mm/swap.c           |    2 +-
 3 files changed, 9 insertions(+), 30 deletions(-)

Index: linux-2.6.24-rc6-mm1/include/linux/gfp.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/gfp.h	2008-01-10 20:03:24.965516788 -0800
+++ linux-2.6.24-rc6-mm1/include/linux/gfp.h	2008-01-10 20:08:12.117206294 -0800
@@ -220,8 +220,7 @@ extern unsigned long FASTCALL(get_zeroed
 
 extern void FASTCALL(__free_pages(struct page *page, unsigned int order));
 extern void FASTCALL(free_pages(unsigned long addr, unsigned int order));
-extern void FASTCALL(free_hot_page(struct page *page));
-extern void FASTCALL(free_cold_page(struct page *page));
+extern void FASTCALL(free_a_page(struct page *page));
 
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr),0)
Index: linux-2.6.24-rc6-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/page_alloc.c	2008-01-10 20:03:24.977516887 -0800
+++ linux-2.6.24-rc6-mm1/mm/page_alloc.c	2008-01-10 20:03:28.169508169 -0800
@@ -993,7 +993,7 @@ void mark_free_pages(struct zone *zone)
 /*
  * Free a 0-order page
  */
-static void free_hot_cold_page(struct page *page, int cold)
+void free_a_page(struct page *page)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
@@ -1013,10 +1013,7 @@ static void free_hot_cold_page(struct pa
 	pcp = &zone_pcp(zone, get_cpu())->pcp;
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);
-	if (cold)
-		list_add_tail(&page->lru, &pcp->list);
-	else
-		list_add(&page->lru, &pcp->list);
+	list_add(&page->lru, &pcp->list);
 	set_page_private(page, get_pageblock_migratetype(page));
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
@@ -1027,16 +1024,6 @@ static void free_hot_cold_page(struct pa
 	put_cpu();
 }
 
-void free_hot_page(struct page *page)
-{
-	free_hot_cold_page(page, 0);
-}
-	
-void free_cold_page(struct page *page)
-{
-	free_hot_cold_page(page, 1);
-}
-
 /*
  * split_page takes a non-compound higher-order page, and splits it into
  * n (1<<order) sub-pages: page[0..n]
@@ -1065,7 +1052,6 @@ static struct page *buffered_rmqueue(str
 {
 	unsigned long flags;
 	struct page *page;
-	int cold = !!(gfp_flags & __GFP_COLD);
 	int cpu;
 	int migratetype = allocflags_to_migratetype(gfp_flags);
 
@@ -1084,15 +1070,9 @@ again:
 		}
 
 		/* Find a page of the appropriate migrate type */
-		if (cold) {
-			list_for_each_entry_reverse(page, &pcp->list, lru)
-				if (page_private(page) == migratetype)
-					break;
-		} else {
-			list_for_each_entry(page, &pcp->list, lru)
-				if (page_private(page) == migratetype)
-					break;
-		}
+		list_for_each_entry(page, &pcp->list, lru)
+			if (page_private(page) == migratetype)
+				break;
 
 		/* Allocate more to the pcp list if necessary */
 		if (unlikely(&page->lru == &pcp->list)) {
@@ -1755,14 +1735,14 @@ void __pagevec_free(struct pagevec *pvec
 	int i = pagevec_count(pvec);
 
 	while (--i >= 0)
-		free_hot_cold_page(pvec->pages[i], pvec->cold);
+		free_a_page(pvec->pages[i]);
 }
 
 void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
 		if (order == 0)
-			free_hot_page(page);
+			free_a_page(page);
 		else
 			__free_pages_ok(page, order);
 	}
Index: linux-2.6.24-rc6-mm1/mm/swap.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/swap.c	2008-01-10 20:07:59.497196870 -0800
+++ linux-2.6.24-rc6-mm1/mm/swap.c	2008-01-10 20:08:12.117206294 -0800
@@ -54,7 +54,7 @@ static void __page_cache_release(struct 
 		del_page_from_lru(zone, page);
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
-	free_hot_page(page);
+	free_a_page(page);
 }
 
 static void put_compound_page(struct page *page)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH] Eliminate the hot/cold distinction in the page allocator
  2008-01-11  4:13 [PATCH] Eliminate the hot/cold distinction in the page allocator Christoph Lameter
@ 2008-01-14 11:24 ` Mel Gorman
  0 siblings, 0 replies; 2+ messages in thread
From: Mel Gorman @ 2008-01-14 11:24 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-mm

On (10/01/08 20:13), Christoph Lameter didst pronounce:
> This is on top of the patch that adds cold pages to the end of the pcp
> list. It drops all the distinctions between hot and cold pages which
> improves performance. See the discussion and the tests that Mel Gorman
> performed with this patch at
> 
> http://marc.info/?t=119507025400001&r=1&w=2
> 

To be sure, I ran some tests on this. They take a while to run, hence
the delay in responding. The tests were based on 2.6.24-rc7 with the
per-cpu-related patches and this patch rebased to mainline instead of -mm (see
http://www.csn.ul.ie/~mel/postings/percpu-20080114/remove-hotcoldpcp.diff). It
still is a case that the performance with or without the list-split is
very close. With only one exception, the unified per-cpu list was slower
on average but by such a small amount, it's mostly within the standard
deviation between runs. Based on these tests, I still think it's safe to
get rid of the hot/cold PCP split.

Test Machine A: bl6-13 X86-64 (BladeCenter LS20)
Test Machine B: elm3a68 X86 (xSeries 345, Xeon based)
Test Machine C: gekko-lp3 PPC64 (System p5 570)

Kernbench
---------

X86-64 bl6-13
KernBench Timing Comparisin (2.6.24-rc7-hot-cold-pcp/2.6.24-rc7-unified-pcp)
                     Min                         Average
Max                         Std. Deviation              
                     --------------------------- --------------------------- --------------------------- ----------------------------
User   CPU time         84.86/84.82    (  0.05%)    85.32/84.87    (  0.53%) 85.59/84.94    (  0.76%)     0.28/0.05     (  84.03%)
System CPU time         33.14/33.46    ( -0.97%)    33.55/33.72    ( -0.49%) 34.14/33.83    (  0.91%)     0.37/0.15     (  59.84%)
Total  CPU time        118.73/118.40   (  0.28%)   118.87/118.58   (  0.24%) 119.00/118.67   (  0.28%)     0.10/0.11     ( -12.02%)
Elapsed    time         34.06/36.01    ( -5.73%)    35.49/36.78    ( -3.65%) 36.48/37.71    ( -3.37%)     0.91/0.65     (  28.53%)

X86 elm3a68
KernBench Timing Comparisin (2.6.24-rc7-hot-cold-pcp/2.6.24-rc7-unified-pcp)
                     Min                         Average                     Max                         Std. Deviation              
                     --------------------------- --------------------------- --------------------------- ----------------------------
User   CPU time       1251.30/1251.25  (  0.00%)  1251.40/1251.97  ( -0.05%)  1251.55/1253.07  ( -0.12%)     0.09/0.68     (-638.66%)
System CPU time        271.00/274.00   ( -1.11%)   272.32/274.22   ( -0.70%)   272.98/274.45   ( -0.54%)     0.78/0.21     (  73.70%)
Total  CPU time       1522.55/1525.28  ( -0.18%)  1523.72/1526.19  ( -0.16%)  1524.37/1527.07  ( -0.18%)     0.71/0.64     (  10.63%)
Elapsed    time        387.55/388.19   ( -0.17%)   388.94/389.76   ( -0.21%)   391.27/392.51   ( -0.32%)     1.47/1.77     ( -20.72%)

PPC64 gekko-lp3
KernBench Timing Comparisin (2.6.24-rc7-hot-cold-pcp/2.6.24-rc7-unified-pcp)
                     Min                         Average                     Max                         Std. Deviation              
                     --------------------------- --------------------------- --------------------------- ----------------------------
User   CPU time        308.92/308.29   (  0.20%)   309.10/308.60   (  0.16%)   309.35/308.86   (  0.16%)     0.16/0.23     ( -44.74%)
System CPU time         16.80/16.78    (  0.12%)    16.82/16.80    (  0.12%)    16.83/16.81    (  0.12%)     0.01/0.01     (   0.00%)
Total  CPU time        325.72/325.07   (  0.20%)   325.92/325.39   (  0.16%)   326.16/325.66   (  0.15%)     0.16/0.23     ( -44.74%)
Elapsed    time        164.03/163.29   (  0.45%)   164.20/163.85   (  0.21%)   164.36/164.21   (  0.09%)     0.12/0.34     (-191.99%)

The bl6-13 elapsed time regression looks severe but it's within standard
deviation. gekko-lp3 was the only machine (out of 12 I tested) that showed
an improvement here. However, gekko-lp1 which is very similar to gekko-lp3
showed a small regression so I guess this is something that varies.

However, I would conclude that the difference here is so minimal that it
doesn't justify splitting per-cpu lists on its own.

Create/Delete
-------------

This is based on the create-delete.c test from ext3 mentioned last by Andrew
here http://marc.info/?l=linux-mm&m=119517308705439&w=2. The test is run
multiple times with different numbers of clients and size mappings. The results
linked here as 1 client running per CPU in the system (i.e. 4 clients)

bl6-13:    http://www.csn.ul.ie/~mel/postings/percpu-20080114/bl6-13-comparison-anonfilemapping-4.ps
elm3a68:   http://www.csn.ul.ie/~mel/postings/percpu-20080114/elm3a68-comparison-anonfilemapping-4.ps
gekko-lp3: http://www.csn.ul.ie/~mel/postings/percpu-20080114/gekko-lp3-comparison-anonfilemapping-4.ps

On bl6-13, anonymous file mappings were comparable. With file mappings,
splitting the per-cpu lists is comparable until the size is larger than the
L2 cache, then it gets slower (11% at the end). In contrast with elm3a68 and
gekko-lp3, the unifying the lists is sometimes marginally faster throughout.

HackBench
---------

While this test is for the scheduler, we've seen where SLAB/SLUB has different
performance characteristics on this test. While the nature of that regression
has no relevance here, I thought it wouldn't hurt to do a comparison just
in case we were very unlucky with the batch sizes and PCP watermarks.

bl6-13:    http://www.csn.ul.ie/~mel/postings/percpu-20080114/bl6-13-comparison-hackbench.ps
elm3a68:   http://www.csn.ul.ie/~mel/postings/percpu-20080114/elm3a68-comparison-hackbench.ps
gekko-lp3: http://www.csn.ul.ie/~mel/postings/percpu-20080114/gekko-lp3-comparison-hackbench.ps

With bl6-13, performance is again very close. Unifying the lists seemed
marginally faster with sockets and marginally slower with pipes - too small
a margin to really say much about. Similar story with elm3a68. With gekko-lp3,
unifying seems slightly *slower* with sockets but similar with pipes.

HighAlloc Comparison
--------------------

I'm not going to say much about this as it's not a performance issue. On some
machines it helped and on others it hurt. I don't have specific details as to
why it makes a difference at all but analysing it will be done independently
of this patch.

Ideally, sysbench and volanomark would also be run but I'm still in the
process of getting them automated fully for doing this type of testing. As
it is, I still see no problems with the patches.

> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2008-01-14 11:24 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-11  4:13 [PATCH] Eliminate the hot/cold distinction in the page allocator Christoph Lameter
2008-01-14 11:24 ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox