* [PATCH] Eliminate the hot/cold distinction in the page allocator
@ 2008-01-11 4:13 Christoph Lameter
2008-01-14 11:24 ` Mel Gorman
0 siblings, 1 reply; 2+ messages in thread
From: Christoph Lameter @ 2008-01-11 4:13 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, Mel Gorman
This is on top of the patch that adds cold pages to the end of the pcp
list. It drops all the distinctions between hot and cold pages which
improves performance. See the discussion and the tests that Mel Gorman
performed with this patch at
http://marc.info/?t=119507025400001&r=1&w=2
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/linux/gfp.h | 3 +--
mm/page_alloc.c | 34 +++++++---------------------------
mm/swap.c | 2 +-
3 files changed, 9 insertions(+), 30 deletions(-)
Index: linux-2.6.24-rc6-mm1/include/linux/gfp.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/gfp.h 2008-01-10 20:03:24.965516788 -0800
+++ linux-2.6.24-rc6-mm1/include/linux/gfp.h 2008-01-10 20:08:12.117206294 -0800
@@ -220,8 +220,7 @@ extern unsigned long FASTCALL(get_zeroed
extern void FASTCALL(__free_pages(struct page *page, unsigned int order));
extern void FASTCALL(free_pages(unsigned long addr, unsigned int order));
-extern void FASTCALL(free_hot_page(struct page *page));
-extern void FASTCALL(free_cold_page(struct page *page));
+extern void FASTCALL(free_a_page(struct page *page));
#define __free_page(page) __free_pages((page), 0)
#define free_page(addr) free_pages((addr),0)
Index: linux-2.6.24-rc6-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/page_alloc.c 2008-01-10 20:03:24.977516887 -0800
+++ linux-2.6.24-rc6-mm1/mm/page_alloc.c 2008-01-10 20:03:28.169508169 -0800
@@ -993,7 +993,7 @@ void mark_free_pages(struct zone *zone)
/*
* Free a 0-order page
*/
-static void free_hot_cold_page(struct page *page, int cold)
+void free_a_page(struct page *page)
{
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
@@ -1013,10 +1013,7 @@ static void free_hot_cold_page(struct pa
pcp = &zone_pcp(zone, get_cpu())->pcp;
local_irq_save(flags);
__count_vm_event(PGFREE);
- if (cold)
- list_add_tail(&page->lru, &pcp->list);
- else
- list_add(&page->lru, &pcp->list);
+ list_add(&page->lru, &pcp->list);
set_page_private(page, get_pageblock_migratetype(page));
pcp->count++;
if (pcp->count >= pcp->high) {
@@ -1027,16 +1024,6 @@ static void free_hot_cold_page(struct pa
put_cpu();
}
-void free_hot_page(struct page *page)
-{
- free_hot_cold_page(page, 0);
-}
-
-void free_cold_page(struct page *page)
-{
- free_hot_cold_page(page, 1);
-}
-
/*
* split_page takes a non-compound higher-order page, and splits it into
* n (1<<order) sub-pages: page[0..n]
@@ -1065,7 +1052,6 @@ static struct page *buffered_rmqueue(str
{
unsigned long flags;
struct page *page;
- int cold = !!(gfp_flags & __GFP_COLD);
int cpu;
int migratetype = allocflags_to_migratetype(gfp_flags);
@@ -1084,15 +1070,9 @@ again:
}
/* Find a page of the appropriate migrate type */
- if (cold) {
- list_for_each_entry_reverse(page, &pcp->list, lru)
- if (page_private(page) == migratetype)
- break;
- } else {
- list_for_each_entry(page, &pcp->list, lru)
- if (page_private(page) == migratetype)
- break;
- }
+ list_for_each_entry(page, &pcp->list, lru)
+ if (page_private(page) == migratetype)
+ break;
/* Allocate more to the pcp list if necessary */
if (unlikely(&page->lru == &pcp->list)) {
@@ -1755,14 +1735,14 @@ void __pagevec_free(struct pagevec *pvec
int i = pagevec_count(pvec);
while (--i >= 0)
- free_hot_cold_page(pvec->pages[i], pvec->cold);
+ free_a_page(pvec->pages[i]);
}
void __free_pages(struct page *page, unsigned int order)
{
if (put_page_testzero(page)) {
if (order == 0)
- free_hot_page(page);
+ free_a_page(page);
else
__free_pages_ok(page, order);
}
Index: linux-2.6.24-rc6-mm1/mm/swap.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/swap.c 2008-01-10 20:07:59.497196870 -0800
+++ linux-2.6.24-rc6-mm1/mm/swap.c 2008-01-10 20:08:12.117206294 -0800
@@ -54,7 +54,7 @@ static void __page_cache_release(struct
del_page_from_lru(zone, page);
spin_unlock_irqrestore(&zone->lru_lock, flags);
}
- free_hot_page(page);
+ free_a_page(page);
}
static void put_compound_page(struct page *page)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [PATCH] Eliminate the hot/cold distinction in the page allocator
2008-01-11 4:13 [PATCH] Eliminate the hot/cold distinction in the page allocator Christoph Lameter
@ 2008-01-14 11:24 ` Mel Gorman
0 siblings, 0 replies; 2+ messages in thread
From: Mel Gorman @ 2008-01-14 11:24 UTC (permalink / raw)
To: Christoph Lameter; +Cc: akpm, linux-mm
On (10/01/08 20:13), Christoph Lameter didst pronounce:
> This is on top of the patch that adds cold pages to the end of the pcp
> list. It drops all the distinctions between hot and cold pages which
> improves performance. See the discussion and the tests that Mel Gorman
> performed with this patch at
>
> http://marc.info/?t=119507025400001&r=1&w=2
>
To be sure, I ran some tests on this. They take a while to run, hence
the delay in responding. The tests were based on 2.6.24-rc7 with the
per-cpu-related patches and this patch rebased to mainline instead of -mm (see
http://www.csn.ul.ie/~mel/postings/percpu-20080114/remove-hotcoldpcp.diff). It
still is a case that the performance with or without the list-split is
very close. With only one exception, the unified per-cpu list was slower
on average but by such a small amount, it's mostly within the standard
deviation between runs. Based on these tests, I still think it's safe to
get rid of the hot/cold PCP split.
Test Machine A: bl6-13 X86-64 (BladeCenter LS20)
Test Machine B: elm3a68 X86 (xSeries 345, Xeon based)
Test Machine C: gekko-lp3 PPC64 (System p5 570)
Kernbench
---------
X86-64 bl6-13
KernBench Timing Comparisin (2.6.24-rc7-hot-cold-pcp/2.6.24-rc7-unified-pcp)
Min Average
Max Std. Deviation
--------------------------- --------------------------- --------------------------- ----------------------------
User CPU time 84.86/84.82 ( 0.05%) 85.32/84.87 ( 0.53%) 85.59/84.94 ( 0.76%) 0.28/0.05 ( 84.03%)
System CPU time 33.14/33.46 ( -0.97%) 33.55/33.72 ( -0.49%) 34.14/33.83 ( 0.91%) 0.37/0.15 ( 59.84%)
Total CPU time 118.73/118.40 ( 0.28%) 118.87/118.58 ( 0.24%) 119.00/118.67 ( 0.28%) 0.10/0.11 ( -12.02%)
Elapsed time 34.06/36.01 ( -5.73%) 35.49/36.78 ( -3.65%) 36.48/37.71 ( -3.37%) 0.91/0.65 ( 28.53%)
X86 elm3a68
KernBench Timing Comparisin (2.6.24-rc7-hot-cold-pcp/2.6.24-rc7-unified-pcp)
Min Average Max Std. Deviation
--------------------------- --------------------------- --------------------------- ----------------------------
User CPU time 1251.30/1251.25 ( 0.00%) 1251.40/1251.97 ( -0.05%) 1251.55/1253.07 ( -0.12%) 0.09/0.68 (-638.66%)
System CPU time 271.00/274.00 ( -1.11%) 272.32/274.22 ( -0.70%) 272.98/274.45 ( -0.54%) 0.78/0.21 ( 73.70%)
Total CPU time 1522.55/1525.28 ( -0.18%) 1523.72/1526.19 ( -0.16%) 1524.37/1527.07 ( -0.18%) 0.71/0.64 ( 10.63%)
Elapsed time 387.55/388.19 ( -0.17%) 388.94/389.76 ( -0.21%) 391.27/392.51 ( -0.32%) 1.47/1.77 ( -20.72%)
PPC64 gekko-lp3
KernBench Timing Comparisin (2.6.24-rc7-hot-cold-pcp/2.6.24-rc7-unified-pcp)
Min Average Max Std. Deviation
--------------------------- --------------------------- --------------------------- ----------------------------
User CPU time 308.92/308.29 ( 0.20%) 309.10/308.60 ( 0.16%) 309.35/308.86 ( 0.16%) 0.16/0.23 ( -44.74%)
System CPU time 16.80/16.78 ( 0.12%) 16.82/16.80 ( 0.12%) 16.83/16.81 ( 0.12%) 0.01/0.01 ( 0.00%)
Total CPU time 325.72/325.07 ( 0.20%) 325.92/325.39 ( 0.16%) 326.16/325.66 ( 0.15%) 0.16/0.23 ( -44.74%)
Elapsed time 164.03/163.29 ( 0.45%) 164.20/163.85 ( 0.21%) 164.36/164.21 ( 0.09%) 0.12/0.34 (-191.99%)
The bl6-13 elapsed time regression looks severe but it's within standard
deviation. gekko-lp3 was the only machine (out of 12 I tested) that showed
an improvement here. However, gekko-lp1 which is very similar to gekko-lp3
showed a small regression so I guess this is something that varies.
However, I would conclude that the difference here is so minimal that it
doesn't justify splitting per-cpu lists on its own.
Create/Delete
-------------
This is based on the create-delete.c test from ext3 mentioned last by Andrew
here http://marc.info/?l=linux-mm&m=119517308705439&w=2. The test is run
multiple times with different numbers of clients and size mappings. The results
linked here as 1 client running per CPU in the system (i.e. 4 clients)
bl6-13: http://www.csn.ul.ie/~mel/postings/percpu-20080114/bl6-13-comparison-anonfilemapping-4.ps
elm3a68: http://www.csn.ul.ie/~mel/postings/percpu-20080114/elm3a68-comparison-anonfilemapping-4.ps
gekko-lp3: http://www.csn.ul.ie/~mel/postings/percpu-20080114/gekko-lp3-comparison-anonfilemapping-4.ps
On bl6-13, anonymous file mappings were comparable. With file mappings,
splitting the per-cpu lists is comparable until the size is larger than the
L2 cache, then it gets slower (11% at the end). In contrast with elm3a68 and
gekko-lp3, the unifying the lists is sometimes marginally faster throughout.
HackBench
---------
While this test is for the scheduler, we've seen where SLAB/SLUB has different
performance characteristics on this test. While the nature of that regression
has no relevance here, I thought it wouldn't hurt to do a comparison just
in case we were very unlucky with the batch sizes and PCP watermarks.
bl6-13: http://www.csn.ul.ie/~mel/postings/percpu-20080114/bl6-13-comparison-hackbench.ps
elm3a68: http://www.csn.ul.ie/~mel/postings/percpu-20080114/elm3a68-comparison-hackbench.ps
gekko-lp3: http://www.csn.ul.ie/~mel/postings/percpu-20080114/gekko-lp3-comparison-hackbench.ps
With bl6-13, performance is again very close. Unifying the lists seemed
marginally faster with sockets and marginally slower with pipes - too small
a margin to really say much about. Similar story with elm3a68. With gekko-lp3,
unifying seems slightly *slower* with sockets but similar with pipes.
HighAlloc Comparison
--------------------
I'm not going to say much about this as it's not a performance issue. On some
machines it helped and on others it hurt. I don't have specific details as to
why it makes a difference at all but analysing it will be done independently
of this patch.
Ideally, sysbench and volanomark would also be run but I'm still in the
process of getting them automated fully for doing this type of testing. As
it is, I still see no problems with the patches.
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2008-01-14 11:24 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-11 4:13 [PATCH] Eliminate the hot/cold distinction in the page allocator Christoph Lameter
2008-01-14 11:24 ` Mel Gorman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox