* [RFC] buddy allocator without bitmap [2/4]
@ 2004-08-26 12:03 Hiroyuki KAMEZAWA
2004-08-26 15:50 ` [Lhms-devel] " Dave Hansen
0 siblings, 1 reply; 15+ messages in thread
From: Hiroyuki KAMEZAWA @ 2004-08-26 12:03 UTC (permalink / raw)
To: Linux Kernel ML; +Cc: linux-mm, LHMS, William Lee Irwin III, Dave Hansen
This is 3rd part, for page allocation.
PG_private is used for indicating
"This page is a head of contiguous free pages,whose length is 2^(page->private)"
-- Kame
========================
This patch removes bitmap operation from alloc_pages().
Instead of using MARK_USED() bitmap operation,
this patch records page's order in page struct itself, page->private field.
During locking zone->lock, a returned page's PG_private is cleared and
new heads of contiguous pages of 2^n length are connected to free_area[].
they are all marked with PG_private and their page->private keep their order.
example) 1 page allocation from 8 pages chunk
start ) before calling alloc_pages()
free_area[3] -> page[0],order=3
free_area[2] ->
free_area[1] ->
free_area[0] ->
8 pages of chunk, starting from page[0] is connected to free_area[3].list
here, free_area[2],free_area[1],free_area[0] is empty.
step1 ) before calling expand()
free_area[3] ->
free_area[2] ->
free_area[1] ->
free_area[0] ->
return page -> page[0],order=invalid
Because free_area[2],free_area[1],free_area[0] are empty,
page[0] in free_area[3] is selected.
expand() is called to divide page[0-7] into suitable chunks.
step2 ) expand loop 1st
free_area[3] ->
free_area[2] -> page[4],order = 2
free_area[1] ->
free_area[0] ->
return page -> page[0],order=invalid
bottom half of pages[0-7], page[4-7] are free and have an order of 2.
page[4] is connected to free_list[2].
step3 ) expand loop 2nd
free_area[3] ->
free_area[2] -> page[4],order = 2
free_area[1] -> page[2],order = 1
free_area[0] ->
return page -> page[0],order=invalid
bottom half of pages[0-3], page[2-3] are free and have an order of 1.
page[2] is connected to free_list[1].
step4 ) expand loop 3rd
free_area[3] ->
free_area[2] -> page[4],order = 2
free_area[1] -> page[2],order = 1
free_area[0] -> page[1],order = 0
return page -> page[0],order=invalid
bottom half of pages[0-1], page[1] is free and has an order of 0.
page[1] is connected to free_list[0].
end )
chunks of page[0 -7] is divided into
page[4-7] of order 2
page[2-3] of order 1
page[1] of order 0
page[0] is allocated.
---
linux-2.6.8.1-mm4-kame-kamezawa/mm/page_alloc.c | 16 ++++++----------
1 files changed, 6 insertions(+), 10 deletions(-)
diff -puN mm/page_alloc.c~eliminate-bitmap-alloc mm/page_alloc.c
--- linux-2.6.8.1-mm4-kame/mm/page_alloc.c~eliminate-bitmap-alloc 2004-08-26 08:43:16.000000000 +0900
+++ linux-2.6.8.1-mm4-kame-kamezawa/mm/page_alloc.c 2004-08-26 11:40:29.461979560 +0900
@@ -288,9 +288,6 @@ void __free_pages_ok(struct page *page,
free_pages_bulk(page_zone(page), 1, &list, order);
}
-#define MARK_USED(index, order, area) \
- __change_bit((index) >> (1+(order)), (area)->map)
-
/*
* The order of subdivision here is critical for the IO subsystem.
* Please do not alter this order without good reasons and regression
@@ -307,7 +304,7 @@ void __free_pages_ok(struct page *page,
*/
static inline struct page *
expand(struct zone *zone, struct page *page,
- unsigned long index, int low, int high, struct free_area *area)
+ int low, int high, struct free_area *area)
{
unsigned long size = 1 << high;
@@ -317,7 +314,8 @@ expand(struct zone *zone, struct page *p
size >>= 1;
BUG_ON(bad_range(zone, &page[size]));
list_add(&page[size].lru, &area->free_list);
- MARK_USED(index + size, high, area);
+ page[size].flags |= (1 << PG_private);
+ page[size].private = high;
}
return page;
}
@@ -371,7 +369,6 @@ static struct page *__rmqueue(struct zon
struct free_area * area;
unsigned int current_order;
struct page *page;
- unsigned int index;
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
area = zone->free_area + current_order;
@@ -380,11 +377,10 @@ static struct page *__rmqueue(struct zon
page = list_entry(area->free_list.next, struct page, lru);
list_del(&page->lru);
- index = page - zone->zone_mem_map;
- if (current_order != MAX_ORDER-1)
- MARK_USED(index, current_order, area);
+ /* Atomic operation is needless here */
+ page->flags &= ~(1 << PG_private);
zone->free_pages -= 1UL << order;
- return expand(zone, page, index, order, current_order, area);
+ return expand(zone, page, order, current_order, area);
}
return NULL;
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [Lhms-devel] [RFC] buddy allocator without bitmap [2/4] 2004-08-26 12:03 [RFC] buddy allocator without bitmap [2/4] Hiroyuki KAMEZAWA @ 2004-08-26 15:50 ` Dave Hansen 2004-08-26 23:05 ` Hiroyuki KAMEZAWA 0 siblings, 1 reply; 15+ messages in thread From: Dave Hansen @ 2004-08-26 15:50 UTC (permalink / raw) To: Hiroyuki KAMEZAWA; +Cc: Linux Kernel ML, linux-mm, lhms, William Lee Irwin III On Thu, 2004-08-26 at 05:03, Hiroyuki KAMEZAWA wrote: > - MARK_USED(index + size, high, area); > + page[size].flags |= (1 << PG_private); > + page[size].private = high; > } > return page; > } ... > + /* Atomic operation is needless here */ > + page->flags &= ~(1 << PG_private); See linux/page_flags.h: #define SetPagePrivate(page) set_bit(PG_private, &(page)->flags) #define ClearPagePrivate(page) clear_bit(PG_private, &(page)->flags) #define PagePrivate(page) test_bit(PG_private, &(page)->flags) -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lhms-devel] [RFC] buddy allocator without bitmap [2/4] 2004-08-26 15:50 ` [Lhms-devel] " Dave Hansen @ 2004-08-26 23:05 ` Hiroyuki KAMEZAWA 2004-08-26 23:11 ` Dave Hansen 2004-08-27 0:18 ` Andrew Morton 0 siblings, 2 replies; 15+ messages in thread From: Hiroyuki KAMEZAWA @ 2004-08-26 23:05 UTC (permalink / raw) To: Dave Hansen; +Cc: Linux Kernel ML, linux-mm, lhms, William Lee Irwin III Hi I understand using these macros cleans up codes as I used them in my previous version. In the previous version, I used SetPagePrivate()/ClearPagePrivate()/PagePrivate(). But these are "atomic" operation and looks very slow. This is why I doesn't used these macros in this version. My previous version, which used set_bit/test_bit/clear_bit, shows very bad performance on my test, and I replaced it. If I made a mistake on measuring the performance and set_bit/test_bit/clear_bit is faster than what I think, I'd like to replace them. -- Kame Dave Hansen wrote: > On Thu, 2004-08-26 at 05:03, Hiroyuki KAMEZAWA wrote: > >>- MARK_USED(index + size, high, area); >>+ page[size].flags |= (1 << PG_private); >>+ page[size].private = high; >> } >> return page; >> } > > ... > >>+ /* Atomic operation is needless here */ >>+ page->flags &= ~(1 << PG_private); > > > See linux/page_flags.h: > > #define SetPagePrivate(page) set_bit(PG_private, &(page)->flags) > #define ClearPagePrivate(page) clear_bit(PG_private, &(page)->flags) > #define PagePrivate(page) test_bit(PG_private, &(page)->flags) > > -- Dave > > -- --the clue is these footmarks leading to the door.-- KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lhms-devel] [RFC] buddy allocator without bitmap [2/4] 2004-08-26 23:05 ` Hiroyuki KAMEZAWA @ 2004-08-26 23:11 ` Dave Hansen 2004-08-26 23:28 ` Hiroyuki KAMEZAWA 2004-08-27 0:18 ` Andrew Morton 1 sibling, 1 reply; 15+ messages in thread From: Dave Hansen @ 2004-08-26 23:11 UTC (permalink / raw) To: Hiroyuki KAMEZAWA; +Cc: Linux Kernel ML, linux-mm, lhms, William Lee Irwin III On Thu, 2004-08-26 at 16:05, Hiroyuki KAMEZAWA wrote: > I understand using these macros cleans up codes as I used them in my previous > version. > > In the previous version, I used SetPagePrivate()/ClearPagePrivate()/PagePrivate(). > But these are "atomic" operation and looks very slow. > This is why I doesn't used these macros in this version. > > My previous version, which used set_bit/test_bit/clear_bit, shows very bad performance > on my test, and I replaced it. > > If I made a mistake on measuring the performance and set_bit/test_bit/clear_bit > is faster than what I think, I'd like to replace them. Sorry, I misread your comment: /* Atomic operation is needless here */ I read "needless" as "needed". Would it make any more sense to you to say "already have lock, don't need atomic ops", instead? -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lhms-devel] [RFC] buddy allocator without bitmap [2/4] 2004-08-26 23:11 ` Dave Hansen @ 2004-08-26 23:28 ` Hiroyuki KAMEZAWA 0 siblings, 0 replies; 15+ messages in thread From: Hiroyuki KAMEZAWA @ 2004-08-26 23:28 UTC (permalink / raw) To: Dave Hansen; +Cc: Linux Kernel ML, linux-mm, lhms, William Lee Irwin III Dave Hansen wrote: > On Thu, 2004-08-26 at 16:05, Hiroyuki KAMEZAWA wrote: > >>I understand using these macros cleans up codes as I used them in my previous >>version. >> >>In the previous version, I used SetPagePrivate()/ClearPagePrivate()/PagePrivate(). >>But these are "atomic" operation and looks very slow. >>This is why I doesn't used these macros in this version. >> >>My previous version, which used set_bit/test_bit/clear_bit, shows very bad performance >>on my test, and I replaced it. >> >>If I made a mistake on measuring the performance and set_bit/test_bit/clear_bit >>is faster than what I think, I'd like to replace them. > > > Sorry, I misread your comment: > > /* Atomic operation is needless here */ > > I read "needless" as "needed". Would it make any more sense to you to > say "already have lock, don't need atomic ops", instead? > Thanks. I'm not so good at writing good comment, as you know :). I'll rewrite it. -- --the clue is these footmarks leading to the door.-- KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lhms-devel] [RFC] buddy allocator without bitmap [2/4] 2004-08-26 23:05 ` Hiroyuki KAMEZAWA 2004-08-26 23:11 ` Dave Hansen @ 2004-08-27 0:18 ` Andrew Morton 2004-08-27 0:27 ` Hiroyuki KAMEZAWA 1 sibling, 1 reply; 15+ messages in thread From: Andrew Morton @ 2004-08-27 0:18 UTC (permalink / raw) To: Hiroyuki KAMEZAWA; +Cc: haveblue, linux-kernel, linux-mm, lhms-devel, wli Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > In the previous version, I used SetPagePrivate()/ClearPagePrivate()/PagePrivate(). > But these are "atomic" operation and looks very slow. > This is why I doesn't used these macros in this version. > > My previous version, which used set_bit/test_bit/clear_bit, shows very bad performance > on my test, and I replaced it. That's surprising. But if you do intend to use non-atomic bitops then please add __SetPagePrivate() and __ClearPagePrivate() -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lhms-devel] [RFC] buddy allocator without bitmap [2/4] 2004-08-27 0:18 ` Andrew Morton @ 2004-08-27 0:27 ` Hiroyuki KAMEZAWA 2004-08-27 4:48 ` Hiroyuki KAMEZAWA 0 siblings, 1 reply; 15+ messages in thread From: Hiroyuki KAMEZAWA @ 2004-08-27 0:27 UTC (permalink / raw) To: Andrew Morton; +Cc: haveblue, linux-kernel, linux-mm, lhms-devel, wli Okay, I'll do more test and if I find atomic ops are slow, I'll add __XXXPagePrivate() macros. ps. I usually test codes on Xeon 1.8G x 2 server. -- Kame Andrew Morton wrote: > Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com> wrote: > >>In the previous version, I used SetPagePrivate()/ClearPagePrivate()/PagePrivate(). >>But these are "atomic" operation and looks very slow. >>This is why I doesn't used these macros in this version. >> >>My previous version, which used set_bit/test_bit/clear_bit, shows very bad performance >>on my test, and I replaced it. > > > That's surprising. But if you do intend to use non-atomic bitops then > please add __SetPagePrivate() and __ClearPagePrivate() -- --the clue is these footmarks leading to the door.-- KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lhms-devel] [RFC] buddy allocator without bitmap [2/4] 2004-08-27 0:27 ` Hiroyuki KAMEZAWA @ 2004-08-27 4:48 ` Hiroyuki KAMEZAWA 2004-08-27 4:59 ` Andrew Morton ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Hiroyuki KAMEZAWA @ 2004-08-27 4:48 UTC (permalink / raw) To: Hiroyuki KAMEZAWA Cc: Andrew Morton, haveblue, linux-kernel, linux-mm, lhms-devel, wli [-- Attachment #1: Type: text/plain, Size: 2681 bytes --] Hi, I testd set_bit()/__set_bit() ops, atomic and non atomic ops, on my Xeon. I think this test is not perfect, but shows some aspect of pefromance of atomic ops. Program: the program touches memory in tight loop, using atomic and non-atomic set_bit(). memory size is 512k, L2 cache size. I attaches it in this mail, but it is configured to my Xeon and looks ugly :). My CPU: from /proc/cpuinfo vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) XEON(TM) MP CPU 1.90GHz stepping : 2 cpu MHz : 1891.582 cache size : 512 KBCPU : Intel Xeon 1.8GHz Result: [root@kanex2 atomic]# nice -10 ./test-atomics score 0 is 64011 note: cache hit, no atomic score 1 is 543011 note: cache hit, atomic score 2 is 303901 note: cache hit, mixture score 3 is 344261 note: cache miss, no atomic score 4 is 1131085 note: cache miss, atomic score 5 is 593443 note: cache miss, mixture score 6 is 118455 note: cache hit, dependency, noatomic score 7 is 416195 note: cache hit, dependency, mixture smaller score is better. score 0-2 shows set_bit/__set_bit performance during good cache hit rate. score 3-5 shows set_bit/__set_bit performance during bad cache hit rate. score 6-7 shows set_bit/__set_bit performance during good cache hit but there is data dependency on each access in the tight loop. To Dave: cost of prefetch() is not here, because I found it is very sensitive to what is done in the loop and difficult to measure in this program. I found cost of calling prefetch is a bit high, I'll measure whether prefetch() in buddy allocator is good or bad again. I think this result shows I should use non-atomic ops when I can. Thanks. Kame Hiroyuki KAMEZAWA wrote: > > > Okay, I'll do more test and if I find atomic ops are slow, > I'll add __XXXPagePrivate() macros. > > ps. I usually test codes on Xeon 1.8G x 2 server. > > -- Kame > > Andrew Morton wrote: > >> Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com> wrote: >> >>> In the previous version, I used >>> SetPagePrivate()/ClearPagePrivate()/PagePrivate(). >>> But these are "atomic" operation and looks very slow. >>> This is why I doesn't used these macros in this version. >>> >>> My previous version, which used set_bit/test_bit/clear_bit, shows >>> very bad performance >>> on my test, and I replaced it. >> >> >> >> That's surprising. But if you do intend to use non-atomic bitops then >> please add __SetPagePrivate() and __ClearPagePrivate() > > -- --the clue is these footmarks leading to the door.-- KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [-- Attachment #2: test-atomics.c --] [-- Type: text/plain, Size: 6834 bytes --] #include <stdio.h> #include <sys/mman.h> /* Note: this program is written for Xeon */ /* * Stolen from Linux. * */ #define ADDR (*(volatile long *) addr) /* * set_bit - Atomically set a bit in memory * @nr: the bit to set * @addr: the address to start counting from * * This function is atomic and may not be reordered. See __set_bit() * if you do not require the atomic guarantees. * * Note: there are no guarantees that this function will not be reordered * on non x86 architectures, so if you are writting portable code, * make sure not to rely on its reordering guarantees. * * Note that @nr may be almost arbitrarily large; this function is not * restricted to acting on a single-word quantity. */ static inline void set_bit(int nr, volatile unsigned long * addr) { __asm__ __volatile__( "lock ;" "btsl %1,%0" :"=m" (ADDR) :"Ir" (nr)); } /** * __set_bit - Set a bit in memory * @nr: the bit to set * @addr: the address to start counting from * * Unlike set_bit(), this function is non-atomic and may be reordered. * If it's called on the same region of memory simultaneously, the effect * may be that only one operation succeeds. */ static inline void __set_bit(int nr, volatile unsigned long * addr) { __asm__( "btsl %1,%0" :"=m" (ADDR) :"Ir" (nr)); } #define rdtsc(low,high) \ __asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high)) /* * Test params. * */ #define CACHESIZE (512 * 1024) /* L2 cache size */ #define LCACHESIZE CACHESIZE/sizeof(long) #define PAGESIZE 4096 #define LPAGESIZE PAGESIZE/sizeof(long) #define MAX_TRY (100) #define NOCACHEMISS_NOATOMIC 0 #define NOCACHEMISS_ATOMIC 1 #define NOCACHEMISS_MIXTURE 2 #define NOATOMIC 3 #define ATOMIC 4 #define MIXTURE 5 #define NOATOMIC_DEPEND 6 #define MIXTURE_DEPEND 7 #define NR_OPS 8 char message[NR_OPS][64]={ "cache hit, no atomic", "cache hit, atomic", "cache hit, mixture", "cache miss, no atomic", "cache miss, atomic", "cache miss, mixture", "cache hit, dependency, noatomic", "cache hit, dependency, mixture" }; #define LINESIZE 128 /* L2 line size */ #define LLINESIZE LINESIZE/sizeof(long) /* * function for preparing cache status */ void hot_cache(char *buffer,int size) { memset(buffer,0,size); return; } void cold_cache(char *buffer,int size) { unsigned long *addr; int i; addr = malloc(size); memset(addr,0,size); return; } #define prefetch(addr) \ __asm__ __volatile__ ("prefetcht0 %0":: "m" (addr)) int main(int argc, char *argv[]) { unsigned long long score[NR_OPS][MAX_TRY]; unsigned long long average_score[NR_OPS]; unsigned long *map, *addr; struct { unsigned long low; unsigned long high; } start,end; int try, i, j; unsigned long long lstart,lend; map = mmap(NULL, CACHESIZE, PROT_WRITE, MAP_PRIVATE | MAP_ANON, 0, 0); for(try = 0; try < MAX_TRY; try++) { /* there is no page fault, cache hit */ hot_cache((char *)map, CACHESIZE); /* No atomic ops case */ rdtsc(start.low, start.high); for(addr = map;addr != map + LCACHESIZE; addr += LLINESIZE * 2) { __set_bit(1,map); __set_bit(2,map + LLINESIZE); } rdtsc(end.low, end.high); lstart = (unsigned long long)start.high << 32 | start.low; lend = (unsigned long long)end.high << 32 | end.low; score[NOCACHEMISS_NOATOMIC][try] = lend - lstart; /* there is no page fault, small cache miss */ hot_cache((char *)map, CACHESIZE); /* atomic ops case */ rdtsc(start.low, start.high); for(addr = map;addr != map + LCACHESIZE; addr += LLINESIZE * 2) { set_bit(1,map); set_bit(2,map + LLINESIZE); } rdtsc(end.low, end.high); lstart = (unsigned long long)start.high << 32 | start.low; lend = (unsigned long long)end.high << 32 | end.low; score[NOCACHEMISS_ATOMIC][try] = lend - lstart; /* there is no page fault, small cache miss */ hot_cache((char *)map, CACHESIZE); /* mixture case */ rdtsc(start.low, start.high); for(addr = map;addr != map + LCACHESIZE; addr += LLINESIZE * 2) { __set_bit(1,map); set_bit(2,map + LLINESIZE); } rdtsc(end.low, end.high); lstart = (unsigned long long)start.high << 32 | start.low; lend = (unsigned long long)end.high << 32 | end.low; score[NOCACHEMISS_MIXTURE][try] = lend - lstart; /* expire cache */ cold_cache((char *)map, CACHESIZE); /* ATOMIC_ONLY case */ rdtsc(start.low, start.high); for(addr = map; addr != map + LCACHESIZE; addr += LLINESIZE*2){ __set_bit(1,addr); __set_bit(2,addr + LLINESIZE); } rdtsc(end.low, end.high); lstart = (unsigned long long)start.high << 32 | start.low; lend = (unsigned long long)end.high << 32 | end.low; score[NOATOMIC][try] = lend - lstart; /* expire cache */ cold_cache((char *)map, CACHESIZE); /* ATOMIC_ONLY case */ rdtsc(start.low, start.high); for(addr = map; addr != map + LCACHESIZE; addr += LLINESIZE * 2){ set_bit(1,addr); set_bit(2,addr + LLINESIZE); } rdtsc(end.low, end.high); lstart = (unsigned long long)start.high << 32 | start.low; lend = (unsigned long long)end.high << 32 | end.low; score[ATOMIC][try] = lend - lstart; /* expire cache */ cold_cache((char *)map, CACHESIZE); /* MIXTURE case */ rdtsc(start.low, start.high); for(addr = map; addr != map + LCACHESIZE; addr += LLINESIZE * 2){ __set_bit(1,addr); set_bit(2,addr + LLINESIZE); } rdtsc(end.low, end.high); lstart = (unsigned long long)start.high << 32 | start.low; lend = (unsigned long long)end.high << 32 | end.low; score[MIXTURE][try] = lend - lstart; /* hot cache */ hot_cache((char *)map, CACHESIZE); /* case with dependency */ rdtsc(start.low, start.high); for(addr = map; addr != map + LCACHESIZE; addr += LLINESIZE * 2){ __set_bit(1,addr); __set_bit(2,addr); } rdtsc(end.low, end.high); lstart = (unsigned long long)start.high << 32 | start.low; lend = (unsigned long long)end.high << 32 | end.low; score[NOATOMIC_DEPEND][try] = lend - lstart; /* expire cache */ hot_cache((char *)map, CACHESIZE); /* case with depndency */ rdtsc(start.low, start.high); for(addr = map; addr != map + LCACHESIZE; addr += LLINESIZE * 2){ __set_bit(1,addr); set_bit(2,addr); } rdtsc(end.low, end.high); lstart = (unsigned long long)start.high << 32 | start.low; lend = (unsigned long long)end.high << 32 | end.low; score[MIXTURE_DEPEND][try] = lend - lstart; } for(j = 0; j < NR_OPS; j++) { average_score[j] = 0; for(i = 0; i < try; i++) { average_score[j] += score[j][i]; } printf("score %d is %16lld note: %s\n",j,average_score[j]/try, message[j]); } return ; } ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lhms-devel] [RFC] buddy allocator without bitmap [2/4] 2004-08-27 4:48 ` Hiroyuki KAMEZAWA @ 2004-08-27 4:59 ` Andrew Morton 2004-08-27 5:20 ` Hiroyuki KAMEZAWA 2004-08-27 5:04 ` Dave Hansen 2004-08-27 5:47 ` Dave Hansen 2 siblings, 1 reply; 15+ messages in thread From: Andrew Morton @ 2004-08-27 4:59 UTC (permalink / raw) To: Hiroyuki KAMEZAWA; +Cc: haveblue, linux-kernel, linux-mm, lhms-devel, wli Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > Hi, > I testd set_bit()/__set_bit() ops, atomic and non atomic ops, on my Xeon. > I think this test is not perfect, but shows some aspect of pefromance of atomic ops. Oh, atomic ops on a P4 hurt like hell. Try doing this: time dd if=/dev/null of=foo bs=1 count=1M on an SMP kernel and compare it with a uniproc kernel. The difference is large. Certainly, executing an atomic op in a tight loop will show a lot of difference. But that doesn't mean that making these operations non-atomic makes a significant difference to overall kernel performance! But whatever - it all adds up. The microoptimisation is fine - let's go that way. > Result: > [root@kanex2 atomic]# nice -10 ./test-atomics > score 0 is 64011 note: cache hit, no atomic > score 1 is 543011 note: cache hit, atomic > score 2 is 303901 note: cache hit, mixture > score 3 is 344261 note: cache miss, no atomic > score 4 is 1131085 note: cache miss, atomic > score 5 is 593443 note: cache miss, mixture > score 6 is 118455 note: cache hit, dependency, noatomic > score 7 is 416195 note: cache hit, dependency, mixture > > smaller score is better. > score 0-2 shows set_bit/__set_bit performance during good cache hit rate. > score 3-5 shows set_bit/__set_bit performance during bad cache hit rate. > score 6-7 shows set_bit/__set_bit performance during good cache hit > but there is data dependency on each access in the tight loop. I _think_ the above means atomic ops are 10x more costly, yes? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lhms-devel] [RFC] buddy allocator without bitmap [2/4] 2004-08-27 4:59 ` Andrew Morton @ 2004-08-27 5:20 ` Hiroyuki KAMEZAWA 0 siblings, 0 replies; 15+ messages in thread From: Hiroyuki KAMEZAWA @ 2004-08-27 5:20 UTC (permalink / raw) To: Andrew Morton; +Cc: haveblue, linux-kernel, linux-mm, lhms-devel, wli Andrew Morton wrote: > Certainly, executing an atomic op in a tight loop will show a lot of > difference. But that doesn't mean that making these operations non-atomic > makes a significant difference to overall kernel performance! > Thanks. My test before positng patch is calling mmap()/munmap() with 4-16Mega bytes. munmap with such Mega bytes causes many calls of __free_pages_bulk() and many pages are coalesced at once. This means atomic_ops in heavyly called tight loop (I called it 3 times in the most inner loop...) and my test shows bad performance ;). > But whatever - it all adds up. The microoptimisation is fine - let's go > that way. > I'd like to add macros and to get my codes clear. > >>Result: >>[root@kanex2 atomic]# nice -10 ./test-atomics >>score 0 is 64011 note: cache hit, no atomic >>score 1 is 543011 note: cache hit, atomic >>score 2 is 303901 note: cache hit, mixture >>score 3 is 344261 note: cache miss, no atomic >>score 4 is 1131085 note: cache miss, atomic >>score 5 is 593443 note: cache miss, mixture >>score 6 is 118455 note: cache hit, dependency, noatomic >>score 7 is 416195 note: cache hit, dependency, mixture >> >>smaller score is better. >>score 0-2 shows set_bit/__set_bit performance during good cache hit rate. >>score 3-5 shows set_bit/__set_bit performance during bad cache hit rate. >>score 6-7 shows set_bit/__set_bit performance during good cache hit >>but there is data dependency on each access in the tight loop. > > > I _think_ the above means atomic ops are 10x more costly, yes? > yes, when L2 cache hits, I think. -- --the clue is these footmarks leading to the door.-- KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lhms-devel] [RFC] buddy allocator without bitmap [2/4] 2004-08-27 4:48 ` Hiroyuki KAMEZAWA 2004-08-27 4:59 ` Andrew Morton @ 2004-08-27 5:04 ` Dave Hansen 2004-08-27 5:31 ` Hiroyuki KAMEZAWA 2004-08-27 5:47 ` Dave Hansen 2 siblings, 1 reply; 15+ messages in thread From: Dave Hansen @ 2004-08-27 5:04 UTC (permalink / raw) To: Hiroyuki KAMEZAWA Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, lhms, William Lee Irwin III On Thu, 2004-08-26 at 21:48, Hiroyuki KAMEZAWA wrote: > I testd set_bit()/__set_bit() ops, atomic and non atomic ops, on my Xeon. > I think this test is not perfect, but shows some aspect of pefromance of atomic ops. > > Program: > the program touches memory in tight loop, using atomic and non-atomic set_bit(). > memory size is 512k, L2 cache size. > I attaches it in this mail, but it is configured to my Xeon and looks ugly :). ... > To Dave: > cost of prefetch() is not here, because I found it is very sensitive to > what is done in the loop and difficult to measure in this program. > I found cost of calling prefetch is a bit high, I'll measure whether > prefetch() in buddy allocator is good or bad again. > > I think this result shows I should use non-atomic ops when I can. I think we all know that locked instructions are going to be slower. However, what I wanted to see is how it influences a slightly more realistic test, and actually in the context of the kernel. Let's actually see how much impact using the prefetch() and atomic vs non-atomic ops has when they're used *in* the kernel on a less contrived less microbenchmarky test. How about finding some kind of benchmark that will do a bunch of forking and a bunch of page faulting to cause lots of activity in the allocator? I'd suggest something like http://ck.kolivas.org/kernbench/ or SDET if you can get your hands on it. Anybody else have some suggestions? The atomic ops, you're probably right about, but it would still be nice to have some hard data. As for prefetch(), we could scatter it and unlikely() all over the kernel, but we only tend to do so when we can either demonstrate a concrete gain, or it is a super-hot path. With hot-and-cold-pages around, even the allocator functions don't necessarily count as super-hot. I'll run kernbench and sdet with and without the atomic ops and prefetch on some of my hardware and see what I come up with. -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lhms-devel] [RFC] buddy allocator without bitmap [2/4] 2004-08-27 5:04 ` Dave Hansen @ 2004-08-27 5:31 ` Hiroyuki KAMEZAWA 2004-08-27 5:31 ` Dave Hansen 0 siblings, 1 reply; 15+ messages in thread From: Hiroyuki KAMEZAWA @ 2004-08-27 5:31 UTC (permalink / raw) To: Dave Hansen Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, lhms, William Lee Irwin III Dave Hansen wrote: >>To Dave: >>cost of prefetch() is not here, because I found it is very sensitive to >>what is done in the loop and difficult to measure in this program. >>I found cost of calling prefetch is a bit high, I'll measure whether >>prefetch() in buddy allocator is good or bad again. >> >>I think this result shows I should use non-atomic ops when I can. > > > I think we all know that locked instructions are going to be slower. > However, what I wanted to see is how it influences a slightly more > realistic test, and actually in the context of the kernel. Let's > actually see how much impact using the prefetch() and atomic vs > non-atomic ops has when they're used *in* the kernel on a less > contrived less microbenchmarky test. > > How about finding some kind of benchmark that will do a bunch of forking > and a bunch of page faulting to cause lots of activity in the allocator? > I cannot find suitable one, so I test in microbenchmark calling mmap() and munmap(). As you say, real-world workload test is more suitable to measure kernel's performance. > I'd suggest something like http://ck.kolivas.org/kernbench/ or SDET if > you can get your hands on it. Anybody else have some suggestions? > thanks. > The atomic ops, you're probably right about, but it would still be nice > to have some hard data. As for prefetch(), we could scatter it and > unlikely() all over the kernel, but we only tend to do so when we can > either demonstrate a concrete gain, or it is a super-hot path. With > hot-and-cold-pages around, even the allocator functions don't > necessarily count as super-hot. > Hmm, my test program was mmap()/munnamp Magebytes of page and hot_cold_page() does not work enough, because current batch is16. My test might be a bit special case. -- Kame -- --the clue is these footmarks leading to the door.-- KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lhms-devel] [RFC] buddy allocator without bitmap [2/4] 2004-08-27 5:31 ` Hiroyuki KAMEZAWA @ 2004-08-27 5:31 ` Dave Hansen 0 siblings, 0 replies; 15+ messages in thread From: Dave Hansen @ 2004-08-27 5:31 UTC (permalink / raw) To: Hiroyuki KAMEZAWA Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, lhms, William Lee Irwin III On Thu, 2004-08-26 at 22:31, Hiroyuki KAMEZAWA wrote: > I cannot find suitable one, so I test in microbenchmark calling mmap() > and munmap(). As you say, real-world workload test is more suitable to > measure kernel's performance. Sorry, I thought you were just running a loop with atomic operations inside of it, not actually exercising the kernel itself. The test you described in your mail to Andrew sounds much more useful than what I thought it was. -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lhms-devel] [RFC] buddy allocator without bitmap [2/4] 2004-08-27 4:48 ` Hiroyuki KAMEZAWA 2004-08-27 4:59 ` Andrew Morton 2004-08-27 5:04 ` Dave Hansen @ 2004-08-27 5:47 ` Dave Hansen 2004-08-27 6:09 ` Hiroyuki KAMEZAWA 2 siblings, 1 reply; 15+ messages in thread From: Dave Hansen @ 2004-08-27 5:47 UTC (permalink / raw) To: Hiroyuki KAMEZAWA Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, lhms, William Lee Irwin III I can't seem to get patch 2 to apply. Were they again a plain 2.6.8.1-mm4 tree, or were some other patches involved too? They were supposed to go in numerical order, right? -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lhms-devel] [RFC] buddy allocator without bitmap [2/4] 2004-08-27 5:47 ` Dave Hansen @ 2004-08-27 6:09 ` Hiroyuki KAMEZAWA 0 siblings, 0 replies; 15+ messages in thread From: Hiroyuki KAMEZAWA @ 2004-08-27 6:09 UTC (permalink / raw) To: Dave Hansen Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, lhms, William Lee Irwin III [-- Attachment #1: Type: text/plain, Size: 891 bytes --] Hi, It's is against pure 2.6.8.1-mm4 tree. I'm afraid that there was cut & paste mistake of mine, so I send it in tgz file again. patch order) 1. eliminate-bitmap-includes.patch 2. eliminate-bitmap-init.patch 3. eliminate-bitmap-alloc.patch 4. eliminate-bitmap-free.patch 5. eliminate-bitmap-prefetch.patch -- Kame Dave Hansen wrote: > I can't seem to get patch 2 to apply. Were they again a plain > 2.6.8.1-mm4 tree, or were some other patches involved too? They were > supposed to go in numerical order, right? > > -- Dave > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> > -- --the clue is these footmarks leading to the door.-- KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [-- Attachment #2: eliminate-bitmap.tgz --] [-- Type: application/x-compressed, Size: 7538 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2004-08-27 6:04 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2004-08-26 12:03 [RFC] buddy allocator without bitmap [2/4] Hiroyuki KAMEZAWA 2004-08-26 15:50 ` [Lhms-devel] " Dave Hansen 2004-08-26 23:05 ` Hiroyuki KAMEZAWA 2004-08-26 23:11 ` Dave Hansen 2004-08-26 23:28 ` Hiroyuki KAMEZAWA 2004-08-27 0:18 ` Andrew Morton 2004-08-27 0:27 ` Hiroyuki KAMEZAWA 2004-08-27 4:48 ` Hiroyuki KAMEZAWA 2004-08-27 4:59 ` Andrew Morton 2004-08-27 5:20 ` Hiroyuki KAMEZAWA 2004-08-27 5:04 ` Dave Hansen 2004-08-27 5:31 ` Hiroyuki KAMEZAWA 2004-08-27 5:31 ` Dave Hansen 2004-08-27 5:47 ` Dave Hansen 2004-08-27 6:09 ` Hiroyuki KAMEZAWA
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox