[RFC] buddy allocator without bitmap [2/4]

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] buddy allocator without bitmap  [2/4]
@ 2004-08-26 12:03 Hiroyuki KAMEZAWA
  2004-08-26 15:50 ` [Lhms-devel] " Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Hiroyuki KAMEZAWA @ 2004-08-26 12:03 UTC (permalink / raw)
  To: Linux Kernel ML; +Cc: linux-mm, LHMS, William Lee Irwin III, Dave Hansen

This is 3rd part, for page allocation.

PG_private is used for indicating
"This page is a head of contiguous free pages,whose length is 2^(page->private)"


-- Kame

========================

This patch removes bitmap operation from alloc_pages().

Instead of using MARK_USED() bitmap operation,
this patch records page's order in page struct itself, page->private field.

During locking zone->lock, a returned page's PG_private is cleared and
new heads of contiguous pages of 2^n length are connected to free_area[].
they are all marked with PG_private and their page->private keep their order.

example) 1 page allocation from 8 pages chunk


start ) before calling alloc_pages()
         free_area[3] -> page[0],order=3
         free_area[2] ->
         free_area[1] ->
         free_area[0] ->
	
	8 pages of chunk, starting from page[0] is connected to free_area[3].list
	here, free_area[2],free_area[1],free_area[0] is empty.	

step1 ) before calling expand()
         free_area[3] ->
         free_area[2] ->
         free_area[1] ->
         free_area[0] ->
         return page  -> page[0],order=invalid
	
	Because free_area[2],free_area[1],free_area[0] are empty,
	page[0] in free_area[3] is selected.
	expand() is called to divide page[0-7] into suitable chunks.

step2 ) expand loop 1st
         free_area[3] ->
	free_area[2] -> page[4],order = 2
	free_area[1] ->
	free_area[0] ->
         return page  -> page[0],order=invalid
	
	bottom half of pages[0-7], page[4-7] are free and have an order of 2.
	page[4] is connected to free_list[2].	
	
step3 ) expand loop 2nd
         free_area[3] ->
	free_area[2] -> page[4],order = 2
	free_area[1] -> page[2],order = 1
	free_area[0] ->
         return page  -> page[0],order=invalid
	
	bottom half of pages[0-3], page[2-3] are free and have an order of 1.
	page[2] is connected to free_list[1].
	
step4 ) expand loop 3rd
         free_area[3] ->
	free_area[2] -> page[4],order = 2
	free_area[1] -> page[2],order = 1
	free_area[0] -> page[1],order = 0
         return page  -> page[0],order=invalid
	
	bottom half of pages[0-1], page[1] is free and has an order of 0.
	page[1] is connected to free_list[0].

end )
         chunks of page[0 -7] is divided into
	page[4-7] of order 2
	page[2-3] of order 1
	page[1]   of order 0
         page[0]   is allocated.



---

  linux-2.6.8.1-mm4-kame-kamezawa/mm/page_alloc.c |   16 ++++++----------
  1 files changed, 6 insertions(+), 10 deletions(-)

diff -puN mm/page_alloc.c~eliminate-bitmap-alloc mm/page_alloc.c
--- linux-2.6.8.1-mm4-kame/mm/page_alloc.c~eliminate-bitmap-alloc	2004-08-26 08:43:16.000000000 +0900
+++ linux-2.6.8.1-mm4-kame-kamezawa/mm/page_alloc.c	2004-08-26 11:40:29.461979560 +0900
@@ -288,9 +288,6 @@ void __free_pages_ok(struct page *page,
  	free_pages_bulk(page_zone(page), 1, &list, order);
  }

-#define MARK_USED(index, order, area) \
-	__change_bit((index) >> (1+(order)), (area)->map)
-
  /*
   * The order of subdivision here is critical for the IO subsystem.
   * Please do not alter this order without good reasons and regression
@@ -307,7 +304,7 @@ void __free_pages_ok(struct page *page,
   */
  static inline struct page *
  expand(struct zone *zone, struct page *page,
-	 unsigned long index, int low, int high, struct free_area *area)
+       int low, int high, struct free_area *area)
  {
  	unsigned long size = 1 << high;

@@ -317,7 +314,8 @@ expand(struct zone *zone, struct page *p
  		size >>= 1;
  		BUG_ON(bad_range(zone, &page[size]));
  		list_add(&page[size].lru, &area->free_list);
-		MARK_USED(index + size, high, area);
+		page[size].flags |= (1 << PG_private);
+		page[size].private = high;
  	}
  	return page;
  }
@@ -371,7 +369,6 @@ static struct page *__rmqueue(struct zon
  	struct free_area * area;
  	unsigned int current_order;
  	struct page *page;
-	unsigned int index;

  	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
  		area = zone->free_area + current_order;
@@ -380,11 +377,10 @@ static struct page *__rmqueue(struct zon

  		page = list_entry(area->free_list.next, struct page, lru);
  		list_del(&page->lru);
-		index = page - zone->zone_mem_map;
-		if (current_order != MAX_ORDER-1)
-			MARK_USED(index, current_order, area);
+		/* Atomic operation is needless here */
+		page->flags &= ~(1 << PG_private);
  		zone->free_pages -= 1UL << order;
-		return expand(zone, page, index, order, current_order, area);
+		return expand(zone, page, order, current_order, area);
  	}

  	return NULL;

_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lhms-devel] [RFC] buddy allocator without bitmap  [2/4]
  2004-08-26 12:03 [RFC] buddy allocator without bitmap [2/4] Hiroyuki KAMEZAWA
@ 2004-08-26 15:50 ` Dave Hansen
  2004-08-26 23:05   ` Hiroyuki KAMEZAWA
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2004-08-26 15:50 UTC (permalink / raw)
  To: Hiroyuki KAMEZAWA; +Cc: Linux Kernel ML, linux-mm, lhms, William Lee Irwin III

On Thu, 2004-08-26 at 05:03, Hiroyuki KAMEZAWA wrote:
> -		MARK_USED(index + size, high, area);
> +		page[size].flags |= (1 << PG_private);
> +		page[size].private = high;
>   	}
>   	return page;
>   }
...
> +		/* Atomic operation is needless here */
> +		page->flags &= ~(1 << PG_private);

See linux/page_flags.h:

#define SetPagePrivate(page)    set_bit(PG_private, &(page)->flags)
#define ClearPagePrivate(page)  clear_bit(PG_private, &(page)->flags)
#define PagePrivate(page)       test_bit(PG_private, &(page)->flags)

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lhms-devel] [RFC] buddy allocator without bitmap  [2/4]
  2004-08-26 15:50 ` [Lhms-devel] " Dave Hansen
@ 2004-08-26 23:05   ` Hiroyuki KAMEZAWA
  2004-08-26 23:11     ` Dave Hansen
  2004-08-27  0:18     ` Andrew Morton
  0 siblings, 2 replies; 15+ messages in thread
From: Hiroyuki KAMEZAWA @ 2004-08-26 23:05 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Linux Kernel ML, linux-mm, lhms, William Lee Irwin III

Hi

I understand using these macros cleans up codes as I used them in my previous
version.

In the previous version, I used SetPagePrivate()/ClearPagePrivate()/PagePrivate().
But these are "atomic" operation and looks very slow.
This is why I doesn't used these macros in this version.

My previous version, which used set_bit/test_bit/clear_bit, shows very bad performance
on my test, and I replaced it.

If I made a mistake on measuring the performance and set_bit/test_bit/clear_bit
is faster than what I think, I'd like to replace them.

-- Kame

Dave Hansen wrote:
> On Thu, 2004-08-26 at 05:03, Hiroyuki KAMEZAWA wrote:
> 
>>-		MARK_USED(index + size, high, area);
>>+		page[size].flags |= (1 << PG_private);
>>+		page[size].private = high;
>>  	}
>>  	return page;
>>  }
> 
> ...
> 
>>+		/* Atomic operation is needless here */
>>+		page->flags &= ~(1 << PG_private);
> 
> 
> See linux/page_flags.h:
> 
> #define SetPagePrivate(page)    set_bit(PG_private, &(page)->flags)
> #define ClearPagePrivate(page)  clear_bit(PG_private, &(page)->flags)
> #define PagePrivate(page)       test_bit(PG_private, &(page)->flags)
> 
> -- Dave
> 
> 


-- 
--the clue is these footmarks leading to the door.--
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lhms-devel] [RFC] buddy allocator without bitmap  [2/4]
  2004-08-26 23:05   ` Hiroyuki KAMEZAWA
@ 2004-08-26 23:11     ` Dave Hansen
  2004-08-26 23:28       ` Hiroyuki KAMEZAWA
  2004-08-27  0:18     ` Andrew Morton
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2004-08-26 23:11 UTC (permalink / raw)
  To: Hiroyuki KAMEZAWA; +Cc: Linux Kernel ML, linux-mm, lhms, William Lee Irwin III

On Thu, 2004-08-26 at 16:05, Hiroyuki KAMEZAWA wrote:
> I understand using these macros cleans up codes as I used them in my previous
> version.
> 
> In the previous version, I used SetPagePrivate()/ClearPagePrivate()/PagePrivate().
> But these are "atomic" operation and looks very slow.
> This is why I doesn't used these macros in this version.
> 
> My previous version, which used set_bit/test_bit/clear_bit, shows very bad performance
> on my test, and I replaced it.
> 
> If I made a mistake on measuring the performance and set_bit/test_bit/clear_bit
> is faster than what I think, I'd like to replace them.

Sorry, I misread your comment:

/* Atomic operation is needless here */

I read "needless" as "needed".  Would it make any more sense to you to
say "already have lock, don't need atomic ops", instead?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lhms-devel] [RFC] buddy allocator without bitmap  [2/4]
  2004-08-26 23:11     ` Dave Hansen
@ 2004-08-26 23:28       ` Hiroyuki KAMEZAWA
  0 siblings, 0 replies; 15+ messages in thread
From: Hiroyuki KAMEZAWA @ 2004-08-26 23:28 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Linux Kernel ML, linux-mm, lhms, William Lee Irwin III


Dave Hansen wrote:

> On Thu, 2004-08-26 at 16:05, Hiroyuki KAMEZAWA wrote:
> 
>>I understand using these macros cleans up codes as I used them in my previous
>>version.
>>
>>In the previous version, I used SetPagePrivate()/ClearPagePrivate()/PagePrivate().
>>But these are "atomic" operation and looks very slow.
>>This is why I doesn't used these macros in this version.
>>
>>My previous version, which used set_bit/test_bit/clear_bit, shows very bad performance
>>on my test, and I replaced it.
>>
>>If I made a mistake on measuring the performance and set_bit/test_bit/clear_bit
>>is faster than what I think, I'd like to replace them.
> 
> 
> Sorry, I misread your comment:
> 
> /* Atomic operation is needless here */
> 
> I read "needless" as "needed".  Would it make any more sense to you to
> say "already have lock, don't need atomic ops", instead?
> 
Thanks. I'm not so good at writing good comment, as you know :).
I'll rewrite it.


-- 
--the clue is these footmarks leading to the door.--
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lhms-devel] [RFC] buddy allocator without bitmap  [2/4]
  2004-08-26 23:05   ` Hiroyuki KAMEZAWA
  2004-08-26 23:11     ` Dave Hansen
@ 2004-08-27  0:18     ` Andrew Morton
  2004-08-27  0:27       ` Hiroyuki KAMEZAWA
  1 sibling, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2004-08-27  0:18 UTC (permalink / raw)
  To: Hiroyuki KAMEZAWA; +Cc: haveblue, linux-kernel, linux-mm, lhms-devel, wli

Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> In the previous version, I used SetPagePrivate()/ClearPagePrivate()/PagePrivate().
> But these are "atomic" operation and looks very slow.
> This is why I doesn't used these macros in this version.
> 
> My previous version, which used set_bit/test_bit/clear_bit, shows very bad performance
> on my test, and I replaced it.

That's surprising.  But if you do intend to use non-atomic bitops then
please add __SetPagePrivate() and __ClearPagePrivate()
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lhms-devel] [RFC] buddy allocator without bitmap  [2/4]
  2004-08-27  0:18     ` Andrew Morton
@ 2004-08-27  0:27       ` Hiroyuki KAMEZAWA
  2004-08-27  4:48         ` Hiroyuki KAMEZAWA
  0 siblings, 1 reply; 15+ messages in thread
From: Hiroyuki KAMEZAWA @ 2004-08-27  0:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: haveblue, linux-kernel, linux-mm, lhms-devel, wli


Okay, I'll do more test and if I find atomic ops are slow,
I'll add __XXXPagePrivate() macros.

ps. I usually test codes on Xeon 1.8G x 2 server.

-- Kame

Andrew Morton wrote:
> Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
>>In the previous version, I used SetPagePrivate()/ClearPagePrivate()/PagePrivate().
>>But these are "atomic" operation and looks very slow.
>>This is why I doesn't used these macros in this version.
>>
>>My previous version, which used set_bit/test_bit/clear_bit, shows very bad performance
>>on my test, and I replaced it.
> 
> 
> That's surprising.  But if you do intend to use non-atomic bitops then
> please add __SetPagePrivate() and __ClearPagePrivate()

-- 
--the clue is these footmarks leading to the door.--
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lhms-devel] [RFC] buddy allocator without bitmap  [2/4]
  2004-08-27  0:27       ` Hiroyuki KAMEZAWA
@ 2004-08-27  4:48         ` Hiroyuki KAMEZAWA
  2004-08-27  4:59           ` Andrew Morton
                             ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Hiroyuki KAMEZAWA @ 2004-08-27  4:48 UTC (permalink / raw)
  To: Hiroyuki KAMEZAWA
  Cc: Andrew Morton, haveblue, linux-kernel, linux-mm, lhms-devel, wli

[-- Attachment #1: Type: text/plain, Size: 2681 bytes --]


Hi,
I testd set_bit()/__set_bit() ops, atomic and non atomic ops, on my Xeon.
I think this test is not perfect, but shows some aspect of pefromance of atomic ops.

Program:
the program touches memory in tight loop, using atomic and non-atomic set_bit().
memory size is 512k, L2 cache size.
I attaches it in this mail, but it is configured to my Xeon and looks ugly :).


My CPU:
from /proc/cpuinfo
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) XEON(TM) MP CPU 1.90GHz
stepping        : 2
cpu MHz         : 1891.582
cache size      : 512 KBCPU     : Intel Xeon 1.8GHz

Result:
[root@kanex2 atomic]# nice -10 ./test-atomics
score 0 is            64011 note: cache hit, no atomic
score 1 is           543011 note: cache hit, atomic
score 2 is           303901 note: cache hit, mixture
score 3 is           344261 note: cache miss, no atomic
score 4 is          1131085 note: cache miss, atomic
score 5 is           593443 note: cache miss, mixture
score 6 is           118455 note: cache hit, dependency, noatomic
score 7 is           416195 note: cache hit, dependency, mixture

smaller score is better.
score 0-2 shows set_bit/__set_bit performance during good cache hit rate.
score 3-5 shows set_bit/__set_bit performance during bad cache hit rate.
score 6-7 shows set_bit/__set_bit performance during good cache hit
but there is data dependency on each access in the tight loop.

To Dave:
cost of prefetch() is not here, because I found it is very sensitive to
what is done in the loop and difficult to measure in this program.
I found cost of calling prefetch is a bit high, I'll measure whether
prefetch() in buddy allocator is good or bad again.

I think this result shows I should use non-atomic ops when I can.

Thanks.
Kame

Hiroyuki KAMEZAWA wrote:
> 
> 
> Okay, I'll do more test and if I find atomic ops are slow,
> I'll add __XXXPagePrivate() macros.
> 
> ps. I usually test codes on Xeon 1.8G x 2 server.
> 
> -- Kame
> 
> Andrew Morton wrote:
> 
>> Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>
>>> In the previous version, I used 
>>> SetPagePrivate()/ClearPagePrivate()/PagePrivate().
>>> But these are "atomic" operation and looks very slow.
>>> This is why I doesn't used these macros in this version.
>>>
>>> My previous version, which used set_bit/test_bit/clear_bit, shows 
>>> very bad performance
>>> on my test, and I replaced it.
>>
>>
>>
>> That's surprising.  But if you do intend to use non-atomic bitops then
>> please add __SetPagePrivate() and __ClearPagePrivate()
> 
> 


-- 
--the clue is these footmarks leading to the door.--
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


[-- Attachment #2: test-atomics.c --]
[-- Type: text/plain, Size: 6834 bytes --]

#include <stdio.h>
#include <sys/mman.h>

/* Note: this program is written for Xeon */

/*
 *   Stolen from Linux.
 *
 */
#define ADDR (*(volatile long *) addr)
/*
 * set_bit - Atomically set a bit in memory
 * @nr: the bit to set
 * @addr: the address to start counting from
 *
 * This function is atomic and may not be reordered.  See __set_bit()
 * if you do not require the atomic guarantees.
 *
 * Note: there are no guarantees that this function will not be reordered
 * on non x86 architectures, so if you are writting portable code,
 * make sure not to rely on its reordering guarantees.
 *
	 * Note that @nr may be almost arbitrarily large; this function is not
 * restricted to acting on a single-word quantity.
 */

static inline void set_bit(int nr, volatile unsigned long * addr)
{
        __asm__ __volatile__( "lock ;"
                "btsl %1,%0"
			      :"=m" (ADDR)
			      :"Ir" (nr));
}


/**
 * __set_bit - Set a bit in memory
 * @nr: the bit to set
 * @addr: the address to start counting from
 *
 * Unlike set_bit(), this function is non-atomic and may be reordered.
 * If it's called on the same region of memory simultaneously, the effect
 * may be that only one operation succeeds.
 */
static inline void __set_bit(int nr, volatile unsigned long * addr)
{
        __asm__(
                "btsl %1,%0"
                :"=m" (ADDR)
                :"Ir" (nr));
}


#define rdtsc(low,high) \
     __asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high))

/*
 *  Test params.
 *
 */

#define CACHESIZE     (512 * 1024) /* L2 cache size */
#define LCACHESIZE    CACHESIZE/sizeof(long)
#define PAGESIZE    4096
#define LPAGESIZE   PAGESIZE/sizeof(long)
#define MAX_TRY     (100)

#define NOCACHEMISS_NOATOMIC 0
#define NOCACHEMISS_ATOMIC   1
#define NOCACHEMISS_MIXTURE  2
#define NOATOMIC             3
#define ATOMIC               4
#define MIXTURE              5
#define NOATOMIC_DEPEND      6
#define MIXTURE_DEPEND       7
#define NR_OPS               8

char message[NR_OPS][64]={
	"cache hit, no atomic",
	"cache hit, atomic",
	"cache hit, mixture",
	"cache miss, no atomic",
	"cache miss, atomic",
	"cache miss, mixture",
	"cache hit, dependency, noatomic",
	"cache hit, dependency, mixture"
};
	
#define LINESIZE      128    /* L2 line size */
#define LLINESIZE     LINESIZE/sizeof(long)



/*
 *  function for preparing cache status
 */
void hot_cache(char *buffer,int size)
{
	memset(buffer,0,size);
	return;
}

void cold_cache(char *buffer,int size)
{
	unsigned long *addr;
	int i;
	addr = malloc(size);
	memset(addr,0,size);
	return;
}

#define prefetch(addr) \
            __asm__ __volatile__ ("prefetcht0 %0":: "m" (addr))



int  main(int argc, char *argv[]) 
{
	unsigned long long score[NR_OPS][MAX_TRY];
	unsigned long long average_score[NR_OPS];
	unsigned long *map, *addr;
	struct {
		unsigned long low;
		unsigned long high;
	} start,end;
	int try, i, j;
	unsigned long long lstart,lend;

	map = mmap(NULL, CACHESIZE, PROT_WRITE, MAP_PRIVATE | MAP_ANON, 0, 0);
	
	for(try = 0; try < MAX_TRY; try++) {
		
		/* there is no page fault, cache hit */
		hot_cache((char *)map, CACHESIZE);
		/* No atomic ops case */
		rdtsc(start.low, start.high);
		for(addr = map;addr != map + LCACHESIZE; addr += LLINESIZE * 2) {
			__set_bit(1,map);
			__set_bit(2,map + LLINESIZE);
		}
		rdtsc(end.low, end.high);
		lstart = (unsigned long long)start.high << 32 | start.low;
		lend = (unsigned long long)end.high << 32 | end.low;
		score[NOCACHEMISS_NOATOMIC][try] = lend - lstart;
		
		
		
		/* there is no page fault, small cache miss */
		hot_cache((char *)map, CACHESIZE);
		/* atomic ops case */
		rdtsc(start.low, start.high);
		for(addr = map;addr != map + LCACHESIZE; addr += LLINESIZE * 2) {
			set_bit(1,map);
			set_bit(2,map + LLINESIZE);
		}
		rdtsc(end.low, end.high);
		lstart = (unsigned long long)start.high << 32 | start.low;
		lend = (unsigned long long)end.high << 32 | end.low;
		score[NOCACHEMISS_ATOMIC][try] = lend - lstart;
		
		
		/* there is no page fault, small cache miss */
		hot_cache((char *)map, CACHESIZE);
		/* mixture case */
		rdtsc(start.low, start.high);
		for(addr = map;addr != map + LCACHESIZE; addr += LLINESIZE * 2) {
			__set_bit(1,map);
			set_bit(2,map + LLINESIZE);
		}
		rdtsc(end.low, end.high);
		lstart = (unsigned long long)start.high << 32 | start.low;
		lend = (unsigned long long)end.high << 32 | end.low;
		score[NOCACHEMISS_MIXTURE][try] = lend - lstart;

		
		/* expire cache  */
		cold_cache((char *)map, CACHESIZE);
		/* ATOMIC_ONLY case */
		rdtsc(start.low, start.high);
		for(addr = map; addr != map + LCACHESIZE; addr += LLINESIZE*2){
			__set_bit(1,addr);
			__set_bit(2,addr + LLINESIZE);
		}
		rdtsc(end.low, end.high);
		lstart = (unsigned long long)start.high << 32 | start.low;
		lend = (unsigned long long)end.high << 32 | end.low;
		score[NOATOMIC][try] = lend - lstart;
		

		/* expire cache  */
		cold_cache((char *)map, CACHESIZE);
		/* ATOMIC_ONLY case */
		rdtsc(start.low, start.high);
		for(addr = map; addr != map + LCACHESIZE; addr += LLINESIZE * 2){
			set_bit(1,addr);
			set_bit(2,addr + LLINESIZE);
		}
		rdtsc(end.low, end.high);
		lstart = (unsigned long long)start.high << 32 | start.low;
		lend = (unsigned long long)end.high << 32 | end.low;
		score[ATOMIC][try] = lend - lstart;

		
		/* expire cache  */
		cold_cache((char *)map, CACHESIZE);
		/* MIXTURE case */
		rdtsc(start.low, start.high);
		for(addr = map; addr != map + LCACHESIZE; addr += LLINESIZE * 2){
			__set_bit(1,addr);
			set_bit(2,addr + LLINESIZE);
		}
		rdtsc(end.low, end.high);
		lstart = (unsigned long long)start.high << 32 | start.low;
		lend = (unsigned long long)end.high << 32 | end.low;
		score[MIXTURE][try] = lend - lstart;

                /* hot cache  */
		hot_cache((char *)map, CACHESIZE);
		/* case with dependency */
		rdtsc(start.low, start.high);
		for(addr = map; addr != map + LCACHESIZE; addr += LLINESIZE * 2){
			__set_bit(1,addr);
			__set_bit(2,addr);
		}
		rdtsc(end.low, end.high);
		lstart = (unsigned long long)start.high << 32 | start.low;
		lend = (unsigned long long)end.high << 32 | end.low;
		score[NOATOMIC_DEPEND][try] = lend - lstart;
		
                /* expire cache  */
		hot_cache((char *)map, CACHESIZE);
		/* case with depndency */
		rdtsc(start.low, start.high);
		for(addr = map; addr != map + LCACHESIZE; addr += LLINESIZE * 2){
			__set_bit(1,addr);
			set_bit(2,addr);
		}
		rdtsc(end.low, end.high);
		lstart = (unsigned long long)start.high << 32 | start.low;
		lend = (unsigned long long)end.high << 32 | end.low;
		score[MIXTURE_DEPEND][try] = lend - lstart;
	}
	for(j = 0; j < NR_OPS; j++) {
		average_score[j] = 0;
		for(i = 0; i < try; i++) {
			average_score[j] += score[j][i];
		}
		printf("score %d is %16lld note: %s\n",j,average_score[j]/try,
		       message[j]);
	}

	return ;
	
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lhms-devel] [RFC] buddy allocator without bitmap  [2/4]
  2004-08-27  4:48         ` Hiroyuki KAMEZAWA
@ 2004-08-27  4:59           ` Andrew Morton
  2004-08-27  5:20             ` Hiroyuki KAMEZAWA
  2004-08-27  5:04           ` Dave Hansen
  2004-08-27  5:47           ` Dave Hansen
  2 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2004-08-27  4:59 UTC (permalink / raw)
  To: Hiroyuki KAMEZAWA; +Cc: haveblue, linux-kernel, linux-mm, lhms-devel, wli

Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> 
> Hi,
> I testd set_bit()/__set_bit() ops, atomic and non atomic ops, on my Xeon.
> I think this test is not perfect, but shows some aspect of pefromance of atomic ops.

Oh, atomic ops on a P4 hurt like hell.  Try doing this:

	time dd if=/dev/null of=foo bs=1 count=1M

on an SMP kernel and compare it with a uniproc kernel.  The difference is
large.

Certainly, executing an atomic op in a tight loop will show a lot of
difference.  But that doesn't mean that making these operations non-atomic
makes a significant difference to overall kernel performance!

But whatever - it all adds up.  The microoptimisation is fine - let's go
that way.

> Result:
> [root@kanex2 atomic]# nice -10 ./test-atomics
> score 0 is            64011 note: cache hit, no atomic
> score 1 is           543011 note: cache hit, atomic
> score 2 is           303901 note: cache hit, mixture
> score 3 is           344261 note: cache miss, no atomic
> score 4 is          1131085 note: cache miss, atomic
> score 5 is           593443 note: cache miss, mixture
> score 6 is           118455 note: cache hit, dependency, noatomic
> score 7 is           416195 note: cache hit, dependency, mixture
> 
> smaller score is better.
> score 0-2 shows set_bit/__set_bit performance during good cache hit rate.
> score 3-5 shows set_bit/__set_bit performance during bad cache hit rate.
> score 6-7 shows set_bit/__set_bit performance during good cache hit
> but there is data dependency on each access in the tight loop.

I _think_ the above means atomic ops are 10x more costly, yes?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lhms-devel] [RFC] buddy allocator without bitmap  [2/4]
  2004-08-27  4:59           ` Andrew Morton
@ 2004-08-27  5:20             ` Hiroyuki KAMEZAWA
  0 siblings, 0 replies; 15+ messages in thread
From: Hiroyuki KAMEZAWA @ 2004-08-27  5:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: haveblue, linux-kernel, linux-mm, lhms-devel, wli

Andrew Morton wrote:

> Certainly, executing an atomic op in a tight loop will show a lot of
> difference.  But that doesn't mean that making these operations non-atomic
> makes a significant difference to overall kernel performance!
> 
Thanks.
My test before positng patch is calling mmap()/munmap() with 4-16Mega bytes.
munmap with such Mega bytes causes many calls of __free_pages_bulk() and
many pages are coalesced at once.

This means atomic_ops in heavyly called tight loop
(I called it 3 times in the most inner loop...)

and my test shows bad performance ;).


> But whatever - it all adds up.  The microoptimisation is fine - let's go
> that way.
> 
I'd like to add macros and to get my codes clear.

> 
>>Result:
>>[root@kanex2 atomic]# nice -10 ./test-atomics
>>score 0 is            64011 note: cache hit, no atomic
>>score 1 is           543011 note: cache hit, atomic
>>score 2 is           303901 note: cache hit, mixture
>>score 3 is           344261 note: cache miss, no atomic
>>score 4 is          1131085 note: cache miss, atomic
>>score 5 is           593443 note: cache miss, mixture
>>score 6 is           118455 note: cache hit, dependency, noatomic
>>score 7 is           416195 note: cache hit, dependency, mixture
>>
>>smaller score is better.
>>score 0-2 shows set_bit/__set_bit performance during good cache hit rate.
>>score 3-5 shows set_bit/__set_bit performance during bad cache hit rate.
>>score 6-7 shows set_bit/__set_bit performance during good cache hit
>>but there is data dependency on each access in the tight loop.
> 
> 
> I _think_ the above means atomic ops are 10x more costly, yes?
> 
yes, when L2 cache hits, I think.




-- 
--the clue is these footmarks leading to the door.--
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lhms-devel] [RFC] buddy allocator without bitmap  [2/4]
  2004-08-27  4:48         ` Hiroyuki KAMEZAWA
  2004-08-27  4:59           ` Andrew Morton
@ 2004-08-27  5:04           ` Dave Hansen
  2004-08-27  5:31             ` Hiroyuki KAMEZAWA
  2004-08-27  5:47           ` Dave Hansen
  2 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2004-08-27  5:04 UTC (permalink / raw)
  To: Hiroyuki KAMEZAWA
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, lhms,
	William Lee Irwin III

On Thu, 2004-08-26 at 21:48, Hiroyuki KAMEZAWA wrote:
> I testd set_bit()/__set_bit() ops, atomic and non atomic ops, on my Xeon.
> I think this test is not perfect, but shows some aspect of pefromance of atomic ops.
> 
> Program:
> the program touches memory in tight loop, using atomic and non-atomic set_bit().
> memory size is 512k, L2 cache size.
> I attaches it in this mail, but it is configured to my Xeon and looks ugly :).
...
> To Dave:
> cost of prefetch() is not here, because I found it is very sensitive to
> what is done in the loop and difficult to measure in this program.
> I found cost of calling prefetch is a bit high, I'll measure whether
> prefetch() in buddy allocator is good or bad again.
> 
> I think this result shows I should use non-atomic ops when I can.

I think we all know that locked instructions are going to be slower. 
However, what I wanted to see is how it influences a slightly more
realistic test, and actually in the context of the kernel.  Let's
actually see how much impact using the prefetch() and atomic vs
non-atomic ops has when they're used *in* the kernel on a less
contrived  less microbenchmarky test.

How about finding some kind of benchmark that will do a bunch of forking
and a bunch of page faulting to cause lots of activity in the allocator?

I'd suggest something like http://ck.kolivas.org/kernbench/ or SDET if
you can get your hands on it.  Anybody else have some suggestions?

The atomic ops, you're probably right about, but it would still be nice
to have some hard data.  As for prefetch(), we could scatter it and
unlikely() all over the kernel, but we only tend to do so when we can
either demonstrate a concrete gain, or it is a super-hot path.  With
hot-and-cold-pages around, even the allocator functions don't
necessarily count as super-hot.  

I'll run kernbench and sdet with and without the atomic ops and prefetch
on some of my hardware and see what I come up with.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lhms-devel] [RFC] buddy allocator without bitmap  [2/4]
  2004-08-27  5:04           ` Dave Hansen
@ 2004-08-27  5:31             ` Hiroyuki KAMEZAWA
  2004-08-27  5:31               ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Hiroyuki KAMEZAWA @ 2004-08-27  5:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, lhms,
	William Lee Irwin III

Dave Hansen wrote:
>>To Dave:
>>cost of prefetch() is not here, because I found it is very sensitive to
>>what is done in the loop and difficult to measure in this program.
>>I found cost of calling prefetch is a bit high, I'll measure whether
>>prefetch() in buddy allocator is good or bad again.
>>
>>I think this result shows I should use non-atomic ops when I can.
> 
> 
> I think we all know that locked instructions are going to be slower. 
> However, what I wanted to see is how it influences a slightly more
> realistic test, and actually in the context of the kernel.  Let's
> actually see how much impact using the prefetch() and atomic vs
> non-atomic ops has when they're used *in* the kernel on a less
> contrived  less microbenchmarky test.
> 
> How about finding some kind of benchmark that will do a bunch of forking
> and a bunch of page faulting to cause lots of activity in the allocator?
> 
I cannot find suitable one, so I test in microbenchmark calling mmap()
and munmap(). As you say, real-world workload test is more suitable to
measure kernel's performance.

> I'd suggest something like http://ck.kolivas.org/kernbench/ or SDET if
> you can get your hands on it.  Anybody else have some suggestions?
> 
thanks.

> The atomic ops, you're probably right about, but it would still be nice
> to have some hard data.  As for prefetch(), we could scatter it and
> unlikely() all over the kernel, but we only tend to do so when we can
> either demonstrate a concrete gain, or it is a super-hot path.  With
> hot-and-cold-pages around, even the allocator functions don't
> necessarily count as super-hot.  
> 
Hmm, my test program was mmap()/munnamp Magebytes of page and hot_cold_page()
does not work enough, because current batch is16.
My test might be a bit special case.


-- Kame

-- 
--the clue is these footmarks leading to the door.--
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lhms-devel] [RFC] buddy allocator without bitmap  [2/4]
  2004-08-27  5:31             ` Hiroyuki KAMEZAWA
@ 2004-08-27  5:31               ` Dave Hansen
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2004-08-27  5:31 UTC (permalink / raw)
  To: Hiroyuki KAMEZAWA
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, lhms,
	William Lee Irwin III

On Thu, 2004-08-26 at 22:31, Hiroyuki KAMEZAWA wrote:
> I cannot find suitable one, so I test in microbenchmark calling mmap()
> and munmap(). As you say, real-world workload test is more suitable to
> measure kernel's performance.

Sorry, I thought you were just running a loop with atomic operations
inside of it, not actually exercising the kernel itself.  The test you
described in your mail to Andrew sounds much more useful than what I
thought it was.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lhms-devel] [RFC] buddy allocator without bitmap  [2/4]
  2004-08-27  4:48         ` Hiroyuki KAMEZAWA
  2004-08-27  4:59           ` Andrew Morton
  2004-08-27  5:04           ` Dave Hansen
@ 2004-08-27  5:47           ` Dave Hansen
  2004-08-27  6:09             ` Hiroyuki KAMEZAWA
  2 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2004-08-27  5:47 UTC (permalink / raw)
  To: Hiroyuki KAMEZAWA
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, lhms,
	William Lee Irwin III

I can't seem to get patch 2 to apply.  Were they again a plain
2.6.8.1-mm4 tree, or were some other patches involved too?  They were
supposed to go in numerical order, right?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lhms-devel] [RFC] buddy allocator without bitmap  [2/4]
  2004-08-27  5:47           ` Dave Hansen
@ 2004-08-27  6:09             ` Hiroyuki KAMEZAWA
  0 siblings, 0 replies; 15+ messages in thread
From: Hiroyuki KAMEZAWA @ 2004-08-27  6:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, lhms,
	William Lee Irwin III

[-- Attachment #1: Type: text/plain, Size: 891 bytes --]

Hi,

It's is against pure 2.6.8.1-mm4 tree.

I'm afraid that there was cut & paste mistake of mine,
so I send it in tgz file again.

patch order)
1.  eliminate-bitmap-includes.patch
2.  eliminate-bitmap-init.patch
3.  eliminate-bitmap-alloc.patch
4.  eliminate-bitmap-free.patch
5.  eliminate-bitmap-prefetch.patch


-- Kame

Dave Hansen wrote:

> I can't seem to get patch 2 to apply.  Were they again a plain
> 2.6.8.1-mm4 tree, or were some other patches involved too?  They were
> supposed to go in numerical order, right?
> 
> -- Dave
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
> 


-- 
--the clue is these footmarks leading to the door.--
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


[-- Attachment #2: eliminate-bitmap.tgz --]
[-- Type: application/x-compressed, Size: 7538 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2004-08-27  6:04 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-08-26 12:03 [RFC] buddy allocator without bitmap [2/4] Hiroyuki KAMEZAWA
2004-08-26 15:50 ` [Lhms-devel] " Dave Hansen
2004-08-26 23:05   ` Hiroyuki KAMEZAWA
2004-08-26 23:11     ` Dave Hansen
2004-08-26 23:28       ` Hiroyuki KAMEZAWA
2004-08-27  0:18     ` Andrew Morton
2004-08-27  0:27       ` Hiroyuki KAMEZAWA
2004-08-27  4:48         ` Hiroyuki KAMEZAWA
2004-08-27  4:59           ` Andrew Morton
2004-08-27  5:20             ` Hiroyuki KAMEZAWA
2004-08-27  5:04           ` Dave Hansen
2004-08-27  5:31             ` Hiroyuki KAMEZAWA
2004-08-27  5:31               ` Dave Hansen
2004-08-27  5:47           ` Dave Hansen
2004-08-27  6:09             ` Hiroyuki KAMEZAWA

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox