[rfc] lockless pagecache

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [rfc] lockless pagecache
@ 2005-06-27  6:29 Nick Piggin
  2005-06-27  6:32 ` [patch 1] mm: PG_free flag Nick Piggin
                   ` (3 more replies)
  0 siblings, 4 replies; 56+ messages in thread
From: Nick Piggin @ 2005-06-27  6:29 UTC (permalink / raw)
  To: linux-kernel, Linux Memory Management

Hi,

This is going to be a fairly long and probably incoherent post. The
idea and implementation are not completely analysed for holes, and
I wouldn't be surprised if some (even fatal ones) exist.

That said, I wanted something to talk about at Ottawa and I think
this is a promising idea - it is at the stage where it would be good
to have interested parties pick it apart. BTW. this is my main reason
for the PageReserved removal patches, so if this falls apart then
some good will have come from it! :)

OK, so my aim is to remove the requirement to take mapping->tree_lock
when looking up pagecache pages (eg. for a read/write or nopage fault).
Note that this does not deal with insertion and removal of pages from
pagecache mappings - that is usually a slower path operation associated
with IO or page reclaim or truncate. However if there was interest in
making these paths more scalable, there are possibilities for that too.

What for? Well there are probably lots of reasons, but suppose you have
a big app with lots of processes all mmaping and playing around with
various parts of the same big file (say, a shared memory file), then
you might start seeing problems if you want to scale this workload up
to say 32+ CPUs.

Now the tree_lock was recently(ish) converted to an rwlock, precisely
for such a workload and that was apparently very successful. However
an rwlock is significantly heavier, and as machines get faster and
bigger, rwlocks (and any locks) will tend to use more and more of Paul
McKenney's toilet paper due to cacheline bouncing.

So in the interest of saving some trees, let's try it without any locks.

First I'll put up some numbers to get you interested - of a 64-way Altix
with 64 processes each read-faulting in their own 512MB part of a 32GB
file that is preloaded in pagecache (with the proper NUMA memory
allocation).

[best of 5 runs]

plain 2.6.12-git4:
  1 proc    0.65u   1.43s 2.09e 99%CPU
64 proc    0.75u 291.30s 4.92e 5927%CPU

64 proc prof:
3242763 total                                      0.5366
1269413 _read_unlock_irq                         19834.5781
842042 do_no_page                               355.5921
779373 cond_resched                             3479.3438
100667 ia64_pal_call_static                     524.3073
  96469 _spin_lock                               1004.8854
  92857 default_idle                             241.8151
  25572 filemap_nopage                            15.6691
  11981 ia64_load_scratch_fpregs                 187.2031
  11671 ia64_save_scratch_fpregs                 182.3594
   2566 page_fault                                 2.5867

It has slowed by a factor of 2.5x when going from serial to 64-way, and it
is due to mapping->tree_lock. Serial is even at the disadvantage of reading
from remote memory 62 times out of 64.

2.6.12-git4-lockless:
  1 proc    0.66u   1.38s 2.04e 99%CPU
64 proc    0.68u   1.42s 0.12e 1686%CPU

64 proc prof:
  81934 total                                      0.0136
  31108 ia64_pal_call_static                     162.0208
  28394 default_idle                              73.9427
   3796 ia64_save_scratch_fpregs                  59.3125
   3736 ia64_load_scratch_fpregs                  58.3750
   2208 page_fault                                 2.2258
   1380 unmap_vmas                                 0.3292
   1298 __mod_page_state                           8.1125
   1089 do_no_page                                 0.4599
    830 find_get_page                              2.5938
    781 ia64_do_page_fault                         0.2805

So we have increased performance exactly 17x when going from 1 to 64 way,
however if you look at the CPU utilisation figure and the elapsed time,
you'll see my test didn't provide enough work to keep all CPUs busy, and
for the amount of CPU time used, we appear to have perfect scalability.
In fact, it is slightly superlinear probably due to remote memory access
on the serial run.

I'll reply to this post with the series of commented patches which is
probably the best way to explain how it is done. They are against
2.6.12-git4 + some future iteration of the PageReserved patches. I
can provide the complete rollup privately on request.

Comments, flames, laughing me out of town, etc. are all very welcome.

Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [patch 1] mm: PG_free flag
  2005-06-27  6:29 [rfc] lockless pagecache Nick Piggin
@ 2005-06-27  6:32 ` Nick Piggin
  2005-06-27  6:32   ` [patch 2] mm: speculative get_page Nick Piggin
  2005-06-27  6:43 ` VFS scalability (was: [rfc] lockless pagecache) Nick Piggin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2005-06-27  6:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 28 bytes --]


-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: mm-PG_free-flag.patch --]
[-- Type: text/plain, Size: 2886 bytes --]

In a future patch we can no longer rely on page_count being stable at any
time, so we can no longer overload PagePrivate && page_count == 0 to mean
the page is free and on the buddy lists.

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -76,6 +76,8 @@
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_uncached		19	/* Page has been mapped as uncached */
 
+#define PG_free			20	/* Page is on the free lists */
+
 /*
  * Global page accounting.  One instance per CPU.  Only unsigned longs are
  * allowed.
@@ -306,6 +308,10 @@ extern void __mod_page_state(unsigned lo
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#define PageFree(page)		test_bit(PG_free, &(page)->flags)
+#define __SetPageFree(page)	__set_bit(PG_free, &(page)->flags)
+#define __ClearPageFree(page)	__clear_bit(PG_free, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -114,7 +114,8 @@ static void bad_page(const char *functio
 			1 << PG_slab    |
 			1 << PG_swapcache |
 			1 << PG_writeback |
-			1 << PG_reserved );
+			1 << PG_reserved |
+			1 << PG_free );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
 	page->mapping = NULL;
@@ -191,12 +192,12 @@ static inline unsigned long page_order(s
 
 static inline void set_page_order(struct page *page, int order) {
 	page->private = order;
-	__SetPagePrivate(page);
+	__SetPageFree(page);
 }
 
 static inline void rmv_page_order(struct page *page)
 {
-	__ClearPagePrivate(page);
+	__ClearPageFree(page);
 	page->private = 0;
 }
 
@@ -242,9 +243,7 @@ __find_combined_index(unsigned long page
  */
 static inline int page_is_buddy(struct page *page, int order)
 {
-       if (PagePrivate(page)           &&
-           (page_order(page) == order) &&
-            page_count(page) == 0)
+       if (PageFree(page) && (page_order(page) == order))
                return 1;
        return 0;
 }
@@ -327,7 +326,8 @@ static inline void free_pages_check(cons
 			1 << PG_slab	|
 			1 << PG_swapcache |
 			1 << PG_writeback |
-			1 << PG_reserved )))
+			1 << PG_reserved |
+			1 << PG_free )))
 		bad_page(function, page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
@@ -456,7 +456,8 @@ static void prep_new_page(struct page *p
 			1 << PG_slab    |
 			1 << PG_swapcache |
 			1 << PG_writeback |
-			1 << PG_reserved )))
+			1 << PG_reserved |
+			1 << PG_free )))
 		bad_page(__FUNCTION__, page);
 
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [patch 2] mm: speculative get_page
  2005-06-27  6:32 ` [patch 1] mm: PG_free flag Nick Piggin
@ 2005-06-27  6:32   ` Nick Piggin
  2005-06-27  6:33     ` [patch 3] radix tree: lookup_slot Nick Piggin
                       ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Nick Piggin @ 2005-06-27  6:32 UTC (permalink / raw)
  To: linux-kernel, Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 28 bytes --]


-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: mm-speculative-get_page.patch --]
[-- Type: text/plain, Size: 6511 bytes --]

If we can be sure that elevating the page_count on a pagecache
page will pin it, we can speculatively run this operation, and
subsequently check to see if we hit the right page rather than
relying on holding a lock or otherwise pinning a reference to
the page.

This can be done if get_page/put_page behaves in the same manner
throughout the whole tree (ie. if we "get" the page after it has
been used for something else, we must be able to free it with a
put_page).

There needs to be some careful logic for freed pages so they aren't
freed again, and also some careful logic for pages in the process
of being removed from pagecache.

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -77,6 +77,7 @@
 #define PG_uncached		19	/* Page has been mapped as uncached */
 
 #define PG_free			20	/* Page is on the free lists */
+#define PG_freeing		21	/* PG_refcount about to be freed */
 
 /*
  * Global page accounting.  One instance per CPU.  Only unsigned longs are
@@ -312,6 +313,11 @@ extern void __mod_page_state(unsigned lo
 #define __SetPageFree(page)	__set_bit(PG_free, &(page)->flags)
 #define __ClearPageFree(page)	__clear_bit(PG_free, &(page)->flags)
 
+#define PageFreeing(page)	test_bit(PG_freeing, &(page)->flags)
+#define SetPageFreeing(page)	set_bit(PG_freeing, &(page)->flags)
+#define ClearPageFreeing(page)	clear_bit(PG_freeing, &(page)->flags)
+#define __ClearPageFreeing(page) __clear_bit(PG_freeing, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -50,6 +50,42 @@ static inline void mapping_set_gfp_mask(
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
+static inline struct page *page_cache_get_speculative(struct page **pagep)
+{
+	struct page *page;
+
+	preempt_disable();
+	page = *pagep;
+	if (!page)
+		goto out_failed;
+
+	if (unlikely(get_page_testone(page))) {
+		/* Picked up a freed page */
+		__put_page(page);
+		goto out_failed;
+	}
+	/*
+	 * preempt can really be enabled here (only needs to be disabled
+	 * because page allocation can spin on the elevated refcount, but
+	 * we don't want to hold a reference on an unrelated page for too
+	 * long, so keep preempt off until we know we have the right page
+	 */
+
+	if (unlikely(PageFreeing(page)) ||
+			unlikely(page != *pagep)) {
+		/* Picked up a page being freed, or one that's been reused */
+		put_page(page);
+		goto out_failed;
+	}
+	preempt_enable();
+
+	return page;
+
+out_failed:
+	preempt_enable();
+	return NULL;
+}
+
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
 	return alloc_pages(mapping_gfp_mask(x)|__GFP_NORECLAIM, 0);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -116,7 +116,6 @@ static void bad_page(const char *functio
 			1 << PG_writeback |
 			1 << PG_reserved |
 			1 << PG_free );
-	set_page_count(page, 0);
 	reset_page_mapcount(page);
 	page->mapping = NULL;
 	tainted |= TAINT_BAD_PAGE;
@@ -316,7 +315,6 @@ static inline void free_pages_check(cons
 {
 	if (	page_mapcount(page) ||
 		page->mapping != NULL ||
-		page_count(page) != 0 ||
 		(page->flags & (
 			1 << PG_lru	|
 			1 << PG_private |
@@ -424,7 +422,7 @@ expand(struct zone *zone, struct page *p
 void set_page_refs(struct page *page, int order)
 {
 #ifdef CONFIG_MMU
-	set_page_count(page, 1);
+	get_page(page);
 #else
 	int i;
 
@@ -434,7 +432,7 @@ void set_page_refs(struct page *page, in
 	 * - eg: access_process_vm()
 	 */
 	for (i = 0; i < (1 << order); i++)
-		set_page_count(page + i, 1);
+		get_page(page + i);
 #endif /* CONFIG_MMU */
 }
 
@@ -445,7 +443,6 @@ static void prep_new_page(struct page *p
 {
 	if (	page_mapcount(page) ||
 		page->mapping != NULL ||
-		page_count(page) != 0 ||
 		(page->flags & (
 			1 << PG_lru	|
 			1 << PG_private	|
@@ -464,7 +461,13 @@ static void prep_new_page(struct page *p
 			1 << PG_referenced | 1 << PG_arch_1 |
 			1 << PG_checked | 1 << PG_mappedtodisk);
 	page->private = 0;
+
 	set_page_refs(page, order);
+	smp_mb();
+	/* Wait for speculative get_page after count has been elevated. */
+	while (unlikely(page_count(page) > 1))
+		cpu_relax();
+
 	kernel_map_pages(page, 1 << order, 1);
 }
 
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -504,6 +504,7 @@ static int shrink_list(struct list_head 
 		if (!mapping)
 			goto keep_locked;	/* truncate got there first */
 
+		SetPageFreeing(page);
 		write_lock_irq(&mapping->tree_lock);
 
 		/*
@@ -513,6 +514,7 @@ static int shrink_list(struct list_head 
 		 */
 		if (page_count(page) != 2 || PageDirty(page)) {
 			write_unlock_irq(&mapping->tree_lock);
+			ClearPageFreeing(page);
 			goto keep_locked;
 		}
 
@@ -533,6 +535,7 @@ static int shrink_list(struct list_head 
 
 free_it:
 		unlock_page(page);
+		__ClearPageFreeing(page);
 		reclaimed++;
 		if (!pagevec_add(&freed_pvec, page))
 			__pagevec_release_nonlru(&freed_pvec);
Index: linux-2.6/mm/bootmem.c
===================================================================
--- linux-2.6.orig/mm/bootmem.c
+++ linux-2.6/mm/bootmem.c
@@ -278,17 +278,19 @@ static unsigned long __init free_all_boo
 		if (gofast && v == ~0UL) {
 			int j, order;
 
+			prefetchw(page);
 			count += BITS_PER_LONG;
-			__ClearPageReserved(page);
+
 			order = ffs(BITS_PER_LONG) - 1;
-			set_page_refs(page, order);
-			for (j = 1; j < BITS_PER_LONG; j++) {
-				if (j + 16 < BITS_PER_LONG)
-					prefetchw(page + j + 16);
+			for (j = 0; j < BITS_PER_LONG; j++) {
+				if (j + 1 < BITS_PER_LONG)
+					prefetchw(page + j + 1);
 				__ClearPageReserved(page + j);
 				set_page_count(page + j, 0);
 			}
+			set_page_refs(page, order);
 			__free_pages(page, order);
+
 			i += BITS_PER_LONG;
 			page += BITS_PER_LONG;
 		} else if (v) {
@@ -297,6 +299,7 @@ static unsigned long __init free_all_boo
 				if (v & m) {
 					count++;
 					__ClearPageReserved(page);
+					set_page_count(page, 0);
 					set_page_refs(page, 0);
 					__free_page(page);
 				}

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [patch 3] radix tree: lookup_slot
  2005-06-27  6:32   ` [patch 2] mm: speculative get_page Nick Piggin
@ 2005-06-27  6:33     ` Nick Piggin
  2005-06-27  6:34       ` [patch 4] radix tree: lockless readside Nick Piggin
  2005-06-27 14:12     ` [patch 2] mm: speculative get_page William Lee Irwin III
  2005-06-28 12:45     ` Andy Whitcroft
  2 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2005-06-27  6:33 UTC (permalink / raw)
  To: linux-kernel, Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 28 bytes --]


-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: radix-tree-lookup_slot.patch --]
[-- Type: text/plain, Size: 2673 bytes --]


From: Hans Reiser <reiser@namesys.com>

Reiser4 uses radix trees to solve a trouble reiser4_readdir has serving nfs
requests.

Unfortunately, radix tree api lacks an operation suitable for modifying
existing entry.  This patch adds radix_tree_lookup_slot which returns pointer
to found item within the tree.  That location can be then updated.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -46,6 +46,7 @@ do {									\
 
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
+void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -276,14 +276,8 @@ int radix_tree_insert(struct radix_tree_
 }
 EXPORT_SYMBOL(radix_tree_insert);
 
-/**
- *	radix_tree_lookup    -    perform lookup operation on a radix tree
- *	@root:		radix tree root
- *	@index:		index key
- *
- *	Lookup the item at the position @index in the radix tree @root.
- */
-void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
+static inline void **__lookup_slot(struct radix_tree_root *root,
+				   unsigned long index)
 {
 	unsigned int height, shift;
 	struct radix_tree_node **slot;
@@ -306,7 +300,36 @@ void *radix_tree_lookup(struct radix_tre
 		height--;
 	}
 
-	return *slot;
+	return (void **)slot;
+}
+
+/**
+ *	radix_tree_lookup_slot    -    lookup a slot in a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Lookup the slot corresponding to the position @index in the radix tree
+ *	@root. This is useful for update-if-exists operations.
+ */
+void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
+{
+	return __lookup_slot(root, index);
+}
+EXPORT_SYMBOL(radix_tree_lookup_slot);
+
+/**
+ *	radix_tree_lookup    -    perform lookup operation on a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Lookup the item at the position @index in the radix tree @root.
+ */
+void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
+{
+	void **slot;
+
+	slot = __lookup_slot(root, index);
+	return slot != NULL ? *slot : NULL;
 }
 EXPORT_SYMBOL(radix_tree_lookup);
 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [patch 4] radix tree: lockless readside
  2005-06-27  6:33     ` [patch 3] radix tree: lookup_slot Nick Piggin
@ 2005-06-27  6:34       ` Nick Piggin
  2005-06-27  6:34         ` [patch 5] mm: lockless pagecache lookups Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2005-06-27  6:34 UTC (permalink / raw)
  To: linux-kernel, Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 28 bytes --]


-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: radix-tree-lockless-readside.patch --]
[-- Type: text/plain, Size: 5957 bytes --]

Make radix tree lookups safe to be performed without locks.

Also introduce a lockfree gang_lookup_slot which will be used
by a future patch.

Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -45,6 +45,7 @@
 	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
 struct radix_tree_node {
+	unsigned int	height;		/* Height from the bottom */
 	unsigned int	count;
 	void		*slots[RADIX_TREE_MAP_SIZE];
 	unsigned long	tags[RADIX_TREE_TAGS][RADIX_TREE_TAG_LONGS];
@@ -196,6 +197,7 @@ static int radix_tree_extend(struct radi
 	}
 
 	do {
+		unsigned int newheight;
 		if (!(node = radix_tree_node_alloc(root)))
 			return -ENOMEM;
 
@@ -208,9 +210,13 @@ static int radix_tree_extend(struct radi
 				tag_set(node, tag, 0);
 		}
 
+		newheight = root->height+1;
+		node->height = newheight;
 		node->count = 1;
+		/* Make ->height visible before node visible via ->rnode */
+		smp_wmb();
 		root->rnode = node;
-		root->height++;
+		root->height = newheight;
 	} while (height > root->height);
 out:
 	return 0;
@@ -250,6 +256,9 @@ int radix_tree_insert(struct radix_tree_
 			/* Have to add a child node.  */
 			if (!(tmp = radix_tree_node_alloc(root)))
 				return -ENOMEM;
+			tmp->height = height;
+			/* Make ->height visible before node visible via slot */
+			smp_wmb();
 			*slot = tmp;
 			if (node)
 				node->count++;
@@ -282,12 +291,14 @@ static inline void **__lookup_slot(struc
 	unsigned int height, shift;
 	struct radix_tree_node **slot;
 
-	height = root->height;
+	if (root->rnode == NULL)
+		return NULL;
+	slot = &root->rnode;
+	height = (*slot)->height;
 	if (index > radix_tree_maxindex(height))
 		return NULL;
 
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
-	slot = &root->rnode;
 
 	while (height > 0) {
 		if (*slot == NULL)
@@ -491,21 +502,24 @@ EXPORT_SYMBOL(radix_tree_tag_get);
 #endif
 
 static unsigned int
-__lookup(struct radix_tree_root *root, void **results, unsigned long index,
+__lookup(struct radix_tree_root *root, void ***results, unsigned long index,
 	unsigned int max_items, unsigned long *next_index)
 {
+	unsigned long i;
 	unsigned int nr_found = 0;
 	unsigned int shift;
-	unsigned int height = root->height;
+	unsigned int height;
 	struct radix_tree_node *slot;
 
-	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 	slot = root->rnode;
+	if (!slot)
+		goto out;
+	height = slot->height;
+	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 
-	while (height > 0) {
-		unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK;
-
-		for ( ; i < RADIX_TREE_MAP_SIZE; i++) {
+	for (;;) {
+		for (i = (index >> shift) & RADIX_TREE_MAP_MASK;
+						i < RADIX_TREE_MAP_SIZE; i++) {
 			if (slot->slots[i] != NULL)
 				break;
 			index &= ~((1UL << shift) - 1);
@@ -516,21 +530,23 @@ __lookup(struct radix_tree_root *root, v
 		if (i == RADIX_TREE_MAP_SIZE)
 			goto out;
 		height--;
-		if (height == 0) {	/* Bottom level: grab some items */
-			unsigned long j = index & RADIX_TREE_MAP_MASK;
-
-			for ( ; j < RADIX_TREE_MAP_SIZE; j++) {
-				index++;
-				if (slot->slots[j]) {
-					results[nr_found++] = slot->slots[j];
-					if (nr_found == max_items)
-						goto out;
-				}
-			}
+		if (height == 0) {
+			/* Bottom level: grab some items */
+			break;
 		}
 		shift -= RADIX_TREE_MAP_SHIFT;
 		slot = slot->slots[i];
 	}
+
+	for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
+		index++;
+		if (slot->slots[i]) {
+			results[nr_found++] = &(slot->slots[i]);
+			if (nr_found == max_items)
+				goto out;
+		}
+	}
+
 out:
 	*next_index = index;
 	return nr_found;
@@ -558,6 +574,43 @@ radix_tree_gang_lookup(struct radix_tree
 	unsigned int ret = 0;
 
 	while (ret < max_items) {
+		unsigned int nr_found, i;
+		unsigned long next_index;	/* Index of next search */
+
+		if (cur_index > max_index)
+			break;
+		nr_found = __lookup(root, (void ***)results + ret, cur_index,
+					max_items - ret, &next_index);
+		for (i = 0; i < nr_found; i++)
+			results[ret + i] = *(((void ***)results)[ret + i]);
+		ret += nr_found;
+		if (next_index == 0)
+			break;
+		cur_index = next_index;
+	}
+	return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup);
+
+/**
+ *	radix_tree_gang_lookup_slot - perform multiple lookup on a radix tree
+ *	@root:		radix tree root
+ *	@results:	where the results of the lookup are placed
+ *	@first_index:	start the lookup from this key
+ *	@max_items:	place up to this many items at *results
+ *
+ *	Same as radix_tree_gang_lookup, but returns an array of pointers
+ *	(slots) to the stored items instead of the items themselves.
+ */
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+			unsigned long first_index, unsigned int max_items)
+{
+	const unsigned long max_index = radix_tree_maxindex(root->height);
+	unsigned long cur_index = first_index;
+	unsigned int ret = 0;
+
+	while (ret < max_items) {
 		unsigned int nr_found;
 		unsigned long next_index;	/* Index of next search */
 
@@ -572,7 +625,8 @@ radix_tree_gang_lookup(struct radix_tree
 	}
 	return ret;
 }
-EXPORT_SYMBOL(radix_tree_gang_lookup);
+EXPORT_SYMBOL(radix_tree_gang_lookup_slot);
+
 
 /*
  * FIXME: the two tag_get()s here should use find_next_bit() instead of
Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -51,6 +51,9 @@ void *radix_tree_delete(struct radix_tre
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+			unsigned long first_index, unsigned int max_items);
 int radix_tree_preload(int gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [patch 5] mm: lockless pagecache lookups
  2005-06-27  6:34       ` [patch 4] radix tree: lockless readside Nick Piggin
@ 2005-06-27  6:34         ` Nick Piggin
  2005-06-27  6:35           ` [patch 6] mm: spinlock tree_lock Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2005-06-27  6:34 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 28 bytes --]


-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: mm-lockless-pagecache-lookups.patch --]
[-- Type: text/plain, Size: 11976 bytes --]

Use the speculative get_page and the lockless radix tree lookups
to introduce lockless page cache lookups (ie. no mapping->tree_lock).

The only atomicity changes this should introduce is the use of a
non atomic pagevec lookup for truncate, however what atomicity
guarantees there were are probably not too useful anyway.

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -378,18 +378,25 @@ int add_to_page_cache(struct page *page,
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
+		page_cache_get(page);
+		__SetPageLocked(page);
+		page->mapping = mapping;
+		page->index = offset;
+
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
-			page_cache_get(page);
-			SetPageLocked(page);
-			page->mapping = mapping;
-			page->index = offset;
 			mapping->nrpages++;
 			pagecache_acct(1);
 		}
 		write_unlock_irq(&mapping->tree_lock);
 		radix_tree_preload_end();
+
+		if (error) {
+			page->mapping = NULL;
+			__put_page(page);
+			__ClearPageLocked(page);
+		}
 	}
 	return error;
 }
@@ -499,13 +506,13 @@ EXPORT_SYMBOL(__lock_page);
  */
 struct page * find_get_page(struct address_space *mapping, unsigned long offset)
 {
-	struct page *page;
+	struct page **pagep;
+	struct page *page = NULL;
 
-	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page)
-		page_cache_get(page);
-	read_unlock_irq(&mapping->tree_lock);
+	pagep = (struct page **)radix_tree_lookup_slot(&mapping->page_tree,
+									offset);
+	if (pagep)
+		page = page_cache_get_speculative(pagep);
 	return page;
 }
 
@@ -518,12 +525,24 @@ struct page *find_trylock_page(struct ad
 {
 	struct page *page;
 
-	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page && TestSetPageLocked(page))
-		page = NULL;
-	read_unlock_irq(&mapping->tree_lock);
-	return page;
+	page = find_get_page(mapping, offset);
+	if (page) {
+		if (TestSetPageLocked(page))
+			goto out_failed;
+		/* Has the page been truncated before being locked? */
+		if (page->mapping != mapping || page->index != offset) {
+			unlock_page(page);
+			goto out_failed;
+		}
+
+		/* Silly interface requires us to drop the refcount */
+		__put_page(page);
+		return page;
+
+out_failed:
+		page_cache_release(page);
+	}
+	return NULL;
 }
 
 EXPORT_SYMBOL(find_trylock_page);
@@ -544,25 +563,17 @@ struct page *find_lock_page(struct addre
 {
 	struct page *page;
 
-	read_lock_irq(&mapping->tree_lock);
 repeat:
-	page = radix_tree_lookup(&mapping->page_tree, offset);
+	page = find_get_page(mapping, offset);
 	if (page) {
-		page_cache_get(page);
-		if (TestSetPageLocked(page)) {
-			read_unlock_irq(&mapping->tree_lock);
-			lock_page(page);
-			read_lock_irq(&mapping->tree_lock);
-
-			/* Has the page been truncated while we slept? */
-			if (page->mapping != mapping || page->index != offset) {
-				unlock_page(page);
-				page_cache_release(page);
-				goto repeat;
-			}
+		lock_page(page);
+		/* Has the page been truncated before being locked? */
+		if (page->mapping != mapping || page->index != offset) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto repeat;
 		}
 	}
-	read_unlock_irq(&mapping->tree_lock);
 	return page;
 }
 
@@ -645,6 +656,30 @@ unsigned find_get_pages(struct address_s
 	return ret;
 }
 
+unsigned find_get_pages_nonatomic(struct address_space *mapping, pgoff_t start,
+			    unsigned int nr_pages, struct page **pages)
+{
+	unsigned int i;
+	unsigned int ret;
+	unsigned int ret2;
+
+	/*
+	 * We do some unsightly casting to use the array first for storing
+	 * pointers to the page pointers, and then for the pointers to
+	 * the pages themselves that the caller wants.
+	 */
+	ret = radix_tree_gang_lookup_slot(&mapping->page_tree,
+				(void ***)pages, start, nr_pages);
+	ret2 = 0;
+	for (i = 0; i < ret; i++) {
+		struct page *page;
+		page = page_cache_get_speculative(((struct page ***)pages)[i]);
+		if (page)
+			pages[ret2++] = page;
+	}
+	return ret2;
+}
+
 /*
  * Like find_get_pages, except we only return pages which are tagged with
  * `tag'.   We update *index to index the next page for the traversal.
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -272,27 +272,24 @@ __do_page_cache_readahead(struct address
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	read_lock_irq(&mapping->tree_lock);
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
 		unsigned long page_offset = offset + page_idx;
 		
 		if (page_offset > end_index)
 			break;
 
+		/* Don't need mapping->tree_lock - lookup can be racy */
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
 		if (page)
 			continue;
 
-		read_unlock_irq(&mapping->tree_lock);
 		page = page_cache_alloc_cold(mapping);
-		read_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
 		page->index = page_offset;
 		list_add(&page->lru, &page_pool);
 		ret++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
 
 	/*
 	 * Now start the IO.  We ignore I/O errors - if the page is not
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -76,19 +76,26 @@ static int __add_to_swap_cache(struct pa
 	BUG_ON(PagePrivate(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
+		page_cache_get(page);
+		SetPageLocked(page);
+		SetPageSwapCache(page);
+		page->private = entry.val;
+
 		write_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (!error) {
-			page_cache_get(page);
-			SetPageLocked(page);
-			SetPageSwapCache(page);
-			page->private = entry.val;
 			total_swapcache_pages++;
 			pagecache_acct(1);
 		}
 		write_unlock_irq(&swapper_space.tree_lock);
 		radix_tree_preload_end();
+
+		if (error) {
+			__put_page(page);
+			ClearPageLocked(page);
+			ClearPageSwapCache(page);
+		}
 	}
 	return error;
 }
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -167,16 +167,13 @@ extern void __mod_page_state(unsigned lo
 /*
  * Manipulation of page state flags
  */
-#define PageLocked(page)		\
-		test_bit(PG_locked, &(page)->flags)
-#define SetPageLocked(page)		\
-		set_bit(PG_locked, &(page)->flags)
-#define TestSetPageLocked(page)		\
-		test_and_set_bit(PG_locked, &(page)->flags)
-#define ClearPageLocked(page)		\
-		clear_bit(PG_locked, &(page)->flags)
-#define TestClearPageLocked(page)	\
-		test_and_clear_bit(PG_locked, &(page)->flags)
+#define PageLocked(page)	test_bit(PG_locked, &(page)->flags)
+#define SetPageLocked(page)	set_bit(PG_locked, &(page)->flags)
+#define __SetPageLocked(page)	__set_bit(PG_locked, &(page)->flags)
+#define TestSetPageLocked(page)	test_and_set_bit(PG_locked, &(page)->flags)
+#define ClearPageLocked(page)	clear_bit(PG_locked, &(page)->flags)
+#define __ClearPageLocked(page)	__clear_bit(PG_locked, &(page)->flags)
+#define TestClearPageLocked(page) test_and_clear_bit(PG_locked, &(page)->flags)
 
 #define PageError(page)		test_bit(PG_error, &(page)->flags)
 #define SetPageError(page)	set_bit(PG_error, &(page)->flags)
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -108,6 +108,8 @@ extern struct page * find_or_create_page
 				unsigned long index, unsigned int gfp_mask);
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			unsigned int nr_pages, struct page **pages);
+unsigned find_get_pages_nonatomic(struct address_space *mapping, pgoff_t start,
+			unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
 
Index: linux-2.6/include/linux/pagevec.h
===================================================================
--- linux-2.6.orig/include/linux/pagevec.h
+++ linux-2.6/include/linux/pagevec.h
@@ -25,6 +25,8 @@ void __pagevec_lru_add_active(struct pag
 void pagevec_strip(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
+unsigned pagevec_lookup_nonatomic(struct pagevec *pvec,
+	struct address_space *mapping, pgoff_t start, unsigned nr_pages);
 unsigned pagevec_lookup_tag(struct pagevec *pvec,
 		struct address_space *mapping, pgoff_t *index, int tag,
 		unsigned nr_pages);
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -380,6 +380,19 @@ unsigned pagevec_lookup(struct pagevec *
 	return pagevec_count(pvec);
 }
 
+/**
+ * pagevec_lookup_nonatomic - non atomic pagevec_lookup
+ *
+ * This routine is non-atomic in that it may return blah.
+ */
+unsigned pagevec_lookup_nonatomic(struct pagevec *pvec,
+		struct address_space *mapping, pgoff_t start, unsigned nr_pages)
+{
+	pvec->nr = find_get_pages_nonatomic(mapping, start,
+					nr_pages, pvec->pages);
+	return pagevec_count(pvec);
+}
+
 unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t *index, int tag, unsigned nr_pages)
 {
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -126,7 +126,7 @@ void truncate_inode_pages(struct address
 
 	pagevec_init(&pvec, 0);
 	next = start;
-	while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+	while (pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t page_index = page->index;
@@ -160,7 +160,7 @@ void truncate_inode_pages(struct address
 	next = start;
 	for ( ; ; ) {
 		cond_resched();
-		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+		if (!pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
 			if (next == start)
 				break;
 			next = start;
@@ -206,7 +206,7 @@ unsigned long invalidate_mapping_pages(s
 
 	pagevec_init(&pvec, 0);
 	while (next <= end &&
-			pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+		pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -811,6 +811,7 @@ int mapping_tagged(struct address_space 
 	unsigned long flags;
 	int ret;
 
+	/* XXX: radix_tree_tagged is safe to run without the lock */
 	read_lock_irqsave(&mapping->tree_lock, flags);
 	ret = radix_tree_tagged(&mapping->page_tree, tag);
 	read_unlock_irqrestore(&mapping->tree_lock, flags);
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -338,6 +338,7 @@ int remove_exclusive_swap_page(struct pa
 	retval = 0;
 	if (p->swap_map[swp_offset(entry)] == 1) {
 		/* Recheck the page count with the swapcache lock held.. */
+		SetPageFreeing(page);
 		write_lock_irq(&swapper_space.tree_lock);
 		if ((page_count(page) == 2) && !PageWriteback(page)) {
 			__delete_from_swap_cache(page);
@@ -345,6 +346,7 @@ int remove_exclusive_swap_page(struct pa
 			retval = 1;
 		}
 		write_unlock_irq(&swapper_space.tree_lock);
+		ClearPageFreeing(page);
 	}
 	swap_info_put(p);
 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [patch 6] mm: spinlock tree_lock
  2005-06-27  6:34         ` [patch 5] mm: lockless pagecache lookups Nick Piggin
@ 2005-06-27  6:35           ` Nick Piggin
  0 siblings, 0 replies; 56+ messages in thread
From: Nick Piggin @ 2005-06-27  6:35 UTC (permalink / raw)
  To: linux-kernel, Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 28 bytes --]


-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: mm-spinlock-tree_lock.patch --]
[-- Type: text/plain, Size: 13830 bytes --]

With practially all the read locks gone from mapping->tree_lock,
convert the lock from an rwlock back to a spinlock.

The remaining locks including the read locks mainly deal with IO
submission and not the lookup fastpaths.

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -875,7 +875,7 @@ int __set_page_dirty_buffers(struct page
 	spin_unlock(&mapping->private_lock);
 
 	if (!TestSetPageDirty(page)) {
-		write_lock_irq(&mapping->tree_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		if (page->mapping) {	/* Race with truncate? */
 			if (mapping_cap_account_dirty(mapping))
 				inc_page_state(nr_dirty);
@@ -883,7 +883,7 @@ int __set_page_dirty_buffers(struct page
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
 		}
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 	}
 	
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -194,7 +194,7 @@ void inode_init_once(struct inode *inode
 	sema_init(&inode->i_sem, 1);
 	init_rwsem(&inode->i_alloc_sem);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
-	rwlock_init(&inode->i_data.tree_lock);
+	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -336,7 +336,7 @@ struct backing_dev_info;
 struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
-	rwlock_t		tree_lock;	/* and rwlock protecting it */
+	spinlock_t		tree_lock;	/* and lock protecting it */
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -120,9 +120,9 @@ void remove_from_page_cache(struct page 
 
 	BUG_ON(!PageLocked(page));
 
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	__remove_from_page_cache(page);
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 }
 
 static int sync_page(void *word)
@@ -383,13 +383,13 @@ int add_to_page_cache(struct page *page,
 		page->mapping = mapping;
 		page->index = offset;
 
-		write_lock_irq(&mapping->tree_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
 			mapping->nrpages++;
 			pagecache_acct(1);
 		}
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		radix_tree_preload_end();
 
 		if (error) {
@@ -647,12 +647,12 @@ unsigned find_get_pages(struct address_s
 	unsigned int i;
 	unsigned int ret;
 
-	read_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, start, nr_pages);
 	for (i = 0; i < ret; i++)
 		page_cache_get(pages[i]);
-	read_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
 
@@ -690,14 +690,14 @@ unsigned find_get_pages_tag(struct addre
 	unsigned int i;
 	unsigned int ret;
 
-	read_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
 				(void **)pages, *index, nr_pages, tag);
 	for (i = 0; i < ret; i++)
 		page_cache_get(pages[i]);
 	if (ret)
 		*index = pages[ret - 1]->index + 1;
-	read_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
 
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -35,7 +35,7 @@ static struct backing_dev_info swap_back
 
 struct address_space swapper_space = {
 	.page_tree	= RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
-	.tree_lock	= RW_LOCK_UNLOCKED,
+	.tree_lock	= SPIN_LOCK_UNLOCKED,
 	.a_ops		= &swap_aops,
 	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
 	.backing_dev_info = &swap_backing_dev_info,
@@ -81,14 +81,14 @@ static int __add_to_swap_cache(struct pa
 		SetPageSwapCache(page);
 		page->private = entry.val;
 
-		write_lock_irq(&swapper_space.tree_lock);
+		spin_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (!error) {
 			total_swapcache_pages++;
 			pagecache_acct(1);
 		}
-		write_unlock_irq(&swapper_space.tree_lock);
+		spin_unlock_irq(&swapper_space.tree_lock);
 		radix_tree_preload_end();
 
 		if (error) {
@@ -210,9 +210,9 @@ void delete_from_swap_cache(struct page 
   
 	entry.val = page->private;
 
-	write_lock_irq(&swapper_space.tree_lock);
+	spin_lock_irq(&swapper_space.tree_lock);
 	__delete_from_swap_cache(page);
-	write_unlock_irq(&swapper_space.tree_lock);
+	spin_unlock_irq(&swapper_space.tree_lock);
 
 	swap_free(entry);
 	page_cache_release(page);
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -339,13 +339,13 @@ int remove_exclusive_swap_page(struct pa
 	if (p->swap_map[swp_offset(entry)] == 1) {
 		/* Recheck the page count with the swapcache lock held.. */
 		SetPageFreeing(page);
-		write_lock_irq(&swapper_space.tree_lock);
+		spin_lock_irq(&swapper_space.tree_lock);
 		if ((page_count(page) == 2) && !PageWriteback(page)) {
 			__delete_from_swap_cache(page);
 			SetPageDirty(page);
 			retval = 1;
 		}
-		write_unlock_irq(&swapper_space.tree_lock);
+		spin_unlock_irq(&swapper_space.tree_lock);
 		ClearPageFreeing(page);
 	}
 	swap_info_put(p);
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -76,15 +76,15 @@ invalidate_complete_page(struct address_
 	if (PagePrivate(page) && !try_to_release_page(page, 0))
 		return 0;
 
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	if (PageDirty(page)) {
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		return 0;
 	}
 
 	BUG_ON(PagePrivate(page));
 	__remove_from_page_cache(page);
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	ClearPageUptodate(page);
 	page_cache_release(page);	/* pagecache ref */
 	return 1;
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -505,7 +505,7 @@ static int shrink_list(struct list_head 
 			goto keep_locked;	/* truncate got there first */
 
 		SetPageFreeing(page);
-		write_lock_irq(&mapping->tree_lock);
+		spin_lock_irq(&mapping->tree_lock);
 
 		/*
 		 * The non-racy check for busy page.  It is critical to check
@@ -513,7 +513,7 @@ static int shrink_list(struct list_head 
 		 * not in use by anybody. 	(pagecache + us == 2)
 		 */
 		if (page_count(page) != 2 || PageDirty(page)) {
-			write_unlock_irq(&mapping->tree_lock);
+			spin_unlock_irq(&mapping->tree_lock);
 			ClearPageFreeing(page);
 			goto keep_locked;
 		}
@@ -522,7 +522,7 @@ static int shrink_list(struct list_head 
 		if (PageSwapCache(page)) {
 			swp_entry_t swap = { .val = page->private };
 			__delete_from_swap_cache(page);
-			write_unlock_irq(&mapping->tree_lock);
+			spin_unlock_irq(&mapping->tree_lock);
 			swap_free(swap);
 			__put_page(page);	/* The pagecache ref */
 			goto free_it;
@@ -530,7 +530,7 @@ static int shrink_list(struct list_head 
 #endif /* CONFIG_SWAP */
 
 		__remove_from_page_cache(page);
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		__put_page(page);
 
 free_it:
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -623,7 +623,7 @@ int __set_page_dirty_nobuffers(struct pa
 		struct address_space *mapping2;
 
 		if (mapping) {
-			write_lock_irq(&mapping->tree_lock);
+			spin_lock_irq(&mapping->tree_lock);
 			mapping2 = page_mapping(page);
 			if (mapping2) { /* Race with truncate? */
 				BUG_ON(mapping2 != mapping);
@@ -632,7 +632,7 @@ int __set_page_dirty_nobuffers(struct pa
 				radix_tree_tag_set(&mapping->page_tree,
 					page_index(page), PAGECACHE_TAG_DIRTY);
 			}
-			write_unlock_irq(&mapping->tree_lock);
+			spin_unlock_irq(&mapping->tree_lock);
 			if (mapping->host) {
 				/* !PageAnon && !swapper_space */
 				__mark_inode_dirty(mapping->host,
@@ -707,17 +707,17 @@ int test_clear_page_dirty(struct page *p
 	unsigned long flags;
 
 	if (mapping) {
-		write_lock_irqsave(&mapping->tree_lock, flags);
+		spin_lock_irqsave(&mapping->tree_lock, flags);
 		if (TestClearPageDirty(page)) {
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
-			write_unlock_irqrestore(&mapping->tree_lock, flags);
+			spin_unlock_irqrestore(&mapping->tree_lock, flags);
 			if (mapping_cap_account_dirty(mapping))
 				dec_page_state(nr_dirty);
 			return 1;
 		}
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 		return 0;
 	}
 	return TestClearPageDirty(page);
@@ -762,13 +762,13 @@ int test_clear_page_writeback(struct pag
 	if (mapping) {
 		unsigned long flags;
 
-		write_lock_irqsave(&mapping->tree_lock, flags);
+		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestClearPageWriteback(page);
 		if (ret)
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestClearPageWriteback(page);
 	}
@@ -783,7 +783,7 @@ int test_set_page_writeback(struct page 
 	if (mapping) {
 		unsigned long flags;
 
-		write_lock_irqsave(&mapping->tree_lock, flags);
+		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
 		if (!ret)
 			radix_tree_tag_set(&mapping->page_tree,
@@ -793,7 +793,7 @@ int test_set_page_writeback(struct page 
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestSetPageWriteback(page);
 	}
@@ -812,9 +812,9 @@ int mapping_tagged(struct address_space 
 	int ret;
 
 	/* XXX: radix_tree_tagged is safe to run without the lock */
-	read_lock_irqsave(&mapping->tree_lock, flags);
+	spin_lock_irqsave(&mapping->tree_lock, flags);
 	ret = radix_tree_tagged(&mapping->page_tree, tag);
-	read_unlock_irqrestore(&mapping->tree_lock, flags);
+	spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	return ret;
 }
 EXPORT_SYMBOL(mapping_tagged);
Index: linux-2.6/drivers/mtd/devices/block2mtd.c
===================================================================
--- linux-2.6.orig/drivers/mtd/devices/block2mtd.c
+++ linux-2.6/drivers/mtd/devices/block2mtd.c
@@ -59,7 +59,7 @@ void cache_readahead(struct address_spac
 
 	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
 
-	read_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	for (i = 0; i < PAGE_READAHEAD; i++) {
 		pagei = index + i;
 		if (pagei > end_index) {
@@ -71,16 +71,16 @@ void cache_readahead(struct address_spac
 			break;
 		if (page)
 			continue;
-		read_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		page = page_cache_alloc_cold(mapping);
-		read_lock_irq(&mapping->tree_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
 		page->index = pagei;
 		list_add(&page->lru, &page_pool);
 		ret++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	if (ret)
 		read_cache_pages(mapping, &page_pool, filler, NULL);
 }
Index: linux-2.6/include/asm-arm/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-arm/cacheflush.h
+++ linux-2.6/include/asm-arm/cacheflush.h
@@ -315,9 +315,9 @@ flush_cache_page(struct vm_area_struct *
 extern void flush_dcache_page(struct page *);
 
 #define flush_dcache_mmap_lock(mapping) \
-	write_lock_irq(&(mapping)->tree_lock)
+	spin_lock_irq(&(mapping)->tree_lock)
 #define flush_dcache_mmap_unlock(mapping) \
-	write_unlock_irq(&(mapping)->tree_lock)
+	spin_unlock_irq(&(mapping)->tree_lock)
 
 #define flush_icache_user_range(vma,page,addr,len) \
 	flush_dcache_page(page)
Index: linux-2.6/include/asm-parisc/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-parisc/cacheflush.h
+++ linux-2.6/include/asm-parisc/cacheflush.h
@@ -57,9 +57,9 @@ flush_user_icache_range(unsigned long st
 extern void flush_dcache_page(struct page *page);
 
 #define flush_dcache_mmap_lock(mapping) \
-	write_lock_irq(&(mapping)->tree_lock)
+	spin_lock_irq(&(mapping)->tree_lock)
 #define flush_dcache_mmap_unlock(mapping) \
-	write_unlock_irq(&(mapping)->tree_lock)
+	spin_unlock_irq(&(mapping)->tree_lock)
 
 #define flush_icache_page(vma,page)	do { flush_kernel_dcache_page(page_address(page)); flush_kernel_icache_page(page_address(page)); } while (0)
 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* VFS scalability (was: [rfc] lockless pagecache)
  2005-06-27  6:29 [rfc] lockless pagecache Nick Piggin
  2005-06-27  6:32 ` [patch 1] mm: PG_free flag Nick Piggin
@ 2005-06-27  6:43 ` Nick Piggin
  2005-06-27  7:13   ` Andi Kleen
  2005-06-27  7:46 ` [rfc] lockless pagecache Andrew Morton
  2005-06-29 10:49 ` Hirokazu Takahashi
  3 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2005-06-27  6:43 UTC (permalink / raw)
  To: linux-kernel, Linux Memory Management

Just an interesting aside, when first testing the patch I was
using read(2) instead of nopage faults. I ran into some surprising
results there which I don't have the time to follow up at the
moment - it might be worth investigating if someone has the time,
regardless the state of the lockless pagecache work.

For the parallel workload as described in the parent post (but
read instead of fault), the vanilla kernel profile looks like
this:

  74453 total                                      0.0121
  25839 update_atime                              44.8594
  19595 _read_unlock_irq                         306.1719
  13025 do_generic_mapping_read                    5.5758
   9374 rw_verify_area                            29.2937
   1739 ia64_pal_call_static                       9.0573
   1567 default_idle                               4.0807
   1114 __copy_user                                0.4704
    848 _spin_lock                                 8.8333
    786 ia64_spinlock_contention                   8.1875
    246 ia64_save_scratch_fpregs                   3.8438
    187 ia64_load_scratch_fpregs                   2.9219
     16 file_read_actor                            0.0263
     15 fsys_bubble_down                           0.0586
     12 vfs_read                                   0.0170

This is with the filesystem mounted as noatime, so I can't work
out why update_atime is so high on the list. I suspect maybe a
false sharing issue with some other fields.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: VFS scalability (was: [rfc] lockless pagecache)
  2005-06-27  6:43 ` VFS scalability (was: [rfc] lockless pagecache) Nick Piggin
@ 2005-06-27  7:13   ` Andi Kleen
  2005-06-27  7:33     ` VFS scalability Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2005-06-27  7:13 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-mm

Nick Piggin <nickpiggin@yahoo.com.au> writes:

> This is with the filesystem mounted as noatime, so I can't work
> out why update_atime is so high on the list. I suspect maybe a
> false sharing issue with some other fields.

Did all the 64CPUs write to the same file?

Then update_atime was just the messenger - it is the first function
to read the inode so it eats the cache miss overhead.

Maybe adding a prefetch for it at the beginning of sys_read() 
might help, but then with 64CPUs writing to parts of the inode
it will always thrash no matter how many prefetches.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: VFS scalability
  2005-06-27  7:13   ` Andi Kleen
@ 2005-06-27  7:33     ` Nick Piggin
  2005-06-27  7:44       ` Andi Kleen
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2005-06-27  7:33 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

Andi Kleen wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> writes:
> 
> 
>>This is with the filesystem mounted as noatime, so I can't work
>>out why update_atime is so high on the list. I suspect maybe a
>>false sharing issue with some other fields.
> 
> 
> Did all the 64CPUs write to the same file?
> 

Yes.

> Then update_atime was just the messenger - it is the first function
> to read the inode so it eats the cache miss overhead.
> 

I agree.

> Maybe adding a prefetch for it at the beginning of sys_read() 
> might help, but then with 64CPUs writing to parts of the inode
> it will always thrash no matter how many prefetches.
> 

True. I'm just not sure what is causing the bouncing - I guess
->f_count due to get_file()?

rw_verify_area is another that is taking a lot of hits - probably
due to the same cacheline(s) as update_atime.

Unless I'm mistaken, the big difference between the read fault and
the read(2) cases is that mmap holds a reference on the file, while
open(2) doesn't?

I guess if anyone really cares about that, they could hack up a flag
to tell the file to remain pinned.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: VFS scalability
  2005-06-27  7:33     ` VFS scalability Nick Piggin
@ 2005-06-27  7:44       ` Andi Kleen
  2005-06-27  8:03         ` Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2005-06-27  7:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andi Kleen, linux-kernel, linux-mm

On Mon, Jun 27, 2005 at 05:33:43PM +1000, Nick Piggin wrote:
> >Maybe adding a prefetch for it at the beginning of sys_read() 
> >might help, but then with 64CPUs writing to parts of the inode
> >it will always thrash no matter how many prefetches.
> >
> 
> True. I'm just not sure what is causing the bouncing - I guess
> ->f_count due to get_file()?

That's in the file, not in the inode. It must be some inode field.
I don't know which one.

There is probably some oprofile/perfmon event that could tell
you which function dirties the cacheline.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  6:29 [rfc] lockless pagecache Nick Piggin
  2005-06-27  6:32 ` [patch 1] mm: PG_free flag Nick Piggin
  2005-06-27  6:43 ` VFS scalability (was: [rfc] lockless pagecache) Nick Piggin
@ 2005-06-27  7:46 ` Andrew Morton
  2005-06-27  8:02   ` Nick Piggin
                     ` (2 more replies)
  2005-06-29 10:49 ` Hirokazu Takahashi
  3 siblings, 3 replies; 56+ messages in thread
From: Andrew Morton @ 2005-06-27  7:46 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-mm

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> First I'll put up some numbers to get you interested - of a 64-way Altix
>  with 64 processes each read-faulting in their own 512MB part of a 32GB
>  file that is preloaded in pagecache (with the proper NUMA memory
>  allocation).

I bet you can get a 5x to 10x reduction in ->tree_lock traffic by doing
16-page faultahead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  7:46 ` [rfc] lockless pagecache Andrew Morton
@ 2005-06-27  8:02   ` Nick Piggin
  2005-06-27  8:15     ` Andrew Morton
                       ` (2 more replies)
  2005-06-27 14:08   ` Martin J. Bligh
  2005-06-27 17:49   ` Christoph Lameter
  2 siblings, 3 replies; 56+ messages in thread
From: Nick Piggin @ 2005-06-27  8:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>First I'll put up some numbers to get you interested - of a 64-way Altix
>> with 64 processes each read-faulting in their own 512MB part of a 32GB
>> file that is preloaded in pagecache (with the proper NUMA memory
>> allocation).
> 
> 
> I bet you can get a 5x to 10x reduction in ->tree_lock traffic by doing
> 16-page faultahead.
> 
> 

Definitely, for the microbenchmark I was testing with.

However I think for Oracle and others that use shared memory like
this, they are probably not doing linear access, so that would be a
net loss. I'm not completely sure (I don't have access to real loads
at the moment), but I would have thought those guys would have looked
into fault ahead if it were a possibility.

Also, the memory usage regression cases that fault ahead brings makes it
a bit contentious.

I like that the lockless patch completely removes the problem at its
source and even makes the serial path lighter. The other things is, the
speculative get_page may be useful for more code than just pagecache
lookups. But it is fairly tricky I'll give you that.

Anyway it is obviously not something that can go in tomorrow. At the
very least the PageReserved patches need to go in first, and even they
will need a lot of testing out of tree.

Perhaps it can be discussed at KS and we can think about what to do with
it after that - that kind of time frame. No rush.

Oh yeah, and obviously it would be nice if it provided real improvements
on real workloads too ;)

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: VFS scalability
  2005-06-27  7:44       ` Andi Kleen
@ 2005-06-27  8:03         ` Nick Piggin
  0 siblings, 0 replies; 56+ messages in thread
From: Nick Piggin @ 2005-06-27  8:03 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

Andi Kleen wrote:
> On Mon, Jun 27, 2005 at 05:33:43PM +1000, Nick Piggin wrote:
> 
>>>Maybe adding a prefetch for it at the beginning of sys_read() 
>>>might help, but then with 64CPUs writing to parts of the inode
>>>it will always thrash no matter how many prefetches.
>>>
>>
>>True. I'm just not sure what is causing the bouncing - I guess
>>->f_count due to get_file()?
> 
> 
> That's in the file, not in the inode. It must be some inode field.
> I don't know which one.
> 

Oh yes, my mistake.

> There is probably some oprofile/perfmon event that could tell
> you which function dirties the cacheline.
> 

I'll see if I can work it out. Thanks.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  8:02   ` Nick Piggin
@ 2005-06-27  8:15     ` Andrew Morton
  2005-06-27  8:28       ` Nick Piggin
  2005-06-27  8:56     ` Lincoln Dale
  2005-06-27 13:17     ` Benjamin LaHaise
  2 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2005-06-27  8:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-mm

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> Also, the memory usage regression cases that fault ahead brings makes it
>  a bit contentious.

faultahead consumes no more memory: if the page is present then point a pte
at it.  It'll make reclaim work a bit harder in some situations.

>  I like that the lockless patch completely removes the problem at its
>  source and even makes the serial path lighter. The other things is, the
>  speculative get_page may be useful for more code than just pagecache
>  lookups. But it is fairly tricky I'll give you that.

Yes, it's scary-looking stuff.

>  Anyway it is obviously not something that can go in tomorrow. At the
>  very least the PageReserved patches need to go in first, and even they
>  will need a lot of testing out of tree.
> 
>  Perhaps it can be discussed at KS and we can think about what to do with
>  it after that - that kind of time frame. No rush.
> 
>  Oh yeah, and obviously it would be nice if it provided real improvements
>  on real workloads too ;)

umm, yes.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  8:15     ` Andrew Morton
@ 2005-06-27  8:28       ` Nick Piggin
  0 siblings, 0 replies; 56+ messages in thread
From: Nick Piggin @ 2005-06-27  8:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>Also, the memory usage regression cases that fault ahead brings makes it
>> a bit contentious.
> 
> 
> faultahead consumes no more memory: if the page is present then point a pte
> at it.  It'll make reclaim work a bit harder in some situations.
> 

Oh OK we'll call that faultahead and Christoph's thing prefault then.

I suspect it may still be a net loss for those that are running into
tree_lock contention, but we'll see.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  8:02   ` Nick Piggin
  2005-06-27  8:15     ` Andrew Morton
@ 2005-06-27  8:56     ` Lincoln Dale
  2005-06-27  9:04       ` Nick Piggin
  2005-06-27 13:17     ` Benjamin LaHaise
  2 siblings, 1 reply; 56+ messages in thread
From: Lincoln Dale @ 2005-06-27  8:56 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm

Nick Piggin wrote:
[..]

> However I think for Oracle and others that use shared memory like
> this, they are probably not doing linear access, so that would be a
> net loss. I'm not completely sure (I don't have access to real loads
> at the moment), but I would have thought those guys would have looked
> into fault ahead if it were a possibility.

i thought those guys used O_DIRECT - in which case, wouldn't the page 
cache not be used?


cheers,

lincoln.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  8:56     ` Lincoln Dale
@ 2005-06-27  9:04       ` Nick Piggin
  2005-06-27 18:14         ` Chen, Kenneth W
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2005-06-27  9:04 UTC (permalink / raw)
  To: Lincoln Dale; +Cc: Andrew Morton, linux-kernel, linux-mm

Lincoln Dale wrote:
> Nick Piggin wrote:
> [..]
> 
>> However I think for Oracle and others that use shared memory like
>> this, they are probably not doing linear access, so that would be a
>> net loss. I'm not completely sure (I don't have access to real loads
>> at the moment), but I would have thought those guys would have looked
>> into fault ahead if it were a possibility.
> 
> 
> i thought those guys used O_DIRECT - in which case, wouldn't the page 
> cache not be used?
> 

Well I think they do use O_DIRECT for their IO, but they need to
use the Linux pagecache for their shared memory - that shared
memory being the basis for their page cache. I think. Whatever
the setup I believe they have issues with the tree_lock, which is
why it was changed to an rwlock.

-- 
SUSE Labs, Novell Inc.


Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  8:02   ` Nick Piggin
  2005-06-27  8:15     ` Andrew Morton
  2005-06-27  8:56     ` Lincoln Dale
@ 2005-06-27 13:17     ` Benjamin LaHaise
  2005-06-28  0:32       ` Nick Piggin
  2 siblings, 1 reply; 56+ messages in thread
From: Benjamin LaHaise @ 2005-06-27 13:17 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm

On Mon, Jun 27, 2005 at 06:02:15PM +1000, Nick Piggin wrote:
> However I think for Oracle and others that use shared memory like
> this, they are probably not doing linear access, so that would be a
> net loss. I'm not completely sure (I don't have access to real loads
> at the moment), but I would have thought those guys would have looked
> into fault ahead if it were a possibility.

Shared memory overhead doesn't show up on any of the database benchmarks 
I've seen, as they tend to use huge pages that are locked in memory, and 
thus don't tend to access the page cache at all after ramp up.

		-ben
-- 
"Time is what keeps everything from happening all at once." -- John Wheeler
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  7:46 ` [rfc] lockless pagecache Andrew Morton
  2005-06-27  8:02   ` Nick Piggin
@ 2005-06-27 14:08   ` Martin J. Bligh
  2005-06-27 17:49   ` Christoph Lameter
  2 siblings, 0 replies; 56+ messages in thread
From: Martin J. Bligh @ 2005-06-27 14:08 UTC (permalink / raw)
  To: Andrew Morton, Nick Piggin; +Cc: linux-kernel, linux-mm


--Andrew Morton <akpm@osdl.org> wrote (on Monday, June 27, 2005 00:46:24 -0700):

> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>> 
>> First I'll put up some numbers to get you interested - of a 64-way Altix
>>  with 64 processes each read-faulting in their own 512MB part of a 32GB
>>  file that is preloaded in pagecache (with the proper NUMA memory
>>  allocation).
> 
> I bet you can get a 5x to 10x reduction in ->tree_lock traffic by doing
> 16-page faultahead.

Maybe true, but when we last tried that, faultahead sucked for performance
in a more general sense. All the extra setup and teardown cost for 
unnecessary PTEs kills you, even if it's only 4 pages or so.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-27  6:32   ` [patch 2] mm: speculative get_page Nick Piggin
  2005-06-27  6:33     ` [patch 3] radix tree: lookup_slot Nick Piggin
@ 2005-06-27 14:12     ` William Lee Irwin III
  2005-06-28  0:03       ` Nick Piggin
  2005-06-28 12:45     ` Andy Whitcroft
  2 siblings, 1 reply; 56+ messages in thread
From: William Lee Irwin III @ 2005-06-27 14:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, Linux Memory Management

On Mon, Jun 27, 2005 at 04:32:38PM +1000, Nick Piggin wrote:
> +static inline struct page *page_cache_get_speculative(struct page **pagep)
> +{
> +	struct page *page;
> +
> +	preempt_disable();
> +	page = *pagep;
> +	if (!page)
> +		goto out_failed;
> +
> +	if (unlikely(get_page_testone(page))) {
> +		/* Picked up a freed page */
> +		__put_page(page);
> +		goto out_failed;
> +	}

So you pick up 0->1 refcount transitions.


On Mon, Jun 27, 2005 at 04:32:38PM +1000, Nick Piggin wrote:
> +	/*
> +	 * preempt can really be enabled here (only needs to be disabled
> +	 * because page allocation can spin on the elevated refcount, but
> +	 * we don't want to hold a reference on an unrelated page for too
> +	 * long, so keep preempt off until we know we have the right page
> +	 */
> +
> +	if (unlikely(PageFreeing(page)) ||

SetPageFreeing is only done in shrink_list(), so other pages in the
buddy bitmaps and/or pagecache pages freed by other methods may not
be found by this. There's also likely trouble with higher-order pages.


On Mon, Jun 27, 2005 at 04:32:38PM +1000, Nick Piggin wrote:
> +			unlikely(page != *pagep)) {
> +		/* Picked up a page being freed, or one that's been reused */
> +		put_page(page);
> +		goto out_failed;
> +	}
> +	preempt_enable();
> +
> +	return page;
> +
> +out_failed:
> +	preempt_enable();
> +	return NULL;
> +}

page != *pagep won't be reliably tripped unless the pagecache
modification has the appropriate memory barriers.

The lockless radix tree lookups are a harder problem than this, and
the implementation didn't look promising. I have other problems to deal
with so I'm not going to go very far into this.

While I agree that locklessness is the right direction for the
pagecache to go, this RFC seems to have too far to go to use it to
conclude anything about the subject.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  7:46 ` [rfc] lockless pagecache Andrew Morton
  2005-06-27  8:02   ` Nick Piggin
  2005-06-27 14:08   ` Martin J. Bligh
@ 2005-06-27 17:49   ` Christoph Lameter
  2 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2005-06-27 17:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, linux-kernel, linux-mm

On Mon, 27 Jun 2005, Andrew Morton wrote:

> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >
> > First I'll put up some numbers to get you interested - of a 64-way Altix
> >  with 64 processes each read-faulting in their own 512MB part of a 32GB
> >  file that is preloaded in pagecache (with the proper NUMA memory
> >  allocation).
> 
> I bet you can get a 5x to 10x reduction in ->tree_lock traffic by doing
> 16-page faultahead.

Could be working into the prefault patch.... Good idea.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [rfc] lockless pagecache
  2005-06-27  9:04       ` Nick Piggin
@ 2005-06-27 18:14         ` Chen, Kenneth W
  2005-06-27 18:50           ` Badari Pulavarty
  0 siblings, 1 reply; 56+ messages in thread
From: Chen, Kenneth W @ 2005-06-27 18:14 UTC (permalink / raw)
  To: 'Nick Piggin', Lincoln Dale; +Cc: Andrew Morton, linux-kernel, linux-mm

Nick Piggin wrote on Monday, June 27, 2005 2:04 AM
> >> However I think for Oracle and others that use shared memory like
> >> this, they are probably not doing linear access, so that would be a
> >> net loss. I'm not completely sure (I don't have access to real loads
> >> at the moment), but I would have thought those guys would have looked
> >> into fault ahead if it were a possibility.
> > 
> > 
> > i thought those guys used O_DIRECT - in which case, wouldn't the page 
> > cache not be used?
> > 
> 
> Well I think they do use O_DIRECT for their IO, but they need to
> use the Linux pagecache for their shared memory - that shared
> memory being the basis for their page cache. I think. Whatever
> the setup I believe they have issues with the tree_lock, which is
> why it was changed to an rwlock.

Typically shared memory is used as db buffer cache, and O_DIRECT is
performed on these buffer cache (hence O_DIRECT on the shared memory).
You must be thinking some other workload.  Nevertheless, for OLTP type
of db workload, tree_lock hasn't been a problem so far.

- Ken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [rfc] lockless pagecache
  2005-06-27 18:14         ` Chen, Kenneth W
@ 2005-06-27 18:50           ` Badari Pulavarty
  2005-06-27 19:05             ` Chen, Kenneth W
  0 siblings, 1 reply; 56+ messages in thread
From: Badari Pulavarty @ 2005-06-27 18:50 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin',
	Lincoln Dale, Andrew Morton, linux-kernel, linux-mm

On Mon, 2005-06-27 at 11:14 -0700, Chen, Kenneth W wrote:
> Nick Piggin wrote on Monday, June 27, 2005 2:04 AM
> > >> However I think for Oracle and others that use shared memory like
> > >> this, they are probably not doing linear access, so that would be a
> > >> net loss. I'm not completely sure (I don't have access to real loads
> > >> at the moment), but I would have thought those guys would have looked
> > >> into fault ahead if it were a possibility.
> > > 
> > > 
> > > i thought those guys used O_DIRECT - in which case, wouldn't the page 
> > > cache not be used?
> > > 
> > 
> > Well I think they do use O_DIRECT for their IO, but they need to
> > use the Linux pagecache for their shared memory - that shared
> > memory being the basis for their page cache. I think. Whatever
> > the setup I believe they have issues with the tree_lock, which is
> > why it was changed to an rwlock.
> 
> Typically shared memory is used as db buffer cache, and O_DIRECT is
> performed on these buffer cache (hence O_DIRECT on the shared memory).
> You must be thinking some other workload.  Nevertheless, for OLTP type
> of db workload, tree_lock hasn't been a problem so far.

What about DSS ? I need to go back and verify some of the profiles
we have.

Thanks,
Badari


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [rfc] lockless pagecache
  2005-06-27 18:50           ` Badari Pulavarty
@ 2005-06-27 19:05             ` Chen, Kenneth W
  2005-06-27 19:22               ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Chen, Kenneth W @ 2005-06-27 19:05 UTC (permalink / raw)
  To: 'Badari Pulavarty'
  Cc: 'Nick Piggin',
	Lincoln Dale, Andrew Morton, linux-kernel, linux-mm

Badari Pulavarty wrote on Monday, June 27, 2005 11:51 AM
> On Mon, 2005-06-27 at 11:14 -0700, Chen, Kenneth W wrote:
> > Typically shared memory is used as db buffer cache, and O_DIRECT is
> > performed on these buffer cache (hence O_DIRECT on the shared memory).
> > You must be thinking some other workload.  Nevertheless, for OLTP type
> > of db workload, tree_lock hasn't been a problem so far.
> 
> What about DSS ? I need to go back and verify some of the profiles
> we have.

I don't recall seeing tree_lock to be a problem for DSS workload either.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [rfc] lockless pagecache
  2005-06-27 19:05             ` Chen, Kenneth W
@ 2005-06-27 19:22               ` Christoph Lameter
  2005-06-27 19:42                 ` Chen, Kenneth W
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2005-06-27 19:22 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Badari Pulavarty', 'Nick Piggin',
	Lincoln Dale, Andrew Morton, linux-kernel, linux-mm

On Mon, 27 Jun 2005, Chen, Kenneth W wrote:

> I don't recall seeing tree_lock to be a problem for DSS workload either.

I have seen the tree_lock being a problem a number of times with large 
scale NUMA type workloads.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [rfc] lockless pagecache
  2005-06-27 19:22               ` Christoph Lameter
@ 2005-06-27 19:42                 ` Chen, Kenneth W
  2005-07-05 15:11                   ` Sonny Rao
  0 siblings, 1 reply; 56+ messages in thread
From: Chen, Kenneth W @ 2005-06-27 19:42 UTC (permalink / raw)
  To: 'Christoph Lameter'
  Cc: 'Badari Pulavarty', 'Nick Piggin',
	Lincoln Dale, Andrew Morton, linux-kernel, linux-mm

Christoph Lameter wrote on Monday, June 27, 2005 12:23 PM
> On Mon, 27 Jun 2005, Chen, Kenneth W wrote:
> > I don't recall seeing tree_lock to be a problem for DSS workload either.
> 
> I have seen the tree_lock being a problem a number of times with large 
> scale NUMA type workloads.

I totally agree!  My earlier posts are strictly referring to industry
standard db workloads (OLTP, DSS).  I'm not saying it's not a problem
for everyone :-)  Obviously you just outlined a few ....

- Ken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-27 14:12     ` [patch 2] mm: speculative get_page William Lee Irwin III
@ 2005-06-28  0:03       ` Nick Piggin
  2005-06-28  0:56         ` Nick Piggin
  2005-06-28  1:22         ` William Lee Irwin III
  0 siblings, 2 replies; 56+ messages in thread
From: Nick Piggin @ 2005-06-28  0:03 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel, Linux Memory Management

William Lee Irwin III wrote:
> On Mon, Jun 27, 2005 at 04:32:38PM +1000, Nick Piggin wrote:
> 
>>+static inline struct page *page_cache_get_speculative(struct page **pagep)
>>+{
>>+	struct page *page;
>>+
>>+	preempt_disable();
>>+	page = *pagep;
>>+	if (!page)
>>+		goto out_failed;
>>+
>>+	if (unlikely(get_page_testone(page))) {
>>+		/* Picked up a freed page */
>>+		__put_page(page);
>>+		goto out_failed;
>>+	}
> 
> 
> So you pick up 0->1 refcount transitions.
> 

Yep ie. a page that's freed or being freed.

> 
> On Mon, Jun 27, 2005 at 04:32:38PM +1000, Nick Piggin wrote:
> 
>>+	/*
>>+	 * preempt can really be enabled here (only needs to be disabled
>>+	 * because page allocation can spin on the elevated refcount, but
>>+	 * we don't want to hold a reference on an unrelated page for too
>>+	 * long, so keep preempt off until we know we have the right page
>>+	 */
>>+
>>+	if (unlikely(PageFreeing(page)) ||
> 
> 
> SetPageFreeing is only done in shrink_list(), so other pages in the
> buddy bitmaps and/or pagecache pages freed by other methods may not

It is also done by remove_exclusive_swap_page, although that hunk
leaked into a later patch (#5), sorry.

Other methods (eg truncate) don't seem to have an atomicity guarantee
anyway - ie. it is valid to pick up a reference on a page that is
just about to get truncated. PageFreeing is only used when some code
is making an assumption about the number of users of the page.

> be found by this. There's also likely trouble with higher-order pages.
> 

There isn't because higher order pages aren't used for pagecache.

> 
> On Mon, Jun 27, 2005 at 04:32:38PM +1000, Nick Piggin wrote:
> 
>>+			unlikely(page != *pagep)) {
>>+		/* Picked up a page being freed, or one that's been reused */
>>+		put_page(page);
>>+		goto out_failed;
>>+	}
>>+	preempt_enable();
>>+
>>+	return page;
>>+
>>+out_failed:
>>+	preempt_enable();
>>+	return NULL;
>>+}
> 
> 
> page != *pagep won't be reliably tripped unless the pagecache
> modification has the appropriate memory barriers.
> 

There are appropriate memory barriers: the radix tree is
modified uner the rwlock/spinlock, and this function has
a memory barrier before testing page != *pagep.

> The lockless radix tree lookups are a harder problem than this, and
> the implementation didn't look promising. I have other problems to deal
> with so I'm not going to go very far into this.
> 

What's wrong with the lockless radix tree lookups?

> While I agree that locklessness is the right direction for the
> pagecache to go, this RFC seems to have too far to go to use it to
> conclude anything about the subject.
> 

You don't seem to have looked enough to conclude anything about it.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27 13:17     ` Benjamin LaHaise
@ 2005-06-28  0:32       ` Nick Piggin
  2005-06-28  1:26         ` William Lee Irwin III
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2005-06-28  0:32 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Andrew Morton, linux-kernel, linux-mm

Benjamin LaHaise wrote:
> On Mon, Jun 27, 2005 at 06:02:15PM +1000, Nick Piggin wrote:
> 
>>However I think for Oracle and others that use shared memory like
>>this, they are probably not doing linear access, so that would be a
>>net loss. I'm not completely sure (I don't have access to real loads
>>at the moment), but I would have thought those guys would have looked
>>into fault ahead if it were a possibility.
> 
> 
> Shared memory overhead doesn't show up on any of the database benchmarks 
> I've seen, as they tend to use huge pages that are locked in memory, and 
> thus don't tend to access the page cache at all after ramp up.
> 

To be quite honest I don't have any real workloads here that stress
it, however I was told that it is a problem for oracle database. If
there is anyone else who has problems then I'd be interested to hear
them as well.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-28  0:03       ` Nick Piggin
@ 2005-06-28  0:56         ` Nick Piggin
  2005-06-28  1:22         ` William Lee Irwin III
  1 sibling, 0 replies; 56+ messages in thread
From: Nick Piggin @ 2005-06-28  0:56 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel, Linux Memory Management

Nick Piggin wrote:
> William Lee Irwin III wrote:
> 
>> On Mon, Jun 27, 2005 at 04:32:38PM +1000, Nick Piggin wrote:
>>
>>> +static inline struct page *page_cache_get_speculative(struct page 
>>> **pagep)
>>> +{
>>> +    struct page *page;
>>> +
>>> +    preempt_disable();
>>> +    page = *pagep;
>>> +    if (!page)
>>> +        goto out_failed;
>>> +
>>> +    if (unlikely(get_page_testone(page))) {
>>> +        /* Picked up a freed page */
>>> +        __put_page(page);
>>> +        goto out_failed;
>>> +    }
>>
>>
>>
>> So you pick up 0->1 refcount transitions.
>>
> 
> Yep ie. a page that's freed or being freed.
> 

Oh, one thing it does need is a check for PageFree(), so it also
picks up 1->2 and other transitions without freeing the free page
if the put()s are done out of order. Maybe that's what you were
alluding to.

I'll add that.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-28  0:03       ` Nick Piggin
  2005-06-28  0:56         ` Nick Piggin
@ 2005-06-28  1:22         ` William Lee Irwin III
  2005-06-28  1:42           ` Nick Piggin
  1 sibling, 1 reply; 56+ messages in thread
From: William Lee Irwin III @ 2005-06-28  1:22 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, Linux Memory Management

William Lee Irwin III wrote:
>> SetPageFreeing is only done in shrink_list(), so other pages in the
>> buddy bitmaps and/or pagecache pages freed by other methods may not

On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote:
> It is also done by remove_exclusive_swap_page, although that hunk
> leaked into a later patch (#5), sorry.
> Other methods (eg truncate) don't seem to have an atomicity guarantee
> anyway - ie. it is valid to pick up a reference on a page that is
> just about to get truncated. PageFreeing is only used when some code
> is making an assumption about the number of users of the page.

tmpfs


William Lee Irwin III wrote:
>> be found by this. There's also likely trouble with higher-order pages.

On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote:
> There isn't because higher order pages aren't used for pagecache.

hugetlbfs


William Lee Irwin III wrote:
>> page != *pagep won't be reliably tripped unless the pagecache
>> modification has the appropriate memory barriers.

On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote:
> There are appropriate memory barriers: the radix tree is
> modified uner the rwlock/spinlock, and this function has
> a memory barrier before testing page != *pagep.

Someone else deal with this (paulus? anton? other arch maintainers?).


William Lee Irwin III wrote:
>> The lockless radix tree lookups are a harder problem than this, and
>> the implementation didn't look promising. I have other problems to deal
>> with so I'm not going to go very far into this.

On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote:
> What's wrong with the lockless radix tree lookups?

The above is as much as I wanted to go into it. I need to direct my
capacity for the grunt work of devising adversary arguments elsewhere.


William Lee Irwin III wrote:
>> While I agree that locklessness is the right direction for the
>> pagecache to go, this RFC seems to have too far to go to use it to
>> conclude anything about the subject.

On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote:
> You don't seem to have looked enough to conclude anything about it.

You requested comments. I made some.

Anyhow, my review has not been comprehensive. I stopped after the first
few things I found that needed fixing. If others could deal with the
rest of this, I'd be much obliged.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-28  0:32       ` Nick Piggin
@ 2005-06-28  1:26         ` William Lee Irwin III
  0 siblings, 0 replies; 56+ messages in thread
From: William Lee Irwin III @ 2005-06-28  1:26 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Benjamin LaHaise, Andrew Morton, linux-kernel, linux-mm

Benjamin LaHaise wrote:
>> Shared memory overhead doesn't show up on any of the database benchmarks 
>> I've seen, as they tend to use huge pages that are locked in memory, and 
>> thus don't tend to access the page cache at all after ramp up.

On Tue, Jun 28, 2005 at 10:32:51AM +1000, Nick Piggin wrote:
> To be quite honest I don't have any real workloads here that stress
> it, however I was told that it is a problem for oracle database. If
> there is anyone else who has problems then I'd be interested to hear
> them as well.

It's vlm-specific.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-28  1:22         ` William Lee Irwin III
@ 2005-06-28  1:42           ` Nick Piggin
  2005-06-28  4:06             ` William Lee Irwin III
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2005-06-28  1:42 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel, Linux Memory Management

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
> 
>>>SetPageFreeing is only done in shrink_list(), so other pages in the
>>>buddy bitmaps and/or pagecache pages freed by other methods may not
> 
> 
> On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote:
> 
>>It is also done by remove_exclusive_swap_page, although that hunk
>>leaked into a later patch (#5), sorry.
>>Other methods (eg truncate) don't seem to have an atomicity guarantee
>>anyway - ie. it is valid to pick up a reference on a page that is
>>just about to get truncated. PageFreeing is only used when some code
>>is making an assumption about the number of users of the page.
> 
> 
> tmpfs
> 

Well it switches between page and swap cache, but it seems to just
use the normal pagecache / swapcache functions for that. It could be
that I've got a big hole somewhere, but so far I don't think you've
pointed oen out.

> 
> William Lee Irwin III wrote:
> 
>>>be found by this. There's also likely trouble with higher-order pages.
> 
> 
> On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote:
> 
>>There isn't because higher order pages aren't used for pagecache.
> 
> 
> hugetlbfs
> 

Well what's the trouble with it?

> 
> William Lee Irwin III wrote:
> 
>>>page != *pagep won't be reliably tripped unless the pagecache
>>>modification has the appropriate memory barriers.
> 
> 
> On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote:
> 
>>There are appropriate memory barriers: the radix tree is
>>modified uner the rwlock/spinlock, and this function has
>>a memory barrier before testing page != *pagep.
> 
> 
> Someone else deal with this (paulus? anton? other arch maintainers?).
> 

I know what a memory barrier is and does, so you said the
necessary memory barriers aren't in place, so can you deal
with it?

> 
> William Lee Irwin III wrote:
> 
>>>The lockless radix tree lookups are a harder problem than this, and
>>>the implementation didn't look promising. I have other problems to deal
>>>with so I'm not going to go very far into this.
> 
> 
> On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote:
> 
>>What's wrong with the lockless radix tree lookups?
> 
> 
> The above is as much as I wanted to go into it. I need to direct my
> capacity for the grunt work of devising adversary arguments elsewhere.
> 

I don't think there is anything wrong with it. I would be very
keen to see real adversary arguments elsewhere though.

> 
> William Lee Irwin III wrote:
> 
>>>While I agree that locklessness is the right direction for the
>>>pagecache to go, this RFC seems to have too far to go to use it to
>>>conclude anything about the subject.
> 
> 
> On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote:
> 
>>You don't seem to have looked enough to conclude anything about it.
> 
> 
> You requested comments. I made some.
> 

Well yeah thanks, you did point out a thinko I made, and that was very
helpful and I value any time you spend looking at it. But just saying
"this is wrong, that won't work, that's crap, ergo the concept is
useless" without finding anything specifically wrong is not very
constructive.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-28  1:42           ` Nick Piggin
@ 2005-06-28  4:06             ` William Lee Irwin III
  2005-06-28  4:50               ` Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: William Lee Irwin III @ 2005-06-28  4:06 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, Linux Memory Management

William Lee Irwin III wrote:
>> tmpfs

On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote:
> Well it switches between page and swap cache, but it seems to just
> use the normal pagecache / swapcache functions for that. It could be
> that I've got a big hole somewhere, but so far I don't think you've
> pointed oen out.

Its radix tree movement bypasses the page allocator.


William Lee Irwin III wrote:
>> hugetlbfs

On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote:
> Well what's the trouble with it?

hugetlb reallocation doesn't go through the page allocator either.


William Lee Irwin III wrote:
>> Someone else deal with this (paulus? anton? other arch maintainers?).

On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote:
> I know what a memory barrier is and does, so you said the
> necessary memory barriers aren't in place, so can you deal
> with it?

spin_unlock() does not imply a memory barrier.


William Lee Irwin III wrote:
>> The above is as much as I wanted to go into it. I need to direct my
>> capacity for the grunt work of devising adversary arguments elsewhere.

On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote:
> I don't think there is anything wrong with it. I would be very
> keen to see real adversary arguments elsewhere though.

They take time to construct.


William Lee Irwin III wrote:
>> You requested comments. I made some.

On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote:
> Well yeah thanks, you did point out a thinko I made, and that was very
> helpful and I value any time you spend looking at it. But just saying
> "this is wrong, that won't work, that's crap, ergo the concept is
> useless" without finding anything specifically wrong is not very
> constructive.

I said nothing of that kind, and I did point out specific things.

The limitation of time/effort is directly related to the nature of the
responses.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-28  4:06             ` William Lee Irwin III
@ 2005-06-28  4:50               ` Nick Piggin
  2005-06-28  5:08                 ` [patch 2] mm: speculative get_page, " David S. Miller, Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2005-06-28  4:50 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel, Linux Memory Management

William Lee Irwin III wrote:

>On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote:
>
>>Well it switches between page and swap cache, but it seems to just
>>use the normal pagecache / swapcache functions for that. It could be
>>that I've got a big hole somewhere, but so far I don't think you've
>>pointed oen out.
>>
>
>Its radix tree movement bypasses the page allocator.
>
>

That should be fine. Net result is the page has been looked up.
What kind of atomicity did you imagine the locked find_get_page
provides that I haven't?

>
>On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote:
>
>>Well what's the trouble with it?
>>
>
>hugetlb reallocation doesn't go through the page allocator either.
>
>

Ditto. Net result is that the page has been looked up. The
speculative get page will recheck that it is in the radix
tree after taking a reference, and if so then it assumes that
reference to be valid.

What is the hangup with the page allocator?

>On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote:
>
>>I know what a memory barrier is and does, so you said the
>>necessary memory barriers aren't in place, so can you deal
>>with it?
>>
>
>spin_unlock() does not imply a memory barrier.
>
>

Intriguing...

>
>William Lee Irwin III wrote:
>
>>>The above is as much as I wanted to go into it. I need to direct my
>>>capacity for the grunt work of devising adversary arguments elsewhere.
>>>
>
>On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote:
>
>>I don't think there is anything wrong with it. I would be very
>>keen to see real adversary arguments elsewhere though.
>>
>
>They take time to construct.
>
>

I can imagine. I don't think I've seen one yet.

>
>William Lee Irwin III wrote:
>
>>>You requested comments. I made some.
>>>
>
>On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote:
>
>>Well yeah thanks, you did point out a thinko I made, and that was very
>>helpful and I value any time you spend looking at it. But just saying
>>"this is wrong, that won't work, that's crap, ergo the concept is
>>useless" without finding anything specifically wrong is not very
>>constructive.
>>
>
>I said nothing of that kind, and I did point out specific things.
>
>

You said "this RFC seems to have too far to go to use it to
conclude anything about the subject", after failing to find
any holes in the actual implementation.

And (parahprasing) "this needs memory barriers but I won't say
where or why, somebody else deal with it" doesn't count as a
specific thing.


Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page, Re: [patch 2] mm: speculative get_page
  2005-06-28  4:50               ` Nick Piggin
@ 2005-06-28  5:08                 ` David S. Miller, Nick Piggin
  2005-06-28  5:34                   ` Nick Piggin
                                     ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: David S. Miller, Nick Piggin @ 2005-06-28  5:08 UTC (permalink / raw)
  To: nickpiggin; +Cc: wli, linux-kernel, linux-mm

> William Lee Irwin III wrote:
> 
> >On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote:
> >
> >spin_unlock() does not imply a memory barrier.
> >
> 
> Intriguing...

BTW, I disagree with this assertion.  spin_unlock() does imply a
memory barrier.

All memory operations before the release of the lock must execute
before the lock release memory operation is globally visible.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-28  5:08                 ` [patch 2] mm: speculative get_page, " David S. Miller, Nick Piggin
@ 2005-06-28  5:34                   ` Nick Piggin
  2005-06-28 14:19                   ` William Lee Irwin III
  2005-06-28 21:32                   ` Jesse Barnes
  2 siblings, 0 replies; 56+ messages in thread
From: Nick Piggin @ 2005-06-28  5:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: wli, linux-kernel, linux-mm

David S. Miller wrote:

>From: Nick Piggin <nickpiggin@yahoo.com.au>
>Subject: Re: [patch 2] mm: speculative get_page
>Date: Tue, 28 Jun 2005 14:50:31 +1000
>
>
>>William Lee Irwin III wrote:
>>
>>
>>>On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote:
>>>
>>>spin_unlock() does not imply a memory barrier.
>>>
>>>
>>Intriguing...
>>
>
>BTW, I disagree with this assertion.  spin_unlock() does imply a
>memory barrier.
>
>All memory operations before the release of the lock must execute
>before the lock release memory operation is globally visible.
>

Yes, it appears that way from looking at a sample set of arch
code too (ie. those without strictly ordered stores put an
explicit barrier there).

I've always understood spin_unlock to imply a barrier.


Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-27  6:32   ` [patch 2] mm: speculative get_page Nick Piggin
  2005-06-27  6:33     ` [patch 3] radix tree: lookup_slot Nick Piggin
  2005-06-27 14:12     ` [patch 2] mm: speculative get_page William Lee Irwin III
@ 2005-06-28 12:45     ` Andy Whitcroft
  2005-06-28 13:16       ` Nick Piggin
  2 siblings, 1 reply; 56+ messages in thread
From: Andy Whitcroft @ 2005-06-28 12:45 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, Linux Memory Management

Nick Piggin wrote:

>  #define PG_free			20	/* Page is on the free lists */
> +#define PG_freeing		21	/* PG_refcount about to be freed */

Wow this needs two new page bits.  That might be a problem ongoing.
There are only 24 of these puppies and this takes us to just two
remaining.  Do we really need _two_ to track free?

One obvious area of overlap might be the PG_nosave_free which seems to
be set on free pages for software suspend.  Perhaps that and PG_free
will be equivalent in intent (though maintained differently) and allow
us to recover a bit?

There are a couple of bits which imply ownership such as PG_slab,
PG_swapcache and PG_reserved which to my mind are all exclusive.
Perhaps those plus the PG_free could be combined into a owner field.  I
am unsure if the PG_freeing can be 'backed out' if not it may also combine?

Mumble ...

-apw
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-28 12:45     ` Andy Whitcroft
@ 2005-06-28 13:16       ` Nick Piggin
  2005-06-28 16:02         ` Dave Hansen
  2005-06-29 16:31         ` Pavel Machek
  0 siblings, 2 replies; 56+ messages in thread
From: Nick Piggin @ 2005-06-28 13:16 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: linux-kernel, Linux Memory Management

Andy Whitcroft wrote:
> Nick Piggin wrote:
> 
> 
>> #define PG_free			20	/* Page is on the free lists */
>>+#define PG_freeing		21	/* PG_refcount about to be freed */
> 
> 
> Wow this needs two new page bits.  That might be a problem ongoing.
> There are only 24 of these puppies and this takes us to just two
> remaining.  Do we really need _two_ to track free?
> 

Yeah they are kind of different. PG_freeing isn't a really good
description for it. Basically it is set to guarantee a page won't
gain any more references (real, not speculative) than what page_count
returns.

I'm in the process of recovering one of those with an earlier set
of patches (PG_reserved).

> One obvious area of overlap might be the PG_nosave_free which seems to
> be set on free pages for software suspend.  Perhaps that and PG_free
> will be equivalent in intent (though maintained differently) and allow
> us to recover a bit?
> 

PG_free can't be shared with anything else, unfortunately. It doesn't
need to be an atomic flag though, so it can be an "impossible"
combination of flags.

> There are a couple of bits which imply ownership such as PG_slab,
> PG_swapcache and PG_reserved which to my mind are all exclusive.
> Perhaps those plus the PG_free could be combined into a owner field.  I
> am unsure if the PG_freeing can be 'backed out' if not it may also combine?
> 

I think there are a a few ways that bits can be reclaimed if we
start digging. swsusp uses 2 which seems excessive though may be
fully justified. Can PG_private be replaced by (!page->private)?
Can filesystems easily stop using PG_checked?

OK, I'll cut the hand-waving: PG_free used to be derived from
PG_private && page_count == 0, so it could instead be
PG_active && !PG_lru quite easily AFAIKS. If this patchset ever
looks like being merged you can take me up on it ;)

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-28  5:08                 ` [patch 2] mm: speculative get_page, " David S. Miller, Nick Piggin
  2005-06-28  5:34                   ` Nick Piggin
@ 2005-06-28 14:19                   ` William Lee Irwin III
  2005-06-28 15:43                     ` Nick Piggin
  2005-06-28 21:32                   ` Jesse Barnes
  2 siblings, 1 reply; 56+ messages in thread
From: William Lee Irwin III @ 2005-06-28 14:19 UTC (permalink / raw)
  To: David S. Miller; +Cc: nickpiggin, linux-kernel, linux-mm

On Mon, Jun 27, 2005 at 10:08:27PM -0700, David S. Miller wrote:
> BTW, I disagree with this assertion.  spin_unlock() does imply a
> memory barrier.
> All memory operations before the release of the lock must execute
> before the lock release memory operation is globally visible.

The affected architectures have only recently changed in this regard.
ppc64 was the most notable case, where it had a barrier for MMIO
(eieio) but not a general memory barrier. PA-RISC likewise formerly had
no such barrier and was a more normal case, with no barrier whatsoever.

Both have since been altered, ppc64 acquiring a heavyweight sync
(arch nomenclature), and PA-RISC acquiring 2 memory barriers.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-28 14:19                   ` William Lee Irwin III
@ 2005-06-28 15:43                     ` Nick Piggin
  2005-06-28 17:01                       ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2005-06-28 15:43 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: David S. Miller, linux-kernel, linux-mm, Anton Blanchard

William Lee Irwin III wrote:
> On Mon, Jun 27, 2005 at 10:08:27PM -0700, David S. Miller wrote:
> 
>>BTW, I disagree with this assertion.  spin_unlock() does imply a
>>memory barrier.
>>All memory operations before the release of the lock must execute
>>before the lock release memory operation is globally visible.
> 
> 
> The affected architectures have only recently changed in this regard.
> ppc64 was the most notable case, where it had a barrier for MMIO
> (eieio) but not a general memory barrier. PA-RISC likewise formerly had
> no such barrier and was a more normal case, with no barrier whatsoever.
> 
> Both have since been altered, ppc64 acquiring a heavyweight sync
> (arch nomenclature), and PA-RISC acquiring 2 memory barriers.
> 

Parisc looks like it's doing the extra memory barrier to "be safe" :P

Re the ppc64 chageset: It looks to me like lwsync is the lightweight
sync, and eieio is just referred to as the lightER (than sync) weight
sync. What's more, it looks like eieio does order stores to system
memory and is not just an MMIO barrier.

But nit picking aside, is it true that we need a load barrier before
unlock? (store barrier I agree with) The ppc64 changeset in question
indicates yes, but I can't quite work out why. There are noises in the
archives about this, but I didn't pinpoint a conclusion...

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-28 13:16       ` Nick Piggin
@ 2005-06-28 16:02         ` Dave Hansen
  2005-06-29 16:31           ` Pavel Machek
  2005-06-29 16:31         ` Pavel Machek
  1 sibling, 1 reply; 56+ messages in thread
From: Dave Hansen @ 2005-06-28 16:02 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andy Whitcroft, linux-kernel, Linux Memory Management

On Tue, 2005-06-28 at 23:16 +1000, Nick Piggin wrote:
> I think there are a a few ways that bits can be reclaimed if we
> start digging. swsusp uses 2 which seems excessive though may be
> fully justified. 

They (swsusp) actually don't need the bits at all until suspend-time, at
all.  Somebody coded up a "dynamic page flags" patch that let them kill
the page->flags use, but it didn't really go anywhere.  Might be nice if
someone dug it up.  I probably have a copy somewhere.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-28 15:43                     ` Nick Piggin
@ 2005-06-28 17:01                       ` Christoph Lameter
  2005-06-28 23:10                         ` Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2005-06-28 17:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, David S. Miller, linux-kernel, linux-mm,
	Anton Blanchard

On Wed, 29 Jun 2005, Nick Piggin wrote:

> But nit picking aside, is it true that we need a load barrier before
> unlock? (store barrier I agree with) The ppc64 changeset in question
> indicates yes, but I can't quite work out why. There are noises in the
> archives about this, but I didn't pinpoint a conclusion...

A spinlock may be used to read a consistent set of variables. If load
operations would be moved below the spin_unlock then one may get values
that have been updated after another process acquired the spinlock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-28  5:08                 ` [patch 2] mm: speculative get_page, " David S. Miller, Nick Piggin
  2005-06-28  5:34                   ` Nick Piggin
  2005-06-28 14:19                   ` William Lee Irwin III
@ 2005-06-28 21:32                   ` Jesse Barnes
  2005-06-28 22:17                     ` Christoph Lameter
  2 siblings, 1 reply; 56+ messages in thread
From: Jesse Barnes @ 2005-06-28 21:32 UTC (permalink / raw)
  To: David S. Miller; +Cc: nickpiggin, wli, linux-kernel, linux-mm

On Monday, June 27, 2005 10:08 pm, David S. Miller wrote:
> From: Nick Piggin <nickpiggin@yahoo.com.au>
> Subject: Re: [patch 2] mm: speculative get_page
> Date: Tue, 28 Jun 2005 14:50:31 +1000
>
> > William Lee Irwin III wrote:
> > >On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote:
> > >
> > >spin_unlock() does not imply a memory barrier.
> >
> > Intriguing...
>
> BTW, I disagree with this assertion.  spin_unlock() does imply a
> memory barrier.
>
> All memory operations before the release of the lock must execute
> before the lock release memory operation is globally visible.

On ia64 at least, the unlock is only a one way barrier.  The store to 
realease the lock uses release semantics (since the lock is declared 
volatile), which implies that prior stores are visible before the 
unlock occurs, but subsequent accesses can 'float up' above the unlock.  
See http://www.gelato.unsw.edu.au/linux-ia64/0304/5122.html for some 
more details.

Jesse
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-28 21:32                   ` Jesse Barnes
@ 2005-06-28 22:17                     ` Christoph Lameter
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2005-06-28 22:17 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: David S. Miller, nickpiggin, wli, linux-kernel, linux-mm

On Tue, 28 Jun 2005, Jesse Barnes wrote:

> On ia64 at least, the unlock is only a one way barrier.  The store to 
> realease the lock uses release semantics (since the lock is declared 
> volatile), which implies that prior stores are visible before the 
> unlock occurs, but subsequent accesses can 'float up' above the unlock.  
> See http://www.gelato.unsw.edu.au/linux-ia64/0304/5122.html for some 
> more details.

The manual talks about "accesses" not stores. So this applies to loads and 
stores. Subsequent accesses can float up but only accesses prior to the 
instruction with release semantics (like an unlock) are guaranteed to be 
visible.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-28 17:01                       ` Christoph Lameter
@ 2005-06-28 23:10                         ` Nick Piggin
  0 siblings, 0 replies; 56+ messages in thread
From: Nick Piggin @ 2005-06-28 23:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: William Lee Irwin III, David S. Miller, linux-kernel, linux-mm,
	Anton Blanchard

Christoph Lameter wrote:
> On Wed, 29 Jun 2005, Nick Piggin wrote:
> 
> 
>>But nit picking aside, is it true that we need a load barrier before
>>unlock? (store barrier I agree with) The ppc64 changeset in question
>>indicates yes, but I can't quite work out why. There are noises in the
>>archives about this, but I didn't pinpoint a conclusion...
> 
> 
> A spinlock may be used to read a consistent set of variables. If load
> operations would be moved below the spin_unlock then one may get values
> that have been updated after another process acquired the spinlock.
> 
> 

Of course, thanks. I was only thinking of the case where loads
were moved from the unlocked into the locked section.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  6:29 [rfc] lockless pagecache Nick Piggin
                   ` (2 preceding siblings ...)
  2005-06-27  7:46 ` [rfc] lockless pagecache Andrew Morton
@ 2005-06-29 10:49 ` Hirokazu Takahashi
  2005-06-29 11:38   ` Nick Piggin
  3 siblings, 1 reply; 56+ messages in thread
From: Hirokazu Takahashi @ 2005-06-29 10:49 UTC (permalink / raw)
  To: nickpiggin; +Cc: linux-kernel, linux-mm

Hi Nick,

Your patches improve the performance if lots of processes are
accessing the same file at the same time, right?

If so, I think we can introduce multiple radix-trees instead,
which enhance each inode to be able to have two or more radix-trees
in it to avoid the race condition traversing the trees.
Some decision mechanism is needed which radix-tree each page
should be in, how many radix-tree should be prepared.

It seems to be simple and effective.

What do you think?

> Now the tree_lock was recently(ish) converted to an rwlock, precisely
> for such a workload and that was apparently very successful. However
> an rwlock is significantly heavier, and as machines get faster and
> bigger, rwlocks (and any locks) will tend to use more and more of Paul
> McKenney's toilet paper due to cacheline bouncing.
> 
> So in the interest of saving some trees, let's try it without any locks.
> 
> First I'll put up some numbers to get you interested - of a 64-way Altix
> with 64 processes each read-faulting in their own 512MB part of a 32GB
> file that is preloaded in pagecache (with the proper NUMA memory
> allocation).

Thanks,
Hirokazu Takahashi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-29 10:49 ` Hirokazu Takahashi
@ 2005-06-29 11:38   ` Nick Piggin
  2005-06-30  3:32     ` Hirokazu Takahashi
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2005-06-29 11:38 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: linux-kernel, linux-mm

Hirokazu Takahashi wrote:
> Hi Nick,
> 

Hi,

> Your patches improve the performance if lots of processes are
> accessing the same file at the same time, right?
> 

Yes.

> If so, I think we can introduce multiple radix-trees instead,
> which enhance each inode to be able to have two or more radix-trees
> in it to avoid the race condition traversing the trees.
> Some decision mechanism is needed which radix-tree each page
> should be in, how many radix-tree should be prepared.
> 
> It seems to be simple and effective.
> 
> What do you think?
> 

Sure it is a possibility.

I don't think you could call it effective like a completely
lockless version is effective. You might take more locks during
gang lookups, you may have a lot of ugly and not-always-working
heuristics (hey, my app goes really fast if it spreads accesses
over a 1GB file, but falls on its face with a 10MB one). You
might get increased cache footprints for common operations.

I mainly did the patches for a bit of fun rather than to address
a particular problem with a real workload and as such I won't be
pushing to get them in the kernel for the time being.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-28 13:16       ` Nick Piggin
  2005-06-28 16:02         ` Dave Hansen
@ 2005-06-29 16:31         ` Pavel Machek
  1 sibling, 0 replies; 56+ messages in thread
From: Pavel Machek @ 2005-06-29 16:31 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andy Whitcroft, linux-kernel, Linux Memory Management

Hi!

> >There are a couple of bits which imply ownership such as PG_slab,
> >PG_swapcache and PG_reserved which to my mind are all exclusive.
> >Perhaps those plus the PG_free could be combined into a owner field.  I
> >am unsure if the PG_freeing can be 'backed out' if not it may also combine?
> 
> I think there are a a few ways that bits can be reclaimed if we
> start digging. swsusp uses 2 which seems excessive though may be
> fully justified. Can PG_private be replaced by (!page->private)?
> Can filesystems easily stop using PG_checked?

It is possible that swsusp could reduce its bit usage... Current stuff
works, but probably does not need strong atomicity guarantees, and
could use some bit combination...
								Pavel
-- 
teflon -- maybe it is a trademark, but it should not be.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-28 16:02         ` Dave Hansen
@ 2005-06-29 16:31           ` Pavel Machek
  2005-06-29 18:43             ` Dave Hansen
  0 siblings, 1 reply; 56+ messages in thread
From: Pavel Machek @ 2005-06-29 16:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Nick Piggin, Andy Whitcroft, linux-kernel, Linux Memory Management

Hi!

> > I think there are a a few ways that bits can be reclaimed if we
> > start digging. swsusp uses 2 which seems excessive though may be
> > fully justified. 
> 
> They (swsusp) actually don't need the bits at all until suspend-time, at
> all.  Somebody coded up a "dynamic page flags" patch that let them kill
> the page->flags use, but it didn't really go anywhere.  Might be nice if
> someone dug it up.  I probably have a copy somewhere.

Unfortunately that patch was rather ugly :-(.
								Pavel
-- 
teflon -- maybe it is a trademark, but it should not be.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-29 16:31           ` Pavel Machek
@ 2005-06-29 18:43             ` Dave Hansen
  2005-06-29 21:22               ` Pavel Machek
  0 siblings, 1 reply; 56+ messages in thread
From: Dave Hansen @ 2005-06-29 18:43 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Nick Piggin, Andy Whitcroft, linux-kernel, Linux Memory Management

On Wed, 2005-06-29 at 18:31 +0200, Pavel Machek wrote:
> > > I think there are a a few ways that bits can be reclaimed if we
> > > start digging. swsusp uses 2 which seems excessive though may be
> > > fully justified. 
> > 
> > They (swsusp) actually don't need the bits at all until suspend-time, at
> > all.  Somebody coded up a "dynamic page flags" patch that let them kill
> > the page->flags use, but it didn't really go anywhere.  Might be nice if
> > someone dug it up.  I probably have a copy somewhere.
> 
> Unfortunately that patch was rather ugly :-(.

Do you think the idea was ugly, or just the implementation?  Is there
something that you'd rather see?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [patch 2] mm: speculative get_page
  2005-06-29 18:43             ` Dave Hansen
@ 2005-06-29 21:22               ` Pavel Machek
  0 siblings, 0 replies; 56+ messages in thread
From: Pavel Machek @ 2005-06-29 21:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Nick Piggin, Andy Whitcroft, linux-kernel, Linux Memory Management

Hi!

> > > > I think there are a a few ways that bits can be reclaimed if we
> > > > start digging. swsusp uses 2 which seems excessive though may be
> > > > fully justified. 
> > > 
> > > They (swsusp) actually don't need the bits at all until suspend-time, at
> > > all.  Somebody coded up a "dynamic page flags" patch that let them kill
> > > the page->flags use, but it didn't really go anywhere.  Might be nice if
> > > someone dug it up.  I probably have a copy somewhere.
> > 
> > Unfortunately that patch was rather ugly :-(.
> 
> Do you think the idea was ugly, or just the implementation?  Is there
> something that you'd rather see?

Well, implementation was ugly and idea was unneccesary because we
still had bits left.

We could spare bits for swsusp by defining "PageReserved | PageLocked
=> PageNosave" etc.... simply by choosing some otherwise unused
combinations. swsusp is not performance critical...
								Pavel
-- 
Boycott Kodak -- for their patent abuse against Java.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-29 11:38   ` Nick Piggin
@ 2005-06-30  3:32     ` Hirokazu Takahashi
  0 siblings, 0 replies; 56+ messages in thread
From: Hirokazu Takahashi @ 2005-06-30  3:32 UTC (permalink / raw)
  To: nickpiggin; +Cc: linux-kernel, linux-mm

Hi,

> > Your patches improve the performance if lots of processes are
> > accessing the same file at the same time, right?
> > 
> 
> Yes.
> 
> > If so, I think we can introduce multiple radix-trees instead,
> > which enhance each inode to be able to have two or more radix-trees
> > in it to avoid the race condition traversing the trees.
> > Some decision mechanism is needed which radix-tree each page
> > should be in, how many radix-tree should be prepared.
> > 
> > It seems to be simple and effective.
> > 
> > What do you think?
> > 
> 
> Sure it is a possibility.
> 
> I don't think you could call it effective like a completely
> lockless version is effective. You might take more locks during
> gang lookups, you may have a lot of ugly and not-always-working
> heuristics (hey, my app goes really fast if it spreads accesses
> over a 1GB file, but falls on its face with a 10MB one). You
> might get increased cache footprints for common operations.

I guess it would be enough to split a huge file into the same
size pieces simply and put each of them in its associated radix-tree
in most cases for practical use.

And I also feel your approach is interesting.

> I mainly did the patches for a bit of fun rather than to address
> a particular problem with a real workload and as such I won't be
> pushing to get them in the kernel for the time being.

I see.

I propose another idea if you don't mind, seqlock seems to make
your code much simpler though I'm not sure whether it works well
under heavy load. It would become stable without the tricks,
which makes VM hard to be enhanced in the future.

Thanks,
Hirokazu Takahashi.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27 19:42                 ` Chen, Kenneth W
@ 2005-07-05 15:11                   ` Sonny Rao
  2005-07-05 15:31                     ` Martin J. Bligh
  0 siblings, 1 reply; 56+ messages in thread
From: Sonny Rao @ 2005-07-05 15:11 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Christoph Lameter', 'Badari Pulavarty',
	'Nick Piggin',
	Lincoln Dale, Andrew Morton, linux-kernel, linux-mm

On Mon, Jun 27, 2005 at 12:42:44PM -0700, Chen, Kenneth W wrote:
> Christoph Lameter wrote on Monday, June 27, 2005 12:23 PM
> > On Mon, 27 Jun 2005, Chen, Kenneth W wrote:
> > > I don't recall seeing tree_lock to be a problem for DSS workload either.
> > 
> > I have seen the tree_lock being a problem a number of times with large 
> > scale NUMA type workloads.
> 
> I totally agree!  My earlier posts are strictly referring to industry
> standard db workloads (OLTP, DSS).  I'm not saying it's not a problem
> for everyone :-)  Obviously you just outlined a few ....

I'm a bit late to the party here (was gone on vacation), but I do have
profiles from DSS workloads using page-cache rather than O_DIRECT and
I do see spin_lock_irq() in the profiles which I'm pretty certain are
locks spinning for access to the radix_tree.  I'll talk about it a bit
more up in Ottawa but here's the top 5 on my profile (sorry don't have
the number of ticks at the momement):

1. dedicated_idle (waiting for I/O)
2. __copy_tofrom_user
3. radix_tree_delete
4. _spin_lock_irq
5. __find_get_block

So, yes, if the page-cache is used in a DSS workload then one will see
the tree-lock.  BTW, this was on a PPC64 machine w/ a fairly small
NUMA factor.

Sonny
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [rfc] lockless pagecache
  2005-07-05 15:11                   ` Sonny Rao
@ 2005-07-05 15:31                     ` Martin J. Bligh
  2005-07-05 15:37                       ` Sonny Rao
  0 siblings, 1 reply; 56+ messages in thread
From: Martin J. Bligh @ 2005-07-05 15:31 UTC (permalink / raw)
  To: Sonny Rao, Chen, Kenneth W
  Cc: 'Christoph Lameter', 'Badari Pulavarty',
	'Nick Piggin',
	Lincoln Dale, Andrew Morton, linux-kernel, linux-mm

>> > On Mon, 27 Jun 2005, Chen, Kenneth W wrote:
>> > > I don't recall seeing tree_lock to be a problem for DSS workload either.
>> > 
>> > I have seen the tree_lock being a problem a number of times with large 
>> > scale NUMA type workloads.
>> 
>> I totally agree!  My earlier posts are strictly referring to industry
>> standard db workloads (OLTP, DSS).  I'm not saying it's not a problem
>> for everyone :-)  Obviously you just outlined a few ....
> 
> I'm a bit late to the party here (was gone on vacation), but I do have
> profiles from DSS workloads using page-cache rather than O_DIRECT and
> I do see spin_lock_irq() in the profiles which I'm pretty certain are
> locks spinning for access to the radix_tree.  I'll talk about it a bit
> more up in Ottawa but here's the top 5 on my profile (sorry don't have
> the number of ticks at the momement):
> 
> 1. dedicated_idle (waiting for I/O)
> 2. __copy_tofrom_user
> 3. radix_tree_delete
> 4. _spin_lock_irq
> 5. __find_get_block
> 
> So, yes, if the page-cache is used in a DSS workload then one will see
> the tree-lock.  BTW, this was on a PPC64 machine w/ a fairly small
> NUMA factor.

The easiest way to confirm the spin-lock thing is to recompile with 
CONFIG_SPINLINE, and take a new profile, then diff the two ...

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [rfc] lockless pagecache
  2005-07-05 15:31                     ` Martin J. Bligh
@ 2005-07-05 15:37                       ` Sonny Rao
  0 siblings, 0 replies; 56+ messages in thread
From: Sonny Rao @ 2005-07-05 15:37 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Chen, Kenneth W, 'Christoph Lameter',
	'Badari Pulavarty', 'Nick Piggin',
	Lincoln Dale, Andrew Morton, linux-kernel, linux-mm

On Tue, Jul 05, 2005 at 08:31:40AM -0700, Martin J. Bligh wrote:
> >> > On Mon, 27 Jun 2005, Chen, Kenneth W wrote:
> >> > > I don't recall seeing tree_lock to be a problem for DSS workload either.
> >> > 
> >> > I have seen the tree_lock being a problem a number of times with large 
> >> > scale NUMA type workloads.
> >> 
> >> I totally agree!  My earlier posts are strictly referring to industry
> >> standard db workloads (OLTP, DSS).  I'm not saying it's not a problem
> >> for everyone :-)  Obviously you just outlined a few ....
> > 
> > I'm a bit late to the party here (was gone on vacation), but I do have
> > profiles from DSS workloads using page-cache rather than O_DIRECT and
> > I do see spin_lock_irq() in the profiles which I'm pretty certain are
> > locks spinning for access to the radix_tree.  I'll talk about it a bit
> > more up in Ottawa but here's the top 5 on my profile (sorry don't have
> > the number of ticks at the momement):
> > 
> > 1. dedicated_idle (waiting for I/O)
> > 2. __copy_tofrom_user
> > 3. radix_tree_delete
> > 4. _spin_lock_irq
> > 5. __find_get_block
> > 
> > So, yes, if the page-cache is used in a DSS workload then one will see
> > the tree-lock.  BTW, this was on a PPC64 machine w/ a fairly small
> > NUMA factor.
> 
> The easiest way to confirm the spin-lock thing is to recompile with 
> CONFIG_SPINLINE, and take a new profile, then diff the two ...

Yep...

Unfortunately, this is broken in PPC64 since 2.6.9-rc2 or something
like that, I never had a chance to track down what the issue was
exactly.  IIRC, there was a lot of churn in the spinlocking code
around that time.

Sonny
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2005-07-05 15:37 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-06-27  6:29 [rfc] lockless pagecache Nick Piggin
2005-06-27  6:32 ` [patch 1] mm: PG_free flag Nick Piggin
2005-06-27  6:32   ` [patch 2] mm: speculative get_page Nick Piggin
2005-06-27  6:33     ` [patch 3] radix tree: lookup_slot Nick Piggin
2005-06-27  6:34       ` [patch 4] radix tree: lockless readside Nick Piggin
2005-06-27  6:34         ` [patch 5] mm: lockless pagecache lookups Nick Piggin
2005-06-27  6:35           ` [patch 6] mm: spinlock tree_lock Nick Piggin
2005-06-27 14:12     ` [patch 2] mm: speculative get_page William Lee Irwin III
2005-06-28  0:03       ` Nick Piggin
2005-06-28  0:56         ` Nick Piggin
2005-06-28  1:22         ` William Lee Irwin III
2005-06-28  1:42           ` Nick Piggin
2005-06-28  4:06             ` William Lee Irwin III
2005-06-28  4:50               ` Nick Piggin
2005-06-28  5:08                 ` [patch 2] mm: speculative get_page, " David S. Miller, Nick Piggin
2005-06-28  5:34                   ` Nick Piggin
2005-06-28 14:19                   ` William Lee Irwin III
2005-06-28 15:43                     ` Nick Piggin
2005-06-28 17:01                       ` Christoph Lameter
2005-06-28 23:10                         ` Nick Piggin
2005-06-28 21:32                   ` Jesse Barnes
2005-06-28 22:17                     ` Christoph Lameter
2005-06-28 12:45     ` Andy Whitcroft
2005-06-28 13:16       ` Nick Piggin
2005-06-28 16:02         ` Dave Hansen
2005-06-29 16:31           ` Pavel Machek
2005-06-29 18:43             ` Dave Hansen
2005-06-29 21:22               ` Pavel Machek
2005-06-29 16:31         ` Pavel Machek
2005-06-27  6:43 ` VFS scalability (was: [rfc] lockless pagecache) Nick Piggin
2005-06-27  7:13   ` Andi Kleen
2005-06-27  7:33     ` VFS scalability Nick Piggin
2005-06-27  7:44       ` Andi Kleen
2005-06-27  8:03         ` Nick Piggin
2005-06-27  7:46 ` [rfc] lockless pagecache Andrew Morton
2005-06-27  8:02   ` Nick Piggin
2005-06-27  8:15     ` Andrew Morton
2005-06-27  8:28       ` Nick Piggin
2005-06-27  8:56     ` Lincoln Dale
2005-06-27  9:04       ` Nick Piggin
2005-06-27 18:14         ` Chen, Kenneth W
2005-06-27 18:50           ` Badari Pulavarty
2005-06-27 19:05             ` Chen, Kenneth W
2005-06-27 19:22               ` Christoph Lameter
2005-06-27 19:42                 ` Chen, Kenneth W
2005-07-05 15:11                   ` Sonny Rao
2005-07-05 15:31                     ` Martin J. Bligh
2005-07-05 15:37                       ` Sonny Rao
2005-06-27 13:17     ` Benjamin LaHaise
2005-06-28  0:32       ` Nick Piggin
2005-06-28  1:26         ` William Lee Irwin III
2005-06-27 14:08   ` Martin J. Bligh
2005-06-27 17:49   ` Christoph Lameter
2005-06-29 10:49 ` Hirokazu Takahashi
2005-06-29 11:38   ` Nick Piggin
2005-06-30  3:32     ` Hirokazu Takahashi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox