A lockless pagecache for Linux

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* A lockless pagecache for Linux
@ 2006-03-10 15:18 Nick Piggin
  2006-03-10 15:18 ` [patch 1/3] radix tree: RCU lockless read-side Nick Piggin
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Nick Piggin @ 2006-03-10 15:18 UTC (permalink / raw)
  To: Linux Kernel, Linux Memory Management; +Cc: Nick Piggin

Hi,

I was waiting for 2.6.16 before releasing my patchset, but that got
boring.

ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/lockless/2.6.16-rc5/

Now I've used some clever subject lines on the subsequent patches
to make you think this isn't a big deal. Actually there are about
36 other "prep" patches before those, and PageReserved removal before
that (which are luckily now mostly in -mm or -linus, respectively).
What's more, there aren't 3 lockless pagecache patches, there are
5 -- but the last two are optimisations.

I'm writing some stuff about these patches, and I've uploaded a
**draft** chapter on the RCU radix-tree, 'radix-intro.pdf' in above
directory (note the bibliography didn't make it -- but thanks Paul
McKenney!)

If anyone would like to test or review it, I would be very happy.
Suggestions to the code or document would be very welcome... but
I'm still hoping nobody spots a fundamental flaw until after OLS.

Rollup of prep patches (5 posted patches apply to the top of this):
2.6.16-rc5-git14-prep.patch.gz

Rollup of prep+lockless patches (includes the 5 posted patches):
2.6.16-rc5-git14-lockless.patch.gz

Note: anyone interested in benchmarking should test prep+rollup vs
prep rather than vs mainline if possible, because there are various
other optimisations in prep.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 1/3] radix tree: RCU lockless read-side
  2006-03-10 15:18 A lockless pagecache for Linux Nick Piggin
@ 2006-03-10 15:18 ` Nick Piggin
  2006-03-11  8:22   ` Balbir Singh
  2006-03-10 15:18 ` [patch 2/3] mm: speculative get_page Nick Piggin
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2006-03-10 15:18 UTC (permalink / raw)
  To: Linux Kernel, Linux Memory Management; +Cc: Nick Piggin

Make radix tree lookups safe to be performed without locks. Readers
are protected against nodes being deleted by using RCU based freeing.
Readers are protected against new node insertion by using memory
barriers to ensure the node itself will be properly written before it
is visible in the radix tree.

Each radix tree node keeps a record of their height (above leaf
nodes). This height does not change after insertion -- when the radix
tree is extended, higher nodes are only inserted in the top. So a
lookup can take the pointer to what is *now* the root node, and
traverse down it even if the tree is concurrently extended and this
node becomes a subtree of a new root.

When a reader wants to traverse the next branch, they will take a
copy of the pointer. This pointer will be either NULL (and the branch
is empty) or non-NULL (and will point to a valid node).

Also introduce a lockfree gang_lookup_slot which will be used by a
future patch.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -30,6 +30,7 @@
 #include <linux/gfp.h>
 #include <linux/string.h>
 #include <linux/bitops.h>
+#include <linux/rcupdate.h>
 
 
 #ifdef __KERNEL__
@@ -46,7 +47,9 @@
 	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
 struct radix_tree_node {
+	unsigned int	height;		/* Height from the bottom */
 	unsigned int	count;
+	struct rcu_head	rcu_head;
 	void		*slots[RADIX_TREE_MAP_SIZE];
 	unsigned long	tags[RADIX_TREE_TAGS][RADIX_TREE_TAG_LONGS];
 };
@@ -98,10 +101,17 @@ radix_tree_node_alloc(struct radix_tree_
 	return ret;
 }
 
+static void radix_tree_node_rcu_free(struct rcu_head *head)
+{
+	struct radix_tree_node *node =
+			container_of(head, struct radix_tree_node, rcu_head);
+	kmem_cache_free(radix_tree_node_cachep, node);
+}
+
 static inline void
 radix_tree_node_free(struct radix_tree_node *node)
 {
-	kmem_cache_free(radix_tree_node_cachep, node);
+	call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
 }
 
 /*
@@ -204,6 +214,7 @@ static int radix_tree_extend(struct radi
 	}
 
 	do {
+		unsigned int newheight;
 		if (!(node = radix_tree_node_alloc(root)))
 			return -ENOMEM;
 
@@ -216,9 +227,11 @@ static int radix_tree_extend(struct radi
 				tag_set(node, tag, 0);
 		}
 
+		newheight = root->height+1;
+		node->height = newheight;
 		node->count = 1;
-		root->rnode = node;
-		root->height++;
+		rcu_assign_pointer(root->rnode, node);
+		root->height = newheight;
 	} while (height > root->height);
 out:
 	return 0;
@@ -258,11 +271,12 @@ int radix_tree_insert(struct radix_tree_
 			/* Have to add a child node.  */
 			if (!(slot = radix_tree_node_alloc(root)))
 				return -ENOMEM;
+			slot->height = height;
 			if (node) {
-				node->slots[offset] = slot;
+				rcu_assign_pointer(node->slots[offset], slot);
 				node->count++;
 			} else
-				root->rnode = slot;
+				rcu_assign_pointer(root->rnode, slot);
 		}
 
 		/* Go a level down */
@@ -278,7 +292,7 @@ int radix_tree_insert(struct radix_tree_
 
 	BUG_ON(!node);
 	node->count++;
-	node->slots[offset] = item;
+	rcu_assign_pointer(node->slots[offset], item);
 	BUG_ON(tag_get(node, 0, offset));
 	BUG_ON(tag_get(node, 1, offset));
 
@@ -290,25 +304,29 @@ static inline void **__lookup_slot(struc
 				   unsigned long index)
 {
 	unsigned int height, shift;
-	struct radix_tree_node **slot;
+	struct radix_tree_node *node, **slot;
 
-	height = root->height;
+	/* Must take a copy now because root->rnode may change */
+	node = rcu_dereference(root->rnode);
+	if (node == NULL)
+		return NULL;
+
+	height = node->height;
 	if (index > radix_tree_maxindex(height))
 		return NULL;
 
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
-	slot = &root->rnode;
 
-	while (height > 0) {
-		if (*slot == NULL)
+	do {
+		slot = (struct radix_tree_node **)
+			(node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
+		node = rcu_dereference(*slot);
+		if (node == NULL)
 			return NULL;
 
-		slot = (struct radix_tree_node **)
-			((*slot)->slots +
-				((index >> shift) & RADIX_TREE_MAP_MASK));
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
-	}
+	} while (height > 0);
 
 	return (void **)slot;
 }
@@ -339,7 +357,7 @@ void *radix_tree_lookup(struct radix_tre
 	void **slot;
 
 	slot = __lookup_slot(root, index);
-	return slot != NULL ? *slot : NULL;
+	return slot != NULL ? rcu_dereference(*slot) : NULL;
 }
 EXPORT_SYMBOL(radix_tree_lookup);
 
@@ -501,26 +519,27 @@ EXPORT_SYMBOL(radix_tree_tag_get);
 #endif
 
 static unsigned int
-__lookup(struct radix_tree_root *root, void **results, unsigned long index,
+__lookup(struct radix_tree_root *root, void ***results, unsigned long index,
 	unsigned int max_items, unsigned long *next_index)
 {
 	unsigned int nr_found = 0;
 	unsigned int shift, height;
-	struct radix_tree_node *slot;
+	struct radix_tree_node *slot, *__s;
 	unsigned long i;
 
-	height = root->height;
-	if (height == 0)
+	slot = rcu_dereference(root->rnode);
+	if (!slot || slot->height == 0)
 		goto out;
 
+	height = slot->height;
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
-	slot = root->rnode;
 
 	for ( ; height > 1; height--) {
 
 		for (i = (index >> shift) & RADIX_TREE_MAP_MASK ;
 				i < RADIX_TREE_MAP_SIZE; i++) {
-			if (slot->slots[i] != NULL)
+			__s = rcu_dereference(slot->slots[i]);
+			if (__s != NULL)
 				break;
 			index &= ~((1UL << shift) - 1);
 			index += 1UL << shift;
@@ -531,14 +550,14 @@ __lookup(struct radix_tree_root *root, v
 			goto out;
 
 		shift -= RADIX_TREE_MAP_SHIFT;
-		slot = slot->slots[i];
+		slot = __s;
 	}
 
 	/* Bottom level: grab some items */
 	for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
 		index++;
 		if (slot->slots[i]) {
-			results[nr_found++] = slot->slots[i];
+			results[nr_found++] = &slot->slots[i];
 			if (nr_found == max_items)
 				goto out;
 		}
@@ -570,6 +589,43 @@ radix_tree_gang_lookup(struct radix_tree
 	unsigned int ret = 0;
 
 	while (ret < max_items) {
+		unsigned int nr_found, i;
+		unsigned long next_index;	/* Index of next search */
+
+		if (cur_index > max_index)
+			break;
+		nr_found = __lookup(root, (void ***)results + ret, cur_index,
+					max_items - ret, &next_index);
+		for (i = 0; i < nr_found; i++)
+			results[ret + i] = *(((void ***)results)[ret + i]);
+		ret += nr_found;
+		if (next_index == 0)
+			break;
+		cur_index = next_index;
+	}
+	return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup);
+
+/**
+ *	radix_tree_gang_lookup_slot - perform multiple lookup on a radix tree
+ *	@root:		radix tree root
+ *	@results:	where the results of the lookup are placed
+ *	@first_index:	start the lookup from this key
+ *	@max_items:	place up to this many items at *results
+ *
+ *	Same as radix_tree_gang_lookup, but returns an array of pointers
+ *	(slots) to the stored items instead of the items themselves.
+ */
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+			unsigned long first_index, unsigned int max_items)
+{
+	const unsigned long max_index = radix_tree_maxindex(root->height);
+	unsigned long cur_index = first_index;
+	unsigned int ret = 0;
+
+	while (ret < max_items) {
 		unsigned int nr_found;
 		unsigned long next_index;	/* Index of next search */
 
@@ -584,7 +640,8 @@ radix_tree_gang_lookup(struct radix_tree
 	}
 	return ret;
 }
-EXPORT_SYMBOL(radix_tree_gang_lookup);
+EXPORT_SYMBOL(radix_tree_gang_lookup_slot);
+
 
 /*
  * FIXME: the two tag_get()s here should use find_next_bit() instead of
@@ -689,6 +746,11 @@ static inline void radix_tree_shrink(str
 			root->rnode->slots[0]) {
 		struct radix_tree_node *to_free = root->rnode;
 
+		/*
+		 * this doesn't need an rcu_assign_pointer, because
+		 * we aren't touching the object that to_free->slots[0]
+		 * points to.
+		 */
 		root->rnode = to_free->slots[0];
 		root->height--;
 		/* must only free zeroed nodes into the slab */
@@ -804,7 +866,7 @@ EXPORT_SYMBOL(radix_tree_delete);
 int radix_tree_tagged(struct radix_tree_root *root, int tag)
 {
   	struct radix_tree_node *rnode;
-  	rnode = root->rnode;
+  	rnode = rcu_dereference(root->rnode);
   	if (!rnode)
   		return 0;
 	return any_tag_set(rnode, tag);
Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -52,6 +52,9 @@ void *radix_tree_delete(struct radix_tre
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+			unsigned long first_index, unsigned int max_items);
 int radix_tree_preload(gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 2/3] mm: speculative get_page
  2006-03-10 15:18 A lockless pagecache for Linux Nick Piggin
  2006-03-10 15:18 ` [patch 1/3] radix tree: RCU lockless read-side Nick Piggin
@ 2006-03-10 15:18 ` Nick Piggin
  2006-03-10 15:18 ` [patch 3/3] mm: lockless pagecache lookups Nick Piggin
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2006-03-10 15:18 UTC (permalink / raw)
  To: Linux Kernel, Linux Memory Management; +Cc: Nick Piggin

If we can be sure that elevating the page_count on a pagecache
page will pin it, we can speculatively run this operation, and
subsequently check to see if we hit the right page rather than
relying on holding a lock or otherwise pinning a reference to the
page.

This can be done if get_page/put_page behaves consistently
throughout the whole tree (ie. if we "get" the page after it has
been used for something else, we must be able to free it with a
put_page).

Actually, there is a period where the count behaves differently:
when the page is free or if it is a constituent page of a compound
page. We need an atomic_inc_not_zero operation to ensure we don't
try to grab the page in either case.

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -76,6 +76,9 @@
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_uncached		19	/* Page has been mapped as uncached */
 
+#define PG_nonewrefs		20	/* Block concurrent pagecache lookups
+					 * while testing refcount */
+
 /*
  * Global page accounting.  One instance per CPU.  Only unsigned longs are
  * allowed.
@@ -346,6 +349,11 @@ extern void __mod_page_state_offset(unsi
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#define PageNoNewRefs(page)	test_bit(PG_nonewrefs, &(page)->flags)
+#define SetPageNoNewRefs(page)	set_bit(PG_nonewrefs, &(page)->flags)
+#define ClearPageNoNewRefs(page) clear_bit(PG_nonewrefs, &(page)->flags)
+#define __ClearPageNoNewRefs(page) __clear_bit(PG_nonewrefs, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -11,6 +11,8 @@
 #include <linux/compiler.h>
 #include <asm/uaccess.h>
 #include <linux/gfp.h>
+#include <linux/page-flags.h>
+#include <linux/hardirq.h> /* for in_interrupt() */
 
 /*
  * Bits in mapping->flags.  The lower __GFP_BITS_SHIFT bits are the page
@@ -51,6 +53,91 @@ static inline void mapping_set_gfp_mask(
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
+static inline struct page *page_cache_get_speculative(struct page **pagep)
+{
+	struct page *page;
+
+	VM_BUG_ON(in_interrupt());
+
+#ifndef CONFIG_SMP
+	page = *pagep;
+	if (unlikely(!page))
+		return NULL;
+
+	VM_BUG_ON(!in_atomic());
+	/*
+	 * Preempt must be disabled here - we rely on rcu_read_lock doing
+	 * this for us.
+	 *
+	 * Pagecache won't be truncated from interrupt context, so if we have
+	 * found a page in the radix tree here, we have pinned its refcount by
+	 * disabling preempt, and hence no need for the "speculative get" that
+	 * SMP requires.
+	 */
+	VM_BUG_ON(page_count(page) == 0);
+	atomic_inc(&page->_count);
+	VM_BUG_ON(page != *pagep);
+
+#else
+ again:
+	page = rcu_dereference(*pagep);
+	if (unlikely(!page))
+		return NULL;
+
+	if (unlikely(!get_page_unless_zero(page)))
+		goto again; /* page has been freed */
+
+	/*
+	 * Note that get_page_unless_zero provides a memory barrier.
+	 * This is needed to ensure PageNoNewRefs is evaluated after the
+	 * page refcount has been raised. See below comment.
+	 */
+
+	/*
+	 * PageNoNewRefs is set in order to prevent new references to the
+	 * page (eg. before it gets removed from pagecache). Wait until it
+	 * becomes clear (and checks below will ensure we still have the
+	 * correct one).
+	 */
+	while (unlikely(PageNoNewRefs(page)))
+		cpu_relax();
+
+	/*
+	 * smp_rmb is to ensure the load of page->flags (for PageNoNewRefs())
+	 * is performed before the load of *pagep in the below comparison.
+	 *
+	 * Those places that set PageNoNewRefs have the following pattern:
+	 * 	SetPageNoNewRefs(page)
+	 * 	wmb();
+	 * 	if (page_count(page) == X)
+	 * 		remove page from pagecache
+	 * 	wmb();
+	 * 	ClearPageNoNewRefs(page)
+	 *
+	 * So PageNoNewRefs() becomes clear _after_ we've elevated page
+	 * refcount, then either the page will be safely pinned in pagecache,
+	 * or it will have been already removed. In the latter case, *pagep
+	 * will be changed in the below test - provided it is loaded after
+	 * testing PageNoNewRefs() (which is what the smp_rmb is for).
+	 *
+	 * If the load was out of order, *pagep might be loaded before the
+	 * page is removed from pagecache while PageNoNewRefs evaluated after
+	 * the ClearPageNoNewRefs().
+	 */
+	smp_rmb();
+
+	if (unlikely(page != *pagep)) {
+		/* page no longer at *pagep */
+		put_page(page);
+		goto again;
+	}
+
+#endif
+	VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);
+
+	return page;
+}
+
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
 	return alloc_pages(mapping_gfp_mask(x), 0);
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -383,6 +383,7 @@ static int remove_mapping(struct address
 	if (!mapping)
 		return 0;		/* truncate got there first */
 
+	SetPageNoNewRefs(page);
 	write_lock_irq(&mapping->tree_lock);
 
 	/*
@@ -401,17 +402,20 @@ static int remove_mapping(struct address
 		__delete_from_swap_cache(page);
 		write_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
-		__put_page(page);	/* The pagecache ref */
-		return 1;
+		goto free_it;
 	}
 
 	__remove_from_page_cache(page);
 	write_unlock_irq(&mapping->tree_lock);
-	__put_page(page);
+
+free_it:
+	__ClearPageNoNewRefs(page);
+	__put_page(page); /* The pagecache ref */
 	return 1;
 
 cannot_free:
 	write_unlock_irq(&mapping->tree_lock);
+	ClearPageNoNewRefs(page);
 	return 0;
 }
 
@@ -731,6 +735,7 @@ int migrate_page_remove_references(struc
 	if (page_mapcount(page))
 		return 1;
 
+	SetPageNoNewRefs(page);
 	write_lock_irq(&mapping->tree_lock);
 
 	radix_pointer = (struct page **)radix_tree_lookup_slot(
@@ -740,6 +745,7 @@ int migrate_page_remove_references(struc
 	if (!page_mapping(page) || page_count(page) != nr_refs ||
 			*radix_pointer != page) {
 		write_unlock_irq(&mapping->tree_lock);
+		ClearPageNoNewRefs(page);
 		return 1;
 	}
 
@@ -758,10 +764,14 @@ int migrate_page_remove_references(struc
 		SetPageSwapCache(newpage);
 		set_page_private(newpage, page_private(page));
 	}
+	SetPageNoNewRefs(newpage);
+
+	rcu_assign_pointer(*radix_pointer, newpage);
 
-	*radix_pointer = newpage;
-	__put_page(page);
 	write_unlock_irq(&mapping->tree_lock);
+	__put_page(page);
+	ClearPageNoNewRefs(page);
+	ClearPageNoNewRefs(newpage);
 
 	return 0;
 }
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -400,6 +400,7 @@ int add_to_page_cache(struct page *page,
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
+		SetPageNoNewRefs(page);
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
@@ -411,6 +412,7 @@ int add_to_page_cache(struct page *page,
 			pagecache_acct(1);
 		}
 		write_unlock_irq(&mapping->tree_lock);
+		ClearPageNoNewRefs(page);
 		radix_tree_preload_end();
 	}
 	return error;
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -77,6 +77,7 @@ static int __add_to_swap_cache(struct pa
 	BUG_ON(PagePrivate(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
+		SetPageNoNewRefs(page);
 		write_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
@@ -89,6 +90,7 @@ static int __add_to_swap_cache(struct pa
 			pagecache_acct(1);
 		}
 		write_unlock_irq(&swapper_space.tree_lock);
+		ClearPageNoNewRefs(page);
 		radix_tree_preload_end();
 	}
 	return error;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 3/3] mm: lockless pagecache lookups
  2006-03-10 15:18 A lockless pagecache for Linux Nick Piggin
  2006-03-10 15:18 ` [patch 1/3] radix tree: RCU lockless read-side Nick Piggin
  2006-03-10 15:18 ` [patch 2/3] mm: speculative get_page Nick Piggin
@ 2006-03-10 15:18 ` Nick Piggin
  2006-03-10 15:18 ` [patch 4/3] mm: lockless optimisations Nick Piggin
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2006-03-10 15:18 UTC (permalink / raw)
  To: Linux Kernel, Linux Memory Management; +Cc: Nick Piggin

Use page_cache_get_speculative and lockless radix tree lookups to
introduce lockless page cache lookups (ie. no mapping->tree_lock).

The only atomicity changes this should introduce is the use of a
non atomic pagevec lookup for truncate, however what atomicity
guarantees that there might have been were not used anyway, because
the size of the pagevec is not guaranteed (eg. it might be 1).

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -417,7 +417,6 @@ int add_to_page_cache(struct page *page,
 	}
 	return error;
 }
-
 EXPORT_SYMBOL(add_to_page_cache);
 
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
@@ -518,21 +517,21 @@ void fastcall __lock_page(struct page *p
 EXPORT_SYMBOL(__lock_page);
 
 /*
- * a rather lightweight function, finding and getting a reference to a
- * hashed page atomically.
+ * find_get_page finds and gets a reference to a pagecache page.
  */
-struct page * find_get_page(struct address_space *mapping, unsigned long offset)
+struct page *find_get_page(struct address_space *mapping, unsigned long offset)
 {
-	struct page *page;
+	struct page **pagep;
+	struct page *page = NULL;
 
-	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page)
-		page_cache_get(page);
-	read_unlock_irq(&mapping->tree_lock);
+	rcu_read_lock();
+	pagep = (struct page **)radix_tree_lookup_slot(&mapping->page_tree,
+									offset);
+	if (likely(pagep))
+		page = page_cache_get_speculative(pagep);
+	rcu_read_unlock();
 	return page;
 }
-
 EXPORT_SYMBOL(find_get_page);
 
 /*
@@ -540,17 +539,31 @@ EXPORT_SYMBOL(find_get_page);
  */
 struct page *find_trylock_page(struct address_space *mapping, unsigned long offset)
 {
-	struct page *page;
+	struct page **pagep;
+	struct page *page = NULL;
 
-	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page) {
-		if (unlikely(TestSetPageLocked(page)))
+	rcu_read_lock();
+	pagep = (struct page **)radix_tree_lookup_slot(&mapping->page_tree,
+									offset);
+	if (pagep) {
+repeat:
+		page = page_cache_get_speculative(pagep);
+
+		if (unlikely(TestSetPageLocked(page))) {
+			page_cache_release(page);
 			page = NULL;
-		else
-			get_page(page);
+			goto out;
+		}
+
+		/* Has the page been truncated before being locked? */
+		if (unlikely(page != *pagep)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto repeat;
+		}
 	}
-	read_unlock_irq(&mapping->tree_lock);
+out:
+	rcu_read_unlock();
 	return page;
 }
 
@@ -573,25 +586,17 @@ struct page *find_lock_page(struct addre
 	struct page *page;
 
 repeat:
-	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
+	page = find_get_page(mapping, offset);
 	if (page) {
-		page_cache_get(page);
-		if (TestSetPageLocked(page)) {
-			read_unlock_irq(&mapping->tree_lock);
-			__lock_page(page);
+		lock_page(page);
 
-			/* Has the page been truncated while we slept? */
-			if (unlikely(page->mapping != mapping)) {
-				unlock_page(page);
-				page_cache_release(page);
-				goto repeat;
-			}
-			goto out;
+		/* Has the page been truncated while we slept? */
+		if (unlikely(page->mapping != mapping)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto repeat;
 		}
 	}
-	read_unlock_irq(&mapping->tree_lock);
-out:
 	return page;
 }
 
@@ -674,6 +679,32 @@ unsigned find_get_pages(struct address_s
 	return ret;
 }
 
+unsigned find_get_pages_nonatomic(struct address_space *mapping, pgoff_t start,
+			    unsigned int nr_pages, struct page **pages)
+{
+	unsigned int i;
+	unsigned int nr_found;
+	unsigned int ret;
+
+	/*
+	 * We do some unsightly casting to use the array first for storing
+	 * pointers to the page pointers, and then for the pointers to
+	 * the pages themselves that the caller wants.
+	 */
+	rcu_read_lock();
+	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+				(void ***)pages, start, nr_pages);
+	ret = 0;
+	for (i = 0; i < nr_found; i++) {
+		struct page *page;
+		page = page_cache_get_speculative(((struct page ***)pages)[i]);
+		if (likely(page))
+			pages[ret++] = page;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
 /*
  * Like find_get_pages, except we only return pages which are tagged with
  * `tag'.   We update *index to index the next page for the traversal.
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -275,27 +275,26 @@ __do_page_cache_readahead(struct address
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	read_lock_irq(&mapping->tree_lock);
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
 		pgoff_t page_offset = offset + page_idx;
 		
 		if (page_offset > end_index)
 			break;
 
+		/* Don't need mapping->tree_lock - lookup can be racy */
+		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
+		rcu_read_unlock();
 		if (page)
 			continue;
 
-		read_unlock_irq(&mapping->tree_lock);
 		page = page_cache_alloc_cold(mapping);
-		read_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
 		page->index = page_offset;
 		list_add(&page->lru, &page_pool);
 		ret++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
 
 	/*
 	 * Now start the IO.  We ignore I/O errors - if the page is not
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -160,6 +160,8 @@ extern struct page * find_or_create_page
 				unsigned long index, gfp_t gfp_mask);
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			unsigned int nr_pages, struct page **pages);
+unsigned find_get_pages_nonatomic(struct address_space *mapping, pgoff_t start,
+			unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
 
Index: linux-2.6/include/linux/pagevec.h
===================================================================
--- linux-2.6.orig/include/linux/pagevec.h
+++ linux-2.6/include/linux/pagevec.h
@@ -28,6 +28,8 @@ void __pagevec_lru_add_active(struct pag
 void pagevec_strip(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
+unsigned pagevec_lookup_nonatomic(struct pagevec *pvec,
+	struct address_space *mapping, pgoff_t start, unsigned nr_pages);
 unsigned pagevec_lookup_tag(struct pagevec *pvec,
 		struct address_space *mapping, pgoff_t *index, int tag,
 		unsigned nr_pages);
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -428,6 +428,20 @@ unsigned pagevec_lookup(struct pagevec *
 
 EXPORT_SYMBOL(pagevec_lookup);
 
+/**
+ * pagevec_lookup_nonatomic - non atomic pagevec_lookup
+ *
+ * This routine is non-atomic in that it may return blah.
+ */
+unsigned pagevec_lookup_nonatomic(struct pagevec *pvec,
+		struct address_space *mapping, pgoff_t start, unsigned nr_pages)
+{
+	pvec->nr = find_get_pages_nonatomic(mapping, start,
+					nr_pages, pvec->pages);
+	return pagevec_count(pvec);
+}
+EXPORT_SYMBOL(pagevec_lookup_nonatomic);
+
 unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t *index, int tag, unsigned nr_pages)
 {
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -124,7 +124,7 @@ void truncate_inode_pages_range(struct a
 	pagevec_init(&pvec, 0);
 	next = start;
 	while (next <= end &&
-	       pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+	       pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t page_index = page->index;
@@ -163,7 +163,7 @@ void truncate_inode_pages_range(struct a
 	next = start;
 	for ( ; ; ) {
 		cond_resched();
-		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+		if (!pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
 			if (next == start)
 				break;
 			next = start;
@@ -227,7 +227,7 @@ unsigned long invalidate_mapping_pages(s
 
 	pagevec_init(&pvec, 0);
 	while (next <= end &&
-			pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+		pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -284,7 +284,7 @@ int invalidate_inode_pages2_range(struct
 	pagevec_init(&pvec, 0);
 	next = start;
 	while (next <= end && !ret && !wrapped &&
-		pagevec_lookup(&pvec, mapping, next,
+		pagevec_lookup_nonatomic(&pvec, mapping, next,
 			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
 		for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -808,17 +808,15 @@ int test_set_page_writeback(struct page 
 EXPORT_SYMBOL(test_set_page_writeback);
 
 /*
- * Return true if any of the pages in the mapping are marged with the
+ * Return true if any of the pages in the mapping are marked with the
  * passed tag.
  */
 int mapping_tagged(struct address_space *mapping, int tag)
 {
-	unsigned long flags;
 	int ret;
-
-	read_lock_irqsave(&mapping->tree_lock, flags);
+	rcu_read_lock();
 	ret = radix_tree_tagged(&mapping->page_tree, tag);
-	read_unlock_irqrestore(&mapping->tree_lock, flags);
+	rcu_read_unlock();
 	return ret;
 }
 EXPORT_SYMBOL(mapping_tagged);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 4/3] mm: lockless optimisations
  2006-03-10 15:18 A lockless pagecache for Linux Nick Piggin
                   ` (2 preceding siblings ...)
  2006-03-10 15:18 ` [patch 3/3] mm: lockless pagecache lookups Nick Piggin
@ 2006-03-10 15:18 ` Nick Piggin
  2006-03-10 15:18 ` [patch 5/3] mm: spinlock tree_lock Nick Piggin
  2006-03-13 23:35 ` A lockless pagecache for Linux Christoph Lameter
  5 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2006-03-10 15:18 UTC (permalink / raw)
  To: Linux Kernel, Linux Memory Management; +Cc: Nick Piggin

add_to_page_cache only deals with newly allocated pages except in the
swap -> shm case. Take advantage of this to optimise add_to_page_cache,
and introduce a new __add_to_page_cache for use on pages other than
newly allocated ones.

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -400,6 +400,45 @@ int add_to_page_cache(struct page *page,
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
+		/*
+		 * Can get away with less atomic ops and without using
+		 * Set/ClearPageNoNewRefs if we order operations correctly.
+		 */
+		page_cache_get(page);
+		__SetPageLocked(page);
+		page->mapping = mapping;
+		page->index = offset;
+
+		write_lock_irq(&mapping->tree_lock);
+		error = radix_tree_insert(&mapping->page_tree, offset, page);
+		if (!error) {
+			mapping->nrpages++;
+			pagecache_acct(1);
+		}
+		write_unlock_irq(&mapping->tree_lock);
+		radix_tree_preload_end();
+
+		if (error) {
+			page->mapping = NULL;
+			__put_page(page);
+			__ClearPageLocked(page);
+		}
+	}
+	return error;
+}
+EXPORT_SYMBOL(add_to_page_cache);
+
+/*
+ * Same as add_to_page_cache, but works on pages that are already in
+ * swapcache and possibly visible to external lookups.
+ * (special case for move_from_swap_cache).
+ */
+int __add_to_page_cache(struct page *page, struct address_space *mapping,
+		pgoff_t offset, gfp_t gfp_mask)
+{
+	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+
+	if (error == 0) {
 		SetPageNoNewRefs(page);
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
@@ -417,7 +456,6 @@ int add_to_page_cache(struct page *page,
 	}
 	return error;
 }
-EXPORT_SYMBOL(add_to_page_cache);
 
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t offset, gfp_t gfp_mask)
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -234,7 +234,7 @@ int move_to_swap_cache(struct page *page
 int move_from_swap_cache(struct page *page, unsigned long index,
 		struct address_space *mapping)
 {
-	int err = add_to_page_cache(page, mapping, index, GFP_ATOMIC);
+	int err = __add_to_page_cache(page, mapping, index, GFP_ATOMIC);
 	if (!err) {
 		delete_from_swap_cache(page);
 		/* shift page from clean_pages to dirty_pages list */
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -183,6 +183,8 @@ extern int read_cache_pages(struct addre
 
 int add_to_page_cache(struct page *page, struct address_space *mapping,
 				unsigned long index, gfp_t gfp_mask);
+int __add_to_page_cache(struct page *page, struct address_space *mapping,
+  				unsigned long index, gfp_t gfp_mask);
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				unsigned long index, gfp_t gfp_mask);
 extern void remove_from_page_cache(struct page *page);
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -214,16 +214,13 @@ extern void __mod_page_state_offset(unsi
 /*
  * Manipulation of page state flags
  */
-#define PageLocked(page)		\
-		test_bit(PG_locked, &(page)->flags)
-#define SetPageLocked(page)		\
-		set_bit(PG_locked, &(page)->flags)
-#define TestSetPageLocked(page)		\
-		test_and_set_bit(PG_locked, &(page)->flags)
-#define ClearPageLocked(page)		\
-		clear_bit(PG_locked, &(page)->flags)
-#define TestClearPageLocked(page)	\
-		test_and_clear_bit(PG_locked, &(page)->flags)
+#define PageLocked(page)	test_bit(PG_locked, &(page)->flags)
+#define SetPageLocked(page)	set_bit(PG_locked, &(page)->flags)
+#define __SetPageLocked(page)	__set_bit(PG_locked, &(page)->flags)
+#define TestSetPageLocked(page)	test_and_set_bit(PG_locked, &(page)->flags)
+#define ClearPageLocked(page)	clear_bit(PG_locked, &(page)->flags)
+#define __ClearPageLocked(page)	__clear_bit(PG_locked, &(page)->flags)
+#define TestClearPageLocked(page) test_and_clear_bit(PG_locked, &(page)->flags)
 
 #define PageError(page)		test_bit(PG_error, &(page)->flags)
 #define SetPageError(page)	set_bit(PG_error, &(page)->flags)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 5/3] mm: spinlock tree_lock
  2006-03-10 15:18 A lockless pagecache for Linux Nick Piggin
                   ` (3 preceding siblings ...)
  2006-03-10 15:18 ` [patch 4/3] mm: lockless optimisations Nick Piggin
@ 2006-03-10 15:18 ` Nick Piggin
  2006-03-13 23:35 ` A lockless pagecache for Linux Christoph Lameter
  5 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2006-03-10 15:18 UTC (permalink / raw)
  To: Linux Kernel, Linux Memory Management; +Cc: Nick Piggin

With practially all the read locks gone from mapping->tree_lock,
convert the lock from an rwlock back to a spinlock.

The remaining locks including the read locks mainly deal with IO
submission and not the lookup fastpaths.

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -855,7 +855,7 @@ int __set_page_dirty_buffers(struct page
 	spin_unlock(&mapping->private_lock);
 
 	if (!TestSetPageDirty(page)) {
-		write_lock_irq(&mapping->tree_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		if (page->mapping) {	/* Race with truncate? */
 			if (mapping_cap_account_dirty(mapping))
 				__inc_page_state(nr_dirty);
@@ -863,7 +863,7 @@ int __set_page_dirty_buffers(struct page
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
 		}
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 	}
 	
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -196,7 +196,7 @@ void inode_init_once(struct inode *inode
 	mutex_init(&inode->i_mutex);
 	init_rwsem(&inode->i_alloc_sem);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
-	rwlock_init(&inode->i_data.tree_lock);
+	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -372,7 +372,7 @@ struct backing_dev_info;
 struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
-	rwlock_t		tree_lock;	/* and rwlock protecting it */
+	spinlock_t		tree_lock;	/* and lock protecting it */
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -110,7 +110,7 @@ generic_file_direct_IO(int rw, struct ki
 /*
  * Remove a page from the page cache and free it. Caller has to make
  * sure the page is locked and that nobody else uses it - or that usage
- * is safe.  The caller must hold a write_lock on the mapping's tree_lock.
+ * is safe.  The caller must hold the mapping's tree_lock.
  */
 void __remove_from_page_cache(struct page *page)
 {
@@ -128,9 +128,9 @@ void remove_from_page_cache(struct page 
 
 	BUG_ON(!PageLocked(page));
 
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	__remove_from_page_cache(page);
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 }
 
 static int sync_page(void *word)
@@ -409,13 +409,13 @@ int add_to_page_cache(struct page *page,
 		page->mapping = mapping;
 		page->index = offset;
 
-		write_lock_irq(&mapping->tree_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
 			mapping->nrpages++;
 			pagecache_acct(1);
 		}
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		radix_tree_preload_end();
 
 		if (error) {
@@ -440,7 +440,7 @@ int __add_to_page_cache(struct page *pag
 
 	if (error == 0) {
 		SetPageNoNewRefs(page);
-		write_lock_irq(&mapping->tree_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
 			page_cache_get(page);
@@ -450,7 +450,7 @@ int __add_to_page_cache(struct page *pag
 			mapping->nrpages++;
 			pagecache_acct(1);
 		}
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		ClearPageNoNewRefs(page);
 		radix_tree_preload_end();
 	}
@@ -708,12 +708,12 @@ unsigned find_get_pages(struct address_s
 	unsigned int i;
 	unsigned int ret;
 
-	read_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, start, nr_pages);
 	for (i = 0; i < ret; i++)
 		page_cache_get(pages[i]);
-	read_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
 
@@ -753,14 +753,14 @@ unsigned find_get_pages_tag(struct addre
 	unsigned int i;
 	unsigned int ret;
 
-	read_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
 				(void **)pages, *index, nr_pages, tag);
 	for (i = 0; i < ret; i++)
 		page_cache_get(pages[i]);
 	if (ret)
 		*index = pages[ret - 1]->index + 1;
-	read_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
 
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -37,7 +37,7 @@ static struct backing_dev_info swap_back
 
 struct address_space swapper_space = {
 	.page_tree	= RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
-	.tree_lock	= RW_LOCK_UNLOCKED,
+	.tree_lock	= SPIN_LOCK_UNLOCKED,
 	.a_ops		= &swap_aops,
 	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
 	.backing_dev_info = &swap_backing_dev_info,
@@ -78,7 +78,7 @@ static int __add_to_swap_cache(struct pa
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
 		SetPageNoNewRefs(page);
-		write_lock_irq(&swapper_space.tree_lock);
+		spin_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (!error) {
@@ -89,7 +89,7 @@ static int __add_to_swap_cache(struct pa
 			total_swapcache_pages++;
 			pagecache_acct(1);
 		}
-		write_unlock_irq(&swapper_space.tree_lock);
+		spin_unlock_irq(&swapper_space.tree_lock);
 		ClearPageNoNewRefs(page);
 		radix_tree_preload_end();
 	}
@@ -202,9 +202,9 @@ void delete_from_swap_cache(struct page 
 
 	entry.val = page_private(page);
 
-	write_lock_irq(&swapper_space.tree_lock);
+	spin_lock_irq(&swapper_space.tree_lock);
 	__delete_from_swap_cache(page);
-	write_unlock_irq(&swapper_space.tree_lock);
+	spin_unlock_irq(&swapper_space.tree_lock);
 
 	swap_free(entry);
 	page_cache_release(page);
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -367,13 +367,13 @@ int remove_exclusive_swap_page(struct pa
 	/* Is the only swap cache user the cache itself? */
 	retval = 0;
 	if (p->swap_map[swp_offset(entry)] == 1) {
-		write_lock_irq(&swapper_space.tree_lock);
+		spin_lock_irq(&swapper_space.tree_lock);
 		if (!PageWriteback(page)) {
 			__delete_from_swap_cache(page);
 			SetPageDirty(page);
 			retval = 1;
 		}
-		write_unlock_irq(&swapper_space.tree_lock);
+		spin_unlock_irq(&swapper_space.tree_lock);
 	}
 	spin_unlock(&swap_lock);
 
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -67,15 +67,15 @@ invalidate_complete_page(struct address_
 	if (PagePrivate(page) && !try_to_release_page(page, 0))
 		return 0;
 
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	if (PageDirty(page)) {
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		return 0;
 	}
 
 	BUG_ON(PagePrivate(page));
 	__remove_from_page_cache(page);
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	ClearPageUptodate(page);
 	page_cache_release(page);	/* pagecache ref */
 	return 1;
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -384,7 +384,7 @@ static int remove_mapping(struct address
 		return 0;		/* truncate got there first */
 
 	SetPageNoNewRefs(page);
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 
 	/*
 	 * The non-racy check for busy page.  It is critical to check
@@ -400,13 +400,13 @@ static int remove_mapping(struct address
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
 		goto free_it;
 	}
 
 	__remove_from_page_cache(page);
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 
 free_it:
 	__ClearPageNoNewRefs(page);
@@ -414,7 +414,7 @@ free_it:
 	return 1;
 
 cannot_free:
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	ClearPageNoNewRefs(page);
 	return 0;
 }
@@ -736,7 +736,7 @@ int migrate_page_remove_references(struc
 		return 1;
 
 	SetPageNoNewRefs(page);
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 
 	radix_pointer = (struct page **)radix_tree_lookup_slot(
 						&mapping->page_tree,
@@ -744,7 +744,7 @@ int migrate_page_remove_references(struc
 
 	if (!page_mapping(page) || page_count(page) != nr_refs ||
 			*radix_pointer != page) {
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		ClearPageNoNewRefs(page);
 		return 1;
 	}
@@ -768,7 +768,7 @@ int migrate_page_remove_references(struc
 
 	rcu_assign_pointer(*radix_pointer, newpage);
 
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	__put_page(page);
 	ClearPageNoNewRefs(page);
 	ClearPageNoNewRefs(newpage);
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -628,7 +628,7 @@ int __set_page_dirty_nobuffers(struct pa
 		struct address_space *mapping2;
 
 		if (mapping) {
-			write_lock_irq(&mapping->tree_lock);
+			spin_lock_irq(&mapping->tree_lock);
 			mapping2 = page_mapping(page);
 			if (mapping2) { /* Race with truncate? */
 				BUG_ON(mapping2 != mapping);
@@ -637,7 +637,7 @@ int __set_page_dirty_nobuffers(struct pa
 				radix_tree_tag_set(&mapping->page_tree,
 					page_index(page), PAGECACHE_TAG_DIRTY);
 			}
-			write_unlock_irq(&mapping->tree_lock);
+			spin_unlock_irq(&mapping->tree_lock);
 			if (mapping->host) {
 				/* !PageAnon && !swapper_space */
 				__mark_inode_dirty(mapping->host,
@@ -709,21 +709,22 @@ EXPORT_SYMBOL(set_page_dirty_lock);
 int test_clear_page_dirty(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
-	unsigned long flags;
 
 	if (mapping) {
-		write_lock_irqsave(&mapping->tree_lock, flags);
-		if (TestClearPageDirty(page)) {
+		unsigned long flags;
+		int ret;
+
+		spin_lock_irqsave(&mapping->tree_lock, flags);
+		ret = TestClearPageDirty(page);
+		if (ret) {
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
 			if (mapping_cap_account_dirty(mapping))
 				__dec_page_state(nr_dirty);
-			write_unlock_irqrestore(&mapping->tree_lock, flags);
-			return 1;
 		}
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
-		return 0;
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
+		return ret;
 	}
 	return TestClearPageDirty(page);
 }
@@ -762,33 +763,32 @@ EXPORT_SYMBOL(clear_page_dirty_for_io);
 int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
-	int ret;
 
 	if (mapping) {
 		unsigned long flags;
+		int ret;
 
-		write_lock_irqsave(&mapping->tree_lock, flags);
+		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestClearPageWriteback(page);
 		if (ret)
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
-	} else {
-		ret = TestClearPageWriteback(page);
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
+		return ret;
 	}
-	return ret;
+	return TestClearPageWriteback(page);
 }
 
 int test_set_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
-	int ret;
 
 	if (mapping) {
 		unsigned long flags;
+		int ret;
 
-		write_lock_irqsave(&mapping->tree_lock, flags);
+		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
 		if (!ret)
 			radix_tree_tag_set(&mapping->page_tree,
@@ -798,11 +798,10 @@ int test_set_page_writeback(struct page 
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
-	} else {
-		ret = TestSetPageWriteback(page);
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
+		return ret;
 	}
-	return ret;
+	return TestSetPageWriteback(page);
 
 }
 EXPORT_SYMBOL(test_set_page_writeback);
Index: linux-2.6/include/asm-arm/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-arm/cacheflush.h
+++ linux-2.6/include/asm-arm/cacheflush.h
@@ -319,9 +319,9 @@ extern void flush_cache_page(struct vm_a
 extern void flush_dcache_page(struct page *);
 
 #define flush_dcache_mmap_lock(mapping) \
-	write_lock_irq(&(mapping)->tree_lock)
+	spin_lock_irq(&(mapping)->tree_lock)
 #define flush_dcache_mmap_unlock(mapping) \
-	write_unlock_irq(&(mapping)->tree_lock)
+	spin_unlock_irq(&(mapping)->tree_lock)
 
 #define flush_icache_user_range(vma,page,addr,len) \
 	flush_dcache_page(page)
Index: linux-2.6/include/asm-parisc/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-parisc/cacheflush.h
+++ linux-2.6/include/asm-parisc/cacheflush.h
@@ -58,9 +58,9 @@ flush_user_icache_range(unsigned long st
 extern void flush_dcache_page(struct page *page);
 
 #define flush_dcache_mmap_lock(mapping) \
-	write_lock_irq(&(mapping)->tree_lock)
+	spin_lock_irq(&(mapping)->tree_lock)
 #define flush_dcache_mmap_unlock(mapping) \
-	write_unlock_irq(&(mapping)->tree_lock)
+	spin_unlock_irq(&(mapping)->tree_lock)
 
 #define flush_icache_page(vma,page)	do { flush_kernel_dcache_page(page_address(page)); flush_kernel_icache_page(page_address(page)); } while (0)
 
Index: linux-2.6/drivers/mtd/devices/block2mtd.c
===================================================================
--- linux-2.6.orig/drivers/mtd/devices/block2mtd.c
+++ linux-2.6/drivers/mtd/devices/block2mtd.c
@@ -58,28 +58,27 @@ static void cache_readahead(struct addre
 
 	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
 
-	read_lock_irq(&mapping->tree_lock);
 	for (i = 0; i < PAGE_READAHEAD; i++) {
 		pagei = index + i;
 		if (pagei > end_index) {
 			INFO("Overrun end of disk in cache readahead\n");
 			break;
 		}
+		/* Don't need mapping->tree_lock - lookup can be racy */
+		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, pagei);
+		rcu_read_unlock();
 		if (page && (!i))
 			break;
 		if (page)
 			continue;
-		read_unlock_irq(&mapping->tree_lock);
 		page = page_cache_alloc_cold(mapping);
-		read_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
 		page->index = pagei;
 		list_add(&page->lru, &page_pool);
 		ret++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
 	if (ret)
 		read_cache_pages(mapping, &page_pool, filler, NULL);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/3] radix tree: RCU lockless read-side
  2006-03-10 15:18 ` [patch 1/3] radix tree: RCU lockless read-side Nick Piggin
@ 2006-03-11  8:22   ` Balbir Singh
  2006-03-11  8:48     ` Nick Piggin
  0 siblings, 1 reply; 25+ messages in thread
From: Balbir Singh @ 2006-03-11  8:22 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Memory Management

<snip>

>                 if (slot->slots[i]) {
> -                       results[nr_found++] = slot->slots[i];
> +                       results[nr_found++] = &slot->slots[i];
>                         if (nr_found == max_items)
>                                 goto out;
>                 }

A quick clarification - Shouldn't accesses to slot->slots[i] above be
protected using rcu_derefence()?

Warm Regards,
Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/3] radix tree: RCU lockless read-side
  2006-03-11  8:22   ` Balbir Singh
@ 2006-03-11  8:48     ` Nick Piggin
  2006-03-13  3:04       ` Balbir Singh
  0 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2006-03-11  8:48 UTC (permalink / raw)
  To: Balbir Singh; +Cc: Nick Piggin, Linux Kernel, Linux Memory Management

Balbir Singh wrote:
> <snip>
> 
>>                if (slot->slots[i]) {
>>-                       results[nr_found++] = slot->slots[i];
>>+                       results[nr_found++] = &slot->slots[i];
>>                        if (nr_found == max_items)
>>                                goto out;
>>                }
> 
> 
> A quick clarification - Shouldn't accesses to slot->slots[i] above be
> protected using rcu_derefence()?
> 

I think we're safe here -- this is the _address_ of the pointer.
However, when dereferencing this address in _gang_lookup,
I think we do need rcu_dereference indeed.

Note that _gang_lookup_slot doesn't do this for us, however --
the caller must do that when dereferencing the pointer to the
item (eg. see page_cache_get_speculative in 2/3).

That said, I'm not 100% sure I have the rcu memory barriers in
the right places (well I'm sure I don't, given the _gang_lookup
bug you exposed!).

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/3] radix tree: RCU lockless read-side
  2006-03-11  8:48     ` Nick Piggin
@ 2006-03-13  3:04       ` Balbir Singh
  2006-03-13  3:11         ` Nick Piggin
  2006-03-13  6:40         ` Nick Piggin
  0 siblings, 2 replies; 25+ messages in thread
From: Balbir Singh @ 2006-03-13  3:04 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Nick Piggin, Linux Kernel, Linux Memory Management

On 3/11/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> Balbir Singh wrote:
> > <snip>
> >
> >>                if (slot->slots[i]) {
> >>-                       results[nr_found++] = slot->slots[i];
> >>+                       results[nr_found++] = &slot->slots[i];
> >>                        if (nr_found == max_items)
> >>                                goto out;
> >>                }
> >
> >
> > A quick clarification - Shouldn't accesses to slot->slots[i] above be
> > protected using rcu_derefence()?
> >
>
> I think we're safe here -- this is the _address_ of the pointer.
> However, when dereferencing this address in _gang_lookup,
> I think we do need rcu_dereference indeed.
>

Yes, I saw the address operator, but we still derefence "slots" to get
the address.

> Note that _gang_lookup_slot doesn't do this for us, however --
> the caller must do that when dereferencing the pointer to the
> item (eg. see page_cache_get_speculative in 2/3).

Oh! I did not get that far. Will look at the rest of the series

>
> That said, I'm not 100% sure I have the rcu memory barriers in
> the right places (well I'm sure I don't, given the _gang_lookup
> bug you exposed!).

Hmm... Let me look at rcu_torture module and see if I can figure it
out or read the documentation again.

>
> Thanks,
> Nick
>

Warm Regards,
Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/3] radix tree: RCU lockless read-side
  2006-03-13  3:04       ` Balbir Singh
@ 2006-03-13  3:11         ` Nick Piggin
  2006-03-13 15:24           ` Balbir Singh
  2006-03-13  6:40         ` Nick Piggin
  1 sibling, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2006-03-13  3:11 UTC (permalink / raw)
  To: Balbir Singh; +Cc: Nick Piggin, Linux Kernel, Linux Memory Management

Balbir Singh wrote:
> On 3/11/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>Balbir Singh wrote:
>>
>>><snip>
>>>
>>>>               if (slot->slots[i]) {
>>>>-                       results[nr_found++] = slot->slots[i];
>>>>+                       results[nr_found++] = &slot->slots[i];
>>>>                       if (nr_found == max_items)
>>>>                               goto out;
>>>>               }
>>>
>>>
>>>A quick clarification - Shouldn't accesses to slot->slots[i] above be
>>>protected using rcu_derefence()?
>>>
>>
>>I think we're safe here -- this is the _address_ of the pointer.
>>However, when dereferencing this address in _gang_lookup,
>>I think we do need rcu_dereference indeed.
>>
> 
> 
> Yes, I saw the address operator, but we still derefence "slots" to get
> the address.
> 

But we should have already rcu_dereference()ed "slot", right
(in the loop above this one)? That means we are now able to
dereference it, and the data at the other end will be valid.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/3] radix tree: RCU lockless read-side
  2006-03-13  3:04       ` Balbir Singh
  2006-03-13  3:11         ` Nick Piggin
@ 2006-03-13  6:40         ` Nick Piggin
  1 sibling, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2006-03-13  6:40 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Nick Piggin, Nick Piggin, Linux Kernel, Linux Memory Management

On Mon, Mar 13, 2006 at 08:34:53AM +0530, Balbir Singh wrote:
> On 3/11/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > Balbir Singh wrote:
> > > <snip>
> > >
> > >>                if (slot->slots[i]) {
> > >>-                       results[nr_found++] = slot->slots[i];
> > >>+                       results[nr_found++] = &slot->slots[i];
> > >>                        if (nr_found == max_items)
> > >>                                goto out;
> > >>                }
> > >
> > >
> > > A quick clarification - Shouldn't accesses to slot->slots[i] above be
> > > protected using rcu_derefence()?
> > >
> >
> > I think we're safe here -- this is the _address_ of the pointer.
> > However, when dereferencing this address in _gang_lookup,
> > I think we do need rcu_dereference indeed.
> >
> 
> Yes, I saw the address operator, but we still derefence "slots" to get
> the address.
> 

OK, I reread what you wrote and I misunderstood you earlier I guess.
slot->slots[i] does dereference the pointer at the ith entry of slots,
but &slot->slots[i] does not, it will return the same thing as
slot->slots+i, which only dereferences 'slot' (which we've established
to be safe).

Even if &slot->slots[i] did, for some silly compiler, dereference the
pointer, we never actually see it or use it so it should be harmless.

Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/3] radix tree: RCU lockless read-side
  2006-03-13  3:11         ` Nick Piggin
@ 2006-03-13 15:24           ` Balbir Singh
  2006-03-13 22:37             ` Nick Piggin
  0 siblings, 1 reply; 25+ messages in thread
From: Balbir Singh @ 2006-03-13 15:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Nick Piggin, Linux Kernel, Linux Memory Management

<snip>

>
> But we should have already rcu_dereference()ed "slot", right
> (in the loop above this one)? That means we are now able to
> dereference it, and the data at the other end will be valid.
>

Yes, but my confusion is about the following piece of code

<begin code>

       for ( ; height > 1; height--) {

               for (i = (index >> shift) & RADIX_TREE_MAP_MASK ;
                               i < RADIX_TREE_MAP_SIZE; i++) {
-                       if (slot->slots[i] != NULL)
+                       __s = rcu_dereference(slot->slots[i]);
+                       if (__s != NULL)
                               break;
                       index &= ~((1UL << shift) - 1);
                       index += 1UL << shift;
@@ -531,14 +550,14 @@ __lookup(struct radix_tree_root *root, v
                       goto out;

               shift -= RADIX_TREE_MAP_SHIFT;
-               slot = slot->slots[i];
+               slot = __s;
       }

       /* Bottom level: grab some items */
       for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
               index++;
               if (slot->slots[i]) {
-                       results[nr_found++] = slot->slots[i];
+                       results[nr_found++] = &slot->slots[i];
                       if (nr_found == max_items)
                               goto out;
               }
<end code>

In the for loop, lets say __s is *not* NULL, we break from the loop.
In the loop below
slot->slots[i] is derefenced without rcu, __s is not used. Is that not
inconsistent?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/3] radix tree: RCU lockless read-side
  2006-03-13 15:24           ` Balbir Singh
@ 2006-03-13 22:37             ` Nick Piggin
  2006-03-14  3:32               ` Balbir Singh
  0 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2006-03-13 22:37 UTC (permalink / raw)
  To: Balbir Singh; +Cc: Nick Piggin, Linux Kernel, Linux Memory Management

Balbir Singh wrote:

><snip>
>
>>But we should have already rcu_dereference()ed "slot", right
>>(in the loop above this one)? That means we are now able to
>>dereference it, and the data at the other end will be valid.
>>
>>
>
>Yes, but my confusion is about the following piece of code
>
><begin code>
>
>       for ( ; height > 1; height--) {
>
>               for (i = (index >> shift) & RADIX_TREE_MAP_MASK ;
>                               i < RADIX_TREE_MAP_SIZE; i++) {
>-                       if (slot->slots[i] != NULL)
>+                       __s = rcu_dereference(slot->slots[i]);
>+                       if (__s != NULL)
>                               break;
>                       index &= ~((1UL << shift) - 1);
>                       index += 1UL << shift;
>@@ -531,14 +550,14 @@ __lookup(struct radix_tree_root *root, v
>                       goto out;
>
>               shift -= RADIX_TREE_MAP_SHIFT;
>-               slot = slot->slots[i];
>+               slot = __s;
>       }
>
>       /* Bottom level: grab some items */
>       for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
>               index++;
>               if (slot->slots[i]) {
>-                       results[nr_found++] = slot->slots[i];
>+                       results[nr_found++] = &slot->slots[i];
>                       if (nr_found == max_items)
>                               goto out;
>               }
><end code>
>
>In the for loop, lets say __s is *not* NULL, we break from the loop.
>In the loop below
>slot->slots[i] is derefenced without rcu, __s is not used. Is that not
>inconsistent?
>
>

The "slots" member is an array, not an RCU assigned pointer. As such, after
doing rcu_dereference(slot), you can access slot->slots[i] without further
memory barriers I think?

But I agree that code now is a bit inconsistent. I've cleaned things up a
bit in my tree now... but perhaps it is easier if you send a patch to show
what you mean (because sometimes I'm a bit dense, I'm afraid).

Thanks,
Nick

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: A lockless pagecache for Linux
  2006-03-10 15:18 A lockless pagecache for Linux Nick Piggin
                   ` (4 preceding siblings ...)
  2006-03-10 15:18 ` [patch 5/3] mm: spinlock tree_lock Nick Piggin
@ 2006-03-13 23:35 ` Christoph Lameter
  2006-03-14  4:14   ` Nick Piggin
  5 siblings, 1 reply; 25+ messages in thread
From: Christoph Lameter @ 2006-03-13 23:35 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Memory Management

On Fri, 10 Mar 2006, Nick Piggin wrote:

> I'm writing some stuff about these patches, and I've uploaded a
> **draft** chapter on the RCU radix-tree, 'radix-intro.pdf' in above
> directory (note the bibliography didn't make it -- but thanks Paul
> McKenney!)

Ah thanks. I had a look at it. Note that the problem with the radix tree 
tags is that these are inherited from the lower layer. How is the 
consistency of these guaranteed? Also you may want to add a more elaborate 
intro and conclusion. Typically these summarize other sections of the 
paper.

What you are proposing is to allow lockless read operations right? No 
lockless write? The concurrency issue that we currently have is multiple 
processes faulting in pages in different ranges from the same file. I 
think this is a rather typical usage scenario. Faulting in a page from a 
file for reading requires a write operation on the radix tree. The 
approach with a lockless read path does not help us. This proposed scheme 
would only help if pages are already faulted in and another process starts
using the same pages as an earlier process.

Would it not be better to handle the radix tree in the same way as a page 
table? Have a lock at the lowest layer so that different sections of the 
radix tree can be locked by different processes? That would enable 
concurrent writes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/3] radix tree: RCU lockless read-side
  2006-03-13 22:37             ` Nick Piggin
@ 2006-03-14  3:32               ` Balbir Singh
  2006-03-14  5:16                 ` Nick Piggin
  0 siblings, 1 reply; 25+ messages in thread
From: Balbir Singh @ 2006-03-14  3:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Nick Piggin, Linux Kernel, Linux Memory Management

> The "slots" member is an array, not an RCU assigned pointer. As such, after
> doing rcu_dereference(slot), you can access slot->slots[i] without further
> memory barriers I think?
>
> But I agree that code now is a bit inconsistent. I've cleaned things up a
> bit in my tree now... but perhaps it is easier if you send a patch to show
> what you mean (because sometimes I'm a bit dense, I'm afraid).
>

Fro starters, I do not think your dense at all.

Hmm... slot/slots is quite confusing name. I was referring to slot and
ended up calling it slots. The point I am contending is that
rcu_derefence(slot->slots[i]) should happen.

<snippet>
+                       __s = rcu_dereference(slot->slots[i]);
+                       if (__s != NULL)
                              break;
</snippet>

If we break from the loop because __s != NULL. Then in the snippet below

<snippet>
       /* Bottom level: grab some items */
      for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
              index++;
              if (slot->slots[i]) {
-                       results[nr_found++] = slot->slots[i];
+                       results[nr_found++] = &slot->slots[i];
                      if (nr_found == max_items)
                              goto out;
              }
</snippet>

We do not use __s above. "slot->slots[i]" is not rcu_derefenced() in
this case because we broke out of the loop above with __s being not
NULL. Another issue is - is it good enough to rcu_derefence() slot
once? Shouldn't all uses of *slot->* be rcu derefenced?

<suggestion (WARNING: patch has spaces and its not compiled)>
       /* Bottom level: grab some items */
      for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
              index++;
-              if (slot->slots[i]) {
-                       results[nr_found++] = slot->slots[i];
+             __s = rcu_dereference(slot->slots[i]);
+             if (__s) {
+                    /* This is tricky, cannot take the address of __s
or rcu_derefence() */
+                    results[nr_found++] = &slot->slots[i];
                      if (nr_found == max_items)
                              goto out;
              }
</suggestion>

I hope I am making sense.

Warm Regards,
Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: A lockless pagecache for Linux
  2006-03-13 23:35 ` A lockless pagecache for Linux Christoph Lameter
@ 2006-03-14  4:14   ` Nick Piggin
  2006-03-14 12:59     ` Wu Fengguang
  0 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2006-03-14  4:14 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Nick Piggin, Linux Kernel, Linux Memory Management

Christoph Lameter wrote:
> On Fri, 10 Mar 2006, Nick Piggin wrote:
> 
> 
>>I'm writing some stuff about these patches, and I've uploaded a
>>**draft** chapter on the RCU radix-tree, 'radix-intro.pdf' in above
>>directory (note the bibliography didn't make it -- but thanks Paul
>>McKenney!)
> 
> 
> Ah thanks. I had a look at it. Note that the problem with the radix tree 
> tags is that these are inherited from the lower layer. How is the 
> consistency of these guaranteed? Also you may want to add a more elaborate 
> intro and conclusion. Typically these summarize other sections of the 
> paper.
> 

Thanks for looking at it. Yeah in the intro I say that I'm considering
a simplified radix-tree (without tags or gang lookups) to start with.
At the end I say how tags are handled... it isn't quite clear enough
for my liking yet though.

Intro and conclusion - yes they should be better. It _is_ a chapter from
a larger document, however I want it to still stand alone as a good
document.

What happens is: read-side tag operations (ie. tag lookups etc) are done
under lock. Ie. they are not made lockless.

> What you are proposing is to allow lockless read operations right? No 
> lockless write? The concurrency issue that we currently have is multiple 
> processes faulting in pages in different ranges from the same file. I 
> think this is a rather typical usage scenario. Faulting in a page from a 
> file for reading requires a write operation on the radix tree. The 
> approach with a lockless read path does not help us. This proposed scheme 
> would only help if pages are already faulted in and another process starts
> using the same pages as an earlier process.
> 

Yep, lockless reads only to start with. I think you'll see some benefit
because the read(2) and ->nopage paths also take read-side locks, so your
write side will no longer have to contend with them.

It won't be a huge improvement in scalability though, maybe just a constant
factor.

> Would it not be better to handle the radix tree in the same way as a page 
> table? Have a lock at the lowest layer so that different sections of the 
> radix tree can be locked by different processes? That would enable 
> concurrent writes.
> 

Yeah this is the next step. Note that it is not the first step because I
actually want to _speed up_ single threaded lookup paths, rather than
slowing them down, otherwise it will never get accepted.

It also might add quite a large amount of complexity to the radix tree, so
it may no longer be suitable for a generic data structure anymore (depends
how it is implemented). But the write side should be easier than the
read-side so I don't think there is too much to worry about. I already have
some bits and pieces to make it fine-grained.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/3] radix tree: RCU lockless read-side
  2006-03-14  3:32               ` Balbir Singh
@ 2006-03-14  5:16                 ` Nick Piggin
  0 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2006-03-14  5:16 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Nick Piggin, Linux Kernel, Linux Memory Management, Paul McKenney

Balbir Singh wrote:
>>The "slots" member is an array, not an RCU assigned pointer. As such, after
>>doing rcu_dereference(slot), you can access slot->slots[i] without further
>>memory barriers I think?
>>
>>But I agree that code now is a bit inconsistent. I've cleaned things up a
>>bit in my tree now... but perhaps it is easier if you send a patch to show
>>what you mean (because sometimes I'm a bit dense, I'm afraid).
>>
> 
> 
> Fro starters, I do not think your dense at all.
> 

I'm glad you think so ;)

> Hmm... slot/slots is quite confusing name. I was referring to slot and
> ended up calling it slots. The point I am contending is that
> rcu_derefence(slot->slots[i]) should happen.
> 
> <snippet>
> +                       __s = rcu_dereference(slot->slots[i]);
> +                       if (__s != NULL)
>                               break;
> </snippet>
> 
> If we break from the loop because __s != NULL. Then in the snippet below
> 

This is a little confusing, because technically I don't think it needs to
rcu_dereference here, either, only when we actually want to dereference
__s and read the data the other end.

rcu_dereference is perhaps an awkward interface for optimising this case.
#define rcu_make_pointer_safe_to_follow(ptr) smp_read_barrier_depends()
or
#define rcu_follow_pointer(ptr) ({ smp_read_barrier_depends(); *ptr; })
might be better

     __s = slot->slots[i];
     if (__s != NULL) {
         rcu_make_pointer_safe_to_follow(__s);
         ...

Would allow the barrier to be skipped if we're not going to follow the
pointer.

> <snippet>
>        /* Bottom level: grab some items */
>       for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
>               index++;
>               if (slot->slots[i]) {
> -                       results[nr_found++] = slot->slots[i];
> +                       results[nr_found++] = &slot->slots[i];
>                       if (nr_found == max_items)
>                               goto out;
>               }
> </snippet>
> 
> We do not use __s above. "slot->slots[i]" is not rcu_derefenced() in
> this case because we broke out of the loop above with __s being not
> NULL. Another issue is - is it good enough to rcu_derefence() slot
> once? Shouldn't all uses of *slot->* be rcu derefenced?
> 

Yes, slot is a local pointer, the address it points to will not change,
so once we have loaded it and issued the read barrier, we can follow it
from then on and reads will not find stale values.

> <suggestion (WARNING: patch has spaces and its not compiled)>
>        /* Bottom level: grab some items */
>       for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
>               index++;
> -              if (slot->slots[i]) {
> -                       results[nr_found++] = slot->slots[i];
> +             __s = rcu_dereference(slot->slots[i]);
> +             if (__s) {
> +                    /* This is tricky, cannot take the address of __s
> or rcu_derefence() */
> +                    results[nr_found++] = &slot->slots[i];
>                       if (nr_found == max_items)
>                               goto out;
>               }
> </suggestion>
> 
> I hope I am making sense.
> 

In this case I think we do not need a barrier because we are only looking
at the pointer (whether it is NULL or not), rather than following it down.
Paul may be able to jump in at this point.

I'll release a new patchset in the next couple of days and try to be more
consisten with the barriers which will hopefully make things less confusing.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: A lockless pagecache for Linux
  2006-03-14  4:14   ` Nick Piggin
@ 2006-03-14 12:59     ` Wu Fengguang
  0 siblings, 0 replies; 25+ messages in thread
From: Wu Fengguang @ 2006-03-14 12:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Nick Piggin, Linux Kernel, Linux Memory Management

On Tue, Mar 14, 2006 at 03:14:38PM +1100, Nick Piggin wrote:
> Christoph Lameter wrote:
> >What you are proposing is to allow lockless read operations right? No 
> >lockless write? The concurrency issue that we currently have is multiple 
> >processes faulting in pages in different ranges from the same file. I 
> >think this is a rather typical usage scenario. Faulting in a page from a 
> >file for reading requires a write operation on the radix tree. The 
> >approach with a lockless read path does not help us. This proposed scheme 
> >would only help if pages are already faulted in and another process starts
> >using the same pages as an earlier process.
> >
> 
> Yep, lockless reads only to start with. I think you'll see some benefit
> because the read(2) and ->nopage paths also take read-side locks, so your
> write side will no longer have to contend with them.
> 
> It won't be a huge improvement in scalability though, maybe just a constant
> factor.
> 
> >Would it not be better to handle the radix tree in the same way as a page 
> >table? Have a lock at the lowest layer so that different sections of the 
> >radix tree can be locked by different processes? That would enable 
> >concurrent writes.
> >
> 
> Yeah this is the next step. Note that it is not the first step because I
> actually want to _speed up_ single threaded lookup paths, rather than
> slowing them down, otherwise it will never get accepted.
> 
> It also might add quite a large amount of complexity to the radix tree, so
> it may no longer be suitable for a generic data structure anymore (depends
> how it is implemented). But the write side should be easier than the
> read-side so I don't think there is too much to worry about. I already have
> some bits and pieces to make it fine-grained.

Maybe we can try another way to reduce the concurrent radix tree
writers problem: coordinating and serializing writers at high level.

Since kswapd is the major radix tree deleter, and readahead is the
major radix tree inserter, putting parts of them together in a loop
might reduce the contention noticeably.

The following pseudo-code shows the basic idea:
(Integrating kprefetchd is also possible. Just for simplicity...:)

PER_NODE(ra_queue);

kswapd()
{
        loop {
                loop {
                        free enough pages for top(ra_queue)
                }
                submit(pop(ra_queue));
                wait();
        }
}

readahead()
{
        assemble one ra_req

        if (ra_req immediately needed)
                submit(ra_req);
        else {
                push(ra_queue, ra_req);
                wakeup_kswapd();
        }
}

This scheme might reduce
        - direct page reclaim pressure
        - radix tree write lock contention
        - lru lock contention

Regards,
Wu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] mm: speculative get_page
  2006-04-04 15:21   ` Christoph Lameter
@ 2006-04-05  0:27     ` Nick Piggin
  0 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2006-04-05  0:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Andrew Morton, Linux Kernel, Linux Memory Management

Christoph Lameter wrote:
> On Tue, 4 Apr 2006, Nick Piggin wrote:
> 
> 
>>+	/*
>>+	 * PageNoNewRefs is set in order to prevent new references to the
>>+	 * page (eg. before it gets removed from pagecache). Wait until it
>>+	 * becomes clear (and checks below will ensure we still have the
>>+	 * correct one).
>>+	 */
>>+	while (unlikely(PageNoNewRefs(page)))
>>+		cpu_relax();
> 
> 
> That part looks suspiciously like we need some sort of lock here.
> 

It's very light-weight now. A lock of course would only be page local,
so it wouldn't really harm scalability, however it would slow down the
single threaded case. At the moment, single threaded performance of
find_get_page is anywhere from about 15-100% faster than before the
lockless patches.

I don't see why you think there needs to be a lock? Before the write
side clears PageNoNewRefs, they will have moved 'page' out of pagecache,
so when this loop breaks, the subsequent test will fail and this
function will be repeated.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] mm: speculative get_page
  2006-04-04 15:20   ` Christoph Lameter
@ 2006-04-05  0:22     ` Nick Piggin
  0 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2006-04-05  0:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Andrew Morton, Linux Kernel, Linux Memory Management

Christoph Lameter wrote:
> Looks like the NoNewRefs flag is mostly == 
> spin_is_locked(mapping->tree_lock)? Would it not be better to check the 
> tree_lock?
> 

Well there are other uses for the tree_lock (eg. tag operations)
which do not need the "no new references" guarantee.

> 
> 
>>--- linux-2.6.orig/mm/migrate.c
>>+++ linux-2.6/mm/migrate.c
>> 
>>+	SetPageNoNewRefs(page);
>> 	write_lock_irq(&mapping->tree_lock);
> 
> 
> A dream come true! If this is really working as it sounds then we can 
> move the SetPageNoNewRefs up and avoid the final check under 
> mapping->tree_lock. Then keep SetPageNoNewRefs until the page has been 
> copied. It would basically play the same role as locking the page.
> 

Yes we could do that but at this stage I wouldn't like to seperate
SetPageNoNewRefs from tree_lock, as it is replacing a traditional
guarantee that tree_lock no longer provides.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] mm: speculative get_page
  2006-04-04  9:32 ` [patch 2/3] mm: speculative get_page Nick Piggin
  2006-04-04  9:47   ` Andrew Morton
  2006-04-04 15:20   ` Christoph Lameter
@ 2006-04-04 15:21   ` Christoph Lameter
  2006-04-05  0:27     ` Nick Piggin
  2 siblings, 1 reply; 25+ messages in thread
From: Christoph Lameter @ 2006-04-04 15:21 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Linux Kernel, Linux Memory Management

On Tue, 4 Apr 2006, Nick Piggin wrote:

> +	/*
> +	 * PageNoNewRefs is set in order to prevent new references to the
> +	 * page (eg. before it gets removed from pagecache). Wait until it
> +	 * becomes clear (and checks below will ensure we still have the
> +	 * correct one).
> +	 */
> +	while (unlikely(PageNoNewRefs(page)))
> +		cpu_relax();

That part looks suspiciously like we need some sort of lock here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] mm: speculative get_page
  2006-04-04  9:32 ` [patch 2/3] mm: speculative get_page Nick Piggin
  2006-04-04  9:47   ` Andrew Morton
@ 2006-04-04 15:20   ` Christoph Lameter
  2006-04-05  0:22     ` Nick Piggin
  2006-04-04 15:21   ` Christoph Lameter
  2 siblings, 1 reply; 25+ messages in thread
From: Christoph Lameter @ 2006-04-04 15:20 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Linux Kernel, Linux Memory Management

Looks like the NoNewRefs flag is mostly == 
spin_is_locked(mapping->tree_lock)? Would it not be better to check the 
tree_lock?

> --- linux-2.6.orig/mm/migrate.c
> +++ linux-2.6/mm/migrate.c
>  
> +	SetPageNoNewRefs(page);
>  	write_lock_irq(&mapping->tree_lock);

A dream come true! If this is really working as it sounds then we can 
move the SetPageNoNewRefs up and avoid the final check under 
mapping->tree_lock. Then keep SetPageNoNewRefs until the page has been 
copied. It would basically play the same role as locking the page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] mm: speculative get_page
  2006-04-04  9:47   ` Andrew Morton
@ 2006-04-04 10:21     ` Nick Piggin
  0 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2006-04-04 10:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, linux-kernel, linux-mm

On Tue, Apr 04, 2006 at 02:47:15AM -0700, Andrew Morton wrote:
> Nick Piggin <npiggin@suse.de> wrote:
> >
> > +static inline struct page *page_cache_get_speculative(struct page **pagep)
> 
> Seems rather large to inline.
> 

Possibly... with all the debugging turned off, it is only atomic_inc
on UP, and atomic_inc_not_zero + several branches and barriers on SMP.

With only two callsites, I figure it is probably OK to be inline. It
probably looks bigger than it is...

> >  +{
> >  +	struct page *page;
> >  +
> >  +	VM_BUG_ON(in_interrupt());
> >  +
> >  +#ifndef CONFIG_SMP
> >  +	page = *pagep;
> >  +	if (unlikely(!page))
> >  +		return NULL;
> >  +
> >  +	VM_BUG_ON(!in_atomic());
> 
> This will go blam if !CONFIG_PREEMPT.

Hmm yes. Is there a safe way to do that? I guess it is pretty trivally
safely under rcu_read_lock , so that can probably just be removed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] mm: speculative get_page
  2006-04-04  9:32 ` [patch 2/3] mm: speculative get_page Nick Piggin
@ 2006-04-04  9:47   ` Andrew Morton
  2006-04-04 10:21     ` Nick Piggin
  2006-04-04 15:20   ` Christoph Lameter
  2006-04-04 15:21   ` Christoph Lameter
  2 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2006-04-04  9:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-mm

Nick Piggin <npiggin@suse.de> wrote:
>
> +static inline struct page *page_cache_get_speculative(struct page **pagep)

Seems rather large to inline.

>  +{
>  +	struct page *page;
>  +
>  +	VM_BUG_ON(in_interrupt());
>  +
>  +#ifndef CONFIG_SMP
>  +	page = *pagep;
>  +	if (unlikely(!page))
>  +		return NULL;
>  +
>  +	VM_BUG_ON(!in_atomic());

This will go blam if !CONFIG_PREEMPT.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 2/3] mm: speculative get_page
  2006-04-04  9:31 [patch 0/3] lockless pagecache Nick Piggin
@ 2006-04-04  9:32 ` Nick Piggin
  2006-04-04  9:47   ` Andrew Morton
                     ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Nick Piggin @ 2006-04-04  9:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Nick Piggin, Linux Memory Management

If we can be sure that elevating the page_count on a pagecache
page will pin it, we can speculatively run this operation, and
subsequently check to see if we hit the right page rather than
relying on holding a lock or otherwise pinning a reference to the
page.

This can be done if get_page/put_page behaves consistently
throughout the whole tree (ie. if we "get" the page after it has
been used for something else, we must be able to free it with a
put_page).

Actually, there is a period where the count behaves differently:
when the page is free or if it is a constituent page of a compound
page. We need an atomic_inc_not_zero operation to ensure we don't
try to grab the page in either case.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -76,6 +76,9 @@
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_uncached		19	/* Page has been mapped as uncached */
 
+#define PG_nonewrefs		20	/* Block concurrent pagecache lookups
+					 * while testing refcount */
+
 /*
  * Global page accounting.  One instance per CPU.  Only unsigned longs are
  * allowed.
@@ -346,6 +349,11 @@ extern void __mod_page_state_offset(unsi
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#define PageNoNewRefs(page)	test_bit(PG_nonewrefs, &(page)->flags)
+#define SetPageNoNewRefs(page)	set_bit(PG_nonewrefs, &(page)->flags)
+#define ClearPageNoNewRefs(page) clear_bit(PG_nonewrefs, &(page)->flags)
+#define __ClearPageNoNewRefs(page) __clear_bit(PG_nonewrefs, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -11,6 +11,8 @@
 #include <linux/compiler.h>
 #include <asm/uaccess.h>
 #include <linux/gfp.h>
+#include <linux/page-flags.h>
+#include <linux/hardirq.h> /* for in_interrupt() */
 
 /*
  * Bits in mapping->flags.  The lower __GFP_BITS_SHIFT bits are the page
@@ -51,6 +53,91 @@ static inline void mapping_set_gfp_mask(
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
+static inline struct page *page_cache_get_speculative(struct page **pagep)
+{
+	struct page *page;
+
+	VM_BUG_ON(in_interrupt());
+
+#ifndef CONFIG_SMP
+	page = *pagep;
+	if (unlikely(!page))
+		return NULL;
+
+	VM_BUG_ON(!in_atomic());
+	/*
+	 * Preempt must be disabled here - we rely on rcu_read_lock doing
+	 * this for us.
+	 *
+	 * Pagecache won't be truncated from interrupt context, so if we have
+	 * found a page in the radix tree here, we have pinned its refcount by
+	 * disabling preempt, and hence no need for the "speculative get" that
+	 * SMP requires.
+	 */
+	VM_BUG_ON(page_count(page) == 0);
+	atomic_inc(&page->_count);
+	VM_BUG_ON(page != *pagep);
+
+#else
+ again:
+	page = rcu_dereference(*pagep);
+	if (unlikely(!page))
+		return NULL;
+
+	if (unlikely(!get_page_unless_zero(page)))
+		goto again; /* page has been freed */
+
+	/*
+	 * Note that get_page_unless_zero provides a memory barrier.
+	 * This is needed to ensure PageNoNewRefs is evaluated after the
+	 * page refcount has been raised. See below comment.
+	 */
+
+	/*
+	 * PageNoNewRefs is set in order to prevent new references to the
+	 * page (eg. before it gets removed from pagecache). Wait until it
+	 * becomes clear (and checks below will ensure we still have the
+	 * correct one).
+	 */
+	while (unlikely(PageNoNewRefs(page)))
+		cpu_relax();
+
+	/*
+	 * smp_rmb is to ensure the load of page->flags (for PageNoNewRefs())
+	 * is performed before the load of *pagep in the below comparison.
+	 *
+	 * Those places that set PageNoNewRefs have the following pattern:
+	 * 	SetPageNoNewRefs(page)
+	 * 	wmb();
+	 * 	if (page_count(page) == X)
+	 * 		remove page from pagecache
+	 * 	wmb();
+	 * 	ClearPageNoNewRefs(page)
+	 *
+	 * So PageNoNewRefs() becomes clear _after_ we've elevated page
+	 * refcount, then either the page will be safely pinned in pagecache,
+	 * or it will have been already removed. In the latter case, *pagep
+	 * will be changed in the below test - provided it is loaded after
+	 * testing PageNoNewRefs() (which is what the smp_rmb is for).
+	 *
+	 * If the load was out of order, *pagep might be loaded before the
+	 * page is removed from pagecache while PageNoNewRefs evaluated after
+	 * the ClearPageNoNewRefs().
+	 */
+	smp_rmb();
+
+	if (unlikely(page != *pagep)) {
+		/* page no longer at *pagep */
+		put_page(page);
+		goto again;
+	}
+
+#endif
+	VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);
+
+	return page;
+}
+
 #ifdef CONFIG_NUMA
 extern struct page *page_cache_alloc(struct address_space *x);
 extern struct page *page_cache_alloc_cold(struct address_space *x);
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -365,6 +365,7 @@ int remove_mapping(struct address_space 
 	if (!mapping)
 		return 0;		/* truncate got there first */
 
+	SetPageNoNewRefs(page);
 	write_lock_irq(&mapping->tree_lock);
 
 	/*
@@ -383,17 +384,20 @@ int remove_mapping(struct address_space 
 		__delete_from_swap_cache(page);
 		write_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
-		__put_page(page);	/* The pagecache ref */
-		return 1;
+		goto free_it;
 	}
 
 	__remove_from_page_cache(page);
 	write_unlock_irq(&mapping->tree_lock);
-	__put_page(page);
+
+free_it:
+	__ClearPageNoNewRefs(page);
+	__put_page(page); /* The pagecache ref */
 	return 1;
 
 cannot_free:
 	write_unlock_irq(&mapping->tree_lock);
+	ClearPageNoNewRefs(page);
 	return 0;
 }
 
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -407,6 +407,7 @@ int add_to_page_cache(struct page *page,
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
+		SetPageNoNewRefs(page);
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
@@ -418,6 +419,7 @@ int add_to_page_cache(struct page *page,
 			pagecache_acct(1);
 		}
 		write_unlock_irq(&mapping->tree_lock);
+		ClearPageNoNewRefs(page);
 		radix_tree_preload_end();
 	}
 	return error;
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -78,6 +78,7 @@ static int __add_to_swap_cache(struct pa
 	BUG_ON(PagePrivate(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
+		SetPageNoNewRefs(page);
 		write_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
@@ -90,6 +91,7 @@ static int __add_to_swap_cache(struct pa
 			pagecache_acct(1);
 		}
 		write_unlock_irq(&swapper_space.tree_lock);
+		ClearPageNoNewRefs(page);
 		radix_tree_preload_end();
 	}
 	return error;
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -28,8 +28,6 @@
 
 #include "internal.h"
 
-#include "internal.h"
-
 /* The maximum number of pages to take off the LRU for migration */
 #define MIGRATE_CHUNK_SIZE 256
 
@@ -225,6 +223,7 @@ int migrate_page_remove_references(struc
 	if (page_mapcount(page))
 		return -EAGAIN;
 
+	SetPageNoNewRefs(page);
 	write_lock_irq(&mapping->tree_lock);
 
 	radix_pointer = (struct page **)radix_tree_lookup_slot(
@@ -234,6 +233,7 @@ int migrate_page_remove_references(struc
 	if (!page_mapping(page) || page_count(page) != nr_refs ||
 			*radix_pointer != page) {
 		write_unlock_irq(&mapping->tree_lock);
+		ClearPageNoNewRefs(page);
 		return 1;
 	}
 
@@ -253,9 +253,13 @@ int migrate_page_remove_references(struc
 		set_page_private(newpage, page_private(page));
 	}
 
-	*radix_pointer = newpage;
+	SetPageNoNewRefs(newpage);
+	rcu_assign_pointer(*radix_pointer, newpage);
+
+  	write_unlock_irq(&mapping->tree_lock);
 	__put_page(page);
-	write_unlock_irq(&mapping->tree_lock);
+	ClearPageNoNewRefs(page);
+	ClearPageNoNewRefs(newpage);
 
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2006-04-05  0:27 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-03-10 15:18 A lockless pagecache for Linux Nick Piggin
2006-03-10 15:18 ` [patch 1/3] radix tree: RCU lockless read-side Nick Piggin
2006-03-11  8:22   ` Balbir Singh
2006-03-11  8:48     ` Nick Piggin
2006-03-13  3:04       ` Balbir Singh
2006-03-13  3:11         ` Nick Piggin
2006-03-13 15:24           ` Balbir Singh
2006-03-13 22:37             ` Nick Piggin
2006-03-14  3:32               ` Balbir Singh
2006-03-14  5:16                 ` Nick Piggin
2006-03-13  6:40         ` Nick Piggin
2006-03-10 15:18 ` [patch 2/3] mm: speculative get_page Nick Piggin
2006-03-10 15:18 ` [patch 3/3] mm: lockless pagecache lookups Nick Piggin
2006-03-10 15:18 ` [patch 4/3] mm: lockless optimisations Nick Piggin
2006-03-10 15:18 ` [patch 5/3] mm: spinlock tree_lock Nick Piggin
2006-03-13 23:35 ` A lockless pagecache for Linux Christoph Lameter
2006-03-14  4:14   ` Nick Piggin
2006-03-14 12:59     ` Wu Fengguang
2006-04-04  9:31 [patch 0/3] lockless pagecache Nick Piggin
2006-04-04  9:32 ` [patch 2/3] mm: speculative get_page Nick Piggin
2006-04-04  9:47   ` Andrew Morton
2006-04-04 10:21     ` Nick Piggin
2006-04-04 15:20   ` Christoph Lameter
2006-04-05  0:22     ` Nick Piggin
2006-04-04 15:21   ` Christoph Lameter
2006-04-05  0:27     ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox