[patch 0/4] lockless pagecache for 2.6.18-rc7-mm1

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 0/4] lockless pagecache for 2.6.18-rc7-mm1
@ 2006-09-22 19:22 Nick Piggin
  2006-09-22 19:22 ` [patch 2/4] radix-tree: use indirect bit Nick Piggin
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Nick Piggin @ 2006-09-22 19:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, Linux Memory Management

Hi,

I have rewritten the lockless pagecache patches to a point that they
are much closer to how they looked before my adapting them to radix
tree direct.

Among problems solved since patchset was last posted (thanks Hugh,
Lee, and others) are:

- gang lookups now can no longer skip over indexes if a pages moves
  between being looked up and a reference taken.

- the verification of the "speculative get" now checks the radix tree
  rather than page->mapping and index. So there is no chance a non
  pagecache user might put unlucky values in there and break it.

- no need for a specific find_get_swap_page

There shouldn't be any known problems with it now, I hope.

There is a bit of overlap with the readahead code in -mm, which may
cause (simple) rejects if you drop it... but otherwise it is close
to 2.6.18.

Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 2/4] radix-tree: use indirect bit
  2006-09-22 19:22 [patch 0/4] lockless pagecache for 2.6.18-rc7-mm1 Nick Piggin
@ 2006-09-22 19:22 ` Nick Piggin
  2006-09-22 19:22 ` [patch 2/9] radix-tree: gang_lookup_slot Nick Piggin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2006-09-22 19:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, Linux Memory Management

Rather than sign direct radix-tree pointers with a special bit, sign
the indirect one that hangs off the root. This means that, given a
lookup_slot operation, the invalid result will be differentiated from
the valid (previously, valid results could have the bit either set or
clear).

This does not affect slot lookups which occur under lock -- they
can never return an invalid result. Is needed in future for lockless
pagecache.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -27,28 +27,31 @@
 #include <linux/rcupdate.h>
 
 /*
- * A direct pointer (root->rnode pointing directly to a data item,
- * rather than another radix_tree_node) is signalled by the low bit
- * set in the root->rnode pointer.
- *
- * In this case root->height is also NULL, but the direct pointer tests are
- * needed for RCU lookups when root->height is unreliable.
+ * An indirect pointer (root->rnode pointing to a radix_tree_node, rather
+ * than a data item) is signalled by the low bit set in the root->rnode
+ * pointer.
+ *
+ * In this case root->height is > 0, but the indirect pointer tests are
+ * needed for RCU lookups (because root->height is unreliable). The only
+ * time callers need worry about this is when doing a lookup_slot under
+ * RCU.
  */
-#define RADIX_TREE_DIRECT_PTR	1
+#define RADIX_TREE_INDIRECT_PTR	1
+#define RADIX_TREE_RETRY ((void *)-1UL)
 
-static inline void *radix_tree_ptr_to_direct(void *ptr)
+static inline void *radix_tree_ptr_to_indirect(void *ptr)
 {
-	return (void *)((unsigned long)ptr | RADIX_TREE_DIRECT_PTR);
+	return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR);
 }
 
-static inline void *radix_tree_direct_to_ptr(void *ptr)
+static inline void *radix_tree_indirect_to_ptr(void *ptr)
 {
-	return (void *)((unsigned long)ptr & ~RADIX_TREE_DIRECT_PTR);
+	return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR);
 }
 
-static inline int radix_tree_is_direct_ptr(void *ptr)
+static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
-	return (int)((unsigned long)ptr & RADIX_TREE_DIRECT_PTR);
+	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
 }
 
 /*** radix-tree API starts here ***/
@@ -131,7 +134,10 @@ do {									\
  */
 static inline void *radix_tree_deref_slot(void **pslot)
 {
-	return radix_tree_direct_to_ptr(*pslot);
+	void *ret = *pslot;
+	if (unlikely(radix_tree_is_indirect_ptr(ret)))
+		ret = RADIX_TREE_RETRY;
+	return ret;
 }
 /**
  * radix_tree_replace_slot	- replace item in a slot
@@ -143,10 +149,8 @@ static inline void *radix_tree_deref_slo
  */
 static inline void radix_tree_replace_slot(void **pslot, void *item)
 {
-	BUG_ON(radix_tree_is_direct_ptr(item));
-	rcu_assign_pointer(*pslot,
-		(void *)((unsigned long)item |
-			((unsigned long)*pslot & RADIX_TREE_DIRECT_PTR)));
+	BUG_ON(radix_tree_is_indirect_ptr(item));
+	rcu_assign_pointer(*pslot, item);
 }
 
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -104,7 +104,7 @@ radix_tree_node_alloc(struct radix_tree_
 			rtp->nr--;
 		}
 	}
-	BUG_ON(radix_tree_is_direct_ptr(ret));
+	BUG_ON(radix_tree_is_indirect_ptr(ret));
 	return ret;
 }
 
@@ -240,7 +240,7 @@ static int radix_tree_extend(struct radi
 			return -ENOMEM;
 
 		/* Increase the height.  */
-		node->slots[0] = radix_tree_direct_to_ptr(root->rnode);
+		node->slots[0] = radix_tree_indirect_to_ptr(root->rnode);
 
 		/* Propagate the aggregated tag info into the new root */
 		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
@@ -251,6 +251,7 @@ static int radix_tree_extend(struct radi
 		newheight = root->height+1;
 		node->height = newheight;
 		node->count = 1;
+		node = radix_tree_ptr_to_indirect(node);
 		rcu_assign_pointer(root->rnode, node);
 		root->height = newheight;
 	} while (height > root->height);
@@ -274,7 +275,7 @@ int radix_tree_insert(struct radix_tree_
 	int offset;
 	int error;
 
-	BUG_ON(radix_tree_is_direct_ptr(item));
+	BUG_ON(radix_tree_is_indirect_ptr(item));
 
 	/* Make sure the tree is high enough.  */
 	if (index > radix_tree_maxindex(root->height)) {
@@ -283,7 +284,8 @@ int radix_tree_insert(struct radix_tree_
 			return error;
 	}
 
-	slot = root->rnode;
+	slot = radix_tree_indirect_to_ptr(root->rnode);
+
 	height = root->height;
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 
@@ -298,7 +300,8 @@ int radix_tree_insert(struct radix_tree_
 				rcu_assign_pointer(node->slots[offset], slot);
 				node->count++;
 			} else
-				rcu_assign_pointer(root->rnode, slot);
+				rcu_assign_pointer(root->rnode,
+					radix_tree_ptr_to_indirect(slot));
 		}
 
 		/* Go a level down */
@@ -318,7 +321,7 @@ int radix_tree_insert(struct radix_tree_
 		BUG_ON(tag_get(node, 0, offset));
 		BUG_ON(tag_get(node, 1, offset));
 	} else {
-		rcu_assign_pointer(root->rnode, radix_tree_ptr_to_direct(item));
+		rcu_assign_pointer(root->rnode, item);
 		BUG_ON(root_tag_get(root, 0));
 		BUG_ON(root_tag_get(root, 1));
 	}
@@ -350,11 +353,12 @@ void **radix_tree_lookup_slot(struct rad
 	if (node == NULL)
 		return NULL;
 
-	if (radix_tree_is_direct_ptr(node)) {
+	if (!radix_tree_is_indirect_ptr(node)) {
 		if (index > 0)
 			return NULL;
 		return (void **)&root->rnode;
 	}
+	node = radix_tree_indirect_to_ptr(node);
 
 	height = node->height;
 	if (index > radix_tree_maxindex(height))
@@ -398,11 +402,12 @@ void *radix_tree_lookup(struct radix_tre
 	if (node == NULL)
 		return NULL;
 
-	if (radix_tree_is_direct_ptr(node)) {
+	if (!radix_tree_is_indirect_ptr(node)) {
 		if (index > 0)
 			return NULL;
-		return radix_tree_direct_to_ptr(node);
+		return node;
 	}
+	node = radix_tree_indirect_to_ptr(node);
 
 	height = node->height;
 	if (index > radix_tree_maxindex(height))
@@ -447,7 +452,7 @@ void *radix_tree_tag_set(struct radix_tr
 	height = root->height;
 	BUG_ON(index > radix_tree_maxindex(height));
 
-	slot = root->rnode;
+	slot = radix_tree_indirect_to_ptr(root->rnode);
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 
 	while (height > 0) {
@@ -497,7 +502,7 @@ void *radix_tree_tag_clear(struct radix_
 
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 	pathp->node = NULL;
-	slot = root->rnode;
+	slot = radix_tree_indirect_to_ptr(root->rnode);
 
 	while (height > 0) {
 		int offset;
@@ -562,8 +567,9 @@ int radix_tree_tag_get(struct radix_tree
 	if (node == NULL)
 		return 0;
 
-	if (radix_tree_is_direct_ptr(node))
+	if (!radix_tree_is_indirect_ptr(node))
 		return (index == 0);
+	node = radix_tree_indirect_to_ptr(node);
 
 	height = node->height;
 	if (index > radix_tree_maxindex(height))
@@ -751,13 +757,13 @@ radix_tree_gang_lookup(struct radix_tree
 	if (!node)
 		return 0;
 
-	if (radix_tree_is_direct_ptr(node)) {
+	if (!radix_tree_is_indirect_ptr(node)) {
 		if (first_index > 0)
 			return 0;
-		node = radix_tree_direct_to_ptr(node);
 		results[0] = rcu_dereference(node);
 		return 1;
 	}
+	node = radix_tree_indirect_to_ptr(node);
 
 	max_index = radix_tree_maxindex(node->height);
 
@@ -879,13 +885,13 @@ radix_tree_gang_lookup_tag(struct radix_
 	if (!node)
 		return 0;
 
-	if (radix_tree_is_direct_ptr(node)) {
+	if (!radix_tree_is_indirect_ptr(node)) {
 		if (first_index > 0)
 			return 0;
-		node = radix_tree_direct_to_ptr(node);
 		results[0] = rcu_dereference(node);
 		return 1;
 	}
+	node = radix_tree_indirect_to_ptr(node);
 
 	max_index = radix_tree_maxindex(node->height);
 
@@ -915,12 +921,22 @@ EXPORT_SYMBOL(radix_tree_gang_lookup_tag
 static inline void radix_tree_shrink(struct radix_tree_root *root)
 {
 	/* try to shrink tree height */
-	while (root->height > 0 &&
-			root->rnode->count == 1 &&
-			root->rnode->slots[0]) {
+	while (root->height > 0) {
 		struct radix_tree_node *to_free = root->rnode;
 		void *newptr;
 
+		BUG_ON(!radix_tree_is_indirect_ptr(to_free));
+		to_free = radix_tree_indirect_to_ptr(to_free);
+
+		/*
+		 * The candidate node has more than one child, or its child
+		 * is not at the leftmost slot, we cannot shrink.
+		 */
+		if (to_free->count != 1)
+			break;
+		if (!to_free->slots[0])
+			break;
+
 		/*
 		 * We don't need rcu_assign_pointer(), since we are simply
 		 * moving the node from one part of the tree to another. If
@@ -929,8 +945,8 @@ static inline void radix_tree_shrink(str
 		 * one (root->rnode).
 		 */
 		newptr = to_free->slots[0];
-		if (root->height == 1)
-			newptr = radix_tree_ptr_to_direct(newptr);
+		if (root->height > 1)
+			newptr = radix_tree_ptr_to_indirect(newptr);
 		root->rnode = newptr;
 		root->height--;
 		/* must only free zeroed nodes into the slab */
@@ -965,12 +981,12 @@ void *radix_tree_delete(struct radix_tre
 		goto out;
 
 	slot = root->rnode;
-	if (height == 0 && root->rnode) {
-		slot = radix_tree_direct_to_ptr(slot);
+	if (height == 0 /* XXX: bugfix? */) {
 		root_tag_clear_all(root);
 		root->rnode = NULL;
 		goto out;
 	}
+	slot = radix_tree_indirect_to_ptr(slot);
 
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 	pathp->node = NULL;
@@ -1012,7 +1028,8 @@ void *radix_tree_delete(struct radix_tre
 			radix_tree_node_free(to_free);
 
 		if (pathp->node->count) {
-			if (pathp->node == root->rnode)
+			if (pathp->node ==
+					radix_tree_indirect_to_ptr(root->rnode))
 				radix_tree_shrink(root);
 			goto out;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 2/9] radix-tree: gang_lookup_slot
  2006-09-22 19:22 [patch 0/4] lockless pagecache for 2.6.18-rc7-mm1 Nick Piggin
  2006-09-22 19:22 ` [patch 2/4] radix-tree: use indirect bit Nick Piggin
@ 2006-09-22 19:22 ` Nick Piggin
  2006-09-22 19:22 ` [patch 3/9] mm: speculative get page Nick Piggin
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2006-09-22 19:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, Linux Memory Management

Introduce a gang_lookup_slot function which is used by lockless pagecache.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -100,12 +100,14 @@ do {									\
  *
  * The notable exceptions to this rule are the following functions:
  * radix_tree_lookup
+ * radix_tree_lookup_slot
  * radix_tree_tag_get
  * radix_tree_gang_lookup
+ * radix_tree_gang_lookup_slot
  * radix_tree_gang_lookup_tag
  * radix_tree_tagged
  *
- * The first 4 functions are able to be called locklessly, using RCU. The
+ * The first 6 functions are able to be called locklessly, using RCU. The
  * caller must ensure calls to these functions are made within rcu_read_lock()
  * regions. Other readers (lock-free or otherwise) and modifications may be
  * running concurrently.
@@ -160,6 +162,9 @@ void *radix_tree_delete(struct radix_tre
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+			unsigned long first_index, unsigned int max_items);
 unsigned long radix_tree_scan_hole_backward(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
 unsigned long radix_tree_scan_hole(struct radix_tree_root *root,
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -349,7 +349,7 @@ void **radix_tree_lookup_slot(struct rad
 	unsigned int height, shift;
 	struct radix_tree_node *node, **slot;
 
-	node = root->rnode;
+	node = rcu_dereference(root->rnode);
 	if (node == NULL)
 		return NULL;
 
@@ -369,7 +369,7 @@ void **radix_tree_lookup_slot(struct rad
 	do {
 		slot = (struct radix_tree_node **)
 			(node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
-		node = *slot;
+		node = rcu_dereference(*slot);
 		if (node == NULL)
 			return NULL;
 
@@ -677,7 +677,7 @@ unsigned long radix_tree_scan_hole_backw
 EXPORT_SYMBOL(radix_tree_scan_hole_backward);
 
 static unsigned int
-__lookup(struct radix_tree_node *slot, void **results, unsigned long index,
+__lookup(struct radix_tree_node *slot, void ***results, unsigned long index,
 	unsigned int max_items, unsigned long *next_index)
 {
 	unsigned int nr_found = 0;
@@ -715,7 +715,7 @@ __lookup(struct radix_tree_node *slot, v
 		index++;
 		node = slot->slots[i];
 		if (node) {
-			results[nr_found++] = rcu_dereference(node);
+			results[nr_found++] = &(slot->slots[i]);
 			if (nr_found == max_items)
 				goto out;
 		}
@@ -760,7 +760,74 @@ radix_tree_gang_lookup(struct radix_tree
 	if (!radix_tree_is_indirect_ptr(node)) {
 		if (first_index > 0)
 			return 0;
-		results[0] = rcu_dereference(node);
+		results[0] = node;
+		return 1;
+	}
+	node = radix_tree_indirect_to_ptr(node);
+
+	max_index = radix_tree_maxindex(node->height);
+
+	ret = 0;
+	while (ret < max_items) {
+		unsigned int nr_found, i, j;
+		unsigned long next_index;	/* Index of next search */
+
+		if (cur_index > max_index)
+			break;
+		nr_found = __lookup(node, (void ***)results + ret, cur_index,
+					max_items - ret, &next_index);
+		for (i = j = 0; i < nr_found; i++) {
+			struct radix_tree_node *slot;
+			slot = rcu_dereference(*(((void ***)results)[ret + i]));
+			if (!slot)
+				continue;
+			results[ret + j] = slot;
+			j++;
+		}
+		ret += j;
+		if (next_index == 0)
+			break;
+		cur_index = next_index;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup);
+
+/**
+ *	radix_tree_gang_lookup_slot - perform multiple slot lookup on radix tree
+ *	@root:		radix tree root
+ *	@results:	where the results of the lookup are placed
+ *	@first_index:	start the lookup from this key
+ *	@max_items:	place up to this many items at *results
+ *
+ *	Performs an index-ascending scan of the tree for present items.  Places
+ *	their slots at *@results and returns the number of items which were
+ *	placed at *@results.
+ *
+ *	The implementation is naive.
+ *
+ *	Like radix_tree_gang_lookup as far as RCU and locking goes. Slots must
+ *	be dereferenced with radix_tree_deref_slot, and if using only RCU
+ *	protection, radix_tree_deref_slot may fail requiring a retry.
+ */
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+			unsigned long first_index, unsigned int max_items)
+{
+	unsigned long max_index;
+	struct radix_tree_node *node;
+	unsigned long cur_index = first_index;
+	unsigned int ret;
+
+	node = rcu_dereference(root->rnode);
+	if (!node)
+		return 0;
+
+	if (!radix_tree_is_indirect_ptr(node)) {
+		if (first_index > 0)
+			return 0;
+		results[0] = (void **)&root->rnode;
 		return 1;
 	}
 	node = radix_tree_indirect_to_ptr(node);
@@ -784,7 +851,7 @@ radix_tree_gang_lookup(struct radix_tree
 
 	return ret;
 }
-EXPORT_SYMBOL(radix_tree_gang_lookup);
+EXPORT_SYMBOL(radix_tree_gang_lookup_slot);
 
 /*
  * FIXME: the two tag_get()s here should use find_next_bit() instead of

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 3/9] mm: speculative get page
  2006-09-22 19:22 [patch 0/4] lockless pagecache for 2.6.18-rc7-mm1 Nick Piggin
  2006-09-22 19:22 ` [patch 2/4] radix-tree: use indirect bit Nick Piggin
  2006-09-22 19:22 ` [patch 2/9] radix-tree: gang_lookup_slot Nick Piggin
@ 2006-09-22 19:22 ` Nick Piggin
  2006-09-23 10:01   ` Peter Zijlstra
  2006-09-24 18:01   ` Hugh Dickins
  2006-09-22 19:22 ` [patch 4/9] mm: lockless pagecache lookups Nick Piggin
  2006-09-22 19:24 ` [patch 0/4] lockless pagecache for 2.6.18-rc7-mm1 Nick Piggin
  4 siblings, 2 replies; 14+ messages in thread
From: Nick Piggin @ 2006-09-22 19:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, Linux Memory Management

If we can be sure that elevating the page_count on a pagecache
page will pin it, we can speculatively run this operation, and
subsequently check to see if we hit the right page rather than
relying on holding a lock or otherwise pinning a reference to the
page.

This can be done if get_page/put_page behaves consistently
throughout the whole tree (ie. if we "get" the page after it has
been used for something else, we must be able to free it with a
put_page).

Actually, there is a period where the count behaves differently:
when the page is free or if it is a constituent page of a compound
page. We need an atomic_inc_not_zero operation to ensure we don't
try to grab the page in either case.

This patch introduces the core locking protocol to the pagecache
(ie. adds page_cache_get_speculative, and tweaks some update-side
code to make it work).

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -86,6 +86,8 @@
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
+#define PG_nonewrefs		20	/* Block concurrent pagecache lookups
+					 * while testing refcount */
 
 #if (BITS_PER_LONG > 32)
 /*
@@ -247,6 +249,11 @@
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#define PageNoNewRefs(page)	test_bit(PG_nonewrefs, &(page)->flags)
+#define SetPageNoNewRefs(page)	set_bit(PG_nonewrefs, &(page)->flags)
+#define ClearPageNoNewRefs(page) clear_bit(PG_nonewrefs, &(page)->flags)
+#define __ClearPageNoNewRefs(page) __clear_bit(PG_nonewrefs, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -11,6 +11,8 @@
 #include <linux/compiler.h>
 #include <asm/uaccess.h>
 #include <linux/gfp.h>
+#include <linux/page-flags.h>
+#include <linux/hardirq.h> /* for in_interrupt() */
 
 /*
  * Bits in mapping->flags.  The lower __GFP_BITS_SHIFT bits are the page
@@ -51,6 +53,109 @@ static inline void mapping_set_gfp_mask(
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
+/*
+ * speculatively take a reference to a page.
+ * If the page is free (_count == 0), then _count is untouched, and 0
+ * is returned. Otherwise, _count is incremented by 1 and 1 is returned.
+ *
+ * This function must be run in the same rcu_read_lock() section as has
+ * been used to lookup the page in the pagecache radix-tree: this allows
+ * allocators to use a synchronize_rcu() to stabilize _count.
+ *
+ * Unless an RCU grace period has passed, the count of all pages coming out
+ * of the allocator must be considered unstable. page_count may return higher
+ * than expected, and put_page must be able to do the right thing when the
+ * page has been finished with (because put_page is what is used to drop an
+ * invalid speculative reference).
+ *
+ * After incrementing the refcount, this function spins until PageNoNewRefs
+ * is clear, then a read memory barrier is issued.
+ *
+ * This forms the core of the lockless pagecache locking protocol, where
+ * the lookup-side (eg. find_get_page) has the following pattern:
+ * 1. find page in radix tree
+ * 2. conditionally increment refcount
+ * 3. wait for PageNoNewRefs
+ * 4. check the page is still in pagecache
+ *
+ * Remove-side (that cares about _count, eg. reclaim) has the following:
+ * A. SetPageNoNewRefs
+ * B. check refcount is correct
+ * C. remove page
+ * D. ClearPageNoNewRefs
+ *
+ * There are 2 critical interleavings that matter:
+ * - 2 runs before B: in this case, B sees elevated refcount and bails out
+ * - B runs before 2: in this case, 3 ensures 4 will not run until *after* C
+ *   (after D, even). In which case, 4 will notice C and lookup side can retry
+ *
+ * It is possible that between 1 and 2, the page is removed then the exact same
+ * page is inserted into the same position in pagecache. That's OK: the
+ * old find_get_page using tree_lock could equally have run before or after
+ * the write-side, depending on timing.
+ *
+ * Pagecache insertion isn't a big problem: either 1 will find the page or
+ * it will not. Likewise, the old find_get_page could run either before the
+ * insertion or afterwards, depending on timing.
+ */
+static inline int page_cache_get_speculative(struct page *page)
+{
+	VM_BUG_ON(in_interrupt());
+
+#ifndef CONFIG_SMP
+# ifdef CONFIG_PREEMPT
+	VM_BUG_ON(!in_atomic());
+# endif
+	/*
+	 * Preempt must be disabled here - we rely on rcu_read_lock doing
+	 * this for us.
+	 *
+	 * Pagecache won't be truncated from interrupt context, so if we have
+	 * found a page in the radix tree here, we have pinned its refcount by
+	 * disabling preempt, and hence no need for the "speculative get" that
+	 * SMP requires.
+	 */
+	VM_BUG_ON(page_count(page) == 0);
+	atomic_inc(&page->_count);
+
+#else
+	if (unlikely(!get_page_unless_zero(page)))
+		return 0; /* page has been freed */
+
+	/*
+	 * Note that get_page_unless_zero provides a memory barrier.
+	 * This is needed to ensure PageNoNewRefs is evaluated after the
+	 * page refcount has been raised. See below comment.
+	 */
+
+	while (unlikely(PageNoNewRefs(page)))
+		cpu_relax();
+
+	/*
+	 * smp_rmb is to ensure the load of page->flags (for PageNoNewRefs())
+	 * is performed before a future load used to ensure the page is
+	 * the correct on (usually: page->mapping and page->index).
+	 *
+	 * Those places that set PageNoNewRefs have the following pattern:
+	 * 	SetPageNoNewRefs(page)
+	 * 	wmb();
+	 * 	if (page_count(page) == X)
+	 * 		remove page from pagecache
+	 * 	wmb();
+	 * 	ClearPageNoNewRefs(page)
+	 *
+	 * If the load was out of order, page->mapping might be loaded before
+	 * the page is removed from pagecache but PageNoNewRefs evaluated
+	 * after the ClearPageNoNewRefs().
+	 */
+	smp_rmb();
+
+#endif
+	VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);
+
+	return 1;
+}
+
 #ifdef CONFIG_NUMA
 extern struct page *page_cache_alloc(struct address_space *x);
 extern struct page *page_cache_alloc_cold(struct address_space *x);
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -380,6 +380,8 @@ int remove_mapping(struct address_space 
 	if (!mapping)
 		return 0;		/* truncate got there first */
 
+	SetPageNoNewRefs(page);
+	smp_wmb();
 	write_lock_irq(&mapping->tree_lock);
 
 	/*
@@ -398,17 +400,21 @@ int remove_mapping(struct address_space 
 		__delete_from_swap_cache(page);
 		write_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
-		__put_page(page);	/* The pagecache ref */
-		return 1;
+		goto free_it;
 	}
 
 	__remove_from_page_cache(page);
 	write_unlock_irq(&mapping->tree_lock);
-	__put_page(page);
+
+free_it:
+	smp_wmb();
+	__ClearPageNoNewRefs(page);
+	__put_page(page); /* The pagecache ref */
 	return 1;
 
 cannot_free:
 	write_unlock_irq(&mapping->tree_lock);
+	ClearPageNoNewRefs(page);
 	return 0;
 }
 
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -440,6 +440,8 @@ int add_to_page_cache(struct page *page,
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
+		SetPageNoNewRefs(page);
+		smp_wmb();
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
@@ -451,6 +453,8 @@ int add_to_page_cache(struct page *page,
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
 		write_unlock_irq(&mapping->tree_lock);
+		smp_wmb();
+		ClearPageNoNewRefs(page);
 		radix_tree_preload_end();
 	}
 	return error;
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -78,6 +78,8 @@ static int __add_to_swap_cache(struct pa
 	BUG_ON(PagePrivate(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
+		SetPageNoNewRefs(page);
+		smp_wmb();
 		write_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
@@ -90,6 +92,8 @@ static int __add_to_swap_cache(struct pa
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
 		write_unlock_irq(&swapper_space.tree_lock);
+		smp_wmb();
+		ClearPageNoNewRefs(page);
 		radix_tree_preload_end();
 	}
 	return error;
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -303,6 +303,8 @@ static int migrate_page_move_mapping(str
 		return 0;
 	}
 
+	SetPageNoNewRefs(page);
+	smp_wmb();
 	write_lock_irq(&mapping->tree_lock);
 
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
@@ -311,6 +313,7 @@ static int migrate_page_move_mapping(str
 	if (page_count(page) != 2 + !!PagePrivate(page) ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
 		write_unlock_irq(&mapping->tree_lock);
+		ClearPageNoNewRefs(page);
 		return -EAGAIN;
 	}
 
@@ -325,16 +328,25 @@ static int migrate_page_move_mapping(str
 	}
 #endif
 
+	SetPageNoNewRefs(newpage);
 	radix_tree_replace_slot(pslot, newpage);
 
+	write_unlock_irq(&mapping->tree_lock);
+
+	page->mapping = NULL;
+
+  	write_unlock_irq(&mapping->tree_lock);
+
+	smp_wmb();
+	ClearPageNoNewRefs(page);
+	ClearPageNoNewRefs(newpage);
+
 	/*
 	 * Drop cache reference from old page.
 	 * We know this isn't the last reference.
 	 */
 	__put_page(page);
 
-	write_unlock_irq(&mapping->tree_lock);
-
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 3/9] mm: speculative get page
  2006-09-22 19:22 ` [patch 3/9] mm: speculative get page Nick Piggin
@ 2006-09-23 10:01   ` Peter Zijlstra
  2006-09-24 18:01   ` Hugh Dickins
  1 sibling, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2006-09-23 10:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Linux Memory Management

>  
> @@ -325,16 +328,25 @@ static int migrate_page_move_mapping(str
>  	}
>  #endif
>  
> +	SetPageNoNewRefs(newpage);
>  	radix_tree_replace_slot(pslot, newpage);
>  
> +	write_unlock_irq(&mapping->tree_lock);
> +
> +	page->mapping = NULL;
> +
> +  	write_unlock_irq(&mapping->tree_lock);
> +
> +	smp_wmb();
> +	ClearPageNoNewRefs(page);
> +	ClearPageNoNewRefs(newpage);
> +
>  	/*
>  	 * Drop cache reference from old page.
>  	 * We know this isn't the last reference.
>  	 */
>  	__put_page(page);
>  
> -	write_unlock_irq(&mapping->tree_lock);
> -
>  	return 0;
>  }


2 consecutive write_unlock_irq() calls seem odd ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 3/9] mm: speculative get page
  2006-09-22 19:22 ` [patch 3/9] mm: speculative get page Nick Piggin
  2006-09-23 10:01   ` Peter Zijlstra
@ 2006-09-24 18:01   ` Hugh Dickins
  2006-09-25  2:00     ` Nick Piggin
  1 sibling, 1 reply; 14+ messages in thread
From: Hugh Dickins @ 2006-09-24 18:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Linux Memory Management

On Fri, 22 Sep 2006, Nick Piggin wrote:
> 
> Index: linux-2.6/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.orig/include/linux/page-flags.h
> +++ linux-2.6/include/linux/page-flags.h
> @@ -86,6 +86,8 @@
>  #define PG_nosave_free		18	/* Free, should not be written */
>  #define PG_buddy		19	/* Page is free, on buddy lists */
>  
> +#define PG_nonewrefs		20	/* Block concurrent pagecache lookups
> +					 * while testing refcount */

Something I didn't get around to mentioning last time: I could well
be mistaken, but it seemed that you could get along without all the
PageNoNewRefs stuff, at cost of using something (too expensive?)
like atomic_cmpxchg(&page->_count, 2, 0) in remove_mapping() and
migrate_page_move_mapping(); compensated by simplification at the
other end in page_cache_get_speculative(), which is already
expected to be the hotter path.

I find it unaesthetic (suspect you do too) to add that adhoc
PageNoNewRefs method of freezing the count, when you're already
demanding that count 0 must be frozen: why not make use of that?
then since you know it's frozen while 0, you can easily insert
the proper count at the end of the critical region.

I didn't attempt to work out what memory barriers would be needed,
but did test a version working that way on i386 - though I seem
to have tidied those mods away to /dev/null since then.

We disagreed over whether PageNoNewRefs usage in add_to_page_cache
and __add_to_swap_cache was the same as in remove_mapping; but I
think we agreed it could be avoided completely in those, just by
being more careful about the ordering of the updates to struct page
(I think it looked like the SetPageLocked needed to come earlier,
but I forget the logic right now).

> +static inline int page_cache_get_speculative(struct page *page)
> +{
> +	VM_BUG_ON(in_interrupt());
> +
> +#ifndef CONFIG_SMP
> +# ifdef CONFIG_PREEMPT
> +	VM_BUG_ON(!in_atomic());
> +# endif
> +	/*
> +	 * Preempt must be disabled here - we rely on rcu_read_lock doing
> +	 * this for us.
> +	 *
> +	 * Pagecache won't be truncated from interrupt context, so if we have
> +	 * found a page in the radix tree here, we have pinned its refcount by
> +	 * disabling preempt, and hence no need for the "speculative get" that
> +	 * SMP requires.
> +	 */
> +	VM_BUG_ON(page_count(page) == 0);
> +	atomic_inc(&page->_count);
> +
> +#else
> +	if (unlikely(!get_page_unless_zero(page)))
> +		return 0; /* page has been freed */

This is the test which fails nicely whenever count is set to 0,
whether because the page has been freed or because you wish to
freeze it.  But if you do make such a change, callers of
page_cache_get_speculative may need to loop a little differently
when it fails (the page might not be freed).

> +	VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);

I found that VM_BUG_ON confusing, because it's only catching a tiny
proportion of the cases you're interested in ruling out: most high
order pages aren't PageCompound (but only the PageCompound ones offer
that kind of check).  If you really want to keep the check, I think
it needs a comment to explain that; but I'd just delete the line.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 3/9] mm: speculative get page
  2006-09-24 18:01   ` Hugh Dickins
@ 2006-09-25  2:00     ` Nick Piggin
  2006-09-25 11:47       ` Nick Piggin
  0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2006-09-25  2:00 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Nick Piggin, Andrew Morton, Linux Memory Management

Hugh Dickins wrote:

>On Fri, 22 Sep 2006, Nick Piggin wrote:
>
>>Index: linux-2.6/include/linux/page-flags.h
>>===================================================================
>>--- linux-2.6.orig/include/linux/page-flags.h
>>+++ linux-2.6/include/linux/page-flags.h
>>@@ -86,6 +86,8 @@
>> #define PG_nosave_free		18	/* Free, should not be written */
>> #define PG_buddy		19	/* Page is free, on buddy lists */
>> 
>>+#define PG_nonewrefs		20	/* Block concurrent pagecache lookups
>>+					 * while testing refcount */
>>
>
>Something I didn't get around to mentioning last time: I could well
>be mistaken, but it seemed that you could get along without all the
>PageNoNewRefs stuff, at cost of using something (too expensive?)
>like atomic_cmpxchg(&page->_count, 2, 0) in remove_mapping() and
>migrate_page_move_mapping(); compensated by simplification at the
>other end in page_cache_get_speculative(), which is already
>expected to be the hotter path.
>

Wow. That's amazing, why didn't I think of that? ;) Now that
we have a PG_buddy, this is going to work nicely.

>I find it unaesthetic (suspect you do too) to add that adhoc
>PageNoNewRefs method of freezing the count, when you're already
>demanding that count 0 must be frozen: why not make use of that?
>then since you know it's frozen while 0, you can easily insert
>the proper count at the end of the critical region.
>

Yes, and without using atomic ops, too.

>
>I didn't attempt to work out what memory barriers would be needed,
>but did test a version working that way on i386 - though I seem
>to have tidied those mods away to /dev/null since then.
>

Memory barriers will be reduced, because we're now only operating
on the single variable, rather than 2 (_count and flags), so we
don't need anything to order them (other than normal cache coherency).

Importantly, this will cut the smp_rmb() out of the speculative get,
which I suspect is why ia64 had slightly worse performance there.
Beautiful. (it will also cut out the smp_wmb()s and one atomic op out
of the write side).

>We disagreed over whether PageNoNewRefs usage in add_to_page_cache
>and __add_to_swap_cache was the same as in remove_mapping; but I
>think we agreed it could be avoided completely in those, just by
>being more careful about the ordering of the updates to struct page
>(I think it looked like the SetPageLocked needed to come earlier,
>but I forget the logic right now).
>

That's right. I have that in a followup patch, but in the interests
of keeping things small, I won't submit it for the first iteration.

>
>>+static inline int page_cache_get_speculative(struct page *page)
>>+{
>>+	VM_BUG_ON(in_interrupt());
>>+
>>+#ifndef CONFIG_SMP
>>+# ifdef CONFIG_PREEMPT
>>+	VM_BUG_ON(!in_atomic());
>>+# endif
>>+	/*
>>+	 * Preempt must be disabled here - we rely on rcu_read_lock doing
>>+	 * this for us.
>>+	 *
>>+	 * Pagecache won't be truncated from interrupt context, so if we have
>>+	 * found a page in the radix tree here, we have pinned its refcount by
>>+	 * disabling preempt, and hence no need for the "speculative get" that
>>+	 * SMP requires.
>>+	 */
>>+	VM_BUG_ON(page_count(page) == 0);
>>+	atomic_inc(&page->_count);
>>+
>>+#else
>>+	if (unlikely(!get_page_unless_zero(page)))
>>+		return 0; /* page has been freed */
>>
>
>This is the test which fails nicely whenever count is set to 0,
>whether because the page has been freed or because you wish to
>freeze it.  But if you do make such a change, callers of
>page_cache_get_speculative may need to loop a little differently
>when it fails (the page might not be freed).
>

Yes, they can just retry -- in the case that the page had been freed, 
they'll
find NULL in the radix tree slot on the next iteration: the 'return' here is
just a little shortcut.

>
>>+	VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);
>>
>
>I found that VM_BUG_ON confusing, because it's only catching a tiny
>proportion of the cases you're interested in ruling out: most high
>order pages aren't PageCompound (but only the PageCompound ones offer
>that kind of check).  If you really want to keep the check, I think
>it needs a comment to explain that; but I'd just delete the line.
>

But it is OK to take a spec ref to a higher order non compound page, because
they follow the same refcounting rules as if they are individual order-0
pages.

Compound pages should be OK, because their constituent pages will have a 
count
of 0 and so will fail the previous test. This check was just for my own
satisfaction but I left it in as a form of commenting, however as you say I
should probably elaborate on that too.

Thank you thank you,
Nick
--

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 3/9] mm: speculative get page
  2006-09-25  2:00     ` Nick Piggin
@ 2006-09-25 11:47       ` Nick Piggin
  2006-09-25 13:04         ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2006-09-25 11:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Hugh Dickins, Andrew Morton, Linux Memory Management

On Mon, Sep 25, 2006 at 12:00:14PM +1000, Nick Piggin wrote:
> Hugh Dickins wrote:
> 
> >On Fri, 22 Sep 2006, Nick Piggin wrote:
> >
> >>Index: linux-2.6/include/linux/page-flags.h
> >>===================================================================
> >>--- linux-2.6.orig/include/linux/page-flags.h
> >>+++ linux-2.6/include/linux/page-flags.h
> >>@@ -86,6 +86,8 @@
> >>#define PG_nosave_free		18	/* Free, should not be 
> >>written */
> >>#define PG_buddy		19	/* Page is free, on buddy lists */
> >>
> >>+#define PG_nonewrefs		20	/* Block concurrent 
> >>pagecache lookups
> >>+					 * while testing refcount */
> >>
> >
> >Something I didn't get around to mentioning last time: I could well
> >be mistaken, but it seemed that you could get along without all the
> >PageNoNewRefs stuff, at cost of using something (too expensive?)
> >like atomic_cmpxchg(&page->_count, 2, 0) in remove_mapping() and
> >migrate_page_move_mapping(); compensated by simplification at the
> >other end in page_cache_get_speculative(), which is already
> >expected to be the hotter path.
> >
> 
> Wow. That's amazing, why didn't I think of that? ;) Now that
> we have a PG_buddy, this is going to work nicely.

OK, so one reason I think is that I was worried about adding pages
to swap/page cache that are not "new" pages.

In which case, I didn't want to be messing around with things like
page->mapping or page flags if the page wasn't actually able to be
added to the cache.

Swapcache is OK, because PageSwapCache seems to be always serialised
by PG_lock. With pagecache, this only happens via shmem but if you
think that's OK, then fine by me.

The result (appended) is incredibly fast on my P4, where it takes 
about 70% of the time required by the nonewrefs version to perform
a find_get_page. The cmpxchg takes only 5ns longer than lock ; inc
so I'd say this would have to be about as fast as you could
implement a find_get_page ;)

This is what it looks like (after a small patch to ensure
__add_to_swap_cache is always called with PG_locked, so it does
not need to unconditionally set PG_locked). What do you think?
Things are still open coded ATM, but I'll clean that up.

--
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -11,6 +11,7 @@
 #include <linux/compiler.h>
 #include <asm/uaccess.h>
 #include <linux/gfp.h>
+#include <linux/hardirq.h> /* for in_interrupt() */
 
 /*
  * Bits in mapping->flags.  The lower __GFP_BITS_SHIFT bits are the page
@@ -51,6 +52,83 @@ static inline void mapping_set_gfp_mask(
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
+/*
+ * speculatively take a reference to a page.
+ * If the page is free (_count == 0), then _count is untouched, and 0
+ * is returned. Otherwise, _count is incremented by 1 and 1 is returned.
+ *
+ * This function must be run in the same rcu_read_lock() section as has
+ * been used to lookup the page in the pagecache radix-tree: this allows
+ * allocators to use a synchronize_rcu() to stabilize _count.
+ *
+ * Unless an RCU grace period has passed, the count of all pages coming out
+ * of the allocator must be considered unstable. page_count may return higher
+ * than expected, and put_page must be able to do the right thing when the
+ * page has been finished with (because put_page is what is used to drop an
+ * invalid speculative reference).
+ *
+ * This forms the core of the lockless pagecache locking protocol, where
+ * the lookup-side (eg. find_get_page) has the following pattern:
+ * 1. find page in radix tree
+ * 2. conditionally increment refcount
+ * 3. check the page is still in pagecache (if no, goto 1)
+ *
+ * Remove-side that cares about stability of _count (eg. reclaim) has the
+ * following (with tree_lock held for write):
+ * A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg)
+ * B. remove page from pagecache
+ * C. free the page
+ *
+ * There are 2 critical interleavings that matter:
+ * - 2 runs before A: in this case, A sees elevated refcount and bails out
+ * - A runs before 2: in this case, 2 sees zero refcount and retries;
+ *   subsequently, B will complete and 1 will find no page, causing the
+ *   lookup to return NULL.
+ *
+ * It is possible that between 1 and 2, the page is removed then the exact same
+ * page is inserted into the same position in pagecache. That's OK: the
+ * old find_get_page using tree_lock could equally have run before or after
+ * such a re-insertion, depending on order that locks are granted.
+ *
+ * Lookups racing against pagecache insertion isn't a big problem: either 1
+ * will find the page or it will not. Likewise, the old find_get_page could run
+ * either before the insertion or afterwards, depending on timing.
+ */
+static inline int page_cache_get_speculative(struct page *page)
+{
+	VM_BUG_ON(in_interrupt());
+
+#ifndef CONFIG_SMP
+# ifdef CONFIG_PREEMPT
+	VM_BUG_ON(!in_atomic());
+# endif
+	/*
+	 * Preempt must be disabled here - we rely on rcu_read_lock doing
+	 * this for us.
+	 *
+	 * Pagecache won't be truncated from interrupt context, so if we have
+	 * found a page in the radix tree here, we have pinned its refcount by
+	 * disabling preempt, and hence no need for the "speculative get" that
+	 * SMP requires.
+	 */
+	VM_BUG_ON(page_count(page) == 0);
+	atomic_inc(&page->_count);
+
+#else
+	if (unlikely(!get_page_unless_zero(page))) {
+		/*
+		 * Either the page has been freed, or will be freed.
+		 * In either case, retry here and the caller should
+		 * do the right thing (see comments above).
+		 */
+		return 0;
+	}
+#endif
+	VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);
+
+	return 1;
+}
+
 #ifdef CONFIG_NUMA
 extern struct page *page_cache_alloc(struct address_space *x);
 extern struct page *page_cache_alloc_cold(struct address_space *x);
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -387,9 +387,9 @@ int remove_mapping(struct address_space 
 	 * PageDirty _after_ making sure that the page is freeable and
 	 * not in use by anybody. 	(pagecache + us == 2)
 	 */
-	if (unlikely(page_count(page) != 2))
+	if (unlikely(atomic_cmpxchg(&page->_count, 0, 2) != 2))
 		goto cannot_free;
-	smp_rmb();
+	/* note: atomic_cmpxchg provides a barrier */
 	if (unlikely(PageDirty(page)))
 		goto cannot_free;
 
@@ -398,13 +398,14 @@ int remove_mapping(struct address_space 
 		__delete_from_swap_cache(page);
 		write_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
-		__put_page(page);	/* The pagecache ref */
-		return 1;
+		goto free_it;
 	}
 
 	__remove_from_page_cache(page);
 	write_unlock_irq(&mapping->tree_lock);
-	__put_page(page);
+
+free_it:
+	set_page_count(page, 1); /* Effectively drop the pagecache ref */
 	return 1;
 
 cannot_free:
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -440,18 +440,25 @@ int add_to_page_cache(struct page *page,
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
+		page_cache_get(page);
+		SetPageLocked(page);
+		page->mapping = mapping;
+		page->index = offset;
+
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (!error) {
-			page_cache_get(page);
-			SetPageLocked(page);
-			page->mapping = mapping;
-			page->index = offset;
+		if (likely(!error)) {
 			mapping->nrpages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
 		write_unlock_irq(&mapping->tree_lock);
 		radix_tree_preload_end();
+
+		if (unlikely(error)) {
+			page->mapping = NULL;
+			ClearPageLocked(page);
+			page_cache_release(page);
+		}
 	}
 	return error;
 }
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -79,18 +79,25 @@ static int __add_to_swap_cache(struct pa
 	BUG_ON(!PageLocked(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
+		page_cache_get(page);
+		SetPageSwapCache(page);
+		set_page_private(page, entry.val);
+
 		write_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
-		if (!error) {
-			page_cache_get(page);
-			SetPageSwapCache(page);
-			set_page_private(page, entry.val);
+		if (likely(!error)) {
 			total_swapcache_pages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
 		write_unlock_irq(&swapper_space.tree_lock);
 		radix_tree_preload_end();
+
+		if (unlikely(error)) {
+			set_page_private(page, 0UL);
+			ClearPageSwapCache(page);
+			page_cache_release(page);
+		}
 	}
 	return error;
 }
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -294,6 +294,7 @@ out:
 static int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page)
 {
+	int expected_count;
 	void **pslot;
 
 	if (!mapping) {
@@ -308,12 +309,19 @@ static int migrate_page_move_mapping(str
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
  					page_index(page));
 
-	if (page_count(page) != 2 + !!PagePrivate(page) ||
+	expected_count = 2 + !!PagePrivate(page);
+	if (page_count(page) != expected_count ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
 		write_unlock_irq(&mapping->tree_lock);
 		return -EAGAIN;
 	}
 
+	if (unlikely(atomic_cmpxchg(&page->_count, 0, expected_count)
+							!= expected_count) {
+		write_unlock_irq(&mapping->tree_lock);
+		return -EAGAIN;
+	}
+
 	/*
 	 * Now we know that no one else is looking at the page.
 	 */
@@ -326,14 +334,15 @@ static int migrate_page_move_mapping(str
 #endif
 
 	radix_tree_replace_slot(pslot, newpage);
+	page->mapping = NULL;
+
+  	write_unlock_irq(&mapping->tree_lock);
 
 	/*
-	 * Drop cache reference from old page.
+	 * Effectively drop the pagecache reference from old page.
 	 * We know this isn't the last reference.
 	 */
-	__put_page(page);
-
-	write_unlock_irq(&mapping->tree_lock);
+	set_page_count(page, expected_count-1);
 
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 3/9] mm: speculative get page
  2006-09-25 11:47       ` Nick Piggin
@ 2006-09-25 13:04         ` Peter Zijlstra
  2006-09-25 22:41           ` Nick Piggin
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2006-09-25 13:04 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Hugh Dickins, Andrew Morton, Linux Memory Management

On Mon, 2006-09-25 at 13:47 +0200, Nick Piggin wrote:

> +/*
> + * speculatively take a reference to a page.
> + * If the page is free (_count == 0), then _count is untouched, and 0
> + * is returned. Otherwise, _count is incremented by 1 and 1 is returned.
> + *
> + * This function must be run in the same rcu_read_lock() section as has
> + * been used to lookup the page in the pagecache radix-tree: this allows
> + * allocators to use a synchronize_rcu() to stabilize _count.
> + *
> + * Unless an RCU grace period has passed, the count of all pages coming out
> + * of the allocator must be considered unstable. page_count may return higher
> + * than expected, and put_page must be able to do the right thing when the
> + * page has been finished with (because put_page is what is used to drop an
> + * invalid speculative reference).
> + *
> + * This forms the core of the lockless pagecache locking protocol, where
> + * the lookup-side (eg. find_get_page) has the following pattern:
> + * 1. find page in radix tree
> + * 2. conditionally increment refcount
> + * 3. check the page is still in pagecache (if no, goto 1)
> + *
> + * Remove-side that cares about stability of _count (eg. reclaim) has the
                                   ^^^^^^^^^^^^^^^^^^^
is that the reason that the following two code paths are good without
change:

  truncate_inode_page_range()
   truncate_complete_page()
     remove_from_page_cache()
       radix_tree_delete()

and

  zap_pte_range()
    free_swap_and_cache()  <-- does check page_count()
      delete_from_swap_cache()
        __delete_from_swap_cache()
          radix_tree_delete()

>From the comments around the truncate bit it seems to be ok with keeping
the page as anonymous, however the zap_pte_range() thing does seem to
want to have a stable page_count().

> + * following (with tree_lock held for write):
> + * A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg)
> + * B. remove page from pagecache
> + * C. free the page
> + *
> + * There are 2 critical interleavings that matter:
> + * - 2 runs before A: in this case, A sees elevated refcount and bails out
> + * - A runs before 2: in this case, 2 sees zero refcount and retries;
> + *   subsequently, B will complete and 1 will find no page, causing the
> + *   lookup to return NULL.
> + *
> + * It is possible that between 1 and 2, the page is removed then the exact same
> + * page is inserted into the same position in pagecache. That's OK: the
> + * old find_get_page using tree_lock could equally have run before or after
> + * such a re-insertion, depending on order that locks are granted.
> + *
> + * Lookups racing against pagecache insertion isn't a big problem: either 1
> + * will find the page or it will not. Likewise, the old find_get_page could run
> + * either before the insertion or afterwards, depending on timing.
> + */

Awesome code ;-)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 3/9] mm: speculative get page
  2006-09-25 13:04         ` Peter Zijlstra
@ 2006-09-25 22:41           ` Nick Piggin
  0 siblings, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2006-09-25 22:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linux Memory Management

Peter Zijlstra wrote:

>On Mon, 2006-09-25 at 13:47 +0200, Nick Piggin wrote:
>
>
>>+/*
>>+ * speculatively take a reference to a page.
>>+ * If the page is free (_count == 0), then _count is untouched, and 0
>>+ * is returned. Otherwise, _count is incremented by 1 and 1 is returned.
>>+ *
>>+ * This function must be run in the same rcu_read_lock() section as has
>>+ * been used to lookup the page in the pagecache radix-tree: this allows
>>+ * allocators to use a synchronize_rcu() to stabilize _count.
>>+ *
>>+ * Unless an RCU grace period has passed, the count of all pages coming out
>>+ * of the allocator must be considered unstable. page_count may return higher
>>+ * than expected, and put_page must be able to do the right thing when the
>>+ * page has been finished with (because put_page is what is used to drop an
>>+ * invalid speculative reference).
>>+ *
>>+ * This forms the core of the lockless pagecache locking protocol, where
>>+ * the lookup-side (eg. find_get_page) has the following pattern:
>>+ * 1. find page in radix tree
>>+ * 2. conditionally increment refcount
>>+ * 3. check the page is still in pagecache (if no, goto 1)
>>+ *
>>+ * Remove-side that cares about stability of _count (eg. reclaim) has the
>>
>                                   ^^^^^^^^^^^^^^^^^^^
>is that the reason that the following two code paths are good without
>change:
>
>  truncate_inode_page_range()
>   truncate_complete_page()
>     remove_from_page_cache()
>       radix_tree_delete()
>

^^^ Yes.

>and
>
>  zap_pte_range()
>    free_swap_and_cache()  <-- does check page_count()
>      delete_from_swap_cache()
>        __delete_from_swap_cache()
>          radix_tree_delete()
>
>>From the comments around the truncate bit it seems to be ok with keeping
>the page as anonymous, however the zap_pte_range() thing does seem to
>want to have a stable page_count().
>

However when I last looked at it, it count can be elevated there for other
reasons (I think it was swap IO or get_user_pages or something). Anyway,
those pages will remain on the LRU and eventually get reclaimed.

I did initially change that code around a little bit, but I remember
working through it with Hugh and we decided that it would be OK as it was.
It should indeed be commented though.

>>+ * following (with tree_lock held for write):
>>+ * A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg)
>>+ * B. remove page from pagecache
>>+ * C. free the page
>>+ *
>>+ * There are 2 critical interleavings that matter:
>>+ * - 2 runs before A: in this case, A sees elevated refcount and bails out
>>+ * - A runs before 2: in this case, 2 sees zero refcount and retries;
>>+ *   subsequently, B will complete and 1 will find no page, causing the
>>+ *   lookup to return NULL.
>>+ *
>>+ * It is possible that between 1 and 2, the page is removed then the exact same
>>+ * page is inserted into the same position in pagecache. That's OK: the
>>+ * old find_get_page using tree_lock could equally have run before or after
>>+ * such a re-insertion, depending on order that locks are granted.
>>+ *
>>+ * Lookups racing against pagecache insertion isn't a big problem: either 1
>>+ * will find the page or it will not. Likewise, the old find_get_page could run
>>+ * either before the insertion or afterwards, depending on timing.
>>+ */
>>
>
>Awesome code ;-)
>

Thanks :) Well, thank Hugh.

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 4/9] mm: lockless pagecache lookups
  2006-09-22 19:22 [patch 0/4] lockless pagecache for 2.6.18-rc7-mm1 Nick Piggin
                   ` (2 preceding siblings ...)
  2006-09-22 19:22 ` [patch 3/9] mm: speculative get page Nick Piggin
@ 2006-09-22 19:22 ` Nick Piggin
  2006-09-22 20:01   ` Lee Schermerhorn
  2006-09-22 19:24 ` [patch 0/4] lockless pagecache for 2.6.18-rc7-mm1 Nick Piggin
  4 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2006-09-22 19:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, Linux Memory Management

Combine page_cache_get_speculative with lockless radix tree lookups to
introduce lockless page cache lookups (ie. no mapping->tree_lock on
the read-side).

The only atomicity changes this introduces is that the gang pagecache
lookup functions now behave as if they are implemented with multiple
find_get_page calls, rather than operating on a snapshot of the pages.
In practice, this atomicity guarantee is not used anyway, and it is
difficult to see how it could be. Gang pagecache lookups are designed
to replace individual lookups, so these semantics are natural.

Swapcache can no longer use find_get_page, because it has a different
method of encoding swapcache position into the page. Introduce a new
find_get_swap_page for it.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -590,9 +590,9 @@ int probe_page(struct address_space *map
 {
 	int exists;
 
-	read_lock_irq(&mapping->tree_lock);
+	rcu_read_lock();
 	exists = __probe_page(mapping, offset);
-	read_unlock_irq(&mapping->tree_lock);
+	rcu_read_unlock();
 
 	return exists;
 }
@@ -666,15 +666,31 @@ void fastcall __lock_page_nosync(struct 
  * Is there a pagecache struct page at the given (mapping, offset) tuple?
  * If yes, increment its refcount and return it; if no, return NULL.
  */
-struct page * find_get_page(struct address_space *mapping, unsigned long offset)
+struct page *find_get_page(struct address_space *mapping, unsigned long offset)
 {
+	void **pagep;
 	struct page *page;
 
-	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page)
-		page_cache_get(page);
-	read_unlock_irq(&mapping->tree_lock);
+	rcu_read_lock();
+repeat:
+	page = NULL;
+	pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);
+	if (pagep) {
+		page = radix_tree_deref_slot(pagep);
+		if (unlikely(!page || page == RADIX_TREE_RETRY))
+			goto repeat;
+
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *pagep)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+	}
+	rcu_read_unlock();
+
 	return page;
 }
 EXPORT_SYMBOL(find_get_page);
@@ -714,26 +730,19 @@ struct page *find_lock_page(struct addre
 {
 	struct page *page;
 
-	read_lock_irq(&mapping->tree_lock);
 repeat:
-	page = radix_tree_lookup(&mapping->page_tree, offset);
+	page = find_get_page(mapping, offset);
 	if (page) {
-		page_cache_get(page);
-		if (TestSetPageLocked(page)) {
-			read_unlock_irq(&mapping->tree_lock);
-			__lock_page(page);
-			read_lock_irq(&mapping->tree_lock);
-
-			/* Has the page been truncated while we slept? */
-			if (unlikely(page->mapping != mapping ||
-				     page->index != offset)) {
-				unlock_page(page);
-				page_cache_release(page);
-				goto repeat;
-			}
+		lock_page(page);
+		/* Has the page been truncated? */
+		if (unlikely(page->mapping != mapping
+				|| page->index != offset)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto repeat;
 		}
 	}
-	read_unlock_irq(&mapping->tree_lock);
+
 	return page;
 }
 EXPORT_SYMBOL(find_lock_page);
@@ -803,13 +812,39 @@ unsigned find_get_pages(struct address_s
 {
 	unsigned int i;
 	unsigned int ret;
+	unsigned int nr_found;
 
-	read_lock_irq(&mapping->tree_lock);
-	ret = radix_tree_gang_lookup(&mapping->page_tree,
-				(void **)pages, start, nr_pages);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
-	read_unlock_irq(&mapping->tree_lock);
+	rcu_read_lock();
+restart:
+	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+				(void ***)pages, start, nr_pages);
+	ret = 0;
+	for (i = 0; i < nr_found; i++) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot((void **)pages[i]);
+		if (unlikely(!page))
+			continue;
+		/*
+		 * this can only trigger if nr_found == 1, making livelock
+		 * a non issue.
+		 */
+		if (unlikely(page == RADIX_TREE_RETRY))
+			goto restart;
+
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *((void **)pages[i]))) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		pages[ret] = page;
+		ret++;
+	}
+	rcu_read_unlock();
 	return ret;
 }
 EXPORT_SYMBOL(find_get_pages);
@@ -831,19 +866,44 @@ unsigned find_get_pages_contig(struct ad
 {
 	unsigned int i;
 	unsigned int ret;
+	unsigned int nr_found;
 
-	read_lock_irq(&mapping->tree_lock);
-	ret = radix_tree_gang_lookup(&mapping->page_tree,
-				(void **)pages, index, nr_pages);
-	for (i = 0; i < ret; i++) {
-		if (pages[i]->mapping == NULL || pages[i]->index != index)
+	rcu_read_lock();
+restart:
+	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+				(void ***)pages, index, nr_pages);
+	ret = 0;
+	for (i = 0; i < nr_found; i++) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot((void **)pages[i]);
+		if (unlikely(!page))
+			continue;
+		/*
+		 * this can only trigger if nr_found == 1, making livelock
+		 * a non issue.
+		 */
+		if (unlikely(page == RADIX_TREE_RETRY))
+			goto restart;
+
+		if (page->mapping == NULL || page->index != index)
 			break;
 
-		page_cache_get(pages[i]);
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *((void **)pages[i]))) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		pages[ret] = page;
+		ret++;
 		index++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
-	return i;
+	rcu_read_unlock();
+	return ret;
 }
 EXPORT_SYMBOL(find_get_pages_tag);
 
@@ -865,6 +925,7 @@ unsigned find_get_pages_tag(struct addre
 	unsigned int ret;
 
 	read_lock_irq(&mapping->tree_lock);
+	/* TODO: implement lookup_tag_slot and make this lockless */
 	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
 				(void **)pages, *index, nr_pages, tag);
 	for (i = 0; i < ret; i++)
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -429,21 +429,20 @@ __do_page_cache_readahead(struct address
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	read_lock_irq(&mapping->tree_lock);
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
 		pgoff_t page_offset = offset + page_idx;
 		
 		if (page_offset > end_index)
 			break;
 
+		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
+		rcu_read_unlock();
 		if (page)
 			continue;
 
-		read_unlock_irq(&mapping->tree_lock);
 		page = page_cache_alloc_cold(mapping);
 		cond_resched();
-		read_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
 		page->index = page_offset;
@@ -452,7 +451,6 @@ __do_page_cache_readahead(struct address
 			SetPageReadahead(page);
 		ret++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
 
 	/*
 	 * Now start the IO.  We ignore I/O errors - if the page is not
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -967,17 +967,15 @@ void writeback_congestion_end(void)
 EXPORT_SYMBOL(writeback_congestion_end);
 
 /*
- * Return true if any of the pages in the mapping are marged with the
+ * Return true if any of the pages in the mapping are marked with the
  * passed tag.
  */
 int mapping_tagged(struct address_space *mapping, int tag)
 {
-	unsigned long flags;
 	int ret;
-
-	read_lock_irqsave(&mapping->tree_lock, flags);
+	rcu_read_lock();
 	ret = radix_tree_tagged(&mapping->page_tree, tag);
-	read_unlock_irqrestore(&mapping->tree_lock, flags);
+	rcu_read_unlock();
 	return ret;
 }
 EXPORT_SYMBOL(mapping_tagged);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 4/9] mm: lockless pagecache lookups
  2006-09-22 19:22 ` [patch 4/9] mm: lockless pagecache lookups Nick Piggin
@ 2006-09-22 20:01   ` Lee Schermerhorn
  2006-09-23  2:35     ` Nick Piggin
  0 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2006-09-22 20:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Linux Memory Management

On Fri, 2006-09-22 at 21:22 +0200, Nick Piggin wrote:
> Combine page_cache_get_speculative with lockless radix tree lookups to
> introduce lockless page cache lookups (ie. no mapping->tree_lock on
> the read-side).
> 
> The only atomicity changes this introduces is that the gang pagecache
> lookup functions now behave as if they are implemented with multiple
> find_get_page calls, rather than operating on a snapshot of the pages.
> In practice, this atomicity guarantee is not used anyway, and it is
> difficult to see how it could be. Gang pagecache lookups are designed
> to replace individual lookups, so these semantics are natural.
> 

vvv - stale comment?
> Swapcache can no longer use find_get_page, because it has a different
> method of encoding swapcache position into the page. Introduce a new
> find_get_swap_page for it.

^^^ 
> 

<snip>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 4/9] mm: lockless pagecache lookups
  2006-09-22 20:01   ` Lee Schermerhorn
@ 2006-09-23  2:35     ` Nick Piggin
  0 siblings, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2006-09-23  2:35 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Andrew Morton, Linux Memory Management

On Fri, Sep 22, 2006 at 04:01:11PM -0400, Lee Schermerhorn wrote:
> On Fri, 2006-09-22 at 21:22 +0200, Nick Piggin wrote:
> > Combine page_cache_get_speculative with lockless radix tree lookups to
> > introduce lockless page cache lookups (ie. no mapping->tree_lock on
> > the read-side).
> > 
> > The only atomicity changes this introduces is that the gang pagecache
> > lookup functions now behave as if they are implemented with multiple
> > find_get_page calls, rather than operating on a snapshot of the pages.
> > In practice, this atomicity guarantee is not used anyway, and it is
> > difficult to see how it could be. Gang pagecache lookups are designed
> > to replace individual lookups, so these semantics are natural.
> > 
> 
> vvv - stale comment?

Yep, thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 0/4] lockless pagecache for 2.6.18-rc7-mm1
  2006-09-22 19:22 [patch 0/4] lockless pagecache for 2.6.18-rc7-mm1 Nick Piggin
                   ` (3 preceding siblings ...)
  2006-09-22 19:22 ` [patch 4/9] mm: lockless pagecache lookups Nick Piggin
@ 2006-09-22 19:24 ` Nick Piggin
  4 siblings, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2006-09-22 19:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Memory Management

On Fri, Sep 22, 2006 at 09:22:10PM +0200, Nick Piggin wrote:
> Hi,

Arrgh, script went wrong :( Should be:

[patch 1/4] radix-tree: use indirect bit
[patch 2/4] radix-tree: gang_lookup_slot
[patch 3/4] mm: speculative get page
[patch 4/4] mm: lockless pagecache lookups

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-09-25 22:41 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-22 19:22 [patch 0/4] lockless pagecache for 2.6.18-rc7-mm1 Nick Piggin
2006-09-22 19:22 ` [patch 2/4] radix-tree: use indirect bit Nick Piggin
2006-09-22 19:22 ` [patch 2/9] radix-tree: gang_lookup_slot Nick Piggin
2006-09-22 19:22 ` [patch 3/9] mm: speculative get page Nick Piggin
2006-09-23 10:01   ` Peter Zijlstra
2006-09-24 18:01   ` Hugh Dickins
2006-09-25  2:00     ` Nick Piggin
2006-09-25 11:47       ` Nick Piggin
2006-09-25 13:04         ` Peter Zijlstra
2006-09-25 22:41           ` Nick Piggin
2006-09-22 19:22 ` [patch 4/9] mm: lockless pagecache lookups Nick Piggin
2006-09-22 20:01   ` Lee Schermerhorn
2006-09-23  2:35     ` Nick Piggin
2006-09-22 19:24 ` [patch 0/4] lockless pagecache for 2.6.18-rc7-mm1 Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox