linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/9] lockless pagecache for 2.6.21-rc6
@ 2007-04-12 12:44 Nick Piggin
  2007-04-12 12:44 ` [patch 1/9] mm: prep find_lock_page Nick Piggin
                   ` (9 more replies)
  0 siblings, 10 replies; 23+ messages in thread
From: Nick Piggin @ 2007-04-12 12:44 UTC (permalink / raw)
  To: Linux Memory Management; +Cc: Nick Piggin, Andrew Morton

This is my updated lockless pagecache patchset against 2.6.21-rc6.

I'm going to submit this to Andrew after 2.6.21 is released, so any
reviews, comments, or tests would be welcome at this point.

Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 1/9] mm: prep find_lock_page
  2007-04-12 12:44 [patch 0/9] lockless pagecache for 2.6.21-rc6 Nick Piggin
@ 2007-04-12 12:44 ` Nick Piggin
  2007-04-12 12:45 ` [patch 2/9] radix-tree: use indirect bit Nick Piggin
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2007-04-12 12:44 UTC (permalink / raw)
  To: Linux Memory Management; +Cc: Nick Piggin, Andrew Morton

find_lock_page does not need to recheck ->index because if the page
is in the right mapping then the index must be the same. Also, tree_lock
does not need to be retaken after the page is locked in order to test
->mapped has not changed.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -620,26 +620,26 @@ struct page *find_lock_page(struct addre
 {
 	struct page *page;
 
-	read_lock_irq(&mapping->tree_lock);
 repeat:
+	read_lock_irq(&mapping->tree_lock);
 	page = radix_tree_lookup(&mapping->page_tree, offset);
 	if (page) {
 		page_cache_get(page);
 		if (TestSetPageLocked(page)) {
 			read_unlock_irq(&mapping->tree_lock);
 			__lock_page(page);
-			read_lock_irq(&mapping->tree_lock);
 
 			/* Has the page been truncated while we slept? */
-			if (unlikely(page->mapping != mapping ||
-				     page->index != offset)) {
+			if (unlikely(page->mapping != mapping)) {
 				unlock_page(page);
 				page_cache_release(page);
 				goto repeat;
 			}
+			goto out;
 		}
 	}
 	read_unlock_irq(&mapping->tree_lock);
+out:
 	return page;
 }
 EXPORT_SYMBOL(find_lock_page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 2/9] radix-tree: use indirect bit
  2007-04-12 12:44 [patch 0/9] lockless pagecache for 2.6.21-rc6 Nick Piggin
  2007-04-12 12:44 ` [patch 1/9] mm: prep find_lock_page Nick Piggin
@ 2007-04-12 12:45 ` Nick Piggin
  2007-04-12 12:45 ` [patch 3/9] radix-tree: gang slot lookups Nick Piggin
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2007-04-12 12:45 UTC (permalink / raw)
  To: Linux Memory Management; +Cc: Nick Piggin, Andrew Morton

Rather than sign direct radix-tree pointers with a special bit, sign
the indirect one that hangs off the root. This means that, given a
lookup_slot operation, the invalid result will be differentiated from
the valid (previously, valid results could have the bit either set or
clear).

This does not affect slot lookups which occur under lock -- they
can never return an invalid result. Is needed in future for lockless
pagecache.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -26,28 +26,31 @@
 #include <linux/rcupdate.h>
 
 /*
- * A direct pointer (root->rnode pointing directly to a data item,
- * rather than another radix_tree_node) is signalled by the low bit
- * set in the root->rnode pointer.
- *
- * In this case root->height is also NULL, but the direct pointer tests are
- * needed for RCU lookups when root->height is unreliable.
+ * An indirect pointer (root->rnode pointing to a radix_tree_node, rather
+ * than a data item) is signalled by the low bit set in the root->rnode
+ * pointer.
+ *
+ * In this case root->height is > 0, but the indirect pointer tests are
+ * needed for RCU lookups (because root->height is unreliable). The only
+ * time callers need worry about this is when doing a lookup_slot under
+ * RCU.
  */
-#define RADIX_TREE_DIRECT_PTR	1
+#define RADIX_TREE_INDIRECT_PTR	1
+#define RADIX_TREE_RETRY ((void *)-1UL)
 
-static inline void *radix_tree_ptr_to_direct(void *ptr)
+static inline void *radix_tree_ptr_to_indirect(void *ptr)
 {
-	return (void *)((unsigned long)ptr | RADIX_TREE_DIRECT_PTR);
+	return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR);
 }
 
-static inline void *radix_tree_direct_to_ptr(void *ptr)
+static inline void *radix_tree_indirect_to_ptr(void *ptr)
 {
-	return (void *)((unsigned long)ptr & ~RADIX_TREE_DIRECT_PTR);
+	return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR);
 }
 
-static inline int radix_tree_is_direct_ptr(void *ptr)
+static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
-	return (int)((unsigned long)ptr & RADIX_TREE_DIRECT_PTR);
+	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
 }
 
 /*** radix-tree API starts here ***/
@@ -130,7 +133,10 @@ do {									\
  */
 static inline void *radix_tree_deref_slot(void **pslot)
 {
-	return radix_tree_direct_to_ptr(*pslot);
+	void *ret = *pslot;
+	if (unlikely(radix_tree_is_indirect_ptr(ret)))
+		ret = RADIX_TREE_RETRY;
+	return ret;
 }
 /**
  * radix_tree_replace_slot	- replace item in a slot
@@ -142,10 +148,8 @@ static inline void *radix_tree_deref_slo
  */
 static inline void radix_tree_replace_slot(void **pslot, void *item)
 {
-	BUG_ON(radix_tree_is_direct_ptr(item));
-	rcu_assign_pointer(*pslot,
-		(void *)((unsigned long)item |
-			((unsigned long)*pslot & RADIX_TREE_DIRECT_PTR)));
+	BUG_ON(radix_tree_is_indirect_ptr(item));
+	rcu_assign_pointer(*pslot, item);
 }
 
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -104,7 +104,7 @@ radix_tree_node_alloc(struct radix_tree_
 			rtp->nr--;
 		}
 	}
-	BUG_ON(radix_tree_is_direct_ptr(ret));
+	BUG_ON(radix_tree_is_indirect_ptr(ret));
 	return ret;
 }
 
@@ -239,7 +239,7 @@ static int radix_tree_extend(struct radi
 			return -ENOMEM;
 
 		/* Increase the height.  */
-		node->slots[0] = radix_tree_direct_to_ptr(root->rnode);
+		node->slots[0] = radix_tree_indirect_to_ptr(root->rnode);
 
 		/* Propagate the aggregated tag info into the new root */
 		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
@@ -250,6 +250,7 @@ static int radix_tree_extend(struct radi
 		newheight = root->height+1;
 		node->height = newheight;
 		node->count = 1;
+		node = radix_tree_ptr_to_indirect(node);
 		rcu_assign_pointer(root->rnode, node);
 		root->height = newheight;
 	} while (height > root->height);
@@ -273,7 +274,7 @@ int radix_tree_insert(struct radix_tree_
 	int offset;
 	int error;
 
-	BUG_ON(radix_tree_is_direct_ptr(item));
+	BUG_ON(radix_tree_is_indirect_ptr(item));
 
 	/* Make sure the tree is high enough.  */
 	if (index > radix_tree_maxindex(root->height)) {
@@ -282,7 +283,8 @@ int radix_tree_insert(struct radix_tree_
 			return error;
 	}
 
-	slot = root->rnode;
+	slot = radix_tree_indirect_to_ptr(root->rnode);
+
 	height = root->height;
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 
@@ -297,7 +299,8 @@ int radix_tree_insert(struct radix_tree_
 				rcu_assign_pointer(node->slots[offset], slot);
 				node->count++;
 			} else
-				rcu_assign_pointer(root->rnode, slot);
+				rcu_assign_pointer(root->rnode,
+					radix_tree_ptr_to_indirect(slot));
 		}
 
 		/* Go a level down */
@@ -317,7 +320,7 @@ int radix_tree_insert(struct radix_tree_
 		BUG_ON(tag_get(node, 0, offset));
 		BUG_ON(tag_get(node, 1, offset));
 	} else {
-		rcu_assign_pointer(root->rnode, radix_tree_ptr_to_direct(item));
+		rcu_assign_pointer(root->rnode, item);
 		BUG_ON(root_tag_get(root, 0));
 		BUG_ON(root_tag_get(root, 1));
 	}
@@ -349,11 +352,12 @@ void **radix_tree_lookup_slot(struct rad
 	if (node == NULL)
 		return NULL;
 
-	if (radix_tree_is_direct_ptr(node)) {
+	if (!radix_tree_is_indirect_ptr(node)) {
 		if (index > 0)
 			return NULL;
 		return (void **)&root->rnode;
 	}
+	node = radix_tree_indirect_to_ptr(node);
 
 	height = node->height;
 	if (index > radix_tree_maxindex(height))
@@ -397,11 +401,12 @@ void *radix_tree_lookup(struct radix_tre
 	if (node == NULL)
 		return NULL;
 
-	if (radix_tree_is_direct_ptr(node)) {
+	if (!radix_tree_is_indirect_ptr(node)) {
 		if (index > 0)
 			return NULL;
-		return radix_tree_direct_to_ptr(node);
+		return node;
 	}
+	node = radix_tree_indirect_to_ptr(node);
 
 	height = node->height;
 	if (index > radix_tree_maxindex(height))
@@ -446,7 +451,7 @@ void *radix_tree_tag_set(struct radix_tr
 	height = root->height;
 	BUG_ON(index > radix_tree_maxindex(height));
 
-	slot = root->rnode;
+	slot = radix_tree_indirect_to_ptr(root->rnode);
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 
 	while (height > 0) {
@@ -496,7 +501,7 @@ void *radix_tree_tag_clear(struct radix_
 
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 	pathp->node = NULL;
-	slot = root->rnode;
+	slot = radix_tree_indirect_to_ptr(root->rnode);
 
 	while (height > 0) {
 		int offset;
@@ -561,8 +566,9 @@ int radix_tree_tag_get(struct radix_tree
 	if (node == NULL)
 		return 0;
 
-	if (radix_tree_is_direct_ptr(node))
+	if (!radix_tree_is_indirect_ptr(node))
 		return (index == 0);
+	node = radix_tree_indirect_to_ptr(node);
 
 	height = node->height;
 	if (index > radix_tree_maxindex(height))
@@ -679,13 +685,13 @@ radix_tree_gang_lookup(struct radix_tree
 	if (!node)
 		return 0;
 
-	if (radix_tree_is_direct_ptr(node)) {
+	if (!radix_tree_is_indirect_ptr(node)) {
 		if (first_index > 0)
 			return 0;
-		node = radix_tree_direct_to_ptr(node);
-		results[0] = rcu_dereference(node);
+		results[0] = node;
 		return 1;
 	}
+	node = radix_tree_indirect_to_ptr(node);
 
 	max_index = radix_tree_maxindex(node->height);
 
@@ -807,13 +813,13 @@ radix_tree_gang_lookup_tag(struct radix_
 	if (!node)
 		return 0;
 
-	if (radix_tree_is_direct_ptr(node)) {
+	if (!radix_tree_is_indirect_ptr(node)) {
 		if (first_index > 0)
 			return 0;
-		node = radix_tree_direct_to_ptr(node);
-		results[0] = rcu_dereference(node);
+		results[0] = node;
 		return 1;
 	}
+	node = radix_tree_indirect_to_ptr(node);
 
 	max_index = radix_tree_maxindex(node->height);
 
@@ -843,12 +849,22 @@ EXPORT_SYMBOL(radix_tree_gang_lookup_tag
 static inline void radix_tree_shrink(struct radix_tree_root *root)
 {
 	/* try to shrink tree height */
-	while (root->height > 0 &&
-			root->rnode->count == 1 &&
-			root->rnode->slots[0]) {
+	while (root->height > 0) {
 		struct radix_tree_node *to_free = root->rnode;
 		void *newptr;
 
+		BUG_ON(!radix_tree_is_indirect_ptr(to_free));
+		to_free = radix_tree_indirect_to_ptr(to_free);
+
+		/*
+		 * The candidate node has more than one child, or its child
+		 * is not at the leftmost slot, we cannot shrink.
+		 */
+		if (to_free->count != 1)
+			break;
+		if (!to_free->slots[0])
+			break;
+
 		/*
 		 * We don't need rcu_assign_pointer(), since we are simply
 		 * moving the node from one part of the tree to another. If
@@ -857,8 +873,8 @@ static inline void radix_tree_shrink(str
 		 * one (root->rnode).
 		 */
 		newptr = to_free->slots[0];
-		if (root->height == 1)
-			newptr = radix_tree_ptr_to_direct(newptr);
+		if (root->height > 1)
+			newptr = radix_tree_ptr_to_indirect(newptr);
 		root->rnode = newptr;
 		root->height--;
 		/* must only free zeroed nodes into the slab */
@@ -893,12 +909,12 @@ void *radix_tree_delete(struct radix_tre
 		goto out;
 
 	slot = root->rnode;
-	if (height == 0 && root->rnode) {
-		slot = radix_tree_direct_to_ptr(slot);
+	if (height == 0 /* XXX: bugfix? */) {
 		root_tag_clear_all(root);
 		root->rnode = NULL;
 		goto out;
 	}
+	slot = radix_tree_indirect_to_ptr(slot);
 
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 	pathp->node = NULL;
@@ -940,7 +956,8 @@ void *radix_tree_delete(struct radix_tre
 			radix_tree_node_free(to_free);
 
 		if (pathp->node->count) {
-			if (pathp->node == root->rnode)
+			if (pathp->node ==
+					radix_tree_indirect_to_ptr(root->rnode))
 				radix_tree_shrink(root);
 			goto out;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 3/9] radix-tree: gang slot lookups
  2007-04-12 12:44 [patch 0/9] lockless pagecache for 2.6.21-rc6 Nick Piggin
  2007-04-12 12:44 ` [patch 1/9] mm: prep find_lock_page Nick Piggin
  2007-04-12 12:45 ` [patch 2/9] radix-tree: use indirect bit Nick Piggin
@ 2007-04-12 12:45 ` Nick Piggin
  2007-06-01 10:31   ` Peter Zijlstra
  2007-04-12 12:45 ` [patch 4/9] mm: __add_to_swap_cache stuff Nick Piggin
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2007-04-12 12:45 UTC (permalink / raw)
  To: Linux Memory Management; +Cc: Nick Piggin, Andrew Morton

Introduce gang_lookup_slot and gang_lookup_slot_tag functions, which are used
by lockless pagecache.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -99,12 +99,15 @@ do {									\
  *
  * The notable exceptions to this rule are the following functions:
  * radix_tree_lookup
+ * radix_tree_lookup_slot
  * radix_tree_tag_get
  * radix_tree_gang_lookup
+ * radix_tree_gang_lookup_slot
  * radix_tree_gang_lookup_tag
+ * radix_tree_gang_lookup_tag_slot
  * radix_tree_tagged
  *
- * The first 4 functions are able to be called locklessly, using RCU. The
+ * The first 7 functions are able to be called locklessly, using RCU. The
  * caller must ensure calls to these functions are made within rcu_read_lock()
  * regions. Other readers (lock-free or otherwise) and modifications may be
  * running concurrently.
@@ -159,6 +162,9 @@ void *radix_tree_delete(struct radix_tre
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+			unsigned long first_index, unsigned int max_items);
 int radix_tree_preload(gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
@@ -171,6 +177,10 @@ unsigned int
 radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
 		unsigned long first_index, unsigned int max_items,
 		unsigned int tag);
+unsigned int
+radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
+		unsigned long first_index, unsigned int max_items,
+		unsigned int tag);
 int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
 
 static inline void radix_tree_preload_end(void)
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -337,18 +337,17 @@ EXPORT_SYMBOL(radix_tree_insert);
  *	Returns:  the slot corresponding to the position @index in the
  *	radix tree @root. This is useful for update-if-exists operations.
  *
- *	This function cannot be called under rcu_read_lock, it must be
- *	excluded from writers, as must the returned slot for subsequent
- *	use by radix_tree_deref_slot() and radix_tree_replace slot.
- *	Caller must hold tree write locked across slot lookup and
- *	replace.
+ *	This function can be called under rcu_read_lock iff the slot is not
+ *	modified by radix_tree_replace_slot, otherwise it must be called
+ *	exclusive from other writers. Any dereference of the slot must be done
+ *	using radix_tree_deref_slot.
  */
 void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
 {
 	unsigned int height, shift;
 	struct radix_tree_node *node, **slot;
 
-	node = root->rnode;
+	node = rcu_dereference(root->rnode);
 	if (node == NULL)
 		return NULL;
 
@@ -368,7 +367,7 @@ void **radix_tree_lookup_slot(struct rad
 	do {
 		slot = (struct radix_tree_node **)
 			(node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
-		node = *slot;
+		node = rcu_dereference(*slot);
 		if (node == NULL)
 			return NULL;
 
@@ -605,7 +604,7 @@ EXPORT_SYMBOL(radix_tree_tag_get);
 #endif
 
 static unsigned int
-__lookup(struct radix_tree_node *slot, void **results, unsigned long index,
+__lookup(struct radix_tree_node *slot, void ***results, unsigned long index,
 	unsigned int max_items, unsigned long *next_index)
 {
 	unsigned int nr_found = 0;
@@ -639,11 +638,9 @@ __lookup(struct radix_tree_node *slot, v
 
 	/* Bottom level: grab some items */
 	for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
-		struct radix_tree_node *node;
 		index++;
-		node = slot->slots[i];
-		if (node) {
-			results[nr_found++] = rcu_dereference(node);
+		if (slot->slots[i]) {
+			results[nr_found++] = &(slot->slots[i]);
 			if (nr_found == max_items)
 				goto out;
 		}
@@ -697,13 +694,22 @@ radix_tree_gang_lookup(struct radix_tree
 
 	ret = 0;
 	while (ret < max_items) {
-		unsigned int nr_found;
+		unsigned int nr_found, slots_found, i;
 		unsigned long next_index;	/* Index of next search */
 
 		if (cur_index > max_index)
 			break;
-		nr_found = __lookup(node, results + ret, cur_index,
+		slots_found = __lookup(node, (void ***)results + ret, cur_index,
 					max_items - ret, &next_index);
+		nr_found = 0;
+		for (i = 0; i < slots_found; i++) {
+			struct radix_tree_node *slot;
+			slot = *(((void ***)results)[ret + i]);
+			if (!slot)
+				continue;
+			results[ret + nr_found] = rcu_dereference(slot);
+			nr_found++;
+		}
 		ret += nr_found;
 		if (next_index == 0)
 			break;
@@ -714,12 +720,71 @@ radix_tree_gang_lookup(struct radix_tree
 }
 EXPORT_SYMBOL(radix_tree_gang_lookup);
 
+/**
+ *	radix_tree_gang_lookup_slot - perform multiple slot lookup on radix tree
+ *	@root:		radix tree root
+ *	@results:	where the results of the lookup are placed
+ *	@first_index:	start the lookup from this key
+ *	@max_items:	place up to this many items at *results
+ *
+ *	Performs an index-ascending scan of the tree for present items.  Places
+ *	their slots at *@results and returns the number of items which were
+ *	placed at *@results.
+ *
+ *	The implementation is naive.
+ *
+ *	Like radix_tree_gang_lookup as far as RCU and locking goes. Slots must
+ *	be dereferenced with radix_tree_deref_slot, and if using only RCU
+ *	protection, radix_tree_deref_slot may fail requiring a retry.
+ */
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+			unsigned long first_index, unsigned int max_items)
+{
+	unsigned long max_index;
+	struct radix_tree_node *node;
+	unsigned long cur_index = first_index;
+	unsigned int ret;
+
+	node = rcu_dereference(root->rnode);
+	if (!node)
+		return 0;
+
+	if (!radix_tree_is_indirect_ptr(node)) {
+		if (first_index > 0)
+			return 0;
+		results[0] = (void **)&root->rnode;
+		return 1;
+	}
+	node = radix_tree_indirect_to_ptr(node);
+
+	max_index = radix_tree_maxindex(node->height);
+
+	ret = 0;
+	while (ret < max_items) {
+		unsigned int slots_found;
+		unsigned long next_index;	/* Index of next search */
+
+		if (cur_index > max_index)
+			break;
+		slots_found = __lookup(node, results + ret, cur_index,
+					max_items - ret, &next_index);
+		ret += slots_found;
+		if (next_index == 0)
+			break;
+		cur_index = next_index;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup_slot);
+
 /*
  * FIXME: the two tag_get()s here should use find_next_bit() instead of
  * open-coding the search.
  */
 static unsigned int
-__lookup_tag(struct radix_tree_node *slot, void **results, unsigned long index,
+__lookup_tag(struct radix_tree_node *slot, void ***results, unsigned long index,
 	unsigned int max_items, unsigned long *next_index, unsigned int tag)
 {
 	unsigned int nr_found = 0;
@@ -764,9 +829,8 @@ __lookup_tag(struct radix_tree_node *slo
 				 * lookup ->slots[x] without a lock (ie. can't
 				 * rely on its value remaining the same).
 				 */
-				if (node) {
-					node = rcu_dereference(node);
-					results[nr_found++] = node;
+				if (slot->slots[j]) {
+					results[nr_found++] = &slot->slots[j];
 					if (nr_found == max_items)
 						goto out;
 				}
@@ -825,13 +889,21 @@ radix_tree_gang_lookup_tag(struct radix_
 
 	ret = 0;
 	while (ret < max_items) {
-		unsigned int nr_found;
+		unsigned int slots_found, nr_found, i;
 		unsigned long next_index;	/* Index of next search */
 
 		if (cur_index > max_index)
 			break;
-		nr_found = __lookup_tag(node, results + ret, cur_index,
-					max_items - ret, &next_index, tag);
+		slots_found = __lookup_tag(node, (void ***)results + ret,
+				cur_index, max_items - ret, &next_index, tag);
+		nr_found = 0;
+		for (i = 0; i < slots_found; i++) {
+			node = *((void ***)results)[ret + i];
+			if (!node)
+				continue;
+			results[ret + nr_found] = rcu_dereference(node);
+			nr_found++;
+		}
 		ret += nr_found;
 		if (next_index == 0)
 			break;
@@ -843,6 +915,67 @@ radix_tree_gang_lookup_tag(struct radix_
 EXPORT_SYMBOL(radix_tree_gang_lookup_tag);
 
 /**
+ *	radix_tree_gang_lookup_tag_slot - perform multiple slot lookup on a
+ *					  radix tree based on a tag
+ *	@root:		radix tree root
+ *	@results:	where the results of the lookup are placed
+ *	@first_index:	start the lookup from this key
+ *	@max_items:	place up to this many items at *results
+ *	@tag:		the tag index (< RADIX_TREE_MAX_TAGS)
+ *
+ *	Performs an index-ascending scan of the tree for present items which
+ *	have the tag indexed by @tag set.  Places the slots at *@results and
+ *	returns the number of slots which were placed at *@results.
+ */
+unsigned int
+radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
+		unsigned long first_index, unsigned int max_items,
+		unsigned int tag)
+{
+	struct radix_tree_node *node;
+	unsigned long max_index;
+	unsigned long cur_index = first_index;
+	unsigned int ret;
+
+	/* check the root's tag bit */
+	if (!root_tag_get(root, tag))
+		return 0;
+
+	node = rcu_dereference(root->rnode);
+	if (!node)
+		return 0;
+
+	if (!radix_tree_is_indirect_ptr(node)) {
+		if (first_index > 0)
+			return 0;
+		results[0] = (void **)&root->rnode;
+		return 1;
+	}
+	node = radix_tree_indirect_to_ptr(node);
+
+	max_index = radix_tree_maxindex(node->height);
+
+	ret = 0;
+	while (ret < max_items) {
+		unsigned int slots_found;
+		unsigned long next_index;	/* Index of next search */
+
+		if (cur_index > max_index)
+			break;
+		slots_found = __lookup_tag(node, results + ret,
+				cur_index, max_items - ret, &next_index, tag);
+		ret += slots_found;
+		if (next_index == 0)
+			break;
+		cur_index = next_index;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup_tag_slot);
+
+
+/**
  *	radix_tree_shrink    -    shrink height of a radix tree to minimal
  *	@root		radix tree root
  */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 4/9] mm: __add_to_swap_cache stuff
  2007-04-12 12:44 [patch 0/9] lockless pagecache for 2.6.21-rc6 Nick Piggin
                   ` (2 preceding siblings ...)
  2007-04-12 12:45 ` [patch 3/9] radix-tree: gang slot lookups Nick Piggin
@ 2007-04-12 12:45 ` Nick Piggin
  2007-04-12 12:45 ` [patch 5/9] mm: lockless probe Nick Piggin
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2007-04-12 12:45 UTC (permalink / raw)
  To: Linux Memory Management; +Cc: Nick Piggin, Andrew Morton

__add_to_swap_cache unconditionally sets the page locked. Instead, just
ensure that the page is locked (which is a usual invariant for manipulating
swapcache).

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -74,6 +74,7 @@ static int __add_to_swap_cache(struct pa
 {
 	int error;
 
+	BUG_ON(!PageLocked(page));
 	BUG_ON(PageSwapCache(page));
 	BUG_ON(PagePrivate(page));
 	error = radix_tree_preload(gfp_mask);
@@ -83,7 +84,6 @@ static int __add_to_swap_cache(struct pa
 						entry.val, page);
 		if (!error) {
 			page_cache_get(page);
-			SetPageLocked(page);
 			SetPageSwapCache(page);
 			set_page_private(page, entry.val);
 			total_swapcache_pages++;
@@ -337,6 +337,7 @@ struct page *read_swap_cache_async(swp_e
 			new_page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
 			if (!new_page)
 				break;		/* Out of memory */
+			SetPageLocked(new_page);/* could be non-atomic op */
 		}
 
 		/*
@@ -360,7 +361,9 @@ struct page *read_swap_cache_async(swp_e
 		}
 	} while (err != -ENOENT && err != -ENOMEM);
 
-	if (new_page)
+	if (new_page) {
+		ClearPageLocked(new_page);
 		page_cache_release(new_page);
+	}
 	return found_page;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 5/9] mm: lockless probe
  2007-04-12 12:44 [patch 0/9] lockless pagecache for 2.6.21-rc6 Nick Piggin
                   ` (3 preceding siblings ...)
  2007-04-12 12:45 ` [patch 4/9] mm: __add_to_swap_cache stuff Nick Piggin
@ 2007-04-12 12:45 ` Nick Piggin
  2007-04-12 12:45 ` [patch 6/9] mm: speculative get page Nick Piggin
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2007-04-12 12:45 UTC (permalink / raw)
  To: Linux Memory Management; +Cc: Nick Piggin, Andrew Morton

Probing pages and radix_tree_tagged are lockless operations with the
lockless radix-tree. Convert these users to RCU locking rather than
using tree_lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/drivers/mtd/devices/block2mtd.c
===================================================================
--- linux-2.6.orig/drivers/mtd/devices/block2mtd.c
+++ linux-2.6/drivers/mtd/devices/block2mtd.c
@@ -59,28 +59,27 @@ static void cache_readahead(struct addre
 
 	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
 
-	read_lock_irq(&mapping->tree_lock);
 	for (i = 0; i < PAGE_READAHEAD; i++) {
 		pagei = index + i;
 		if (pagei > end_index) {
 			INFO("Overrun end of disk in cache readahead\n");
 			break;
 		}
+		/* Don't need mapping->tree_lock - lookup can be racy */
+		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, pagei);
+		rcu_read_unlock();
 		if (page && (!i))
 			break;
 		if (page)
 			continue;
-		read_unlock_irq(&mapping->tree_lock);
 		page = page_cache_alloc_cold(mapping);
-		read_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
 		page->index = pagei;
 		list_add(&page->lru, &page_pool);
 		ret++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
 	if (ret)
 		read_cache_pages(mapping, &page_pool, filler, NULL);
 }
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -965,17 +965,15 @@ int test_set_page_writeback(struct page 
 EXPORT_SYMBOL(test_set_page_writeback);
 
 /*
- * Return true if any of the pages in the mapping are marged with the
+ * Return true if any of the pages in the mapping are marked with the
  * passed tag.
  */
 int mapping_tagged(struct address_space *mapping, int tag)
 {
-	unsigned long flags;
 	int ret;
-
-	read_lock_irqsave(&mapping->tree_lock, flags);
+	rcu_read_lock();
 	ret = radix_tree_tagged(&mapping->page_tree, tag);
-	read_unlock_irqrestore(&mapping->tree_lock, flags);
+	rcu_read_unlock();
 	return ret;
 }
 EXPORT_SYMBOL(mapping_tagged);
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -281,27 +281,25 @@ __do_page_cache_readahead(struct address
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	read_lock_irq(&mapping->tree_lock);
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
 		pgoff_t page_offset = offset + page_idx;
 		
 		if (page_offset > end_index)
 			break;
 
+		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
+		rcu_read_unlock();
 		if (page)
 			continue;
 
-		read_unlock_irq(&mapping->tree_lock);
 		page = page_cache_alloc_cold(mapping);
-		read_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
 		page->index = page_offset;
 		list_add(&page->lru, &page_pool);
 		ret++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
 
 	/*
 	 * Now start the IO.  We ignore I/O errors - if the page is not

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 6/9] mm: speculative get page
  2007-04-12 12:44 [patch 0/9] lockless pagecache for 2.6.21-rc6 Nick Piggin
                   ` (4 preceding siblings ...)
  2007-04-12 12:45 ` [patch 5/9] mm: lockless probe Nick Piggin
@ 2007-04-12 12:45 ` Nick Piggin
  2007-04-16 18:54   ` Hugh Dickins
  2007-04-12 12:45 ` [patch 7/9] mm: lockless pagecache lookups Nick Piggin
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2007-04-12 12:45 UTC (permalink / raw)
  To: Linux Memory Management; +Cc: Nick Piggin, Andrew Morton

If we can be sure that elevating the page_count on a pagecache
page will pin it, we can speculatively run this operation, and
subsequently check to see if we hit the right page rather than
relying on holding a lock or otherwise pinning a reference to the
page.

This can be done if get_page/put_page behaves consistently
throughout the whole tree (ie. if we "get" the page after it has
been used for something else, we must be able to free it with a
put_page).

Actually, there is a period where the count behaves differently:
when the page is free or if it is a constituent page of a compound
page. We need an atomic_inc_not_zero operation to ensure we don't
try to grab the page in either case.

This patch introduces the core locking protocol to the pagecache
(ie. adds page_cache_get_speculative, and tweaks some update-side
code to make it work).


[Hugh notices that PG_nonewrefs might be dispensed with entirely
 if current set_page_nonewrefs instead atomically save the page count
 and temporarily set it to zero. 

 This is a nice idea, and simplifies find_get_page very much, but
 cannot be applied to all current SetPageNoNewRefs sites. Need to
 verify that add_to_page_cache and add_to_swap_cache can cope
 without it or make do some other way. Also, migration pages with
 PagePrivate set means that the filesystem has a ref on the page,
 so it might muck with page count, which is a big problem.  ]

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -91,6 +91,9 @@
 #define PG_nosave_free		18	/* Used for system suspend/resume */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
+#define PG_nonewrefs		20	/* Block concurrent pagecache lookups
+					 * while testing refcount */
+
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
 
@@ -253,6 +256,11 @@ static inline void SetPageUptodate(struc
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#define PageNoNewRefs(page)	test_bit(PG_nonewrefs, &(page)->flags)
+#define SetPageNoNewRefs(page)	set_bit(PG_nonewrefs, &(page)->flags)
+#define ClearPageNoNewRefs(page) clear_bit(PG_nonewrefs, &(page)->flags)
+#define __ClearPageNoNewRefs(page) __clear_bit(PG_nonewrefs, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 extern void cancel_dirty_page(struct page *page, unsigned int account_size);
@@ -265,4 +273,25 @@ static inline void set_page_writeback(st
 	test_set_page_writeback(page);
 }
 
+static inline void set_page_nonewrefs(struct page *page)
+{
+	preempt_disable();
+	SetPageNoNewRefs(page);
+	smp_wmb();
+}
+
+static inline void __clear_page_nonewrefs(struct page *page)
+{
+	smp_wmb();
+	__ClearPageNoNewRefs(page);
+	preempt_enable();
+}
+
+static inline void clear_page_nonewrefs(struct page *page)
+{
+	smp_wmb();
+	ClearPageNoNewRefs(page);
+	preempt_enable();
+}
+
 #endif	/* PAGE_FLAGS_H */
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -11,6 +11,8 @@
 #include <linux/compiler.h>
 #include <asm/uaccess.h>
 #include <linux/gfp.h>
+#include <linux/page-flags.h>
+#include <linux/hardirq.h> /* for in_interrupt() */
 
 /*
  * Bits in mapping->flags.  The lower __GFP_BITS_SHIFT bits are the page
@@ -51,6 +53,109 @@ static inline void mapping_set_gfp_mask(
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
+/*
+ * speculatively take a reference to a page.
+ * If the page is free (_count == 0), then _count is untouched, and 0
+ * is returned. Otherwise, _count is incremented by 1 and 1 is returned.
+ *
+ * This function must be run in the same rcu_read_lock() section as has
+ * been used to lookup the page in the pagecache radix-tree: this allows
+ * allocators to use a synchronize_rcu() to stabilize _count.
+ *
+ * Unless an RCU grace period has passed, the count of all pages coming out
+ * of the allocator must be considered unstable. page_count may return higher
+ * than expected, and put_page must be able to do the right thing when the
+ * page has been finished with (because put_page is what is used to drop an
+ * invalid speculative reference).
+ *
+ * After incrementing the refcount, this function spins until PageNoNewRefs
+ * is clear, then a read memory barrier is issued.
+ *
+ * This forms the core of the lockless pagecache locking protocol, where
+ * the lookup-side (eg. find_get_page) has the following pattern:
+ * 1. find page in radix tree
+ * 2. conditionally increment refcount
+ * 3. wait for PageNoNewRefs
+ * 4. check the page is still in pagecache
+ *
+ * Remove-side (that cares about _count, eg. reclaim) has the following:
+ * A. SetPageNoNewRefs
+ * B. check refcount is correct
+ * C. remove page
+ * D. ClearPageNoNewRefs
+ *
+ * There are 2 critical interleavings that matter:
+ * - 2 runs before B: in this case, B sees elevated refcount and bails out
+ * - B runs before 2: in this case, 3 ensures 4 will not run until *after* C
+ *   (after D, even). In which case, 4 will notice C and lookup side can retry
+ *
+ * It is possible that between 1 and 2, the page is removed then the exact same
+ * page is inserted into the same position in pagecache. That's OK: the
+ * old find_get_page using tree_lock could equally have run before or after
+ * the write-side, depending on timing.
+ *
+ * Pagecache insertion isn't a big problem: either 1 will find the page or
+ * it will not. Likewise, the old find_get_page could run either before the
+ * insertion or afterwards, depending on timing.
+ */
+static inline int page_cache_get_speculative(struct page *page)
+{
+	VM_BUG_ON(in_interrupt());
+
+#ifndef CONFIG_SMP
+# ifdef CONFIG_PREEMPT
+	VM_BUG_ON(!in_atomic());
+# endif
+	/*
+	 * Preempt must be disabled here - we rely on rcu_read_lock doing
+	 * this for us.
+	 *
+	 * Pagecache won't be truncated from interrupt context, so if we have
+	 * found a page in the radix tree here, we have pinned its refcount by
+	 * disabling preempt, and hence no need for the "speculative get" that
+	 * SMP requires.
+	 */
+	VM_BUG_ON(page_count(page) == 0);
+	atomic_inc(&page->_count);
+
+#else
+	if (unlikely(!get_page_unless_zero(page)))
+		return 0; /* page has been freed */
+
+	/*
+	 * Note that get_page_unless_zero provides a memory barrier.
+	 * This is needed to ensure PageNoNewRefs is evaluated after the
+	 * page refcount has been raised. See below comment.
+	 */
+
+	while (unlikely(PageNoNewRefs(page)))
+		cpu_relax();
+
+	/*
+	 * smp_rmb is to ensure the load of page->flags (for PageNoNewRefs())
+	 * is performed before a future load used to ensure the page is
+	 * the correct on (usually: page->mapping and page->index).
+	 *
+	 * Those places that set PageNoNewRefs have the following pattern:
+	 * 	SetPageNoNewRefs(page)
+	 * 	wmb();
+	 * 	if (page_count(page) == X)
+	 * 		remove page from pagecache
+	 * 	wmb();
+	 * 	ClearPageNoNewRefs(page)
+	 *
+	 * If the load was out of order, page->mapping might be loaded before
+	 * the page is removed from pagecache but PageNoNewRefs evaluated
+	 * after the ClearPageNoNewRefs().
+	 */
+	smp_rmb();
+
+#endif
+	VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);
+
+	return 1;
+}
+
 #ifdef CONFIG_NUMA
 extern struct page *__page_cache_alloc(gfp_t gfp);
 #else
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -390,6 +390,7 @@ int remove_mapping(struct address_space 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
+	set_page_nonewrefs(page);
 	write_lock_irq(&mapping->tree_lock);
 	/*
 	 * The non racy check for a busy page.
@@ -427,17 +428,20 @@ int remove_mapping(struct address_space 
 		__delete_from_swap_cache(page);
 		write_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
-		__put_page(page);	/* The pagecache ref */
-		return 1;
+		goto free_it;
 	}
 
 	__remove_from_page_cache(page);
 	write_unlock_irq(&mapping->tree_lock);
-	__put_page(page);
+
+free_it:
+	__clear_page_nonewrefs(page);
+	__put_page(page); /* The pagecache ref */
 	return 1;
 
 cannot_free:
 	write_unlock_irq(&mapping->tree_lock);
+	clear_page_nonewrefs(page);
 	return 0;
 }
 
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -440,6 +440,7 @@ int add_to_page_cache(struct page *page,
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
+		set_page_nonewrefs(page);
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
@@ -451,6 +452,7 @@ int add_to_page_cache(struct page *page,
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
 		write_unlock_irq(&mapping->tree_lock);
+		clear_page_nonewrefs(page);
 		radix_tree_preload_end();
 	}
 	return error;
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -79,6 +79,7 @@ static int __add_to_swap_cache(struct pa
 	BUG_ON(PagePrivate(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
+		set_page_nonewrefs(page);
 		write_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
@@ -90,6 +91,7 @@ static int __add_to_swap_cache(struct pa
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
 		write_unlock_irq(&swapper_space.tree_lock);
+		clear_page_nonewrefs(page);
 		radix_tree_preload_end();
 	}
 	return error;
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -303,6 +303,7 @@ static int migrate_page_move_mapping(str
 		return 0;
 	}
 
+	set_page_nonewrefs(page);
 	write_lock_irq(&mapping->tree_lock);
 
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
@@ -311,6 +312,7 @@ static int migrate_page_move_mapping(str
 	if (page_count(page) != 2 + !!PagePrivate(page) ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
 		write_unlock_irq(&mapping->tree_lock);
+		clear_page_nonewrefs(page);
 		return -EAGAIN;
 	}
 
@@ -326,6 +328,9 @@ static int migrate_page_move_mapping(str
 #endif
 
 	radix_tree_replace_slot(pslot, newpage);
+	page->mapping = NULL;
+  	write_unlock_irq(&mapping->tree_lock);
+	clear_page_nonewrefs(page);
 
 	/*
 	 * Drop cache reference from old page.
@@ -333,8 +338,6 @@ static int migrate_page_move_mapping(str
 	 */
 	__put_page(page);
 
-	write_unlock_irq(&mapping->tree_lock);
-
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 7/9] mm: lockless pagecache lookups
  2007-04-12 12:44 [patch 0/9] lockless pagecache for 2.6.21-rc6 Nick Piggin
                   ` (5 preceding siblings ...)
  2007-04-12 12:45 ` [patch 6/9] mm: speculative get page Nick Piggin
@ 2007-04-12 12:45 ` Nick Piggin
  2007-04-12 12:46 ` [patch 8/9] mm: spinlock tree_lock Nick Piggin
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2007-04-12 12:45 UTC (permalink / raw)
  To: Linux Memory Management; +Cc: Nick Piggin, Andrew Morton

Combine page_cache_get_speculative with lockless radix tree lookups to
introduce lockless page cache lookups (ie. no mapping->tree_lock on
the read-side).

The only atomicity changes this introduces is that the gang pagecache
lookup functions now behave as if they are implemented with multiple
find_get_page calls, rather than operating on a snapshot of the pages.
In practice, this atomicity guarantee is not used anyway, and it is
difficult to see how it could be. Gang pagecache lookups are designed
to replace individual lookups, so these semantics are natural.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -594,15 +594,35 @@ void fastcall __lock_page_nosync(struct 
  * Is there a pagecache struct page at the given (mapping, offset) tuple?
  * If yes, increment its refcount and return it; if no, return NULL.
  */
-struct page * find_get_page(struct address_space *mapping, unsigned long offset)
+struct page *find_get_page(struct address_space *mapping, unsigned long offset)
 {
+	void **pagep;
 	struct page *page;
 
-	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page)
-		page_cache_get(page);
-	read_unlock_irq(&mapping->tree_lock);
+	rcu_read_lock();
+repeat:
+	page = NULL;
+	pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);
+	if (pagep) {
+		page = radix_tree_deref_slot(pagep);
+		if (unlikely(!page || page == RADIX_TREE_RETRY))
+			goto repeat;
+
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/*
+		 * Has the page moved?
+		 * This is part of the lockless pagecache protocol. See
+		 * include/linux/pagemap.h for details.
+		 */
+		if (unlikely(page != *pagep)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+	}
+	rcu_read_unlock();
+
 	return page;
 }
 EXPORT_SYMBOL(find_get_page);
@@ -623,25 +643,16 @@ struct page *find_lock_page(struct addre
 	struct page *page;
 
 repeat:
-	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
+	page = find_get_page(mapping, offset);
 	if (page) {
-		page_cache_get(page);
-		if (TestSetPageLocked(page)) {
-			read_unlock_irq(&mapping->tree_lock);
-			__lock_page(page);
-
-			/* Has the page been truncated while we slept? */
-			if (unlikely(page->mapping != mapping)) {
-				unlock_page(page);
-				page_cache_release(page);
-				goto repeat;
-			}
-			goto out;
+		lock_page(page);
+		/* Has the page been truncated? */
+		if (unlikely(page->mapping != mapping)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto repeat;
 		}
 	}
-	read_unlock_irq(&mapping->tree_lock);
-out:
 	return page;
 }
 EXPORT_SYMBOL(find_lock_page);
@@ -711,13 +722,39 @@ unsigned find_get_pages(struct address_s
 {
 	unsigned int i;
 	unsigned int ret;
+	unsigned int nr_found;
+
+	rcu_read_lock();
+restart:
+	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+				(void ***)pages, start, nr_pages);
+	ret = 0;
+	for (i = 0; i < nr_found; i++) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot((void **)pages[i]);
+		if (unlikely(!page))
+			continue;
+		/*
+		 * this can only trigger if nr_found == 1, making livelock
+		 * a non issue.
+		 */
+		if (unlikely(page == RADIX_TREE_RETRY))
+			goto restart;
 
-	read_lock_irq(&mapping->tree_lock);
-	ret = radix_tree_gang_lookup(&mapping->page_tree,
-				(void **)pages, start, nr_pages);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
-	read_unlock_irq(&mapping->tree_lock);
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *((void **)pages[i]))) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		pages[ret] = page;
+		ret++;
+	}
+	rcu_read_unlock();
 	return ret;
 }
 
@@ -738,19 +775,44 @@ unsigned find_get_pages_contig(struct ad
 {
 	unsigned int i;
 	unsigned int ret;
+	unsigned int nr_found;
+
+	rcu_read_lock();
+restart:
+	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+				(void ***)pages, index, nr_pages);
+	ret = 0;
+	for (i = 0; i < nr_found; i++) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot((void **)pages[i]);
+		if (unlikely(!page))
+			continue;
+		/*
+		 * this can only trigger if nr_found == 1, making livelock
+		 * a non issue.
+		 */
+		if (unlikely(page == RADIX_TREE_RETRY))
+			goto restart;
 
-	read_lock_irq(&mapping->tree_lock);
-	ret = radix_tree_gang_lookup(&mapping->page_tree,
-				(void **)pages, index, nr_pages);
-	for (i = 0; i < ret; i++) {
-		if (pages[i]->mapping == NULL || pages[i]->index != index)
+		if (page->mapping == NULL || page->index != index)
 			break;
 
-		page_cache_get(pages[i]);
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *((void **)pages[i]))) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		pages[ret] = page;
+		ret++;
 		index++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
-	return i;
+	rcu_read_unlock();
+	return ret;
 }
 
 /**
@@ -769,15 +831,43 @@ unsigned find_get_pages_tag(struct addre
 {
 	unsigned int i;
 	unsigned int ret;
+	unsigned int nr_found;
+
+	rcu_read_lock();
+restart:
+	nr_found = radix_tree_gang_lookup_tag_slot(&mapping->page_tree,
+				(void ***)pages, *index, nr_pages, tag);
+	ret = 0;
+	for (i = 0; i < nr_found; i++) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot((void **)pages[i]);
+		if (unlikely(!page))
+			continue;
+		/*
+		 * this can only trigger if nr_found == 1, making livelock
+		 * a non issue.
+		 */
+		if (unlikely(page == RADIX_TREE_RETRY))
+			goto restart;
+
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *((void **)pages[i]))) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		pages[ret] = page;
+		ret++;
+	}
+	rcu_read_unlock();
 
-	read_lock_irq(&mapping->tree_lock);
-	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
-				(void **)pages, *index, nr_pages, tag);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
 	if (ret)
 		*index = pages[ret - 1]->index + 1;
-	read_unlock_irq(&mapping->tree_lock);
+
 	return ret;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 8/9] mm: spinlock tree_lock
  2007-04-12 12:44 [patch 0/9] lockless pagecache for 2.6.21-rc6 Nick Piggin
                   ` (6 preceding siblings ...)
  2007-04-12 12:45 ` [patch 7/9] mm: lockless pagecache lookups Nick Piggin
@ 2007-04-12 12:46 ` Nick Piggin
  2007-04-12 12:46 ` [patch 9/9] mm: lockless test threads Nick Piggin
  2007-04-12 12:46 ` [rfc] rename page_count for lockless pagecache Nick Piggin
  9 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2007-04-12 12:46 UTC (permalink / raw)
  To: Linux Memory Management; +Cc: Nick Piggin, Andrew Morton

mapping->tree_lock has no read lockers. convert the lock from an rwlock
to a spinlock.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -729,7 +729,7 @@ int __set_page_dirty_buffers(struct page
 	if (TestSetPageDirty(page))
 		return 0;
 
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
@@ -738,7 +738,7 @@ int __set_page_dirty_buffers(struct page
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 	return 1;
 }
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -193,7 +193,7 @@ void inode_init_once(struct inode *inode
 	mutex_init(&inode->i_mutex);
 	init_rwsem(&inode->i_alloc_sem);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
-	rwlock_init(&inode->i_data.tree_lock);
+	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -434,7 +434,7 @@ struct backing_dev_info;
 struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
-	rwlock_t		tree_lock;	/* and rwlock protecting it */
+	spinlock_t		tree_lock;	/* and lock protecting it */
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -110,7 +110,7 @@ generic_file_direct_IO(int rw, struct ki
 /*
  * Remove a page from the page cache and free it. Caller has to make
  * sure the page is locked and that nobody else uses it - or that usage
- * is safe.  The caller must hold a write_lock on the mapping's tree_lock.
+ * is safe.  The caller must hold the mapping's tree_lock.
  */
 void __remove_from_page_cache(struct page *page)
 {
@@ -128,9 +128,9 @@ void remove_from_page_cache(struct page 
 
 	BUG_ON(!PageLocked(page));
 
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	__remove_from_page_cache(page);
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 }
 
 static int sync_page(void *word)
@@ -441,7 +441,7 @@ int add_to_page_cache(struct page *page,
 
 	if (error == 0) {
 		set_page_nonewrefs(page);
-		write_lock_irq(&mapping->tree_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
 			page_cache_get(page);
@@ -451,7 +451,7 @@ int add_to_page_cache(struct page *page,
 			mapping->nrpages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		clear_page_nonewrefs(page);
 		radix_tree_preload_end();
 	}
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -38,7 +38,7 @@ static struct backing_dev_info swap_back
 
 struct address_space swapper_space = {
 	.page_tree	= RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
-	.tree_lock	= __RW_LOCK_UNLOCKED(swapper_space.tree_lock),
+	.tree_lock	= __SPIN_LOCK_UNLOCKED(swapper_space.tree_lock),
 	.a_ops		= &swap_aops,
 	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
 	.backing_dev_info = &swap_backing_dev_info,
@@ -80,7 +80,7 @@ static int __add_to_swap_cache(struct pa
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
 		set_page_nonewrefs(page);
-		write_lock_irq(&swapper_space.tree_lock);
+		spin_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (!error) {
@@ -90,7 +90,7 @@ static int __add_to_swap_cache(struct pa
 			total_swapcache_pages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
-		write_unlock_irq(&swapper_space.tree_lock);
+		spin_unlock_irq(&swapper_space.tree_lock);
 		clear_page_nonewrefs(page);
 		radix_tree_preload_end();
 	}
@@ -202,9 +202,9 @@ void delete_from_swap_cache(struct page 
 
 	entry.val = page_private(page);
 
-	write_lock_irq(&swapper_space.tree_lock);
+	spin_lock_irq(&swapper_space.tree_lock);
 	__delete_from_swap_cache(page);
-	write_unlock_irq(&swapper_space.tree_lock);
+	spin_unlock_irq(&swapper_space.tree_lock);
 
 	swap_free(entry);
 	page_cache_release(page);
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -367,13 +367,13 @@ int remove_exclusive_swap_page(struct pa
 	retval = 0;
 	if (p->swap_map[swp_offset(entry)] == 1) {
 		/* Recheck the page count with the swapcache lock held.. */
-		write_lock_irq(&swapper_space.tree_lock);
+		spin_lock_irq(&swapper_space.tree_lock);
 		if ((page_count(page) == 2) && !PageWriteback(page)) {
 			__delete_from_swap_cache(page);
 			SetPageDirty(page);
 			retval = 1;
 		}
-		write_unlock_irq(&swapper_space.tree_lock);
+		spin_unlock_irq(&swapper_space.tree_lock);
 	}
 	spin_unlock(&swap_lock);
 
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -328,18 +328,18 @@ invalidate_complete_page2(struct address
 	if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	if (PageDirty(page))
 		goto failed;
 
 	BUG_ON(PagePrivate(page));
 	__remove_from_page_cache(page);
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	ClearPageUptodate(page);
 	page_cache_release(page);	/* pagecache ref */
 	return 1;
 failed:
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	return 0;
 }
 
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -391,7 +391,7 @@ int remove_mapping(struct address_space 
 	BUG_ON(mapping != page_mapping(page));
 
 	set_page_nonewrefs(page);
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	/*
 	 * The non racy check for a busy page.
 	 *
@@ -426,13 +426,13 @@ int remove_mapping(struct address_space 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
 		goto free_it;
 	}
 
 	__remove_from_page_cache(page);
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 
 free_it:
 	__clear_page_nonewrefs(page);
@@ -440,7 +440,7 @@ free_it:
 	return 1;
 
 cannot_free:
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	clear_page_nonewrefs(page);
 	return 0;
 }
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -765,7 +765,7 @@ int __set_page_dirty_no_writeback(struct
  * mapping is pinned by the vma's ->vm_file reference.
  *
  * We take care to handle the case where the page was truncated from the
- * mapping by re-checking page_mapping() insode tree_lock.
+ * mapping by re-checking page_mapping() inside tree_lock.
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
@@ -776,7 +776,7 @@ int __set_page_dirty_nobuffers(struct pa
 		if (!mapping)
 			return 1;
 
-		write_lock_irq(&mapping->tree_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		mapping2 = page_mapping(page);
 		if (mapping2) { /* Race with truncate? */
 			BUG_ON(mapping2 != mapping);
@@ -787,7 +787,7 @@ int __set_page_dirty_nobuffers(struct pa
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -919,33 +919,32 @@ EXPORT_SYMBOL(clear_page_dirty_for_io);
 int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
-	int ret;
 
 	if (mapping) {
 		unsigned long flags;
+		int ret;
 
-		write_lock_irqsave(&mapping->tree_lock, flags);
+		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestClearPageWriteback(page);
 		if (ret)
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
-	} else {
-		ret = TestClearPageWriteback(page);
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
+		return ret;
 	}
-	return ret;
+	return TestClearPageWriteback(page);
 }
 
 int test_set_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
-	int ret;
 
 	if (mapping) {
 		unsigned long flags;
+		int ret;
 
-		write_lock_irqsave(&mapping->tree_lock, flags);
+		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
 		if (!ret)
 			radix_tree_tag_set(&mapping->page_tree,
@@ -955,11 +954,10 @@ int test_set_page_writeback(struct page 
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
-	} else {
-		ret = TestSetPageWriteback(page);
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
+		return ret;
 	}
-	return ret;
+	return TestSetPageWriteback(page);
 
 }
 EXPORT_SYMBOL(test_set_page_writeback);
Index: linux-2.6/include/asm-arm/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-arm/cacheflush.h
+++ linux-2.6/include/asm-arm/cacheflush.h
@@ -405,9 +405,9 @@ static inline void flush_anon_page(struc
 }
 
 #define flush_dcache_mmap_lock(mapping) \
-	write_lock_irq(&(mapping)->tree_lock)
+	spin_lock_irq(&(mapping)->tree_lock)
 #define flush_dcache_mmap_unlock(mapping) \
-	write_unlock_irq(&(mapping)->tree_lock)
+	spin_unlock_irq(&(mapping)->tree_lock)
 
 #define flush_icache_user_range(vma,page,addr,len) \
 	flush_dcache_page(page)
Index: linux-2.6/include/asm-parisc/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-parisc/cacheflush.h
+++ linux-2.6/include/asm-parisc/cacheflush.h
@@ -45,9 +45,9 @@ void flush_cache_mm(struct mm_struct *mm
 extern void flush_dcache_page(struct page *page);
 
 #define flush_dcache_mmap_lock(mapping) \
-	write_lock_irq(&(mapping)->tree_lock)
+	spin_lock_irq(&(mapping)->tree_lock)
 #define flush_dcache_mmap_unlock(mapping) \
-	write_unlock_irq(&(mapping)->tree_lock)
+	spin_unlock_irq(&(mapping)->tree_lock)
 
 #define flush_icache_page(vma,page)	do { 		\
 	flush_kernel_dcache_page(page);			\
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -304,14 +304,14 @@ static int migrate_page_move_mapping(str
 	}
 
 	set_page_nonewrefs(page);
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
  					page_index(page));
 
 	if (page_count(page) != 2 + !!PagePrivate(page) ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		clear_page_nonewrefs(page);
 		return -EAGAIN;
 	}
@@ -329,7 +329,7 @@ static int migrate_page_move_mapping(str
 
 	radix_tree_replace_slot(pslot, newpage);
 	page->mapping = NULL;
-  	write_unlock_irq(&mapping->tree_lock);
+  	spin_unlock_irq(&mapping->tree_lock);
 	clear_page_nonewrefs(page);
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 9/9] mm: lockless test threads
  2007-04-12 12:44 [patch 0/9] lockless pagecache for 2.6.21-rc6 Nick Piggin
                   ` (7 preceding siblings ...)
  2007-04-12 12:46 ` [patch 8/9] mm: spinlock tree_lock Nick Piggin
@ 2007-04-12 12:46 ` Nick Piggin
  2007-04-13 16:54   ` Hugh Dickins
  2007-04-12 12:46 ` [rfc] rename page_count for lockless pagecache Nick Piggin
  9 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2007-04-12 12:46 UTC (permalink / raw)
  To: Linux Memory Management; +Cc: Nick Piggin, Andrew Morton

Introduce a basic lockless pagecache test harness. I don't know what value
this has, because it hasn't caught a bug yet, but it might help with testing.

Signed-off-by: Nick Piggin <npiggin@suse.de>

 lib/Kconfig.debug |   12 +++
 mm/Makefile       |    1 
 mm/lpctest.c      |  194 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 207 insertions(+)

Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug
+++ linux-2.6/lib/Kconfig.debug
@@ -379,6 +379,18 @@ config FORCED_INLINING
 	  become the default in the future, until then this option is there to
 	  test gcc for this.
 
+config LPC_TEST
+	tristate "Background tests for lockless pagecache"
+	depends on DEBUG_KERNEL
+	default n
+	help
+	  This option provides a kernel module that runs some background
+	  threads that exercise lockless pagecache races more than usual.
+
+	  Say Y here if you want LPC test threads start automatically at
+	  boot. Say M to be able to start them by inserting the module.
+	  Say N if you are unsure.
+
 config RCU_TORTURE_TEST
 	tristate "torture tests for RCU"
 	depends on DEBUG_KERNEL
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_MEMORY_HOTPLUG) += memory_h
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
+obj-$(CONFIG_LPC_TEST) += lpctest.o
Index: linux-2.6/mm/lpctest.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/lpctest.c
@@ -0,0 +1,194 @@
+/*
+ * Lockless pagecache test thread
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Based on kernel/rcutorture.c, which is
+ * Copyright (C) IBM Corporation, 2005, 2006
+ *
+ * Copyright (C) Nick Piggin, SUSE Labs, Novell Inc, 2007
+ */
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+#include <linux/err.h>
+#include <linux/spinlock.h>
+#include <linux/smp.h>
+#include <linux/interrupt.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/module.h>
+#include <linux/completion.h>
+#include <linux/moduleparam.h>
+#include <linux/percpu.h>
+#include <linux/notifier.h>
+#include <linux/cpu.h>
+#include <linux/random.h>
+#include <linux/delay.h>
+#include <linux/byteorder/swabb.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Nick Piggin <npiggin@suse.de");
+
+static int random_threads = 2;	/* # random pfn threads */
+static int verbose;		/* Print more debug info. */
+
+module_param(random_threads, int, 0444);
+MODULE_PARM_DESC(nreaders, "Number of random pfn threads");
+module_param(verbose, bool, 0444);
+MODULE_PARM_DESC(verbose, "Enable verbose debugging printk()s");
+
+static struct task_struct **random_tasks;
+
+struct lpc_random_state {
+	unsigned long rrs_state;
+	long rrs_count;
+};
+
+#define LPC_RANDOM_MULT 39916801  /* prime */
+#define LPC_RANDOM_ADD	479001701 /* prime */
+#define LPC_RANDOM_REFRESH 10000
+
+#define DEFINE_LPC_RANDOM(name) struct lpc_random_state name = { 0, 0 }
+
+/*
+ * Crude but fast random-number generator.  Uses a linear congruential
+ * generator, with occasional help from get_random_bytes().
+ */
+static unsigned long lpc_random(struct lpc_random_state *rrsp)
+{
+	long refresh;
+
+	if (--rrsp->rrs_count < 0) {
+		get_random_bytes(&refresh, sizeof(refresh));
+		rrsp->rrs_state += refresh;
+		rrsp->rrs_count = LPC_RANDOM_REFRESH;
+	}
+	rrsp->rrs_state = rrsp->rrs_state * LPC_RANDOM_MULT + LPC_RANDOM_ADD;
+	return swahw32(rrsp->rrs_state);
+}
+
+/*
+ * Definitions for lpc testing.
+ */
+static void lpc_page_delay(struct lpc_random_state *rrsp)
+{
+	long rnd;
+
+	/* We want there to be long-held pages, but not all the time. */
+
+	rnd = lpc_random(rrsp);
+	if (rnd % 200 == 0)
+		udelay(20);
+	else if (rnd % 300 == 0)
+		schedule_timeout(2);
+
+}
+
+/*
+ * LPC test random kthread.  Repeatedly takes speculative references on
+ * random pages, then dropping them (possibly after a delay).
+ *
+ * This will not work properly if things start using synchronize_rcu to
+ * ensure a page will not be touched by speculative references, but so
+ * far we have avoided that.
+ */
+static int lpc_random_thread(void *arg)
+{
+	DEFINE_LPC_RANDOM(rand);
+
+	set_user_nice(current, 19);
+	current->flags |= PF_NOFREEZE;
+
+	do {
+		struct zone *zone;
+		for_each_zone(zone) {
+			unsigned long pfn;
+			unsigned int times;
+			struct page *page;
+
+			pfn = zone->zone_start_pfn +
+				lpc_random(&rand) % zone->spanned_pages;
+			if (!pfn_valid(pfn))
+				continue;
+
+			page = pfn_to_page(pfn);
+
+			for (times = 1+lpc_random(&rand)%100; times; times--) {
+				int ret;
+				rcu_read_lock();
+				ret = page_cache_get_speculative(page);
+				rcu_read_unlock();
+				if (ret) {
+					lpc_page_delay(&rand);
+					page_cache_release(page);
+				}
+				lpc_page_delay(&rand);
+			}
+		}
+	} while (!kthread_should_stop());
+
+	return 0;
+}
+
+static void lpc_test_cleanup(void)
+{
+	int i;
+
+	if (random_tasks != NULL) {
+		for (i = 0; i < random_threads; i++) {
+			if (random_tasks[i] != NULL) {
+				kthread_stop(random_tasks[i]);
+				random_tasks[i] = NULL;
+			}
+		}
+		kfree(random_tasks);
+		random_tasks = NULL;
+	}
+}
+
+static int lpc_test_init(void)
+{
+	int i;
+	int err = 0;
+
+	random_tasks = kzalloc(random_threads * sizeof(random_tasks[0]),
+			       GFP_KERNEL);
+	if (random_tasks == NULL) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	for (i = 0; i < random_threads; i++) {
+		random_tasks[i] = kthread_run(lpc_random_thread, NULL,
+					      "lpc_random_thread");
+		if (IS_ERR(random_tasks[i])) {
+			err = PTR_ERR(random_tasks[i]);
+			random_tasks[i] = NULL;
+			goto out;
+		}
+	}
+
+out:
+	if (err)
+		lpc_test_cleanup();
+	return err;
+}
+
+module_init(lpc_test_init);
+module_exit(lpc_test_cleanup);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [rfc] rename page_count for lockless pagecache
  2007-04-12 12:44 [patch 0/9] lockless pagecache for 2.6.21-rc6 Nick Piggin
                   ` (8 preceding siblings ...)
  2007-04-12 12:46 ` [patch 9/9] mm: lockless test threads Nick Piggin
@ 2007-04-12 12:46 ` Nick Piggin
  2007-04-12 17:03   ` Peter Zijlstra
  2007-04-13 11:53   ` Hugh Dickins
  9 siblings, 2 replies; 23+ messages in thread
From: Nick Piggin @ 2007-04-12 12:46 UTC (permalink / raw)
  To: Linux Memory Management; +Cc: Nick Piggin, Andrew Morton

In order to force an audit of page_count users (which I have already done
for in-tree users), and to ensure people think about page_count correctly
in future, I propose this (incomplete, RFC) patch to rename page_count.

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -267,13 +267,32 @@ static inline int get_page_unless_zero(s
 	return atomic_inc_not_zero(&page->_count);
 }
 
-static inline int page_count(struct page *page)
+static inline int unstable_page_count(struct page *page)
 {
 	if (unlikely(PageCompound(page)))
 		page = (struct page *)page_private(page);
 	return atomic_read(&page->_count);
 }
 
+/*
+ * Returns true if the page count is zero. The page count is
+ * actually stable if it is zero.
+ */
+static inline int page_count_is_zero(struct page *page)
+{
+	return unstable_page_count(page) == 0;
+}
+
+/*
+ * Returns true if page count of page is less than or equal to c.
+ * PageNoNewRefs must be set on page.
+ */
+static inline int page_count_lessequal(struct page *page, int c)
+{
+	VM_BUG_ON(!PageNoNewRefs(page));
+	return unstable_page_count(page) <= c;
+}
+
 static inline void get_page(struct page *page)
 {
 	if (unlikely(PageCompound(page)))
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -90,7 +90,7 @@ static struct page *dequeue_huge_page(st
 
 static void free_huge_page(struct page *page)
 {
-	BUG_ON(page_count(page));
+	BUG_ON(!page_count_is_zero(page));
 
 	INIT_LIST_HEAD(&page->lru);
 
@@ -429,7 +429,7 @@ static int hugetlb_cow(struct mm_struct 
 
 	/* If no-one else is actually using this page, avoid the copy
 	 * and just make the page writable */
-	avoidcopy = (page_count(old_page) == 1);
+	avoidcopy = (unstable_page_count(old_page) == 1);
 	if (avoidcopy) {
 		set_huge_ptep_writable(vma, address, ptep);
 		return VM_FAULT_MINOR;
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1270,7 +1270,7 @@ int vm_insert_page(struct vm_area_struct
 {
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return -EFAULT;
-	if (!page_count(page))
+	if (page_count_is_zero(page))
 		return -EINVAL;
 	vma->vm_flags |= VM_INSERTPAGE;
 	return insert_page(vma->vm_mm, addr, page, vma->vm_page_prot);
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -298,7 +298,7 @@ static int migrate_page_move_mapping(str
 
 	if (!mapping) {
 		/* Anonymous page */
-		if (page_count(page) != 1)
+		if (unstable_page_count(page) != 1)
 			return -EAGAIN;
 		return 0;
 	}
@@ -309,7 +309,7 @@ static int migrate_page_move_mapping(str
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
  					page_index(page));
 
-	if (page_count(page) != 2 + !!PagePrivate(page) ||
+	if (!page_count_lessequal(page, 2 + !!PagePrivate(page)) ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
 		spin_unlock_irq(&mapping->tree_lock);
 		clear_page_nonewrefs(page);
@@ -606,7 +606,7 @@ static int unmap_and_move(new_page_t get
 	if (!newpage)
 		return -ENOMEM;
 
-	if (page_count(page) == 1)
+	if (unstable_page_count(page) == 1)
 		/* page was freed from under us. So we are done. */
 		goto move_newpage;
 
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -192,7 +192,7 @@ static void bad_page(struct page *page)
 		KERN_EMERG "Backtrace:\n",
 		current->comm, page, (int)(2*sizeof(unsigned long)),
 		(unsigned long)page->flags, page->mapping,
-		page_mapcount(page), page_count(page));
+		page_mapcount(page), unstable_page_count(page));
 	dump_stack();
 	page->flags &= ~(1 << PG_lru	|
 			1 << PG_private |
@@ -355,7 +355,7 @@ static inline int page_is_buddy(struct p
 		return 0;
 
 	if (PageBuddy(buddy) && page_order(buddy) == order) {
-		BUG_ON(page_count(buddy) != 0);
+		BUG_ON(!page_count_is_zero(buddy));
 		return 1;
 	}
 	return 0;
@@ -427,7 +427,7 @@ static inline int free_pages_check(struc
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(!page_count_is_zero(page)) |
 		(page->flags & (
 			1 << PG_lru	|
 			1 << PG_private |
@@ -576,7 +576,7 @@ static int prep_new_page(struct page *pa
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(!page_count_is_zero(page)) |
 		(page->flags & (
 			1 << PG_lru	|
 			1 << PG_private	|
@@ -854,7 +854,7 @@ void split_page(struct page *page, unsig
 	int i;
 
 	VM_BUG_ON(PageCompound(page));
-	VM_BUG_ON(!page_count(page));
+	VM_BUG_ON(page_count_is_zero(page));
 	for (i = 1; i < (1 << order); i++)
 		set_page_refcounted(page + i);
 }
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -586,7 +586,7 @@ void page_remove_rmap(struct page *page,
 			printk (KERN_EMERG "Eeek! page_mapcount(page) went negative! (%d)\n", page_mapcount(page));
 			printk (KERN_EMERG "  page pfn = %lx\n", page_to_pfn(page));
 			printk (KERN_EMERG "  page->flags = %lx\n", page->flags);
-			printk (KERN_EMERG "  page->count = %x\n", page_count(page));
+			printk (KERN_EMERG "  page->count = %x\n", unstable_page_count(page));
 			printk (KERN_EMERG "  page->mapping = %p\n", page->mapping);
 			print_symbol (KERN_EMERG "  vma->vm_ops = %s\n", (unsigned long)vma->vm_ops);
 			if (vma->vm_ops)
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -73,7 +73,7 @@ void swap_unplug_io_fn(struct backing_de
 		 * condition and it's harmless. However if it triggers without
 		 * swapoff it signals a problem.
 		 */
-		WARN_ON(page_count(page) <= 1);
+		WARN_ON(unstable_page_count(page) <= 1);
 
 		bdi = bdev->bd_inode->i_mapping->backing_dev_info;
 		blk_run_backing_dev(bdi, page);
@@ -355,7 +355,7 @@ int remove_exclusive_swap_page(struct pa
 		return 0;
 	if (PageWriteback(page))
 		return 0;
-	if (page_count(page) != 2) /* 2: us + cache */
+	if (unstable_page_count(page) != 2) /* 2: us + cache */
 		return 0;
 
 	entry.val = page_private(page);
@@ -366,14 +366,14 @@ int remove_exclusive_swap_page(struct pa
 	/* Is the only swap cache user the cache itself? */
 	retval = 0;
 	if (p->swap_map[swp_offset(entry)] == 1) {
-		/* Recheck the page count with the swapcache lock held.. */
-		spin_lock_irq(&swapper_space.tree_lock);
-		if ((page_count(page) == 2) && !PageWriteback(page)) {
+		/* XXX: BUG_ON(PageWriteback(page)) ?? */
+		if ((unstable_page_count(page) == 2) && !PageWriteback(page)) {
+			spin_lock_irq(&swapper_space.tree_lock);
 			__delete_from_swap_cache(page);
+			spin_unlock_irq(&swapper_space.tree_lock);
 			SetPageDirty(page);
 			retval = 1;
 		}
-		spin_unlock_irq(&swapper_space.tree_lock);
 	}
 	spin_unlock(&swap_lock);
 
@@ -412,7 +412,7 @@ void free_swap_and_cache(swp_entry_t ent
 		int one_user;
 
 		BUG_ON(PagePrivate(page));
-		one_user = (page_count(page) == 2);
+		one_user = (unstable_page_count(page) == 2);
 		/* Only cache user (+us), or swap space full? Free it! */
 		/* Also recheck PageSwapCache after page is locked (above) */
 		if (PageSwapCache(page) && !PageWriteback(page) &&
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -254,7 +254,7 @@ static inline int page_mapping_inuse(str
 
 static inline int is_page_cache_freeable(struct page *page)
 {
-	return page_count(page) - !!PagePrivate(page) == 2;
+	return unstable_page_count(page) - !!PagePrivate(page) == 2;
 }
 
 static int may_write_to_queue(struct backing_dev_info *bdi)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [rfc] rename page_count for lockless pagecache
  2007-04-12 12:46 ` [rfc] rename page_count for lockless pagecache Nick Piggin
@ 2007-04-12 17:03   ` Peter Zijlstra
  2007-04-12 23:27     ` Nick Piggin
  2007-04-13 11:53   ` Hugh Dickins
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2007-04-12 17:03 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Memory Management, Andrew Morton

On Thu, 2007-04-12 at 14:46 +0200, Nick Piggin wrote:
> In order to force an audit of page_count users (which I have already done
> for in-tree users), and to ensure people think about page_count correctly
> in future, I propose this (incomplete, RFC) patch to rename page_count.
> 

My compiler suggests you did a very terse job, but given that its an
RFC... :-)

Anyway, I like it, but I'll leave it out for now.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [rfc] rename page_count for lockless pagecache
  2007-04-12 17:03   ` Peter Zijlstra
@ 2007-04-12 23:27     ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2007-04-12 23:27 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Linux Memory Management, Andrew Morton

On Thu, Apr 12, 2007 at 07:03:39PM +0200, Peter Zijlstra wrote:
> On Thu, 2007-04-12 at 14:46 +0200, Nick Piggin wrote:
> > In order to force an audit of page_count users (which I have already done
> > for in-tree users), and to ensure people think about page_count correctly
> > in future, I propose this (incomplete, RFC) patch to rename page_count.
> > 
> 
> My compiler suggests you did a very terse job, but given that its an
> RFC... :-)

Yeah, just mm/, to see how it looks.

I have no problem converting the whole tree (there isn't much left),
but the question is whether out of tree drivers use it.

I guess if they do use it, then they need to be checked anyway, so
there isn't really a way we can provide interim compatibility.

> 
> Anyway, I like it, but I'll leave it out for now.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [rfc] rename page_count for lockless pagecache
  2007-04-12 12:46 ` [rfc] rename page_count for lockless pagecache Nick Piggin
  2007-04-12 17:03   ` Peter Zijlstra
@ 2007-04-13 11:53   ` Hugh Dickins
  2007-04-13 12:13     ` Nick Piggin
  1 sibling, 1 reply; 23+ messages in thread
From: Hugh Dickins @ 2007-04-13 11:53 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Memory Management, Andrew Morton

On Thu, 12 Apr 2007, Nick Piggin wrote:
> In order to force an audit of page_count users (which I have already done
> for in-tree users), and to ensure people think about page_count correctly
> in future, I propose this (incomplete, RFC) patch to rename page_count.

I see your point, it's a concern worth raising; but it grieves me that
we first lost page->count, and now you propose we lose page_count().

I don't care for the patch (especially page_count_lessequal).
I rather think it will cause more noise and nuisance than anything
else.  All the arches would need to be updated too.  Out of tree
people, won't they just #define anew without comprehending?

Might it be more profitable for a DEBUG mode to inject random
variations into page_count?

What did your audit show?  Was anything in the tree actually using
page_count() in a manner safe before but unsafe after your changes?
What you found outside of /mm should be a fair guide to what might
be there out of tree.

Hugh

> 
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -267,13 +267,32 @@ static inline int get_page_unless_zero(s
>  	return atomic_inc_not_zero(&page->_count);
>  }
>  
> -static inline int page_count(struct page *page)
> +static inline int unstable_page_count(struct page *page)
>  {
>  	if (unlikely(PageCompound(page)))
>  		page = (struct page *)page_private(page);
>  	return atomic_read(&page->_count);
>  }
>  
> +/*
> + * Returns true if the page count is zero. The page count is
> + * actually stable if it is zero.
> + */
> +static inline int page_count_is_zero(struct page *page)
> +{
> +	return unstable_page_count(page) == 0;
> +}
> +
> +/*
> + * Returns true if page count of page is less than or equal to c.
> + * PageNoNewRefs must be set on page.
> + */
> +static inline int page_count_lessequal(struct page *page, int c)
> +{
> +	VM_BUG_ON(!PageNoNewRefs(page));
> +	return unstable_page_count(page) <= c;
> +}
> +
>  static inline void get_page(struct page *page)
>  {
>  	if (unlikely(PageCompound(page)))
> Index: linux-2.6/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.orig/mm/hugetlb.c
> +++ linux-2.6/mm/hugetlb.c
> @@ -90,7 +90,7 @@ static struct page *dequeue_huge_page(st
>  
>  static void free_huge_page(struct page *page)
>  {
> -	BUG_ON(page_count(page));
> +	BUG_ON(!page_count_is_zero(page));
>  
>  	INIT_LIST_HEAD(&page->lru);
>  
> @@ -429,7 +429,7 @@ static int hugetlb_cow(struct mm_struct 
>  
>  	/* If no-one else is actually using this page, avoid the copy
>  	 * and just make the page writable */
> -	avoidcopy = (page_count(old_page) == 1);
> +	avoidcopy = (unstable_page_count(old_page) == 1);
>  	if (avoidcopy) {
>  		set_huge_ptep_writable(vma, address, ptep);
>  		return VM_FAULT_MINOR;
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -1270,7 +1270,7 @@ int vm_insert_page(struct vm_area_struct
>  {
>  	if (addr < vma->vm_start || addr >= vma->vm_end)
>  		return -EFAULT;
> -	if (!page_count(page))
> +	if (page_count_is_zero(page))
>  		return -EINVAL;
>  	vma->vm_flags |= VM_INSERTPAGE;
>  	return insert_page(vma->vm_mm, addr, page, vma->vm_page_prot);
> Index: linux-2.6/mm/migrate.c
> ===================================================================
> --- linux-2.6.orig/mm/migrate.c
> +++ linux-2.6/mm/migrate.c
> @@ -298,7 +298,7 @@ static int migrate_page_move_mapping(str
>  
>  	if (!mapping) {
>  		/* Anonymous page */
> -		if (page_count(page) != 1)
> +		if (unstable_page_count(page) != 1)
>  			return -EAGAIN;
>  		return 0;
>  	}
> @@ -309,7 +309,7 @@ static int migrate_page_move_mapping(str
>  	pslot = radix_tree_lookup_slot(&mapping->page_tree,
>   					page_index(page));
>  
> -	if (page_count(page) != 2 + !!PagePrivate(page) ||
> +	if (!page_count_lessequal(page, 2 + !!PagePrivate(page)) ||
>  			(struct page *)radix_tree_deref_slot(pslot) != page) {
>  		spin_unlock_irq(&mapping->tree_lock);
>  		clear_page_nonewrefs(page);
> @@ -606,7 +606,7 @@ static int unmap_and_move(new_page_t get
>  	if (!newpage)
>  		return -ENOMEM;
>  
> -	if (page_count(page) == 1)
> +	if (unstable_page_count(page) == 1)
>  		/* page was freed from under us. So we are done. */
>  		goto move_newpage;
>  
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -192,7 +192,7 @@ static void bad_page(struct page *page)
>  		KERN_EMERG "Backtrace:\n",
>  		current->comm, page, (int)(2*sizeof(unsigned long)),
>  		(unsigned long)page->flags, page->mapping,
> -		page_mapcount(page), page_count(page));
> +		page_mapcount(page), unstable_page_count(page));
>  	dump_stack();
>  	page->flags &= ~(1 << PG_lru	|
>  			1 << PG_private |
> @@ -355,7 +355,7 @@ static inline int page_is_buddy(struct p
>  		return 0;
>  
>  	if (PageBuddy(buddy) && page_order(buddy) == order) {
> -		BUG_ON(page_count(buddy) != 0);
> +		BUG_ON(!page_count_is_zero(buddy));
>  		return 1;
>  	}
>  	return 0;
> @@ -427,7 +427,7 @@ static inline int free_pages_check(struc
>  {
>  	if (unlikely(page_mapcount(page) |
>  		(page->mapping != NULL)  |
> -		(page_count(page) != 0)  |
> +		(!page_count_is_zero(page)) |
>  		(page->flags & (
>  			1 << PG_lru	|
>  			1 << PG_private |
> @@ -576,7 +576,7 @@ static int prep_new_page(struct page *pa
>  {
>  	if (unlikely(page_mapcount(page) |
>  		(page->mapping != NULL)  |
> -		(page_count(page) != 0)  |
> +		(!page_count_is_zero(page)) |
>  		(page->flags & (
>  			1 << PG_lru	|
>  			1 << PG_private	|
> @@ -854,7 +854,7 @@ void split_page(struct page *page, unsig
>  	int i;
>  
>  	VM_BUG_ON(PageCompound(page));
> -	VM_BUG_ON(!page_count(page));
> +	VM_BUG_ON(page_count_is_zero(page));
>  	for (i = 1; i < (1 << order); i++)
>  		set_page_refcounted(page + i);
>  }
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -586,7 +586,7 @@ void page_remove_rmap(struct page *page,
>  			printk (KERN_EMERG "Eeek! page_mapcount(page) went negative! (%d)\n", page_mapcount(page));
>  			printk (KERN_EMERG "  page pfn = %lx\n", page_to_pfn(page));
>  			printk (KERN_EMERG "  page->flags = %lx\n", page->flags);
> -			printk (KERN_EMERG "  page->count = %x\n", page_count(page));
> +			printk (KERN_EMERG "  page->count = %x\n", unstable_page_count(page));
>  			printk (KERN_EMERG "  page->mapping = %p\n", page->mapping);
>  			print_symbol (KERN_EMERG "  vma->vm_ops = %s\n", (unsigned long)vma->vm_ops);
>  			if (vma->vm_ops)
> Index: linux-2.6/mm/swapfile.c
> ===================================================================
> --- linux-2.6.orig/mm/swapfile.c
> +++ linux-2.6/mm/swapfile.c
> @@ -73,7 +73,7 @@ void swap_unplug_io_fn(struct backing_de
>  		 * condition and it's harmless. However if it triggers without
>  		 * swapoff it signals a problem.
>  		 */
> -		WARN_ON(page_count(page) <= 1);
> +		WARN_ON(unstable_page_count(page) <= 1);
>  
>  		bdi = bdev->bd_inode->i_mapping->backing_dev_info;
>  		blk_run_backing_dev(bdi, page);
> @@ -355,7 +355,7 @@ int remove_exclusive_swap_page(struct pa
>  		return 0;
>  	if (PageWriteback(page))
>  		return 0;
> -	if (page_count(page) != 2) /* 2: us + cache */
> +	if (unstable_page_count(page) != 2) /* 2: us + cache */
>  		return 0;
>  
>  	entry.val = page_private(page);
> @@ -366,14 +366,14 @@ int remove_exclusive_swap_page(struct pa
>  	/* Is the only swap cache user the cache itself? */
>  	retval = 0;
>  	if (p->swap_map[swp_offset(entry)] == 1) {
> -		/* Recheck the page count with the swapcache lock held.. */
> -		spin_lock_irq(&swapper_space.tree_lock);
> -		if ((page_count(page) == 2) && !PageWriteback(page)) {
> +		/* XXX: BUG_ON(PageWriteback(page)) ?? */
> +		if ((unstable_page_count(page) == 2) && !PageWriteback(page)) {
> +			spin_lock_irq(&swapper_space.tree_lock);
>  			__delete_from_swap_cache(page);
> +			spin_unlock_irq(&swapper_space.tree_lock);
>  			SetPageDirty(page);
>  			retval = 1;
>  		}
> -		spin_unlock_irq(&swapper_space.tree_lock);
>  	}
>  	spin_unlock(&swap_lock);
>  
> @@ -412,7 +412,7 @@ void free_swap_and_cache(swp_entry_t ent
>  		int one_user;
>  
>  		BUG_ON(PagePrivate(page));
> -		one_user = (page_count(page) == 2);
> +		one_user = (unstable_page_count(page) == 2);
>  		/* Only cache user (+us), or swap space full? Free it! */
>  		/* Also recheck PageSwapCache after page is locked (above) */
>  		if (PageSwapCache(page) && !PageWriteback(page) &&
> Index: linux-2.6/mm/vmscan.c
> ===================================================================
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -254,7 +254,7 @@ static inline int page_mapping_inuse(str
>  
>  static inline int is_page_cache_freeable(struct page *page)
>  {
> -	return page_count(page) - !!PagePrivate(page) == 2;
> +	return unstable_page_count(page) - !!PagePrivate(page) == 2;
>  }
>  
>  static int may_write_to_queue(struct backing_dev_info *bdi)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [rfc] rename page_count for lockless pagecache
  2007-04-13 11:53   ` Hugh Dickins
@ 2007-04-13 12:13     ` Nick Piggin
  2007-04-14  2:24       ` Nick Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2007-04-13 12:13 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linux Memory Management, Andrew Morton

On Fri, Apr 13, 2007 at 12:53:05PM +0100, Hugh Dickins wrote:
> On Thu, 12 Apr 2007, Nick Piggin wrote:
> > In order to force an audit of page_count users (which I have already done
> > for in-tree users), and to ensure people think about page_count correctly
> > in future, I propose this (incomplete, RFC) patch to rename page_count.
> 
> I see your point, it's a concern worth raising; but it grieves me that
> we first lost page->count, and now you propose we lose page_count().
> 
> I don't care for the patch (especially page_count_lessequal).
> I rather think it will cause more noise and nuisance than anything
> else.  All the arches would need to be updated too.  Out of tree
> people, won't they just #define anew without comprehending?

Yeah you may have a point. (lessequal is silly I agree, because it
doesn't convey the fact that the count is still unstable even with
nonewrefs).

On the other hand, I think it probably would get people to think a
little bit more.


> Might it be more profitable for a DEBUG mode to inject random
> variations into page_count?

I think that's a very fine idea, and much more suitable for an
everyday kernel than my test threads. Doesn't help if they use the
field somehow without the accessors, but we must discourage that.
Thanks, I'll add such a debug mode.


> What did your audit show?  Was anything in the tree actually using
> page_count() in a manner safe before but unsafe after your changes?
> What you found outside of /mm should be a fair guide to what might
> be there out of tree.

A couple of things... a network driver was using it as a non-atomic
field IIRC (or at least in an unsafe manner), and x86-64 kernel tlb
flushing was using it unsafely. I think that might have been all,
but that was a while ago... So yeah, basically, not much wsa wrong.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 9/9] mm: lockless test threads
  2007-04-12 12:46 ` [patch 9/9] mm: lockless test threads Nick Piggin
@ 2007-04-13 16:54   ` Hugh Dickins
  2007-04-13 23:31     ` Nick Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Hugh Dickins @ 2007-04-13 16:54 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Memory Management, Andrew Morton

On Thu, 12 Apr 2007, Nick Piggin wrote:

> Introduce a basic lockless pagecache test harness. I don't know what value
> this has, because it hasn't caught a bug yet, but it might help with testing.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>

A couple of fixes to fold in: the modular build needs two exports;
and I got divide-by-0 with mem=512M to a HIGHMEM kernel.

Signed-off-by: Hugh Dickins <hugh@veritas.com>

--- 2.6.21-rc6-np/mm/lpctest.c	2007-04-13 15:25:41.000000000 +0100
+++ linux/mm/lpctest.c	2007-04-13 17:36:22.000000000 +0100
@@ -122,6 +122,8 @@ static int lpc_random_thread(void *arg)
 			unsigned int times;
 			struct page *page;
 
+			if (!zone->spanned_pages)
+				continue;
 			pfn = zone->zone_start_pfn +
 				lpc_random(&rand) % zone->spanned_pages;
 			if (!pfn_valid(pfn))
--- 2.6.21-rc6-np/mm/mmzone.c	2007-02-04 18:44:54.000000000 +0000
+++ linux/mm/mmzone.c	2007-04-13 16:10:06.000000000 +0100
@@ -42,3 +42,7 @@ struct zone *next_zone(struct zone *zone
 	return zone;
 }
 
+#ifdef CONFIG_LPC_TEST_MODULE
+EXPORT_SYMBOL_GPL(first_online_pgdat);
+EXPORT_SYMBOL_GPL(next_zone);
+#endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 9/9] mm: lockless test threads
  2007-04-13 16:54   ` Hugh Dickins
@ 2007-04-13 23:31     ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2007-04-13 23:31 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linux Memory Management, Andrew Morton

On Fri, Apr 13, 2007 at 05:54:37PM +0100, Hugh Dickins wrote:
> On Thu, 12 Apr 2007, Nick Piggin wrote:
> 
> > Introduce a basic lockless pagecache test harness. I don't know what value
> > this has, because it hasn't caught a bug yet, but it might help with testing.
> > 
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> 
> A couple of fixes to fold in: the modular build needs two exports;
> and I got divide-by-0 with mem=512M to a HIGHMEM kernel.

Thanks!

> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> 
> --- 2.6.21-rc6-np/mm/lpctest.c	2007-04-13 15:25:41.000000000 +0100
> +++ linux/mm/lpctest.c	2007-04-13 17:36:22.000000000 +0100
> @@ -122,6 +122,8 @@ static int lpc_random_thread(void *arg)
>  			unsigned int times;
>  			struct page *page;
>  
> +			if (!zone->spanned_pages)
> +				continue;
>  			pfn = zone->zone_start_pfn +
>  				lpc_random(&rand) % zone->spanned_pages;
>  			if (!pfn_valid(pfn))
> --- 2.6.21-rc6-np/mm/mmzone.c	2007-02-04 18:44:54.000000000 +0000
> +++ linux/mm/mmzone.c	2007-04-13 16:10:06.000000000 +0100
> @@ -42,3 +42,7 @@ struct zone *next_zone(struct zone *zone
>  	return zone;
>  }
>  
> +#ifdef CONFIG_LPC_TEST_MODULE
> +EXPORT_SYMBOL_GPL(first_online_pgdat);
> +EXPORT_SYMBOL_GPL(next_zone);
> +#endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [rfc] rename page_count for lockless pagecache
  2007-04-13 12:13     ` Nick Piggin
@ 2007-04-14  2:24       ` Nick Piggin
  2007-04-16 18:28         ` Hugh Dickins
  0 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2007-04-14  2:24 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linux Memory Management, Andrew Morton

On Fri, Apr 13, 2007 at 02:13:47PM +0200, Nick Piggin wrote:
> On Fri, Apr 13, 2007 at 12:53:05PM +0100, Hugh Dickins wrote:
> > Might it be more profitable for a DEBUG mode to inject random
> > variations into page_count?
> 
> I think that's a very fine idea, and much more suitable for an
> everyday kernel than my test threads. Doesn't help if they use the
> field somehow without the accessors, but we must discourage that.
> Thanks, I'll add such a debug mode.

Something like this boots and survives some stress testing here.

I guess it should be under something other than CONFIG_DEBUG_VM,
because it could harm performance and scalability significantly on
bigger boxes... or maybe it should use per-cpu counters? ;)

--
Add some debugging for lockless pagecache as suggested by Hugh.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -267,10 +267,29 @@ static inline int get_page_unless_zero(s
 	return atomic_inc_not_zero(&page->_count);
 }
 
+#ifdef CONFIG_DEBUG_VM
+extern int ll_counter;
+#endif
 static inline int page_count(struct page *page)
 {
 	if (unlikely(PageCompound(page)))
 		page = (struct page *)page_private(page);
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * debug testing for lockless pagecache. add a random value to
+	 * page_count every now and then, to simulate speculative references
+	 * to it.
+	 */
+	{
+		int count = atomic_read(&page->_count);
+		if (count) {
+			ll_counter++;
+			if (ll_counter % 5 == 0 || ll_counter % 7 == 0)
+				count += ll_counter % 11;
+		}
+		return count;
+	}
+#endif
 	return atomic_read(&page->_count);
 }
 
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -137,6 +137,8 @@ static unsigned long __initdata dma_rese
 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
 
 #ifdef CONFIG_DEBUG_VM
+int ll_counter; /* used in include/linux/mm.h, for lockless pagecache */
+EXPORT_SYMBOL(ll_counter);
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
 	int ret = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [rfc] rename page_count for lockless pagecache
  2007-04-14  2:24       ` Nick Piggin
@ 2007-04-16 18:28         ` Hugh Dickins
  2007-04-17  3:07           ` Nick Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Hugh Dickins @ 2007-04-16 18:28 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Memory Management, Andrew Morton

On Sat, 14 Apr 2007, Nick Piggin wrote:
> On Fri, Apr 13, 2007 at 02:13:47PM +0200, Nick Piggin wrote:
> > On Fri, Apr 13, 2007 at 12:53:05PM +0100, Hugh Dickins wrote:
> > > Might it be more profitable for a DEBUG mode to inject random
> > > variations into page_count?
> > 
> > I think that's a very fine idea, and much more suitable for an
> > everyday kernel than my test threads. Doesn't help if they use the
> > field somehow without the accessors, but we must discourage that.
> > Thanks, I'll add such a debug mode.
> 
> Something like this boots and survives some stress testing here.
> 
> I guess it should be under something other than CONFIG_DEBUG_VM,
> because it could harm performance and scalability significantly on
> bigger boxes... or maybe it should use per-cpu counters? ;)
> 
> --
> Add some debugging for lockless pagecache as suggested by Hugh.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Hmm, maybe.  Would be rather cleaner if in this case page_count()
were not inlined but EXPORTed, with the ll_count static within it.

But I'm not terribly proud of the idea, and wonder whether we just
forget it?  How are we going to recognize it if this (or your
lpctest) ever does cause a problem?  Seems like a good thing for
you or I to try when developing, but whether it should go on into
the tree I'm less sure.

Hugh

> 
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -267,10 +267,29 @@ static inline int get_page_unless_zero(s
>  	return atomic_inc_not_zero(&page->_count);
>  }
>  
> +#ifdef CONFIG_DEBUG_VM
> +extern int ll_counter;
> +#endif
>  static inline int page_count(struct page *page)
>  {
>  	if (unlikely(PageCompound(page)))
>  		page = (struct page *)page_private(page);
> +#ifdef CONFIG_DEBUG_VM
> +	/*
> +	 * debug testing for lockless pagecache. add a random value to
> +	 * page_count every now and then, to simulate speculative references
> +	 * to it.
> +	 */
> +	{
> +		int count = atomic_read(&page->_count);
> +		if (count) {
> +			ll_counter++;
> +			if (ll_counter % 5 == 0 || ll_counter % 7 == 0)
> +				count += ll_counter % 11;
> +		}
> +		return count;
> +	}
> +#endif
>  	return atomic_read(&page->_count);
>  }
>  
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -137,6 +137,8 @@ static unsigned long __initdata dma_rese
>  #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
>  
>  #ifdef CONFIG_DEBUG_VM
> +int ll_counter; /* used in include/linux/mm.h, for lockless pagecache */
> +EXPORT_SYMBOL(ll_counter);
>  static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
>  {
>  	int ret = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 6/9] mm: speculative get page
  2007-04-12 12:45 ` [patch 6/9] mm: speculative get page Nick Piggin
@ 2007-04-16 18:54   ` Hugh Dickins
  2007-04-17  3:09     ` Nick Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Hugh Dickins @ 2007-04-16 18:54 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Memory Management, Andrew Morton

> --- linux-2.6.orig/include/linux/pagemap.h
> +++ linux-2.6/include/linux/pagemap.h
> ...
> +static inline int page_cache_get_speculative(struct page *page)
> +{
> +	VM_BUG_ON(in_interrupt());
> +
> +#ifndef CONFIG_SMP
> +# ifdef CONFIG_PREEMPT
> +	VM_BUG_ON(!in_atomic());
> +# endif
> +	/*
> +	 * Preempt must be disabled here - we rely on rcu_read_lock doing
> +	 * this for us.
> +	 *
> +	 * Pagecache won't be truncated from interrupt context, so if we have
> +	 * found a page in the radix tree here, we have pinned its refcount by
> +	 * disabling preempt, and hence no need for the "speculative get" that
> +	 * SMP requires.
> +	 */
> +	VM_BUG_ON(page_count(page) == 0);
> +	atomic_inc(&page->_count);
> +
> +#else
> +	if (unlikely(!get_page_unless_zero(page)))
> +		return 0; /* page has been freed */

Now you're using get_page_unless_zero() here, you need to remove its
	VM_BUG_ON(PageCompound(page));
since hugetlb_nopage() uses find_lock_page() on huge compound pages
and so comes here (and you have a superior VM_BUG_ON further down).

You could move that VM_BUG_ON to its original caller isolate_lru_pages(),
or you could replace it by your superior check in get_page_unless_zero();
but I'd be inclined to do the easiest and just cut it out now.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [rfc] rename page_count for lockless pagecache
  2007-04-16 18:28         ` Hugh Dickins
@ 2007-04-17  3:07           ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2007-04-17  3:07 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linux Memory Management, Andrew Morton

On Mon, Apr 16, 2007 at 07:28:33PM +0100, Hugh Dickins wrote:
> On Sat, 14 Apr 2007, Nick Piggin wrote:
> > On Fri, Apr 13, 2007 at 02:13:47PM +0200, Nick Piggin wrote:
> > > On Fri, Apr 13, 2007 at 12:53:05PM +0100, Hugh Dickins wrote:
> > > > Might it be more profitable for a DEBUG mode to inject random
> > > > variations into page_count?
> > > 
> > > I think that's a very fine idea, and much more suitable for an
> > > everyday kernel than my test threads. Doesn't help if they use the
> > > field somehow without the accessors, but we must discourage that.
> > > Thanks, I'll add such a debug mode.
> > 
> > Something like this boots and survives some stress testing here.
> > 
> > I guess it should be under something other than CONFIG_DEBUG_VM,
> > because it could harm performance and scalability significantly on
> > bigger boxes... or maybe it should use per-cpu counters? ;)
> > 
> > --
> > Add some debugging for lockless pagecache as suggested by Hugh.
> > 
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> 
> Hmm, maybe.  Would be rather cleaner if in this case page_count()
> were not inlined but EXPORTed, with the ll_count static within it.
> 
> But I'm not terribly proud of the idea, and wonder whether we just
> forget it?  How are we going to recognize it if this (or your
> lpctest) ever does cause a problem?  Seems like a good thing for
> you or I to try when developing, but whether it should go on into
> the tree I'm less sure.

Yeah, good question. I guess we could hope that if it is used in
drivers or things for something other than page refcounting, they
might stop working in short order.

Otherwise maybe we'd get page leaks... I think this is going to cause
some swapcache pages to "leak" more than usual, but of course they
should end up getting cleaned off the LRU. I guess anything more
obscure (eg. leaks in some driver private pages) might not actually
be noticable unless it was running for a long time. So that's a bit
sad for a debugging option. But it could be useful to catch an out of
tree driver using page_count privately...

I wonder if we could do sweeps over pages that are not free in the
allocator, and BUG if any of them have a zero page count? Hmm, no I
don't know if that can really happen unless someone is abusing
init_page_count.

I don't know whether it should go into the tree. I think I'd like to
put it in -mm and turn it on for a few releases though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 6/9] mm: speculative get page
  2007-04-16 18:54   ` Hugh Dickins
@ 2007-04-17  3:09     ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2007-04-17  3:09 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linux Memory Management, Andrew Morton

On Mon, Apr 16, 2007 at 07:54:03PM +0100, Hugh Dickins wrote:
> > --- linux-2.6.orig/include/linux/pagemap.h
> > +++ linux-2.6/include/linux/pagemap.h
> > ...
> > +static inline int page_cache_get_speculative(struct page *page)
> > +{
> > +	VM_BUG_ON(in_interrupt());
> > +
> > +#ifndef CONFIG_SMP
> > +# ifdef CONFIG_PREEMPT
> > +	VM_BUG_ON(!in_atomic());
> > +# endif
> > +	/*
> > +	 * Preempt must be disabled here - we rely on rcu_read_lock doing
> > +	 * this for us.
> > +	 *
> > +	 * Pagecache won't be truncated from interrupt context, so if we have
> > +	 * found a page in the radix tree here, we have pinned its refcount by
> > +	 * disabling preempt, and hence no need for the "speculative get" that
> > +	 * SMP requires.
> > +	 */
> > +	VM_BUG_ON(page_count(page) == 0);
> > +	atomic_inc(&page->_count);
> > +
> > +#else
> > +	if (unlikely(!get_page_unless_zero(page)))
> > +		return 0; /* page has been freed */
> 
> Now you're using get_page_unless_zero() here, you need to remove its
> 	VM_BUG_ON(PageCompound(page));
> since hugetlb_nopage() uses find_lock_page() on huge compound pages
> and so comes here (and you have a superior VM_BUG_ON further down).

Ah yes, I forgot about that assertion in there. Thanks!

> You could move that VM_BUG_ON to its original caller isolate_lru_pages(),
> or you could replace it by your superior check in get_page_unless_zero();
> but I'd be inclined to do the easiest and just cut it out now.

OK I'll do that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 3/9] radix-tree: gang slot lookups
  2007-04-12 12:45 ` [patch 3/9] radix-tree: gang slot lookups Nick Piggin
@ 2007-06-01 10:31   ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-01 10:31 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Memory Management, Andrew Morton

On Thu, 2007-04-12 at 14:45 +0200, Nick Piggin wrote:

> @@ -825,13 +889,21 @@ radix_tree_gang_lookup_tag(struct radix_
>  
>  	ret = 0;
>  	while (ret < max_items) {
> -		unsigned int nr_found;
> +		unsigned int slots_found, nr_found, i;
>  		unsigned long next_index;	/* Index of next search */
>  
>  		if (cur_index > max_index)
>  			break;
> -		nr_found = __lookup_tag(node, results + ret, cur_index,
> -					max_items - ret, &next_index, tag);
> +		slots_found = __lookup_tag(node, (void ***)results + ret,
> +				cur_index, max_items - ret, &next_index, tag);
> +		nr_found = 0;
> +		for (i = 0; i < slots_found; i++) {
> +			node = *((void ***)results)[ret + i];
> +			if (!node)
> +				continue;
> +			results[ret + nr_found] = rcu_dereference(node);

I think this should read (as you correctly did in
radix_tree_gang_lookup):

			struct radix_tree_node *slot;
			slot = *((void ***)results)[ret + i];
			if (!slot)
				continue;
			results[ret + nr_found] = rcu_dereference(slot);

by overwriting node, a subsequent __lookup_tag() will do the darnest
things.


> +			nr_found++;
> +		}
>  		ret += nr_found;
>  		if (next_index == 0)
>  			break;



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2007-06-01 10:31 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-12 12:44 [patch 0/9] lockless pagecache for 2.6.21-rc6 Nick Piggin
2007-04-12 12:44 ` [patch 1/9] mm: prep find_lock_page Nick Piggin
2007-04-12 12:45 ` [patch 2/9] radix-tree: use indirect bit Nick Piggin
2007-04-12 12:45 ` [patch 3/9] radix-tree: gang slot lookups Nick Piggin
2007-06-01 10:31   ` Peter Zijlstra
2007-04-12 12:45 ` [patch 4/9] mm: __add_to_swap_cache stuff Nick Piggin
2007-04-12 12:45 ` [patch 5/9] mm: lockless probe Nick Piggin
2007-04-12 12:45 ` [patch 6/9] mm: speculative get page Nick Piggin
2007-04-16 18:54   ` Hugh Dickins
2007-04-17  3:09     ` Nick Piggin
2007-04-12 12:45 ` [patch 7/9] mm: lockless pagecache lookups Nick Piggin
2007-04-12 12:46 ` [patch 8/9] mm: spinlock tree_lock Nick Piggin
2007-04-12 12:46 ` [patch 9/9] mm: lockless test threads Nick Piggin
2007-04-13 16:54   ` Hugh Dickins
2007-04-13 23:31     ` Nick Piggin
2007-04-12 12:46 ` [rfc] rename page_count for lockless pagecache Nick Piggin
2007-04-12 17:03   ` Peter Zijlstra
2007-04-12 23:27     ` Nick Piggin
2007-04-13 11:53   ` Hugh Dickins
2007-04-13 12:13     ` Nick Piggin
2007-04-14  2:24       ` Nick Piggin
2007-04-16 18:28         ` Hugh Dickins
2007-04-17  3:07           ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox