[patch 0/7] speculative page references, lockless pagecache, lockless gup

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 0/7] speculative page references, lockless pagecache, lockless gup
@ 2008-06-05  9:43 npiggin
  2008-06-05  9:43 ` [patch 1/7] mm: readahead scan lockless npiggin
                   ` (9 more replies)
  0 siblings, 10 replies; 31+ messages in thread
From: npiggin @ 2008-06-05  9:43 UTC (permalink / raw)
  To: akpm, torvalds; +Cc: linux-mm, linux-kernel, benh, paulus

Hi,

I've decided to submit the speculative page references patch to get merged.
I think I've now got enough reasons to get it merged. Well... I always
thought I did, I just didn't think anyone else thought I did. If you know
what I mean.

cc'ing the powerpc guys specifically because everyone else who probably
cares should be on linux-mm...

So speculative page references are required to support lockless pagecache and
lockless get_user_pages (on architectures that can't use the x86 trick). Other
uses for speculative page references could also pop up, it is a pretty useful
concept. Doesn't need to be pagecache pages either.

Anyway,

lockless pagecache:
- speeds up single threaded pagecache lookup operations significantly, by
  avoiding atomic operations, memory barriers, and interrupts-off sections.
  I just measured again on a few CPUs I have lying around here, and the
  speedup is over 2x reduction in cycles on them all, closer to 3x in some
  cases.

   find_get_page takes:
                ppc970 (g5)     K10             P4 Nocona       Core2
    vanilla     275 (cycles)    85              315             143
    lockless    125             40              127             61

- speeds up single threaded pagecache modification operations, by using
  regular spinlocks rather than rwlocks and avoiding an atomic operation
  on x86 for one. Also, most real paths which involve pagecache modification
  also involve pagecache lookups, so it is hard not to get a net speedup.

- solves the rwlock starvation problem for pagecache operations. This is
  being noticed on big SGI systems, but theoretically could happen on
  relatively small systems (dozens of CPUs) due to the really nasty
  writer starvation problem of rwlocks -- not even hardware fairness can
  solve that.

- improves pagecache scalability to operations on a single file. I
  demonstrated page faults to a single file were improved in throughput
  by 250x on a 64-way Altix several years ago. We now have systems with
  thousands of CPUs in them.

lockless get_user_pages:
- provides a way to operate on user pages which is scalable to many threads,
  and does not get impacted by, or contribute to, mmap_sem contention.

- Alrady shown to speed up DB2 running OLTP by a significant amount.

The speculative page references idea has been out there for quite a few
years now, and never been disproved.

So, that's the jist of my justification. If it were up to me, then I would
have merged the thing solely on the very first point under lockless
pagecache, but...

Review/comments/testing appreciated. I wonder how people feel about merging
this soon?

(the actual patchset must go on top of the fast get_user_pages patches I
posted earlier because I'm adding the powerpc variant of that here)

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 1/7] mm: readahead scan lockless
  2008-06-05  9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
@ 2008-06-05  9:43 ` npiggin
  2008-06-05  9:43 ` [patch 2/7] radix-tree: add gang_lookup_slot, gang_lookup_slot_tag npiggin
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: npiggin @ 2008-06-05  9:43 UTC (permalink / raw)
  To: akpm, torvalds; +Cc: linux-mm, linux-kernel, benh, paulus

[-- Attachment #1: mm-readahead-scan-lockless.patch --]
[-- Type: text/plain, Size: 997 bytes --]

radix_tree_next_hole is implemented as a series of radix_tree_lookup()s. So
it can be called locklessly, under rcu_read_lock().

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -382,9 +382,9 @@ ondemand_readahead(struct address_space 
 	if (hit_readahead_marker) {
 		pgoff_t start;
 
-		read_lock_irq(&mapping->tree_lock);
-		start = radix_tree_next_hole(&mapping->page_tree, offset, max+1);
-		read_unlock_irq(&mapping->tree_lock);
+		rcu_read_lock();
+		start = radix_tree_next_hole(&mapping->page_tree, offset,max+1);
+		rcu_read_unlock();
 
 		if (!start || start - offset > max)
 			return 0;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 2/7] radix-tree: add gang_lookup_slot, gang_lookup_slot_tag
  2008-06-05  9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
  2008-06-05  9:43 ` [patch 1/7] mm: readahead scan lockless npiggin
@ 2008-06-05  9:43 ` npiggin
  2008-06-05  9:43 ` [patch 3/7] mm: speculative page references npiggin
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: npiggin @ 2008-06-05  9:43 UTC (permalink / raw)
  To: akpm, torvalds; +Cc: linux-mm, linux-kernel, benh, paulus

[-- Attachment #1: radix-tree-gang-lookup-slot.patch --]
[-- Type: text/plain, Size: 10807 bytes --]

Introduce gang_lookup_slot and gang_lookup_slot_tag functions, which are used
by lockless pagecache.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -99,12 +99,15 @@ do {									\
  *
  * The notable exceptions to this rule are the following functions:
  * radix_tree_lookup
+ * radix_tree_lookup_slot
  * radix_tree_tag_get
  * radix_tree_gang_lookup
+ * radix_tree_gang_lookup_slot
  * radix_tree_gang_lookup_tag
+ * radix_tree_gang_lookup_tag_slot
  * radix_tree_tagged
  *
- * The first 4 functions are able to be called locklessly, using RCU. The
+ * The first 7 functions are able to be called locklessly, using RCU. The
  * caller must ensure calls to these functions are made within rcu_read_lock()
  * regions. Other readers (lock-free or otherwise) and modifications may be
  * running concurrently.
@@ -159,6 +162,9 @@ void *radix_tree_delete(struct radix_tre
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+			unsigned long first_index, unsigned int max_items);
 unsigned long radix_tree_next_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
@@ -173,6 +179,10 @@ unsigned int
 radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
 		unsigned long first_index, unsigned int max_items,
 		unsigned int tag);
+unsigned int
+radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
+		unsigned long first_index, unsigned int max_items,
+		unsigned int tag);
 int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
 
 static inline void radix_tree_preload_end(void)
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -350,18 +350,17 @@ EXPORT_SYMBOL(radix_tree_insert);
  *	Returns:  the slot corresponding to the position @index in the
  *	radix tree @root. This is useful for update-if-exists operations.
  *
- *	This function cannot be called under rcu_read_lock, it must be
- *	excluded from writers, as must the returned slot for subsequent
- *	use by radix_tree_deref_slot() and radix_tree_replace slot.
- *	Caller must hold tree write locked across slot lookup and
- *	replace.
+ *	This function can be called under rcu_read_lock iff the slot is not
+ *	modified by radix_tree_replace_slot, otherwise it must be called
+ *	exclusive from other writers. Any dereference of the slot must be done
+ *	using radix_tree_deref_slot.
  */
 void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
 {
 	unsigned int height, shift;
 	struct radix_tree_node *node, **slot;
 
-	node = root->rnode;
+	node = rcu_dereference(root->rnode);
 	if (node == NULL)
 		return NULL;
 
@@ -381,7 +380,7 @@ void **radix_tree_lookup_slot(struct rad
 	do {
 		slot = (struct radix_tree_node **)
 			(node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
-		node = *slot;
+		node = rcu_dereference(*slot);
 		if (node == NULL)
 			return NULL;
 
@@ -658,7 +657,7 @@ unsigned long radix_tree_next_hole(struc
 EXPORT_SYMBOL(radix_tree_next_hole);
 
 static unsigned int
-__lookup(struct radix_tree_node *slot, void **results, unsigned long index,
+__lookup(struct radix_tree_node *slot, void ***results, unsigned long index,
 	unsigned int max_items, unsigned long *next_index)
 {
 	unsigned int nr_found = 0;
@@ -692,11 +691,9 @@ __lookup(struct radix_tree_node *slot, v
 
 	/* Bottom level: grab some items */
 	for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
-		struct radix_tree_node *node;
 		index++;
-		node = slot->slots[i];
-		if (node) {
-			results[nr_found++] = rcu_dereference(node);
+		if (slot->slots[i]) {
+			results[nr_found++] = &(slot->slots[i]);
 			if (nr_found == max_items)
 				goto out;
 		}
@@ -750,13 +747,22 @@ radix_tree_gang_lookup(struct radix_tree
 
 	ret = 0;
 	while (ret < max_items) {
-		unsigned int nr_found;
+		unsigned int nr_found, slots_found, i;
 		unsigned long next_index;	/* Index of next search */
 
 		if (cur_index > max_index)
 			break;
-		nr_found = __lookup(node, results + ret, cur_index,
+		slots_found = __lookup(node, (void ***)results + ret, cur_index,
 					max_items - ret, &next_index);
+		nr_found = 0;
+		for (i = 0; i < slots_found; i++) {
+			struct radix_tree_node *slot;
+			slot = *(((void ***)results)[ret + i]);
+			if (!slot)
+				continue;
+			results[ret + nr_found] = rcu_dereference(slot);
+			nr_found++;
+		}
 		ret += nr_found;
 		if (next_index == 0)
 			break;
@@ -767,12 +773,71 @@ radix_tree_gang_lookup(struct radix_tree
 }
 EXPORT_SYMBOL(radix_tree_gang_lookup);
 
+/**
+ *	radix_tree_gang_lookup_slot - perform multiple slot lookup on radix tree
+ *	@root:		radix tree root
+ *	@results:	where the results of the lookup are placed
+ *	@first_index:	start the lookup from this key
+ *	@max_items:	place up to this many items at *results
+ *
+ *	Performs an index-ascending scan of the tree for present items.  Places
+ *	their slots at *@results and returns the number of items which were
+ *	placed at *@results.
+ *
+ *	The implementation is naive.
+ *
+ *	Like radix_tree_gang_lookup as far as RCU and locking goes. Slots must
+ *	be dereferenced with radix_tree_deref_slot, and if using only RCU
+ *	protection, radix_tree_deref_slot may fail requiring a retry.
+ */
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+			unsigned long first_index, unsigned int max_items)
+{
+	unsigned long max_index;
+	struct radix_tree_node *node;
+	unsigned long cur_index = first_index;
+	unsigned int ret;
+
+	node = rcu_dereference(root->rnode);
+	if (!node)
+		return 0;
+
+	if (!radix_tree_is_indirect_ptr(node)) {
+		if (first_index > 0)
+			return 0;
+		results[0] = (void **)&root->rnode;
+		return 1;
+	}
+	node = radix_tree_indirect_to_ptr(node);
+
+	max_index = radix_tree_maxindex(node->height);
+
+	ret = 0;
+	while (ret < max_items) {
+		unsigned int slots_found;
+		unsigned long next_index;	/* Index of next search */
+
+		if (cur_index > max_index)
+			break;
+		slots_found = __lookup(node, results + ret, cur_index,
+					max_items - ret, &next_index);
+		ret += slots_found;
+		if (next_index == 0)
+			break;
+		cur_index = next_index;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup_slot);
+
 /*
  * FIXME: the two tag_get()s here should use find_next_bit() instead of
  * open-coding the search.
  */
 static unsigned int
-__lookup_tag(struct radix_tree_node *slot, void **results, unsigned long index,
+__lookup_tag(struct radix_tree_node *slot, void ***results, unsigned long index,
 	unsigned int max_items, unsigned long *next_index, unsigned int tag)
 {
 	unsigned int nr_found = 0;
@@ -802,11 +867,9 @@ __lookup_tag(struct radix_tree_node *slo
 			unsigned long j = index & RADIX_TREE_MAP_MASK;
 
 			for ( ; j < RADIX_TREE_MAP_SIZE; j++) {
-				struct radix_tree_node *node;
 				index++;
 				if (!tag_get(slot, tag, j))
 					continue;
-				node = slot->slots[j];
 				/*
 				 * Even though the tag was found set, we need to
 				 * recheck that we have a non-NULL node, because
@@ -817,9 +880,8 @@ __lookup_tag(struct radix_tree_node *slo
 				 * lookup ->slots[x] without a lock (ie. can't
 				 * rely on its value remaining the same).
 				 */
-				if (node) {
-					node = rcu_dereference(node);
-					results[nr_found++] = node;
+				if (slot->slots[j]) {
+					results[nr_found++] = &(slot->slots[j]);
 					if (nr_found == max_items)
 						goto out;
 				}
@@ -878,13 +940,22 @@ radix_tree_gang_lookup_tag(struct radix_
 
 	ret = 0;
 	while (ret < max_items) {
-		unsigned int nr_found;
+		unsigned int nr_found, slots_found, i;
 		unsigned long next_index;	/* Index of next search */
 
 		if (cur_index > max_index)
 			break;
-		nr_found = __lookup_tag(node, results + ret, cur_index,
-					max_items - ret, &next_index, tag);
+		slots_found = __lookup_tag(node, (void ***)results + ret,
+				cur_index, max_items - ret, &next_index, tag);
+		nr_found = 0;
+		for (i = 0; i < slots_found; i++) {
+			struct radix_tree_node *slot;
+			slot = *(((void ***)results)[ret + i]);
+			if (!slot)
+				continue;
+			results[ret + nr_found] = rcu_dereference(slot);
+			nr_found++;
+		}
 		ret += nr_found;
 		if (next_index == 0)
 			break;
@@ -896,6 +967,67 @@ radix_tree_gang_lookup_tag(struct radix_
 EXPORT_SYMBOL(radix_tree_gang_lookup_tag);
 
 /**
+ *	radix_tree_gang_lookup_tag_slot - perform multiple slot lookup on a
+ *					  radix tree based on a tag
+ *	@root:		radix tree root
+ *	@results:	where the results of the lookup are placed
+ *	@first_index:	start the lookup from this key
+ *	@max_items:	place up to this many items at *results
+ *	@tag:		the tag index (< RADIX_TREE_MAX_TAGS)
+ *
+ *	Performs an index-ascending scan of the tree for present items which
+ *	have the tag indexed by @tag set.  Places the slots at *@results and
+ *	returns the number of slots which were placed at *@results.
+ */
+unsigned int
+radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
+		unsigned long first_index, unsigned int max_items,
+		unsigned int tag)
+{
+	struct radix_tree_node *node;
+	unsigned long max_index;
+	unsigned long cur_index = first_index;
+	unsigned int ret;
+
+	/* check the root's tag bit */
+	if (!root_tag_get(root, tag))
+		return 0;
+
+	node = rcu_dereference(root->rnode);
+	if (!node)
+		return 0;
+
+	if (!radix_tree_is_indirect_ptr(node)) {
+		if (first_index > 0)
+			return 0;
+		results[0] = (void **)&root->rnode;
+		return 1;
+	}
+	node = radix_tree_indirect_to_ptr(node);
+
+	max_index = radix_tree_maxindex(node->height);
+
+	ret = 0;
+	while (ret < max_items) {
+		unsigned int slots_found;
+		unsigned long next_index;	/* Index of next search */
+
+		if (cur_index > max_index)
+			break;
+		slots_found = __lookup_tag(node, results + ret,
+				cur_index, max_items - ret, &next_index, tag);
+		ret += slots_found;
+		if (next_index == 0)
+			break;
+		cur_index = next_index;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup_tag_slot);
+
+
+/**
  *	radix_tree_shrink    -    shrink height of a radix tree to minimal
  *	@root		radix tree root
  */

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 3/7] mm: speculative page references
  2008-06-05  9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
  2008-06-05  9:43 ` [patch 1/7] mm: readahead scan lockless npiggin
  2008-06-05  9:43 ` [patch 2/7] radix-tree: add gang_lookup_slot, gang_lookup_slot_tag npiggin
@ 2008-06-05  9:43 ` npiggin
  2008-06-06 14:20   ` Peter Zijlstra
                     ` (2 more replies)
  2008-06-05  9:43 ` [patch 4/7] mm: lockless pagecache npiggin
                   ` (6 subsequent siblings)
  9 siblings, 3 replies; 31+ messages in thread
From: npiggin @ 2008-06-05  9:43 UTC (permalink / raw)
  To: akpm, torvalds; +Cc: linux-mm, linux-kernel, benh, paulus

[-- Attachment #1: mm-speculative-get_page-hugh.patch --]
[-- Type: text/plain, Size: 13116 bytes --]

If we can be sure that elevating the page_count on a pagecache page will pin
it, we can speculatively run this operation, and subsequently check to see if
we hit the right page rather than relying on holding a lock or otherwise
pinning a reference to the page.

This can be done if get_page/put_page behaves consistently throughout the whole
tree (ie. if we "get" the page after it has been used for something else, we
must be able to free it with a put_page).

Actually, there is a period where the count behaves differently: when the page
is free or if it is a constituent page of a compound page. We need an
atomic_inc_not_zero operation to ensure we don't try to grab the page in either
case.

This patch introduces the core locking protocol to the pagecache (ie. adds
page_cache_get_speculative, and tweaks some update-side code to make it work).

Thanks to Hugh for pointing out an improvement to the algorithm setting
page_count to zero when we have control of all references, in order to hold off
speculative getters.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -12,6 +12,7 @@
 #include <asm/uaccess.h>
 #include <linux/gfp.h>
 #include <linux/bitops.h>
+#include <linux/hardirq.h> /* for in_interrupt() */
 
 /*
  * Bits in mapping->flags.  The lower __GFP_BITS_SHIFT bits are the page
@@ -62,6 +63,98 @@ static inline void mapping_set_gfp_mask(
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
+/*
+ * speculatively take a reference to a page.
+ * If the page is free (_count == 0), then _count is untouched, and 0
+ * is returned. Otherwise, _count is incremented by 1 and 1 is returned.
+ *
+ * This function must be called inside the same rcu_read_lock() section as has
+ * been used to lookup the page in the pagecache radix-tree (or page table):
+ * this allows allocators to use a synchronize_rcu() to stabilize _count.
+ *
+ * Unless an RCU grace period has passed, the count of all pages coming out
+ * of the allocator must be considered unstable. page_count may return higher
+ * than expected, and put_page must be able to do the right thing when the
+ * page has been finished with, no matter what it is subsequently allocated
+ * for (because put_page is what is used here to drop an invalid speculative
+ * reference).
+ *
+ * This is the interesting part of the lockless pagecache (and lockless
+ * get_user_pages) locking protocol, where the lookup-side (eg. find_get_page)
+ * has the following pattern:
+ * 1. find page in radix tree
+ * 2. conditionally increment refcount
+ * 3. check the page is still in pagecache (if no, goto 1)
+ *
+ * Remove-side that cares about stability of _count (eg. reclaim) has the
+ * following (with tree_lock held for write):
+ * A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg)
+ * B. remove page from pagecache
+ * C. free the page
+ *
+ * There are 2 critical interleavings that matter:
+ * - 2 runs before A: in this case, A sees elevated refcount and bails out
+ * - A runs before 2: in this case, 2 sees zero refcount and retries;
+ *   subsequently, B will complete and 1 will find no page, causing the
+ *   lookup to return NULL.
+ *
+ * It is possible that between 1 and 2, the page is removed then the exact same
+ * page is inserted into the same position in pagecache. That's OK: the
+ * old find_get_page using tree_lock could equally have run before or after
+ * such a re-insertion, depending on order that locks are granted.
+ *
+ * Lookups racing against pagecache insertion isn't a big problem: either 1
+ * will find the page or it will not. Likewise, the old find_get_page could run
+ * either before the insertion or afterwards, depending on timing.
+ */
+static inline int page_cache_get_speculative(struct page *page)
+{
+	VM_BUG_ON(in_interrupt());
+
+#ifndef CONFIG_SMP
+# ifdef CONFIG_PREEMPT
+	VM_BUG_ON(!in_atomic());
+# endif
+	/*
+	 * Preempt must be disabled here - we rely on rcu_read_lock doing
+	 * this for us.
+	 *
+	 * Pagecache won't be truncated from interrupt context, so if we have
+	 * found a page in the radix tree here, we have pinned its refcount by
+	 * disabling preempt, and hence no need for the "speculative get" that
+	 * SMP requires.
+	 */
+	VM_BUG_ON(page_count(page) == 0);
+	atomic_inc(&page->_count);
+
+#else
+	if (unlikely(!get_page_unless_zero(page))) {
+		/*
+		 * Either the page has been freed, or will be freed.
+		 * In either case, retry here and the caller should
+		 * do the right thing (see comments above).
+		 */
+		return 0;
+	}
+#endif
+	VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);
+
+	return 1;
+}
+
+static inline int page_freeze_refs(struct page *page, int count)
+{
+	return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
+}
+
+static inline void page_unfreeze_refs(struct page *page, int count)
+{
+	VM_BUG_ON(page_count(page) != 0);
+	VM_BUG_ON(count == 0);
+
+	atomic_set(&page->_count, count);
+}
+
 #ifdef CONFIG_NUMA
 extern struct page *__page_cache_alloc(gfp_t gfp);
 #else
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -390,12 +390,10 @@ static pageout_t pageout(struct page *pa
 }
 
 /*
- * Attempt to detach a locked page from its ->mapping.  If it is dirty or if
- * someone else has a ref on the page, abort and return 0.  If it was
- * successfully detached, return 1.  Assumes the caller has a single ref on
- * this page.
+ * Save as remove_mapping, but if the page is removed from the mapping, it
+ * gets returned with a refcount of 0.
  */
-int remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping(struct address_space *mapping, struct page *page)
 {
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
@@ -426,9 +424,9 @@ int remove_mapping(struct address_space 
 	 * Note that if SetPageDirty is always performed via set_page_dirty,
 	 * and thus under tree_lock, then this ordering is not required.
 	 */
-	if (unlikely(page_count(page) != 2))
+	if (!page_freeze_refs(page, 2))
 		goto cannot_free;
-	smp_rmb();
+	/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
 	if (unlikely(PageDirty(page)))
 		goto cannot_free;
 
@@ -437,13 +435,11 @@ int remove_mapping(struct address_space 
 		__delete_from_swap_cache(page);
 		write_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
-		__put_page(page);	/* The pagecache ref */
-		return 1;
+	} else {
+		__remove_from_page_cache(page);
+		write_unlock_irq(&mapping->tree_lock);
 	}
 
-	__remove_from_page_cache(page);
-	write_unlock_irq(&mapping->tree_lock);
-	__put_page(page);
 	return 1;
 
 cannot_free:
@@ -452,6 +448,26 @@ cannot_free:
 }
 
 /*
+ * Attempt to detach a locked page from its ->mapping.  If it is dirty or if
+ * someone else has a ref on the page, abort and return 0.  If it was
+ * successfully detached, return 1.  Assumes the caller has a single ref on
+ * this page.
+ */
+int remove_mapping(struct address_space *mapping, struct page *page)
+{
+	if (__remove_mapping(mapping, page)) {
+		/*
+		 * Unfreezing the refcount with 1 rather than 2 effectively
+		 * drops the pagecache ref for us without requiring another
+		 * atomic operation.
+		 */
+		page_unfreeze_refs(page, 1);
+		return 1;
+	}
+	return 0;
+}
+
+/*
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
@@ -597,18 +613,27 @@ static unsigned long shrink_page_list(st
 		if (PagePrivate(page)) {
 			if (!try_to_release_page(page, sc->gfp_mask))
 				goto activate_locked;
-			if (!mapping && page_count(page) == 1)
-				goto free_it;
+			if (!mapping && page_count(page) == 1) {
+				unlock_page(page);
+				if (put_page_testzero(page))
+					goto free_it;
+				else {
+					nr_reclaimed++;
+					continue;
+				}
+			}
 		}
 
-		if (!mapping || !remove_mapping(mapping, page))
+		if (!mapping || !__remove_mapping(mapping, page))
 			goto keep_locked;
 
 free_it:
 		unlock_page(page);
 		nr_reclaimed++;
-		if (!pagevec_add(&freed_pvec, page))
-			__pagevec_release_nonlru(&freed_pvec);
+		if (!pagevec_add(&freed_pvec, page)) {
+			__pagevec_free(&freed_pvec);
+			pagevec_reinit(&freed_pvec);
+		}
 		continue;
 
 activate_locked:
@@ -622,7 +647,7 @@ keep:
 	}
 	list_splice(&ret_pages, page_list);
 	if (pagevec_count(&freed_pvec))
-		__pagevec_release_nonlru(&freed_pvec);
+		__pagevec_free(&freed_pvec);
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -466,17 +466,22 @@ int add_to_page_cache(struct page *page,
 
 	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error == 0) {
+		page_cache_get(page);
+		SetPageLocked(page);
+		page->mapping = mapping;
+		page->index = offset;
+
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (!error) {
-			page_cache_get(page);
-			SetPageLocked(page);
-			page->mapping = mapping;
-			page->index = offset;
+		if (likely(!error)) {
 			mapping->nrpages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
-		} else
+		} else {
+			page->mapping = NULL;
+			ClearPageLocked(page);
 			mem_cgroup_uncharge_page(page);
+			page_cache_release(page);
+		}
 
 		write_unlock_irq(&mapping->tree_lock);
 		radix_tree_preload_end();
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -76,19 +76,26 @@ int add_to_swap_cache(struct page *page,
 	BUG_ON(PagePrivate(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
+		page_cache_get(page);
+		SetPageSwapCache(page);
+		set_page_private(page, entry.val);
+
 		write_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
-		if (!error) {
-			page_cache_get(page);
-			SetPageSwapCache(page);
-			set_page_private(page, entry.val);
+		if (likely(!error)) {
 			total_swapcache_pages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 			INC_CACHE_INFO(add_total);
 		}
 		write_unlock_irq(&swapper_space.tree_lock);
 		radix_tree_preload_end();
+
+		if (unlikely(error)) {
+			set_page_private(page, 0UL);
+			ClearPageSwapCache(page);
+			page_cache_release(page);
+		}
 	}
 	return error;
 }
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -304,6 +304,7 @@ out:
 static int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page)
 {
+	int expected_count;
 	void **pslot;
 
 	if (!mapping) {
@@ -318,12 +319,18 @@ static int migrate_page_move_mapping(str
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
  					page_index(page));
 
-	if (page_count(page) != 2 + !!PagePrivate(page) ||
+	expected_count = 2 + !!PagePrivate(page);
+	if (page_count(page) != expected_count ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
 		write_unlock_irq(&mapping->tree_lock);
 		return -EAGAIN;
 	}
 
+	if (!page_freeze_refs(page, expected_count)) {
+		write_unlock_irq(&mapping->tree_lock);
+		return -EAGAIN;
+	}
+
 	/*
 	 * Now we know that no one else is looking at the page.
 	 */
@@ -337,6 +344,7 @@ static int migrate_page_move_mapping(str
 
 	radix_tree_replace_slot(pslot, newpage);
 
+	page_unfreeze_refs(page, expected_count);
 	/*
 	 * Drop cache reference from old page.
 	 * We know this isn't the last reference.
Index: linux-2.6/drivers/net/cassini.c
===================================================================
--- linux-2.6.orig/drivers/net/cassini.c
+++ linux-2.6/drivers/net/cassini.c
@@ -576,6 +576,18 @@ static void cas_spare_recover(struct cas
 	list_for_each_safe(elem, tmp, &list) {
 		cas_page_t *page = list_entry(elem, cas_page_t, list);
 
+		/*
+		 * With the lockless pagecache, cassini buffering scheme gets
+		 * slightly less accurate: we might find that a page has an
+		 * elevated reference count here, due to a speculative ref,
+		 * and skip it as in-use. Ideally we would be able to reclaim
+		 * it. However this would be such a rare case, it doesn't
+		 * matter too much as we should pick it up the next time round.
+		 *
+		 * Importantly, if we find that the page has a refcount of 1
+		 * here (our refcount), then we know it is definitely not inuse
+		 * so we can reuse it.
+		 */
 		if (page_count(page->buffer) > 1)
 			continue;
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 3/7] mm: speculative page references
  2008-06-05  9:43 ` [patch 3/7] mm: speculative page references npiggin
@ 2008-06-06 14:20   ` Peter Zijlstra
  2008-06-06 16:26     ` Nick Piggin
  2008-06-06 16:27     ` Nick Piggin
  2008-06-09  4:48   ` Tim Pepper
  2008-06-10 19:08   ` Christoph Lameter
  2 siblings, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2008-06-06 14:20 UTC (permalink / raw)
  To: npiggin
  Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus, Paul E McKenney

On Thu, 2008-06-05 at 19:43 +1000, npiggin@suse.de wrote:
> plain text document attachment (mm-speculative-get_page-hugh.patch)

> +static inline int page_cache_get_speculative(struct page *page)
> +{
> +	VM_BUG_ON(in_interrupt());
> +
> +#ifndef CONFIG_SMP
> +# ifdef CONFIG_PREEMPT
> +	VM_BUG_ON(!in_atomic());
> +# endif
> +	/*
> +	 * Preempt must be disabled here - we rely on rcu_read_lock doing
> +	 * this for us.

Preemptible RCU is already in the tree, so I guess you'll have to
explcitly disable preemption if you require it.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 3/7] mm: speculative page references
  2008-06-06 14:20   ` Peter Zijlstra
@ 2008-06-06 16:26     ` Nick Piggin
  2008-06-06 16:27     ` Nick Piggin
  1 sibling, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-06-06 16:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus, Paul E McKenney

On Fri, Jun 06, 2008 at 04:20:04PM +0200, Peter Zijlstra wrote:
> On Thu, 2008-06-05 at 19:43 +1000, npiggin@suse.de wrote:
> > plain text document attachment (mm-speculative-get_page-hugh.patch)
> 
> > +static inline int page_cache_get_speculative(struct page *page)
> > +{
> > +	VM_BUG_ON(in_interrupt());
> > +
> > +#ifndef CONFIG_SMP
> > +# ifdef CONFIG_PREEMPT
> > +	VM_BUG_ON(!in_atomic());
> > +# endif
> > +	/*
> > +	 * Preempt must be disabled here - we rely on rcu_read_lock doing
> > +	 * this for us.
> 
> Preemptible RCU is already in the tree, so I guess you'll have to
> explcitly disable preemption if you require it.
 
Oh, of course, I forget about preempt RCU, lucky for the comment.
Good spotting.

--
As per the comment here, we can only use that shortcut if rcu_read_lock
disabled preemption. It would be somewhat annoying to have to put
preempt_disable/preempt_enable around all callers in order to support
this, but preempt RCU isn't going to be hugely performance critical
anyway (and actually it actively trades performance for fewer preempt off
sections), so it can use the slightly slower path quite happily.

Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -111,7 +111,7 @@ static inline int page_cache_get_specula
 {
 	VM_BUG_ON(in_interrupt());
 
-#ifndef CONFIG_SMP
+#if !defined(CONFIG_SMP) && defined(CONFIG_CLASSIC_RCU)
 # ifdef CONFIG_PREEMPT
 	VM_BUG_ON(!in_atomic());
 # endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 3/7] mm: speculative page references
  2008-06-06 14:20   ` Peter Zijlstra
  2008-06-06 16:26     ` Nick Piggin
@ 2008-06-06 16:27     ` Nick Piggin
  1 sibling, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-06-06 16:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus, Paul E McKenney

On Fri, Jun 06, 2008 at 04:20:04PM +0200, Peter Zijlstra wrote:
> On Thu, 2008-06-05 at 19:43 +1000, npiggin@suse.de wrote:
> > plain text document attachment (mm-speculative-get_page-hugh.patch)
> 
> > +static inline int page_cache_get_speculative(struct page *page)
> > +{
> > +	VM_BUG_ON(in_interrupt());
> > +
> > +#ifndef CONFIG_SMP
> > +# ifdef CONFIG_PREEMPT
> > +	VM_BUG_ON(!in_atomic());
> > +# endif
> > +	/*
> > +	 * Preempt must be disabled here - we rely on rcu_read_lock doing
> > +	 * this for us.
> 
> Preemptible RCU is already in the tree, so I guess you'll have to
> explcitly disable preemption if you require it.
> 

And here is the fix for patch 7/7

--
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -149,7 +149,7 @@ static inline int page_cache_add_specula
 {
 	VM_BUG_ON(in_interrupt());
 
-#ifndef CONFIG_SMP
+#if !defined(CONFIG_SMP) && defined(CONFIG_CLASSIC_RCU)
 # ifdef CONFIG_PREEMPT
 	VM_BUG_ON(!in_atomic());
 # endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 3/7] mm: speculative page references
  2008-06-05  9:43 ` [patch 3/7] mm: speculative page references npiggin
  2008-06-06 14:20   ` Peter Zijlstra
@ 2008-06-09  4:48   ` Tim Pepper
  2008-06-10 19:08   ` Christoph Lameter
  2 siblings, 0 replies; 31+ messages in thread
From: Tim Pepper @ 2008-06-09  4:48 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus

On Thu, Jun 5, 2008 at 2:43 AM,  <npiggin@suse.de> wrote:
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -390,12 +390,10 @@ static pageout_t pageout(struct page *pa
>  }
>
>  /*
> - * Attempt to detach a locked page from its ->mapping.  If it is dirty or if
> - * someone else has a ref on the page, abort and return 0.  If it was
> - * successfully detached, return 1.  Assumes the caller has a single ref on
> - * this page.
> + * Save as remove_mapping, but if the page is removed from the mapping, it
> + * gets returned with a refcount of 0.

       ^^^^^^

Same as?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 3/7] mm: speculative page references
  2008-06-05  9:43 ` [patch 3/7] mm: speculative page references npiggin
  2008-06-06 14:20   ` Peter Zijlstra
  2008-06-09  4:48   ` Tim Pepper
@ 2008-06-10 19:08   ` Christoph Lameter
  2008-06-11  3:19     ` Nick Piggin
  2 siblings, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2008-06-10 19:08 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus

On Thu, 5 Jun 2008, npiggin@suse.de wrote:

> +		 * do the right thing (see comments above).
> +		 */
> +		return 0;
> +	}
> +#endif
> +	VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);

This is easier written as:

== VM_BUG_ON(PageTail(page)

And its also slightly incorrect since page_private(page) is not pointing 
to the head page for PageHead(page).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 3/7] mm: speculative page references
  2008-06-10 19:08   ` Christoph Lameter
@ 2008-06-11  3:19     ` Nick Piggin
  0 siblings, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-06-11  3:19 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus

On Tue, Jun 10, 2008 at 12:08:27PM -0700, Christoph Lameter wrote:
> On Thu, 5 Jun 2008, npiggin@suse.de wrote:
> 
> > +		 * do the right thing (see comments above).
> > +		 */
> > +		return 0;
> > +	}
> > +#endif
> > +	VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);
> 
> This is easier written as:
> 
> == VM_BUG_ON(PageTail(page)
 
Yeah that would be nicer.


> And its also slightly incorrect since page_private(page) is not pointing 
> to the head page for PageHead(page).

I see. Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 4/7] mm: lockless pagecache
  2008-06-05  9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
                   ` (2 preceding siblings ...)
  2008-06-05  9:43 ` [patch 3/7] mm: speculative page references npiggin
@ 2008-06-05  9:43 ` npiggin
  2008-06-05  9:43 ` [patch 5/7] mm: spinlock tree_lock npiggin
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: npiggin @ 2008-06-05  9:43 UTC (permalink / raw)
  To: akpm, torvalds; +Cc: linux-mm, linux-kernel, benh, paulus

[-- Attachment #1: mm-lockless-pagecache-lookups.patch --]
[-- Type: text/plain, Size: 6874 bytes --]

Combine page_cache_get_speculative with lockless radix tree lookups to
introduce lockless page cache lookups (ie. no mapping->tree_lock on
the read-side).

The only atomicity changes this introduces is that the gang pagecache
lookup functions now behave as if they are implemented with multiple
find_get_page calls, rather than operating on a snapshot of the pages.
In practice, this atomicity guarantee is not used anyway, and it is
difficult to see how it could be. Gang pagecache lookups are designed
to replace individual lookups, so these semantics are natural.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -640,15 +640,35 @@ void __lock_page_nosync(struct page *pag
  * Is there a pagecache struct page at the given (mapping, offset) tuple?
  * If yes, increment its refcount and return it; if no, return NULL.
  */
-struct page * find_get_page(struct address_space *mapping, pgoff_t offset)
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
 {
+	void **pagep;
 	struct page *page;
 
-	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page)
-		page_cache_get(page);
-	read_unlock_irq(&mapping->tree_lock);
+	rcu_read_lock();
+repeat:
+	page = NULL;
+	pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);
+	if (pagep) {
+		page = radix_tree_deref_slot(pagep);
+		if (unlikely(!page || page == RADIX_TREE_RETRY))
+			goto repeat;
+
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/*
+		 * Has the page moved?
+		 * This is part of the lockless pagecache protocol. See
+		 * include/linux/pagemap.h for details.
+		 */
+		if (unlikely(page != *pagep)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+	}
+	rcu_read_unlock();
+
 	return page;
 }
 EXPORT_SYMBOL(find_get_page);
@@ -663,32 +683,22 @@ EXPORT_SYMBOL(find_get_page);
  *
  * Returns zero if the page was not present. find_lock_page() may sleep.
  */
-struct page *find_lock_page(struct address_space *mapping,
-				pgoff_t offset)
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
 {
 	struct page *page;
 
 repeat:
-	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
+	page = find_get_page(mapping, offset);
 	if (page) {
-		page_cache_get(page);
-		if (TestSetPageLocked(page)) {
-			read_unlock_irq(&mapping->tree_lock);
-			__lock_page(page);
-
-			/* Has the page been truncated while we slept? */
-			if (unlikely(page->mapping != mapping)) {
-				unlock_page(page);
-				page_cache_release(page);
-				goto repeat;
-			}
-			VM_BUG_ON(page->index != offset);
-			goto out;
+		lock_page(page);
+		/* Has the page been truncated? */
+		if (unlikely(page->mapping != mapping)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto repeat;
 		}
+		VM_BUG_ON(page->index != offset);
 	}
-	read_unlock_irq(&mapping->tree_lock);
-out:
 	return page;
 }
 EXPORT_SYMBOL(find_lock_page);
@@ -754,13 +764,39 @@ unsigned find_get_pages(struct address_s
 {
 	unsigned int i;
 	unsigned int ret;
+	unsigned int nr_found;
+
+	rcu_read_lock();
+restart:
+	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+				(void ***)pages, start, nr_pages);
+	ret = 0;
+	for (i = 0; i < nr_found; i++) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot((void **)pages[i]);
+		if (unlikely(!page))
+			continue;
+		/*
+		 * this can only trigger if nr_found == 1, making livelock
+		 * a non issue.
+		 */
+		if (unlikely(page == RADIX_TREE_RETRY))
+			goto restart;
 
-	read_lock_irq(&mapping->tree_lock);
-	ret = radix_tree_gang_lookup(&mapping->page_tree,
-				(void **)pages, start, nr_pages);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
-	read_unlock_irq(&mapping->tree_lock);
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *((void **)pages[i]))) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		pages[ret] = page;
+		ret++;
+	}
+	rcu_read_unlock();
 	return ret;
 }
 
@@ -781,19 +817,44 @@ unsigned find_get_pages_contig(struct ad
 {
 	unsigned int i;
 	unsigned int ret;
+	unsigned int nr_found;
+
+	rcu_read_lock();
+restart:
+	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+				(void ***)pages, index, nr_pages);
+	ret = 0;
+	for (i = 0; i < nr_found; i++) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot((void **)pages[i]);
+		if (unlikely(!page))
+			continue;
+		/*
+		 * this can only trigger if nr_found == 1, making livelock
+		 * a non issue.
+		 */
+		if (unlikely(page == RADIX_TREE_RETRY))
+			goto restart;
 
-	read_lock_irq(&mapping->tree_lock);
-	ret = radix_tree_gang_lookup(&mapping->page_tree,
-				(void **)pages, index, nr_pages);
-	for (i = 0; i < ret; i++) {
-		if (pages[i]->mapping == NULL || pages[i]->index != index)
+		if (page->mapping == NULL || page->index != index)
 			break;
 
-		page_cache_get(pages[i]);
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *((void **)pages[i]))) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		pages[ret] = page;
+		ret++;
 		index++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
-	return i;
+	rcu_read_unlock();
+	return ret;
 }
 EXPORT_SYMBOL(find_get_pages_contig);
 
@@ -813,15 +874,43 @@ unsigned find_get_pages_tag(struct addre
 {
 	unsigned int i;
 	unsigned int ret;
+	unsigned int nr_found;
+
+	rcu_read_lock();
+restart:
+	nr_found = radix_tree_gang_lookup_tag_slot(&mapping->page_tree,
+				(void ***)pages, *index, nr_pages, tag);
+	ret = 0;
+	for (i = 0; i < nr_found; i++) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot((void **)pages[i]);
+		if (unlikely(!page))
+			continue;
+		/*
+		 * this can only trigger if nr_found == 1, making livelock
+		 * a non issue.
+		 */
+		if (unlikely(page == RADIX_TREE_RETRY))
+			goto restart;
+
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *((void **)pages[i]))) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		pages[ret] = page;
+		ret++;
+	}
+	rcu_read_unlock();
 
-	read_lock_irq(&mapping->tree_lock);
-	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
-				(void **)pages, *index, nr_pages, tag);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
 	if (ret)
 		*index = pages[ret - 1]->index + 1;
-	read_unlock_irq(&mapping->tree_lock);
+
 	return ret;
 }
 EXPORT_SYMBOL(find_get_pages_tag);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 5/7] mm: spinlock tree_lock
  2008-06-05  9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
                   ` (3 preceding siblings ...)
  2008-06-05  9:43 ` [patch 4/7] mm: lockless pagecache npiggin
@ 2008-06-05  9:43 ` npiggin
  2008-06-05  9:43 ` [patch 6/7] powerpc: implement pte_special npiggin
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: npiggin @ 2008-06-05  9:43 UTC (permalink / raw)
  To: akpm, torvalds; +Cc: linux-mm, linux-kernel, benh, paulus

[-- Attachment #1: mm-spinlock-tree_lock.patch --]
[-- Type: text/plain, Size: 12281 bytes --]

mapping->tree_lock has no read lockers. convert the lock from an rwlock
to a spinlock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -706,7 +706,7 @@ static int __set_page_dirty(struct page 
 	if (TestSetPageDirty(page))
 		return 0;
 
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 
@@ -719,7 +719,7 @@ static int __set_page_dirty(struct page 
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
 	return 1;
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -209,7 +209,7 @@ void inode_init_once(struct inode *inode
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
-	rwlock_init(&inode->i_data.tree_lock);
+	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -498,7 +498,7 @@ struct backing_dev_info;
 struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
-	rwlock_t		tree_lock;	/* and rwlock protecting it */
+	spinlock_t		tree_lock;	/* and lock protecting it */
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -112,7 +112,7 @@ generic_file_direct_IO(int rw, struct ki
 /*
  * Remove a page from the page cache and free it. Caller has to make
  * sure the page is locked and that nobody else uses it - or that usage
- * is safe.  The caller must hold a write_lock on the mapping's tree_lock.
+ * is safe.  The caller must hold the mapping's tree_lock.
  */
 void __remove_from_page_cache(struct page *page)
 {
@@ -144,9 +144,9 @@ void remove_from_page_cache(struct page 
 
 	BUG_ON(!PageLocked(page));
 
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	__remove_from_page_cache(page);
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 }
 
 static int sync_page(void *word)
@@ -471,7 +471,7 @@ int add_to_page_cache(struct page *page,
 		page->mapping = mapping;
 		page->index = offset;
 
-		write_lock_irq(&mapping->tree_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (likely(!error)) {
 			mapping->nrpages++;
@@ -483,7 +483,7 @@ int add_to_page_cache(struct page *page,
 			page_cache_release(page);
 		}
 
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		radix_tree_preload_end();
 	} else
 		mem_cgroup_uncharge_page(page);
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -39,7 +39,7 @@ static struct backing_dev_info swap_back
 
 struct address_space swapper_space = {
 	.page_tree	= RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
-	.tree_lock	= __RW_LOCK_UNLOCKED(swapper_space.tree_lock),
+	.tree_lock	= __SPIN_LOCK_UNLOCKED(swapper_space.tree_lock),
 	.a_ops		= &swap_aops,
 	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
 	.backing_dev_info = &swap_backing_dev_info,
@@ -80,7 +80,7 @@ int add_to_swap_cache(struct page *page,
 		SetPageSwapCache(page);
 		set_page_private(page, entry.val);
 
-		write_lock_irq(&swapper_space.tree_lock);
+		spin_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (likely(!error)) {
@@ -88,7 +88,7 @@ int add_to_swap_cache(struct page *page,
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 			INC_CACHE_INFO(add_total);
 		}
-		write_unlock_irq(&swapper_space.tree_lock);
+		spin_unlock_irq(&swapper_space.tree_lock);
 		radix_tree_preload_end();
 
 		if (unlikely(error)) {
@@ -182,9 +182,9 @@ void delete_from_swap_cache(struct page 
 
 	entry.val = page_private(page);
 
-	write_lock_irq(&swapper_space.tree_lock);
+	spin_lock_irq(&swapper_space.tree_lock);
 	__delete_from_swap_cache(page);
-	write_unlock_irq(&swapper_space.tree_lock);
+	spin_unlock_irq(&swapper_space.tree_lock);
 
 	swap_free(entry);
 	page_cache_release(page);
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -368,13 +368,13 @@ int remove_exclusive_swap_page(struct pa
 	retval = 0;
 	if (p->swap_map[swp_offset(entry)] == 1) {
 		/* Recheck the page count with the swapcache lock held.. */
-		write_lock_irq(&swapper_space.tree_lock);
+		spin_lock_irq(&swapper_space.tree_lock);
 		if ((page_count(page) == 2) && !PageWriteback(page)) {
 			__delete_from_swap_cache(page);
 			SetPageDirty(page);
 			retval = 1;
 		}
-		write_unlock_irq(&swapper_space.tree_lock);
+		spin_unlock_irq(&swapper_space.tree_lock);
 	}
 	spin_unlock(&swap_lock);
 
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -349,18 +349,18 @@ invalidate_complete_page2(struct address
 	if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	if (PageDirty(page))
 		goto failed;
 
 	BUG_ON(PagePrivate(page));
 	__remove_from_page_cache(page);
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	ClearPageUptodate(page);
 	page_cache_release(page);	/* pagecache ref */
 	return 1;
 failed:
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	return 0;
 }
 
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -398,7 +398,7 @@ static int __remove_mapping(struct addre
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	/*
 	 * The non racy check for a busy page.
 	 *
@@ -433,17 +433,17 @@ static int __remove_mapping(struct addre
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
 	} else {
 		__remove_from_page_cache(page);
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 	}
 
 	return 1;
 
 cannot_free:
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	return 0;
 }
 
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -1081,7 +1081,7 @@ int __set_page_dirty_nobuffers(struct pa
 		if (!mapping)
 			return 1;
 
-		write_lock_irq(&mapping->tree_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		mapping2 = page_mapping(page);
 		if (mapping2) { /* Race with truncate? */
 			BUG_ON(mapping2 != mapping);
@@ -1095,7 +1095,7 @@ int __set_page_dirty_nobuffers(struct pa
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -1251,7 +1251,7 @@ int test_clear_page_writeback(struct pag
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
 
-		write_lock_irqsave(&mapping->tree_lock, flags);
+		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestClearPageWriteback(page);
 		if (ret) {
 			radix_tree_tag_clear(&mapping->page_tree,
@@ -1262,7 +1262,7 @@ int test_clear_page_writeback(struct pag
 				__bdi_writeout_inc(bdi);
 			}
 		}
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestClearPageWriteback(page);
 	}
@@ -1280,7 +1280,7 @@ int test_set_page_writeback(struct page 
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
 
-		write_lock_irqsave(&mapping->tree_lock, flags);
+		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
 		if (!ret) {
 			radix_tree_tag_set(&mapping->page_tree,
@@ -1293,7 +1293,7 @@ int test_set_page_writeback(struct page 
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestSetPageWriteback(page);
 	}
Index: linux-2.6/include/asm-arm/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-arm/cacheflush.h
+++ linux-2.6/include/asm-arm/cacheflush.h
@@ -421,9 +421,9 @@ static inline void flush_anon_page(struc
 }
 
 #define flush_dcache_mmap_lock(mapping) \
-	write_lock_irq(&(mapping)->tree_lock)
+	spin_lock_irq(&(mapping)->tree_lock)
 #define flush_dcache_mmap_unlock(mapping) \
-	write_unlock_irq(&(mapping)->tree_lock)
+	spin_unlock_irq(&(mapping)->tree_lock)
 
 #define flush_icache_user_range(vma,page,addr,len) \
 	flush_dcache_page(page)
Index: linux-2.6/include/asm-parisc/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-parisc/cacheflush.h
+++ linux-2.6/include/asm-parisc/cacheflush.h
@@ -45,9 +45,9 @@ void flush_cache_mm(struct mm_struct *mm
 extern void flush_dcache_page(struct page *page);
 
 #define flush_dcache_mmap_lock(mapping) \
-	write_lock_irq(&(mapping)->tree_lock)
+	spin_lock_irq(&(mapping)->tree_lock)
 #define flush_dcache_mmap_unlock(mapping) \
-	write_unlock_irq(&(mapping)->tree_lock)
+	spin_unlock_irq(&(mapping)->tree_lock)
 
 #define flush_icache_page(vma,page)	do { 		\
 	flush_kernel_dcache_page(page);			\
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -314,7 +314,7 @@ static int migrate_page_move_mapping(str
 		return 0;
 	}
 
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
  					page_index(page));
@@ -322,12 +322,12 @@ static int migrate_page_move_mapping(str
 	expected_count = 2 + !!PagePrivate(page);
 	if (page_count(page) != expected_count ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		return -EAGAIN;
 	}
 
 	if (!page_freeze_refs(page, expected_count)) {
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		return -EAGAIN;
 	}
 
@@ -364,7 +364,7 @@ static int migrate_page_move_mapping(str
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	__inc_zone_page_state(newpage, NR_FILE_PAGES);
 
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 
 	return 0;
 }

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 6/7] powerpc: implement pte_special
  2008-06-05  9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
                   ` (4 preceding siblings ...)
  2008-06-05  9:43 ` [patch 5/7] mm: spinlock tree_lock npiggin
@ 2008-06-05  9:43 ` npiggin
  2008-06-06  4:04   ` Benjamin Herrenschmidt
  2008-06-05  9:43 ` [patch 7/7] powerpc: lockless get_user_pages_fast npiggin
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: npiggin @ 2008-06-05  9:43 UTC (permalink / raw)
  To: akpm, torvalds; +Cc: linux-mm, linux-kernel, benh, paulus

[-- Attachment #1: powerpc-implement-pte_special.patch --]
[-- Type: text/plain, Size: 2929 bytes --]

Implement PTE_SPECIAL for powerpc. At the moment I only have a spare bit for
the 4k pages config, but Ben has freed up another one for 64k pages that I
can use, so this patch should include that before it goes upstream.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/asm-powerpc/pgtable-ppc64.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/pgtable-ppc64.h
+++ linux-2.6/include/asm-powerpc/pgtable-ppc64.h
@@ -239,7 +239,7 @@ static inline int pte_write(pte_t pte) {
 static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY;}
 static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;}
 static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;}
-static inline int pte_special(pte_t pte) { return 0; }
+static inline int pte_special(pte_t pte) { return pte_val(pte) & _PAGE_SPECIAL; }
 
 static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; }
 static inline void pte_cache(pte_t pte)   { pte_val(pte) &= ~_PAGE_NO_CACHE; }
@@ -259,7 +259,7 @@ static inline pte_t pte_mkyoung(pte_t pt
 static inline pte_t pte_mkhuge(pte_t pte) {
 	return pte; }
 static inline pte_t pte_mkspecial(pte_t pte) {
-	return pte; }
+	pte_val(pte) |= _PAGE_SPECIAL; return pte; }
 
 /* Atomic PTE updates */
 static inline unsigned long pte_update(struct mm_struct *mm,
Index: linux-2.6/include/asm-powerpc/pgtable-4k.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/pgtable-4k.h
+++ linux-2.6/include/asm-powerpc/pgtable-4k.h
@@ -45,6 +45,8 @@
 #define _PAGE_GROUP_IX  0x7000 /* software: HPTE index within group */
 #define _PAGE_F_SECOND  _PAGE_SECONDARY
 #define _PAGE_F_GIX     _PAGE_GROUP_IX
+#define _PAGE_SPECIAL	0x10000 /* software: special page */
+#define __HAVE_ARCH_PTE_SPECIAL
 
 /* PTE flags to conserve for HPTE identification */
 #define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | \
Index: linux-2.6/include/asm-powerpc/pgtable-64k.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/pgtable-64k.h
+++ linux-2.6/include/asm-powerpc/pgtable-64k.h
@@ -74,6 +74,7 @@ static inline struct subpage_prot_table 
 #define _PAGE_HPTE_SUB0	0x08000000 /* combo only: first sub page */
 #define _PAGE_COMBO	0x10000000 /* this is a combo 4k page */
 #define _PAGE_4K_PFN	0x20000000 /* PFN is for a single 4k page */
+#define _PAGE_SPECIAL	0x0	   /* don't have enough room for this yet */
 
 /* Note the full page bits must be in the same location as for normal
  * 4k pages as the same asssembly will be used to insert 64K pages

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 6/7] powerpc: implement pte_special
  2008-06-05  9:43 ` [patch 6/7] powerpc: implement pte_special npiggin
@ 2008-06-06  4:04   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2008-06-06  4:04 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, torvalds, linux-mm, linux-kernel, paulus

On Thu, 2008-06-05 at 19:43 +1000, npiggin@suse.de wrote:
> plain text document attachment (powerpc-implement-pte_special.patch)
> Implement PTE_SPECIAL for powerpc. At the moment I only have a spare bit for
> the 4k pages config, but Ben has freed up another one for 64k pages that I
> can use, so this patch should include that before it goes upstream.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Ack that bit. _PAGE_SPECIAL will replace _PAGE_HASHPTE on 64K (ie.
0x400). The patch that frees that bit should get into powerpc.git (and
from there -mm) as soon as paulus catches up with his backlog :-)

Cheers,
Ben.

> ---
> Index: linux-2.6/include/asm-powerpc/pgtable-ppc64.h
> ===================================================================
> --- linux-2.6.orig/include/asm-powerpc/pgtable-ppc64.h
> +++ linux-2.6/include/asm-powerpc/pgtable-ppc64.h
> @@ -239,7 +239,7 @@ static inline int pte_write(pte_t pte) {
>  static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY;}
>  static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;}
>  static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;}
> -static inline int pte_special(pte_t pte) { return 0; }
> +static inline int pte_special(pte_t pte) { return pte_val(pte) & _PAGE_SPECIAL; }
>  
>  static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; }
>  static inline void pte_cache(pte_t pte)   { pte_val(pte) &= ~_PAGE_NO_CACHE; }
> @@ -259,7 +259,7 @@ static inline pte_t pte_mkyoung(pte_t pt
>  static inline pte_t pte_mkhuge(pte_t pte) {
>  	return pte; }
>  static inline pte_t pte_mkspecial(pte_t pte) {
> -	return pte; }
> +	pte_val(pte) |= _PAGE_SPECIAL; return pte; }
>  
>  /* Atomic PTE updates */
>  static inline unsigned long pte_update(struct mm_struct *mm,
> Index: linux-2.6/include/asm-powerpc/pgtable-4k.h
> ===================================================================
> --- linux-2.6.orig/include/asm-powerpc/pgtable-4k.h
> +++ linux-2.6/include/asm-powerpc/pgtable-4k.h
> @@ -45,6 +45,8 @@
>  #define _PAGE_GROUP_IX  0x7000 /* software: HPTE index within group */
>  #define _PAGE_F_SECOND  _PAGE_SECONDARY
>  #define _PAGE_F_GIX     _PAGE_GROUP_IX
> +#define _PAGE_SPECIAL	0x10000 /* software: special page */
> +#define __HAVE_ARCH_PTE_SPECIAL
>  
>  /* PTE flags to conserve for HPTE identification */
>  #define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | \
> Index: linux-2.6/include/asm-powerpc/pgtable-64k.h
> ===================================================================
> --- linux-2.6.orig/include/asm-powerpc/pgtable-64k.h
> +++ linux-2.6/include/asm-powerpc/pgtable-64k.h
> @@ -74,6 +74,7 @@ static inline struct subpage_prot_table 
>  #define _PAGE_HPTE_SUB0	0x08000000 /* combo only: first sub page */
>  #define _PAGE_COMBO	0x10000000 /* this is a combo 4k page */
>  #define _PAGE_4K_PFN	0x20000000 /* PFN is for a single 4k page */
> +#define _PAGE_SPECIAL	0x0	   /* don't have enough room for this yet */
>  
>  /* Note the full page bits must be in the same location as for normal
>   * 4k pages as the same asssembly will be used to insert 64K pages
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 7/7] powerpc: lockless get_user_pages_fast
  2008-06-05  9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
                   ` (5 preceding siblings ...)
  2008-06-05  9:43 ` [patch 6/7] powerpc: implement pte_special npiggin
@ 2008-06-05  9:43 ` npiggin
  2008-06-09  8:32   ` Andrew Morton
  2008-06-10 19:00   ` Christoph Lameter
  2008-06-05 11:53 ` [patch 0/7] speculative page references, lockless pagecache, lockless gup Nick Piggin
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 31+ messages in thread
From: npiggin @ 2008-06-05  9:43 UTC (permalink / raw)
  To: akpm, torvalds; +Cc: linux-mm, linux-kernel, benh, paulus

[-- Attachment #1: powerpc-fast_gup.patch --]
[-- Type: text/plain, Size: 8461 bytes --]

Implement lockless get_user_pages_fast for powerpc. Page table existence is
guaranteed with RCU, and speculative page references are used to take a
reference to the pages without having a prior existence guarantee on them.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/asm-powerpc/uaccess.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/uaccess.h
+++ linux-2.6/include/asm-powerpc/uaccess.h
@@ -493,6 +493,12 @@ static inline int strnlen_user(const cha
 
 #define strlen_user(str)	strnlen_user((str), 0x7ffffffe)
 
+#ifdef __powerpc64__
+#define __HAVE_ARCH_GET_USER_PAGES_FAST
+struct page;
+int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages);
+#endif
+
 #endif  /* __ASSEMBLY__ */
 #endif /* __KERNEL__ */
 
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -244,7 +244,7 @@ static inline int put_page_testzero(stru
  */
 static inline int get_page_unless_zero(struct page *page)
 {
-	VM_BUG_ON(PageTail(page));
+	VM_BUG_ON(PageCompound(page));
 	return atomic_inc_not_zero(&page->_count);
 }
 
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -142,6 +142,29 @@ static inline int page_cache_get_specula
 	return 1;
 }
 
+/*
+ * Same as above, but add instead of inc (could just be merged)
+ */
+static inline int page_cache_add_speculative(struct page *page, int count)
+{
+	VM_BUG_ON(in_interrupt());
+
+#ifndef CONFIG_SMP
+# ifdef CONFIG_PREEMPT
+	VM_BUG_ON(!in_atomic());
+# endif
+	VM_BUG_ON(page_count(page) == 0);
+	atomic_add(count, &page->_count);
+
+#else
+	if (unlikely(!atomic_add_unless(&page->_count, count, 0)))
+		return 0;
+#endif
+	VM_BUG_ON(PageCompound(page) && page != compound_head(page));
+
+	return 1;
+}
+
 static inline int page_freeze_refs(struct page *page, int count)
 {
 	return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
Index: linux-2.6/arch/powerpc/mm/Makefile
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/Makefile
+++ linux-2.6/arch/powerpc/mm/Makefile
@@ -6,7 +6,7 @@ ifeq ($(CONFIG_PPC64),y)
 EXTRA_CFLAGS	+= -mno-minimal-toc
 endif
 
-obj-y				:= fault.o mem.o \
+obj-y				:= fault.o mem.o gup.o \
 				   init_$(CONFIG_WORD_SIZE).o \
 				   pgtable_$(CONFIG_WORD_SIZE).o \
 				   mmu_context_$(CONFIG_WORD_SIZE).o
Index: linux-2.6/arch/powerpc/mm/gup.c
===================================================================
--- /dev/null
+++ linux-2.6/arch/powerpc/mm/gup.c
@@ -0,0 +1,230 @@
+/*
+ * Lockless get_user_pages_fast for powerpc
+ *
+ * Copyright (C) 2008 Nick Piggin
+ * Copyright (C) 2008 Novell Inc.
+ */
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <linux/pagemap.h>
+#include <linux/rwsem.h>
+#include <asm/pgtable.h>
+
+/*
+ * The performance critical leaf functions are made noinline otherwise gcc
+ * inlines everything into a single function which results in too much
+ * register pressure.
+ */
+static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask, result;
+	pte_t *ptep;
+
+	result = _PAGE_PRESENT|_PAGE_USER;
+	if (write)
+		result |= _PAGE_RW;
+	mask = result | _PAGE_SPECIAL;
+
+	ptep = pte_offset_kernel(&pmd, addr);
+	do {
+		pte_t pte = *ptep;
+		struct page *page;
+
+		if ((pte_val(pte) & mask) != result)
+			return 0;
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+		if (!page_cache_get_speculative(page))
+			return 0;
+		if (unlikely(pte != *ptep)) {
+			put_page(page);
+			return 0;
+		}
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+
+	return 1;
+}
+
+static noinline int gup_huge_pte(pte_t *ptep, unsigned long *addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask;
+	unsigned long pte_end;
+	struct page *head, *page;
+	pte_t pte;
+	int refs;
+
+	pte_end = (*addr + HPAGE_SIZE) & HPAGE_MASK;
+	if (pte_end < end)
+		end = pte_end;
+
+	pte = *ptep;
+	mask = _PAGE_PRESENT|_PAGE_USER;
+	if (write)
+		mask |= _PAGE_RW;
+	if ((pte_val(pte) & mask) != mask)
+		return 0;
+	/* hugepages are never "special" */
+	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+	refs = 0;
+	head = pte_page(pte);
+	page = head + ((*addr & ~HPAGE_MASK) >> PAGE_SHIFT);
+	do {
+		VM_BUG_ON(compound_head(page) != head);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (*addr += PAGE_SIZE, *addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+	if (unlikely(pte != *ptep)) {
+		/* Could be optimized better */
+		while (*nr) {
+			put_page(page);
+			(*nr)--;
+		}
+	}
+
+	return 1;
+}
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = pmd_offset(&pud, addr);
+	do {
+		pmd_t pmd = *pmdp;
+
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd))
+			return 0;
+		if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+			return 0;
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = pud_offset(&pgd, addr);
+	do {
+		pud_t pud = *pudp;
+
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long end = start + (nr_pages << PAGE_SHIFT);
+	unsigned long addr = start;
+	unsigned long next;
+	pgd_t *pgdp;
+	int nr = 0;
+
+
+	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+					start, nr_pages*PAGE_SIZE)))
+		goto slow_irqon;
+
+	/* Cross a slice boundary? */
+	if (unlikely(addr < SLICE_LOW_TOP && end >= SLICE_LOW_TOP))
+		goto slow_irqon;
+
+	/*
+	 * XXX: batch / limit 'nr', to avoid large irq off latency
+	 * needs some instrumenting to determine the common sizes used by
+	 * important workloads (eg. DB2), and whether limiting the batch size
+	 * will decrease performance.
+	 *
+	 * It seems like we're in the clear for the moment. Direct-IO is
+	 * the main guy that batches up lots of get_user_pages, and even
+	 * they are limited to 64-at-a-time which is not so many.
+	 */
+	/*
+	 * This doesn't prevent pagetable teardown, but does prevent
+	 * the pagetables from being freed on powerpc.
+	 *
+	 * So long as we atomically load page table pointers versus teardown,
+	 * we can follow the address down to the the page and take a ref on it.
+	 */
+	local_irq_disable();
+
+	if (get_slice_psize(mm, addr) == mmu_huge_psize) {
+		pte_t *ptep;
+		unsigned long a = addr;
+
+		ptep = huge_pte_offset(mm, a);
+		do {
+			if (!gup_huge_pte(ptep, &a, end, write, pages, &nr))
+				goto slow;
+			ptep++;
+		} while (a != end);
+	} else {
+		pgdp = pgd_offset(mm, addr);
+		do {
+			pgd_t pgd = *pgdp;
+
+			next = pgd_addr_end(addr, end);
+			if (pgd_none(pgd))
+				goto slow;
+			if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+				goto slow;
+		} while (pgdp++, addr = next, addr != end);
+	}
+	local_irq_enable();
+
+	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
+	return nr;
+
+	{
+		int ret;
+
+slow:
+		local_irq_enable();
+slow_irqon:
+		/* Try to get the remaining pages with get_user_pages */
+		start += nr << PAGE_SHIFT;
+		pages += nr;
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		/* Have to be a bit careful with return values */
+		if (nr > 0) {
+			if (ret < 0)
+				ret = nr;
+			else
+				ret += nr;
+		}
+
+		return ret;
+	}
+}

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
  2008-06-05  9:43 ` [patch 7/7] powerpc: lockless get_user_pages_fast npiggin
@ 2008-06-09  8:32   ` Andrew Morton
  2008-06-10  3:15     ` Nick Piggin
  2008-06-10 19:00   ` Christoph Lameter
  1 sibling, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2008-06-09  8:32 UTC (permalink / raw)
  To: npiggin; +Cc: torvalds, linux-mm, linux-kernel, benh, paulus

On Thu, 05 Jun 2008 19:43:07 +1000 npiggin@suse.de wrote:

> Implement lockless get_user_pages_fast for powerpc. Page table existence is
> guaranteed with RCU, and speculative page references are used to take a
> reference to the pages without having a prior existence guarantee on them.
> 

arch/powerpc/mm/gup.c: In function `get_user_pages_fast':
arch/powerpc/mm/gup.c:156: error: `SLICE_LOW_TOP' undeclared (first use in this function)
arch/powerpc/mm/gup.c:156: error: (Each undeclared identifier is reported only once
arch/powerpc/mm/gup.c:156: error: for each function it appears in.)
arch/powerpc/mm/gup.c:178: error: implicit declaration of function `get_slice_psize'
arch/powerpc/mm/gup.c:178: error: `mmu_huge_psize' undeclared (first use in this function)
arch/powerpc/mm/gup.c:182: error: implicit declaration of function `huge_pte_offset'
arch/powerpc/mm/gup.c:182: warning: assignment makes pointer from integer without a cast

with

http://userweb.kernel.org/~akpm/config-g5.txt

I don't immediately know why - adding asm/page.h to gup.c doesn't help.
I'm suspecting a recursive include problem somewhere.

I'll drop it, sorry - too much other stuff to fix over here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
  2008-06-09  8:32   ` Andrew Morton
@ 2008-06-10  3:15     ` Nick Piggin
  0 siblings, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-06-10  3:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, linux-mm, linux-kernel, benh, paulus

On Mon, Jun 09, 2008 at 01:32:04AM -0700, Andrew Morton wrote:
> On Thu, 05 Jun 2008 19:43:07 +1000 npiggin@suse.de wrote:
> 
> > Implement lockless get_user_pages_fast for powerpc. Page table existence is
> > guaranteed with RCU, and speculative page references are used to take a
> > reference to the pages without having a prior existence guarantee on them.
> > 
> 
> arch/powerpc/mm/gup.c: In function `get_user_pages_fast':
> arch/powerpc/mm/gup.c:156: error: `SLICE_LOW_TOP' undeclared (first use in this function)
> arch/powerpc/mm/gup.c:156: error: (Each undeclared identifier is reported only once
> arch/powerpc/mm/gup.c:156: error: for each function it appears in.)
> arch/powerpc/mm/gup.c:178: error: implicit declaration of function `get_slice_psize'
> arch/powerpc/mm/gup.c:178: error: `mmu_huge_psize' undeclared (first use in this function)
> arch/powerpc/mm/gup.c:182: error: implicit declaration of function `huge_pte_offset'
> arch/powerpc/mm/gup.c:182: warning: assignment makes pointer from integer without a cast
> 
> with
> 
> http://userweb.kernel.org/~akpm/config-g5.txt
> 
> I don't immediately know why - adding asm/page.h to gup.c doesn't help.
> I'm suspecting a recursive include problem somewhere.
> 
> I'll drop it, sorry - too much other stuff to fix over here.

No problem. Likely a clash with the hugepage patches.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
  2008-06-05  9:43 ` [patch 7/7] powerpc: lockless get_user_pages_fast npiggin
  2008-06-09  8:32   ` Andrew Morton
@ 2008-06-10 19:00   ` Christoph Lameter
  2008-06-11  3:18     ` Nick Piggin
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2008-06-10 19:00 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus

On Thu, 5 Jun 2008, npiggin@suse.de wrote:

> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -244,7 +244,7 @@ static inline int put_page_testzero(stru
>   */
>  static inline int get_page_unless_zero(struct page *page)
>  {
> -	VM_BUG_ON(PageTail(page));
> +	VM_BUG_ON(PageCompound(page));
>  	return atomic_inc_not_zero(&page->_count);
>  }

This is reversing the modification to make get_page_unless_zero() usable 
with compound page heads. Will break the slab defrag patchset.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
  2008-06-10 19:00   ` Christoph Lameter
@ 2008-06-11  3:18     ` Nick Piggin
  2008-06-11  4:40       ` Christoph Lameter
  0 siblings, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2008-06-11  3:18 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus

On Tue, Jun 10, 2008 at 12:00:48PM -0700, Christoph Lameter wrote:
> On Thu, 5 Jun 2008, npiggin@suse.de wrote:
> 
> > Index: linux-2.6/include/linux/mm.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/mm.h
> > +++ linux-2.6/include/linux/mm.h
> > @@ -244,7 +244,7 @@ static inline int put_page_testzero(stru
> >   */
> >  static inline int get_page_unless_zero(struct page *page)
> >  {
> > -	VM_BUG_ON(PageTail(page));
> > +	VM_BUG_ON(PageCompound(page));
> >  	return atomic_inc_not_zero(&page->_count);
> >  }
> 
> This is reversing the modification to make get_page_unless_zero() usable 
> with compound page heads. Will break the slab defrag patchset.

Is the slab defrag patchset in -mm? Because you ignored my comment about
this change that assertions should not be weakened until required by the
actual patchset. I wanted to have these assertions be as strong as
possible for the lockless pagecache patchset.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
  2008-06-11  3:18     ` Nick Piggin
@ 2008-06-11  4:40       ` Christoph Lameter
  2008-06-11  4:41         ` Christoph Lameter
  2008-06-11  4:47         ` Nick Piggin
  0 siblings, 2 replies; 31+ messages in thread
From: Christoph Lameter @ 2008-06-11  4:40 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus

On Wed, 11 Jun 2008, Nick Piggin wrote:

> > This is reversing the modification to make get_page_unless_zero() usable 
> > with compound page heads. Will break the slab defrag patchset.
> 
> Is the slab defrag patchset in -mm? Because you ignored my comment about
> this change that assertions should not be weakened until required by the
> actual patchset. I wanted to have these assertions be as strong as
> possible for the lockless pagecache patchset.

So you are worried about accidentally using get_page_unless_zero on a 
compound page? What would be wrong about that?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
  2008-06-11  4:40       ` Christoph Lameter
@ 2008-06-11  4:41         ` Christoph Lameter
  2008-06-11  4:49           ` Nick Piggin
  2008-06-11  4:47         ` Nick Piggin
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2008-06-11  4:41 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus

And yes slab defrag is part of linux-next. So it would break.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
  2008-06-11  4:41         ` Christoph Lameter
@ 2008-06-11  4:49           ` Nick Piggin
  2008-06-11  6:06             ` Andrew Morton
  0 siblings, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2008-06-11  4:49 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus

On Tue, Jun 10, 2008 at 09:41:33PM -0700, Christoph Lameter wrote:
> And yes slab defrag is part of linux-next. So it would break.

Can memory management patches go though mm/? I dislike the cowboy
method of merging things that some other subsystems have adopted :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
  2008-06-11  4:49           ` Nick Piggin
@ 2008-06-11  6:06             ` Andrew Morton
  2008-06-11  6:24               ` Nick Piggin
  2008-06-11 23:20               ` Christoph Lameter
  0 siblings, 2 replies; 31+ messages in thread
From: Andrew Morton @ 2008-06-11  6:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, torvalds, linux-mm, linux-kernel, benh, paulus

On Wed, 11 Jun 2008 06:49:02 +0200 Nick Piggin <npiggin@suse.de> wrote:

> On Tue, Jun 10, 2008 at 09:41:33PM -0700, Christoph Lameter wrote:
> > And yes slab defrag is part of linux-next. So it would break.

No, slab defreg[*] isn't in linux-next.

y:/usr/src/25> diffstat patches/linux-next.patch| grep mm/slub.c
 mm/slub.c                                                    |    4 

That's two spelling fixes in comments.

I have git-pekka in -mm too.  Here it is:

--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2765,6 +2765,7 @@ void kfree(const void *x)

 	page = virt_to_head_page(x);
 	if (unlikely(!PageSlab(page))) {
+		BUG_ON(!PageCompound(page));
 		put_page(page);
 		return;
 	}

> Can memory management patches go though mm/? I dislike the cowboy
> method of merging things that some other subsystems have adopted :)

I think I'd prefer that.  I may be a bit slow, but we're shoving at
least 100 MM patches through each kernel release and I think I review
things more closely than others choose to.  At least, I find problems
and I've seen some pretty wild acked-bys...

[*] It _isn't_ "slab defrag".  Or at least, it wasn't last time I saw
it.  It's "slub defrag".  And IMO it is bad to be adding slub-only
features because afaik slub still isn't as fast as slab on some things
and so some people might want to run slab rather than slub.  And
because if this the decision whether to retain slab or slub STILL
hasn't been made.  Carrying both versions was supposed to be a
short-term transitional thing :(

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
  2008-06-11  6:06             ` Andrew Morton
@ 2008-06-11  6:24               ` Nick Piggin
  2008-06-11  6:50                 ` Andrew Morton
  2008-06-11 23:20               ` Christoph Lameter
  1 sibling, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2008-06-11  6:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, torvalds, linux-mm, linux-kernel, benh, paulus

On Tue, Jun 10, 2008 at 11:06:22PM -0700, Andrew Morton wrote:
> On Wed, 11 Jun 2008 06:49:02 +0200 Nick Piggin <npiggin@suse.de> wrote:
> 
> > Can memory management patches go though mm/? I dislike the cowboy
                                            ^^^
That should read -mm, of course.


> > method of merging things that some other subsystems have adopted :)
> 
> I think I'd prefer that.  I may be a bit slow, but we're shoving at
> least 100 MM patches through each kernel release and I think I review
> things more closely than others choose to.  At least, I find problems
> and I've seen some pretty wild acked-bys...

I wouldn't say you're too slow. You're as close to mm and mm/fs 
maintainer as we're likely to get and I think it would be much worse
to have things merged out-of-band. Even the more peripheral parts like
slab or hugetlb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
  2008-06-11  6:24               ` Nick Piggin
@ 2008-06-11  6:50                 ` Andrew Morton
  0 siblings, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2008-06-11  6:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, torvalds, linux-mm, linux-kernel, benh, paulus

On Wed, 11 Jun 2008 08:24:04 +0200 Nick Piggin <npiggin@suse.de> wrote:

> On Tue, Jun 10, 2008 at 11:06:22PM -0700, Andrew Morton wrote:
> > On Wed, 11 Jun 2008 06:49:02 +0200 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > Can memory management patches go though mm/? I dislike the cowboy
>                                             ^^^
> That should read -mm, of course.
> 

-mm is looking awfully peripheral nowadays.  I really need to get my
linux-next act together.  Instead I'll be taking all next week off. 
nyer nyer.  

> 
> > > method of merging things that some other subsystems have adopted :)
> > 
> > I think I'd prefer that.  I may be a bit slow, but we're shoving at
> > least 100 MM patches through each kernel release and I think I review
> > things more closely than others choose to.  At least, I find problems
> > and I've seen some pretty wild acked-bys...
> 
> I wouldn't say you're too slow. You're as close to mm and mm/fs 
> maintainer as we're likely to get and I think it would be much worse
> to have things merged out-of-band. Even the more peripheral parts like
> slab or hugetlb.

Sigh.  I feel guilty when spending time merging (for example)
random-usb-patches-in-case-greg-misses-them, but such is life.  Some
help reviewing things would be nice.

A lot of my review is now of the "how the heck is anyone to understand
this in a year's time if I can't understand it now" variety, but I hope
that's useful...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
  2008-06-11  6:06             ` Andrew Morton
  2008-06-11  6:24               ` Nick Piggin
@ 2008-06-11 23:20               ` Christoph Lameter
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Lameter @ 2008-06-11 23:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, torvalds, linux-mm, linux-kernel, benh, paulus

On Tue, 10 Jun 2008, Andrew Morton wrote:

> hasn't been made.  Carrying both versions was supposed to be a
> short-term transitional thing :(

The whatever defrag patchset includes a patch to make SLAB
experimental. So one step further.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
  2008-06-11  4:40       ` Christoph Lameter
  2008-06-11  4:41         ` Christoph Lameter
@ 2008-06-11  4:47         ` Nick Piggin
  1 sibling, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-06-11  4:47 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus

On Tue, Jun 10, 2008 at 09:40:25PM -0700, Christoph Lameter wrote:
> On Wed, 11 Jun 2008, Nick Piggin wrote:
> 
> > > This is reversing the modification to make get_page_unless_zero() usable 
> > > with compound page heads. Will break the slab defrag patchset.
> > 
> > Is the slab defrag patchset in -mm? Because you ignored my comment about
> > this change that assertions should not be weakened until required by the
> > actual patchset. I wanted to have these assertions be as strong as
> > possible for the lockless pagecache patchset.
> 
> So you are worried about accidentally using get_page_unless_zero on a 
> compound page? What would be wrong about that?

Unexpected. Compound pages should have no such races that require
get_page_unless_zero that we very carefully use in page reclaim.

If you don't actually know whether you have a reference to the
thing or not before trying to operate on it, then you're almost
definitely got refcount wrong. How does slab defrag use it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 0/7] speculative page references, lockless pagecache, lockless gup
  2008-06-05  9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
                   ` (6 preceding siblings ...)
  2008-06-05  9:43 ` [patch 7/7] powerpc: lockless get_user_pages_fast npiggin
@ 2008-06-05 11:53 ` Nick Piggin
  2008-06-05 17:33 ` Linus Torvalds
  2008-06-06 21:32 ` Peter Zijlstra
  9 siblings, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-06-05 11:53 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus

On Thursday 05 June 2008 19:43, npiggin@suse.de wrote:
> Hi,
>
> I've decided to submit the speculative page references patch to get merged.
> I think I've now got enough reasons to get it merged. Well... I always
> thought I did, I just didn't think anyone else thought I did. If you know
> what I mean.
>
> cc'ing the powerpc guys specifically because everyone else who probably
> cares should be on linux-mm...
>
> So speculative page references are required to support lockless pagecache
> and lockless get_user_pages (on architectures that can't use the x86
> trick). Other uses for speculative page references could also pop up, it is
> a pretty useful concept. Doesn't need to be pagecache pages either.
>
> Anyway,
>
> lockless pagecache:
> - speeds up single threaded pagecache lookup operations significantly, by
>   avoiding atomic operations, memory barriers, and interrupts-off sections.
>   I just measured again on a few CPUs I have lying around here, and the
>   speedup is over 2x reduction in cycles on them all, closer to 3x in some
>   cases.
>
>    find_get_page takes:
>                 ppc970 (g5)     K10             P4 Nocona       Core2
>     vanilla     275 (cycles)    85              315             143
>     lockless    125             40              127             61
>
> - speeds up single threaded pagecache modification operations, by using
>   regular spinlocks rather than rwlocks and avoiding an atomic operation
>   on x86 for one. Also, most real paths which involve pagecache
> modification also involve pagecache lookups, so it is hard not to get a net
> speedup.
>
> - solves the rwlock starvation problem for pagecache operations. This is
>   being noticed on big SGI systems, but theoretically could happen on
>   relatively small systems (dozens of CPUs) due to the really nasty
>   writer starvation problem of rwlocks -- not even hardware fairness can
>   solve that.
>
> - improves pagecache scalability to operations on a single file. I
>   demonstrated page faults to a single file were improved in throughput
>   by 250x on a 64-way Altix several years ago. We now have systems with
>   thousands of CPUs in them.

Oh that's actually anothr thing I remember now that I posted the scalable
vmap code...

The lock I ended up hitting next in the XFS large directory workload that
improved so much with the vmap patches was tree_lock of the buffer cache.
So lockless pagecache gave a reasonable improvement there too IIRC :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 0/7] speculative page references, lockless pagecache, lockless gup
  2008-06-05  9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
                   ` (7 preceding siblings ...)
  2008-06-05 11:53 ` [patch 0/7] speculative page references, lockless pagecache, lockless gup Nick Piggin
@ 2008-06-05 17:33 ` Linus Torvalds
  2008-06-06  0:08   ` Nick Piggin
  2008-06-06 21:32 ` Peter Zijlstra
  9 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2008-06-05 17:33 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, linux-kernel, benh, paulus


On Thu, 5 Jun 2008, npiggin@suse.de wrote:
> 
> I've decided to submit the speculative page references patch to get merged.
> I think I've now got enough reasons to get it merged. Well... I always
> thought I did, I just didn't think anyone else thought I did. If you know
> what I mean.

So I'd certainly like to see these early in the 2.6.27 series. 

Nick, will you just re-send them once 2.6.26 is out? Or do they cause 
problems for Andrew and he wants to be part of the chain? I'm fine with 
either.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 0/7] speculative page references, lockless pagecache, lockless gup
  2008-06-05 17:33 ` Linus Torvalds
@ 2008-06-06  0:08   ` Nick Piggin
  0 siblings, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-06-06  0:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: akpm, linux-mm, linux-kernel, benh, paulus

On Thu, Jun 05, 2008 at 10:33:15AM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 5 Jun 2008, npiggin@suse.de wrote:
> > 
> > I've decided to submit the speculative page references patch to get merged.
> > I think I've now got enough reasons to get it merged. Well... I always
> > thought I did, I just didn't think anyone else thought I did. If you know
> > what I mean.
> 
> So I'd certainly like to see these early in the 2.6.27 series. 
 
Oh good ;) So would I!


> Nick, will you just re-send them once 2.6.26 is out? Or do they cause 
> problems for Andrew and he wants to be part of the chain? I'm fine with 
> either.

Andrew has picked them up by the looks, and he's my favoured channel to
get mm work merged. Let's see how things go between now and 2.6.26,
which I assume should be a few weeks away?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 0/7] speculative page references, lockless pagecache, lockless gup
  2008-06-05  9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
                   ` (8 preceding siblings ...)
  2008-06-05 17:33 ` Linus Torvalds
@ 2008-06-06 21:32 ` Peter Zijlstra
  9 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2008-06-06 21:32 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus

On Thu, 2008-06-05 at 19:43 +1000, npiggin@suse.de wrote:
> Hi,
> 
> I've decided to submit the speculative page references patch to get merged.
> I think I've now got enough reasons to get it merged. Well... I always
> thought I did, I just didn't think anyone else thought I did. If you know
> what I mean.
> 
> cc'ing the powerpc guys specifically because everyone else who probably
> cares should be on linux-mm...
> 
> So speculative page references are required to support lockless pagecache and
> lockless get_user_pages (on architectures that can't use the x86 trick). Other
> uses for speculative page references could also pop up, it is a pretty useful
> concept. Doesn't need to be pagecache pages either.

For patches 1-5

Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2008-06-11 23:20 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-06-05  9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
2008-06-05  9:43 ` [patch 1/7] mm: readahead scan lockless npiggin
2008-06-05  9:43 ` [patch 2/7] radix-tree: add gang_lookup_slot, gang_lookup_slot_tag npiggin
2008-06-05  9:43 ` [patch 3/7] mm: speculative page references npiggin
2008-06-06 14:20   ` Peter Zijlstra
2008-06-06 16:26     ` Nick Piggin
2008-06-06 16:27     ` Nick Piggin
2008-06-09  4:48   ` Tim Pepper
2008-06-10 19:08   ` Christoph Lameter
2008-06-11  3:19     ` Nick Piggin
2008-06-05  9:43 ` [patch 4/7] mm: lockless pagecache npiggin
2008-06-05  9:43 ` [patch 5/7] mm: spinlock tree_lock npiggin
2008-06-05  9:43 ` [patch 6/7] powerpc: implement pte_special npiggin
2008-06-06  4:04   ` Benjamin Herrenschmidt
2008-06-05  9:43 ` [patch 7/7] powerpc: lockless get_user_pages_fast npiggin
2008-06-09  8:32   ` Andrew Morton
2008-06-10  3:15     ` Nick Piggin
2008-06-10 19:00   ` Christoph Lameter
2008-06-11  3:18     ` Nick Piggin
2008-06-11  4:40       ` Christoph Lameter
2008-06-11  4:41         ` Christoph Lameter
2008-06-11  4:49           ` Nick Piggin
2008-06-11  6:06             ` Andrew Morton
2008-06-11  6:24               ` Nick Piggin
2008-06-11  6:50                 ` Andrew Morton
2008-06-11 23:20               ` Christoph Lameter
2008-06-11  4:47         ` Nick Piggin
2008-06-05 11:53 ` [patch 0/7] speculative page references, lockless pagecache, lockless gup Nick Piggin
2008-06-05 17:33 ` Linus Torvalds
2008-06-06  0:08   ` Nick Piggin
2008-06-06 21:32 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox