* [patch 1/7] mm: readahead scan lockless
2008-06-05 9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
@ 2008-06-05 9:43 ` npiggin
2008-06-05 9:43 ` [patch 2/7] radix-tree: add gang_lookup_slot, gang_lookup_slot_tag npiggin
` (8 subsequent siblings)
9 siblings, 0 replies; 31+ messages in thread
From: npiggin @ 2008-06-05 9:43 UTC (permalink / raw)
To: akpm, torvalds; +Cc: linux-mm, linux-kernel, benh, paulus
[-- Attachment #1: mm-readahead-scan-lockless.patch --]
[-- Type: text/plain, Size: 997 bytes --]
radix_tree_next_hole is implemented as a series of radix_tree_lookup()s. So
it can be called locklessly, under rcu_read_lock().
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -382,9 +382,9 @@ ondemand_readahead(struct address_space
if (hit_readahead_marker) {
pgoff_t start;
- read_lock_irq(&mapping->tree_lock);
- start = radix_tree_next_hole(&mapping->page_tree, offset, max+1);
- read_unlock_irq(&mapping->tree_lock);
+ rcu_read_lock();
+ start = radix_tree_next_hole(&mapping->page_tree, offset,max+1);
+ rcu_read_unlock();
if (!start || start - offset > max)
return 0;
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread* [patch 2/7] radix-tree: add gang_lookup_slot, gang_lookup_slot_tag
2008-06-05 9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
2008-06-05 9:43 ` [patch 1/7] mm: readahead scan lockless npiggin
@ 2008-06-05 9:43 ` npiggin
2008-06-05 9:43 ` [patch 3/7] mm: speculative page references npiggin
` (7 subsequent siblings)
9 siblings, 0 replies; 31+ messages in thread
From: npiggin @ 2008-06-05 9:43 UTC (permalink / raw)
To: akpm, torvalds; +Cc: linux-mm, linux-kernel, benh, paulus
[-- Attachment #1: radix-tree-gang-lookup-slot.patch --]
[-- Type: text/plain, Size: 10807 bytes --]
Introduce gang_lookup_slot and gang_lookup_slot_tag functions, which are used
by lockless pagecache.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -99,12 +99,15 @@ do { \
*
* The notable exceptions to this rule are the following functions:
* radix_tree_lookup
+ * radix_tree_lookup_slot
* radix_tree_tag_get
* radix_tree_gang_lookup
+ * radix_tree_gang_lookup_slot
* radix_tree_gang_lookup_tag
+ * radix_tree_gang_lookup_tag_slot
* radix_tree_tagged
*
- * The first 4 functions are able to be called locklessly, using RCU. The
+ * The first 7 functions are able to be called locklessly, using RCU. The
* caller must ensure calls to these functions are made within rcu_read_lock()
* regions. Other readers (lock-free or otherwise) and modifications may be
* running concurrently.
@@ -159,6 +162,9 @@ void *radix_tree_delete(struct radix_tre
unsigned int
radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
unsigned long first_index, unsigned int max_items);
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+ unsigned long first_index, unsigned int max_items);
unsigned long radix_tree_next_hole(struct radix_tree_root *root,
unsigned long index, unsigned long max_scan);
int radix_tree_preload(gfp_t gfp_mask);
@@ -173,6 +179,10 @@ unsigned int
radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
unsigned long first_index, unsigned int max_items,
unsigned int tag);
+unsigned int
+radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
+ unsigned long first_index, unsigned int max_items,
+ unsigned int tag);
int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
static inline void radix_tree_preload_end(void)
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -350,18 +350,17 @@ EXPORT_SYMBOL(radix_tree_insert);
* Returns: the slot corresponding to the position @index in the
* radix tree @root. This is useful for update-if-exists operations.
*
- * This function cannot be called under rcu_read_lock, it must be
- * excluded from writers, as must the returned slot for subsequent
- * use by radix_tree_deref_slot() and radix_tree_replace slot.
- * Caller must hold tree write locked across slot lookup and
- * replace.
+ * This function can be called under rcu_read_lock iff the slot is not
+ * modified by radix_tree_replace_slot, otherwise it must be called
+ * exclusive from other writers. Any dereference of the slot must be done
+ * using radix_tree_deref_slot.
*/
void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
{
unsigned int height, shift;
struct radix_tree_node *node, **slot;
- node = root->rnode;
+ node = rcu_dereference(root->rnode);
if (node == NULL)
return NULL;
@@ -381,7 +380,7 @@ void **radix_tree_lookup_slot(struct rad
do {
slot = (struct radix_tree_node **)
(node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
- node = *slot;
+ node = rcu_dereference(*slot);
if (node == NULL)
return NULL;
@@ -658,7 +657,7 @@ unsigned long radix_tree_next_hole(struc
EXPORT_SYMBOL(radix_tree_next_hole);
static unsigned int
-__lookup(struct radix_tree_node *slot, void **results, unsigned long index,
+__lookup(struct radix_tree_node *slot, void ***results, unsigned long index,
unsigned int max_items, unsigned long *next_index)
{
unsigned int nr_found = 0;
@@ -692,11 +691,9 @@ __lookup(struct radix_tree_node *slot, v
/* Bottom level: grab some items */
for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
- struct radix_tree_node *node;
index++;
- node = slot->slots[i];
- if (node) {
- results[nr_found++] = rcu_dereference(node);
+ if (slot->slots[i]) {
+ results[nr_found++] = &(slot->slots[i]);
if (nr_found == max_items)
goto out;
}
@@ -750,13 +747,22 @@ radix_tree_gang_lookup(struct radix_tree
ret = 0;
while (ret < max_items) {
- unsigned int nr_found;
+ unsigned int nr_found, slots_found, i;
unsigned long next_index; /* Index of next search */
if (cur_index > max_index)
break;
- nr_found = __lookup(node, results + ret, cur_index,
+ slots_found = __lookup(node, (void ***)results + ret, cur_index,
max_items - ret, &next_index);
+ nr_found = 0;
+ for (i = 0; i < slots_found; i++) {
+ struct radix_tree_node *slot;
+ slot = *(((void ***)results)[ret + i]);
+ if (!slot)
+ continue;
+ results[ret + nr_found] = rcu_dereference(slot);
+ nr_found++;
+ }
ret += nr_found;
if (next_index == 0)
break;
@@ -767,12 +773,71 @@ radix_tree_gang_lookup(struct radix_tree
}
EXPORT_SYMBOL(radix_tree_gang_lookup);
+/**
+ * radix_tree_gang_lookup_slot - perform multiple slot lookup on radix tree
+ * @root: radix tree root
+ * @results: where the results of the lookup are placed
+ * @first_index: start the lookup from this key
+ * @max_items: place up to this many items at *results
+ *
+ * Performs an index-ascending scan of the tree for present items. Places
+ * their slots at *@results and returns the number of items which were
+ * placed at *@results.
+ *
+ * The implementation is naive.
+ *
+ * Like radix_tree_gang_lookup as far as RCU and locking goes. Slots must
+ * be dereferenced with radix_tree_deref_slot, and if using only RCU
+ * protection, radix_tree_deref_slot may fail requiring a retry.
+ */
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+ unsigned long first_index, unsigned int max_items)
+{
+ unsigned long max_index;
+ struct radix_tree_node *node;
+ unsigned long cur_index = first_index;
+ unsigned int ret;
+
+ node = rcu_dereference(root->rnode);
+ if (!node)
+ return 0;
+
+ if (!radix_tree_is_indirect_ptr(node)) {
+ if (first_index > 0)
+ return 0;
+ results[0] = (void **)&root->rnode;
+ return 1;
+ }
+ node = radix_tree_indirect_to_ptr(node);
+
+ max_index = radix_tree_maxindex(node->height);
+
+ ret = 0;
+ while (ret < max_items) {
+ unsigned int slots_found;
+ unsigned long next_index; /* Index of next search */
+
+ if (cur_index > max_index)
+ break;
+ slots_found = __lookup(node, results + ret, cur_index,
+ max_items - ret, &next_index);
+ ret += slots_found;
+ if (next_index == 0)
+ break;
+ cur_index = next_index;
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup_slot);
+
/*
* FIXME: the two tag_get()s here should use find_next_bit() instead of
* open-coding the search.
*/
static unsigned int
-__lookup_tag(struct radix_tree_node *slot, void **results, unsigned long index,
+__lookup_tag(struct radix_tree_node *slot, void ***results, unsigned long index,
unsigned int max_items, unsigned long *next_index, unsigned int tag)
{
unsigned int nr_found = 0;
@@ -802,11 +867,9 @@ __lookup_tag(struct radix_tree_node *slo
unsigned long j = index & RADIX_TREE_MAP_MASK;
for ( ; j < RADIX_TREE_MAP_SIZE; j++) {
- struct radix_tree_node *node;
index++;
if (!tag_get(slot, tag, j))
continue;
- node = slot->slots[j];
/*
* Even though the tag was found set, we need to
* recheck that we have a non-NULL node, because
@@ -817,9 +880,8 @@ __lookup_tag(struct radix_tree_node *slo
* lookup ->slots[x] without a lock (ie. can't
* rely on its value remaining the same).
*/
- if (node) {
- node = rcu_dereference(node);
- results[nr_found++] = node;
+ if (slot->slots[j]) {
+ results[nr_found++] = &(slot->slots[j]);
if (nr_found == max_items)
goto out;
}
@@ -878,13 +940,22 @@ radix_tree_gang_lookup_tag(struct radix_
ret = 0;
while (ret < max_items) {
- unsigned int nr_found;
+ unsigned int nr_found, slots_found, i;
unsigned long next_index; /* Index of next search */
if (cur_index > max_index)
break;
- nr_found = __lookup_tag(node, results + ret, cur_index,
- max_items - ret, &next_index, tag);
+ slots_found = __lookup_tag(node, (void ***)results + ret,
+ cur_index, max_items - ret, &next_index, tag);
+ nr_found = 0;
+ for (i = 0; i < slots_found; i++) {
+ struct radix_tree_node *slot;
+ slot = *(((void ***)results)[ret + i]);
+ if (!slot)
+ continue;
+ results[ret + nr_found] = rcu_dereference(slot);
+ nr_found++;
+ }
ret += nr_found;
if (next_index == 0)
break;
@@ -896,6 +967,67 @@ radix_tree_gang_lookup_tag(struct radix_
EXPORT_SYMBOL(radix_tree_gang_lookup_tag);
/**
+ * radix_tree_gang_lookup_tag_slot - perform multiple slot lookup on a
+ * radix tree based on a tag
+ * @root: radix tree root
+ * @results: where the results of the lookup are placed
+ * @first_index: start the lookup from this key
+ * @max_items: place up to this many items at *results
+ * @tag: the tag index (< RADIX_TREE_MAX_TAGS)
+ *
+ * Performs an index-ascending scan of the tree for present items which
+ * have the tag indexed by @tag set. Places the slots at *@results and
+ * returns the number of slots which were placed at *@results.
+ */
+unsigned int
+radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
+ unsigned long first_index, unsigned int max_items,
+ unsigned int tag)
+{
+ struct radix_tree_node *node;
+ unsigned long max_index;
+ unsigned long cur_index = first_index;
+ unsigned int ret;
+
+ /* check the root's tag bit */
+ if (!root_tag_get(root, tag))
+ return 0;
+
+ node = rcu_dereference(root->rnode);
+ if (!node)
+ return 0;
+
+ if (!radix_tree_is_indirect_ptr(node)) {
+ if (first_index > 0)
+ return 0;
+ results[0] = (void **)&root->rnode;
+ return 1;
+ }
+ node = radix_tree_indirect_to_ptr(node);
+
+ max_index = radix_tree_maxindex(node->height);
+
+ ret = 0;
+ while (ret < max_items) {
+ unsigned int slots_found;
+ unsigned long next_index; /* Index of next search */
+
+ if (cur_index > max_index)
+ break;
+ slots_found = __lookup_tag(node, results + ret,
+ cur_index, max_items - ret, &next_index, tag);
+ ret += slots_found;
+ if (next_index == 0)
+ break;
+ cur_index = next_index;
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup_tag_slot);
+
+
+/**
* radix_tree_shrink - shrink height of a radix tree to minimal
* @root radix tree root
*/
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread* [patch 3/7] mm: speculative page references
2008-06-05 9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
2008-06-05 9:43 ` [patch 1/7] mm: readahead scan lockless npiggin
2008-06-05 9:43 ` [patch 2/7] radix-tree: add gang_lookup_slot, gang_lookup_slot_tag npiggin
@ 2008-06-05 9:43 ` npiggin
2008-06-06 14:20 ` Peter Zijlstra
` (2 more replies)
2008-06-05 9:43 ` [patch 4/7] mm: lockless pagecache npiggin
` (6 subsequent siblings)
9 siblings, 3 replies; 31+ messages in thread
From: npiggin @ 2008-06-05 9:43 UTC (permalink / raw)
To: akpm, torvalds; +Cc: linux-mm, linux-kernel, benh, paulus
[-- Attachment #1: mm-speculative-get_page-hugh.patch --]
[-- Type: text/plain, Size: 13116 bytes --]
If we can be sure that elevating the page_count on a pagecache page will pin
it, we can speculatively run this operation, and subsequently check to see if
we hit the right page rather than relying on holding a lock or otherwise
pinning a reference to the page.
This can be done if get_page/put_page behaves consistently throughout the whole
tree (ie. if we "get" the page after it has been used for something else, we
must be able to free it with a put_page).
Actually, there is a period where the count behaves differently: when the page
is free or if it is a constituent page of a compound page. We need an
atomic_inc_not_zero operation to ensure we don't try to grab the page in either
case.
This patch introduces the core locking protocol to the pagecache (ie. adds
page_cache_get_speculative, and tweaks some update-side code to make it work).
Thanks to Hugh for pointing out an improvement to the algorithm setting
page_count to zero when we have control of all references, in order to hold off
speculative getters.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -12,6 +12,7 @@
#include <asm/uaccess.h>
#include <linux/gfp.h>
#include <linux/bitops.h>
+#include <linux/hardirq.h> /* for in_interrupt() */
/*
* Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page
@@ -62,6 +63,98 @@ static inline void mapping_set_gfp_mask(
#define page_cache_release(page) put_page(page)
void release_pages(struct page **pages, int nr, int cold);
+/*
+ * speculatively take a reference to a page.
+ * If the page is free (_count == 0), then _count is untouched, and 0
+ * is returned. Otherwise, _count is incremented by 1 and 1 is returned.
+ *
+ * This function must be called inside the same rcu_read_lock() section as has
+ * been used to lookup the page in the pagecache radix-tree (or page table):
+ * this allows allocators to use a synchronize_rcu() to stabilize _count.
+ *
+ * Unless an RCU grace period has passed, the count of all pages coming out
+ * of the allocator must be considered unstable. page_count may return higher
+ * than expected, and put_page must be able to do the right thing when the
+ * page has been finished with, no matter what it is subsequently allocated
+ * for (because put_page is what is used here to drop an invalid speculative
+ * reference).
+ *
+ * This is the interesting part of the lockless pagecache (and lockless
+ * get_user_pages) locking protocol, where the lookup-side (eg. find_get_page)
+ * has the following pattern:
+ * 1. find page in radix tree
+ * 2. conditionally increment refcount
+ * 3. check the page is still in pagecache (if no, goto 1)
+ *
+ * Remove-side that cares about stability of _count (eg. reclaim) has the
+ * following (with tree_lock held for write):
+ * A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg)
+ * B. remove page from pagecache
+ * C. free the page
+ *
+ * There are 2 critical interleavings that matter:
+ * - 2 runs before A: in this case, A sees elevated refcount and bails out
+ * - A runs before 2: in this case, 2 sees zero refcount and retries;
+ * subsequently, B will complete and 1 will find no page, causing the
+ * lookup to return NULL.
+ *
+ * It is possible that between 1 and 2, the page is removed then the exact same
+ * page is inserted into the same position in pagecache. That's OK: the
+ * old find_get_page using tree_lock could equally have run before or after
+ * such a re-insertion, depending on order that locks are granted.
+ *
+ * Lookups racing against pagecache insertion isn't a big problem: either 1
+ * will find the page or it will not. Likewise, the old find_get_page could run
+ * either before the insertion or afterwards, depending on timing.
+ */
+static inline int page_cache_get_speculative(struct page *page)
+{
+ VM_BUG_ON(in_interrupt());
+
+#ifndef CONFIG_SMP
+# ifdef CONFIG_PREEMPT
+ VM_BUG_ON(!in_atomic());
+# endif
+ /*
+ * Preempt must be disabled here - we rely on rcu_read_lock doing
+ * this for us.
+ *
+ * Pagecache won't be truncated from interrupt context, so if we have
+ * found a page in the radix tree here, we have pinned its refcount by
+ * disabling preempt, and hence no need for the "speculative get" that
+ * SMP requires.
+ */
+ VM_BUG_ON(page_count(page) == 0);
+ atomic_inc(&page->_count);
+
+#else
+ if (unlikely(!get_page_unless_zero(page))) {
+ /*
+ * Either the page has been freed, or will be freed.
+ * In either case, retry here and the caller should
+ * do the right thing (see comments above).
+ */
+ return 0;
+ }
+#endif
+ VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);
+
+ return 1;
+}
+
+static inline int page_freeze_refs(struct page *page, int count)
+{
+ return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
+}
+
+static inline void page_unfreeze_refs(struct page *page, int count)
+{
+ VM_BUG_ON(page_count(page) != 0);
+ VM_BUG_ON(count == 0);
+
+ atomic_set(&page->_count, count);
+}
+
#ifdef CONFIG_NUMA
extern struct page *__page_cache_alloc(gfp_t gfp);
#else
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -390,12 +390,10 @@ static pageout_t pageout(struct page *pa
}
/*
- * Attempt to detach a locked page from its ->mapping. If it is dirty or if
- * someone else has a ref on the page, abort and return 0. If it was
- * successfully detached, return 1. Assumes the caller has a single ref on
- * this page.
+ * Save as remove_mapping, but if the page is removed from the mapping, it
+ * gets returned with a refcount of 0.
*/
-int remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping(struct address_space *mapping, struct page *page)
{
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
@@ -426,9 +424,9 @@ int remove_mapping(struct address_space
* Note that if SetPageDirty is always performed via set_page_dirty,
* and thus under tree_lock, then this ordering is not required.
*/
- if (unlikely(page_count(page) != 2))
+ if (!page_freeze_refs(page, 2))
goto cannot_free;
- smp_rmb();
+ /* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
if (unlikely(PageDirty(page)))
goto cannot_free;
@@ -437,13 +435,11 @@ int remove_mapping(struct address_space
__delete_from_swap_cache(page);
write_unlock_irq(&mapping->tree_lock);
swap_free(swap);
- __put_page(page); /* The pagecache ref */
- return 1;
+ } else {
+ __remove_from_page_cache(page);
+ write_unlock_irq(&mapping->tree_lock);
}
- __remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
- __put_page(page);
return 1;
cannot_free:
@@ -452,6 +448,26 @@ cannot_free:
}
/*
+ * Attempt to detach a locked page from its ->mapping. If it is dirty or if
+ * someone else has a ref on the page, abort and return 0. If it was
+ * successfully detached, return 1. Assumes the caller has a single ref on
+ * this page.
+ */
+int remove_mapping(struct address_space *mapping, struct page *page)
+{
+ if (__remove_mapping(mapping, page)) {
+ /*
+ * Unfreezing the refcount with 1 rather than 2 effectively
+ * drops the pagecache ref for us without requiring another
+ * atomic operation.
+ */
+ page_unfreeze_refs(page, 1);
+ return 1;
+ }
+ return 0;
+}
+
+/*
* shrink_page_list() returns the number of reclaimed pages
*/
static unsigned long shrink_page_list(struct list_head *page_list,
@@ -597,18 +613,27 @@ static unsigned long shrink_page_list(st
if (PagePrivate(page)) {
if (!try_to_release_page(page, sc->gfp_mask))
goto activate_locked;
- if (!mapping && page_count(page) == 1)
- goto free_it;
+ if (!mapping && page_count(page) == 1) {
+ unlock_page(page);
+ if (put_page_testzero(page))
+ goto free_it;
+ else {
+ nr_reclaimed++;
+ continue;
+ }
+ }
}
- if (!mapping || !remove_mapping(mapping, page))
+ if (!mapping || !__remove_mapping(mapping, page))
goto keep_locked;
free_it:
unlock_page(page);
nr_reclaimed++;
- if (!pagevec_add(&freed_pvec, page))
- __pagevec_release_nonlru(&freed_pvec);
+ if (!pagevec_add(&freed_pvec, page)) {
+ __pagevec_free(&freed_pvec);
+ pagevec_reinit(&freed_pvec);
+ }
continue;
activate_locked:
@@ -622,7 +647,7 @@ keep:
}
list_splice(&ret_pages, page_list);
if (pagevec_count(&freed_pvec))
- __pagevec_release_nonlru(&freed_pvec);
+ __pagevec_free(&freed_pvec);
count_vm_events(PGACTIVATE, pgactivate);
return nr_reclaimed;
}
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -466,17 +466,22 @@ int add_to_page_cache(struct page *page,
error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error == 0) {
+ page_cache_get(page);
+ SetPageLocked(page);
+ page->mapping = mapping;
+ page->index = offset;
+
write_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
- if (!error) {
- page_cache_get(page);
- SetPageLocked(page);
- page->mapping = mapping;
- page->index = offset;
+ if (likely(!error)) {
mapping->nrpages++;
__inc_zone_page_state(page, NR_FILE_PAGES);
- } else
+ } else {
+ page->mapping = NULL;
+ ClearPageLocked(page);
mem_cgroup_uncharge_page(page);
+ page_cache_release(page);
+ }
write_unlock_irq(&mapping->tree_lock);
radix_tree_preload_end();
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -76,19 +76,26 @@ int add_to_swap_cache(struct page *page,
BUG_ON(PagePrivate(page));
error = radix_tree_preload(gfp_mask);
if (!error) {
+ page_cache_get(page);
+ SetPageSwapCache(page);
+ set_page_private(page, entry.val);
+
write_lock_irq(&swapper_space.tree_lock);
error = radix_tree_insert(&swapper_space.page_tree,
entry.val, page);
- if (!error) {
- page_cache_get(page);
- SetPageSwapCache(page);
- set_page_private(page, entry.val);
+ if (likely(!error)) {
total_swapcache_pages++;
__inc_zone_page_state(page, NR_FILE_PAGES);
INC_CACHE_INFO(add_total);
}
write_unlock_irq(&swapper_space.tree_lock);
radix_tree_preload_end();
+
+ if (unlikely(error)) {
+ set_page_private(page, 0UL);
+ ClearPageSwapCache(page);
+ page_cache_release(page);
+ }
}
return error;
}
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -304,6 +304,7 @@ out:
static int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page)
{
+ int expected_count;
void **pslot;
if (!mapping) {
@@ -318,12 +319,18 @@ static int migrate_page_move_mapping(str
pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));
- if (page_count(page) != 2 + !!PagePrivate(page) ||
+ expected_count = 2 + !!PagePrivate(page);
+ if (page_count(page) != expected_count ||
(struct page *)radix_tree_deref_slot(pslot) != page) {
write_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
}
+ if (!page_freeze_refs(page, expected_count)) {
+ write_unlock_irq(&mapping->tree_lock);
+ return -EAGAIN;
+ }
+
/*
* Now we know that no one else is looking at the page.
*/
@@ -337,6 +344,7 @@ static int migrate_page_move_mapping(str
radix_tree_replace_slot(pslot, newpage);
+ page_unfreeze_refs(page, expected_count);
/*
* Drop cache reference from old page.
* We know this isn't the last reference.
Index: linux-2.6/drivers/net/cassini.c
===================================================================
--- linux-2.6.orig/drivers/net/cassini.c
+++ linux-2.6/drivers/net/cassini.c
@@ -576,6 +576,18 @@ static void cas_spare_recover(struct cas
list_for_each_safe(elem, tmp, &list) {
cas_page_t *page = list_entry(elem, cas_page_t, list);
+ /*
+ * With the lockless pagecache, cassini buffering scheme gets
+ * slightly less accurate: we might find that a page has an
+ * elevated reference count here, due to a speculative ref,
+ * and skip it as in-use. Ideally we would be able to reclaim
+ * it. However this would be such a rare case, it doesn't
+ * matter too much as we should pick it up the next time round.
+ *
+ * Importantly, if we find that the page has a refcount of 1
+ * here (our refcount), then we know it is definitely not inuse
+ * so we can reuse it.
+ */
if (page_count(page->buffer) > 1)
continue;
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [patch 3/7] mm: speculative page references
2008-06-05 9:43 ` [patch 3/7] mm: speculative page references npiggin
@ 2008-06-06 14:20 ` Peter Zijlstra
2008-06-06 16:26 ` Nick Piggin
2008-06-06 16:27 ` Nick Piggin
2008-06-09 4:48 ` Tim Pepper
2008-06-10 19:08 ` Christoph Lameter
2 siblings, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2008-06-06 14:20 UTC (permalink / raw)
To: npiggin
Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus, Paul E McKenney
On Thu, 2008-06-05 at 19:43 +1000, npiggin@suse.de wrote:
> plain text document attachment (mm-speculative-get_page-hugh.patch)
> +static inline int page_cache_get_speculative(struct page *page)
> +{
> + VM_BUG_ON(in_interrupt());
> +
> +#ifndef CONFIG_SMP
> +# ifdef CONFIG_PREEMPT
> + VM_BUG_ON(!in_atomic());
> +# endif
> + /*
> + * Preempt must be disabled here - we rely on rcu_read_lock doing
> + * this for us.
Preemptible RCU is already in the tree, so I guess you'll have to
explcitly disable preemption if you require it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [patch 3/7] mm: speculative page references
2008-06-06 14:20 ` Peter Zijlstra
@ 2008-06-06 16:26 ` Nick Piggin
2008-06-06 16:27 ` Nick Piggin
1 sibling, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-06-06 16:26 UTC (permalink / raw)
To: Peter Zijlstra
Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus, Paul E McKenney
On Fri, Jun 06, 2008 at 04:20:04PM +0200, Peter Zijlstra wrote:
> On Thu, 2008-06-05 at 19:43 +1000, npiggin@suse.de wrote:
> > plain text document attachment (mm-speculative-get_page-hugh.patch)
>
> > +static inline int page_cache_get_speculative(struct page *page)
> > +{
> > + VM_BUG_ON(in_interrupt());
> > +
> > +#ifndef CONFIG_SMP
> > +# ifdef CONFIG_PREEMPT
> > + VM_BUG_ON(!in_atomic());
> > +# endif
> > + /*
> > + * Preempt must be disabled here - we rely on rcu_read_lock doing
> > + * this for us.
>
> Preemptible RCU is already in the tree, so I guess you'll have to
> explcitly disable preemption if you require it.
Oh, of course, I forget about preempt RCU, lucky for the comment.
Good spotting.
--
As per the comment here, we can only use that shortcut if rcu_read_lock
disabled preemption. It would be somewhat annoying to have to put
preempt_disable/preempt_enable around all callers in order to support
this, but preempt RCU isn't going to be hugely performance critical
anyway (and actually it actively trades performance for fewer preempt off
sections), so it can use the slightly slower path quite happily.
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -111,7 +111,7 @@ static inline int page_cache_get_specula
{
VM_BUG_ON(in_interrupt());
-#ifndef CONFIG_SMP
+#if !defined(CONFIG_SMP) && defined(CONFIG_CLASSIC_RCU)
# ifdef CONFIG_PREEMPT
VM_BUG_ON(!in_atomic());
# endif
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [patch 3/7] mm: speculative page references
2008-06-06 14:20 ` Peter Zijlstra
2008-06-06 16:26 ` Nick Piggin
@ 2008-06-06 16:27 ` Nick Piggin
1 sibling, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-06-06 16:27 UTC (permalink / raw)
To: Peter Zijlstra
Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus, Paul E McKenney
On Fri, Jun 06, 2008 at 04:20:04PM +0200, Peter Zijlstra wrote:
> On Thu, 2008-06-05 at 19:43 +1000, npiggin@suse.de wrote:
> > plain text document attachment (mm-speculative-get_page-hugh.patch)
>
> > +static inline int page_cache_get_speculative(struct page *page)
> > +{
> > + VM_BUG_ON(in_interrupt());
> > +
> > +#ifndef CONFIG_SMP
> > +# ifdef CONFIG_PREEMPT
> > + VM_BUG_ON(!in_atomic());
> > +# endif
> > + /*
> > + * Preempt must be disabled here - we rely on rcu_read_lock doing
> > + * this for us.
>
> Preemptible RCU is already in the tree, so I guess you'll have to
> explcitly disable preemption if you require it.
>
And here is the fix for patch 7/7
--
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -149,7 +149,7 @@ static inline int page_cache_add_specula
{
VM_BUG_ON(in_interrupt());
-#ifndef CONFIG_SMP
+#if !defined(CONFIG_SMP) && defined(CONFIG_CLASSIC_RCU)
# ifdef CONFIG_PREEMPT
VM_BUG_ON(!in_atomic());
# endif
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch 3/7] mm: speculative page references
2008-06-05 9:43 ` [patch 3/7] mm: speculative page references npiggin
2008-06-06 14:20 ` Peter Zijlstra
@ 2008-06-09 4:48 ` Tim Pepper
2008-06-10 19:08 ` Christoph Lameter
2 siblings, 0 replies; 31+ messages in thread
From: Tim Pepper @ 2008-06-09 4:48 UTC (permalink / raw)
To: npiggin; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus
On Thu, Jun 5, 2008 at 2:43 AM, <npiggin@suse.de> wrote:
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -390,12 +390,10 @@ static pageout_t pageout(struct page *pa
> }
>
> /*
> - * Attempt to detach a locked page from its ->mapping. If it is dirty or if
> - * someone else has a ref on the page, abort and return 0. If it was
> - * successfully detached, return 1. Assumes the caller has a single ref on
> - * this page.
> + * Save as remove_mapping, but if the page is removed from the mapping, it
> + * gets returned with a refcount of 0.
^^^^^^
Same as?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [patch 3/7] mm: speculative page references
2008-06-05 9:43 ` [patch 3/7] mm: speculative page references npiggin
2008-06-06 14:20 ` Peter Zijlstra
2008-06-09 4:48 ` Tim Pepper
@ 2008-06-10 19:08 ` Christoph Lameter
2008-06-11 3:19 ` Nick Piggin
2 siblings, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2008-06-10 19:08 UTC (permalink / raw)
To: npiggin; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus
On Thu, 5 Jun 2008, npiggin@suse.de wrote:
> + * do the right thing (see comments above).
> + */
> + return 0;
> + }
> +#endif
> + VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);
This is easier written as:
== VM_BUG_ON(PageTail(page)
And its also slightly incorrect since page_private(page) is not pointing
to the head page for PageHead(page).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch 3/7] mm: speculative page references
2008-06-10 19:08 ` Christoph Lameter
@ 2008-06-11 3:19 ` Nick Piggin
0 siblings, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-06-11 3:19 UTC (permalink / raw)
To: Christoph Lameter; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus
On Tue, Jun 10, 2008 at 12:08:27PM -0700, Christoph Lameter wrote:
> On Thu, 5 Jun 2008, npiggin@suse.de wrote:
>
> > + * do the right thing (see comments above).
> > + */
> > + return 0;
> > + }
> > +#endif
> > + VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);
>
> This is easier written as:
>
> == VM_BUG_ON(PageTail(page)
Yeah that would be nicer.
> And its also slightly incorrect since page_private(page) is not pointing
> to the head page for PageHead(page).
I see. Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread
* [patch 4/7] mm: lockless pagecache
2008-06-05 9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
` (2 preceding siblings ...)
2008-06-05 9:43 ` [patch 3/7] mm: speculative page references npiggin
@ 2008-06-05 9:43 ` npiggin
2008-06-05 9:43 ` [patch 5/7] mm: spinlock tree_lock npiggin
` (5 subsequent siblings)
9 siblings, 0 replies; 31+ messages in thread
From: npiggin @ 2008-06-05 9:43 UTC (permalink / raw)
To: akpm, torvalds; +Cc: linux-mm, linux-kernel, benh, paulus
[-- Attachment #1: mm-lockless-pagecache-lookups.patch --]
[-- Type: text/plain, Size: 6874 bytes --]
Combine page_cache_get_speculative with lockless radix tree lookups to
introduce lockless page cache lookups (ie. no mapping->tree_lock on
the read-side).
The only atomicity changes this introduces is that the gang pagecache
lookup functions now behave as if they are implemented with multiple
find_get_page calls, rather than operating on a snapshot of the pages.
In practice, this atomicity guarantee is not used anyway, and it is
difficult to see how it could be. Gang pagecache lookups are designed
to replace individual lookups, so these semantics are natural.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -640,15 +640,35 @@ void __lock_page_nosync(struct page *pag
* Is there a pagecache struct page at the given (mapping, offset) tuple?
* If yes, increment its refcount and return it; if no, return NULL.
*/
-struct page * find_get_page(struct address_space *mapping, pgoff_t offset)
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
{
+ void **pagep;
struct page *page;
- read_lock_irq(&mapping->tree_lock);
- page = radix_tree_lookup(&mapping->page_tree, offset);
- if (page)
- page_cache_get(page);
- read_unlock_irq(&mapping->tree_lock);
+ rcu_read_lock();
+repeat:
+ page = NULL;
+ pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);
+ if (pagep) {
+ page = radix_tree_deref_slot(pagep);
+ if (unlikely(!page || page == RADIX_TREE_RETRY))
+ goto repeat;
+
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /*
+ * Has the page moved?
+ * This is part of the lockless pagecache protocol. See
+ * include/linux/pagemap.h for details.
+ */
+ if (unlikely(page != *pagep)) {
+ page_cache_release(page);
+ goto repeat;
+ }
+ }
+ rcu_read_unlock();
+
return page;
}
EXPORT_SYMBOL(find_get_page);
@@ -663,32 +683,22 @@ EXPORT_SYMBOL(find_get_page);
*
* Returns zero if the page was not present. find_lock_page() may sleep.
*/
-struct page *find_lock_page(struct address_space *mapping,
- pgoff_t offset)
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
{
struct page *page;
repeat:
- read_lock_irq(&mapping->tree_lock);
- page = radix_tree_lookup(&mapping->page_tree, offset);
+ page = find_get_page(mapping, offset);
if (page) {
- page_cache_get(page);
- if (TestSetPageLocked(page)) {
- read_unlock_irq(&mapping->tree_lock);
- __lock_page(page);
-
- /* Has the page been truncated while we slept? */
- if (unlikely(page->mapping != mapping)) {
- unlock_page(page);
- page_cache_release(page);
- goto repeat;
- }
- VM_BUG_ON(page->index != offset);
- goto out;
+ lock_page(page);
+ /* Has the page been truncated? */
+ if (unlikely(page->mapping != mapping)) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto repeat;
}
+ VM_BUG_ON(page->index != offset);
}
- read_unlock_irq(&mapping->tree_lock);
-out:
return page;
}
EXPORT_SYMBOL(find_lock_page);
@@ -754,13 +764,39 @@ unsigned find_get_pages(struct address_s
{
unsigned int i;
unsigned int ret;
+ unsigned int nr_found;
+
+ rcu_read_lock();
+restart:
+ nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+ (void ***)pages, start, nr_pages);
+ ret = 0;
+ for (i = 0; i < nr_found; i++) {
+ struct page *page;
+repeat:
+ page = radix_tree_deref_slot((void **)pages[i]);
+ if (unlikely(!page))
+ continue;
+ /*
+ * this can only trigger if nr_found == 1, making livelock
+ * a non issue.
+ */
+ if (unlikely(page == RADIX_TREE_RETRY))
+ goto restart;
- read_lock_irq(&mapping->tree_lock);
- ret = radix_tree_gang_lookup(&mapping->page_tree,
- (void **)pages, start, nr_pages);
- for (i = 0; i < ret; i++)
- page_cache_get(pages[i]);
- read_unlock_irq(&mapping->tree_lock);
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /* Has the page moved? */
+ if (unlikely(page != *((void **)pages[i]))) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
+ pages[ret] = page;
+ ret++;
+ }
+ rcu_read_unlock();
return ret;
}
@@ -781,19 +817,44 @@ unsigned find_get_pages_contig(struct ad
{
unsigned int i;
unsigned int ret;
+ unsigned int nr_found;
+
+ rcu_read_lock();
+restart:
+ nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+ (void ***)pages, index, nr_pages);
+ ret = 0;
+ for (i = 0; i < nr_found; i++) {
+ struct page *page;
+repeat:
+ page = radix_tree_deref_slot((void **)pages[i]);
+ if (unlikely(!page))
+ continue;
+ /*
+ * this can only trigger if nr_found == 1, making livelock
+ * a non issue.
+ */
+ if (unlikely(page == RADIX_TREE_RETRY))
+ goto restart;
- read_lock_irq(&mapping->tree_lock);
- ret = radix_tree_gang_lookup(&mapping->page_tree,
- (void **)pages, index, nr_pages);
- for (i = 0; i < ret; i++) {
- if (pages[i]->mapping == NULL || pages[i]->index != index)
+ if (page->mapping == NULL || page->index != index)
break;
- page_cache_get(pages[i]);
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /* Has the page moved? */
+ if (unlikely(page != *((void **)pages[i]))) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
+ pages[ret] = page;
+ ret++;
index++;
}
- read_unlock_irq(&mapping->tree_lock);
- return i;
+ rcu_read_unlock();
+ return ret;
}
EXPORT_SYMBOL(find_get_pages_contig);
@@ -813,15 +874,43 @@ unsigned find_get_pages_tag(struct addre
{
unsigned int i;
unsigned int ret;
+ unsigned int nr_found;
+
+ rcu_read_lock();
+restart:
+ nr_found = radix_tree_gang_lookup_tag_slot(&mapping->page_tree,
+ (void ***)pages, *index, nr_pages, tag);
+ ret = 0;
+ for (i = 0; i < nr_found; i++) {
+ struct page *page;
+repeat:
+ page = radix_tree_deref_slot((void **)pages[i]);
+ if (unlikely(!page))
+ continue;
+ /*
+ * this can only trigger if nr_found == 1, making livelock
+ * a non issue.
+ */
+ if (unlikely(page == RADIX_TREE_RETRY))
+ goto restart;
+
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /* Has the page moved? */
+ if (unlikely(page != *((void **)pages[i]))) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
+ pages[ret] = page;
+ ret++;
+ }
+ rcu_read_unlock();
- read_lock_irq(&mapping->tree_lock);
- ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
- (void **)pages, *index, nr_pages, tag);
- for (i = 0; i < ret; i++)
- page_cache_get(pages[i]);
if (ret)
*index = pages[ret - 1]->index + 1;
- read_unlock_irq(&mapping->tree_lock);
+
return ret;
}
EXPORT_SYMBOL(find_get_pages_tag);
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread* [patch 5/7] mm: spinlock tree_lock
2008-06-05 9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
` (3 preceding siblings ...)
2008-06-05 9:43 ` [patch 4/7] mm: lockless pagecache npiggin
@ 2008-06-05 9:43 ` npiggin
2008-06-05 9:43 ` [patch 6/7] powerpc: implement pte_special npiggin
` (4 subsequent siblings)
9 siblings, 0 replies; 31+ messages in thread
From: npiggin @ 2008-06-05 9:43 UTC (permalink / raw)
To: akpm, torvalds; +Cc: linux-mm, linux-kernel, benh, paulus
[-- Attachment #1: mm-spinlock-tree_lock.patch --]
[-- Type: text/plain, Size: 12281 bytes --]
mapping->tree_lock has no read lockers. convert the lock from an rwlock
to a spinlock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -706,7 +706,7 @@ static int __set_page_dirty(struct page
if (TestSetPageDirty(page))
return 0;
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
@@ -719,7 +719,7 @@ static int __set_page_dirty(struct page
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
return 1;
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -209,7 +209,7 @@ void inode_init_once(struct inode *inode
INIT_LIST_HEAD(&inode->i_dentry);
INIT_LIST_HEAD(&inode->i_devices);
INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
- rwlock_init(&inode->i_data.tree_lock);
+ spin_lock_init(&inode->i_data.tree_lock);
spin_lock_init(&inode->i_data.i_mmap_lock);
INIT_LIST_HEAD(&inode->i_data.private_list);
spin_lock_init(&inode->i_data.private_lock);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -498,7 +498,7 @@ struct backing_dev_info;
struct address_space {
struct inode *host; /* owner: inode, block_device */
struct radix_tree_root page_tree; /* radix tree of all pages */
- rwlock_t tree_lock; /* and rwlock protecting it */
+ spinlock_t tree_lock; /* and lock protecting it */
unsigned int i_mmap_writable;/* count VM_SHARED mappings */
struct prio_tree_root i_mmap; /* tree of private and shared mappings */
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -112,7 +112,7 @@ generic_file_direct_IO(int rw, struct ki
/*
* Remove a page from the page cache and free it. Caller has to make
* sure the page is locked and that nobody else uses it - or that usage
- * is safe. The caller must hold a write_lock on the mapping's tree_lock.
+ * is safe. The caller must hold the mapping's tree_lock.
*/
void __remove_from_page_cache(struct page *page)
{
@@ -144,9 +144,9 @@ void remove_from_page_cache(struct page
BUG_ON(!PageLocked(page));
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
__remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
}
static int sync_page(void *word)
@@ -471,7 +471,7 @@ int add_to_page_cache(struct page *page,
page->mapping = mapping;
page->index = offset;
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
if (likely(!error)) {
mapping->nrpages++;
@@ -483,7 +483,7 @@ int add_to_page_cache(struct page *page,
page_cache_release(page);
}
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
radix_tree_preload_end();
} else
mem_cgroup_uncharge_page(page);
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -39,7 +39,7 @@ static struct backing_dev_info swap_back
struct address_space swapper_space = {
.page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
- .tree_lock = __RW_LOCK_UNLOCKED(swapper_space.tree_lock),
+ .tree_lock = __SPIN_LOCK_UNLOCKED(swapper_space.tree_lock),
.a_ops = &swap_aops,
.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
.backing_dev_info = &swap_backing_dev_info,
@@ -80,7 +80,7 @@ int add_to_swap_cache(struct page *page,
SetPageSwapCache(page);
set_page_private(page, entry.val);
- write_lock_irq(&swapper_space.tree_lock);
+ spin_lock_irq(&swapper_space.tree_lock);
error = radix_tree_insert(&swapper_space.page_tree,
entry.val, page);
if (likely(!error)) {
@@ -88,7 +88,7 @@ int add_to_swap_cache(struct page *page,
__inc_zone_page_state(page, NR_FILE_PAGES);
INC_CACHE_INFO(add_total);
}
- write_unlock_irq(&swapper_space.tree_lock);
+ spin_unlock_irq(&swapper_space.tree_lock);
radix_tree_preload_end();
if (unlikely(error)) {
@@ -182,9 +182,9 @@ void delete_from_swap_cache(struct page
entry.val = page_private(page);
- write_lock_irq(&swapper_space.tree_lock);
+ spin_lock_irq(&swapper_space.tree_lock);
__delete_from_swap_cache(page);
- write_unlock_irq(&swapper_space.tree_lock);
+ spin_unlock_irq(&swapper_space.tree_lock);
swap_free(entry);
page_cache_release(page);
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -368,13 +368,13 @@ int remove_exclusive_swap_page(struct pa
retval = 0;
if (p->swap_map[swp_offset(entry)] == 1) {
/* Recheck the page count with the swapcache lock held.. */
- write_lock_irq(&swapper_space.tree_lock);
+ spin_lock_irq(&swapper_space.tree_lock);
if ((page_count(page) == 2) && !PageWriteback(page)) {
__delete_from_swap_cache(page);
SetPageDirty(page);
retval = 1;
}
- write_unlock_irq(&swapper_space.tree_lock);
+ spin_unlock_irq(&swapper_space.tree_lock);
}
spin_unlock(&swap_lock);
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -349,18 +349,18 @@ invalidate_complete_page2(struct address
if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
return 0;
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
if (PageDirty(page))
goto failed;
BUG_ON(PagePrivate(page));
__remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
ClearPageUptodate(page);
page_cache_release(page); /* pagecache ref */
return 1;
failed:
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
return 0;
}
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -398,7 +398,7 @@ static int __remove_mapping(struct addre
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
/*
* The non racy check for a busy page.
*
@@ -433,17 +433,17 @@ static int __remove_mapping(struct addre
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
__delete_from_swap_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
swap_free(swap);
} else {
__remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
}
return 1;
cannot_free:
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
return 0;
}
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -1081,7 +1081,7 @@ int __set_page_dirty_nobuffers(struct pa
if (!mapping)
return 1;
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
mapping2 = page_mapping(page);
if (mapping2) { /* Race with truncate? */
BUG_ON(mapping2 != mapping);
@@ -1095,7 +1095,7 @@ int __set_page_dirty_nobuffers(struct pa
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
if (mapping->host) {
/* !PageAnon && !swapper_space */
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -1251,7 +1251,7 @@ int test_clear_page_writeback(struct pag
struct backing_dev_info *bdi = mapping->backing_dev_info;
unsigned long flags;
- write_lock_irqsave(&mapping->tree_lock, flags);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
ret = TestClearPageWriteback(page);
if (ret) {
radix_tree_tag_clear(&mapping->page_tree,
@@ -1262,7 +1262,7 @@ int test_clear_page_writeback(struct pag
__bdi_writeout_inc(bdi);
}
}
- write_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
} else {
ret = TestClearPageWriteback(page);
}
@@ -1280,7 +1280,7 @@ int test_set_page_writeback(struct page
struct backing_dev_info *bdi = mapping->backing_dev_info;
unsigned long flags;
- write_lock_irqsave(&mapping->tree_lock, flags);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
ret = TestSetPageWriteback(page);
if (!ret) {
radix_tree_tag_set(&mapping->page_tree,
@@ -1293,7 +1293,7 @@ int test_set_page_writeback(struct page
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY);
- write_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
} else {
ret = TestSetPageWriteback(page);
}
Index: linux-2.6/include/asm-arm/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-arm/cacheflush.h
+++ linux-2.6/include/asm-arm/cacheflush.h
@@ -421,9 +421,9 @@ static inline void flush_anon_page(struc
}
#define flush_dcache_mmap_lock(mapping) \
- write_lock_irq(&(mapping)->tree_lock)
+ spin_lock_irq(&(mapping)->tree_lock)
#define flush_dcache_mmap_unlock(mapping) \
- write_unlock_irq(&(mapping)->tree_lock)
+ spin_unlock_irq(&(mapping)->tree_lock)
#define flush_icache_user_range(vma,page,addr,len) \
flush_dcache_page(page)
Index: linux-2.6/include/asm-parisc/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-parisc/cacheflush.h
+++ linux-2.6/include/asm-parisc/cacheflush.h
@@ -45,9 +45,9 @@ void flush_cache_mm(struct mm_struct *mm
extern void flush_dcache_page(struct page *page);
#define flush_dcache_mmap_lock(mapping) \
- write_lock_irq(&(mapping)->tree_lock)
+ spin_lock_irq(&(mapping)->tree_lock)
#define flush_dcache_mmap_unlock(mapping) \
- write_unlock_irq(&(mapping)->tree_lock)
+ spin_unlock_irq(&(mapping)->tree_lock)
#define flush_icache_page(vma,page) do { \
flush_kernel_dcache_page(page); \
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -314,7 +314,7 @@ static int migrate_page_move_mapping(str
return 0;
}
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));
@@ -322,12 +322,12 @@ static int migrate_page_move_mapping(str
expected_count = 2 + !!PagePrivate(page);
if (page_count(page) != expected_count ||
(struct page *)radix_tree_deref_slot(pslot) != page) {
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
}
if (!page_freeze_refs(page, expected_count)) {
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
}
@@ -364,7 +364,7 @@ static int migrate_page_move_mapping(str
__dec_zone_page_state(page, NR_FILE_PAGES);
__inc_zone_page_state(newpage, NR_FILE_PAGES);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
return 0;
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread* [patch 6/7] powerpc: implement pte_special
2008-06-05 9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
` (4 preceding siblings ...)
2008-06-05 9:43 ` [patch 5/7] mm: spinlock tree_lock npiggin
@ 2008-06-05 9:43 ` npiggin
2008-06-06 4:04 ` Benjamin Herrenschmidt
2008-06-05 9:43 ` [patch 7/7] powerpc: lockless get_user_pages_fast npiggin
` (3 subsequent siblings)
9 siblings, 1 reply; 31+ messages in thread
From: npiggin @ 2008-06-05 9:43 UTC (permalink / raw)
To: akpm, torvalds; +Cc: linux-mm, linux-kernel, benh, paulus
[-- Attachment #1: powerpc-implement-pte_special.patch --]
[-- Type: text/plain, Size: 2929 bytes --]
Implement PTE_SPECIAL for powerpc. At the moment I only have a spare bit for
the 4k pages config, but Ben has freed up another one for 64k pages that I
can use, so this patch should include that before it goes upstream.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/asm-powerpc/pgtable-ppc64.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/pgtable-ppc64.h
+++ linux-2.6/include/asm-powerpc/pgtable-ppc64.h
@@ -239,7 +239,7 @@ static inline int pte_write(pte_t pte) {
static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY;}
static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;}
static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;}
-static inline int pte_special(pte_t pte) { return 0; }
+static inline int pte_special(pte_t pte) { return pte_val(pte) & _PAGE_SPECIAL; }
static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; }
static inline void pte_cache(pte_t pte) { pte_val(pte) &= ~_PAGE_NO_CACHE; }
@@ -259,7 +259,7 @@ static inline pte_t pte_mkyoung(pte_t pt
static inline pte_t pte_mkhuge(pte_t pte) {
return pte; }
static inline pte_t pte_mkspecial(pte_t pte) {
- return pte; }
+ pte_val(pte) |= _PAGE_SPECIAL; return pte; }
/* Atomic PTE updates */
static inline unsigned long pte_update(struct mm_struct *mm,
Index: linux-2.6/include/asm-powerpc/pgtable-4k.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/pgtable-4k.h
+++ linux-2.6/include/asm-powerpc/pgtable-4k.h
@@ -45,6 +45,8 @@
#define _PAGE_GROUP_IX 0x7000 /* software: HPTE index within group */
#define _PAGE_F_SECOND _PAGE_SECONDARY
#define _PAGE_F_GIX _PAGE_GROUP_IX
+#define _PAGE_SPECIAL 0x10000 /* software: special page */
+#define __HAVE_ARCH_PTE_SPECIAL
/* PTE flags to conserve for HPTE identification */
#define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | \
Index: linux-2.6/include/asm-powerpc/pgtable-64k.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/pgtable-64k.h
+++ linux-2.6/include/asm-powerpc/pgtable-64k.h
@@ -74,6 +74,7 @@ static inline struct subpage_prot_table
#define _PAGE_HPTE_SUB0 0x08000000 /* combo only: first sub page */
#define _PAGE_COMBO 0x10000000 /* this is a combo 4k page */
#define _PAGE_4K_PFN 0x20000000 /* PFN is for a single 4k page */
+#define _PAGE_SPECIAL 0x0 /* don't have enough room for this yet */
/* Note the full page bits must be in the same location as for normal
* 4k pages as the same asssembly will be used to insert 64K pages
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [patch 6/7] powerpc: implement pte_special
2008-06-05 9:43 ` [patch 6/7] powerpc: implement pte_special npiggin
@ 2008-06-06 4:04 ` Benjamin Herrenschmidt
0 siblings, 0 replies; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2008-06-06 4:04 UTC (permalink / raw)
To: npiggin; +Cc: akpm, torvalds, linux-mm, linux-kernel, paulus
On Thu, 2008-06-05 at 19:43 +1000, npiggin@suse.de wrote:
> plain text document attachment (powerpc-implement-pte_special.patch)
> Implement PTE_SPECIAL for powerpc. At the moment I only have a spare bit for
> the 4k pages config, but Ben has freed up another one for 64k pages that I
> can use, so this patch should include that before it goes upstream.
>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
Ack that bit. _PAGE_SPECIAL will replace _PAGE_HASHPTE on 64K (ie.
0x400). The patch that frees that bit should get into powerpc.git (and
from there -mm) as soon as paulus catches up with his backlog :-)
Cheers,
Ben.
> ---
> Index: linux-2.6/include/asm-powerpc/pgtable-ppc64.h
> ===================================================================
> --- linux-2.6.orig/include/asm-powerpc/pgtable-ppc64.h
> +++ linux-2.6/include/asm-powerpc/pgtable-ppc64.h
> @@ -239,7 +239,7 @@ static inline int pte_write(pte_t pte) {
> static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY;}
> static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;}
> static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;}
> -static inline int pte_special(pte_t pte) { return 0; }
> +static inline int pte_special(pte_t pte) { return pte_val(pte) & _PAGE_SPECIAL; }
>
> static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; }
> static inline void pte_cache(pte_t pte) { pte_val(pte) &= ~_PAGE_NO_CACHE; }
> @@ -259,7 +259,7 @@ static inline pte_t pte_mkyoung(pte_t pt
> static inline pte_t pte_mkhuge(pte_t pte) {
> return pte; }
> static inline pte_t pte_mkspecial(pte_t pte) {
> - return pte; }
> + pte_val(pte) |= _PAGE_SPECIAL; return pte; }
>
> /* Atomic PTE updates */
> static inline unsigned long pte_update(struct mm_struct *mm,
> Index: linux-2.6/include/asm-powerpc/pgtable-4k.h
> ===================================================================
> --- linux-2.6.orig/include/asm-powerpc/pgtable-4k.h
> +++ linux-2.6/include/asm-powerpc/pgtable-4k.h
> @@ -45,6 +45,8 @@
> #define _PAGE_GROUP_IX 0x7000 /* software: HPTE index within group */
> #define _PAGE_F_SECOND _PAGE_SECONDARY
> #define _PAGE_F_GIX _PAGE_GROUP_IX
> +#define _PAGE_SPECIAL 0x10000 /* software: special page */
> +#define __HAVE_ARCH_PTE_SPECIAL
>
> /* PTE flags to conserve for HPTE identification */
> #define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | \
> Index: linux-2.6/include/asm-powerpc/pgtable-64k.h
> ===================================================================
> --- linux-2.6.orig/include/asm-powerpc/pgtable-64k.h
> +++ linux-2.6/include/asm-powerpc/pgtable-64k.h
> @@ -74,6 +74,7 @@ static inline struct subpage_prot_table
> #define _PAGE_HPTE_SUB0 0x08000000 /* combo only: first sub page */
> #define _PAGE_COMBO 0x10000000 /* this is a combo 4k page */
> #define _PAGE_4K_PFN 0x20000000 /* PFN is for a single 4k page */
> +#define _PAGE_SPECIAL 0x0 /* don't have enough room for this yet */
>
> /* Note the full page bits must be in the same location as for normal
> * 4k pages as the same asssembly will be used to insert 64K pages
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread
* [patch 7/7] powerpc: lockless get_user_pages_fast
2008-06-05 9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
` (5 preceding siblings ...)
2008-06-05 9:43 ` [patch 6/7] powerpc: implement pte_special npiggin
@ 2008-06-05 9:43 ` npiggin
2008-06-09 8:32 ` Andrew Morton
2008-06-10 19:00 ` Christoph Lameter
2008-06-05 11:53 ` [patch 0/7] speculative page references, lockless pagecache, lockless gup Nick Piggin
` (2 subsequent siblings)
9 siblings, 2 replies; 31+ messages in thread
From: npiggin @ 2008-06-05 9:43 UTC (permalink / raw)
To: akpm, torvalds; +Cc: linux-mm, linux-kernel, benh, paulus
[-- Attachment #1: powerpc-fast_gup.patch --]
[-- Type: text/plain, Size: 8461 bytes --]
Implement lockless get_user_pages_fast for powerpc. Page table existence is
guaranteed with RCU, and speculative page references are used to take a
reference to the pages without having a prior existence guarantee on them.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/asm-powerpc/uaccess.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/uaccess.h
+++ linux-2.6/include/asm-powerpc/uaccess.h
@@ -493,6 +493,12 @@ static inline int strnlen_user(const cha
#define strlen_user(str) strnlen_user((str), 0x7ffffffe)
+#ifdef __powerpc64__
+#define __HAVE_ARCH_GET_USER_PAGES_FAST
+struct page;
+int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages);
+#endif
+
#endif /* __ASSEMBLY__ */
#endif /* __KERNEL__ */
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -244,7 +244,7 @@ static inline int put_page_testzero(stru
*/
static inline int get_page_unless_zero(struct page *page)
{
- VM_BUG_ON(PageTail(page));
+ VM_BUG_ON(PageCompound(page));
return atomic_inc_not_zero(&page->_count);
}
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -142,6 +142,29 @@ static inline int page_cache_get_specula
return 1;
}
+/*
+ * Same as above, but add instead of inc (could just be merged)
+ */
+static inline int page_cache_add_speculative(struct page *page, int count)
+{
+ VM_BUG_ON(in_interrupt());
+
+#ifndef CONFIG_SMP
+# ifdef CONFIG_PREEMPT
+ VM_BUG_ON(!in_atomic());
+# endif
+ VM_BUG_ON(page_count(page) == 0);
+ atomic_add(count, &page->_count);
+
+#else
+ if (unlikely(!atomic_add_unless(&page->_count, count, 0)))
+ return 0;
+#endif
+ VM_BUG_ON(PageCompound(page) && page != compound_head(page));
+
+ return 1;
+}
+
static inline int page_freeze_refs(struct page *page, int count)
{
return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
Index: linux-2.6/arch/powerpc/mm/Makefile
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/Makefile
+++ linux-2.6/arch/powerpc/mm/Makefile
@@ -6,7 +6,7 @@ ifeq ($(CONFIG_PPC64),y)
EXTRA_CFLAGS += -mno-minimal-toc
endif
-obj-y := fault.o mem.o \
+obj-y := fault.o mem.o gup.o \
init_$(CONFIG_WORD_SIZE).o \
pgtable_$(CONFIG_WORD_SIZE).o \
mmu_context_$(CONFIG_WORD_SIZE).o
Index: linux-2.6/arch/powerpc/mm/gup.c
===================================================================
--- /dev/null
+++ linux-2.6/arch/powerpc/mm/gup.c
@@ -0,0 +1,230 @@
+/*
+ * Lockless get_user_pages_fast for powerpc
+ *
+ * Copyright (C) 2008 Nick Piggin
+ * Copyright (C) 2008 Novell Inc.
+ */
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <linux/pagemap.h>
+#include <linux/rwsem.h>
+#include <asm/pgtable.h>
+
+/*
+ * The performance critical leaf functions are made noinline otherwise gcc
+ * inlines everything into a single function which results in too much
+ * register pressure.
+ */
+static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
+ unsigned long end, int write, struct page **pages, int *nr)
+{
+ unsigned long mask, result;
+ pte_t *ptep;
+
+ result = _PAGE_PRESENT|_PAGE_USER;
+ if (write)
+ result |= _PAGE_RW;
+ mask = result | _PAGE_SPECIAL;
+
+ ptep = pte_offset_kernel(&pmd, addr);
+ do {
+ pte_t pte = *ptep;
+ struct page *page;
+
+ if ((pte_val(pte) & mask) != result)
+ return 0;
+ VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+ page = pte_page(pte);
+ if (!page_cache_get_speculative(page))
+ return 0;
+ if (unlikely(pte != *ptep)) {
+ put_page(page);
+ return 0;
+ }
+ pages[*nr] = page;
+ (*nr)++;
+
+ } while (ptep++, addr += PAGE_SIZE, addr != end);
+
+ return 1;
+}
+
+static noinline int gup_huge_pte(pte_t *ptep, unsigned long *addr,
+ unsigned long end, int write, struct page **pages, int *nr)
+{
+ unsigned long mask;
+ unsigned long pte_end;
+ struct page *head, *page;
+ pte_t pte;
+ int refs;
+
+ pte_end = (*addr + HPAGE_SIZE) & HPAGE_MASK;
+ if (pte_end < end)
+ end = pte_end;
+
+ pte = *ptep;
+ mask = _PAGE_PRESENT|_PAGE_USER;
+ if (write)
+ mask |= _PAGE_RW;
+ if ((pte_val(pte) & mask) != mask)
+ return 0;
+ /* hugepages are never "special" */
+ VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+ refs = 0;
+ head = pte_page(pte);
+ page = head + ((*addr & ~HPAGE_MASK) >> PAGE_SHIFT);
+ do {
+ VM_BUG_ON(compound_head(page) != head);
+ pages[*nr] = page;
+ (*nr)++;
+ page++;
+ refs++;
+ } while (*addr += PAGE_SIZE, *addr != end);
+
+ if (!page_cache_add_speculative(head, refs)) {
+ *nr -= refs;
+ return 0;
+ }
+ if (unlikely(pte != *ptep)) {
+ /* Could be optimized better */
+ while (*nr) {
+ put_page(page);
+ (*nr)--;
+ }
+ }
+
+ return 1;
+}
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+ int write, struct page **pages, int *nr)
+{
+ unsigned long next;
+ pmd_t *pmdp;
+
+ pmdp = pmd_offset(&pud, addr);
+ do {
+ pmd_t pmd = *pmdp;
+
+ next = pmd_addr_end(addr, end);
+ if (pmd_none(pmd))
+ return 0;
+ if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+ return 0;
+ } while (pmdp++, addr = next, addr != end);
+
+ return 1;
+}
+
+static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
+ int write, struct page **pages, int *nr)
+{
+ unsigned long next;
+ pud_t *pudp;
+
+ pudp = pud_offset(&pgd, addr);
+ do {
+ pud_t pud = *pudp;
+
+ next = pud_addr_end(addr, end);
+ if (pud_none(pud))
+ return 0;
+ if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+ return 0;
+ } while (pudp++, addr = next, addr != end);
+
+ return 1;
+}
+
+int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages)
+{
+ struct mm_struct *mm = current->mm;
+ unsigned long end = start + (nr_pages << PAGE_SHIFT);
+ unsigned long addr = start;
+ unsigned long next;
+ pgd_t *pgdp;
+ int nr = 0;
+
+
+ if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+ start, nr_pages*PAGE_SIZE)))
+ goto slow_irqon;
+
+ /* Cross a slice boundary? */
+ if (unlikely(addr < SLICE_LOW_TOP && end >= SLICE_LOW_TOP))
+ goto slow_irqon;
+
+ /*
+ * XXX: batch / limit 'nr', to avoid large irq off latency
+ * needs some instrumenting to determine the common sizes used by
+ * important workloads (eg. DB2), and whether limiting the batch size
+ * will decrease performance.
+ *
+ * It seems like we're in the clear for the moment. Direct-IO is
+ * the main guy that batches up lots of get_user_pages, and even
+ * they are limited to 64-at-a-time which is not so many.
+ */
+ /*
+ * This doesn't prevent pagetable teardown, but does prevent
+ * the pagetables from being freed on powerpc.
+ *
+ * So long as we atomically load page table pointers versus teardown,
+ * we can follow the address down to the the page and take a ref on it.
+ */
+ local_irq_disable();
+
+ if (get_slice_psize(mm, addr) == mmu_huge_psize) {
+ pte_t *ptep;
+ unsigned long a = addr;
+
+ ptep = huge_pte_offset(mm, a);
+ do {
+ if (!gup_huge_pte(ptep, &a, end, write, pages, &nr))
+ goto slow;
+ ptep++;
+ } while (a != end);
+ } else {
+ pgdp = pgd_offset(mm, addr);
+ do {
+ pgd_t pgd = *pgdp;
+
+ next = pgd_addr_end(addr, end);
+ if (pgd_none(pgd))
+ goto slow;
+ if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+ goto slow;
+ } while (pgdp++, addr = next, addr != end);
+ }
+ local_irq_enable();
+
+ VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
+ return nr;
+
+ {
+ int ret;
+
+slow:
+ local_irq_enable();
+slow_irqon:
+ /* Try to get the remaining pages with get_user_pages */
+ start += nr << PAGE_SHIFT;
+ pages += nr;
+
+ down_read(&mm->mmap_sem);
+ ret = get_user_pages(current, mm, start,
+ (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
+ up_read(&mm->mmap_sem);
+
+ /* Have to be a bit careful with return values */
+ if (nr > 0) {
+ if (ret < 0)
+ ret = nr;
+ else
+ ret += nr;
+ }
+
+ return ret;
+ }
+}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
2008-06-05 9:43 ` [patch 7/7] powerpc: lockless get_user_pages_fast npiggin
@ 2008-06-09 8:32 ` Andrew Morton
2008-06-10 3:15 ` Nick Piggin
2008-06-10 19:00 ` Christoph Lameter
1 sibling, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2008-06-09 8:32 UTC (permalink / raw)
To: npiggin; +Cc: torvalds, linux-mm, linux-kernel, benh, paulus
On Thu, 05 Jun 2008 19:43:07 +1000 npiggin@suse.de wrote:
> Implement lockless get_user_pages_fast for powerpc. Page table existence is
> guaranteed with RCU, and speculative page references are used to take a
> reference to the pages without having a prior existence guarantee on them.
>
arch/powerpc/mm/gup.c: In function `get_user_pages_fast':
arch/powerpc/mm/gup.c:156: error: `SLICE_LOW_TOP' undeclared (first use in this function)
arch/powerpc/mm/gup.c:156: error: (Each undeclared identifier is reported only once
arch/powerpc/mm/gup.c:156: error: for each function it appears in.)
arch/powerpc/mm/gup.c:178: error: implicit declaration of function `get_slice_psize'
arch/powerpc/mm/gup.c:178: error: `mmu_huge_psize' undeclared (first use in this function)
arch/powerpc/mm/gup.c:182: error: implicit declaration of function `huge_pte_offset'
arch/powerpc/mm/gup.c:182: warning: assignment makes pointer from integer without a cast
with
http://userweb.kernel.org/~akpm/config-g5.txt
I don't immediately know why - adding asm/page.h to gup.c doesn't help.
I'm suspecting a recursive include problem somewhere.
I'll drop it, sorry - too much other stuff to fix over here.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
2008-06-09 8:32 ` Andrew Morton
@ 2008-06-10 3:15 ` Nick Piggin
0 siblings, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-06-10 3:15 UTC (permalink / raw)
To: Andrew Morton; +Cc: torvalds, linux-mm, linux-kernel, benh, paulus
On Mon, Jun 09, 2008 at 01:32:04AM -0700, Andrew Morton wrote:
> On Thu, 05 Jun 2008 19:43:07 +1000 npiggin@suse.de wrote:
>
> > Implement lockless get_user_pages_fast for powerpc. Page table existence is
> > guaranteed with RCU, and speculative page references are used to take a
> > reference to the pages without having a prior existence guarantee on them.
> >
>
> arch/powerpc/mm/gup.c: In function `get_user_pages_fast':
> arch/powerpc/mm/gup.c:156: error: `SLICE_LOW_TOP' undeclared (first use in this function)
> arch/powerpc/mm/gup.c:156: error: (Each undeclared identifier is reported only once
> arch/powerpc/mm/gup.c:156: error: for each function it appears in.)
> arch/powerpc/mm/gup.c:178: error: implicit declaration of function `get_slice_psize'
> arch/powerpc/mm/gup.c:178: error: `mmu_huge_psize' undeclared (first use in this function)
> arch/powerpc/mm/gup.c:182: error: implicit declaration of function `huge_pte_offset'
> arch/powerpc/mm/gup.c:182: warning: assignment makes pointer from integer without a cast
>
> with
>
> http://userweb.kernel.org/~akpm/config-g5.txt
>
> I don't immediately know why - adding asm/page.h to gup.c doesn't help.
> I'm suspecting a recursive include problem somewhere.
>
> I'll drop it, sorry - too much other stuff to fix over here.
No problem. Likely a clash with the hugepage patches.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
2008-06-05 9:43 ` [patch 7/7] powerpc: lockless get_user_pages_fast npiggin
2008-06-09 8:32 ` Andrew Morton
@ 2008-06-10 19:00 ` Christoph Lameter
2008-06-11 3:18 ` Nick Piggin
1 sibling, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2008-06-10 19:00 UTC (permalink / raw)
To: npiggin; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus
On Thu, 5 Jun 2008, npiggin@suse.de wrote:
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -244,7 +244,7 @@ static inline int put_page_testzero(stru
> */
> static inline int get_page_unless_zero(struct page *page)
> {
> - VM_BUG_ON(PageTail(page));
> + VM_BUG_ON(PageCompound(page));
> return atomic_inc_not_zero(&page->_count);
> }
This is reversing the modification to make get_page_unless_zero() usable
with compound page heads. Will break the slab defrag patchset.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
2008-06-10 19:00 ` Christoph Lameter
@ 2008-06-11 3:18 ` Nick Piggin
2008-06-11 4:40 ` Christoph Lameter
0 siblings, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2008-06-11 3:18 UTC (permalink / raw)
To: Christoph Lameter; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus
On Tue, Jun 10, 2008 at 12:00:48PM -0700, Christoph Lameter wrote:
> On Thu, 5 Jun 2008, npiggin@suse.de wrote:
>
> > Index: linux-2.6/include/linux/mm.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/mm.h
> > +++ linux-2.6/include/linux/mm.h
> > @@ -244,7 +244,7 @@ static inline int put_page_testzero(stru
> > */
> > static inline int get_page_unless_zero(struct page *page)
> > {
> > - VM_BUG_ON(PageTail(page));
> > + VM_BUG_ON(PageCompound(page));
> > return atomic_inc_not_zero(&page->_count);
> > }
>
> This is reversing the modification to make get_page_unless_zero() usable
> with compound page heads. Will break the slab defrag patchset.
Is the slab defrag patchset in -mm? Because you ignored my comment about
this change that assertions should not be weakened until required by the
actual patchset. I wanted to have these assertions be as strong as
possible for the lockless pagecache patchset.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
2008-06-11 3:18 ` Nick Piggin
@ 2008-06-11 4:40 ` Christoph Lameter
2008-06-11 4:41 ` Christoph Lameter
2008-06-11 4:47 ` Nick Piggin
0 siblings, 2 replies; 31+ messages in thread
From: Christoph Lameter @ 2008-06-11 4:40 UTC (permalink / raw)
To: Nick Piggin; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus
On Wed, 11 Jun 2008, Nick Piggin wrote:
> > This is reversing the modification to make get_page_unless_zero() usable
> > with compound page heads. Will break the slab defrag patchset.
>
> Is the slab defrag patchset in -mm? Because you ignored my comment about
> this change that assertions should not be weakened until required by the
> actual patchset. I wanted to have these assertions be as strong as
> possible for the lockless pagecache patchset.
So you are worried about accidentally using get_page_unless_zero on a
compound page? What would be wrong about that?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
2008-06-11 4:40 ` Christoph Lameter
@ 2008-06-11 4:41 ` Christoph Lameter
2008-06-11 4:49 ` Nick Piggin
2008-06-11 4:47 ` Nick Piggin
1 sibling, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2008-06-11 4:41 UTC (permalink / raw)
To: Nick Piggin; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus
And yes slab defrag is part of linux-next. So it would break.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
2008-06-11 4:41 ` Christoph Lameter
@ 2008-06-11 4:49 ` Nick Piggin
2008-06-11 6:06 ` Andrew Morton
0 siblings, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2008-06-11 4:49 UTC (permalink / raw)
To: Christoph Lameter; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus
On Tue, Jun 10, 2008 at 09:41:33PM -0700, Christoph Lameter wrote:
> And yes slab defrag is part of linux-next. So it would break.
Can memory management patches go though mm/? I dislike the cowboy
method of merging things that some other subsystems have adopted :)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
2008-06-11 4:49 ` Nick Piggin
@ 2008-06-11 6:06 ` Andrew Morton
2008-06-11 6:24 ` Nick Piggin
2008-06-11 23:20 ` Christoph Lameter
0 siblings, 2 replies; 31+ messages in thread
From: Andrew Morton @ 2008-06-11 6:06 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Lameter, torvalds, linux-mm, linux-kernel, benh, paulus
On Wed, 11 Jun 2008 06:49:02 +0200 Nick Piggin <npiggin@suse.de> wrote:
> On Tue, Jun 10, 2008 at 09:41:33PM -0700, Christoph Lameter wrote:
> > And yes slab defrag is part of linux-next. So it would break.
No, slab defreg[*] isn't in linux-next.
y:/usr/src/25> diffstat patches/linux-next.patch| grep mm/slub.c
mm/slub.c | 4
That's two spelling fixes in comments.
I have git-pekka in -mm too. Here it is:
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2765,6 +2765,7 @@ void kfree(const void *x)
page = virt_to_head_page(x);
if (unlikely(!PageSlab(page))) {
+ BUG_ON(!PageCompound(page));
put_page(page);
return;
}
> Can memory management patches go though mm/? I dislike the cowboy
> method of merging things that some other subsystems have adopted :)
I think I'd prefer that. I may be a bit slow, but we're shoving at
least 100 MM patches through each kernel release and I think I review
things more closely than others choose to. At least, I find problems
and I've seen some pretty wild acked-bys...
[*] It _isn't_ "slab defrag". Or at least, it wasn't last time I saw
it. It's "slub defrag". And IMO it is bad to be adding slub-only
features because afaik slub still isn't as fast as slab on some things
and so some people might want to run slab rather than slub. And
because if this the decision whether to retain slab or slub STILL
hasn't been made. Carrying both versions was supposed to be a
short-term transitional thing :(
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
2008-06-11 6:06 ` Andrew Morton
@ 2008-06-11 6:24 ` Nick Piggin
2008-06-11 6:50 ` Andrew Morton
2008-06-11 23:20 ` Christoph Lameter
1 sibling, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2008-06-11 6:24 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Lameter, torvalds, linux-mm, linux-kernel, benh, paulus
On Tue, Jun 10, 2008 at 11:06:22PM -0700, Andrew Morton wrote:
> On Wed, 11 Jun 2008 06:49:02 +0200 Nick Piggin <npiggin@suse.de> wrote:
>
> > Can memory management patches go though mm/? I dislike the cowboy
^^^
That should read -mm, of course.
> > method of merging things that some other subsystems have adopted :)
>
> I think I'd prefer that. I may be a bit slow, but we're shoving at
> least 100 MM patches through each kernel release and I think I review
> things more closely than others choose to. At least, I find problems
> and I've seen some pretty wild acked-bys...
I wouldn't say you're too slow. You're as close to mm and mm/fs
maintainer as we're likely to get and I think it would be much worse
to have things merged out-of-band. Even the more peripheral parts like
slab or hugetlb.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
2008-06-11 6:24 ` Nick Piggin
@ 2008-06-11 6:50 ` Andrew Morton
0 siblings, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2008-06-11 6:50 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Lameter, torvalds, linux-mm, linux-kernel, benh, paulus
On Wed, 11 Jun 2008 08:24:04 +0200 Nick Piggin <npiggin@suse.de> wrote:
> On Tue, Jun 10, 2008 at 11:06:22PM -0700, Andrew Morton wrote:
> > On Wed, 11 Jun 2008 06:49:02 +0200 Nick Piggin <npiggin@suse.de> wrote:
> >
> > > Can memory management patches go though mm/? I dislike the cowboy
> ^^^
> That should read -mm, of course.
>
-mm is looking awfully peripheral nowadays. I really need to get my
linux-next act together. Instead I'll be taking all next week off.
nyer nyer.
>
> > > method of merging things that some other subsystems have adopted :)
> >
> > I think I'd prefer that. I may be a bit slow, but we're shoving at
> > least 100 MM patches through each kernel release and I think I review
> > things more closely than others choose to. At least, I find problems
> > and I've seen some pretty wild acked-bys...
>
> I wouldn't say you're too slow. You're as close to mm and mm/fs
> maintainer as we're likely to get and I think it would be much worse
> to have things merged out-of-band. Even the more peripheral parts like
> slab or hugetlb.
Sigh. I feel guilty when spending time merging (for example)
random-usb-patches-in-case-greg-misses-them, but such is life. Some
help reviewing things would be nice.
A lot of my review is now of the "how the heck is anyone to understand
this in a year's time if I can't understand it now" variety, but I hope
that's useful...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
2008-06-11 6:06 ` Andrew Morton
2008-06-11 6:24 ` Nick Piggin
@ 2008-06-11 23:20 ` Christoph Lameter
1 sibling, 0 replies; 31+ messages in thread
From: Christoph Lameter @ 2008-06-11 23:20 UTC (permalink / raw)
To: Andrew Morton; +Cc: Nick Piggin, torvalds, linux-mm, linux-kernel, benh, paulus
On Tue, 10 Jun 2008, Andrew Morton wrote:
> hasn't been made. Carrying both versions was supposed to be a
> short-term transitional thing :(
The whatever defrag patchset includes a patch to make SLAB
experimental. So one step further.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch 7/7] powerpc: lockless get_user_pages_fast
2008-06-11 4:40 ` Christoph Lameter
2008-06-11 4:41 ` Christoph Lameter
@ 2008-06-11 4:47 ` Nick Piggin
1 sibling, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-06-11 4:47 UTC (permalink / raw)
To: Christoph Lameter; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus
On Tue, Jun 10, 2008 at 09:40:25PM -0700, Christoph Lameter wrote:
> On Wed, 11 Jun 2008, Nick Piggin wrote:
>
> > > This is reversing the modification to make get_page_unless_zero() usable
> > > with compound page heads. Will break the slab defrag patchset.
> >
> > Is the slab defrag patchset in -mm? Because you ignored my comment about
> > this change that assertions should not be weakened until required by the
> > actual patchset. I wanted to have these assertions be as strong as
> > possible for the lockless pagecache patchset.
>
> So you are worried about accidentally using get_page_unless_zero on a
> compound page? What would be wrong about that?
Unexpected. Compound pages should have no such races that require
get_page_unless_zero that we very carefully use in page reclaim.
If you don't actually know whether you have a reference to the
thing or not before trying to operate on it, then you're almost
definitely got refcount wrong. How does slab defrag use it?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch 0/7] speculative page references, lockless pagecache, lockless gup
2008-06-05 9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
` (6 preceding siblings ...)
2008-06-05 9:43 ` [patch 7/7] powerpc: lockless get_user_pages_fast npiggin
@ 2008-06-05 11:53 ` Nick Piggin
2008-06-05 17:33 ` Linus Torvalds
2008-06-06 21:32 ` Peter Zijlstra
9 siblings, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-06-05 11:53 UTC (permalink / raw)
To: npiggin; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus
On Thursday 05 June 2008 19:43, npiggin@suse.de wrote:
> Hi,
>
> I've decided to submit the speculative page references patch to get merged.
> I think I've now got enough reasons to get it merged. Well... I always
> thought I did, I just didn't think anyone else thought I did. If you know
> what I mean.
>
> cc'ing the powerpc guys specifically because everyone else who probably
> cares should be on linux-mm...
>
> So speculative page references are required to support lockless pagecache
> and lockless get_user_pages (on architectures that can't use the x86
> trick). Other uses for speculative page references could also pop up, it is
> a pretty useful concept. Doesn't need to be pagecache pages either.
>
> Anyway,
>
> lockless pagecache:
> - speeds up single threaded pagecache lookup operations significantly, by
> avoiding atomic operations, memory barriers, and interrupts-off sections.
> I just measured again on a few CPUs I have lying around here, and the
> speedup is over 2x reduction in cycles on them all, closer to 3x in some
> cases.
>
> find_get_page takes:
> ppc970 (g5) K10 P4 Nocona Core2
> vanilla 275 (cycles) 85 315 143
> lockless 125 40 127 61
>
> - speeds up single threaded pagecache modification operations, by using
> regular spinlocks rather than rwlocks and avoiding an atomic operation
> on x86 for one. Also, most real paths which involve pagecache
> modification also involve pagecache lookups, so it is hard not to get a net
> speedup.
>
> - solves the rwlock starvation problem for pagecache operations. This is
> being noticed on big SGI systems, but theoretically could happen on
> relatively small systems (dozens of CPUs) due to the really nasty
> writer starvation problem of rwlocks -- not even hardware fairness can
> solve that.
>
> - improves pagecache scalability to operations on a single file. I
> demonstrated page faults to a single file were improved in throughput
> by 250x on a 64-way Altix several years ago. We now have systems with
> thousands of CPUs in them.
Oh that's actually anothr thing I remember now that I posted the scalable
vmap code...
The lock I ended up hitting next in the XFS large directory workload that
improved so much with the vmap patches was tree_lock of the buffer cache.
So lockless pagecache gave a reasonable improvement there too IIRC :)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [patch 0/7] speculative page references, lockless pagecache, lockless gup
2008-06-05 9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
` (7 preceding siblings ...)
2008-06-05 11:53 ` [patch 0/7] speculative page references, lockless pagecache, lockless gup Nick Piggin
@ 2008-06-05 17:33 ` Linus Torvalds
2008-06-06 0:08 ` Nick Piggin
2008-06-06 21:32 ` Peter Zijlstra
9 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2008-06-05 17:33 UTC (permalink / raw)
To: npiggin; +Cc: akpm, linux-mm, linux-kernel, benh, paulus
On Thu, 5 Jun 2008, npiggin@suse.de wrote:
>
> I've decided to submit the speculative page references patch to get merged.
> I think I've now got enough reasons to get it merged. Well... I always
> thought I did, I just didn't think anyone else thought I did. If you know
> what I mean.
So I'd certainly like to see these early in the 2.6.27 series.
Nick, will you just re-send them once 2.6.26 is out? Or do they cause
problems for Andrew and he wants to be part of the chain? I'm fine with
either.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [patch 0/7] speculative page references, lockless pagecache, lockless gup
2008-06-05 17:33 ` Linus Torvalds
@ 2008-06-06 0:08 ` Nick Piggin
0 siblings, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-06-06 0:08 UTC (permalink / raw)
To: Linus Torvalds; +Cc: akpm, linux-mm, linux-kernel, benh, paulus
On Thu, Jun 05, 2008 at 10:33:15AM -0700, Linus Torvalds wrote:
>
>
> On Thu, 5 Jun 2008, npiggin@suse.de wrote:
> >
> > I've decided to submit the speculative page references patch to get merged.
> > I think I've now got enough reasons to get it merged. Well... I always
> > thought I did, I just didn't think anyone else thought I did. If you know
> > what I mean.
>
> So I'd certainly like to see these early in the 2.6.27 series.
Oh good ;) So would I!
> Nick, will you just re-send them once 2.6.26 is out? Or do they cause
> problems for Andrew and he wants to be part of the chain? I'm fine with
> either.
Andrew has picked them up by the looks, and he's my favoured channel to
get mm work merged. Let's see how things go between now and 2.6.26,
which I assume should be a few weeks away?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch 0/7] speculative page references, lockless pagecache, lockless gup
2008-06-05 9:43 [patch 0/7] speculative page references, lockless pagecache, lockless gup npiggin
` (8 preceding siblings ...)
2008-06-05 17:33 ` Linus Torvalds
@ 2008-06-06 21:32 ` Peter Zijlstra
9 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2008-06-06 21:32 UTC (permalink / raw)
To: npiggin; +Cc: akpm, torvalds, linux-mm, linux-kernel, benh, paulus
On Thu, 2008-06-05 at 19:43 +1000, npiggin@suse.de wrote:
> Hi,
>
> I've decided to submit the speculative page references patch to get merged.
> I think I've now got enough reasons to get it merged. Well... I always
> thought I did, I just didn't think anyone else thought I did. If you know
> what I mean.
>
> cc'ing the powerpc guys specifically because everyone else who probably
> cares should be on linux-mm...
>
> So speculative page references are required to support lockless pagecache and
> lockless get_user_pages (on architectures that can't use the x86 trick). Other
> uses for speculative page references could also pop up, it is a pretty useful
> concept. Doesn't need to be pagecache pages either.
For patches 1-5
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 31+ messages in thread