* [patch 1/6] mm: readahead scan lockless
2007-11-11 8:45 [patch 0/6] lockless pagecache Nick Piggin
@ 2007-11-11 8:47 ` Nick Piggin
2007-11-11 8:49 ` [patch 2/6] radix-tree: gang_lookup_slot Nick Piggin
` (7 subsequent siblings)
8 siblings, 0 replies; 15+ messages in thread
From: Nick Piggin @ 2007-11-11 8:47 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds, Peter Zijlstra, Hugh Dickins
Cc: Linux Memory Management List
This can actually go upstream now...
--
radix_tree_next_hole is implemented as a series of radix_tree_lookup()s. So
it can be called locklessly, under rcu_read_lock().
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -376,9 +376,9 @@ ondemand_readahead(struct address_space
if (hit_readahead_marker) {
pgoff_t start;
- read_lock_irq(&mapping->tree_lock);
- start = radix_tree_next_hole(&mapping->page_tree, offset, max+1);
- read_unlock_irq(&mapping->tree_lock);
+ rcu_read_lock();
+ start = radix_tree_next_hole(&mapping->page_tree, offset,max+1);
+ rcu_read_unlock();
if (!start || start - offset > max)
return 0;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread* [patch 2/6] radix-tree: gang_lookup_slot
2007-11-11 8:45 [patch 0/6] lockless pagecache Nick Piggin
2007-11-11 8:47 ` [patch 1/6] mm: readahead scan lockless Nick Piggin
@ 2007-11-11 8:49 ` Nick Piggin
2007-11-11 8:50 ` [patch 3/6] mm: speculative get page Nick Piggin
` (6 subsequent siblings)
8 siblings, 0 replies; 15+ messages in thread
From: Nick Piggin @ 2007-11-11 8:49 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds, Peter Zijlstra, Hugh Dickins
Cc: Linux Memory Management List
Introduce gang_lookup_slot and gang_lookup_slot_tag functions, which are used
by lockless pagecache.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -99,12 +99,15 @@ do { \
*
* The notable exceptions to this rule are the following functions:
* radix_tree_lookup
+ * radix_tree_lookup_slot
* radix_tree_tag_get
* radix_tree_gang_lookup
+ * radix_tree_gang_lookup_slot
* radix_tree_gang_lookup_tag
+ * radix_tree_gang_lookup_tag_slot
* radix_tree_tagged
*
- * The first 4 functions are able to be called locklessly, using RCU. The
+ * The first 7 functions are able to be called locklessly, using RCU. The
* caller must ensure calls to these functions are made within rcu_read_lock()
* regions. Other readers (lock-free or otherwise) and modifications may be
* running concurrently.
@@ -159,6 +162,9 @@ void *radix_tree_delete(struct radix_tre
unsigned int
radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
unsigned long first_index, unsigned int max_items);
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+ unsigned long first_index, unsigned int max_items);
unsigned long radix_tree_next_hole(struct radix_tree_root *root,
unsigned long index, unsigned long max_scan);
int radix_tree_preload(gfp_t gfp_mask);
@@ -173,6 +179,10 @@ unsigned int
radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
unsigned long first_index, unsigned int max_items,
unsigned int tag);
+unsigned int
+radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
+ unsigned long first_index, unsigned int max_items,
+ unsigned int tag);
int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
static inline void radix_tree_preload_end(void)
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -345,18 +345,17 @@ EXPORT_SYMBOL(radix_tree_insert);
* Returns: the slot corresponding to the position @index in the
* radix tree @root. This is useful for update-if-exists operations.
*
- * This function cannot be called under rcu_read_lock, it must be
- * excluded from writers, as must the returned slot for subsequent
- * use by radix_tree_deref_slot() and radix_tree_replace slot.
- * Caller must hold tree write locked across slot lookup and
- * replace.
+ * This function can be called under rcu_read_lock iff the slot is not
+ * modified by radix_tree_replace_slot, otherwise it must be called
+ * exclusive from other writers. Any dereference of the slot must be done
+ * using radix_tree_deref_slot.
*/
void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
{
unsigned int height, shift;
struct radix_tree_node *node, **slot;
- node = root->rnode;
+ node = rcu_dereference(root->rnode);
if (node == NULL)
return NULL;
@@ -376,7 +375,7 @@ void **radix_tree_lookup_slot(struct rad
do {
slot = (struct radix_tree_node **)
(node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
- node = *slot;
+ node = rcu_dereference(*slot);
if (node == NULL)
return NULL;
@@ -653,7 +652,7 @@ unsigned long radix_tree_next_hole(struc
EXPORT_SYMBOL(radix_tree_next_hole);
static unsigned int
-__lookup(struct radix_tree_node *slot, void **results, unsigned long index,
+__lookup(struct radix_tree_node *slot, void ***results, unsigned long index,
unsigned int max_items, unsigned long *next_index)
{
unsigned int nr_found = 0;
@@ -687,11 +686,9 @@ __lookup(struct radix_tree_node *slot, v
/* Bottom level: grab some items */
for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
- struct radix_tree_node *node;
index++;
- node = slot->slots[i];
- if (node) {
- results[nr_found++] = rcu_dereference(node);
+ if (slot->slots[i]) {
+ results[nr_found++] = &(slot->slots[i]);
if (nr_found == max_items)
goto out;
}
@@ -745,13 +742,22 @@ radix_tree_gang_lookup(struct radix_tree
ret = 0;
while (ret < max_items) {
- unsigned int nr_found;
+ unsigned int nr_found, slots_found, i;
unsigned long next_index; /* Index of next search */
if (cur_index > max_index)
break;
- nr_found = __lookup(node, results + ret, cur_index,
+ slots_found = __lookup(node, (void ***)results + ret, cur_index,
max_items - ret, &next_index);
+ nr_found = 0;
+ for (i = 0; i < slots_found; i++) {
+ struct radix_tree_node *slot;
+ slot = *(((void ***)results)[ret + i]);
+ if (!slot)
+ continue;
+ results[ret + nr_found] = rcu_dereference(slot);
+ nr_found++;
+ }
ret += nr_found;
if (next_index == 0)
break;
@@ -762,12 +768,71 @@ radix_tree_gang_lookup(struct radix_tree
}
EXPORT_SYMBOL(radix_tree_gang_lookup);
+/**
+ * radix_tree_gang_lookup_slot - perform multiple slot lookup on radix tree
+ * @root: radix tree root
+ * @results: where the results of the lookup are placed
+ * @first_index: start the lookup from this key
+ * @max_items: place up to this many items at *results
+ *
+ * Performs an index-ascending scan of the tree for present items. Places
+ * their slots at *@results and returns the number of items which were
+ * placed at *@results.
+ *
+ * The implementation is naive.
+ *
+ * Like radix_tree_gang_lookup as far as RCU and locking goes. Slots must
+ * be dereferenced with radix_tree_deref_slot, and if using only RCU
+ * protection, radix_tree_deref_slot may fail requiring a retry.
+ */
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+ unsigned long first_index, unsigned int max_items)
+{
+ unsigned long max_index;
+ struct radix_tree_node *node;
+ unsigned long cur_index = first_index;
+ unsigned int ret;
+
+ node = rcu_dereference(root->rnode);
+ if (!node)
+ return 0;
+
+ if (!radix_tree_is_indirect_ptr(node)) {
+ if (first_index > 0)
+ return 0;
+ results[0] = (void **)&root->rnode;
+ return 1;
+ }
+ node = radix_tree_indirect_to_ptr(node);
+
+ max_index = radix_tree_maxindex(node->height);
+
+ ret = 0;
+ while (ret < max_items) {
+ unsigned int slots_found;
+ unsigned long next_index; /* Index of next search */
+
+ if (cur_index > max_index)
+ break;
+ slots_found = __lookup(node, results + ret, cur_index,
+ max_items - ret, &next_index);
+ ret += slots_found;
+ if (next_index == 0)
+ break;
+ cur_index = next_index;
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup_slot);
+
/*
* FIXME: the two tag_get()s here should use find_next_bit() instead of
* open-coding the search.
*/
static unsigned int
-__lookup_tag(struct radix_tree_node *slot, void **results, unsigned long index,
+__lookup_tag(struct radix_tree_node *slot, void ***results, unsigned long index,
unsigned int max_items, unsigned long *next_index, unsigned int tag)
{
unsigned int nr_found = 0;
@@ -797,11 +862,9 @@ __lookup_tag(struct radix_tree_node *slo
unsigned long j = index & RADIX_TREE_MAP_MASK;
for ( ; j < RADIX_TREE_MAP_SIZE; j++) {
- struct radix_tree_node *node;
index++;
if (!tag_get(slot, tag, j))
continue;
- node = slot->slots[j];
/*
* Even though the tag was found set, we need to
* recheck that we have a non-NULL node, because
@@ -812,9 +875,8 @@ __lookup_tag(struct radix_tree_node *slo
* lookup ->slots[x] without a lock (ie. can't
* rely on its value remaining the same).
*/
- if (node) {
- node = rcu_dereference(node);
- results[nr_found++] = node;
+ if (slot->slots[j]) {
+ results[nr_found++] = &(slot->slots[j]);
if (nr_found == max_items)
goto out;
}
@@ -873,13 +935,22 @@ radix_tree_gang_lookup_tag(struct radix_
ret = 0;
while (ret < max_items) {
- unsigned int nr_found;
+ unsigned int nr_found, slots_found, i;
unsigned long next_index; /* Index of next search */
if (cur_index > max_index)
break;
- nr_found = __lookup_tag(node, results + ret, cur_index,
- max_items - ret, &next_index, tag);
+ slots_found = __lookup_tag(node, (void ***)results + ret,
+ cur_index, max_items - ret, &next_index, tag);
+ nr_found = 0;
+ for (i = 0; i < slots_found; i++) {
+ struct radix_tree_node *slot;
+ slot = *(((void ***)results)[ret + i]);
+ if (!slot)
+ continue;
+ results[ret + nr_found] = rcu_dereference(slot);
+ nr_found++;
+ }
ret += nr_found;
if (next_index == 0)
break;
@@ -891,6 +962,67 @@ radix_tree_gang_lookup_tag(struct radix_
EXPORT_SYMBOL(radix_tree_gang_lookup_tag);
/**
+ * radix_tree_gang_lookup_tag_slot - perform multiple slot lookup on a
+ * radix tree based on a tag
+ * @root: radix tree root
+ * @results: where the results of the lookup are placed
+ * @first_index: start the lookup from this key
+ * @max_items: place up to this many items at *results
+ * @tag: the tag index (< RADIX_TREE_MAX_TAGS)
+ *
+ * Performs an index-ascending scan of the tree for present items which
+ * have the tag indexed by @tag set. Places the slots at *@results and
+ * returns the number of slots which were placed at *@results.
+ */
+unsigned int
+radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
+ unsigned long first_index, unsigned int max_items,
+ unsigned int tag)
+{
+ struct radix_tree_node *node;
+ unsigned long max_index;
+ unsigned long cur_index = first_index;
+ unsigned int ret;
+
+ /* check the root's tag bit */
+ if (!root_tag_get(root, tag))
+ return 0;
+
+ node = rcu_dereference(root->rnode);
+ if (!node)
+ return 0;
+
+ if (!radix_tree_is_indirect_ptr(node)) {
+ if (first_index > 0)
+ return 0;
+ results[0] = (void **)&root->rnode;
+ return 1;
+ }
+ node = radix_tree_indirect_to_ptr(node);
+
+ max_index = radix_tree_maxindex(node->height);
+
+ ret = 0;
+ while (ret < max_items) {
+ unsigned int slots_found;
+ unsigned long next_index; /* Index of next search */
+
+ if (cur_index > max_index)
+ break;
+ slots_found = __lookup_tag(node, results + ret,
+ cur_index, max_items - ret, &next_index, tag);
+ ret += slots_found;
+ if (next_index == 0)
+ break;
+ cur_index = next_index;
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup_tag_slot);
+
+
+/**
* radix_tree_shrink - shrink height of a radix tree to minimal
* @root radix tree root
*/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread* [patch 3/6] mm: speculative get page
2007-11-11 8:45 [patch 0/6] lockless pagecache Nick Piggin
2007-11-11 8:47 ` [patch 1/6] mm: readahead scan lockless Nick Piggin
2007-11-11 8:49 ` [patch 2/6] radix-tree: gang_lookup_slot Nick Piggin
@ 2007-11-11 8:50 ` Nick Piggin
2007-11-12 20:21 ` Christoph Lameter
2007-11-11 8:51 ` [patch 4/6] mm: lockless pagecache lookups Nick Piggin
` (5 subsequent siblings)
8 siblings, 1 reply; 15+ messages in thread
From: Nick Piggin @ 2007-11-11 8:50 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds, Peter Zijlstra, Hugh Dickins
Cc: Linux Memory Management List
If we can be sure that elevating the page_count on a pagecache
page will pin it, we can speculatively run this operation, and
subsequently check to see if we hit the right page rather than
relying on holding a lock or otherwise pinning a reference to the
page.
This can be done if get_page/put_page behaves consistently
throughout the whole tree (ie. if we "get" the page after it has
been used for something else, we must be able to free it with a
put_page).
Actually, there is a period where the count behaves differently:
when the page is free or if it is a constituent page of a compound
page. We need an atomic_inc_not_zero operation to ensure we don't
try to grab the page in either case.
This patch introduces the core locking protocol to the pagecache
(ie. adds page_cache_get_speculative, and tweaks some update-side
code to make it work).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -12,6 +12,7 @@
#include <asm/uaccess.h>
#include <linux/gfp.h>
#include <linux/bitops.h>
+#include <linux/hardirq.h> /* for in_interrupt() */
/*
* Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page
@@ -62,6 +63,96 @@ static inline void mapping_set_gfp_mask(
#define page_cache_release(page) put_page(page)
void release_pages(struct page **pages, int nr, int cold);
+/*
+ * speculatively take a reference to a page.
+ * If the page is free (_count == 0), then _count is untouched, and 0
+ * is returned. Otherwise, _count is incremented by 1 and 1 is returned.
+ *
+ * This function must be run in the same rcu_read_lock() section as has
+ * been used to lookup the page in the pagecache radix-tree: this allows
+ * allocators to use a synchronize_rcu() to stabilize _count.
+ *
+ * Unless an RCU grace period has passed, the count of all pages coming out
+ * of the allocator must be considered unstable. page_count may return higher
+ * than expected, and put_page must be able to do the right thing when the
+ * page has been finished with (because put_page is what is used to drop an
+ * invalid speculative reference).
+ *
+ * This forms the core of the lockless pagecache locking protocol, where
+ * the lookup-side (eg. find_get_page) has the following pattern:
+ * 1. find page in radix tree
+ * 2. conditionally increment refcount
+ * 3. check the page is still in pagecache (if no, goto 1)
+ *
+ * Remove-side that cares about stability of _count (eg. reclaim) has the
+ * following (with tree_lock held for write):
+ * A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg)
+ * B. remove page from pagecache
+ * C. free the page
+ *
+ * There are 2 critical interleavings that matter:
+ * - 2 runs before A: in this case, A sees elevated refcount and bails out
+ * - A runs before 2: in this case, 2 sees zero refcount and retries;
+ * subsequently, B will complete and 1 will find no page, causing the
+ * lookup to return NULL.
+ *
+ * It is possible that between 1 and 2, the page is removed then the exact same
+ * page is inserted into the same position in pagecache. That's OK: the
+ * old find_get_page using tree_lock could equally have run before or after
+ * such a re-insertion, depending on order that locks are granted.
+ *
+ * Lookups racing against pagecache insertion isn't a big problem: either 1
+ * will find the page or it will not. Likewise, the old find_get_page could run
+ * either before the insertion or afterwards, depending on timing.
+ */
+static inline int page_cache_get_speculative(struct page *page)
+{
+ VM_BUG_ON(in_interrupt());
+
+#ifndef CONFIG_SMP
+# ifdef CONFIG_PREEMPT
+ VM_BUG_ON(!in_atomic());
+# endif
+ /*
+ * Preempt must be disabled here - we rely on rcu_read_lock doing
+ * this for us.
+ *
+ * Pagecache won't be truncated from interrupt context, so if we have
+ * found a page in the radix tree here, we have pinned its refcount by
+ * disabling preempt, and hence no need for the "speculative get" that
+ * SMP requires.
+ */
+ VM_BUG_ON(page_count(page) == 0);
+ atomic_inc(&page->_count);
+
+#else
+ if (unlikely(!get_page_unless_zero(page))) {
+ /*
+ * Either the page has been freed, or will be freed.
+ * In either case, retry here and the caller should
+ * do the right thing (see comments above).
+ */
+ return 0;
+ }
+#endif
+ VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);
+
+ return 1;
+}
+
+static inline int page_freeze_refs(struct page *page, int count)
+{
+ return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
+}
+
+static inline void page_unfreeze_refs(struct page *page, int count)
+{
+ VM_BUG_ON(page_count(page) != 0);
+ VM_BUG_ON(count == 0);
+
+ atomic_set(&page->_count, count);
+}
+
#ifdef CONFIG_NUMA
extern struct page *__page_cache_alloc(gfp_t gfp);
#else
@@ -155,14 +246,11 @@ extern void FASTCALL(unlock_page(struct
static inline void __set_page_locked(struct page *page)
{
- /* concurrent access would cause data loss with non-atomic bitop */
- VM_BUG_ON(page_count(page) != 1);
__set_bit(PG_locked, &page->flags);
}
static inline void __clear_page_locked(struct page *page)
{
- VM_BUG_ON(page_count(page) != 1);
__clear_bit(PG_locked, &page->flags);
}
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -374,12 +374,10 @@ static pageout_t pageout(struct page *pa
}
/*
- * Attempt to detach a locked page from its ->mapping. If it is dirty or if
- * someone else has a ref on the page, abort and return 0. If it was
- * successfully detached, return 1. Assumes the caller has a single ref on
- * this page.
+ * Save as remove_mapping, but if the page is removed from the mapping, it
+ * gets returned with a refcount of 0.
*/
-int remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping(struct address_space *mapping, struct page *page)
{
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
@@ -410,9 +408,9 @@ int remove_mapping(struct address_space
* Note that if SetPageDirty is always performed via set_page_dirty,
* and thus under tree_lock, then this ordering is not required.
*/
- if (unlikely(page_count(page) != 2))
+ if (!page_freeze_refs(page, 2))
goto cannot_free;
- smp_rmb();
+ /* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
if (unlikely(PageDirty(page)))
goto cannot_free;
@@ -421,13 +419,12 @@ int remove_mapping(struct address_space
__delete_from_swap_cache(page);
write_unlock_irq(&mapping->tree_lock);
swap_free(swap);
- __put_page(page); /* The pagecache ref */
- return 1;
+ } else {
+ __remove_from_page_cache(page);
+ write_unlock_irq(&mapping->tree_lock);
}
- __remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
- __put_page(page);
+free_it:
return 1;
cannot_free:
@@ -436,6 +433,26 @@ cannot_free:
}
/*
+ * Attempt to detach a locked page from its ->mapping. If it is dirty or if
+ * someone else has a ref on the page, abort and return 0. If it was
+ * successfully detached, return 1. Assumes the caller has a single ref on
+ * this page.
+ */
+int remove_mapping(struct address_space *mapping, struct page *page)
+{
+ if (__remove_mapping(mapping, page)) {
+ /*
+ * Unfreezing the refcount with 1 rather than 2 effectively
+ * drops the pagecache ref for us without requiring another
+ * atomic operation.
+ */
+ page_unfreeze_refs(page, 1);
+ return 1;
+ }
+ return 0;
+}
+
+/*
* shrink_page_list() returns the number of reclaimed pages
*/
static unsigned long shrink_page_list(struct list_head *page_list,
@@ -581,11 +598,18 @@ static unsigned long shrink_page_list(st
if (PagePrivate(page)) {
if (!try_to_release_page(page, sc->gfp_mask))
goto activate_locked;
- if (!mapping && page_count(page) == 1)
- goto free_it;
+ if (!mapping && page_count(page) == 1) {
+ unlock_page(page);
+ if (put_page_testzero(page))
+ goto free_it;
+ else {
+ nr_reclaimed++;
+ continue;
+ }
+ }
}
- if (!mapping || !remove_mapping(mapping, page))
+ if (!mapping || !__remove_mapping(mapping, page))
goto keep_locked;
free_it:
@@ -598,8 +622,10 @@ free_it:
*/
__clear_page_locked(page);
nr_reclaimed++;
- if (!pagevec_add(&freed_pvec, page))
- __pagevec_release_nonlru(&freed_pvec);
+ if (!pagevec_add(&freed_pvec, page)) {
+ __pagevec_free(&freed_pvec);
+ pagevec_reinit(&freed_pvec);
+ }
continue;
activate_locked:
@@ -613,7 +639,7 @@ keep:
}
list_splice(&ret_pages, page_list);
if (pagevec_count(&freed_pvec))
- __pagevec_release_nonlru(&freed_pvec);
+ __pagevec_free(&freed_pvec);
count_vm_events(PGACTIVATE, pgactivate);
return nr_reclaimed;
}
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -444,17 +444,23 @@ int add_to_page_cache_locked(struct page
error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error == 0) {
+ page_cache_get(page);
+ page->mapping = mapping;
+ page->index = offset;
+
write_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
- if (!error) {
- page_cache_get(page);
- page->mapping = mapping;
- page->index = offset;
+ if (likely(!error)) {
mapping->nrpages++;
__inc_zone_page_state(page, NR_FILE_PAGES);
}
write_unlock_irq(&mapping->tree_lock);
radix_tree_preload_end();
+
+ if (unlikely(error)) {
+ page->mapping = NULL;
+ page_cache_release(page);
+ }
}
return error;
}
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -79,18 +79,25 @@ static int __add_to_swap_cache(struct pa
BUG_ON(PagePrivate(page));
error = radix_tree_preload(gfp_mask);
if (!error) {
+ page_cache_get(page);
+ SetPageSwapCache(page);
+ set_page_private(page, entry.val);
+
write_lock_irq(&swapper_space.tree_lock);
error = radix_tree_insert(&swapper_space.page_tree,
entry.val, page);
- if (!error) {
- page_cache_get(page);
- SetPageSwapCache(page);
- set_page_private(page, entry.val);
+ if (likely(!error)) {
total_swapcache_pages++;
__inc_zone_page_state(page, NR_FILE_PAGES);
}
write_unlock_irq(&swapper_space.tree_lock);
radix_tree_preload_end();
+
+ if (unlikely(error)) {
+ set_page_private(page, 0UL);
+ ClearPageSwapCache(page);
+ page_cache_release(page);
+ }
}
return error;
}
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -294,6 +294,7 @@ out:
static int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page)
{
+ int expected_count;
void **pslot;
if (!mapping) {
@@ -308,12 +309,18 @@ static int migrate_page_move_mapping(str
pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));
- if (page_count(page) != 2 + !!PagePrivate(page) ||
+ expected_count = 2 + !!PagePrivate(page);
+ if (page_count(page) != expected_count ||
(struct page *)radix_tree_deref_slot(pslot) != page) {
write_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
}
+ if (!page_freeze_refs(page, expected_count))
+ write_unlock_irq(&mapping->tree_lock);
+ return -EAGAIN;
+ }
+
/*
* Now we know that no one else is looking at the page.
*/
@@ -327,6 +334,7 @@ static int migrate_page_move_mapping(str
radix_tree_replace_slot(pslot, newpage);
+ page_unfreeze_refs(page, expected_count);
/*
* Drop cache reference from old page.
* We know this isn't the last reference.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [patch 3/6] mm: speculative get page
2007-11-11 8:50 ` [patch 3/6] mm: speculative get page Nick Piggin
@ 2007-11-12 20:21 ` Christoph Lameter
2007-11-13 0:35 ` Nick Piggin
0 siblings, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2007-11-12 20:21 UTC (permalink / raw)
To: Nick Piggin
Cc: Andrew Morton, Linus Torvalds, Peter Zijlstra, Hugh Dickins,
Linux Memory Management List
On Sun, 11 Nov 2007, Nick Piggin wrote:
> +static inline int page_freeze_refs(struct page *page, int count)
> +{
> + return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
> +}
> +
> +static inline void page_unfreeze_refs(struct page *page, int count)
> +{
> + VM_BUG_ON(page_count(page) != 0);
> + VM_BUG_ON(count == 0);
> +
> + atomic_set(&page->_count, count);
> +}
Good idea. That avoids another page bit.
> Index: linux-2.6/mm/migrate.c
> ===================================================================
> --- linux-2.6.orig/mm/migrate.c
> +++ linux-2.6/mm/migrate.c
> @@ -294,6 +294,7 @@ out:
> static int migrate_page_move_mapping(struct address_space *mapping,
> struct page *newpage, struct page *page)
> {
> + int expected_count;
> void **pslot;
>
> if (!mapping) {
> @@ -308,12 +309,18 @@ static int migrate_page_move_mapping(str
> pslot = radix_tree_lookup_slot(&mapping->page_tree,
> page_index(page));
>
> - if (page_count(page) != 2 + !!PagePrivate(page) ||
> + expected_count = 2 + !!PagePrivate(page);
> + if (page_count(page) != expected_count ||
> (struct page *)radix_tree_deref_slot(pslot) != page) {
> write_unlock_irq(&mapping->tree_lock);
> return -EAGAIN;
> }
>
> + if (!page_freeze_refs(page, expected_count))
> + write_unlock_irq(&mapping->tree_lock);
> + return -EAGAIN;
> + }
> +
Looks okay but I think you could remove the earlier performance check. We
already modified the page struct by obtaining the page lock so we hold it
exclusively. And the failure rate here is typicalyvery low.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [patch 3/6] mm: speculative get page
2007-11-12 20:21 ` Christoph Lameter
@ 2007-11-13 0:35 ` Nick Piggin
2007-11-13 0:37 ` Christoph Lameter
0 siblings, 1 reply; 15+ messages in thread
From: Nick Piggin @ 2007-11-13 0:35 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrew Morton, Linus Torvalds, Peter Zijlstra, Hugh Dickins,
Linux Memory Management List
On Mon, Nov 12, 2007 at 12:21:10PM -0800, Christoph Lameter wrote:
> On Sun, 11 Nov 2007, Nick Piggin wrote:
>
> > +static inline int page_freeze_refs(struct page *page, int count)
> > +{
> > + return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
> > +}
> > +
> > +static inline void page_unfreeze_refs(struct page *page, int count)
> > +{
> > + VM_BUG_ON(page_count(page) != 0);
> > + VM_BUG_ON(count == 0);
> > +
> > + atomic_set(&page->_count, count);
> > +}
>
> Good idea. That avoids another page bit.
Yeah, it's Hugh's good idea. It avoids smp_rmb() in the find_get_page
path as well, which will be helpful at least for things like powerpc
and ia64, if not x86. At one single atomic operation to lookup and take
a reference on a pagecache page, I think it is approaching the fastest
possible implementation ;)
> > Index: linux-2.6/mm/migrate.c
> > ===================================================================
> > --- linux-2.6.orig/mm/migrate.c
> > +++ linux-2.6/mm/migrate.c
> > @@ -294,6 +294,7 @@ out:
> > static int migrate_page_move_mapping(struct address_space *mapping,
> > struct page *newpage, struct page *page)
> > {
> > + int expected_count;
> > void **pslot;
> >
> > if (!mapping) {
> > @@ -308,12 +309,18 @@ static int migrate_page_move_mapping(str
> > pslot = radix_tree_lookup_slot(&mapping->page_tree,
> > page_index(page));
> >
> > - if (page_count(page) != 2 + !!PagePrivate(page) ||
> > + expected_count = 2 + !!PagePrivate(page);
> > + if (page_count(page) != expected_count ||
> > (struct page *)radix_tree_deref_slot(pslot) != page) {
> > write_unlock_irq(&mapping->tree_lock);
> > return -EAGAIN;
> > }
> >
> > + if (!page_freeze_refs(page, expected_count))
> > + write_unlock_irq(&mapping->tree_lock);
> > + return -EAGAIN;
> > + }
> > +
>
> Looks okay but I think you could remove the earlier performance check. We
> already modified the page struct by obtaining the page lock so we hold it
> exclusively. And the failure rate here is typicalyvery low.
It's up to you. Honestly, I don't have good test facilities for page
migration. If it's all the same to you, do you mind if we leave it like
this, and then you can change it in future?
I expect the earlier check won't hurt too much either, even if it doesn't
trigger for 99.9% of pages...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [patch 3/6] mm: speculative get page
2007-11-13 0:35 ` Nick Piggin
@ 2007-11-13 0:37 ` Christoph Lameter
2007-11-13 0:51 ` Nick Piggin
0 siblings, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2007-11-13 0:37 UTC (permalink / raw)
To: Nick Piggin
Cc: Andrew Morton, Linus Torvalds, Peter Zijlstra, Hugh Dickins,
Linux Memory Management List
On Tue, 13 Nov 2007, Nick Piggin wrote:
> > Good idea. That avoids another page bit.
>
> Yeah, it's Hugh's good idea. It avoids smp_rmb() in the find_get_page
> path as well, which will be helpful at least for things like powerpc
> and ia64, if not x86. At one single atomic operation to lookup and take
> a reference on a pagecache page, I think it is approaching the fastest
> possible implementation ;)
Well I hope all locations that do get_page_unless_zero are aware that a
failure does not mean that the page is being freed.
> > Looks okay but I think you could remove the earlier performance check. We
> > already modified the page struct by obtaining the page lock so we hold it
> > exclusively. And the failure rate here is typicalyvery low.
>
> It's up to you. Honestly, I don't have good test facilities for page
> migration. If it's all the same to you, do you mind if we leave it like
> this, and then you can change it in future?
Ok. Lee will likely get to that ;-).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch 3/6] mm: speculative get page
2007-11-13 0:37 ` Christoph Lameter
@ 2007-11-13 0:51 ` Nick Piggin
0 siblings, 0 replies; 15+ messages in thread
From: Nick Piggin @ 2007-11-13 0:51 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrew Morton, Linus Torvalds, Peter Zijlstra, Hugh Dickins,
Linux Memory Management List
On Mon, Nov 12, 2007 at 04:37:28PM -0800, Christoph Lameter wrote:
> On Tue, 13 Nov 2007, Nick Piggin wrote:
>
> > > Good idea. That avoids another page bit.
> >
> > Yeah, it's Hugh's good idea. It avoids smp_rmb() in the find_get_page
> > path as well, which will be helpful at least for things like powerpc
> > and ia64, if not x86. At one single atomic operation to lookup and take
> > a reference on a pagecache page, I think it is approaching the fastest
> > possible implementation ;)
>
> Well I hope all locations that do get_page_unless_zero are aware that a
> failure does not mean that the page is being freed.
Often it does mean the page will be freed soon anyway, so it fits quite well.
But no, nothing assumes the page will be freed. It is hard to imagine the use
of such knowledge would be (outside the page allocator).
You can't use it to assume the page is going to be free at any point T+now,
nor can you assume the pagecache coordinate which used to hold the page is
empty.
> > > Looks okay but I think you could remove the earlier performance check. We
> > > already modified the page struct by obtaining the page lock so we hold it
> > > exclusively. And the failure rate here is typicalyvery low.
> >
> > It's up to you. Honestly, I don't have good test facilities for page
> > migration. If it's all the same to you, do you mind if we leave it like
> > this, and then you can change it in future?
>
> Ok. Lee will likely get to that ;-).
OK good, thanks ;)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* [patch 4/6] mm: lockless pagecache lookups
2007-11-11 8:45 [patch 0/6] lockless pagecache Nick Piggin
` (2 preceding siblings ...)
2007-11-11 8:50 ` [patch 3/6] mm: speculative get page Nick Piggin
@ 2007-11-11 8:51 ` Nick Piggin
2007-11-11 8:51 ` [patch 5/6] mm: spinlock tree_lock Nick Piggin
` (4 subsequent siblings)
8 siblings, 0 replies; 15+ messages in thread
From: Nick Piggin @ 2007-11-11 8:51 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds, Peter Zijlstra, Hugh Dickins
Cc: Linux Memory Management List
Combine page_cache_get_speculative with lockless radix tree lookups to
introduce lockless page cache lookups (ie. no mapping->tree_lock on
the read-side).
The only atomicity changes this introduces is that the gang pagecache
lookup functions now behave as if they are implemented with multiple
find_get_page calls, rather than operating on a snapshot of the pages.
In practice, this atomicity guarantee is not used anyway, and it is
difficult to see how it could be. Gang pagecache lookups are designed
to replace individual lookups, so these semantics are natural.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -601,15 +601,35 @@ void fastcall __lock_page_nosync(struct
* Is there a pagecache struct page at the given (mapping, offset) tuple?
* If yes, increment its refcount and return it; if no, return NULL.
*/
-struct page * find_get_page(struct address_space *mapping, pgoff_t offset)
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
{
+ void **pagep;
struct page *page;
- read_lock_irq(&mapping->tree_lock);
- page = radix_tree_lookup(&mapping->page_tree, offset);
- if (page)
- page_cache_get(page);
- read_unlock_irq(&mapping->tree_lock);
+ rcu_read_lock();
+repeat:
+ page = NULL;
+ pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);
+ if (pagep) {
+ page = radix_tree_deref_slot(pagep);
+ if (unlikely(!page || page == RADIX_TREE_RETRY))
+ goto repeat;
+
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /*
+ * Has the page moved?
+ * This is part of the lockless pagecache protocol. See
+ * include/linux/pagemap.h for details.
+ */
+ if (unlikely(page != *pagep)) {
+ page_cache_release(page);
+ goto repeat;
+ }
+ }
+ rcu_read_unlock();
+
return page;
}
EXPORT_SYMBOL(find_get_page);
@@ -624,32 +644,22 @@ EXPORT_SYMBOL(find_get_page);
*
* Returns zero if the page was not present. find_lock_page() may sleep.
*/
-struct page *find_lock_page(struct address_space *mapping,
- pgoff_t offset)
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
{
struct page *page;
repeat:
- read_lock_irq(&mapping->tree_lock);
- page = radix_tree_lookup(&mapping->page_tree, offset);
+ page = find_get_page(mapping, offset);
if (page) {
- page_cache_get(page);
- if (!trylock_page(page)) {
- read_unlock_irq(&mapping->tree_lock);
- __lock_page(page);
-
- /* Has the page been truncated while we slept? */
- if (unlikely(page->mapping != mapping)) {
- unlock_page(page);
- page_cache_release(page);
- goto repeat;
- }
- VM_BUG_ON(page->index != offset);
- goto out;
+ lock_page(page);
+ /* Has the page been truncated? */
+ if (unlikely(page->mapping != mapping)) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto repeat;
}
+ VM_BUG_ON(page->index != offset);
}
- read_unlock_irq(&mapping->tree_lock);
-out:
return page;
}
EXPORT_SYMBOL(find_lock_page);
@@ -715,13 +725,39 @@ unsigned find_get_pages(struct address_s
{
unsigned int i;
unsigned int ret;
+ unsigned int nr_found;
+
+ rcu_read_lock();
+restart:
+ nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+ (void ***)pages, start, nr_pages);
+ ret = 0;
+ for (i = 0; i < nr_found; i++) {
+ struct page *page;
+repeat:
+ page = radix_tree_deref_slot((void **)pages[i]);
+ if (unlikely(!page))
+ continue;
+ /*
+ * this can only trigger if nr_found == 1, making livelock
+ * a non issue.
+ */
+ if (unlikely(page == RADIX_TREE_RETRY))
+ goto restart;
+
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /* Has the page moved? */
+ if (unlikely(page != *((void **)pages[i]))) {
+ page_cache_release(page);
+ goto repeat;
+ }
- read_lock_irq(&mapping->tree_lock);
- ret = radix_tree_gang_lookup(&mapping->page_tree,
- (void **)pages, start, nr_pages);
- for (i = 0; i < ret; i++)
- page_cache_get(pages[i]);
- read_unlock_irq(&mapping->tree_lock);
+ pages[ret] = page;
+ ret++;
+ }
+ rcu_read_unlock();
return ret;
}
@@ -742,19 +778,44 @@ unsigned find_get_pages_contig(struct ad
{
unsigned int i;
unsigned int ret;
+ unsigned int nr_found;
+
+ rcu_read_lock();
+restart:
+ nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+ (void ***)pages, index, nr_pages);
+ ret = 0;
+ for (i = 0; i < nr_found; i++) {
+ struct page *page;
+repeat:
+ page = radix_tree_deref_slot((void **)pages[i]);
+ if (unlikely(!page))
+ continue;
+ /*
+ * this can only trigger if nr_found == 1, making livelock
+ * a non issue.
+ */
+ if (unlikely(page == RADIX_TREE_RETRY))
+ goto restart;
- read_lock_irq(&mapping->tree_lock);
- ret = radix_tree_gang_lookup(&mapping->page_tree,
- (void **)pages, index, nr_pages);
- for (i = 0; i < ret; i++) {
- if (pages[i]->mapping == NULL || pages[i]->index != index)
+ if (page->mapping == NULL || page->index != index)
break;
- page_cache_get(pages[i]);
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /* Has the page moved? */
+ if (unlikely(page != *((void **)pages[i]))) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
+ pages[ret] = page;
+ ret++;
index++;
}
- read_unlock_irq(&mapping->tree_lock);
- return i;
+ rcu_read_unlock();
+ return ret;
}
EXPORT_SYMBOL(find_get_pages_contig);
@@ -774,15 +835,43 @@ unsigned find_get_pages_tag(struct addre
{
unsigned int i;
unsigned int ret;
+ unsigned int nr_found;
+
+ rcu_read_lock();
+restart:
+ nr_found = radix_tree_gang_lookup_tag_slot(&mapping->page_tree,
+ (void ***)pages, *index, nr_pages, tag);
+ ret = 0;
+ for (i = 0; i < nr_found; i++) {
+ struct page *page;
+repeat:
+ page = radix_tree_deref_slot((void **)pages[i]);
+ if (unlikely(!page))
+ continue;
+ /*
+ * this can only trigger if nr_found == 1, making livelock
+ * a non issue.
+ */
+ if (unlikely(page == RADIX_TREE_RETRY))
+ goto restart;
+
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /* Has the page moved? */
+ if (unlikely(page != *((void **)pages[i]))) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
+ pages[ret] = page;
+ ret++;
+ }
+ rcu_read_unlock();
- read_lock_irq(&mapping->tree_lock);
- ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
- (void **)pages, *index, nr_pages, tag);
- for (i = 0; i < ret; i++)
- page_cache_get(pages[i]);
if (ret)
*index = pages[ret - 1]->index + 1;
- read_unlock_irq(&mapping->tree_lock);
+
return ret;
}
EXPORT_SYMBOL(find_get_pages_tag);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread* [patch 5/6] mm: spinlock tree_lock
2007-11-11 8:45 [patch 0/6] lockless pagecache Nick Piggin
` (3 preceding siblings ...)
2007-11-11 8:51 ` [patch 4/6] mm: lockless pagecache lookups Nick Piggin
@ 2007-11-11 8:51 ` Nick Piggin
2007-11-11 8:52 ` [patch 6/6] mm: speculative refcount debug Nick Piggin
` (3 subsequent siblings)
8 siblings, 0 replies; 15+ messages in thread
From: Nick Piggin @ 2007-11-11 8:51 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds, Peter Zijlstra, Hugh Dickins
Cc: Linux Memory Management List
mapping->tree_lock has no read lockers. convert the lock from an rwlock
to a spinlock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -704,7 +704,7 @@ static int __set_page_dirty(struct page
if (TestSetPageDirty(page))
return 0;
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
@@ -717,7 +717,7 @@ static int __set_page_dirty(struct page
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
return 1;
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -209,7 +209,7 @@ void inode_init_once(struct inode *inode
INIT_LIST_HEAD(&inode->i_dentry);
INIT_LIST_HEAD(&inode->i_devices);
INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
- rwlock_init(&inode->i_data.tree_lock);
+ spin_lock_init(&inode->i_data.tree_lock);
spin_lock_init(&inode->i_data.i_mmap_lock);
INIT_LIST_HEAD(&inode->i_data.private_list);
spin_lock_init(&inode->i_data.private_lock);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -497,7 +497,7 @@ struct backing_dev_info;
struct address_space {
struct inode *host; /* owner: inode, block_device */
struct radix_tree_root page_tree; /* radix tree of all pages */
- rwlock_t tree_lock; /* and rwlock protecting it */
+ spinlock_t tree_lock; /* and lock protecting it */
unsigned int i_mmap_writable;/* count VM_SHARED mappings */
struct prio_tree_root i_mmap; /* tree of private and shared mappings */
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -113,7 +113,7 @@ generic_file_direct_IO(int rw, struct ki
/*
* Remove a page from the page cache and free it. Caller has to make
* sure the page is locked and that nobody else uses it - or that usage
- * is safe. The caller must hold a write_lock on the mapping's tree_lock.
+ * is safe. The caller must hold the mapping's tree_lock.
*/
void __remove_from_page_cache(struct page *page)
{
@@ -132,9 +132,9 @@ void remove_from_page_cache(struct page
BUG_ON(!PageLocked(page));
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
__remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
}
static int sync_page(void *word)
@@ -448,13 +448,13 @@ int add_to_page_cache_locked(struct page
page->mapping = mapping;
page->index = offset;
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
if (likely(!error)) {
mapping->nrpages++;
__inc_zone_page_state(page, NR_FILE_PAGES);
}
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
radix_tree_preload_end();
if (unlikely(error)) {
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -38,7 +38,7 @@ static struct backing_dev_info swap_back
struct address_space swapper_space = {
.page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
- .tree_lock = __RW_LOCK_UNLOCKED(swapper_space.tree_lock),
+ .tree_lock = __SPIN_LOCK_UNLOCKED(swapper_space.tree_lock),
.a_ops = &swap_aops,
.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
.backing_dev_info = &swap_backing_dev_info,
@@ -83,14 +83,14 @@ static int __add_to_swap_cache(struct pa
SetPageSwapCache(page);
set_page_private(page, entry.val);
- write_lock_irq(&swapper_space.tree_lock);
+ spin_lock_irq(&swapper_space.tree_lock);
error = radix_tree_insert(&swapper_space.page_tree,
entry.val, page);
if (likely(!error)) {
total_swapcache_pages++;
__inc_zone_page_state(page, NR_FILE_PAGES);
}
- write_unlock_irq(&swapper_space.tree_lock);
+ spin_unlock_irq(&swapper_space.tree_lock);
radix_tree_preload_end();
if (unlikely(error)) {
@@ -209,9 +209,9 @@ void delete_from_swap_cache(struct page
entry.val = page_private(page);
- write_lock_irq(&swapper_space.tree_lock);
+ spin_lock_irq(&swapper_space.tree_lock);
__delete_from_swap_cache(page);
- write_unlock_irq(&swapper_space.tree_lock);
+ spin_unlock_irq(&swapper_space.tree_lock);
swap_free(entry);
page_cache_release(page);
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -367,13 +367,13 @@ int remove_exclusive_swap_page(struct pa
retval = 0;
if (p->swap_map[swp_offset(entry)] == 1) {
/* Recheck the page count with the swapcache lock held.. */
- write_lock_irq(&swapper_space.tree_lock);
+ spin_lock_irq(&swapper_space.tree_lock);
if ((page_count(page) == 2) && !PageWriteback(page)) {
__delete_from_swap_cache(page);
SetPageDirty(page);
retval = 1;
}
- write_unlock_irq(&swapper_space.tree_lock);
+ spin_unlock_irq(&swapper_space.tree_lock);
}
spin_unlock(&swap_lock);
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -350,18 +350,18 @@ invalidate_complete_page2(struct address
if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
return 0;
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
if (PageDirty(page))
goto failed;
BUG_ON(PagePrivate(page));
__remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
ClearPageUptodate(page);
page_cache_release(page); /* pagecache ref */
return 1;
failed:
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
return 0;
}
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -382,7 +382,7 @@ static int __remove_mapping(struct addre
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
/*
* The non racy check for a busy page.
*
@@ -417,18 +417,18 @@ static int __remove_mapping(struct addre
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
__delete_from_swap_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
swap_free(swap);
} else {
__remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
}
free_it:
return 1;
cannot_free:
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
return 0;
}
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -1000,7 +1000,7 @@ int __set_page_dirty_nobuffers(struct pa
if (!mapping)
return 1;
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
mapping2 = page_mapping(page);
if (mapping2) { /* Race with truncate? */
BUG_ON(mapping2 != mapping);
@@ -1014,7 +1014,7 @@ int __set_page_dirty_nobuffers(struct pa
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
if (mapping->host) {
/* !PageAnon && !swapper_space */
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -1170,7 +1170,7 @@ int test_clear_page_writeback(struct pag
struct backing_dev_info *bdi = mapping->backing_dev_info;
unsigned long flags;
- write_lock_irqsave(&mapping->tree_lock, flags);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
ret = TestClearPageWriteback(page);
if (ret) {
radix_tree_tag_clear(&mapping->page_tree,
@@ -1181,7 +1181,7 @@ int test_clear_page_writeback(struct pag
__bdi_writeout_inc(bdi);
}
}
- write_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
} else {
ret = TestClearPageWriteback(page);
}
@@ -1199,7 +1199,7 @@ int test_set_page_writeback(struct page
struct backing_dev_info *bdi = mapping->backing_dev_info;
unsigned long flags;
- write_lock_irqsave(&mapping->tree_lock, flags);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
ret = TestSetPageWriteback(page);
if (!ret) {
radix_tree_tag_set(&mapping->page_tree,
@@ -1212,7 +1212,7 @@ int test_set_page_writeback(struct page
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY);
- write_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
} else {
ret = TestSetPageWriteback(page);
}
Index: linux-2.6/include/asm-arm/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-arm/cacheflush.h
+++ linux-2.6/include/asm-arm/cacheflush.h
@@ -413,9 +413,9 @@ static inline void flush_anon_page(struc
}
#define flush_dcache_mmap_lock(mapping) \
- write_lock_irq(&(mapping)->tree_lock)
+ spin_lock_irq(&(mapping)->tree_lock)
#define flush_dcache_mmap_unlock(mapping) \
- write_unlock_irq(&(mapping)->tree_lock)
+ spin_unlock_irq(&(mapping)->tree_lock)
#define flush_icache_user_range(vma,page,addr,len) \
flush_dcache_page(page)
Index: linux-2.6/include/asm-parisc/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-parisc/cacheflush.h
+++ linux-2.6/include/asm-parisc/cacheflush.h
@@ -45,9 +45,9 @@ void flush_cache_mm(struct mm_struct *mm
extern void flush_dcache_page(struct page *page);
#define flush_dcache_mmap_lock(mapping) \
- write_lock_irq(&(mapping)->tree_lock)
+ spin_lock_irq(&(mapping)->tree_lock)
#define flush_dcache_mmap_unlock(mapping) \
- write_unlock_irq(&(mapping)->tree_lock)
+ spin_unlock_irq(&(mapping)->tree_lock)
#define flush_icache_page(vma,page) do { \
flush_kernel_dcache_page(page); \
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -304,7 +304,7 @@ static int migrate_page_move_mapping(str
return 0;
}
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));
@@ -312,12 +312,12 @@ static int migrate_page_move_mapping(str
expected_count = 2 + !!PagePrivate(page);
if (page_count(page) != expected_count ||
(struct page *)radix_tree_deref_slot(pslot) != page) {
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
}
if (!page_freeze_refs(page, expected_count))
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
}
@@ -354,7 +354,7 @@ static int migrate_page_move_mapping(str
__dec_zone_page_state(page, NR_FILE_PAGES);
__inc_zone_page_state(newpage, NR_FILE_PAGES);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
return 0;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread* [patch 6/6] mm: speculative refcount debug
2007-11-11 8:45 [patch 0/6] lockless pagecache Nick Piggin
` (4 preceding siblings ...)
2007-11-11 8:51 ` [patch 5/6] mm: spinlock tree_lock Nick Piggin
@ 2007-11-11 8:52 ` Nick Piggin
2007-11-17 9:48 ` [patch 0/6] lockless pagecache Peter Zijlstra
` (2 subsequent siblings)
8 siblings, 0 replies; 15+ messages in thread
From: Nick Piggin @ 2007-11-11 8:52 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds, Peter Zijlstra, Hugh Dickins
Cc: Linux Memory Management List
Add some debugging for lockless pagecache.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -236,9 +236,30 @@ static inline struct page *compound_head
return page;
}
+#ifdef CONFIG_DEBUG_VM
+extern int ll_counter;
+#endif
+
static inline int page_count(struct page *page)
{
+#ifndef CONFIG_DEBUG_VM
return atomic_read(&compound_head(page)->_count);
+#else
+
+ int count;
+ /*
+ * debug testing for lockless pagecache. add a random value to
+ * page_count every now and then, to simulate speculative references
+ * to it.
+ */
+ count = atomic_read(&compound_head(page)->_count);
+ if (count) {
+ ll_counter++;
+ if (ll_counter % 5 == 0 || ll_counter % 7 == 0)
+ count += ll_counter % 11;
+ }
+ return count;
+#endif
}
static inline void get_page(struct page *page)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -173,6 +173,8 @@ static void set_pageblock_migratetype(st
}
#ifdef CONFIG_DEBUG_VM
+int ll_counter; /* used in include/linux/mm.h, for lockless pagecache */
+EXPORT_SYMBOL(ll_counter);
static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
{
int ret = 0;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [patch 0/6] lockless pagecache
2007-11-11 8:45 [patch 0/6] lockless pagecache Nick Piggin
` (5 preceding siblings ...)
2007-11-11 8:52 ` [patch 6/6] mm: speculative refcount debug Nick Piggin
@ 2007-11-17 9:48 ` Peter Zijlstra
2007-11-17 20:16 ` Hugh Dickins
2008-01-09 15:45 ` Peter Zijlstra
8 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2007-11-17 9:48 UTC (permalink / raw)
To: Nick Piggin
Cc: Andrew Morton, Linus Torvalds, Hugh Dickins,
Linux Memory Management List
[-- Attachment #1: Type: text/plain, Size: 547 bytes --]
On Sun, 2007-11-11 at 09:45 +0100, Nick Piggin wrote:
> Hi,
>
> I wonder what everyone thinks about getting the lockless pagecache patch
> into -mm? This version uses Hugh's suggestion to avoid a smp_rmb and a load
> and branch in the lockless lookup side, and avoids some atomic ops in the
> reclaim path, and avoids using a page flag! The coolest thing about it is
> that it speeds up single-threaded pagecache lookups...
>
> Patches are against latest git for RFC.
Full set
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [patch 0/6] lockless pagecache
2007-11-11 8:45 [patch 0/6] lockless pagecache Nick Piggin
` (6 preceding siblings ...)
2007-11-17 9:48 ` [patch 0/6] lockless pagecache Peter Zijlstra
@ 2007-11-17 20:16 ` Hugh Dickins
2007-11-19 22:58 ` Nick Piggin
2008-01-09 15:45 ` Peter Zijlstra
8 siblings, 1 reply; 15+ messages in thread
From: Hugh Dickins @ 2007-11-17 20:16 UTC (permalink / raw)
To: Nick Piggin
Cc: Andrew Morton, Linus Torvalds, Peter Zijlstra,
Linux Memory Management List
On Sun, 11 Nov 2007, Nick Piggin wrote:
>
> I wonder what everyone thinks about getting the lockless pagecache patch
> into -mm? This version uses Hugh's suggestion to avoid a smp_rmb and a load
> and branch in the lockless lookup side, and avoids some atomic ops in the
> reclaim path, and avoids using a page flag! The coolest thing about it is
> that it speeds up single-threaded pagecache lookups...
I've liked this in the past, with the exception of PageNoNewRefs which
seemed an unnecessary ugliness. Now you've eliminated that, thank you,
I expect I should like it through and through (if I actually found time
to redigest it). A moment came up and I thought I'd give it a spin...
> Patches are against latest git for RFC.
... but they're not. You seem to have descended into sending out
?cleanup? patches at intervals, and recursive dependence upon them.
This set relies on there being something called __set_page_locked()
in include/linux/pagemap.h, but there isn't in latest git (nor mm).
Ah, you posted a patch earlier which introduced that, but it relies on
there being something called set_page_locked() in include/linux/pagemap.h,
but there isn't in latest git (nor mm). Ah, you posted a patch earlier
which introduced that ... I gave up at this point.
We've all got lots of other things to do, please make it easier.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [patch 0/6] lockless pagecache
2007-11-17 20:16 ` Hugh Dickins
@ 2007-11-19 22:58 ` Nick Piggin
0 siblings, 0 replies; 15+ messages in thread
From: Nick Piggin @ 2007-11-19 22:58 UTC (permalink / raw)
To: Hugh Dickins
Cc: Andrew Morton, Linus Torvalds, Peter Zijlstra,
Linux Memory Management List
On Sat, Nov 17, 2007 at 08:16:18PM +0000, Hugh Dickins wrote:
> On Sun, 11 Nov 2007, Nick Piggin wrote:
> >
> > I wonder what everyone thinks about getting the lockless pagecache patch
> > into -mm? This version uses Hugh's suggestion to avoid a smp_rmb and a load
> > and branch in the lockless lookup side, and avoids some atomic ops in the
> > reclaim path, and avoids using a page flag! The coolest thing about it is
> > that it speeds up single-threaded pagecache lookups...
>
> I've liked this in the past, with the exception of PageNoNewRefs which
> seemed an unnecessary ugliness. Now you've eliminated that, thank you,
> I expect I should like it through and through (if I actually found time
> to redigest it). A moment came up and I thought I'd give it a spin...
Yeah I decided it is actually just as good or better at it's job --
neither really protects against an errant get_page() or put_page(),
however at least this scheme will go bug with CONFIG_DEBUG_VM, as opposed
to the current or old lockless schemes (which I guess will silently allow
it).
> > Patches are against latest git for RFC.
>
> ... but they're not. You seem to have descended into sending out
> ?cleanup? patches at intervals, and recursive dependence upon them.
> This set relies on there being something called __set_page_locked()
> in include/linux/pagemap.h, but there isn't in latest git (nor mm).
> Ah, you posted a patch earlier which introduced that, but it relies on
> there being something called set_page_locked() in include/linux/pagemap.h,
> but there isn't in latest git (nor mm). Ah, you posted a patch earlier
> which introduced that ... I gave up at this point.
>
> We've all got lots of other things to do, please make it easier.
Sorry, I honestly didn't pay enough attention there because I didn't
think anybody would run it! I'll update it and resend.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch 0/6] lockless pagecache
2007-11-11 8:45 [patch 0/6] lockless pagecache Nick Piggin
` (7 preceding siblings ...)
2007-11-17 20:16 ` Hugh Dickins
@ 2008-01-09 15:45 ` Peter Zijlstra
8 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2008-01-09 15:45 UTC (permalink / raw)
To: Nick Piggin
Cc: Andrew Morton, Linus Torvalds, Hugh Dickins,
Linux Memory Management List
On Sun, 2007-11-11 at 09:45 +0100, Nick Piggin wrote:
> Hi,
>
> I wonder what everyone thinks about getting the lockless pagecache patch
> into -mm? This version uses Hugh's suggestion to avoid a smp_rmb and a load
> and branch in the lockless lookup side, and avoids some atomic ops in the
> reclaim path, and avoids using a page flag! The coolest thing about it is
> that it speeds up single-threaded pagecache lookups...
>
> Patches are against latest git for RFC.
How are we doing with this?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread