* [PATCH 01/16] radix-tree: RCU lockless readside
2006-12-07 16:18 [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Peter Zijlstra
@ 2006-12-07 16:18 ` Nick Piggin
2006-12-07 16:18 ` [PATCH 02/16] radix-tree: use indirect bit Nick Piggin
` (15 subsequent siblings)
16 siblings, 0 replies; 19+ messages in thread
From: Nick Piggin @ 2006-12-07 16:18 UTC (permalink / raw)
To: linux-mm; +Cc: Nick Piggin, Peter Zijlstra
[-- Attachment #1: radix-tree-rcu-lockless-readside.patch --]
[-- Type: text/plain, Size: 25624 bytes --]
Make radix tree lookups safe to be performed without locks. Readers are
protected against nodes being deleted by using RCU based freeing. Readers
are protected against new node insertion by using memory barriers to ensure
the node itself will be properly written before it is visible in the radix
tree.
Each radix tree node keeps a record of their height (above leaf nodes).
This height does not change after insertion -- when the radix tree is
extended, higher nodes are only inserted in the top. So a lookup can take
the pointer to what is *now* the root node, and traverse down it even if
the tree is concurrently extended and this node becomes a subtree of a new
root.
"Direct" pointers (tree height of 0, where root->rnode points directly to
the data item) are handled by using the low bit of the pointer to signal
whether rnode is a direct pointer or a pointer to a radix tree node.
When a reader wants to traverse the next branch, they will take a copy of
the pointer. This pointer will be either NULL (and the branch is empty) or
non-NULL (and will point to a valid node).
[akpm@osdl.org: cleanups]
[Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications]
[clameter@sgi.com: build fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---
include/linux/radix-tree.h | 101 +++++++++++++
lib/radix-tree.c | 327 +++++++++++++++++++++++++++++++--------------
mm/migrate.c | 19 +-
3 files changed, 340 insertions(+), 107 deletions(-)
Index: linux-2.6-rt/include/linux/radix-tree.h
===================================================================
--- linux-2.6-rt.orig/include/linux/radix-tree.h 2006-11-29 14:20:37.000000000 +0100
+++ linux-2.6-rt/include/linux/radix-tree.h 2006-11-29 14:20:39.000000000 +0100
@@ -1,6 +1,7 @@
/*
* Copyright (C) 2001 Momchil Velikov
* Portions Copyright (C) 2001 Christoph Hellwig
+ * Copyright (C) 2006 Nick Piggin
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License as
@@ -22,6 +23,35 @@
#include <linux/sched.h>
#include <linux/preempt.h>
#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/rcupdate.h>
+
+/*
+ * A direct pointer (root->rnode pointing directly to a data item,
+ * rather than another radix_tree_node) is signalled by the low bit
+ * set in the root->rnode pointer.
+ *
+ * In this case root->height is also NULL, but the direct pointer tests are
+ * needed for RCU lookups when root->height is unreliable.
+ */
+#define RADIX_TREE_DIRECT_PTR 1
+
+static inline void *radix_tree_ptr_to_direct(void *ptr)
+{
+ return (void *)((unsigned long)ptr | RADIX_TREE_DIRECT_PTR);
+}
+
+static inline void *radix_tree_direct_to_ptr(void *ptr)
+{
+ return (void *)((unsigned long)ptr & ~RADIX_TREE_DIRECT_PTR);
+}
+
+static inline int radix_tree_is_direct_ptr(void *ptr)
+{
+ return (int)((unsigned long)ptr & RADIX_TREE_DIRECT_PTR);
+}
+
+/*** radix-tree API starts here ***/
#define RADIX_TREE_MAX_TAGS 2
@@ -48,6 +78,77 @@ do { \
(root)->rnode = NULL; \
} while (0)
+/**
+ * Radix-tree synchronization
+ *
+ * The radix-tree API requires that users provide all synchronisation (with
+ * specific exceptions, noted below).
+ *
+ * Synchronization of access to the data items being stored in the tree, and
+ * management of their lifetimes must be completely managed by API users.
+ *
+ * For API usage, in general,
+ * - any function _modifying_ the the tree or tags (inserting or deleting
+ * items, setting or clearing tags must exclude other modifications, and
+ * exclude any functions reading the tree.
+ * - any function _reading_ the the tree or tags (looking up items or tags,
+ * gang lookups) must exclude modifications to the tree, but may occur
+ * concurrently with other readers.
+ *
+ * The notable exceptions to this rule are the following functions:
+ * radix_tree_lookup
+ * radix_tree_tag_get
+ * radix_tree_gang_lookup
+ * radix_tree_gang_lookup_tag
+ * radix_tree_tagged
+ *
+ * The first 4 functions are able to be called locklessly, using RCU. The
+ * caller must ensure calls to these functions are made within rcu_read_lock()
+ * regions. Other readers (lock-free or otherwise) and modifications may be
+ * running concurrently.
+ *
+ * It is still required that the caller manage the synchronization and lifetimes
+ * of the items. So if RCU lock-free lookups are used, typically this would mean
+ * that the items have their own locks, or are amenable to lock-free access; and
+ * that the items are freed by RCU (or only freed after having been deleted from
+ * the radix tree *and* a synchronize_rcu() grace period).
+ *
+ * (Note, rcu_assign_pointer and rcu_dereference are not needed to control
+ * access to data items when inserting into or looking up from the radix tree)
+ *
+ * radix_tree_tagged is able to be called without locking or RCU.
+ */
+
+/**
+ * radix_tree_deref_slot - dereference a slot
+ * @pslot: pointer to slot, returned by radix_tree_lookup_slot
+ * Returns: item that was stored in that slot with any direct pointer flag
+ * removed.
+ *
+ * For use with radix_tree_lookup_slot(). Caller must hold tree at least read
+ * locked across slot lookup and dereference. More likely, will be used with
+ * radix_tree_replace_slot(), as well, so caller will hold tree write locked.
+ */
+static inline void *radix_tree_deref_slot(void **pslot)
+{
+ return radix_tree_direct_to_ptr(*pslot);
+}
+/**
+ * radix_tree_replace_slot - replace item in a slot
+ * @pslot: pointer to slot, returned by radix_tree_lookup_slot
+ * @item: new item to store in the slot.
+ *
+ * For use with radix_tree_lookup_slot(). Caller must hold tree write locked
+ * across slot lookup and replacement.
+ */
+static inline void radix_tree_replace_slot(void **pslot, void *item)
+{
+ BUG_ON(radix_tree_is_direct_ptr(item));
+ rcu_assign_pointer(*pslot,
+ (void *)((unsigned long)item |
+ ((unsigned long)*pslot & RADIX_TREE_DIRECT_PTR)));
+}
+
int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
Index: linux-2.6-rt/lib/radix-tree.c
===================================================================
--- linux-2.6-rt.orig/lib/radix-tree.c 2006-11-29 14:20:37.000000000 +0100
+++ linux-2.6-rt/lib/radix-tree.c 2006-11-29 14:20:39.000000000 +0100
@@ -2,6 +2,7 @@
* Copyright (C) 2001 Momchil Velikov
* Portions Copyright (C) 2001 Christoph Hellwig
* Copyright (C) 2005 SGI, Christoph Lameter <clameter@sgi.com>
+ * Copyright (C) 2006 Nick Piggin
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License as
@@ -30,6 +31,7 @@
#include <linux/gfp.h>
#include <linux/string.h>
#include <linux/bitops.h>
+#include <linux/rcupdate.h>
#ifdef __KERNEL__
@@ -45,7 +47,9 @@
((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
struct radix_tree_node {
+ unsigned int height; /* Height from the bottom */
unsigned int count;
+ struct rcu_head rcu_head;
void *slots[RADIX_TREE_MAP_SIZE];
unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
};
@@ -100,13 +104,21 @@ radix_tree_node_alloc(struct radix_tree_
rtp->nr--;
}
}
+ BUG_ON(radix_tree_is_direct_ptr(ret));
return ret;
}
+static void radix_tree_node_rcu_free(struct rcu_head *head)
+{
+ struct radix_tree_node *node =
+ container_of(head, struct radix_tree_node, rcu_head);
+ kmem_cache_free(radix_tree_node_cachep, node);
+}
+
static inline void
radix_tree_node_free(struct radix_tree_node *node)
{
- kmem_cache_free(radix_tree_node_cachep, node);
+ call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
}
#ifndef CONFIG_PREEMPT_RT
@@ -225,11 +237,12 @@ static int radix_tree_extend(struct radi
}
do {
+ unsigned int newheight;
if (!(node = radix_tree_node_alloc(root)))
return -ENOMEM;
/* Increase the height. */
- node->slots[0] = root->rnode;
+ node->slots[0] = radix_tree_direct_to_ptr(root->rnode);
/* Propagate the aggregated tag info into the new root */
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
@@ -237,9 +250,11 @@ static int radix_tree_extend(struct radi
tag_set(node, tag, 0);
}
+ newheight = root->height+1;
+ node->height = newheight;
node->count = 1;
- root->rnode = node;
- root->height++;
+ rcu_assign_pointer(root->rnode, node);
+ root->height = newheight;
} while (height > root->height);
out:
return 0;
@@ -261,6 +276,8 @@ int radix_tree_insert(struct radix_tree_
int offset;
int error;
+ BUG_ON(radix_tree_is_direct_ptr(item));
+
/* Make sure the tree is high enough. */
if (index > radix_tree_maxindex(root->height)) {
error = radix_tree_extend(root, index);
@@ -278,11 +295,12 @@ int radix_tree_insert(struct radix_tree_
/* Have to add a child node. */
if (!(slot = radix_tree_node_alloc(root)))
return -ENOMEM;
+ slot->height = height;
if (node) {
- node->slots[offset] = slot;
+ rcu_assign_pointer(node->slots[offset], slot);
node->count++;
} else
- root->rnode = slot;
+ rcu_assign_pointer(root->rnode, slot);
}
/* Go a level down */
@@ -298,11 +316,11 @@ int radix_tree_insert(struct radix_tree_
if (node) {
node->count++;
- node->slots[offset] = item;
+ rcu_assign_pointer(node->slots[offset], item);
BUG_ON(tag_get(node, 0, offset));
BUG_ON(tag_get(node, 1, offset));
} else {
- root->rnode = item;
+ rcu_assign_pointer(root->rnode, radix_tree_ptr_to_direct(item));
BUG_ON(root_tag_get(root, 0));
BUG_ON(root_tag_get(root, 1));
}
@@ -311,49 +329,54 @@ int radix_tree_insert(struct radix_tree_
}
EXPORT_SYMBOL(radix_tree_insert);
-static inline void **__lookup_slot(struct radix_tree_root *root,
- unsigned long index)
+/**
+ * radix_tree_lookup_slot - lookup a slot in a radix tree
+ * @root: radix tree root
+ * @index: index key
+ *
+ * Returns: the slot corresponding to the position @index in the
+ * radix tree @root. This is useful for update-if-exists operations.
+ *
+ * This function cannot be called under rcu_read_lock, it must be
+ * excluded from writers, as must the returned slot for subsequent
+ * use by radix_tree_deref_slot() and radix_tree_replace slot.
+ * Caller must hold tree write locked across slot lookup and
+ * replace.
+ */
+void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
{
unsigned int height, shift;
- struct radix_tree_node **slot;
-
- height = root->height;
+ struct radix_tree_node *node, **slot;
- if (index > radix_tree_maxindex(height))
+ node = root->rnode;
+ if (node == NULL)
return NULL;
- if (height == 0 && root->rnode)
+ if (radix_tree_is_direct_ptr(node)) {
+ if (index > 0)
+ return NULL;
return (void **)&root->rnode;
+ }
+
+ height = node->height;
+ if (index > radix_tree_maxindex(height))
+ return NULL;
shift = (height-1) * RADIX_TREE_MAP_SHIFT;
- slot = &root->rnode;
- while (height > 0) {
- if (*slot == NULL)
+ do {
+ slot = (struct radix_tree_node **)
+ (node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
+ node = *slot;
+ if (node == NULL)
return NULL;
- slot = (struct radix_tree_node **)
- ((*slot)->slots +
- ((index >> shift) & RADIX_TREE_MAP_MASK));
shift -= RADIX_TREE_MAP_SHIFT;
height--;
- }
+ } while (height > 0);
return (void **)slot;
}
-
-/**
- * radix_tree_lookup_slot - lookup a slot in a radix tree
- * @root: radix tree root
- * @index: index key
- *
- * Lookup the slot corresponding to the position @index in the radix tree
- * @root. This is useful for update-if-exists operations.
- */
-void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
-{
- return __lookup_slot(root, index);
-}
EXPORT_SYMBOL(radix_tree_lookup_slot);
/**
@@ -362,13 +385,45 @@ EXPORT_SYMBOL(radix_tree_lookup_slot);
* @index: index key
*
* Lookup the item at the position @index in the radix tree @root.
+ *
+ * This function can be called under rcu_read_lock, however the caller
+ * must manage lifetimes of leaf nodes (eg. RCU may also be used to free
+ * them safely). No RCU barriers are required to access or modify the
+ * returned item, however.
*/
void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
{
- void **slot;
+ unsigned int height, shift;
+ struct radix_tree_node *node, **slot;
+
+ node = rcu_dereference(root->rnode);
+ if (node == NULL)
+ return NULL;
+
+ if (radix_tree_is_direct_ptr(node)) {
+ if (index > 0)
+ return NULL;
+ return radix_tree_direct_to_ptr(node);
+ }
+
+ height = node->height;
+ if (index > radix_tree_maxindex(height))
+ return NULL;
+
+ shift = (height-1) * RADIX_TREE_MAP_SHIFT;
- slot = __lookup_slot(root, index);
- return slot != NULL ? *slot : NULL;
+ do {
+ slot = (struct radix_tree_node **)
+ (node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
+ node = rcu_dereference(*slot);
+ if (node == NULL)
+ return NULL;
+
+ shift -= RADIX_TREE_MAP_SHIFT;
+ height--;
+ } while (height > 0);
+
+ return node;
}
EXPORT_SYMBOL(radix_tree_lookup);
@@ -498,27 +553,30 @@ int radix_tree_tag_get(struct radix_tree
unsigned long index, unsigned int tag)
{
unsigned int height, shift;
- struct radix_tree_node *slot;
+ struct radix_tree_node *node;
int saw_unset_tag = 0;
- height = root->height;
- if (index > radix_tree_maxindex(height))
- return 0;
-
/* check the root's tag bit */
if (!root_tag_get(root, tag))
return 0;
- if (height == 0)
- return 1;
+ node = rcu_dereference(root->rnode);
+ if (node == NULL)
+ return 0;
+
+ if (radix_tree_is_direct_ptr(node))
+ return (index == 0);
+
+ height = node->height;
+ if (index > radix_tree_maxindex(height))
+ return 0;
shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
- slot = root->rnode;
for ( ; ; ) {
int offset;
- if (slot == NULL)
+ if (node == NULL)
return 0;
offset = (index >> shift) & RADIX_TREE_MAP_MASK;
@@ -527,15 +585,15 @@ int radix_tree_tag_get(struct radix_tree
* This is just a debug check. Later, we can bale as soon as
* we see an unset tag.
*/
- if (!tag_get(slot, tag, offset))
+ if (!tag_get(node, tag, offset))
saw_unset_tag = 1;
if (height == 1) {
- int ret = tag_get(slot, tag, offset);
+ int ret = tag_get(node, tag, offset);
BUG_ON(ret && saw_unset_tag);
return !!ret;
}
- slot = slot->slots[offset];
+ node = rcu_dereference(node->slots[offset]);
shift -= RADIX_TREE_MAP_SHIFT;
height--;
}
@@ -544,47 +602,45 @@ EXPORT_SYMBOL(radix_tree_tag_get);
#endif
static unsigned int
-__lookup(struct radix_tree_root *root, void **results, unsigned long index,
+__lookup(struct radix_tree_node *slot, void **results, unsigned long index,
unsigned int max_items, unsigned long *next_index)
{
unsigned int nr_found = 0;
unsigned int shift, height;
- struct radix_tree_node *slot;
unsigned long i;
- height = root->height;
- if (height == 0) {
- if (root->rnode && index == 0)
- results[nr_found++] = root->rnode;
+ height = slot->height;
+ if (height == 0)
goto out;
- }
-
shift = (height-1) * RADIX_TREE_MAP_SHIFT;
- slot = root->rnode;
for ( ; height > 1; height--) {
-
- for (i = (index >> shift) & RADIX_TREE_MAP_MASK ;
- i < RADIX_TREE_MAP_SIZE; i++) {
+ i = (index >> shift) & RADIX_TREE_MAP_MASK;
+ for (;;) {
if (slot->slots[i] != NULL)
break;
index &= ~((1UL << shift) - 1);
index += 1UL << shift;
if (index == 0)
goto out; /* 32-bit wraparound */
+ i++;
+ if (i == RADIX_TREE_MAP_SIZE)
+ goto out;
}
- if (i == RADIX_TREE_MAP_SIZE)
- goto out;
shift -= RADIX_TREE_MAP_SHIFT;
- slot = slot->slots[i];
+ slot = rcu_dereference(slot->slots[i]);
+ if (slot == NULL)
+ goto out;
}
/* Bottom level: grab some items */
for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
+ struct radix_tree_node *node;
index++;
- if (slot->slots[i]) {
- results[nr_found++] = slot->slots[i];
+ node = slot->slots[i];
+ if (node) {
+ results[nr_found++] = rcu_dereference(node);
if (nr_found == max_items)
goto out;
}
@@ -606,28 +662,51 @@ out:
* *@results.
*
* The implementation is naive.
+ *
+ * Like radix_tree_lookup, radix_tree_gang_lookup may be called under
+ * rcu_read_lock. In this case, rather than the returned results being
+ * an atomic snapshot of the tree at a single point in time, the semantics
+ * of an RCU protected gang lookup are as though multiple radix_tree_lookups
+ * have been issued in individual locks, and results stored in 'results'.
*/
unsigned int
radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
unsigned long first_index, unsigned int max_items)
{
- const unsigned long max_index = radix_tree_maxindex(root->height);
+ unsigned long max_index;
+ struct radix_tree_node *node;
unsigned long cur_index = first_index;
- unsigned int ret = 0;
+ unsigned int ret;
+
+ node = rcu_dereference(root->rnode);
+ if (!node)
+ return 0;
+ if (radix_tree_is_direct_ptr(node)) {
+ if (first_index > 0)
+ return 0;
+ node = radix_tree_direct_to_ptr(node);
+ results[0] = rcu_dereference(node);
+ return 1;
+ }
+
+ max_index = radix_tree_maxindex(node->height);
+
+ ret = 0;
while (ret < max_items) {
unsigned int nr_found;
unsigned long next_index; /* Index of next search */
if (cur_index > max_index)
break;
- nr_found = __lookup(root, results + ret, cur_index,
+ nr_found = __lookup(node, results + ret, cur_index,
max_items - ret, &next_index);
ret += nr_found;
if (next_index == 0)
break;
cur_index = next_index;
}
+
return ret;
}
EXPORT_SYMBOL(radix_tree_gang_lookup);
@@ -637,55 +716,64 @@ EXPORT_SYMBOL(radix_tree_gang_lookup);
* open-coding the search.
*/
static unsigned int
-__lookup_tag(struct radix_tree_root *root, void **results, unsigned long index,
+__lookup_tag(struct radix_tree_node *slot, void **results, unsigned long index,
unsigned int max_items, unsigned long *next_index, unsigned int tag)
{
unsigned int nr_found = 0;
- unsigned int shift;
- unsigned int height = root->height;
- struct radix_tree_node *slot;
+ unsigned int shift, height;
- if (height == 0) {
- if (root->rnode && index == 0)
- results[nr_found++] = root->rnode;
+ height = slot->height;
+ if (height == 0)
goto out;
- }
-
- shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
- slot = root->rnode;
+ shift = (height-1) * RADIX_TREE_MAP_SHIFT;
- do {
- unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK;
+ while (height > 0) {
+ unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK ;
- for ( ; i < RADIX_TREE_MAP_SIZE; i++) {
- if (tag_get(slot, tag, i)) {
- BUG_ON(slot->slots[i] == NULL);
+ for (;;) {
+ if (tag_get(slot, tag, i))
break;
- }
index &= ~((1UL << shift) - 1);
index += 1UL << shift;
if (index == 0)
goto out; /* 32-bit wraparound */
+ i++;
+ if (i == RADIX_TREE_MAP_SIZE)
+ goto out;
}
- if (i == RADIX_TREE_MAP_SIZE)
- goto out;
height--;
if (height == 0) { /* Bottom level: grab some items */
unsigned long j = index & RADIX_TREE_MAP_MASK;
for ( ; j < RADIX_TREE_MAP_SIZE; j++) {
+ struct radix_tree_node *node;
index++;
- if (tag_get(slot, tag, j)) {
- BUG_ON(slot->slots[j] == NULL);
- results[nr_found++] = slot->slots[j];
+ if (!tag_get(slot, tag, j))
+ continue;
+ node = slot->slots[j];
+ /*
+ * Even though the tag was found set, we need to
+ * recheck that we have a non-NULL node, because
+ * if this lookup is lockless, it may have been
+ * subsequently deleted.
+ *
+ * Similar care must be taken in any place that
+ * lookup ->slots[x] without a lock (ie. can't
+ * rely on its value remaining the same).
+ */
+ if (node) {
+ node = rcu_dereference(node);
+ results[nr_found++] = node;
if (nr_found == max_items)
goto out;
}
}
}
shift -= RADIX_TREE_MAP_SHIFT;
- slot = slot->slots[i];
- } while (height > 0);
+ slot = rcu_dereference(slot->slots[i]);
+ if (slot == NULL)
+ break;
+ }
out:
*next_index = index;
return nr_found;
@@ -709,27 +797,44 @@ radix_tree_gang_lookup_tag(struct radix_
unsigned long first_index, unsigned int max_items,
unsigned int tag)
{
- const unsigned long max_index = radix_tree_maxindex(root->height);
+ struct radix_tree_node *node;
+ unsigned long max_index;
unsigned long cur_index = first_index;
- unsigned int ret = 0;
+ unsigned int ret;
/* check the root's tag bit */
if (!root_tag_get(root, tag))
return 0;
+ node = rcu_dereference(root->rnode);
+ if (!node)
+ return 0;
+
+ if (radix_tree_is_direct_ptr(node)) {
+ if (first_index > 0)
+ return 0;
+ node = radix_tree_direct_to_ptr(node);
+ results[0] = rcu_dereference(node);
+ return 1;
+ }
+
+ max_index = radix_tree_maxindex(node->height);
+
+ ret = 0;
while (ret < max_items) {
unsigned int nr_found;
unsigned long next_index; /* Index of next search */
if (cur_index > max_index)
break;
- nr_found = __lookup_tag(root, results + ret, cur_index,
+ nr_found = __lookup_tag(node, results + ret, cur_index,
max_items - ret, &next_index, tag);
ret += nr_found;
if (next_index == 0)
break;
cur_index = next_index;
}
+
return ret;
}
EXPORT_SYMBOL(radix_tree_gang_lookup_tag);
@@ -745,8 +850,19 @@ static inline void radix_tree_shrink(str
root->rnode->count == 1 &&
root->rnode->slots[0]) {
struct radix_tree_node *to_free = root->rnode;
+ void *newptr;
- root->rnode = to_free->slots[0];
+ /*
+ * We don't need rcu_assign_pointer(), since we are simply
+ * moving the node from one part of the tree to another. If
+ * it was safe to dereference the old pointer to it
+ * (to_free->slots[0]), it will be safe to dereference the new
+ * one (root->rnode).
+ */
+ newptr = to_free->slots[0];
+ if (root->height == 1)
+ newptr = radix_tree_ptr_to_direct(newptr);
+ root->rnode = newptr;
root->height--;
/* must only free zeroed nodes into the slab */
tag_clear(to_free, 0, 0);
@@ -770,6 +886,7 @@ void *radix_tree_delete(struct radix_tre
{
struct radix_tree_path path[RADIX_TREE_MAX_PATH], *pathp = path;
struct radix_tree_node *slot = NULL;
+ struct radix_tree_node *to_free;
unsigned int height, shift;
int tag;
int offset;
@@ -780,6 +897,7 @@ void *radix_tree_delete(struct radix_tre
slot = root->rnode;
if (height == 0 && root->rnode) {
+ slot = radix_tree_direct_to_ptr(slot);
root_tag_clear_all(root);
root->rnode = NULL;
goto out;
@@ -812,10 +930,17 @@ void *radix_tree_delete(struct radix_tre
radix_tree_tag_clear(root, index, tag);
}
+ to_free = NULL;
/* Now free the nodes we do not need anymore */
while (pathp->node) {
pathp->node->slots[pathp->offset] = NULL;
pathp->node->count--;
+ /*
+ * Queue the node for deferred freeing after the
+ * last reference to it disappears (set NULL, above).
+ */
+ if (to_free)
+ radix_tree_node_free(to_free);
if (pathp->node->count) {
if (pathp->node == root->rnode)
@@ -824,13 +949,15 @@ void *radix_tree_delete(struct radix_tre
}
/* Node with zero slots in use so free it */
- radix_tree_node_free(pathp->node);
-
+ to_free = pathp->node;
pathp--;
+
}
root_tag_clear_all(root);
root->height = 0;
root->rnode = NULL;
+ if (to_free)
+ radix_tree_node_free(to_free);
out:
return slot;
Index: linux-2.6-rt/mm/migrate.c
===================================================================
--- linux-2.6-rt.orig/mm/migrate.c 2006-11-29 14:20:37.000000000 +0100
+++ linux-2.6-rt/mm/migrate.c 2006-11-29 14:20:39.000000000 +0100
@@ -294,7 +294,7 @@ out:
static int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page)
{
- struct page **radix_pointer;
+ void **pslot;
if (!mapping) {
/* Anonymous page */
@@ -305,12 +305,11 @@ static int migrate_page_move_mapping(str
write_lock_irq(&mapping->tree_lock);
- radix_pointer = (struct page **)radix_tree_lookup_slot(
- &mapping->page_tree,
- page_index(page));
+ pslot = radix_tree_lookup_slot(&mapping->page_tree,
+ page_index(page));
if (page_count(page) != 2 + !!PagePrivate(page) ||
- *radix_pointer != page) {
+ (struct page *)radix_tree_deref_slot(pslot) != page) {
write_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
}
@@ -318,7 +317,7 @@ static int migrate_page_move_mapping(str
/*
* Now we know that no one else is looking at the page.
*/
- get_page(newpage);
+ get_page(newpage); /* add cache reference */
#ifdef CONFIG_SWAP
if (PageSwapCache(page)) {
SetPageSwapCache(newpage);
@@ -326,8 +325,14 @@ static int migrate_page_move_mapping(str
}
#endif
- *radix_pointer = newpage;
+ radix_tree_replace_slot(pslot, newpage);
+
+ /*
+ * Drop cache reference from old page.
+ * We know this isn't the last reference.
+ */
__put_page(page);
+
write_unlock_irq(&mapping->tree_lock);
return 0;
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 02/16] radix-tree: use indirect bit
2006-12-07 16:18 [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Peter Zijlstra
2006-12-07 16:18 ` [PATCH 01/16] radix-tree: RCU lockless readside Nick Piggin
@ 2006-12-07 16:18 ` Nick Piggin
2006-12-07 16:18 ` [PATCH 03/16] radix-tree: gang_lookup_slot Nick Piggin
` (14 subsequent siblings)
16 siblings, 0 replies; 19+ messages in thread
From: Nick Piggin @ 2006-12-07 16:18 UTC (permalink / raw)
To: linux-mm; +Cc: Nick Piggin, Peter Zijlstra
[-- Attachment #1: radix-tree-use-indirect-bit.patch --]
[-- Type: text/plain, Size: 10133 bytes --]
Rather than sign direct radix-tree pointers with a special bit, sign
the indirect one that hangs off the root. This means that, given a
lookup_slot operation, the invalid result will be differentiated from
the valid (previously, valid results could have the bit either set or
clear).
This does not affect slot lookups which occur under lock -- they
can never return an invalid result. Is needed in future for lockless
pagecache.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
include/linux/radix-tree.h | 40 ++++++++++++++------------
lib/radix-tree.c | 69 ++++++++++++++++++++++++++++-----------------
2 files changed, 65 insertions(+), 44 deletions(-)
Index: linux-2.6-rt/include/linux/radix-tree.h
===================================================================
--- linux-2.6-rt.orig/include/linux/radix-tree.h 2006-11-29 14:20:39.000000000 +0100
+++ linux-2.6-rt/include/linux/radix-tree.h 2006-11-29 14:20:42.000000000 +0100
@@ -27,28 +27,31 @@
#include <linux/rcupdate.h>
/*
- * A direct pointer (root->rnode pointing directly to a data item,
- * rather than another radix_tree_node) is signalled by the low bit
- * set in the root->rnode pointer.
- *
- * In this case root->height is also NULL, but the direct pointer tests are
- * needed for RCU lookups when root->height is unreliable.
+ * An indirect pointer (root->rnode pointing to a radix_tree_node, rather
+ * than a data item) is signalled by the low bit set in the root->rnode
+ * pointer.
+ *
+ * In this case root->height is > 0, but the indirect pointer tests are
+ * needed for RCU lookups (because root->height is unreliable). The only
+ * time callers need worry about this is when doing a lookup_slot under
+ * RCU.
*/
-#define RADIX_TREE_DIRECT_PTR 1
+#define RADIX_TREE_INDIRECT_PTR 1
+#define RADIX_TREE_RETRY ((void *)-1UL)
-static inline void *radix_tree_ptr_to_direct(void *ptr)
+static inline void *radix_tree_ptr_to_indirect(void *ptr)
{
- return (void *)((unsigned long)ptr | RADIX_TREE_DIRECT_PTR);
+ return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR);
}
-static inline void *radix_tree_direct_to_ptr(void *ptr)
+static inline void *radix_tree_indirect_to_ptr(void *ptr)
{
- return (void *)((unsigned long)ptr & ~RADIX_TREE_DIRECT_PTR);
+ return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR);
}
-static inline int radix_tree_is_direct_ptr(void *ptr)
+static inline int radix_tree_is_indirect_ptr(void *ptr)
{
- return (int)((unsigned long)ptr & RADIX_TREE_DIRECT_PTR);
+ return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
}
/*** radix-tree API starts here ***/
@@ -131,7 +134,10 @@ do { \
*/
static inline void *radix_tree_deref_slot(void **pslot)
{
- return radix_tree_direct_to_ptr(*pslot);
+ void *ret = *pslot;
+ if (unlikely(radix_tree_is_indirect_ptr(ret)))
+ ret = RADIX_TREE_RETRY;
+ return ret;
}
/**
* radix_tree_replace_slot - replace item in a slot
@@ -143,10 +149,8 @@ static inline void *radix_tree_deref_slo
*/
static inline void radix_tree_replace_slot(void **pslot, void *item)
{
- BUG_ON(radix_tree_is_direct_ptr(item));
- rcu_assign_pointer(*pslot,
- (void *)((unsigned long)item |
- ((unsigned long)*pslot & RADIX_TREE_DIRECT_PTR)));
+ BUG_ON(radix_tree_is_indirect_ptr(item));
+ rcu_assign_pointer(*pslot, item);
}
int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
Index: linux-2.6-rt/lib/radix-tree.c
===================================================================
--- linux-2.6-rt.orig/lib/radix-tree.c 2006-11-29 14:20:39.000000000 +0100
+++ linux-2.6-rt/lib/radix-tree.c 2006-11-29 14:20:42.000000000 +0100
@@ -104,7 +104,7 @@ radix_tree_node_alloc(struct radix_tree_
rtp->nr--;
}
}
- BUG_ON(radix_tree_is_direct_ptr(ret));
+ BUG_ON(radix_tree_is_indirect_ptr(ret));
return ret;
}
@@ -242,7 +242,7 @@ static int radix_tree_extend(struct radi
return -ENOMEM;
/* Increase the height. */
- node->slots[0] = radix_tree_direct_to_ptr(root->rnode);
+ node->slots[0] = radix_tree_indirect_to_ptr(root->rnode);
/* Propagate the aggregated tag info into the new root */
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
@@ -253,6 +253,7 @@ static int radix_tree_extend(struct radi
newheight = root->height+1;
node->height = newheight;
node->count = 1;
+ node = radix_tree_ptr_to_indirect(node);
rcu_assign_pointer(root->rnode, node);
root->height = newheight;
} while (height > root->height);
@@ -276,7 +277,7 @@ int radix_tree_insert(struct radix_tree_
int offset;
int error;
- BUG_ON(radix_tree_is_direct_ptr(item));
+ BUG_ON(radix_tree_is_indirect_ptr(item));
/* Make sure the tree is high enough. */
if (index > radix_tree_maxindex(root->height)) {
@@ -285,7 +286,8 @@ int radix_tree_insert(struct radix_tree_
return error;
}
- slot = root->rnode;
+ slot = radix_tree_indirect_to_ptr(root->rnode);
+
height = root->height;
shift = (height-1) * RADIX_TREE_MAP_SHIFT;
@@ -300,7 +302,8 @@ int radix_tree_insert(struct radix_tree_
rcu_assign_pointer(node->slots[offset], slot);
node->count++;
} else
- rcu_assign_pointer(root->rnode, slot);
+ rcu_assign_pointer(root->rnode,
+ radix_tree_ptr_to_indirect(slot));
}
/* Go a level down */
@@ -320,7 +323,7 @@ int radix_tree_insert(struct radix_tree_
BUG_ON(tag_get(node, 0, offset));
BUG_ON(tag_get(node, 1, offset));
} else {
- rcu_assign_pointer(root->rnode, radix_tree_ptr_to_direct(item));
+ rcu_assign_pointer(root->rnode, item);
BUG_ON(root_tag_get(root, 0));
BUG_ON(root_tag_get(root, 1));
}
@@ -352,11 +355,12 @@ void **radix_tree_lookup_slot(struct rad
if (node == NULL)
return NULL;
- if (radix_tree_is_direct_ptr(node)) {
+ if (!radix_tree_is_indirect_ptr(node)) {
if (index > 0)
return NULL;
return (void **)&root->rnode;
}
+ node = radix_tree_indirect_to_ptr(node);
height = node->height;
if (index > radix_tree_maxindex(height))
@@ -400,11 +404,12 @@ void *radix_tree_lookup(struct radix_tre
if (node == NULL)
return NULL;
- if (radix_tree_is_direct_ptr(node)) {
+ if (!radix_tree_is_indirect_ptr(node)) {
if (index > 0)
return NULL;
- return radix_tree_direct_to_ptr(node);
+ return node;
}
+ node = radix_tree_indirect_to_ptr(node);
height = node->height;
if (index > radix_tree_maxindex(height))
@@ -449,7 +454,7 @@ void *radix_tree_tag_set(struct radix_tr
height = root->height;
BUG_ON(index > radix_tree_maxindex(height));
- slot = root->rnode;
+ slot = radix_tree_indirect_to_ptr(root->rnode);
shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
while (height > 0) {
@@ -499,7 +504,7 @@ void *radix_tree_tag_clear(struct radix_
shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
pathp->node = NULL;
- slot = root->rnode;
+ slot = radix_tree_indirect_to_ptr(root->rnode);
while (height > 0) {
int offset;
@@ -564,8 +569,9 @@ int radix_tree_tag_get(struct radix_tree
if (node == NULL)
return 0;
- if (radix_tree_is_direct_ptr(node))
+ if (!radix_tree_is_indirect_ptr(node))
return (index == 0);
+ node = radix_tree_indirect_to_ptr(node);
height = node->height;
if (index > radix_tree_maxindex(height))
@@ -682,13 +688,13 @@ radix_tree_gang_lookup(struct radix_tree
if (!node)
return 0;
- if (radix_tree_is_direct_ptr(node)) {
+ if (!radix_tree_is_indirect_ptr(node)) {
if (first_index > 0)
return 0;
- node = radix_tree_direct_to_ptr(node);
- results[0] = rcu_dereference(node);
+ results[0] = node;
return 1;
}
+ node = radix_tree_indirect_to_ptr(node);
max_index = radix_tree_maxindex(node->height);
@@ -810,13 +816,13 @@ radix_tree_gang_lookup_tag(struct radix_
if (!node)
return 0;
- if (radix_tree_is_direct_ptr(node)) {
+ if (!radix_tree_is_indirect_ptr(node)) {
if (first_index > 0)
return 0;
- node = radix_tree_direct_to_ptr(node);
- results[0] = rcu_dereference(node);
+ results[0] = node;
return 1;
}
+ node = radix_tree_indirect_to_ptr(node);
max_index = radix_tree_maxindex(node->height);
@@ -846,12 +852,22 @@ EXPORT_SYMBOL(radix_tree_gang_lookup_tag
static inline void radix_tree_shrink(struct radix_tree_root *root)
{
/* try to shrink tree height */
- while (root->height > 0 &&
- root->rnode->count == 1 &&
- root->rnode->slots[0]) {
+ while (root->height > 0) {
struct radix_tree_node *to_free = root->rnode;
void *newptr;
+ BUG_ON(!radix_tree_is_indirect_ptr(to_free));
+ to_free = radix_tree_indirect_to_ptr(to_free);
+
+ /*
+ * The candidate node has more than one child, or its child
+ * is not at the leftmost slot, we cannot shrink.
+ */
+ if (to_free->count != 1)
+ break;
+ if (!to_free->slots[0])
+ break;
+
/*
* We don't need rcu_assign_pointer(), since we are simply
* moving the node from one part of the tree to another. If
@@ -860,8 +876,8 @@ static inline void radix_tree_shrink(str
* one (root->rnode).
*/
newptr = to_free->slots[0];
- if (root->height == 1)
- newptr = radix_tree_ptr_to_direct(newptr);
+ if (root->height > 1)
+ newptr = radix_tree_ptr_to_indirect(newptr);
root->rnode = newptr;
root->height--;
/* must only free zeroed nodes into the slab */
@@ -896,12 +912,12 @@ void *radix_tree_delete(struct radix_tre
goto out;
slot = root->rnode;
- if (height == 0 && root->rnode) {
- slot = radix_tree_direct_to_ptr(slot);
+ if (height == 0 /* XXX: bugfix? */) {
root_tag_clear_all(root);
root->rnode = NULL;
goto out;
}
+ slot = radix_tree_indirect_to_ptr(slot);
shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
pathp->node = NULL;
@@ -943,7 +959,8 @@ void *radix_tree_delete(struct radix_tre
radix_tree_node_free(to_free);
if (pathp->node->count) {
- if (pathp->node == root->rnode)
+ if (pathp->node ==
+ radix_tree_indirect_to_ptr(root->rnode))
radix_tree_shrink(root);
goto out;
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 03/16] radix-tree: gang_lookup_slot
2006-12-07 16:18 [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Peter Zijlstra
2006-12-07 16:18 ` [PATCH 01/16] radix-tree: RCU lockless readside Nick Piggin
2006-12-07 16:18 ` [PATCH 02/16] radix-tree: use indirect bit Nick Piggin
@ 2006-12-07 16:18 ` Nick Piggin
2006-12-07 16:18 ` [PATCH 04/16] radix-tree: gang_lookup_tag_slot Peter Zijlstra
` (13 subsequent siblings)
16 siblings, 0 replies; 19+ messages in thread
From: Nick Piggin @ 2006-12-07 16:18 UTC (permalink / raw)
To: linux-mm; +Cc: Nick Piggin, Peter Zijlstra
[-- Attachment #1: radix-tree-gang_lookup_slot.patch --]
[-- Type: text/plain, Size: 6269 bytes --]
Introduce a gang_lookup_slot function which is used by lockless pagecache.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
include/linux/radix-tree.h | 7 +++
lib/radix-tree.c | 86 +++++++++++++++++++++++++++++++++++++++------
2 files changed, 82 insertions(+), 11 deletions(-)
Index: linux-2.6-rt/include/linux/radix-tree.h
===================================================================
--- linux-2.6-rt.orig/include/linux/radix-tree.h 2006-11-29 14:20:42.000000000 +0100
+++ linux-2.6-rt/include/linux/radix-tree.h 2006-11-29 14:20:45.000000000 +0100
@@ -100,12 +100,14 @@ do { \
*
* The notable exceptions to this rule are the following functions:
* radix_tree_lookup
+ * radix_tree_lookup_slot
* radix_tree_tag_get
* radix_tree_gang_lookup
+ * radix_tree_gang_lookup_slot
* radix_tree_gang_lookup_tag
* radix_tree_tagged
*
- * The first 4 functions are able to be called locklessly, using RCU. The
+ * The first 6 functions are able to be called locklessly, using RCU. The
* caller must ensure calls to these functions are made within rcu_read_lock()
* regions. Other readers (lock-free or otherwise) and modifications may be
* running concurrently.
@@ -160,6 +162,9 @@ void *radix_tree_delete(struct radix_tre
unsigned int
radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
unsigned long first_index, unsigned int max_items);
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+ unsigned long first_index, unsigned int max_items);
/*
* On a mutex based kernel we can freely schedule within the radix code:
*/
Index: linux-2.6-rt/lib/radix-tree.c
===================================================================
--- linux-2.6-rt.orig/lib/radix-tree.c 2006-11-29 14:20:42.000000000 +0100
+++ linux-2.6-rt/lib/radix-tree.c 2006-11-29 14:20:45.000000000 +0100
@@ -340,18 +340,17 @@ EXPORT_SYMBOL(radix_tree_insert);
* Returns: the slot corresponding to the position @index in the
* radix tree @root. This is useful for update-if-exists operations.
*
- * This function cannot be called under rcu_read_lock, it must be
- * excluded from writers, as must the returned slot for subsequent
- * use by radix_tree_deref_slot() and radix_tree_replace slot.
- * Caller must hold tree write locked across slot lookup and
- * replace.
+ * This function can be called under rcu_read_lock iff the slot is not
+ * modified by radix_tree_replace_slot, otherwise it must be called
+ * exclusive from other writers. Any dereference of the slot must be done
+ * using radix_tree_deref_slot.
*/
void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
{
unsigned int height, shift;
struct radix_tree_node *node, **slot;
- node = root->rnode;
+ node = rcu_dereference(root->rnode);
if (node == NULL)
return NULL;
@@ -371,7 +370,7 @@ void **radix_tree_lookup_slot(struct rad
do {
slot = (struct radix_tree_node **)
(node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
- node = *slot;
+ node = rcu_dereference(*slot);
if (node == NULL)
return NULL;
@@ -608,7 +607,7 @@ EXPORT_SYMBOL(radix_tree_tag_get);
#endif
static unsigned int
-__lookup(struct radix_tree_node *slot, void **results, unsigned long index,
+__lookup(struct radix_tree_node *slot, void ***results, unsigned long index,
unsigned int max_items, unsigned long *next_index)
{
unsigned int nr_found = 0;
@@ -646,7 +645,7 @@ __lookup(struct radix_tree_node *slot, v
index++;
node = slot->slots[i];
if (node) {
- results[nr_found++] = rcu_dereference(node);
+ results[nr_found++] = &(slot->slots[i]);
if (nr_found == max_items)
goto out;
}
@@ -700,6 +699,73 @@ radix_tree_gang_lookup(struct radix_tree
ret = 0;
while (ret < max_items) {
+ unsigned int nr_found, i, j;
+ unsigned long next_index; /* Index of next search */
+
+ if (cur_index > max_index)
+ break;
+ nr_found = __lookup(node, (void ***)results + ret, cur_index,
+ max_items - ret, &next_index);
+ for (i = j = 0; i < nr_found; i++) {
+ struct radix_tree_node *slot;
+ slot = rcu_dereference(*(((void ***)results)[ret + i]));
+ if (!slot)
+ continue;
+ results[ret + j] = slot;
+ j++;
+ }
+ ret += j;
+ if (next_index == 0)
+ break;
+ cur_index = next_index;
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup);
+
+/**
+ * radix_tree_gang_lookup_slot - perform multiple slot lookup on radix tree
+ * @root: radix tree root
+ * @results: where the results of the lookup are placed
+ * @first_index: start the lookup from this key
+ * @max_items: place up to this many items at *results
+ *
+ * Performs an index-ascending scan of the tree for present items. Places
+ * their slots at *@results and returns the number of items which were
+ * placed at *@results.
+ *
+ * The implementation is naive.
+ *
+ * Like radix_tree_gang_lookup as far as RCU and locking goes. Slots must
+ * be dereferenced with radix_tree_deref_slot, and if using only RCU
+ * protection, radix_tree_deref_slot may fail requiring a retry.
+ */
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+ unsigned long first_index, unsigned int max_items)
+{
+ unsigned long max_index;
+ struct radix_tree_node *node;
+ unsigned long cur_index = first_index;
+ unsigned int ret;
+
+ node = rcu_dereference(root->rnode);
+ if (!node)
+ return 0;
+
+ if (!radix_tree_is_indirect_ptr(node)) {
+ if (first_index > 0)
+ return 0;
+ results[0] = (void **)&root->rnode;
+ return 1;
+ }
+ node = radix_tree_indirect_to_ptr(node);
+
+ max_index = radix_tree_maxindex(node->height);
+
+ ret = 0;
+ while (ret < max_items) {
unsigned int nr_found;
unsigned long next_index; /* Index of next search */
@@ -715,7 +781,7 @@ radix_tree_gang_lookup(struct radix_tree
return ret;
}
-EXPORT_SYMBOL(radix_tree_gang_lookup);
+EXPORT_SYMBOL(radix_tree_gang_lookup_slot);
/*
* FIXME: the two tag_get()s here should use find_next_bit() instead of
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 04/16] radix-tree: gang_lookup_tag_slot
2006-12-07 16:18 [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Peter Zijlstra
` (2 preceding siblings ...)
2006-12-07 16:18 ` [PATCH 03/16] radix-tree: gang_lookup_slot Nick Piggin
@ 2006-12-07 16:18 ` Peter Zijlstra
2006-12-07 16:18 ` [PATCH 05/16] mm: speculative get page Nick Piggin
` (12 subsequent siblings)
16 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2006-12-07 16:18 UTC (permalink / raw)
To: linux-mm; +Cc: Nick Piggin, Peter Zijlstra
[-- Attachment #1: radix-tree-gang_lookup_tag_slot.patch --]
[-- Type: text/plain, Size: 5063 bytes --]
Simple implementation of radix_tree_gang_lookup_tag_slot()
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/radix-tree.h | 5 ++
lib/radix-tree.c | 81 ++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 82 insertions(+), 4 deletions(-)
Index: linux-2.6-rt/include/linux/radix-tree.h
===================================================================
--- linux-2.6-rt.orig/include/linux/radix-tree.h 2006-11-30 13:47:17.000000000 +0100
+++ linux-2.6-rt/include/linux/radix-tree.h 2006-11-30 16:56:44.000000000 +0100
@@ -105,6 +105,7 @@ do { \
* radix_tree_gang_lookup
* radix_tree_gang_lookup_slot
* radix_tree_gang_lookup_tag
+ * radix_tree_gang_lookup_tag_slot
* radix_tree_tagged
*
* The first 6 functions are able to be called locklessly, using RCU. The
@@ -188,6 +189,10 @@ unsigned int
radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
unsigned long first_index, unsigned int max_items,
unsigned int tag);
+unsigned int
+radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
+ unsigned long first_index, unsigned int max_items,
+ unsigned int tag);
int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
static inline void radix_tree_preload_end(void)
Index: linux-2.6-rt/lib/radix-tree.c
===================================================================
--- linux-2.6-rt.orig/lib/radix-tree.c 2006-11-30 13:47:16.000000000 +0100
+++ linux-2.6-rt/lib/radix-tree.c 2006-11-30 16:55:57.000000000 +0100
@@ -788,7 +788,7 @@ EXPORT_SYMBOL(radix_tree_gang_lookup_slo
* open-coding the search.
*/
static unsigned int
-__lookup_tag(struct radix_tree_node *slot, void **results, unsigned long index,
+__lookup_tag(struct radix_tree_node *slot, void ***results, unsigned long index,
unsigned int max_items, unsigned long *next_index, unsigned int tag)
{
unsigned int nr_found = 0;
@@ -834,8 +834,7 @@ __lookup_tag(struct radix_tree_node *slo
* rely on its value remaining the same).
*/
if (node) {
- node = rcu_dereference(node);
- results[nr_found++] = node;
+ results[nr_found++] = &(slot->slots[j]);
if (nr_found == max_items)
goto out;
}
@@ -894,6 +893,80 @@ radix_tree_gang_lookup_tag(struct radix_
ret = 0;
while (ret < max_items) {
+ unsigned int nr_found, i, j;
+ unsigned long next_index; /* Index of next search */
+
+ if (cur_index > max_index)
+ break;
+ nr_found = __lookup_tag(node, (void ***)results + ret, cur_index,
+ max_items - ret, &next_index, tag);
+ for (i = j = 0; i < nr_found; i++) {
+ struct radix_tree_node *slot;
+ slot = rcu_dereference(*(((void ***)results)[ret + i]));
+ if (!slot)
+ continue;
+ results[ret + j] = slot;
+ j++;
+ }
+ ret += j;
+ if (next_index == 0)
+ break;
+ cur_index = next_index;
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup_tag);
+
+/**
+ * radix_tree_gang_lookup_tag_slot - perform multiple slot lookup on a
+ * radix tree based on a tag
+ * @root: radix tree root
+ * @results: where the results of the lookup are placed
+ * @first_index: start the lookup from this key
+ * @max_items: place up to this many items at *results
+ * @tag: the tag index (< RADIX_TREE_MAX_TAGS)
+ *
+ * Performs an index-ascending scan of the tree for present items which
+ * have the tag indexed by @tag set. Places their slots at *@results and
+ * returns the number of items which were placed at *@results.
+ *
+ * The implementation is naive.
+ *
+ * Like radix_tree_gang_lookup_tag as far as RCU and locking goes. Slots
+ * must be dereferenced with radix_tree_deref_slot, and if using only RCU
+ * protection, radix_tree_deref_slot may fail requiring a retry.
+ */
+unsigned int
+radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
+ unsigned long first_index, unsigned int max_items,
+ unsigned int tag)
+{
+ struct radix_tree_node *node;
+ unsigned long max_index;
+ unsigned long cur_index = first_index;
+ unsigned int ret;
+
+ /* check the root's tag bit */
+ if (!root_tag_get(root, tag))
+ return 0;
+
+ node = rcu_dereference(root->rnode);
+ if (!node)
+ return 0;
+
+ if (!radix_tree_is_indirect_ptr(node)) {
+ if (first_index > 0)
+ return 0;
+ results[0] = node;
+ return 1;
+ }
+ node = radix_tree_indirect_to_ptr(node);
+
+ max_index = radix_tree_maxindex(node->height);
+
+ ret = 0;
+ while (ret < max_items) {
unsigned int nr_found;
unsigned long next_index; /* Index of next search */
@@ -909,7 +982,7 @@ radix_tree_gang_lookup_tag(struct radix_
return ret;
}
-EXPORT_SYMBOL(radix_tree_gang_lookup_tag);
+EXPORT_SYMBOL(radix_tree_gang_lookup_tag_slot);
/**
* radix_tree_shrink - shrink height of a radix tree to minimal
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 05/16] mm: speculative get page
2006-12-07 16:18 [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Peter Zijlstra
` (3 preceding siblings ...)
2006-12-07 16:18 ` [PATCH 04/16] radix-tree: gang_lookup_tag_slot Peter Zijlstra
@ 2006-12-07 16:18 ` Nick Piggin
2006-12-07 16:18 ` [PATCH 06/16] mm: lockless pagecache lookups Nick Piggin
` (11 subsequent siblings)
16 siblings, 0 replies; 19+ messages in thread
From: Nick Piggin @ 2006-12-07 16:18 UTC (permalink / raw)
To: linux-mm; +Cc: Nick Piggin, Peter Zijlstra
[-- Attachment #1: mm-speculative-get-page.patch --]
[-- Type: text/plain, Size: 11501 bytes --]
If we can be sure that elevating the page_count on a pagecache
page will pin it, we can speculatively run this operation, and
subsequently check to see if we hit the right page rather than
relying on holding a lock or otherwise pinning a reference to the
page.
This can be done if get_page/put_page behaves consistently
throughout the whole tree (ie. if we "get" the page after it has
been used for something else, we must be able to free it with a
put_page).
Actually, there is a period where the count behaves differently:
when the page is free or if it is a constituent page of a compound
page. We need an atomic_inc_not_zero operation to ensure we don't
try to grab the page in either case.
This patch introduces the core locking protocol to the pagecache
(ie. adds page_cache_get_speculative, and tweaks some update-side
code to make it work).
[Hugh notices that PG_nonewrefs might be dispensed with entirely
if current SetPageNoNewRefs instead atomically save the page count
and temporarily set it to zero.
This is a nice idea, and simplifies find_get_page very much, but
cannot be applied to all current SetPageNoNewRefs sites. Need to
verify that add_to_page_cache and add_to_swap_cache can cope
without it or make do some other way.
In the meantime, this version is a slightly more mechanical
replacement.]
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
include/linux/page-flags.h | 8 +++
include/linux/pagemap.h | 105 +++++++++++++++++++++++++++++++++++++++++++++
mm/filemap.c | 4 +
mm/migrate.c | 9 +++
mm/swap_state.c | 4 +
mm/vmscan.c | 12 +++--
6 files changed, 136 insertions(+), 6 deletions(-)
Index: linux-2.6-rt/include/linux/page-flags.h
===================================================================
--- linux-2.6-rt.orig/include/linux/page-flags.h 2006-11-29 14:20:36.000000000 +0100
+++ linux-2.6-rt/include/linux/page-flags.h 2006-11-29 14:20:48.000000000 +0100
@@ -91,7 +91,8 @@
#define PG_nosave_free 18 /* Used for system suspend/resume */
#define PG_buddy 19 /* Page is free, on buddy lists */
-
+#define PG_nonewrefs 20 /* Block concurrent pagecache lookups
+ * while testing refcount */
#if (BITS_PER_LONG > 32)
/*
* 64-bit-only flags build down from bit 31
@@ -251,6 +252,11 @@ static inline void SetPageUptodate(struc
#define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
#define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)
+#define PageNoNewRefs(page) test_bit(PG_nonewrefs, &(page)->flags)
+#define SetPageNoNewRefs(page) set_bit(PG_nonewrefs, &(page)->flags)
+#define ClearPageNoNewRefs(page) clear_bit(PG_nonewrefs, &(page)->flags)
+#define __ClearPageNoNewRefs(page) __clear_bit(PG_nonewrefs, &(page)->flags)
+
struct page; /* forward declaration */
int test_clear_page_dirty(struct page *page);
Index: linux-2.6-rt/include/linux/pagemap.h
===================================================================
--- linux-2.6-rt.orig/include/linux/pagemap.h 2006-11-29 14:20:36.000000000 +0100
+++ linux-2.6-rt/include/linux/pagemap.h 2006-11-29 14:20:48.000000000 +0100
@@ -11,6 +11,8 @@
#include <linux/compiler.h>
#include <asm/uaccess.h>
#include <linux/gfp.h>
+#include <linux/page-flags.h>
+#include <linux/hardirq.h> /* for in_interrupt() */
/*
* Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page
@@ -51,6 +53,109 @@ static inline void mapping_set_gfp_mask(
#define page_cache_release(page) put_page(page)
void release_pages(struct page **pages, int nr, int cold);
+/*
+ * speculatively take a reference to a page.
+ * If the page is free (_count == 0), then _count is untouched, and 0
+ * is returned. Otherwise, _count is incremented by 1 and 1 is returned.
+ *
+ * This function must be run in the same rcu_read_lock() section as has
+ * been used to lookup the page in the pagecache radix-tree: this allows
+ * allocators to use a synchronize_rcu() to stabilize _count.
+ *
+ * Unless an RCU grace period has passed, the count of all pages coming out
+ * of the allocator must be considered unstable. page_count may return higher
+ * than expected, and put_page must be able to do the right thing when the
+ * page has been finished with (because put_page is what is used to drop an
+ * invalid speculative reference).
+ *
+ * After incrementing the refcount, this function spins until PageNoNewRefs
+ * is clear, then a read memory barrier is issued.
+ *
+ * This forms the core of the lockless pagecache locking protocol, where
+ * the lookup-side (eg. find_get_page) has the following pattern:
+ * 1. find page in radix tree
+ * 2. conditionally increment refcount
+ * 3. wait for PageNoNewRefs
+ * 4. check the page is still in pagecache
+ *
+ * Remove-side (that cares about _count, eg. reclaim) has the following:
+ * A. SetPageNoNewRefs
+ * B. check refcount is correct
+ * C. remove page
+ * D. ClearPageNoNewRefs
+ *
+ * There are 2 critical interleavings that matter:
+ * - 2 runs before B: in this case, B sees elevated refcount and bails out
+ * - B runs before 2: in this case, 3 ensures 4 will not run until *after* C
+ * (after D, even). In which case, 4 will notice C and lookup side can retry
+ *
+ * It is possible that between 1 and 2, the page is removed then the exact same
+ * page is inserted into the same position in pagecache. That's OK: the
+ * old find_get_page using tree_lock could equally have run before or after
+ * the write-side, depending on timing.
+ *
+ * Pagecache insertion isn't a big problem: either 1 will find the page or
+ * it will not. Likewise, the old find_get_page could run either before the
+ * insertion or afterwards, depending on timing.
+ */
+static inline int page_cache_get_speculative(struct page *page)
+{
+ VM_BUG_ON(in_interrupt());
+
+#ifndef CONFIG_SMP
+# ifdef CONFIG_PREEMPT
+ VM_BUG_ON(!in_atomic());
+# endif
+ /*
+ * Preempt must be disabled here - we rely on rcu_read_lock doing
+ * this for us.
+ *
+ * Pagecache won't be truncated from interrupt context, so if we have
+ * found a page in the radix tree here, we have pinned its refcount by
+ * disabling preempt, and hence no need for the "speculative get" that
+ * SMP requires.
+ */
+ VM_BUG_ON(page_count(page) == 0);
+ atomic_inc(&page->_count);
+
+#else
+ if (unlikely(!get_page_unless_zero(page)))
+ return 0; /* page has been freed */
+
+ /*
+ * Note that get_page_unless_zero provides a memory barrier.
+ * This is needed to ensure PageNoNewRefs is evaluated after the
+ * page refcount has been raised. See below comment.
+ */
+
+ while (unlikely(PageNoNewRefs(page)))
+ cpu_relax();
+
+ /*
+ * smp_rmb is to ensure the load of page->flags (for PageNoNewRefs())
+ * is performed before a future load used to ensure the page is
+ * the correct on (usually: page->mapping and page->index).
+ *
+ * Those places that set PageNoNewRefs have the following pattern:
+ * SetPageNoNewRefs(page)
+ * wmb();
+ * if (page_count(page) == X)
+ * remove page from pagecache
+ * wmb();
+ * ClearPageNoNewRefs(page)
+ *
+ * If the load was out of order, page->mapping might be loaded before
+ * the page is removed from pagecache but PageNoNewRefs evaluated
+ * after the ClearPageNoNewRefs().
+ */
+ smp_rmb();
+
+#endif
+ VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);
+
+ return 1;
+}
+
#ifdef CONFIG_NUMA
extern struct page *__page_cache_alloc(gfp_t gfp);
#else
Index: linux-2.6-rt/mm/vmscan.c
===================================================================
--- linux-2.6-rt.orig/mm/vmscan.c 2006-11-29 14:20:36.000000000 +0100
+++ linux-2.6-rt/mm/vmscan.c 2006-11-29 14:20:48.000000000 +0100
@@ -390,6 +390,8 @@ int remove_mapping(struct address_space
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
+ SetPageNoNewRefs(page);
+ smp_wmb();
write_lock_irq(&mapping->tree_lock);
/*
* The non racy check for a busy page.
@@ -427,17 +429,21 @@ int remove_mapping(struct address_space
__delete_from_swap_cache(page);
write_unlock_irq(&mapping->tree_lock);
swap_free(swap);
- __put_page(page); /* The pagecache ref */
- return 1;
+ goto free_it;
}
__remove_from_page_cache(page);
write_unlock_irq(&mapping->tree_lock);
- __put_page(page);
+
+free_it:
+ smp_wmb();
+ __ClearPageNoNewRefs(page);
+ __put_page(page); /* The pagecache ref */
return 1;
cannot_free:
write_unlock_irq(&mapping->tree_lock);
+ ClearPageNoNewRefs(page);
return 0;
}
Index: linux-2.6-rt/mm/filemap.c
===================================================================
--- linux-2.6-rt.orig/mm/filemap.c 2006-11-29 14:20:36.000000000 +0100
+++ linux-2.6-rt/mm/filemap.c 2006-11-29 14:20:48.000000000 +0100
@@ -440,6 +440,8 @@ int add_to_page_cache(struct page *page,
int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error == 0) {
+ SetPageNoNewRefs(page);
+ smp_wmb();
write_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
if (!error) {
@@ -451,6 +453,8 @@ int add_to_page_cache(struct page *page,
__inc_zone_page_state(page, NR_FILE_PAGES);
}
write_unlock_irq(&mapping->tree_lock);
+ smp_wmb();
+ ClearPageNoNewRefs(page);
radix_tree_preload_end();
}
return error;
Index: linux-2.6-rt/mm/swap_state.c
===================================================================
--- linux-2.6-rt.orig/mm/swap_state.c 2006-11-29 14:20:36.000000000 +0100
+++ linux-2.6-rt/mm/swap_state.c 2006-11-29 14:20:48.000000000 +0100
@@ -78,6 +78,8 @@ static int __add_to_swap_cache(struct pa
BUG_ON(PagePrivate(page));
error = radix_tree_preload(gfp_mask);
if (!error) {
+ SetPageNoNewRefs(page);
+ smp_wmb();
write_lock_irq(&swapper_space.tree_lock);
error = radix_tree_insert(&swapper_space.page_tree,
entry.val, page);
@@ -90,6 +92,8 @@ static int __add_to_swap_cache(struct pa
__inc_zone_page_state(page, NR_FILE_PAGES);
}
write_unlock_irq(&swapper_space.tree_lock);
+ smp_wmb();
+ ClearPageNoNewRefs(page);
radix_tree_preload_end();
}
return error;
Index: linux-2.6-rt/mm/migrate.c
===================================================================
--- linux-2.6-rt.orig/mm/migrate.c 2006-11-29 14:20:39.000000000 +0100
+++ linux-2.6-rt/mm/migrate.c 2006-11-29 14:20:48.000000000 +0100
@@ -303,6 +303,8 @@ static int migrate_page_move_mapping(str
return 0;
}
+ SetPageNoNewRefs(page);
+ smp_wmb();
write_lock_irq(&mapping->tree_lock);
pslot = radix_tree_lookup_slot(&mapping->page_tree,
@@ -311,6 +313,7 @@ static int migrate_page_move_mapping(str
if (page_count(page) != 2 + !!PagePrivate(page) ||
(struct page *)radix_tree_deref_slot(pslot) != page) {
write_unlock_irq(&mapping->tree_lock);
+ ClearPageNoNewRefs(page);
return -EAGAIN;
}
@@ -326,6 +329,10 @@ static int migrate_page_move_mapping(str
#endif
radix_tree_replace_slot(pslot, newpage);
+ page->mapping = NULL;
+ write_unlock_irq(&mapping->tree_lock);
+ smp_wmb();
+ ClearPageNoNewRefs(page);
/*
* Drop cache reference from old page.
@@ -333,8 +340,6 @@ static int migrate_page_move_mapping(str
*/
__put_page(page);
- write_unlock_irq(&mapping->tree_lock);
-
return 0;
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 06/16] mm: lockless pagecache lookups
2006-12-07 16:18 [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Peter Zijlstra
` (4 preceding siblings ...)
2006-12-07 16:18 ` [PATCH 05/16] mm: speculative get page Nick Piggin
@ 2006-12-07 16:18 ` Nick Piggin
2006-12-07 16:18 ` [PATCH 07/16] mm: fix speculative page get preemption bug Peter Zijlstra
` (10 subsequent siblings)
16 siblings, 0 replies; 19+ messages in thread
From: Nick Piggin @ 2006-12-07 16:18 UTC (permalink / raw)
To: linux-mm; +Cc: Nick Piggin, Peter Zijlstra
[-- Attachment #1: mm-lockless-pagecache-lookups.patch --]
[-- Type: text/plain, Size: 7831 bytes --]
Combine page_cache_get_speculative with lockless radix tree lookups to
introduce lockless page cache lookups (ie. no mapping->tree_lock on
the read-side).
The only atomicity changes this introduces is that the gang pagecache
lookup functions now behave as if they are implemented with multiple
find_get_page calls, rather than operating on a snapshot of the pages.
In practice, this atomicity guarantee is not used anyway, and it is
difficult to see how it could be. Gang pagecache lookups are designed
to replace individual lookups, so these semantics are natural.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
mm/filemap.c | 133 +++++++++++++++++++++++++++++++++++++---------------
mm/page-writeback.c | 8 +--
mm/readahead.c | 6 --
3 files changed, 102 insertions(+), 45 deletions(-)
Index: linux-2.6-rt/mm/filemap.c
===================================================================
--- linux-2.6-rt.orig/mm/filemap.c 2006-11-29 14:20:48.000000000 +0100
+++ linux-2.6-rt/mm/filemap.c 2006-11-29 14:20:52.000000000 +0100
@@ -596,15 +596,31 @@ void fastcall __lock_page_nosync(struct
* Is there a pagecache struct page at the given (mapping, offset) tuple?
* If yes, increment its refcount and return it; if no, return NULL.
*/
-struct page * find_get_page(struct address_space *mapping, unsigned long offset)
+struct page *find_get_page(struct address_space *mapping, unsigned long offset)
{
+ void **pagep;
struct page *page;
- read_lock_irq(&mapping->tree_lock);
- page = radix_tree_lookup(&mapping->page_tree, offset);
- if (page)
- page_cache_get(page);
- read_unlock_irq(&mapping->tree_lock);
+ rcu_read_lock();
+repeat:
+ page = NULL;
+ pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);
+ if (pagep) {
+ page = radix_tree_deref_slot(pagep);
+ if (unlikely(!page || page == RADIX_TREE_RETRY))
+ goto repeat;
+
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /* Has the page moved? */
+ if (unlikely(page != *pagep)) {
+ page_cache_release(page);
+ goto repeat;
+ }
+ }
+ rcu_read_unlock();
+
return page;
}
EXPORT_SYMBOL(find_get_page);
@@ -644,26 +660,19 @@ struct page *find_lock_page(struct addre
{
struct page *page;
- read_lock_irq(&mapping->tree_lock);
repeat:
- page = radix_tree_lookup(&mapping->page_tree, offset);
+ page = find_get_page(mapping, offset);
if (page) {
- page_cache_get(page);
- if (TestSetPageLocked(page)) {
- read_unlock_irq(&mapping->tree_lock);
- __lock_page(page);
- read_lock_irq(&mapping->tree_lock);
-
- /* Has the page been truncated while we slept? */
- if (unlikely(page->mapping != mapping ||
- page->index != offset)) {
- unlock_page(page);
- page_cache_release(page);
- goto repeat;
- }
+ lock_page(page);
+ /* Has the page been truncated? */
+ if (unlikely(page->mapping != mapping
+ || page->index != offset)) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto repeat;
}
}
- read_unlock_irq(&mapping->tree_lock);
+
return page;
}
EXPORT_SYMBOL(find_lock_page);
@@ -733,13 +742,39 @@ unsigned find_get_pages(struct address_s
{
unsigned int i;
unsigned int ret;
+ unsigned int nr_found;
- read_lock_irq(&mapping->tree_lock);
- ret = radix_tree_gang_lookup(&mapping->page_tree,
- (void **)pages, start, nr_pages);
- for (i = 0; i < ret; i++)
- page_cache_get(pages[i]);
- read_unlock_irq(&mapping->tree_lock);
+ rcu_read_lock();
+restart:
+ nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+ (void ***)pages, start, nr_pages);
+ ret = 0;
+ for (i = 0; i < nr_found; i++) {
+ struct page *page;
+repeat:
+ page = radix_tree_deref_slot((void **)pages[i]);
+ if (unlikely(!page))
+ continue;
+ /*
+ * this can only trigger if nr_found == 1, making livelock
+ * a non issue.
+ */
+ if (unlikely(page == RADIX_TREE_RETRY))
+ goto restart;
+
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /* Has the page moved? */
+ if (unlikely(page != *((void **)pages[i]))) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
+ pages[ret] = page;
+ ret++;
+ }
+ rcu_read_unlock();
return ret;
}
@@ -760,19 +795,44 @@ unsigned find_get_pages_contig(struct ad
{
unsigned int i;
unsigned int ret;
+ unsigned int nr_found;
- read_lock_irq(&mapping->tree_lock);
- ret = radix_tree_gang_lookup(&mapping->page_tree,
- (void **)pages, index, nr_pages);
- for (i = 0; i < ret; i++) {
- if (pages[i]->mapping == NULL || pages[i]->index != index)
+ rcu_read_lock();
+restart:
+ nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+ (void ***)pages, index, nr_pages);
+ ret = 0;
+ for (i = 0; i < nr_found; i++) {
+ struct page *page;
+repeat:
+ page = radix_tree_deref_slot((void **)pages[i]);
+ if (unlikely(!page))
+ continue;
+ /*
+ * this can only trigger if nr_found == 1, making livelock
+ * a non issue.
+ */
+ if (unlikely(page == RADIX_TREE_RETRY))
+ goto restart;
+
+ if (page->mapping == NULL || page->index != index)
break;
- page_cache_get(pages[i]);
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /* Has the page moved? */
+ if (unlikely(page != *((void **)pages[i]))) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
+ pages[ret] = page;
+ ret++;
index++;
}
- read_unlock_irq(&mapping->tree_lock);
- return i;
+ rcu_read_unlock();
+ return ret;
}
/**
@@ -793,6 +853,7 @@ unsigned find_get_pages_tag(struct addre
unsigned int ret;
read_lock_irq(&mapping->tree_lock);
+ /* TODO: implement lookup_tag_slot and make this lockless */
ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
(void **)pages, *index, nr_pages, tag);
for (i = 0; i < ret; i++)
Index: linux-2.6-rt/mm/readahead.c
===================================================================
--- linux-2.6-rt.orig/mm/readahead.c 2006-11-29 14:20:36.000000000 +0100
+++ linux-2.6-rt/mm/readahead.c 2006-11-29 14:20:52.000000000 +0100
@@ -285,27 +285,25 @@ __do_page_cache_readahead(struct address
/*
* Preallocate as many pages as we will need.
*/
- read_lock_irq(&mapping->tree_lock);
for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
pgoff_t page_offset = offset + page_idx;
if (page_offset > end_index)
break;
+ rcu_read_lock();
page = radix_tree_lookup(&mapping->page_tree, page_offset);
+ rcu_read_unlock();
if (page)
continue;
- read_unlock_irq(&mapping->tree_lock);
page = page_cache_alloc_cold(mapping);
- read_lock_irq(&mapping->tree_lock);
if (!page)
break;
page->index = page_offset;
list_add(&page->lru, &page_pool);
ret++;
}
- read_unlock_irq(&mapping->tree_lock);
/*
* Now start the IO. We ignore I/O errors - if the page is not
Index: linux-2.6-rt/mm/page-writeback.c
===================================================================
--- linux-2.6-rt.orig/mm/page-writeback.c 2006-11-29 14:20:36.000000000 +0100
+++ linux-2.6-rt/mm/page-writeback.c 2006-11-29 14:20:52.000000000 +0100
@@ -956,17 +956,15 @@ int test_set_page_writeback(struct page
EXPORT_SYMBOL(test_set_page_writeback);
/*
- * Return true if any of the pages in the mapping are marged with the
+ * Return true if any of the pages in the mapping are marked with the
* passed tag.
*/
int mapping_tagged(struct address_space *mapping, int tag)
{
- unsigned long flags;
int ret;
-
- read_lock_irqsave(&mapping->tree_lock, flags);
+ rcu_read_lock();
ret = radix_tree_tagged(&mapping->page_tree, tag);
- read_unlock_irqrestore(&mapping->tree_lock, flags);
+ rcu_read_unlock();
return ret;
}
EXPORT_SYMBOL(mapping_tagged);
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 07/16] mm: fix speculative page get preemption bug
2006-12-07 16:18 [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Peter Zijlstra
` (5 preceding siblings ...)
2006-12-07 16:18 ` [PATCH 06/16] mm: lockless pagecache lookups Nick Piggin
@ 2006-12-07 16:18 ` Peter Zijlstra
2006-12-07 16:18 ` [PATCH 08/16] mm: speculative page get for PREEMPT_RT Peter Zijlstra
` (9 subsequent siblings)
16 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2006-12-07 16:18 UTC (permalink / raw)
To: linux-mm; +Cc: Nick Piggin, Peter Zijlstra
[-- Attachment #1: mm-lockless-preempt-fixup.patch --]
[-- Type: text/plain, Size: 5736 bytes --]
Livelock scenario pointed out by Nick.
SetPageNoNewRefs(page);
*** preempted here ***
page_cache_get_speculative() {
while (PageNoNewRefs(page)) /* livelock */
}
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/pagemap.h | 25 +++++++++++++++++++++++--
mm/filemap.c | 6 ++----
mm/migrate.c | 8 +++-----
mm/swap_state.c | 6 ++----
mm/vmscan.c | 8 +++-----
5 files changed, 33 insertions(+), 20 deletions(-)
Index: linux-2.6-rt/include/linux/pagemap.h
===================================================================
--- linux-2.6-rt.orig/include/linux/pagemap.h 2006-11-29 14:20:48.000000000 +0100
+++ linux-2.6-rt/include/linux/pagemap.h 2006-11-29 14:20:55.000000000 +0100
@@ -53,6 +53,28 @@ static inline void mapping_set_gfp_mask(
#define page_cache_release(page) put_page(page)
void release_pages(struct page **pages, int nr, int cold);
+static inline void set_page_no_new_refs(struct page *page)
+{
+ VM_BUG_ON(PageNoNewRefs(page));
+ preempt_disable();
+ SetPageNoNewRefs(page);
+ smp_wmb();
+}
+
+static inline void end_page_no_new_refs(struct page *page)
+{
+ VM_BUG_ON(!PageNoNewRefs(page));
+ smp_wmb();
+ ClearPageNoNewRefs(page);
+ preempt_enable();
+}
+
+static inline void wait_on_new_refs(struct page *page)
+{
+ while (unlikely(PageNoNewRefs(page)))
+ cpu_relax();
+}
+
/*
* speculatively take a reference to a page.
* If the page is free (_count == 0), then _count is untouched, and 0
@@ -128,8 +150,7 @@ static inline int page_cache_get_specula
* page refcount has been raised. See below comment.
*/
- while (unlikely(PageNoNewRefs(page)))
- cpu_relax();
+ wait_on_new_refs(page);
/*
* smp_rmb is to ensure the load of page->flags (for PageNoNewRefs())
Index: linux-2.6-rt/mm/filemap.c
===================================================================
--- linux-2.6-rt.orig/mm/filemap.c 2006-11-29 14:20:52.000000000 +0100
+++ linux-2.6-rt/mm/filemap.c 2006-11-29 14:20:55.000000000 +0100
@@ -440,8 +440,7 @@ int add_to_page_cache(struct page *page,
int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error == 0) {
- SetPageNoNewRefs(page);
- smp_wmb();
+ set_page_no_new_refs(page);
write_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
if (!error) {
@@ -453,8 +452,7 @@ int add_to_page_cache(struct page *page,
__inc_zone_page_state(page, NR_FILE_PAGES);
}
write_unlock_irq(&mapping->tree_lock);
- smp_wmb();
- ClearPageNoNewRefs(page);
+ end_page_no_new_refs(page);
radix_tree_preload_end();
}
return error;
Index: linux-2.6-rt/mm/migrate.c
===================================================================
--- linux-2.6-rt.orig/mm/migrate.c 2006-11-29 14:20:48.000000000 +0100
+++ linux-2.6-rt/mm/migrate.c 2006-11-29 14:20:55.000000000 +0100
@@ -303,8 +303,7 @@ static int migrate_page_move_mapping(str
return 0;
}
- SetPageNoNewRefs(page);
- smp_wmb();
+ set_page_no_new_refs(page);
write_lock_irq(&mapping->tree_lock);
pslot = radix_tree_lookup_slot(&mapping->page_tree,
@@ -313,7 +312,7 @@ static int migrate_page_move_mapping(str
if (page_count(page) != 2 + !!PagePrivate(page) ||
(struct page *)radix_tree_deref_slot(pslot) != page) {
write_unlock_irq(&mapping->tree_lock);
- ClearPageNoNewRefs(page);
+ end_page_no_new_refs(page);
return -EAGAIN;
}
@@ -331,8 +330,7 @@ static int migrate_page_move_mapping(str
radix_tree_replace_slot(pslot, newpage);
page->mapping = NULL;
write_unlock_irq(&mapping->tree_lock);
- smp_wmb();
- ClearPageNoNewRefs(page);
+ end_page_no_new_refs(page);
/*
* Drop cache reference from old page.
Index: linux-2.6-rt/mm/swap_state.c
===================================================================
--- linux-2.6-rt.orig/mm/swap_state.c 2006-11-29 14:20:48.000000000 +0100
+++ linux-2.6-rt/mm/swap_state.c 2006-11-29 14:20:55.000000000 +0100
@@ -78,8 +78,7 @@ static int __add_to_swap_cache(struct pa
BUG_ON(PagePrivate(page));
error = radix_tree_preload(gfp_mask);
if (!error) {
- SetPageNoNewRefs(page);
- smp_wmb();
+ set_page_no_new_refs(page);
write_lock_irq(&swapper_space.tree_lock);
error = radix_tree_insert(&swapper_space.page_tree,
entry.val, page);
@@ -92,8 +91,7 @@ static int __add_to_swap_cache(struct pa
__inc_zone_page_state(page, NR_FILE_PAGES);
}
write_unlock_irq(&swapper_space.tree_lock);
- smp_wmb();
- ClearPageNoNewRefs(page);
+ end_page_no_new_refs(page);
radix_tree_preload_end();
}
return error;
Index: linux-2.6-rt/mm/vmscan.c
===================================================================
--- linux-2.6-rt.orig/mm/vmscan.c 2006-11-29 14:20:48.000000000 +0100
+++ linux-2.6-rt/mm/vmscan.c 2006-11-29 14:20:55.000000000 +0100
@@ -390,8 +390,7 @@ int remove_mapping(struct address_space
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
- SetPageNoNewRefs(page);
- smp_wmb();
+ set_page_no_new_refs(page);
write_lock_irq(&mapping->tree_lock);
/*
* The non racy check for a busy page.
@@ -436,14 +435,13 @@ int remove_mapping(struct address_space
write_unlock_irq(&mapping->tree_lock);
free_it:
- smp_wmb();
- __ClearPageNoNewRefs(page);
+ end_page_no_new_refs(page);
__put_page(page); /* The pagecache ref */
return 1;
cannot_free:
write_unlock_irq(&mapping->tree_lock);
- ClearPageNoNewRefs(page);
+ end_page_no_new_refs(page);
return 0;
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 08/16] mm: speculative page get for PREEMPT_RT
2006-12-07 16:18 [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Peter Zijlstra
` (6 preceding siblings ...)
2006-12-07 16:18 ` [PATCH 07/16] mm: fix speculative page get preemption bug Peter Zijlstra
@ 2006-12-07 16:18 ` Peter Zijlstra
2006-12-07 16:18 ` [PATCH 09/16] mm: speculative find_get_pages_tag Peter Zijlstra
` (8 subsequent siblings)
16 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2006-12-07 16:18 UTC (permalink / raw)
To: linux-mm; +Cc: Nick Piggin, Peter Zijlstra
[-- Attachment #1: mm-lockless-preempt-rt-fixup.patch --]
[-- Type: text/plain, Size: 4403 bytes --]
Since most of the locks are sleeping locks with PREEMPT_RT provide
a sleeping implementation of wait_on_new_refs(). This also solves
the preempt livelock and thus we can remove the preempt_disable()/
preempt_enable() from the PG_nonewrefs functions.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/pagemap.h | 51 +++++++++++++++++++++++++++++++++++++++++++++++-
mm/filemap.c | 17 ++--------------
2 files changed, 53 insertions(+), 15 deletions(-)
Index: linux-2.6-rt/include/linux/pagemap.h
===================================================================
--- linux-2.6-rt.orig/include/linux/pagemap.h 2006-11-29 14:20:55.000000000 +0100
+++ linux-2.6-rt/include/linux/pagemap.h 2006-11-29 14:20:58.000000000 +0100
@@ -13,6 +13,8 @@
#include <linux/gfp.h>
#include <linux/page-flags.h>
#include <linux/hardirq.h> /* for in_interrupt() */
+#include <linux/wait.h>
+#include <linux/hash.h>
/*
* Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page
@@ -53,6 +55,26 @@ static inline void mapping_set_gfp_mask(
#define page_cache_release(page) put_page(page)
void release_pages(struct page **pages, int nr, int cold);
+/*
+ * In order to wait for pages to become available there must be
+ * waitqueues associated with pages. By using a hash table of
+ * waitqueues where the bucket discipline is to maintain all
+ * waiters on the same queue and wake all when any of the pages
+ * become available, and for the woken contexts to check to be
+ * sure the appropriate page became available, this saves space
+ * at a cost of "thundering herd" phenomena during rare hash
+ * collisions.
+ */
+static inline wait_queue_head_t *page_waitqueue(struct page *page)
+{
+ const struct zone *zone = page_zone(page);
+
+ return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
+}
+
+extern int __sleep_on_page(void *);
+
+#ifndef CONFIG_PREEMPT_RT
static inline void set_page_no_new_refs(struct page *page)
{
VM_BUG_ON(PageNoNewRefs(page));
@@ -74,6 +96,33 @@ static inline void wait_on_new_refs(stru
while (unlikely(PageNoNewRefs(page)))
cpu_relax();
}
+#else
+static inline void set_page_no_new_refs(struct page *page)
+{
+ VM_BUG_ON(PageNoNewRefs(page));
+ SetPageNoNewRefs(page);
+ smp_wmb();
+}
+
+static inline void end_page_no_new_refs(struct page *page)
+{
+ VM_BUG_ON(!PageNoNewRefs(page));
+ smp_wmb();
+ ClearPageNoNewRefs(page);
+ smp_mb__after_clear_bit();
+ __wake_up_bit(page_waitqueue(page), &page->flags, PG_nonewrefs);
+}
+
+static inline void wait_on_new_refs(struct page *page)
+{
+ might_sleep();
+ if (unlikely(PageNoNewRefs(page))) {
+ DEFINE_WAIT_BIT(wait, &page->flags, PG_nonewrefs);
+ __wait_on_bit(page_waitqueue(page), &wait, __sleep_on_page,
+ TASK_UNINTERRUPTIBLE);
+ }
+}
+#endif
/*
* speculatively take a reference to a page.
@@ -124,7 +173,7 @@ static inline int page_cache_get_specula
{
VM_BUG_ON(in_interrupt());
-#ifndef CONFIG_SMP
+#if !defined(CONFIG_SMP) && !defined(CONFIG_PREEMPT_RT)
# ifdef CONFIG_PREEMPT
VM_BUG_ON(!in_atomic());
# endif
Index: linux-2.6-rt/mm/filemap.c
===================================================================
--- linux-2.6-rt.orig/mm/filemap.c 2006-11-29 14:20:55.000000000 +0100
+++ linux-2.6-rt/mm/filemap.c 2006-11-29 14:20:58.000000000 +0100
@@ -486,21 +486,10 @@ static int __sleep_on_page_lock(void *wo
return 0;
}
-/*
- * In order to wait for pages to become available there must be
- * waitqueues associated with pages. By using a hash table of
- * waitqueues where the bucket discipline is to maintain all
- * waiters on the same queue and wake all when any of the pages
- * become available, and for the woken contexts to check to be
- * sure the appropriate page became available, this saves space
- * at a cost of "thundering herd" phenomena during rare hash
- * collisions.
- */
-static wait_queue_head_t *page_waitqueue(struct page *page)
+int __sleep_on_page(void *word)
{
- const struct zone *zone = page_zone(page);
-
- return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
+ schedule();
+ return 0;
}
static inline void wake_up_page(struct page *page, int bit)
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 09/16] mm: speculative find_get_pages_tag
2006-12-07 16:18 [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Peter Zijlstra
` (7 preceding siblings ...)
2006-12-07 16:18 ` [PATCH 08/16] mm: speculative page get for PREEMPT_RT Peter Zijlstra
@ 2006-12-07 16:18 ` Peter Zijlstra
2006-12-07 16:18 ` [PATCH 10/16] mm: remove find_tylock_page Peter Zijlstra
` (7 subsequent siblings)
16 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2006-12-07 16:18 UTC (permalink / raw)
To: linux-mm; +Cc: Nick Piggin, Peter Zijlstra
[-- Attachment #1: mm-lockless-find_get_pages_tag.patch --]
[-- Type: text/plain, Size: 1904 bytes --]
implement the speculative find_get_pages_tag.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
mm/filemap.c | 42 +++++++++++++++++++++++++++++++++---------
1 file changed, 33 insertions(+), 9 deletions(-)
Index: linux-2.6-rt/mm/filemap.c
===================================================================
--- linux-2.6-rt.orig/mm/filemap.c 2006-11-30 16:59:17.000000000 +0100
+++ linux-2.6-rt/mm/filemap.c 2006-11-30 17:45:12.000000000 +0100
@@ -838,16 +838,40 @@ unsigned find_get_pages_tag(struct addre
{
unsigned int i;
unsigned int ret;
+ unsigned int nr_found;
- read_lock_irq(&mapping->tree_lock);
- /* TODO: implement lookup_tag_slot and make this lockless */
- ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
- (void **)pages, *index, nr_pages, tag);
- for (i = 0; i < ret; i++)
- page_cache_get(pages[i]);
- if (ret)
- *index = pages[ret - 1]->index + 1;
- read_unlock_irq(&mapping->tree_lock);
+ rcu_read_lock();
+restart:
+ nr_found = radix_tree_gang_lookup_tag_slot(&mapping->page_tree,
+ (void ***)pages, *index, nr_pages, tag);
+
+ ret = 0;
+ for (i = 0; i < nr_found; i++) {
+ struct page *page;
+repeat:
+ page = radix_tree_deref_slot((void**)pages[i]);
+ if (unlikely(!page))
+ continue;
+ /*
+ * this can only trigger if nr_found == 1, making livelocks
+ * a non issue.
+ */
+ if (unlikely(page == RADIX_TREE_RETRY))
+ goto restart;
+
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /* Has the page moved? */
+ if (unlikely(page != *((void **)pages[i]))) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
+ pages[ret] = page;
+ ret++;
+ }
+ rcu_read_unlock();
return ret;
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 10/16] mm: remove find_tylock_page
2006-12-07 16:18 [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Peter Zijlstra
` (8 preceding siblings ...)
2006-12-07 16:18 ` [PATCH 09/16] mm: speculative find_get_pages_tag Peter Zijlstra
@ 2006-12-07 16:18 ` Peter Zijlstra
2006-12-07 16:18 ` [PATCH 11/16] mm: change tree_lock into a spinlock Peter Zijlstra
` (6 subsequent siblings)
16 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2006-12-07 16:18 UTC (permalink / raw)
To: linux-mm; +Cc: Nick Piggin, Peter Zijlstra
[-- Attachment #1: kill-find_trylock_page.patch --]
[-- Type: text/plain, Size: 2272 bytes --]
its the last read_lock user of tree_lock, and since its unused remove
it rather than convert it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/pagemap.h | 2 --
mm/filemap.c | 20 --------------------
2 files changed, 22 deletions(-)
Index: linux-2.6-rt/mm/filemap.c
===================================================================
--- linux-2.6-rt.orig/mm/filemap.c 2006-11-30 17:24:49.000000000 +0100
+++ linux-2.6-rt/mm/filemap.c 2006-11-30 17:25:29.000000000 +0100
@@ -613,26 +613,6 @@ repeat:
EXPORT_SYMBOL(find_get_page);
/**
- * find_trylock_page - find and lock a page
- * @mapping: the address_space to search
- * @offset: the page index
- *
- * Same as find_get_page(), but trylock it instead of incrementing the count.
- */
-struct page *find_trylock_page(struct address_space *mapping, unsigned long offset)
-{
- struct page *page;
-
- read_lock_irq(&mapping->tree_lock);
- page = radix_tree_lookup(&mapping->page_tree, offset);
- if (page && TestSetPageLocked(page))
- page = NULL;
- read_unlock_irq(&mapping->tree_lock);
- return page;
-}
-EXPORT_SYMBOL(find_trylock_page);
-
-/**
* find_lock_page - locate, pin and lock a pagecache page
* @mapping: the address_space to search
* @offset: the page index
Index: linux-2.6-rt/include/linux/pagemap.h
===================================================================
--- linux-2.6-rt.orig/include/linux/pagemap.h 2006-11-30 17:24:59.000000000 +0100
+++ linux-2.6-rt/include/linux/pagemap.h 2006-11-30 17:27:38.000000000 +0100
@@ -251,8 +251,6 @@ extern struct page * find_get_page(struc
unsigned long index);
extern struct page * find_lock_page(struct address_space *mapping,
unsigned long index);
-extern __deprecated_for_modules struct page * find_trylock_page(
- struct address_space *mapping, unsigned long index);
extern struct page * find_or_create_page(struct address_space *mapping,
unsigned long index, gfp_t gfp_mask);
unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 11/16] mm: change tree_lock into a spinlock
2006-12-07 16:18 [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Peter Zijlstra
` (9 preceding siblings ...)
2006-12-07 16:18 ` [PATCH 10/16] mm: remove find_tylock_page Peter Zijlstra
@ 2006-12-07 16:18 ` Peter Zijlstra
2006-12-07 16:18 ` [PATCH 12/16] radix-tree: concurrent write side support Peter Zijlstra
` (5 subsequent siblings)
16 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2006-12-07 16:18 UTC (permalink / raw)
To: linux-mm; +Cc: Nick Piggin, Peter Zijlstra
[-- Attachment #1: tree_lock-to-spinlock.patch --]
[-- Type: text/plain, Size: 14238 bytes --]
with all the read_lock uses of the tree_lock gone, change it into
a spinlock.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
fs/buffer.c | 4 ++--
fs/inode.c | 2 +-
include/asm-arm/cacheflush.h | 4 ++--
include/asm-parisc/cacheflush.h | 4 ++--
include/linux/fs.h | 2 +-
mm/filemap.c | 8 ++++----
mm/migrate.c | 6 +++---
mm/page-writeback.c | 18 +++++++++---------
mm/swap_state.c | 10 +++++-----
mm/swapfile.c | 4 ++--
mm/truncate.c | 6 +++---
mm/vmscan.c | 8 ++++----
12 files changed, 38 insertions(+), 38 deletions(-)
Index: linux-2.6-rt/fs/buffer.c
===================================================================
--- linux-2.6-rt.orig/fs/buffer.c 2006-11-30 17:44:58.000000000 +0100
+++ linux-2.6-rt/fs/buffer.c 2006-11-30 17:45:17.000000000 +0100
@@ -719,7 +719,7 @@ int __set_page_dirty_buffers(struct page
spin_unlock(&mapping->private_lock);
if (!TestSetPageDirty(page)) {
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
if (mapping_cap_account_dirty(mapping))
__inc_zone_page_state(page, NR_FILE_DIRTY);
@@ -727,7 +727,7 @@ int __set_page_dirty_buffers(struct page
page_index(page),
PAGECACHE_TAG_DIRTY);
}
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
return 1;
}
Index: linux-2.6-rt/include/asm-arm/cacheflush.h
===================================================================
--- linux-2.6-rt.orig/include/asm-arm/cacheflush.h 2006-11-30 17:44:58.000000000 +0100
+++ linux-2.6-rt/include/asm-arm/cacheflush.h 2006-11-30 17:45:17.000000000 +0100
@@ -354,9 +354,9 @@ extern void flush_ptrace_access(struct v
extern void flush_dcache_page(struct page *);
#define flush_dcache_mmap_lock(mapping) \
- write_lock_irq(&(mapping)->tree_lock)
+ spin_lock_irq(&(mapping)->tree_lock)
#define flush_dcache_mmap_unlock(mapping) \
- write_unlock_irq(&(mapping)->tree_lock)
+ spin_unlock_irq(&(mapping)->tree_lock)
#define flush_icache_user_range(vma,page,addr,len) \
flush_dcache_page(page)
Index: linux-2.6-rt/include/asm-parisc/cacheflush.h
===================================================================
--- linux-2.6-rt.orig/include/asm-parisc/cacheflush.h 2006-11-30 17:44:58.000000000 +0100
+++ linux-2.6-rt/include/asm-parisc/cacheflush.h 2006-11-30 17:45:17.000000000 +0100
@@ -57,9 +57,9 @@ flush_user_icache_range(unsigned long st
extern void flush_dcache_page(struct page *page);
#define flush_dcache_mmap_lock(mapping) \
- write_lock_irq(&(mapping)->tree_lock)
+ spin_lock_irq(&(mapping)->tree_lock)
#define flush_dcache_mmap_unlock(mapping) \
- write_unlock_irq(&(mapping)->tree_lock)
+ spin_unlock_irq(&(mapping)->tree_lock)
#define flush_icache_page(vma,page) do { flush_kernel_dcache_page(page); flush_kernel_icache_page(page_address(page)); } while (0)
Index: linux-2.6-rt/include/linux/fs.h
===================================================================
--- linux-2.6-rt.orig/include/linux/fs.h 2006-11-30 17:44:58.000000000 +0100
+++ linux-2.6-rt/include/linux/fs.h 2006-11-30 17:45:17.000000000 +0100
@@ -430,7 +430,7 @@ struct backing_dev_info;
struct address_space {
struct inode *host; /* owner: inode, block_device */
struct radix_tree_root page_tree; /* radix tree of all pages */
- rwlock_t tree_lock; /* and rwlock protecting it */
+ spinlock_t tree_lock; /* and rwlock protecting it */
unsigned int i_mmap_writable;/* count VM_SHARED mappings */
struct prio_tree_root i_mmap; /* tree of private and shared mappings */
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
Index: linux-2.6-rt/mm/filemap.c
===================================================================
--- linux-2.6-rt.orig/mm/filemap.c 2006-11-30 17:45:17.000000000 +0100
+++ linux-2.6-rt/mm/filemap.c 2006-11-30 17:45:17.000000000 +0100
@@ -128,9 +128,9 @@ void remove_from_page_cache(struct page
BUG_ON(!PageLocked(page));
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
__remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
}
static int sync_page(void *word)
@@ -441,7 +441,7 @@ int add_to_page_cache(struct page *page,
if (error == 0) {
set_page_no_new_refs(page);
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
if (!error) {
page_cache_get(page);
@@ -451,7 +451,7 @@ int add_to_page_cache(struct page *page,
mapping->nrpages++;
__inc_zone_page_state(page, NR_FILE_PAGES);
}
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
end_page_no_new_refs(page);
radix_tree_preload_end();
}
Index: linux-2.6-rt/mm/migrate.c
===================================================================
--- linux-2.6-rt.orig/mm/migrate.c 2006-11-30 17:44:58.000000000 +0100
+++ linux-2.6-rt/mm/migrate.c 2006-11-30 17:45:17.000000000 +0100
@@ -304,14 +304,14 @@ static int migrate_page_move_mapping(str
}
set_page_no_new_refs(page);
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));
if (page_count(page) != 2 + !!PagePrivate(page) ||
(struct page *)radix_tree_deref_slot(pslot) != page) {
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
end_page_no_new_refs(page);
return -EAGAIN;
}
@@ -329,7 +329,7 @@ static int migrate_page_move_mapping(str
radix_tree_replace_slot(pslot, newpage);
page->mapping = NULL;
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
end_page_no_new_refs(page);
/*
Index: linux-2.6-rt/mm/page-writeback.c
===================================================================
--- linux-2.6-rt.orig/mm/page-writeback.c 2006-11-30 17:44:58.000000000 +0100
+++ linux-2.6-rt/mm/page-writeback.c 2006-11-30 17:45:17.000000000 +0100
@@ -762,7 +762,7 @@ int __set_page_dirty_nobuffers(struct pa
struct address_space *mapping2;
if (mapping) {
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
mapping2 = page_mapping(page);
if (mapping2) { /* Race with truncate? */
BUG_ON(mapping2 != mapping);
@@ -772,7 +772,7 @@ int __set_page_dirty_nobuffers(struct pa
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
if (mapping->host) {
/* !PageAnon && !swapper_space */
__mark_inode_dirty(mapping->host,
@@ -852,12 +852,12 @@ int test_clear_page_dirty(struct page *p
unsigned long flags;
if (mapping) {
- write_lock_irqsave(&mapping->tree_lock, flags);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
if (TestClearPageDirty(page)) {
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY);
- write_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
/*
* We can continue to use `mapping' here because the
* page is locked, which pins the address_space
@@ -868,7 +868,7 @@ int test_clear_page_dirty(struct page *p
}
return 1;
}
- write_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
return 0;
}
return TestClearPageDirty(page);
@@ -915,13 +915,13 @@ int test_clear_page_writeback(struct pag
if (mapping) {
unsigned long flags;
- write_lock_irqsave(&mapping->tree_lock, flags);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
ret = TestClearPageWriteback(page);
if (ret)
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
- write_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
} else {
ret = TestClearPageWriteback(page);
}
@@ -936,7 +936,7 @@ int test_set_page_writeback(struct page
if (mapping) {
unsigned long flags;
- write_lock_irqsave(&mapping->tree_lock, flags);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
ret = TestSetPageWriteback(page);
if (!ret)
radix_tree_tag_set(&mapping->page_tree,
@@ -946,7 +946,7 @@ int test_set_page_writeback(struct page
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY);
- write_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
} else {
ret = TestSetPageWriteback(page);
}
Index: linux-2.6-rt/mm/swap_state.c
===================================================================
--- linux-2.6-rt.orig/mm/swap_state.c 2006-11-30 17:44:58.000000000 +0100
+++ linux-2.6-rt/mm/swap_state.c 2006-11-30 17:45:17.000000000 +0100
@@ -38,7 +38,7 @@ static struct backing_dev_info swap_back
struct address_space swapper_space = {
.page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
- .tree_lock = __RW_LOCK_UNLOCKED(swapper_space.tree_lock),
+ .tree_lock = __SPIN_LOCK_UNLOCKED(swapper_space.tree_lock),
.a_ops = &swap_aops,
.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
.backing_dev_info = &swap_backing_dev_info,
@@ -79,7 +79,7 @@ static int __add_to_swap_cache(struct pa
error = radix_tree_preload(gfp_mask);
if (!error) {
set_page_no_new_refs(page);
- write_lock_irq(&swapper_space.tree_lock);
+ spin_lock_irq(&swapper_space.tree_lock);
error = radix_tree_insert(&swapper_space.page_tree,
entry.val, page);
if (!error) {
@@ -90,7 +90,7 @@ static int __add_to_swap_cache(struct pa
total_swapcache_pages++;
__inc_zone_page_state(page, NR_FILE_PAGES);
}
- write_unlock_irq(&swapper_space.tree_lock);
+ spin_unlock_irq(&swapper_space.tree_lock);
end_page_no_new_refs(page);
radix_tree_preload_end();
}
@@ -202,9 +202,9 @@ void delete_from_swap_cache(struct page
entry.val = page_private(page);
- write_lock_irq(&swapper_space.tree_lock);
+ spin_lock_irq(&swapper_space.tree_lock);
__delete_from_swap_cache(page);
- write_unlock_irq(&swapper_space.tree_lock);
+ spin_unlock_irq(&swapper_space.tree_lock);
swap_free(entry);
page_cache_release(page);
Index: linux-2.6-rt/mm/swapfile.c
===================================================================
--- linux-2.6-rt.orig/mm/swapfile.c 2006-11-30 17:44:58.000000000 +0100
+++ linux-2.6-rt/mm/swapfile.c 2006-11-30 17:45:17.000000000 +0100
@@ -367,13 +367,13 @@ int remove_exclusive_swap_page(struct pa
retval = 0;
if (p->swap_map[swp_offset(entry)] == 1) {
/* Recheck the page count with the swapcache lock held.. */
- write_lock_irq(&swapper_space.tree_lock);
+ spin_lock_irq(&swapper_space.tree_lock);
if ((page_count(page) == 2) && !PageWriteback(page)) {
__delete_from_swap_cache(page);
SetPageDirty(page);
retval = 1;
}
- write_unlock_irq(&swapper_space.tree_lock);
+ spin_unlock_irq(&swapper_space.tree_lock);
}
spin_unlock(&swap_lock);
Index: linux-2.6-rt/mm/truncate.c
===================================================================
--- linux-2.6-rt.orig/mm/truncate.c 2006-11-30 17:44:58.000000000 +0100
+++ linux-2.6-rt/mm/truncate.c 2006-11-30 17:45:17.000000000 +0100
@@ -304,18 +304,18 @@ invalidate_complete_page2(struct address
if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
return 0;
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
if (PageDirty(page))
goto failed;
BUG_ON(PagePrivate(page));
__remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
ClearPageUptodate(page);
page_cache_release(page); /* pagecache ref */
return 1;
failed:
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
return 0;
}
Index: linux-2.6-rt/mm/vmscan.c
===================================================================
--- linux-2.6-rt.orig/mm/vmscan.c 2006-11-30 17:44:58.000000000 +0100
+++ linux-2.6-rt/mm/vmscan.c 2006-11-30 17:45:17.000000000 +0100
@@ -391,7 +391,7 @@ int remove_mapping(struct address_space
BUG_ON(mapping != page_mapping(page));
set_page_no_new_refs(page);
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
/*
* The non racy check for a busy page.
*
@@ -426,13 +426,13 @@ int remove_mapping(struct address_space
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
__delete_from_swap_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
swap_free(swap);
goto free_it;
}
__remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
free_it:
end_page_no_new_refs(page);
@@ -440,7 +440,7 @@ free_it:
return 1;
cannot_free:
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
end_page_no_new_refs(page);
return 0;
}
Index: linux-2.6-rt/fs/inode.c
===================================================================
--- linux-2.6-rt.orig/fs/inode.c 2006-11-30 18:00:49.000000000 +0100
+++ linux-2.6-rt/fs/inode.c 2006-11-30 18:01:08.000000000 +0100
@@ -193,7 +193,7 @@ void inode_init_once(struct inode *inode
mutex_init(&inode->i_mutex);
init_rwsem(&inode->i_alloc_sem);
INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
- rwlock_init(&inode->i_data.tree_lock);
+ spin_lock_init(&inode->i_data.tree_lock);
spin_lock_init(&inode->i_data.i_mmap_lock);
INIT_LIST_HEAD(&inode->i_data.private_list);
spin_lock_init(&inode->i_data.private_lock);
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 12/16] radix-tree: concurrent write side support
2006-12-07 16:18 [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Peter Zijlstra
` (10 preceding siblings ...)
2006-12-07 16:18 ` [PATCH 11/16] mm: change tree_lock into a spinlock Peter Zijlstra
@ 2006-12-07 16:18 ` Peter Zijlstra
2006-12-07 16:18 ` [PATCH 13/16] atomic_ulong_t Peter Zijlstra
` (4 subsequent siblings)
16 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2006-12-07 16:18 UTC (permalink / raw)
To: linux-mm; +Cc: Nick Piggin, Peter Zijlstra
[-- Attachment #1: radix-tree-concurrent.patch --]
[-- Type: text/plain, Size: 20546 bytes --]
Provide support for concurrent write side operations without changing the API
for all current uses.
Concurrency is realized by means of two locking models; the simple one is
ladder locking, the more complex one is path locking.
Ladder locking is like walking down a ladder, you place your foot on a spoke
below the one your other foot finds support etc.. There is no walking with both
feet in the air. Likewise with walking a tree, you lock a node below the
current node before releasing it.
This allows other modifying operations to start as soon as you release the
lock on the root node and even complete before you if they walk another path
downward.
The modifying operations: insert, lookup_slot and set_tag, use this simple
method.
The more complex path locking method is needed for operations that need to
walk upwards again after they walked down, those are: tag_clear and delete.
These lock their whole path downwards and release whole sections at points
where it can be determined the walk upwards will stop, thus also allowing
concurrency.
Finding the conditions for the terminated walk upwards while doing the downward
walk is the 'interesting' part of this approach.
The remaining - unmodified - operations will have exclusive locking (since
they're unmodified, they never move the lock downwards from the root node).
The API for this looks like:
DECLARE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree)
radix_tree_lock(&ctx)
... do _1_ modifying operation ...
radix_tree_unlock(&ctx)
Note that before the radix operation the root node is held and will provide
exclusive locking, after the operation the held lock might only be enough to
protect a single item.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/radix-tree.h | 87 +++++++++++++-
init/Kconfig | 4
lib/radix-tree.c | 279 +++++++++++++++++++++++++++++++++++----------
3 files changed, 306 insertions(+), 64 deletions(-)
Index: linux-2.6-rt/include/linux/radix-tree.h
===================================================================
--- linux-2.6-rt.orig/include/linux/radix-tree.h 2006-12-02 23:18:37.000000000 +0100
+++ linux-2.6-rt/include/linux/radix-tree.h 2006-12-03 00:52:55.000000000 +0100
@@ -63,23 +63,73 @@ struct radix_tree_root {
unsigned int height;
gfp_t gfp_mask;
struct radix_tree_node *rnode;
+ spinlock_t lock;
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+ struct radix_tree_context *context;
+ struct task_struct *owner;
+#endif
};
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+#define __RADIX_TREE_CONCURRENT_INIT \
+ .context = NULL, \
+ .owner = NULL,
+#else
+#define __RADIX_TREE_CONCURRENT_INIT
+#endif
+
#define RADIX_TREE_INIT(mask) { \
.height = 0, \
.gfp_mask = (mask), \
.rnode = NULL, \
+ .lock = __SPIN_LOCK_UNLOCKED(radix_tree_root.lock), \
+ __RADIX_TREE_CONCURRENT_INIT \
}
#define RADIX_TREE(name, mask) \
struct radix_tree_root name = RADIX_TREE_INIT(mask)
-#define INIT_RADIX_TREE(root, mask) \
-do { \
- (root)->height = 0; \
- (root)->gfp_mask = (mask); \
- (root)->rnode = NULL; \
-} while (0)
+static inline void INIT_RADIX_TREE(struct radix_tree_root *root, gfp_t gfp_mask)
+{
+ root->height = 0;
+ root->gfp_mask = gfp_mask;
+ root->rnode = NULL;
+ spin_lock_init(&root->lock);
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+ root->context = NULL;
+ root->owner = NULL;
+#endif
+}
+
+struct radix_tree_context {
+ struct radix_tree_root *root;
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+ spinlock_t *locked;
+#endif
+};
+
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+#define __RADIX_TREE_CONTEXT_INIT \
+ .locked = NULL,
+#else
+#define __RADIX_TREE_CONTEXT_INIT
+#endif
+
+#define DECLARE_RADIX_TREE_CONTEXT(context, tree) \
+ struct radix_tree_context context = { \
+ .root = (tree), \
+ __RADIX_TREE_CONTEXT_INIT \
+ }
+
+static inline void
+init_radix_tree_context(struct radix_tree_context *wctx,
+ struct radix_tree_root *root)
+{
+ wctx->root = root;
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+ wctx->locked = NULL;
+#endif
+}
/**
* Radix-tree synchronization
@@ -156,6 +206,31 @@ static inline void radix_tree_replace_sl
rcu_assign_pointer(*pslot, item);
}
+static inline void radix_tree_lock(struct radix_tree_context *context)
+{
+ struct radix_tree_root *root = context->root;
+ rcu_read_lock();
+ spin_lock(&root->lock);
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+ BUG_ON(context->locked);
+ context->locked = &root->lock;
+ root->owner = current;
+ root->context = context;
+#endif
+}
+
+static inline void radix_tree_unlock(struct radix_tree_context *context)
+{
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+ BUG_ON(!context->locked);
+ spin_unlock(context->locked);
+ context->locked = NULL;
+#else
+ spin_unlock(&context->root->lock);
+#endif
+ rcu_read_unlock();
+}
+
int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
Index: linux-2.6-rt/lib/radix-tree.c
===================================================================
--- linux-2.6-rt.orig/lib/radix-tree.c 2006-12-02 23:18:37.000000000 +0100
+++ linux-2.6-rt/lib/radix-tree.c 2006-12-03 00:52:57.000000000 +0100
@@ -32,6 +32,7 @@
#include <linux/string.h>
#include <linux/bitops.h>
#include <linux/rcupdate.h>
+#include <linux/spinlock.h>
#ifdef __KERNEL__
@@ -52,11 +53,17 @@ struct radix_tree_node {
struct rcu_head rcu_head;
void *slots[RADIX_TREE_MAP_SIZE];
unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+ spinlock_t lock;
+#endif
};
struct radix_tree_path {
struct radix_tree_node *node;
int offset;
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+ spinlock_t *locked;
+#endif
};
#define RADIX_TREE_INDEX_BITS (8 /* CHAR_BIT */ * sizeof(unsigned long))
@@ -64,6 +71,10 @@ struct radix_tree_path {
static unsigned long height_to_maxindex[RADIX_TREE_MAX_PATH] __read_mostly;
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+static struct lock_class_key radix_node_class[RADIX_TREE_MAX_PATH];
+#endif
+
/*
* Radix tree node cache.
*/
@@ -88,7 +99,7 @@ static inline gfp_t root_gfp_mask(struct
* that the caller has pinned this thread of control to the current CPU.
*/
static struct radix_tree_node *
-radix_tree_node_alloc(struct radix_tree_root *root)
+radix_tree_node_alloc(struct radix_tree_root *root, int height)
{
struct radix_tree_node *ret;
gfp_t gfp_mask = root_gfp_mask(root);
@@ -105,6 +116,11 @@ radix_tree_node_alloc(struct radix_tree_
}
}
BUG_ON(radix_tree_is_indirect_ptr(ret));
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+ spin_lock_init(&ret->lock);
+ lockdep_set_class(&ret->lock, &radix_node_class[height]);
+#endif
+ ret->height = height;
return ret;
}
@@ -208,6 +224,22 @@ static inline int any_tag_set(struct rad
return 0;
}
+static inline int any_tag_set_but(struct radix_tree_node *node,
+ unsigned int tag, int offset)
+{
+ int idx;
+ int offset_idx = offset / BITS_PER_LONG;
+ unsigned long offset_mask = ~(1 << (offset % BITS_PER_LONG));
+ for (idx = 0; idx < RADIX_TREE_TAG_LONGS; idx++) {
+ unsigned long mask = ~0UL;
+ if (idx == offset_idx)
+ mask = offset_mask;
+ if (node->tags[tag][idx] & mask)
+ return 1;
+ }
+ return 0;
+}
+
/*
* Return the maximum key which can be store into a
* radix tree with height HEIGHT.
@@ -237,8 +269,8 @@ static int radix_tree_extend(struct radi
}
do {
- unsigned int newheight;
- if (!(node = radix_tree_node_alloc(root)))
+ unsigned int newheight = root->height + 1;
+ if (!(node = radix_tree_node_alloc(root, newheight)))
return -ENOMEM;
/* Increase the height. */
@@ -250,8 +282,6 @@ static int radix_tree_extend(struct radi
tag_set(node, tag, 0);
}
- newheight = root->height+1;
- node->height = newheight;
node->count = 1;
node = radix_tree_ptr_to_indirect(node);
rcu_assign_pointer(root->rnode, node);
@@ -261,6 +291,78 @@ out:
return 0;
}
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+static inline struct radix_tree_context *
+radix_tree_get_context(struct radix_tree_root *root)
+{
+ struct radix_tree_context *context = NULL;
+
+ if (root->owner == current && root->context) {
+ context = root->context;
+ root->context = NULL;
+ }
+ return context;
+}
+
+#define RADIX_TREE_CONTEXT(context, root) \
+ struct radix_tree_context *context = radix_tree_get_context(root)
+
+static inline spinlock_t *radix_node_lock(struct radix_tree_root *root,
+ struct radix_tree_node *node, int height)
+{
+ spinlock_t *locked = &node->lock;
+ spin_lock(locked);
+ return locked;
+}
+
+static inline void radix_ladder_lock(struct radix_tree_context *context,
+ struct radix_tree_node *node, int height)
+{
+ if (context) {
+ struct radix_tree_root *root = context->root;
+ spinlock_t *locked = radix_node_lock(root, node, height);
+ if (locked) {
+ spin_unlock(context->locked);
+ context->locked = locked;
+ }
+ }
+}
+
+static inline void radix_path_init(struct radix_tree_context *context,
+ struct radix_tree_path *pathp)
+{
+ pathp->locked = context ? context->locked : NULL;
+}
+
+static inline void radix_path_lock(struct radix_tree_context *context,
+ struct radix_tree_path *pathp, struct radix_tree_node *node,
+ int height)
+{
+ if (context) {
+ struct radix_tree_root *root = context->root;
+ spinlock_t *locked = radix_node_lock(root, node, height);
+ if (locked)
+ context->locked = locked;
+ pathp->locked = locked;
+ } else
+ pathp->locked = NULL;
+}
+
+static inline void radix_path_unlock(struct radix_tree_context *context,
+ struct radix_tree_path *punlock)
+{
+ if (context && punlock->locked &&
+ context->locked != punlock->locked)
+ spin_unlock(punlock->locked);
+}
+#else
+#define RADIX_TREE_CONTEXT(context, root) do { } while (0)
+#define radix_ladder_lock(context, node, height) do { } while (0)
+#define radix_path_init(context, pathp) do { } while (0)
+#define radix_path_lock(context, pathp, node, height) do { } while (0)
+#define radix_path_unlock(context, punlock) do { } while (0)
+#endif
+
/**
* radix_tree_insert - insert into a radix tree
* @root: radix tree root
@@ -276,6 +378,8 @@ int radix_tree_insert(struct radix_tree_
unsigned int height, shift;
int offset;
int error;
+ int tag;
+ RADIX_TREE_CONTEXT(context, root);
BUG_ON(radix_tree_is_indirect_ptr(item));
@@ -295,9 +399,8 @@ int radix_tree_insert(struct radix_tree_
while (height > 0) {
if (slot == NULL) {
/* Have to add a child node. */
- if (!(slot = radix_tree_node_alloc(root)))
+ if (!(slot = radix_tree_node_alloc(root, height)))
return -ENOMEM;
- slot->height = height;
if (node) {
rcu_assign_pointer(node->slots[offset], slot);
node->count++;
@@ -309,6 +412,9 @@ int radix_tree_insert(struct radix_tree_
/* Go a level down */
offset = (index >> shift) & RADIX_TREE_MAP_MASK;
node = slot;
+
+ radix_ladder_lock(context, node, height);
+
slot = node->slots[offset];
shift -= RADIX_TREE_MAP_SHIFT;
height--;
@@ -320,12 +426,12 @@ int radix_tree_insert(struct radix_tree_
if (node) {
node->count++;
rcu_assign_pointer(node->slots[offset], item);
- BUG_ON(tag_get(node, 0, offset));
- BUG_ON(tag_get(node, 1, offset));
+ for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+ BUG_ON(tag_get(node, tag, offset));
} else {
rcu_assign_pointer(root->rnode, item);
- BUG_ON(root_tag_get(root, 0));
- BUG_ON(root_tag_get(root, 1));
+ for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+ BUG_ON(root_tag_get(root, tag));
}
return 0;
@@ -349,6 +455,7 @@ void **radix_tree_lookup_slot(struct rad
{
unsigned int height, shift;
struct radix_tree_node *node, **slot;
+ RADIX_TREE_CONTEXT(context, root);
node = rcu_dereference(root->rnode);
if (node == NULL)
@@ -374,6 +481,8 @@ void **radix_tree_lookup_slot(struct rad
if (node == NULL)
return NULL;
+ radix_ladder_lock(context, node, height);
+
shift -= RADIX_TREE_MAP_SHIFT;
height--;
} while (height > 0);
@@ -449,6 +558,7 @@ void *radix_tree_tag_set(struct radix_tr
{
unsigned int height, shift;
struct radix_tree_node *slot;
+ RADIX_TREE_CONTEXT(context, root);
height = root->height;
BUG_ON(index > radix_tree_maxindex(height));
@@ -456,9 +566,15 @@ void *radix_tree_tag_set(struct radix_tr
slot = radix_tree_indirect_to_ptr(root->rnode);
shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
+ /* set the root's tag bit */
+ if (slot && !root_tag_get(root, tag))
+ root_tag_set(root, tag);
+
while (height > 0) {
int offset;
+ radix_ladder_lock(context, slot, height);
+
offset = (index >> shift) & RADIX_TREE_MAP_MASK;
if (!tag_get(slot, tag, offset))
tag_set(slot, tag, offset);
@@ -468,10 +584,6 @@ void *radix_tree_tag_set(struct radix_tr
height--;
}
- /* set the root's tag bit */
- if (slot && !root_tag_get(root, tag))
- root_tag_set(root, tag);
-
return slot;
}
EXPORT_SYMBOL(radix_tree_tag_set);
@@ -494,28 +606,51 @@ void *radix_tree_tag_clear(struct radix_
unsigned long index, unsigned int tag)
{
struct radix_tree_path path[RADIX_TREE_MAX_PATH], *pathp = path;
+ struct radix_tree_path *punlock = path, *piter;
struct radix_tree_node *slot = NULL;
unsigned int height, shift;
+ RADIX_TREE_CONTEXT(context, root);
+
+ pathp->node = NULL;
+ radix_path_init(context, pathp);
height = root->height;
if (index > radix_tree_maxindex(height))
goto out;
shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
- pathp->node = NULL;
slot = radix_tree_indirect_to_ptr(root->rnode);
while (height > 0) {
int offset;
+ int parent_tag = 0;
if (slot == NULL)
goto out;
+ if (pathp->node)
+ parent_tag = tag_get(pathp->node, tag, pathp->offset);
+ else
+ parent_tag = root_tag_get(root, tag);
+
offset = (index >> shift) & RADIX_TREE_MAP_MASK;
- pathp[1].offset = offset;
- pathp[1].node = slot;
- slot = slot->slots[offset];
pathp++;
+ pathp->offset = offset;
+ pathp->node = slot;
+ radix_path_lock(context, pathp, slot, height);
+
+ /*
+ * If the parent node does not have the tag set or the current
+ * node has slots with the tag set other than the on we're
+ * potentially clearing, the change can never propagate upwards
+ * from here.
+ */
+ if (!parent_tag || any_tag_set_but(slot, tag, offset)) {
+ for (; punlock < pathp; punlock++)
+ radix_path_unlock(context, punlock);
+ }
+
+ slot = slot->slots[offset];
shift -= RADIX_TREE_MAP_SHIFT;
height--;
}
@@ -523,20 +658,16 @@ void *radix_tree_tag_clear(struct radix_
if (slot == NULL)
goto out;
- while (pathp->node) {
- if (!tag_get(pathp->node, tag, pathp->offset))
- goto out;
- tag_clear(pathp->node, tag, pathp->offset);
- if (any_tag_set(pathp->node, tag))
- goto out;
- pathp--;
+ for (piter = pathp; piter >= punlock; piter--) {
+ if (piter->node)
+ tag_clear(piter->node, tag, piter->offset);
+ else
+ root_tag_clear(root, tag);
}
- /* clear the root's tag bit */
- if (root_tag_get(root, tag))
- root_tag_clear(root, tag);
-
out:
+ for (; punlock < pathp; punlock++)
+ radix_path_unlock(context, punlock);
return slot;
}
EXPORT_SYMBOL(radix_tree_tag_clear);
@@ -958,7 +1089,7 @@ radix_tree_gang_lookup_tag_slot(struct r
if (!radix_tree_is_indirect_ptr(node)) {
if (first_index > 0)
return 0;
- results[0] = node;
+ results[0] = (void **)&root->rnode;
return 1;
}
node = radix_tree_indirect_to_ptr(node);
@@ -994,6 +1125,7 @@ static inline void radix_tree_shrink(str
while (root->height > 0) {
struct radix_tree_node *to_free = root->rnode;
void *newptr;
+ int tag;
BUG_ON(!radix_tree_is_indirect_ptr(to_free));
to_free = radix_tree_indirect_to_ptr(to_free);
@@ -1020,11 +1152,10 @@ static inline void radix_tree_shrink(str
root->rnode = newptr;
root->height--;
/* must only free zeroed nodes into the slab */
- tag_clear(to_free, 0, 0);
- tag_clear(to_free, 1, 0);
+ for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+ tag_clear(to_free, tag, 0);
to_free->slots[0] = NULL;
to_free->count = 0;
- radix_tree_node_free(to_free);
}
}
@@ -1040,11 +1171,15 @@ static inline void radix_tree_shrink(str
void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
{
struct radix_tree_path path[RADIX_TREE_MAX_PATH], *pathp = path;
+ struct radix_tree_path *punlock = path, *piter;
struct radix_tree_node *slot = NULL;
- struct radix_tree_node *to_free;
unsigned int height, shift;
int tag;
int offset;
+ RADIX_TREE_CONTEXT(context, root);
+
+ pathp->node = NULL;
+ radix_path_init(context, pathp);
height = root->height;
if (index > radix_tree_maxindex(height))
@@ -1059,16 +1194,42 @@ void *radix_tree_delete(struct radix_tre
slot = radix_tree_indirect_to_ptr(slot);
shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
- pathp->node = NULL;
do {
+ int parent_tags = 0;
+ int no_tags = 0;
+
if (slot == NULL)
goto out;
+ for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
+ if (pathp->node) {
+ parent_tags |=
+ tag_get(pathp->node, tag, pathp->offset);
+ } else
+ parent_tags |= root_tag_get(root, tag);
+ }
+
pathp++;
offset = (index >> shift) & RADIX_TREE_MAP_MASK;
pathp->offset = offset;
pathp->node = slot;
+ radix_path_lock(context, pathp, slot, height);
+
+ for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+ no_tags |= !any_tag_set_but(slot, tag, offset);
+
+ /*
+ * If the parent node does not have any tag set or the current
+ * node has slots with the tags set other than the one we're
+ * potentially clearing AND there are more children than just
+ * us, the changes can never propagate upwards from here.
+ */
+ if ((!parent_tags || !no_tags) && slot->count > 2) {
+ for (; punlock < pathp; punlock++)
+ radix_path_unlock(context, punlock);
+ }
+
slot = slot->slots[offset];
shift -= RADIX_TREE_MAP_SHIFT;
height--;
@@ -1081,41 +1242,43 @@ void *radix_tree_delete(struct radix_tre
* Clear all tags associated with the just-deleted item
*/
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
- if (tag_get(pathp->node, tag, pathp->offset))
- radix_tree_tag_clear(root, index, tag);
+ for (piter = pathp; piter >= punlock; piter--) {
+ if (piter->node) {
+ if (!tag_get(piter->node, tag, piter->offset))
+ break;
+ tag_clear(piter->node, tag, piter->offset);
+ if (any_tag_set(piter->node, tag))
+ break;
+ } else {
+ if (root_tag_get(root, tag))
+ root_tag_clear(root, tag);
+ }
+ }
}
- to_free = NULL;
- /* Now free the nodes we do not need anymore */
- while (pathp->node) {
- pathp->node->slots[pathp->offset] = NULL;
- pathp->node->count--;
- /*
- * Queue the node for deferred freeing after the
- * last reference to it disappears (set NULL, above).
- */
- if (to_free)
- radix_tree_node_free(to_free);
+ /* Now unhook the nodes we do not need anymore */
+ for (piter = pathp; piter >= punlock && piter->node; piter--) {
+ piter->node->slots[piter->offset] = NULL;
+ piter->node->count--;
- if (pathp->node->count) {
- if (pathp->node ==
+ if (piter->node->count) {
+ if (piter->node ==
radix_tree_indirect_to_ptr(root->rnode))
radix_tree_shrink(root);
goto out;
}
-
- /* Node with zero slots in use so free it */
- to_free = pathp->node;
- pathp--;
-
}
+
root_tag_clear_all(root);
root->height = 0;
root->rnode = NULL;
- if (to_free)
- radix_tree_node_free(to_free);
out:
+ for (; punlock <= pathp; punlock++) {
+ radix_path_unlock(context, punlock);
+ if (punlock->node && punlock->node->count == 0)
+ radix_tree_node_free(punlock->node);
+ }
return slot;
}
EXPORT_SYMBOL(radix_tree_delete);
Index: linux-2.6-rt/init/Kconfig
===================================================================
--- linux-2.6-rt.orig/init/Kconfig 2006-12-02 23:18:16.000000000 +0100
+++ linux-2.6-rt/init/Kconfig 2006-12-02 23:18:37.000000000 +0100
@@ -287,6 +287,10 @@ config TASK_XACCT
config SYSCTL
bool
+config RADIX_TREE_CONCURRENT
+ bool "Enable concurrent radix tree operations (EXPERIMENTAL)"
+ default y if SMP
+
menuconfig EMBEDDED
bool "Configure standard kernel features (for small systems)"
help
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 13/16] atomic_ulong_t
2006-12-07 16:18 [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Peter Zijlstra
` (11 preceding siblings ...)
2006-12-07 16:18 ` [PATCH 12/16] radix-tree: concurrent write side support Peter Zijlstra
@ 2006-12-07 16:18 ` Peter Zijlstra
2006-12-07 16:18 ` [PATCH 14/16] mm/fs: abstract address_space::nrpages Peter Zijlstra
` (3 subsequent siblings)
16 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2006-12-07 16:18 UTC (permalink / raw)
To: linux-mm; +Cc: Nick Piggin, Peter Zijlstra
[-- Attachment #1: atomic_ulong_t.patch --]
[-- Type: text/plain, Size: 1771 bytes --]
provide an unsigned long atomic type.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/asm-generic/atomic.h | 45 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)
Index: linux-2.6-rt/include/asm-generic/atomic.h
===================================================================
--- linux-2.6-rt.orig/include/asm-generic/atomic.h 2006-12-02 17:42:16.000000000 +0100
+++ linux-2.6-rt/include/asm-generic/atomic.h 2006-12-02 17:46:09.000000000 +0100
@@ -114,4 +114,49 @@ static inline void atomic_long_sub(long
}
#endif
+
+typedef atomic_long_t atomic_ulong_t;
+
+#define ATOMIC_ULONG_INIT(i) ATOMIC_LONG_INIT(i)
+static inline unsigned long atomic_ulong_read(atomic_ulong_t *l)
+{
+ atomic_long_t *v = (atomic_long_t *)l;
+
+ return (unsigned long)atomic_long_read(v);
+}
+
+static inline void atomic_ulong_set(atomic_ulong_t *l, unsigned long i)
+{
+ atomic_long_t *v = (atomic_long_t *)l;
+
+ atomic_long_set(v, i);
+}
+
+static inline void atomic_ulong_inc(atomic_ulong_t *l)
+{
+ atomic_long_t *v = (atomic_long_t *)l;
+
+ atomic_long_inc(v);
+}
+
+static inline void atomic_ulong_dec(atomic_ulong_t *l)
+{
+ atomic_long_t *v = (atomic_long_t *)l;
+
+ atomic_long_dec(v);
+}
+
+static inline void atomic_ulong_add(unsigned long i, atomic_ulong_t *l)
+{
+ atomic_long_t *v = (atomic_long_t *)l;
+
+ atomic_long_add(i, v);
+}
+
+static inline void atomic_ulong_sub(unsigned long i, atomic_ulong_t *l)
+{
+ atomic_long_t *v = (atomic_long_t *)l;
+
+ atomic_long_sub(i, v);
+}
#endif
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 14/16] mm/fs: abstract address_space::nrpages
2006-12-07 16:18 [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Peter Zijlstra
` (12 preceding siblings ...)
2006-12-07 16:18 ` [PATCH 13/16] atomic_ulong_t Peter Zijlstra
@ 2006-12-07 16:18 ` Peter Zijlstra
2006-12-07 16:18 ` [PATCH 15/16] mm: lock_page_ref Peter Zijlstra
` (2 subsequent siblings)
16 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2006-12-07 16:18 UTC (permalink / raw)
To: linux-mm; +Cc: Nick Piggin, Peter Zijlstra
[-- Attachment #1: mapping_nrpages.patch --]
[-- Type: text/plain, Size: 19752 bytes --]
Currently the tree_lock protects mapping->nrpages, this will not be
possible much longer. Hence abstract the access to this variable so that
it can be easily replaced by an atomic_ulong_t.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
arch/sh64/lib/dbg.c | 2 +-
fs/block_dev.c | 2 +-
fs/buffer.c | 2 +-
fs/gfs2/glock.c | 2 +-
fs/gfs2/glops.c | 6 +++---
fs/gfs2/meta_io.c | 2 +-
fs/hugetlbfs/inode.c | 2 +-
fs/inode.c | 10 +++++-----
fs/jffs/inode-v23.c | 2 +-
fs/jffs2/dir.c | 4 ++--
fs/jffs2/fs.c | 2 +-
fs/libfs.c | 2 +-
fs/nfs/inode.c | 6 +++---
fs/xfs/linux-2.6/xfs_vnode.h | 2 +-
include/linux/fs.h | 22 +++++++++++++++++++++-
include/linux/swap.h | 2 +-
ipc/shm.c | 4 ++--
mm/filemap.c | 12 ++++++------
mm/shmem.c | 8 ++++----
mm/swap_state.c | 4 ++--
mm/truncate.c | 2 +-
21 files changed, 60 insertions(+), 40 deletions(-)
Index: linux-2.6-rt/arch/sh64/lib/dbg.c
===================================================================
--- linux-2.6-rt.orig/arch/sh64/lib/dbg.c 2006-12-02 22:23:38.000000000 +0100
+++ linux-2.6-rt/arch/sh64/lib/dbg.c 2006-12-02 22:23:42.000000000 +0100
@@ -424,6 +424,6 @@ void print_page(struct page *page)
printk(" page[%p] -> index 0x%lx, count 0x%x, flags 0x%lx\n",
page, page->index, page_count(page), page->flags);
printk(" address_space = %p, pages =%ld\n", page->mapping,
- page->mapping->nrpages);
+ mapping_nrpages(page->mapping));
}
Index: linux-2.6-rt/fs/block_dev.c
===================================================================
--- linux-2.6-rt.orig/fs/block_dev.c 2006-12-02 22:23:38.000000000 +0100
+++ linux-2.6-rt/fs/block_dev.c 2006-12-02 22:23:42.000000000 +0100
@@ -398,7 +398,7 @@ long nr_blockdev_pages(void)
list_for_each(p, &all_bdevs) {
struct block_device *bdev;
bdev = list_entry(p, struct block_device, bd_list);
- ret += bdev->bd_inode->i_mapping->nrpages;
+ ret += mapping_nrpages(bdev->bd_inode->i_mapping);
}
spin_unlock(&bdev_lock);
return ret;
Index: linux-2.6-rt/fs/buffer.c
===================================================================
--- linux-2.6-rt.orig/fs/buffer.c 2006-12-02 22:23:42.000000000 +0100
+++ linux-2.6-rt/fs/buffer.c 2006-12-02 23:17:12.000000000 +0100
@@ -335,7 +335,7 @@ void invalidate_bdev(struct block_device
{
struct address_space *mapping = bdev->bd_inode->i_mapping;
- if (mapping->nrpages == 0)
+ if (mapping_nrpages(mapping) == 0)
return;
invalidate_bh_lrus();
Index: linux-2.6-rt/fs/gfs2/glock.c
===================================================================
--- linux-2.6-rt.orig/fs/gfs2/glock.c 2006-12-02 22:23:38.000000000 +0100
+++ linux-2.6-rt/fs/gfs2/glock.c 2006-12-02 22:23:42.000000000 +0100
@@ -2128,7 +2128,7 @@ static int dump_glock(struct gfs2_glock
(list_empty(&gl->gl_reclaim)) ? "no" : "yes");
if (gl->gl_aspace)
printk(KERN_INFO " aspace = 0x%p nrpages = %lu\n", gl->gl_aspace,
- gl->gl_aspace->i_mapping->nrpages);
+ mapping_nrpages(gl->gl_aspace->i_mapping));
else
printk(KERN_INFO " aspace = no\n");
printk(KERN_INFO " ail = %d\n", atomic_read(&gl->gl_ail_count));
Index: linux-2.6-rt/fs/gfs2/glops.c
===================================================================
--- linux-2.6-rt.orig/fs/gfs2/glops.c 2006-12-02 22:23:38.000000000 +0100
+++ linux-2.6-rt/fs/gfs2/glops.c 2006-12-02 22:23:42.000000000 +0100
@@ -123,7 +123,7 @@ static void gfs2_page_inval(struct gfs2_
return;
truncate_inode_pages(inode->i_mapping, 0);
- gfs2_assert_withdraw(GFS2_SB(&ip->i_inode), !inode->i_mapping->nrpages);
+ gfs2_assert_withdraw(GFS2_SB(&ip->i_inode), !mapping_nrpages(inode->i_mapping));
clear_bit(GIF_PAGED, &ip->i_flags);
}
@@ -324,7 +324,7 @@ static int inode_go_demote_ok(struct gfs
struct gfs2_sbd *sdp = gl->gl_sbd;
int demote = 0;
- if (!gl->gl_object && !gl->gl_aspace->i_mapping->nrpages)
+ if (!gl->gl_object && !mapping_nrpages(gl->gl_aspace->i_mapping))
demote = 1;
else if (!sdp->sd_args.ar_localcaching &&
time_after_eq(jiffies, gl->gl_stamp +
@@ -428,7 +428,7 @@ static void inode_greedy(struct gfs2_glo
static int rgrp_go_demote_ok(struct gfs2_glock *gl)
{
- return !gl->gl_aspace->i_mapping->nrpages;
+ return !mapping_nrpages(gl->gl_aspace->i_mapping);
}
/**
Index: linux-2.6-rt/fs/gfs2/meta_io.c
===================================================================
--- linux-2.6-rt.orig/fs/gfs2/meta_io.c 2006-12-02 22:23:38.000000000 +0100
+++ linux-2.6-rt/fs/gfs2/meta_io.c 2006-12-02 22:23:42.000000000 +0100
@@ -104,7 +104,7 @@ void gfs2_meta_inval(struct gfs2_glock *
truncate_inode_pages(mapping, 0);
atomic_dec(&aspace->i_writecount);
- gfs2_assert_withdraw(sdp, !mapping->nrpages);
+ gfs2_assert_withdraw(sdp, !mapping_nrpages(mapping));
}
/**
Index: linux-2.6-rt/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6-rt.orig/fs/hugetlbfs/inode.c 2006-12-02 22:23:38.000000000 +0100
+++ linux-2.6-rt/fs/hugetlbfs/inode.c 2006-12-02 22:23:42.000000000 +0100
@@ -214,7 +214,7 @@ static void truncate_hugepages(struct in
}
huge_pagevec_release(&pvec);
}
- BUG_ON(!lstart && mapping->nrpages);
+ BUG_ON(!lstart && mapping_nrpages(mapping));
hugetlb_unreserve_pages(inode, start, freed);
}
Index: linux-2.6-rt/fs/jffs/inode-v23.c
===================================================================
--- linux-2.6-rt.orig/fs/jffs/inode-v23.c 2006-12-02 22:23:38.000000000 +0100
+++ linux-2.6-rt/fs/jffs/inode-v23.c 2006-12-02 22:23:42.000000000 +0100
@@ -1352,7 +1352,7 @@ jffs_create(struct inode *dir, struct de
inode->i_op = &jffs_file_inode_operations;
inode->i_fop = &jffs_file_operations;
inode->i_mapping->a_ops = &jffs_address_operations;
- inode->i_mapping->nrpages = 0;
+ mapping_nrpages_init(inode->i_mapping);
d_instantiate(dentry, inode);
jffs_create_end:
Index: linux-2.6-rt/fs/jffs2/dir.c
===================================================================
--- linux-2.6-rt.orig/fs/jffs2/dir.c 2006-12-02 22:23:38.000000000 +0100
+++ linux-2.6-rt/fs/jffs2/dir.c 2006-12-02 22:23:42.000000000 +0100
@@ -206,7 +206,7 @@ static int jffs2_create(struct inode *di
inode->i_op = &jffs2_file_inode_operations;
inode->i_fop = &jffs2_file_operations;
inode->i_mapping->a_ops = &jffs2_file_address_operations;
- inode->i_mapping->nrpages = 0;
+ mapping_nrpages_init(inode->i_mapping);
f = JFFS2_INODE_INFO(inode);
dir_f = JFFS2_INODE_INFO(dir_i);
@@ -230,7 +230,7 @@ static int jffs2_create(struct inode *di
d_instantiate(dentry, inode);
D1(printk(KERN_DEBUG "jffs2_create: Created ino #%lu with mode %o, nlink %d(%d). nrpages %ld\n",
- inode->i_ino, inode->i_mode, inode->i_nlink, f->inocache->nlink, inode->i_mapping->nrpages));
+ inode->i_ino, inode->i_mode, inode->i_nlink, f->inocache->nlink, mapping_nrpages(inode->i_mapping)));
return 0;
fail:
Index: linux-2.6-rt/fs/jffs2/fs.c
===================================================================
--- linux-2.6-rt.orig/fs/jffs2/fs.c 2006-12-02 22:23:38.000000000 +0100
+++ linux-2.6-rt/fs/jffs2/fs.c 2006-12-02 22:23:42.000000000 +0100
@@ -293,7 +293,7 @@ void jffs2_read_inode (struct inode *ino
inode->i_op = &jffs2_file_inode_operations;
inode->i_fop = &jffs2_file_operations;
inode->i_mapping->a_ops = &jffs2_file_address_operations;
- inode->i_mapping->nrpages = 0;
+ mapping_nrpages_init(inode->i_mapping);
break;
case S_IFBLK:
Index: linux-2.6-rt/fs/libfs.c
===================================================================
--- linux-2.6-rt.orig/fs/libfs.c 2006-12-02 22:23:38.000000000 +0100
+++ linux-2.6-rt/fs/libfs.c 2006-12-02 22:23:42.000000000 +0100
@@ -16,7 +16,7 @@ int simple_getattr(struct vfsmount *mnt,
{
struct inode *inode = dentry->d_inode;
generic_fillattr(inode, stat);
- stat->blocks = inode->i_mapping->nrpages << (PAGE_CACHE_SHIFT - 9);
+ stat->blocks = mapping_nrpages(inode->i_mapping) << (PAGE_CACHE_SHIFT - 9);
return 0;
}
Index: linux-2.6-rt/fs/nfs/inode.c
===================================================================
--- linux-2.6-rt.orig/fs/nfs/inode.c 2006-12-02 22:23:38.000000000 +0100
+++ linux-2.6-rt/fs/nfs/inode.c 2006-12-02 22:23:42.000000000 +0100
@@ -93,7 +93,7 @@ int nfs_sync_mapping(struct address_spac
{
int ret;
- if (mapping->nrpages == 0)
+ if (mapping_nrpages(mapping) == 0)
return 0;
unmap_mapping_range(mapping, 0, 0, 0);
ret = filemap_write_and_wait(mapping);
@@ -133,7 +133,7 @@ void nfs_zap_caches(struct inode *inode)
void nfs_zap_mapping(struct inode *inode, struct address_space *mapping)
{
- if (mapping->nrpages != 0) {
+ if (mapping_nrpages(mapping) != 0) {
spin_lock(&inode->i_lock);
NFS_I(inode)->cache_validity |= NFS_INO_INVALID_DATA;
spin_unlock(&inode->i_lock);
@@ -684,7 +684,7 @@ int nfs_revalidate_mapping(struct inode
goto out;
if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
- if (mapping->nrpages != 0) {
+ if (mapping_nrpages(mapping) != 0) {
if (S_ISREG(inode->i_mode)) {
ret = nfs_sync_mapping(mapping);
if (ret < 0)
Index: linux-2.6-rt/fs/xfs/linux-2.6/xfs_vnode.h
===================================================================
--- linux-2.6-rt.orig/fs/xfs/linux-2.6/xfs_vnode.h 2006-12-02 22:23:38.000000000 +0100
+++ linux-2.6-rt/fs/xfs/linux-2.6/xfs_vnode.h 2006-12-02 22:23:42.000000000 +0100
@@ -548,7 +548,7 @@ static inline void vn_atime_to_time_t(bh
* Some useful predicates.
*/
#define VN_MAPPED(vp) mapping_mapped(vn_to_inode(vp)->i_mapping)
-#define VN_CACHED(vp) (vn_to_inode(vp)->i_mapping->nrpages)
+#define VN_CACHED(vp) mapping_nrpages(vn_to_inode(vp)->i_mapping)
#define VN_DIRTY(vp) mapping_tagged(vn_to_inode(vp)->i_mapping, \
PAGECACHE_TAG_DIRTY)
#define VN_TRUNC(vp) ((vp)->v_flag & VTRUNCATED)
Index: linux-2.6-rt/include/linux/fs.h
===================================================================
--- linux-2.6-rt.orig/include/linux/fs.h 2006-12-02 22:23:42.000000000 +0100
+++ linux-2.6-rt/include/linux/fs.h 2006-12-02 23:17:11.000000000 +0100
@@ -436,7 +436,7 @@ struct address_space {
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
spinlock_t i_mmap_lock; /* protect tree, count, list */
unsigned int truncate_count; /* Cover race condition with truncate */
- unsigned long nrpages; /* number of total pages */
+ unsigned long __nrpages; /* number of total pages */
pgoff_t writeback_index;/* writeback starts here */
const struct address_space_operations *a_ops; /* methods */
unsigned long flags; /* error bits/gfp mask */
@@ -451,6 +451,26 @@ struct address_space {
* of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON.
*/
+static inline void mapping_nrpages_init(struct address_space *mapping)
+{
+ mapping->__nrpages = 0;
+}
+
+static inline unsigned long mapping_nrpages(struct address_space *mapping)
+{
+ return mapping->__nrpages;
+}
+
+static inline void mapping_nrpages_inc(struct address_space *mapping)
+{
+ mapping->__nrpages++;
+}
+
+static inline void mapping_nrpages_dec(struct address_space *mapping)
+{
+ mapping->__nrpages--;
+}
+
struct block_device {
dev_t bd_dev; /* not a kdev_t - it's a search key */
struct inode * bd_inode; /* will die */
Index: linux-2.6-rt/ipc/shm.c
===================================================================
--- linux-2.6-rt.orig/ipc/shm.c 2006-12-02 22:23:38.000000000 +0100
+++ linux-2.6-rt/ipc/shm.c 2006-12-02 22:23:42.000000000 +0100
@@ -499,11 +499,11 @@ static void shm_get_stat(struct ipc_name
if (is_file_hugepages(shp->shm_file)) {
struct address_space *mapping = inode->i_mapping;
- *rss += (HPAGE_SIZE/PAGE_SIZE)*mapping->nrpages;
+ *rss += (HPAGE_SIZE/PAGE_SIZE)*mapping_nrpages(mapping);
} else {
struct shmem_inode_info *info = SHMEM_I(inode);
spin_lock(&info->lock);
- *rss += inode->i_mapping->nrpages;
+ *rss += mapping_nrpages(inode->i_mapping);
*swp += info->swapped;
spin_unlock(&info->lock);
}
Index: linux-2.6-rt/mm/filemap.c
===================================================================
--- linux-2.6-rt.orig/mm/filemap.c 2006-12-02 22:23:42.000000000 +0100
+++ linux-2.6-rt/mm/filemap.c 2006-12-02 23:17:12.000000000 +0100
@@ -118,7 +118,7 @@ void __remove_from_page_cache(struct pag
radix_tree_delete(&mapping->page_tree, page->index);
page->mapping = NULL;
- mapping->nrpages--;
+ mapping_nrpages_dec(mapping);
__dec_zone_page_state(page, NR_FILE_PAGES);
}
@@ -190,7 +190,7 @@ int __filemap_fdatawrite_range(struct ad
int ret;
struct writeback_control wbc = {
.sync_mode = sync_mode,
- .nr_to_write = mapping->nrpages * 2,
+ .nr_to_write = mapping_nrpages(mapping) * 2,
.range_start = start,
.range_end = end,
};
@@ -372,7 +372,7 @@ int filemap_write_and_wait(struct addres
{
int err = 0;
- if (mapping->nrpages) {
+ if (mapping_nrpages(mapping)) {
err = filemap_fdatawrite(mapping);
/*
* Even if the above returned error, the pages may be
@@ -406,7 +406,7 @@ int filemap_write_and_wait_range(struct
{
int err = 0;
- if (mapping->nrpages) {
+ if (mapping_nrpages(mapping)) {
err = __filemap_fdatawrite_range(mapping, lstart, lend,
WB_SYNC_ALL);
/* See comment of filemap_write_and_wait() */
@@ -448,7 +448,7 @@ int add_to_page_cache(struct page *page,
SetPageLocked(page);
page->mapping = mapping;
page->index = offset;
- mapping->nrpages++;
+ mapping_nrpages_inc(mapping);
__inc_zone_page_state(page, NR_FILE_PAGES);
}
spin_unlock_irq(&mapping->tree_lock);
@@ -2469,7 +2469,7 @@ generic_file_direct_IO(int rw, struct ki
if (retval == 0) {
retval = mapping->a_ops->direct_IO(rw, iocb, iov,
offset, nr_segs);
- if (rw == WRITE && mapping->nrpages) {
+ if (rw == WRITE && mapping_nrpages(mapping)) {
pgoff_t end = (offset + write_len - 1)
>> PAGE_CACHE_SHIFT;
int err = invalidate_inode_pages2_range(mapping,
Index: linux-2.6-rt/mm/shmem.c
===================================================================
--- linux-2.6-rt.orig/mm/shmem.c 2006-12-02 22:23:38.000000000 +0100
+++ linux-2.6-rt/mm/shmem.c 2006-12-02 22:23:42.000000000 +0100
@@ -211,8 +211,8 @@ static void shmem_free_blocks(struct ino
* We have to calculate the free blocks since the mm can drop
* undirtied hole pages behind our back.
*
- * But normally info->alloced == inode->i_mapping->nrpages + info->swapped
- * So mm freed is info->alloced - (inode->i_mapping->nrpages + info->swapped)
+ * But normally info->alloced == mapping_nrpages(inode->i_mapping) + info->swapped
+ * So mm freed is info->alloced - (mapping_nrpages(inode->i_mapping) + info->swapped)
*
* It has to be called with the spinlock held.
*/
@@ -221,7 +221,7 @@ static void shmem_recalc_inode(struct in
struct shmem_inode_info *info = SHMEM_I(inode);
long freed;
- freed = info->alloced - info->swapped - inode->i_mapping->nrpages;
+ freed = info->alloced - info->swapped - mapping_nrpages(inode->i_mapping);
if (freed > 0) {
info->alloced -= freed;
shmem_unacct_blocks(info->flags, freed);
@@ -602,7 +602,7 @@ static void shmem_truncate_range(struct
done1:
shmem_dir_unmap(dir);
done2:
- if (inode->i_mapping->nrpages && (info->flags & SHMEM_PAGEIN)) {
+ if (mapping_nrpages(inode->i_mapping) && (info->flags & SHMEM_PAGEIN)) {
/*
* Call truncate_inode_pages again: racing shmem_unuse_inode
* may have swizzled a page in from swap since vmtruncate or
Index: linux-2.6-rt/mm/swap_state.c
===================================================================
--- linux-2.6-rt.orig/mm/swap_state.c 2006-12-02 22:23:42.000000000 +0100
+++ linux-2.6-rt/mm/swap_state.c 2006-12-02 23:17:12.000000000 +0100
@@ -87,7 +87,7 @@ static int __add_to_swap_cache(struct pa
SetPageLocked(page);
SetPageSwapCache(page);
set_page_private(page, entry.val);
- total_swapcache_pages++;
+ mapping_nrpages_inc(&swapper_space);
__inc_zone_page_state(page, NR_FILE_PAGES);
}
spin_unlock_irq(&swapper_space.tree_lock);
@@ -133,7 +133,7 @@ void __delete_from_swap_cache(struct pag
radix_tree_delete(&swapper_space.page_tree, page_private(page));
set_page_private(page, 0);
ClearPageSwapCache(page);
- total_swapcache_pages--;
+ mapping_nrpages_dec(&swapper_space);
__dec_zone_page_state(page, NR_FILE_PAGES);
INC_CACHE_INFO(del_total);
}
Index: linux-2.6-rt/mm/truncate.c
===================================================================
--- linux-2.6-rt.orig/mm/truncate.c 2006-12-02 22:23:42.000000000 +0100
+++ linux-2.6-rt/mm/truncate.c 2006-12-02 23:17:12.000000000 +0100
@@ -134,7 +134,7 @@ void truncate_inode_pages_range(struct a
pgoff_t next;
int i;
- if (mapping->nrpages == 0)
+ if (mapping_nrpages(mapping) == 0)
return;
BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
Index: linux-2.6-rt/include/linux/swap.h
===================================================================
--- linux-2.6-rt.orig/include/linux/swap.h 2006-12-02 22:23:38.000000000 +0100
+++ linux-2.6-rt/include/linux/swap.h 2006-12-02 22:23:42.000000000 +0100
@@ -224,7 +224,7 @@ extern int end_swap_bio_read(struct bio
/* linux/mm/swap_state.c */
extern struct address_space swapper_space;
-#define total_swapcache_pages swapper_space.nrpages
+#define total_swapcache_pages mapping_nrpages(&swapper_space)
extern void show_swap_cache_info(void);
extern int add_to_swap(struct page *, gfp_t);
extern void __delete_from_swap_cache(struct page *);
Index: linux-2.6-rt/fs/inode.c
===================================================================
--- linux-2.6-rt.orig/fs/inode.c 2006-12-02 22:32:32.000000000 +0100
+++ linux-2.6-rt/fs/inode.c 2006-12-02 23:17:11.000000000 +0100
@@ -246,7 +246,7 @@ void clear_inode(struct inode *inode)
might_sleep();
invalidate_inode_buffers(inode);
- BUG_ON(inode->i_data.nrpages);
+ BUG_ON(mapping_nrpages(&inode->i_data));
BUG_ON(!(inode->i_state & I_FREEING));
BUG_ON(inode->i_state & I_CLEAR);
wait_on_inode(inode);
@@ -279,7 +279,7 @@ static void dispose_list(struct list_hea
inode = list_entry(head->next, struct inode, i_list);
list_del(&inode->i_list);
- if (inode->i_data.nrpages)
+ if (mapping_nrpages(&inode->i_data))
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);
@@ -371,7 +371,7 @@ static int can_unuse(struct inode *inode
return 0;
if (atomic_read(&inode->i_count))
return 0;
- if (inode->i_data.nrpages)
+ if (mapping_nrpages(&inode->i_data))
return 0;
return 1;
}
@@ -410,7 +410,7 @@ static void prune_icache(int nr_to_scan)
list_move(&inode->i_list, &inode_unused);
continue;
}
- if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+ if (inode_has_buffers(inode) || mapping_nrpages(&inode->i_data)) {
__iget(inode);
spin_unlock(&inode_lock);
if (remove_inode_buffers(inode))
@@ -1057,7 +1057,7 @@ static void generic_forget_inode(struct
inode->i_state |= I_FREEING;
inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);
- if (inode->i_data.nrpages)
+ if (mapping_nrpages(&inode->i_data))
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);
wake_up_inode(inode);
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 15/16] mm: lock_page_ref
2006-12-07 16:18 [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Peter Zijlstra
` (13 preceding siblings ...)
2006-12-07 16:18 ` [PATCH 14/16] mm/fs: abstract address_space::nrpages Peter Zijlstra
@ 2006-12-07 16:18 ` Peter Zijlstra
2006-12-07 16:18 ` [PATCH 16/16] mm: concurrent pagecache write side Peter Zijlstra
2006-12-11 19:03 ` [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Christoph Lameter
16 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2006-12-07 16:18 UTC (permalink / raw)
To: linux-mm; +Cc: Nick Piggin, Peter Zijlstra
[-- Attachment #1: lock_page_ref.patch --]
[-- Type: text/plain, Size: 15298 bytes --]
Change the PG_nonewref operations into locking primitives and place them
so that they provide page level serialization with regard to the page_tree
operations. (basically replace the tree_lock with a per page lock).
The normal page lock has sufficiently different (and overlapping) scope and
protection rules that this second lock is needed.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
fs/buffer.c | 6 ++-
include/linux/pagemap.h | 76 ++++++++++++++++++++++++++++++++----------------
mm/filemap.c | 14 +++++---
mm/migrate.c | 12 +++----
mm/page-writeback.c | 27 +++++++++++------
mm/swap_state.c | 14 +++++---
mm/swapfile.c | 6 ++-
mm/truncate.c | 9 +++--
mm/vmscan.c | 14 ++++----
9 files changed, 112 insertions(+), 66 deletions(-)
Index: linux-2.6-rt/include/linux/pagemap.h
===================================================================
--- linux-2.6-rt.orig/include/linux/pagemap.h 2006-12-02 22:23:42.000000000 +0100
+++ linux-2.6-rt/include/linux/pagemap.h 2006-12-02 22:25:18.000000000 +0100
@@ -75,55 +75,81 @@ static inline wait_queue_head_t *page_wa
extern int __sleep_on_page(void *);
#ifndef CONFIG_PREEMPT_RT
-static inline void set_page_no_new_refs(struct page *page)
+static inline void lock_page_ref(struct page *page)
{
- VM_BUG_ON(PageNoNewRefs(page));
preempt_disable();
- SetPageNoNewRefs(page);
+ bit_spin_lock(PG_nonewrefs, &page->flags);
smp_wmb();
}
-static inline void end_page_no_new_refs(struct page *page)
+static inline void unlock_page_ref(struct page *page)
{
- VM_BUG_ON(!PageNoNewRefs(page));
- smp_wmb();
- ClearPageNoNewRefs(page);
+ bit_spin_unlock(PG_nonewrefs, &page->flags);
preempt_enable();
}
-static inline void wait_on_new_refs(struct page *page)
+static inline void wait_on_unlock_page_ref(struct page *page)
{
- while (unlikely(PageNoNewRefs(page)))
+ while (unlikely(test_bit(PG_nonewrefs, &page->flags)))
cpu_relax();
}
#else
-static inline void set_page_no_new_refs(struct page *page)
+/*
+ * open coded sleeping bit lock, yay!
+ */
+static inline void wait_on_unlock_page_ref(struct page *page)
{
- VM_BUG_ON(PageNoNewRefs(page));
- SetPageNoNewRefs(page);
+ might_sleep();
+ if (unlikely(PageNoNewRefs(page))) {
+ DEFINE_WAIT_BIT(wait, &page->flags, PG_nonewrefs);
+ __wait_on_bit(page_waitqueue(page), &wait, __sleep_on_page,
+ TASK_UNINTERRUPTIBLE);
+ }
+}
+
+static inline void lock_page_ref(struct page *page)
+{
+ while (test_and_set_bit(PG_nonewrefs, &page->flags))
+ wait_on_unlock_page_ref(page);
+ __acquire(bitlock);
smp_wmb();
}
-static inline void end_page_no_new_refs(struct page *page)
+static inline void unlock_page_ref(struct page *page)
{
VM_BUG_ON(!PageNoNewRefs(page));
- smp_wmb();
+ smp_mb__before_clear_bit();
ClearPageNoNewRefs(page);
smp_mb__after_clear_bit();
__wake_up_bit(page_waitqueue(page), &page->flags, PG_nonewrefs);
-}
-
-static inline void wait_on_new_refs(struct page *page)
-{
- might_sleep();
- if (unlikely(PageNoNewRefs(page))) {
- DEFINE_WAIT_BIT(wait, &page->flags, PG_nonewrefs);
- __wait_on_bit(page_waitqueue(page), &wait, __sleep_on_page,
- TASK_UNINTERRUPTIBLE);
- }
+ __release(bitlock);
}
#endif
+#define lock_page_ref_irq(page) \
+do { \
+ local_irq_disable_nort(); \
+ lock_page_ref(page); \
+} while (0)
+
+#define unlock_page_ref_irq(page) \
+do { \
+ unlock_page_ref(page); \
+ local_irq_enable_nort(); \
+} while (0)
+
+#define lock_page_ref_irqsave(page, flags) \
+do { \
+ local_irq_save_nort(flags); \
+ lock_page_ref(page); \
+} while (0)
+
+#define unlock_page_ref_irqrestore(page, flags) \
+do { \
+ unlock_page_ref(page); \
+ local_irq_restore_nort(flags); \
+} while (0)
+
/*
* speculatively take a reference to a page.
* If the page is free (_count == 0), then _count is untouched, and 0
@@ -199,7 +225,7 @@ static inline int page_cache_get_specula
* page refcount has been raised. See below comment.
*/
- wait_on_new_refs(page);
+ wait_on_unlock_page_ref(page);
/*
* smp_rmb is to ensure the load of page->flags (for PageNoNewRefs())
Index: linux-2.6-rt/mm/filemap.c
===================================================================
--- linux-2.6-rt.orig/mm/filemap.c 2006-12-02 22:23:42.000000000 +0100
+++ linux-2.6-rt/mm/filemap.c 2006-12-02 22:25:04.000000000 +0100
@@ -128,9 +128,11 @@ void remove_from_page_cache(struct page
BUG_ON(!PageLocked(page));
- spin_lock_irq(&mapping->tree_lock);
+ lock_page_ref_irq(page);
+ spin_lock(&mapping->tree_lock);
__remove_from_page_cache(page);
- spin_unlock_irq(&mapping->tree_lock);
+ spin_unlock(&mapping->tree_lock);
+ unlock_page_ref_irq(page);
}
static int sync_page(void *word)
@@ -440,8 +442,8 @@ int add_to_page_cache(struct page *page,
int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error == 0) {
- set_page_no_new_refs(page);
- spin_lock_irq(&mapping->tree_lock);
+ lock_page_ref_irq(page);
+ spin_lock(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
if (!error) {
page_cache_get(page);
@@ -451,8 +453,8 @@ int add_to_page_cache(struct page *page,
mapping_nrpages_inc(mapping);
__inc_zone_page_state(page, NR_FILE_PAGES);
}
- spin_unlock_irq(&mapping->tree_lock);
- end_page_no_new_refs(page);
+ spin_unlock(&mapping->tree_lock);
+ unlock_page_ref_irq(page);
radix_tree_preload_end();
}
return error;
Index: linux-2.6-rt/mm/migrate.c
===================================================================
--- linux-2.6-rt.orig/mm/migrate.c 2006-12-02 22:23:42.000000000 +0100
+++ linux-2.6-rt/mm/migrate.c 2006-12-02 22:25:04.000000000 +0100
@@ -303,16 +303,16 @@ static int migrate_page_move_mapping(str
return 0;
}
- set_page_no_new_refs(page);
- spin_lock_irq(&mapping->tree_lock);
+ lock_page_ref_irq(page);
+ spin_lock(&mapping->tree_lock);
pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));
if (page_count(page) != 2 + !!PagePrivate(page) ||
(struct page *)radix_tree_deref_slot(pslot) != page) {
- spin_unlock_irq(&mapping->tree_lock);
- end_page_no_new_refs(page);
+ spin_unlock(&mapping->tree_lock);
+ unlock_page_ref_irq(page);
return -EAGAIN;
}
@@ -329,8 +329,8 @@ static int migrate_page_move_mapping(str
radix_tree_replace_slot(pslot, newpage);
page->mapping = NULL;
- spin_unlock_irq(&mapping->tree_lock);
- end_page_no_new_refs(page);
+ spin_unlock(&mapping->tree_lock);
+ unlock_page_ref_irq(page);
/*
* Drop cache reference from old page.
Index: linux-2.6-rt/mm/swap_state.c
===================================================================
--- linux-2.6-rt.orig/mm/swap_state.c 2006-12-02 22:23:42.000000000 +0100
+++ linux-2.6-rt/mm/swap_state.c 2006-12-02 22:25:04.000000000 +0100
@@ -78,8 +78,8 @@ static int __add_to_swap_cache(struct pa
BUG_ON(PagePrivate(page));
error = radix_tree_preload(gfp_mask);
if (!error) {
- set_page_no_new_refs(page);
- spin_lock_irq(&swapper_space.tree_lock);
+ lock_page_ref_irq(page);
+ spin_lock(&swapper_space.tree_lock);
error = radix_tree_insert(&swapper_space.page_tree,
entry.val, page);
if (!error) {
@@ -90,8 +90,8 @@ static int __add_to_swap_cache(struct pa
mapping_nrpages_inc(&swapper_space);
__inc_zone_page_state(page, NR_FILE_PAGES);
}
- spin_unlock_irq(&swapper_space.tree_lock);
- end_page_no_new_refs(page);
+ spin_unlock(&swapper_space.tree_lock);
+ unlock_page_ref_irq(page);
radix_tree_preload_end();
}
return error;
@@ -202,9 +202,11 @@ void delete_from_swap_cache(struct page
entry.val = page_private(page);
- spin_lock_irq(&swapper_space.tree_lock);
+ lock_page_ref_irq(page);
+ spin_lock(&swapper_space.tree_lock);
__delete_from_swap_cache(page);
- spin_unlock_irq(&swapper_space.tree_lock);
+ spin_unlock(&swapper_space.tree_lock);
+ unlock_page_ref_irq(page);
swap_free(entry);
page_cache_release(page);
Index: linux-2.6-rt/mm/vmscan.c
===================================================================
--- linux-2.6-rt.orig/mm/vmscan.c 2006-12-02 22:23:42.000000000 +0100
+++ linux-2.6-rt/mm/vmscan.c 2006-12-02 22:25:04.000000000 +0100
@@ -390,8 +390,8 @@ int remove_mapping(struct address_space
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
- set_page_no_new_refs(page);
- spin_lock_irq(&mapping->tree_lock);
+ lock_page_ref_irq(page);
+ spin_lock(&mapping->tree_lock);
/*
* The non racy check for a busy page.
*
@@ -426,22 +426,22 @@ int remove_mapping(struct address_space
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
__delete_from_swap_cache(page);
- spin_unlock_irq(&mapping->tree_lock);
+ spin_unlock(&mapping->tree_lock);
swap_free(swap);
goto free_it;
}
__remove_from_page_cache(page);
- spin_unlock_irq(&mapping->tree_lock);
+ spin_unlock(&mapping->tree_lock);
free_it:
- end_page_no_new_refs(page);
+ unlock_page_ref_irq(page);
__put_page(page); /* The pagecache ref */
return 1;
cannot_free:
- spin_unlock_irq(&mapping->tree_lock);
- end_page_no_new_refs(page);
+ spin_unlock(&mapping->tree_lock);
+ unlock_page_ref_irq(page);
return 0;
}
Index: linux-2.6-rt/fs/buffer.c
===================================================================
--- linux-2.6-rt.orig/fs/buffer.c 2006-12-02 22:23:42.000000000 +0100
+++ linux-2.6-rt/fs/buffer.c 2006-12-02 22:25:04.000000000 +0100
@@ -719,7 +719,8 @@ int __set_page_dirty_buffers(struct page
spin_unlock(&mapping->private_lock);
if (!TestSetPageDirty(page)) {
- spin_lock_irq(&mapping->tree_lock);
+ lock_page_ref_irq(page);
+ spin_lock(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
if (mapping_cap_account_dirty(mapping))
__inc_zone_page_state(page, NR_FILE_DIRTY);
@@ -727,7 +728,8 @@ int __set_page_dirty_buffers(struct page
page_index(page),
PAGECACHE_TAG_DIRTY);
}
- spin_unlock_irq(&mapping->tree_lock);
+ spin_unlock(&mapping->tree_lock);
+ unlock_page_ref_irq(page);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
return 1;
}
Index: linux-2.6-rt/mm/page-writeback.c
===================================================================
--- linux-2.6-rt.orig/mm/page-writeback.c 2006-12-02 22:23:42.000000000 +0100
+++ linux-2.6-rt/mm/page-writeback.c 2006-12-02 22:25:04.000000000 +0100
@@ -762,7 +762,8 @@ int __set_page_dirty_nobuffers(struct pa
struct address_space *mapping2;
if (mapping) {
- spin_lock_irq(&mapping->tree_lock);
+ lock_page_ref_irq(page);
+ spin_lock(&mapping->tree_lock);
mapping2 = page_mapping(page);
if (mapping2) { /* Race with truncate? */
BUG_ON(mapping2 != mapping);
@@ -772,7 +773,8 @@ int __set_page_dirty_nobuffers(struct pa
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
- spin_unlock_irq(&mapping->tree_lock);
+ spin_unlock(&mapping->tree_lock);
+ unlock_page_ref_irq(page);
if (mapping->host) {
/* !PageAnon && !swapper_space */
__mark_inode_dirty(mapping->host,
@@ -852,12 +854,14 @@ int test_clear_page_dirty(struct page *p
unsigned long flags;
if (mapping) {
- spin_lock_irqsave(&mapping->tree_lock, flags);
+ lock_page_ref_irqsave(page, flags);
+ spin_lock(&mapping->tree_lock);
if (TestClearPageDirty(page)) {
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY);
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock(&mapping->tree_lock);
+ unlock_page_ref_irqrestore(page, flags);
/*
* We can continue to use `mapping' here because the
* page is locked, which pins the address_space
@@ -868,7 +872,8 @@ int test_clear_page_dirty(struct page *p
}
return 1;
}
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock(&mapping->tree_lock);
+ unlock_page_ref_irqrestore(page, flags);
return 0;
}
return TestClearPageDirty(page);
@@ -915,13 +920,15 @@ int test_clear_page_writeback(struct pag
if (mapping) {
unsigned long flags;
- spin_lock_irqsave(&mapping->tree_lock, flags);
+ lock_page_ref_irqsave(page, flags);
+ spin_lock(&mapping->tree_lock);
ret = TestClearPageWriteback(page);
if (ret)
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock(&mapping->tree_lock);
+ unlock_page_ref_irqrestore(page, flags);
} else {
ret = TestClearPageWriteback(page);
}
@@ -936,7 +943,8 @@ int test_set_page_writeback(struct page
if (mapping) {
unsigned long flags;
- spin_lock_irqsave(&mapping->tree_lock, flags);
+ lock_page_ref_irqsave(page, flags);
+ spin_lock(&mapping->tree_lock);
ret = TestSetPageWriteback(page);
if (!ret)
radix_tree_tag_set(&mapping->page_tree,
@@ -946,7 +954,8 @@ int test_set_page_writeback(struct page
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY);
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock(&mapping->tree_lock);
+ unlock_page_ref_irqrestore(page, flags);
} else {
ret = TestSetPageWriteback(page);
}
Index: linux-2.6-rt/mm/swapfile.c
===================================================================
--- linux-2.6-rt.orig/mm/swapfile.c 2006-12-02 22:23:42.000000000 +0100
+++ linux-2.6-rt/mm/swapfile.c 2006-12-02 22:25:04.000000000 +0100
@@ -367,13 +367,15 @@ int remove_exclusive_swap_page(struct pa
retval = 0;
if (p->swap_map[swp_offset(entry)] == 1) {
/* Recheck the page count with the swapcache lock held.. */
- spin_lock_irq(&swapper_space.tree_lock);
+ lock_page_ref_irq(page);
+ spin_lock(&swapper_space.tree_lock);
if ((page_count(page) == 2) && !PageWriteback(page)) {
__delete_from_swap_cache(page);
SetPageDirty(page);
retval = 1;
}
- spin_unlock_irq(&swapper_space.tree_lock);
+ spin_unlock(&swapper_space.tree_lock);
+ unlock_page_ref_irq(page);
}
spin_unlock(&swap_lock);
Index: linux-2.6-rt/mm/truncate.c
===================================================================
--- linux-2.6-rt.orig/mm/truncate.c 2006-12-02 22:23:42.000000000 +0100
+++ linux-2.6-rt/mm/truncate.c 2006-12-02 22:25:04.000000000 +0100
@@ -304,18 +304,21 @@ invalidate_complete_page2(struct address
if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
return 0;
- spin_lock_irq(&mapping->tree_lock);
+ lock_page_ref_irq(page);
+ spin_lock(&mapping->tree_lock);
if (PageDirty(page))
goto failed;
BUG_ON(PagePrivate(page));
__remove_from_page_cache(page);
- spin_unlock_irq(&mapping->tree_lock);
+ spin_unlock(&mapping->tree_lock);
+ unlock_page_ref_irq(page);
ClearPageUptodate(page);
page_cache_release(page); /* pagecache ref */
return 1;
failed:
- spin_unlock_irq(&mapping->tree_lock);
+ spin_unlock(&mapping->tree_lock);
+ unlock_page_ref_irq(page);
return 0;
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 16/16] mm: concurrent pagecache write side
2006-12-07 16:18 [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Peter Zijlstra
` (14 preceding siblings ...)
2006-12-07 16:18 ` [PATCH 15/16] mm: lock_page_ref Peter Zijlstra
@ 2006-12-07 16:18 ` Peter Zijlstra
2006-12-11 19:03 ` [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Christoph Lameter
16 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2006-12-07 16:18 UTC (permalink / raw)
To: linux-mm; +Cc: Nick Piggin, Peter Zijlstra
[-- Attachment #1: mm-concurrent-pagecache.patch --]
[-- Type: text/plain, Size: 14732 bytes --]
Remove the tree_lock, change address_space::nrpages to atomic_ulong_t
because its not protected any longer and use the concurrent radix tree API to
protect the modifying radix tree operations.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
fs/buffer.c | 6 ++++--
fs/inode.c | 1 -
include/linux/fs.h | 11 +++++------
mm/filemap.c | 11 +++++++----
mm/migrate.c | 9 +++++----
mm/page-writeback.c | 35 +++++++++++++++++++++++------------
mm/swap_state.c | 13 ++++++++-----
mm/swapfile.c | 2 --
mm/truncate.c | 3 ---
mm/vmscan.c | 4 ----
10 files changed, 52 insertions(+), 43 deletions(-)
Index: linux-2.6-rt/fs/buffer.c
===================================================================
--- linux-2.6-rt.orig/fs/buffer.c 2006-12-02 22:42:24.000000000 +0100
+++ linux-2.6-rt/fs/buffer.c 2006-12-02 22:42:25.000000000 +0100
@@ -720,15 +720,17 @@ int __set_page_dirty_buffers(struct page
if (!TestSetPageDirty(page)) {
lock_page_ref_irq(page);
- spin_lock(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
+ DECLARE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
+
if (mapping_cap_account_dirty(mapping))
__inc_zone_page_state(page, NR_FILE_DIRTY);
+ radix_tree_lock(&ctx);
radix_tree_tag_set(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY);
+ radix_tree_unlock(&ctx);
}
- spin_unlock(&mapping->tree_lock);
unlock_page_ref_irq(page);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
return 1;
Index: linux-2.6-rt/fs/inode.c
===================================================================
--- linux-2.6-rt.orig/fs/inode.c 2006-12-02 22:41:33.000000000 +0100
+++ linux-2.6-rt/fs/inode.c 2006-12-02 22:42:25.000000000 +0100
@@ -193,7 +193,6 @@ void inode_init_once(struct inode *inode
mutex_init(&inode->i_mutex);
init_rwsem(&inode->i_alloc_sem);
INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
- spin_lock_init(&inode->i_data.tree_lock);
spin_lock_init(&inode->i_data.i_mmap_lock);
INIT_LIST_HEAD(&inode->i_data.private_list);
spin_lock_init(&inode->i_data.private_lock);
Index: linux-2.6-rt/include/linux/fs.h
===================================================================
--- linux-2.6-rt.orig/include/linux/fs.h 2006-12-02 22:41:23.000000000 +0100
+++ linux-2.6-rt/include/linux/fs.h 2006-12-02 22:42:25.000000000 +0100
@@ -430,13 +430,12 @@ struct backing_dev_info;
struct address_space {
struct inode *host; /* owner: inode, block_device */
struct radix_tree_root page_tree; /* radix tree of all pages */
- spinlock_t tree_lock; /* and rwlock protecting it */
unsigned int i_mmap_writable;/* count VM_SHARED mappings */
struct prio_tree_root i_mmap; /* tree of private and shared mappings */
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
spinlock_t i_mmap_lock; /* protect tree, count, list */
unsigned int truncate_count; /* Cover race condition with truncate */
- unsigned long __nrpages; /* number of total pages */
+ atomic_ulong_t __nrpages; /* number of total pages */
pgoff_t writeback_index;/* writeback starts here */
const struct address_space_operations *a_ops; /* methods */
unsigned long flags; /* error bits/gfp mask */
@@ -453,22 +452,22 @@ struct address_space {
static inline void mapping_nrpages_init(struct address_space *mapping)
{
- mapping->__nrpages = 0;
+ mapping->__nrpages = (atomic_ulong_t)ATOMIC_ULONG_INIT(0);
}
static inline unsigned long mapping_nrpages(struct address_space *mapping)
{
- return mapping->__nrpages;
+ return atomic_ulong_read(&mapping->__nrpages);
}
static inline void mapping_nrpages_inc(struct address_space *mapping)
{
- mapping->__nrpages++;
+ atomic_ulong_inc(&mapping->__nrpages);
}
static inline void mapping_nrpages_dec(struct address_space *mapping)
{
- mapping->__nrpages--;
+ atomic_ulong_dec(&mapping->__nrpages);
}
struct block_device {
Index: linux-2.6-rt/mm/filemap.c
===================================================================
--- linux-2.6-rt.orig/mm/filemap.c 2006-12-02 22:42:24.000000000 +0100
+++ linux-2.6-rt/mm/filemap.c 2006-12-02 22:42:25.000000000 +0100
@@ -115,8 +115,11 @@ generic_file_direct_IO(int rw, struct ki
void __remove_from_page_cache(struct page *page)
{
struct address_space *mapping = page->mapping;
+ DECLARE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
+ radix_tree_lock(&ctx);
radix_tree_delete(&mapping->page_tree, page->index);
+ radix_tree_unlock(&ctx);
page->mapping = NULL;
mapping_nrpages_dec(mapping);
__dec_zone_page_state(page, NR_FILE_PAGES);
@@ -129,9 +132,7 @@ void remove_from_page_cache(struct page
BUG_ON(!PageLocked(page));
lock_page_ref_irq(page);
- spin_lock(&mapping->tree_lock);
__remove_from_page_cache(page);
- spin_unlock(&mapping->tree_lock);
unlock_page_ref_irq(page);
}
@@ -442,9 +443,12 @@ int add_to_page_cache(struct page *page,
int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error == 0) {
+ DECLARE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
+
lock_page_ref_irq(page);
- spin_lock(&mapping->tree_lock);
+ radix_tree_lock(&ctx);
error = radix_tree_insert(&mapping->page_tree, offset, page);
+ radix_tree_unlock(&ctx);
if (!error) {
page_cache_get(page);
SetPageLocked(page);
@@ -453,7 +457,6 @@ int add_to_page_cache(struct page *page,
mapping_nrpages_inc(mapping);
__inc_zone_page_state(page, NR_FILE_PAGES);
}
- spin_unlock(&mapping->tree_lock);
unlock_page_ref_irq(page);
radix_tree_preload_end();
}
Index: linux-2.6-rt/mm/migrate.c
===================================================================
--- linux-2.6-rt.orig/mm/migrate.c 2006-12-02 22:42:24.000000000 +0100
+++ linux-2.6-rt/mm/migrate.c 2006-12-02 22:42:25.000000000 +0100
@@ -295,6 +295,7 @@ static int migrate_page_move_mapping(str
struct page *newpage, struct page *page)
{
void **pslot;
+ struct radix_tree_context ctx;
if (!mapping) {
/* Anonymous page */
@@ -303,15 +304,15 @@ static int migrate_page_move_mapping(str
return 0;
}
+ init_radix_tree_context(&context, &mapping->page_tree);
lock_page_ref_irq(page);
- spin_lock(&mapping->tree_lock);
-
+ radix_tree_lock(&ctx);
pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));
if (page_count(page) != 2 + !!PagePrivate(page) ||
(struct page *)radix_tree_deref_slot(pslot) != page) {
- spin_unlock(&mapping->tree_lock);
+ radix_tree_unlock(&ctx);
unlock_page_ref_irq(page);
return -EAGAIN;
}
@@ -329,7 +330,7 @@ static int migrate_page_move_mapping(str
radix_tree_replace_slot(pslot, newpage);
page->mapping = NULL;
- spin_unlock(&mapping->tree_lock);
+ radix_tree_unlock(&ctx);
unlock_page_ref_irq(page);
/*
Index: linux-2.6-rt/mm/page-writeback.c
===================================================================
--- linux-2.6-rt.orig/mm/page-writeback.c 2006-12-02 22:42:24.000000000 +0100
+++ linux-2.6-rt/mm/page-writeback.c 2006-12-02 22:42:25.000000000 +0100
@@ -763,17 +763,19 @@ int __set_page_dirty_nobuffers(struct pa
if (mapping) {
lock_page_ref_irq(page);
- spin_lock(&mapping->tree_lock);
mapping2 = page_mapping(page);
if (mapping2) { /* Race with truncate? */
+ DECLARE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
+
BUG_ON(mapping2 != mapping);
if (mapping_cap_account_dirty(mapping))
__inc_zone_page_state(page,
NR_FILE_DIRTY);
+ radix_tree_lock(&ctx);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
+ radix_tree_unlock(&ctx);
}
- spin_unlock(&mapping->tree_lock);
unlock_page_ref_irq(page);
if (mapping->host) {
/* !PageAnon && !swapper_space */
@@ -855,12 +857,14 @@ int test_clear_page_dirty(struct page *p
if (mapping) {
lock_page_ref_irqsave(page, flags);
- spin_lock(&mapping->tree_lock);
if (TestClearPageDirty(page)) {
+ DECLARE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
+
+ radix_tree_lock(&ctx);
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY);
- spin_unlock(&mapping->tree_lock);
+ radix_tree_unlock(&ctx);
unlock_page_ref_irqrestore(page, flags);
/*
* We can continue to use `mapping' here because the
@@ -872,7 +876,6 @@ int test_clear_page_dirty(struct page *p
}
return 1;
}
- spin_unlock(&mapping->tree_lock);
unlock_page_ref_irqrestore(page, flags);
return 0;
}
@@ -921,13 +924,16 @@ int test_clear_page_writeback(struct pag
unsigned long flags;
lock_page_ref_irqsave(page, flags);
- spin_lock(&mapping->tree_lock);
ret = TestClearPageWriteback(page);
- if (ret)
+ if (ret) {
+ DECLARE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
+
+ radix_tree_lock(&ctx);
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
- spin_unlock(&mapping->tree_lock);
+ radix_tree_unlock(&ctx);
+ }
unlock_page_ref_irqrestore(page, flags);
} else {
ret = TestClearPageWriteback(page);
@@ -942,19 +948,24 @@ int test_set_page_writeback(struct page
if (mapping) {
unsigned long flags;
+ DECLARE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
lock_page_ref_irqsave(page, flags);
- spin_lock(&mapping->tree_lock);
ret = TestSetPageWriteback(page);
- if (!ret)
+ if (!ret) {
+ radix_tree_lock(&ctx);
radix_tree_tag_set(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
- if (!PageDirty(page))
+ radix_tree_unlock(&ctx);
+ }
+ if (!PageDirty(page)) {
+ radix_tree_lock(&ctx);
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY);
- spin_unlock(&mapping->tree_lock);
+ radix_tree_unlock(&ctx);
+ }
unlock_page_ref_irqrestore(page, flags);
} else {
ret = TestSetPageWriteback(page);
Index: linux-2.6-rt/mm/swap_state.c
===================================================================
--- linux-2.6-rt.orig/mm/swap_state.c 2006-12-02 22:42:24.000000000 +0100
+++ linux-2.6-rt/mm/swap_state.c 2006-12-02 22:43:25.000000000 +0100
@@ -38,7 +38,6 @@ static struct backing_dev_info swap_back
struct address_space swapper_space = {
.page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
- .tree_lock = __SPIN_LOCK_UNLOCKED(swapper_space.tree_lock),
.a_ops = &swap_aops,
.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
.backing_dev_info = &swap_backing_dev_info,
@@ -78,10 +77,13 @@ static int __add_to_swap_cache(struct pa
BUG_ON(PagePrivate(page));
error = radix_tree_preload(gfp_mask);
if (!error) {
+ DECLARE_RADIX_TREE_CONTEXT(ctx, &swapper_space.page_tree);
+
lock_page_ref_irq(page);
- spin_lock(&swapper_space.tree_lock);
+ radix_tree_lock(&ctx);
error = radix_tree_insert(&swapper_space.page_tree,
entry.val, page);
+ radix_tree_unlock(&ctx);
if (!error) {
page_cache_get(page);
SetPageLocked(page);
@@ -90,7 +92,6 @@ static int __add_to_swap_cache(struct pa
mapping_nrpages_inc(&swapper_space);
__inc_zone_page_state(page, NR_FILE_PAGES);
}
- spin_unlock(&swapper_space.tree_lock);
unlock_page_ref_irq(page);
radix_tree_preload_end();
}
@@ -125,12 +126,16 @@ static int add_to_swap_cache(struct page
*/
void __delete_from_swap_cache(struct page *page)
{
+ DECLARE_RADIX_TREE_CONTEXT(ctx, &swapper_space.page_tree);
+
BUG_ON(!PageLocked(page));
BUG_ON(!PageSwapCache(page));
BUG_ON(PageWriteback(page));
BUG_ON(PagePrivate(page));
+ radix_tree_lock(&ctx);
radix_tree_delete(&swapper_space.page_tree, page_private(page));
+ radix_tree_unlock(&ctx);
set_page_private(page, 0);
ClearPageSwapCache(page);
mapping_nrpages_dec(&swapper_space);
@@ -203,9 +208,7 @@ void delete_from_swap_cache(struct page
entry.val = page_private(page);
lock_page_ref_irq(page);
- spin_lock(&swapper_space.tree_lock);
__delete_from_swap_cache(page);
- spin_unlock(&swapper_space.tree_lock);
unlock_page_ref_irq(page);
swap_free(entry);
Index: linux-2.6-rt/mm/swapfile.c
===================================================================
--- linux-2.6-rt.orig/mm/swapfile.c 2006-12-02 22:42:24.000000000 +0100
+++ linux-2.6-rt/mm/swapfile.c 2006-12-02 22:42:25.000000000 +0100
@@ -368,13 +368,11 @@ int remove_exclusive_swap_page(struct pa
if (p->swap_map[swp_offset(entry)] == 1) {
/* Recheck the page count with the swapcache lock held.. */
lock_page_ref_irq(page);
- spin_lock(&swapper_space.tree_lock);
if ((page_count(page) == 2) && !PageWriteback(page)) {
__delete_from_swap_cache(page);
SetPageDirty(page);
retval = 1;
}
- spin_unlock(&swapper_space.tree_lock);
unlock_page_ref_irq(page);
}
spin_unlock(&swap_lock);
Index: linux-2.6-rt/mm/truncate.c
===================================================================
--- linux-2.6-rt.orig/mm/truncate.c 2006-12-02 22:42:24.000000000 +0100
+++ linux-2.6-rt/mm/truncate.c 2006-12-02 22:42:25.000000000 +0100
@@ -305,19 +305,16 @@ invalidate_complete_page2(struct address
return 0;
lock_page_ref_irq(page);
- spin_lock(&mapping->tree_lock);
if (PageDirty(page))
goto failed;
BUG_ON(PagePrivate(page));
__remove_from_page_cache(page);
- spin_unlock(&mapping->tree_lock);
unlock_page_ref_irq(page);
ClearPageUptodate(page);
page_cache_release(page); /* pagecache ref */
return 1;
failed:
- spin_unlock(&mapping->tree_lock);
unlock_page_ref_irq(page);
return 0;
}
Index: linux-2.6-rt/mm/vmscan.c
===================================================================
--- linux-2.6-rt.orig/mm/vmscan.c 2006-12-02 22:42:24.000000000 +0100
+++ linux-2.6-rt/mm/vmscan.c 2006-12-02 22:42:25.000000000 +0100
@@ -391,7 +391,6 @@ int remove_mapping(struct address_space
BUG_ON(mapping != page_mapping(page));
lock_page_ref_irq(page);
- spin_lock(&mapping->tree_lock);
/*
* The non racy check for a busy page.
*
@@ -426,13 +425,11 @@ int remove_mapping(struct address_space
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
__delete_from_swap_cache(page);
- spin_unlock(&mapping->tree_lock);
swap_free(swap);
goto free_it;
}
__remove_from_page_cache(page);
- spin_unlock(&mapping->tree_lock);
free_it:
unlock_page_ref_irq(page);
@@ -440,7 +437,6 @@ free_it:
return 1;
cannot_free:
- spin_unlock(&mapping->tree_lock);
unlock_page_ref_irq(page);
return 0;
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 00/16] concurrent pagecache (against 2.6.19-rt)
2006-12-07 16:18 [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Peter Zijlstra
` (15 preceding siblings ...)
2006-12-07 16:18 ` [PATCH 16/16] mm: concurrent pagecache write side Peter Zijlstra
@ 2006-12-11 19:03 ` Christoph Lameter
2006-12-11 19:24 ` Peter Zijlstra
16 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2006-12-11 19:03 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-mm, Nick Piggin
On Thu, 7 Dec 2006, Peter Zijlstra wrote:
> Based on Nick's lockless (read-side) pagecache patches (included in the series)
> here an attempt to make the write side concurrent.
On first glance it looks quite interesting and very innovative. Removing
the tree_lock completely also reduces cache line usage. The page struct
cacheline is already references in most contexts.
> Comment away ;-)
Could you post Nick's patches from your email addres and add a From Nick
line in them? Its a bit confusing to have a patchset with different
originating email addresses. Or does this come about by the evil header
mangling of the list processor? Maybe you need to use >From ??
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 00/16] concurrent pagecache (against 2.6.19-rt)
2006-12-11 19:03 ` [PATCH 00/16] concurrent pagecache (against 2.6.19-rt) Christoph Lameter
@ 2006-12-11 19:24 ` Peter Zijlstra
0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2006-12-11 19:24 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-mm, Nick Piggin
On Mon, 2006-12-11 at 11:03 -0800, Christoph Lameter wrote:
> On Thu, 7 Dec 2006, Peter Zijlstra wrote:
>
> > Based on Nick's lockless (read-side) pagecache patches (included in the series)
> > here an attempt to make the write side concurrent.
>
> On first glance it looks quite interesting and very innovative. Removing
> the tree_lock completely also reduces cache line usage. The page struct
> cacheline is already references in most contexts.
Thanks, I'm just curious how bouncy the fine grained radix tree locks
will be.
> > Comment away ;-)
>
> Could you post Nick's patches from your email addres and add a From Nick
> line in them? Its a bit confusing to have a patchset with different
> originating email addresses. Or does this come about by the evil header
> mangling of the list processor? Maybe you need to use >From ??
Nah that was on purpose, you can grab the patches from here:
http://programming.kicks-ass.net/kernel-patches/concurrent-pagecache-rt/
if you care, or I can post them again.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread