[patch 1/2] mm: speculative get

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 1/2] mm: speculative get_page
@ 2006-07-26  6:39 Nick Piggin
  2006-07-31 15:35 ` Andy Whitcroft
  2006-08-07 10:11 ` Hugh Dickins
  0 siblings, 2 replies; 10+ messages in thread
From: Nick Piggin @ 2006-07-26  6:39 UTC (permalink / raw)
  To: Andrew Morton, Linux Memory Management List

If we can be sure that elevating the page_count on a pagecache
page will pin it, we can speculatively run this operation, and
subsequently check to see if we hit the right page rather than
relying on holding a lock or otherwise pinning a reference to the
page.

This can be done if get_page/put_page behaves consistently
throughout the whole tree (ie. if we "get" the page after it has
been used for something else, we must be able to free it with a
put_page).

Actually, there is a period where the count behaves differently:
when the page is free or if it is a constituent page of a compound
page. We need an atomic_inc_not_zero operation to ensure we don't
try to grab the page in either case.

This patch introduces the core locking protocol to the pagecache
(ie. adds page_cache_get_speculative, and tweaks some update-side
code to make it work).

Signed-off-by: Nick Piggin <npiggin@suse.de>

 include/linux/page-flags.h |    7 +++
 include/linux/pagemap.h    |  103 +++++++++++++++++++++++++++++++++++++++++++++
 mm/filemap.c               |    4 +
 mm/migrate.c               |   11 ++++
 mm/swap_state.c            |    4 +
 mm/vmscan.c                |   12 +++--
 6 files changed, 137 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -86,6 +86,8 @@
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
+#define PG_nonewrefs		20	/* Block concurrent pagecache lookups
+					 * while testing refcount */
 
 #if (BITS_PER_LONG > 32)
 /*
@@ -247,6 +249,11 @@
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#define PageNoNewRefs(page)	test_bit(PG_nonewrefs, &(page)->flags)
+#define SetPageNoNewRefs(page)	set_bit(PG_nonewrefs, &(page)->flags)
+#define ClearPageNoNewRefs(page) clear_bit(PG_nonewrefs, &(page)->flags)
+#define __ClearPageNoNewRefs(page) __clear_bit(PG_nonewrefs, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -11,6 +11,8 @@
 #include <linux/compiler.h>
 #include <asm/uaccess.h>
 #include <linux/gfp.h>
+#include <linux/page-flags.h>
+#include <linux/hardirq.h> /* for in_interrupt() */
 
 /*
  * Bits in mapping->flags.  The lower __GFP_BITS_SHIFT bits are the page
@@ -51,6 +53,107 @@ static inline void mapping_set_gfp_mask(
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
+/*
+ * speculatively take a reference to a page.
+ * If the page is free (_count == 0), then _count is untouched, and NULL
+ * is returned. Otherwise, _count is incremented by 1 and page is returned.
+ *
+ * This function must be run in the same rcu_read_lock() section as has
+ * been used to lookup the page in the pagecache radix-tree: this allows
+ * allocators to use a synchronize_rcu() to stabilize _count.
+ *
+ * Unless an RCU grace period has passed, the count of all pages coming out
+ * of the allocator must be considered unstable. page_count may return higher
+ * than expected, and put_page must be able to do the right thing when the
+ * page has been finished with (because put_page is what is used to drop an
+ * invalid speculative reference).
+ *
+ * After incrementing the refcount, this function spins until PageNoNewRefs
+ * is clear, then a read memory barrier is issued.
+ *
+ * This forms the core of the lockless pagecache locking protocol, where
+ * the lookup-side (eg. find_get_page) has the following pattern:
+ * 1. find page in radix tree
+ * 2. conditionally increment refcount
+ * 3. wait for PageNoNewRefs
+ * 4. check the page is still in pagecache
+ *
+ * Remove-side (that cares about _count, eg. reclaim) has the following:
+ * A. SetPageNoNewRefs
+ * B. check refcount is correct
+ * C. remove page
+ * D. ClearPageNoNewRefs
+ *
+ * There are 2 critical interleavings that matter:
+ * - 2 runs before B: in this case, B sees elevated refcount and bails out
+ * - B runs before 2: in this case, 3 ensures 4 will not run until *after* C
+ *   (after D, even). In which case, 4 will notice C and lookup side can retry
+ *
+ * It is possible that between 1 and 2, the page is removed then the exact same
+ * page is inserted into the same position in pagecache. That's OK: the
+ * old find_get_page using tree_lock could equally have run before or after
+ * the write-side, depending on timing.
+ *
+ * Pagecache insertion isn't a big problem: either 1 will find the page or
+ * it will not. Likewise, the old find_get_page could run either before the
+ * insertion or afterwards, depending on timing.
+ */
+static inline struct page *page_cache_get_speculative(struct page *page)
+{
+	VM_BUG_ON(in_interrupt());
+
+#ifndef CONFIG_SMP
+	VM_BUG_ON(!in_atomic());
+	/*
+	 * Preempt must be disabled here - we rely on rcu_read_lock doing
+	 * this for us.
+	 *
+	 * Pagecache won't be truncated from interrupt context, so if we have
+	 * found a page in the radix tree here, we have pinned its refcount by
+	 * disabling preempt, and hence no need for the "speculative get" that
+	 * SMP requires.
+	 */
+	VM_BUG_ON(page_count(page) == 0);
+	atomic_inc(&page->_count);
+
+#else
+	if (unlikely(!get_page_unless_zero(page)))
+		return NULL; /* page has been freed */
+
+	/*
+	 * Note that get_page_unless_zero provides a memory barrier.
+	 * This is needed to ensure PageNoNewRefs is evaluated after the
+	 * page refcount has been raised. See below comment.
+	 */
+
+	while (unlikely(PageNoNewRefs(page)))
+		cpu_relax();
+
+	/*
+	 * smp_rmb is to ensure the load of page->flags (for PageNoNewRefs())
+	 * is performed before a future load used to ensure the page is
+	 * the correct on (usually: page->mapping and page->index).
+	 *
+	 * Those places that set PageNoNewRefs have the following pattern:
+	 * 	SetPageNoNewRefs(page)
+	 * 	wmb();
+	 * 	if (page_count(page) == X)
+	 * 		remove page from pagecache
+	 * 	wmb();
+	 * 	ClearPageNoNewRefs(page)
+	 *
+	 * If the load was out of order, page->mapping might be loaded before
+	 * the page is removed from pagecache but PageNoNewRefs evaluated
+	 * after the ClearPageNoNewRefs().
+	 */
+	smp_rmb();
+
+#endif
+	VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);
+
+	return page;
+}
+
 #ifdef CONFIG_NUMA
 extern struct page *page_cache_alloc(struct address_space *x);
 extern struct page *page_cache_alloc_cold(struct address_space *x);
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -380,6 +380,8 @@ int remove_mapping(struct address_space 
 	if (!mapping)
 		return 0;		/* truncate got there first */
 
+	SetPageNoNewRefs(page);
+	smp_wmb();
 	write_lock_irq(&mapping->tree_lock);
 
 	/*
@@ -398,17 +400,21 @@ int remove_mapping(struct address_space 
 		__delete_from_swap_cache(page);
 		write_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
-		__put_page(page);	/* The pagecache ref */
-		return 1;
+		goto free_it;
 	}
 
 	__remove_from_page_cache(page);
 	write_unlock_irq(&mapping->tree_lock);
-	__put_page(page);
+
+free_it:
+	smp_wmb();
+	__ClearPageNoNewRefs(page);
+	__put_page(page); /* The pagecache ref */
 	return 1;
 
 cannot_free:
 	write_unlock_irq(&mapping->tree_lock);
+	ClearPageNoNewRefs(page);
 	return 0;
 }
 
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -440,6 +440,8 @@ int add_to_page_cache(struct page *page,
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
+		SetPageNoNewRefs(page);
+		smp_wmb();
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
@@ -451,6 +453,8 @@ int add_to_page_cache(struct page *page,
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
 		write_unlock_irq(&mapping->tree_lock);
+		smp_wmb();
+		ClearPageNoNewRefs(page);
 		radix_tree_preload_end();
 	}
 	return error;
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -78,6 +78,8 @@ static int __add_to_swap_cache(struct pa
 	BUG_ON(PagePrivate(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
+		SetPageNoNewRefs(page);
+		smp_wmb();
 		write_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
@@ -90,6 +92,8 @@ static int __add_to_swap_cache(struct pa
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
 		write_unlock_irq(&swapper_space.tree_lock);
+		smp_wmb();
+		ClearPageNoNewRefs(page);
 		radix_tree_preload_end();
 	}
 	return error;
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -303,6 +303,8 @@ static int migrate_page_move_mapping(str
 		return 0;
 	}
 
+	SetPageNoNewRefs(page);
+	smp_wmb();
 	write_lock_irq(&mapping->tree_lock);
 
 	radix_pointer = (struct page **)radix_tree_lookup_slot(
@@ -312,6 +314,7 @@ static int migrate_page_move_mapping(str
 	if (page_count(page) != 2 + !!PagePrivate(page) ||
 			radix_tree_deref_slot(radix_pointer) != page) {
 		write_unlock_irq(&mapping->tree_lock);
+		ClearPageNoNewRefs(page);
 		return -EAGAIN;
 	}
 
@@ -326,9 +329,15 @@ static int migrate_page_move_mapping(str
 	}
 #endif
 
+	SetPageNoNewRefs(newpage);
 	radix_tree_replace_slot(radix_pointer, newpage);
+	page->mapping = NULL;
+
+  	write_unlock_irq(&mapping->tree_lock);
 	__put_page(page);
-	write_unlock_irq(&mapping->tree_lock);
+	smp_wmb();
+	ClearPageNoNewRefs(page);
+	ClearPageNoNewRefs(newpage);
 
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 1/2] mm: speculative get_page
  2006-07-26  6:39 [patch 1/2] mm: speculative get_page Nick Piggin
@ 2006-07-31 15:35 ` Andy Whitcroft
  2006-08-01  8:45   ` Nick Piggin
  2006-08-07 10:11 ` Hugh Dickins
  1 sibling, 1 reply; 10+ messages in thread
From: Andy Whitcroft @ 2006-07-31 15:35 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Linux Memory Management List

Nick Piggin wrote:
> If we can be sure that elevating the page_count on a pagecache
> page will pin it, we can speculatively run this operation, and
> subsequently check to see if we hit the right page rather than
> relying on holding a lock or otherwise pinning a reference to the
> page.
> 
> This can be done if get_page/put_page behaves consistently
> throughout the whole tree (ie. if we "get" the page after it has
> been used for something else, we must be able to free it with a
> put_page).
> 
> Actually, there is a period where the count behaves differently:
> when the page is free or if it is a constituent page of a compound
> page. We need an atomic_inc_not_zero operation to ensure we don't
> try to grab the page in either case.
> 
> This patch introduces the core locking protocol to the pagecache
> (ie. adds page_cache_get_speculative, and tweaks some update-side
> code to make it work).
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Ok, this one is a bit scarey but here goes.

First question is about performance.  I seem to remember from your OLS 
paper that there was good scaling improvements with this.  Was there any 
benefit to simple cases (one process on SMP)?  There seems to be a good 
deal less locking in here, well without preempt etc anyhow.

>  include/linux/page-flags.h |    7 +++
>  include/linux/pagemap.h    |  103 +++++++++++++++++++++++++++++++++++++++++++++
>  mm/filemap.c               |    4 +
>  mm/migrate.c               |   11 ++++
>  mm/swap_state.c            |    4 +
>  mm/vmscan.c                |   12 +++--
>  6 files changed, 137 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.orig/include/linux/page-flags.h
> +++ linux-2.6/include/linux/page-flags.h
> @@ -86,6 +86,8 @@
>  #define PG_nosave_free		18	/* Free, should not be written */
>  #define PG_buddy		19	/* Page is free, on buddy lists */
>  
> +#define PG_nonewrefs		20	/* Block concurrent pagecache lookups
> +					 * while testing refcount */

As always ... page flags :(.  It seems pretty key to the stabilisation 
of _count, however are we really relying on that?  (See next comment ...)

>  
>  #if (BITS_PER_LONG > 32)
>  /*
> @@ -247,6 +249,11 @@
>  #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
>  #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
>  
> +#define PageNoNewRefs(page)	test_bit(PG_nonewrefs, &(page)->flags)
> +#define SetPageNoNewRefs(page)	set_bit(PG_nonewrefs, &(page)->flags)
> +#define ClearPageNoNewRefs(page) clear_bit(PG_nonewrefs, &(page)->flags)
> +#define __ClearPageNoNewRefs(page) __clear_bit(PG_nonewrefs, &(page)->flags)
> +
>  struct page;	/* forward declaration */
>  
>  int test_clear_page_dirty(struct page *page);
> Index: linux-2.6/include/linux/pagemap.h
> ===================================================================
> --- linux-2.6.orig/include/linux/pagemap.h
> +++ linux-2.6/include/linux/pagemap.h
> @@ -11,6 +11,8 @@
>  #include <linux/compiler.h>
>  #include <asm/uaccess.h>
>  #include <linux/gfp.h>
> +#include <linux/page-flags.h>
> +#include <linux/hardirq.h> /* for in_interrupt() */
>  
>  /*
>   * Bits in mapping->flags.  The lower __GFP_BITS_SHIFT bits are the page
> @@ -51,6 +53,107 @@ static inline void mapping_set_gfp_mask(
>  #define page_cache_release(page)	put_page(page)
>  void release_pages(struct page **pages, int nr, int cold);
>  
> +/*
> + * speculatively take a reference to a page.
> + * If the page is free (_count == 0), then _count is untouched, and NULL
> + * is returned. Otherwise, _count is incremented by 1 and page is returned.
> + *
> + * This function must be run in the same rcu_read_lock() section as has
> + * been used to lookup the page in the pagecache radix-tree: this allows
> + * allocators to use a synchronize_rcu() to stabilize _count.

Ok, so that makes sense from the algorithm as we take an additional 
reference somewhere within the 'rcu read lock'.  To get a stable count 
we have to ensure there is no-one is in the read side.  However, the 
commentary says we can use synchronize_rcu to get a stable count.  Is 
that correct?  All that synchronize_rcu() guarentees is that all 
concurrent readers at the start of the call will have finished when it 
returns, there is no guarentee that there will be no new readers since 
the start of the call, not in parallel with its completion?  Setting 
PageNoNewRefs will not prevent a new reader upping the reference count 
either as they wait after they have bumped it.  So do we really have a 
way to stablise _count here?  I am likely missing something, educate me :).

Now I cannot see any users of this effect in either of the patches in 
this set so perhaps we do not care?

> + *
> + * Unless an RCU grace period has passed, the count of all pages coming out
> + * of the allocator must be considered unstable. page_count may return higher
> + * than expected, and put_page must be able to do the right thing when the
> + * page has been finished with (because put_page is what is used to drop an
> + * invalid speculative reference).
> + *
> + * After incrementing the refcount, this function spins until PageNoNewRefs
> + * is clear, then a read memory barrier is issued.
> + *
> + * This forms the core of the lockless pagecache locking protocol, where
> + * the lookup-side (eg. find_get_page) has the following pattern:
> + * 1. find page in radix tree
> + * 2. conditionally increment refcount
> + * 3. wait for PageNoNewRefs
> + * 4. check the page is still in pagecache
> + *
> + * Remove-side (that cares about _count, eg. reclaim) has the following:
> + * A. SetPageNoNewRefs
> + * B. check refcount is correct
> + * C. remove page
> + * D. ClearPageNoNewRefs
> + *
> + * There are 2 critical interleavings that matter:
> + * - 2 runs before B: in this case, B sees elevated refcount and bails out
> + * - B runs before 2: in this case, 3 ensures 4 will not run until *after* C
> + *   (after D, even). In which case, 4 will notice C and lookup side can retry
> + *
> + * It is possible that between 1 and 2, the page is removed then the exact same
> + * page is inserted into the same position in pagecache. That's OK: the
> + * old find_get_page using tree_lock could equally have run before or after
> + * the write-side, depending on timing.
> + *
> + * Pagecache insertion isn't a big problem: either 1 will find the page or
> + * it will not. Likewise, the old find_get_page could run either before the
> + * insertion or afterwards, depending on timing.
> + */
> +static inline struct page *page_cache_get_speculative(struct page *page)
> +{
> +	VM_BUG_ON(in_interrupt());
> +
> +#ifndef CONFIG_SMP
> +	VM_BUG_ON(!in_atomic());
> +	/*
> +	 * Preempt must be disabled here - we rely on rcu_read_lock doing
> +	 * this for us.
> +	 *
> +	 * Pagecache won't be truncated from interrupt context, so if we have
> +	 * found a page in the radix tree here, we have pinned its refcount by
> +	 * disabling preempt, and hence no need for the "speculative get" that
> +	 * SMP requires.
> +	 */
> +	VM_BUG_ON(page_count(page) == 0);
> +	atomic_inc(&page->_count);
> +
> +#else
> +	if (unlikely(!get_page_unless_zero(page)))
> +		return NULL; /* page has been freed */
> +
> +	/*
> +	 * Note that get_page_unless_zero provides a memory barrier.
> +	 * This is needed to ensure PageNoNewRefs is evaluated after the
> +	 * page refcount has been raised. See below comment.
> +	 */
> +
> +	while (unlikely(PageNoNewRefs(page)))
> +		cpu_relax();
> +
> +	/*
> +	 * smp_rmb is to ensure the load of page->flags (for PageNoNewRefs())
> +	 * is performed before a future load used to ensure the page is
> +	 * the correct on (usually: page->mapping and page->index).

"the correct on[e]"

Ok, this is a little confusing mostly I think because you don't provide 
a corresponding read side example.  Or it should read.  "smp_rmb is 
required to ensure the load ...., provided within get_page_unless_zero()."

Also, I do wonder if there should be some way to indicate that we need a 
barrier, and that we're stealing the one before or after which we get 
for free.

	if (unlikely(!get_page_unless_zero(page)))
		return NULL; /* page has been freed */
	/* smp_rmb() */
	SetPageNoNewRefs(page);

	...

	SetPageNoNewRefs(page);
	/* smp_wmb() */

> +	 *
> +	 * Those places that set PageNoNewRefs have the following pattern:
> +	 * 	SetPageNoNewRefs(page)
> +	 * 	wmb();
> +	 * 	if (page_count(page) == X)
> +	 * 		remove page from pagecache
> +	 * 	wmb();
> +	 * 	ClearPageNoNewRefs(page)
> +	 *
> +	 * If the load was out of order, page->mapping might be loaded before
> +	 * the page is removed from pagecache but PageNoNewRefs evaluated
> +	 * after the ClearPageNoNewRefs().
> +	 */
> +	smp_rmb();
> +
> +#endif
> +	VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);
> +
> +	return page;
> +}
> +
>  #ifdef CONFIG_NUMA
>  extern struct page *page_cache_alloc(struct address_space *x);
>  extern struct page *page_cache_alloc_cold(struct address_space *x);
> Index: linux-2.6/mm/vmscan.c
> ===================================================================
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -380,6 +380,8 @@ int remove_mapping(struct address_space 
>  	if (!mapping)
>  		return 0;		/* truncate got there first */
>  
> +	SetPageNoNewRefs(page);
> +	smp_wmb();
>  	write_lock_irq(&mapping->tree_lock);

Ok.  Do we need the smp_wmb() here?  Would not the write_lock_irq() 
provide a full barrier already.

>  	/*
> @@ -398,17 +400,21 @@ int remove_mapping(struct address_space 
>  		__delete_from_swap_cache(page);
>  		write_unlock_irq(&mapping->tree_lock);
>  		swap_free(swap);
> -		__put_page(page);	/* The pagecache ref */
> -		return 1;
> +		goto free_it;
>  	}
>  
>  	__remove_from_page_cache(page);
>  	write_unlock_irq(&mapping->tree_lock);
> -	__put_page(page);
> +
> +free_it:
> +	smp_wmb();
> +	__ClearPageNoNewRefs(page);
> +	__put_page(page); /* The pagecache ref */
>  	return 1;
>  
>  cannot_free:
>  	write_unlock_irq(&mapping->tree_lock);
> +	ClearPageNoNewRefs(page);
>  	return 0;
>  }
>  
> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c
> +++ linux-2.6/mm/filemap.c
> @@ -440,6 +440,8 @@ int add_to_page_cache(struct page *page,
>  	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
>  
>  	if (error == 0) {
> +		SetPageNoNewRefs(page);
> +		smp_wmb();
>  		write_lock_irq(&mapping->tree_lock);

Again, do we not have an implicit barrier in write_lock_irq().

>  		error = radix_tree_insert(&mapping->page_tree, offset, page);
>  		if (!error) {
> @@ -451,6 +453,8 @@ int add_to_page_cache(struct page *page,
>  			__inc_zone_page_state(page, NR_FILE_PAGES);
>  		}
>  		write_unlock_irq(&mapping->tree_lock);
> +		smp_wmb();
> +		ClearPageNoNewRefs(page);

Again, do we not have an implicit barrier in the unlock.

>  		radix_tree_preload_end();
>  	}
>  	return error;
> Index: linux-2.6/mm/swap_state.c
> ===================================================================
> --- linux-2.6.orig/mm/swap_state.c
> +++ linux-2.6/mm/swap_state.c
> @@ -78,6 +78,8 @@ static int __add_to_swap_cache(struct pa
>  	BUG_ON(PagePrivate(page));
>  	error = radix_tree_preload(gfp_mask);
>  	if (!error) {
> +		SetPageNoNewRefs(page);
> +		smp_wmb();
>  		write_lock_irq(&swapper_space.tree_lock);
>  		error = radix_tree_insert(&swapper_space.page_tree,
>  						entry.val, page);
> @@ -90,6 +92,8 @@ static int __add_to_swap_cache(struct pa
>  			__inc_zone_page_state(page, NR_FILE_PAGES);
>  		}
>  		write_unlock_irq(&swapper_space.tree_lock);
> +		smp_wmb();
> +		ClearPageNoNewRefs(page);
>  		radix_tree_preload_end();
>  	}
>  	return error;
> Index: linux-2.6/mm/migrate.c
> ===================================================================
> --- linux-2.6.orig/mm/migrate.c
> +++ linux-2.6/mm/migrate.c
> @@ -303,6 +303,8 @@ static int migrate_page_move_mapping(str
>  		return 0;
>  	}
>  
> +	SetPageNoNewRefs(page);
> +	smp_wmb();
>  	write_lock_irq(&mapping->tree_lock);
>  
>  	radix_pointer = (struct page **)radix_tree_lookup_slot(
> @@ -312,6 +314,7 @@ static int migrate_page_move_mapping(str
>  	if (page_count(page) != 2 + !!PagePrivate(page) ||
>  			radix_tree_deref_slot(radix_pointer) != page) {
>  		write_unlock_irq(&mapping->tree_lock);
> +		ClearPageNoNewRefs(page);
>  		return -EAGAIN;
>  	}
>  
> @@ -326,9 +329,15 @@ static int migrate_page_move_mapping(str
>  	}
>  #endif
>  
> +	SetPageNoNewRefs(newpage);
>  	radix_tree_replace_slot(radix_pointer, newpage);
> +	page->mapping = NULL;
> +
> +  	write_unlock_irq(&mapping->tree_lock);
>  	__put_page(page);
> -	write_unlock_irq(&mapping->tree_lock);
> +	smp_wmb();
> +	ClearPageNoNewRefs(page);
> +	ClearPageNoNewRefs(newpage);
>  
>  	return 0;
>  }
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 1/2] mm: speculative get_page
  2006-07-31 15:35 ` Andy Whitcroft
@ 2006-08-01  8:45   ` Nick Piggin
  0 siblings, 0 replies; 10+ messages in thread
From: Nick Piggin @ 2006-08-01  8:45 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: Andrew Morton, Linux Memory Management List

On Mon, Jul 31, 2006 at 04:35:38PM +0100, Andy Whitcroft wrote:
> Nick Piggin wrote:
> >
> >This patch introduces the core locking protocol to the pagecache
> >(ie. adds page_cache_get_speculative, and tweaks some update-side
> >code to make it work).
> >
> >Signed-off-by: Nick Piggin <npiggin@suse.de>
> 
> Ok, this one is a bit scarey but here goes.

Thanks for reviewing!

Will send out an incremental patch...

> 
> First question is about performance.  I seem to remember from your OLS 
> paper that there was good scaling improvements with this.  Was there any 
> benefit to simple cases (one process on SMP)?  There seems to be a good 
> deal less locking in here, well without preempt etc anyhow.

The single thread find_get_page numbers were improved, yes. Highlights
were on UP compiled kernel, P4 was about 3x faster and SMP kernel, G5
was about 2x faster working on a cache hot struct page. Cache cold
numbers were improved too.

Of course, this is a very tiny function anyway, performed within a
larger context... but at least we don't regress here.

Gang lookups I still haven't instrumeted fully - the difference would
be much less there I expect.

> 
> > include/linux/page-flags.h |    7 +++
> > include/linux/pagemap.h    |  103 
> > +++++++++++++++++++++++++++++++++++++++++++++
> > mm/filemap.c               |    4 +
> > mm/migrate.c               |   11 ++++
> > mm/swap_state.c            |    4 +
> > mm/vmscan.c                |   12 +++--
> > 6 files changed, 137 insertions(+), 4 deletions(-)
> >
> >Index: linux-2.6/include/linux/page-flags.h
> >===================================================================
> >--- linux-2.6.orig/include/linux/page-flags.h
> >+++ linux-2.6/include/linux/page-flags.h
> >@@ -86,6 +86,8 @@
> > #define PG_nosave_free		18	/* Free, should not be 
> > written */
> > #define PG_buddy		19	/* Page is free, on buddy lists */
> > 
> >+#define PG_nonewrefs		20	/* Block concurrent pagecache lookups
> >+					 * while testing refcount */
> 
> As always ... page flags :(.  It seems pretty key to the stabilisation 
> of _count, however are we really relying on that?  (See next comment ...)

Yeah it is a page flag. I think we do need it. Am I allowed to trade
PG_reserved for it? ;)

> >+/*
> >+ * speculatively take a reference to a page.
> >+ * If the page is free (_count == 0), then _count is untouched, and NULL
> >+ * is returned. Otherwise, _count is incremented by 1 and page is 
> >returned.
> >+ *
> >+ * This function must be run in the same rcu_read_lock() section as has
> >+ * been used to lookup the page in the pagecache radix-tree: this allows
> >+ * allocators to use a synchronize_rcu() to stabilize _count.
> 
> Ok, so that makes sense from the algorithm as we take an additional 
> reference somewhere within the 'rcu read lock'.  To get a stable count 
> we have to ensure there is no-one is in the read side.  However, the 
> commentary says we can use synchronize_rcu to get a stable count.  Is 
> that correct?  All that synchronize_rcu() guarentees is that all 
> concurrent readers at the start of the call will have finished when it 
> returns, there is no guarentee that there will be no new readers since 
> the start of the call, not in parallel with its completion?  Setting 

There will be no new readers, because if you have newly allocated this
page, lookups can no longer find it in pagecache after a synchronize_rcu.
The important word is "allocators" (ie. not pagecache) -- but I don't
think I have made that clear: will fix.

> PageNoNewRefs will not prevent a new reader upping the reference count 
> either as they wait after they have bumped it.  So do we really have a 
> way to stablise _count here?  I am likely missing something, educate me :).

We can't stabilise _count for pagecache pages. What we can do is prevent
any new *references* from being handed out via the pagecache (although
they may indeed increment _count, we don't give them the pointer).

> 
> Now I cannot see any users of this effect in either of the patches in 
> this set so perhaps we do not care?

synchronize_rcu(), no. I imagine it will become needed for memory hot
unplug if we're freeing up mem_map[]s. Other users might just find it
convenient, but so far I think I converted all users to something else
which tended to be cleaner anyway.

> >+	if (unlikely(!get_page_unless_zero(page)))
> >+		return NULL; /* page has been freed */
> >+
> >+	/*
> >+	 * Note that get_page_unless_zero provides a memory barrier.
> >+	 * This is needed to ensure PageNoNewRefs is evaluated after the
> >+	 * page refcount has been raised. See below comment.
> >+	 */
> >+
> >+	while (unlikely(PageNoNewRefs(page)))
> >+		cpu_relax();
> >+
> >+	/*
> >+	 * smp_rmb is to ensure the load of page->flags (for PageNoNewRefs())
> >+	 * is performed before a future load used to ensure the page is
> >+	 * the correct on (usually: page->mapping and page->index).
> 
> "the correct on[e]"

Yep.

> 
> Ok, this is a little confusing mostly I think because you don't provide 
> a corresponding read side example.  Or it should read.  "smp_rmb is 
> required to ensure the load ...., provided within get_page_unless_zero()."

This is the read-side example (only the page->mapping test is done by callers).

> 
> Also, I do wonder if there should be some way to indicate that we need a 
> barrier, and that we're stealing the one before or after which we get 
> for free.
> 
> 	if (unlikely(!get_page_unless_zero(page)))
> 		return NULL; /* page has been freed */
> 	/* smp_rmb() */

But you really need the commenting to show which accesses you are
interested in ordering, and who else cares.

> >Index: linux-2.6/mm/vmscan.c
> >===================================================================
> >--- linux-2.6.orig/mm/vmscan.c
> >+++ linux-2.6/mm/vmscan.c
> >@@ -380,6 +380,8 @@ int remove_mapping(struct address_space 
> > 	if (!mapping)
> > 		return 0;		/* truncate got there first */
> > 
> >+	SetPageNoNewRefs(page);
> >+	smp_wmb();
> > 	write_lock_irq(&mapping->tree_lock);
> 
> Ok.  Do we need the smp_wmb() here?  Would not the write_lock_irq() 
> provide a full barrier already.

No, only an acquire barrier (in this case, the store to page->flags
may leak as far as the write_unlock_irq at the end of the crit section).

> >Index: linux-2.6/mm/filemap.c
> >===================================================================
> >--- linux-2.6.orig/mm/filemap.c
> >+++ linux-2.6/mm/filemap.c
> >@@ -440,6 +440,8 @@ int add_to_page_cache(struct page *page,
> > 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> > 
> > 	if (error == 0) {
> >+		SetPageNoNewRefs(page);
> >+		smp_wmb();
> > 		write_lock_irq(&mapping->tree_lock);
> 
> Again, do we not have an implicit barrier in write_lock_irq().

ditto

> 
> > 		error = radix_tree_insert(&mapping->page_tree, offset, page);
> > 		if (!error) {
> >@@ -451,6 +453,8 @@ int add_to_page_cache(struct page *page,
> > 			__inc_zone_page_state(page, NR_FILE_PAGES);
> > 		}
> > 		write_unlock_irq(&mapping->tree_lock);
> >+		smp_wmb();
> >+		ClearPageNoNewRefs(page);
> 
> Again, do we not have an implicit barrier in the unlock.

Only release: the store can go as far up as the write_lock_irq.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 1/2] mm: speculative get_page
  2006-07-26  6:39 [patch 1/2] mm: speculative get_page Nick Piggin
  2006-07-31 15:35 ` Andy Whitcroft
@ 2006-08-07 10:11 ` Hugh Dickins
       [not found]   ` <20060807132633.GD4433@wotan.suse.de>
  1 sibling, 1 reply; 10+ messages in thread
From: Hugh Dickins @ 2006-08-07 10:11 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Linux Memory Management List

A basic question I need to understand before going further...

On Wed, 26 Jul 2006, Nick Piggin wrote:
> + *
> + * This forms the core of the lockless pagecache locking protocol, where
> + * the lookup-side (eg. find_get_page) has the following pattern:
> + * 1. find page in radix tree
> + * 2. conditionally increment refcount
> + * 3. wait for PageNoNewRefs

(Better say
         wait while PageNoNewRefs
)

> + * 4. check the page is still in pagecache
> + *
> + * Remove-side (that cares about _count, eg. reclaim) has the following:
> + * A. SetPageNoNewRefs
> + * B. check refcount is correct
> + * C. remove page
> + * D. ClearPageNoNewRefs

Yes, I understand why remove_mapping and migrate_page_move_mapping
(on page) do the PageNoNewRefs business; but why do add_to_page_cache,
__add_to_swap_cache and migrate_page_move_mapping (on newpage) do it?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 1/2] mm: speculative get_page
       [not found]   ` <20060807132633.GD4433@wotan.suse.de>
@ 2006-08-07 14:37     ` Hugh Dickins
  2006-08-07 14:51       ` Nick Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: Hugh Dickins @ 2006-08-07 14:37 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-mm

On Mon, 7 Aug 2006, Nick Piggin wrote:
> On Mon, Aug 07, 2006 at 11:11:15AM +0100, Hugh Dickins wrote:
> > 
> > Yes, I understand why remove_mapping and migrate_page_move_mapping
> > (on page) do the PageNoNewRefs business; but why do add_to_page_cache,
> > __add_to_swap_cache and migrate_page_move_mapping (on newpage) do it?
> 
> add_to_*_cache(), because they insert the page *then* set up fields
> in the page. Without the bit set, the page is visible to pagecache
> as soon as it hits the radix tree.

Aha, thank you.

> In the page_cache case, I have a subsequent patch to rearrange this a bit,
> and reduce the number of atomic ops. I thought it would just add too much
> to review for now, though.

Well, it's a slightly different use for PageNoNewRefs, and would need
to be commented if it stays: I'd recommend avoiding the need for that
comment and the unnecessary atomics, doing your rearrangement in a
preceding patch.

Though maybe cleaner to have mapping/index/SwapCache/private properly
set before inserting page into radix tree, page_cache_get_speculative
callers all have to check afterwards and repeat if wrong; so the only
thing that's essential to do earlier is the SetPageLocked, isn't it?

On the subject of mapping/index, I think there's potential for a very
very unlikely race you're ignoring, a race you can blame on me and my
passion for squeezing in alternative uses of struct page fields:

Isn't it conceivable that a page_cache_get_speculative finds a page
in the radix tree, but by the time its callers do those mapping/index
checks, that page is reused for some other purpose completely, which
happens to set the field formerly known as page->mapping to something
(perhaps a sequence of 4 or 8 random bytes) identical to what was
there before (and leaves index untouched, or changes it to the same)?

I'm thinking particularly of the per-pagetable page spinlock, where
what goes into page->mapping depends on CONFIG_DEBUG_SPINLOCK de jour.

I think we can probably (but I've not tried) satisfy ourselves that
there's currently no way that can happen; but how shall we prevent
ourselves from later making a change which opens up the possibility?
(By passing my address to a hitman, perhaps.)

An alternative would be to go more the radix_tree_lookup_slot way,
and the checks be on page remaining in slot; but I think you comment
that it cannot be used for RCU lookups, I didn't investigate further.

This is not a grave concern: but (unless I'm plain wrong) we do need
to be aware of it.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 1/2] mm: speculative get_page
  2006-08-07 14:37     ` Hugh Dickins
@ 2006-08-07 14:51       ` Nick Piggin
  0 siblings, 0 replies; 10+ messages in thread
From: Nick Piggin @ 2006-08-07 14:51 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-mm

On Mon, Aug 07, 2006 at 03:37:12PM +0100, Hugh Dickins wrote:
> On Mon, 7 Aug 2006, Nick Piggin wrote:
> > On Mon, Aug 07, 2006 at 11:11:15AM +0100, Hugh Dickins wrote:
> > > 
> > > Yes, I understand why remove_mapping and migrate_page_move_mapping
> > > (on page) do the PageNoNewRefs business; but why do add_to_page_cache,
> > > __add_to_swap_cache and migrate_page_move_mapping (on newpage) do it?
> > 
> > add_to_*_cache(), because they insert the page *then* set up fields
> > in the page. Without the bit set, the page is visible to pagecache
> > as soon as it hits the radix tree.
> 
> Aha, thank you.
> 
> > In the page_cache case, I have a subsequent patch to rearrange this a bit,
> > and reduce the number of atomic ops. I thought it would just add too much
> > to review for now, though.
> 
> Well, it's a slightly different use for PageNoNewRefs, and would need
> to be commented if it stays: I'd recommend avoiding the need for that
> comment and the unnecessary atomics, doing your rearrangement in a
> preceding patch.

It's the same use in that when you combine tree_lock with the page
bit, you get the same semantics as the old write_lock(&tree_lock).

What I mean is: if the current slightly different uses of tree_lock
don't warrant different comments, then I'm don't see that PG_nnr
does either.

> 
> Though maybe cleaner to have mapping/index/SwapCache/private properly
> set before inserting page into radix tree, page_cache_get_speculative
> callers all have to check afterwards and repeat if wrong; so the only
> thing that's essential to do earlier is the SetPageLocked, isn't it?

Something like that, yes.

> 
> On the subject of mapping/index, I think there's potential for a very
> very unlikely race you're ignoring, a race you can blame on me and my
> passion for squeezing in alternative uses of struct page fields:
> 
> Isn't it conceivable that a page_cache_get_speculative finds a page
> in the radix tree, but by the time its callers do those mapping/index
> checks, that page is reused for some other purpose completely, which
> happens to set the field formerly known as page->mapping to something
> (perhaps a sequence of 4 or 8 random bytes) identical to what was
> there before (and leaves index untouched, or changes it to the same)?
> 
> I'm thinking particularly of the per-pagetable page spinlock, where
> what goes into page->mapping depends on CONFIG_DEBUG_SPINLOCK de jour.
> 
> I think we can probably (but I've not tried) satisfy ourselves that
> there's currently no way that can happen; but how shall we prevent
> ourselves from later making a change which opens up the possibility?
> (By passing my address to a hitman, perhaps.)
> 
> An alternative would be to go more the radix_tree_lookup_slot way,
> and the checks be on page remaining in slot; but I think you comment
> that it cannot be used for RCU lookups, I didn't investigate further.
> 
> This is not a grave concern: but (unless I'm plain wrong) we do need
> to be aware of it.

No, it is something I'm worried about too. And definitely the lookup_slot
approach would solve it. I'm inclined to go back to the lookup_slot
method which would solve the weird gang lookup problems that come about
with this approach. And as another bonus we don't need find_get_swap_page.

The problem is not so much with RCU lookups as with the direct-data
patch: one can take the address of a slot in a radix-tree node with the
knowledge that, under RCU lock, it will always dereference to either
NULL or a valid item.

However, direct data stores 0-height trees' item at ->rnode. But this
can also be switched to point to a radix-tree node or vice versa at
any time.

The solution is just a little bit more API to do the dereferencing work
for us. Shouldn't be a big problem.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 1/2] mm: speculative get_page
  2006-08-01 20:42   ` Oleg Nesterov
@ 2006-08-01 23:53     ` Nick Piggin
  0 siblings, 0 replies; 10+ messages in thread
From: Nick Piggin @ 2006-08-01 23:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Kleikamp, Nick Piggin, Hugh Dickins, Andrew Morton,
	Andy Whitcroft, linux-mm, linux-kernel

Oleg Nesterov wrote:
> On 08/01, Dave Kleikamp wrote:

>>Isn't the page locked when calling remove_mapping()?  It looks like
>>SetPageNoNewRefs & ClearPageNoNewRefs are called in safe places.  Either
>>the page is locked, or it's newly allocated.  I could have missed
>>something, though.
> 
> 
> No, I think it is I who missed something, thanks.

Yeah, SetPageNoNewRefs is indeed called only under PageLocked or for
newly allocated pages. I should make a note about that, as it isn't
immediately clear.

Thanks

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 1/2] mm: speculative get_page
  2006-08-01 15:55 ` Dave Kleikamp
@ 2006-08-01 20:42   ` Oleg Nesterov
  2006-08-01 23:53     ` Nick Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: Oleg Nesterov @ 2006-08-01 20:42 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Andy Whitcroft,
	linux-mm, linux-kernel

On 08/01, Dave Kleikamp wrote:
>
> On Tue, 2006-08-01 at 23:32 +0400, Oleg Nesterov wrote:
> > Nick Piggin wrote:
> > >
> > > --- linux-2.6.orig/mm/vmscan.c
> > > +++ linux-2.6/mm/vmscan.c
> > > @@ -380,6 +380,8 @@ int remove_mapping(struct address_space 
> > >  	if (!mapping)
> > >  		return 0;		/* truncate got there first */
> > >
> > > +	SetPageNoNewRefs(page);
> > > +	smp_wmb();
> > >  	write_lock_irq(&mapping->tree_lock);
> > >
> > 
> > Is it enough?
> > 
> > PG_nonewrefs could be already set by another add_to_page_cache()/remove_mapping(),
> > and it will be cleared when we take ->tree_lock.
> 
> Isn't the page locked when calling remove_mapping()?  It looks like
> SetPageNoNewRefs & ClearPageNoNewRefs are called in safe places.  Either
> the page is locked, or it's newly allocated.  I could have missed
> something, though.

No, I think it is I who missed something, thanks.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 1/2] mm: speculative get_page
@ 2006-08-01 19:32 Oleg Nesterov
  2006-08-01 15:55 ` Dave Kleikamp
  0 siblings, 1 reply; 10+ messages in thread
From: Oleg Nesterov @ 2006-08-01 19:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Andrew Morton, Andy Whitcroft, linux-mm, linux-kernel

Nick Piggin wrote:
>
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -380,6 +380,8 @@ int remove_mapping(struct address_space 
>  	if (!mapping)
>  		return 0;		/* truncate got there first */
>
> +	SetPageNoNewRefs(page);
> +	smp_wmb();
>  	write_lock_irq(&mapping->tree_lock);
>

Is it enough?

PG_nonewrefs could be already set by another add_to_page_cache()/remove_mapping(),
and it will be cleared when we take ->tree_lock. For example:

CPU_0					CPU_1					CPU_3

add_to_page_cache:

    SetPageNoNewRefs();
    write_lock_irq(->tree_lock);
    ...
    write_unlock_irq(->tree_lock);

					remove_mapping:
	
					    SetPageNoNewRefs();

    ClearPageNoNewRefs();
					    write_lock_irq(->tree_lock);

					    check page_count()

										page_cache_get_speculative:

										    increment page_count()

										    no PG_nonewrefs => return

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 1/2] mm: speculative get_page
  2006-08-01 19:32 Oleg Nesterov
@ 2006-08-01 15:55 ` Dave Kleikamp
  2006-08-01 20:42   ` Oleg Nesterov
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Kleikamp @ 2006-08-01 15:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Andy Whitcroft,
	linux-mm, linux-kernel

On Tue, 2006-08-01 at 23:32 +0400, Oleg Nesterov wrote:
> Nick Piggin wrote:
> >
> > --- linux-2.6.orig/mm/vmscan.c
> > +++ linux-2.6/mm/vmscan.c
> > @@ -380,6 +380,8 @@ int remove_mapping(struct address_space 
> >  	if (!mapping)
> >  		return 0;		/* truncate got there first */
> >
> > +	SetPageNoNewRefs(page);
> > +	smp_wmb();
> >  	write_lock_irq(&mapping->tree_lock);
> >
> 
> Is it enough?
> 
> PG_nonewrefs could be already set by another add_to_page_cache()/remove_mapping(),
> and it will be cleared when we take ->tree_lock.

Isn't the page locked when calling remove_mapping()?  It looks like
SetPageNoNewRefs & ClearPageNoNewRefs are called in safe places.  Either
the page is locked, or it's newly allocated.  I could have missed
something, though.

>  For example:
> 
> CPU_0					CPU_1					CPU_3
> 
> add_to_page_cache:
> 
>     SetPageNoNewRefs();
>     write_lock_irq(->tree_lock);

      SetPageLocked(page);

>     ...
>     write_unlock_irq(->tree_lock);
> 
> 					remove_mapping:
> 	
> 					    SetPageNoNewRefs();
> 
>     ClearPageNoNewRefs();
> 					    write_lock_irq(->tree_lock);
> 
> 					    check page_count()
> 
> 										page_cache_get_speculative:
> 
> 										    increment page_count()
> 
> 										    no PG_nonewrefs => return
> 
> Oleg.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-08-07 14:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-07-26  6:39 [patch 1/2] mm: speculative get_page Nick Piggin
2006-07-31 15:35 ` Andy Whitcroft
2006-08-01  8:45   ` Nick Piggin
2006-08-07 10:11 ` Hugh Dickins
     [not found]   ` <20060807132633.GD4433@wotan.suse.de>
2006-08-07 14:37     ` Hugh Dickins
2006-08-07 14:51       ` Nick Piggin
2006-08-01 19:32 Oleg Nesterov
2006-08-01 15:55 ` Dave Kleikamp
2006-08-01 20:42   ` Oleg Nesterov
2006-08-01 23:53     ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox