* [rfc] lockless pagecache
@ 2005-06-27 6:29 Nick Piggin
2005-06-27 6:32 ` [patch 1] mm: PG_free flag Nick Piggin
` (3 more replies)
0 siblings, 4 replies; 56+ messages in thread
From: Nick Piggin @ 2005-06-27 6:29 UTC (permalink / raw)
To: linux-kernel, Linux Memory Management
Hi,
This is going to be a fairly long and probably incoherent post. The
idea and implementation are not completely analysed for holes, and
I wouldn't be surprised if some (even fatal ones) exist.
That said, I wanted something to talk about at Ottawa and I think
this is a promising idea - it is at the stage where it would be good
to have interested parties pick it apart. BTW. this is my main reason
for the PageReserved removal patches, so if this falls apart then
some good will have come from it! :)
OK, so my aim is to remove the requirement to take mapping->tree_lock
when looking up pagecache pages (eg. for a read/write or nopage fault).
Note that this does not deal with insertion and removal of pages from
pagecache mappings - that is usually a slower path operation associated
with IO or page reclaim or truncate. However if there was interest in
making these paths more scalable, there are possibilities for that too.
What for? Well there are probably lots of reasons, but suppose you have
a big app with lots of processes all mmaping and playing around with
various parts of the same big file (say, a shared memory file), then
you might start seeing problems if you want to scale this workload up
to say 32+ CPUs.
Now the tree_lock was recently(ish) converted to an rwlock, precisely
for such a workload and that was apparently very successful. However
an rwlock is significantly heavier, and as machines get faster and
bigger, rwlocks (and any locks) will tend to use more and more of Paul
McKenney's toilet paper due to cacheline bouncing.
So in the interest of saving some trees, let's try it without any locks.
First I'll put up some numbers to get you interested - of a 64-way Altix
with 64 processes each read-faulting in their own 512MB part of a 32GB
file that is preloaded in pagecache (with the proper NUMA memory
allocation).
[best of 5 runs]
plain 2.6.12-git4:
1 proc 0.65u 1.43s 2.09e 99%CPU
64 proc 0.75u 291.30s 4.92e 5927%CPU
64 proc prof:
3242763 total 0.5366
1269413 _read_unlock_irq 19834.5781
842042 do_no_page 355.5921
779373 cond_resched 3479.3438
100667 ia64_pal_call_static 524.3073
96469 _spin_lock 1004.8854
92857 default_idle 241.8151
25572 filemap_nopage 15.6691
11981 ia64_load_scratch_fpregs 187.2031
11671 ia64_save_scratch_fpregs 182.3594
2566 page_fault 2.5867
It has slowed by a factor of 2.5x when going from serial to 64-way, and it
is due to mapping->tree_lock. Serial is even at the disadvantage of reading
from remote memory 62 times out of 64.
2.6.12-git4-lockless:
1 proc 0.66u 1.38s 2.04e 99%CPU
64 proc 0.68u 1.42s 0.12e 1686%CPU
64 proc prof:
81934 total 0.0136
31108 ia64_pal_call_static 162.0208
28394 default_idle 73.9427
3796 ia64_save_scratch_fpregs 59.3125
3736 ia64_load_scratch_fpregs 58.3750
2208 page_fault 2.2258
1380 unmap_vmas 0.3292
1298 __mod_page_state 8.1125
1089 do_no_page 0.4599
830 find_get_page 2.5938
781 ia64_do_page_fault 0.2805
So we have increased performance exactly 17x when going from 1 to 64 way,
however if you look at the CPU utilisation figure and the elapsed time,
you'll see my test didn't provide enough work to keep all CPUs busy, and
for the amount of CPU time used, we appear to have perfect scalability.
In fact, it is slightly superlinear probably due to remote memory access
on the serial run.
I'll reply to this post with the series of commented patches which is
probably the best way to explain how it is done. They are against
2.6.12-git4 + some future iteration of the PageReserved patches. I
can provide the complete rollup privately on request.
Comments, flames, laughing me out of town, etc. are all very welcome.
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 56+ messages in thread* [patch 1] mm: PG_free flag 2005-06-27 6:29 [rfc] lockless pagecache Nick Piggin @ 2005-06-27 6:32 ` Nick Piggin 2005-06-27 6:32 ` [patch 2] mm: speculative get_page Nick Piggin 2005-06-27 6:43 ` VFS scalability (was: [rfc] lockless pagecache) Nick Piggin ` (2 subsequent siblings) 3 siblings, 1 reply; 56+ messages in thread From: Nick Piggin @ 2005-06-27 6:32 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel, Linux Memory Management [-- Attachment #1: Type: text/plain, Size: 28 bytes --] -- SUSE Labs, Novell Inc. [-- Attachment #2: mm-PG_free-flag.patch --] [-- Type: text/plain, Size: 2886 bytes --] In a future patch we can no longer rely on page_count being stable at any time, so we can no longer overload PagePrivate && page_count == 0 to mean the page is free and on the buddy lists. Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h +++ linux-2.6/include/linux/page-flags.h @@ -76,6 +76,8 @@ #define PG_nosave_free 18 /* Free, should not be written */ #define PG_uncached 19 /* Page has been mapped as uncached */ +#define PG_free 20 /* Page is on the free lists */ + /* * Global page accounting. One instance per CPU. Only unsigned longs are * allowed. @@ -306,6 +308,10 @@ extern void __mod_page_state(unsigned lo #define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags) #define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags) +#define PageFree(page) test_bit(PG_free, &(page)->flags) +#define __SetPageFree(page) __set_bit(PG_free, &(page)->flags) +#define __ClearPageFree(page) __clear_bit(PG_free, &(page)->flags) + struct page; /* forward declaration */ int test_clear_page_dirty(struct page *page); Index: linux-2.6/mm/page_alloc.c =================================================================== --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -114,7 +114,8 @@ static void bad_page(const char *functio 1 << PG_slab | 1 << PG_swapcache | 1 << PG_writeback | - 1 << PG_reserved ); + 1 << PG_reserved | + 1 << PG_free ); set_page_count(page, 0); reset_page_mapcount(page); page->mapping = NULL; @@ -191,12 +192,12 @@ static inline unsigned long page_order(s static inline void set_page_order(struct page *page, int order) { page->private = order; - __SetPagePrivate(page); + __SetPageFree(page); } static inline void rmv_page_order(struct page *page) { - __ClearPagePrivate(page); + __ClearPageFree(page); page->private = 0; } @@ -242,9 +243,7 @@ __find_combined_index(unsigned long page */ static inline int page_is_buddy(struct page *page, int order) { - if (PagePrivate(page) && - (page_order(page) == order) && - page_count(page) == 0) + if (PageFree(page) && (page_order(page) == order)) return 1; return 0; } @@ -327,7 +326,8 @@ static inline void free_pages_check(cons 1 << PG_slab | 1 << PG_swapcache | 1 << PG_writeback | - 1 << PG_reserved ))) + 1 << PG_reserved | + 1 << PG_free ))) bad_page(function, page); if (PageDirty(page)) __ClearPageDirty(page); @@ -456,7 +456,8 @@ static void prep_new_page(struct page *p 1 << PG_slab | 1 << PG_swapcache | 1 << PG_writeback | - 1 << PG_reserved ))) + 1 << PG_reserved | + 1 << PG_free ))) bad_page(__FUNCTION__, page); page->flags &= ~(1 << PG_uptodate | 1 << PG_error | ^ permalink raw reply [flat|nested] 56+ messages in thread
* [patch 2] mm: speculative get_page 2005-06-27 6:32 ` [patch 1] mm: PG_free flag Nick Piggin @ 2005-06-27 6:32 ` Nick Piggin 2005-06-27 6:33 ` [patch 3] radix tree: lookup_slot Nick Piggin ` (2 more replies) 0 siblings, 3 replies; 56+ messages in thread From: Nick Piggin @ 2005-06-27 6:32 UTC (permalink / raw) To: linux-kernel, Linux Memory Management [-- Attachment #1: Type: text/plain, Size: 28 bytes --] -- SUSE Labs, Novell Inc. [-- Attachment #2: mm-speculative-get_page.patch --] [-- Type: text/plain, Size: 6511 bytes --] If we can be sure that elevating the page_count on a pagecache page will pin it, we can speculatively run this operation, and subsequently check to see if we hit the right page rather than relying on holding a lock or otherwise pinning a reference to the page. This can be done if get_page/put_page behaves in the same manner throughout the whole tree (ie. if we "get" the page after it has been used for something else, we must be able to free it with a put_page). There needs to be some careful logic for freed pages so they aren't freed again, and also some careful logic for pages in the process of being removed from pagecache. Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h +++ linux-2.6/include/linux/page-flags.h @@ -77,6 +77,7 @@ #define PG_uncached 19 /* Page has been mapped as uncached */ #define PG_free 20 /* Page is on the free lists */ +#define PG_freeing 21 /* PG_refcount about to be freed */ /* * Global page accounting. One instance per CPU. Only unsigned longs are @@ -312,6 +313,11 @@ extern void __mod_page_state(unsigned lo #define __SetPageFree(page) __set_bit(PG_free, &(page)->flags) #define __ClearPageFree(page) __clear_bit(PG_free, &(page)->flags) +#define PageFreeing(page) test_bit(PG_freeing, &(page)->flags) +#define SetPageFreeing(page) set_bit(PG_freeing, &(page)->flags) +#define ClearPageFreeing(page) clear_bit(PG_freeing, &(page)->flags) +#define __ClearPageFreeing(page) __clear_bit(PG_freeing, &(page)->flags) + struct page; /* forward declaration */ int test_clear_page_dirty(struct page *page); Index: linux-2.6/include/linux/pagemap.h =================================================================== --- linux-2.6.orig/include/linux/pagemap.h +++ linux-2.6/include/linux/pagemap.h @@ -50,6 +50,42 @@ static inline void mapping_set_gfp_mask( #define page_cache_release(page) put_page(page) void release_pages(struct page **pages, int nr, int cold); +static inline struct page *page_cache_get_speculative(struct page **pagep) +{ + struct page *page; + + preempt_disable(); + page = *pagep; + if (!page) + goto out_failed; + + if (unlikely(get_page_testone(page))) { + /* Picked up a freed page */ + __put_page(page); + goto out_failed; + } + /* + * preempt can really be enabled here (only needs to be disabled + * because page allocation can spin on the elevated refcount, but + * we don't want to hold a reference on an unrelated page for too + * long, so keep preempt off until we know we have the right page + */ + + if (unlikely(PageFreeing(page)) || + unlikely(page != *pagep)) { + /* Picked up a page being freed, or one that's been reused */ + put_page(page); + goto out_failed; + } + preempt_enable(); + + return page; + +out_failed: + preempt_enable(); + return NULL; +} + static inline struct page *page_cache_alloc(struct address_space *x) { return alloc_pages(mapping_gfp_mask(x)|__GFP_NORECLAIM, 0); Index: linux-2.6/mm/page_alloc.c =================================================================== --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -116,7 +116,6 @@ static void bad_page(const char *functio 1 << PG_writeback | 1 << PG_reserved | 1 << PG_free ); - set_page_count(page, 0); reset_page_mapcount(page); page->mapping = NULL; tainted |= TAINT_BAD_PAGE; @@ -316,7 +315,6 @@ static inline void free_pages_check(cons { if ( page_mapcount(page) || page->mapping != NULL || - page_count(page) != 0 || (page->flags & ( 1 << PG_lru | 1 << PG_private | @@ -424,7 +422,7 @@ expand(struct zone *zone, struct page *p void set_page_refs(struct page *page, int order) { #ifdef CONFIG_MMU - set_page_count(page, 1); + get_page(page); #else int i; @@ -434,7 +432,7 @@ void set_page_refs(struct page *page, in * - eg: access_process_vm() */ for (i = 0; i < (1 << order); i++) - set_page_count(page + i, 1); + get_page(page + i); #endif /* CONFIG_MMU */ } @@ -445,7 +443,6 @@ static void prep_new_page(struct page *p { if ( page_mapcount(page) || page->mapping != NULL || - page_count(page) != 0 || (page->flags & ( 1 << PG_lru | 1 << PG_private | @@ -464,7 +461,13 @@ static void prep_new_page(struct page *p 1 << PG_referenced | 1 << PG_arch_1 | 1 << PG_checked | 1 << PG_mappedtodisk); page->private = 0; + set_page_refs(page, order); + smp_mb(); + /* Wait for speculative get_page after count has been elevated. */ + while (unlikely(page_count(page) > 1)) + cpu_relax(); + kernel_map_pages(page, 1 << order, 1); } Index: linux-2.6/mm/vmscan.c =================================================================== --- linux-2.6.orig/mm/vmscan.c +++ linux-2.6/mm/vmscan.c @@ -504,6 +504,7 @@ static int shrink_list(struct list_head if (!mapping) goto keep_locked; /* truncate got there first */ + SetPageFreeing(page); write_lock_irq(&mapping->tree_lock); /* @@ -513,6 +514,7 @@ static int shrink_list(struct list_head */ if (page_count(page) != 2 || PageDirty(page)) { write_unlock_irq(&mapping->tree_lock); + ClearPageFreeing(page); goto keep_locked; } @@ -533,6 +535,7 @@ static int shrink_list(struct list_head free_it: unlock_page(page); + __ClearPageFreeing(page); reclaimed++; if (!pagevec_add(&freed_pvec, page)) __pagevec_release_nonlru(&freed_pvec); Index: linux-2.6/mm/bootmem.c =================================================================== --- linux-2.6.orig/mm/bootmem.c +++ linux-2.6/mm/bootmem.c @@ -278,17 +278,19 @@ static unsigned long __init free_all_boo if (gofast && v == ~0UL) { int j, order; + prefetchw(page); count += BITS_PER_LONG; - __ClearPageReserved(page); + order = ffs(BITS_PER_LONG) - 1; - set_page_refs(page, order); - for (j = 1; j < BITS_PER_LONG; j++) { - if (j + 16 < BITS_PER_LONG) - prefetchw(page + j + 16); + for (j = 0; j < BITS_PER_LONG; j++) { + if (j + 1 < BITS_PER_LONG) + prefetchw(page + j + 1); __ClearPageReserved(page + j); set_page_count(page + j, 0); } + set_page_refs(page, order); __free_pages(page, order); + i += BITS_PER_LONG; page += BITS_PER_LONG; } else if (v) { @@ -297,6 +299,7 @@ static unsigned long __init free_all_boo if (v & m) { count++; __ClearPageReserved(page); + set_page_count(page, 0); set_page_refs(page, 0); __free_page(page); } ^ permalink raw reply [flat|nested] 56+ messages in thread
* [patch 3] radix tree: lookup_slot 2005-06-27 6:32 ` [patch 2] mm: speculative get_page Nick Piggin @ 2005-06-27 6:33 ` Nick Piggin 2005-06-27 6:34 ` [patch 4] radix tree: lockless readside Nick Piggin 2005-06-27 14:12 ` [patch 2] mm: speculative get_page William Lee Irwin III 2005-06-28 12:45 ` Andy Whitcroft 2 siblings, 1 reply; 56+ messages in thread From: Nick Piggin @ 2005-06-27 6:33 UTC (permalink / raw) To: linux-kernel, Linux Memory Management [-- Attachment #1: Type: text/plain, Size: 28 bytes --] -- SUSE Labs, Novell Inc. [-- Attachment #2: radix-tree-lookup_slot.patch --] [-- Type: text/plain, Size: 2673 bytes --] From: Hans Reiser <reiser@namesys.com> Reiser4 uses radix trees to solve a trouble reiser4_readdir has serving nfs requests. Unfortunately, radix tree api lacks an operation suitable for modifying existing entry. This patch adds radix_tree_lookup_slot which returns pointer to found item within the tree. That location can be then updated. Signed-off-by: Andrew Morton <akpm@osdl.org> Index: linux-2.6/include/linux/radix-tree.h =================================================================== --- linux-2.6.orig/include/linux/radix-tree.h +++ linux-2.6/include/linux/radix-tree.h @@ -46,6 +46,7 @@ do { \ int radix_tree_insert(struct radix_tree_root *, unsigned long, void *); void *radix_tree_lookup(struct radix_tree_root *, unsigned long); +void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long); void *radix_tree_delete(struct radix_tree_root *, unsigned long); unsigned int radix_tree_gang_lookup(struct radix_tree_root *root, void **results, Index: linux-2.6/lib/radix-tree.c =================================================================== --- linux-2.6.orig/lib/radix-tree.c +++ linux-2.6/lib/radix-tree.c @@ -276,14 +276,8 @@ int radix_tree_insert(struct radix_tree_ } EXPORT_SYMBOL(radix_tree_insert); -/** - * radix_tree_lookup - perform lookup operation on a radix tree - * @root: radix tree root - * @index: index key - * - * Lookup the item at the position @index in the radix tree @root. - */ -void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index) +static inline void **__lookup_slot(struct radix_tree_root *root, + unsigned long index) { unsigned int height, shift; struct radix_tree_node **slot; @@ -306,7 +300,36 @@ void *radix_tree_lookup(struct radix_tre height--; } - return *slot; + return (void **)slot; +} + +/** + * radix_tree_lookup_slot - lookup a slot in a radix tree + * @root: radix tree root + * @index: index key + * + * Lookup the slot corresponding to the position @index in the radix tree + * @root. This is useful for update-if-exists operations. + */ +void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index) +{ + return __lookup_slot(root, index); +} +EXPORT_SYMBOL(radix_tree_lookup_slot); + +/** + * radix_tree_lookup - perform lookup operation on a radix tree + * @root: radix tree root + * @index: index key + * + * Lookup the item at the position @index in the radix tree @root. + */ +void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index) +{ + void **slot; + + slot = __lookup_slot(root, index); + return slot != NULL ? *slot : NULL; } EXPORT_SYMBOL(radix_tree_lookup); ^ permalink raw reply [flat|nested] 56+ messages in thread
* [patch 4] radix tree: lockless readside 2005-06-27 6:33 ` [patch 3] radix tree: lookup_slot Nick Piggin @ 2005-06-27 6:34 ` Nick Piggin 2005-06-27 6:34 ` [patch 5] mm: lockless pagecache lookups Nick Piggin 0 siblings, 1 reply; 56+ messages in thread From: Nick Piggin @ 2005-06-27 6:34 UTC (permalink / raw) To: linux-kernel, Linux Memory Management [-- Attachment #1: Type: text/plain, Size: 28 bytes --] -- SUSE Labs, Novell Inc. [-- Attachment #2: radix-tree-lockless-readside.patch --] [-- Type: text/plain, Size: 5957 bytes --] Make radix tree lookups safe to be performed without locks. Also introduce a lockfree gang_lookup_slot which will be used by a future patch. Index: linux-2.6/lib/radix-tree.c =================================================================== --- linux-2.6.orig/lib/radix-tree.c +++ linux-2.6/lib/radix-tree.c @@ -45,6 +45,7 @@ ((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG) struct radix_tree_node { + unsigned int height; /* Height from the bottom */ unsigned int count; void *slots[RADIX_TREE_MAP_SIZE]; unsigned long tags[RADIX_TREE_TAGS][RADIX_TREE_TAG_LONGS]; @@ -196,6 +197,7 @@ static int radix_tree_extend(struct radi } do { + unsigned int newheight; if (!(node = radix_tree_node_alloc(root))) return -ENOMEM; @@ -208,9 +210,13 @@ static int radix_tree_extend(struct radi tag_set(node, tag, 0); } + newheight = root->height+1; + node->height = newheight; node->count = 1; + /* Make ->height visible before node visible via ->rnode */ + smp_wmb(); root->rnode = node; - root->height++; + root->height = newheight; } while (height > root->height); out: return 0; @@ -250,6 +256,9 @@ int radix_tree_insert(struct radix_tree_ /* Have to add a child node. */ if (!(tmp = radix_tree_node_alloc(root))) return -ENOMEM; + tmp->height = height; + /* Make ->height visible before node visible via slot */ + smp_wmb(); *slot = tmp; if (node) node->count++; @@ -282,12 +291,14 @@ static inline void **__lookup_slot(struc unsigned int height, shift; struct radix_tree_node **slot; - height = root->height; + if (root->rnode == NULL) + return NULL; + slot = &root->rnode; + height = (*slot)->height; if (index > radix_tree_maxindex(height)) return NULL; shift = (height-1) * RADIX_TREE_MAP_SHIFT; - slot = &root->rnode; while (height > 0) { if (*slot == NULL) @@ -491,21 +502,24 @@ EXPORT_SYMBOL(radix_tree_tag_get); #endif static unsigned int -__lookup(struct radix_tree_root *root, void **results, unsigned long index, +__lookup(struct radix_tree_root *root, void ***results, unsigned long index, unsigned int max_items, unsigned long *next_index) { + unsigned long i; unsigned int nr_found = 0; unsigned int shift; - unsigned int height = root->height; + unsigned int height; struct radix_tree_node *slot; - shift = (height-1) * RADIX_TREE_MAP_SHIFT; slot = root->rnode; + if (!slot) + goto out; + height = slot->height; + shift = (height-1) * RADIX_TREE_MAP_SHIFT; - while (height > 0) { - unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK; - - for ( ; i < RADIX_TREE_MAP_SIZE; i++) { + for (;;) { + for (i = (index >> shift) & RADIX_TREE_MAP_MASK; + i < RADIX_TREE_MAP_SIZE; i++) { if (slot->slots[i] != NULL) break; index &= ~((1UL << shift) - 1); @@ -516,21 +530,23 @@ __lookup(struct radix_tree_root *root, v if (i == RADIX_TREE_MAP_SIZE) goto out; height--; - if (height == 0) { /* Bottom level: grab some items */ - unsigned long j = index & RADIX_TREE_MAP_MASK; - - for ( ; j < RADIX_TREE_MAP_SIZE; j++) { - index++; - if (slot->slots[j]) { - results[nr_found++] = slot->slots[j]; - if (nr_found == max_items) - goto out; - } - } + if (height == 0) { + /* Bottom level: grab some items */ + break; } shift -= RADIX_TREE_MAP_SHIFT; slot = slot->slots[i]; } + + for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) { + index++; + if (slot->slots[i]) { + results[nr_found++] = &(slot->slots[i]); + if (nr_found == max_items) + goto out; + } + } + out: *next_index = index; return nr_found; @@ -558,6 +574,43 @@ radix_tree_gang_lookup(struct radix_tree unsigned int ret = 0; while (ret < max_items) { + unsigned int nr_found, i; + unsigned long next_index; /* Index of next search */ + + if (cur_index > max_index) + break; + nr_found = __lookup(root, (void ***)results + ret, cur_index, + max_items - ret, &next_index); + for (i = 0; i < nr_found; i++) + results[ret + i] = *(((void ***)results)[ret + i]); + ret += nr_found; + if (next_index == 0) + break; + cur_index = next_index; + } + return ret; +} +EXPORT_SYMBOL(radix_tree_gang_lookup); + +/** + * radix_tree_gang_lookup_slot - perform multiple lookup on a radix tree + * @root: radix tree root + * @results: where the results of the lookup are placed + * @first_index: start the lookup from this key + * @max_items: place up to this many items at *results + * + * Same as radix_tree_gang_lookup, but returns an array of pointers + * (slots) to the stored items instead of the items themselves. + */ +unsigned int +radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results, + unsigned long first_index, unsigned int max_items) +{ + const unsigned long max_index = radix_tree_maxindex(root->height); + unsigned long cur_index = first_index; + unsigned int ret = 0; + + while (ret < max_items) { unsigned int nr_found; unsigned long next_index; /* Index of next search */ @@ -572,7 +625,8 @@ radix_tree_gang_lookup(struct radix_tree } return ret; } -EXPORT_SYMBOL(radix_tree_gang_lookup); +EXPORT_SYMBOL(radix_tree_gang_lookup_slot); + /* * FIXME: the two tag_get()s here should use find_next_bit() instead of Index: linux-2.6/include/linux/radix-tree.h =================================================================== --- linux-2.6.orig/include/linux/radix-tree.h +++ linux-2.6/include/linux/radix-tree.h @@ -51,6 +51,9 @@ void *radix_tree_delete(struct radix_tre unsigned int radix_tree_gang_lookup(struct radix_tree_root *root, void **results, unsigned long first_index, unsigned int max_items); +unsigned int +radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results, + unsigned long first_index, unsigned int max_items); int radix_tree_preload(int gfp_mask); void radix_tree_init(void); void *radix_tree_tag_set(struct radix_tree_root *root, ^ permalink raw reply [flat|nested] 56+ messages in thread
* [patch 5] mm: lockless pagecache lookups 2005-06-27 6:34 ` [patch 4] radix tree: lockless readside Nick Piggin @ 2005-06-27 6:34 ` Nick Piggin 2005-06-27 6:35 ` [patch 6] mm: spinlock tree_lock Nick Piggin 0 siblings, 1 reply; 56+ messages in thread From: Nick Piggin @ 2005-06-27 6:34 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel, Linux Memory Management [-- Attachment #1: Type: text/plain, Size: 28 bytes --] -- SUSE Labs, Novell Inc. [-- Attachment #2: mm-lockless-pagecache-lookups.patch --] [-- Type: text/plain, Size: 11976 bytes --] Use the speculative get_page and the lockless radix tree lookups to introduce lockless page cache lookups (ie. no mapping->tree_lock). The only atomicity changes this should introduce is the use of a non atomic pagevec lookup for truncate, however what atomicity guarantees there were are probably not too useful anyway. Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -378,18 +378,25 @@ int add_to_page_cache(struct page *page, int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); if (error == 0) { + page_cache_get(page); + __SetPageLocked(page); + page->mapping = mapping; + page->index = offset; + write_lock_irq(&mapping->tree_lock); error = radix_tree_insert(&mapping->page_tree, offset, page); if (!error) { - page_cache_get(page); - SetPageLocked(page); - page->mapping = mapping; - page->index = offset; mapping->nrpages++; pagecache_acct(1); } write_unlock_irq(&mapping->tree_lock); radix_tree_preload_end(); + + if (error) { + page->mapping = NULL; + __put_page(page); + __ClearPageLocked(page); + } } return error; } @@ -499,13 +506,13 @@ EXPORT_SYMBOL(__lock_page); */ struct page * find_get_page(struct address_space *mapping, unsigned long offset) { - struct page *page; + struct page **pagep; + struct page *page = NULL; - read_lock_irq(&mapping->tree_lock); - page = radix_tree_lookup(&mapping->page_tree, offset); - if (page) - page_cache_get(page); - read_unlock_irq(&mapping->tree_lock); + pagep = (struct page **)radix_tree_lookup_slot(&mapping->page_tree, + offset); + if (pagep) + page = page_cache_get_speculative(pagep); return page; } @@ -518,12 +525,24 @@ struct page *find_trylock_page(struct ad { struct page *page; - read_lock_irq(&mapping->tree_lock); - page = radix_tree_lookup(&mapping->page_tree, offset); - if (page && TestSetPageLocked(page)) - page = NULL; - read_unlock_irq(&mapping->tree_lock); - return page; + page = find_get_page(mapping, offset); + if (page) { + if (TestSetPageLocked(page)) + goto out_failed; + /* Has the page been truncated before being locked? */ + if (page->mapping != mapping || page->index != offset) { + unlock_page(page); + goto out_failed; + } + + /* Silly interface requires us to drop the refcount */ + __put_page(page); + return page; + +out_failed: + page_cache_release(page); + } + return NULL; } EXPORT_SYMBOL(find_trylock_page); @@ -544,25 +563,17 @@ struct page *find_lock_page(struct addre { struct page *page; - read_lock_irq(&mapping->tree_lock); repeat: - page = radix_tree_lookup(&mapping->page_tree, offset); + page = find_get_page(mapping, offset); if (page) { - page_cache_get(page); - if (TestSetPageLocked(page)) { - read_unlock_irq(&mapping->tree_lock); - lock_page(page); - read_lock_irq(&mapping->tree_lock); - - /* Has the page been truncated while we slept? */ - if (page->mapping != mapping || page->index != offset) { - unlock_page(page); - page_cache_release(page); - goto repeat; - } + lock_page(page); + /* Has the page been truncated before being locked? */ + if (page->mapping != mapping || page->index != offset) { + unlock_page(page); + page_cache_release(page); + goto repeat; } } - read_unlock_irq(&mapping->tree_lock); return page; } @@ -645,6 +656,30 @@ unsigned find_get_pages(struct address_s return ret; } +unsigned find_get_pages_nonatomic(struct address_space *mapping, pgoff_t start, + unsigned int nr_pages, struct page **pages) +{ + unsigned int i; + unsigned int ret; + unsigned int ret2; + + /* + * We do some unsightly casting to use the array first for storing + * pointers to the page pointers, and then for the pointers to + * the pages themselves that the caller wants. + */ + ret = radix_tree_gang_lookup_slot(&mapping->page_tree, + (void ***)pages, start, nr_pages); + ret2 = 0; + for (i = 0; i < ret; i++) { + struct page *page; + page = page_cache_get_speculative(((struct page ***)pages)[i]); + if (page) + pages[ret2++] = page; + } + return ret2; +} + /* * Like find_get_pages, except we only return pages which are tagged with * `tag'. We update *index to index the next page for the traversal. Index: linux-2.6/mm/readahead.c =================================================================== --- linux-2.6.orig/mm/readahead.c +++ linux-2.6/mm/readahead.c @@ -272,27 +272,24 @@ __do_page_cache_readahead(struct address /* * Preallocate as many pages as we will need. */ - read_lock_irq(&mapping->tree_lock); for (page_idx = 0; page_idx < nr_to_read; page_idx++) { unsigned long page_offset = offset + page_idx; if (page_offset > end_index) break; + /* Don't need mapping->tree_lock - lookup can be racy */ page = radix_tree_lookup(&mapping->page_tree, page_offset); if (page) continue; - read_unlock_irq(&mapping->tree_lock); page = page_cache_alloc_cold(mapping); - read_lock_irq(&mapping->tree_lock); if (!page) break; page->index = page_offset; list_add(&page->lru, &page_pool); ret++; } - read_unlock_irq(&mapping->tree_lock); /* * Now start the IO. We ignore I/O errors - if the page is not Index: linux-2.6/mm/swap_state.c =================================================================== --- linux-2.6.orig/mm/swap_state.c +++ linux-2.6/mm/swap_state.c @@ -76,19 +76,26 @@ static int __add_to_swap_cache(struct pa BUG_ON(PagePrivate(page)); error = radix_tree_preload(gfp_mask); if (!error) { + page_cache_get(page); + SetPageLocked(page); + SetPageSwapCache(page); + page->private = entry.val; + write_lock_irq(&swapper_space.tree_lock); error = radix_tree_insert(&swapper_space.page_tree, entry.val, page); if (!error) { - page_cache_get(page); - SetPageLocked(page); - SetPageSwapCache(page); - page->private = entry.val; total_swapcache_pages++; pagecache_acct(1); } write_unlock_irq(&swapper_space.tree_lock); radix_tree_preload_end(); + + if (error) { + __put_page(page); + ClearPageLocked(page); + ClearPageSwapCache(page); + } } return error; } Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h +++ linux-2.6/include/linux/page-flags.h @@ -167,16 +167,13 @@ extern void __mod_page_state(unsigned lo /* * Manipulation of page state flags */ -#define PageLocked(page) \ - test_bit(PG_locked, &(page)->flags) -#define SetPageLocked(page) \ - set_bit(PG_locked, &(page)->flags) -#define TestSetPageLocked(page) \ - test_and_set_bit(PG_locked, &(page)->flags) -#define ClearPageLocked(page) \ - clear_bit(PG_locked, &(page)->flags) -#define TestClearPageLocked(page) \ - test_and_clear_bit(PG_locked, &(page)->flags) +#define PageLocked(page) test_bit(PG_locked, &(page)->flags) +#define SetPageLocked(page) set_bit(PG_locked, &(page)->flags) +#define __SetPageLocked(page) __set_bit(PG_locked, &(page)->flags) +#define TestSetPageLocked(page) test_and_set_bit(PG_locked, &(page)->flags) +#define ClearPageLocked(page) clear_bit(PG_locked, &(page)->flags) +#define __ClearPageLocked(page) __clear_bit(PG_locked, &(page)->flags) +#define TestClearPageLocked(page) test_and_clear_bit(PG_locked, &(page)->flags) #define PageError(page) test_bit(PG_error, &(page)->flags) #define SetPageError(page) set_bit(PG_error, &(page)->flags) Index: linux-2.6/include/linux/pagemap.h =================================================================== --- linux-2.6.orig/include/linux/pagemap.h +++ linux-2.6/include/linux/pagemap.h @@ -108,6 +108,8 @@ extern struct page * find_or_create_page unsigned long index, unsigned int gfp_mask); unsigned find_get_pages(struct address_space *mapping, pgoff_t start, unsigned int nr_pages, struct page **pages); +unsigned find_get_pages_nonatomic(struct address_space *mapping, pgoff_t start, + unsigned int nr_pages, struct page **pages); unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, int tag, unsigned int nr_pages, struct page **pages); Index: linux-2.6/include/linux/pagevec.h =================================================================== --- linux-2.6.orig/include/linux/pagevec.h +++ linux-2.6/include/linux/pagevec.h @@ -25,6 +25,8 @@ void __pagevec_lru_add_active(struct pag void pagevec_strip(struct pagevec *pvec); unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping, pgoff_t start, unsigned nr_pages); +unsigned pagevec_lookup_nonatomic(struct pagevec *pvec, + struct address_space *mapping, pgoff_t start, unsigned nr_pages); unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping, pgoff_t *index, int tag, unsigned nr_pages); Index: linux-2.6/mm/swap.c =================================================================== --- linux-2.6.orig/mm/swap.c +++ linux-2.6/mm/swap.c @@ -380,6 +380,19 @@ unsigned pagevec_lookup(struct pagevec * return pagevec_count(pvec); } +/** + * pagevec_lookup_nonatomic - non atomic pagevec_lookup + * + * This routine is non-atomic in that it may return blah. + */ +unsigned pagevec_lookup_nonatomic(struct pagevec *pvec, + struct address_space *mapping, pgoff_t start, unsigned nr_pages) +{ + pvec->nr = find_get_pages_nonatomic(mapping, start, + nr_pages, pvec->pages); + return pagevec_count(pvec); +} + unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping, pgoff_t *index, int tag, unsigned nr_pages) { Index: linux-2.6/mm/truncate.c =================================================================== --- linux-2.6.orig/mm/truncate.c +++ linux-2.6/mm/truncate.c @@ -126,7 +126,7 @@ void truncate_inode_pages(struct address pagevec_init(&pvec, 0); next = start; - while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { + while (pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) { for (i = 0; i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; pgoff_t page_index = page->index; @@ -160,7 +160,7 @@ void truncate_inode_pages(struct address next = start; for ( ; ; ) { cond_resched(); - if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { + if (!pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) { if (next == start) break; next = start; @@ -206,7 +206,7 @@ unsigned long invalidate_mapping_pages(s pagevec_init(&pvec, 0); while (next <= end && - pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { + pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) { for (i = 0; i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; Index: linux-2.6/mm/page-writeback.c =================================================================== --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -811,6 +811,7 @@ int mapping_tagged(struct address_space unsigned long flags; int ret; + /* XXX: radix_tree_tagged is safe to run without the lock */ read_lock_irqsave(&mapping->tree_lock, flags); ret = radix_tree_tagged(&mapping->page_tree, tag); read_unlock_irqrestore(&mapping->tree_lock, flags); Index: linux-2.6/mm/swapfile.c =================================================================== --- linux-2.6.orig/mm/swapfile.c +++ linux-2.6/mm/swapfile.c @@ -338,6 +338,7 @@ int remove_exclusive_swap_page(struct pa retval = 0; if (p->swap_map[swp_offset(entry)] == 1) { /* Recheck the page count with the swapcache lock held.. */ + SetPageFreeing(page); write_lock_irq(&swapper_space.tree_lock); if ((page_count(page) == 2) && !PageWriteback(page)) { __delete_from_swap_cache(page); @@ -345,6 +346,7 @@ int remove_exclusive_swap_page(struct pa retval = 1; } write_unlock_irq(&swapper_space.tree_lock); + ClearPageFreeing(page); } swap_info_put(p); ^ permalink raw reply [flat|nested] 56+ messages in thread
* [patch 6] mm: spinlock tree_lock 2005-06-27 6:34 ` [patch 5] mm: lockless pagecache lookups Nick Piggin @ 2005-06-27 6:35 ` Nick Piggin 0 siblings, 0 replies; 56+ messages in thread From: Nick Piggin @ 2005-06-27 6:35 UTC (permalink / raw) To: linux-kernel, Linux Memory Management [-- Attachment #1: Type: text/plain, Size: 28 bytes --] -- SUSE Labs, Novell Inc. [-- Attachment #2: mm-spinlock-tree_lock.patch --] [-- Type: text/plain, Size: 13830 bytes --] With practially all the read locks gone from mapping->tree_lock, convert the lock from an rwlock back to a spinlock. The remaining locks including the read locks mainly deal with IO submission and not the lookup fastpaths. Index: linux-2.6/fs/buffer.c =================================================================== --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -875,7 +875,7 @@ int __set_page_dirty_buffers(struct page spin_unlock(&mapping->private_lock); if (!TestSetPageDirty(page)) { - write_lock_irq(&mapping->tree_lock); + spin_lock_irq(&mapping->tree_lock); if (page->mapping) { /* Race with truncate? */ if (mapping_cap_account_dirty(mapping)) inc_page_state(nr_dirty); @@ -883,7 +883,7 @@ int __set_page_dirty_buffers(struct page page_index(page), PAGECACHE_TAG_DIRTY); } - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); } Index: linux-2.6/fs/inode.c =================================================================== --- linux-2.6.orig/fs/inode.c +++ linux-2.6/fs/inode.c @@ -194,7 +194,7 @@ void inode_init_once(struct inode *inode sema_init(&inode->i_sem, 1); init_rwsem(&inode->i_alloc_sem); INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC); - rwlock_init(&inode->i_data.tree_lock); + spin_lock_init(&inode->i_data.tree_lock); spin_lock_init(&inode->i_data.i_mmap_lock); INIT_LIST_HEAD(&inode->i_data.private_list); spin_lock_init(&inode->i_data.private_lock); Index: linux-2.6/include/linux/fs.h =================================================================== --- linux-2.6.orig/include/linux/fs.h +++ linux-2.6/include/linux/fs.h @@ -336,7 +336,7 @@ struct backing_dev_info; struct address_space { struct inode *host; /* owner: inode, block_device */ struct radix_tree_root page_tree; /* radix tree of all pages */ - rwlock_t tree_lock; /* and rwlock protecting it */ + spinlock_t tree_lock; /* and lock protecting it */ unsigned int i_mmap_writable;/* count VM_SHARED mappings */ struct prio_tree_root i_mmap; /* tree of private and shared mappings */ struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -120,9 +120,9 @@ void remove_from_page_cache(struct page BUG_ON(!PageLocked(page)); - write_lock_irq(&mapping->tree_lock); + spin_lock_irq(&mapping->tree_lock); __remove_from_page_cache(page); - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); } static int sync_page(void *word) @@ -383,13 +383,13 @@ int add_to_page_cache(struct page *page, page->mapping = mapping; page->index = offset; - write_lock_irq(&mapping->tree_lock); + spin_lock_irq(&mapping->tree_lock); error = radix_tree_insert(&mapping->page_tree, offset, page); if (!error) { mapping->nrpages++; pagecache_acct(1); } - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); radix_tree_preload_end(); if (error) { @@ -647,12 +647,12 @@ unsigned find_get_pages(struct address_s unsigned int i; unsigned int ret; - read_lock_irq(&mapping->tree_lock); + spin_lock_irq(&mapping->tree_lock); ret = radix_tree_gang_lookup(&mapping->page_tree, (void **)pages, start, nr_pages); for (i = 0; i < ret; i++) page_cache_get(pages[i]); - read_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); return ret; } @@ -690,14 +690,14 @@ unsigned find_get_pages_tag(struct addre unsigned int i; unsigned int ret; - read_lock_irq(&mapping->tree_lock); + spin_lock_irq(&mapping->tree_lock); ret = radix_tree_gang_lookup_tag(&mapping->page_tree, (void **)pages, *index, nr_pages, tag); for (i = 0; i < ret; i++) page_cache_get(pages[i]); if (ret) *index = pages[ret - 1]->index + 1; - read_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); return ret; } Index: linux-2.6/mm/swap_state.c =================================================================== --- linux-2.6.orig/mm/swap_state.c +++ linux-2.6/mm/swap_state.c @@ -35,7 +35,7 @@ static struct backing_dev_info swap_back struct address_space swapper_space = { .page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN), - .tree_lock = RW_LOCK_UNLOCKED, + .tree_lock = SPIN_LOCK_UNLOCKED, .a_ops = &swap_aops, .i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear), .backing_dev_info = &swap_backing_dev_info, @@ -81,14 +81,14 @@ static int __add_to_swap_cache(struct pa SetPageSwapCache(page); page->private = entry.val; - write_lock_irq(&swapper_space.tree_lock); + spin_lock_irq(&swapper_space.tree_lock); error = radix_tree_insert(&swapper_space.page_tree, entry.val, page); if (!error) { total_swapcache_pages++; pagecache_acct(1); } - write_unlock_irq(&swapper_space.tree_lock); + spin_unlock_irq(&swapper_space.tree_lock); radix_tree_preload_end(); if (error) { @@ -210,9 +210,9 @@ void delete_from_swap_cache(struct page entry.val = page->private; - write_lock_irq(&swapper_space.tree_lock); + spin_lock_irq(&swapper_space.tree_lock); __delete_from_swap_cache(page); - write_unlock_irq(&swapper_space.tree_lock); + spin_unlock_irq(&swapper_space.tree_lock); swap_free(entry); page_cache_release(page); Index: linux-2.6/mm/swapfile.c =================================================================== --- linux-2.6.orig/mm/swapfile.c +++ linux-2.6/mm/swapfile.c @@ -339,13 +339,13 @@ int remove_exclusive_swap_page(struct pa if (p->swap_map[swp_offset(entry)] == 1) { /* Recheck the page count with the swapcache lock held.. */ SetPageFreeing(page); - write_lock_irq(&swapper_space.tree_lock); + spin_lock_irq(&swapper_space.tree_lock); if ((page_count(page) == 2) && !PageWriteback(page)) { __delete_from_swap_cache(page); SetPageDirty(page); retval = 1; } - write_unlock_irq(&swapper_space.tree_lock); + spin_unlock_irq(&swapper_space.tree_lock); ClearPageFreeing(page); } swap_info_put(p); Index: linux-2.6/mm/truncate.c =================================================================== --- linux-2.6.orig/mm/truncate.c +++ linux-2.6/mm/truncate.c @@ -76,15 +76,15 @@ invalidate_complete_page(struct address_ if (PagePrivate(page) && !try_to_release_page(page, 0)) return 0; - write_lock_irq(&mapping->tree_lock); + spin_lock_irq(&mapping->tree_lock); if (PageDirty(page)) { - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); return 0; } BUG_ON(PagePrivate(page)); __remove_from_page_cache(page); - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); ClearPageUptodate(page); page_cache_release(page); /* pagecache ref */ return 1; Index: linux-2.6/mm/vmscan.c =================================================================== --- linux-2.6.orig/mm/vmscan.c +++ linux-2.6/mm/vmscan.c @@ -505,7 +505,7 @@ static int shrink_list(struct list_head goto keep_locked; /* truncate got there first */ SetPageFreeing(page); - write_lock_irq(&mapping->tree_lock); + spin_lock_irq(&mapping->tree_lock); /* * The non-racy check for busy page. It is critical to check @@ -513,7 +513,7 @@ static int shrink_list(struct list_head * not in use by anybody. (pagecache + us == 2) */ if (page_count(page) != 2 || PageDirty(page)) { - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); ClearPageFreeing(page); goto keep_locked; } @@ -522,7 +522,7 @@ static int shrink_list(struct list_head if (PageSwapCache(page)) { swp_entry_t swap = { .val = page->private }; __delete_from_swap_cache(page); - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); swap_free(swap); __put_page(page); /* The pagecache ref */ goto free_it; @@ -530,7 +530,7 @@ static int shrink_list(struct list_head #endif /* CONFIG_SWAP */ __remove_from_page_cache(page); - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); __put_page(page); free_it: Index: linux-2.6/mm/page-writeback.c =================================================================== --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -623,7 +623,7 @@ int __set_page_dirty_nobuffers(struct pa struct address_space *mapping2; if (mapping) { - write_lock_irq(&mapping->tree_lock); + spin_lock_irq(&mapping->tree_lock); mapping2 = page_mapping(page); if (mapping2) { /* Race with truncate? */ BUG_ON(mapping2 != mapping); @@ -632,7 +632,7 @@ int __set_page_dirty_nobuffers(struct pa radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); if (mapping->host) { /* !PageAnon && !swapper_space */ __mark_inode_dirty(mapping->host, @@ -707,17 +707,17 @@ int test_clear_page_dirty(struct page *p unsigned long flags; if (mapping) { - write_lock_irqsave(&mapping->tree_lock, flags); + spin_lock_irqsave(&mapping->tree_lock, flags); if (TestClearPageDirty(page)) { radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); - write_unlock_irqrestore(&mapping->tree_lock, flags); + spin_unlock_irqrestore(&mapping->tree_lock, flags); if (mapping_cap_account_dirty(mapping)) dec_page_state(nr_dirty); return 1; } - write_unlock_irqrestore(&mapping->tree_lock, flags); + spin_unlock_irqrestore(&mapping->tree_lock, flags); return 0; } return TestClearPageDirty(page); @@ -762,13 +762,13 @@ int test_clear_page_writeback(struct pag if (mapping) { unsigned long flags; - write_lock_irqsave(&mapping->tree_lock, flags); + spin_lock_irqsave(&mapping->tree_lock, flags); ret = TestClearPageWriteback(page); if (ret) radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); - write_unlock_irqrestore(&mapping->tree_lock, flags); + spin_unlock_irqrestore(&mapping->tree_lock, flags); } else { ret = TestClearPageWriteback(page); } @@ -783,7 +783,7 @@ int test_set_page_writeback(struct page if (mapping) { unsigned long flags; - write_lock_irqsave(&mapping->tree_lock, flags); + spin_lock_irqsave(&mapping->tree_lock, flags); ret = TestSetPageWriteback(page); if (!ret) radix_tree_tag_set(&mapping->page_tree, @@ -793,7 +793,7 @@ int test_set_page_writeback(struct page radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); - write_unlock_irqrestore(&mapping->tree_lock, flags); + spin_unlock_irqrestore(&mapping->tree_lock, flags); } else { ret = TestSetPageWriteback(page); } @@ -812,9 +812,9 @@ int mapping_tagged(struct address_space int ret; /* XXX: radix_tree_tagged is safe to run without the lock */ - read_lock_irqsave(&mapping->tree_lock, flags); + spin_lock_irqsave(&mapping->tree_lock, flags); ret = radix_tree_tagged(&mapping->page_tree, tag); - read_unlock_irqrestore(&mapping->tree_lock, flags); + spin_unlock_irqrestore(&mapping->tree_lock, flags); return ret; } EXPORT_SYMBOL(mapping_tagged); Index: linux-2.6/drivers/mtd/devices/block2mtd.c =================================================================== --- linux-2.6.orig/drivers/mtd/devices/block2mtd.c +++ linux-2.6/drivers/mtd/devices/block2mtd.c @@ -59,7 +59,7 @@ void cache_readahead(struct address_spac end_index = ((isize - 1) >> PAGE_CACHE_SHIFT); - read_lock_irq(&mapping->tree_lock); + spin_lock_irq(&mapping->tree_lock); for (i = 0; i < PAGE_READAHEAD; i++) { pagei = index + i; if (pagei > end_index) { @@ -71,16 +71,16 @@ void cache_readahead(struct address_spac break; if (page) continue; - read_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); page = page_cache_alloc_cold(mapping); - read_lock_irq(&mapping->tree_lock); + spin_lock_irq(&mapping->tree_lock); if (!page) break; page->index = pagei; list_add(&page->lru, &page_pool); ret++; } - read_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); if (ret) read_cache_pages(mapping, &page_pool, filler, NULL); } Index: linux-2.6/include/asm-arm/cacheflush.h =================================================================== --- linux-2.6.orig/include/asm-arm/cacheflush.h +++ linux-2.6/include/asm-arm/cacheflush.h @@ -315,9 +315,9 @@ flush_cache_page(struct vm_area_struct * extern void flush_dcache_page(struct page *); #define flush_dcache_mmap_lock(mapping) \ - write_lock_irq(&(mapping)->tree_lock) + spin_lock_irq(&(mapping)->tree_lock) #define flush_dcache_mmap_unlock(mapping) \ - write_unlock_irq(&(mapping)->tree_lock) + spin_unlock_irq(&(mapping)->tree_lock) #define flush_icache_user_range(vma,page,addr,len) \ flush_dcache_page(page) Index: linux-2.6/include/asm-parisc/cacheflush.h =================================================================== --- linux-2.6.orig/include/asm-parisc/cacheflush.h +++ linux-2.6/include/asm-parisc/cacheflush.h @@ -57,9 +57,9 @@ flush_user_icache_range(unsigned long st extern void flush_dcache_page(struct page *page); #define flush_dcache_mmap_lock(mapping) \ - write_lock_irq(&(mapping)->tree_lock) + spin_lock_irq(&(mapping)->tree_lock) #define flush_dcache_mmap_unlock(mapping) \ - write_unlock_irq(&(mapping)->tree_lock) + spin_unlock_irq(&(mapping)->tree_lock) #define flush_icache_page(vma,page) do { flush_kernel_dcache_page(page_address(page)); flush_kernel_icache_page(page_address(page)); } while (0) ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-27 6:32 ` [patch 2] mm: speculative get_page Nick Piggin 2005-06-27 6:33 ` [patch 3] radix tree: lookup_slot Nick Piggin @ 2005-06-27 14:12 ` William Lee Irwin III 2005-06-28 0:03 ` Nick Piggin 2005-06-28 12:45 ` Andy Whitcroft 2 siblings, 1 reply; 56+ messages in thread From: William Lee Irwin III @ 2005-06-27 14:12 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel, Linux Memory Management On Mon, Jun 27, 2005 at 04:32:38PM +1000, Nick Piggin wrote: > +static inline struct page *page_cache_get_speculative(struct page **pagep) > +{ > + struct page *page; > + > + preempt_disable(); > + page = *pagep; > + if (!page) > + goto out_failed; > + > + if (unlikely(get_page_testone(page))) { > + /* Picked up a freed page */ > + __put_page(page); > + goto out_failed; > + } So you pick up 0->1 refcount transitions. On Mon, Jun 27, 2005 at 04:32:38PM +1000, Nick Piggin wrote: > + /* > + * preempt can really be enabled here (only needs to be disabled > + * because page allocation can spin on the elevated refcount, but > + * we don't want to hold a reference on an unrelated page for too > + * long, so keep preempt off until we know we have the right page > + */ > + > + if (unlikely(PageFreeing(page)) || SetPageFreeing is only done in shrink_list(), so other pages in the buddy bitmaps and/or pagecache pages freed by other methods may not be found by this. There's also likely trouble with higher-order pages. On Mon, Jun 27, 2005 at 04:32:38PM +1000, Nick Piggin wrote: > + unlikely(page != *pagep)) { > + /* Picked up a page being freed, or one that's been reused */ > + put_page(page); > + goto out_failed; > + } > + preempt_enable(); > + > + return page; > + > +out_failed: > + preempt_enable(); > + return NULL; > +} page != *pagep won't be reliably tripped unless the pagecache modification has the appropriate memory barriers. The lockless radix tree lookups are a harder problem than this, and the implementation didn't look promising. I have other problems to deal with so I'm not going to go very far into this. While I agree that locklessness is the right direction for the pagecache to go, this RFC seems to have too far to go to use it to conclude anything about the subject. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-27 14:12 ` [patch 2] mm: speculative get_page William Lee Irwin III @ 2005-06-28 0:03 ` Nick Piggin 2005-06-28 0:56 ` Nick Piggin 2005-06-28 1:22 ` William Lee Irwin III 0 siblings, 2 replies; 56+ messages in thread From: Nick Piggin @ 2005-06-28 0:03 UTC (permalink / raw) To: William Lee Irwin III; +Cc: linux-kernel, Linux Memory Management William Lee Irwin III wrote: > On Mon, Jun 27, 2005 at 04:32:38PM +1000, Nick Piggin wrote: > >>+static inline struct page *page_cache_get_speculative(struct page **pagep) >>+{ >>+ struct page *page; >>+ >>+ preempt_disable(); >>+ page = *pagep; >>+ if (!page) >>+ goto out_failed; >>+ >>+ if (unlikely(get_page_testone(page))) { >>+ /* Picked up a freed page */ >>+ __put_page(page); >>+ goto out_failed; >>+ } > > > So you pick up 0->1 refcount transitions. > Yep ie. a page that's freed or being freed. > > On Mon, Jun 27, 2005 at 04:32:38PM +1000, Nick Piggin wrote: > >>+ /* >>+ * preempt can really be enabled here (only needs to be disabled >>+ * because page allocation can spin on the elevated refcount, but >>+ * we don't want to hold a reference on an unrelated page for too >>+ * long, so keep preempt off until we know we have the right page >>+ */ >>+ >>+ if (unlikely(PageFreeing(page)) || > > > SetPageFreeing is only done in shrink_list(), so other pages in the > buddy bitmaps and/or pagecache pages freed by other methods may not It is also done by remove_exclusive_swap_page, although that hunk leaked into a later patch (#5), sorry. Other methods (eg truncate) don't seem to have an atomicity guarantee anyway - ie. it is valid to pick up a reference on a page that is just about to get truncated. PageFreeing is only used when some code is making an assumption about the number of users of the page. > be found by this. There's also likely trouble with higher-order pages. > There isn't because higher order pages aren't used for pagecache. > > On Mon, Jun 27, 2005 at 04:32:38PM +1000, Nick Piggin wrote: > >>+ unlikely(page != *pagep)) { >>+ /* Picked up a page being freed, or one that's been reused */ >>+ put_page(page); >>+ goto out_failed; >>+ } >>+ preempt_enable(); >>+ >>+ return page; >>+ >>+out_failed: >>+ preempt_enable(); >>+ return NULL; >>+} > > > page != *pagep won't be reliably tripped unless the pagecache > modification has the appropriate memory barriers. > There are appropriate memory barriers: the radix tree is modified uner the rwlock/spinlock, and this function has a memory barrier before testing page != *pagep. > The lockless radix tree lookups are a harder problem than this, and > the implementation didn't look promising. I have other problems to deal > with so I'm not going to go very far into this. > What's wrong with the lockless radix tree lookups? > While I agree that locklessness is the right direction for the > pagecache to go, this RFC seems to have too far to go to use it to > conclude anything about the subject. > You don't seem to have looked enough to conclude anything about it. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-28 0:03 ` Nick Piggin @ 2005-06-28 0:56 ` Nick Piggin 2005-06-28 1:22 ` William Lee Irwin III 1 sibling, 0 replies; 56+ messages in thread From: Nick Piggin @ 2005-06-28 0:56 UTC (permalink / raw) To: William Lee Irwin III; +Cc: linux-kernel, Linux Memory Management Nick Piggin wrote: > William Lee Irwin III wrote: > >> On Mon, Jun 27, 2005 at 04:32:38PM +1000, Nick Piggin wrote: >> >>> +static inline struct page *page_cache_get_speculative(struct page >>> **pagep) >>> +{ >>> + struct page *page; >>> + >>> + preempt_disable(); >>> + page = *pagep; >>> + if (!page) >>> + goto out_failed; >>> + >>> + if (unlikely(get_page_testone(page))) { >>> + /* Picked up a freed page */ >>> + __put_page(page); >>> + goto out_failed; >>> + } >> >> >> >> So you pick up 0->1 refcount transitions. >> > > Yep ie. a page that's freed or being freed. > Oh, one thing it does need is a check for PageFree(), so it also picks up 1->2 and other transitions without freeing the free page if the put()s are done out of order. Maybe that's what you were alluding to. I'll add that. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-28 0:03 ` Nick Piggin 2005-06-28 0:56 ` Nick Piggin @ 2005-06-28 1:22 ` William Lee Irwin III 2005-06-28 1:42 ` Nick Piggin 1 sibling, 1 reply; 56+ messages in thread From: William Lee Irwin III @ 2005-06-28 1:22 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel, Linux Memory Management William Lee Irwin III wrote: >> SetPageFreeing is only done in shrink_list(), so other pages in the >> buddy bitmaps and/or pagecache pages freed by other methods may not On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote: > It is also done by remove_exclusive_swap_page, although that hunk > leaked into a later patch (#5), sorry. > Other methods (eg truncate) don't seem to have an atomicity guarantee > anyway - ie. it is valid to pick up a reference on a page that is > just about to get truncated. PageFreeing is only used when some code > is making an assumption about the number of users of the page. tmpfs William Lee Irwin III wrote: >> be found by this. There's also likely trouble with higher-order pages. On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote: > There isn't because higher order pages aren't used for pagecache. hugetlbfs William Lee Irwin III wrote: >> page != *pagep won't be reliably tripped unless the pagecache >> modification has the appropriate memory barriers. On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote: > There are appropriate memory barriers: the radix tree is > modified uner the rwlock/spinlock, and this function has > a memory barrier before testing page != *pagep. Someone else deal with this (paulus? anton? other arch maintainers?). William Lee Irwin III wrote: >> The lockless radix tree lookups are a harder problem than this, and >> the implementation didn't look promising. I have other problems to deal >> with so I'm not going to go very far into this. On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote: > What's wrong with the lockless radix tree lookups? The above is as much as I wanted to go into it. I need to direct my capacity for the grunt work of devising adversary arguments elsewhere. William Lee Irwin III wrote: >> While I agree that locklessness is the right direction for the >> pagecache to go, this RFC seems to have too far to go to use it to >> conclude anything about the subject. On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote: > You don't seem to have looked enough to conclude anything about it. You requested comments. I made some. Anyhow, my review has not been comprehensive. I stopped after the first few things I found that needed fixing. If others could deal with the rest of this, I'd be much obliged. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-28 1:22 ` William Lee Irwin III @ 2005-06-28 1:42 ` Nick Piggin 2005-06-28 4:06 ` William Lee Irwin III 0 siblings, 1 reply; 56+ messages in thread From: Nick Piggin @ 2005-06-28 1:42 UTC (permalink / raw) To: William Lee Irwin III; +Cc: linux-kernel, Linux Memory Management William Lee Irwin III wrote: > William Lee Irwin III wrote: > >>>SetPageFreeing is only done in shrink_list(), so other pages in the >>>buddy bitmaps and/or pagecache pages freed by other methods may not > > > On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote: > >>It is also done by remove_exclusive_swap_page, although that hunk >>leaked into a later patch (#5), sorry. >>Other methods (eg truncate) don't seem to have an atomicity guarantee >>anyway - ie. it is valid to pick up a reference on a page that is >>just about to get truncated. PageFreeing is only used when some code >>is making an assumption about the number of users of the page. > > > tmpfs > Well it switches between page and swap cache, but it seems to just use the normal pagecache / swapcache functions for that. It could be that I've got a big hole somewhere, but so far I don't think you've pointed oen out. > > William Lee Irwin III wrote: > >>>be found by this. There's also likely trouble with higher-order pages. > > > On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote: > >>There isn't because higher order pages aren't used for pagecache. > > > hugetlbfs > Well what's the trouble with it? > > William Lee Irwin III wrote: > >>>page != *pagep won't be reliably tripped unless the pagecache >>>modification has the appropriate memory barriers. > > > On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote: > >>There are appropriate memory barriers: the radix tree is >>modified uner the rwlock/spinlock, and this function has >>a memory barrier before testing page != *pagep. > > > Someone else deal with this (paulus? anton? other arch maintainers?). > I know what a memory barrier is and does, so you said the necessary memory barriers aren't in place, so can you deal with it? > > William Lee Irwin III wrote: > >>>The lockless radix tree lookups are a harder problem than this, and >>>the implementation didn't look promising. I have other problems to deal >>>with so I'm not going to go very far into this. > > > On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote: > >>What's wrong with the lockless radix tree lookups? > > > The above is as much as I wanted to go into it. I need to direct my > capacity for the grunt work of devising adversary arguments elsewhere. > I don't think there is anything wrong with it. I would be very keen to see real adversary arguments elsewhere though. > > William Lee Irwin III wrote: > >>>While I agree that locklessness is the right direction for the >>>pagecache to go, this RFC seems to have too far to go to use it to >>>conclude anything about the subject. > > > On Tue, Jun 28, 2005 at 10:03:00AM +1000, Nick Piggin wrote: > >>You don't seem to have looked enough to conclude anything about it. > > > You requested comments. I made some. > Well yeah thanks, you did point out a thinko I made, and that was very helpful and I value any time you spend looking at it. But just saying "this is wrong, that won't work, that's crap, ergo the concept is useless" without finding anything specifically wrong is not very constructive. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-28 1:42 ` Nick Piggin @ 2005-06-28 4:06 ` William Lee Irwin III 2005-06-28 4:50 ` Nick Piggin 0 siblings, 1 reply; 56+ messages in thread From: William Lee Irwin III @ 2005-06-28 4:06 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel, Linux Memory Management William Lee Irwin III wrote: >> tmpfs On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote: > Well it switches between page and swap cache, but it seems to just > use the normal pagecache / swapcache functions for that. It could be > that I've got a big hole somewhere, but so far I don't think you've > pointed oen out. Its radix tree movement bypasses the page allocator. William Lee Irwin III wrote: >> hugetlbfs On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote: > Well what's the trouble with it? hugetlb reallocation doesn't go through the page allocator either. William Lee Irwin III wrote: >> Someone else deal with this (paulus? anton? other arch maintainers?). On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote: > I know what a memory barrier is and does, so you said the > necessary memory barriers aren't in place, so can you deal > with it? spin_unlock() does not imply a memory barrier. William Lee Irwin III wrote: >> The above is as much as I wanted to go into it. I need to direct my >> capacity for the grunt work of devising adversary arguments elsewhere. On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote: > I don't think there is anything wrong with it. I would be very > keen to see real adversary arguments elsewhere though. They take time to construct. William Lee Irwin III wrote: >> You requested comments. I made some. On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote: > Well yeah thanks, you did point out a thinko I made, and that was very > helpful and I value any time you spend looking at it. But just saying > "this is wrong, that won't work, that's crap, ergo the concept is > useless" without finding anything specifically wrong is not very > constructive. I said nothing of that kind, and I did point out specific things. The limitation of time/effort is directly related to the nature of the responses. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-28 4:06 ` William Lee Irwin III @ 2005-06-28 4:50 ` Nick Piggin 2005-06-28 5:08 ` [patch 2] mm: speculative get_page, " David S. Miller, Nick Piggin 0 siblings, 1 reply; 56+ messages in thread From: Nick Piggin @ 2005-06-28 4:50 UTC (permalink / raw) To: William Lee Irwin III; +Cc: linux-kernel, Linux Memory Management William Lee Irwin III wrote: >On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote: > >>Well it switches between page and swap cache, but it seems to just >>use the normal pagecache / swapcache functions for that. It could be >>that I've got a big hole somewhere, but so far I don't think you've >>pointed oen out. >> > >Its radix tree movement bypasses the page allocator. > > That should be fine. Net result is the page has been looked up. What kind of atomicity did you imagine the locked find_get_page provides that I haven't? > >On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote: > >>Well what's the trouble with it? >> > >hugetlb reallocation doesn't go through the page allocator either. > > Ditto. Net result is that the page has been looked up. The speculative get page will recheck that it is in the radix tree after taking a reference, and if so then it assumes that reference to be valid. What is the hangup with the page allocator? >On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote: > >>I know what a memory barrier is and does, so you said the >>necessary memory barriers aren't in place, so can you deal >>with it? >> > >spin_unlock() does not imply a memory barrier. > > Intriguing... > >William Lee Irwin III wrote: > >>>The above is as much as I wanted to go into it. I need to direct my >>>capacity for the grunt work of devising adversary arguments elsewhere. >>> > >On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote: > >>I don't think there is anything wrong with it. I would be very >>keen to see real adversary arguments elsewhere though. >> > >They take time to construct. > > I can imagine. I don't think I've seen one yet. > >William Lee Irwin III wrote: > >>>You requested comments. I made some. >>> > >On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote: > >>Well yeah thanks, you did point out a thinko I made, and that was very >>helpful and I value any time you spend looking at it. But just saying >>"this is wrong, that won't work, that's crap, ergo the concept is >>useless" without finding anything specifically wrong is not very >>constructive. >> > >I said nothing of that kind, and I did point out specific things. > > You said "this RFC seems to have too far to go to use it to conclude anything about the subject", after failing to find any holes in the actual implementation. And (parahprasing) "this needs memory barriers but I won't say where or why, somebody else deal with it" doesn't count as a specific thing. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page, Re: [patch 2] mm: speculative get_page 2005-06-28 4:50 ` Nick Piggin @ 2005-06-28 5:08 ` David S. Miller, Nick Piggin 2005-06-28 5:34 ` Nick Piggin ` (2 more replies) 0 siblings, 3 replies; 56+ messages in thread From: David S. Miller, Nick Piggin @ 2005-06-28 5:08 UTC (permalink / raw) To: nickpiggin; +Cc: wli, linux-kernel, linux-mm > William Lee Irwin III wrote: > > >On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote: > > > >spin_unlock() does not imply a memory barrier. > > > > Intriguing... BTW, I disagree with this assertion. spin_unlock() does imply a memory barrier. All memory operations before the release of the lock must execute before the lock release memory operation is globally visible. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-28 5:08 ` [patch 2] mm: speculative get_page, " David S. Miller, Nick Piggin @ 2005-06-28 5:34 ` Nick Piggin 2005-06-28 14:19 ` William Lee Irwin III 2005-06-28 21:32 ` Jesse Barnes 2 siblings, 0 replies; 56+ messages in thread From: Nick Piggin @ 2005-06-28 5:34 UTC (permalink / raw) To: David S. Miller; +Cc: wli, linux-kernel, linux-mm David S. Miller wrote: >From: Nick Piggin <nickpiggin@yahoo.com.au> >Subject: Re: [patch 2] mm: speculative get_page >Date: Tue, 28 Jun 2005 14:50:31 +1000 > > >>William Lee Irwin III wrote: >> >> >>>On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote: >>> >>>spin_unlock() does not imply a memory barrier. >>> >>> >>Intriguing... >> > >BTW, I disagree with this assertion. spin_unlock() does imply a >memory barrier. > >All memory operations before the release of the lock must execute >before the lock release memory operation is globally visible. > Yes, it appears that way from looking at a sample set of arch code too (ie. those without strictly ordered stores put an explicit barrier there). I've always understood spin_unlock to imply a barrier. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-28 5:08 ` [patch 2] mm: speculative get_page, " David S. Miller, Nick Piggin 2005-06-28 5:34 ` Nick Piggin @ 2005-06-28 14:19 ` William Lee Irwin III 2005-06-28 15:43 ` Nick Piggin 2005-06-28 21:32 ` Jesse Barnes 2 siblings, 1 reply; 56+ messages in thread From: William Lee Irwin III @ 2005-06-28 14:19 UTC (permalink / raw) To: David S. Miller; +Cc: nickpiggin, linux-kernel, linux-mm On Mon, Jun 27, 2005 at 10:08:27PM -0700, David S. Miller wrote: > BTW, I disagree with this assertion. spin_unlock() does imply a > memory barrier. > All memory operations before the release of the lock must execute > before the lock release memory operation is globally visible. The affected architectures have only recently changed in this regard. ppc64 was the most notable case, where it had a barrier for MMIO (eieio) but not a general memory barrier. PA-RISC likewise formerly had no such barrier and was a more normal case, with no barrier whatsoever. Both have since been altered, ppc64 acquiring a heavyweight sync (arch nomenclature), and PA-RISC acquiring 2 memory barriers. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-28 14:19 ` William Lee Irwin III @ 2005-06-28 15:43 ` Nick Piggin 2005-06-28 17:01 ` Christoph Lameter 0 siblings, 1 reply; 56+ messages in thread From: Nick Piggin @ 2005-06-28 15:43 UTC (permalink / raw) To: William Lee Irwin III Cc: David S. Miller, linux-kernel, linux-mm, Anton Blanchard William Lee Irwin III wrote: > On Mon, Jun 27, 2005 at 10:08:27PM -0700, David S. Miller wrote: > >>BTW, I disagree with this assertion. spin_unlock() does imply a >>memory barrier. >>All memory operations before the release of the lock must execute >>before the lock release memory operation is globally visible. > > > The affected architectures have only recently changed in this regard. > ppc64 was the most notable case, where it had a barrier for MMIO > (eieio) but not a general memory barrier. PA-RISC likewise formerly had > no such barrier and was a more normal case, with no barrier whatsoever. > > Both have since been altered, ppc64 acquiring a heavyweight sync > (arch nomenclature), and PA-RISC acquiring 2 memory barriers. > Parisc looks like it's doing the extra memory barrier to "be safe" :P Re the ppc64 chageset: It looks to me like lwsync is the lightweight sync, and eieio is just referred to as the lightER (than sync) weight sync. What's more, it looks like eieio does order stores to system memory and is not just an MMIO barrier. But nit picking aside, is it true that we need a load barrier before unlock? (store barrier I agree with) The ppc64 changeset in question indicates yes, but I can't quite work out why. There are noises in the archives about this, but I didn't pinpoint a conclusion... Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-28 15:43 ` Nick Piggin @ 2005-06-28 17:01 ` Christoph Lameter 2005-06-28 23:10 ` Nick Piggin 0 siblings, 1 reply; 56+ messages in thread From: Christoph Lameter @ 2005-06-28 17:01 UTC (permalink / raw) To: Nick Piggin Cc: William Lee Irwin III, David S. Miller, linux-kernel, linux-mm, Anton Blanchard On Wed, 29 Jun 2005, Nick Piggin wrote: > But nit picking aside, is it true that we need a load barrier before > unlock? (store barrier I agree with) The ppc64 changeset in question > indicates yes, but I can't quite work out why. There are noises in the > archives about this, but I didn't pinpoint a conclusion... A spinlock may be used to read a consistent set of variables. If load operations would be moved below the spin_unlock then one may get values that have been updated after another process acquired the spinlock. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-28 17:01 ` Christoph Lameter @ 2005-06-28 23:10 ` Nick Piggin 0 siblings, 0 replies; 56+ messages in thread From: Nick Piggin @ 2005-06-28 23:10 UTC (permalink / raw) To: Christoph Lameter Cc: William Lee Irwin III, David S. Miller, linux-kernel, linux-mm, Anton Blanchard Christoph Lameter wrote: > On Wed, 29 Jun 2005, Nick Piggin wrote: > > >>But nit picking aside, is it true that we need a load barrier before >>unlock? (store barrier I agree with) The ppc64 changeset in question >>indicates yes, but I can't quite work out why. There are noises in the >>archives about this, but I didn't pinpoint a conclusion... > > > A spinlock may be used to read a consistent set of variables. If load > operations would be moved below the spin_unlock then one may get values > that have been updated after another process acquired the spinlock. > > Of course, thanks. I was only thinking of the case where loads were moved from the unlocked into the locked section. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-28 5:08 ` [patch 2] mm: speculative get_page, " David S. Miller, Nick Piggin 2005-06-28 5:34 ` Nick Piggin 2005-06-28 14:19 ` William Lee Irwin III @ 2005-06-28 21:32 ` Jesse Barnes 2005-06-28 22:17 ` Christoph Lameter 2 siblings, 1 reply; 56+ messages in thread From: Jesse Barnes @ 2005-06-28 21:32 UTC (permalink / raw) To: David S. Miller; +Cc: nickpiggin, wli, linux-kernel, linux-mm On Monday, June 27, 2005 10:08 pm, David S. Miller wrote: > From: Nick Piggin <nickpiggin@yahoo.com.au> > Subject: Re: [patch 2] mm: speculative get_page > Date: Tue, 28 Jun 2005 14:50:31 +1000 > > > William Lee Irwin III wrote: > > >On Tue, Jun 28, 2005 at 11:42:16AM +1000, Nick Piggin wrote: > > > > > >spin_unlock() does not imply a memory barrier. > > > > Intriguing... > > BTW, I disagree with this assertion. spin_unlock() does imply a > memory barrier. > > All memory operations before the release of the lock must execute > before the lock release memory operation is globally visible. On ia64 at least, the unlock is only a one way barrier. The store to realease the lock uses release semantics (since the lock is declared volatile), which implies that prior stores are visible before the unlock occurs, but subsequent accesses can 'float up' above the unlock. See http://www.gelato.unsw.edu.au/linux-ia64/0304/5122.html for some more details. Jesse -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-28 21:32 ` Jesse Barnes @ 2005-06-28 22:17 ` Christoph Lameter 0 siblings, 0 replies; 56+ messages in thread From: Christoph Lameter @ 2005-06-28 22:17 UTC (permalink / raw) To: Jesse Barnes; +Cc: David S. Miller, nickpiggin, wli, linux-kernel, linux-mm On Tue, 28 Jun 2005, Jesse Barnes wrote: > On ia64 at least, the unlock is only a one way barrier. The store to > realease the lock uses release semantics (since the lock is declared > volatile), which implies that prior stores are visible before the > unlock occurs, but subsequent accesses can 'float up' above the unlock. > See http://www.gelato.unsw.edu.au/linux-ia64/0304/5122.html for some > more details. The manual talks about "accesses" not stores. So this applies to loads and stores. Subsequent accesses can float up but only accesses prior to the instruction with release semantics (like an unlock) are guaranteed to be visible. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-27 6:32 ` [patch 2] mm: speculative get_page Nick Piggin 2005-06-27 6:33 ` [patch 3] radix tree: lookup_slot Nick Piggin 2005-06-27 14:12 ` [patch 2] mm: speculative get_page William Lee Irwin III @ 2005-06-28 12:45 ` Andy Whitcroft 2005-06-28 13:16 ` Nick Piggin 2 siblings, 1 reply; 56+ messages in thread From: Andy Whitcroft @ 2005-06-28 12:45 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel, Linux Memory Management Nick Piggin wrote: > #define PG_free 20 /* Page is on the free lists */ > +#define PG_freeing 21 /* PG_refcount about to be freed */ Wow this needs two new page bits. That might be a problem ongoing. There are only 24 of these puppies and this takes us to just two remaining. Do we really need _two_ to track free? One obvious area of overlap might be the PG_nosave_free which seems to be set on free pages for software suspend. Perhaps that and PG_free will be equivalent in intent (though maintained differently) and allow us to recover a bit? There are a couple of bits which imply ownership such as PG_slab, PG_swapcache and PG_reserved which to my mind are all exclusive. Perhaps those plus the PG_free could be combined into a owner field. I am unsure if the PG_freeing can be 'backed out' if not it may also combine? Mumble ... -apw -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-28 12:45 ` Andy Whitcroft @ 2005-06-28 13:16 ` Nick Piggin 2005-06-28 16:02 ` Dave Hansen 2005-06-29 16:31 ` Pavel Machek 0 siblings, 2 replies; 56+ messages in thread From: Nick Piggin @ 2005-06-28 13:16 UTC (permalink / raw) To: Andy Whitcroft; +Cc: linux-kernel, Linux Memory Management Andy Whitcroft wrote: > Nick Piggin wrote: > > >> #define PG_free 20 /* Page is on the free lists */ >>+#define PG_freeing 21 /* PG_refcount about to be freed */ > > > Wow this needs two new page bits. That might be a problem ongoing. > There are only 24 of these puppies and this takes us to just two > remaining. Do we really need _two_ to track free? > Yeah they are kind of different. PG_freeing isn't a really good description for it. Basically it is set to guarantee a page won't gain any more references (real, not speculative) than what page_count returns. I'm in the process of recovering one of those with an earlier set of patches (PG_reserved). > One obvious area of overlap might be the PG_nosave_free which seems to > be set on free pages for software suspend. Perhaps that and PG_free > will be equivalent in intent (though maintained differently) and allow > us to recover a bit? > PG_free can't be shared with anything else, unfortunately. It doesn't need to be an atomic flag though, so it can be an "impossible" combination of flags. > There are a couple of bits which imply ownership such as PG_slab, > PG_swapcache and PG_reserved which to my mind are all exclusive. > Perhaps those plus the PG_free could be combined into a owner field. I > am unsure if the PG_freeing can be 'backed out' if not it may also combine? > I think there are a a few ways that bits can be reclaimed if we start digging. swsusp uses 2 which seems excessive though may be fully justified. Can PG_private be replaced by (!page->private)? Can filesystems easily stop using PG_checked? OK, I'll cut the hand-waving: PG_free used to be derived from PG_private && page_count == 0, so it could instead be PG_active && !PG_lru quite easily AFAIKS. If this patchset ever looks like being merged you can take me up on it ;) -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-28 13:16 ` Nick Piggin @ 2005-06-28 16:02 ` Dave Hansen 2005-06-29 16:31 ` Pavel Machek 2005-06-29 16:31 ` Pavel Machek 1 sibling, 1 reply; 56+ messages in thread From: Dave Hansen @ 2005-06-28 16:02 UTC (permalink / raw) To: Nick Piggin; +Cc: Andy Whitcroft, linux-kernel, Linux Memory Management On Tue, 2005-06-28 at 23:16 +1000, Nick Piggin wrote: > I think there are a a few ways that bits can be reclaimed if we > start digging. swsusp uses 2 which seems excessive though may be > fully justified. They (swsusp) actually don't need the bits at all until suspend-time, at all. Somebody coded up a "dynamic page flags" patch that let them kill the page->flags use, but it didn't really go anywhere. Might be nice if someone dug it up. I probably have a copy somewhere. -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-28 16:02 ` Dave Hansen @ 2005-06-29 16:31 ` Pavel Machek 2005-06-29 18:43 ` Dave Hansen 0 siblings, 1 reply; 56+ messages in thread From: Pavel Machek @ 2005-06-29 16:31 UTC (permalink / raw) To: Dave Hansen Cc: Nick Piggin, Andy Whitcroft, linux-kernel, Linux Memory Management Hi! > > I think there are a a few ways that bits can be reclaimed if we > > start digging. swsusp uses 2 which seems excessive though may be > > fully justified. > > They (swsusp) actually don't need the bits at all until suspend-time, at > all. Somebody coded up a "dynamic page flags" patch that let them kill > the page->flags use, but it didn't really go anywhere. Might be nice if > someone dug it up. I probably have a copy somewhere. Unfortunately that patch was rather ugly :-(. Pavel -- teflon -- maybe it is a trademark, but it should not be. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-29 16:31 ` Pavel Machek @ 2005-06-29 18:43 ` Dave Hansen 2005-06-29 21:22 ` Pavel Machek 0 siblings, 1 reply; 56+ messages in thread From: Dave Hansen @ 2005-06-29 18:43 UTC (permalink / raw) To: Pavel Machek Cc: Nick Piggin, Andy Whitcroft, linux-kernel, Linux Memory Management On Wed, 2005-06-29 at 18:31 +0200, Pavel Machek wrote: > > > I think there are a a few ways that bits can be reclaimed if we > > > start digging. swsusp uses 2 which seems excessive though may be > > > fully justified. > > > > They (swsusp) actually don't need the bits at all until suspend-time, at > > all. Somebody coded up a "dynamic page flags" patch that let them kill > > the page->flags use, but it didn't really go anywhere. Might be nice if > > someone dug it up. I probably have a copy somewhere. > > Unfortunately that patch was rather ugly :-(. Do you think the idea was ugly, or just the implementation? Is there something that you'd rather see? -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-29 18:43 ` Dave Hansen @ 2005-06-29 21:22 ` Pavel Machek 0 siblings, 0 replies; 56+ messages in thread From: Pavel Machek @ 2005-06-29 21:22 UTC (permalink / raw) To: Dave Hansen Cc: Nick Piggin, Andy Whitcroft, linux-kernel, Linux Memory Management Hi! > > > > I think there are a a few ways that bits can be reclaimed if we > > > > start digging. swsusp uses 2 which seems excessive though may be > > > > fully justified. > > > > > > They (swsusp) actually don't need the bits at all until suspend-time, at > > > all. Somebody coded up a "dynamic page flags" patch that let them kill > > > the page->flags use, but it didn't really go anywhere. Might be nice if > > > someone dug it up. I probably have a copy somewhere. > > > > Unfortunately that patch was rather ugly :-(. > > Do you think the idea was ugly, or just the implementation? Is there > something that you'd rather see? Well, implementation was ugly and idea was unneccesary because we still had bits left. We could spare bits for swsusp by defining "PageReserved | PageLocked => PageNosave" etc.... simply by choosing some otherwise unused combinations. swsusp is not performance critical... Pavel -- Boycott Kodak -- for their patent abuse against Java. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch 2] mm: speculative get_page 2005-06-28 13:16 ` Nick Piggin 2005-06-28 16:02 ` Dave Hansen @ 2005-06-29 16:31 ` Pavel Machek 1 sibling, 0 replies; 56+ messages in thread From: Pavel Machek @ 2005-06-29 16:31 UTC (permalink / raw) To: Nick Piggin; +Cc: Andy Whitcroft, linux-kernel, Linux Memory Management Hi! > >There are a couple of bits which imply ownership such as PG_slab, > >PG_swapcache and PG_reserved which to my mind are all exclusive. > >Perhaps those plus the PG_free could be combined into a owner field. I > >am unsure if the PG_freeing can be 'backed out' if not it may also combine? > > I think there are a a few ways that bits can be reclaimed if we > start digging. swsusp uses 2 which seems excessive though may be > fully justified. Can PG_private be replaced by (!page->private)? > Can filesystems easily stop using PG_checked? It is possible that swsusp could reduce its bit usage... Current stuff works, but probably does not need strong atomicity guarantees, and could use some bit combination... Pavel -- teflon -- maybe it is a trademark, but it should not be. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* VFS scalability (was: [rfc] lockless pagecache) 2005-06-27 6:29 [rfc] lockless pagecache Nick Piggin 2005-06-27 6:32 ` [patch 1] mm: PG_free flag Nick Piggin @ 2005-06-27 6:43 ` Nick Piggin 2005-06-27 7:13 ` Andi Kleen 2005-06-27 7:46 ` [rfc] lockless pagecache Andrew Morton 2005-06-29 10:49 ` Hirokazu Takahashi 3 siblings, 1 reply; 56+ messages in thread From: Nick Piggin @ 2005-06-27 6:43 UTC (permalink / raw) To: linux-kernel, Linux Memory Management Just an interesting aside, when first testing the patch I was using read(2) instead of nopage faults. I ran into some surprising results there which I don't have the time to follow up at the moment - it might be worth investigating if someone has the time, regardless the state of the lockless pagecache work. For the parallel workload as described in the parent post (but read instead of fault), the vanilla kernel profile looks like this: 74453 total 0.0121 25839 update_atime 44.8594 19595 _read_unlock_irq 306.1719 13025 do_generic_mapping_read 5.5758 9374 rw_verify_area 29.2937 1739 ia64_pal_call_static 9.0573 1567 default_idle 4.0807 1114 __copy_user 0.4704 848 _spin_lock 8.8333 786 ia64_spinlock_contention 8.1875 246 ia64_save_scratch_fpregs 3.8438 187 ia64_load_scratch_fpregs 2.9219 16 file_read_actor 0.0263 15 fsys_bubble_down 0.0586 12 vfs_read 0.0170 This is with the filesystem mounted as noatime, so I can't work out why update_atime is so high on the list. I suspect maybe a false sharing issue with some other fields. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: VFS scalability (was: [rfc] lockless pagecache) 2005-06-27 6:43 ` VFS scalability (was: [rfc] lockless pagecache) Nick Piggin @ 2005-06-27 7:13 ` Andi Kleen 2005-06-27 7:33 ` VFS scalability Nick Piggin 0 siblings, 1 reply; 56+ messages in thread From: Andi Kleen @ 2005-06-27 7:13 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel, linux-mm Nick Piggin <nickpiggin@yahoo.com.au> writes: > This is with the filesystem mounted as noatime, so I can't work > out why update_atime is so high on the list. I suspect maybe a > false sharing issue with some other fields. Did all the 64CPUs write to the same file? Then update_atime was just the messenger - it is the first function to read the inode so it eats the cache miss overhead. Maybe adding a prefetch for it at the beginning of sys_read() might help, but then with 64CPUs writing to parts of the inode it will always thrash no matter how many prefetches. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: VFS scalability 2005-06-27 7:13 ` Andi Kleen @ 2005-06-27 7:33 ` Nick Piggin 2005-06-27 7:44 ` Andi Kleen 0 siblings, 1 reply; 56+ messages in thread From: Nick Piggin @ 2005-06-27 7:33 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, linux-mm Andi Kleen wrote: > Nick Piggin <nickpiggin@yahoo.com.au> writes: > > >>This is with the filesystem mounted as noatime, so I can't work >>out why update_atime is so high on the list. I suspect maybe a >>false sharing issue with some other fields. > > > Did all the 64CPUs write to the same file? > Yes. > Then update_atime was just the messenger - it is the first function > to read the inode so it eats the cache miss overhead. > I agree. > Maybe adding a prefetch for it at the beginning of sys_read() > might help, but then with 64CPUs writing to parts of the inode > it will always thrash no matter how many prefetches. > True. I'm just not sure what is causing the bouncing - I guess ->f_count due to get_file()? rw_verify_area is another that is taking a lot of hits - probably due to the same cacheline(s) as update_atime. Unless I'm mistaken, the big difference between the read fault and the read(2) cases is that mmap holds a reference on the file, while open(2) doesn't? I guess if anyone really cares about that, they could hack up a flag to tell the file to remain pinned. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: VFS scalability 2005-06-27 7:33 ` VFS scalability Nick Piggin @ 2005-06-27 7:44 ` Andi Kleen 2005-06-27 8:03 ` Nick Piggin 0 siblings, 1 reply; 56+ messages in thread From: Andi Kleen @ 2005-06-27 7:44 UTC (permalink / raw) To: Nick Piggin; +Cc: Andi Kleen, linux-kernel, linux-mm On Mon, Jun 27, 2005 at 05:33:43PM +1000, Nick Piggin wrote: > >Maybe adding a prefetch for it at the beginning of sys_read() > >might help, but then with 64CPUs writing to parts of the inode > >it will always thrash no matter how many prefetches. > > > > True. I'm just not sure what is causing the bouncing - I guess > ->f_count due to get_file()? That's in the file, not in the inode. It must be some inode field. I don't know which one. There is probably some oprofile/perfmon event that could tell you which function dirties the cacheline. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: VFS scalability 2005-06-27 7:44 ` Andi Kleen @ 2005-06-27 8:03 ` Nick Piggin 0 siblings, 0 replies; 56+ messages in thread From: Nick Piggin @ 2005-06-27 8:03 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, linux-mm Andi Kleen wrote: > On Mon, Jun 27, 2005 at 05:33:43PM +1000, Nick Piggin wrote: > >>>Maybe adding a prefetch for it at the beginning of sys_read() >>>might help, but then with 64CPUs writing to parts of the inode >>>it will always thrash no matter how many prefetches. >>> >> >>True. I'm just not sure what is causing the bouncing - I guess >>->f_count due to get_file()? > > > That's in the file, not in the inode. It must be some inode field. > I don't know which one. > Oh yes, my mistake. > There is probably some oprofile/perfmon event that could tell > you which function dirties the cacheline. > I'll see if I can work it out. Thanks. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [rfc] lockless pagecache 2005-06-27 6:29 [rfc] lockless pagecache Nick Piggin 2005-06-27 6:32 ` [patch 1] mm: PG_free flag Nick Piggin 2005-06-27 6:43 ` VFS scalability (was: [rfc] lockless pagecache) Nick Piggin @ 2005-06-27 7:46 ` Andrew Morton 2005-06-27 8:02 ` Nick Piggin ` (2 more replies) 2005-06-29 10:49 ` Hirokazu Takahashi 3 siblings, 3 replies; 56+ messages in thread From: Andrew Morton @ 2005-06-27 7:46 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel, linux-mm Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > First I'll put up some numbers to get you interested - of a 64-way Altix > with 64 processes each read-faulting in their own 512MB part of a 32GB > file that is preloaded in pagecache (with the proper NUMA memory > allocation). I bet you can get a 5x to 10x reduction in ->tree_lock traffic by doing 16-page faultahead. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [rfc] lockless pagecache 2005-06-27 7:46 ` [rfc] lockless pagecache Andrew Morton @ 2005-06-27 8:02 ` Nick Piggin 2005-06-27 8:15 ` Andrew Morton ` (2 more replies) 2005-06-27 14:08 ` Martin J. Bligh 2005-06-27 17:49 ` Christoph Lameter 2 siblings, 3 replies; 56+ messages in thread From: Nick Piggin @ 2005-06-27 8:02 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm Andrew Morton wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >>First I'll put up some numbers to get you interested - of a 64-way Altix >> with 64 processes each read-faulting in their own 512MB part of a 32GB >> file that is preloaded in pagecache (with the proper NUMA memory >> allocation). > > > I bet you can get a 5x to 10x reduction in ->tree_lock traffic by doing > 16-page faultahead. > > Definitely, for the microbenchmark I was testing with. However I think for Oracle and others that use shared memory like this, they are probably not doing linear access, so that would be a net loss. I'm not completely sure (I don't have access to real loads at the moment), but I would have thought those guys would have looked into fault ahead if it were a possibility. Also, the memory usage regression cases that fault ahead brings makes it a bit contentious. I like that the lockless patch completely removes the problem at its source and even makes the serial path lighter. The other things is, the speculative get_page may be useful for more code than just pagecache lookups. But it is fairly tricky I'll give you that. Anyway it is obviously not something that can go in tomorrow. At the very least the PageReserved patches need to go in first, and even they will need a lot of testing out of tree. Perhaps it can be discussed at KS and we can think about what to do with it after that - that kind of time frame. No rush. Oh yeah, and obviously it would be nice if it provided real improvements on real workloads too ;) -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [rfc] lockless pagecache 2005-06-27 8:02 ` Nick Piggin @ 2005-06-27 8:15 ` Andrew Morton 2005-06-27 8:28 ` Nick Piggin 2005-06-27 8:56 ` Lincoln Dale 2005-06-27 13:17 ` Benjamin LaHaise 2 siblings, 1 reply; 56+ messages in thread From: Andrew Morton @ 2005-06-27 8:15 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel, linux-mm Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > Also, the memory usage regression cases that fault ahead brings makes it > a bit contentious. faultahead consumes no more memory: if the page is present then point a pte at it. It'll make reclaim work a bit harder in some situations. > I like that the lockless patch completely removes the problem at its > source and even makes the serial path lighter. The other things is, the > speculative get_page may be useful for more code than just pagecache > lookups. But it is fairly tricky I'll give you that. Yes, it's scary-looking stuff. > Anyway it is obviously not something that can go in tomorrow. At the > very least the PageReserved patches need to go in first, and even they > will need a lot of testing out of tree. > > Perhaps it can be discussed at KS and we can think about what to do with > it after that - that kind of time frame. No rush. > > Oh yeah, and obviously it would be nice if it provided real improvements > on real workloads too ;) umm, yes. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [rfc] lockless pagecache 2005-06-27 8:15 ` Andrew Morton @ 2005-06-27 8:28 ` Nick Piggin 0 siblings, 0 replies; 56+ messages in thread From: Nick Piggin @ 2005-06-27 8:28 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm Andrew Morton wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >>Also, the memory usage regression cases that fault ahead brings makes it >> a bit contentious. > > > faultahead consumes no more memory: if the page is present then point a pte > at it. It'll make reclaim work a bit harder in some situations. > Oh OK we'll call that faultahead and Christoph's thing prefault then. I suspect it may still be a net loss for those that are running into tree_lock contention, but we'll see. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [rfc] lockless pagecache 2005-06-27 8:02 ` Nick Piggin 2005-06-27 8:15 ` Andrew Morton @ 2005-06-27 8:56 ` Lincoln Dale 2005-06-27 9:04 ` Nick Piggin 2005-06-27 13:17 ` Benjamin LaHaise 2 siblings, 1 reply; 56+ messages in thread From: Lincoln Dale @ 2005-06-27 8:56 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm Nick Piggin wrote: [..] > However I think for Oracle and others that use shared memory like > this, they are probably not doing linear access, so that would be a > net loss. I'm not completely sure (I don't have access to real loads > at the moment), but I would have thought those guys would have looked > into fault ahead if it were a possibility. i thought those guys used O_DIRECT - in which case, wouldn't the page cache not be used? cheers, lincoln. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [rfc] lockless pagecache 2005-06-27 8:56 ` Lincoln Dale @ 2005-06-27 9:04 ` Nick Piggin 2005-06-27 18:14 ` Chen, Kenneth W 0 siblings, 1 reply; 56+ messages in thread From: Nick Piggin @ 2005-06-27 9:04 UTC (permalink / raw) To: Lincoln Dale; +Cc: Andrew Morton, linux-kernel, linux-mm Lincoln Dale wrote: > Nick Piggin wrote: > [..] > >> However I think for Oracle and others that use shared memory like >> this, they are probably not doing linear access, so that would be a >> net loss. I'm not completely sure (I don't have access to real loads >> at the moment), but I would have thought those guys would have looked >> into fault ahead if it were a possibility. > > > i thought those guys used O_DIRECT - in which case, wouldn't the page > cache not be used? > Well I think they do use O_DIRECT for their IO, but they need to use the Linux pagecache for their shared memory - that shared memory being the basis for their page cache. I think. Whatever the setup I believe they have issues with the tree_lock, which is why it was changed to an rwlock. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: [rfc] lockless pagecache 2005-06-27 9:04 ` Nick Piggin @ 2005-06-27 18:14 ` Chen, Kenneth W 2005-06-27 18:50 ` Badari Pulavarty 0 siblings, 1 reply; 56+ messages in thread From: Chen, Kenneth W @ 2005-06-27 18:14 UTC (permalink / raw) To: 'Nick Piggin', Lincoln Dale; +Cc: Andrew Morton, linux-kernel, linux-mm Nick Piggin wrote on Monday, June 27, 2005 2:04 AM > >> However I think for Oracle and others that use shared memory like > >> this, they are probably not doing linear access, so that would be a > >> net loss. I'm not completely sure (I don't have access to real loads > >> at the moment), but I would have thought those guys would have looked > >> into fault ahead if it were a possibility. > > > > > > i thought those guys used O_DIRECT - in which case, wouldn't the page > > cache not be used? > > > > Well I think they do use O_DIRECT for their IO, but they need to > use the Linux pagecache for their shared memory - that shared > memory being the basis for their page cache. I think. Whatever > the setup I believe they have issues with the tree_lock, which is > why it was changed to an rwlock. Typically shared memory is used as db buffer cache, and O_DIRECT is performed on these buffer cache (hence O_DIRECT on the shared memory). You must be thinking some other workload. Nevertheless, for OLTP type of db workload, tree_lock hasn't been a problem so far. - Ken -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: [rfc] lockless pagecache 2005-06-27 18:14 ` Chen, Kenneth W @ 2005-06-27 18:50 ` Badari Pulavarty 2005-06-27 19:05 ` Chen, Kenneth W 0 siblings, 1 reply; 56+ messages in thread From: Badari Pulavarty @ 2005-06-27 18:50 UTC (permalink / raw) To: Chen, Kenneth W Cc: 'Nick Piggin', Lincoln Dale, Andrew Morton, linux-kernel, linux-mm On Mon, 2005-06-27 at 11:14 -0700, Chen, Kenneth W wrote: > Nick Piggin wrote on Monday, June 27, 2005 2:04 AM > > >> However I think for Oracle and others that use shared memory like > > >> this, they are probably not doing linear access, so that would be a > > >> net loss. I'm not completely sure (I don't have access to real loads > > >> at the moment), but I would have thought those guys would have looked > > >> into fault ahead if it were a possibility. > > > > > > > > > i thought those guys used O_DIRECT - in which case, wouldn't the page > > > cache not be used? > > > > > > > Well I think they do use O_DIRECT for their IO, but they need to > > use the Linux pagecache for their shared memory - that shared > > memory being the basis for their page cache. I think. Whatever > > the setup I believe they have issues with the tree_lock, which is > > why it was changed to an rwlock. > > Typically shared memory is used as db buffer cache, and O_DIRECT is > performed on these buffer cache (hence O_DIRECT on the shared memory). > You must be thinking some other workload. Nevertheless, for OLTP type > of db workload, tree_lock hasn't been a problem so far. What about DSS ? I need to go back and verify some of the profiles we have. Thanks, Badari -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: [rfc] lockless pagecache 2005-06-27 18:50 ` Badari Pulavarty @ 2005-06-27 19:05 ` Chen, Kenneth W 2005-06-27 19:22 ` Christoph Lameter 0 siblings, 1 reply; 56+ messages in thread From: Chen, Kenneth W @ 2005-06-27 19:05 UTC (permalink / raw) To: 'Badari Pulavarty' Cc: 'Nick Piggin', Lincoln Dale, Andrew Morton, linux-kernel, linux-mm Badari Pulavarty wrote on Monday, June 27, 2005 11:51 AM > On Mon, 2005-06-27 at 11:14 -0700, Chen, Kenneth W wrote: > > Typically shared memory is used as db buffer cache, and O_DIRECT is > > performed on these buffer cache (hence O_DIRECT on the shared memory). > > You must be thinking some other workload. Nevertheless, for OLTP type > > of db workload, tree_lock hasn't been a problem so far. > > What about DSS ? I need to go back and verify some of the profiles > we have. I don't recall seeing tree_lock to be a problem for DSS workload either. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: [rfc] lockless pagecache 2005-06-27 19:05 ` Chen, Kenneth W @ 2005-06-27 19:22 ` Christoph Lameter 2005-06-27 19:42 ` Chen, Kenneth W 0 siblings, 1 reply; 56+ messages in thread From: Christoph Lameter @ 2005-06-27 19:22 UTC (permalink / raw) To: Chen, Kenneth W Cc: 'Badari Pulavarty', 'Nick Piggin', Lincoln Dale, Andrew Morton, linux-kernel, linux-mm On Mon, 27 Jun 2005, Chen, Kenneth W wrote: > I don't recall seeing tree_lock to be a problem for DSS workload either. I have seen the tree_lock being a problem a number of times with large scale NUMA type workloads. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: [rfc] lockless pagecache 2005-06-27 19:22 ` Christoph Lameter @ 2005-06-27 19:42 ` Chen, Kenneth W 2005-07-05 15:11 ` Sonny Rao 0 siblings, 1 reply; 56+ messages in thread From: Chen, Kenneth W @ 2005-06-27 19:42 UTC (permalink / raw) To: 'Christoph Lameter' Cc: 'Badari Pulavarty', 'Nick Piggin', Lincoln Dale, Andrew Morton, linux-kernel, linux-mm Christoph Lameter wrote on Monday, June 27, 2005 12:23 PM > On Mon, 27 Jun 2005, Chen, Kenneth W wrote: > > I don't recall seeing tree_lock to be a problem for DSS workload either. > > I have seen the tree_lock being a problem a number of times with large > scale NUMA type workloads. I totally agree! My earlier posts are strictly referring to industry standard db workloads (OLTP, DSS). I'm not saying it's not a problem for everyone :-) Obviously you just outlined a few .... - Ken -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [rfc] lockless pagecache 2005-06-27 19:42 ` Chen, Kenneth W @ 2005-07-05 15:11 ` Sonny Rao 2005-07-05 15:31 ` Martin J. Bligh 0 siblings, 1 reply; 56+ messages in thread From: Sonny Rao @ 2005-07-05 15:11 UTC (permalink / raw) To: Chen, Kenneth W Cc: 'Christoph Lameter', 'Badari Pulavarty', 'Nick Piggin', Lincoln Dale, Andrew Morton, linux-kernel, linux-mm On Mon, Jun 27, 2005 at 12:42:44PM -0700, Chen, Kenneth W wrote: > Christoph Lameter wrote on Monday, June 27, 2005 12:23 PM > > On Mon, 27 Jun 2005, Chen, Kenneth W wrote: > > > I don't recall seeing tree_lock to be a problem for DSS workload either. > > > > I have seen the tree_lock being a problem a number of times with large > > scale NUMA type workloads. > > I totally agree! My earlier posts are strictly referring to industry > standard db workloads (OLTP, DSS). I'm not saying it's not a problem > for everyone :-) Obviously you just outlined a few .... I'm a bit late to the party here (was gone on vacation), but I do have profiles from DSS workloads using page-cache rather than O_DIRECT and I do see spin_lock_irq() in the profiles which I'm pretty certain are locks spinning for access to the radix_tree. I'll talk about it a bit more up in Ottawa but here's the top 5 on my profile (sorry don't have the number of ticks at the momement): 1. dedicated_idle (waiting for I/O) 2. __copy_tofrom_user 3. radix_tree_delete 4. _spin_lock_irq 5. __find_get_block So, yes, if the page-cache is used in a DSS workload then one will see the tree-lock. BTW, this was on a PPC64 machine w/ a fairly small NUMA factor. Sonny -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [rfc] lockless pagecache 2005-07-05 15:11 ` Sonny Rao @ 2005-07-05 15:31 ` Martin J. Bligh 2005-07-05 15:37 ` Sonny Rao 0 siblings, 1 reply; 56+ messages in thread From: Martin J. Bligh @ 2005-07-05 15:31 UTC (permalink / raw) To: Sonny Rao, Chen, Kenneth W Cc: 'Christoph Lameter', 'Badari Pulavarty', 'Nick Piggin', Lincoln Dale, Andrew Morton, linux-kernel, linux-mm >> > On Mon, 27 Jun 2005, Chen, Kenneth W wrote: >> > > I don't recall seeing tree_lock to be a problem for DSS workload either. >> > >> > I have seen the tree_lock being a problem a number of times with large >> > scale NUMA type workloads. >> >> I totally agree! My earlier posts are strictly referring to industry >> standard db workloads (OLTP, DSS). I'm not saying it's not a problem >> for everyone :-) Obviously you just outlined a few .... > > I'm a bit late to the party here (was gone on vacation), but I do have > profiles from DSS workloads using page-cache rather than O_DIRECT and > I do see spin_lock_irq() in the profiles which I'm pretty certain are > locks spinning for access to the radix_tree. I'll talk about it a bit > more up in Ottawa but here's the top 5 on my profile (sorry don't have > the number of ticks at the momement): > > 1. dedicated_idle (waiting for I/O) > 2. __copy_tofrom_user > 3. radix_tree_delete > 4. _spin_lock_irq > 5. __find_get_block > > So, yes, if the page-cache is used in a DSS workload then one will see > the tree-lock. BTW, this was on a PPC64 machine w/ a fairly small > NUMA factor. The easiest way to confirm the spin-lock thing is to recompile with CONFIG_SPINLINE, and take a new profile, then diff the two ... M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [rfc] lockless pagecache 2005-07-05 15:31 ` Martin J. Bligh @ 2005-07-05 15:37 ` Sonny Rao 0 siblings, 0 replies; 56+ messages in thread From: Sonny Rao @ 2005-07-05 15:37 UTC (permalink / raw) To: Martin J. Bligh Cc: Chen, Kenneth W, 'Christoph Lameter', 'Badari Pulavarty', 'Nick Piggin', Lincoln Dale, Andrew Morton, linux-kernel, linux-mm On Tue, Jul 05, 2005 at 08:31:40AM -0700, Martin J. Bligh wrote: > >> > On Mon, 27 Jun 2005, Chen, Kenneth W wrote: > >> > > I don't recall seeing tree_lock to be a problem for DSS workload either. > >> > > >> > I have seen the tree_lock being a problem a number of times with large > >> > scale NUMA type workloads. > >> > >> I totally agree! My earlier posts are strictly referring to industry > >> standard db workloads (OLTP, DSS). I'm not saying it's not a problem > >> for everyone :-) Obviously you just outlined a few .... > > > > I'm a bit late to the party here (was gone on vacation), but I do have > > profiles from DSS workloads using page-cache rather than O_DIRECT and > > I do see spin_lock_irq() in the profiles which I'm pretty certain are > > locks spinning for access to the radix_tree. I'll talk about it a bit > > more up in Ottawa but here's the top 5 on my profile (sorry don't have > > the number of ticks at the momement): > > > > 1. dedicated_idle (waiting for I/O) > > 2. __copy_tofrom_user > > 3. radix_tree_delete > > 4. _spin_lock_irq > > 5. __find_get_block > > > > So, yes, if the page-cache is used in a DSS workload then one will see > > the tree-lock. BTW, this was on a PPC64 machine w/ a fairly small > > NUMA factor. > > The easiest way to confirm the spin-lock thing is to recompile with > CONFIG_SPINLINE, and take a new profile, then diff the two ... Yep... Unfortunately, this is broken in PPC64 since 2.6.9-rc2 or something like that, I never had a chance to track down what the issue was exactly. IIRC, there was a lot of churn in the spinlocking code around that time. Sonny -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [rfc] lockless pagecache 2005-06-27 8:02 ` Nick Piggin 2005-06-27 8:15 ` Andrew Morton 2005-06-27 8:56 ` Lincoln Dale @ 2005-06-27 13:17 ` Benjamin LaHaise 2005-06-28 0:32 ` Nick Piggin 2 siblings, 1 reply; 56+ messages in thread From: Benjamin LaHaise @ 2005-06-27 13:17 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm On Mon, Jun 27, 2005 at 06:02:15PM +1000, Nick Piggin wrote: > However I think for Oracle and others that use shared memory like > this, they are probably not doing linear access, so that would be a > net loss. I'm not completely sure (I don't have access to real loads > at the moment), but I would have thought those guys would have looked > into fault ahead if it were a possibility. Shared memory overhead doesn't show up on any of the database benchmarks I've seen, as they tend to use huge pages that are locked in memory, and thus don't tend to access the page cache at all after ramp up. -ben -- "Time is what keeps everything from happening all at once." -- John Wheeler -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [rfc] lockless pagecache 2005-06-27 13:17 ` Benjamin LaHaise @ 2005-06-28 0:32 ` Nick Piggin 2005-06-28 1:26 ` William Lee Irwin III 0 siblings, 1 reply; 56+ messages in thread From: Nick Piggin @ 2005-06-28 0:32 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Andrew Morton, linux-kernel, linux-mm Benjamin LaHaise wrote: > On Mon, Jun 27, 2005 at 06:02:15PM +1000, Nick Piggin wrote: > >>However I think for Oracle and others that use shared memory like >>this, they are probably not doing linear access, so that would be a >>net loss. I'm not completely sure (I don't have access to real loads >>at the moment), but I would have thought those guys would have looked >>into fault ahead if it were a possibility. > > > Shared memory overhead doesn't show up on any of the database benchmarks > I've seen, as they tend to use huge pages that are locked in memory, and > thus don't tend to access the page cache at all after ramp up. > To be quite honest I don't have any real workloads here that stress it, however I was told that it is a problem for oracle database. If there is anyone else who has problems then I'd be interested to hear them as well. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [rfc] lockless pagecache 2005-06-28 0:32 ` Nick Piggin @ 2005-06-28 1:26 ` William Lee Irwin III 0 siblings, 0 replies; 56+ messages in thread From: William Lee Irwin III @ 2005-06-28 1:26 UTC (permalink / raw) To: Nick Piggin; +Cc: Benjamin LaHaise, Andrew Morton, linux-kernel, linux-mm Benjamin LaHaise wrote: >> Shared memory overhead doesn't show up on any of the database benchmarks >> I've seen, as they tend to use huge pages that are locked in memory, and >> thus don't tend to access the page cache at all after ramp up. On Tue, Jun 28, 2005 at 10:32:51AM +1000, Nick Piggin wrote: > To be quite honest I don't have any real workloads here that stress > it, however I was told that it is a problem for oracle database. If > there is anyone else who has problems then I'd be interested to hear > them as well. It's vlm-specific. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [rfc] lockless pagecache 2005-06-27 7:46 ` [rfc] lockless pagecache Andrew Morton 2005-06-27 8:02 ` Nick Piggin @ 2005-06-27 14:08 ` Martin J. Bligh 2005-06-27 17:49 ` Christoph Lameter 2 siblings, 0 replies; 56+ messages in thread From: Martin J. Bligh @ 2005-06-27 14:08 UTC (permalink / raw) To: Andrew Morton, Nick Piggin; +Cc: linux-kernel, linux-mm --Andrew Morton <akpm@osdl.org> wrote (on Monday, June 27, 2005 00:46:24 -0700): > Nick Piggin <nickpiggin@yahoo.com.au> wrote: >> >> First I'll put up some numbers to get you interested - of a 64-way Altix >> with 64 processes each read-faulting in their own 512MB part of a 32GB >> file that is preloaded in pagecache (with the proper NUMA memory >> allocation). > > I bet you can get a 5x to 10x reduction in ->tree_lock traffic by doing > 16-page faultahead. Maybe true, but when we last tried that, faultahead sucked for performance in a more general sense. All the extra setup and teardown cost for unnecessary PTEs kills you, even if it's only 4 pages or so. M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [rfc] lockless pagecache 2005-06-27 7:46 ` [rfc] lockless pagecache Andrew Morton 2005-06-27 8:02 ` Nick Piggin 2005-06-27 14:08 ` Martin J. Bligh @ 2005-06-27 17:49 ` Christoph Lameter 2 siblings, 0 replies; 56+ messages in thread From: Christoph Lameter @ 2005-06-27 17:49 UTC (permalink / raw) To: Andrew Morton; +Cc: Nick Piggin, linux-kernel, linux-mm On Mon, 27 Jun 2005, Andrew Morton wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > First I'll put up some numbers to get you interested - of a 64-way Altix > > with 64 processes each read-faulting in their own 512MB part of a 32GB > > file that is preloaded in pagecache (with the proper NUMA memory > > allocation). > > I bet you can get a 5x to 10x reduction in ->tree_lock traffic by doing > 16-page faultahead. Could be working into the prefault patch.... Good idea. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [rfc] lockless pagecache 2005-06-27 6:29 [rfc] lockless pagecache Nick Piggin ` (2 preceding siblings ...) 2005-06-27 7:46 ` [rfc] lockless pagecache Andrew Morton @ 2005-06-29 10:49 ` Hirokazu Takahashi 2005-06-29 11:38 ` Nick Piggin 3 siblings, 1 reply; 56+ messages in thread From: Hirokazu Takahashi @ 2005-06-29 10:49 UTC (permalink / raw) To: nickpiggin; +Cc: linux-kernel, linux-mm Hi Nick, Your patches improve the performance if lots of processes are accessing the same file at the same time, right? If so, I think we can introduce multiple radix-trees instead, which enhance each inode to be able to have two or more radix-trees in it to avoid the race condition traversing the trees. Some decision mechanism is needed which radix-tree each page should be in, how many radix-tree should be prepared. It seems to be simple and effective. What do you think? > Now the tree_lock was recently(ish) converted to an rwlock, precisely > for such a workload and that was apparently very successful. However > an rwlock is significantly heavier, and as machines get faster and > bigger, rwlocks (and any locks) will tend to use more and more of Paul > McKenney's toilet paper due to cacheline bouncing. > > So in the interest of saving some trees, let's try it without any locks. > > First I'll put up some numbers to get you interested - of a 64-way Altix > with 64 processes each read-faulting in their own 512MB part of a 32GB > file that is preloaded in pagecache (with the proper NUMA memory > allocation). Thanks, Hirokazu Takahashi. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [rfc] lockless pagecache 2005-06-29 10:49 ` Hirokazu Takahashi @ 2005-06-29 11:38 ` Nick Piggin 2005-06-30 3:32 ` Hirokazu Takahashi 0 siblings, 1 reply; 56+ messages in thread From: Nick Piggin @ 2005-06-29 11:38 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: linux-kernel, linux-mm Hirokazu Takahashi wrote: > Hi Nick, > Hi, > Your patches improve the performance if lots of processes are > accessing the same file at the same time, right? > Yes. > If so, I think we can introduce multiple radix-trees instead, > which enhance each inode to be able to have two or more radix-trees > in it to avoid the race condition traversing the trees. > Some decision mechanism is needed which radix-tree each page > should be in, how many radix-tree should be prepared. > > It seems to be simple and effective. > > What do you think? > Sure it is a possibility. I don't think you could call it effective like a completely lockless version is effective. You might take more locks during gang lookups, you may have a lot of ugly and not-always-working heuristics (hey, my app goes really fast if it spreads accesses over a 1GB file, but falls on its face with a 10MB one). You might get increased cache footprints for common operations. I mainly did the patches for a bit of fun rather than to address a particular problem with a real workload and as such I won't be pushing to get them in the kernel for the time being. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [rfc] lockless pagecache 2005-06-29 11:38 ` Nick Piggin @ 2005-06-30 3:32 ` Hirokazu Takahashi 0 siblings, 0 replies; 56+ messages in thread From: Hirokazu Takahashi @ 2005-06-30 3:32 UTC (permalink / raw) To: nickpiggin; +Cc: linux-kernel, linux-mm Hi, > > Your patches improve the performance if lots of processes are > > accessing the same file at the same time, right? > > > > Yes. > > > If so, I think we can introduce multiple radix-trees instead, > > which enhance each inode to be able to have two or more radix-trees > > in it to avoid the race condition traversing the trees. > > Some decision mechanism is needed which radix-tree each page > > should be in, how many radix-tree should be prepared. > > > > It seems to be simple and effective. > > > > What do you think? > > > > Sure it is a possibility. > > I don't think you could call it effective like a completely > lockless version is effective. You might take more locks during > gang lookups, you may have a lot of ugly and not-always-working > heuristics (hey, my app goes really fast if it spreads accesses > over a 1GB file, but falls on its face with a 10MB one). You > might get increased cache footprints for common operations. I guess it would be enough to split a huge file into the same size pieces simply and put each of them in its associated radix-tree in most cases for practical use. And I also feel your approach is interesting. > I mainly did the patches for a bit of fun rather than to address > a particular problem with a real workload and as such I won't be > pushing to get them in the kernel for the time being. I see. I propose another idea if you don't mind, seqlock seems to make your code much simpler though I'm not sure whether it works well under heavy load. It would become stable without the tricks, which makes VM hard to be enhanced in the future. Thanks, Hirokazu Takahashi. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
end of thread, other threads:[~2005-07-05 15:37 UTC | newest] Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-06-27 6:29 [rfc] lockless pagecache Nick Piggin 2005-06-27 6:32 ` [patch 1] mm: PG_free flag Nick Piggin 2005-06-27 6:32 ` [patch 2] mm: speculative get_page Nick Piggin 2005-06-27 6:33 ` [patch 3] radix tree: lookup_slot Nick Piggin 2005-06-27 6:34 ` [patch 4] radix tree: lockless readside Nick Piggin 2005-06-27 6:34 ` [patch 5] mm: lockless pagecache lookups Nick Piggin 2005-06-27 6:35 ` [patch 6] mm: spinlock tree_lock Nick Piggin 2005-06-27 14:12 ` [patch 2] mm: speculative get_page William Lee Irwin III 2005-06-28 0:03 ` Nick Piggin 2005-06-28 0:56 ` Nick Piggin 2005-06-28 1:22 ` William Lee Irwin III 2005-06-28 1:42 ` Nick Piggin 2005-06-28 4:06 ` William Lee Irwin III 2005-06-28 4:50 ` Nick Piggin 2005-06-28 5:08 ` [patch 2] mm: speculative get_page, " David S. Miller, Nick Piggin 2005-06-28 5:34 ` Nick Piggin 2005-06-28 14:19 ` William Lee Irwin III 2005-06-28 15:43 ` Nick Piggin 2005-06-28 17:01 ` Christoph Lameter 2005-06-28 23:10 ` Nick Piggin 2005-06-28 21:32 ` Jesse Barnes 2005-06-28 22:17 ` Christoph Lameter 2005-06-28 12:45 ` Andy Whitcroft 2005-06-28 13:16 ` Nick Piggin 2005-06-28 16:02 ` Dave Hansen 2005-06-29 16:31 ` Pavel Machek 2005-06-29 18:43 ` Dave Hansen 2005-06-29 21:22 ` Pavel Machek 2005-06-29 16:31 ` Pavel Machek 2005-06-27 6:43 ` VFS scalability (was: [rfc] lockless pagecache) Nick Piggin 2005-06-27 7:13 ` Andi Kleen 2005-06-27 7:33 ` VFS scalability Nick Piggin 2005-06-27 7:44 ` Andi Kleen 2005-06-27 8:03 ` Nick Piggin 2005-06-27 7:46 ` [rfc] lockless pagecache Andrew Morton 2005-06-27 8:02 ` Nick Piggin 2005-06-27 8:15 ` Andrew Morton 2005-06-27 8:28 ` Nick Piggin 2005-06-27 8:56 ` Lincoln Dale 2005-06-27 9:04 ` Nick Piggin 2005-06-27 18:14 ` Chen, Kenneth W 2005-06-27 18:50 ` Badari Pulavarty 2005-06-27 19:05 ` Chen, Kenneth W 2005-06-27 19:22 ` Christoph Lameter 2005-06-27 19:42 ` Chen, Kenneth W 2005-07-05 15:11 ` Sonny Rao 2005-07-05 15:31 ` Martin J. Bligh 2005-07-05 15:37 ` Sonny Rao 2005-06-27 13:17 ` Benjamin LaHaise 2005-06-28 0:32 ` Nick Piggin 2005-06-28 1:26 ` William Lee Irwin III 2005-06-27 14:08 ` Martin J. Bligh 2005-06-27 17:49 ` Christoph Lameter 2005-06-29 10:49 ` Hirokazu Takahashi 2005-06-29 11:38 ` Nick Piggin 2005-06-30 3:32 ` Hirokazu Takahashi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox