[patch] radix-tree: avoid atomic allocations for preloaded insertions

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch] radix-tree: avoid atomic allocations for preloaded insertions
@ 2007-11-08  0:43 Nick Piggin
  2007-11-08  1:09 ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2007-11-08  0:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Memory Management List, David Miller

OK, here's this patch again. This time I come with real failures on real
systems (in this case, David is running some 'dd' pagecache throughput
tests).

I haven't got him to retest it yet, but I think the idea is just a no-brainer.
We significantly reduce maximum tree_lock(W) hold time, and we reduce the
amount of GFP_ATOMIC allocations.

--

Most pagecache (and some other) radix tree insertions have the great
opportunity to preallocate a few nodes with relaxed gfp flags. But
the preallocation is squandered when it comes time to allocate a node,
we default to first attempting a GFP_ATOMIC allocation -- that doesn't
normally fail, but it can eat into atomic memory reserves that we
don't need to be using.

Another upshot of this is that it removes the sometimes highly contended
zone->lock from underneath tree_lock.

David Miller reports seeing this allocation fail on a highly threaded
sparc64 system when running a parallel 'dd' test:

[527319.459981] dd: page allocation failure. order:0, mode:0x20
[527319.460403] Call Trace:
[527319.460568]  [00000000004b71e0] __slab_alloc+0x1b0/0x6a8
[527319.460636]  [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8
[527319.460698]  [000000000055309c] radix_tree_node_alloc+0x20/0x90
[527319.460763]  [0000000000553238] radix_tree_insert+0x12c/0x260
[527319.460830]  [0000000000495cd0] add_to_page_cache+0x38/0xb0
[527319.460893]  [00000000004e4794] mpage_readpages+0x6c/0x134
[527319.460955]  [000000000049c7fc] __do_page_cache_readahead+0x170/0x280
[527319.461028]  [000000000049cc88] ondemand_readahead+0x208/0x214
[527319.461094]  [0000000000496018] do_generic_mapping_read+0xe8/0x428
[527319.461152]  [0000000000497948] generic_file_aio_read+0x108/0x170
[527319.461217]  [00000000004badac] do_sync_read+0x88/0xd0
[527319.461292]  [00000000004bb5cc] vfs_read+0x78/0x10c
[527319.461361]  [00000000004bb920] sys_read+0x34/0x60
[527319.461424]  [0000000000406294] linux_sparc_syscall32+0x3c/0x40

The calltrace is significant: __do_page_cache_readahead allocates a number
of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
memory to satisfy GFP_ATOMIC allocations. However after the list of pages
goes to mpage_readpages, there can be significant intervals (including
disk IO) before all the pages are inserted into the radix-tree. So the
reserves can easily be depleted at that point.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -95,12 +95,10 @@ static inline gfp_t root_gfp_mask(struct
 static struct radix_tree_node *
 radix_tree_node_alloc(struct radix_tree_root *root)
 {
-	struct radix_tree_node *ret;
+	struct radix_tree_node *ret = NULL;
 	gfp_t gfp_mask = root_gfp_mask(root);
 
-	ret = kmem_cache_alloc(radix_tree_node_cachep,
-				set_migrateflags(gfp_mask, __GFP_RECLAIMABLE));
-	if (ret == NULL && !(gfp_mask & __GFP_WAIT)) {
+	if (!(gfp_mask & __GFP_WAIT)) {
 		struct radix_tree_preload *rtp;
 
 		rtp = &__get_cpu_var(radix_tree_preloads);
@@ -110,6 +108,10 @@ radix_tree_node_alloc(struct radix_tree_
 			rtp->nr--;
 		}
 	}
+	if (ret == NULL)
+		ret = kmem_cache_alloc(radix_tree_node_cachep,
+				set_migrateflags(gfp_mask, __GFP_RECLAIMABLE));
+
 	BUG_ON(radix_tree_is_indirect_ptr(ret));
 	return ret;
 }
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -65,7 +65,6 @@ generic_file_direct_IO(int rw, struct ki
  *    ->private_lock		(__free_pte->__set_page_dirty_buffers)
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
- *          ->zone.lock
  *
  *  ->i_mutex
  *    ->i_mmap_lock		(truncate->unmap_mapping_range)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -36,7 +36,6 @@
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
  *                           within inode_lock in __sync_single_inode)
- *                   zone->lock (within radix tree node alloc)
  */
 
 #include <linux/mm.h>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] radix-tree: avoid atomic allocations for preloaded insertions
  2007-11-08  0:43 [patch] radix-tree: avoid atomic allocations for preloaded insertions Nick Piggin
@ 2007-11-08  1:09 ` Andrew Morton
  2007-11-08  1:34   ` David Miller, Andrew Morton
  2007-11-08  1:37   ` Nick Piggin
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2007-11-08  1:09 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, davem

> On Thu, 8 Nov 2007 01:43:04 +0100 Nick Piggin <npiggin@suse.de> wrote:
> OK, here's this patch again. This time I come with real failures on real
> systems (in this case, David is running some 'dd' pagecache throughput
> tests).
> 
> I haven't got him to retest it yet, but I think the idea is just a no-brainer.
> We significantly reduce maximum tree_lock(W) hold time, and we reduce the
> amount of GFP_ATOMIC allocations.
> 
> --
> 
> Most pagecache (and some other) radix tree insertions have the great
> opportunity to preallocate a few nodes with relaxed gfp flags. But
> the preallocation is squandered when it comes time to allocate a node,
> we default to first attempting a GFP_ATOMIC allocation -- that doesn't
> normally fail, but it can eat into atomic memory reserves that we
> don't need to be using.
> 
> Another upshot of this is that it removes the sometimes highly contended
> zone->lock from underneath tree_lock.
> 
> David Miller reports seeing this allocation fail on a highly threaded
> sparc64 system when running a parallel 'dd' test:
> 
> [527319.459981] dd: page allocation failure. order:0, mode:0x20
> [527319.460403] Call Trace:
> [527319.460568]  [00000000004b71e0] __slab_alloc+0x1b0/0x6a8
> [527319.460636]  [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8
> [527319.460698]  [000000000055309c] radix_tree_node_alloc+0x20/0x90
> [527319.460763]  [0000000000553238] radix_tree_insert+0x12c/0x260
> [527319.460830]  [0000000000495cd0] add_to_page_cache+0x38/0xb0
> [527319.460893]  [00000000004e4794] mpage_readpages+0x6c/0x134
> [527319.460955]  [000000000049c7fc] __do_page_cache_readahead+0x170/0x280
> [527319.461028]  [000000000049cc88] ondemand_readahead+0x208/0x214
> [527319.461094]  [0000000000496018] do_generic_mapping_read+0xe8/0x428
> [527319.461152]  [0000000000497948] generic_file_aio_read+0x108/0x170
> [527319.461217]  [00000000004badac] do_sync_read+0x88/0xd0
> [527319.461292]  [00000000004bb5cc] vfs_read+0x78/0x10c
> [527319.461361]  [00000000004bb920] sys_read+0x34/0x60
> [527319.461424]  [0000000000406294] linux_sparc_syscall32+0x3c/0x40
> 
> The calltrace is significant: __do_page_cache_readahead allocates a number
> of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
> memory to satisfy GFP_ATOMIC allocations. However after the list of pages
> goes to mpage_readpages, there can be significant intervals (including
> disk IO) before all the pages are inserted into the radix-tree. So the
> reserves can easily be depleted at that point.
> 

So now I've got to re-re-remember why I didn't like this the first time. 
Do you recall?

Why not just stomp the warning with __GFP_NOWARN?

Did you consider turning off __GFP_HIGH?  (Dunno why)

This change will slow things down - has this been quantified?  Probably
it's unmeasurable, but it's still there.

I'd have thought that a superior approach would be to just set
__GFP_NOWARN?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] radix-tree: avoid atomic allocations for preloaded insertions
  2007-11-08  1:09 ` Andrew Morton
@ 2007-11-08  1:34   ` David Miller, Andrew Morton
  2007-11-08  1:41     ` Andrew Morton
  2007-11-08  1:37   ` Nick Piggin
  1 sibling, 1 reply; 22+ messages in thread
From: David Miller, Andrew Morton @ 2007-11-08  1:34 UTC (permalink / raw)
  To: akpm; +Cc: npiggin, linux-mm

> Why not just stomp the warning with __GFP_NOWARN?
> 
> Did you consider turning off __GFP_HIGH?  (Dunno why)
> 
> This change will slow things down - has this been quantified?  Probably
> it's unmeasurable, but it's still there.
> 
> I'd have thought that a superior approach would be to just set
> __GFP_NOWARN?

I've rerun my test case which triggers this on Niagara 2
and I no longer get the messages.

For reference I first create N 16GB sparse files with
a script such as:

#!/bin/sh
#
# Usage: create_sparse NUM_FILES

for i in $(seq $1)
do
   dd if=/dev/zero of=sparse_file_$i bs=1MB count=1 seek=$((16 * 1024))
done

And then I fork off N threads, each running dd over one of
those sparse files with a script like:

#!/bin/sh
#
# Usage: thread_sparse NUM_THREADS

for i in $(seq $1)
do
    dd bs=1M if=sparse_file_$i of=/dev/null &
done

wait

On my Niagara 2 box I use '64' for 'N', so I go:

create_sparse 64
thread_sparse 64

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] radix-tree: avoid atomic allocations for preloaded insertions
  2007-11-08  1:09 ` Andrew Morton
  2007-11-08  1:34   ` David Miller, Andrew Morton
@ 2007-11-08  1:37   ` Nick Piggin
  2007-11-08  3:02     ` Andrew Morton
  1 sibling, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2007-11-08  1:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, davem

On Wed, Nov 07, 2007 at 05:09:23PM -0800, Andrew Morton wrote:
> > On Thu, 8 Nov 2007 01:43:04 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > OK, here's this patch again. This time I come with real failures on real
> > systems (in this case, David is running some 'dd' pagecache throughput
> > tests).
> > 
> > I haven't got him to retest it yet, but I think the idea is just a no-brainer.
> > We significantly reduce maximum tree_lock(W) hold time, and we reduce the
> > amount of GFP_ATOMIC allocations.
> > 
> > --
> > 
> > Most pagecache (and some other) radix tree insertions have the great
> > opportunity to preallocate a few nodes with relaxed gfp flags. But
> > the preallocation is squandered when it comes time to allocate a node,
> > we default to first attempting a GFP_ATOMIC allocation -- that doesn't
> > normally fail, but it can eat into atomic memory reserves that we
> > don't need to be using.
> > 
> > Another upshot of this is that it removes the sometimes highly contended
> > zone->lock from underneath tree_lock.
> > 
> > David Miller reports seeing this allocation fail on a highly threaded
> > sparc64 system when running a parallel 'dd' test:
> > 
> > [527319.459981] dd: page allocation failure. order:0, mode:0x20
> > [527319.460403] Call Trace:
> > [527319.460568]  [00000000004b71e0] __slab_alloc+0x1b0/0x6a8
> > [527319.460636]  [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8
> > [527319.460698]  [000000000055309c] radix_tree_node_alloc+0x20/0x90
> > [527319.460763]  [0000000000553238] radix_tree_insert+0x12c/0x260
> > [527319.460830]  [0000000000495cd0] add_to_page_cache+0x38/0xb0
> > [527319.460893]  [00000000004e4794] mpage_readpages+0x6c/0x134
> > [527319.460955]  [000000000049c7fc] __do_page_cache_readahead+0x170/0x280
> > [527319.461028]  [000000000049cc88] ondemand_readahead+0x208/0x214
> > [527319.461094]  [0000000000496018] do_generic_mapping_read+0xe8/0x428
> > [527319.461152]  [0000000000497948] generic_file_aio_read+0x108/0x170
> > [527319.461217]  [00000000004badac] do_sync_read+0x88/0xd0
> > [527319.461292]  [00000000004bb5cc] vfs_read+0x78/0x10c
> > [527319.461361]  [00000000004bb920] sys_read+0x34/0x60
> > [527319.461424]  [0000000000406294] linux_sparc_syscall32+0x3c/0x40
> > 
> > The calltrace is significant: __do_page_cache_readahead allocates a number
> > of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
> > memory to satisfy GFP_ATOMIC allocations. However after the list of pages
> > goes to mpage_readpages, there can be significant intervals (including
> > disk IO) before all the pages are inserted into the radix-tree. So the
> > reserves can easily be depleted at that point.
> > 
> 
> So now I've got to re-re-remember why I didn't like this the first time. 
> Do you recall?

Sorry, can't recall why you didn't like it the first time. Maybe I was
misremembering, and you simply didn't merge it because I didn't present
it as a submission.. I honestly can't find the mail anywhere.

You didn't like it the second time because I didn't offer a realistic
test were it mattered.


> Why not just stomp the warning with __GFP_NOWARN?
 
Yeah, but it's still using up a lot of atomic reserves.


> Did you consider turning off __GFP_HIGH?  (Dunno why)

That would help, although that still allows one to eat a (smaller) amount
of reserves, which would be nice to avoid. 


> This change will slow things down - has this been quantified?  Probably
> it's unmeasurable, but it's still there.
 
I wouldn't have thought it should slow things down _too much_. The radix
tree nodes are those unusual allocations (like pagetables) that don't
really need to be allocated cache-hot. (If that's where you're thinking
the slowdown will come from...)


> I'd have thought that a superior approach would be to just set
> __GFP_NOWARN?

But given that the potential performance loss is so small, I think it is
more important to avoid using reserves that we need for important things
like networking.

Though even if we ignore the question of atomic allocations, I think it
is really nice to be able to turn tree_lock into an innermost lock, and
not transitively pollute it with zone->lock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] radix-tree: avoid atomic allocations for preloaded insertions
  2007-11-08  1:34   ` David Miller, Andrew Morton
@ 2007-11-08  1:41     ` Andrew Morton
  2007-11-08  1:45       ` David Miller, Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2007-11-08  1:41 UTC (permalink / raw)
  To: David Miller; +Cc: npiggin, linux-mm

> On Wed, 07 Nov 2007 17:34:19 -0800 (PST) David Miller <davem@davemloft.net> wrote:
> From: Andrew Morton <akpm@linux-foundation.org>
> Date: Wed, 7 Nov 2007 17:09:23 -0800
> 
> > Why not just stomp the warning with __GFP_NOWARN?
> > 
> > Did you consider turning off __GFP_HIGH?  (Dunno why)
> > 
> > This change will slow things down - has this been quantified?  Probably
> > it's unmeasurable, but it's still there.
> > 
> > I'd have thought that a superior approach would be to just set
> > __GFP_NOWARN?
> 
> I've rerun my test case which triggers this on Niagara 2
> and I no longer get the messages.

With Nick's patch, I assume?

> For reference I first create N 16GB sparse files with
> a script such as:
> 
> #!/bin/sh
> #
> # Usage: create_sparse NUM_FILES
> 
> for i in $(seq $1)
> do
>    dd if=/dev/zero of=sparse_file_$i bs=1MB count=1 seek=$((16 * 1024))
> done
> 
> And then I fork off N threads, each running dd over one of
> those sparse files with a script like:
> 
> #!/bin/sh
> #
> # Usage: thread_sparse NUM_THREADS
> 
> for i in $(seq $1)
> do
>     dd bs=1M if=sparse_file_$i of=/dev/null &
> done
> 
> wait
> 
> On my Niagara 2 box I use '64' for 'N', so I go:
> 
> create_sparse 64
> thread_sparse 64

Yeah, draining the GFP_ATOMIC reserves is bad.  Setting __GFP_NOWARN and
clearing __GFP_HIGH should plug this, but which appropach is the best? 
Unsure.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] radix-tree: avoid atomic allocations for preloaded insertions
  2007-11-08  1:41     ` Andrew Morton
@ 2007-11-08  1:45       ` David Miller, Andrew Morton
  0 siblings, 0 replies; 22+ messages in thread
From: David Miller, Andrew Morton @ 2007-11-08  1:45 UTC (permalink / raw)
  To: akpm; +Cc: npiggin, linux-mm

> > On Wed, 07 Nov 2007 17:34:19 -0800 (PST) David Miller <davem@davemloft.net> wrote:
> > From: Andrew Morton <akpm@linux-foundation.org>
> > Date: Wed, 7 Nov 2007 17:09:23 -0800
> > 
> > > Why not just stomp the warning with __GFP_NOWARN?
> > > 
> > > Did you consider turning off __GFP_HIGH?  (Dunno why)
> > > 
> > > This change will slow things down - has this been quantified?  Probably
> > > it's unmeasurable, but it's still there.
> > > 
> > > I'd have thought that a superior approach would be to just set
> > > __GFP_NOWARN?
> > 
> > I've rerun my test case which triggers this on Niagara 2
> > and I no longer get the messages.
> 
> With Nick's patch, I assume?

Yes, that's correct.

> Yeah, draining the GFP_ATOMIC reserves is bad.  Setting __GFP_NOWARN and
> clearing __GFP_HIGH should plug this, but which appropach is the best? 
> Unsure.

I like the locking aspects of Nick's patch personally.

This will allow us to do more interesting things in
the future.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] radix-tree: avoid atomic allocations for preloaded insertions
  2007-11-08  1:37   ` Nick Piggin
@ 2007-11-08  3:02     ` Andrew Morton
  2007-11-08  3:16       ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2007-11-08  3:02 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, davem

> On Thu, 8 Nov 2007 02:37:23 +0100 Nick Piggin <npiggin@suse.de> wrote:
> On Wed, Nov 07, 2007 at 05:09:23PM -0800, Andrew Morton wrote:
> > > On Thu, 8 Nov 2007 01:43:04 +0100 Nick Piggin <npiggin@suse.de> wrote:
>  
> I wouldn't have thought it should slow things down _too much_. The radix
> tree nodes are those unusual allocations (like pagetables) that don't
> really need to be allocated cache-hot. (If that's where you're thinking
> the slowdown will come from...)

Well, it's simply more work.  For each ratnode we presently do

	test radix_tree_preloads, do nothing
	kmem_cache_alloc()

now we do

	test radix_tree_preloads
		kmem_cache_alloc()
		store it in radix_tree_preloads()
	retrieve it from radix_tree_preloads()

it's not a _lot_ of work, but it's there.  Mainly the new dirtying of this
cpu's radix_tree_preload all the time.

> 
> > I'd have thought that a superior approach would be to just set
> > __GFP_NOWARN?
> 
> But given that the potential performance loss is so small, I think it is
> more important to avoid using reserves that we need for important things
> like networking.

Spose so.  We'll end up consuming a quarter of the atomic reserve in rare
situations for very short periods.

> Though even if we ignore the question of atomic allocations, I think it
> is really nice to be able to turn tree_lock into an innermost lock, and
> not transitively pollute it with zone->lock.

That would be nice if it were true.  But you still have a
kmem_cache_alloc() in radix_tree_node_alloc()

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] radix-tree: avoid atomic allocations for preloaded insertions
  2007-11-08  3:02     ` Andrew Morton
@ 2007-11-08  3:16       ` Nick Piggin
  2007-11-08  4:12         ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2007-11-08  3:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, davem

On Wed, Nov 07, 2007 at 07:02:54PM -0800, Andrew Morton wrote:
> > On Thu, 8 Nov 2007 02:37:23 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > On Wed, Nov 07, 2007 at 05:09:23PM -0800, Andrew Morton wrote:
> > > > On Thu, 8 Nov 2007 01:43:04 +0100 Nick Piggin <npiggin@suse.de> wrote:
> >  
> > I wouldn't have thought it should slow things down _too much_. The radix
> > tree nodes are those unusual allocations (like pagetables) that don't
> > really need to be allocated cache-hot. (If that's where you're thinking
> > the slowdown will come from...)
> 
> Well, it's simply more work.  For each ratnode we presently do
> 
> 	test radix_tree_preloads, do nothing
> 	kmem_cache_alloc()
> 
> now we do
> 
> 	test radix_tree_preloads
> 		kmem_cache_alloc()
> 		store it in radix_tree_preloads()
> 	retrieve it from radix_tree_preloads()
> 
> it's not a _lot_ of work, but it's there.  Mainly the new dirtying of this
> cpu's radix_tree_preload all the time.

Oh that. Yeah I suppose it does, but it is a per-cpu var which already
must be read from, so we won't get write misses. I can't see it being
any problem at all.


> > > I'd have thought that a superior approach would be to just set
> > > __GFP_NOWARN?
> > 
> > But given that the potential performance loss is so small, I think it is
> > more important to avoid using reserves that we need for important things
> > like networking.
> 
> Spose so.  We'll end up consuming a quarter of the atomic reserve in rare
> situations for very short periods.

Well that's not insignificant. And you can end up consuming _all_ of the
!__GFP_WAIT reserves that can be used for more useful things (eg. we can
use it in the block layer request allocation to avoid a spin_unlock_irq/
spin_lock_irq pair required when we fall back to __GFP_WAIT).
 

> > Though even if we ignore the question of atomic allocations, I think it
> > is really nice to be able to turn tree_lock into an innermost lock, and
> > not transitively pollute it with zone->lock.
> 
> That would be nice if it were true.  But you still have a
> kmem_cache_alloc() in radix_tree_node_alloc()

But with my patch it is never called so long as the radix_tree_insert is
called within a successful preload (which it always is, for pagecache
AFAIKS).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] radix-tree: avoid atomic allocations for preloaded insertions
  2007-11-08  3:16       ` Nick Piggin
@ 2007-11-08  4:12         ` Andrew Morton
  2007-11-08  4:54           ` Nick Piggin
  2007-11-08 11:57           ` [patch] radix-tree: avoid atomic allocations for preloaded insertions Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2007-11-08  4:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, davem

> On Thu, 8 Nov 2007 04:16:45 +0100 Nick Piggin <npiggin@suse.de> wrote:
> On Wed, Nov 07, 2007 at 07:02:54PM -0800, Andrew Morton wrote:
> > > On Thu, 8 Nov 2007 02:37:23 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > On Wed, Nov 07, 2007 at 05:09:23PM -0800, Andrew Morton wrote:
> > > > > On Thu, 8 Nov 2007 01:43:04 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > >  
> > > I wouldn't have thought it should slow things down _too much_. The radix
> > > tree nodes are those unusual allocations (like pagetables) that don't
> > > really need to be allocated cache-hot. (If that's where you're thinking
> > > the slowdown will come from...)
> > 
> > Well, it's simply more work.  For each ratnode we presently do
> > 
> > 	test radix_tree_preloads, do nothing
> > 	kmem_cache_alloc()
> > 
> > now we do
> > 
> > 	test radix_tree_preloads
> > 		kmem_cache_alloc()
> > 		store it in radix_tree_preloads()
> > 	retrieve it from radix_tree_preloads()
> > 
> > it's not a _lot_ of work, but it's there.  Mainly the new dirtying of this
> > cpu's radix_tree_preload all the time.
> 
> Oh that. Yeah I suppose it does, but it is a per-cpu var which already
> must be read from, so we won't get write misses. I can't see it being
> any problem at all.
> 

The cacheline now needs to be written back: more bus traffic.

> 
> > > > I'd have thought that a superior approach would be to just set
> > > > __GFP_NOWARN?
> > > 
> > > But given that the potential performance loss is so small, I think it is
> > > more important to avoid using reserves that we need for important things
> > > like networking.
> > 
> > Spose so.  We'll end up consuming a quarter of the atomic reserve in rare
> > situations for very short periods.
> 
> Well that's not insignificant.

Actually I don't think it's true.

> And you can end up consuming _all_ of the
> !__GFP_WAIT reserves that can be used for more useful things (eg. we can
> use it in the block layer request allocation to avoid a spin_unlock_irq/
> spin_lock_irq pair required when we fall back to __GFP_WAIT).

I don't think we can go more than a single page below the zone watermark
because the preceding radix_tree_preload(GFP_KERNEL) filled things up
again.

In which case, why did David hit that allocation failure at all? 
Presumably this cpu's radix_tree_preloads slot was already full, or the
radix_tree_preload() allocation was satisfied from slab cache.

> 
> > > Though even if we ignore the question of atomic allocations, I think it
> > > is really nice to be able to turn tree_lock into an innermost lock, and
> > > not transitively pollute it with zone->lock.
> > 
> > That would be nice if it were true.  But you still have a
> > kmem_cache_alloc() in radix_tree_node_alloc()
> 
> But with my patch it is never called so long as the radix_tree_insert is
> called within a successful preload (which it always is, for pagecache
> AFAIKS).

ug, that was subtle.  Too subtle to be omitted from changelog, code comments
and runtime assertions..

It would be good to simply require that the radix_tree_preloads slot be
full on entry to radix_tree_insert() (ie: all callers correctly use
radix_tree_preload()).  But some callers don't bother.

<looks at arch/powerpc/kernel/irq.c>

It's buggy - doesn't handle GFP_ATOMIC allocation failures.

<looks at drivers/net/mlx4/cq.c>

Well at least it tests for failure, but it could reliably avoid failure if
it used radix_tree_preload().

<looks at fs/nfs/write.c>

again: unreliable, remembers to test for failure, would be better to use
radix_tree_preload().


So I think what we should be doing here is fixing those three callers to
use radix_tree_preload() correctly, then remove that kmem_cache_alloc()
from radix_tree_node_alloc() altogether.  And add a suitable runtime
assertion to the top of radix_tree_insert() to catch regressers.

But I suppose that's a separate work.  For now, please at least comment
your "AFAIKS"?


My bottom line: your change

- is a bit slower

- doesn't solve the problem which it claims to be solving
  (radix_tree_insert() doesn't deplete atomic reserves as long as the
  caller uses radix_tree_preload(GFP_KERNEL))

- is probably desirable as a simplify-the-locking-hierarchy thing, but a)
  should be presented as such and b) needs code comments explaining why it
  is correct and needs a big fat TODO explaining how we should get that
  kmem_cache_alloc() out of there, an how we should do it.

OK?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] radix-tree: avoid atomic allocations for preloaded insertions
  2007-11-08  4:12         ` Andrew Morton
@ 2007-11-08  4:54           ` Nick Piggin
  2007-11-08  5:02             ` Andrew Morton
  2007-11-08 11:57           ` [patch] radix-tree: avoid atomic allocations for preloaded insertions Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2007-11-08  4:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, davem

On Wed, Nov 07, 2007 at 08:12:42PM -0800, Andrew Morton wrote:
> > On Thu, 8 Nov 2007 04:16:45 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > On Wed, Nov 07, 2007 at 07:02:54PM -0800, Andrew Morton wrote:
> > > 
> > > 	test radix_tree_preloads
> > > 		kmem_cache_alloc()
> > > 		store it in radix_tree_preloads()
> > > 	retrieve it from radix_tree_preloads()
> > > 
> > > it's not a _lot_ of work, but it's there.  Mainly the new dirtying of this
> > > cpu's radix_tree_preload all the time.
> > 
> > Oh that. Yeah I suppose it does, but it is a per-cpu var which already
> > must be read from, so we won't get write misses. I can't see it being
> > any problem at all.
> > 
> 
> The cacheline now needs to be written back: more bus traffic.

Sure, but that's a bandwidth issue rather than a latency one, and the
bandwidth consumed by this single write is in the noise, I can assure
you. We're talking about this event happening at *very most* once per
pagecache page insertion. And considering the pagecache has to be zeroed
out or otherwise initialized, the bandwidth is like 0.1% of the page
clearing bandwidth required.

But if the variable is that hot, it won't get written back to memory
_every_ time it is dirtied in CPU cache anyway.

> > And you can end up consuming _all_ of the
> > !__GFP_WAIT reserves that can be used for more useful things (eg. we can
> > use it in the block layer request allocation to avoid a spin_unlock_irq/
> > spin_lock_irq pair required when we fall back to __GFP_WAIT).
> 
> I don't think we can go more than a single page below the zone watermark
> because the preceding radix_tree_preload(GFP_KERNEL) filled things up
> again.
> 
> In which case, why did David hit that allocation failure at all? 
> Presumably this cpu's radix_tree_preloads slot was already full, or the
> radix_tree_preload() allocation was satisfied from slab cache.

Of course it is already full: as the code stands now it never gets used
up at all until a page allocation failure hits.

> > But with my patch it is never called so long as the radix_tree_insert is
> > called within a successful preload (which it always is, for pagecache
> > AFAIKS).
> 
> ug, that was subtle.  Too subtle to be omitted from changelog, code comments
> and runtime assertions..

I did mention it in the changelog (zone->lock moves from underneath tree_lock).

I don't know what runtime assertions we can have. But it isn't so hard to
see the 2 places that insert pages into the pagecache run under a preload.

> It would be good to simply require that the radix_tree_preloads slot be
> full on entry to radix_tree_insert() (ie: all callers correctly use
> radix_tree_preload()).  But some callers don't bother.
> 
> <looks at arch/powerpc/kernel/irq.c>
> 
> It's buggy - doesn't handle GFP_ATOMIC allocation failures.
> 
> <looks at drivers/net/mlx4/cq.c>
> 
> Well at least it tests for failure, but it could reliably avoid failure if
> it used radix_tree_preload().
> 
> <looks at fs/nfs/write.c>
> 
> again: unreliable, remembers to test for failure, would be better to use
> radix_tree_preload().
> 
> 
> So I think what we should be doing here is fixing those three callers to
> use radix_tree_preload() correctly, then remove that kmem_cache_alloc()
> from radix_tree_node_alloc() altogether.  And add a suitable runtime
> assertion to the top of radix_tree_insert() to catch regressers.

Sure, that's another patch again. I'm just patching the problem I want
to see fixed. Everytime I try to do these peripheral cleanups and fixes,
the patch fixing the main issue gets ignored (along with everything
else).

> But I suppose that's a separate work.  For now, please at least comment
> your "AFAIKS"?
> 
> 
> My bottom line: your change
> 
> - is a bit slower

That's so much less significant than the atomic->non-atomic operations
you hate me doing that I never thought you'd care at all about it. If
the new Andrew is going to count cycles going into core code, I couldn't
be happier though ;)

> - doesn't solve the problem which it claims to be solving
>   (radix_tree_insert() doesn't deplete atomic reserves as long as the
>   caller uses radix_tree_preload(GFP_KERNEL))

I'm pretty sure it does. I don't follow exactly why you say it doesn't.

> - is probably desirable as a simplify-the-locking-hierarchy thing, but a)
>   should be presented as such and

It's primarily to avoid GFP_ATOMIC allocations. Simplify the locking
hierarcy is secondary and I put that in the changelog.

> b) needs code comments explaining why it
>   is correct and needs a big fat TODO explaining how we should get that
>   kmem_cache_alloc() out of there, an how we should do it.
> 
> OK?

I don't really know about getting that kmem_cache_alloc out of there.
For radix trees that are protected by sleeping locks, you don't actually
need to disable preempt and you can do sleeping allocations there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] radix-tree: avoid atomic allocations for preloaded insertions
  2007-11-08  4:54           ` Nick Piggin
@ 2007-11-08  5:02             ` Andrew Morton
  2007-11-08  5:44               ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2007-11-08  5:02 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, davem

> On Thu, 8 Nov 2007 05:54:04 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > - doesn't solve the problem which it claims to be solving
> >   (radix_tree_insert() doesn't deplete atomic reserves as long as the
> >   caller uses radix_tree_preload(GFP_KERNEL))
> 
> I'm pretty sure it does. I don't follow exactly why you say it doesn't.
>  

I was wrong.  The radix_tree_preloads will remain full and we'll just keep
on doing GFP_ATOMIC allocations.

> > - is probably desirable as a simplify-the-locking-hierarchy thing, but a)
> >   should be presented as such and
> 
> It's primarily to avoid GFP_ATOMIC allocations. Simplify the locking
> hierarcy is secondary and I put that in the changelog.
> 
> 
> > b) needs code comments explaining why it
> >   is correct and needs a big fat TODO explaining how we should get that
> >   kmem_cache_alloc() out of there, an how we should do it.
> > 
> > OK?
> 
> I don't really know about getting that kmem_cache_alloc out of there.
> For radix trees that are protected by sleeping locks, you don't actually
> need to disable preempt and you can do sleeping allocations there.

If the radix tree's gfp_mask is GFP_ATOMIC, radix_tree_insert() can require
that the preloads be full.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] radix-tree: avoid atomic allocations for preloaded insertions
  2007-11-08  5:02             ` Andrew Morton
@ 2007-11-08  5:44               ` Nick Piggin
  2007-11-08  6:02                 ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2007-11-08  5:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, davem

On Wed, Nov 07, 2007 at 09:02:04PM -0800, Andrew Morton wrote:
> > On Thu, 8 Nov 2007 05:54:04 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > - doesn't solve the problem which it claims to be solving
> > >   (radix_tree_insert() doesn't deplete atomic reserves as long as the
> > >   caller uses radix_tree_preload(GFP_KERNEL))
> > 
> > I'm pretty sure it does. I don't follow exactly why you say it doesn't.
> >  
> 
> I was wrong.  The radix_tree_preloads will remain full and we'll just keep
> on doing GFP_ATOMIC allocations.

OK. Normally it's not a huge issue, because you've recently allocated a
page, but actually it can become a problem if: the page comes out of a
different zone than the radix tree node; or you're doing something like
mpage_readpages code that can do a lot of work between allocating the
pages and inserting them.

 
> > > - is probably desirable as a simplify-the-locking-hierarchy thing, but a)
> > >   should be presented as such and
> > 
> > It's primarily to avoid GFP_ATOMIC allocations. Simplify the locking
> > hierarcy is secondary and I put that in the changelog.
> > 
> > 
> > > b) needs code comments explaining why it
> > >   is correct and needs a big fat TODO explaining how we should get that
> > >   kmem_cache_alloc() out of there, an how we should do it.
> > > 
> > > OK?
> > 
> > I don't really know about getting that kmem_cache_alloc out of there.
> > For radix trees that are protected by sleeping locks, you don't actually
> > need to disable preempt and you can do sleeping allocations there.
> 
> If the radix tree's gfp_mask is GFP_ATOMIC, radix_tree_insert() can require
> that the preloads be full.

So we can put that invariant check in radix_tree_insert(), and I could
refactor / comment the radix_tree_node_alloc a bit so that it is clearer?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] radix-tree: avoid atomic allocations for preloaded insertions
  2007-11-08  5:44               ` Nick Piggin
@ 2007-11-08  6:02                 ` Andrew Morton
  2007-11-08  6:54                   ` Nick Piggin
  2007-11-08  6:56                   ` [patch] nfs: use GFP_NOFS preloads for radix-tree insertion Nick Piggin
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2007-11-08  6:02 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, davem

> On Thu, 8 Nov 2007 06:44:46 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > I don't really know about getting that kmem_cache_alloc out of there.
> > > For radix trees that are protected by sleeping locks, you don't actually
> > > need to disable preempt and you can do sleeping allocations there.
> > 
> > If the radix tree's gfp_mask is GFP_ATOMIC, radix_tree_insert() can require
> > that the preloads be full.
> 
> So we can put that invariant check in radix_tree_insert(),

Well, ultimately.  If we do that right now the powerpc irq management will
trigger it.  But it deserves to ;)

> and I could
> refactor / comment the radix_tree_node_alloc a bit so that it is clearer?

Please.

I guess we can't actually remove the kmem_cache_alloc() call in there
because of possible future code which uses sleeping locks.  AFAICT all
callers persently use GFP_ATOMIC.   So it's going to end up trickier than
one would like.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] radix-tree: avoid atomic allocations for preloaded insertions
  2007-11-08  6:02                 ` Andrew Morton
@ 2007-11-08  6:54                   ` Nick Piggin
  2007-11-08  6:56                   ` [patch] nfs: use GFP_NOFS preloads for radix-tree insertion Nick Piggin
  1 sibling, 0 replies; 22+ messages in thread
From: Nick Piggin @ 2007-11-08  6:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, davem

On Wed, Nov 07, 2007 at 10:02:00PM -0800, Andrew Morton wrote:
> > On Thu, 8 Nov 2007 06:44:46 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > > I don't really know about getting that kmem_cache_alloc out of there.
> > > > For radix trees that are protected by sleeping locks, you don't actually
> > > > need to disable preempt and you can do sleeping allocations there.
> > > 
> > > If the radix tree's gfp_mask is GFP_ATOMIC, radix_tree_insert() can require
> > > that the preloads be full.
> > 
> > So we can put that invariant check in radix_tree_insert(),
> 
> Well, ultimately.  If we do that right now the powerpc irq management will
> trigger it.  But it deserves to ;)
> 
> > and I could
> > refactor / comment the radix_tree_node_alloc a bit so that it is clearer?
> 
> Please.

OK, here is this version. It won't be possible to put invariants in until
other code gets cleaned up... I had a shot at NFS...

---

Most pagecache (and some other) radix tree insertions have the great
opportunity to preallocate a few nodes with relaxed gfp flags. But
the preallocation is squandered when it comes time to allocate a node,
we default to first attempting a GFP_ATOMIC allocation -- that doesn't
normally fail, but it can eat into atomic memory reserves that we
don't need to be using.

Another upshot of this is that it removes the sometimes highly contended
zone->lock from underneath tree_lock. Pagecache insertions are always
performed with a radix tree preload, and after this change, such a situation
will never fall back to kmem_cache_alloc within radix_tree_node_alloc.

David Miller reports seeing this allocation fail on a highly threaded
sparc64 system:

[527319.459981] dd: page allocation failure. order:0, mode:0x20
[527319.460403] Call Trace:
[527319.460568]  [00000000004b71e0] __slab_alloc+0x1b0/0x6a8
[527319.460636]  [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8
[527319.460698]  [000000000055309c] radix_tree_node_alloc+0x20/0x90
[527319.460763]  [0000000000553238] radix_tree_insert+0x12c/0x260
[527319.460830]  [0000000000495cd0] add_to_page_cache+0x38/0xb0
[527319.460893]  [00000000004e4794] mpage_readpages+0x6c/0x134
[527319.460955]  [000000000049c7fc] __do_page_cache_readahead+0x170/0x280
[527319.461028]  [000000000049cc88] ondemand_readahead+0x208/0x214
[527319.461094]  [0000000000496018] do_generic_mapping_read+0xe8/0x428
[527319.461152]  [0000000000497948] generic_file_aio_read+0x108/0x170
[527319.461217]  [00000000004badac] do_sync_read+0x88/0xd0
[527319.461292]  [00000000004bb5cc] vfs_read+0x78/0x10c
[527319.461361]  [00000000004bb920] sys_read+0x34/0x60
[527319.461424]  [0000000000406294] linux_sparc_syscall32+0x3c/0x40

The calltrace is significant: __do_page_cache_readahead allocates a number
of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
memory to satisfy GFP_ATOMIC allocations. However after the list of pages
goes to mpage_readpages, there can be significant intervals (including
disk IO) before all the pages are inserted into the radix-tree. So the
reserves can easily be depleted at that point. The patch is confirmed to
fix the problem.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -95,14 +95,17 @@ static inline gfp_t root_gfp_mask(struct
 static struct radix_tree_node *
 radix_tree_node_alloc(struct radix_tree_root *root)
 {
-	struct radix_tree_node *ret;
+	struct radix_tree_node *ret = NULL;
 	gfp_t gfp_mask = root_gfp_mask(root);
 
-	ret = kmem_cache_alloc(radix_tree_node_cachep,
-				set_migrateflags(gfp_mask, __GFP_RECLAIMABLE));
-	if (ret == NULL && !(gfp_mask & __GFP_WAIT)) {
+	if (!(gfp_mask & __GFP_WAIT)) {
 		struct radix_tree_preload *rtp;
 
+		/*
+		 * Provided the caller has preloaded here, we will always
+		 * succeed in getting a node here (and never reach
+		 * kmem_cache_alloc)
+		 */
 		rtp = &__get_cpu_var(radix_tree_preloads);
 		if (rtp->nr) {
 			ret = rtp->nodes[rtp->nr - 1];
@@ -110,6 +113,10 @@ radix_tree_node_alloc(struct radix_tree_
 			rtp->nr--;
 		}
 	}
+	if (ret == NULL)
+		ret = kmem_cache_alloc(radix_tree_node_cachep,
+				set_migrateflags(gfp_mask, __GFP_RECLAIMABLE));
+
 	BUG_ON(radix_tree_is_indirect_ptr(ret));
 	return ret;
 }
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -65,7 +65,6 @@ generic_file_direct_IO(int rw, struct ki
  *    ->private_lock		(__free_pte->__set_page_dirty_buffers)
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
- *          ->zone.lock
  *
  *  ->i_mutex
  *    ->i_mmap_lock		(truncate->unmap_mapping_range)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -36,7 +36,6 @@
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
  *                           within inode_lock in __sync_single_inode)
- *                   zone->lock (within radix tree node alloc)
  */
 
 #include <linux/mm.h>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch] nfs: use GFP_NOFS preloads for radix-tree insertion
  2007-11-08  6:02                 ` Andrew Morton
  2007-11-08  6:54                   ` Nick Piggin
@ 2007-11-08  6:56                   ` Nick Piggin
  2007-11-13 10:55                     ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2007-11-08  6:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, davem, trond.myklebust

Here is the NFS version. I guess Trond should ack it before you pick it
up.

--

NFS should use GFP_NOFS mode radix tree preloads rather than GFP_ATOMIC
allocations at radix-tree insertion-time. This is important to reduce the
atomic memory requirement.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -363,15 +363,13 @@ int nfs_writepages(struct address_space 
 /*
  * Insert a write request into an inode
  */
-static int nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
+static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
 {
 	struct nfs_inode *nfsi = NFS_I(inode);
 	int error;
 
 	error = radix_tree_insert(&nfsi->nfs_page_tree, req->wb_index, req);
-	BUG_ON(error == -EEXIST);
-	if (error)
-		return error;
+	BUG_ON(error);
 	if (!nfsi->npages) {
 		igrab(inode);
 		if (nfs_have_delegation(inode, FMODE_WRITE))
@@ -381,7 +379,6 @@ static int nfs_inode_add_request(struct 
 	set_page_private(req->wb_page, (unsigned long)req);
 	nfsi->npages++;
 	kref_get(&req->wb_kref);
-	return 0;
 }
 
 /*
@@ -593,6 +590,13 @@ static struct nfs_page * nfs_update_requ
 		/* Loop over all inode entries and see if we find
 		 * A request for the page we wish to update
 		 */
+		if (new) {
+			if (radix_tree_preload(GFP_NOFS)) {
+				nfs_release_request(new);
+				return ERR_PTR(-ENOMEM);
+			}
+		}
+
 		spin_lock(&inode->i_lock);
 		req = nfs_page_find_request_locked(page);
 		if (req) {
@@ -603,28 +607,27 @@ static struct nfs_page * nfs_update_requ
 				error = nfs_wait_on_request(req);
 				nfs_release_request(req);
 				if (error < 0) {
-					if (new)
+					if (new) {
+						radix_tree_preload_end();
 						nfs_release_request(new);
+					}
 					return ERR_PTR(error);
 				}
 				continue;
 			}
 			spin_unlock(&inode->i_lock);
-			if (new)
+			if (new) {
+				radix_tree_preload_end();
 				nfs_release_request(new);
+			}
 			break;
 		}
 
 		if (new) {
-			int error;
 			nfs_lock_request_dontget(new);
-			error = nfs_inode_add_request(inode, new);
-			if (error) {
-				spin_unlock(&inode->i_lock);
-				nfs_unlock_request(new);
-				return ERR_PTR(error);
-			}
+			nfs_inode_add_request(inode, new);
 			spin_unlock(&inode->i_lock);
+			radix_tree_preload_end();
 			req = new;
 			goto zero_page;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] radix-tree: avoid atomic allocations for preloaded insertions
  2007-11-08  4:12         ` Andrew Morton
  2007-11-08  4:54           ` Nick Piggin
@ 2007-11-08 11:57           ` Peter Zijlstra
  2007-11-08 20:37             ` Nick Piggin
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2007-11-08 11:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, linux-mm, davem

On Wed, 2007-11-07 at 20:12 -0800, Andrew Morton wrote:

> <looks at fs/nfs/write.c>
> 
> again: unreliable, remembers to test for failure, would be better to use
> radix_tree_preload().

http://lkml.org/lkml/2007/10/30/271



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] radix-tree: avoid atomic allocations for preloaded insertions
  2007-11-08 11:57           ` [patch] radix-tree: avoid atomic allocations for preloaded insertions Peter Zijlstra
@ 2007-11-08 20:37             ` Nick Piggin
  2007-11-08 20:47               ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2007-11-08 20:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andrew Morton, linux-mm, davem

On Thu, Nov 08, 2007 at 12:57:02PM +0100, Peter Zijlstra wrote:
> On Wed, 2007-11-07 at 20:12 -0800, Andrew Morton wrote:
> 
> > <looks at fs/nfs/write.c>
> > 
> > again: unreliable, remembers to test for failure, would be better to use
> > radix_tree_preload().
> 
> http://lkml.org/lkml/2007/10/30/271

Ah, missed that. See my subsequent patch too, slightly different. It only
preloads if a request isn't already found (is this a good idea?, if it wasn't
relatively common, they'd just be checking for -EEXIST in the insertion?).

Anyway we can also simplify the code because the insertion can't fail with a
preload.

NFS can also use GFP_NOFS for the preload (at least, for upstream).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] radix-tree: avoid atomic allocations for preloaded insertions
  2007-11-08 20:37             ` Nick Piggin
@ 2007-11-08 20:47               ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2007-11-08 20:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-mm, davem

On Thu, 2007-11-08 at 21:37 +0100, Nick Piggin wrote:
> On Thu, Nov 08, 2007 at 12:57:02PM +0100, Peter Zijlstra wrote:
> > On Wed, 2007-11-07 at 20:12 -0800, Andrew Morton wrote:
> > 
> > > <looks at fs/nfs/write.c>
> > > 
> > > again: unreliable, remembers to test for failure, would be better to use
> > > radix_tree_preload().
> > 
> > http://lkml.org/lkml/2007/10/30/271
> 
> Ah, missed that. See my subsequent patch too, slightly different. It only
> preloads if a request isn't already found (is this a good idea?, if it wasn't
> relatively common, they'd just be checking for -EEXIST in the insertion?).
> 
> Anyway we can also simplify the code because the insertion can't fail with a
> preload.

Yeah, saw that, didn't get round to verifying the logic. Patch looked
good on first glance.

> NFS can also use GFP_NOFS for the preload (at least, for upstream).

Agreed, the NOIO comes from me swapping over it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] nfs: use GFP_NOFS preloads for radix-tree insertion
  2007-11-08  6:56                   ` [patch] nfs: use GFP_NOFS preloads for radix-tree insertion Nick Piggin
@ 2007-11-13 10:55                     ` Peter Zijlstra
  2007-11-14  4:20                       ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2007-11-13 10:55 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-mm, davem, Trond Myklebust

On Thu, 2007-11-08 at 07:56 +0100, Nick Piggin wrote:
> Here is the NFS version. I guess Trond should ack it before you pick it
> up.
> 
> --
> 
> NFS should use GFP_NOFS mode radix tree preloads rather than GFP_ATOMIC
> allocations at radix-tree insertion-time. This is important to reduce the
> atomic memory requirement.

In another mail you said:

> Anyway we can also simplify the code because the insertion can't fail with a
> preload.

Can we please avoid adding strict dependencies on that as the preload
API is unsupportable in -rt.

> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
> Index: linux-2.6/fs/nfs/write.c
> ===================================================================
> --- linux-2.6.orig/fs/nfs/write.c
> +++ linux-2.6/fs/nfs/write.c
> @@ -363,15 +363,13 @@ int nfs_writepages(struct address_space 
>  /*
>   * Insert a write request into an inode
>   */
> -static int nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
> +static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
>  {
>  	struct nfs_inode *nfsi = NFS_I(inode);
>  	int error;
>  
>  	error = radix_tree_insert(&nfsi->nfs_page_tree, req->wb_index, req);
> -	BUG_ON(error == -EEXIST);
> -	if (error)
> -		return error;
> +	BUG_ON(error);
>  	if (!nfsi->npages) {
>  		igrab(inode);
>  		if (nfs_have_delegation(inode, FMODE_WRITE))
> @@ -381,7 +379,6 @@ static int nfs_inode_add_request(struct 
>  	set_page_private(req->wb_page, (unsigned long)req);
>  	nfsi->npages++;
>  	kref_get(&req->wb_kref);
> -	return 0;
>  }
>  
>  /*
> @@ -593,6 +590,13 @@ static struct nfs_page * nfs_update_requ
>  		/* Loop over all inode entries and see if we find
>  		 * A request for the page we wish to update
>  		 */
> +		if (new) {
> +			if (radix_tree_preload(GFP_NOFS)) {
> +				nfs_release_request(new);
> +				return ERR_PTR(-ENOMEM);
> +			}
> +		}
> +
>  		spin_lock(&inode->i_lock);
>  		req = nfs_page_find_request_locked(page);
>  		if (req) {
> @@ -603,28 +607,27 @@ static struct nfs_page * nfs_update_requ
>  				error = nfs_wait_on_request(req);
>  				nfs_release_request(req);
>  				if (error < 0) {
> -					if (new)
> +					if (new) {
> +						radix_tree_preload_end();
>  						nfs_release_request(new);
> +					}
>  					return ERR_PTR(error);
>  				}
>  				continue;
>  			}
>  			spin_unlock(&inode->i_lock);
> -			if (new)
> +			if (new) {
> +				radix_tree_preload_end();
>  				nfs_release_request(new);
> +			}
>  			break;
>  		}
>  
>  		if (new) {
> -			int error;
>  			nfs_lock_request_dontget(new);
> -			error = nfs_inode_add_request(inode, new);
> -			if (error) {
> -				spin_unlock(&inode->i_lock);
> -				nfs_unlock_request(new);
> -				return ERR_PTR(error);
> -			}
> +			nfs_inode_add_request(inode, new);
>  			spin_unlock(&inode->i_lock);
> +			radix_tree_preload_end();
>  			req = new;
>  			goto zero_page;
>  		}
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] nfs: use GFP_NOFS preloads for radix-tree insertion
  2007-11-13 10:55                     ` Peter Zijlstra
@ 2007-11-14  4:20                       ` Nick Piggin
  2007-11-14  9:06                         ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2007-11-14  4:20 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andrew Morton, linux-mm, davem, Trond Myklebust

On Tue, Nov 13, 2007 at 11:55:45AM +0100, Peter Zijlstra wrote:
> 
> On Thu, 2007-11-08 at 07:56 +0100, Nick Piggin wrote:
> > Here is the NFS version. I guess Trond should ack it before you pick it
> > up.
> > 
> > --
> > 
> > NFS should use GFP_NOFS mode radix tree preloads rather than GFP_ATOMIC
> > allocations at radix-tree insertion-time. This is important to reduce the
> > atomic memory requirement.
> 
> In another mail you said:
> 
> > Anyway we can also simplify the code because the insertion can't fail with a
> > preload.
> 
> Can we please avoid adding strict dependencies on that as the preload
> API is unsupportable in -rt.

You can surely support it. You just have to do per-thread preloads if you
want preemption left on.



> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > ---
> > Index: linux-2.6/fs/nfs/write.c
> > ===================================================================
> > --- linux-2.6.orig/fs/nfs/write.c
> > +++ linux-2.6/fs/nfs/write.c
> > @@ -363,15 +363,13 @@ int nfs_writepages(struct address_space 
> >  /*
> >   * Insert a write request into an inode
> >   */
> > -static int nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
> > +static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
> >  {
> >  	struct nfs_inode *nfsi = NFS_I(inode);
> >  	int error;
> >  
> >  	error = radix_tree_insert(&nfsi->nfs_page_tree, req->wb_index, req);
> > -	BUG_ON(error == -EEXIST);
> > -	if (error)
> > -		return error;
> > +	BUG_ON(error);
> >  	if (!nfsi->npages) {
> >  		igrab(inode);
> >  		if (nfs_have_delegation(inode, FMODE_WRITE))
> > @@ -381,7 +379,6 @@ static int nfs_inode_add_request(struct 
> >  	set_page_private(req->wb_page, (unsigned long)req);
> >  	nfsi->npages++;
> >  	kref_get(&req->wb_kref);
> > -	return 0;
> >  }
> >  
> >  /*
> > @@ -593,6 +590,13 @@ static struct nfs_page * nfs_update_requ
> >  		/* Loop over all inode entries and see if we find
> >  		 * A request for the page we wish to update
> >  		 */
> > +		if (new) {
> > +			if (radix_tree_preload(GFP_NOFS)) {
> > +				nfs_release_request(new);
> > +				return ERR_PTR(-ENOMEM);
> > +			}
> > +		}
> > +
> >  		spin_lock(&inode->i_lock);
> >  		req = nfs_page_find_request_locked(page);
> >  		if (req) {
> > @@ -603,28 +607,27 @@ static struct nfs_page * nfs_update_requ
> >  				error = nfs_wait_on_request(req);
> >  				nfs_release_request(req);
> >  				if (error < 0) {
> > -					if (new)
> > +					if (new) {
> > +						radix_tree_preload_end();
> >  						nfs_release_request(new);
> > +					}
> >  					return ERR_PTR(error);
> >  				}
> >  				continue;
> >  			}
> >  			spin_unlock(&inode->i_lock);
> > -			if (new)
> > +			if (new) {
> > +				radix_tree_preload_end();
> >  				nfs_release_request(new);
> > +			}
> >  			break;
> >  		}
> >  
> >  		if (new) {
> > -			int error;
> >  			nfs_lock_request_dontget(new);
> > -			error = nfs_inode_add_request(inode, new);
> > -			if (error) {
> > -				spin_unlock(&inode->i_lock);
> > -				nfs_unlock_request(new);
> > -				return ERR_PTR(error);
> > -			}
> > +			nfs_inode_add_request(inode, new);
> >  			spin_unlock(&inode->i_lock);
> > +			radix_tree_preload_end();
> >  			req = new;
> >  			goto zero_page;
> >  		}
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] nfs: use GFP_NOFS preloads for radix-tree insertion
  2007-11-14  4:20                       ` Nick Piggin
@ 2007-11-14  9:06                         ` Peter Zijlstra
  2007-11-14 15:39                           ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2007-11-14  9:06 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-mm, davem, Trond Myklebust

On Wed, 2007-11-14 at 05:20 +0100, Nick Piggin wrote:
> On Tue, Nov 13, 2007 at 11:55:45AM +0100, Peter Zijlstra wrote:
> > 
> > On Thu, 2007-11-08 at 07:56 +0100, Nick Piggin wrote:
> > > Here is the NFS version. I guess Trond should ack it before you pick it
> > > up.
> > > 
> > > --
> > > 
> > > NFS should use GFP_NOFS mode radix tree preloads rather than GFP_ATOMIC
> > > allocations at radix-tree insertion-time. This is important to reduce the
> > > atomic memory requirement.
> > 
> > In another mail you said:
> > 
> > > Anyway we can also simplify the code because the insertion can't fail with a
> > > preload.
> > 
> > Can we please avoid adding strict dependencies on that as the preload
> > API is unsupportable in -rt.
> 
> You can surely support it. You just have to do per-thread preloads if you
> want preemption left on.

Well, true, but that would mean adding stuff to task_struct, not the end
of the world I guess.

But as it is leaving the error handling on each individual
radix_tree_insert() allows us to just use GFP_KERNEL for everything.

The other, nicer option, is to do preload on the radix_tree_context
object instead.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] nfs: use GFP_NOFS preloads for radix-tree insertion
  2007-11-14  9:06                         ` Peter Zijlstra
@ 2007-11-14 15:39                           ` Nick Piggin
  0 siblings, 0 replies; 22+ messages in thread
From: Nick Piggin @ 2007-11-14 15:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andrew Morton, linux-mm, davem, Trond Myklebust

On Wed, Nov 14, 2007 at 10:06:27AM +0100, Peter Zijlstra wrote:
> 
> On Wed, 2007-11-14 at 05:20 +0100, Nick Piggin wrote:
> > On Tue, Nov 13, 2007 at 11:55:45AM +0100, Peter Zijlstra wrote:
> > > 
> > > On Thu, 2007-11-08 at 07:56 +0100, Nick Piggin wrote:
> > > > Here is the NFS version. I guess Trond should ack it before you pick it
> > > > up.
> > > > 
> > > > --
> > > > 
> > > > NFS should use GFP_NOFS mode radix tree preloads rather than GFP_ATOMIC
> > > > allocations at radix-tree insertion-time. This is important to reduce the
> > > > atomic memory requirement.
> > > 
> > > In another mail you said:
> > > 
> > > > Anyway we can also simplify the code because the insertion can't fail with a
> > > > preload.
> > > 
> > > Can we please avoid adding strict dependencies on that as the preload
> > > API is unsupportable in -rt.
> > 
> > You can surely support it. You just have to do per-thread preloads if you
> > want preemption left on.
> 
> Well, true, but that would mean adding stuff to task_struct, not the end
> of the world I guess.
> 
> But as it is leaving the error handling on each individual
> radix_tree_insert() allows us to just use GFP_KERNEL for everything.

Hmm, then you reintroduce the lock ordering which I got rid of. Not that
it's a particularly big deal in this case. I don't think you should noop
fundamental things like this just because they turn preempt off. At any
rate, it's not something that mainline can really be concerned with...


> The other, nicer option, is to do preload on the radix_tree_context
> object instead.

I don't know if you'd call it nicer... at least, not as nice as what's
upstream. So it seems like you'd have to have a custom solution anyway,
given that per-cpu preloads are probably the best we can do upstream.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2007-11-14 15:39 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-08  0:43 [patch] radix-tree: avoid atomic allocations for preloaded insertions Nick Piggin
2007-11-08  1:09 ` Andrew Morton
2007-11-08  1:34   ` David Miller, Andrew Morton
2007-11-08  1:41     ` Andrew Morton
2007-11-08  1:45       ` David Miller, Andrew Morton
2007-11-08  1:37   ` Nick Piggin
2007-11-08  3:02     ` Andrew Morton
2007-11-08  3:16       ` Nick Piggin
2007-11-08  4:12         ` Andrew Morton
2007-11-08  4:54           ` Nick Piggin
2007-11-08  5:02             ` Andrew Morton
2007-11-08  5:44               ` Nick Piggin
2007-11-08  6:02                 ` Andrew Morton
2007-11-08  6:54                   ` Nick Piggin
2007-11-08  6:56                   ` [patch] nfs: use GFP_NOFS preloads for radix-tree insertion Nick Piggin
2007-11-13 10:55                     ` Peter Zijlstra
2007-11-14  4:20                       ` Nick Piggin
2007-11-14  9:06                         ` Peter Zijlstra
2007-11-14 15:39                           ` Nick Piggin
2007-11-08 11:57           ` [patch] radix-tree: avoid atomic allocations for preloaded insertions Peter Zijlstra
2007-11-08 20:37             ` Nick Piggin
2007-11-08 20:47               ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox