* [PATCH] mmap: avoid unnecessary anon_vma lock acquisition in vma_adjust()
@ 2009-09-09 20:10 Lee Schermerhorn
2009-09-13 13:24 ` Hugh Dickins
0 siblings, 1 reply; 3+ messages in thread
From: Lee Schermerhorn @ 2009-09-09 20:10 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, Andrew Morton, Nick Piggin, Hugh Dickins, stable, Eric Whitney
Against: 2.6.31-rc6
We noticed very erratic behavior [throughput] with the AIM7 shared
workload running on recent distro [SLES11] and mainline kernels on
an 8-socket, 32-core, 256GB x86_64 platform. On the SLES11 kernel
[2.6.27.19+] with Barcelona processors, as we increased the load
[10s of thousands of tasks], the throughput would vary between two
"plateaus"--one at ~65K jobs per minute and one at ~130K jpm. The
simple patch below causes the results to smooth out at the ~130k
plateau.
But wait, there's more:
We do not see this behavior on smaller platforms--e.g., 4 socket/8
core. This could be the result of the larger number of cpus on
the larger platform--a scalability issue--or it could be the result
of the larger number of interconnect "hops" between some nodes in
this platform and how the tasks for a given load end up distributed
over the nodes' cpus and memories--a stochastic NUMA effect.
The variability in the results are less pronounced [on the same
platform] with Shanghai processors and with mainline kernels. With
31-rc6 on Shanghai processors and 288 file systems on 288 fibre
attached storage volumes, the curves [jpm vs load] are both quite
flat with the patched kernel consistently producing ~3.9% better
throughput [~80K jpm vs ~77K jpm] than the unpatched kernel.
Profiling indicated that the "slow" runs were incurring high[er]
contention on an anon_vma lock in vma_adjust(), apparently called
from the sbrk() system call.
The patch:
A comment in mm/mmap.c:vma_adjust() suggests that we don't really
need the anon_vma lock when we're only adjusting the end of a vma,
as is the case for brk(). The comment questions whether it's worth
while to optimize for this case. Apparently, on the newer, larger
x86_64 platforms, with interesting NUMA topologies, it is worth
while--especially considering that the patch [if correct!] is
quite simple.
We can detect this condition--no overlap with next vma--by noting
a NULL "importer". The anon_vma pointer will also be NULL in this
case, so simply avoid loading vma->anon_vma to avoid the lock.
However, we apparently DO need to take the anon_vma lock when
we're inserting a vma ['insert' non-NULL] even when we have no
overlap [NULL "importer"], so we need to check for 'insert', as well.
I have tested with and without the 'file || ' test in the patch.
This does not seem to matter for stability nor performance. I
left this check/filter in, so we only optimize away the
anon_vma lock acquisition when adjusting the end of a non-
importing, non-inserting, anon vma.
If accepted, this patch may benefit the stable tree as well.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
mm/mmap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
Index: linux-2.6.31-rc6/mm/mmap.c
===================================================================
--- linux-2.6.31-rc6.orig/mm/mmap.c 2009-08-19 14:34:13.000000000 -0400
+++ linux-2.6.31-rc6/mm/mmap.c 2009-08-19 14:53:24.000000000 -0400
@@ -573,9 +573,9 @@ again: remove_next = 1 + (end > next->
/*
* When changing only vma->vm_end, we don't really need
- * anon_vma lock: but is that case worth optimizing out?
+ * anon_vma lock.
*/
- if (vma->anon_vma)
+ if ((file || insert || importer) && vma->anon_vma)
anon_vma = vma->anon_vma;
if (anon_vma) {
spin_lock(&anon_vma->lock);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [PATCH] mmap: avoid unnecessary anon_vma lock acquisition in vma_adjust() 2009-09-09 20:10 [PATCH] mmap: avoid unnecessary anon_vma lock acquisition in vma_adjust() Lee Schermerhorn @ 2009-09-13 13:24 ` Hugh Dickins 2009-09-14 19:35 ` Lee Schermerhorn 0 siblings, 1 reply; 3+ messages in thread From: Hugh Dickins @ 2009-09-13 13:24 UTC (permalink / raw) To: Lee Schermerhorn Cc: linux-kernel, linux-mm, Andrew Morton, Nick Piggin, Eric Whitney, stable [I've kept stable@kernel.org in the Cc list only to put on record that I don't think that this patch is really suitable for -stable] On Wed, 9 Sep 2009, Lee Schermerhorn wrote: > Thanks for this. Interesting stuff snipped and repeated below. > > A comment in mm/mmap.c:vma_adjust() suggests that we don't really > need the anon_vma lock when we're only adjusting the end of a vma, I feel a warm smugness from having foreseen the possibility that we might want to optimize that away. I didn't do so, because it needs more thought (and more branches) than seemed worthwhile at the time: you've found similar difficulty, but now it does seem worthwhile. > We can detect this condition--no overlap with next vma--by noting > a NULL "importer". The anon_vma pointer will also be NULL in this > case, so simply avoid loading vma->anon_vma to avoid the lock. > However, we apparently DO need to take the anon_vma lock when > we're inserting a vma ['insert' non-NULL] even when we have no > overlap [NULL "importer"], so we need to check for 'insert', as well. Those importer and insert checks are good and relevant, but not quite enough. The anon_vma lock should also be guaranteeing the integrity of the relationship between vm_start and vm_pgoff for all the vmas attached to the anon_vma, so that rmap.c can rely upon vma_address() to work correctly while it holds anon_vma lock. That's a considerably less important consideration than the integrity of the list threading itself. Anything BUGging on a wrong page->index is holding mmap_sem, which would keep vma_adjust off. So it's just a matter of whether rmap.c can be expected to find all instances of a page at all times, which nothing absolutely requires (and in checking this patch, I notice fs/exec.c's shift_arg_pages() use of vma_adjust() a little violatory in that respect). But it is something the anon_vma lock has protected in the past, and it shouldn't affect your sbrk() case at all, so I'd like to check we're not changing vm_start too (vm_pgoff should be changing with it, but shift_arg_pages() deals with that in a different way, keeping vm_pgoff unchanged but shifting the pages). (Compare with how stack's expand_downwards() has anon_vma_lock() when it adjusts vm_start and vm_pgoff - though that's also because it has only down_read of mmap_sem, not the down_write we'd usually require for such adjustments.) > > I have tested with and without the 'file || ' test in the patch. > This does not seem to matter for stability nor performance. I > left this check/filter in, so we only optimize away the > anon_vma lock acquisition when adjusting the end of a non- > importing, non-inserting, anon vma. I dislike that: the "file" test just has no relevance at all (beyond that you've no interest in the case when there is a file), and two years down the line will make people like me worry for hours on end what it's there for. I've removed it below. > > If accepted, this patch may benefit the stable tree as well. We seem to have a different perception of what the stable tree is for! But I'm not against any distro picking up this patch if it chooses. Here's a version with my signoff: [PATCH] mmap: avoid unnecessary anon_vma lock acquisition in vma_adjust() From: Lee Schermerhorn <lee.schermerhorn@hp.com> We noticed very erratic behavior [throughput] with the AIM7 shared workload running on recent distro [SLES11] and mainline kernels on an 8-socket, 32-core, 256GB x86_64 platform. On the SLES11 kernel [2.6.27.19+] with Barcelona processors, as we increased the load [10s of thousands of tasks], the throughput would vary between two "plateaus"--one at ~65K jobs per minute and one at ~130K jpm. The simple patch below causes the results to smooth out at the ~130k plateau. But wait, there's more: We do not see this behavior on smaller platforms--e.g., 4 socket/8 core. This could be the result of the larger number of cpus on the larger platform--a scalability issue--or it could be the result of the larger number of interconnect "hops" between some nodes in this platform and how the tasks for a given load end up distributed over the nodes' cpus and memories--a stochastic NUMA effect. The variability in the results are less pronounced [on the same platform] with Shanghai processors and with mainline kernels. With 31-rc6 on Shanghai processors and 288 file systems on 288 fibre attached storage volumes, the curves [jpm vs load] are both quite flat with the patched kernel consistently producing ~3.9% better throughput [~80K jpm vs ~77K jpm] than the unpatched kernel. Profiling indicated that the "slow" runs were incurring high[er] contention on an anon_vma lock in vma_adjust(), apparently called from the sbrk() system call. The patch: A comment in mm/mmap.c:vma_adjust() suggests that we don't really need the anon_vma lock when we're only adjusting the end of a vma, as is the case for brk(). The comment questions whether it's worth while to optimize for this case. Apparently, on the newer, larger x86_64 platforms, with interesting NUMA topologies, it is worth while--especially considering that the patch [if correct!] is quite simple. We can detect this condition--no overlap with next vma--by noting a NULL "importer". The anon_vma pointer will also be NULL in this case, so simply avoid loading vma->anon_vma to avoid the lock. However, we DO need to take the anon_vma lock when we're inserting a vma ['insert' non-NULL] even when we have no overlap [NULL "importer"], so we need to check for 'insert', as well. And Hugh points out that we should also take it when adjusting vm_start (so that rmap.c can rely upon vma_address() while it holds the anon_vma lock). Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Nick Piggin <npiggin@suse.de> Cc: Eric Whitney <eric.whitney@hp.com> --- mm/mmap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- 2.6.31/mm/mmap.c 2009-09-09 23:13:59.000000000 +0100 +++ linux/mm/mmap.c 2009-09-13 13:08:40.000000000 +0100 @@ -570,9 +570,9 @@ again: remove_next = 1 + (end > next-> /* * When changing only vma->vm_end, we don't really need - * anon_vma lock: but is that case worth optimizing out? + * anon_vma lock. */ - if (vma->anon_vma) + if (vma->anon_vma && (insert || importer || start != vma->vm_start)) anon_vma = vma->anon_vma; if (anon_vma) { spin_lock(&anon_vma->lock); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] mmap: avoid unnecessary anon_vma lock acquisition in vma_adjust() 2009-09-13 13:24 ` Hugh Dickins @ 2009-09-14 19:35 ` Lee Schermerhorn 0 siblings, 0 replies; 3+ messages in thread From: Lee Schermerhorn @ 2009-09-14 19:35 UTC (permalink / raw) To: Hugh Dickins, Andrew Morton Cc: linux-kernel, linux-mm, Nick Piggin, Eric Whitney, stable On Sun, 2009-09-13 at 14:24 +0100, Hugh Dickins wrote: > [I've kept stable@kernel.org in the Cc list only to put on record > that I don't think that this patch is really suitable for -stable] > > On Wed, 9 Sep 2009, Lee Schermerhorn wrote: > > > Thanks for this. Interesting stuff snipped and repeated below. Hugh: Thank you for reviewing and correcting the patch. Andrew: Shall I resend? Or, can you grab Hugh's patch from his message? Lee > > > > > A comment in mm/mmap.c:vma_adjust() suggests that we don't really > > need the anon_vma lock when we're only adjusting the end of a vma, > > I feel a warm smugness from having foreseen the possibility that we > might want to optimize that away. I didn't do so, because it needs > more thought (and more branches) than seemed worthwhile at the time: > you've found similar difficulty, but now it does seem worthwhile. > > > We can detect this condition--no overlap with next vma--by noting > > a NULL "importer". The anon_vma pointer will also be NULL in this > > case, so simply avoid loading vma->anon_vma to avoid the lock. > > However, we apparently DO need to take the anon_vma lock when > > we're inserting a vma ['insert' non-NULL] even when we have no > > overlap [NULL "importer"], so we need to check for 'insert', as well. > > Those importer and insert checks are good and relevant, but not > quite enough. The anon_vma lock should also be guaranteeing the > integrity of the relationship between vm_start and vm_pgoff for > all the vmas attached to the anon_vma, so that rmap.c can rely > upon vma_address() to work correctly while it holds anon_vma lock. > > That's a considerably less important consideration than the integrity > of the list threading itself. Anything BUGging on a wrong page->index > is holding mmap_sem, which would keep vma_adjust off. So it's just a > matter of whether rmap.c can be expected to find all instances of a > page at all times, which nothing absolutely requires (and in checking > this patch, I notice fs/exec.c's shift_arg_pages() use of vma_adjust() > a little violatory in that respect). > > But it is something the anon_vma lock has protected in the past, > and it shouldn't affect your sbrk() case at all, so I'd like to > check we're not changing vm_start too (vm_pgoff should be changing > with it, but shift_arg_pages() deals with that in a different way, > keeping vm_pgoff unchanged but shifting the pages). > > (Compare with how stack's expand_downwards() has anon_vma_lock() > when it adjusts vm_start and vm_pgoff - though that's also because > it has only down_read of mmap_sem, not the down_write we'd usually > require for such adjustments.) > > > > > I have tested with and without the 'file || ' test in the patch. > > This does not seem to matter for stability nor performance. I > > left this check/filter in, so we only optimize away the > > anon_vma lock acquisition when adjusting the end of a non- > > importing, non-inserting, anon vma. > > I dislike that: the "file" test just has no relevance at all > (beyond that you've no interest in the case when there is a file), > and two years down the line will make people like me worry for > hours on end what it's there for. I've removed it below. > > > > > If accepted, this patch may benefit the stable tree as well. > > We seem to have a different perception of what the stable tree is for! > But I'm not against any distro picking up this patch if it chooses. > Here's a version with my signoff: > > > [PATCH] mmap: avoid unnecessary anon_vma lock acquisition in vma_adjust() > > From: Lee Schermerhorn <lee.schermerhorn@hp.com> > > We noticed very erratic behavior [throughput] with the AIM7 shared > workload running on recent distro [SLES11] and mainline kernels on > an 8-socket, 32-core, 256GB x86_64 platform. On the SLES11 kernel > [2.6.27.19+] with Barcelona processors, as we increased the load > [10s of thousands of tasks], the throughput would vary between two > "plateaus"--one at ~65K jobs per minute and one at ~130K jpm. The > simple patch below causes the results to smooth out at the ~130k > plateau. > > But wait, there's more: > > We do not see this behavior on smaller platforms--e.g., 4 socket/8 > core. This could be the result of the larger number of cpus on > the larger platform--a scalability issue--or it could be the result > of the larger number of interconnect "hops" between some nodes in > this platform and how the tasks for a given load end up distributed > over the nodes' cpus and memories--a stochastic NUMA effect. > > The variability in the results are less pronounced [on the same > platform] with Shanghai processors and with mainline kernels. With > 31-rc6 on Shanghai processors and 288 file systems on 288 fibre > attached storage volumes, the curves [jpm vs load] are both quite > flat with the patched kernel consistently producing ~3.9% better > throughput [~80K jpm vs ~77K jpm] than the unpatched kernel. > > Profiling indicated that the "slow" runs were incurring high[er] > contention on an anon_vma lock in vma_adjust(), apparently called > from the sbrk() system call. > > The patch: > > A comment in mm/mmap.c:vma_adjust() suggests that we don't really > need the anon_vma lock when we're only adjusting the end of a vma, > as is the case for brk(). The comment questions whether it's worth > while to optimize for this case. Apparently, on the newer, larger > x86_64 platforms, with interesting NUMA topologies, it is worth > while--especially considering that the patch [if correct!] is > quite simple. > > We can detect this condition--no overlap with next vma--by noting > a NULL "importer". The anon_vma pointer will also be NULL in this > case, so simply avoid loading vma->anon_vma to avoid the lock. > > However, we DO need to take the anon_vma lock when we're inserting a > vma ['insert' non-NULL] even when we have no overlap [NULL "importer"], > so we need to check for 'insert', as well. And Hugh points out that > we should also take it when adjusting vm_start (so that rmap.c can > rely upon vma_address() while it holds the anon_vma lock). > > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> > Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> > Cc: Nick Piggin <npiggin@suse.de> > Cc: Eric Whitney <eric.whitney@hp.com> > --- > > mm/mmap.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > --- 2.6.31/mm/mmap.c 2009-09-09 23:13:59.000000000 +0100 > +++ linux/mm/mmap.c 2009-09-13 13:08:40.000000000 +0100 > @@ -570,9 +570,9 @@ again: remove_next = 1 + (end > next-> > > /* > * When changing only vma->vm_end, we don't really need > - * anon_vma lock: but is that case worth optimizing out? > + * anon_vma lock. > */ > - if (vma->anon_vma) > + if (vma->anon_vma && (insert || importer || start != vma->vm_start)) > anon_vma = vma->anon_vma; > if (anon_vma) { > spin_lock(&anon_vma->lock); > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2009-09-14 19:36 UTC | newest] Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-09-09 20:10 [PATCH] mmap: avoid unnecessary anon_vma lock acquisition in vma_adjust() Lee Schermerhorn 2009-09-13 13:24 ` Hugh Dickins 2009-09-14 19:35 ` Lee Schermerhorn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox