post-3.18 performance regression in TLB flushing code

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* post-3.18 performance regression in TLB flushing code
@ 2014-12-16 21:36 Dave Hansen
  2014-12-17 10:08 ` Will Deacon
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Hansen @ 2014-12-16 21:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Peter Zijlstra, Russell King - ARM Linux,
	Michal Simek, Linus Torvalds, Will Deacon, LKML, Linux-MM

[-- Attachment #1: Type: text/plain, Size: 1438 bytes --]

I'm running the 'brk1' test from will-it-scale:

> https://github.com/antonblanchard/will-it-scale/blob/master/tests/brk1.c

on a 8-socket/160-thread system.  It's seeing about a 6% drop in
performance (263M -> 247M ops/sec at 80-threads) from this commit:

	commit fb7332a9fedfd62b1ba6530c86f39f0fa38afd49
	Author: Will Deacon <will.deacon@arm.com>
	Date:   Wed Oct 29 10:03:09 2014 +0000

	 mmu_gather: move minimal range calculations into generic code

tlb_finish_mmu() goes up about 9x in the profiles (~0.4%->3.6%) and
tlb_flush_mmu_free() takes about 3.1% of CPU time with the patch
applied, but does not show up at all on the commit before.

This isn't a major regression, but it is rather unfortunate for a patch
that is apparently a code cleanup.  It also _looks_ to show up even when
things are single-threaded, although I haven't looked at it in detail.

I suspect the tlb->need_flush logic was serving some role that the
modified code isn't capturing like in this hunk:

>  void tlb_flush_mmu(struct mmu_gather *tlb)
>  {
> -       if (!tlb->need_flush)
> -               return;
>         tlb_flush_mmu_tlbonly(tlb);
>         tlb_flush_mmu_free(tlb);
>  }

tlb_flush_mmu_tlbonly() has tlb->end check (which replaces the
->need_flush logic), but tlb_flush_mmu_free() does not.

If we add a !tlb->end (patch attached) to tlb_flush_mmu(), that gets us
back up to ~258M ops/sec, but that's still ~2% down from where we started.

[-- Attachment #2: fix-old-need_flush-logic.patch --]
[-- Type: text/x-patch, Size: 461 bytes --]

---

 b/mm/memory.c |    3 +++
 1 file changed, 3 insertions(+)

diff -puN mm/memory.c~fix-old-need_flush-logic mm/memory.c
--- a/mm/memory.c~fix-old-need_flush-logic	2014-12-16 13:24:27.338557014 -0800
+++ b/mm/memory.c	2014-12-16 13:24:50.412598019 -0800
@@ -258,6 +258,9 @@ static void tlb_flush_mmu_free(struct mm

 void tlb_flush_mmu(struct mmu_gather *tlb)
 {
+	if (!tlb->end)
+		return;
+
 	tlb_flush_mmu_tlbonly(tlb);
 	tlb_flush_mmu_free(tlb);
 }
_

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: post-3.18 performance regression in TLB flushing code
  2014-12-16 21:36 post-3.18 performance regression in TLB flushing code Dave Hansen
@ 2014-12-17 10:08 ` Will Deacon
  2014-12-17 16:28   ` Linus Torvalds
  0 siblings, 1 reply; 6+ messages in thread
From: Will Deacon @ 2014-12-17 10:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Benjamin Herrenschmidt, Peter Zijlstra, Russell King - ARM Linux,
	Michal Simek, Linus Torvalds, LKML, Linux-MM

Hi Dave,

Thanks for reporting this.

On Tue, Dec 16, 2014 at 09:36:56PM +0000, Dave Hansen wrote:
> I'm running the 'brk1' test from will-it-scale:
> 
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/brk1.c
> 
> on a 8-socket/160-thread system.  It's seeing about a 6% drop in
> performance (263M -> 247M ops/sec at 80-threads) from this commit:

This is x86, right?

> 	commit fb7332a9fedfd62b1ba6530c86f39f0fa38afd49
> 	Author: Will Deacon <will.deacon@arm.com>
> 	Date:   Wed Oct 29 10:03:09 2014 +0000
> 
> 	 mmu_gather: move minimal range calculations into generic code
> 
> tlb_finish_mmu() goes up about 9x in the profiles (~0.4%->3.6%) and
> tlb_flush_mmu_free() takes about 3.1% of CPU time with the patch
> applied, but does not show up at all on the commit before.

Ouch...

> This isn't a major regression, but it is rather unfortunate for a patch
> that is apparently a code cleanup.  It also _looks_ to show up even when
> things are single-threaded, although I haven't looked at it in detail.

Ok, so there are two parts to this patch:

  (1) Fixing an issue where the arch code (arm64 in my case) wanted to
      adjust the range behind the back of the core code, which resulted
      in negative ranges being flushed

  (2) A cleanup removing the need_flush field, since we can now rely
      on tlb->end != 0 to indicate that a flush is needed

> I suspect the tlb->need_flush logic was serving some role that the
> modified code isn't capturing like in this hunk:
> 
> >  void tlb_flush_mmu(struct mmu_gather *tlb)
> >  {
> > -       if (!tlb->need_flush)
> > -               return;
> >         tlb_flush_mmu_tlbonly(tlb);
> >         tlb_flush_mmu_free(tlb);
> >  }
> 
> tlb_flush_mmu_tlbonly() has tlb->end check (which replaces the
> ->need_flush logic), but tlb_flush_mmu_free() does not.

Yes, I thought tlb_flush_mmu_free wouldn't do anything if we hadn't batched,
but actually free_pages_and_swap_cache does have work outside of the loop.

> If we add a !tlb->end (patch attached) to tlb_flush_mmu(), that gets us
> back up to ~258M ops/sec, but that's still ~2% down from where we started.

I think there are a couple of things you could try to see if that 2% comes
back:

  * Revert the patch and try the one here [1] instead (which only does part
    (1) of the above).

-- or --

  * Instead of adding the tlb->end check to tlb_flush_mmu, add it to
    tlb_flush_mmu_free

Could you give that a go, please?

Cheers,

Will

[1] http://www.spinics.net/lists/kernel/msg1855260.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: post-3.18 performance regression in TLB flushing code
  2014-12-17 10:08 ` Will Deacon
@ 2014-12-17 16:28   ` Linus Torvalds
  2014-12-17 16:53     ` Will Deacon
  0 siblings, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2014-12-17 16:28 UTC (permalink / raw)
  To: Will Deacon
  Cc: Dave Hansen, Benjamin Herrenschmidt, Peter Zijlstra,
	Russell King - ARM Linux, Michal Simek, LKML, Linux-MM

[-- Attachment #1: Type: text/plain, Size: 1144 bytes --]

On Wed, Dec 17, 2014 at 2:08 AM, Will Deacon <will.deacon@arm.com> wrote:
>
> I think there are a couple of things you could try to see if that 2% comes
> back:
>
>   * Revert the patch and try the one here [1] instead (which only does part
>     (1) of the above).
>
> -- or --
>
>   * Instead of adding the tlb->end check to tlb_flush_mmu, add it to
>     tlb_flush_mmu_free

or just move the check back to tlb_flush_mmu() where it belongs.

I don't see why you moved it to "tlb_flush_mmu_tlbonly()" in the first
place, or why you'd now want to add it to tlb_flush_mmu_free().

Both of those helper functions have two callers:

 - tlb_flush_mmu(). Doing it here (instead of in the helper functions)
is the right thing to do

 - the "force_flush" case: we know we have added at least one page to
the TLB state so checking for it is pointless.

So I'm not seeing why you wanted to do it in tlb_flush_mmu_tlbonly(),
and now add it to tlb_flush_mmu_free(). That seems bogus.

So why not just this trivial patch, to make the logic be the same it
used to be (just using "end > 0" instead of the old "need_flush")?

                           Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 714 bytes --]

 mm/memory.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c3b9097251c5..6efe36a998ba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -235,9 +235,6 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
 
 static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
-	if (!tlb->end)
-		return;
-
 	tlb_flush(tlb);
 	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
@@ -259,6 +256,9 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 
 void tlb_flush_mmu(struct mmu_gather *tlb)
 {
+	if (!tlb->end)
+		return;
+
 	tlb_flush_mmu_tlbonly(tlb);
 	tlb_flush_mmu_free(tlb);
 }

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: post-3.18 performance regression in TLB flushing code
  2014-12-17 16:28   ` Linus Torvalds
@ 2014-12-17 16:53     ` Will Deacon
  2014-12-17 18:52       ` Dave Hansen
  0 siblings, 1 reply; 6+ messages in thread
From: Will Deacon @ 2014-12-17 16:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Benjamin Herrenschmidt, Peter Zijlstra,
	Russell King - ARM Linux, Michal Simek, LKML, Linux-MM

On Wed, Dec 17, 2014 at 04:28:23PM +0000, Linus Torvalds wrote:
> On Wed, Dec 17, 2014 at 2:08 AM, Will Deacon <will.deacon@arm.com> wrote:
> >
> > I think there are a couple of things you could try to see if that 2% comes
> > back:
> >
> >   * Revert the patch and try the one here [1] instead (which only does part
> >     (1) of the above).
> >
> > -- or --
> >
> >   * Instead of adding the tlb->end check to tlb_flush_mmu, add it to
> >     tlb_flush_mmu_free
> 
> or just move the check back to tlb_flush_mmu() where it belongs.
> 
> I don't see why you moved it to "tlb_flush_mmu_tlbonly()" in the first
> place, or why you'd now want to add it to tlb_flush_mmu_free().
> 
> Both of those helper functions have two callers:
> 
>  - tlb_flush_mmu(). Doing it here (instead of in the helper functions)
> is the right thing to do
> 
>  - the "force_flush" case: we know we have added at least one page to
> the TLB state so checking for it is pointless.
> 
> So I'm not seeing why you wanted to do it in tlb_flush_mmu_tlbonly(),
> and now add it to tlb_flush_mmu_free(). That seems bogus.

I guess I was being overly cautious in case tlb_flush_mmu_tlbonly grows
additional users, but you're right.

> So why not just this trivial patch, to make the logic be the same it
> used to be (just using "end > 0" instead of the old "need_flush")?

Looks fine to me... Dave?

Will

> diff --git a/mm/memory.c b/mm/memory.c
> index c3b9097251c5..6efe36a998ba 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -235,9 +235,6 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
>  
>  static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
>  {
> -	if (!tlb->end)
> -		return;
> -
>  	tlb_flush(tlb);
>  	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
>  #ifdef CONFIG_HAVE_RCU_TABLE_FREE
> @@ -259,6 +256,9 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
>  
>  void tlb_flush_mmu(struct mmu_gather *tlb)
>  {
> +	if (!tlb->end)
> +		return;
> +
>  	tlb_flush_mmu_tlbonly(tlb);
>  	tlb_flush_mmu_free(tlb);
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: post-3.18 performance regression in TLB flushing code
  2014-12-17 16:53     ` Will Deacon
@ 2014-12-17 18:52       ` Dave Hansen
  2014-12-17 19:58         ` Linus Torvalds
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Hansen @ 2014-12-17 18:52 UTC (permalink / raw)
  To: Will Deacon, Linus Torvalds
  Cc: Benjamin Herrenschmidt, Peter Zijlstra, Russell King - ARM Linux,
	Michal Simek, LKML, Linux-MM

On 12/17/2014 08:53 AM, Will Deacon wrote:
> On Wed, Dec 17, 2014 at 04:28:23PM +0000, Linus Torvalds wrote:
>> On Wed, Dec 17, 2014 at 2:08 AM, Will Deacon <will.deacon@arm.com> wrote:
>> So why not just this trivial patch, to make the logic be the same it
>> used to be (just using "end > 0" instead of the old "need_flush")?
> 
> Looks fine to me... Dave?

First of all, this is quite observable when testing single-threaded on a
desktop.  This is a mildly crusty Sandybridge CPU from 2011.  I made 3
runs with a single thread: ./brk1_processes -s 30 -t 1

	   fb7332a9fed : 4323385
	   fb7332a9fed^: 4503736
fb7332a9fed+Linus's fix: 4516761

These things are also a little bit noisy, so we're well within the
margin of error with Linus's fix.

This also holds up on the large system.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: post-3.18 performance regression in TLB flushing code
  2014-12-17 18:52       ` Dave Hansen
@ 2014-12-17 19:58         ` Linus Torvalds
  0 siblings, 0 replies; 6+ messages in thread
From: Linus Torvalds @ 2014-12-17 19:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Will Deacon, Benjamin Herrenschmidt, Peter Zijlstra,
	Russell King - ARM Linux, Michal Simek, LKML, Linux-MM

On Wed, Dec 17, 2014 at 10:52 AM, Dave Hansen <dave@sr71.net> wrote:
>
> These things are also a little bit noisy, so we're well within the
> margin of error with Linus's fix.
>
> This also holds up on the large system.

Good. I'll commit it asap.

                          Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-12-17 19:58 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-16 21:36 post-3.18 performance regression in TLB flushing code Dave Hansen
2014-12-17 10:08 ` Will Deacon
2014-12-17 16:28   ` Linus Torvalds
2014-12-17 16:53     ` Will Deacon
2014-12-17 18:52       ` Dave Hansen
2014-12-17 19:58         ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox