Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4
@ 2024-09-03 12:21 Anders Roxell
  2024-09-03 12:37 ` David Hildenbrand
  0 siblings, 1 reply; 5+ messages in thread
From: Anders Roxell @ 2024-09-03 12:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	dvhart, dave, andrealmeid, Linux Kernel Mailing List, Linux-MM

Hi,

I've noticed that the futex01-thread-* tests in will-it-scale-sys-threads
are running about 2% slower on v6.10-rc1 compared to v6.9, and this
slowdown continues with v6.11-rc4. I am focused on identifying any
performance regressions greater than 2% that occur in automated
testing on arm64 HW.

Using git bisect, I traced the issue to commit
f002882ca369 ("mm: merge folio_is_secretmem() and
folio_fast_pin_allowed() into gup_fast_folio_allowed()").

My tests were performed on m7g.large and m7g.metal instances:

* The slowdown is consistent regardless of the number of threads;
   futex1-threads-128 performs similarly to futex1-threads-2, indicating
   there is no scalability issue, just a minor performance overhead.
* The test doesn’t involve actual futex operations, just dummy wake/wait
   on a variable that isn’t accessed by other threads, so the results might
   not be very significant.

Given that this seems to be a minor increase in code path length rather
than a scalability issue, would this be considered a genuine regression?

Cheers,
Anders

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4
  2024-09-03 12:21 Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4 Anders Roxell
@ 2024-09-03 12:37 ` David Hildenbrand
  2024-09-04 10:05   ` Anders Roxell
  0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2024-09-03 12:37 UTC (permalink / raw)
  To: Anders Roxell
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	dvhart, dave, andrealmeid, Linux Kernel Mailing List, Linux-MM

On 03.09.24 14:21, Anders Roxell wrote:
> Hi,
> 
> I've noticed that the futex01-thread-* tests in will-it-scale-sys-threads
> are running about 2% slower on v6.10-rc1 compared to v6.9, and this
> slowdown continues with v6.11-rc4. I am focused on identifying any
> performance regressions greater than 2% that occur in automated
> testing on arm64 HW.
> 
> Using git bisect, I traced the issue to commit
> f002882ca369 ("mm: merge folio_is_secretmem() and
> folio_fast_pin_allowed() into gup_fast_folio_allowed()").

Thanks for analyzing the (slight) regression!

> 
> My tests were performed on m7g.large and m7g.metal instances:
> 
> * The slowdown is consistent regardless of the number of threads;
>     futex1-threads-128 performs similarly to futex1-threads-2, indicating
>     there is no scalability issue, just a minor performance overhead.
> * The test doesn’t involve actual futex operations, just dummy wake/wait
>     on a variable that isn’t accessed by other threads, so the results might
>     not be very significant.
> 
> Given that this seems to be a minor increase in code path length rather
> than a scalability issue, would this be considered a genuine regression?

Likely not, I've seen these kinds of regressions (for example in my fork
micro-benchmarks) simply because the compiler slightly changes the code
layout, or suddenly decides to not inline a functions.

Still it is rather unexpected, so let's find out what's happening.

My first intuition would have been that the compiler now decides to not
inline gup_fast_folio_allowed() anymore, adding a function call.

LLVM seems to inline it for me. GCC not.

Would this return the original behavior for you?

diff --git a/mm/gup.c b/mm/gup.c
index 69c483e2cc32d..6642f09c95881 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2726,7 +2726,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
   * in the fast path, so instead we whitelist known good cases and if in doubt,
   * fall back to the slow path.
   */
-static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
+static __always_inline bool gup_fast_folio_allowed(struct folio *folio,
+               unsigned int flags)
  {
         bool reject_file_backed = false;
         struct address_space *mapping;


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4
  2024-09-03 12:37 ` David Hildenbrand
@ 2024-09-04 10:05   ` Anders Roxell
  2024-09-04 13:47     ` David Hildenbrand
  0 siblings, 1 reply; 5+ messages in thread
From: Anders Roxell @ 2024-09-04 10:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	dvhart, dave, andrealmeid, Linux Kernel Mailing List, Linux-MM

On Tue, 3 Sept 2024 at 14:37, David Hildenbrand <david@redhat.com> wrote:
>
> On 03.09.24 14:21, Anders Roxell wrote:
> > Hi,
> >
> > I've noticed that the futex01-thread-* tests in will-it-scale-sys-threads
> > are running about 2% slower on v6.10-rc1 compared to v6.9, and this
> > slowdown continues with v6.11-rc4. I am focused on identifying any
> > performance regressions greater than 2% that occur in automated
> > testing on arm64 HW.
> >
> > Using git bisect, I traced the issue to commit
> > f002882ca369 ("mm: merge folio_is_secretmem() and
> > folio_fast_pin_allowed() into gup_fast_folio_allowed()").
>
> Thanks for analyzing the (slight) regression!
>
> >
> > My tests were performed on m7g.large and m7g.metal instances:
> >
> > * The slowdown is consistent regardless of the number of threads;
> >     futex1-threads-128 performs similarly to futex1-threads-2, indicating
> >     there is no scalability issue, just a minor performance overhead.
> > * The test doesn’t involve actual futex operations, just dummy wake/wait
> >     on a variable that isn’t accessed by other threads, so the results might
> >     not be very significant.
> >
> > Given that this seems to be a minor increase in code path length rather
> > than a scalability issue, would this be considered a genuine regression?
>
> Likely not, I've seen these kinds of regressions (for example in my fork
> micro-benchmarks) simply because the compiler slightly changes the code
> layout, or suddenly decides to not inline a functions.
>
> Still it is rather unexpected, so let's find out what's happening.
>
> My first intuition would have been that the compiler now decides to not
> inline gup_fast_folio_allowed() anymore, adding a function call.
>
> LLVM seems to inline it for me. GCC not.
>
> Would this return the original behavior for you?

David thank you for quick patch for me to try.

This patch helped the original regression on v6.10-rc1, but on current mainline
v6.11-rc6 the patch does nothing and the performance is as expeced.


Cheers,
Anders

>
> diff --git a/mm/gup.c b/mm/gup.c
> index 69c483e2cc32d..6642f09c95881 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2726,7 +2726,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
>    * in the fast path, so instead we whitelist known good cases and if in doubt,
>    * fall back to the slow path.
>    */
> -static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
> +static __always_inline bool gup_fast_folio_allowed(struct folio *folio,
> +               unsigned int flags)
>   {
>          bool reject_file_backed = false;
>          struct address_space *mapping;
>
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4
  2024-09-04 10:05   ` Anders Roxell
@ 2024-09-04 13:47     ` David Hildenbrand
  2024-09-04 15:51       ` Anders Roxell
  0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2024-09-04 13:47 UTC (permalink / raw)
  To: Anders Roxell
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	dvhart, dave, andrealmeid, Linux Kernel Mailing List, Linux-MM

On 04.09.24 12:05, Anders Roxell wrote:
> On Tue, 3 Sept 2024 at 14:37, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 03.09.24 14:21, Anders Roxell wrote:
>>> Hi,
>>>
>>> I've noticed that the futex01-thread-* tests in will-it-scale-sys-threads
>>> are running about 2% slower on v6.10-rc1 compared to v6.9, and this
>>> slowdown continues with v6.11-rc4. I am focused on identifying any
>>> performance regressions greater than 2% that occur in automated
>>> testing on arm64 HW.
>>>
>>> Using git bisect, I traced the issue to commit
>>> f002882ca369 ("mm: merge folio_is_secretmem() and
>>> folio_fast_pin_allowed() into gup_fast_folio_allowed()").
>>
>> Thanks for analyzing the (slight) regression!
>>
>>>
>>> My tests were performed on m7g.large and m7g.metal instances:
>>>
>>> * The slowdown is consistent regardless of the number of threads;
>>>      futex1-threads-128 performs similarly to futex1-threads-2, indicating
>>>      there is no scalability issue, just a minor performance overhead.
>>> * The test doesn’t involve actual futex operations, just dummy wake/wait
>>>      on a variable that isn’t accessed by other threads, so the results might
>>>      not be very significant.
>>>
>>> Given that this seems to be a minor increase in code path length rather
>>> than a scalability issue, would this be considered a genuine regression?
>>
>> Likely not, I've seen these kinds of regressions (for example in my fork
>> micro-benchmarks) simply because the compiler slightly changes the code
>> layout, or suddenly decides to not inline a functions.
>>
>> Still it is rather unexpected, so let's find out what's happening.
>>
>> My first intuition would have been that the compiler now decides to not
>> inline gup_fast_folio_allowed() anymore, adding a function call.
>>
>> LLVM seems to inline it for me. GCC not.
>>
>> Would this return the original behavior for you?
> 
> David thank you for quick patch for me to try.
> 
> This patch helped the original regression on v6.10-rc1, but on current mainline
> v6.11-rc6 the patch does nothing and the performance is as expeced.

Just so I understand this correctly:

It fixed itself after v6.11-rc4, but v6.11-rc4 was fixed with my patch?

If that's the case, then it's really the compiler deciding whether to 
inline or not, and on v6.11-rc6 it decides to inline again.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4
  2024-09-04 13:47     ` David Hildenbrand
@ 2024-09-04 15:51       ` Anders Roxell
  0 siblings, 0 replies; 5+ messages in thread
From: Anders Roxell @ 2024-09-04 15:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	dvhart, dave, andrealmeid, Linux Kernel Mailing List, Linux-MM

On Wed, 4 Sept 2024 at 15:47, David Hildenbrand <david@redhat.com> wrote:
>
> On 04.09.24 12:05, Anders Roxell wrote:
> > On Tue, 3 Sept 2024 at 14:37, David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 03.09.24 14:21, Anders Roxell wrote:
> >>> Hi,
> >>>
> >>> I've noticed that the futex01-thread-* tests in will-it-scale-sys-threads
> >>> are running about 2% slower on v6.10-rc1 compared to v6.9, and this
> >>> slowdown continues with v6.11-rc4. I am focused on identifying any
> >>> performance regressions greater than 2% that occur in automated
> >>> testing on arm64 HW.
> >>>
> >>> Using git bisect, I traced the issue to commit
> >>> f002882ca369 ("mm: merge folio_is_secretmem() and
> >>> folio_fast_pin_allowed() into gup_fast_folio_allowed()").
> >>
> >> Thanks for analyzing the (slight) regression!
> >>
> >>>
> >>> My tests were performed on m7g.large and m7g.metal instances:
> >>>
> >>> * The slowdown is consistent regardless of the number of threads;
> >>>      futex1-threads-128 performs similarly to futex1-threads-2, indicating
> >>>      there is no scalability issue, just a minor performance overhead.
> >>> * The test doesn’t involve actual futex operations, just dummy wake/wait
> >>>      on a variable that isn’t accessed by other threads, so the results might
> >>>      not be very significant.
> >>>
> >>> Given that this seems to be a minor increase in code path length rather
> >>> than a scalability issue, would this be considered a genuine regression?
> >>
> >> Likely not, I've seen these kinds of regressions (for example in my fork
> >> micro-benchmarks) simply because the compiler slightly changes the code
> >> layout, or suddenly decides to not inline a functions.
> >>
> >> Still it is rather unexpected, so let's find out what's happening.
> >>
> >> My first intuition would have been that the compiler now decides to not
> >> inline gup_fast_folio_allowed() anymore, adding a function call.
> >>
> >> LLVM seems to inline it for me. GCC not.
> >>
> >> Would this return the original behavior for you?
> >
> > David thank you for quick patch for me to try.
> >
> > This patch helped the original regression on v6.10-rc1, but on current mainline
> > v6.11-rc6 the patch does nothing and the performance is as expeced.
>
> Just so I understand this correctly:
>
> It fixed itself after v6.11-rc4, but v6.11-rc4 was fixed with my patch?

I had to double check and no, on v6.11-rc4 with or without your patch
I see the 2% regression.

Cheers,
Anders

>
> If that's the case, then it's really the compiler deciding whether to
> inline or not, and on v6.11-rc6 it decides to inline again.
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-09-04 15:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-03 12:21 Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4 Anders Roxell
2024-09-03 12:37 ` David Hildenbrand
2024-09-04 10:05   ` Anders Roxell
2024-09-04 13:47     ` David Hildenbrand
2024-09-04 15:51       ` Anders Roxell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox