* Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4
@ 2024-09-03 12:21 Anders Roxell
2024-09-03 12:37 ` David Hildenbrand
0 siblings, 1 reply; 5+ messages in thread
From: Anders Roxell @ 2024-09-03 12:21 UTC (permalink / raw)
To: David Hildenbrand
Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
dvhart, dave, andrealmeid, Linux Kernel Mailing List, Linux-MM
Hi,
I've noticed that the futex01-thread-* tests in will-it-scale-sys-threads
are running about 2% slower on v6.10-rc1 compared to v6.9, and this
slowdown continues with v6.11-rc4. I am focused on identifying any
performance regressions greater than 2% that occur in automated
testing on arm64 HW.
Using git bisect, I traced the issue to commit
f002882ca369 ("mm: merge folio_is_secretmem() and
folio_fast_pin_allowed() into gup_fast_folio_allowed()").
My tests were performed on m7g.large and m7g.metal instances:
* The slowdown is consistent regardless of the number of threads;
futex1-threads-128 performs similarly to futex1-threads-2, indicating
there is no scalability issue, just a minor performance overhead.
* The test doesn’t involve actual futex operations, just dummy wake/wait
on a variable that isn’t accessed by other threads, so the results might
not be very significant.
Given that this seems to be a minor increase in code path length rather
than a scalability issue, would this be considered a genuine regression?
Cheers,
Anders
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4
2024-09-03 12:21 Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4 Anders Roxell
@ 2024-09-03 12:37 ` David Hildenbrand
2024-09-04 10:05 ` Anders Roxell
0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2024-09-03 12:37 UTC (permalink / raw)
To: Anders Roxell
Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
dvhart, dave, andrealmeid, Linux Kernel Mailing List, Linux-MM
On 03.09.24 14:21, Anders Roxell wrote:
> Hi,
>
> I've noticed that the futex01-thread-* tests in will-it-scale-sys-threads
> are running about 2% slower on v6.10-rc1 compared to v6.9, and this
> slowdown continues with v6.11-rc4. I am focused on identifying any
> performance regressions greater than 2% that occur in automated
> testing on arm64 HW.
>
> Using git bisect, I traced the issue to commit
> f002882ca369 ("mm: merge folio_is_secretmem() and
> folio_fast_pin_allowed() into gup_fast_folio_allowed()").
Thanks for analyzing the (slight) regression!
>
> My tests were performed on m7g.large and m7g.metal instances:
>
> * The slowdown is consistent regardless of the number of threads;
> futex1-threads-128 performs similarly to futex1-threads-2, indicating
> there is no scalability issue, just a minor performance overhead.
> * The test doesn’t involve actual futex operations, just dummy wake/wait
> on a variable that isn’t accessed by other threads, so the results might
> not be very significant.
>
> Given that this seems to be a minor increase in code path length rather
> than a scalability issue, would this be considered a genuine regression?
Likely not, I've seen these kinds of regressions (for example in my fork
micro-benchmarks) simply because the compiler slightly changes the code
layout, or suddenly decides to not inline a functions.
Still it is rather unexpected, so let's find out what's happening.
My first intuition would have been that the compiler now decides to not
inline gup_fast_folio_allowed() anymore, adding a function call.
LLVM seems to inline it for me. GCC not.
Would this return the original behavior for you?
diff --git a/mm/gup.c b/mm/gup.c
index 69c483e2cc32d..6642f09c95881 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2726,7 +2726,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
* in the fast path, so instead we whitelist known good cases and if in doubt,
* fall back to the slow path.
*/
-static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
+static __always_inline bool gup_fast_folio_allowed(struct folio *folio,
+ unsigned int flags)
{
bool reject_file_backed = false;
struct address_space *mapping;
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4
2024-09-03 12:37 ` David Hildenbrand
@ 2024-09-04 10:05 ` Anders Roxell
2024-09-04 13:47 ` David Hildenbrand
0 siblings, 1 reply; 5+ messages in thread
From: Anders Roxell @ 2024-09-04 10:05 UTC (permalink / raw)
To: David Hildenbrand
Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
dvhart, dave, andrealmeid, Linux Kernel Mailing List, Linux-MM
On Tue, 3 Sept 2024 at 14:37, David Hildenbrand <david@redhat.com> wrote:
>
> On 03.09.24 14:21, Anders Roxell wrote:
> > Hi,
> >
> > I've noticed that the futex01-thread-* tests in will-it-scale-sys-threads
> > are running about 2% slower on v6.10-rc1 compared to v6.9, and this
> > slowdown continues with v6.11-rc4. I am focused on identifying any
> > performance regressions greater than 2% that occur in automated
> > testing on arm64 HW.
> >
> > Using git bisect, I traced the issue to commit
> > f002882ca369 ("mm: merge folio_is_secretmem() and
> > folio_fast_pin_allowed() into gup_fast_folio_allowed()").
>
> Thanks for analyzing the (slight) regression!
>
> >
> > My tests were performed on m7g.large and m7g.metal instances:
> >
> > * The slowdown is consistent regardless of the number of threads;
> > futex1-threads-128 performs similarly to futex1-threads-2, indicating
> > there is no scalability issue, just a minor performance overhead.
> > * The test doesn’t involve actual futex operations, just dummy wake/wait
> > on a variable that isn’t accessed by other threads, so the results might
> > not be very significant.
> >
> > Given that this seems to be a minor increase in code path length rather
> > than a scalability issue, would this be considered a genuine regression?
>
> Likely not, I've seen these kinds of regressions (for example in my fork
> micro-benchmarks) simply because the compiler slightly changes the code
> layout, or suddenly decides to not inline a functions.
>
> Still it is rather unexpected, so let's find out what's happening.
>
> My first intuition would have been that the compiler now decides to not
> inline gup_fast_folio_allowed() anymore, adding a function call.
>
> LLVM seems to inline it for me. GCC not.
>
> Would this return the original behavior for you?
David thank you for quick patch for me to try.
This patch helped the original regression on v6.10-rc1, but on current mainline
v6.11-rc6 the patch does nothing and the performance is as expeced.
Cheers,
Anders
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 69c483e2cc32d..6642f09c95881 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2726,7 +2726,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
> * in the fast path, so instead we whitelist known good cases and if in doubt,
> * fall back to the slow path.
> */
> -static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
> +static __always_inline bool gup_fast_folio_allowed(struct folio *folio,
> + unsigned int flags)
> {
> bool reject_file_backed = false;
> struct address_space *mapping;
>
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4
2024-09-04 10:05 ` Anders Roxell
@ 2024-09-04 13:47 ` David Hildenbrand
2024-09-04 15:51 ` Anders Roxell
0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2024-09-04 13:47 UTC (permalink / raw)
To: Anders Roxell
Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
dvhart, dave, andrealmeid, Linux Kernel Mailing List, Linux-MM
On 04.09.24 12:05, Anders Roxell wrote:
> On Tue, 3 Sept 2024 at 14:37, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 03.09.24 14:21, Anders Roxell wrote:
>>> Hi,
>>>
>>> I've noticed that the futex01-thread-* tests in will-it-scale-sys-threads
>>> are running about 2% slower on v6.10-rc1 compared to v6.9, and this
>>> slowdown continues with v6.11-rc4. I am focused on identifying any
>>> performance regressions greater than 2% that occur in automated
>>> testing on arm64 HW.
>>>
>>> Using git bisect, I traced the issue to commit
>>> f002882ca369 ("mm: merge folio_is_secretmem() and
>>> folio_fast_pin_allowed() into gup_fast_folio_allowed()").
>>
>> Thanks for analyzing the (slight) regression!
>>
>>>
>>> My tests were performed on m7g.large and m7g.metal instances:
>>>
>>> * The slowdown is consistent regardless of the number of threads;
>>> futex1-threads-128 performs similarly to futex1-threads-2, indicating
>>> there is no scalability issue, just a minor performance overhead.
>>> * The test doesn’t involve actual futex operations, just dummy wake/wait
>>> on a variable that isn’t accessed by other threads, so the results might
>>> not be very significant.
>>>
>>> Given that this seems to be a minor increase in code path length rather
>>> than a scalability issue, would this be considered a genuine regression?
>>
>> Likely not, I've seen these kinds of regressions (for example in my fork
>> micro-benchmarks) simply because the compiler slightly changes the code
>> layout, or suddenly decides to not inline a functions.
>>
>> Still it is rather unexpected, so let's find out what's happening.
>>
>> My first intuition would have been that the compiler now decides to not
>> inline gup_fast_folio_allowed() anymore, adding a function call.
>>
>> LLVM seems to inline it for me. GCC not.
>>
>> Would this return the original behavior for you?
>
> David thank you for quick patch for me to try.
>
> This patch helped the original regression on v6.10-rc1, but on current mainline
> v6.11-rc6 the patch does nothing and the performance is as expeced.
Just so I understand this correctly:
It fixed itself after v6.11-rc4, but v6.11-rc4 was fixed with my patch?
If that's the case, then it's really the compiler deciding whether to
inline or not, and on v6.11-rc6 it decides to inline again.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4
2024-09-04 13:47 ` David Hildenbrand
@ 2024-09-04 15:51 ` Anders Roxell
0 siblings, 0 replies; 5+ messages in thread
From: Anders Roxell @ 2024-09-04 15:51 UTC (permalink / raw)
To: David Hildenbrand
Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
dvhart, dave, andrealmeid, Linux Kernel Mailing List, Linux-MM
On Wed, 4 Sept 2024 at 15:47, David Hildenbrand <david@redhat.com> wrote:
>
> On 04.09.24 12:05, Anders Roxell wrote:
> > On Tue, 3 Sept 2024 at 14:37, David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 03.09.24 14:21, Anders Roxell wrote:
> >>> Hi,
> >>>
> >>> I've noticed that the futex01-thread-* tests in will-it-scale-sys-threads
> >>> are running about 2% slower on v6.10-rc1 compared to v6.9, and this
> >>> slowdown continues with v6.11-rc4. I am focused on identifying any
> >>> performance regressions greater than 2% that occur in automated
> >>> testing on arm64 HW.
> >>>
> >>> Using git bisect, I traced the issue to commit
> >>> f002882ca369 ("mm: merge folio_is_secretmem() and
> >>> folio_fast_pin_allowed() into gup_fast_folio_allowed()").
> >>
> >> Thanks for analyzing the (slight) regression!
> >>
> >>>
> >>> My tests were performed on m7g.large and m7g.metal instances:
> >>>
> >>> * The slowdown is consistent regardless of the number of threads;
> >>> futex1-threads-128 performs similarly to futex1-threads-2, indicating
> >>> there is no scalability issue, just a minor performance overhead.
> >>> * The test doesn’t involve actual futex operations, just dummy wake/wait
> >>> on a variable that isn’t accessed by other threads, so the results might
> >>> not be very significant.
> >>>
> >>> Given that this seems to be a minor increase in code path length rather
> >>> than a scalability issue, would this be considered a genuine regression?
> >>
> >> Likely not, I've seen these kinds of regressions (for example in my fork
> >> micro-benchmarks) simply because the compiler slightly changes the code
> >> layout, or suddenly decides to not inline a functions.
> >>
> >> Still it is rather unexpected, so let's find out what's happening.
> >>
> >> My first intuition would have been that the compiler now decides to not
> >> inline gup_fast_folio_allowed() anymore, adding a function call.
> >>
> >> LLVM seems to inline it for me. GCC not.
> >>
> >> Would this return the original behavior for you?
> >
> > David thank you for quick patch for me to try.
> >
> > This patch helped the original regression on v6.10-rc1, but on current mainline
> > v6.11-rc6 the patch does nothing and the performance is as expeced.
>
> Just so I understand this correctly:
>
> It fixed itself after v6.11-rc4, but v6.11-rc4 was fixed with my patch?
I had to double check and no, on v6.11-rc4 with or without your patch
I see the 2% regression.
Cheers,
Anders
>
> If that's the case, then it's really the compiler deciding whether to
> inline or not, and on v6.11-rc6 it decides to inline again.
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-09-04 15:51 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-03 12:21 Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4 Anders Roxell
2024-09-03 12:37 ` David Hildenbrand
2024-09-04 10:05 ` Anders Roxell
2024-09-04 13:47 ` David Hildenbrand
2024-09-04 15:51 ` Anders Roxell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox