* Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4
@ 2024-09-03 12:21 Anders Roxell
2024-09-03 12:37 ` David Hildenbrand
0 siblings, 1 reply; 5+ messages in thread
From: Anders Roxell @ 2024-09-03 12:21 UTC (permalink / raw)
To: David Hildenbrand
Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
dvhart, dave, andrealmeid, Linux Kernel Mailing List, Linux-MM
Hi,
I've noticed that the futex01-thread-* tests in will-it-scale-sys-threads
are running about 2% slower on v6.10-rc1 compared to v6.9, and this
slowdown continues with v6.11-rc4. I am focused on identifying any
performance regressions greater than 2% that occur in automated
testing on arm64 HW.
Using git bisect, I traced the issue to commit
f002882ca369 ("mm: merge folio_is_secretmem() and
folio_fast_pin_allowed() into gup_fast_folio_allowed()").
My tests were performed on m7g.large and m7g.metal instances:
* The slowdown is consistent regardless of the number of threads;
futex1-threads-128 performs similarly to futex1-threads-2, indicating
there is no scalability issue, just a minor performance overhead.
* The test doesn’t involve actual futex operations, just dummy wake/wait
on a variable that isn’t accessed by other threads, so the results might
not be very significant.
Given that this seems to be a minor increase in code path length rather
than a scalability issue, would this be considered a genuine regression?
Cheers,
Anders
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4 2024-09-03 12:21 Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4 Anders Roxell @ 2024-09-03 12:37 ` David Hildenbrand 2024-09-04 10:05 ` Anders Roxell 0 siblings, 1 reply; 5+ messages in thread From: David Hildenbrand @ 2024-09-03 12:37 UTC (permalink / raw) To: Anders Roxell Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, dvhart, dave, andrealmeid, Linux Kernel Mailing List, Linux-MM On 03.09.24 14:21, Anders Roxell wrote: > Hi, > > I've noticed that the futex01-thread-* tests in will-it-scale-sys-threads > are running about 2% slower on v6.10-rc1 compared to v6.9, and this > slowdown continues with v6.11-rc4. I am focused on identifying any > performance regressions greater than 2% that occur in automated > testing on arm64 HW. > > Using git bisect, I traced the issue to commit > f002882ca369 ("mm: merge folio_is_secretmem() and > folio_fast_pin_allowed() into gup_fast_folio_allowed()"). Thanks for analyzing the (slight) regression! > > My tests were performed on m7g.large and m7g.metal instances: > > * The slowdown is consistent regardless of the number of threads; > futex1-threads-128 performs similarly to futex1-threads-2, indicating > there is no scalability issue, just a minor performance overhead. > * The test doesn’t involve actual futex operations, just dummy wake/wait > on a variable that isn’t accessed by other threads, so the results might > not be very significant. > > Given that this seems to be a minor increase in code path length rather > than a scalability issue, would this be considered a genuine regression? Likely not, I've seen these kinds of regressions (for example in my fork micro-benchmarks) simply because the compiler slightly changes the code layout, or suddenly decides to not inline a functions. Still it is rather unexpected, so let's find out what's happening. My first intuition would have been that the compiler now decides to not inline gup_fast_folio_allowed() anymore, adding a function call. LLVM seems to inline it for me. GCC not. Would this return the original behavior for you? diff --git a/mm/gup.c b/mm/gup.c index 69c483e2cc32d..6642f09c95881 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2726,7 +2726,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked); * in the fast path, so instead we whitelist known good cases and if in doubt, * fall back to the slow path. */ -static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags) +static __always_inline bool gup_fast_folio_allowed(struct folio *folio, + unsigned int flags) { bool reject_file_backed = false; struct address_space *mapping; -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4 2024-09-03 12:37 ` David Hildenbrand @ 2024-09-04 10:05 ` Anders Roxell 2024-09-04 13:47 ` David Hildenbrand 0 siblings, 1 reply; 5+ messages in thread From: Anders Roxell @ 2024-09-04 10:05 UTC (permalink / raw) To: David Hildenbrand Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, dvhart, dave, andrealmeid, Linux Kernel Mailing List, Linux-MM On Tue, 3 Sept 2024 at 14:37, David Hildenbrand <david@redhat.com> wrote: > > On 03.09.24 14:21, Anders Roxell wrote: > > Hi, > > > > I've noticed that the futex01-thread-* tests in will-it-scale-sys-threads > > are running about 2% slower on v6.10-rc1 compared to v6.9, and this > > slowdown continues with v6.11-rc4. I am focused on identifying any > > performance regressions greater than 2% that occur in automated > > testing on arm64 HW. > > > > Using git bisect, I traced the issue to commit > > f002882ca369 ("mm: merge folio_is_secretmem() and > > folio_fast_pin_allowed() into gup_fast_folio_allowed()"). > > Thanks for analyzing the (slight) regression! > > > > > My tests were performed on m7g.large and m7g.metal instances: > > > > * The slowdown is consistent regardless of the number of threads; > > futex1-threads-128 performs similarly to futex1-threads-2, indicating > > there is no scalability issue, just a minor performance overhead. > > * The test doesn’t involve actual futex operations, just dummy wake/wait > > on a variable that isn’t accessed by other threads, so the results might > > not be very significant. > > > > Given that this seems to be a minor increase in code path length rather > > than a scalability issue, would this be considered a genuine regression? > > Likely not, I've seen these kinds of regressions (for example in my fork > micro-benchmarks) simply because the compiler slightly changes the code > layout, or suddenly decides to not inline a functions. > > Still it is rather unexpected, so let's find out what's happening. > > My first intuition would have been that the compiler now decides to not > inline gup_fast_folio_allowed() anymore, adding a function call. > > LLVM seems to inline it for me. GCC not. > > Would this return the original behavior for you? David thank you for quick patch for me to try. This patch helped the original regression on v6.10-rc1, but on current mainline v6.11-rc6 the patch does nothing and the performance is as expeced. Cheers, Anders > > diff --git a/mm/gup.c b/mm/gup.c > index 69c483e2cc32d..6642f09c95881 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2726,7 +2726,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked); > * in the fast path, so instead we whitelist known good cases and if in doubt, > * fall back to the slow path. > */ > -static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags) > +static __always_inline bool gup_fast_folio_allowed(struct folio *folio, > + unsigned int flags) > { > bool reject_file_backed = false; > struct address_space *mapping; > > > -- > Cheers, > > David / dhildenb > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4 2024-09-04 10:05 ` Anders Roxell @ 2024-09-04 13:47 ` David Hildenbrand 2024-09-04 15:51 ` Anders Roxell 0 siblings, 1 reply; 5+ messages in thread From: David Hildenbrand @ 2024-09-04 13:47 UTC (permalink / raw) To: Anders Roxell Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, dvhart, dave, andrealmeid, Linux Kernel Mailing List, Linux-MM On 04.09.24 12:05, Anders Roxell wrote: > On Tue, 3 Sept 2024 at 14:37, David Hildenbrand <david@redhat.com> wrote: >> >> On 03.09.24 14:21, Anders Roxell wrote: >>> Hi, >>> >>> I've noticed that the futex01-thread-* tests in will-it-scale-sys-threads >>> are running about 2% slower on v6.10-rc1 compared to v6.9, and this >>> slowdown continues with v6.11-rc4. I am focused on identifying any >>> performance regressions greater than 2% that occur in automated >>> testing on arm64 HW. >>> >>> Using git bisect, I traced the issue to commit >>> f002882ca369 ("mm: merge folio_is_secretmem() and >>> folio_fast_pin_allowed() into gup_fast_folio_allowed()"). >> >> Thanks for analyzing the (slight) regression! >> >>> >>> My tests were performed on m7g.large and m7g.metal instances: >>> >>> * The slowdown is consistent regardless of the number of threads; >>> futex1-threads-128 performs similarly to futex1-threads-2, indicating >>> there is no scalability issue, just a minor performance overhead. >>> * The test doesn’t involve actual futex operations, just dummy wake/wait >>> on a variable that isn’t accessed by other threads, so the results might >>> not be very significant. >>> >>> Given that this seems to be a minor increase in code path length rather >>> than a scalability issue, would this be considered a genuine regression? >> >> Likely not, I've seen these kinds of regressions (for example in my fork >> micro-benchmarks) simply because the compiler slightly changes the code >> layout, or suddenly decides to not inline a functions. >> >> Still it is rather unexpected, so let's find out what's happening. >> >> My first intuition would have been that the compiler now decides to not >> inline gup_fast_folio_allowed() anymore, adding a function call. >> >> LLVM seems to inline it for me. GCC not. >> >> Would this return the original behavior for you? > > David thank you for quick patch for me to try. > > This patch helped the original regression on v6.10-rc1, but on current mainline > v6.11-rc6 the patch does nothing and the performance is as expeced. Just so I understand this correctly: It fixed itself after v6.11-rc4, but v6.11-rc4 was fixed with my patch? If that's the case, then it's really the compiler deciding whether to inline or not, and on v6.11-rc6 it decides to inline again. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4 2024-09-04 13:47 ` David Hildenbrand @ 2024-09-04 15:51 ` Anders Roxell 0 siblings, 0 replies; 5+ messages in thread From: Anders Roxell @ 2024-09-04 15:51 UTC (permalink / raw) To: David Hildenbrand Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, dvhart, dave, andrealmeid, Linux Kernel Mailing List, Linux-MM On Wed, 4 Sept 2024 at 15:47, David Hildenbrand <david@redhat.com> wrote: > > On 04.09.24 12:05, Anders Roxell wrote: > > On Tue, 3 Sept 2024 at 14:37, David Hildenbrand <david@redhat.com> wrote: > >> > >> On 03.09.24 14:21, Anders Roxell wrote: > >>> Hi, > >>> > >>> I've noticed that the futex01-thread-* tests in will-it-scale-sys-threads > >>> are running about 2% slower on v6.10-rc1 compared to v6.9, and this > >>> slowdown continues with v6.11-rc4. I am focused on identifying any > >>> performance regressions greater than 2% that occur in automated > >>> testing on arm64 HW. > >>> > >>> Using git bisect, I traced the issue to commit > >>> f002882ca369 ("mm: merge folio_is_secretmem() and > >>> folio_fast_pin_allowed() into gup_fast_folio_allowed()"). > >> > >> Thanks for analyzing the (slight) regression! > >> > >>> > >>> My tests were performed on m7g.large and m7g.metal instances: > >>> > >>> * The slowdown is consistent regardless of the number of threads; > >>> futex1-threads-128 performs similarly to futex1-threads-2, indicating > >>> there is no scalability issue, just a minor performance overhead. > >>> * The test doesn’t involve actual futex operations, just dummy wake/wait > >>> on a variable that isn’t accessed by other threads, so the results might > >>> not be very significant. > >>> > >>> Given that this seems to be a minor increase in code path length rather > >>> than a scalability issue, would this be considered a genuine regression? > >> > >> Likely not, I've seen these kinds of regressions (for example in my fork > >> micro-benchmarks) simply because the compiler slightly changes the code > >> layout, or suddenly decides to not inline a functions. > >> > >> Still it is rather unexpected, so let's find out what's happening. > >> > >> My first intuition would have been that the compiler now decides to not > >> inline gup_fast_folio_allowed() anymore, adding a function call. > >> > >> LLVM seems to inline it for me. GCC not. > >> > >> Would this return the original behavior for you? > > > > David thank you for quick patch for me to try. > > > > This patch helped the original regression on v6.10-rc1, but on current mainline > > v6.11-rc6 the patch does nothing and the performance is as expeced. > > Just so I understand this correctly: > > It fixed itself after v6.11-rc4, but v6.11-rc4 was fixed with my patch? I had to double check and no, on v6.11-rc4 with or without your patch I see the 2% regression. Cheers, Anders > > If that's the case, then it's really the compiler deciding whether to > inline or not, and on v6.11-rc6 it decides to inline again. > > -- > Cheers, > > David / dhildenb > ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-09-04 15:51 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-09-03 12:21 Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4 Anders Roxell 2024-09-03 12:37 ` David Hildenbrand 2024-09-04 10:05 ` Anders Roxell 2024-09-04 13:47 ` David Hildenbrand 2024-09-04 15:51 ` Anders Roxell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox