linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* darktable performance regression on AMD systems caused by "mm: align larger anonymous mappings on THP boundaries"
@ 2024-10-24  7:45 Thorsten Leemhuis
  2024-10-24  9:58 ` Vlastimil Babka
  2024-10-24 15:12 ` [PATCH hotfix 6.12] mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes Vlastimil Babka
  0 siblings, 2 replies; 16+ messages in thread
From: Thorsten Leemhuis @ 2024-10-24  7:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Matthias, Andrew Morton, Linux kernel regressions list, LKML,
	Linux-MM, Yang Shi

Hi, Thorsten here, the Linux kernel's regression tracker.

Rik, I noticed a report about a regression in bugzilla.kernel.org that
appears to be caused by the following change of yours:

efa7df3e3bb5da ("mm: align larger anonymous mappings on THP boundaries")
[v6.7]

It might be one of those "some things got faster, a few things became
slower" situations. Not sure. Felt odd that the reporter was able to
reproduce it on two AMD systems, but not on a Intel system. Maybe there
is a bug somewhere else that was exposed by this.

So in the end it felt worth forwarding by mail to me. Not tracking this
yet, first waiting for feedback.

To quote from https://bugzilla.kernel.org/show_bug.cgi?id=219366 :

> Matthias 2024-10-09 05:37:51 UTC
> 
> I am using a darktable benchmark and I am finding that RAW-to-JPG
> conversion is about 15-25 % slower with kernels 6.7-6.10. The last
> fast kernel series is 6.6. I also tested kernel series 6.5 and it is
> as fast as 6.6
> 
> I know this sounds weird. What has darktable to do with the kernel?
> But the numbers are true. And the darktable devs tell me that this
> is a kernel regression. The darktable github issue is: https://
> github.com/darktable-org/darktable/issues/17397  You can find more
> details there.
> 
> What do I do to measure the performance?
> 
> I am executing darktable on the command line. opencl is disabled so
> that all activities are only on the CPU:
> 
> darktable-cli bench.SRW /tmp/test.jpg --core --disable-opencl -d
> perf -d opencl --configdir /tmp
> 
> ( bench.SRW and the sidecar file can be found here: https://
> drive.google.com/drive/folders/1cfV2b893JuobVwGiZXcaNv5-yszH6j-N )
> 
> This will show some debug output. The line to look for is
> 
> 4,2765 [dev_process_export] pixel pipeline processing took 3,811
> secs (81,883 CPU)
> 
> This gives an exact number how much time darktable needed to convert
> the image. The time darktable needs has a clear dependency on the
> kernel version. It is fast with kernel 6.6. and older and slow with
> kernel 6.7 and newer. Something must have happened from 6.6 to 6.7
> which slows down darktable.
> 
> The darktable debug output shows that basically only one module is
> responsible for the slow down: 'atrous'
> 
> with kernel 6.6.47:
> 
> 4,0548 [dev_pixelpipe] took 0,635 secs (14,597 CPU) [export]
> processed 'atrous' on CPU, blended on CPU ... 4,2765
> [dev_process_export] pixel pipeline processing took 3,811 secs
> (81,883 CPU)
> 
> with kernel 6.10.6:
> 
> 4,9645 [dev_pixelpipe] took 1,489 secs (33,736 CPU) [export]
> processed 'atrous' on CPU, blended on CPU ... 5,2151
> [dev_process_export] pixel pipeline processing took 4,773 secs
> (102,452 CPU)
> 
> 
> This is also being discussed here: https://discuss.pixls.us/t/
> darktable-performance-regression-with-kernel-6-7-and-newer/45945/1 
> And other users confirm the performance degradation.

[...]

> This seems to affect AMD only. I reproduced this performance
> degradation on two different Ryzen Desktop PCs (Ryzen 5 and Ryzen
> 9). But I can not reproduce it on my Intel PC (Lenovo X1 Carbon,
> core i5).

[...]

> By the way, there is also a thread in the darktable forum on this topic:
> https://discuss.pixls.us/t/darktable-performance-regression-with-kernel-6-7-and-newer/45945
>  
> Some users reproduced it there as well.

See the ticket for more details. The reporter is CCed. openZFS is in
use, but the problem was reproduced on vanilla kernels.

Ciao, Thorsten


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: darktable performance regression on AMD systems caused by "mm: align larger anonymous mappings on THP boundaries"
  2024-10-24  7:45 darktable performance regression on AMD systems caused by "mm: align larger anonymous mappings on THP boundaries" Thorsten Leemhuis
@ 2024-10-24  9:58 ` Vlastimil Babka
  2024-10-24 10:23   ` Vlastimil Babka
  2024-10-24 15:12 ` [PATCH hotfix 6.12] mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes Vlastimil Babka
  1 sibling, 1 reply; 16+ messages in thread
From: Vlastimil Babka @ 2024-10-24  9:58 UTC (permalink / raw)
  To: Thorsten Leemhuis, Rik van Riel
  Cc: Matthias, Andrew Morton, Linux kernel regressions list, LKML,
	Linux-MM, Yang Shi, Petr Tesarik

On 10/24/24 09:45, Thorsten Leemhuis wrote:
> Hi, Thorsten here, the Linux kernel's regression tracker.
> 
> Rik, I noticed a report about a regression in bugzilla.kernel.org that
> appears to be caused by the following change of yours:
> 
> efa7df3e3bb5da ("mm: align larger anonymous mappings on THP boundaries")
> [v6.7]
> 
> It might be one of those "some things got faster, a few things became
> slower" situations. Not sure. Felt odd that the reporter was able to
> reproduce it on two AMD systems, but not on a Intel system. Maybe there
> is a bug somewhere else that was exposed by this.

It seems very similar to what we've seen with spec benchmarks such as cactus
and bisected to the same commit:

https://bugzilla.suse.com/show_bug.cgi?id=1229012

The exact regression varies per system. Intel regresses too but relatively
less. The theory is that there are many large-ish allocations that don't
have individual sizes aligned to 2MB and would have been merged, commit
efa7df3e3bb5da causes them to become separate areas where each aligns its
start at 2MB boundary and there are gaps between. This (gaps and vma
fragmentation) itself is not great, but most of the problem seemed to be
from the start alignment, which togethter with the access pattern causes
more TLB or cache missess due to limited associtativity.

So maybe darktable has a similar problem. A simple candidate fix could
change commit efa7df3e3bb5da so that the mapping size has to be a multiple
of THP size (2MB) in order to become aligned, right now it's enough if it's
THP sized or larger.

> So in the end it felt worth forwarding by mail to me. Not tracking this
> yet, first waiting for feedback.
> 
> To quote from https://bugzilla.kernel.org/show_bug.cgi?id=219366 :
> 
>> Matthias 2024-10-09 05:37:51 UTC
>> 
>> I am using a darktable benchmark and I am finding that RAW-to-JPG
>> conversion is about 15-25 % slower with kernels 6.7-6.10. The last
>> fast kernel series is 6.6. I also tested kernel series 6.5 and it is
>> as fast as 6.6
>> 
>> I know this sounds weird. What has darktable to do with the kernel?
>> But the numbers are true. And the darktable devs tell me that this
>> is a kernel regression. The darktable github issue is: https://
>> github.com/darktable-org/darktable/issues/17397  You can find more
>> details there.
>> 
>> What do I do to measure the performance?
>> 
>> I am executing darktable on the command line. opencl is disabled so
>> that all activities are only on the CPU:
>> 
>> darktable-cli bench.SRW /tmp/test.jpg --core --disable-opencl -d
>> perf -d opencl --configdir /tmp
>> 
>> ( bench.SRW and the sidecar file can be found here: https://
>> drive.google.com/drive/folders/1cfV2b893JuobVwGiZXcaNv5-yszH6j-N )
>> 
>> This will show some debug output. The line to look for is
>> 
>> 4,2765 [dev_process_export] pixel pipeline processing took 3,811
>> secs (81,883 CPU)
>> 
>> This gives an exact number how much time darktable needed to convert
>> the image. The time darktable needs has a clear dependency on the
>> kernel version. It is fast with kernel 6.6. and older and slow with
>> kernel 6.7 and newer. Something must have happened from 6.6 to 6.7
>> which slows down darktable.
>> 
>> The darktable debug output shows that basically only one module is
>> responsible for the slow down: 'atrous'
>> 
>> with kernel 6.6.47:
>> 
>> 4,0548 [dev_pixelpipe] took 0,635 secs (14,597 CPU) [export]
>> processed 'atrous' on CPU, blended on CPU ... 4,2765
>> [dev_process_export] pixel pipeline processing took 3,811 secs
>> (81,883 CPU)
>> 
>> with kernel 6.10.6:
>> 
>> 4,9645 [dev_pixelpipe] took 1,489 secs (33,736 CPU) [export]
>> processed 'atrous' on CPU, blended on CPU ... 5,2151
>> [dev_process_export] pixel pipeline processing took 4,773 secs
>> (102,452 CPU)
>> 
>> 
>> This is also being discussed here: https://discuss.pixls.us/t/
>> darktable-performance-regression-with-kernel-6-7-and-newer/45945/1 
>> And other users confirm the performance degradation.
> 
> [...]
> 
>> This seems to affect AMD only. I reproduced this performance
>> degradation on two different Ryzen Desktop PCs (Ryzen 5 and Ryzen
>> 9). But I can not reproduce it on my Intel PC (Lenovo X1 Carbon,
>> core i5).
> 
> [...]
> 
>> By the way, there is also a thread in the darktable forum on this topic:
>> https://discuss.pixls.us/t/darktable-performance-regression-with-kernel-6-7-and-newer/45945
>>  
>> Some users reproduced it there as well.
> 
> See the ticket for more details. The reporter is CCed. openZFS is in
> use, but the problem was reproduced on vanilla kernels.
> 
> Ciao, Thorsten
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: darktable performance regression on AMD systems caused by "mm: align larger anonymous mappings on THP boundaries"
  2024-10-24  9:58 ` Vlastimil Babka
@ 2024-10-24 10:23   ` Vlastimil Babka
  2024-10-24 10:49     ` Petr Tesarik
  2024-10-24 11:20     ` Matthias Bodenbinder
  0 siblings, 2 replies; 16+ messages in thread
From: Vlastimil Babka @ 2024-10-24 10:23 UTC (permalink / raw)
  To: Thorsten Leemhuis, Rik van Riel
  Cc: Matthias, Andrew Morton, Linux kernel regressions list, LKML,
	Linux-MM, Yang Shi, Petr Tesarik

On 10/24/24 11:58, Vlastimil Babka wrote:
> On 10/24/24 09:45, Thorsten Leemhuis wrote:
>> Hi, Thorsten here, the Linux kernel's regression tracker.
>> 
>> Rik, I noticed a report about a regression in bugzilla.kernel.org that
>> appears to be caused by the following change of yours:
>> 
>> efa7df3e3bb5da ("mm: align larger anonymous mappings on THP boundaries")
>> [v6.7]
>> 
>> It might be one of those "some things got faster, a few things became
>> slower" situations. Not sure. Felt odd that the reporter was able to
>> reproduce it on two AMD systems, but not on a Intel system. Maybe there
>> is a bug somewhere else that was exposed by this.
> 
> It seems very similar to what we've seen with spec benchmarks such as cactus
> and bisected to the same commit:
> 
> https://bugzilla.suse.com/show_bug.cgi?id=1229012
> 
> The exact regression varies per system. Intel regresses too but relatively
> less. The theory is that there are many large-ish allocations that don't
> have individual sizes aligned to 2MB and would have been merged, commit
> efa7df3e3bb5da causes them to become separate areas where each aligns its
> start at 2MB boundary and there are gaps between. This (gaps and vma
> fragmentation) itself is not great, but most of the problem seemed to be
> from the start alignment, which togethter with the access pattern causes
> more TLB or cache missess due to limited associtativity.
> 
> So maybe darktable has a similar problem. A simple candidate fix could
> change commit efa7df3e3bb5da so that the mapping size has to be a multiple
> of THP size (2MB) in order to become aligned, right now it's enough if it's
> THP sized or larger.

Maybe this could be enough to fix the issue? (on 6.12-rc4)

diff --git a/mm/mmap.c b/mm/mmap.c
index 9c0fb43064b5..a5297cfb1dfc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -900,7 +900,8 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 
 	if (get_area) {
 		addr = get_area(file, addr, len, pgoff, flags);
-	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
+		   && IS_ALIGNED(len, PMD_SIZE)) {
 		/* Ensures that larger anonymous mappings are THP aligned. */
 		addr = thp_get_unmapped_area_vmflags(file, addr, len,
 						     pgoff, flags, vm_flags);



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: darktable performance regression on AMD systems caused by "mm: align larger anonymous mappings on THP boundaries"
  2024-10-24 10:23   ` Vlastimil Babka
@ 2024-10-24 10:49     ` Petr Tesarik
  2024-10-24 10:56       ` Vlastimil Babka
  2024-10-24 11:20     ` Matthias Bodenbinder
  1 sibling, 1 reply; 16+ messages in thread
From: Petr Tesarik @ 2024-10-24 10:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Thorsten Leemhuis, Rik van Riel, Matthias, Andrew Morton,
	Linux kernel regressions list, LKML, Linux-MM, Yang Shi

On Thu, 24 Oct 2024 12:23:48 +0200
Vlastimil Babka <vbabka@suse.cz> wrote:

> On 10/24/24 11:58, Vlastimil Babka wrote:
> > On 10/24/24 09:45, Thorsten Leemhuis wrote:  
> >> Hi, Thorsten here, the Linux kernel's regression tracker.
> >> 
> >> Rik, I noticed a report about a regression in bugzilla.kernel.org that
> >> appears to be caused by the following change of yours:
> >> 
> >> efa7df3e3bb5da ("mm: align larger anonymous mappings on THP boundaries")
> >> [v6.7]
> >> 
> >> It might be one of those "some things got faster, a few things became
> >> slower" situations. Not sure. Felt odd that the reporter was able to
> >> reproduce it on two AMD systems, but not on a Intel system. Maybe there
> >> is a bug somewhere else that was exposed by this.  
> > 
> > It seems very similar to what we've seen with spec benchmarks such as cactus
> > and bisected to the same commit:
> > 
> > https://bugzilla.suse.com/show_bug.cgi?id=1229012
> > 
> > The exact regression varies per system. Intel regresses too but relatively
> > less. The theory is that there are many large-ish allocations that don't
> > have individual sizes aligned to 2MB and would have been merged, commit
> > efa7df3e3bb5da causes them to become separate areas where each aligns its
> > start at 2MB boundary and there are gaps between. This (gaps and vma
> > fragmentation) itself is not great, but most of the problem seemed to be
> > from the start alignment, which togethter with the access pattern causes
> > more TLB or cache missess due to limited associtativity.
> > 
> > So maybe darktable has a similar problem. A simple candidate fix could
> > change commit efa7df3e3bb5da so that the mapping size has to be a multiple
> > of THP size (2MB) in order to become aligned, right now it's enough if it's
> > THP sized or larger.  
> 
> Maybe this could be enough to fix the issue? (on 6.12-rc4)


Yes, this should work. I was unsure if thp_get_unmapped_area_vmflags()
differs in other ways from mm_get_unmapped_area_vmflags(), which might
still be relevant. I mean, does mm_get_unmapped_area_vmflags() also
prefer to allocate THPs if the virtual memory block is large enough?

Petr T

> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 9c0fb43064b5..a5297cfb1dfc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -900,7 +900,8 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
>  
>  	if (get_area) {
>  		addr = get_area(file, addr, len, pgoff, flags);
> -	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> +	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
> +		   && IS_ALIGNED(len, PMD_SIZE)) {
>  		/* Ensures that larger anonymous mappings are THP aligned. */
>  		addr = thp_get_unmapped_area_vmflags(file, addr, len,
>  						     pgoff, flags, vm_flags);
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: darktable performance regression on AMD systems caused by "mm: align larger anonymous mappings on THP boundaries"
  2024-10-24 10:49     ` Petr Tesarik
@ 2024-10-24 10:56       ` Vlastimil Babka
  2024-10-24 11:13         ` Petr Tesarik
  0 siblings, 1 reply; 16+ messages in thread
From: Vlastimil Babka @ 2024-10-24 10:56 UTC (permalink / raw)
  To: Petr Tesarik
  Cc: Thorsten Leemhuis, Rik van Riel, Matthias, Andrew Morton,
	Linux kernel regressions list, LKML, Linux-MM, Yang Shi

On 10/24/24 12:49, Petr Tesarik wrote:
> On Thu, 24 Oct 2024 12:23:48 +0200
> Vlastimil Babka <vbabka@suse.cz> wrote:
> 
>> On 10/24/24 11:58, Vlastimil Babka wrote:
>> > On 10/24/24 09:45, Thorsten Leemhuis wrote:  
>> >> Hi, Thorsten here, the Linux kernel's regression tracker.
>> >> 
>> >> Rik, I noticed a report about a regression in bugzilla.kernel.org that
>> >> appears to be caused by the following change of yours:
>> >> 
>> >> efa7df3e3bb5da ("mm: align larger anonymous mappings on THP boundaries")
>> >> [v6.7]
>> >> 
>> >> It might be one of those "some things got faster, a few things became
>> >> slower" situations. Not sure. Felt odd that the reporter was able to
>> >> reproduce it on two AMD systems, but not on a Intel system. Maybe there
>> >> is a bug somewhere else that was exposed by this.  
>> > 
>> > It seems very similar to what we've seen with spec benchmarks such as cactus
>> > and bisected to the same commit:
>> > 
>> > https://bugzilla.suse.com/show_bug.cgi?id=1229012
>> > 
>> > The exact regression varies per system. Intel regresses too but relatively
>> > less. The theory is that there are many large-ish allocations that don't
>> > have individual sizes aligned to 2MB and would have been merged, commit
>> > efa7df3e3bb5da causes them to become separate areas where each aligns its
>> > start at 2MB boundary and there are gaps between. This (gaps and vma
>> > fragmentation) itself is not great, but most of the problem seemed to be
>> > from the start alignment, which togethter with the access pattern causes
>> > more TLB or cache missess due to limited associtativity.
>> > 
>> > So maybe darktable has a similar problem. A simple candidate fix could
>> > change commit efa7df3e3bb5da so that the mapping size has to be a multiple
>> > of THP size (2MB) in order to become aligned, right now it's enough if it's
>> > THP sized or larger.  
>> 
>> Maybe this could be enough to fix the issue? (on 6.12-rc4)
> 
> 
> Yes, this should work. I was unsure if thp_get_unmapped_area_vmflags()
> differs in other ways from mm_get_unmapped_area_vmflags(), which might
> still be relevant. I mean, does mm_get_unmapped_area_vmflags() also
> prefer to allocate THPs if the virtual memory block is large enough?

Well any sufficiently large area spanning a PMD aligned/sized block (either
a result of a single allocation or merging of several allocations) can
become populated by THPs (at least in those aligned blocks), and the
preference depends on system-wide THP settings and madvise(MADV_HUGEPAGE) or
prctl.

But mm_get_unmapped_area_vmflags() will AFAIK not try to align the area to
PMD size like the thp_ version would, even if the request is large enough.

> Petr T
> 
>> 
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 9c0fb43064b5..a5297cfb1dfc 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -900,7 +900,8 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
>>  
>>  	if (get_area) {
>>  		addr = get_area(file, addr, len, pgoff, flags);
>> -	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>> +	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
>> +		   && IS_ALIGNED(len, PMD_SIZE)) {
>>  		/* Ensures that larger anonymous mappings are THP aligned. */
>>  		addr = thp_get_unmapped_area_vmflags(file, addr, len,
>>  						     pgoff, flags, vm_flags);
>> 
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: darktable performance regression on AMD systems caused by "mm: align larger anonymous mappings on THP boundaries"
  2024-10-24 10:56       ` Vlastimil Babka
@ 2024-10-24 11:13         ` Petr Tesarik
  2024-10-24 13:29           ` Vlastimil Babka
  0 siblings, 1 reply; 16+ messages in thread
From: Petr Tesarik @ 2024-10-24 11:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Thorsten Leemhuis, Rik van Riel, Matthias, Andrew Morton,
	Linux kernel regressions list, LKML, Linux-MM, Yang Shi

On Thu, 24 Oct 2024 12:56:27 +0200
Vlastimil Babka <vbabka@suse.cz> wrote:

> On 10/24/24 12:49, Petr Tesarik wrote:
> > On Thu, 24 Oct 2024 12:23:48 +0200
> > Vlastimil Babka <vbabka@suse.cz> wrote:
> >   
> >> On 10/24/24 11:58, Vlastimil Babka wrote:  
> >> > On 10/24/24 09:45, Thorsten Leemhuis wrote:    
> >> >> Hi, Thorsten here, the Linux kernel's regression tracker.
> >> >> 
> >> >> Rik, I noticed a report about a regression in bugzilla.kernel.org that
> >> >> appears to be caused by the following change of yours:
> >> >> 
> >> >> efa7df3e3bb5da ("mm: align larger anonymous mappings on THP boundaries")
> >> >> [v6.7]
> >> >> 
> >> >> It might be one of those "some things got faster, a few things became
> >> >> slower" situations. Not sure. Felt odd that the reporter was able to
> >> >> reproduce it on two AMD systems, but not on a Intel system. Maybe there
> >> >> is a bug somewhere else that was exposed by this.    
> >> > 
> >> > It seems very similar to what we've seen with spec benchmarks such as cactus
> >> > and bisected to the same commit:
> >> > 
> >> > https://bugzilla.suse.com/show_bug.cgi?id=1229012
> >> > 
> >> > The exact regression varies per system. Intel regresses too but relatively
> >> > less. The theory is that there are many large-ish allocations that don't
> >> > have individual sizes aligned to 2MB and would have been merged, commit
> >> > efa7df3e3bb5da causes them to become separate areas where each aligns its
> >> > start at 2MB boundary and there are gaps between. This (gaps and vma
> >> > fragmentation) itself is not great, but most of the problem seemed to be
> >> > from the start alignment, which togethter with the access pattern causes
> >> > more TLB or cache missess due to limited associtativity.
> >> > 
> >> > So maybe darktable has a similar problem. A simple candidate fix could
> >> > change commit efa7df3e3bb5da so that the mapping size has to be a multiple
> >> > of THP size (2MB) in order to become aligned, right now it's enough if it's
> >> > THP sized or larger.    
> >> 
> >> Maybe this could be enough to fix the issue? (on 6.12-rc4)  
> > 
> > 
> > Yes, this should work. I was unsure if thp_get_unmapped_area_vmflags()
> > differs in other ways from mm_get_unmapped_area_vmflags(), which might
> > still be relevant. I mean, does mm_get_unmapped_area_vmflags() also
> > prefer to allocate THPs if the virtual memory block is large enough?  
> 
> Well any sufficiently large area spanning a PMD aligned/sized block (either
> a result of a single allocation or merging of several allocations) can
> become populated by THPs (at least in those aligned blocks), and the
> preference depends on system-wide THP settings and madvise(MADV_HUGEPAGE) or
> prctl.
> 
> But mm_get_unmapped_area_vmflags() will AFAIK not try to align the area to
> PMD size like the thp_ version would, even if the request is large enough.

Then it sounds like exactly what we want. But I prefer to move the
check down to __thp_get_unmapped_area() like this:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2fb328880b50..8d80f3af5525 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1082,6 +1082,9 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
 	if (!IS_ENABLED(CONFIG_64BIT) || in_compat_syscall())
 		return 0;
 
+	if (!IS_ALIGNED(len, size))
+		return 0;
+
 	if (off_end <= off_align || (off_end - off_align) < size)
 		return 0;
 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: darktable performance regression on AMD systems caused by "mm: align larger anonymous mappings on THP boundaries"
  2024-10-24 10:23   ` Vlastimil Babka
  2024-10-24 10:49     ` Petr Tesarik
@ 2024-10-24 11:20     ` Matthias Bodenbinder
  1 sibling, 0 replies; 16+ messages in thread
From: Matthias Bodenbinder @ 2024-10-24 11:20 UTC (permalink / raw)
  To: Vlastimil Babka, Thorsten Leemhuis, Rik van Riel
  Cc: Andrew Morton, Linux kernel regressions list, LKML, Linux-MM,
	Yang Shi, Petr Tesarik

Am Donnerstag, dem 24.10.2024 um 12:23 +0200 schrieb Vlastimil Babka:
> On 10/24/24 11:58, Vlastimil Babka wrote:
> > On 10/24/24 09:45, Thorsten Leemhuis wrote:
> > > Hi, Thorsten here, the Linux kernel's regression tracker.
> > > 
> > > Rik, I noticed a report about a regression in bugzilla.kernel.org that
> > > appears to be caused by the following change of yours:
> > > 
> > > efa7df3e3bb5da ("mm: align larger anonymous mappings on THP boundaries")
> > > [v6.7]
> > > 
> > > It might be one of those "some things got faster, a few things became
> > > slower" situations. Not sure. Felt odd that the reporter was able to
> > > reproduce it on two AMD systems, but not on a Intel system. Maybe there
> > > is a bug somewhere else that was exposed by this.
> > 
> > It seems very similar to what we've seen with spec benchmarks such as cactus
> > and bisected to the same commit:
> > 
> > https://bugzilla.suse.com/show_bug.cgi?id=1229012
> > 
> > The exact regression varies per system. Intel regresses too but relatively
> > less. The theory is that there are many large-ish allocations that don't
> > have individual sizes aligned to 2MB and would have been merged, commit
> > efa7df3e3bb5da causes them to become separate areas where each aligns its
> > start at 2MB boundary and there are gaps between. This (gaps and vma
> > fragmentation) itself is not great, but most of the problem seemed to be
> > from the start alignment, which togethter with the access pattern causes
> > more TLB or cache missess due to limited associtativity.
> > 
> > So maybe darktable has a similar problem. A simple candidate fix could
> > change commit efa7df3e3bb5da so that the mapping size has to be a multiple
> > of THP size (2MB) in order to become aligned, right now it's enough if it's
> > THP sized or larger.
> 
> Maybe this could be enough to fix the issue? (on 6.12-rc4)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 9c0fb43064b5..a5297cfb1dfc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -900,7 +900,8 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned
> long len,
>  
>  	if (get_area) {
>  		addr = get_area(file, addr, len, pgoff, flags);
> -	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> +	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
> +		   && IS_ALIGNED(len, PMD_SIZE)) {
>  		/* Ensures that larger anonymous mappings are THP aligned. */
>  		addr = thp_get_unmapped_area_vmflags(file, addr, len,
>  						     pgoff, flags, vm_flags);
> 


Hi, 
here is Matthias, the reporter of the darktable issue
https://bugzilla.kernel.org/show_bug.cgi?id=219366

I applied your patch to kernel 6.11.5 and it works. darktable pixel pipeline goes down to
3.8 s with this patch. Same performance as with kernel 6.6.x. It was 4.7 s without that
patch.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: darktable performance regression on AMD systems caused by "mm: align larger anonymous mappings on THP boundaries"
  2024-10-24 11:13         ` Petr Tesarik
@ 2024-10-24 13:29           ` Vlastimil Babka
  2024-10-24 14:14             ` Petr Tesarik
  0 siblings, 1 reply; 16+ messages in thread
From: Vlastimil Babka @ 2024-10-24 13:29 UTC (permalink / raw)
  To: Petr Tesarik
  Cc: Thorsten Leemhuis, Rik van Riel, Matthias, Andrew Morton,
	Linux kernel regressions list, LKML, Linux-MM, Yang Shi,
	Matthew Wilcox

On 10/24/24 13:13, Petr Tesarik wrote:
> On Thu, 24 Oct 2024 12:56:27 +0200
> Vlastimil Babka <vbabka@suse.cz> wrote:
> 
>> On 10/24/24 12:49, Petr Tesarik wrote:
>> > On Thu, 24 Oct 2024 12:23:48 +0200
>> > Vlastimil Babka <vbabka@suse.cz> wrote:
>> >   
>> >> On 10/24/24 11:58, Vlastimil Babka wrote:  
>> >> > On 10/24/24 09:45, Thorsten Leemhuis wrote:    
>> >> >> Hi, Thorsten here, the Linux kernel's regression tracker.
>> >> >> 
>> >> >> Rik, I noticed a report about a regression in bugzilla.kernel.org that
>> >> >> appears to be caused by the following change of yours:
>> >> >> 
>> >> >> efa7df3e3bb5da ("mm: align larger anonymous mappings on THP boundaries")
>> >> >> [v6.7]
>> >> >> 
>> >> >> It might be one of those "some things got faster, a few things became
>> >> >> slower" situations. Not sure. Felt odd that the reporter was able to
>> >> >> reproduce it on two AMD systems, but not on a Intel system. Maybe there
>> >> >> is a bug somewhere else that was exposed by this.    
>> >> > 
>> >> > It seems very similar to what we've seen with spec benchmarks such as cactus
>> >> > and bisected to the same commit:
>> >> > 
>> >> > https://bugzilla.suse.com/show_bug.cgi?id=1229012
>> >> > 
>> >> > The exact regression varies per system. Intel regresses too but relatively
>> >> > less. The theory is that there are many large-ish allocations that don't
>> >> > have individual sizes aligned to 2MB and would have been merged, commit
>> >> > efa7df3e3bb5da causes them to become separate areas where each aligns its
>> >> > start at 2MB boundary and there are gaps between. This (gaps and vma
>> >> > fragmentation) itself is not great, but most of the problem seemed to be
>> >> > from the start alignment, which togethter with the access pattern causes
>> >> > more TLB or cache missess due to limited associtativity.
>> >> > 
>> >> > So maybe darktable has a similar problem. A simple candidate fix could
>> >> > change commit efa7df3e3bb5da so that the mapping size has to be a multiple
>> >> > of THP size (2MB) in order to become aligned, right now it's enough if it's
>> >> > THP sized or larger.    
>> >> 
>> >> Maybe this could be enough to fix the issue? (on 6.12-rc4)  
>> > 
>> > 
>> > Yes, this should work. I was unsure if thp_get_unmapped_area_vmflags()
>> > differs in other ways from mm_get_unmapped_area_vmflags(), which might
>> > still be relevant. I mean, does mm_get_unmapped_area_vmflags() also
>> > prefer to allocate THPs if the virtual memory block is large enough?  
>> 
>> Well any sufficiently large area spanning a PMD aligned/sized block (either
>> a result of a single allocation or merging of several allocations) can
>> become populated by THPs (at least in those aligned blocks), and the
>> preference depends on system-wide THP settings and madvise(MADV_HUGEPAGE) or
>> prctl.
>> 
>> But mm_get_unmapped_area_vmflags() will AFAIK not try to align the area to
>> PMD size like the thp_ version would, even if the request is large enough.
> 
> Then it sounds like exactly what we want. But I prefer to move the
> check down to __thp_get_unmapped_area() like this:

I wanted to limit the fix to the place commit efa7df3e3bb5da changes, i.e.
anonymous mappings, because there are other callers of
__thp_get_unmapped_area(), namely the filesystems via
thp_get_unmapped_area() and I wasn't sure if that wouldn't regress them. But
since you suggested I had a brief look now...

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2fb328880b50..8d80f3af5525 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1082,6 +1082,9 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
>  	if (!IS_ENABLED(CONFIG_64BIT) || in_compat_syscall())
>  		return 0;
>  
> +	if (!IS_ALIGNED(len, size))
> +		return 0;

I think the filesystem might be asked to create a mapping for e.g. a
[1MB, 4MB] range from a file, thus the offset would be 1MB (with anonymous
pages an off=0 is passed) and the current implementation would try to do the
right thing for that (align the [2MB, 4MB] range to THP) but after your
patch it would see len is 3MB and give up, no?

> +
>  	if (off_end <= off_align || (off_end - off_align) < size)
>  		return 0;
>  



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: darktable performance regression on AMD systems caused by "mm: align larger anonymous mappings on THP boundaries"
  2024-10-24 13:29           ` Vlastimil Babka
@ 2024-10-24 14:14             ` Petr Tesarik
  0 siblings, 0 replies; 16+ messages in thread
From: Petr Tesarik @ 2024-10-24 14:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Thorsten Leemhuis, Rik van Riel, Matthias, Andrew Morton,
	Linux kernel regressions list, LKML, Linux-MM, Yang Shi,
	Matthew Wilcox

On Thu, 24 Oct 2024 15:29:35 +0200
Vlastimil Babka <vbabka@suse.cz> wrote:

> On 10/24/24 13:13, Petr Tesarik wrote:
> > On Thu, 24 Oct 2024 12:56:27 +0200
> > Vlastimil Babka <vbabka@suse.cz> wrote:
> >   
> >> On 10/24/24 12:49, Petr Tesarik wrote:  
> >> > On Thu, 24 Oct 2024 12:23:48 +0200
> >> > Vlastimil Babka <vbabka@suse.cz> wrote:
> >> >     
> >> >> On 10/24/24 11:58, Vlastimil Babka wrote:    
> >> >> > On 10/24/24 09:45, Thorsten Leemhuis wrote:      
> >> >> >> Hi, Thorsten here, the Linux kernel's regression tracker.
> >> >> >> 
> >> >> >> Rik, I noticed a report about a regression in bugzilla.kernel.org that
> >> >> >> appears to be caused by the following change of yours:
> >> >> >> 
> >> >> >> efa7df3e3bb5da ("mm: align larger anonymous mappings on THP boundaries")
> >> >> >> [v6.7]
> >> >> >> 
> >> >> >> It might be one of those "some things got faster, a few things became
> >> >> >> slower" situations. Not sure. Felt odd that the reporter was able to
> >> >> >> reproduce it on two AMD systems, but not on a Intel system. Maybe there
> >> >> >> is a bug somewhere else that was exposed by this.      
> >> >> > 
> >> >> > It seems very similar to what we've seen with spec benchmarks such as cactus
> >> >> > and bisected to the same commit:
> >> >> > 
> >> >> > https://bugzilla.suse.com/show_bug.cgi?id=1229012
> >> >> > 
> >> >> > The exact regression varies per system. Intel regresses too but relatively
> >> >> > less. The theory is that there are many large-ish allocations that don't
> >> >> > have individual sizes aligned to 2MB and would have been merged, commit
> >> >> > efa7df3e3bb5da causes them to become separate areas where each aligns its
> >> >> > start at 2MB boundary and there are gaps between. This (gaps and vma
> >> >> > fragmentation) itself is not great, but most of the problem seemed to be
> >> >> > from the start alignment, which togethter with the access pattern causes
> >> >> > more TLB or cache missess due to limited associtativity.
> >> >> > 
> >> >> > So maybe darktable has a similar problem. A simple candidate fix could
> >> >> > change commit efa7df3e3bb5da so that the mapping size has to be a multiple
> >> >> > of THP size (2MB) in order to become aligned, right now it's enough if it's
> >> >> > THP sized or larger.      
> >> >> 
> >> >> Maybe this could be enough to fix the issue? (on 6.12-rc4)    
> >> > 
> >> > 
> >> > Yes, this should work. I was unsure if thp_get_unmapped_area_vmflags()
> >> > differs in other ways from mm_get_unmapped_area_vmflags(), which might
> >> > still be relevant. I mean, does mm_get_unmapped_area_vmflags() also
> >> > prefer to allocate THPs if the virtual memory block is large enough?    
> >> 
> >> Well any sufficiently large area spanning a PMD aligned/sized block (either
> >> a result of a single allocation or merging of several allocations) can
> >> become populated by THPs (at least in those aligned blocks), and the
> >> preference depends on system-wide THP settings and madvise(MADV_HUGEPAGE) or
> >> prctl.
> >> 
> >> But mm_get_unmapped_area_vmflags() will AFAIK not try to align the area to
> >> PMD size like the thp_ version would, even if the request is large enough.  
> > 
> > Then it sounds like exactly what we want. But I prefer to move the
> > check down to __thp_get_unmapped_area() like this:  
> 
> I wanted to limit the fix to the place commit efa7df3e3bb5da changes, i.e.
> anonymous mappings, because there are other callers of
> __thp_get_unmapped_area(), namely the filesystems via
> thp_get_unmapped_area() and I wasn't sure if that wouldn't regress them. But
> since you suggested I had a brief look now...
> 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 2fb328880b50..8d80f3af5525 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1082,6 +1082,9 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
> >  	if (!IS_ENABLED(CONFIG_64BIT) || in_compat_syscall())
> >  		return 0;
> >  
> > +	if (!IS_ALIGNED(len, size))
> > +		return 0;  
> 
> I think the filesystem might be asked to create a mapping for e.g. a
> [1MB, 4MB] range from a file, thus the offset would be 1MB (with anonymous
> pages an off=0 is passed) and the current implementation would try to do the
> right thing for that (align the [2MB, 4MB] range to THP) but after your
> patch it would see len is 3MB and give up, no?

I'm probably showing my ignorance, but I didn't know it is somehow
beneficial to align THP boundaries with corresponding file offsets. In
that case you're right, and the check should be limited to anonymous
mappings.

Petr T

> > +
> >  	if (off_end <= off_align || (off_end - off_align) < size)
> >  		return 0;
> >    
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH hotfix 6.12] mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes
  2024-10-24  7:45 darktable performance regression on AMD systems caused by "mm: align larger anonymous mappings on THP boundaries" Thorsten Leemhuis
  2024-10-24  9:58 ` Vlastimil Babka
@ 2024-10-24 15:12 ` Vlastimil Babka
  2024-10-24 15:47   ` Lorenzo Stoakes
  2024-10-24 18:32   ` Yang Shi
  1 sibling, 2 replies; 16+ messages in thread
From: Vlastimil Babka @ 2024-10-24 15:12 UTC (permalink / raw)
  To: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes
  Cc: Jann Horn, Thorsten Leemhuis, linux-mm, linux-kernel,
	Petr Tesarik, Vlastimil Babka, Michael Matz,
	Gabriel Krisman Bertazi, Matthias Bodenbinder, stable,
	Rik van Riel, Yang Shi

Since commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
boundaries") a mmap() of anonymous memory without a specific address
hint and of at least PMD_SIZE will be aligned to PMD so that it can
benefit from a THP backing page.

However this change has been shown to regress some workloads
significantly. [1] reports regressions in various spec benchmarks, with
up to 600% slowdown of the cactusBSSN benchmark on some platforms. The
benchmark seems to create many mappings of 4632kB, which would have
merged to a large THP-backed area before commit efa7df3e3bb5 and now
they are fragmented to multiple areas each aligned to PMD boundary with
gaps between. The regression then seems to be caused mainly due to the
benchmark's memory access pattern suffering from TLB or cache aliasing
due to the aligned boundaries of the individual areas.

Another known regression bisected to commit efa7df3e3bb5 is darktable
[2] [3] and early testing suggests this patch fixes the regression there
as well.

To fix the regression but still try to benefit from THP-friendly
anonymous mapping alignment, add a condition that the size of the
mapping must be a multiple of PMD size instead of at least PMD size. In
case of many odd-sized mapping like the cactusBSSN creates, those will
stop being aligned and with gaps between, and instead naturally merge
again.

Reported-by: Michael Matz <matz@suse.de>
Debugged-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
Closes: https://bugzilla.suse.com/show_bug.cgi?id=1229012 [1]
Reported-by: Matthias Bodenbinder <matthias@bodenbinder.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219366 [2]
Closes: https://lore.kernel.org/all/2050f0d4-57b0-481d-bab8-05e8d48fed0c@leemhuis.info/ [3]
Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries")
Cc: <stable@vger.kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mmap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 9c0fb43064b5..a5297cfb1dfc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -900,7 +900,8 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 
 	if (get_area) {
 		addr = get_area(file, addr, len, pgoff, flags);
-	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
+		   && IS_ALIGNED(len, PMD_SIZE)) {
 		/* Ensures that larger anonymous mappings are THP aligned. */
 		addr = thp_get_unmapped_area_vmflags(file, addr, len,
 						     pgoff, flags, vm_flags);
-- 
2.47.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH hotfix 6.12] mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes
  2024-10-24 15:12 ` [PATCH hotfix 6.12] mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes Vlastimil Babka
@ 2024-10-24 15:47   ` Lorenzo Stoakes
  2024-10-24 16:00     ` Lorenzo Stoakes
                       ` (2 more replies)
  2024-10-24 18:32   ` Yang Shi
  1 sibling, 3 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2024-10-24 15:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Liam R. Howlett, Jann Horn, Thorsten Leemhuis,
	linux-mm, linux-kernel, Petr Tesarik, Michael Matz,
	Gabriel Krisman Bertazi, Matthias Bodenbinder, stable,
	Rik van Riel, Yang Shi

On Thu, Oct 24, 2024 at 05:12:29PM +0200, Vlastimil Babka wrote:
> Since commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
> boundaries") a mmap() of anonymous memory without a specific address
> hint and of at least PMD_SIZE will be aligned to PMD so that it can
> benefit from a THP backing page.
>
> However this change has been shown to regress some workloads
> significantly. [1] reports regressions in various spec benchmarks, with
> up to 600% slowdown of the cactusBSSN benchmark on some platforms. The

Ugh god.

> benchmark seems to create many mappings of 4632kB, which would have
> merged to a large THP-backed area before commit efa7df3e3bb5 and now
> they are fragmented to multiple areas each aligned to PMD boundary with
> gaps between. The regression then seems to be caused mainly due to the
> benchmark's memory access pattern suffering from TLB or cache aliasing
> due to the aligned boundaries of the individual areas.

Any more details on precisely why?

>
> Another known regression bisected to commit efa7df3e3bb5 is darktable
> [2] [3] and early testing suggests this patch fixes the regression there
> as well.

Good!

>
> To fix the regression but still try to benefit from THP-friendly
> anonymous mapping alignment, add a condition that the size of the
> mapping must be a multiple of PMD size instead of at least PMD size. In
> case of many odd-sized mapping like the cactusBSSN creates, those will
> stop being aligned and with gaps between, and instead naturally merge
> again.
>

Seems like the original logic just padded the length by PMD size and checks
for overflow, assuming that [pgoff << PAGE_SHIFT, pgoff << PAGE_SHIFT +
len) contains at least one PMD-sized block.

Which I guess results in potentially getting mis-sized empty spaces that
now can't be PMD-merged at the bits that 'overhang' the PMD-sized/aligned
bit?

Which is yeah, not great and would explain this (correct me if my
understanding is wrong).

> Reported-by: Michael Matz <matz@suse.de>
> Debugged-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
> Closes: https://bugzilla.suse.com/show_bug.cgi?id=1229012 [1]
> Reported-by: Matthias Bodenbinder <matthias@bodenbinder.de>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219366 [2]
> Closes: https://lore.kernel.org/all/2050f0d4-57b0-481d-bab8-05e8d48fed0c@leemhuis.info/ [3]
> Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries")
> Cc: <stable@vger.kernel.org>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Yang Shi <yang@os.amperecomputing.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/mmap.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 9c0fb43064b5..a5297cfb1dfc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -900,7 +900,8 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
>
>  	if (get_area) {
>  		addr = get_area(file, addr, len, pgoff, flags);
> -	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> +	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
> +		   && IS_ALIGNED(len, PMD_SIZE)) {

So doing this feels right but...

Hm this seems like it belongs in __thp_get_unmapped_area() which does a bunch of
checks up front returning 0 if they fail, which then results in it peforming the
normal get unmapped area logic.

That also has a bunch of (offset) alignment checks as well overflow checks
so it would seem the natural place to also check length?

>  		/* Ensures that larger anonymous mappings are THP aligned. */
>  		addr = thp_get_unmapped_area_vmflags(file, addr, len,
>  						     pgoff, flags, vm_flags);
> --
> 2.47.0
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH hotfix 6.12] mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes
  2024-10-24 15:47   ` Lorenzo Stoakes
@ 2024-10-24 16:00     ` Lorenzo Stoakes
  2024-10-24 16:04     ` Vlastimil Babka
  2024-10-28 13:45     ` Michael Matz
  2 siblings, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2024-10-24 16:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Liam R. Howlett, Jann Horn, Thorsten Leemhuis,
	linux-mm, linux-kernel, Petr Tesarik, Michael Matz,
	Gabriel Krisman Bertazi, Matthias Bodenbinder, stable,
	Rik van Riel, Yang Shi

On Thu, Oct 24, 2024 at 04:47:54PM +0100, Lorenzo Stoakes wrote:
[snip]

> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 9c0fb43064b5..a5297cfb1dfc 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -900,7 +900,8 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
> >
> >  	if (get_area) {
> >  		addr = get_area(file, addr, len, pgoff, flags);
> > -	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> > +	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
> > +		   && IS_ALIGNED(len, PMD_SIZE)) {
>
> So doing this feels right but...
>
> Hm this seems like it belongs in __thp_get_unmapped_area() which does a bunch of
> checks up front returning 0 if they fail, which then results in it peforming the
> normal get unmapped area logic.
>
> That also has a bunch of (offset) alignment checks as well overflow checks
> so it would seem the natural place to also check length?
>

OK having said that, I see this function is referenced from a bunch of fs
stuff we probably don't want to potentially break by enforcing this
requirement there (at least in this fix).

So disregard that and since this looks otherwise good to me, feel free to add:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


> >  		/* Ensures that larger anonymous mappings are THP aligned. */
> >  		addr = thp_get_unmapped_area_vmflags(file, addr, len,
> >  						     pgoff, flags, vm_flags);
> > --
> > 2.47.0
> >

Thanks!


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH hotfix 6.12] mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes
  2024-10-24 15:47   ` Lorenzo Stoakes
  2024-10-24 16:00     ` Lorenzo Stoakes
@ 2024-10-24 16:04     ` Vlastimil Babka
  2024-10-24 16:17       ` Lorenzo Stoakes
  2024-10-28 13:45     ` Michael Matz
  2 siblings, 1 reply; 16+ messages in thread
From: Vlastimil Babka @ 2024-10-24 16:04 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R. Howlett, Jann Horn, Thorsten Leemhuis,
	linux-mm, linux-kernel, Petr Tesarik, Michael Matz,
	Gabriel Krisman Bertazi, Matthias Bodenbinder, stable,
	Rik van Riel, Yang Shi, Matthew Wilcox

On 10/24/24 17:47, Lorenzo Stoakes wrote:
> On Thu, Oct 24, 2024 at 05:12:29PM +0200, Vlastimil Babka wrote:
>> Since commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
>> boundaries") a mmap() of anonymous memory without a specific address
>> hint and of at least PMD_SIZE will be aligned to PMD so that it can
>> benefit from a THP backing page.
>>
>> However this change has been shown to regress some workloads
>> significantly. [1] reports regressions in various spec benchmarks, with
>> up to 600% slowdown of the cactusBSSN benchmark on some platforms. The
> 
> Ugh god.
> 
>> benchmark seems to create many mappings of 4632kB, which would have
>> merged to a large THP-backed area before commit efa7df3e3bb5 and now
>> they are fragmented to multiple areas each aligned to PMD boundary with
>> gaps between. The regression then seems to be caused mainly due to the
>> benchmark's memory access pattern suffering from TLB or cache aliasing
>> due to the aligned boundaries of the individual areas.
> 
> Any more details on precisely why?

The experiments performed in [1] didn't seem conclusive enough for me to say
that with enough confidence :) Generally speaking if there are multiple
addresses with the same virtual or physical offset accesssed rapidly, they
can alias in the TLB or processor caches due to limited associativity and
cause thrashing. Aligning the mappings to same 2MB boundary can cause such
aliasing.

>>
>> Another known regression bisected to commit efa7df3e3bb5 is darktable
>> [2] [3] and early testing suggests this patch fixes the regression there
>> as well.
> 
> Good!
> 
>>
>> To fix the regression but still try to benefit from THP-friendly
>> anonymous mapping alignment, add a condition that the size of the
>> mapping must be a multiple of PMD size instead of at least PMD size. In
>> case of many odd-sized mapping like the cactusBSSN creates, those will
>> stop being aligned and with gaps between, and instead naturally merge
>> again.
>>
> 
> Seems like the original logic just padded the length by PMD size and checks
> for overflow, assuming that [pgoff << PAGE_SHIFT, pgoff << PAGE_SHIFT +
> len) contains at least one PMD-sized block.
> 
> Which I guess results in potentially getting mis-sized empty spaces that
> now can't be PMD-merged at the bits that 'overhang' the PMD-sized/aligned
> bit?
> 
> Which is yeah, not great and would explain this (correct me if my
> understanding is wrong).
> 
>> Reported-by: Michael Matz <matz@suse.de>
>> Debugged-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
>> Closes: https://bugzilla.suse.com/show_bug.cgi?id=1229012 [1]
>> Reported-by: Matthias Bodenbinder <matthias@bodenbinder.de>
>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219366 [2]
>> Closes: https://lore.kernel.org/all/2050f0d4-57b0-481d-bab8-05e8d48fed0c@leemhuis.info/ [3]
>> Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries")
>> Cc: <stable@vger.kernel.org>
>> Cc: Rik van Riel <riel@surriel.com>
>> Cc: Yang Shi <yang@os.amperecomputing.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  mm/mmap.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 9c0fb43064b5..a5297cfb1dfc 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -900,7 +900,8 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
>>
>>  	if (get_area) {
>>  		addr = get_area(file, addr, len, pgoff, flags);
>> -	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>> +	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
>> +		   && IS_ALIGNED(len, PMD_SIZE)) {
> 
> So doing this feels right but...
> 
> Hm this seems like it belongs in __thp_get_unmapped_area() which does a bunch of
> checks up front returning 0 if they fail, which then results in it peforming the
> normal get unmapped area logic.
> 
> That also has a bunch of (offset) alignment checks as well overflow checks
> so it would seem the natural place to also check length?

Petr suggested the same, but changing  __thp_get_unmapped_area() affects FS
THP's and the proposed check seemed wrong to me:

https://lore.kernel.org/all/9d7c73f6-1e1a-458b-93c6-3b44959022e0@suse.cz/

While it could be fixed, I'm still not sure if we want to restrict FS THPs
the same as anonymous THPs. AFAIU even small mappings of a range from a file
should be aligned properly to make it possible for a large range from the
same file (that includes the smaller range) mapped elsewhere to be THP
backed? I mean we can investigate it further, but for the regression fix to
backported to stable kernels it seemed more safe to address only the case
that was changed by commit efa7df3e3bb5 specifically, i.e. anonymous mappings.

>>  		/* Ensures that larger anonymous mappings are THP aligned. */
>>  		addr = thp_get_unmapped_area_vmflags(file, addr, len,
>>  						     pgoff, flags, vm_flags);
>> --
>> 2.47.0
>>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH hotfix 6.12] mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes
  2024-10-24 16:04     ` Vlastimil Babka
@ 2024-10-24 16:17       ` Lorenzo Stoakes
  0 siblings, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2024-10-24 16:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Liam R. Howlett, Jann Horn, Thorsten Leemhuis,
	linux-mm, linux-kernel, Petr Tesarik, Michael Matz,
	Gabriel Krisman Bertazi, Matthias Bodenbinder, stable,
	Rik van Riel, Yang Shi, Matthew Wilcox

On Thu, Oct 24, 2024 at 06:04:41PM +0200, Vlastimil Babka wrote:

> Petr suggested the same, but changing  __thp_get_unmapped_area() affects FS
> THP's and the proposed check seemed wrong to me:
>
> https://lore.kernel.org/all/9d7c73f6-1e1a-458b-93c6-3b44959022e0@suse.cz/
>
> While it could be fixed, I'm still not sure if we want to restrict FS THPs
> the same as anonymous THPs. AFAIU even small mappings of a range from a file
> should be aligned properly to make it possible for a large range from the
> same file (that includes the smaller range) mapped elsewhere to be THP
> backed? I mean we can investigate it further, but for the regression fix to
> backported to stable kernels it seemed more safe to address only the case
> that was changed by commit efa7df3e3bb5 specifically, i.e. anonymous mappings.
>

Ack, yeah totally agreed - sorry I missed the fs usage before, see my 2nd
reply. I had wrongly assumed this was only used in 1 place, where it would
be sensible to move the check, however with fs using it of course it's not.

Gave an R-b tag on other reply so this patch LGTM! :)

Cheers for finding this utterly critical fix!


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH hotfix 6.12] mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes
  2024-10-24 15:12 ` [PATCH hotfix 6.12] mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes Vlastimil Babka
  2024-10-24 15:47   ` Lorenzo Stoakes
@ 2024-10-24 18:32   ` Yang Shi
  1 sibling, 0 replies; 16+ messages in thread
From: Yang Shi @ 2024-10-24 18:32 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes
  Cc: Jann Horn, Thorsten Leemhuis, linux-mm, linux-kernel,
	Petr Tesarik, Michael Matz, Gabriel Krisman Bertazi,
	Matthias Bodenbinder, stable, Rik van Riel



On 10/24/24 8:12 AM, Vlastimil Babka wrote:
> Since commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
> boundaries") a mmap() of anonymous memory without a specific address
> hint and of at least PMD_SIZE will be aligned to PMD so that it can
> benefit from a THP backing page.
>
> However this change has been shown to regress some workloads
> significantly. [1] reports regressions in various spec benchmarks, with
> up to 600% slowdown of the cactusBSSN benchmark on some platforms. The
> benchmark seems to create many mappings of 4632kB, which would have
> merged to a large THP-backed area before commit efa7df3e3bb5 and now
> they are fragmented to multiple areas each aligned to PMD boundary with
> gaps between. The regression then seems to be caused mainly due to the
> benchmark's memory access pattern suffering from TLB or cache aliasing
> due to the aligned boundaries of the individual areas.
>
> Another known regression bisected to commit efa7df3e3bb5 is darktable
> [2] [3] and early testing suggests this patch fixes the regression there
> as well.
>
> To fix the regression but still try to benefit from THP-friendly
> anonymous mapping alignment, add a condition that the size of the
> mapping must be a multiple of PMD size instead of at least PMD size. In
> case of many odd-sized mapping like the cactusBSSN creates, those will
> stop being aligned and with gaps between, and instead naturally merge
> again.

Thanks for debugging this. The fix makes sense to me. Reviewed-by: Yang 
Shi <yang@os.amperecomputing.com>

>
> Reported-by: Michael Matz <matz@suse.de>
> Debugged-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
> Closes: https://bugzilla.suse.com/show_bug.cgi?id=1229012 [1]
> Reported-by: Matthias Bodenbinder <matthias@bodenbinder.de>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219366 [2]
> Closes: https://lore.kernel.org/all/2050f0d4-57b0-481d-bab8-05e8d48fed0c@leemhuis.info/ [3]
> Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries")
> Cc: <stable@vger.kernel.org>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Yang Shi <yang@os.amperecomputing.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>   mm/mmap.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 9c0fb43064b5..a5297cfb1dfc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -900,7 +900,8 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
>   
>   	if (get_area) {
>   		addr = get_area(file, addr, len, pgoff, flags);
> -	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> +	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
> +		   && IS_ALIGNED(len, PMD_SIZE)) {
>   		/* Ensures that larger anonymous mappings are THP aligned. */
>   		addr = thp_get_unmapped_area_vmflags(file, addr, len,
>   						     pgoff, flags, vm_flags);



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH hotfix 6.12] mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes
  2024-10-24 15:47   ` Lorenzo Stoakes
  2024-10-24 16:00     ` Lorenzo Stoakes
  2024-10-24 16:04     ` Vlastimil Babka
@ 2024-10-28 13:45     ` Michael Matz
  2 siblings, 0 replies; 16+ messages in thread
From: Michael Matz @ 2024-10-28 13:45 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, Andrew Morton, Liam R. Howlett, Jann Horn,
	Thorsten Leemhuis, linux-mm, linux-kernel, Petr Tesarik,
	Gabriel Krisman Bertazi, Matthias Bodenbinder, stable,
	Rik van Riel, Yang Shi

Hello,

On Thu, 24 Oct 2024, Lorenzo Stoakes wrote:

> > benchmark seems to create many mappings of 4632kB, which would have
> > merged to a large THP-backed area before commit efa7df3e3bb5 and now
> > they are fragmented to multiple areas each aligned to PMD boundary with
> > gaps between. The regression then seems to be caused mainly due to the
> > benchmark's memory access pattern suffering from TLB or cache aliasing
> > due to the aligned boundaries of the individual areas.
> 
> Any more details on precisely why?

Anything we found out and theorized about is in the suse bugreport.  I 
think the best theory is TLB aliasing when the mixing^Whash function in 
the given hardware uses too few bits, and most of them in the low 21-12 
bits of an address.  Of course that then still depends on the particular 
access pattern.  cactuBSSN has about 20 memory streams in the hot loops, 
and the accesses are fairly regular from step to step (plus/minus certain 
strides in 3D arrays).  When their start addresses all differ only in the 
upper bits, you will hit TLB aliasing from time to time, and when the 
dimensions/strides are just right it occurs often, the N-way associativity 
doesn't save you anymore and you will hit it very very hard.

It was interesting to see how broad the range of CPUs and vendors was that 
exhibited the problem (in various degrees of severity, from 50% to 600% 
slowdown), and how more recent CPUs don't show the symptom anymore.  I 
guess the micro-arch guys eventually convinced P&R management that hashing 
another bit or two is worthwhile the silicon :-)


Ciao,
Michael.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-10-28 13:45 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-24  7:45 darktable performance regression on AMD systems caused by "mm: align larger anonymous mappings on THP boundaries" Thorsten Leemhuis
2024-10-24  9:58 ` Vlastimil Babka
2024-10-24 10:23   ` Vlastimil Babka
2024-10-24 10:49     ` Petr Tesarik
2024-10-24 10:56       ` Vlastimil Babka
2024-10-24 11:13         ` Petr Tesarik
2024-10-24 13:29           ` Vlastimil Babka
2024-10-24 14:14             ` Petr Tesarik
2024-10-24 11:20     ` Matthias Bodenbinder
2024-10-24 15:12 ` [PATCH hotfix 6.12] mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes Vlastimil Babka
2024-10-24 15:47   ` Lorenzo Stoakes
2024-10-24 16:00     ` Lorenzo Stoakes
2024-10-24 16:04     ` Vlastimil Babka
2024-10-24 16:17       ` Lorenzo Stoakes
2024-10-28 13:45     ` Michael Matz
2024-10-24 18:32   ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox