[RFC PATCH] mm/thp: check and bail out if page in deferred queue already

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
@ 2022-12-23 13:52 Yin Fengwei
  2022-12-27 20:15 ` Nathan Chancellor
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Yin Fengwei @ 2022-12-23 13:52 UTC (permalink / raw)
  To: riel, nathan, akpm, shy828301, willy, kirill.shutemov,
	ying.huang, feng.tang, zhengjun.xing, linux-mm
  Cc: fengwei.yin

Kernel build regression with LLVM was reported here:
https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
boundaries"). And the commit f35b5d7d676e was reverted.

It turned out the regression is related with madvise(MADV_DONTNEED)
was used by ld.lld. But with none PMD_SIZE aligned parameter len.
trace-bpfcc captured:
531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4

If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
trigger split_queue_lock contention raised significantly. perf showed
following data:
    14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
       entry_SYSCALL_64_after_hwframe
           11.52%
                entry_SYSCALL_64_after_hwframe
                do_syscall_64
                __x64_sys_madvise
                do_madvise.part.0
                zap_page_range
                unmap_single_vma
                unmap_page_range
                page_remove_rmap
                deferred_split_huge_page
                __lock_text_start
                native_queued_spin_lock_slowpath

If THP can't be removed from rmap as whole THP, partial THP will be
removed from rmap by removing sub-pages from rmap. Even the THP
head page is added to deferred queue already, the split_queue_lock
will be acquired and check whether the THP head page is in the queue
already. Thus, the contention of split_queue_lock is raised.

Before acquire split_queue_lock, check and bail out early if the THP
head page is in the queue already. The checking without holding
split_queue_lock could race with deferred_split_scan, but it doesn't
impact the correctness here.

Test result of building kernel with ld.lld:
commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
        6:07.99 real,   26367.77 user,  5063.35 sys

commit f35b5d7d676e:
time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
        7:22.15 real,   26235.03 user,  12504.55 sys

commit f35b5d7d676e with the fixing patch:
time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
        6:08.49 real,   26520.15 user,  5047.91 sys

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
My first thought was to change the per node deferred queue to per cpu.
It's complicated and may be overkill.

For the race without lock acquired, I didn't see obvious issue here. But I
could miss something here. Let me know if I did. Thanks.

 mm/huge_memory.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index abe6cfd92ffa..7cde9f702e63 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2837,6 +2837,9 @@ void deferred_split_huge_page(struct page *page)
 	if (PageSwapCache(page))
 		return;

+	if (!list_empty(page_deferred_list(page)))
+		return;
+
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	if (list_empty(page_deferred_list(page))) {
 		count_vm_event(THP_DEFERRED_SPLIT_PAGE);

base-commit: 8395ae05cb5a2e31d36106e8c85efa11cda849be
-- 
2.34.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
  2022-12-23 13:52 [RFC PATCH] mm/thp: check and bail out if page in deferred queue already Yin Fengwei
@ 2022-12-27 20:15 ` Nathan Chancellor
  2022-12-29  1:14   ` Yin, Fengwei
  2022-12-28  1:24 ` David Rientjes
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Nathan Chancellor @ 2022-12-27 20:15 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: riel, akpm, shy828301, willy, kirill.shutemov, ying.huang,
	feng.tang, zhengjun.xing, linux-mm

On Fri, Dec 23, 2022 at 09:52:07PM +0800, Yin Fengwei wrote:
> Kernel build regression with LLVM was reported here:
> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
> boundaries"). And the commit f35b5d7d676e was reverted.
> 
> It turned out the regression is related with madvise(MADV_DONTNEED)
> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
> trace-bpfcc captured:
> 531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
> 531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
> 
> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
> trigger split_queue_lock contention raised significantly. perf showed
> following data:
>     14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
>        entry_SYSCALL_64_after_hwframe
>            11.52%
>                 entry_SYSCALL_64_after_hwframe
>                 do_syscall_64
>                 __x64_sys_madvise
>                 do_madvise.part.0
>                 zap_page_range
>                 unmap_single_vma
>                 unmap_page_range
>                 page_remove_rmap
>                 deferred_split_huge_page
>                 __lock_text_start
>                 native_queued_spin_lock_slowpath
> 
> If THP can't be removed from rmap as whole THP, partial THP will be
> removed from rmap by removing sub-pages from rmap. Even the THP
> head page is added to deferred queue already, the split_queue_lock
> will be acquired and check whether the THP head page is in the queue
> already. Thus, the contention of split_queue_lock is raised.
> 
> Before acquire split_queue_lock, check and bail out early if the THP
> head page is in the queue already. The checking without holding
> split_queue_lock could race with deferred_split_scan, but it doesn't
> impact the correctness here.
> 
> Test result of building kernel with ld.lld:
> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         6:07.99 real,   26367.77 user,  5063.35 sys
> 
> commit f35b5d7d676e:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         7:22.15 real,   26235.03 user,  12504.55 sys
> 
> commit f35b5d7d676e with the fixing patch:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         6:08.49 real,   26520.15 user,  5047.91 sys
> 
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>

I cannot say whether or not this is a good idea or not but it does
resolve the regression I reported:

  Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00016-g7b5a0b664ebe
    Time (mean ± σ):     383.003 s ±  0.680 s    [User: 34737.850 s, System: 7287.079 s]
    Range (min … max):   382.218 s … 383.413 s    3 runs

  Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e
    Time (mean ± σ):     437.886 s ±  1.030 s    [User: 35888.658 s, System: 14048.871 s]
    Range (min … max):   436.865 s … 438.924 s    3 runs

  Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e-dirty
    Time (mean ± σ):     384.371 s ±  1.004 s    [User: 35402.880 s, System: 6401.691 s]
    Range (min … max):   383.547 s … 385.489 s    3 runs

Tested-by: Nathan Chancellor <nathan@kernel.org>

> ---
> My first thought was to change the per node deferred queue to per cpu.
> It's complicated and may be overkill.
> 
> For the race without lock acquired, I didn't see obvious issue here. But I
> could miss something here. Let me know if I did. Thanks.
> 
> 
>  mm/huge_memory.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index abe6cfd92ffa..7cde9f702e63 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2837,6 +2837,9 @@ void deferred_split_huge_page(struct page *page)
>  	if (PageSwapCache(page))
>  		return;
>  
> +	if (!list_empty(page_deferred_list(page)))
> +		return;
> +
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	if (list_empty(page_deferred_list(page))) {
>  		count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> 
> base-commit: 8395ae05cb5a2e31d36106e8c85efa11cda849be
> -- 
> 2.34.1
> 
> 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
  2022-12-27 20:15 ` Nathan Chancellor
@ 2022-12-29  1:14   ` Yin, Fengwei
  0 siblings, 0 replies; 9+ messages in thread
From: Yin, Fengwei @ 2022-12-29  1:14 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: riel, akpm, shy828301, willy, kirill.shutemov, ying.huang,
	feng.tang, zhengjun.xing, linux-mm



On 12/28/2022 4:15 AM, Nathan Chancellor wrote:
> On Fri, Dec 23, 2022 at 09:52:07PM +0800, Yin Fengwei wrote:
>> Kernel build regression with LLVM was reported here:
>> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
>> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
>> boundaries"). And the commit f35b5d7d676e was reverted.
>>
>> It turned out the regression is related with madvise(MADV_DONTNEED)
>> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
>> trace-bpfcc captured:
>> 531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
>> 531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
>>
>> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
>> trigger split_queue_lock contention raised significantly. perf showed
>> following data:
>>     14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
>>        entry_SYSCALL_64_after_hwframe
>>            11.52%
>>                 entry_SYSCALL_64_after_hwframe
>>                 do_syscall_64
>>                 __x64_sys_madvise
>>                 do_madvise.part.0
>>                 zap_page_range
>>                 unmap_single_vma
>>                 unmap_page_range
>>                 page_remove_rmap
>>                 deferred_split_huge_page
>>                 __lock_text_start
>>                 native_queued_spin_lock_slowpath
>>
>> If THP can't be removed from rmap as whole THP, partial THP will be
>> removed from rmap by removing sub-pages from rmap. Even the THP
>> head page is added to deferred queue already, the split_queue_lock
>> will be acquired and check whether the THP head page is in the queue
>> already. Thus, the contention of split_queue_lock is raised.
>>
>> Before acquire split_queue_lock, check and bail out early if the THP
>> head page is in the queue already. The checking without holding
>> split_queue_lock could race with deferred_split_scan, but it doesn't
>> impact the correctness here.
>>
>> Test result of building kernel with ld.lld:
>> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         6:07.99 real,   26367.77 user,  5063.35 sys
>>
>> commit f35b5d7d676e:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         7:22.15 real,   26235.03 user,  12504.55 sys
>>
>> commit f35b5d7d676e with the fixing patch:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         6:08.49 real,   26520.15 user,  5047.91 sys
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> 
> I cannot say whether or not this is a good idea or not but it does
> resolve the regression I reported:
> 
>   Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00016-g7b5a0b664ebe
>     Time (mean ± σ):     383.003 s ±  0.680 s    [User: 34737.850 s, System: 7287.079 s]
>     Range (min … max):   382.218 s … 383.413 s    3 runs
> 
>   Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e
>     Time (mean ± σ):     437.886 s ±  1.030 s    [User: 35888.658 s, System: 14048.871 s]
>     Range (min … max):   436.865 s … 438.924 s    3 runs
> 
>   Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e-dirty
>     Time (mean ± σ):     384.371 s ±  1.004 s    [User: 35402.880 s, System: 6401.691 s]
>     Range (min … max):   383.547 s … 385.489 s    3 runs
> 
> Tested-by: Nathan Chancellor <nathan@kernel.org>
Thanks for testing the patch.

Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
  2022-12-23 13:52 [RFC PATCH] mm/thp: check and bail out if page in deferred queue already Yin Fengwei
  2022-12-27 20:15 ` Nathan Chancellor
@ 2022-12-28  1:24 ` David Rientjes
  2022-12-29  1:15   ` Yin, Fengwei
  2022-12-28  2:06 ` Huang, Ying
  2022-12-28 23:33 ` Andrew Morton
  3 siblings, 1 reply; 9+ messages in thread
From: David Rientjes @ 2022-12-28  1:24 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: riel, nathan, Andrew Morton, shy828301, Matthew Wilcox,
	kirill.shutemov, ying.huang, feng.tang, zhengjun.xing, linux-mm

On Fri, 23 Dec 2022, Yin Fengwei wrote:

> Kernel build regression with LLVM was reported here:
> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
> boundaries"). And the commit f35b5d7d676e was reverted.
> 
> It turned out the regression is related with madvise(MADV_DONTNEED)
> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
> trace-bpfcc captured:
> 531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
> 531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
> 
> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
> trigger split_queue_lock contention raised significantly. perf showed
> following data:
>     14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
>        entry_SYSCALL_64_after_hwframe
>            11.52%
>                 entry_SYSCALL_64_after_hwframe
>                 do_syscall_64
>                 __x64_sys_madvise
>                 do_madvise.part.0
>                 zap_page_range
>                 unmap_single_vma
>                 unmap_page_range
>                 page_remove_rmap
>                 deferred_split_huge_page
>                 __lock_text_start
>                 native_queued_spin_lock_slowpath
> 
> If THP can't be removed from rmap as whole THP, partial THP will be
> removed from rmap by removing sub-pages from rmap. Even the THP
> head page is added to deferred queue already, the split_queue_lock
> will be acquired and check whether the THP head page is in the queue
> already. Thus, the contention of split_queue_lock is raised.
> 
> Before acquire split_queue_lock, check and bail out early if the THP
> head page is in the queue already. The checking without holding
> split_queue_lock could race with deferred_split_scan, but it doesn't
> impact the correctness here.
> 
> Test result of building kernel with ld.lld:
> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         6:07.99 real,   26367.77 user,  5063.35 sys
> 
> commit f35b5d7d676e:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         7:22.15 real,   26235.03 user,  12504.55 sys
> 
> commit f35b5d7d676e with the fixing patch:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         6:08.49 real,   26520.15 user,  5047.91 sys
> 
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>

Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
  2022-12-28  1:24 ` David Rientjes
@ 2022-12-29  1:15   ` Yin, Fengwei
  0 siblings, 0 replies; 9+ messages in thread
From: Yin, Fengwei @ 2022-12-29  1:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: riel, nathan, Andrew Morton, shy828301, Matthew Wilcox,
	kirill.shutemov, ying.huang, feng.tang, zhengjun.xing, linux-mm



On 12/28/2022 9:24 AM, David Rientjes wrote:
> On Fri, 23 Dec 2022, Yin Fengwei wrote:
> 
>> Kernel build regression with LLVM was reported here:
>> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
>> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
>> boundaries"). And the commit f35b5d7d676e was reverted.
>>
>> It turned out the regression is related with madvise(MADV_DONTNEED)
>> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
>> trace-bpfcc captured:
>> 531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
>> 531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
>>
>> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
>> trigger split_queue_lock contention raised significantly. perf showed
>> following data:
>>     14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
>>        entry_SYSCALL_64_after_hwframe
>>            11.52%
>>                 entry_SYSCALL_64_after_hwframe
>>                 do_syscall_64
>>                 __x64_sys_madvise
>>                 do_madvise.part.0
>>                 zap_page_range
>>                 unmap_single_vma
>>                 unmap_page_range
>>                 page_remove_rmap
>>                 deferred_split_huge_page
>>                 __lock_text_start
>>                 native_queued_spin_lock_slowpath
>>
>> If THP can't be removed from rmap as whole THP, partial THP will be
>> removed from rmap by removing sub-pages from rmap. Even the THP
>> head page is added to deferred queue already, the split_queue_lock
>> will be acquired and check whether the THP head page is in the queue
>> already. Thus, the contention of split_queue_lock is raised.
>>
>> Before acquire split_queue_lock, check and bail out early if the THP
>> head page is in the queue already. The checking without holding
>> split_queue_lock could race with deferred_split_scan, but it doesn't
>> impact the correctness here.
>>
>> Test result of building kernel with ld.lld:
>> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         6:07.99 real,   26367.77 user,  5063.35 sys
>>
>> commit f35b5d7d676e:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         7:22.15 real,   26235.03 user,  12504.55 sys
>>
>> commit f35b5d7d676e with the fixing patch:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         6:08.49 real,   26520.15 user,  5047.91 sys
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> 
> Acked-by: David Rientjes <rientjes@google.com>

Thanks for reviewing the patch.

Regards
Yin, Fengwei




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
  2022-12-23 13:52 [RFC PATCH] mm/thp: check and bail out if page in deferred queue already Yin Fengwei
  2022-12-27 20:15 ` Nathan Chancellor
  2022-12-28  1:24 ` David Rientjes
@ 2022-12-28  2:06 ` Huang, Ying
  2022-12-29  1:15   ` Yin, Fengwei
  2022-12-28 23:33 ` Andrew Morton
  3 siblings, 1 reply; 9+ messages in thread
From: Huang, Ying @ 2022-12-28  2:06 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: riel, nathan, akpm, shy828301, willy, kirill.shutemov, feng.tang,
	zhengjun.xing, linux-mm

Yin Fengwei <fengwei.yin@intel.com> writes:

> Kernel build regression with LLVM was reported here:
> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
> boundaries"). And the commit f35b5d7d676e was reverted.
>
> It turned out the regression is related with madvise(MADV_DONTNEED)
> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
> trace-bpfcc captured:
> 531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
> 531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
>
> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
> trigger split_queue_lock contention raised significantly. perf showed
> following data:
>     14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
>        entry_SYSCALL_64_after_hwframe
>            11.52%
>                 entry_SYSCALL_64_after_hwframe
>                 do_syscall_64
>                 __x64_sys_madvise
>                 do_madvise.part.0
>                 zap_page_range
>                 unmap_single_vma
>                 unmap_page_range
>                 page_remove_rmap
>                 deferred_split_huge_page
>                 __lock_text_start
>                 native_queued_spin_lock_slowpath
>
> If THP can't be removed from rmap as whole THP, partial THP will be
> removed from rmap by removing sub-pages from rmap. Even the THP
> head page is added to deferred queue already, the split_queue_lock
> will be acquired and check whether the THP head page is in the queue
> already. Thus, the contention of split_queue_lock is raised.
>
> Before acquire split_queue_lock, check and bail out early if the THP
> head page is in the queue already. The checking without holding
> split_queue_lock could race with deferred_split_scan, but it doesn't
> impact the correctness here.
>
> Test result of building kernel with ld.lld:
> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         6:07.99 real,   26367.77 user,  5063.35 sys
>
> commit f35b5d7d676e:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         7:22.15 real,   26235.03 user,  12504.55 sys
>
> commit f35b5d7d676e with the fixing patch:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         6:08.49 real,   26520.15 user,  5047.91 sys
>
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>

Thank you for fixing.

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

> ---
> My first thought was to change the per node deferred queue to per cpu.
> It's complicated and may be overkill.
>
> For the race without lock acquired, I didn't see obvious issue here. But I
> could miss something here. Let me know if I did. Thanks.
>
>
>  mm/huge_memory.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index abe6cfd92ffa..7cde9f702e63 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2837,6 +2837,9 @@ void deferred_split_huge_page(struct page *page)
>  	if (PageSwapCache(page))
>  		return;
>  
> +	if (!list_empty(page_deferred_list(page)))
> +		return;
> +
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	if (list_empty(page_deferred_list(page))) {
>  		count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>
> base-commit: 8395ae05cb5a2e31d36106e8c85efa11cda849be


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
  2022-12-28  2:06 ` Huang, Ying
@ 2022-12-29  1:15   ` Yin, Fengwei
  0 siblings, 0 replies; 9+ messages in thread
From: Yin, Fengwei @ 2022-12-29  1:15 UTC (permalink / raw)
  To: Huang, Ying
  Cc: riel, nathan, akpm, shy828301, willy, kirill.shutemov, feng.tang,
	zhengjun.xing, linux-mm



On 12/28/2022 10:06 AM, Huang, Ying wrote:
> Yin Fengwei <fengwei.yin@intel.com> writes:
> 
>> Kernel build regression with LLVM was reported here:
>> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
>> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
>> boundaries"). And the commit f35b5d7d676e was reverted.
>>
>> It turned out the regression is related with madvise(MADV_DONTNEED)
>> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
>> trace-bpfcc captured:
>> 531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
>> 531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
>>
>> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
>> trigger split_queue_lock contention raised significantly. perf showed
>> following data:
>>     14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
>>        entry_SYSCALL_64_after_hwframe
>>            11.52%
>>                 entry_SYSCALL_64_after_hwframe
>>                 do_syscall_64
>>                 __x64_sys_madvise
>>                 do_madvise.part.0
>>                 zap_page_range
>>                 unmap_single_vma
>>                 unmap_page_range
>>                 page_remove_rmap
>>                 deferred_split_huge_page
>>                 __lock_text_start
>>                 native_queued_spin_lock_slowpath
>>
>> If THP can't be removed from rmap as whole THP, partial THP will be
>> removed from rmap by removing sub-pages from rmap. Even the THP
>> head page is added to deferred queue already, the split_queue_lock
>> will be acquired and check whether the THP head page is in the queue
>> already. Thus, the contention of split_queue_lock is raised.
>>
>> Before acquire split_queue_lock, check and bail out early if the THP
>> head page is in the queue already. The checking without holding
>> split_queue_lock could race with deferred_split_scan, but it doesn't
>> impact the correctness here.
>>
>> Test result of building kernel with ld.lld:
>> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         6:07.99 real,   26367.77 user,  5063.35 sys
>>
>> commit f35b5d7d676e:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         7:22.15 real,   26235.03 user,  12504.55 sys
>>
>> commit f35b5d7d676e with the fixing patch:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         6:08.49 real,   26520.15 user,  5047.91 sys
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> 
> Thank you for fixing.
> 
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Thanks for reviewing the patch.

Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
  2022-12-23 13:52 [RFC PATCH] mm/thp: check and bail out if page in deferred queue already Yin Fengwei
                   ` (2 preceding siblings ...)
  2022-12-28  2:06 ` Huang, Ying
@ 2022-12-28 23:33 ` Andrew Morton
  2022-12-29  1:29   ` Yin, Fengwei
  3 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2022-12-28 23:33 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: riel, nathan, shy828301, willy, kirill.shutemov, ying.huang,
	feng.tang, zhengjun.xing, linux-mm

On Fri, 23 Dec 2022 21:52:07 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:

> Kernel build regression with LLVM was reported here:
> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
> boundaries"). And the commit f35b5d7d676e was reverted.
> 
> It turned out the regression is related with madvise(MADV_DONTNEED)
> was used by ld.lld.

Are we able to identify a Fixes: target for this?  Thanks.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
  2022-12-28 23:33 ` Andrew Morton
@ 2022-12-29  1:29   ` Yin, Fengwei
  0 siblings, 0 replies; 9+ messages in thread
From: Yin, Fengwei @ 2022-12-29  1:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: riel, nathan, shy828301, willy, kirill.shutemov, ying.huang,
	feng.tang, zhengjun.xing, linux-mm



On 12/29/2022 7:33 AM, Andrew Morton wrote:
> On Fri, 23 Dec 2022 21:52:07 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:
> 
>> Kernel build regression with LLVM was reported here:
>> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
>> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
>> boundaries"). And the commit f35b5d7d676e was reverted.
>>
>> It turned out the regression is related with madvise(MADV_DONTNEED)
>> was used by ld.lld.
> 
> Are we able to identify a Fixes: target for this?  Thanks.
I am not sure. The commit f35b5d7d676e didn't introduce this issue. It
just exposed this issue and it's reverted already. The partial THP
behavior was like this from the first day and not a regression.


Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-12-29  1:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-23 13:52 [RFC PATCH] mm/thp: check and bail out if page in deferred queue already Yin Fengwei
2022-12-27 20:15 ` Nathan Chancellor
2022-12-29  1:14   ` Yin, Fengwei
2022-12-28  1:24 ` David Rientjes
2022-12-29  1:15   ` Yin, Fengwei
2022-12-28  2:06 ` Huang, Ying
2022-12-29  1:15   ` Yin, Fengwei
2022-12-28 23:33 ` Andrew Morton
2022-12-29  1:29   ` Yin, Fengwei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox