[PATCH] mm/page_alloc: use batch page clearing in kernel_init

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()
@ 2026-04-08  9:24 Hrushikesh Salunke
  2026-04-08  9:47 ` Vlastimil Babka (SUSE)
  2026-04-08 11:32 ` [syzbot ci] " syzbot ci
  0 siblings, 2 replies; 8+ messages in thread
From: Hrushikesh Salunke @ 2026-04-08  9:24 UTC (permalink / raw)
  To: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy
  Cc: linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora,
	shivankg, hsalunke

When init_on_alloc is enabled, kernel_init_pages() clears every page
one at a time, calling clear_page() per page.  This is unnecessarily
slow for large contiguous allocations (mTHPs, HugeTLB) that dominate
real workloads.

On 64-bit (!HIGHMEM) systems, switch to clearing pages in batch via
clear_pages(), bypassing the per-page kmap_local_page()/kunmap_local()
overhead and allowing the arch clearing primitive to operate on the full
contiguous range in a single invocation.  The batch size is the full
allocation when the preempt model is preemptible (preemption points are
implicit), or PROCESS_PAGES_NON_PREEMPT_BATCH otherwise, with
cond_resched() between batches to limit scheduling latency under
cooperative preemption.

The HIGHMEM path is kept as-is since those pages require kmap.

Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:

  Before: 0.445s
  After:  0.166s  (-62.7%, 2.68x faster)

Kernel time (sys) reduction per workload with init_on_alloc=1:

  Workload            Before       After       Change
  Graph500 64C128T    30m 41.8s    15m 14.8s   -50.3%
  Graph500 16C32T     15m 56.7s     9m 43.7s   -39.0%
  Pagerank 32T         1m 58.5s     1m 12.8s   -38.5%
  Pagerank 128T        2m 36.3s     1m 40.4s   -35.7%

Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com>
---
base commit: 1a2fbbe3653f0ebb24af9b306a8a968287344a35

 mm/page_alloc.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b1c5430cad4e..178cbebadd50 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1224,8 +1224,23 @@ static void kernel_init_pages(struct page *page, int numpages)

 	/* s390's use of memset() could override KASAN redzones. */
 	kasan_disable_current();
-	for (i = 0; i < numpages; i++)
-		clear_highpage_kasan_tagged(page + i);
+
+	if (!IS_ENABLED(CONFIG_HIGHMEM)) {
+		void *addr = kasan_reset_tag(page_address(page));
+		unsigned int unit = preempt_model_preemptible() ?
+					numpages : PROCESS_PAGES_NON_PREEMPT_BATCH;
+		int count;
+
+		for (i = 0; i < numpages; i += count) {
+			cond_resched();
+			count = min_t(int, unit, numpages - i);
+			clear_pages(addr + (i << PAGE_SHIFT), count);
+		}
+	} else {
+		for (i = 0; i < numpages; i++)
+			clear_highpage_kasan_tagged(page + i);
+	}
+
 	kasan_enable_current();
 }

-- 
2.43.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()
  2026-04-08  9:24 [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages() Hrushikesh Salunke
@ 2026-04-08  9:47 ` Vlastimil Babka (SUSE)
  2026-04-08 10:44   ` Salunke, Hrushikesh
  2026-04-08 11:32 ` [syzbot ci] " syzbot ci
  1 sibling, 1 reply; 8+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-04-08  9:47 UTC (permalink / raw)
  To: Hrushikesh Salunke, akpm, surenb, mhocko, jackmanb, hannes, ziy
  Cc: linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora,
	shivankg, David Hildenbrand

On 4/8/26 11:24, Hrushikesh Salunke wrote:
> When init_on_alloc is enabled, kernel_init_pages() clears every page
> one at a time, calling clear_page() per page.  This is unnecessarily
> slow for large contiguous allocations (mTHPs, HugeTLB) that dominate
> real workloads.
> 
> On 64-bit (!HIGHMEM) systems, switch to clearing pages in batch via
> clear_pages(), bypassing the per-page kmap_local_page()/kunmap_local()
> overhead and allowing the arch clearing primitive to operate on the full
> contiguous range in a single invocation.  The batch size is the full
> allocation when the preempt model is preemptible (preemption points are
> implicit), or PROCESS_PAGES_NON_PREEMPT_BATCH otherwise, with
> cond_resched() between batches to limit scheduling latency under
> cooperative preemption.
> 
> The HIGHMEM path is kept as-is since those pages require kmap.
> 
> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
> 
>   Before: 0.445s
>   After:  0.166s  (-62.7%, 2.68x faster)
> 
> Kernel time (sys) reduction per workload with init_on_alloc=1:
> 
>   Workload            Before       After       Change
>   Graph500 64C128T    30m 41.8s    15m 14.8s   -50.3%
>   Graph500 16C32T     15m 56.7s     9m 43.7s   -39.0%
>   Pagerank 32T         1m 58.5s     1m 12.8s   -38.5%
>   Pagerank 128T        2m 36.3s     1m 40.4s   -35.7%
> 
> Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com>
> ---
> base commit: 1a2fbbe3653f0ebb24af9b306a8a968287344a35

Any way to reuse the code added by [1], e.g. clear_user_highpages()?

[1]
https://lore.kernel.org/linux-mm/20250917152418.4077386-1-ankur.a.arora@oracle.com/

> 
>  mm/page_alloc.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b1c5430cad4e..178cbebadd50 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1224,8 +1224,23 @@ static void kernel_init_pages(struct page *page, int numpages)
>  
>  	/* s390's use of memset() could override KASAN redzones. */
>  	kasan_disable_current();
> -	for (i = 0; i < numpages; i++)
> -		clear_highpage_kasan_tagged(page + i);
> +
> +	if (!IS_ENABLED(CONFIG_HIGHMEM)) {
> +		void *addr = kasan_reset_tag(page_address(page));
> +		unsigned int unit = preempt_model_preemptible() ?
> +					numpages : PROCESS_PAGES_NON_PREEMPT_BATCH;
> +		int count;
> +
> +		for (i = 0; i < numpages; i += count) {
> +			cond_resched();
> +			count = min_t(int, unit, numpages - i);
> +			clear_pages(addr + (i << PAGE_SHIFT), count);
> +		}
> +	} else {
> +		for (i = 0; i < numpages; i++)
> +			clear_highpage_kasan_tagged(page + i);
> +	}
> +
>  	kasan_enable_current();
>  }
>  



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()
  2026-04-08  9:47 ` Vlastimil Babka (SUSE)
@ 2026-04-08 10:44   ` Salunke, Hrushikesh
  2026-04-08 10:53     ` David Hildenbrand (Arm)
                       ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Salunke, Hrushikesh @ 2026-04-08 10:44 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), akpm, surenb, mhocko, jackmanb, hannes, ziy
  Cc: linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora,
	shivankg, David Hildenbrand, hsalunke


On 08-04-2026 15:17, Vlastimil Babka (SUSE) wrote:

> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On 4/8/26 11:24, Hrushikesh Salunke wrote:
>> When init_on_alloc is enabled, kernel_init_pages() clears every page
>> one at a time, calling clear_page() per page.  This is unnecessarily
>> slow for large contiguous allocations (mTHPs, HugeTLB) that dominate
>> real workloads.
>>
>> On 64-bit (!HIGHMEM) systems, switch to clearing pages in batch via
>> clear_pages(), bypassing the per-page kmap_local_page()/kunmap_local()
>> overhead and allowing the arch clearing primitive to operate on the full
>> contiguous range in a single invocation.  The batch size is the full
>> allocation when the preempt model is preemptible (preemption points are
>> implicit), or PROCESS_PAGES_NON_PREEMPT_BATCH otherwise, with
>> cond_resched() between batches to limit scheduling latency under
>> cooperative preemption.
>>
>> The HIGHMEM path is kept as-is since those pages require kmap.
>>
>> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
>>
>>   Before: 0.445s
>>   After:  0.166s  (-62.7%, 2.68x faster)
>>
>> Kernel time (sys) reduction per workload with init_on_alloc=1:
>>
>>   Workload            Before       After       Change
>>   Graph500 64C128T    30m 41.8s    15m 14.8s   -50.3%
>>   Graph500 16C32T     15m 56.7s     9m 43.7s   -39.0%
>>   Pagerank 32T         1m 58.5s     1m 12.8s   -38.5%
>>   Pagerank 128T        2m 36.3s     1m 40.4s   -35.7%
>>
>> Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com>
>> ---
>> base commit: 1a2fbbe3653f0ebb24af9b306a8a968287344a35
> Any way to reuse the code added by [1], e.g. clear_user_highpages()?
>
> [1]
> https://lore.kernel.org/linux-mm/20250917152418.4077386-1-ankur.a.arora@oracle.com/

Thanks for the review. Sure, I will check if code reuse is possible.
Meanwhile I found another issue with the current patch.

kernel_init_pages() runs inside the allocator (post_alloc_hook and
__free_pages_prepare), so it inherits whatever context the caller is in.
Testing with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_PROVE_LOCKING=y, I
hit this during exit_group() -> exit_mmap() -> __zap_vma_range, where a
page allocation happens while the PTE lock and RCU read lock are held,
making the cond_resched() in the clearing loop illegal:

[ 1997.353228] BUG: sleeping function called from invalid context at mm/page_alloc.c:1235
[ 1997.353433] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 19725, name: bash
[ 1997.353572] preempt_count: 1, expected: 0
[ 1997.353706] RCU nest depth: 1, expected: 0
[ 1997.353837] 3 locks held by bash/19725:
[ 1997.353839]  #0: ff38cd415971e540 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap+0x6e/0x430
[ 1997.353850]  #1: ffffffffb03d6f60 (rcu_read_lock){....}-{1:3}, at: __pte_offset_map+0x2c/0x220
[ 1997.353855]  #2: ff38cd410deb4618 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: pte_offset_map_lock+0x92/0x170
[ 1997.353868] Call Trace:
[ 1997.353870]  <TASK>
[ 1997.353873]  dump_stack_lvl+0x91/0xb0
[ 1997.353877]  __might_resched+0x15f/0x290
[ 1997.353882]  kernel_init_pages+0x4b/0xa0
[ 1997.353886]  get_page_from_freelist+0x406/0x1e60
[ 1997.353895]  __alloc_frozen_pages_noprof+0x1d8/0x1730
[ 1997.353912]  alloc_pages_mpol+0xa4/0x190
[ 1997.353917]  alloc_pages_noprof+0x59/0xd0
[ 1997.353919]  get_free_pages_noprof+0x11/0x40
[ 1997.353921]  __tlb_remove_folio_pages_size.isra.0+0x7f/0xe0
[ 1997.353923]  __zap_vma_range+0x1bbd/0x1f40
[ 1997.353931]  unmap_vmas+0xd9/0x1d0
[ 1997.353934]  exit_mmap+0x10a/0x430
[ 1997.353943]  __mmput+0x3d/0x130
[ 1997.353947]  do_exit+0x2a7/0xae0
[ 1997.353951]  do_group_exit+0x36/0xa0
[ 1997.353953]  __x64_sys_exit_group+0x18/0x20
[ 1997.353959]  do_syscall_64+0xe1/0x710
[ 1997.353990]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1997.354003]  </TASK>

This also means clear_contig_highpages() can't be directly reused here
since it has an unconditional might_sleep() + cond_resched(). I'll look
into this. Any suggestions on the right way to handle cond_resched()
in a context that may or may not be atomic?

Thanks,
Hrushikesh

>>  mm/page_alloc.c | 19 +++++++++++++++++--
>>  1 file changed, 17 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b1c5430cad4e..178cbebadd50 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1224,8 +1224,23 @@ static void kernel_init_pages(struct page *page, int numpages)
>>
>>       /* s390's use of memset() could override KASAN redzones. */
>>       kasan_disable_current();
>> -     for (i = 0; i < numpages; i++)
>> -             clear_highpage_kasan_tagged(page + i);
>> +
>> +     if (!IS_ENABLED(CONFIG_HIGHMEM)) {
>> +             void *addr = kasan_reset_tag(page_address(page));
>> +             unsigned int unit = preempt_model_preemptible() ?
>> +                                     numpages : PROCESS_PAGES_NON_PREEMPT_BATCH;
>> +             int count;
>> +
>> +             for (i = 0; i < numpages; i += count) {
>> +                     cond_resched();
>> +                     count = min_t(int, unit, numpages - i);
>> +                     clear_pages(addr + (i << PAGE_SHIFT), count);
>> +             }
>> +     } else {
>> +             for (i = 0; i < numpages; i++)
>> +                     clear_highpage_kasan_tagged(page + i);
>> +     }
>> +
>>       kasan_enable_current();
>>  }
>>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()
  2026-04-08 10:44   ` Salunke, Hrushikesh
@ 2026-04-08 10:53     ` David Hildenbrand (Arm)
  2026-04-08 11:16     ` Raghavendra K T
  2026-04-08 15:32     ` Andrew Morton
  2 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-08 10:53 UTC (permalink / raw)
  To: Salunke, Hrushikesh, Vlastimil Babka (SUSE),
	akpm, surenb, mhocko, jackmanb, hannes, ziy
  Cc: linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora, shivankg

On 4/8/26 12:44, Salunke, Hrushikesh wrote:
> 
> On 08-04-2026 15:17, Vlastimil Babka (SUSE) wrote:
> 
>> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>>
>>
>> On 4/8/26 11:24, Hrushikesh Salunke wrote:
>>> When init_on_alloc is enabled, kernel_init_pages() clears every page
>>> one at a time, calling clear_page() per page.  This is unnecessarily
>>> slow for large contiguous allocations (mTHPs, HugeTLB) that dominate
>>> real workloads.
>>>
>>> On 64-bit (!HIGHMEM) systems, switch to clearing pages in batch via
>>> clear_pages(), bypassing the per-page kmap_local_page()/kunmap_local()
>>> overhead and allowing the arch clearing primitive to operate on the full
>>> contiguous range in a single invocation.  The batch size is the full
>>> allocation when the preempt model is preemptible (preemption points are
>>> implicit), or PROCESS_PAGES_NON_PREEMPT_BATCH otherwise, with
>>> cond_resched() between batches to limit scheduling latency under
>>> cooperative preemption.
>>>
>>> The HIGHMEM path is kept as-is since those pages require kmap.
>>>
>>> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
>>>
>>>   Before: 0.445s
>>>   After:  0.166s  (-62.7%, 2.68x faster)
>>>
>>> Kernel time (sys) reduction per workload with init_on_alloc=1:
>>>
>>>   Workload            Before       After       Change
>>>   Graph500 64C128T    30m 41.8s    15m 14.8s   -50.3%
>>>   Graph500 16C32T     15m 56.7s     9m 43.7s   -39.0%
>>>   Pagerank 32T         1m 58.5s     1m 12.8s   -38.5%
>>>   Pagerank 128T        2m 36.3s     1m 40.4s   -35.7%
>>>
>>> Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com>
>>> ---
>>> base commit: 1a2fbbe3653f0ebb24af9b306a8a968287344a35
>> Any way to reuse the code added by [1], e.g. clear_user_highpages()?
>>
>> [1]
>> https://lore.kernel.org/linux-mm/20250917152418.4077386-1-ankur.a.arora@oracle.com/
> 
> Thanks for the review. Sure, I will check if code reuse is possible.
> Meanwhile I found another issue with the current patch.
> 
> kernel_init_pages() runs inside the allocator (post_alloc_hook and
> __free_pages_prepare), so it inherits whatever context the caller is in.
> Testing with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_PROVE_LOCKING=y, I
> hit this during exit_group() -> exit_mmap() -> __zap_vma_range, where a
> page allocation happens while the PTE lock and RCU read lock are held,
> making the cond_resched() in the clearing loop illegal:
> 
> [ 1997.353228] BUG: sleeping function called from invalid context at mm/page_alloc.c:1235
> [ 1997.353433] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 19725, name: bash
> [ 1997.353572] preempt_count: 1, expected: 0
> [ 1997.353706] RCU nest depth: 1, expected: 0
> [ 1997.353837] 3 locks held by bash/19725:
> [ 1997.353839]  #0: ff38cd415971e540 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap+0x6e/0x430
> [ 1997.353850]  #1: ffffffffb03d6f60 (rcu_read_lock){....}-{1:3}, at: __pte_offset_map+0x2c/0x220
> [ 1997.353855]  #2: ff38cd410deb4618 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: pte_offset_map_lock+0x92/0x170
> [ 1997.353868] Call Trace:
> [ 1997.353870]  <TASK>
> [ 1997.353873]  dump_stack_lvl+0x91/0xb0
> [ 1997.353877]  __might_resched+0x15f/0x290
> [ 1997.353882]  kernel_init_pages+0x4b/0xa0
> [ 1997.353886]  get_page_from_freelist+0x406/0x1e60
> [ 1997.353895]  __alloc_frozen_pages_noprof+0x1d8/0x1730
> [ 1997.353912]  alloc_pages_mpol+0xa4/0x190
> [ 1997.353917]  alloc_pages_noprof+0x59/0xd0
> [ 1997.353919]  get_free_pages_noprof+0x11/0x40
> [ 1997.353921]  __tlb_remove_folio_pages_size.isra.0+0x7f/0xe0
> [ 1997.353923]  __zap_vma_range+0x1bbd/0x1f40
> [ 1997.353931]  unmap_vmas+0xd9/0x1d0
> [ 1997.353934]  exit_mmap+0x10a/0x430
> [ 1997.353943]  __mmput+0x3d/0x130
> [ 1997.353947]  do_exit+0x2a7/0xae0
> [ 1997.353951]  do_group_exit+0x36/0xa0
> [ 1997.353953]  __x64_sys_exit_group+0x18/0x20
> [ 1997.353959]  do_syscall_64+0xe1/0x710
> [ 1997.353990]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1997.354003]  </TASK>
> 
> This also means clear_contig_highpages() can't be directly reused here
> since it has an unconditional might_sleep() + cond_resched(). I'll look
> into this. Any suggestions on the right way to handle cond_resched()
> in a context that may or may not be atomic?

clear_contig_highpages() is prepared to handle arbitrary sizes,
including 1 GiB chunks or even larger.

The question is whether you even have to use
PROCESS_PAGES_NON_PREEMPT_BATCH given that we cannot trigger a manual
resched either way (and the assumption is that memory we are clearing is
not that big. Well, on arm64 it can still be 512 MiB).

So I wonder what happens when you just use clear_pages().

Likely you should provide a clear_highpages_kasan_tagged() and a
clear_highpages() ?

So you would be calling clear_highpages_kasan_tagged() here that would
just default to calling clear_highpages() unless kasan applies etc.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()
  2026-04-08 10:44   ` Salunke, Hrushikesh
  2026-04-08 10:53     ` David Hildenbrand (Arm)
@ 2026-04-08 11:16     ` Raghavendra K T
  2026-04-08 16:24       ` Raghavendra K T
  2026-04-08 15:32     ` Andrew Morton
  2 siblings, 1 reply; 8+ messages in thread
From: Raghavendra K T @ 2026-04-08 11:16 UTC (permalink / raw)
  To: Salunke, Hrushikesh, Vlastimil Babka (SUSE),
	akpm, surenb, mhocko, jackmanb, hannes, ziy
  Cc: linux-mm, linux-kernel, bharata, ankur.a.arora, shivankg,
	David Hildenbrand



On 4/8/2026 4:14 PM, Salunke, Hrushikesh wrote:
> [Some people who received this message don't often get email from hsalunke@amd.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> On 08-04-2026 15:17, Vlastimil Babka (SUSE) wrote:
> 
>> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>>
>>
>> On 4/8/26 11:24, Hrushikesh Salunke wrote:
>>> When init_on_alloc is enabled, kernel_init_pages() clears every page
>>> one at a time, calling clear_page() per page.  This is unnecessarily
>>> slow for large contiguous allocations (mTHPs, HugeTLB) that dominate
>>> real workloads.
>>>
>>> On 64-bit (!HIGHMEM) systems, switch to clearing pages in batch via
>>> clear_pages(), bypassing the per-page kmap_local_page()/kunmap_local()
>>> overhead and allowing the arch clearing primitive to operate on the full
>>> contiguous range in a single invocation.  The batch size is the full
>>> allocation when the preempt model is preemptible (preemption points are
>>> implicit), or PROCESS_PAGES_NON_PREEMPT_BATCH otherwise, with
>>> cond_resched() between batches to limit scheduling latency under
>>> cooperative preemption.
>>>
>>> The HIGHMEM path is kept as-is since those pages require kmap.
>>>
>>> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
>>>
>>>    Before: 0.445s
>>>    After:  0.166s  (-62.7%, 2.68x faster)
>>>
>>> Kernel time (sys) reduction per workload with init_on_alloc=1:
>>>
>>>    Workload            Before       After       Change
>>>    Graph500 64C128T    30m 41.8s    15m 14.8s   -50.3%
>>>    Graph500 16C32T     15m 56.7s     9m 43.7s   -39.0%
>>>    Pagerank 32T         1m 58.5s     1m 12.8s   -38.5%
>>>    Pagerank 128T        2m 36.3s     1m 40.4s   -35.7%
>>>
>>> Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com>
>>> ---
>>> base commit: 1a2fbbe3653f0ebb24af9b306a8a968287344a35
>> Any way to reuse the code added by [1], e.g. clear_user_highpages()?
>>
>> [1]
>> https://lore.kernel.org/linux-mm/20250917152418.4077386-1-ankur.a.arora@oracle.com/
> 
> Thanks for the review. Sure, I will check if code reuse is possible.
> Meanwhile I found another issue with the current patch.
> 
> kernel_init_pages() runs inside the allocator (post_alloc_hook and
> __free_pages_prepare), so it inherits whatever context the caller is in.
> Testing with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_PROVE_LOCKING=y, I
> hit this during exit_group() -> exit_mmap() -> __zap_vma_range, where a
> page allocation happens while the PTE lock and RCU read lock are held,
> making the cond_resched() in the clearing loop illegal:
> 
> [ 1997.353228] BUG: sleeping function called from invalid context at mm/page_alloc.c:1235
> [ 1997.353433] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 19725, name: bash
> [ 1997.353572] preempt_count: 1, expected: 0
> [ 1997.353706] RCU nest depth: 1, expected: 0
> [ 1997.353837] 3 locks held by bash/19725:
> [ 1997.353839]  #0: ff38cd415971e540 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap+0x6e/0x430
> [ 1997.353850]  #1: ffffffffb03d6f60 (rcu_read_lock){....}-{1:3}, at: __pte_offset_map+0x2c/0x220
> [ 1997.353855]  #2: ff38cd410deb4618 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: pte_offset_map_lock+0x92/0x170
> [ 1997.353868] Call Trace:
> [ 1997.353870]  <TASK>
> [ 1997.353873]  dump_stack_lvl+0x91/0xb0
> [ 1997.353877]  __might_resched+0x15f/0x290
> [ 1997.353882]  kernel_init_pages+0x4b/0xa0
> [ 1997.353886]  get_page_from_freelist+0x406/0x1e60
> [ 1997.353895]  __alloc_frozen_pages_noprof+0x1d8/0x1730
> [ 1997.353912]  alloc_pages_mpol+0xa4/0x190
> [ 1997.353917]  alloc_pages_noprof+0x59/0xd0
> [ 1997.353919]  get_free_pages_noprof+0x11/0x40
> [ 1997.353921]  __tlb_remove_folio_pages_size.isra.0+0x7f/0xe0
> [ 1997.353923]  __zap_vma_range+0x1bbd/0x1f40
> [ 1997.353931]  unmap_vmas+0xd9/0x1d0
> [ 1997.353934]  exit_mmap+0x10a/0x430
> [ 1997.353943]  __mmput+0x3d/0x130
> [ 1997.353947]  do_exit+0x2a7/0xae0
> [ 1997.353951]  do_group_exit+0x36/0xa0
> [ 1997.353953]  __x64_sys_exit_group+0x18/0x20
> [ 1997.353959]  do_syscall_64+0xe1/0x710
> [ 1997.353990]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1997.354003]  </TASK>
> 
> This also means clear_contig_highpages() can't be directly reused here
> since it has an unconditional might_sleep() + cond_resched(). I'll look
> into this. Any suggestions on the right way to handle cond_resched()
> in a context that may or may not be atomic?
> 
> Thanks,
> Hrushikesh
> 
>>>   mm/page_alloc.c | 19 +++++++++++++++++--
>>>   1 file changed, 17 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index b1c5430cad4e..178cbebadd50 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1224,8 +1224,23 @@ static void kernel_init_pages(struct page *page, int numpages)
>>>
>>>        /* s390's use of memset() could override KASAN redzones. */
>>>        kasan_disable_current();
>>> -     for (i = 0; i < numpages; i++)
>>> -             clear_highpage_kasan_tagged(page + i);
>>> +
>>> +     if (!IS_ENABLED(CONFIG_HIGHMEM)) {
>>> +             void *addr = kasan_reset_tag(page_address(page));
>>> +             unsigned int unit = preempt_model_preemptible() ?
>>> +                                     numpages : PROCESS_PAGES_NON_PREEMPT_BATCH;
>>> +             int count;
>>> +
>>> +             for (i = 0; i < numpages; i += count) {
>>> +                     cond_resched();

Just thinking,
Considering that for preemptible kernel/preempt_auto preempt_count()
knows about preemption points to decide where it can preempt,

and

for non_preemptible kernel and voluntary kernel it is safe to do
preemption at PROCESS_PAGES_NON_PREEMPT_BATCH granularity

do we need cond_resched() here ?

Let me know if I am missing something.

>>> +                     count = min_t(int, unit, numpages - i);
>>> +                     clear_pages(addr + (i << PAGE_SHIFT), count);
>>> +             }
>>> +     } else {
>>> +             for (i = 0; i < numpages; i++)
>>> +                     clear_highpage_kasan_tagged(page + i);
>>> +     }
>>> +
>>>        kasan_enable_current();
>>>   }
>>>

Regards
- Raghu



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [syzbot ci] Re: mm/page_alloc: use batch page clearing in kernel_init_pages()
  2026-04-08  9:24 [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages() Hrushikesh Salunke
  2026-04-08  9:47 ` Vlastimil Babka (SUSE)
@ 2026-04-08 11:32 ` syzbot ci
  1 sibling, 0 replies; 8+ messages in thread
From: syzbot ci @ 2026-04-08 11:32 UTC (permalink / raw)
  To: akpm, ankur.a.arora, bharata, hannes, hsalunke, jackmanb,
	linux-kernel, linux-mm, mhocko, rkodsara, shivankg, surenb,
	vbabka, ziy
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v1] mm/page_alloc: use batch page clearing in kernel_init_pages()
https://lore.kernel.org/all/20260408092441.435133-1-hsalunke@amd.com
* [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()

and found the following issue:
WARNING in preempt_model_full

Full report is available here:
https://ci.syzbot.org/series/be6c0534-641b-42aa-8b73-ab8f592ec267

***

WARNING in preempt_model_full

tree:      mm-new
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base:      0d90551ea699ef3d1a85cd7a1a7e21e8d4f04db2
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/22cbd293-31c6-46b9-b8d4-2ff590ce406b/config

CPU topo: Max. logical packages:   2
CPU topo: Max. logical nodes:      1
CPU topo: Num. nodes per package:  1
CPU topo: Max. logical dies:       2
CPU topo: Max. dies per package:   1
CPU topo: Max. threads per core:   1
CPU topo: Num. cores per package:     1
CPU topo: Num. threads per package:   1
CPU topo: Allowing 2 present CPUs plus 0 hotplug CPUs
kvm-guest: APIC: eoi() replaced with kvm_guest_apic_eoi_write()
PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
PM: hibernation: Registered nosave memory: [mem 0x0009f000-0x000fffff]
PM: hibernation: Registered nosave memory: [mem 0x7ffdf000-0xffffffff]
[gap 0xc0000000-0xfed1bfff] available for PCI devices
Booting paravirtualized kernel on KVM
clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
Zone ranges:
  DMA      [mem 0x0000000000001000-0x0000000000ffffff]
  DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
  Normal   [mem 0x0000000100000000-0x000000023fffffff]
  Device   empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x0000000000001000-0x000000000009efff]
  node   0: [mem 0x0000000000100000-0x000000007ffdefff]
  node   0: [mem 0x0000000100000000-0x0000000160000fff]
  node   1: [mem 0x0000000160001000-0x000000023fffffff]
Initmem setup node 0 [mem 0x0000000000001000-0x0000000160000fff]
Initmem setup node 1 [mem 0x0000000160001000-0x000000023fffffff]
On node 0, zone DMA: 1 pages in unavailable ranges
On node 0, zone DMA: 97 pages in unavailable ranges
On node 0, zone Normal: 33 pages in unavailable ranges
setup_percpu: NR_CPUS:8 nr_cpumask_bits:2 nr_cpu_ids:2 nr_node_ids:2
percpu: Embedded 71 pages/cpu s250120 r8192 d32504 u2097152
kvm-guest: PV spinlocks disabled, no host support
Kernel command line: earlyprintk=serial net.ifnames=0 sysctl.kernel.hung_task_all_cpu_backtrace=1 ima_policy=tcb nf-conntrack-ftp.ports=20000 nf-conntrack-tftp.ports=20000 nf-conntrack-sip.ports=20000 nf-conntrack-irc.ports=20000 nf-conntrack-sane.ports=20000 binder.debug_mask=0 rcupdate.rcu_expedited=1 rcupdate.rcu_cpu_stall_cputime=1 no_hash_pointers page_owner=on sysctl.vm.nr_hugepages=4 sysctl.vm.nr_overcommit_hugepages=4 secretmem.enable=1 sysctl.max_rcu_stall_to_panic=1 msr.allow_writes=off coredump_filter=0xffff root=/dev/sda console=ttyS0 vsyscall=native numa=fake=2 kvm-intel.nested=1 spec_store_bypass_disable=prctl nopcid vivid.n_devs=64 vivid.multiplanar=1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2 netrom.nr_ndevs=32 rose.rose_ndevs=32 smp.csd_lock_timeout=100000 watchdog_thresh=55 workqueue.watchdog_thresh=140 sysctl.net.core.netdev_unregister_timeout_secs=140 dummy_hcd.num=32 max_loop=32 nbds_max=32 \
Kernel command line: comedi.comedi_num_legacy_minors=4 panic_on_warn=1 root=/dev/sda console=ttyS0 root=/dev/sda1
Unknown kernel command line parameters "nbds_max=32", will be passed to user space.
printk: log buffer data + meta data: 262144 + 917504 = 1179648 bytes
software IO TLB: area num 2.
Fallback order for Node 0: 0 1 
Fallback order for Node 1: 1 0 
Built 2 zonelists, mobility grouping on.  Total pages: 1834877
Policy zone: Normal
mem auto-init: stack:all(zero), heap alloc:on, heap free:off
stackdepot: allocating hash table via alloc_large_system_hash
stackdepot hash table entries: 1048576 (order: 12, 16777216 bytes, linear)
stackdepot: allocating space for 8192 stack pools via memblock
------------[ cut here ]------------
preempt_dynamic_mode == preempt_dynamic_undefined
WARNING: kernel/sched/core.c:7743 at preempt_model_full+0x1e/0x30, CPU#0: swapper/0
Modules linked in:
CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted syzkaller #0 PREEMPT(undef) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:preempt_model_full+0x1e/0x30
Code: 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 83 3d 35 d9 cb 0c ff 74 10 83 3d 2c d9 cb 0c 02 0f 94 c0 2e e9 93 fa 19 0a 90 <0f> 0b 90 eb ea 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90
RSP: 0000:ffffffff8e407a78 EFLAGS: 00010046
RAX: 1ffffffff1c359d9 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffea0004000000
RBP: 0000000000040100 R08: ffffffff8221aafb R09: 0000000000000000
R10: ffffed1020000000 R11: ffffed1024206961 R12: 0000000000000001
R13: 0000000000000001 R14: ffff888100000000 R15: dffffc0000000000
FS:  0000000000000000(0000) GS:ffff88818de62000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88823ffff000 CR3: 000000000e54c000 CR4: 00000000000000b0
Call Trace:
 <TASK>
 kernel_init_pages+0x6d/0xe0
 post_alloc_hook+0xae/0x1e0
 get_page_from_freelist+0x24ba/0x2540
 __alloc_frozen_pages_noprof+0x18d/0x380
 alloc_pages_mpol+0x235/0x490
 alloc_pages_noprof+0xac/0x2a0
 __pud_alloc+0x3a/0x460
 preallocate_vmalloc_pages+0x386/0x3d0
 mm_core_init+0x79/0xb0
 start_kernel+0x15a/0x3d0
 x86_64_start_reservations+0x24/0x30
 x86_64_start_kernel+0x143/0x1c0
 common_startup_64+0x13e/0x147
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()
  2026-04-08 10:44   ` Salunke, Hrushikesh
  2026-04-08 10:53     ` David Hildenbrand (Arm)
  2026-04-08 11:16     ` Raghavendra K T
@ 2026-04-08 15:32     ` Andrew Morton
  2 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2026-04-08 15:32 UTC (permalink / raw)
  To: Salunke, Hrushikesh
  Cc: Vlastimil Babka (SUSE),
	surenb, mhocko, jackmanb, hannes, ziy, linux-mm, linux-kernel,
	rkodsara, bharata, ankur.a.arora, shivankg, David Hildenbrand

On Wed, 8 Apr 2026 16:14:03 +0530 "Salunke, Hrushikesh" <hsalunke@amd.com> wrote:

> kernel_init_pages() runs inside the allocator (post_alloc_hook and
> __free_pages_prepare), so it inherits whatever context the caller is in.
> Testing with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_PROVE_LOCKING=y, I
> hit this during exit_group() -> exit_mmap() -> __zap_vma_range, where a
> page allocation happens while the PTE lock and RCU read lock are held,
> making the cond_resched() in the clearing loop illegal:
> 
> [ 1997.353228] BUG: sleeping function called from invalid context at mm/page_alloc.c:1235
> [ 1997.353433] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 19725, name: bash
> [ 1997.353572] preempt_count: 1, expected: 0
> [ 1997.353706] RCU nest depth: 1, expected: 0
> [ 1997.353837] 3 locks held by bash/19725:
> [ 1997.353839]  #0: ff38cd415971e540 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap+0x6e/0x430
> [ 1997.353850]  #1: ffffffffb03d6f60 (rcu_read_lock){....}-{1:3}, at: __pte_offset_map+0x2c/0x220
> [ 1997.353855]  #2: ff38cd410deb4618 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: pte_offset_map_lock+0x92/0x170
> [ 1997.353868] Call Trace:
> [ 1997.353870]  <TASK>
> [ 1997.353873]  dump_stack_lvl+0x91/0xb0
> [ 1997.353877]  __might_resched+0x15f/0x290
> [ 1997.353882]  kernel_init_pages+0x4b/0xa0
> [ 1997.353886]  get_page_from_freelist+0x406/0x1e60
> [ 1997.353895]  __alloc_frozen_pages_noprof+0x1d8/0x1730
> [ 1997.353912]  alloc_pages_mpol+0xa4/0x190
> [ 1997.353917]  alloc_pages_noprof+0x59/0xd0
> [ 1997.353919]  get_free_pages_noprof+0x11/0x40
> [ 1997.353921]  __tlb_remove_folio_pages_size.isra.0+0x7f/0xe0
> [ 1997.353923]  __zap_vma_range+0x1bbd/0x1f40
> [ 1997.353931]  unmap_vmas+0xd9/0x1d0
> [ 1997.353934]  exit_mmap+0x10a/0x430
> [ 1997.353943]  __mmput+0x3d/0x130
> [ 1997.353947]  do_exit+0x2a7/0xae0

tlb_next_batch() is (fortunately) using GFP_NOWAIT.  Perhaps you can
alter your patch to not call the cond_resched() if caller is attempting
an atomic allocation.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()
  2026-04-08 11:16     ` Raghavendra K T
@ 2026-04-08 16:24       ` Raghavendra K T
  0 siblings, 0 replies; 8+ messages in thread
From: Raghavendra K T @ 2026-04-08 16:24 UTC (permalink / raw)
  To: Salunke, Hrushikesh, Vlastimil Babka (SUSE),
	akpm, surenb, mhocko, jackmanb, hannes, ziy
  Cc: linux-mm, linux-kernel, bharata, ankur.a.arora, shivankg,
	David Hildenbrand



On 4/8/2026 4:46 PM, Raghavendra K T wrote:
> 
> 
> On 4/8/2026 4:14 PM, Salunke, Hrushikesh wrote:
>> [Some people who received this message don't often get email from 
>> hsalunke@amd.com. Learn why this is important at https://aka.ms/ 
>> LearnAboutSenderIdentification ]
>>
[...]
>>>>   mm/page_alloc.c | 19 +++++++++++++++++--
>>>>   1 file changed, 17 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index b1c5430cad4e..178cbebadd50 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -1224,8 +1224,23 @@ static void kernel_init_pages(struct page 
>>>> *page, int numpages)
>>>>
>>>>        /* s390's use of memset() could override KASAN redzones. */
>>>>        kasan_disable_current();
>>>> -     for (i = 0; i < numpages; i++)
>>>> -             clear_highpage_kasan_tagged(page + i);
>>>> +
>>>> +     if (!IS_ENABLED(CONFIG_HIGHMEM)) {
>>>> +             void *addr = kasan_reset_tag(page_address(page));
>>>> +             unsigned int unit = preempt_model_preemptible() ?
>>>> +                                     numpages : 
>>>> PROCESS_PAGES_NON_PREEMPT_BATCH;
>>>> +             int count;
>>>> +
>>>> +             for (i = 0; i < numpages; i += count) {
>>>> +                     cond_resched();
> 
> Just thinking,
> Considering that for preemptible kernel/preempt_auto preempt_count()
> knows about preemption points to decide where it can preempt,
> 
> and
> 
> for non_preemptible kernel and voluntary kernel it is safe to do
> preemption at PROCESS_PAGES_NON_PREEMPT_BATCH granularity

s/preemption/clear_page/

> do we need cond_resched() here ?
> 
> Let me know if I am missing something.
> 

I do see Andrew also has same thoughts, (to remove cond_resched() and
sorry for being not crisp :))

But please ensure you test with the below configs to ensure there are no
surprises viz., preempt=none,voluntary,full(/auto).

Regards
- Raghu




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-04-08 16:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-08  9:24 [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages() Hrushikesh Salunke
2026-04-08  9:47 ` Vlastimil Babka (SUSE)
2026-04-08 10:44   ` Salunke, Hrushikesh
2026-04-08 10:53     ` David Hildenbrand (Arm)
2026-04-08 11:16     ` Raghavendra K T
2026-04-08 16:24       ` Raghavendra K T
2026-04-08 15:32     ` Andrew Morton
2026-04-08 11:32 ` [syzbot ci] " syzbot ci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox