From: "David Hildenbrand (Arm)" <david@kernel.org>
To: "Salunke, Hrushikesh" <hsalunke@amd.com>,
Andrew Morton <akpm@linux-foundation.org>
Cc: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>,
surenb@google.com, mhocko@suse.com, jackmanb@google.com,
hannes@cmpxchg.org, ziy@nvidia.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, rkodsara@amd.com, bharata@amd.com,
ankur.a.arora@oracle.com, shivankg@amd.com
Subject: Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()
Date: Thu, 9 Apr 2026 11:00:33 +0200 [thread overview]
Message-ID: <4dd26573-85cc-446a-b2b7-2aeab8aa2417@kernel.org> (raw)
In-Reply-To: <fcc68286-d2ae-4e51-b4b2-886af115ad7c@amd.com>
On 4/9/26 10:55, Salunke, Hrushikesh wrote:
>
> On 08-04-2026 21:02, Andrew Morton wrote:
>
>> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>>
>>
>> On Wed, 8 Apr 2026 16:14:03 +0530 "Salunke, Hrushikesh" <hsalunke@amd.com> wrote:
>>
>>> kernel_init_pages() runs inside the allocator (post_alloc_hook and
>>> __free_pages_prepare), so it inherits whatever context the caller is in.
>>> Testing with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_PROVE_LOCKING=y, I
>>> hit this during exit_group() -> exit_mmap() -> __zap_vma_range, where a
>>> page allocation happens while the PTE lock and RCU read lock are held,
>>> making the cond_resched() in the clearing loop illegal:
>>>
>>> [ 1997.353228] BUG: sleeping function called from invalid context at mm/page_alloc.c:1235
>>> [ 1997.353433] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 19725, name: bash
>>> [ 1997.353572] preempt_count: 1, expected: 0
>>> [ 1997.353706] RCU nest depth: 1, expected: 0
>>> [ 1997.353837] 3 locks held by bash/19725:
>>> [ 1997.353839] #0: ff38cd415971e540 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap+0x6e/0x430
>>> [ 1997.353850] #1: ffffffffb03d6f60 (rcu_read_lock){....}-{1:3}, at: __pte_offset_map+0x2c/0x220
>>> [ 1997.353855] #2: ff38cd410deb4618 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: pte_offset_map_lock+0x92/0x170
>>> [ 1997.353868] Call Trace:
>>> [ 1997.353870] <TASK>
>>> [ 1997.353873] dump_stack_lvl+0x91/0xb0
>>> [ 1997.353877] __might_resched+0x15f/0x290
>>> [ 1997.353882] kernel_init_pages+0x4b/0xa0
>>> [ 1997.353886] get_page_from_freelist+0x406/0x1e60
>>> [ 1997.353895] __alloc_frozen_pages_noprof+0x1d8/0x1730
>>> [ 1997.353912] alloc_pages_mpol+0xa4/0x190
>>> [ 1997.353917] alloc_pages_noprof+0x59/0xd0
>>> [ 1997.353919] get_free_pages_noprof+0x11/0x40
>>> [ 1997.353921] __tlb_remove_folio_pages_size.isra.0+0x7f/0xe0
>>> [ 1997.353923] __zap_vma_range+0x1bbd/0x1f40
>>> [ 1997.353931] unmap_vmas+0xd9/0x1d0
>>> [ 1997.353934] exit_mmap+0x10a/0x430
>>> [ 1997.353943] __mmput+0x3d/0x130
>>> [ 1997.353947] do_exit+0x2a7/0xae0
>> tlb_next_batch() is (fortunately) using GFP_NOWAIT. Perhaps you can
>> alter your patch to not call the cond_resched() if caller is attempting
>> an atomic allocation.
>
>
> Thanks Vlastimil, David, Andrew, and Raghu for the reviews.
>
> After looking into this more, I think adding cond_resched() here was
> overkill. I agree that dropping cond_resched() and
> PROCESS_PAGES_NON_PREEMPT_BATCH entirely and just calling clear_pages()
> is the right approach. There's no case where cond_resched() in
> kernel_init_pages() is both necessary and safe:
>
> - It's unsafe in atomic context, as the BUG shows (tlb_next_batch()
> allocates under PTE lock + RCU read lock via GFP_NOWAIT).
> - It's unnecessary for common allocations (order-0, mTHP, 2MB) which
> clear in well under 1ms.
> - For 1 GiB hugepages, kernel_init_pages() only runs during the
> initial admin-triggered allocation. When processes later fault on
> those pages, clearing goes through folio_zero_user() ->
> clear_contig_highpages(), not kernel_init_pages().
>
> So rather than guarding cond_resched() with GFP flags (as Andrew
> suggested), I'll remove it entirely in v2 to keep things simple and
> same scheduling characteristics as the original code, just with the
> batch clearing performance benefit.
>
> Regarding the 512 MiB arm64 case that David mentioned the stall from
> clearing that without cond_resched() under PREEMPT_NONE is acceptable,
> or should it be handled differently?
I mean, it would already happen today, because there is no
cond_resched(). So nothing to worry about I guess.
>
> I can introduce clear_highpages_kasan_tagged() / clear_highpages()
> helpers, or keep v2 minimal with the logic inline in
> kernel_init_pages(). Any preference?
I'd prefer not sprinkling IS_ENABLED(CONFIG_HIGHMEM) around and simply
calling a clear_highpages_kasan_tagged() from kernel_init_pages().
--
Cheers,
David
next prev parent reply other threads:[~2026-04-09 9:00 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-08 9:24 Hrushikesh Salunke
2026-04-08 9:47 ` Vlastimil Babka (SUSE)
2026-04-08 10:44 ` Salunke, Hrushikesh
2026-04-08 10:53 ` David Hildenbrand (Arm)
2026-04-08 11:16 ` Raghavendra K T
2026-04-08 16:24 ` Raghavendra K T
2026-04-08 15:32 ` Andrew Morton
2026-04-09 8:55 ` Salunke, Hrushikesh
2026-04-09 9:00 ` David Hildenbrand (Arm) [this message]
2026-04-09 9:28 ` Salunke, Hrushikesh
2026-04-08 11:32 ` [syzbot ci] " syzbot ci
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4dd26573-85cc-446a-b2b7-2aeab8aa2417@kernel.org \
--to=david@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=ankur.a.arora@oracle.com \
--cc=bharata@amd.com \
--cc=hannes@cmpxchg.org \
--cc=hsalunke@amd.com \
--cc=jackmanb@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=rkodsara@amd.com \
--cc=shivankg@amd.com \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox