From: "Salunke, Hrushikesh" <hsalunke@amd.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>, <surenb@google.com>,
<mhocko@suse.com>, <jackmanb@google.com>, <hannes@cmpxchg.org>,
<ziy@nvidia.com>, <linux-mm@kvack.org>,
<linux-kernel@vger.kernel.org>, <rkodsara@amd.com>,
<bharata@amd.com>, <ankur.a.arora@oracle.com>, <shivankg@amd.com>,
David Hildenbrand <david@redhat.com>, <hsalunke@amd.com>
Subject: Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()
Date: Thu, 9 Apr 2026 14:25:39 +0530 [thread overview]
Message-ID: <fcc68286-d2ae-4e51-b4b2-886af115ad7c@amd.com> (raw)
In-Reply-To: <20260408083229.45d1a083f17484d3b2678855@linux-foundation.org>
On 08-04-2026 21:02, Andrew Morton wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On Wed, 8 Apr 2026 16:14:03 +0530 "Salunke, Hrushikesh" <hsalunke@amd.com> wrote:
>
>> kernel_init_pages() runs inside the allocator (post_alloc_hook and
>> __free_pages_prepare), so it inherits whatever context the caller is in.
>> Testing with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_PROVE_LOCKING=y, I
>> hit this during exit_group() -> exit_mmap() -> __zap_vma_range, where a
>> page allocation happens while the PTE lock and RCU read lock are held,
>> making the cond_resched() in the clearing loop illegal:
>>
>> [ 1997.353228] BUG: sleeping function called from invalid context at mm/page_alloc.c:1235
>> [ 1997.353433] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 19725, name: bash
>> [ 1997.353572] preempt_count: 1, expected: 0
>> [ 1997.353706] RCU nest depth: 1, expected: 0
>> [ 1997.353837] 3 locks held by bash/19725:
>> [ 1997.353839] #0: ff38cd415971e540 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap+0x6e/0x430
>> [ 1997.353850] #1: ffffffffb03d6f60 (rcu_read_lock){....}-{1:3}, at: __pte_offset_map+0x2c/0x220
>> [ 1997.353855] #2: ff38cd410deb4618 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: pte_offset_map_lock+0x92/0x170
>> [ 1997.353868] Call Trace:
>> [ 1997.353870] <TASK>
>> [ 1997.353873] dump_stack_lvl+0x91/0xb0
>> [ 1997.353877] __might_resched+0x15f/0x290
>> [ 1997.353882] kernel_init_pages+0x4b/0xa0
>> [ 1997.353886] get_page_from_freelist+0x406/0x1e60
>> [ 1997.353895] __alloc_frozen_pages_noprof+0x1d8/0x1730
>> [ 1997.353912] alloc_pages_mpol+0xa4/0x190
>> [ 1997.353917] alloc_pages_noprof+0x59/0xd0
>> [ 1997.353919] get_free_pages_noprof+0x11/0x40
>> [ 1997.353921] __tlb_remove_folio_pages_size.isra.0+0x7f/0xe0
>> [ 1997.353923] __zap_vma_range+0x1bbd/0x1f40
>> [ 1997.353931] unmap_vmas+0xd9/0x1d0
>> [ 1997.353934] exit_mmap+0x10a/0x430
>> [ 1997.353943] __mmput+0x3d/0x130
>> [ 1997.353947] do_exit+0x2a7/0xae0
> tlb_next_batch() is (fortunately) using GFP_NOWAIT. Perhaps you can
> alter your patch to not call the cond_resched() if caller is attempting
> an atomic allocation.
Thanks Vlastimil, David, Andrew, and Raghu for the reviews.
After looking into this more, I think adding cond_resched() here was
overkill. I agree that dropping cond_resched() and
PROCESS_PAGES_NON_PREEMPT_BATCH entirely and just calling clear_pages()
is the right approach. There's no case where cond_resched() in
kernel_init_pages() is both necessary and safe:
- It's unsafe in atomic context, as the BUG shows (tlb_next_batch()
allocates under PTE lock + RCU read lock via GFP_NOWAIT).
- It's unnecessary for common allocations (order-0, mTHP, 2MB) which
clear in well under 1ms.
- For 1 GiB hugepages, kernel_init_pages() only runs during the
initial admin-triggered allocation. When processes later fault on
those pages, clearing goes through folio_zero_user() ->
clear_contig_highpages(), not kernel_init_pages().
So rather than guarding cond_resched() with GFP flags (as Andrew
suggested), I'll remove it entirely in v2 to keep things simple and
same scheduling characteristics as the original code, just with the
batch clearing performance benefit.
Regarding the 512 MiB arm64 case that David mentioned the stall from
clearing that without cond_resched() under PREEMPT_NONE is acceptable,
or should it be handled differently?
I can introduce clear_highpages_kasan_tagged() / clear_highpages()
helpers, or keep v2 minimal with the logic inline in
kernel_init_pages(). Any preference?
I'll test v2 across preempt=none,voluntary,full,auto with
init_on_alloc=1 and CONFIG_DEBUG_ATOMIC_SLEEP=y before sending.
Regards,
Hrushikesh
next prev parent reply other threads:[~2026-04-09 8:56 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-08 9:24 Hrushikesh Salunke
2026-04-08 9:47 ` Vlastimil Babka (SUSE)
2026-04-08 10:44 ` Salunke, Hrushikesh
2026-04-08 10:53 ` David Hildenbrand (Arm)
2026-04-08 11:16 ` Raghavendra K T
2026-04-08 16:24 ` Raghavendra K T
2026-04-08 15:32 ` Andrew Morton
2026-04-09 8:55 ` Salunke, Hrushikesh [this message]
2026-04-09 9:00 ` David Hildenbrand (Arm)
2026-04-09 9:28 ` Salunke, Hrushikesh
2026-04-08 11:32 ` [syzbot ci] " syzbot ci
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fcc68286-d2ae-4e51-b4b2-886af115ad7c@amd.com \
--to=hsalunke@amd.com \
--cc=akpm@linux-foundation.org \
--cc=ankur.a.arora@oracle.com \
--cc=bharata@amd.com \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=rkodsara@amd.com \
--cc=shivankg@amd.com \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox