linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/mm_init: Don't call cond_resched() in deferred_init_memmap_chunk() if rcu_preempt_depth() set
@ 2026-01-21 19:10 Waiman Long
  2026-01-21 19:43 ` Andrew Morton
  0 siblings, 1 reply; 8+ messages in thread
From: Waiman Long @ 2026-01-21 19:10 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt
  Cc: linux-mm, linux-kernel, linux-rt-devel, Wei Yang,
	David Hildenbrand, Waiman Long

Commit 3acb913c9d5b ("mm/mm_init: use deferred_init_memmap_chunk()
in deferred_grow_zone()") made deferred_grow_zone() call
deferred_init_memmap_chunk() within a pgdat_resize_lock() critical
section with irqs disabled. It did check for irqs_disabled() in
deferred_init_memmap_chunk() to avoid calling cond_resched(). For a
PREEMPT_RT kernel build, however, spin_lock_irqsave() does not disable
interrupt but rcu_read_lock() is called. This leads to the following
bug report.

  BUG: sleeping function called from invalid context at mm/mm_init.c:2091
  in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
  preempt_count: 0, expected: 0
  RCU nest depth: 1, expected: 0
  3 locks held by swapper/0/1:
   #0: ffff80008471b7a0 (sched_domains_mutex){+.+.}-{4:4}, at: sched_domains_mutex_lock+0x28/0x40
   #1: ffff003bdfffef48 (&pgdat->node_size_lock){+.+.}-{3:3}, at: deferred_grow_zone+0x140/0x278
   #2: ffff800084acf600 (rcu_read_lock){....}-{1:3}, at: rt_spin_lock+0x1b4/0x408
  CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G        W           6.19.0-rc6-test #1 PREEMPT_{RT,(full)
}
  Tainted: [W]=WARN
  Call trace:
   show_stack+0x20/0x38 (C)
   dump_stack_lvl+0xdc/0xf8
   dump_stack+0x1c/0x28
   __might_resched+0x384/0x530
   deferred_init_memmap_chunk+0x560/0x688
   deferred_grow_zone+0x190/0x278
   _deferred_grow_zone+0x18/0x30
   get_page_from_freelist+0x780/0xf78
   __alloc_frozen_pages_noprof+0x1dc/0x348
   alloc_slab_page+0x30/0x110
   allocate_slab+0x98/0x2a0
   new_slab+0x4c/0x80
   ___slab_alloc+0x5a4/0x770
   __slab_alloc.constprop.0+0x88/0x1e0
   __kmalloc_node_noprof+0x2c0/0x598
   __sdt_alloc+0x3b8/0x728
   build_sched_domains+0xe0/0x1260
   sched_init_domains+0x14c/0x1c8
   sched_init_smp+0x9c/0x1d0
   kernel_init_freeable+0x218/0x358
   kernel_init+0x28/0x208
   ret_from_fork+0x10/0x20

Fix it by checking rcu_preempt_depth() as well to prevent calling
cond_resched(). Note that CONFIG_PREEMPT_RCU should always be enabled
in a PREEMPT_RT kernel.

Fixes: 3acb913c9d5b ("mm/mm_init: use deferred_init_memmap_chunk() in deferred_grow_zone()")
Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/mm_init.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index fc2a6f1e518f..4d6b6dac6f58 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2085,7 +2085,12 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
 
 			spfn = chunk_end;
 
-			if (irqs_disabled())
+			/*
+			 * pgdat_resize_lock() only disables irqs in non-RT
+			 * kernels but calls rcu_read_lock() in a PREEMPT_RT
+			 * kernel.
+			 */
+			if (irqs_disabled() || rcu_preempt_depth())
 				touch_nmi_watchdog();
 			else
 				cond_resched();
-- 
2.52.0



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/mm_init: Don't call cond_resched() in deferred_init_memmap_chunk() if rcu_preempt_depth() set
  2026-01-21 19:10 [PATCH] mm/mm_init: Don't call cond_resched() in deferred_init_memmap_chunk() if rcu_preempt_depth() set Waiman Long
@ 2026-01-21 19:43 ` Andrew Morton
  2026-01-21 20:07   ` Waiman Long
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2026-01-21 19:43 UTC (permalink / raw)
  To: Waiman Long
  Cc: Mike Rapoport, Sebastian Andrzej Siewior, Clark Williams,
	Steven Rostedt, linux-mm, linux-kernel, linux-rt-devel, Wei Yang,
	David Hildenbrand, Paul E. McKenney

On Wed, 21 Jan 2026 14:10:36 -0500 Waiman Long <longman@redhat.com> wrote:

> Commit 3acb913c9d5b ("mm/mm_init: use deferred_init_memmap_chunk()
> in deferred_grow_zone()") made deferred_grow_zone() call
> deferred_init_memmap_chunk() within a pgdat_resize_lock() critical
> section with irqs disabled. It did check for irqs_disabled() in
> deferred_init_memmap_chunk() to avoid calling cond_resched(). For a
> PREEMPT_RT kernel build, however, spin_lock_irqsave() does not disable
> interrupt but rcu_read_lock() is called. This leads to the following
> bug report.
> 
>   BUG: sleeping function called from invalid context at mm/mm_init.c:2091
>   in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
>   preempt_count: 0, expected: 0
>   RCU nest depth: 1, expected: 0
>   3 locks held by swapper/0/1:
>    #0: ffff80008471b7a0 (sched_domains_mutex){+.+.}-{4:4}, at: sched_domains_mutex_lock+0x28/0x40
>    #1: ffff003bdfffef48 (&pgdat->node_size_lock){+.+.}-{3:3}, at: deferred_grow_zone+0x140/0x278
>    #2: ffff800084acf600 (rcu_read_lock){....}-{1:3}, at: rt_spin_lock+0x1b4/0x408
>   CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G        W           6.19.0-rc6-test #1 PREEMPT_{RT,(full)
> }
>   Tainted: [W]=WARN
>   Call trace:
>    show_stack+0x20/0x38 (C)
>    dump_stack_lvl+0xdc/0xf8
>    dump_stack+0x1c/0x28
>    __might_resched+0x384/0x530
>    deferred_init_memmap_chunk+0x560/0x688
>    deferred_grow_zone+0x190/0x278
>    _deferred_grow_zone+0x18/0x30
>    get_page_from_freelist+0x780/0xf78
>    __alloc_frozen_pages_noprof+0x1dc/0x348
>    alloc_slab_page+0x30/0x110
>    allocate_slab+0x98/0x2a0
>    new_slab+0x4c/0x80
>    ___slab_alloc+0x5a4/0x770
>    __slab_alloc.constprop.0+0x88/0x1e0
>    __kmalloc_node_noprof+0x2c0/0x598
>    __sdt_alloc+0x3b8/0x728
>    build_sched_domains+0xe0/0x1260
>    sched_init_domains+0x14c/0x1c8
>    sched_init_smp+0x9c/0x1d0
>    kernel_init_freeable+0x218/0x358
>    kernel_init+0x28/0x208
>    ret_from_fork+0x10/0x20
> 
> Fix it by checking rcu_preempt_depth() as well to prevent calling
> cond_resched(). Note that CONFIG_PREEMPT_RCU should always be enabled
> in a PREEMPT_RT kernel.
>
> ...
> 
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -2085,7 +2085,12 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
>  
>  			spfn = chunk_end;
>  
> -			if (irqs_disabled())
> +			/*
> +			 * pgdat_resize_lock() only disables irqs in non-RT
> +			 * kernels but calls rcu_read_lock() in a PREEMPT_RT
> +			 * kernel.
> +			 */
> +			if (irqs_disabled() || rcu_preempt_depth())
>  				touch_nmi_watchdog();

rcu_preempt_depth() seems a fairly internal low-level thing - it's
rarely used.

Is there a more official way of detecting this condition?  Maybe even
#ifdef CONFIG_PREEMPT_RCU?


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/mm_init: Don't call cond_resched() in deferred_init_memmap_chunk() if rcu_preempt_depth() set
  2026-01-21 19:43 ` Andrew Morton
@ 2026-01-21 20:07   ` Waiman Long
  2026-01-21 21:27     ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Waiman Long @ 2026-01-21 20:07 UTC (permalink / raw)
  To: Andrew Morton, Sebastian Andrzej Siewior
  Cc: Mike Rapoport, Clark Williams, Steven Rostedt, linux-mm,
	linux-kernel, linux-rt-devel, Wei Yang, David Hildenbrand,
	Paul E. McKenney

[-- Attachment #1: Type: text/plain, Size: 3684 bytes --]

On 1/21/26 2:43 PM, Andrew Morton wrote:
> On Wed, 21 Jan 2026 14:10:36 -0500 Waiman Long<longman@redhat.com> wrote:
>
>> Commit 3acb913c9d5b ("mm/mm_init: use deferred_init_memmap_chunk()
>> in deferred_grow_zone()") made deferred_grow_zone() call
>> deferred_init_memmap_chunk() within a pgdat_resize_lock() critical
>> section with irqs disabled. It did check for irqs_disabled() in
>> deferred_init_memmap_chunk() to avoid calling cond_resched(). For a
>> PREEMPT_RT kernel build, however, spin_lock_irqsave() does not disable
>> interrupt but rcu_read_lock() is called. This leads to the following
>> bug report.
>>
>>    BUG: sleeping function called from invalid context at mm/mm_init.c:2091
>>    in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
>>    preempt_count: 0, expected: 0
>>    RCU nest depth: 1, expected: 0
>>    3 locks held by swapper/0/1:
>>     #0: ffff80008471b7a0 (sched_domains_mutex){+.+.}-{4:4}, at: sched_domains_mutex_lock+0x28/0x40
>>     #1: ffff003bdfffef48 (&pgdat->node_size_lock){+.+.}-{3:3}, at: deferred_grow_zone+0x140/0x278
>>     #2: ffff800084acf600 (rcu_read_lock){....}-{1:3}, at: rt_spin_lock+0x1b4/0x408
>>    CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G        W           6.19.0-rc6-test #1 PREEMPT_{RT,(full)
>> }
>>    Tainted: [W]=WARN
>>    Call trace:
>>     show_stack+0x20/0x38 (C)
>>     dump_stack_lvl+0xdc/0xf8
>>     dump_stack+0x1c/0x28
>>     __might_resched+0x384/0x530
>>     deferred_init_memmap_chunk+0x560/0x688
>>     deferred_grow_zone+0x190/0x278
>>     _deferred_grow_zone+0x18/0x30
>>     get_page_from_freelist+0x780/0xf78
>>     __alloc_frozen_pages_noprof+0x1dc/0x348
>>     alloc_slab_page+0x30/0x110
>>     allocate_slab+0x98/0x2a0
>>     new_slab+0x4c/0x80
>>     ___slab_alloc+0x5a4/0x770
>>     __slab_alloc.constprop.0+0x88/0x1e0
>>     __kmalloc_node_noprof+0x2c0/0x598
>>     __sdt_alloc+0x3b8/0x728
>>     build_sched_domains+0xe0/0x1260
>>     sched_init_domains+0x14c/0x1c8
>>     sched_init_smp+0x9c/0x1d0
>>     kernel_init_freeable+0x218/0x358
>>     kernel_init+0x28/0x208
>>     ret_from_fork+0x10/0x20
>>
>> Fix it by checking rcu_preempt_depth() as well to prevent calling
>> cond_resched(). Note that CONFIG_PREEMPT_RCU should always be enabled
>> in a PREEMPT_RT kernel.
>>
>> ...
>>
>> --- a/mm/mm_init.c
>> +++ b/mm/mm_init.c
>> @@ -2085,7 +2085,12 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
>>   
>>   			spfn = chunk_end;
>>   
>> -			if (irqs_disabled())
>> +			/*
>> +			 * pgdat_resize_lock() only disables irqs in non-RT
>> +			 * kernels but calls rcu_read_lock() in a PREEMPT_RT
>> +			 * kernel.
>> +			 */
>> +			if (irqs_disabled() || rcu_preempt_depth())
>>   				touch_nmi_watchdog();
> rcu_preempt_depth() seems a fairly internal low-level thing - it's
> rarely used.
That is true. Beside the scheduler, workqueue also use 
rcu_preempt_depth(). This API is included in "include/linux/rcupdate.h" 
which is included directly or indirectly by many kernel files. So even 
though it is rarely used, but it is still a public API.

>
> Is there a more official way of detecting this condition?  Maybe even
> #ifdef CONFIG_PREEMPT_RCU?
>
I am not aware of a more official way of detecting this. Maybe Sebastian 
has some ideas. rcu_preempt_count() is defined 
whether CONFIG_PREEMPT_RCU is defined or not. So we don't need a "#ifdef 
CONFIG_PREEMPT_RCU". Maybe I should explicitly include 
"include/linux/rcupdate.h" in mm/mm_init.c just to be sure.

CONFIG_PREEMPT_RCU defaults to on if PREMPT_RT is set. With 
!CONFIG_PREEMPT_RCU, rcu_preempt_depth() is hard-coded to 0 and will be 
optimized out.

Cheers,
Longman

[-- Attachment #2: Type: text/html, Size: 4348 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/mm_init: Don't call cond_resched() in deferred_init_memmap_chunk() if rcu_preempt_depth() set
  2026-01-21 20:07   ` Waiman Long
@ 2026-01-21 21:27     ` Paul E. McKenney
  2026-01-22  7:57       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2026-01-21 21:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Sebastian Andrzej Siewior, Mike Rapoport,
	Clark Williams, Steven Rostedt, linux-mm, linux-kernel,
	linux-rt-devel, Wei Yang, David Hildenbrand

On Wed, Jan 21, 2026 at 03:07:46PM -0500, Waiman Long wrote:
> On 1/21/26 2:43 PM, Andrew Morton wrote:
> > On Wed, 21 Jan 2026 14:10:36 -0500 Waiman Long<longman@redhat.com> wrote:
> > 
> > > Commit 3acb913c9d5b ("mm/mm_init: use deferred_init_memmap_chunk()
> > > in deferred_grow_zone()") made deferred_grow_zone() call
> > > deferred_init_memmap_chunk() within a pgdat_resize_lock() critical
> > > section with irqs disabled. It did check for irqs_disabled() in
> > > deferred_init_memmap_chunk() to avoid calling cond_resched(). For a
> > > PREEMPT_RT kernel build, however, spin_lock_irqsave() does not disable
> > > interrupt but rcu_read_lock() is called. This leads to the following
> > > bug report.
> > > 
> > >    BUG: sleeping function called from invalid context at mm/mm_init.c:2091
> > >    in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
> > >    preempt_count: 0, expected: 0
> > >    RCU nest depth: 1, expected: 0
> > >    3 locks held by swapper/0/1:
> > >     #0: ffff80008471b7a0 (sched_domains_mutex){+.+.}-{4:4}, at: sched_domains_mutex_lock+0x28/0x40
> > >     #1: ffff003bdfffef48 (&pgdat->node_size_lock){+.+.}-{3:3}, at: deferred_grow_zone+0x140/0x278
> > >     #2: ffff800084acf600 (rcu_read_lock){....}-{1:3}, at: rt_spin_lock+0x1b4/0x408
> > >    CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G        W           6.19.0-rc6-test #1 PREEMPT_{RT,(full)
> > > }
> > >    Tainted: [W]=WARN
> > >    Call trace:
> > >     show_stack+0x20/0x38 (C)
> > >     dump_stack_lvl+0xdc/0xf8
> > >     dump_stack+0x1c/0x28
> > >     __might_resched+0x384/0x530
> > >     deferred_init_memmap_chunk+0x560/0x688
> > >     deferred_grow_zone+0x190/0x278
> > >     _deferred_grow_zone+0x18/0x30
> > >     get_page_from_freelist+0x780/0xf78
> > >     __alloc_frozen_pages_noprof+0x1dc/0x348
> > >     alloc_slab_page+0x30/0x110
> > >     allocate_slab+0x98/0x2a0
> > >     new_slab+0x4c/0x80
> > >     ___slab_alloc+0x5a4/0x770
> > >     __slab_alloc.constprop.0+0x88/0x1e0
> > >     __kmalloc_node_noprof+0x2c0/0x598
> > >     __sdt_alloc+0x3b8/0x728
> > >     build_sched_domains+0xe0/0x1260
> > >     sched_init_domains+0x14c/0x1c8
> > >     sched_init_smp+0x9c/0x1d0
> > >     kernel_init_freeable+0x218/0x358
> > >     kernel_init+0x28/0x208
> > >     ret_from_fork+0x10/0x20
> > > 
> > > Fix it by checking rcu_preempt_depth() as well to prevent calling
> > > cond_resched(). Note that CONFIG_PREEMPT_RCU should always be enabled
> > > in a PREEMPT_RT kernel.
> > > 
> > > ...
> > > 
> > > --- a/mm/mm_init.c
> > > +++ b/mm/mm_init.c
> > > @@ -2085,7 +2085,12 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> > >   			spfn = chunk_end;
> > > -			if (irqs_disabled())
> > > +			/*
> > > +			 * pgdat_resize_lock() only disables irqs in non-RT
> > > +			 * kernels but calls rcu_read_lock() in a PREEMPT_RT
> > > +			 * kernel.
> > > +			 */
> > > +			if (irqs_disabled() || rcu_preempt_depth())
> > >   				touch_nmi_watchdog();
> > rcu_preempt_depth() seems a fairly internal low-level thing - it's
> > rarely used.
> That is true. Beside the scheduler, workqueue also use rcu_preempt_depth().
> This API is included in "include/linux/rcupdate.h" which is included
> directly or indirectly by many kernel files. So even though it is rarely
> used, but it is still a public API.

It is a bit tricky, for example, given a kernel built with both
CONFIG_PREEMPT_NONE=y and CONFIG_PREEMPT_DYNAMIC=y, it will never
invoke touch_nmi_watchdog(), even if it really is in an RCU read-side
critical section.  This is because it was intended for lockdep-like use,
where (for example) you don't want to complain about sleeping in an RCU
read-side critical section unless you are 100% sure that you are in fact
in an RCU read-side critical section.

Maybe something like this?

	if (irqs_disabled() || !IS_ENABLED(CONFIG_PREEMPT_RCU) || rcu_preempt_depth())
		touch_nmi_watchdog();

This would *always* invoke touch_nmi_watchdog() for such kernels, which
might or might not be OK.

I freely confesss that I am not sure which of these is appropriate in
this setting.

							Thanx, Paul

> > Is there a more official way of detecting this condition?  Maybe even
> > #ifdef CONFIG_PREEMPT_RCU?
> > 
> I am not aware of a more official way of detecting this. Maybe Sebastian has
> some ideas. rcu_preempt_count() is defined whether CONFIG_PREEMPT_RCU is
> defined or not. So we don't need a "#ifdef CONFIG_PREEMPT_RCU". Maybe I
> should explicitly include "include/linux/rcupdate.h" in mm/mm_init.c just to
> be sure.
> 
> CONFIG_PREEMPT_RCU defaults to on if PREMPT_RT is set. With
> !CONFIG_PREEMPT_RCU, rcu_preempt_depth() is hard-coded to 0 and will be
> optimized out.
> 
> Cheers,
> Longman


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/mm_init: Don't call cond_resched() in deferred_init_memmap_chunk() if rcu_preempt_depth() set
  2026-01-21 21:27     ` Paul E. McKenney
@ 2026-01-22  7:57       ` Sebastian Andrzej Siewior
  2026-01-22  9:47         ` Mike Rapoport
                           ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-01-22  7:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Waiman Long, Andrew Morton, Mike Rapoport, Clark Williams,
	Steven Rostedt, linux-mm, linux-kernel, linux-rt-devel, Wei Yang,
	David Hildenbrand

On 2026-01-21 13:27:32 [-0800], Paul E. McKenney wrote:
> > > > --- a/mm/mm_init.c
> > > > +++ b/mm/mm_init.c
> > > > @@ -2085,7 +2085,12 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> > > >   			spfn = chunk_end;
> > > > -			if (irqs_disabled())
> > > > +			/*
> > > > +			 * pgdat_resize_lock() only disables irqs in non-RT
> > > > +			 * kernels but calls rcu_read_lock() in a PREEMPT_RT
> > > > +			 * kernel.
> > > > +			 */
> > > > +			if (irqs_disabled() || rcu_preempt_depth())
> > > >   				touch_nmi_watchdog();
> > > rcu_preempt_depth() seems a fairly internal low-level thing - it's
> > > rarely used.
If you acquire a lock from time to time and you pass a bool the let the
function below know whether scheduling is fine or not then it is
obvious. If you choose to check for symptoms of an acquired lock then
you have to use also the rarely used functions ;)

> > That is true. Beside the scheduler, workqueue also use rcu_preempt_depth().
> > This API is included in "include/linux/rcupdate.h" which is included
> > directly or indirectly by many kernel files. So even though it is rarely
> > used, but it is still a public API.
> 
> It is a bit tricky, for example, given a kernel built with both
> CONFIG_PREEMPT_NONE=y and CONFIG_PREEMPT_DYNAMIC=y, it will never
> invoke touch_nmi_watchdog(), even if it really is in an RCU read-side
> critical section.  This is because it was intended for lockdep-like use,
> where (for example) you don't want to complain about sleeping in an RCU
> read-side critical section unless you are 100% sure that you are in fact
> in an RCU read-side critical section.
> 
> Maybe something like this?
> 
> 	if (irqs_disabled() || !IS_ENABLED(CONFIG_PREEMPT_RCU) || rcu_preempt_depth())
> 		touch_nmi_watchdog();

I don't understand the PREEMPT_NONE+DYNAMIC reasoning. irqs_disabled()
should not be affected by this and rcu_preempt_depth() will be 0 for
!CONFIG_PREEMPT_RCU so I don't think this is required. 

> This would *always* invoke touch_nmi_watchdog() for such kernels, which
> might or might not be OK.
> 
> I freely confesss that I am not sure which of these is appropriate in
> this setting.

What about a more straight forward and obvious approach?

diff --git a/mm/mm_init.c b/mm/mm_init.c
index fc2a6f1e518f1..0b283fd48b282 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2059,7 +2059,7 @@ static unsigned long __init deferred_init_pages(struct zone *zone,
  */
 static unsigned long __init
 deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
-			   struct zone *zone)
+			   struct zone *zone, bool may_schedule)
 {
 	int nid = zone_to_nid(zone);
 	unsigned long nr_pages = 0;
@@ -2085,10 +2085,10 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
 
 			spfn = chunk_end;
 
-			if (irqs_disabled())
-				touch_nmi_watchdog();
-			else
+			if (may_schedule)
 				cond_resched();
+			else
+				touch_nmi_watchdog();
 		}
 	}
 
@@ -2101,7 +2101,7 @@ deferred_init_memmap_job(unsigned long start_pfn, unsigned long end_pfn,
 {
 	struct zone *zone = arg;
 
-	deferred_init_memmap_chunk(start_pfn, end_pfn, zone);
+	deferred_init_memmap_chunk(start_pfn, end_pfn, zone, true);
 }
 
 static unsigned int __init
@@ -2216,7 +2216,7 @@ bool __init deferred_grow_zone(struct zone *zone, unsigned int order)
 	for (spfn = first_deferred_pfn, epfn = SECTION_ALIGN_UP(spfn + 1);
 	     nr_pages < nr_pages_needed && spfn < zone_end_pfn(zone);
 	     spfn = epfn, epfn += PAGES_PER_SECTION) {
-		nr_pages += deferred_init_memmap_chunk(spfn, epfn, zone);
+		nr_pages += deferred_init_memmap_chunk(spfn, epfn, zone, false);
 	}
 
 	/*

Wouldn't this work?

> 							Thanx, Paul

Sebastian


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/mm_init: Don't call cond_resched() in deferred_init_memmap_chunk() if rcu_preempt_depth() set
  2026-01-22  7:57       ` Sebastian Andrzej Siewior
@ 2026-01-22  9:47         ` Mike Rapoport
  2026-01-22 17:17         ` Paul E. McKenney
  2026-01-22 17:59         ` Waiman Long
  2 siblings, 0 replies; 8+ messages in thread
From: Mike Rapoport @ 2026-01-22  9:47 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Paul E. McKenney, Waiman Long, Andrew Morton, Clark Williams,
	Steven Rostedt, linux-mm, linux-kernel, linux-rt-devel, Wei Yang,
	David Hildenbrand

On Thu, Jan 22, 2026 at 08:57:47AM +0100, Sebastian Andrzej Siewior wrote:
> On 2026-01-21 13:27:32 [-0800], Paul E. McKenney wrote:
> > 
> > It is a bit tricky, for example, given a kernel built with both
> > CONFIG_PREEMPT_NONE=y and CONFIG_PREEMPT_DYNAMIC=y, it will never
> > invoke touch_nmi_watchdog(), even if it really is in an RCU read-side
> > critical section.  This is because it was intended for lockdep-like use,
> > where (for example) you don't want to complain about sleeping in an RCU
> > read-side critical section unless you are 100% sure that you are in fact
> > in an RCU read-side critical section.
> > 
> > Maybe something like this?
> > 
> > 	if (irqs_disabled() || !IS_ENABLED(CONFIG_PREEMPT_RCU) || rcu_preempt_depth())
> > 		touch_nmi_watchdog();
> 
> I don't understand the PREEMPT_NONE+DYNAMIC reasoning. irqs_disabled()
> should not be affected by this and rcu_preempt_depth() will be 0 for
> !CONFIG_PREEMPT_RCU so I don't think this is required. 
> 
> > This would *always* invoke touch_nmi_watchdog() for such kernels, which
> > might or might not be OK.
> > 
> > I freely confesss that I am not sure which of these is appropriate in
> > this setting.
> 
> What about a more straight forward and obvious approach?
> 
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index fc2a6f1e518f1..0b283fd48b282 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -2059,7 +2059,7 @@ static unsigned long __init deferred_init_pages(struct zone *zone,
>   */
>  static unsigned long __init
>  deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> -			   struct zone *zone)
> +			   struct zone *zone, bool may_schedule)
>  {
>  	int nid = zone_to_nid(zone);
>  	unsigned long nr_pages = 0;
> @@ -2085,10 +2085,10 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
>  
>  			spfn = chunk_end;
>  
> -			if (irqs_disabled())
> -				touch_nmi_watchdog();
> -			else
> +			if (may_schedule)
>  				cond_resched();
> +			else
> +				touch_nmi_watchdog();
>  		}
>  	}
>  
> @@ -2101,7 +2101,7 @@ deferred_init_memmap_job(unsigned long start_pfn, unsigned long end_pfn,
>  {
>  	struct zone *zone = arg;
>  
> -	deferred_init_memmap_chunk(start_pfn, end_pfn, zone);
> +	deferred_init_memmap_chunk(start_pfn, end_pfn, zone, true);
>  }
>  
>  static unsigned int __init
> @@ -2216,7 +2216,7 @@ bool __init deferred_grow_zone(struct zone *zone, unsigned int order)
>  	for (spfn = first_deferred_pfn, epfn = SECTION_ALIGN_UP(spfn + 1);
>  	     nr_pages < nr_pages_needed && spfn < zone_end_pfn(zone);
>  	     spfn = epfn, epfn += PAGES_PER_SECTION) {
> -		nr_pages += deferred_init_memmap_chunk(spfn, epfn, zone);
> +		nr_pages += deferred_init_memmap_chunk(spfn, epfn, zone, false);
>  	}
>  
>  	/*
> 
> Wouldn't this work?

Yes, it will. And I think this is less fragile and clearer.
 
> Sebastian
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/mm_init: Don't call cond_resched() in deferred_init_memmap_chunk() if rcu_preempt_depth() set
  2026-01-22  7:57       ` Sebastian Andrzej Siewior
  2026-01-22  9:47         ` Mike Rapoport
@ 2026-01-22 17:17         ` Paul E. McKenney
  2026-01-22 17:59         ` Waiman Long
  2 siblings, 0 replies; 8+ messages in thread
From: Paul E. McKenney @ 2026-01-22 17:17 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Waiman Long, Andrew Morton, Mike Rapoport, Clark Williams,
	Steven Rostedt, linux-mm, linux-kernel, linux-rt-devel, Wei Yang,
	David Hildenbrand

On Thu, Jan 22, 2026 at 08:57:47AM +0100, Sebastian Andrzej Siewior wrote:
> On 2026-01-21 13:27:32 [-0800], Paul E. McKenney wrote:
> > > > > --- a/mm/mm_init.c
> > > > > +++ b/mm/mm_init.c
> > > > > @@ -2085,7 +2085,12 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> > > > >   			spfn = chunk_end;
> > > > > -			if (irqs_disabled())
> > > > > +			/*
> > > > > +			 * pgdat_resize_lock() only disables irqs in non-RT
> > > > > +			 * kernels but calls rcu_read_lock() in a PREEMPT_RT
> > > > > +			 * kernel.
> > > > > +			 */
> > > > > +			if (irqs_disabled() || rcu_preempt_depth())
> > > > >   				touch_nmi_watchdog();
> > > > rcu_preempt_depth() seems a fairly internal low-level thing - it's
> > > > rarely used.
> If you acquire a lock from time to time and you pass a bool the let the
> function below know whether scheduling is fine or not then it is
> obvious. If you choose to check for symptoms of an acquired lock then
> you have to use also the rarely used functions ;)

Agreed.

> > > That is true. Beside the scheduler, workqueue also use rcu_preempt_depth().
> > > This API is included in "include/linux/rcupdate.h" which is included
> > > directly or indirectly by many kernel files. So even though it is rarely
> > > used, but it is still a public API.
> > 
> > It is a bit tricky, for example, given a kernel built with both
> > CONFIG_PREEMPT_NONE=y and CONFIG_PREEMPT_DYNAMIC=y, it will never
> > invoke touch_nmi_watchdog(), even if it really is in an RCU read-side
> > critical section.  This is because it was intended for lockdep-like use,
> > where (for example) you don't want to complain about sleeping in an RCU
> > read-side critical section unless you are 100% sure that you are in fact
> > in an RCU read-side critical section.
> > 
> > Maybe something like this?
> > 
> > 	if (irqs_disabled() || !IS_ENABLED(CONFIG_PREEMPT_RCU) || rcu_preempt_depth())
> > 		touch_nmi_watchdog();
> 
> I don't understand the PREEMPT_NONE+DYNAMIC reasoning. irqs_disabled()
> should not be affected by this and rcu_preempt_depth() will be 0 for
> !CONFIG_PREEMPT_RCU so I don't think this is required. 

If touch_nmi_watchdog() is fast enough that calling it in a potentially
tight loop is OK, then agreed, this is not required.

But maybe I should redefine !PREEMPT rcu_preempt_depth() in terms of
preemptible() or some such.  Thoughts?

> > This would *always* invoke touch_nmi_watchdog() for such kernels, which
> > might or might not be OK.
> > 
> > I freely confesss that I am not sure which of these is appropriate in
> > this setting.
> 
> What about a more straight forward and obvious approach?
> 
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index fc2a6f1e518f1..0b283fd48b282 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -2059,7 +2059,7 @@ static unsigned long __init deferred_init_pages(struct zone *zone,
>   */
>  static unsigned long __init
>  deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> -			   struct zone *zone)
> +			   struct zone *zone, bool may_schedule)
>  {
>  	int nid = zone_to_nid(zone);
>  	unsigned long nr_pages = 0;
> @@ -2085,10 +2085,10 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
>  
>  			spfn = chunk_end;
>  
> -			if (irqs_disabled())
> -				touch_nmi_watchdog();
> -			else
> +			if (may_schedule)
>  				cond_resched();
> +			else
> +				touch_nmi_watchdog();
>  		}
>  	}
>  
> @@ -2101,7 +2101,7 @@ deferred_init_memmap_job(unsigned long start_pfn, unsigned long end_pfn,
>  {
>  	struct zone *zone = arg;
>  
> -	deferred_init_memmap_chunk(start_pfn, end_pfn, zone);
> +	deferred_init_memmap_chunk(start_pfn, end_pfn, zone, true);
>  }
>  
>  static unsigned int __init
> @@ -2216,7 +2216,7 @@ bool __init deferred_grow_zone(struct zone *zone, unsigned int order)
>  	for (spfn = first_deferred_pfn, epfn = SECTION_ALIGN_UP(spfn + 1);
>  	     nr_pages < nr_pages_needed && spfn < zone_end_pfn(zone);
>  	     spfn = epfn, epfn += PAGES_PER_SECTION) {
> -		nr_pages += deferred_init_memmap_chunk(spfn, epfn, zone);
> +		nr_pages += deferred_init_memmap_chunk(spfn, epfn, zone, false);
>  	}
>  
>  	/*
> 
> Wouldn't this work?

I will defer to the mm guys on this question.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/mm_init: Don't call cond_resched() in deferred_init_memmap_chunk() if rcu_preempt_depth() set
  2026-01-22  7:57       ` Sebastian Andrzej Siewior
  2026-01-22  9:47         ` Mike Rapoport
  2026-01-22 17:17         ` Paul E. McKenney
@ 2026-01-22 17:59         ` Waiman Long
  2 siblings, 0 replies; 8+ messages in thread
From: Waiman Long @ 2026-01-22 17:59 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Paul E. McKenney
  Cc: Waiman Long, Andrew Morton, Mike Rapoport, Clark Williams,
	Steven Rostedt, linux-mm, linux-kernel, linux-rt-devel, Wei Yang,
	David Hildenbrand

On 1/22/26 2:57 AM, Sebastian Andrzej Siewior wrote:
> On 2026-01-21 13:27:32 [-0800], Paul E. McKenney wrote:
>>>>> --- a/mm/mm_init.c
>>>>> +++ b/mm/mm_init.c
>>>>> @@ -2085,7 +2085,12 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
>>>>>    			spfn = chunk_end;
>>>>> -			if (irqs_disabled())
>>>>> +			/*
>>>>> +			 * pgdat_resize_lock() only disables irqs in non-RT
>>>>> +			 * kernels but calls rcu_read_lock() in a PREEMPT_RT
>>>>> +			 * kernel.
>>>>> +			 */
>>>>> +			if (irqs_disabled() || rcu_preempt_depth())
>>>>>    				touch_nmi_watchdog();
>>>> rcu_preempt_depth() seems a fairly internal low-level thing - it's
>>>> rarely used.
> If you acquire a lock from time to time and you pass a bool the let the
> function below know whether scheduling is fine or not then it is
> obvious. If you choose to check for symptoms of an acquired lock then
> you have to use also the rarely used functions ;)
>
>>> That is true. Beside the scheduler, workqueue also use rcu_preempt_depth().
>>> This API is included in "include/linux/rcupdate.h" which is included
>>> directly or indirectly by many kernel files. So even though it is rarely
>>> used, but it is still a public API.
>> It is a bit tricky, for example, given a kernel built with both
>> CONFIG_PREEMPT_NONE=y and CONFIG_PREEMPT_DYNAMIC=y, it will never
>> invoke touch_nmi_watchdog(), even if it really is in an RCU read-side
>> critical section.  This is because it was intended for lockdep-like use,
>> where (for example) you don't want to complain about sleeping in an RCU
>> read-side critical section unless you are 100% sure that you are in fact
>> in an RCU read-side critical section.
>>
>> Maybe something like this?
>>
>> 	if (irqs_disabled() || !IS_ENABLED(CONFIG_PREEMPT_RCU) || rcu_preempt_depth())
>> 		touch_nmi_watchdog();
> I don't understand the PREEMPT_NONE+DYNAMIC reasoning. irqs_disabled()
> should not be affected by this and rcu_preempt_depth() will be 0 for
> !CONFIG_PREEMPT_RCU so I don't think this is required.
>
>> This would *always* invoke touch_nmi_watchdog() for such kernels, which
>> might or might not be OK.
>>
>> I freely confesss that I am not sure which of these is appropriate in
>> this setting.
> What about a more straight forward and obvious approach?
>
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index fc2a6f1e518f1..0b283fd48b282 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -2059,7 +2059,7 @@ static unsigned long __init deferred_init_pages(struct zone *zone,
>    */
>   static unsigned long __init
>   deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> -			   struct zone *zone)
> +			   struct zone *zone, bool may_schedule)
>   {
>   	int nid = zone_to_nid(zone);
>   	unsigned long nr_pages = 0;
> @@ -2085,10 +2085,10 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
>   
>   			spfn = chunk_end;
>   
> -			if (irqs_disabled())
> -				touch_nmi_watchdog();
> -			else
> +			if (may_schedule)
>   				cond_resched();
> +			else
> +				touch_nmi_watchdog();
>   		}
>   	}
>   
> @@ -2101,7 +2101,7 @@ deferred_init_memmap_job(unsigned long start_pfn, unsigned long end_pfn,
>   {
>   	struct zone *zone = arg;
>   
> -	deferred_init_memmap_chunk(start_pfn, end_pfn, zone);
> +	deferred_init_memmap_chunk(start_pfn, end_pfn, zone, true);
>   }
>   
>   static unsigned int __init
> @@ -2216,7 +2216,7 @@ bool __init deferred_grow_zone(struct zone *zone, unsigned int order)
>   	for (spfn = first_deferred_pfn, epfn = SECTION_ALIGN_UP(spfn + 1);
>   	     nr_pages < nr_pages_needed && spfn < zone_end_pfn(zone);
>   	     spfn = epfn, epfn += PAGES_PER_SECTION) {
> -		nr_pages += deferred_init_memmap_chunk(spfn, epfn, zone);
> +		nr_pages += deferred_init_memmap_chunk(spfn, epfn, zone, false);
>   	}
>   
>   	/*
>
> Wouldn't this work?

Yes, I think that is the better approach. I will post a v3 with change 
as Mike has no objection to it. Thanks!

Cheers,
Longman



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-01-22 18:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-21 19:10 [PATCH] mm/mm_init: Don't call cond_resched() in deferred_init_memmap_chunk() if rcu_preempt_depth() set Waiman Long
2026-01-21 19:43 ` Andrew Morton
2026-01-21 20:07   ` Waiman Long
2026-01-21 21:27     ` Paul E. McKenney
2026-01-22  7:57       ` Sebastian Andrzej Siewior
2026-01-22  9:47         ` Mike Rapoport
2026-01-22 17:17         ` Paul E. McKenney
2026-01-22 17:59         ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox