linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Christophe Leroy <christophe.leroy@csgroup.eu>
To: "paulmck@kernel.org" <paulmck@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Geert Uytterhoeven <geert@linux-m68k.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"maple-tree@lists.infradead.org" <maple-tree@lists.infradead.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"linux-renesas-soc@vger.kernel.org"
	<linux-renesas-soc@vger.kernel.org>,
	Shanker Donthineni <sdonthineni@nvidia.com>
Subject: Re: [PATCH v2 1/2] maple_tree: Disable mas_wr_append() when other readers are possible
Date: Tue, 12 Sep 2023 14:37:42 +0000	[thread overview]
Message-ID: <3e75add4-4d51-eab9-7029-25660e02b082@csgroup.eu> (raw)
In-Reply-To: <33150b55-970c-4607-9015-af0e50e4112d@paulmck-laptop>



Le 12/09/2023 à 10:14, Paul E. McKenney a écrit :
> On Mon, Sep 11, 2023 at 07:54:52PM -0400, Liam R. Howlett wrote:
>> * Paul E. McKenney <paulmck@kernel.org> [230906 14:03]:
>>> On Wed, Sep 06, 2023 at 01:29:54PM -0400, Liam R. Howlett wrote:
>>>> * Paul E. McKenney <paulmck@kernel.org> [230906 13:24]:
>>>>> On Wed, Sep 06, 2023 at 11:23:25AM -0400, Liam R. Howlett wrote:
>>>>>> (Adding Paul & Shanker to Cc list.. please see below for why)
>>>>>>
>>>>>> Apologies on the late response, I was away and have been struggling to
>>>>>> get a working PPC32 test environment.
>>>>>>
>>>>>> * Geert Uytterhoeven <geert@linux-m68k.org> [230829 12:42]:
>>>>>>> 	Hi Liam,
>>>>>>>
>>>>>>> On Fri, 18 Aug 2023, Liam R. Howlett wrote:
>>>>>>>> The current implementation of append may cause duplicate data and/or
>>>>>>>> incorrect ranges to be returned to a reader during an update.  Although
>>>>>>>> this has not been reported or seen, disable the append write operation
>>>>>>>> while the tree is in rcu mode out of an abundance of caution.
>>>>>>
>>>>>> ...
>>>>>>>>
>>
>> ...
>>
>>>>>>> RCU-related configs:
>>>>>>>
>>>>>>>      $ grep RCU .config
>>>>>>>      # RCU Subsystem
>>>>>>>      CONFIG_TINY_RCU=y
> 
> I must have been asleep last time I looked at this.  I was looking at
> Tree RCU.  Please accept my apologies for my lapse.  :-/
> 
> However, Tiny RCU's call_rcu() also avoids enabling IRQs, so I would
> have said the same thing, albeit after looking at a lot less RCU code.
> 
> TL;DR:
> 
> 1.	Try making the __setup_irq() function's call to mutex_lock()
> 	instead be as follows:
> 
> 	if (!mutex_trylock(&desc->request_mutex))
> 		mutex_lock(&desc->request_mutex);
> 
> 	This might fail if __setup_irq() has other dependencies on a
> 	fully operational scheduler.
> 
> 2.	Move that ppc32 call to __setup_irq() much later, most definitely
> 	after interrupts have been enabled and the scheduler is fully
> 	operational.  Invoking mutex_lock() before that time is not a
> 	good idea.  ;-)
> 
> For more detail, please read on!
> 
>>>>>>>      # CONFIG_RCU_EXPERT is not set
>>>>>>>      CONFIG_TINY_SRCU=y
>>>>>>>      # end of RCU Subsystem
>>>>>>>      # RCU Debugging
>>>>>>>      # CONFIG_RCU_SCALE_TEST is not set
>>>>>>>      # CONFIG_RCU_TORTURE_TEST is not set
>>>>>>>      # CONFIG_RCU_REF_SCALE_TEST is not set
>>>>>>>      # CONFIG_RCU_TRACE is not set
>>>>>>>      # CONFIG_RCU_EQS_DEBUG is not set
>>>>>>>      # end of RCU Debugging
>>>>>>
>>>>>> I used the configuration from debian 8 and ran 'make oldconfig' to build
>>>>>> my kernel.  I have attached the configuration.
>>
>> ...
>>
>>>>>> It appears to be something to do with struct maple_tree sparse_irqs.  If
>>>>>> you drop the rcu flag from that maple tree, then my configuration boots
>>>>>> without the warning.
>>>>>>
>>>>>> I *think* this is because we will reuse a lot more nodes.  And I *think*
>>>>>> the rcu flag is not needed, since there is a comment about reading the
>>>>>> tree being protected by the mutex sparse_irq_lock within the
>>>>>> kernel/irq/irqdesc.c file.  Shanker, can you comment on that?
>>>>>>
>>>>>> I wonder if there is a limit to the number of RCU free events before
>>>>>> something is triggered to flush them out which could trigger IRQ
>>>>>> enabling? Paul, could this be the case?
>>>>>
>>>>> Are you asking if call_rcu() will re-enable interrupts in the following
>>>>> use case?
>>>>>
>>>>> 	local_irq_disable();
>>>>> 	call_rcu(&p->rh, my_cb_func);
>>>>> 	local_irq_enable();
>>
>> I am not.
>>
>> ...
>>
>>>>>
>>>>> Or am I missing your point?
>>>>
>>>> This is very early in the boot sequence when interrupts have not been
>>>> enabled.  What we are seeing is a WARN_ON() that is triggered by
>>>> interrupts being enabled before they should be enabled.
>>>>
>>>> I was wondering if, for example, I called call_rcu() a lot *before*
>>>> interrupts were enabled, that something could trigger that would either
>>>> enable interrupts or indicate the task needs rescheduling?
>>>
>>> You aren't doing call_rcu() enough to hit OOM, are you?  The actual RCU
>>> callback invocations won't happen until some time after the scheduler
>>> starts up.
>>
>> I am not, it's just a detection of IRQs being enabled early.
>>
>>>
>>>> Specifically the rescheduling part is suspect.  I tracked down the call
>>>> to a mutex_lock() which calls cond_resched(), so could rcu be
>>>> 'encouraging' the rcu window by a reschedule request?
>>>
>>> During boot before interrupts are enabled, RCU has not yet spawned any of
>>> its kthreads.  Therefore, all of its attempts to do wakeups would notice
>>> a NULL task_struct pointer and refrain from actually doing the wakeup.
>>> If it did do the wakeup, you would see a NULL-pointer exception.  See
>>> for example, invoke_rcu_core_kthread(), though that won't happen unless
>>> you booted with rcutree.use_softirq=0.
>>>
>>> Besides, since when did doing a wakeup enable interrupts?  That would
>>> make it hard to do wakeups from hardware interrupt handlers, not?
>>
>> Taking the mutex lock in kernel/irq/manage.c __setup_irq() is calling a
>> cond_resched().
>>
>> >From what Michael said [1] in this thread, since something has already
>> set TIF_NEED_RESCHED, it will eventually enable interrupts on us.
>>
>> I've traced this to running call_rcu() in kernel/rcu/tiny.c and
>> is_idle_task(current) is true, which means rcu runs:
>> 		/* force scheduling for rcu_qs() */
>>                  resched_cpu(0);
>>
>> the task is set idle in sched_init() -> init_idle() and never changed,
>> afaict.
> 
> Yes, because RCU eventually needs a context switch in order to make
> a grace period happen.  And Maple Tree isn't the only thing invoking
> call_rcu() early, so this has been in place for a very long time.
> 
>> Removing the RCU option from the maple tree in kernel/irq/irqdesc.c
>> fixes the issue by avoiding the maple tree running call_rcu().  I am not
>> sure on the locking of the tree so I feel this change may cause other
>> issues...also it's before lockdep_init(), so any issue I introduce may
>> not be detected.
>>
>> When CONFIG_DEBUG_ATOMIC_SLEEP is configured, it seems that rcu does the
>> same thing, but the IRQs are not enabled on return.  So, resched_cpu(0)
>> is called, but the IRQs warning of enabled isn't triggered.  I failed to
>> find a reason why.
> 
> Here you mean IRQs being enabled upon return from __setup_irq(), correct?
> 
> But yes, __setup_irq() does call mutex_lock().  Which will call
> preempt_schedule_common() via might_sleep() and __cond_resched(), even
> though that is clearly a very bad thing this early.  And let's face it,
> the whole purpose of mutex_lock() is to block when needed.  And a big
> purpose of that might_sleep() is to yell at people invoking this with
> interrupts disabled.
> 
> And most of the wrappers around __setup_irq() look to be intended
> for much later, after interrupts have been enabled.  One exception is
> setup_percpu_irq(), which says that it is for "the early boot process",
> whenever that might be.  But this is only invoked from mips in v6.5.
> 
> The __request_percpu_irq() function is wrappered by request_percpu_irq(),
> and its header comment suggests that it is to be called after there are
> multiple CPUs.  I am not seeing a call that is obviously from ppc32,
> but there are a number of drivers using this with which I am unfamiliar.
> 
> The request_percpu_nmi() has to be followed up on each CPU using
> prepare_percpu_nmi() and enable_percpu_nmi(), so it is not clear that
> it is useful to invoke this before interrupts are enabled.  But this is
> used by ARM, not ppc32 from what I can see.
> 
> So even though I am not seeing how ppc32 invokes __setup_irq() early,
> your testing clearly indicates that it is doing so.
> 
>> I am not entirely sure what makes ppc32 different than other platforms
>> in that the initial task is configured to an idle task and the first
>> call to call_rcu (tiny!) would cause the observed behaviour.
> 
> Maybe something like this in __setup_irq(), right before the
> mutex_lock()?
> 
> 	WARN_ON_ONCE(irqs_disabled());
> 
> This will dump the stack trace showing how __setup_irq() is being invoked
> in early boot on ppc32.
> 
> Again, given that __setup_irq() invokes mutex_lock(), invoking this
> function in its current form before interrupts have been enabled is a
> bad idea.
> 

No trigger of that WARN_ON() I added in __setup_irq() as instructed 
above, still getting (pmac32_defconfig on MAC99 QEMU)

------------[ cut here ]------------
Interrupts were enabled early
WARNING: CPU: 0 PID: 0 at init/main.c:992 start_kernel+0x4d8/0x5c0
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 6.6.0-rc1-dirty #481
Hardware name: PowerMac3,1 7400 0xc0209 PowerMac
NIP:  c0a6052c LR: c0a6052c CTR: 00000000
REGS: c0c4dee0 TRAP: 0700   Not tainted  (6.6.0-rc1-dirty)
MSR:  00029032 <EE,ME,IR,DR,RI>  CR: 24000282  XER: 20000000

GPR00: c0a6052c c0c4dfa0 c0b92580 0000001d c0b9d128 00000001 c0b9d148 
3ffffdff
GPR08: c0ba80f0 00000000 00000000 3ffffe00 44000282 00000000 00000000 
0199abfc
GPR16: 0199b0a4 7fde7fa4 7fc5ac0c 000000bb 41000000 01c690c8 c0b92014 
c09b4bdc
GPR24: c0c55220 c0ac0000 00000000 efff9109 efff9100 0000000a c0c6d000 
c0b920a0
NIP [c0a6052c] start_kernel+0x4d8/0x5c0
LR [c0a6052c] start_kernel+0x4d8/0x5c0
Call Trace:
[c0c4dfa0] [c0a6052c] start_kernel+0x4d8/0x5c0 (unreliable)
[c0c4dff0] [00003540] 0x3540
Code: 480037b1 48023c05 4bab88ed 90620260 480139e9 4b657ced 7d2000a6 
71298000 41a20014 3c60c09a 3863b78c 4b5e9ccd <0fe00000> 39200000 
99380008 7d2000a6
---[ end trace 0000000000000000 ]---


  parent reply	other threads:[~2023-09-12 14:37 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-19  0:43 [PATCH v2 0/2] maple_tree: mas_wr_append() fix ups Liam R. Howlett
2023-08-19  0:43 ` [PATCH v2 1/2] maple_tree: Disable mas_wr_append() when other readers are possible Liam R. Howlett
2023-08-29 16:42   ` Geert Uytterhoeven
2023-08-31  5:39     ` Michael Ellerman
2023-08-31  8:25       ` Geert Uytterhoeven
2023-08-31  8:45         ` Peng Zhang
2023-08-31  9:43           ` Geert Uytterhoeven
2023-09-06 15:23     ` Liam R. Howlett
2023-09-06 17:23       ` Paul E. McKenney
2023-09-06 17:29         ` Liam R. Howlett
2023-09-06 18:02           ` Paul E. McKenney
2023-09-11 23:54             ` Liam R. Howlett
2023-09-12  8:14               ` Paul E. McKenney
2023-09-12  8:23                 ` Geert Uytterhoeven
2023-09-12  8:30                   ` Paul E. McKenney
2023-09-12  8:34                     ` Geert Uytterhoeven
2023-09-12 10:00                       ` Paul E. McKenney
2023-09-12 13:56                         ` Liam R. Howlett
2023-09-12 14:29                           ` Liam R. Howlett
2023-09-12 15:08                             ` Paul E. McKenney
2023-09-12 15:27                               ` Christophe Leroy
2023-09-12 15:49                                 ` Liam R. Howlett
2023-09-12 15:07                           ` Paul E. McKenney
2023-09-12 15:44                             ` Liam R. Howlett
2023-09-12 16:49                               ` Paul E. McKenney
2023-09-12 17:02                                 ` Christophe Leroy
2023-09-12 17:09                               ` Christophe Leroy
2023-09-12 17:38                                 ` Liam R. Howlett
2023-09-13 13:14                         ` Geert Uytterhoeven
2023-09-13 13:24                           ` Liam R. Howlett
2023-09-13 13:26                             ` Geert Uytterhoeven
2023-09-12 14:37                 ` Christophe Leroy [this message]
2023-09-12 14:10               ` Matthew Wilcox
2023-09-12 14:17                 ` Liam R. Howlett
2023-09-06 19:06       ` Geert Uytterhoeven
2023-09-11 12:27     ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-10-16  8:29       ` Linux regression tracking #update (Thorsten Leemhuis)
2023-08-30 19:49   ` Andreas Schwab
2023-08-31  5:37     ` Michael Ellerman
2023-08-31 19:01       ` Andreas Schwab
2023-09-12 18:15     ` Andreas Schwab
2023-09-12 19:09       ` Liam R. Howlett
2023-09-12 20:01         ` Andreas Schwab
2023-08-19  0:43 ` [PATCH v2 2/2] maple_tree: Clean up mas_wr_append() Liam R. Howlett

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3e75add4-4d51-eab9-7029-25660e02b082@csgroup.eu \
    --to=christophe.leroy@csgroup.eu \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=geert@linux-m68k.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-renesas-soc@vger.kernel.org \
    --cc=maple-tree@lists.infradead.org \
    --cc=paulmck@kernel.org \
    --cc=sdonthineni@nvidia.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox