From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 27D70CA0EC3 for ; Tue, 12 Sep 2023 08:14:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7881C6B00C2; Tue, 12 Sep 2023 04:14:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 737276B00C3; Tue, 12 Sep 2023 04:14:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5D7FD6B00C4; Tue, 12 Sep 2023 04:14:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 44B956B00C2 for ; Tue, 12 Sep 2023 04:14:21 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 11AA01CA722 for ; Tue, 12 Sep 2023 08:14:21 +0000 (UTC) X-FDA: 81227233122.02.1318071 Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf17.hostedemail.com (Postfix) with ESMTP id 93FF840023 for ; Tue, 12 Sep 2023 08:14:18 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=C8EVGg3s; spf=pass (imf17.hostedemail.com: domain of "SRS0=rctY=E4=paulmck-ThinkPad-P17-Gen-1.home=paulmck@kernel.org" designates 145.40.73.55 as permitted sender) smtp.mailfrom="SRS0=rctY=E4=paulmck-ThinkPad-P17-Gen-1.home=paulmck@kernel.org"; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1694506459; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OA46tpQse26eS5acLCN2GiJI8ydeqTzf52Gm8ufM8PU=; b=jYgIR5ZQ4HJll6KD+DNCAy1rvsPRzG+24GtQ8oGJGnkLwwjsHtiqFokR1A2GwlSLd+bdEA sENRSMOuNOHR4QULWOX2Y2djgnl5muDHSX8zVGTMB8NSh8rwfRwxHFiumbQngCGBy+p5Ia NZ28zpCWe8cWkvf9y1fXPjqqkrkw14Y= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694506459; a=rsa-sha256; cv=none; b=DuyHNvXtLAOhEj/nezV0/vmy8K31lVvb53V++daySqLKxLZYrrCaY+kPTd0S8JY/G3VicB d3S657WOUOGfpqq77970PC1xjcxcOsDXhRKSzGitLWJDwglcCv3oMDNpGbpf8+vhukfio9 msTrCVmwLgSKmA8PRfVK85x14qpEEoY= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=C8EVGg3s; spf=pass (imf17.hostedemail.com: domain of "SRS0=rctY=E4=paulmck-ThinkPad-P17-Gen-1.home=paulmck@kernel.org" designates 145.40.73.55 as permitted sender) smtp.mailfrom="SRS0=rctY=E4=paulmck-ThinkPad-P17-Gen-1.home=paulmck@kernel.org"; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by sin.source.kernel.org (Postfix) with ESMTPS id AA17DCE1B0E; Tue, 12 Sep 2023 08:14:14 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id F0C31C433CA; Tue, 12 Sep 2023 08:14:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1694506453; bh=WcZtaGkqHKbm5o/8W3nvSSv5FuHAVyX3MpMpnPsL/0c=; h=Date:From:To:Subject:Reply-To:References:In-Reply-To:From; b=C8EVGg3sEa4s2hRrayFXZ1I+jDWksSLL3if5dCWrOcVBKUk8GZ8utCZWTL221cQ2D e/mJtgRwXrZoRI3Hvnc0roaRWLB5sK141GFVpnH33evgyQCJC3+taCbMpM1dZmzIEN tHUjnAc2zeEzyttPnqQ7HOaymRe2Zi6/OWLBq2ja0Nxnq6szibvfFqM7TzrU+Ho1Xy aaq8MT4ZFdoCH5OucNS/eky/bze77CbV9LnasPZrr7RqFH03rpTt+z0Si4TeH4PSIH /JeiztajGUfZQn4Y5bAWioiCLLCJVSwHb/aemHoO/lzctjG+1RoDDTcF2A0MJt2bOX 3eueL9guc3jKA== Received: by paulmck-ThinkPad-P17-Gen-1.home (Postfix, from userid 1000) id 80B62CE093C; Tue, 12 Sep 2023 01:14:12 -0700 (PDT) Date: Tue, 12 Sep 2023 01:14:12 -0700 From: "Paul E. McKenney" To: "Liam R. Howlett" , Geert Uytterhoeven , Andrew Morton , maple-tree@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, linux-renesas-soc@vger.kernel.org, Shanker Donthineni Subject: Re: [PATCH v2 1/2] maple_tree: Disable mas_wr_append() when other readers are possible Message-ID: <33150b55-970c-4607-9015-af0e50e4112d@paulmck-laptop> Reply-To: paulmck@kernel.org References: <20230819004356.1454718-1-Liam.Howlett@oracle.com> <20230819004356.1454718-2-Liam.Howlett@oracle.com> <3f86d58e-7f36-c6b4-c43a-2a7bcffd3bd@linux-m68k.org> <20230906152325.dblzauybyoq5kd35@revolver> <20230906172954.oq4vogeuco25zam7@revolver> <495849d6-1dc6-4f38-bce7-23c50df3a99f@paulmck-laptop> <20230911235452.xhtnt7ply7ayr53x@revolver> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230911235452.xhtnt7ply7ayr53x@revolver> X-Stat-Signature: na879ubzgctyd6p89aq37fzx1q5shs9n X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 93FF840023 X-Rspam-User: X-HE-Tag: 1694506458-96547 X-HE-Meta: U2FsdGVkX19m69s6S2Kv7LxKDN9He9oshGsl2ieBRqubrrKTHwysia8y9tRwxUJ51mV+ftOG7fclSc6QotsBvZ6MVwhjOvOR+5dFm6NueLWS85q0rFhqPH4/Vqo6GUa9byiAfq0HmpJoYgyCNqj2Dr3wq8PawKnqlxikUq2w9WJLZskf6eMrdR1sCoESH1wbVYmtpaFWvFZwX7Ygphp06F+fIbVK5NcqU8qACf8gke+WJdHlkbLD1j6b4JJD2LJX2zTqEOG/7H/hu0hWr7DM6xiYR+x3axd8lv3HKW0grPVyCaVyYXDR/qA/q35z2npUTvv3UWQ/OZ+RErbZrCHOq1GbgoLNHC4AEmkcQgrO+t5tyFUOhYust085TfET3WiFjEvuSBsoaI4btSqayBiQ5jZic++ALgZqJiAkwXaGBbqn2EMByLPmafAY67z4P4xOd3IK8/kVqgmIbZofX7JK0IVXgXIjaOJSiyMHCdMdArCufufvnYPSfKEId2hS1OL3Y5bj/GD3xEiOL3Wx3zy2zbggk781THzc04CnnGX7LC/JrGYI0yG5fEdsdcdmlmkxxBdoVhuXkwohthuE/s4rCHXhtuZ5rx3pkU97Zf5FAWyFePQm7aWfM15hX3P59x2dy4T2PFqCs27G4XrU7cfs//Ex9QJQ2zd9KpXD68mopioTab4KhR49v1GoG9SAamnsHvpBSoRK5Q7Z9VXN/GrDr07DVtv70CFXOdzFKfppkRhNGid8xTQ/xa6AlTYIoh7vin3oEwMX87m0UTNNny1Gp9aFddE9V1Oa6eJHdYcy+kCrQL4UGWEzhE2cZhaW9x86BVG7utuEPmgrWlYLKsfzp+7Euge901Vnx/T04JAI38FBvkUl1FDDbF/gR443OjuV1DXF5xLbLk3SWGwNszSw3w86eF3lykBoNp6sbO0+uil7piPKxWJpqb9wdb9S0qiCOouwf0VPy8oOj1zexgC UQswWCM7 291/iFVT/0LGqQTSj+0nrG3gOyBY1WBCUPUqLwykrlgVvojighP5bSqV4qulRhasOQLJQ7KYMxbZiLXolI+iZ0GEMZoXlu+p64d7nXxyejS5liINPymyyxf72HSRU4cnu0X2ztww7kd0ZR/P8N0hJJUU+yuViKqt3fnZved28yZ0eZASbwy3qCM1jFW98GWFWGQx4+lYvGaYHUK4niNuQIwI7uSrY/d/eclokSeugfs7ROFRUfBDr58KiQyOQ46ehOwCwNNnd9pjbtpj5y9LLRLuR5hf6R7c9usLshCXmkvCitRwS78vxC4oADptHYJoVuuif2UcZv3X1nW5Kgsoq3INBMSyvKsAwDoRfxSRTdzwDOfD56dF4m7bN6Ic9coMZ5tdYRlvA6FENtQk4HskbZGpfTlKCApCBjLndfwnp1E7npLSZD+qXNyeGCUARa3veyxu4VcWFU2TaqOixBt5Bs/LZ0/rvqoZKiKkWA8bgZwdYSZ+qNrbMJ3laCo5WWyQBsJ17hNCjwMEXJY+8ywKe8m/rQIIlThEhVNJOLUN7qtT7gBmYpmha2LpTDN5YqctPwRlXiYv/Sx9uAyPPlwl2FmLePDUYGXTXm0rSeEThnEMgwIsYSKoc+D7k3Br1oliXchVZifq6HViLAn3LaBGG7EhHhbtkBE74mfgRTRQTExuGd8BHr2+y+UT++clDqvNzCbLoPp8eNFkQLrxiSaL934ISNGumUNYq7yCFyFqISO3B9iD+NPokF+JeK2/K0Bp/TJI/eY1VeNUk3BjjldNALJ5ibIf/wfcmq9HF8pZAHYqxBxg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Sep 11, 2023 at 07:54:52PM -0400, Liam R. Howlett wrote: > * Paul E. McKenney [230906 14:03]: > > On Wed, Sep 06, 2023 at 01:29:54PM -0400, Liam R. Howlett wrote: > > > * Paul E. McKenney [230906 13:24]: > > > > On Wed, Sep 06, 2023 at 11:23:25AM -0400, Liam R. Howlett wrote: > > > > > (Adding Paul & Shanker to Cc list.. please see below for why) > > > > > > > > > > Apologies on the late response, I was away and have been struggling to > > > > > get a working PPC32 test environment. > > > > > > > > > > * Geert Uytterhoeven [230829 12:42]: > > > > > > Hi Liam, > > > > > > > > > > > > On Fri, 18 Aug 2023, Liam R. Howlett wrote: > > > > > > > The current implementation of append may cause duplicate data and/or > > > > > > > incorrect ranges to be returned to a reader during an update. Although > > > > > > > this has not been reported or seen, disable the append write operation > > > > > > > while the tree is in rcu mode out of an abundance of caution. > > > > > > > > > > ... > > > > > > > > > ... > > > > > > > RCU-related configs: > > > > > > > > > > > > $ grep RCU .config > > > > > > # RCU Subsystem > > > > > > CONFIG_TINY_RCU=y I must have been asleep last time I looked at this. I was looking at Tree RCU. Please accept my apologies for my lapse. :-/ However, Tiny RCU's call_rcu() also avoids enabling IRQs, so I would have said the same thing, albeit after looking at a lot less RCU code. TL;DR: 1. Try making the __setup_irq() function's call to mutex_lock() instead be as follows: if (!mutex_trylock(&desc->request_mutex)) mutex_lock(&desc->request_mutex); This might fail if __setup_irq() has other dependencies on a fully operational scheduler. 2. Move that ppc32 call to __setup_irq() much later, most definitely after interrupts have been enabled and the scheduler is fully operational. Invoking mutex_lock() before that time is not a good idea. ;-) For more detail, please read on! > > > > > > # CONFIG_RCU_EXPERT is not set > > > > > > CONFIG_TINY_SRCU=y > > > > > > # end of RCU Subsystem > > > > > > # RCU Debugging > > > > > > # CONFIG_RCU_SCALE_TEST is not set > > > > > > # CONFIG_RCU_TORTURE_TEST is not set > > > > > > # CONFIG_RCU_REF_SCALE_TEST is not set > > > > > > # CONFIG_RCU_TRACE is not set > > > > > > # CONFIG_RCU_EQS_DEBUG is not set > > > > > > # end of RCU Debugging > > > > > > > > > > I used the configuration from debian 8 and ran 'make oldconfig' to build > > > > > my kernel. I have attached the configuration. > > ... > > > > > > It appears to be something to do with struct maple_tree sparse_irqs. If > > > > > you drop the rcu flag from that maple tree, then my configuration boots > > > > > without the warning. > > > > > > > > > > I *think* this is because we will reuse a lot more nodes. And I *think* > > > > > the rcu flag is not needed, since there is a comment about reading the > > > > > tree being protected by the mutex sparse_irq_lock within the > > > > > kernel/irq/irqdesc.c file. Shanker, can you comment on that? > > > > > > > > > > I wonder if there is a limit to the number of RCU free events before > > > > > something is triggered to flush them out which could trigger IRQ > > > > > enabling? Paul, could this be the case? > > > > > > > > Are you asking if call_rcu() will re-enable interrupts in the following > > > > use case? > > > > > > > > local_irq_disable(); > > > > call_rcu(&p->rh, my_cb_func); > > > > local_irq_enable(); > > I am not. > > ... > > > > > > > > > Or am I missing your point? > > > > > > This is very early in the boot sequence when interrupts have not been > > > enabled. What we are seeing is a WARN_ON() that is triggered by > > > interrupts being enabled before they should be enabled. > > > > > > I was wondering if, for example, I called call_rcu() a lot *before* > > > interrupts were enabled, that something could trigger that would either > > > enable interrupts or indicate the task needs rescheduling? > > > > You aren't doing call_rcu() enough to hit OOM, are you? The actual RCU > > callback invocations won't happen until some time after the scheduler > > starts up. > > I am not, it's just a detection of IRQs being enabled early. > > > > > > Specifically the rescheduling part is suspect. I tracked down the call > > > to a mutex_lock() which calls cond_resched(), so could rcu be > > > 'encouraging' the rcu window by a reschedule request? > > > > During boot before interrupts are enabled, RCU has not yet spawned any of > > its kthreads. Therefore, all of its attempts to do wakeups would notice > > a NULL task_struct pointer and refrain from actually doing the wakeup. > > If it did do the wakeup, you would see a NULL-pointer exception. See > > for example, invoke_rcu_core_kthread(), though that won't happen unless > > you booted with rcutree.use_softirq=0. > > > > Besides, since when did doing a wakeup enable interrupts? That would > > make it hard to do wakeups from hardware interrupt handlers, not? > > Taking the mutex lock in kernel/irq/manage.c __setup_irq() is calling a > cond_resched(). > > >From what Michael said [1] in this thread, since something has already > set TIF_NEED_RESCHED, it will eventually enable interrupts on us. > > I've traced this to running call_rcu() in kernel/rcu/tiny.c and > is_idle_task(current) is true, which means rcu runs: > /* force scheduling for rcu_qs() */ > resched_cpu(0); > > the task is set idle in sched_init() -> init_idle() and never changed, > afaict. Yes, because RCU eventually needs a context switch in order to make a grace period happen. And Maple Tree isn't the only thing invoking call_rcu() early, so this has been in place for a very long time. > Removing the RCU option from the maple tree in kernel/irq/irqdesc.c > fixes the issue by avoiding the maple tree running call_rcu(). I am not > sure on the locking of the tree so I feel this change may cause other > issues...also it's before lockdep_init(), so any issue I introduce may > not be detected. > > When CONFIG_DEBUG_ATOMIC_SLEEP is configured, it seems that rcu does the > same thing, but the IRQs are not enabled on return. So, resched_cpu(0) > is called, but the IRQs warning of enabled isn't triggered. I failed to > find a reason why. Here you mean IRQs being enabled upon return from __setup_irq(), correct? But yes, __setup_irq() does call mutex_lock(). Which will call preempt_schedule_common() via might_sleep() and __cond_resched(), even though that is clearly a very bad thing this early. And let's face it, the whole purpose of mutex_lock() is to block when needed. And a big purpose of that might_sleep() is to yell at people invoking this with interrupts disabled. And most of the wrappers around __setup_irq() look to be intended for much later, after interrupts have been enabled. One exception is setup_percpu_irq(), which says that it is for "the early boot process", whenever that might be. But this is only invoked from mips in v6.5. The __request_percpu_irq() function is wrappered by request_percpu_irq(), and its header comment suggests that it is to be called after there are multiple CPUs. I am not seeing a call that is obviously from ppc32, but there are a number of drivers using this with which I am unfamiliar. The request_percpu_nmi() has to be followed up on each CPU using prepare_percpu_nmi() and enable_percpu_nmi(), so it is not clear that it is useful to invoke this before interrupts are enabled. But this is used by ARM, not ppc32 from what I can see. So even though I am not seeing how ppc32 invokes __setup_irq() early, your testing clearly indicates that it is doing so. > I am not entirely sure what makes ppc32 different than other platforms > in that the initial task is configured to an idle task and the first > call to call_rcu (tiny!) would cause the observed behaviour. Maybe something like this in __setup_irq(), right before the mutex_lock()? WARN_ON_ONCE(irqs_disabled()); This will dump the stack trace showing how __setup_irq() is being invoked in early boot on ppc32. Again, given that __setup_irq() invokes mutex_lock(), invoking this function in its current form before interrupts have been enabled is a bad idea. > Non-tiny rcu calls (as I am sure you know, but others may not) > kernel/rcu/tree.c which in turn calls __call_rcu_common(). That > function is far more complex than the tiny version. Maybe it's part of > why we see different behaviour based on platforms? I don't see an idle > check in that version of call_rcu(). Tree RCU can rely on the grace-period kthread starting up later in boot. This kthread will handle any need for a grace period from early boot invocation of call_rcu(). Tiny RCU does not have such a kthread, hence the resched_cpu(). Which is fine, as long as nothing (incorrectly!) invokes a blocking operation before the scheduler has been initialized. > Or maybe PPC32 has something set incorrectly to cause this failure in > early boot and I've just found something that needs to be set > differently? This might be a failure of imagination on my part, but I certainly don't see why ppc32 should expect to get away with invoking __setup_irq() anywhere near that early! ;-) But if ppc32 cannot be adjusted so as to invoke __setup_irq() later, the following change to __setup_irq() might work: if (!mutex_trylock(&desc->request_mutex)) mutex_lock(&desc->request_mutex); The reason for "might" rather than "will" is that __setup_irq() might well have other dependencies on the scheduler being up and running. It would be cleaner for that ppc32 call to __setup_irq() to be moved later, after the scheduler is up and running. > > But why not put some WARN_ON_ONCE(!irqs_disabled()) calls in the areas > > of greatest suspicion, starting from the stack trace generated by that > > mutex_lock()? A stray interrupt-enable could be pretty much anywhere. > > > > But where are those call_rcu() invocations? Before rcu_init()? > > During init_IRQ(), which is after rcu_init() but before rcu_init_nohz(), > srcu_init(), and softirq_init() in init/main.c start_kernel(). That is a perfectly legal time and place for a call_rcu() invocation. It is also legal before rcu_init(), but it has to be late enough that things like per-CPU variables are accessible. Which is still very early. > > Presumably before init is spawned and the early_init() calls. > > > > And what is the RCU-related Kconfig and boot-parameter setup? > > The .config was attached to the email I sent, and it matches what was > quoted above in the "RCU-related configs" section. > > [1] https://lore.kernel.org/linux-mm/87v8cv22jh.fsf@mail.lhotse/ Again, apologies for having misread that .config snippet!!! Thanx, Paul