From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 98DD3CDB47E for ; Wed, 18 Oct 2023 13:16:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DA36F8D00AD; Wed, 18 Oct 2023 09:16:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D52C68D0016; Wed, 18 Oct 2023 09:16:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C1C118D00AD; Wed, 18 Oct 2023 09:16:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B2A428D0016 for ; Wed, 18 Oct 2023 09:16:17 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 750BE1A014A for ; Wed, 18 Oct 2023 13:16:17 +0000 (UTC) X-FDA: 81358630794.11.7842FCB Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by imf02.hostedemail.com (Postfix) with ESMTP id 734BC80026 for ; Wed, 18 Oct 2023 13:16:15 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=linutronix.de header.s=2020 header.b=gejEfomV; dkim=pass header.d=linutronix.de header.s=2020e header.b=8x3Q182I; dmarc=pass (policy=none) header.from=linutronix.de; spf=pass (imf02.hostedemail.com: domain of tglx@linutronix.de designates 193.142.43.55 as permitted sender) smtp.mailfrom=tglx@linutronix.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697634975; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=l1KbJzsMswh2mzowMIcPFg0mKYIxRl5P0zkuZOEKHbQ=; b=XRhZ0RSXVh2C40NnQUE/1AYUBd35VRvjETO3DTZUtLIgGGhpCH0CRoNXfUzfda+DmjqOga DrEw8xHqoBc0wJUDa4OikeJL4yIWnvEmGUFcNhIaGwF8ZOVMZMMHa3ZvuZMQ8ZbxCRCcA5 k+/ayctO91skd1MnSVRxZQ1COceTaRI= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=linutronix.de header.s=2020 header.b=gejEfomV; dkim=pass header.d=linutronix.de header.s=2020e header.b=8x3Q182I; dmarc=pass (policy=none) header.from=linutronix.de; spf=pass (imf02.hostedemail.com: domain of tglx@linutronix.de designates 193.142.43.55 as permitted sender) smtp.mailfrom=tglx@linutronix.de ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697634975; a=rsa-sha256; cv=none; b=rn8qi1EDz29uZPqhoF1fvBPj23pA1pzyqUA/fxAAYNa/xoNAy04gpxjCTeY92fyGWJSJA5 Wa8ossevmkvrgfyFhW8TPUGsMJqGhV5NM+bZEMRcOWYMN8EutZk1cDvXSH0EOxc1bTkquQ UXs14qec1kV/ktdZSfk9r6uCNJTYEec= From: Thomas Gleixner DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1697634973; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=l1KbJzsMswh2mzowMIcPFg0mKYIxRl5P0zkuZOEKHbQ=; b=gejEfomVSSPFDM3tkGMf8t4Qu5BVmhC6BR6SaWrHWjv/zzcbeeOU3GRv5CTzuVdpa5FSO6 3C5ufjHodJnO7I3jtN1/3qjNP/1a65Od+dzd+Vo8Q6JvSUbDESVO6NoZxxCo8usKf5oTKw ipPvXKxmJ1fs2LM7Mos6XMQNoVQwYifkfTnWEYhzmKRfRFogAklfC+7uBqg85MylaMlWTK LrG1VbmoOONe5w0Jofo9vrvCDf99BSPDdYgpDAUU+5j7uW1MokpWHl7k4n+OgJX2y2Y63n +LMQswMvmLCB14dncar+q6oR15VpT/BcrRGhkNVNd2uypEMisLDNJfclNcp35Q== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1697634973; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=l1KbJzsMswh2mzowMIcPFg0mKYIxRl5P0zkuZOEKHbQ=; b=8x3Q182IRY/FzFHBln9PzBuXO3PmpOJjlCjBAAx6Fzx6Itxwo4/5JP1H6PmaY2SmTuMEy8 nn3fxXVLR2qbduBw== To: paulmck@kernel.org Cc: Linus Torvalds , Peter Zijlstra , Ankur Arora , linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org, akpm@linux-foundation.org, luto@kernel.org, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, willy@infradead.org, mgorman@suse.de, rostedt@goodmis.org, jon.grimm@amd.com, bharata@amd.com, raghavendra.kt@amd.com, boris.ostrovsky@oracle.com, konrad.wilk@oracle.com, jgross@suse.com, andrew.cooper3@citrix.com, Frederic Weisbecker Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED In-Reply-To: References: <87ttrngmq0.ffs@tglx> <87jzshhexi.ffs@tglx> Date: Wed, 18 Oct 2023 15:16:12 +0200 Message-ID: <87pm1c3wbn.ffs@tglx> MIME-Version: 1.0 Content-Type: text/plain X-Rspamd-Queue-Id: 734BC80026 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: j7nzaxj847obmfb3amnff5yb1utway4h X-HE-Tag: 1697634975-403833 X-HE-Meta: U2FsdGVkX1+qXt7Reu01TCe3CJZnVoXvPOJG8augkYV25ZAT9mFGafSjkClmPBh47B0Fwm76z9c/uhX6kATdK5B4DJ7lrW8gJY4BKpXhJOGAtvi2VHGt1Ykw9CoeZ/orvP8s90lEwSp9gFS/ldPJEtOGYD1lqv9SnNteyjoyCKaTpl3nuH1JmqmRRyls/wZqudM8Tqvrp6mSQho/9fvDoBwssd4i6QZe87Eqqr0Cc+i25Lfcr9tm4NVbcfEHo00bZlUd0OkWWpTYAfg5RBgbEH1XVR3if2lpHe7kQqQ3GLWdXf45xMNfAcj5andQy0arKKvTFOz7erw40aIjM9fUiJD5Cn0xx/m1CsG7+6zATMIc3wncan8RRDyUCCGfyPxf6RqZSdJDLqexLELySDleGaFaADdqfLcpv2t4G6UnR/NZ2jqfV2s1cfPzeCB/9UNK6GeqgSo+Hdk0eexaQ6dYFyp65wnAIH68MN8+g78ubCNHlq39TwIU1QltxDUhBHAIS72gpw6x6TP/wdbZ9MV96JUewsCk4f5wsQjFFP2ySmhlr+UGEdCWlw0WJMfozcnoMMo1I7f+pJeNKiHuNHxUBUKhJY1nH+Ve3rifSzeWFf1kKnENNouCH7k4pVw5DdpUfqHD/Dxi4Audy3Fgdtt8nF9XlZFRUD34NvXwFGMZ5fcyTAti9cLcZoHvaLIVCq7AcBv9Bnbsbmj+LTyENPMhO2PaSzEXE+D3Tk2JkJPfPToFttXWc8DDGC9Q/3g99cl0HI49Yw7XhzS4TyseKkZVk9g69y8s4SraMYVQKGqoQQP8iz7xB794KICQCCJy6PxycxHuzLE1IfEStOsQ6dm8nP+DGok2BzvnUpFdse4RZq4P6HBySFWGl7ULPctd9P1QcP1+wVqYBd60QtUiDulo5jDFWA/szJrB1sxPXVcp8wDQYEq0EJUxYKdrizfX83/c8+PSBstWfKwAtP2zjVc 41z65pO2 awpP2A62ErZxEdPCyg+z8o5huiyX/pJBGqSo1CMLUlNm6QIbPznRC1gE7foxY7PSfkF7YGFyoOqoL1aTFb2zs/WpkexEAp1cvmKNc X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Paul! On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote: > Belatedly calling out some RCU issues. Nothing fatal, just a > (surprisingly) few adjustments that will need to be made. The key thing > to note is that from RCU's viewpoint, with this change, all kernels > are preemptible, though rcu_read_lock() readers remain > non-preemptible. Why? Either I'm confused or you or both of us :) With this approach the kernel is by definition fully preemptible, which means means rcu_read_lock() is preemptible too. That's pretty much the same situation as with PREEMPT_DYNAMIC. For throughput sake this fully preemptible kernel provides a mechanism to delay preemption for SCHED_OTHER tasks, i.e. instead of setting NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY. That means the preemption points in preempt_enable() and return from interrupt to kernel will not see NEED_RESCHED and the tasks can run to completion either to the point where they call schedule() or when they return to user space. That's pretty much what PREEMPT_NONE does today. The difference to NONE/VOLUNTARY is that the explicit cond_resched() points are not longer required because the scheduler can preempt the long running task by setting NEED_RESCHED instead. That preemption might be suboptimal in some cases compared to cond_resched(), but from my initial experimentation that's not really an issue. > With that: > > 1. As an optimization, given that preempt_count() would always give > good information, the scheduling-clock interrupt could sense RCU > readers for new-age CONFIG_PREEMPT_NONE=y kernels. As might the > IPI handlers for expedited grace periods. A nice optimization. > Except that... > > 2. The quiescent-state-forcing code currently relies on the presence > of cond_resched() in CONFIG_PREEMPT_RCU=n kernels. One fix > would be to do resched_cpu() more quickly, but some workloads > might not love the additional IPIs. Another approach to do #1 > above to replace the quiescent states from cond_resched() with > scheduler-tick-interrupt-sensed quiescent states. Right. The tick can see either the lazy resched bit "ignored" or some magic "RCU needs a quiescent state" and force a reschedule. > Plus... > > 3. For nohz_full CPUs that run for a long time in the kernel, > there are no scheduling-clock interrupts. RCU reaches for > the resched_cpu() hammer a few jiffies into the grace period. > And it sets the ->rcu_urgent_qs flag so that the holdout CPU's > interrupt-entry code will re-enable its scheduling-clock interrupt > upon receiving the resched_cpu() IPI. You can spare the IPI by setting NEED_RESCHED on the remote CPU which will cause it to preempt. > So nohz_full CPUs should be OK as far as RCU is concerned. > Other subsystems might have other opinions. > > 4. As another optimization, kvfree_rcu() could unconditionally > check preempt_count() to sense a clean environment suitable for > memory allocation. Correct. All the limitations of preempt count being useless are gone. > 5. Kconfig files with "select TASKS_RCU if PREEMPTION" must > instead say "select TASKS_RCU". This means that the #else > in include/linux/rcupdate.h that defines TASKS_RCU in terms of > vanilla RCU must go. There might be be some fallout if something > fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y, > and expects call_rcu_tasks(), synchronize_rcu_tasks(), or > rcu_tasks_classic_qs() do do something useful. In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob remaining would be CONFIG_PREEMPT_RT, which should be renamed to CONFIG_RT or such as it does not really change the preemption model itself. RT just reduces the preemption disabled sections with the lock conversions, forced interrupt threading and some more. > 6. You might think that RCU Tasks (as opposed to RCU Tasks Trace > or RCU Tasks Rude) would need those pesky cond_resched() calls > to stick around. The reason is that RCU Tasks readers are ended > only by voluntary context switches. This means that although a > preemptible infinite loop in the kernel won't inconvenience a > real-time task (nor an non-real-time task for all that long), > and won't delay grace periods for the other flavors of RCU, > it would indefinitely delay an RCU Tasks grace period. > > However, RCU Tasks grace periods seem to be finite in preemptible > kernels today, so they should remain finite in limited-preemptible > kernels tomorrow. Famous last words... That's an issue which you have today with preempt FULL, right? So if it turns out to be a problem then it's not a problem of the new model. > 7. RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice > any algorithmic difference from this change. > > 8. As has been noted elsewhere, in this new limited-preemption > mode of operation, rcu_read_lock() readers remain preemptible. > This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain. Why? You fundamentally have a preemptible kernel with PREEMPT_RCU, no? > 9. The rcu_preempt_depth() macro could do something useful in > limited-preemption kernels. Its current lack of ability in > CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past. Correct. > 10. The cond_resched_rcu() function must remain because we still > have non-preemptible rcu_read_lock() readers. Where? > 11. My guess is that the IPVS_EST_TICK_CHAINS heuristic remains > unchanged, but I must defer to the include/net/ip_vs.h people. *blink* > 12. I need to check with the BPF folks on the BPF verifier's > definition of BTF_ID(func, rcu_read_unlock_strict). > > 13. The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner() > function might have some redundancy across the board instead > of just on CONFIG_PREEMPT_RCU=y. Or might not. > > 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function > might need to do something for non-preemptible RCU to make > up for the lack of cond_resched() calls. Maybe just drop the > "IS_ENABLED()" and execute the body of the current "if" statement > unconditionally. Again. There is no non-preemtible RCU with this model, unless I'm missing something important here. > 15. I must defer to others on the mm/pgtable-generic.c file's > #ifdef that depends on CONFIG_PREEMPT_RCU. All those ifdefs should die :) > While in the area, I noted that KLP seems to depend on cond_resched(), > but on this I must defer to the KLP people. Yeah, KLP needs some thoughts, but that's not rocket science to fix IMO. > I am sure that I am missing something, but I have not yet seen any > show-stoppers. Just some needed adjustments. Right. If it works out as I think it can work out the main adjustments are to remove a large amount of #ifdef maze and related gunk :) Thanks, tglx