From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51D87C0032E for ; Wed, 25 Oct 2023 19:20:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6FDD88D0012; Wed, 25 Oct 2023 15:20:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 687D38D0001; Wed, 25 Oct 2023 15:20:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 500D28D0012; Wed, 25 Oct 2023 15:20:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 3B2748D0001 for ; Wed, 25 Oct 2023 15:20:04 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id F12C7140970 for ; Wed, 25 Oct 2023 19:20:03 +0000 (UTC) X-FDA: 81384949086.28.BB5349C Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by imf18.hostedemail.com (Postfix) with ESMTP id 333411C0009 for ; Wed, 25 Oct 2023 19:20:00 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=none; spf=pass (imf18.hostedemail.com: domain of "SRS0=eRXu=GH=goodmis.org=rostedt@kernel.org" designates 145.40.68.75 as permitted sender) smtp.mailfrom="SRS0=eRXu=GH=goodmis.org=rostedt@kernel.org"; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698261602; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=aN49d+i7xo05qkYfmNk2WHGyvgNebvOmk+TUiH2gPGQ=; b=4GhSQh+PC2NC6/o7ED+ov/bvHaBtiQcAUWRfd5Ti2RTunHxvp3AXMqt/2SISHcS7PwXsm4 e0mJ4ONQ5QYO6RN9CN2Go/8NjFHtW0PFNmaplvOdhnBUv3TwlEv2pMv8rlbBHqVD/EDZ4E 16U7xTyNXAn3DwZMYR+gqiAkiYe1KX8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698261602; a=rsa-sha256; cv=none; b=iAgITgJIoa51LgrGuvEJ/DU/BvtlvkHUv2/owM5obQflwPvDSs4mZwjPb4jrm8EHwQjZa0 V6Noua0yavnkATrXVBAxPLu+XX2M7KIhFeqzZz/3Ap0gL2miYTk9WgAleFk+Yivb7DxLOP kOb8W5uKoiKaMZ5bWUwkZ9vSXz5+vhk= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=none; spf=pass (imf18.hostedemail.com: domain of "SRS0=eRXu=GH=goodmis.org=rostedt@kernel.org" designates 145.40.68.75 as permitted sender) smtp.mailfrom="SRS0=eRXu=GH=goodmis.org=rostedt@kernel.org"; dmarc=none Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by ams.source.kernel.org (Postfix) with ESMTP id D75ABB811DA; Wed, 25 Oct 2023 19:19:59 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A6ACDC433C7; Wed, 25 Oct 2023 19:19:55 +0000 (UTC) Date: Wed, 25 Oct 2023 15:19:51 -0400 From: Steven Rostedt To: Mathieu Desnoyers Cc: Mateusz Guzik , Peter Zijlstra , LKML , Thomas Gleixner , Ankur Arora , Linus Torvalds , linux-mm@kvack.org, x86@kernel.org, akpm@linux-foundation.org, luto@kernel.org, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, willy@infradead.org, mgorman@suse.de, jon.grimm@amd.com, bharata@amd.com, raghavendra.kt@amd.com, boris.ostrovsky@oracle.com, konrad.wilk@oracle.com, jgross@suse.com, andrew.cooper3@citrix.com, Joel Fernandes , Youssef Esmat , Vineeth Pillai , Suleiman Souhlal , Ingo Molnar , Daniel Bristot de Oliveira Subject: Re: [POC][RFC][PATCH] sched: Extended Scheduler Time Slice Message-ID: <20231025151951.5f1a9ab1@gandalf.local.home> In-Reply-To: <0d95385f-1be1-4dcf-93cb-8c5df3bc9d0c@efficios.com> References: <20231025054219.1acaa3dd@gandalf.local.home> <20231025102952.GG37471@noisy.programming.kicks-ass.net> <20231025085434.35d5f9e0@gandalf.local.home> <20231025135545.GG31201@noisy.programming.kicks-ass.net> <20231025103105.5ec64b89@gandalf.local.home> <884e4603-4d29-41ae-8715-a070c43482c4@efficios.com> <20231025162435.ibhdktcshhzltr3r@f> <20231025131731.48461873@gandalf.local.home> <0d95385f-1be1-4dcf-93cb-8c5df3bc9d0c@efficios.com> X-Mailer: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 333411C0009 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: xs689mgfcennxaqi1hxyy6f7q7kdc6uk X-HE-Tag: 1698261600-648600 X-HE-Meta: U2FsdGVkX19/qXHG+cQ95b26iws0x89Uj8bq/Lir5PyaAcsNavV2Ks3gHRGaKUJyn5uf72abgUuLf2pct3Y+EK1tXeJU8X8FJKuZMobeHIZdMb+qWR1iRw8ITqelCDsbOb0Mmd5S+DQ9Ev+xsqFX9wk9jdQnsl2V6Z7sjeIhfOsYGKFAYXbv7jTqAqyYI4606YgBswJPDJ1TJKkYCRzBDJmfLgjnB8pWAg7zJlvPxOiYENQz28aWyK4HPPerRvtMI/BjO/hu1YG1hBfQ7Vf7pl7HGdGteeR7XIMVRBbUlyqbZ+nfT/F7VQ+qnriavvaknTY7RufBhy7olyQNApcjirS3PJCRV0hYVe4Tzm29af6D3W4DIbBenLtx3UHGW0AyZ/HV6+j67cBuRjHh2BQkb5clRvSqn8Solj1BzQUG856+t+F9jCXsUXzdKTIUOIupFNNlrw4AiQaIq08fFZaDozB8E+LPq8bDHL3HejlybjD4zqGyqQTN1MxYQBBsuEOWFuleXZNoR2U45oqA0eKesFcouPpXoidWQBihr5J/maiuFgnL7Qa4o//FmLkHlosZH74irtrTd9w3lddYJND4kaRjaYWc/OAgl+UxnswIwE+h8V9cM0qgY2gH6J7HYoArfk27PGkmzHOaLOryfwYU+3cTlBy6fQlPMcTL44+Wpb/yynkeSokrRvf1bWFpD3HUrOF1uWiW63KvzxImOeSNeInvY3kB9s/GYDoGRaNuIs4px9bceoJEdNk6y6s+KnJo5o6HkDkvXOXngL/PGg5C0+KCt/cnyxvT/QWg1U78Xz1OUTYvJSLKQo1cG8XpciZLxNiz805jE50QYC44VXihahu52iLEb+PEKw2ErbaDQiy1lmEw1IcoHs/Uhh5PiGpDwurnc9bNyO36fMrcp6CFBpPwWkBzIXlYY3+mmfLc1k+lheC0M5bu8yjYLNVdSU9zZ0NmOU8XGIpbVAwhCqg mcIAu8yh e38yNtdPG3CQWUwQ+kGMeiYLMlUUh3ejSPNFcCBQjToZFzn395jAeL/qYSWXEWYlmN7sSQ4K4jyRB8FNIr1aFANyBuj6HWMQPtcPtbeQkBPnGh6HbFuuhKziUbEbBzUVeS6douS/MpHDCOp9a5+1Iuz9tdIJGfaF4QO9u3Lp9Bp6AoKEn1YnQ1txs3bm5qwrAluc4WamnvwLRjB3QnsY3pVL17KvSXU8ghwWTXrWPIoPKk4f2zEXqICbKT5UlCtlzXuAiloHYxpi2/i9fBvhNKUPTk3d3XEUfzkf8+q3TY3sdor9uCXFcI/JpdMcRujxVhz0rC9GVx6CM4yzI4YxQHZRgUEk7ac1ynspb2wu0DASBGQ94SXVKa/9YlA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, 25 Oct 2023 14:49:44 -0400 Mathieu Desnoyers wrote: > > > > No, I wouldn't say it's the same as priority inheritance, which is to help > > with determinism and not performance. PI adds overhead but removes > > unbounded latency. On average, a non PI mutex is faster than PI mutex, but > > can suffer from unbounded priority inversion. > > AFAIU, this is because PI mutex as implemented by sys futex only boosts the > priority of the lock owner. In my proposal here the owner would be able to > borrow scheduler slices from the waiters as well. I would be worried that that could cause even more scheduling disruption. Now we are taking time slices from other CPUs to run the current one? > > [...] > > >>> > >>> Hopefully I'm not oversimplifying if I state that we have mainly two > >>> actors to consider: > >>> > >>> [A] the lock owner thread > >>> > >>> [B] threads that block trying to acquire the lock > >>> > >>> The fast-path here is [A]. [B] can go through a system call, I don't > >>> think it matters at all. > > > > No, B going into a system call can be just as devastating. Adaptive > > spinning helps with that. The thing here is that if A gets preempted, there > > will be a lot more B's getting stuck. > > I would indeed combine this with an adaptive spinning scheme to allow waiters to > stay in userspace if contention is short. As you know, rseq can also help there: > > https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/ > > Therefore, it's only the blocking case that would call into the kernel, which > should not be so devastating. I think we are in agreement about the blocking case, which I believe you mean "go to sleep and wait to be woken up when the resource is available". > > > > > I implemented the test with futexes (where you go to sleep on contention) > > and the performance dropped considerably, where the benefits of not having > > A get preempted made no difference at all. Sure, adaptive spinning helps in > > that case, but adaptive spinning would only make it as good as my spinning > > in user space logic is without any changes. > > I'm not sure what you are arguing here. > > My overall idea would be to combine: > > 1) Adaptive spinning in userspace, > 2) Priority inheritance, > 3) Scheduler slices inheritance. The PI and slice inheritance can become very complex. It does sound more like the proxy server, which in theory sounds great but is very difficult to implement in practice. > > > > >>> > >>> Those lock addresses could then be used as keys for private locks, > >>> or transformed into inode/offset keys for shared-memory locks. Threads > >>> [B] blocking trying to acquire the lock can call a system call which > >>> would boost the lock owner's slice and/or priority for a given lock key. > > > > Do you mean that this would be done in user space? Going into the kernel to > > do any of this would make it already lost. > > Going to the kernel only happens when threads need to block, which means > that the typical contended half-happy path should be busy-spinning in userspace > (adaptive spinning). > > I see why blocking in a scenario where busy-spinning would be better is > inefficient, but I don't see how going to the kernel for the _blocking_ > case is bad. My point of view in this patch is for very short duration critical sections. This patch is not for the blocking use case at all. For that, I'm 100% on board with adaptive spinning. IOW, "blocking" is out of scope for this patch. > > > > >>> > >>> When the scheduler preempts [A], it would check whether the rseq > >>> per-thread area has a "held locks" field set and use this information > >>> to find the slice/priority boost which are currently active for each > >>> lock, and use this information to boost the task slice/priority > >>> accordingly. > > > > Why do we care about locks here? Note, I'm looking at using this same > > feature for VMs on interrupt handlers. The only thing user space needs to > > tell the kernel is "It's not a good time to preempt me, but it will be > > shortly". > > > > Quoting https://lore.kernel.org/lkml/20231024103426.4074d319@gandalf.local.home/ > > "The goal is to prevent a thread / vCPU being preempted while holding a lock > or resource that other threads / vCPUs will want. That is, prevent > contention, as that's usually the biggest issue with performance in user > space and VMs." I should have been more specific. Yes, locks is a major use case here, and the one I brought up as it's the easiest to understand, but this can be for any use case that there's a short critical section were preemption could be bad. I don't want to implement a policy that this is only for locking. > > We care about locks here because this is in fact your own problem statement. > If you want to consider the different problem of making VM interrupt handlers > go fast, then you should state it up front. Those two distinct problems may > or may not require entirely different solutions. We'll soon know as we'll be testing that too. Anyway, this patch is for the "short critical section". It's not good for all locking. It's only good for very short held locks. Going back to PREEMPT_RT, I did tests where I implemented NEED_RESCHED_LAZY on all kernel mutexes (it's currently only used for PREEMPT_RT spin locks that turn into mutexes), and found that it made no difference. Maybe a little bit, but not enough to push it. The reason is that long held locks, are either not contended heavily, or if they are, the fact that they are long held, keeping them running isn't as big of a difference than if they stop to let something else run. Of course, RT has priority inheritance, so the PI isn't as much of a overhead compared to the time the locks were held. > > >>> > >>> A scheme like this should allow lock priority inheritance without > >>> requiring system calls on the userspace lock/unlock fast path. > > > > Priority inheritance doesn't make sense when everything is running. > > I should have also said that this scheme should allow the lock owner to > borrow scheduler time slices from waiters (in addition to PI). As I mentioned, that sounds like proxy execution, which is something that's been a work in progress for some time now. Anyway, this patch is a very simple solution that can help a large number of use cases. Even if those use cases are very specific (highly contended short duration critical sections). -- Steve