From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BCD17C07545 for ; Wed, 25 Oct 2023 12:54:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3E2EF8D000A; Wed, 25 Oct 2023 08:54:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 392D08D0001; Wed, 25 Oct 2023 08:54:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 25B478D000A; Wed, 25 Oct 2023 08:54:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 158CF8D0001 for ; Wed, 25 Oct 2023 08:54:47 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id DD16314022B for ; Wed, 25 Oct 2023 12:54:46 +0000 (UTC) X-FDA: 81383978172.16.CB26E9E Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf15.hostedemail.com (Postfix) with ESMTP id 99C91A001C for ; Wed, 25 Oct 2023 12:54:44 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf15.hostedemail.com: domain of "SRS0=eRXu=GH=goodmis.org=rostedt@kernel.org" designates 145.40.73.55 as permitted sender) smtp.mailfrom="SRS0=eRXu=GH=goodmis.org=rostedt@kernel.org" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698238485; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xKVLfraT7ldlYLU3I+xpbYLKnMM0wIRNZxUoZGfFD00=; b=fIY2qtCyd0DNrpC2QkJ99d8mryiKYk8KZKmBOBK06DQXkeu4llOp6OrSPoP6VRwiwc7JBs 3uIHTTtI7uZ9DTRbM68QzUrRNkCO/RoupzPmCO7lJO+Bve+/Dk/KODoA3+J0QRqmTlKgq/ /+yCOYW231iWMrm9oUUGGSCrrn+CU/k= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf15.hostedemail.com: domain of "SRS0=eRXu=GH=goodmis.org=rostedt@kernel.org" designates 145.40.73.55 as permitted sender) smtp.mailfrom="SRS0=eRXu=GH=goodmis.org=rostedt@kernel.org" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698238485; a=rsa-sha256; cv=none; b=JJWu8sSKGJBbzgvu2YUHI6hmPU9iFXdUGdgBP0dUz6JY0BEpG9n3+W5rqsnFKophScMvn1 iZv7lxohMBaFBrvEhG5Up7BYWJcUr5dhGEYTDnxHVEPTPnJ99NDsbSSWoGhKDlu2Q80UQH SeNwdF098A27aeKi6Qxw3B37NjQ9a4M= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 6B1BECE06B2; Wed, 25 Oct 2023 12:54:40 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 34CB0C433C7; Wed, 25 Oct 2023 12:54:36 +0000 (UTC) Date: Wed, 25 Oct 2023 08:54:34 -0400 From: Steven Rostedt To: Peter Zijlstra Cc: LKML , Thomas Gleixner , Ankur Arora , Linus Torvalds , linux-mm@kvack.org, x86@kernel.org, akpm@linux-foundation.org, luto@kernel.org, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, willy@infradead.org, mgorman@suse.de, jon.grimm@amd.com, bharata@amd.com, raghavendra.kt@amd.com, boris.ostrovsky@oracle.com, konrad.wilk@oracle.com, jgross@suse.com, andrew.cooper3@citrix.com, Joel Fernandes , Youssef Esmat , Vineeth Pillai , Suleiman Souhlal , Ingo Molnar , Daniel Bristot de Oliveira Subject: Re: [POC][RFC][PATCH] sched: Extended Scheduler Time Slice Message-ID: <20231025085434.35d5f9e0@gandalf.local.home> In-Reply-To: <20231025102952.GG37471@noisy.programming.kicks-ass.net> References: <20231025054219.1acaa3dd@gandalf.local.home> <20231025102952.GG37471@noisy.programming.kicks-ass.net> X-Mailer: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 99C91A001C X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: cczotmrfxc5px1r4ok8jh1dzri4rw9q1 X-HE-Tag: 1698238484-422409 X-HE-Meta: U2FsdGVkX19p2dr4s6oS5KrfylUaZAbO+kUKyGSFLF5Yv0CW2STfk7MLmTtDNh3B6ys8Mt76VmdpJNj6f9JbOwuHz61mx10hq7f5+nIvwJEGDEZr65TGoZlGqrHPVZ3E5HisL/rMy//4I8v7l+wqXTffGwSEbKceelF1n5Zi5cgw8mG4KapFnpaYdh2hV6b4FpyqdYXKYHATLPaclXIcOLjoMFjp9XV5GQTM/RM6EIn153/rHULNthxlFKNFmMYCb1M5t9+ta5Ql+txSnm7RdqpKfaNv+eHxWfbBPd+w+E3SPs6vSzfgPzB/dAWmB5orOwX3QnwR0g00JoRdvXFfr4f0OmGDzv/pNaasVKTIeIMmqq0IN10fIwcdDLSXFgo9HqPIer7hDe12UKq5Y8iDNdajpmF4OHbNpTT5colrAPADcLx8ImYy1LBnavrPxuJ0lVbFpOHVpz2lhYXQWWjNP3OdHF8axiVFQHtHdxVdN51FvrbdnEsg5MCEALoyX3f10T8qKcdkxYP2via/mjbTWYAoKe2wIh8enzuBhoi3wYyFLEKLy7zbOSvtVOA2HDi+hV0l0UipYwKFCGpbgGH4i4rD9N+WWlldDojWsxERPVdJ3npfI2Y/9sICF1LUdvtGYbglXKQpg7ED3iH4HoPul06DEU1k2+BiEVJc6JGQ+y+Wov3AKm4UbV04Vfvyxagyljlq6lcpS4Lq6Fn9AlaQBjG2rzfTv5KkAv7DdX9XU2u/DMrZlDzkwp75RjCTKy//MA4yKijADa75iL6xsJH+1oW//ds+CYIX/Ll+pfqWBF+8pFXswlSwQUetTdWP4s+/e7hUdXU5UbtM5fKM03PF9NDI1LylNDIuIt2dfnINaqIJMz4AzXVk0Zi7mOFPpxeYHaCHqiJv5XeXb4y4K/WKdIFyHqcKN8v1IBCdVIWJkOWHdexRwWNqto02hv7oIZ8QpgIu0RLl4CSNuDNqC0q UV/2CH9C vWZEk43bXNz3C3hiA3tDAlPaINYJ+KAtKGh2y3WLxZSw3zUvkyGJrR+C/5CM6acf7iVBhndtBvDNP4yxOxWa72yIUy8S6SMmYStBZxilsJTIxTZ9KFGLehLZUHG75ht7YfCwU37rRFqmfVnyuf4/Gj/KwUct1D2mT/yq5346qIPNm0HxcD5Ff6nnAXuZMRk0V7VAFoeE1Nlbf1K2DXyxceFozuXiHVMQ+ASFTsdX2iC6jARM+irsy1zoBY+EYNKpjskE6ML5qiQ51RNUxIeDn61Nhu3CDyZ18b/3ZqmhwP0mCl5hxvWzf01SRljqfUYMRd1bu3pKlxBc46ft4MkbClVicJxTOCeQxq5A0COtFCOExUJkcCxJ3OfRm5Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Peter! [ After watching Thomas and Paul reply to each other, I figured this is the new LMKL greeting. ] On Wed, 25 Oct 2023 12:29:52 +0200 Peter Zijlstra wrote: > On Wed, Oct 25, 2023 at 05:42:19AM -0400, Steven Rostedt wrote: > > > That is, there's this structure for every thread. It's assigned with: > > > > fd = open("/sys/kernel/extend_sched", O_RDWR); > > extend_map = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); > > > > I don't actually like this interface, as it wastes a full page for just two > > bits :-p Perhaps it should be a new system call, where it just locks in > > existing memory from the user application? The requirement is that each > > thread needs its own bits to play with. It should not be shared with other > > threads. It could be, as it will not mess up the kernel, but will mess up > > the application. > > What was wrong with using rseq? I didn't want to overload that for something completely different. This is not a "restartable sequence". > > > Anyway, to tell the kernel to "extend" the time slice if possible because > > it's in a critical section, we have: > > > > static void extend(void) > > { > > if (!extend_map) > > return; > > > > extend_map->flags = 1; > > } > > > > And to say that's it's done: > > > > static void unextend(void) > > { > > unsigned long prev; > > > > if (!extend_map) > > return; > > > > prev = xchg(&extend_map->flags, 0); > > if (prev & 2) > > sched_yield(); > > } > > > > So, bit 1 is for user space to tell the kernel "please extend me", and bit > > two is for the kernel to tell user space "OK, I extended you, but call > > sched_yield() when done". > > So what if it doesn't ? Can we kill it for not playing nice ? No, it's no different than a system call running for a long time. You could set this bit and leave it there for as long as you want, and it should not affect anything. If you look at what Thomas's PREEMPT_AUTO.patch does, is that it sets NEED_RESCHED_LAZY at the tick. Without my patch, this will not schedule right away, but will schedule when going into user space. My patch will ignore the schedule if NEED_RESCHED_LAZY is set when going into user space. With Thomas's patch, if a task is in the kernel for too long, on the next tick (if I read his code correctly), if NEED_RESCHED_LAZY is still set, it will then force the schedule. That is, you get two ticks instead of one. I may have misread the code, but that's what it looks like it does in update_deadline() in fair.c. https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/PREEMPT_AUTO.patch?h=v6.6-rc6-rt10-patches#n587 With my patch, the same thing happens but in user space. This does not give any more power to any task. I don't expect nor want this to be a privilege operation. It's no different than running a long system call. And EEVDF should even keep it fair. As if you use an extra tick, it will go against your eligibility for the next time around. Note, NEED_RESCHED still schedules. If a RT or DL task were to wake up, it will immediately preempt this task regardless of that bit being set. > > [ aside from it being bit 0 and bit 1 as you yourself point out, it is > also jarring you use a numeral for one and write out the other. ] > > That said, I properly hate all these things, extending a slice doesn't > reliably work and we're always left with people demanding an ever longer > extension. We could possibly make it adjustable. I'm guessing that will happen anyway with Thomas's patch. Anyway, my test shows that it makes a huge improvement for user space implemented spin locks, which I tailored this after how Postgresql does their spin locks. That is, this is a real world use case. I plan to implement this in Postgresql and see what improvements it makes in their tests. I also plan on testing VMs. > > The *much* better heuristic is what the kernel uses, don't spin if the > lock holder isn't running. No it is not. That is a completely useless heuristic for this use case. That's for waiters and I would guess would make no difference in my test. The point of this patch is to keep the lock holder running not the waiter spinning. The reason for the improvement in my test is that the lock was always held for a very short time and when the time slice came up while the task was holding the lock, it was able to get it extended to release it, and then schedule. Without my patch, you get a several hundreds of this: extend-sched-3773 [000] 9628.573272: print: tracing_mark_write: Have lock! extend-sched-3773 [000] 9628.573278: sched_switch: extend-sched:3773 [120] R ==> mysqld:1216 [120] mysqld-1216 [000] 9628.573286: sched_switch: mysqld:1216 [120] S ==> extend-sched:3773 [120] extend-sched-3773 [000] 9628.573287: print: tracing_mark_write: released lock! [ Ironically, this example is preempted by mysqld ] With my patch, there was only a single instance during the run. When a lock holder schedules out, it greatly increases contention on that lock. That's the entire reason Thomas implemented NEED_RESCHED_LAZY in the first place. The aggressive preemption in PREEMPT_RT caused a lot more contention on spin lock turned mutexes. My patch is to do the exact same thing for user space implement spin locks, which also includes spin locks in VM kernels. Adaptive spin locks (spin on owner running) helped PREEMPT_RT for waiters, but that did nothing to help the lock holder being preempted, and why NEED_RESCHED_LAZY was still needed even when the kernel already had adaptive spinners. The reason I've been told over the last few decades of why people implement 100% user space spin locks is because the overhead of going int the kernel is way too high. Sleeping is much worse (but that is where the adaptive spinning comes in, which is a separate issue). Allowing user space to say "hey, give me a few more microseconds and I'm fine being preempted" is a very good heuristic. And a way for the kernel to say, "hey, I gave it to you, you better go into the kernel when you can, otherwise I'll preempt you no matter what!" -- Steve