From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BCD17C07545
	for <linux-mm@archiver.kernel.org>; Wed, 25 Oct 2023 12:54:47 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 3E2EF8D000A; Wed, 25 Oct 2023 08:54:47 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 392D08D0001; Wed, 25 Oct 2023 08:54:47 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 25B478D000A; Wed, 25 Oct 2023 08:54:47 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 158CF8D0001
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 08:54:47 -0400 (EDT)
Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id DD16314022B
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 12:54:46 +0000 (UTC)
X-FDA: 81383978172.16.CB26E9E
Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55])
	by imf15.hostedemail.com (Postfix) with ESMTP id 99C91A001C
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 12:54:44 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=none;
	dmarc=none;
	spf=pass (imf15.hostedemail.com: domain of "SRS0=eRXu=GH=goodmis.org=rostedt@kernel.org" designates 145.40.73.55 as permitted sender) smtp.mailfrom="SRS0=eRXu=GH=goodmis.org=rostedt@kernel.org"
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1698238485;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=xKVLfraT7ldlYLU3I+xpbYLKnMM0wIRNZxUoZGfFD00=;
	b=fIY2qtCyd0DNrpC2QkJ99d8mryiKYk8KZKmBOBK06DQXkeu4llOp6OrSPoP6VRwiwc7JBs
	3uIHTTtI7uZ9DTRbM68QzUrRNkCO/RoupzPmCO7lJO+Bve+/Dk/KODoA3+J0QRqmTlKgq/
	/+yCOYW231iWMrm9oUUGGSCrrn+CU/k=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=none;
	dmarc=none;
	spf=pass (imf15.hostedemail.com: domain of "SRS0=eRXu=GH=goodmis.org=rostedt@kernel.org" designates 145.40.73.55 as permitted sender) smtp.mailfrom="SRS0=eRXu=GH=goodmis.org=rostedt@kernel.org"
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698238485; a=rsa-sha256;
	cv=none;
	b=JJWu8sSKGJBbzgvu2YUHI6hmPU9iFXdUGdgBP0dUz6JY0BEpG9n3+W5rqsnFKophScMvn1
	iZv7lxohMBaFBrvEhG5Up7BYWJcUr5dhGEYTDnxHVEPTPnJ99NDsbSSWoGhKDlu2Q80UQH
	SeNwdF098A27aeKi6Qxw3B37NjQ9a4M=
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by sin.source.kernel.org (Postfix) with ESMTP id 6B1BECE06B2;
	Wed, 25 Oct 2023 12:54:40 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 34CB0C433C7;
	Wed, 25 Oct 2023 12:54:36 +0000 (UTC)
Date: Wed, 25 Oct 2023 08:54:34 -0400
From: Steven Rostedt <rostedt@goodmis.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner
 <tglx@linutronix.de>, Ankur Arora <ankur.a.arora@oracle.com>, Linus
 Torvalds <torvalds@linux-foundation.org>, linux-mm@kvack.org,
 x86@kernel.org, akpm@linux-foundation.org, luto@kernel.org, bp@alien8.de,
 dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com,
 juri.lelli@redhat.com, vincent.guittot@linaro.org, willy@infradead.org,
 mgorman@suse.de, jon.grimm@amd.com, bharata@amd.com,
 raghavendra.kt@amd.com, boris.ostrovsky@oracle.com, konrad.wilk@oracle.com,
 jgross@suse.com, andrew.cooper3@citrix.com, Joel Fernandes
 <joel@joelfernandes.org>, Youssef Esmat <youssefesmat@chromium.org>,
 Vineeth Pillai <vineethrp@google.com>, Suleiman Souhlal
 <suleiman@google.com>, Ingo Molnar <mingo@kernel.org>, Daniel Bristot de
 Oliveira <bristot@kernel.org>
Subject: Re: [POC][RFC][PATCH] sched: Extended Scheduler Time Slice
Message-ID: <20231025085434.35d5f9e0@gandalf.local.home>
In-Reply-To: <20231025102952.GG37471@noisy.programming.kicks-ass.net>
References: <20231025054219.1acaa3dd@gandalf.local.home>
	<20231025102952.GG37471@noisy.programming.kicks-ass.net>
X-Mailer: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-pc-linux-gnu)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-Rspamd-Queue-Id: 99C91A001C
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Stat-Signature: cczotmrfxc5px1r4ok8jh1dzri4rw9q1
X-HE-Tag: 1698238484-422409
X-HE-Meta: U2FsdGVkX19p2dr4s6oS5KrfylUaZAbO+kUKyGSFLF5Yv0CW2STfk7MLmTtDNh3B6ys8Mt76VmdpJNj6f9JbOwuHz61mx10hq7f5+nIvwJEGDEZr65TGoZlGqrHPVZ3E5HisL/rMy//4I8v7l+wqXTffGwSEbKceelF1n5Zi5cgw8mG4KapFnpaYdh2hV6b4FpyqdYXKYHATLPaclXIcOLjoMFjp9XV5GQTM/RM6EIn153/rHULNthxlFKNFmMYCb1M5t9+ta5Ql+txSnm7RdqpKfaNv+eHxWfbBPd+w+E3SPs6vSzfgPzB/dAWmB5orOwX3QnwR0g00JoRdvXFfr4f0OmGDzv/pNaasVKTIeIMmqq0IN10fIwcdDLSXFgo9HqPIer7hDe12UKq5Y8iDNdajpmF4OHbNpTT5colrAPADcLx8ImYy1LBnavrPxuJ0lVbFpOHVpz2lhYXQWWjNP3OdHF8axiVFQHtHdxVdN51FvrbdnEsg5MCEALoyX3f10T8qKcdkxYP2via/mjbTWYAoKe2wIh8enzuBhoi3wYyFLEKLy7zbOSvtVOA2HDi+hV0l0UipYwKFCGpbgGH4i4rD9N+WWlldDojWsxERPVdJ3npfI2Y/9sICF1LUdvtGYbglXKQpg7ED3iH4HoPul06DEU1k2+BiEVJc6JGQ+y+Wov3AKm4UbV04Vfvyxagyljlq6lcpS4Lq6Fn9AlaQBjG2rzfTv5KkAv7DdX9XU2u/DMrZlDzkwp75RjCTKy//MA4yKijADa75iL6xsJH+1oW//ds+CYIX/Ll+pfqWBF+8pFXswlSwQUetTdWP4s+/e7hUdXU5UbtM5fKM03PF9NDI1LylNDIuIt2dfnINaqIJMz4AzXVk0Zi7mOFPpxeYHaCHqiJv5XeXb4y4K/WKdIFyHqcKN8v1IBCdVIWJkOWHdexRwWNqto02hv7oIZ8QpgIu0RLl4CSNuDNqC0q
 UV/2CH9C
 vWZEk43bXNz3C3hiA3tDAlPaINYJ+KAtKGh2y3WLxZSw3zUvkyGJrR+C/5CM6acf7iVBhndtBvDNP4yxOxWa72yIUy8S6SMmYStBZxilsJTIxTZ9KFGLehLZUHG75ht7YfCwU37rRFqmfVnyuf4/Gj/KwUct1D2mT/yq5346qIPNm0HxcD5Ff6nnAXuZMRk0V7VAFoeE1Nlbf1K2DXyxceFozuXiHVMQ+ASFTsdX2iC6jARM+irsy1zoBY+EYNKpjskE6ML5qiQ51RNUxIeDn61Nhu3CDyZ18b/3ZqmhwP0mCl5hxvWzf01SRljqfUYMRd1bu3pKlxBc46ft4MkbClVicJxTOCeQxq5A0COtFCOExUJkcCxJ3OfRm5Q==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


Peter!

[ After watching Thomas and Paul reply to each other, I figured this is the
  new LMKL greeting. ]


On Wed, 25 Oct 2023 12:29:52 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Oct 25, 2023 at 05:42:19AM -0400, Steven Rostedt wrote:
> 
> > That is, there's this structure for every thread. It's assigned with:
> > 
> > 	fd = open("/sys/kernel/extend_sched", O_RDWR);
> > 	extend_map = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> > 
> > I don't actually like this interface, as it wastes a full page for just two
> > bits :-p  Perhaps it should be a new system call, where it just locks in
> > existing memory from the user application? The requirement is that each
> > thread needs its own bits to play with. It should not be shared with other
> > threads. It could be, as it will not mess up the kernel, but will mess up
> > the application.  
> 
> What was wrong with using rseq?

I didn't want to overload that for something completely different. This is
not a "restartable sequence".

> 
> > Anyway, to tell the kernel to "extend" the time slice if possible because
> > it's in a critical section, we have:
> > 
> >  static void extend(void)
> >  {
> > 	if (!extend_map)
> > 		return;
> > 
> > 	extend_map->flags = 1;
> >  }
> > 
> > And to say that's it's done:
> > 
> >  static void unextend(void)
> >  {
> > 	unsigned long prev;
> > 
> > 	if (!extend_map)
> > 		return;
> > 
> > 	prev = xchg(&extend_map->flags, 0);
> > 	if (prev & 2)
> > 		sched_yield();
> >  }
> > 
> > So, bit 1 is for user space to tell the kernel "please extend me", and bit
> > two is for the kernel to tell user space "OK, I extended you, but call
> > sched_yield() when done".  
> 
> So what if it doesn't ? Can we kill it for not playing nice ?

No, it's no different than a system call running for a long time. You could
set this bit and leave it there for as long as you want, and it should not
affect anything. If you look at what Thomas's PREEMPT_AUTO.patch does, is
that it sets NEED_RESCHED_LAZY at the tick. Without my patch, this will not
schedule right away, but will schedule when going into user space. My patch
will ignore the schedule if NEED_RESCHED_LAZY is set when going into user
space.

With Thomas's patch, if a task is in the kernel for too long, on the next
tick (if I read his code correctly), if NEED_RESCHED_LAZY is still set,
it will then force the schedule. That is, you get two ticks instead of one.
I may have misread the code, but that's what it looks like it does in
update_deadline() in fair.c.

 https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/PREEMPT_AUTO.patch?h=v6.6-rc6-rt10-patches#n587

With my patch, the same thing happens but in user space. This does not give
any more power to any task. I don't expect nor want this to be a privilege
operation. It's no different than running a long system call. And EEVDF
should even keep it fair. As if you use an extra tick, it will go against
your eligibility for the next time around.

Note, NEED_RESCHED still schedules. If a RT or DL task were to wake up, it
will immediately preempt this task regardless of that bit being set.

> 
> [ aside from it being bit 0 and bit 1 as you yourself point out, it is
>   also jarring you use a numeral for one and write out the other. ]
> 
> That said, I properly hate all these things, extending a slice doesn't
> reliably work and we're always left with people demanding an ever longer
> extension.

We could possibly make it adjustable. I'm guessing that will happen anyway
with Thomas's patch. Anyway, my test shows that it makes a huge improvement
for user space implemented spin locks, which I tailored this after how
Postgresql does their spin locks. That is, this is a real world use case. I
plan to implement this in Postgresql and see what improvements it makes in
their tests.

I also plan on testing VMs.

> 
> The *much* better heuristic is what the kernel uses, don't spin if the
> lock holder isn't running.

No it is not. That is a completely useless heuristic for this use case.
That's for waiters and I would guess would make no difference in my test.
The point of this patch is to keep the lock holder running not the waiter
spinning. The reason for the improvement in my test is that the lock was
always held for a very short time and when the time slice came up while the
task was holding the lock, it was able to get it extended to release it, and
then schedule.

Without my patch, you get a several hundreds of this:

    extend-sched-3773  [000]  9628.573272: print:                tracing_mark_write: Have lock!
    extend-sched-3773  [000]  9628.573278: sched_switch:         extend-sched:3773 [120] R ==> mysqld:1216 [120]
          mysqld-1216  [000]  9628.573286: sched_switch:         mysqld:1216 [120] S ==> extend-sched:3773 [120]
    extend-sched-3773  [000]  9628.573287: print:                tracing_mark_write: released lock!

[ Ironically, this example is preempted by mysqld ]

With my patch, there was only a single instance during the run.

When a lock holder schedules out, it greatly increases contention on that
lock. That's the entire reason Thomas implemented NEED_RESCHED_LAZY in the
first place. The aggressive preemption in PREEMPT_RT caused a lot more
contention on spin lock turned mutexes. My patch is to do the exact same
thing for user space implement spin locks, which also includes spin locks
in VM kernels. Adaptive spin locks (spin on owner running) helped
PREEMPT_RT for waiters, but that did nothing to help the lock holder being
preempted, and why NEED_RESCHED_LAZY was still needed even when the kernel
already had adaptive spinners.

The reason I've been told over the last few decades of why people implement
100% user space spin locks is because the overhead of going int the kernel
is way too high. Sleeping is much worse (but that is where the adaptive
spinning comes in, which is a separate issue).

Allowing user space to say "hey, give me a few more microseconds and I'm
fine being preempted" is a very good heuristic. And a way for the kernel to
say, "hey, I gave it to you, you better go into the kernel when you can,
otherwise I'll preempt you no matter what!"

-- Steve