linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mateusz Guzik <mjguzik@gmail.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ankur Arora <ankur.a.arora@oracle.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-mm@kvack.org, x86@kernel.org, akpm@linux-foundation.org,
	luto@kernel.org, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, mingo@redhat.com, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, willy@infradead.org, mgorman@suse.de,
	jon.grimm@amd.com, bharata@amd.com, raghavendra.kt@amd.com,
	boris.ostrovsky@oracle.com, konrad.wilk@oracle.com,
	jgross@suse.com, andrew.cooper3@citrix.com,
	Joel Fernandes <joel@joelfernandes.org>,
	Youssef Esmat <youssefesmat@chromium.org>,
	Vineeth Pillai <vineethrp@google.com>,
	Suleiman Souhlal <suleiman@google.com>,
	Ingo Molnar <mingo@kernel.org>,
	Daniel Bristot de Oliveira <bristot@kernel.org>
Subject: Re: [POC][RFC][PATCH] sched: Extended Scheduler Time Slice
Date: Wed, 25 Oct 2023 18:24:35 +0200	[thread overview]
Message-ID: <20231025162435.ibhdktcshhzltr3r@f> (raw)
In-Reply-To: <884e4603-4d29-41ae-8715-a070c43482c4@efficios.com>

On Wed, Oct 25, 2023 at 11:42:34AM -0400, Mathieu Desnoyers wrote:
> On 2023-10-25 10:31, Steven Rostedt wrote:
> > On Wed, 25 Oct 2023 15:55:45 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> 
> [...]
> 
> After digging lore for context, here are some thoughts about the actual
> proposal: AFAIU the intent here is to boost the scheduling slice for a
> userspace thread running with a mutex held so it can complete faster,
> and therefore reduce contention.
> 
> I suspect this is not completely unrelated to priority inheritance
> futexes, except that one goal stated by Steven is to increase the
> owner slice without requiring to call a system call on the fast-path.
> 
> Compared to PI futexes, I think Steven's proposal misses the part
> where a thread waiting on a futex boosts the lock owner's priority
> so it can complete faster. By making the lock owner selfishly claim
> that it needs a larger scheduling slice, it opens the door to
> scheduler disruption, and it's hard to come up with upper-bounds
> that work for all cases.
> 
> Hopefully I'm not oversimplifying if I state that we have mainly two
> actors to consider:
> 
> [A] the lock owner thread
> 
> [B] threads that block trying to acquire the lock
> 
> The fast-path here is [A]. [B] can go through a system call, I don't
> think it matters at all.
> 
> So perhaps we can extend the rseq per-thread area with a field that
> implements a "held locks" list that allows [A] to let the kernel know
> that it is currently holding a set of locks (those can be chained when
> locks are nested). It would be updated on lock/unlock with just a few
> stores in userspace.
> 
> Those lock addresses could then be used as keys for private locks,
> or transformed into inode/offset keys for shared-memory locks. Threads
> [B] blocking trying to acquire the lock can call a system call which
> would boost the lock owner's slice and/or priority for a given lock key.
> 
> When the scheduler preempts [A], it would check whether the rseq
> per-thread area has a "held locks" field set and use this information
> to find the slice/priority boost which are currently active for each
> lock, and use this information to boost the task slice/priority
> accordingly.
> 
> A scheme like this should allow lock priority inheritance without
> requiring system calls on the userspace lock/unlock fast path.
> 

I think both this proposal and the original in this thread are opening a
can of worms and I don't think going down that road was properly
justified. A proper justification would demonstrate a big enough(tm)
improvement over a locking primitive with adaptive spinning.

It is well known that what mostly shafts performance of regular
userspace locking is all the nasty going off cpu to wait.

The original benchmark with slice extension disabled keeps using CPUs,
virtually guaranteeing these threads will keep getting preempted, some
of the time while holding the lock. Should that happen all other threads
which happened to get preempted actively waste time.

Adaptive spinning was already mentioned elsewhere in the thread and the
idea itself is at least 2 decades old. If anything I find it strange it
did not land years ago.

I find there is a preliminary patch by you which exports the state so
one can nicely spin without even going to the kernel:
https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/

To be clear, I think a locking primitive which can do adaptive spinning
*and* futexes *and* not get preempted while holding locks is the fastest
option. What is not clear to me if it is sufficiently faster than
adaptive spinning and futexes.

tl;dr perhaps someone(tm) could carry the above to a state where it can
be benchmarked vs the original patch


  reply	other threads:[~2023-10-25 16:24 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-25  9:42 Steven Rostedt
2023-10-25  9:46 ` Steven Rostedt
2023-10-25 10:29 ` Peter Zijlstra
2023-10-25 12:54   ` Steven Rostedt
2023-10-25 13:55     ` Peter Zijlstra
2023-10-25 14:31       ` Steven Rostedt
2023-10-25 14:53         ` Mathieu Desnoyers
2023-10-25 15:07           ` Steven Rostedt
2023-10-25 15:42         ` Mathieu Desnoyers
2023-10-25 16:24           ` Mateusz Guzik [this message]
2023-10-25 17:17             ` Steven Rostedt
2023-10-25 18:49               ` Mathieu Desnoyers
2023-10-25 19:19                 ` Steven Rostedt
2023-10-25 21:56                   ` Mathieu Desnoyers
2023-10-26  8:54               ` Peter Zijlstra
2023-10-26 13:40                 ` Steven Rostedt
2023-10-26 15:49                   ` Steven Rostedt
2023-10-26 16:31                     ` Daniel Bristot de Oliveira
2023-10-26 17:26                       ` Steven Rostedt
2023-10-26  8:44         ` Peter Zijlstra
2023-10-26 13:16           ` Steven Rostedt
2023-10-30 13:29             ` Peter Zijlstra
2023-10-30 13:52               ` Steven Rostedt
2023-10-26  5:03   ` Ankur Arora
2023-10-25 15:12 ` Steven Rostedt
2023-10-25 15:34 ` Rasmus Villemoes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231025162435.ibhdktcshhzltr3r@f \
    --to=mjguzik@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrew.cooper3@citrix.com \
    --cc=ankur.a.arora@oracle.com \
    --cc=bharata@amd.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=bristot@kernel.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=jgross@suse.com \
    --cc=joel@joelfernandes.org \
    --cc=jon.grimm@amd.com \
    --cc=juri.lelli@redhat.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=rostedt@goodmis.org \
    --cc=suleiman@google.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vineethrp@google.com \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=youssefesmat@chromium.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox