* [RFC][PATCH 0/2] sched: Extended Scheduler Time Slice revisited
@ 2025-01-31 22:58 Steven Rostedt
2025-01-31 22:58 ` [RFC][PATCH 1/2] sched: Extended scheduler time slice Steven Rostedt
2025-01-31 22:58 ` [RFC][PATCH 2/2] sched: Shorten time that tasks can extend their time slice for Steven Rostedt
0 siblings, 2 replies; 66+ messages in thread
From: Steven Rostedt @ 2025-01-31 22:58 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Thomas Gleixner, Peter Zijlstra, Ankur Arora, Linus Torvalds,
linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli,
vincent.guittot, willy, mgorman, jon.grimm, bharata,
raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross,
andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal,
Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy,
daniel.wagner, joseph.salisbury, broonie
Extended scheduler time slice
Wow, it's been over a year since I posted my original POC of this patch
series[1]. But that was when the PREEMPT_LAZY (NEED_RESCHED_LAZY) was
just being proposed and this patch set depended on that. Now that
NEED_RESCHED_LAZY is part of mainline, it's time to revisit this proposal.
Quick recap:
PREEMPT_LAZY can be used to dynamically set the preemption model of the
kernel. To emulate the old server model, if the timer tick goes off
and wants to schedule out the currently running SCHED_OTHER task,
it will set NEED_RESCHED_LAZY. Instead of immediately scheduling,
it will only schedule when the current task is exiting to user space.
If the task runs in the kernel for over a tick, then NEED_RESCHED is
set which will force the task to schedule as soon as it is out of any
kernel critical section.
I wanted to give this feature to user space as well. This is a way for
user space to inform the kernel that it too is in a critical section
(perhaps implementing user space spin locks), and if it happens to lose
its quota and the scheduler wants to schedule it out for another SCHED_OTHER
task, it can get a little more time to release its locks.
The patches use rseq to map a new "cr_counter" field to pass this
information to the kernel. Bit 0 is reserved for the kernel, and the other
31 bits tells the kernel that the task is in a critical section. If any of
those 31 bits is set, the kernel will try to extend the tasks time slice if
it deems fit to do so (not guaranteed of course).
The 31 bits allows the task to implement a counter. Note that this rseq
memory is per thread so it does not need to worry about racing with other
threads. But it does need to worry about racing with the kernel, so the
incrementing or decrementing of this value should be done local atomically.
(Like with a simple addl or subl in x86, not a read/modify/write).
The counter lets user space add 2 to the cr_counter (skipping over bit 0)
when it enters a critical section and subtract 2 when it leaves. If the
counter is 1, then it knows that the kernel extended its time slice and it
should immediately schedule by doing a system call (yield() but any system
call will work too). If it does not, the kernel will force a schedule on the
next scheduler tick.
The first patch implements this and gives the task 1 more tick just like
the kernel does before it forces a schedule. But this means user space can
ask for a full millisecond up to 10 milliseconds depending on CONFIG_HZ.
The second patch is based off of Peter Zijlstra's patch[2], that injects
a 50us scheduler tick when it extends the task's slice. This way the most
a task will get is 50us extra, which hopefully would not bother other
task's latency. Note, that the way EEVDF works, the task will get penalized
by losing out on eligibility, and even if it does get a little more time,
it may be scheduled less often.
I removed POC as I no longer believe this is simply a proof-of-concept.
But it is still RFC, and the patches contain trace_printk() in them
to see if it is indeed working, as well as to analyze the results of
my tests. Those trace_printk()s will be removed if this gets accepted.
I wrote a program[3] to micro-benchmark this. What the program does
is to create one thread per CPU that will grab a user space spin lock,
run in a loop for around 30us and release the spin lock. Then it will
go to sleep for "100 + cpu * 27" microseconds before it wakes up and
tries again. This is to try to stagger the different threads. Each
of these threads are pinned to their corresponding CPU.
5 more threads are created on each CPU that does the following:
while (!data->done) {
for (i = 0; i < 100; i++)
wmb();
do_sleep(10);
rmb();
}
Where do_sleep(10) will sleep for 10us. This causes a lot of scheduling
on SCHED_OTHER tasks.
This program keeps track of:
- Total number of loops iterated by the threads that are taking the lock
- The total time it waited to get a lock (and the average time)
- The number of times the lock was contented.
- The max time it waited to get a lock
- The max time it held the lock (and the average time it held a lock).
- It also keeps track of the number of times its time slice was extended
It takes two parameters:
-d disable using rseq to tell the kernel to extend the lock
-w Keep it "extended" even while waiting for the lock
Note -w is meaningless with -d, and is really added as an academic exercise
as waiting for a critical section isn't necessary a critical section.
It runs for 5 seconds and then stops. So all numbers are for a 5 second
duration.
Without rseq enabled, we had:
for i in `seq 10` ; do ./extend-sched -d ; done
Finish up
Ran for 105278 times
Total wait time: 7.657703 (avg: 0.000072)
Total contention: 88661
Total extended: 0
max wait: 1410
max: 328 (avg: 43)
Finish up
Ran for 106703 times
Total wait time: 7.371958 (avg: 0.000069)
Total contention: 89252
Total extended: 0
max wait: 1822
max: 410 (avg: 42)
Finish up
Ran for 106679 times
Total wait time: 7.344924 (avg: 0.000068)
Total contention: 89003
Total extended: 0
max wait: 1499
max: 338 (avg: 42)
Finish up
Ran for 106512 times
Total wait time: 7.398154 (avg: 0.000069)
Total contention: 89323
Total extended: 0
max wait: 1231
max: 334 (avg: 42)
Finish up
Ran for 106686 times
Total wait time: 7.369875 (avg: 0.000069)
Total contention: 89141
Total extended: 0
max wait: 1606
max: 448 (avg: 42)
Finish up
Ran for 106291 times
Total wait time: 7.464811 (avg: 0.000070)
Total contention: 89244
Total extended: 0
max wait: 1727
max: 373 (avg: 42)
Finish up
Ran for 106230 times
Total wait time: 7.467716 (avg: 0.000070)
Total contention: 88950
Total extended: 0
max wait: 4084
max: 377 (avg: 42)
Finish up
Ran for 106699 times
Total wait time: 7.369399 (avg: 0.000069)
Total contention: 89085
Total extended: 0
max wait: 1415
max: 348 (avg: 42)
Finish up
Ran for 106648 times
Total wait time: 7.352611 (avg: 0.000068)
Total contention: 89202
Total extended: 0
max wait: 1177
max: 377 (avg: 42)
Finish up
Ran for 106532 times
Total wait time: 7.363098 (avg: 0.000069)
Total contention: 89009
Total extended: 0
max wait: 1454
max: 429 (avg: 42)
Now with a 50us slice extension with rseq:
for i in `seq 10` ; do ./extend-sched ; done
Finish up
Ran for 121185 times
Total wait time: 3.450114 (avg: 0.000028)
Total contention: 84405
Total extended: 19879
max wait: 652
max: 174 (avg: 32)
Finish up
Ran for 120842 times
Total wait time: 3.474066 (avg: 0.000028)
Total contention: 84338
Total extended: 20450
max wait: 487
max: 181 (avg: 32)
Finish up
Ran for 120814 times
Total wait time: 3.473712 (avg: 0.000028)
Total contention: 83938
Total extended: 20418
max wait: 631
max: 185 (avg: 32)
Finish up
Ran for 120918 times
Total wait time: 3.442310 (avg: 0.000028)
Total contention: 83921
Total extended: 20246
max wait: 511
max: 172 (avg: 32)
Finish up
Ran for 120685 times
Total wait time: 3.426023 (avg: 0.000028)
Total contention: 83327
Total extended: 20504
max wait: 488
max: 161 (avg: 32)
Finish up
Ran for 120873 times
Total wait time: 3.477329 (avg: 0.000028)
Total contention: 84139
Total extended: 20808
max wait: 551
max: 172 (avg: 32)
Finish up
Ran for 120667 times
Total wait time: 3.491623 (avg: 0.000028)
Total contention: 84004
Total extended: 20585
max wait: 554
max: 170 (avg: 32)
Finish up
Ran for 121595 times
Total wait time: 3.446635 (avg: 0.000028)
Total contention: 84568
Total extended: 20258
max wait: 543
max: 166 (avg: 32)
Finish up
Ran for 121729 times
Total wait time: 3.437635 (avg: 0.000028)
Total contention: 84825
Total extended: 20143
max wait: 497
max: 165 (avg: 32)
Finish up
Ran for 121545 times
Total wait time: 3.452991 (avg: 0.000028)
Total contention: 84583
Total extended: 20186
max wait: 578
max: 161 (avg: 32)
The averages of the 10 runs:
No extensions:
avg iterations: 106426
avg total wait: 7.416025 seconds
avg avg wait: 0.000069 seconds
contention: 89087
max wait: 1742 us
max: 376.2 us
avg max: 42 us
With rseq extension:
avg iterations: 121085
avg total wait: 3.457244 seconds
avg avg wait: 0.000028 seconds
contention: 84205
max wait: 549 us
max: 171 us
avg max: 32 us
extended: 20347
This shows that with a 50us extra time slice
It was able to run 14659 more iteration (+13.7%)
It waited a total of 3.958781 seconds less (-53.3%)
The average wait time was 41us less (-59.4%)
It had 4882 less contentions (-5.4%)
It had 1193us less max wait time (-31.5%)
It held the lock for 205.2us less (-54.5%)
And the average time it held the lock was 10us less (-23.8%)
After running the extend version, I looked at ftrace to see if
it hit the 50us max:
# trace-cmd show | grep force
extend-sched-29816 [000] dBH.. 76942.819849: rseq_delay_resched_tick: timeout -- force resched
extend-sched-29865 [000] dbh.. 76944.151878: rseq_delay_resched_tick: timeout -- force resched
extend-sched-29865 [000] dBh.. 76945.266837: rseq_delay_resched_tick: timeout -- force resched
extend-sched-29865 [000] dBh.. 76946.182833: rseq_delay_resched_tick: timeout -- force resched
It did so 4 times.
For kicks, here's the run with '-w' where it sets the rseq extend
while it spins for waiting for a lock. I would not recommend this
even if it does help. Why extend when it is safe to preempt?
for i in `seq 10` ; do ./extend-sched -w ; done
Finish up
Ran for 120111 times
Total wait time: 3.389300 (avg: 0.000028)
Total contention: 83406
Total extended: 23539
max wait: 438
max: 176 (avg: 32)
Finish up
Ran for 120241 times
Total wait time: 3.377985 (avg: 0.000028)
Total contention: 83586
Total extended: 23458
max wait: 453
max: 246 (avg: 32)
Finish up
Ran for 120140 times
Total wait time: 3.391172 (avg: 0.000028)
Total contention: 83234
Total extended: 23571
max wait: 446
max: 195 (avg: 32)
Finish up
Ran for 120100 times
Total wait time: 3.366652 (avg: 0.000028)
Total contention: 83088
Total extended: 23256
max wait: 2710
max: 2592 (avg: 32)
Finish up
Ran for 120373 times
Total wait time: 3.372495 (avg: 0.000028)
Total contention: 83405
Total extended: 23657
max wait: 460
max: 164 (avg: 32)
Finish up
Ran for 120332 times
Total wait time: 3.389414 (avg: 0.000028)
Total contention: 83752
Total extended: 23487
max wait: 498
max: 223 (avg: 32)
Finish up
Ran for 120411 times
Total wait time: 3.357409 (avg: 0.000027)
Total contention: 83371
Total extended: 23349
max wait: 423
max: 175 (avg: 32)
Finish up
Ran for 120258 times
Total wait time: 3.376960 (avg: 0.000028)
Total contention: 83595
Total extended: 23454
max wait: 385
max: 164 (avg: 32)
Finish up
Ran for 120407 times
Total wait time: 3.366934 (avg: 0.000027)
Total contention: 83649
Total extended: 23351
max wait: 446
max: 164 (avg: 32)
Finish up
Ran for 120397 times
Total wait time: 3.395540 (avg: 0.000028)
Total contention: 83859
Total extended: 23513
max wait: 469
max: 172 (avg: 32)
Again, I would not recommend this, as after running this I looked
at the trace again to see if it hit the max 50us (I did reset the buffer
before running), and I had this:
# trace-cmd show | grep force | wc -l
19697
[1] https://lore.kernel.org/all/20231025054219.1acaa3dd@gandalf.local.home/
[2] https://lore.kernel.org/all/20231030132949.GA38123@noisy.programming.kicks-ass.net/
[3] https://rostedt.org/code/extend-sched.c
Steven Rostedt (2):
sched: Shorten time that tasks can extend their time slice for
sched: Extended scheduler time slice
----
include/linux/entry-common.h | 2 +
include/linux/sched.h | 19 +++++++++
include/uapi/linux/rseq.h | 24 +++++++++++
kernel/entry/common.c | 16 +++++++-
kernel/rseq.c | 96 ++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/core.c | 16 ++++++++
kernel/sched/syscalls.c | 6 +++
7 files changed, 177 insertions(+), 2 deletions(-)
^ permalink raw reply [flat|nested] 66+ messages in thread* [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-01-31 22:58 [RFC][PATCH 0/2] sched: Extended Scheduler Time Slice revisited Steven Rostedt @ 2025-01-31 22:58 ` Steven Rostedt 2025-02-01 11:59 ` Peter Zijlstra 2025-02-01 14:35 ` Mathieu Desnoyers 2025-01-31 22:58 ` [RFC][PATCH 2/2] sched: Shorten time that tasks can extend their time slice for Steven Rostedt 1 sibling, 2 replies; 66+ messages in thread From: Steven Rostedt @ 2025-01-31 22:58 UTC (permalink / raw) To: linux-kernel, linux-trace-kernel Cc: Thomas Gleixner, Peter Zijlstra, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie From: "Steven Rostedt (Google)" <rostedt@goodmis.org> This is to improve user space implemented spin locks or any critical section. It may also be extended for VMs and their guest spin locks as well, but that will come later. This adds a new field in the struct rseq called cr_counter. This is a 32 bit field where bit zero is a flag reserved for the kernel, and the other 31 bits can be used as a counter (although the kernel doesn't care how they are used, as any bit set means the same). This works in tandem with PREEMPT_LAZY, where a task can tell the kernel via the rseq structure that it is in a critical section (like holding a spin lock) that it will be leaving very shortly, and to ask the kernel to not preempt it at the moment. The way this works is before going into a critical section, the user space thread will increment the cr_counter by 2 (skipping bit zero that is reserved for the kernel). If the tasks time runs out and NEED_RESCHED_LAZY is set, on the way back out to user space, instead of calling schedule, the kernel will allow user space to continue to run. For the moment, it lets it run for one more tick (which will be changed later). When the kernel lets the thread have some extended time, it will set bit zero of the rseq cr_counter, to inform the user thread that it was granted extended time and that it should call a system call immediately after it leaves its critical section. When the user thread leaves the critical section, it decrements the counter by 2 and if the counter equals 1, then it knows that the kernel extended its time slice and it then will call a system call to allow the kernel to schedule it. If NEED_RESCHED is set, then the rseq is ignored and the kernel will schedule. Note, the incrementing and decrementing the counter by 2 is just one implementation that user space can use. As stated, any bit set in the cr_counter from bit 1 to 31 will cause the kernel to try and grant extra time. Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> --- include/linux/sched.h | 10 ++++++++++ include/uapi/linux/rseq.h | 24 ++++++++++++++++++++++++ kernel/entry/common.c | 14 +++++++++++++- kernel/rseq.c | 30 ++++++++++++++++++++++++++++++ 4 files changed, 77 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 64934e0830af..8e983d8cf72d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2206,6 +2206,16 @@ static inline bool owner_on_cpu(struct task_struct *owner) unsigned long sched_cpu_util(int cpu); #endif /* CONFIG_SMP */ +#ifdef CONFIG_RSEQ + +extern bool rseq_delay_resched(void); + +#else + +static inline bool rseq_delay_resched(void) { return false; } + +#endif + #ifdef CONFIG_SCHED_CORE extern void sched_core_free(struct task_struct *tsk); extern void sched_core_fork(struct task_struct *p); diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h index c233aae5eac9..185fe9826ff9 100644 --- a/include/uapi/linux/rseq.h +++ b/include/uapi/linux/rseq.h @@ -37,6 +37,18 @@ enum rseq_cs_flags { (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), }; +enum rseq_cr_flags_bit { + RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED_BIT = 0, +}; + +enum rseq_cr_flags { + RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED = + (1U << RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED_BIT), +}; + +#define RSEQ_CR_FLAG_IN_CRITICAL_SECTION_MASK \ + (~RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED) + /* * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always * contained within a single cache-line. It is usually declared as @@ -148,6 +160,18 @@ struct rseq { */ __u32 mm_cid; + /* + * The cr_counter is a way for user space to inform the kernel that + * it is in a critical section. If bits 1-31 are set, then the + * kernel may grant the thread a bit more time (but there is no + * guarantee of how much time or if it is granted at all). If the + * kernel does grant the thread extra time, it will set bit 0 to + * inform user space that it has granted the thread more time and that + * user space should call yield() as soon as it leaves its critical + * section. + */ + __u32 cr_counter; + /* * Flexible array member at end of structure, after last feature field. */ diff --git a/kernel/entry/common.c b/kernel/entry/common.c index e33691d5adf7..50e35f153bf8 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -90,6 +90,8 @@ void __weak arch_do_signal_or_restart(struct pt_regs *regs) { } __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work) { + unsigned long ignore_mask = 0; + /* * Before returning to user space ensure that all pending work * items have been completed. @@ -98,9 +100,18 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs, local_irq_enable_exit_to_user(ti_work); - if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) + if (ti_work & _TIF_NEED_RESCHED) { schedule(); + } else if (ti_work & _TIF_NEED_RESCHED_LAZY) { + /* Allow to leave with NEED_RESCHED_LAZY still set */ + if (rseq_delay_resched()) { + trace_printk("Avoid scheduling\n"); + ignore_mask |= _TIF_NEED_RESCHED_LAZY; + } else + schedule(); + } + if (ti_work & _TIF_UPROBE) uprobe_notify_resume(regs); @@ -127,6 +138,7 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs, tick_nohz_user_enter_prepare(); ti_work = read_thread_flags(); + ti_work &= ~ignore_mask; } /* Return the latest work state for arch_exit_to_user_mode() */ diff --git a/kernel/rseq.c b/kernel/rseq.c index 9de6e35fe679..b792e36a3550 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -339,6 +339,36 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) force_sigsegv(sig); } +bool rseq_delay_resched(void) +{ + struct task_struct *t = current; + u32 flags; + + if (!t->rseq) + return false; + + /* Make sure the cr_counter exists */ + if (current->rseq_len <= offsetof(struct rseq, cr_counter)) + return false; + + /* If this were to fault, it would likely cause a schedule anyway */ + if (copy_from_user_nofault(&flags, &t->rseq->cr_counter, sizeof(flags))) + return false; + + if (!(flags & RSEQ_CR_FLAG_IN_CRITICAL_SECTION_MASK)) + return false; + + trace_printk("Extend time slice\n"); + flags |= RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED; + + if (copy_to_user_nofault(&t->rseq->cr_counter, &flags, sizeof(flags))) { + trace_printk("Faulted writing rseq\n"); + return false; + } + + return true; +} + #ifdef CONFIG_DEBUG_RSEQ /* -- 2.45.2 ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-01-31 22:58 ` [RFC][PATCH 1/2] sched: Extended scheduler time slice Steven Rostedt @ 2025-02-01 11:59 ` Peter Zijlstra 2025-02-01 12:47 ` Steven Rostedt 2025-02-01 14:35 ` Mathieu Desnoyers 1 sibling, 1 reply; 66+ messages in thread From: Peter Zijlstra @ 2025-02-01 11:59 UTC (permalink / raw) To: Steven Rostedt Cc: linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Fri, Jan 31, 2025 at 05:58:38PM -0500, Steven Rostedt wrote: > From: "Steven Rostedt (Google)" <rostedt@goodmis.org> > > This is to improve user space implemented spin locks or any critical > section. It may also be extended for VMs and their guest spin locks as > well, but that will come later. > > This adds a new field in the struct rseq called cr_counter. This is a 32 bit > field where bit zero is a flag reserved for the kernel, and the other 31 > bits can be used as a counter (although the kernel doesn't care how they > are used, as any bit set means the same). > > This works in tandem with PREEMPT_LAZY, where a task can tell the kernel > via the rseq structure that it is in a critical section (like holding a > spin lock) that it will be leaving very shortly, and to ask the kernel to > not preempt it at the moment. I still have full hate for this approach. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-01 11:59 ` Peter Zijlstra @ 2025-02-01 12:47 ` Steven Rostedt 2025-02-01 18:11 ` Peter Zijlstra 0 siblings, 1 reply; 66+ messages in thread From: Steven Rostedt @ 2025-02-01 12:47 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On February 1, 2025 6:59:06 AM EST, Peter Zijlstra <peterz@infradead.org> wrote: >I still have full hate for this approach. So what approach would you prefer? -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-01 12:47 ` Steven Rostedt @ 2025-02-01 18:11 ` Peter Zijlstra 2025-02-01 23:06 ` Steven Rostedt 2025-02-04 22:44 ` Prakash Sangappa 0 siblings, 2 replies; 66+ messages in thread From: Peter Zijlstra @ 2025-02-01 18:11 UTC (permalink / raw) To: Steven Rostedt Cc: linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Sat, Feb 01, 2025 at 07:47:32AM -0500, Steven Rostedt wrote: > > > On February 1, 2025 6:59:06 AM EST, Peter Zijlstra <peterz@infradead.org> wrote: > > >I still have full hate for this approach. > > So what approach would you prefer? The one that does not rely on the preemption method -- I think I posted something along those line, and someone else recently reposted something bsaed on it. Tying things to the preemption method is absurdly bad design -- and I've told you that before. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-01 18:11 ` Peter Zijlstra @ 2025-02-01 23:06 ` Steven Rostedt 2025-02-03 8:43 ` Peter Zijlstra 2025-02-04 22:44 ` Prakash Sangappa 1 sibling, 1 reply; 66+ messages in thread From: Steven Rostedt @ 2025-02-01 23:06 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Sat, 1 Feb 2025 19:11:29 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Sat, Feb 01, 2025 at 07:47:32AM -0500, Steven Rostedt wrote: > > > > > > On February 1, 2025 6:59:06 AM EST, Peter Zijlstra <peterz@infradead.org> wrote: > > > > >I still have full hate for this approach. > > > > So what approach would you prefer? > > The one that does not rely on the preemption method -- I think I posted > something along those line, and someone else recently reposted something > bsaed on it. > > Tying things to the preemption method is absurdly bad design -- and I've > told you that before. How exactly is it "bad design"? Changing the preemption method itself changes the way applications schedule and can be very noticeable to the applications themselves. No preempt, applications will have high latency every time any application does a system call. Preempt voluntary is a little more reactive, but more randomly done. The preempt lazy kconfig has: This option provides a scheduler driven preemption model that is fundamentally similar to full preemption, but is less eager to preempt SCHED_NORMAL tasks in an attempt to reduce lock holder preemption and recover some of the performance gains seen from using Voluntary preemption. This could be a config option called PREEMPT_USER_LAZY that extends the "reduce lock holder preemption of user space spin locks". But if your issue is with relying on the preemption method, does that mean you prefer to have this feature for any preemption method? That may require still using the LAZY flag that can cause a schedule in the kernel but not in user space? Note, my group is actually more interested in implementing this for VMs. But that requires another level of redirection of the pointers. That is, qemu could create a device that shares memory between the guest kernel and the qemu VCPU thread. The guest kernel could update the counter in this shared memory before grabbing a raw_spin_lock which act like this patch set does. The difference would be that the counter would need to live in a memory page that only has this information in it and not the rseq structure itself. Mathieu was concerned about leaks and corruption in the rseq structure by a malicious guest. Thus, the counter would have to be a clean memory page that is shared between the guest and the qemu thread. The rseq would then have a pointer to this memory, and the host kernel would then have to traverse that pointer to the location of the counter. In other words, my real goal is to have this working for guests on their raw_spin_locks. We first tried to do this in KVM directly, but the KVM maintainers said this is more a generic scheduling issue and doesn't belong in KVM. I agreed with them. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-01 23:06 ` Steven Rostedt @ 2025-02-03 8:43 ` Peter Zijlstra 2025-02-03 8:53 ` Peter Zijlstra 2025-02-03 16:45 ` Steven Rostedt 0 siblings, 2 replies; 66+ messages in thread From: Peter Zijlstra @ 2025-02-03 8:43 UTC (permalink / raw) To: Steven Rostedt Cc: linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Sat, Feb 01, 2025 at 06:06:17PM -0500, Steven Rostedt wrote: > On Sat, 1 Feb 2025 19:11:29 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Sat, Feb 01, 2025 at 07:47:32AM -0500, Steven Rostedt wrote: > > > > > > > > > On February 1, 2025 6:59:06 AM EST, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > >I still have full hate for this approach. > > > > > > So what approach would you prefer? > > > > The one that does not rely on the preemption method -- I think I posted > > something along those line, and someone else recently reposted something > > bsaed on it. > > > > Tying things to the preemption method is absurdly bad design -- and I've > > told you that before. > > How exactly is it "bad design"? Changing the preemption method itself > changes the way applications schedule and can be very noticeable to the > applications themselves. Lazy is not the default, nor even the recommended preemption method at this time. Lazy will not ever be the only preemption method, full isn't going anywhere. Lazy only applies to fair (and whatever bpf things end up using resched_curr_lazy()). Lazy works on tick granularity, which is variable per the HZ config, and way too long for any of this nonsense. So by tying this to lazy, you get something that doesn't actually work most of the time, and when it works, it has variable and bad behaviour. So yeah, crap. This really isn't difficult to understand, and I've told you this before. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-03 8:43 ` Peter Zijlstra @ 2025-02-03 8:53 ` Peter Zijlstra 2025-02-03 16:45 ` Steven Rostedt 1 sibling, 0 replies; 66+ messages in thread From: Peter Zijlstra @ 2025-02-03 8:53 UTC (permalink / raw) To: Steven Rostedt Cc: linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Mon, Feb 03, 2025 at 09:43:06AM +0100, Peter Zijlstra wrote: > Lazy is not the default, nor even the recommended preemption method at > this time. Just to clarify this for people reading along; lazy is there for people to identify performance 'issues' vs voluntary such that those can be addressed. I'm not at all sure who is spending time on it, but it will probably take a while. Once it has been determined lazy is good enough -- as indicated by the distros having picked it as a default, we can look at deprecating and removing voluntary (and eventually none). Also, RT likes it :-) ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-03 8:43 ` Peter Zijlstra 2025-02-03 8:53 ` Peter Zijlstra @ 2025-02-03 16:45 ` Steven Rostedt 2025-02-04 3:28 ` Suleiman Souhlal 2025-02-04 9:16 ` Peter Zijlstra 1 sibling, 2 replies; 66+ messages in thread From: Steven Rostedt @ 2025-02-03 16:45 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Mon, 3 Feb 2025 09:43:06 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > Lazy is not the default, nor even the recommended preemption method at > this time. That's OK. If it is considered to be the default in the future, this can wait. > > Lazy will not ever be the only preemption method, full isn't going > anywhere. That's fine too, as full preemption has the same issue of preempting kernel mutexes. Full preemption is for something that likely doesn't want this feature anyway. > > Lazy only applies to fair (and whatever bpf things end up using > resched_curr_lazy()). Is that a problem? User spin locks for RT tasks are very dangerous. If an RT task preempts the owner that is of lower priority, it can cause a deadlock (if the two tasks are pinned to the same CPU). Which BTW, Sebastion mentioned in the Stable RT meeting that glibc supplies a pthread_spin_lock() and doesn't have in the man page anything about this possible scenario. > > Lazy works on tick granularity, which is variable per the HZ config, and > way too long for any of this nonsense. Patch 2 changes that to do what you wrote the last time. It has a max wait time of 50us. > > So by tying this to lazy, you get something that doesn't actually work > most of the time, and when it works, it has variable and bad behaviour. Um no. If we wait for lazy to become the default behavior, it will work most of the time. And when it does work, it has strict behavior of 50us. > > So yeah, crap. As your rationale was not correct, I will disagree with this being crap. > > This really isn't difficult to understand, and I've told you this > before. And I listened to what you told me before. Patch 2 implements the 50us max that you suggested. I separated it out because it made the code simpler to understand and debug. The change log even mentioned: For the moment, it lets it run for one more tick (which will be changed later). That "changed later" is the second patch in this series. With the "this can wait until lazy is default", is because we have an "upstream first" policy. As long as there is some buy-in to the changes, we can go ahead and implement it on our devices. We do not have to wait for it to be accepted. But if there's a strong NAK to the idea, it is much harder to get it implemented internally. I would also implement a way for user space to know if it is supported or not. Perhaps have the cr_counter of the rseq initialized to some value that tells user space this is supported in the current configuration of the kernel? This would make there be "no surprises". Our current use case is actually for VMs. Which requires a slightly different method. Instead of having the cr_counter that is used for telling the kernel the task is in a critical section, the rseq would contain a pointer to some user space memory that has that counter. The reason is that this memory would need to be mapped between the VM guest kernel and the VM VCPU emulation thread. Mathieu did not want to allow exposure of the VM VCPU thread's rseq structure to the VM guest kernel. Having a separate memory map for that is more secure. Then the raw spin locks of the guest VM kernel could be implemented using this method as well. We do find performance issues when a VCPU of a guest kernel is preempted while holding spin locks. We can focus more on this VM use case, and we could then give better benchmarks. But again, his depends on whether or not you intend on NAKing this approach altogether. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-03 16:45 ` Steven Rostedt @ 2025-02-04 3:28 ` Suleiman Souhlal 2025-02-04 3:57 ` Steven Rostedt 2025-02-04 9:16 ` Peter Zijlstra 1 sibling, 1 reply; 66+ messages in thread From: Suleiman Souhlal @ 2025-02-04 3:28 UTC (permalink / raw) To: Steven Rostedt Cc: Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Tue, Feb 4, 2025 at 1:45 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Mon, 3 Feb 2025 09:43:06 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > Lazy is not the default, nor even the recommended preemption method at > > this time. > > That's OK. If it is considered to be the default in the future, this can > wait. > > > > > Lazy will not ever be the only preemption method, full isn't going > > anywhere. > > That's fine too, as full preemption has the same issue of preempting > kernel mutexes. Full preemption is for something that likely doesn't want > this feature anyway. > > > > > Lazy only applies to fair (and whatever bpf things end up using > > resched_curr_lazy()). > > Is that a problem? User spin locks for RT tasks are very dangerous. If an > RT task preempts the owner that is of lower priority, it can cause a > deadlock (if the two tasks are pinned to the same CPU). Which BTW, > Sebastion mentioned in the Stable RT meeting that glibc supplies a > pthread_spin_lock() and doesn't have in the man page anything about this > possible scenario. > > > > > Lazy works on tick granularity, which is variable per the HZ config, and > > way too long for any of this nonsense. > > Patch 2 changes that to do what you wrote the last time. It has a max wait > time of 50us. > > > > > So by tying this to lazy, you get something that doesn't actually work > > most of the time, and when it works, it has variable and bad behaviour. > > Um no. If we wait for lazy to become the default behavior, it will work > most of the time. And when it does work, it has strict behavior of 50us. > > > > > So yeah, crap. > > As your rationale was not correct, I will disagree with this being crap. > > > > > > This really isn't difficult to understand, and I've told you this > > before. > > And I listened to what you told me before. Patch 2 implements the 50us max > that you suggested. I separated it out because it made the code simpler to > understand and debug. The change log even mentioned: > > For the moment, it lets it run for one more tick (which will be > changed later). > > That "changed later" is the second patch in this series. > > With the "this can wait until lazy is default", is because we have an > "upstream first" policy. As long as there is some buy-in to the changes, we > can go ahead and implement it on our devices. We do not have to wait for it > to be accepted. But if there's a strong NAK to the idea, it is much harder > to get it implemented internally. Can you explain why this approach requires PREEMPT_LAZY? Could exit_to_user_mode_loop() be changed to something like the following (with maybe some provision to only do it once)? if ((ti_work & _TIF_NEED_RESCHED) && !rseq_delay_resched()) schedule(); I suppose there would also need to be some additional changes to make sure full preemption also doesn't preempt, maybe in preempt_schedule*(). -- Suleiman ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-04 3:28 ` Suleiman Souhlal @ 2025-02-04 3:57 ` Steven Rostedt 0 siblings, 0 replies; 66+ messages in thread From: Steven Rostedt @ 2025-02-04 3:57 UTC (permalink / raw) To: Suleiman Souhlal Cc: Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Tue, 4 Feb 2025 12:28:41 +0900 Suleiman Souhlal <suleiman@google.com> wrote: > Can you explain why this approach requires PREEMPT_LAZY? > > Could exit_to_user_mode_loop() be changed to something like the > following (with maybe some provision to only do it once)? > > if ((ti_work & _TIF_NEED_RESCHED) && !rseq_delay_resched()) > schedule(); The main reason is that we need to differentiate a preemption based on a SCHED_OTHER scheduling tick, and an RT task waking up. We should not delay any RT tasks ever. If PREEMPT_LAZY becomes default, IIUC then even the old "server" version will have RT tasks preempt tasks within the kernel without waiting for another tick. Currently, the only way to differentiate between a SCHED_OTHER scheduler tick preemption and an RT task waking up is with the NEED_RESCHED_LAZY vs NEED_RESCHED. Now, if we wanted to (and I'm not sure we do), we could add another way to differentiate the two and still allow this to work. > > I suppose there would also need to be some additional changes to make > sure full preemption also doesn't preempt, maybe in > preempt_schedule*(). Which may be quite difficult as the cr_counter is in user space and can only be read from a user space faultable context. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-03 16:45 ` Steven Rostedt 2025-02-04 3:28 ` Suleiman Souhlal @ 2025-02-04 9:16 ` Peter Zijlstra 2025-02-04 12:51 ` Steven Rostedt 1 sibling, 1 reply; 66+ messages in thread From: Peter Zijlstra @ 2025-02-04 9:16 UTC (permalink / raw) To: Steven Rostedt Cc: linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Mon, Feb 03, 2025 at 11:45:37AM -0500, Steven Rostedt wrote: > > Lazy only applies to fair (and whatever bpf things end up using > > resched_curr_lazy()). > > Is that a problem? User spin locks for RT tasks are very dangerous. If an > RT task preempts the owner that is of lower priority, it can cause a > deadlock (if the two tasks are pinned to the same CPU). Which BTW, > Sebastion mentioned in the Stable RT meeting that glibc supplies a > pthread_spin_lock() and doesn't have in the man page anything about this > possible scenario. Yeah, we've known that for at least a decade if not longer. That's not new. Traditionally glibc people haven't been very RT minded -- the whole condvar thing comes to mind as well. And yes, you can still use the whole 'delay preemption' hint for RT tasks just fine. Spinlocks isn't the only thing. It can be used to make any RSEQ section more likely to succeed. > Patch 2 changes that to do what you wrote the last time. It has a max wait > time of 50us. I'm so confused, WTF do you then need the lazy crap? You're making things needlessly complicated again. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-04 9:16 ` Peter Zijlstra @ 2025-02-04 12:51 ` Steven Rostedt 2025-02-04 13:16 ` Steven Rostedt 2025-02-04 15:30 ` Peter Zijlstra 0 siblings, 2 replies; 66+ messages in thread From: Steven Rostedt @ 2025-02-04 12:51 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Tue, 4 Feb 2025 10:16:13 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > And yes, you can still use the whole 'delay preemption' hint for RT > tasks just fine. Spinlocks isn't the only thing. It can be used to make > any RSEQ section more likely to succeed. > > > > Patch 2 changes that to do what you wrote the last time. It has a max wait > > time of 50us. > > I'm so confused, WTF do you then need the lazy crap? > > You're making things needlessly complicated again. Do we really want to delay an RT task by 50us? That will cause a lot more regressions. This is a performance feature not a latency one. RT tasks are about limiting latency and will sacrifice performance to do so. PREEMPT_RT has great minimal latency at the expense of performance. We try to improve performance but never at the sacrifice of latency. RT wakeups are also more predictable. That is, they happen when an event comes in or a timer expires and when an RT task wakes up it is to run ASAP to handle some event. This is about SCHED_OTHER tasks where the scheduler decides when a task gets to run and will preempt it when its quota is over. The task has no idea when that will happen. This is about giving the kernel a hint that it's a bad time to interrupt the task and if it can just wait another 50us or less, then it would be fine to schedule. SCHED_OTHER tasks never have latency requirements less than a millisecond. And SCHED_OTHER tasks are effected by other SCHED_OTHER tasks, even those that are lower priority. My patches here were based on where the NEED_RESCHED_LAZY came from, and that was from the PREEMPT_RT patch. The problem was that the kernel would allow preemption almost everywhere. This was great for RT tasks, but non RT tasks suffered performance issues. That's because of the timer tick going off while a SCHED_OTHER task was holding a spin_lock converted into a mutex and it would be scheduled out while holding that sleeping spin_lock. This increased the amount of contention on that spin_lock, and that affected performance. The fix was to introduce NEED_RESCHED_LAZY which would not have the SCHED_OTHER task preempt while holding a sleeping spin_lock. This put back the performance close to non PREEMPT_RT. This work I'm doing is based on that. It doesn't make sense to delay RT tasks for a performance improvement. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-04 12:51 ` Steven Rostedt @ 2025-02-04 13:16 ` Steven Rostedt 2025-02-04 15:05 ` Steven Rostedt 2025-02-04 15:30 ` Peter Zijlstra 1 sibling, 1 reply; 66+ messages in thread From: Steven Rostedt @ 2025-02-04 13:16 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Tue, 4 Feb 2025 07:51:00 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > > I'm so confused, WTF do you then need the lazy crap? IOW, the "lazy crap" was created to solve this very issue. The holding of sleeping spin locks interrupted by a scheduler tick. I'm just giving user space the same feature that we gave the kernel in PREEMPT_RT. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-04 13:16 ` Steven Rostedt @ 2025-02-04 15:05 ` Steven Rostedt 0 siblings, 0 replies; 66+ messages in thread From: Steven Rostedt @ 2025-02-04 15:05 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie n Tue, 4 Feb 2025 08:16:53 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Tue, 4 Feb 2025 07:51:00 -0500 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > I'm so confused, WTF do you then need the lazy crap? > > IOW, the "lazy crap" was created to solve this very issue. The holding of > sleeping spin locks interrupted by a scheduler tick. I'm just giving user > space the same feature that we gave the kernel in PREEMPT_RT. > Also, I believe it is best to follow the current preemption method and that's what the NEED_RESCHED_LAZY gives us. Let's say you have a low priority program, maybe even malicious, that goes into a loop of calling a system call that can run for almost a millisecond without sleeping. In PREEMPT_NONE, this low priority program can cause RT tasks a latency of a millisecond because if an RT task wakes up as the program just enters the system call, it will have to wait for it to exit that system call for it to run, which might be close to that millisecond. For PREEMPT_VOLUNTARY, it will only preempt tasks until it hits a might_sleep(), or cond_resched() (but so would PREEMPT_NONE on the cond_resched(), but we want to get rid of those). For PREEMPT_FULL, the program shouldn't affect any other task because its system call will simply be preempted. Now let's look at this new feature. It allows a task to ask for some extended time to get out of a critical section if possible. If we decide in the future that we remove PREEMPT_NONE and PREEMPT_VOLUNTARY with a dynamic type like: TYPE | Sched Tick | RT Wakeup | Enter user space | ===========+================+=============+====================+ None | Set LAZY | Set LAZY | schedule | -----------+----------------+-------------+--------------------+ Voluntary? | Set LAZY | schedule | schedule | -----------+----------------+-------------+--------------------+ Full | schedule | schedule | schedule | -----------+----------------+-------------+--------------------+ (The "Enter user space" is when a NEED_RESCHED is set) Where in NONE, the LAZY flag is set for both the sched tick and the RT wakeup and it doesn't schedule until it hits user space. In "Voluntary", the LAZY flag is set only for sched tick on SCHED_OTHER tasks, but RT tasks will get to be scheduled immediately (depending on preempt_disable of course). With "Full" it will schedule whenever it can. With that task that calls that long system call, which method type above is in place determines the latency of other tasks. Now, if we add this feature, I want it to behave the same as a long system call. Where it would only extend the time if a long system call would extend the time, as that means it wouldn't modify the typical behavior of the system for other tasks, but it would help in the performance for the task that is requesting this feature. With this feature: TYPE | Sched Tick | RT Wakeup | Enter user space | ===========+================+=============+=======================+ None | Set LAZY | Set LAZY | schedule if !LAZY | -----------+----------------+-------------+-----------------------+ Voluntary? | Set LAZY | schedule | schedule if !LAZY | -----------+----------------+-------------+-----------------------+ Full | schedule | schedule | schedule | -----------+----------------+-------------+-----------------------+ Thus, in NONE, it would likely get to extend its time just like if it called a long system call. This can include even making RT tasks wait a little longer, just like they would wait on a system call. In "Voluntary", it would only get its timeslice extended if it was another SCHED_OTHER task that is to be scheduled. But if an RT task would wake up, it would schedule immediately regardless if a extended time slice was requested or not. In "Full", it probably makes sense to simply disable this feature (the program would see that it is disabled when it registers the rseq), as it would never get its time slice extended, as a system call would be preempted immediately if it was interrupted. So back to your question about why I'm tying this to the "lazy crap", is because I want the behavior of other tasks to not change due to one task asking for an extended time slice. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-04 12:51 ` Steven Rostedt 2025-02-04 13:16 ` Steven Rostedt @ 2025-02-04 15:30 ` Peter Zijlstra 2025-02-04 16:11 ` Steven Rostedt 1 sibling, 1 reply; 66+ messages in thread From: Peter Zijlstra @ 2025-02-04 15:30 UTC (permalink / raw) To: Steven Rostedt Cc: linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Tue, Feb 04, 2025 at 07:51:00AM -0500, Steven Rostedt wrote: > On Tue, 4 Feb 2025 10:16:13 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > And yes, you can still use the whole 'delay preemption' hint for RT > > tasks just fine. Spinlocks isn't the only thing. It can be used to make > > any RSEQ section more likely to succeed. > > > > > > > Patch 2 changes that to do what you wrote the last time. It has a max wait > > > time of 50us. > > > > I'm so confused, WTF do you then need the lazy crap? > > > > You're making things needlessly complicated again. > > Do we really want to delay an RT task by 50us? If you go back and reread that initial thread, you'll find the 50us is below the scheduling latency that random test box already had. I'm sure more modern systems will have a lower number, and slower systems will have a larger number, but we got to pick a number :/ I'm fine with making it 20us. Or whatever. Its just a stupid number. But yes. If we're going to be doing this, there is absolutely no reason not to allow DEADLINE/FIFO threads the same. Misbehaving FIFO is already a problem, and we can make DL-CBS enforcement punch through it if we have to. And less retries on the RSEQ for FIFO can equally improve performance. There is no difference between a 'malicious/broken' userspace consuming the entire window in userspace (50us, 20us whatever it will be) and doing a system call which we know will cause similar delays because it does in-kernel locking. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-04 15:30 ` Peter Zijlstra @ 2025-02-04 16:11 ` Steven Rostedt 2025-02-05 9:07 ` Peter Zijlstra 0 siblings, 1 reply; 66+ messages in thread From: Steven Rostedt @ 2025-02-04 16:11 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Tue, 4 Feb 2025 16:30:53 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > If you go back and reread that initial thread, you'll find the 50us is > below the scheduling latency that random test box already had. > > I'm sure more modern systems will have a lower number, and slower > systems will have a larger number, but we got to pick a number :/ > > I'm fine with making it 20us. Or whatever. Its just a stupid number. > > But yes. If we're going to be doing this, there is absolutely no reason > not to allow DEADLINE/FIFO threads the same. Misbehaving FIFO is already > a problem, and we can make DL-CBS enforcement punch through it if we > have to. > > And less retries on the RSEQ for FIFO can equally improve performance. > > There is no difference between a 'malicious/broken' userspace consuming > the entire window in userspace (50us, 20us whatever it will be) and > doing a system call which we know will cause similar delays because it > does in-kernel locking. This is where we will disagree for the reasons I explained in my second email. This feature affects other tasks. And no, making it 20us doesn't make it better. Because from what I get from you, if we implement this, it will be available for all preemption methods (including PREEMPT_RT), where we do have less than 50us latency, and and even a 20us will break those applications. This was supposed to be only a hint to the kernel, not a complete feature that is hard coded and will override how other tasks behave. As system calls themselves can make how things are scheduled depending on the preemption method, I didn't want to add something that will change how things are scheduled that ignores the preemption method that was chosen. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-04 16:11 ` Steven Rostedt @ 2025-02-05 9:07 ` Peter Zijlstra 2025-02-05 13:10 ` Steven Rostedt 0 siblings, 1 reply; 66+ messages in thread From: Peter Zijlstra @ 2025-02-05 9:07 UTC (permalink / raw) To: Steven Rostedt Cc: linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Tue, Feb 04, 2025 at 11:11:19AM -0500, Steven Rostedt wrote: > On Tue, 4 Feb 2025 16:30:53 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > If you go back and reread that initial thread, you'll find the 50us is > > below the scheduling latency that random test box already had. > > > > I'm sure more modern systems will have a lower number, and slower > > systems will have a larger number, but we got to pick a number :/ > > > > I'm fine with making it 20us. Or whatever. Its just a stupid number. > > > > But yes. If we're going to be doing this, there is absolutely no reason > > not to allow DEADLINE/FIFO threads the same. Misbehaving FIFO is already > > a problem, and we can make DL-CBS enforcement punch through it if we > > have to. > > > > And less retries on the RSEQ for FIFO can equally improve performance. > > > > There is no difference between a 'malicious/broken' userspace consuming > > the entire window in userspace (50us, 20us whatever it will be) and > > doing a system call which we know will cause similar delays because it > > does in-kernel locking. > > This is where we will disagree for the reasons I explained in my second > email. This feature affects other tasks. And no, making it 20us doesn't > make it better. Because from what I get from you, if we implement this, it > will be available for all preemption methods (including PREEMPT_RT), where > we do have less than 50us latency, and and even a 20us will break those > applications. Then pick another number; RT too has a max scheduling latency number (on some random hardware). If you stay below that, all is fine. > This was supposed to be only a hint to the kernel, not a complete feature That's a contradiction in terms -- even a hint is a feature. > that is hard coded and will override how other tasks behave. Everything has some effect. My point is that if you limit this effect to be less than what it can already effect, you're not making things worse. > As system > calls themselves can make how things are scheduled depending on the > preemption method, What? > I didn't want to add something that will change how > things are scheduled that ignores the preemption method that was chosen. Userspace is totally oblivious to the preemption method chosen, and it damn well should be. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-05 9:07 ` Peter Zijlstra @ 2025-02-05 13:10 ` Steven Rostedt 2025-02-05 13:44 ` Steven Rostedt 0 siblings, 1 reply; 66+ messages in thread From: Steven Rostedt @ 2025-02-05 13:10 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Wed, 5 Feb 2025 10:07:36 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > > This is where we will disagree for the reasons I explained in my second > > email. This feature affects other tasks. And no, making it 20us doesn't > > make it better. Because from what I get from you, if we implement this, it > > will be available for all preemption methods (including PREEMPT_RT), where > > we do have less than 50us latency, and and even a 20us will break those > > applications. > > Then pick another number; RT too has a max scheduling latency number (on > some random hardware). If you stay below that, all is fine. So we set it to 1us? Or does this have to calculate what that latency number is for each random hardware? > > > This was supposed to be only a hint to the kernel, not a complete feature > > That's a contradiction in terms -- even a hint is a feature. Yes, a hint is a feature, I meant "complete feature" meaning it being not just a hint, but guaranteed to do something. > > > that is hard coded and will override how other tasks behave. > > Everything has some effect. My point is that if you limit this effect to > be less than what it can already effect, you're not making things worse. > > > As system > > calls themselves can make how things are scheduled depending on the > > preemption method, > > What? Read my last email: https://lore.kernel.org/all/20250204100555.1a641b9b@gandalf.local.home/ I went into this in detail. > > > I didn't want to add something that will change how > > things are scheduled that ignores the preemption method that was chosen. > > Userspace is totally oblivious to the preemption method chosen, and it > damn well should be. Agreed, and user space doesn't have to know what preemption method was chosen for this. Where does this say that it needs to know? All this does is to give the kernel a hint that it is in a critical section and the kernel decides to grant some more time or not. The preemption method will influence that decision, but user space doesn't need to be know. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-05 13:10 ` Steven Rostedt @ 2025-02-05 13:44 ` Steven Rostedt 0 siblings, 0 replies; 66+ messages in thread From: Steven Rostedt @ 2025-02-05 13:44 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Wed, 5 Feb 2025 08:10:19 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > The preemption method will influence that decision, but user space doesn't > need to be know. If your issue is that this depends on the preemption method, I have a slight change to make it work for all preemption methods. I just replied to Joel about this: https://lore.kernel.org/all/20250205083857.3cc06aa7@gandalf.local.home/ I'll repeat some of it here. Currently we have: TYPE | Sched Tick | RT Wakeup | ===========+=======================+======================+ None | NEED_RESCHED_LAZY | NEED_RESCHED_LAZY | -----------+-----------------------+----------------------+ Voluntary | NEED_RESCHED_LAZY | NEED_RESCHED | -----------+-----------------------+----------------------+ Full | NEED_RESCHED | NEED_RESCHED | -----------+-----------------------+----------------------+ Now, if we set NEED_RESCHED_LAZY in PREEMPT_FULL (and PREEMPT_RT) on a scheduler tick if it interrupted user space (not the kernel) then we have this: TYPE | Sched Tick | RT Wakeup | ===========+===================================+======================+ None | NEED_RESCHED_LAZY | NEED_RESCHED_LAZY | -----------+-----------------------------------+----------------------+ Voluntary | NEED_RESCHED_LAZY | NEED_RESCHED | -----------+-----------------------------------+----------------------+ Full | NEED_RESCHED or NEED_RESCHED_LAZY | NEED_RESCHED | -----------+-----------------------------------+----------------------+ Then going back to user space from the interrupt, we can use rseq and the LAZY bit to know if we should extend the tick or not! Even without rseq, this would behave the same as NEED_RESCHED_LAZY will schedule just like NEED_RESCHED when going back to user space. This will allow this schedule tick extension to work in all the preemption methods. Would this be something you are more OK with? I can go and test this out. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-01 18:11 ` Peter Zijlstra 2025-02-01 23:06 ` Steven Rostedt @ 2025-02-04 22:44 ` Prakash Sangappa 2025-02-05 0:56 ` Joel Fernandes 1 sibling, 1 reply; 66+ messages in thread From: Prakash Sangappa @ 2025-02-04 22:44 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie [-- Attachment #1: Type: text/plain, Size: 926 bytes --] On Feb 1, 2025, at 10:11 AM, Peter Zijlstra <peterz@infradead.org> wrote: On Sat, Feb 01, 2025 at 07:47:32AM -0500, Steven Rostedt wrote: On February 1, 2025 6:59:06 AM EST, Peter Zijlstra <peterz@infradead.org> wrote: I still have full hate for this approach. So what approach would you prefer? The one that does not rely on the preemption method -- I think I posted something along those line, and someone else recently reposted something bsaed on it. Here is the RFC I had sent that Peter is referring to. <https://lore.kernel.org/all/20241113000126.967713-1-prakash.sangappa@oracle.com/> lore.kernel.org<https://lore.kernel.org/all/20241113000126.967713-1-prakash.sangappa@oracle.com/> [X]<https://lore.kernel.org/all/20241113000126.967713-1-prakash.sangappa@oracle.com/> -Prakash. Tying things to the preemption method is absurdly bad design -- and I've told you that before. [-- Attachment #2: Type: text/html, Size: 10277 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-04 22:44 ` Prakash Sangappa @ 2025-02-05 0:56 ` Joel Fernandes 2025-02-05 3:04 ` Steven Rostedt 0 siblings, 1 reply; 66+ messages in thread From: Joel Fernandes @ 2025-02-05 0:56 UTC (permalink / raw) To: Prakash Sangappa Cc: Peter Zijlstra, Steven Rostedt, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie > On Feb 4, 2025, at 5:44 PM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote: > > > >> On Feb 1, 2025, at 10:11 AM, Peter Zijlstra <peterz@infradead.org> wrote: >> >>> On Sat, Feb 01, 2025 at 07:47:32AM -0500, Steven Rostedt wrote: >>> >>> >>> On February 1, 2025 6:59:06 AM EST, Peter Zijlstra <peterz@infradead.org> wrote: >>> >>>> I still have full hate for this approach. >>> >>> So what approach would you prefer? >> >> The one that does not rely on the preemption method -- I think I posted >> something along those line, and someone else recently reposted something >> bsaed on it. > > Here is the RFC I had sent that Peter is referring FWIW, I second the idea of a new syscall for this than (ab)using rseq and also independence from preemption method. I agree that something generic is better than relying on preemption method. thanks, - Joel > Tying things to the preemption method is absurdly bad design -- and I've >> told you that before. >> > ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-05 0:56 ` Joel Fernandes @ 2025-02-05 3:04 ` Steven Rostedt 2025-02-05 5:09 ` Joel Fernandes 0 siblings, 1 reply; 66+ messages in thread From: Steven Rostedt @ 2025-02-05 3:04 UTC (permalink / raw) To: Joel Fernandes Cc: Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie On Tue, 4 Feb 2025 19:56:09 -0500 Joel Fernandes <joel@joelfernandes.org> wrote: > > Here is the RFC I had sent that Peter is referring > > FWIW, I second the idea of a new syscall for this than (ab)using rseq > and also independence from preemption method. I agree that something > generic is better than relying on preemption method. So you are for adding another user/kernel memory mapped section? And you are also OK with allowing any task to make an RT task wait longer? Putting my RT hat back on, I would definitely disable that on any system that requires RT. But for now, I'm looking at implementing this for VMs only. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-05 3:04 ` Steven Rostedt @ 2025-02-05 5:09 ` Joel Fernandes 2025-02-05 13:16 ` Steven Rostedt 0 siblings, 1 reply; 66+ messages in thread From: Joel Fernandes @ 2025-02-05 5:09 UTC (permalink / raw) To: Steven Rostedt Cc: Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie On Tue, Feb 4, 2025 at 10:03 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Tue, 4 Feb 2025 19:56:09 -0500 > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > Here is the RFC I had sent that Peter is referring > > > > FWIW, I second the idea of a new syscall for this than (ab)using rseq > > and also independence from preemption method. I agree that something > > generic is better than relying on preemption method. > > So you are for adding another user/kernel memory mapped section? I don't personally mind that. > And you are also OK with allowing any task to make an RT task wait longer? > > Putting my RT hat back on, I would definitely disable that on any system > that requires RT. Just so I understand, you are basically saying that you want this feature only for FAIR tasks, and allowing RT tasks to extend time slice might actually hurt the latency of (other) RT tasks on the system right? This assumes PREEMPT_RT because the latency is 50us right? But in a poorly designed system, if you have RT tasks at higher priority that preempt things lower in RT, that would already cause latency anyway. Similarly, I would also consider any PREEMPT_RT system that (mis)uses this API in an RT task as also a poorly designed system. I think PREEMPT_RT systems generally require careful design anyway. So the fact that a system is poorly designed and thus causes latency is not the kernel's problem IMO. In any case, if you want this to only work on FAIR tasks and not RT tasks, why is that only possible to do with rseq() + LAZY preemption and not Prakash's new API + all preemption modes? Also you can just ignore RT tasks (not that I'm saying that's a good idea but..) in taskshrd_delay_resched() in that patch if you ever wanted to do that. I just feel the RT latency thing is a non-issue AFAICS. thanks, - Joel ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-05 5:09 ` Joel Fernandes @ 2025-02-05 13:16 ` Steven Rostedt 2025-02-05 13:38 ` Steven Rostedt ` (2 more replies) 0 siblings, 3 replies; 66+ messages in thread From: Steven Rostedt @ 2025-02-05 13:16 UTC (permalink / raw) To: Joel Fernandes Cc: Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie On Wed, 5 Feb 2025 00:09:51 -0500 Joel Fernandes <joel@joelfernandes.org> wrote: > On Tue, Feb 4, 2025 at 10:03 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > On Tue, 4 Feb 2025 19:56:09 -0500 > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > Here is the RFC I had sent that Peter is referring > > > > > > FWIW, I second the idea of a new syscall for this than (ab)using rseq > > > and also independence from preemption method. I agree that something > > > generic is better than relying on preemption method. > > > > So you are for adding another user/kernel memory mapped section? > > I don't personally mind that. I'm glad you don't personally mind it. Are you going to help maintain another memory mapped section? > > > And you are also OK with allowing any task to make an RT task wait longer? > > > > Putting my RT hat back on, I would definitely disable that on any system > > that requires RT. > > Just so I understand, you are basically saying that you want this > feature only for FAIR tasks, and allowing RT tasks to extend time > slice might actually hurt the latency of (other) RT tasks on the > system right? This assumes PREEMPT_RT because the latency is 50us > right? RT tasks don't have a time slice. They are affected by events. An external interrupt coming in, or a timer going off that states something is happening. Perhaps we could use this for SCHED_RR or maybe even SCHED_DEADLINE, as those do have time slices. But if it does get used, it should only be used when the task being scheduled is the same SCHED_RR priority, or if SCHED_DEADLINE will not fail its guarantees. > > But in a poorly designed system, if you have RT tasks at higher > priority that preempt things lower in RT, that would already cause > latency anyway. Similarly, I would also consider any PREEMPT_RT system And that would be a poorly designed system, and not the problem of the kernel. > that (mis)uses this API in an RT task as also a poorly designed > system. I think PREEMPT_RT systems generally require careful design > anyway. So the fact that a system is poorly designed and thus causes > latency is not the kernel's problem IMO. Correct. And why I don't think this should be used for RT. It's SCHED_OTHER that doesn't have any control of the sched tick, where this hint can help. > > In any case, if you want this to only work on FAIR tasks and not RT > tasks, why is that only possible to do with rseq() + LAZY preemption > and not Prakash's new API + all preemption modes? > > Also you can just ignore RT tasks (not that I'm saying that's a good > idea but..) in taskshrd_delay_resched() in that patch if you ever > wanted to do that. > > I just feel the RT latency thing is a non-issue AFAICS. Have you worked on any RT projects before? -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-05 13:16 ` Steven Rostedt @ 2025-02-05 13:38 ` Steven Rostedt 2025-02-05 21:08 ` Prakash Sangappa 2025-02-06 3:07 ` Joel Fernandes 2 siblings, 0 replies; 66+ messages in thread From: Steven Rostedt @ 2025-02-05 13:38 UTC (permalink / raw) To: Joel Fernandes Cc: Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie On Wed, 5 Feb 2025 08:16:35 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > Correct. And why I don't think this should be used for RT. It's SCHED_OTHER > that doesn't have any control of the sched tick, where this hint can help. Honestly, I don't care that much if it is used on all preemption models, but I do care if it affects RT tasks. The LAZY flag just happens to let us know if the next schedule is mandatory or not. NEED_RESCHED_LAZY means, this schedule is not that important, where as the NEED_RESCHED means it is. That's why I picked that to decide if the task can get extended or not. Has nothing to do with the preemption method. The preemption method currently sets that flag. Now we could just change when NEED_RESCHED_LAZY is set and this could work with all preemption methods. To explain this better. Currently we have: TYPE | Sched Tick | RT Wakeup | ===========+=======================+======================+ None | NEED_RESCHED_LAZY | NEED_RESCHED_LAZY | -----------+-----------------------+----------------------+ Voluntary | NEED_RESCHED_LAZY | NEED_RESCHED | -----------+-----------------------+----------------------+ Full | NEED_RESCHED | NEED_RESCHED | -----------+-----------------------+----------------------+ Perhaps if we do: TYPE | Sched Tick | RT Wakeup | ===========+===================================+======================+ None | NEED_RESCHED_LAZY | NEED_RESCHED_LAZY | -----------+-----------------------------------+----------------------+ Voluntary | NEED_RESCHED_LAZY | NEED_RESCHED | -----------+-----------------------------------+----------------------+ Full | NEED_RESCHED or NEED_RESCHED_LAZY | NEED_RESCHED | -----------+-----------------------------------+----------------------+ Where on the scheduler tick, for PREEMPT_FULL (and even PREEMPT_RT), we set NEED_RESCHED if the task is in the kernel and NEED_RESCHED_LAZY if it is in user space then this patch will work in all preemption methods. As the LAZY bit will decide if the task gets extended or not. That is, any SCHED_OTHER task that is being preempted due to its scheduler tick can be granted 50us more, regardless of the preemption method. Now on PREEMPT_NONE, it may even get to preempt a RT task a bit more, but RT tasks have more to worry about if they are running on a PREEMPT_NONE system anyway! -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-05 13:16 ` Steven Rostedt 2025-02-05 13:38 ` Steven Rostedt @ 2025-02-05 21:08 ` Prakash Sangappa 2025-02-05 21:19 ` Steven Rostedt 2025-02-06 3:07 ` Joel Fernandes 2 siblings, 1 reply; 66+ messages in thread From: Prakash Sangappa @ 2025-02-05 21:08 UTC (permalink / raw) To: Steven Rostedt Cc: Joel Fernandes, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie > On Feb 5, 2025, at 5:16 AM, Steven Rostedt <rostedt@goodmis.org> wrote: > > On Wed, 5 Feb 2025 00:09:51 -0500 > Joel Fernandes <joel@joelfernandes.org> wrote: > >> On Tue, Feb 4, 2025 at 10:03 PM Steven Rostedt <rostedt@goodmis.org> wrote: >>> >>> On Tue, 4 Feb 2025 19:56:09 -0500 >>> Joel Fernandes <joel@joelfernandes.org> wrote: >>> >>>>> Here is the RFC I had sent that Peter is referring >>>> >>>> FWIW, I second the idea of a new syscall for this than (ab)using rseq >>>> and also independence from preemption method. I agree that something >>>> generic is better than relying on preemption method. >>> >>> So you are for adding another user/kernel memory mapped section? >> >> I don't personally mind that. > > I'm glad you don't personally mind it. Are you going to help maintain > another memory mapped section? > The new syscall/API proposed was to provide per thread shared mapped area(shared structure) that are allocated from memory pages that are pinned. So the kernel could access it without the need for a copyin/copyout. The idea is that it would be helpful in places where we cannot take a page fault in the kernel codepath. >> >>> And you are also OK with allowing any task to make an RT task wait longer? >>> >>> Putting my RT hat back on, I would definitely disable that on any system >>> that requires RT. >> >> Just so I understand, you are basically saying that you want this >> feature only for FAIR tasks, and allowing RT tasks to extend time >> slice might actually hurt the latency of (other) RT tasks on the >> system right? This assumes PREEMPT_RT because the latency is 50us >> right? > > RT tasks don't have a time slice. They are affected by events. An external > interrupt coming in, or a timer going off that states something is > happening. Perhaps we could use this for SCHED_RR or maybe even > SCHED_DEADLINE, as those do have time slices. > > But if it does get used, it should only be used when the task being > scheduled is the same SCHED_RR priority, or if SCHED_DEADLINE will not fail > its guarantees. > >> >> But in a poorly designed system, if you have RT tasks at higher >> priority that preempt things lower in RT, that would already cause >> latency anyway. Similarly, I would also consider any PREEMPT_RT system > > And that would be a poorly designed system, and not the problem of the > kernel. > >> that (mis)uses this API in an RT task as also a poorly designed >> system. I think PREEMPT_RT systems generally require careful design >> anyway. So the fact that a system is poorly designed and thus causes >> latency is not the kernel's problem IMO. > > Correct. And why I don't think this should be used for RT. It's SCHED_OTHER > that doesn't have any control of the sched tick, where this hint can help. > >> >> In any case, if you want this to only work on FAIR tasks and not RT >> tasks, why is that only possible to do with rseq() + LAZY preemption >> and not Prakash's new API + all preemption modes? >> >> Also you can just ignore RT tasks (not that I'm saying that's a good >> idea but..) in taskshrd_delay_resched() in that patch if you ever >> wanted to do that. >> >> I just feel the RT latency thing is a non-issue AFAICS. > > Have you worked on any RT projects before? > > -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-05 21:08 ` Prakash Sangappa @ 2025-02-05 21:19 ` Steven Rostedt 2025-02-05 21:33 ` Steven Rostedt 0 siblings, 1 reply; 66+ messages in thread From: Steven Rostedt @ 2025-02-05 21:19 UTC (permalink / raw) To: Prakash Sangappa Cc: Joel Fernandes, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie On Wed, 5 Feb 2025 21:08:47 +0000 Prakash Sangappa <prakash.sangappa@oracle.com> wrote: > The new syscall/API proposed was to provide per thread shared mapped > area(shared structure) that are allocated from memory pages that are pinned. > So the kernel could access it without the need for a copyin/copyout. > > The idea is that it would be helpful in places where we cannot take a page > fault in the kernel codepath. What places do we need to decided this in a critical path? If we follow my proposal, where we set NEED_RESCHED_LAZY on sched_tick when it interrupts user space, then it should all work out. I agree with Peter about not caring about system calls. If you can do a system call in a critical path, then just use futexes. The only reason I support system calls is for debugging. That's because I write into the trace_marker from user space to see if things are working, and that requires a system call. But once things are working, I'll make it not work for system calls too. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-05 21:19 ` Steven Rostedt @ 2025-02-05 21:33 ` Steven Rostedt 2025-02-05 21:36 ` Prakash Sangappa 0 siblings, 1 reply; 66+ messages in thread From: Steven Rostedt @ 2025-02-05 21:33 UTC (permalink / raw) To: Prakash Sangappa Cc: Joel Fernandes, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie On Wed, 5 Feb 2025 16:19:45 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Wed, 5 Feb 2025 21:08:47 +0000 > Prakash Sangappa <prakash.sangappa@oracle.com> wrote: > > > The new syscall/API proposed was to provide per thread shared mapped > > area(shared structure) that are allocated from memory pages that are pinned. > > So the kernel could access it without the need for a copyin/copyout. > > > > The idea is that it would be helpful in places where we cannot take a page > > fault in the kernel codepath. > > What places do we need to decided this in a critical path? If we follow my > proposal, where we set NEED_RESCHED_LAZY on sched_tick when it interrupts > user space, then it should all work out. Actually, it doesn't need to be pinned for kernel critical paths (like an interrupt handler). That's because when entering the user critical path, it writes to the location, which will fault that memory in. If the page isn't there when the kernel accesses it, it most likely means the task isn't in a critical section and there's no reason to extend the tick. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-05 21:33 ` Steven Rostedt @ 2025-02-05 21:36 ` Prakash Sangappa 0 siblings, 0 replies; 66+ messages in thread From: Prakash Sangappa @ 2025-02-05 21:36 UTC (permalink / raw) To: Steven Rostedt Cc: Joel Fernandes, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie [-- Attachment #1: Type: text/plain, Size: 1235 bytes --] On Feb 5, 2025, at 1:33 PM, Steven Rostedt <rostedt@goodmis.org> wrote: On Wed, 5 Feb 2025 16:19:45 -0500 Steven Rostedt <rostedt@goodmis.org<mailto:rostedt@goodmis.org>> wrote: On Wed, 5 Feb 2025 21:08:47 +0000 Prakash Sangappa <prakash.sangappa@oracle.com> wrote: The new syscall/API proposed was to provide per thread shared mapped area(shared structure) that are allocated from memory pages that are pinned. So the kernel could access it without the need for a copyin/copyout. The idea is that it would be helpful in places where we cannot take a page fault in the kernel codepath. What places do we need to decided this in a critical path? If we follow my proposal, where we set NEED_RESCHED_LAZY on sched_tick when it interrupts user space, then it should all work out. Actually, it doesn't need to be pinned for kernel critical paths (like an interrupt handler). That's because when entering the user critical path, it writes to the location, which will fault that memory in. If the page isn't there when the kernel accesses it, it most likely means the task isn't in a critical section and there's no reason to extend the tick. Sure, it is not important in this use case. -- Steve [-- Attachment #2: Type: text/html, Size: 9303 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-05 13:16 ` Steven Rostedt 2025-02-05 13:38 ` Steven Rostedt 2025-02-05 21:08 ` Prakash Sangappa @ 2025-02-06 3:07 ` Joel Fernandes 2025-02-06 13:30 ` Steven Rostedt 2 siblings, 1 reply; 66+ messages in thread From: Joel Fernandes @ 2025-02-06 3:07 UTC (permalink / raw) To: Steven Rostedt Cc: Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie On Wed, Feb 5, 2025 at 8:16 AM Steven Rostedt <rostedt@goodmis.org> wrote: [...] > > > > > And you are also OK with allowing any task to make an RT task wait longer? > > > > > > Putting my RT hat back on, I would definitely disable that on any system > > > that requires RT. > > > > Just so I understand, you are basically saying that you want this > > feature only for FAIR tasks, and allowing RT tasks to extend time > > slice might actually hurt the latency of (other) RT tasks on the > > system right? This assumes PREEMPT_RT because the latency is 50us > > right? > > RT tasks don't have a time slice. They are affected by events. An external > interrupt coming in, or a timer going off that states something is > happening. Perhaps we could use this for SCHED_RR or maybe even > SCHED_DEADLINE, as those do have time slices. > > But if it does get used, it should only be used when the task being > scheduled is the same SCHED_RR priority, or if SCHED_DEADLINE will not fail > its guarantees. > Right, it would apply still to RR/DL though... > > In any case, if you want this to only work on FAIR tasks and not RT > > tasks, why is that only possible to do with rseq() + LAZY preemption > > and not Prakash's new API + all preemption modes? > > > > Also you can just ignore RT tasks (not that I'm saying that's a good > > idea but..) in taskshrd_delay_resched() in that patch if you ever > > wanted to do that. > > > > I just feel the RT latency thing is a non-issue AFAICS. > > Have you worked on any RT projects before? Heh.. I think maybe you misunderstood my statement, I was mentioning that I felt (similar to Peter I think) that NOT adopting this feature generically for all tasks due to a concern of 50us latency maybe does not make sense since poorly designed app / random hardware already have this issue. I think the main concern discussed in this thread is (and please CMIIW): 1. Locking down this feature to only SCHED_OTHER versus making it generic (maybe sched_ext could also use it?). 2. Tying it to specific preemption methods which may change user mode behavior/expectation (because LAZY is tied to preemption method). 3. Overloading the purpose of LAZY: My understanding is, the purpose of LAZY is to let the scheduler decide if it wants to preempt based on preemption mode. It is not based on any hint, just on the preemption mode. I guess you are overloading LAZY by making LAZY flag also extend userspace timeslice (versus say making the time-slice extension hint its own thing...). Yes, I have worked on RT projects before -- you would know better than anyone. :-D. But admittedly, I haven't got to work much with PREEMPT_RT systems. - Joel ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-06 3:07 ` Joel Fernandes @ 2025-02-06 13:30 ` Steven Rostedt 2025-02-06 13:44 ` Sebastian Andrzej Siewior ` (2 more replies) 0 siblings, 3 replies; 66+ messages in thread From: Steven Rostedt @ 2025-02-06 13:30 UTC (permalink / raw) To: Joel Fernandes Cc: Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie On Wed, 5 Feb 2025 22:07:12 -0500 Joel Fernandes <joel@joelfernandes.org> wrote: > > > > RT tasks don't have a time slice. They are affected by events. An external > > interrupt coming in, or a timer going off that states something is > > happening. Perhaps we could use this for SCHED_RR or maybe even > > SCHED_DEADLINE, as those do have time slices. > > > > But if it does get used, it should only be used when the task being > > scheduled is the same SCHED_RR priority, or if SCHED_DEADLINE will not fail > > its guarantees. > > > > Right, it would apply still to RR/DL though... But it would have to guarantee that the RR it is delaying is of the same priority, and that delaying the DL is not going to cause something to miss its deadline. > > > > In any case, if you want this to only work on FAIR tasks and not RT > > > tasks, why is that only possible to do with rseq() + LAZY preemption > > > and not Prakash's new API + all preemption modes? > > > > > > Also you can just ignore RT tasks (not that I'm saying that's a good > > > idea but..) in taskshrd_delay_resched() in that patch if you ever > > > wanted to do that. > > > > > > I just feel the RT latency thing is a non-issue AFAICS. > > > > Have you worked on any RT projects before? > > Heh.. I think maybe you misunderstood my statement, I was mentioning > that I felt (similar to Peter I think) that NOT adopting this feature > generically for all tasks due to a concern of 50us latency maybe does > not make sense since poorly designed app / random hardware already > have this issue. I think the main concern discussed in this thread is > (and please CMIIW): We have code that has sub 100us latency and less. If some random user space application applies this, adding 50us (or even 20us) will break these. And this has nothing to do with poorly designed applications or hardware. By adding this as a feature that works everywhere, you will break use cases that work today. > 1. Locking down this feature to only SCHED_OTHER versus making it > generic (maybe sched_ext could also use it?). sched_ext can do whatever it wants ;-) But the reason I picked SCHED_OTHER is because that's the only policy that has no control of when it gets preempted by lower priority processes. This isn't about "hey I'm in a critical section can you delay higher priority applications with strict deadlines for me?" The scheduler tick comes at random moments. SCHED_OTHER is more about performance and not about latency. Sure, we want better latency when it comes to reaction times, but that's usually in the millisec range. Not microsecond range. RT and DL tasks do care about microseconds. And every microsecond counts. This is why I was fine in limiting this to 50us. > 2. Tying it to specific preemption methods which may change user mode > behavior/expectation (because LAZY is tied to preemption method). Well, every time a user task calls a system call, it is affected by the preemption method. And I also reported how this can work in all preemption methods, but only for SCHED_OTHER. It will just take some work on how the kernel handles NEED_RESCHED_LAZY. User space will be unaware of any of this. > 3. Overloading the purpose of LAZY: My understanding is, the purpose > of LAZY is to let the scheduler decide if it wants to preempt based on > preemption mode. It is not based on any hint, just on the preemption > mode. I guess you are overloading LAZY by making LAZY flag also extend > userspace timeslice (versus say making the time-slice extension hint > its own thing...). I already replied about that. Note, LAZY was created in PREEMPT_RT for this very purpose (but in the kernel), and ported to vanilla for a slightly different purpose. Here's the history: PREEMPT_RT would convert spin_locks in the kernel to sleeping mutexes. This made RT tasks respond much faster to events. But non-RT (SCHED_OTHER) started suffering performance issues. When looking at the performance issues, we found that it was due to tasks holding these sleeping spin_locks and being preempted. That is, the preemption of holding spin_locks was causing more contention and slowing things down tremendously. To first handle this, adaptive mutexes was introduced. These would spin if the owner of the lock was still running, and would go to sleep if the owner goes to sleep. This helped things quite a bit, but PREEMPT_RT was still suffer a performance deficit compared to non-RT. This was because of the timer tick on SCHED_OTHER tasks that could preempt a task holding a spin lock. NEED_RESCHED_LAZY was introduced to remedy this. It would be set for SCHED_OTHER tasks and NEED_RESCHED for RT tasks. If the task was holding a sleeping spin lock, the NEED_RESCHED_LAZY would not preempt the running task, but NEED_RESCHED would. If the SCHED_OTHER task was not holding a sleeping spin_lock it would be preempted regardless. This improved the performance of SCHED_OTHER tasks in PREEMPT_RT to be as good as what was in vanilla. You see, LAZY was *created* for this purpose. Of letting the scheduler know that the running task is in a critical section and the timer tick should not preempt a SCHED_OTHER task. I just wanted to extend this to SCHED_OTHER in user space too. > > Yes, I have worked on RT projects before -- you would know better > than anyone. :-D. But admittedly, I haven't got to work much with > PREEMPT_RT systems. Just using RT policy to improve performance is not an RT project. I'm talking about projects that if you miss a deadline things crash. Where the project works very hard to make sure everything works as intended. I'm totally against allowing SCHED_OTHER to use any feature that can delay an RT/DL task (unless of course it is to help those, like priority inheritance). There's several RT folks on this thread. I wonder if any of them are OK with this? -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-06 13:30 ` Steven Rostedt @ 2025-02-06 13:44 ` Sebastian Andrzej Siewior 2025-02-06 13:48 ` Peter Zijlstra 2025-02-10 19:43 ` Steven Rostedt 2025-02-10 14:07 ` Joel Fernandes 2025-02-10 17:20 ` David Laight 2 siblings, 2 replies; 66+ messages in thread From: Sebastian Andrzej Siewior @ 2025-02-06 13:44 UTC (permalink / raw) To: Steven Rostedt Cc: Joel Fernandes, Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On 2025-02-06 08:30:39 [-0500], Steven Rostedt wrote: > NEED_RESCHED_LAZY was introduced to remedy this. It would be set for > SCHED_OTHER tasks and NEED_RESCHED for RT tasks. If the task was holding > a sleeping spin lock, the NEED_RESCHED_LAZY would not preempt the running > task, but NEED_RESCHED would. If the SCHED_OTHER task was not holding a > sleeping spin_lock it would be preempted regardless. > > This improved the performance of SCHED_OTHER tasks in PREEMPT_RT to be as > good as what was in vanilla. > > You see, LAZY was *created* for this purpose. Of letting the scheduler know > that the running task is in a critical section and the timer tick should > not preempt a SCHED_OTHER task. This was introduced that way originally. It helped because we did not preempt lock-owner unless we had to. This isn't the case with LAZY as of today. I did add a scheduling point in rt_spin_unlock() if LAZY was set and based on few tests it was something between noise and worse. It seems that "run to completion" is better than interrupt the kernel in the middle whatever it is doing. "Don't preempt the lock owner" is already handled by LAZY with the scheduling point on return to userland. > I just wanted to extend this to SCHED_OTHER in user space too. > > > > > Yes, I have worked on RT projects before -- you would know better > > than anyone. :-D. But admittedly, I haven't got to work much with > > PREEMPT_RT systems. > > Just using RT policy to improve performance is not an RT project. I'm > talking about projects that if you miss a deadline things crash. Where the > project works very hard to make sure everything works as intended. > > I'm totally against allowing SCHED_OTHER to use any feature that can delay > an RT/DL task (unless of course it is to help those, like priority inheritance). > > There's several RT folks on this thread. I wonder if any of > them are OK with this? If you roll your own (sched_ext) then do what you want as per you sched-policy. If you use this LAZY bit as a hint from userland to kernel as "please give me up to X usec/ ticks before cutting me off" fine. It is SCHED_OTHER vs SCHED_OTHER after all. But please don't delay a wakeup of SCHED_FIFO/ RR/ DL because of this LAZY hint. > -- Steve Sebastian ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-06 13:44 ` Sebastian Andrzej Siewior @ 2025-02-06 13:48 ` Peter Zijlstra 2025-02-06 13:53 ` Sebastian Andrzej Siewior 2025-02-10 19:43 ` Steven Rostedt 1 sibling, 1 reply; 66+ messages in thread From: Peter Zijlstra @ 2025-02-06 13:48 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Steven Rostedt, Joel Fernandes, Prakash Sangappa, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On Thu, Feb 06, 2025 at 02:44:08PM +0100, Sebastian Andrzej Siewior wrote: > SCHED_OTHER vs SCHED_OTHER after all. But please don't delay a wakeup of > SCHED_FIFO/ RR/ DL because of this LAZY hint. Thing will get delayed if interrupts are disabled or kernel has preemption disabled too. So as long as we ensure hint crap is of equal order, nothing cares if you do. If you can't tell the difference between task does hint crap in userspace and task is in the middle of syscall, you can't tell the difference. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-06 13:48 ` Peter Zijlstra @ 2025-02-06 13:53 ` Sebastian Andrzej Siewior 2025-02-06 13:57 ` Peter Zijlstra 0 siblings, 1 reply; 66+ messages in thread From: Sebastian Andrzej Siewior @ 2025-02-06 13:53 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, Joel Fernandes, Prakash Sangappa, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On 2025-02-06 14:48:59 [+0100], Peter Zijlstra wrote: > On Thu, Feb 06, 2025 at 02:44:08PM +0100, Sebastian Andrzej Siewior wrote: > > SCHED_OTHER vs SCHED_OTHER after all. But please don't delay a wakeup of > > SCHED_FIFO/ RR/ DL because of this LAZY hint. > > Thing will get delayed if interrupts are disabled or kernel has > preemption disabled too. So as long as we ensure hint crap is of equal > order, nothing cares if you do. > > If you can't tell the difference between task does hint crap in > userspace and task is in the middle of syscall, you can't tell the > difference. I can tell the difference if I see a trace where an interrupt fires, performs a wakeup and the SCHED_OTHER task remains on CPU while the task SCHED_FIFO task sits on the runqueue until a timer fires. Sebastian ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-06 13:53 ` Sebastian Andrzej Siewior @ 2025-02-06 13:57 ` Peter Zijlstra 2025-02-06 14:20 ` Steven Rostedt 2025-02-06 14:22 ` Sebastian Andrzej Siewior 0 siblings, 2 replies; 66+ messages in thread From: Peter Zijlstra @ 2025-02-06 13:57 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Steven Rostedt, Joel Fernandes, Prakash Sangappa, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On Thu, Feb 06, 2025 at 02:53:53PM +0100, Sebastian Andrzej Siewior wrote: > On 2025-02-06 14:48:59 [+0100], Peter Zijlstra wrote: > > On Thu, Feb 06, 2025 at 02:44:08PM +0100, Sebastian Andrzej Siewior wrote: > > > SCHED_OTHER vs SCHED_OTHER after all. But please don't delay a wakeup of > > > SCHED_FIFO/ RR/ DL because of this LAZY hint. > > > > Thing will get delayed if interrupts are disabled or kernel has > > preemption disabled too. So as long as we ensure hint crap is of equal > > order, nothing cares if you do. > > > > If you can't tell the difference between task does hint crap in > > userspace and task is in the middle of syscall, you can't tell the > > difference. > > I can tell the difference if I see a trace where an interrupt fires, > performs a wakeup and the SCHED_OTHER task remains on CPU while the task > SCHED_FIFO task sits on the runqueue until a timer fires. Right, but so what? Same delay will happen if interrupt fires in the middle of a preempt_disable() region. Or if interrupt gets pending while interrupts are disabled, except your trace will not show that. Your worst case response time isn't affected. That's all that matters. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-06 13:57 ` Peter Zijlstra @ 2025-02-06 14:20 ` Steven Rostedt 2025-02-06 14:22 ` Sebastian Andrzej Siewior 1 sibling, 0 replies; 66+ messages in thread From: Steven Rostedt @ 2025-02-06 14:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Sebastian Andrzej Siewior, Joel Fernandes, Prakash Sangappa, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On Thu, 6 Feb 2025 14:57:44 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > Right, but so what? Same delay will happen if interrupt fires in the > middle of a preempt_disable() region. > > Or if interrupt gets pending while interrupts are disabled, except your > trace will not show that. > > Your worst case response time isn't affected. That's all that matters. So now if a task has this set, and an interrupt goes off and wakes an RT task, not only is the time of the interrupt the latency of the RT task, but also this extension of the SCHED_OTHER task. That is, where it use to be: event RT task scheduled | | v v time: |-------+-+----+-- ^ | interrupt If the interrupt triggered just as the task set this bit, we then have: event set Xus RT task scheduled | | | v v v time: |-------+-+----+-------+--+ ^ ^ | | interrupt Xus timer triggered This adds on *top* of the current latency, and is not just by itself. Yes, this may not happen often, but in RT we very much do care about the worst case scenarios. (That's the difference between an RT project and a project that just uses RT). -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-06 13:57 ` Peter Zijlstra 2025-02-06 14:20 ` Steven Rostedt @ 2025-02-06 14:22 ` Sebastian Andrzej Siewior 2025-02-06 14:27 ` Peter Zijlstra 1 sibling, 1 reply; 66+ messages in thread From: Sebastian Andrzej Siewior @ 2025-02-06 14:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, Joel Fernandes, Prakash Sangappa, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On 2025-02-06 14:57:44 [+0100], Peter Zijlstra wrote: > On Thu, Feb 06, 2025 at 02:53:53PM +0100, Sebastian Andrzej Siewior wrote: > > On 2025-02-06 14:48:59 [+0100], Peter Zijlstra wrote: > > > On Thu, Feb 06, 2025 at 02:44:08PM +0100, Sebastian Andrzej Siewior wrote: > > > > SCHED_OTHER vs SCHED_OTHER after all. But please don't delay a wakeup of > > > > SCHED_FIFO/ RR/ DL because of this LAZY hint. > > > > > > Thing will get delayed if interrupts are disabled or kernel has > > > preemption disabled too. So as long as we ensure hint crap is of equal > > > order, nothing cares if you do. > > > > > > If you can't tell the difference between task does hint crap in > > > userspace and task is in the middle of syscall, you can't tell the > > > difference. > > > > I can tell the difference if I see a trace where an interrupt fires, > > performs a wakeup and the SCHED_OTHER task remains on CPU while the task > > SCHED_FIFO task sits on the runqueue until a timer fires. > > Right, but so what? Same delay will happen if interrupt fires in the > middle of a preempt_disable() region. But I do see an interrupt and a wakeup in the middle of a preempt-disable section based on the preempt-counter. Then once the preempt-disabled section is done, sched-switch is expected. > Or if interrupt gets pending while interrupts are disabled, except your > trace will not show that. That is true. In general we try not to have a lot of these. Also pushing the "unexpected" part of the CPU such as unrelated interrupts. So may end up a RT thread and SCHED_OTHER threads on the same CPU. It would be very unfortunate if this "let me give a bit extra time so I get out of my critical section" within this SCHED_OTHER application affects the RT application. You run SCHED_OTHER tasks and a cyclic interrupt wakes your RT task every ms. You expect to be scheduled asap, so ideally 0us but 70us in real world. Then this feature adds 20us on top? > Your worst case response time isn't affected. That's all that matters. Sebastian ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-06 14:22 ` Sebastian Andrzej Siewior @ 2025-02-06 14:27 ` Peter Zijlstra 2025-02-06 14:57 ` Steven Rostedt 2025-02-06 15:01 ` Sebastian Andrzej Siewior 0 siblings, 2 replies; 66+ messages in thread From: Peter Zijlstra @ 2025-02-06 14:27 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Steven Rostedt, Joel Fernandes, Prakash Sangappa, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On Thu, Feb 06, 2025 at 03:22:34PM +0100, Sebastian Andrzej Siewior wrote: > Then this feature adds 20us on top? The point has always been for the number to be < the observable scheduling latency. I'm not sure what that number is, and it is always hardware dependent. I measured it on a random test box when I did the prototype a long while ago, and ended up at 50us, but for all I know that machine was running a lockdep enabled kernel at the time (won't be the first and certainly won't be the last time I try and do a performance measurement on a debug kernel). That was not the important part -- but everybody fixates on the number, instead of the intent. I'm assuming you have a recent number around -- what's sane? 5us, less? ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-06 14:27 ` Peter Zijlstra @ 2025-02-06 14:57 ` Steven Rostedt 2025-02-06 15:01 ` Sebastian Andrzej Siewior 1 sibling, 0 replies; 66+ messages in thread From: Steven Rostedt @ 2025-02-06 14:57 UTC (permalink / raw) To: Peter Zijlstra Cc: Sebastian Andrzej Siewior, Joel Fernandes, Prakash Sangappa, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On Thu, 6 Feb 2025 15:27:17 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > I'm assuming you have a recent number around -- what's sane? 5us, less? It really doesn't matter what the number is. No matter what it is, it adds to the latency. Just adding the timer and another interrupt just doubled the interrupt latency. If an RT task were to wake up but this flag is set to extend the currently running task, even if you made it 5us, it will be more than that. You need to enable a new timer, get back to user space, trigger another interrupt, before you can schedule the RT task from its original time it was to wake up and run. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-06 14:27 ` Peter Zijlstra 2025-02-06 14:57 ` Steven Rostedt @ 2025-02-06 15:01 ` Sebastian Andrzej Siewior 1 sibling, 0 replies; 66+ messages in thread From: Sebastian Andrzej Siewior @ 2025-02-06 15:01 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, Joel Fernandes, Prakash Sangappa, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On 2025-02-06 15:27:17 [+0100], Peter Zijlstra wrote: > On Thu, Feb 06, 2025 at 03:22:34PM +0100, Sebastian Andrzej Siewior wrote: > > Then this feature adds 20us on top? > > The point has always been for the number to be < the observable > scheduling latency. > > I'm not sure what that number is, and it is always hardware dependent. I > measured it on a random test box when I did the prototype a long while > ago, and ended up at 50us, but for all I know that machine was running a > lockdep enabled kernel at the time (won't be the first and certainly > won't be the last time I try and do a performance measurement on a debug > kernel). When I have lockdep enabled, I have scheduling latencies >1ms. > That was not the important part -- but everybody fixates on the number, > instead of the intent. I don't mind to delay a SCHED_OTHER wakeup for the greater good. And here a number, 50us, be it. This is certainly not something I complain. I'm just asking not to delayed the wakeup of the RT task which should be on CPU based on its priority. Depending on RT application, it is not just the interrupt and preempt-off section that you worry about. It could also involve to PI a SCHED_OTHER task on a different CPU to release the lock in question so that the RT application on _this_ CPU can make progress. So you have 50us on the this CPU and 50us on the remote CPU because it also does LAZY thingy for performance reasons. And so the number doubled. > I'm assuming you have a recent number around -- what's sane? 5us, less? As I tried to explain any additional delay hurts. If your application requires a latency of 1ms, you get max 100us based on testing then additional 50us certainly won't hurt you. However if you require 200us max, you already struggle with 160us especially if everything fires at once and the caches are gone. In this the 5us will still fit the requirement on paper but the buffer got smaller. Also, the 5us requires a timer to fire etc… There are "bigger" x86 boxes with high clocked CPU and big caches which can be partitioned and so on and everything is nice. There are also smaller x86 boxes where you have two trace_printk() after each other in an IRQ-off region 2us apart and yell at the scheduler for taking 30us for a scheduling decision. Sebastian ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-06 13:44 ` Sebastian Andrzej Siewior 2025-02-06 13:48 ` Peter Zijlstra @ 2025-02-10 19:43 ` Steven Rostedt 2025-02-10 22:04 ` David Laight 2025-02-11 8:21 ` Sebastian Andrzej Siewior 1 sibling, 2 replies; 66+ messages in thread From: Steven Rostedt @ 2025-02-10 19:43 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Joel Fernandes, Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On Thu, 6 Feb 2025 14:44:08 +0100 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > I did add a scheduling point in rt_spin_unlock() if LAZY was set and > based on few tests it was something between noise and worse. It seems > that "run to completion" is better than interrupt the kernel in the > middle whatever it is doing. "Don't preempt the lock owner" is already > handled by LAZY with the scheduling point on return to userland. Does that mean that PREEMPT_RT requires a non preempt method for SCHED_OTHER for SCHED_OTHER to not hit the issues that we were originally hitting? That is, with being able to preempt spin_locks in PREEMPT_RT, running a system with PREEMPT_RT in full preemption mode will still suffer performance issues against a non PREEMPT_RT running in full preemption mode? -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-10 19:43 ` Steven Rostedt @ 2025-02-10 22:04 ` David Laight 2025-02-10 22:15 ` Steven Rostedt 2025-02-11 8:21 ` Sebastian Andrzej Siewior 1 sibling, 1 reply; 66+ messages in thread From: David Laight @ 2025-02-10 22:04 UTC (permalink / raw) To: Steven Rostedt Cc: Sebastian Andrzej Siewior, Joel Fernandes, Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On Mon, 10 Feb 2025 14:43:21 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Thu, 6 Feb 2025 14:44:08 +0100 > Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > > I did add a scheduling point in rt_spin_unlock() if LAZY was set and > > based on few tests it was something between noise and worse. It seems > > that "run to completion" is better than interrupt the kernel in the > > middle whatever it is doing. "Don't preempt the lock owner" is already > > handled by LAZY with the scheduling point on return to userland. > > Does that mean that PREEMPT_RT requires a non preempt method for > SCHED_OTHER for SCHED_OTHER to not hit the issues that we were originally > hitting? That is, with being able to preempt spin_locks in PREEMPT_RT, > running a system with PREEMPT_RT in full preemption mode will still suffer > performance issues against a non PREEMPT_RT running in full preemption mode? My 'gut feel' is that all the context switches with PREEMPT_RT add a significant overhead. It might not matter if your system is lightly loaded (overspecified), but if you need to run at 95%+ cpu then they will hit you hard. Maybe you can afford to drop softint and napi code to a high(ish) priority thread, but I'd have thought that most interrupts should stay that way and most spinlocks stay as spinlocks - and probably all disable interrupts! Any interrupts that take 'a long time' or spinlocks that are held for 'a long time' really need changing anyway. But there are some really dreadful bits of code in the kernel. One of the Intel ethernet drivers spins for ages whenever the bios is accessing the hardware - you can't run RTP audio tests on that system. Perhaps interrupt disable and pre-emption times should (optionally) be monitored and a warning output every time they go up significantly. A 'name and shame' policy might improve matters. David > > -- Steve > ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-10 22:04 ` David Laight @ 2025-02-10 22:15 ` Steven Rostedt 0 siblings, 0 replies; 66+ messages in thread From: Steven Rostedt @ 2025-02-10 22:15 UTC (permalink / raw) To: David Laight Cc: Sebastian Andrzej Siewior, Joel Fernandes, Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On Mon, 10 Feb 2025 22:04:33 +0000 David Laight <david.laight.linux@gmail.com> wrote: > My 'gut feel' is that all the context switches with PREEMPT_RT add a significant > overhead. "gut feel" is normally not something we care about. You can have a gut feel to try new code, but it's the benchmarks and measurements that are the only input for actual decisions. > It might not matter if your system is lightly loaded (overspecified), > but if you need to run at 95%+ cpu then they will hit you hard. > > Maybe you can afford to drop softint and napi code to a high(ish) priority > thread, but I'd have thought that most interrupts should stay that way and > most spinlocks stay as spinlocks - and probably all disable interrupts! > Any interrupts that take 'a long time' or spinlocks that are held for 'a long time' > really need changing anyway. There's a lot of users of PREEMPT_RT where things work under load. We've been arguing about those things being changed for a long time (over 20 years), and that hasn't worked out well. The PREEMPT_RT method was used to fix that. There's just too many places that grab a spin lock and iterate a list that doesn't have a bound size. > > But there are some really dreadful bits of code in the kernel. > One of the Intel ethernet drivers spins for ages whenever the bios is accessing > the hardware - you can't run RTP audio tests on that system. > > Perhaps interrupt disable and pre-emption times should (optionally) be monitored > and a warning output every time they go up significantly. > A 'name and shame' policy might improve matters. We have interrupt and preemption off latency tracers. Note, any measurements you add can cause noticeable performance hits. They are small, but still noticeable. Which is why I recommend disabling them on any production system. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-10 19:43 ` Steven Rostedt 2025-02-10 22:04 ` David Laight @ 2025-02-11 8:21 ` Sebastian Andrzej Siewior 2025-02-11 10:57 ` Peter Zijlstra 2025-02-11 15:28 ` Steven Rostedt 1 sibling, 2 replies; 66+ messages in thread From: Sebastian Andrzej Siewior @ 2025-02-11 8:21 UTC (permalink / raw) To: Steven Rostedt Cc: Joel Fernandes, Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On 2025-02-10 14:43:21 [-0500], Steven Rostedt wrote: > On Thu, 6 Feb 2025 14:44:08 +0100 > Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > > I did add a scheduling point in rt_spin_unlock() if LAZY was set and > > based on few tests it was something between noise and worse. It seems > > that "run to completion" is better than interrupt the kernel in the > > middle whatever it is doing. "Don't preempt the lock owner" is already > > handled by LAZY with the scheduling point on return to userland. > > Does that mean that PREEMPT_RT requires a non preempt method for > SCHED_OTHER for SCHED_OTHER to not hit the issues that we were originally > hitting? That is, with being able to preempt spin_locks in PREEMPT_RT, > running a system with PREEMPT_RT in full preemption mode will still suffer > performance issues against a non PREEMPT_RT running in full preemption mode? So with LAZY_PREEMPT (not that one that was merged upstream, its predecessor) we had a counter similar to the preemption counter. On each rt_spin_lock() the counter was incremented, on each rt_spin_unlock() the counter was decremented. Once the counter hit zero and the lazy preempt flag was set (which was only set on schedule requests by SCHED_OTHER tasks), we scheduled. This improved the performance as we didn't schedule() while holding a spinlock_t and then bump into the same lock in the next task. We don't follow this behaviour exactly today. Adding this behaviour back vs the behaviour we have now, doesn't seem to improve anything at visible levels. We don't have a counter but we can look at the RCU nesting counter which should be zero once locks have been dropped. So this can be used for testing. But as I said: using "run to completion" and preempt on the return userland rather than once the lazy flag is seen and all locks have been released appears to be better. It is (now) possible that you run for a long time and get preempted while holding a spinlock_t. It is however more likely that you release all locks and get preempted while returning to userland. > -- Steve Sebastian ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-11 8:21 ` Sebastian Andrzej Siewior @ 2025-02-11 10:57 ` Peter Zijlstra 2025-02-11 15:28 ` Steven Rostedt 1 sibling, 0 replies; 66+ messages in thread From: Peter Zijlstra @ 2025-02-11 10:57 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Steven Rostedt, Joel Fernandes, Prakash Sangappa, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On Tue, Feb 11, 2025 at 09:21:38AM +0100, Sebastian Andrzej Siewior wrote: > So with LAZY_PREEMPT (not that one that was merged upstream, its > predecessor) we had a counter similar to the preemption counter. On each > rt_spin_lock() the counter was incremented, on each rt_spin_unlock() the > counter was decremented. Once the counter hit zero and the lazy preempt > flag was set (which was only set on schedule requests by SCHED_OTHER > tasks), we scheduled. > This improved the performance as we didn't schedule() while holding a > spinlock_t and then bump into the same lock in the next task. > > We don't follow this behaviour exactly today. I think I send some hackery Mike's way to implement that at some point. IIRC it wasn't an obvious win. Anyway its not too hard to do. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-11 8:21 ` Sebastian Andrzej Siewior 2025-02-11 10:57 ` Peter Zijlstra @ 2025-02-11 15:28 ` Steven Rostedt 2025-02-12 12:11 ` Sebastian Andrzej Siewior 1 sibling, 1 reply; 66+ messages in thread From: Steven Rostedt @ 2025-02-11 15:28 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Joel Fernandes, Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On Tue, 11 Feb 2025 09:21:38 +0100 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > We don't follow this behaviour exactly today. > > Adding this behaviour back vs the behaviour we have now, doesn't seem to > improve anything at visible levels. We don't have a counter but we can > look at the RCU nesting counter which should be zero once locks have > been dropped. So this can be used for testing. > > But as I said: using "run to completion" and preempt on the return > userland rather than once the lazy flag is seen and all locks have been > released appears to be better. > > It is (now) possible that you run for a long time and get preempted > while holding a spinlock_t. It is however more likely that you release > all locks and get preempted while returning to userland. IIUC, today, LAZY causes all SCHED_OTHER tasks to act more like PREEMPT_NONE. Is that correct? Now that the PREEMPT_RT is not one of the preemption selections, when you select PREEMPT_RT, you can pick between CONFIG_PREEMPT and CONFIG_PREEMPT_LAZY. Where CONFIG_PREEMPT will preempt the kernel at the scheduler tick if preemption is enabled and CONFIG_PREEMPT_LAZY will not preempt the kernel on a scheduler tick and wait for exit to user space. Sebastian, It appears you only tested the CONFIG_PREEMPT_LAZY selection. Have you tested the difference of how CONFIG_PREEMPT behaves between PREEMPT_RT and no PREEMPT_RT? I think that will show a difference like we had in the past. I can see people picking both PREEMPT_RT and CONFIG_PREEMPT (Full), but then wondering why their non RT tasks are suffering from a performance penalty from that. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-11 15:28 ` Steven Rostedt @ 2025-02-12 12:11 ` Sebastian Andrzej Siewior 2025-02-12 15:00 ` Steven Rostedt 0 siblings, 1 reply; 66+ messages in thread From: Sebastian Andrzej Siewior @ 2025-02-12 12:11 UTC (permalink / raw) To: Steven Rostedt Cc: Joel Fernandes, Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On 2025-02-11 10:28:01 [-0500], Steven Rostedt wrote: > On Tue, 11 Feb 2025 09:21:38 +0100 > Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > > We don't follow this behaviour exactly today. > > > > Adding this behaviour back vs the behaviour we have now, doesn't seem to > > improve anything at visible levels. We don't have a counter but we can > > look at the RCU nesting counter which should be zero once locks have > > been dropped. So this can be used for testing. > > > > But as I said: using "run to completion" and preempt on the return > > userland rather than once the lazy flag is seen and all locks have been > > released appears to be better. > > > > It is (now) possible that you run for a long time and get preempted > > while holding a spinlock_t. It is however more likely that you release > > all locks and get preempted while returning to userland. > > IIUC, today, LAZY causes all SCHED_OTHER tasks to act more like > PREEMPT_NONE. Is that correct? Well. First sched-tick will set the LAZY bit, the second sched-tick forces a resched. On PREEMPT_NONE the sched-tick would be set NEED_RESCHED while nothing will force a resched until the task decides to do schedule() on its own. So it is slightly different for kernel threads. Unless we talk about userland, here we would have a resched on the return to userland after the sched-tick LAZY or NONE does not matter. > Now that the PREEMPT_RT is not one of the preemption selections, when you > select PREEMPT_RT, you can pick between CONFIG_PREEMPT and > CONFIG_PREEMPT_LAZY. Where CONFIG_PREEMPT will preempt the kernel at the > scheduler tick if preemption is enabled and CONFIG_PREEMPT_LAZY will > not preempt the kernel on a scheduler tick and wait for exit to user space. This is not specific to RT but FULL vs LAZY. But yes. However the second sched-tick will force preemption point even without the exit-to-userland. > Sebastian, > > It appears you only tested the CONFIG_PREEMPT_LAZY selection. Have you > tested the difference of how CONFIG_PREEMPT behaves between PREEMPT_RT and > no PREEMPT_RT? I think that will show a difference like we had in the past. Not that I remember testing PREEMPT vs PREEMPT_RT. I remember people complained about high networking load on RT which become visible due to threaded interrupts (as in top) while for non-RT it was more or less hidden and not clearly visible due to selected accounting. The network performance was mostly the same as far as I remember (that is gbit). > I can see people picking both PREEMPT_RT and CONFIG_PREEMPT (Full), but > then wondering why their non RT tasks are suffering from a performance > penalty from that. We might want to opt in for lazy by default on RT. That was the case in the RT queue until it was replaced with PREEMPT_AUTO. But then why not use LAZY in favour of PREEMPT. Mike had numbers https://lore.kernel.org/all/9df22ebbc2e6d426099bf380477a0ed885068896.camel@gmx.de/ where LAZY had mostly the voluntary performance with less context switches than preempt. Which means also without the need for cond_resched() and friends. > -- Steve Sebastian ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-12 12:11 ` Sebastian Andrzej Siewior @ 2025-02-12 15:00 ` Steven Rostedt 2025-02-12 15:18 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 66+ messages in thread From: Steven Rostedt @ 2025-02-12 15:00 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Joel Fernandes, Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On Wed, 12 Feb 2025 13:11:13 +0100 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > IIUC, today, LAZY causes all SCHED_OTHER tasks to act more like > > PREEMPT_NONE. Is that correct? > > Well. First sched-tick will set the LAZY bit, the second sched-tick > forces a resched. > On PREEMPT_NONE the sched-tick would be set NEED_RESCHED while nothing > will force a resched until the task decides to do schedule() on its own. > So it is slightly different for kernel threads. Except that it should schedule on a cond_resched() and the point of adding LAZY was to get rid of all the cond_resched() which in turn gets rid of the need for PREEMPT_NONE. Which was what I was getting at. That PREEMPT_LAZY is really just NONE without the need to sprinkle cond_resched() all over the kernel. Instead of having cond_resched(), we just wait for the next tick. > > Unless we talk about userland, here we would have a resched on the > return to userland after the sched-tick LAZY or NONE does not matter. > > > Now that the PREEMPT_RT is not one of the preemption selections, when you > > select PREEMPT_RT, you can pick between CONFIG_PREEMPT and > > CONFIG_PREEMPT_LAZY. Where CONFIG_PREEMPT will preempt the kernel at the > > scheduler tick if preemption is enabled and CONFIG_PREEMPT_LAZY will > > not preempt the kernel on a scheduler tick and wait for exit to user space. > > This is not specific to RT but FULL vs LAZY. But yes. However the second Not true. PREEMPT_RT use to enable PREEMPT_FULL as well (it would preempt everywhere). The issue we found was that spin_locks which would not have been preempted by just FULL alone were being preempted when RT was enabled. That caused a lot more contention with spin locks in the kernel. That is PREEMPT_RT with PREEMPT_FULL will have a noticeable performance degradation compared to just PREEMPT_FULL alone. > sched-tick will force preemption point even without the > exit-to-userland. > My question still stands. Have you compared PREEMPT_FULL with and without PREEMPT_RT? -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-12 15:00 ` Steven Rostedt @ 2025-02-12 15:18 ` Sebastian Andrzej Siewior 0 siblings, 0 replies; 66+ messages in thread From: Sebastian Andrzej Siewior @ 2025-02-12 15:18 UTC (permalink / raw) To: Steven Rostedt Cc: Joel Fernandes, Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, daniel.wagner, Joseph Salisbury, broonie On 2025-02-12 10:00:01 [-0500], Steven Rostedt wrote: > On Wed, 12 Feb 2025 13:11:13 +0100 > Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > > IIUC, today, LAZY causes all SCHED_OTHER tasks to act more like > > > PREEMPT_NONE. Is that correct? > > > > Well. First sched-tick will set the LAZY bit, the second sched-tick > > forces a resched. > > On PREEMPT_NONE the sched-tick would be set NEED_RESCHED while nothing > > will force a resched until the task decides to do schedule() on its own. > > So it is slightly different for kernel threads. > > Except that it should schedule on a cond_resched() and the point of adding > LAZY was to get rid of all the cond_resched() which in turn gets rid of the > need for PREEMPT_NONE. Which was what I was getting at. That PREEMPT_LAZY > is really just NONE without the need to sprinkle cond_resched() all over > the kernel. Instead of having cond_resched(), we just wait for the next > tick. I would argue that we want to get out of the kernel asap and not schedule() if we stumble upon cond_resched(). > > Unless we talk about userland, here we would have a resched on the > > return to userland after the sched-tick LAZY or NONE does not matter. > > > > > Now that the PREEMPT_RT is not one of the preemption selections, when you > > > select PREEMPT_RT, you can pick between CONFIG_PREEMPT and > > > CONFIG_PREEMPT_LAZY. Where CONFIG_PREEMPT will preempt the kernel at the > > > scheduler tick if preemption is enabled and CONFIG_PREEMPT_LAZY will > > > not preempt the kernel on a scheduler tick and wait for exit to user space. > > > > This is not specific to RT but FULL vs LAZY. But yes. However the second > > Not true. PREEMPT_RT use to enable PREEMPT_FULL as well (it would preempt > everywhere). The issue we found was that spin_locks which would not have > been preempted by just FULL alone were being preempted when RT was enabled. > That caused a lot more contention with spin locks in the kernel. > > That is PREEMPT_RT with PREEMPT_FULL will have a noticeable performance > degradation compared to just PREEMPT_FULL alone. okay. > > sched-tick will force preemption point even without the > > exit-to-userland. > > > > My question still stands. Have you compared PREEMPT_FULL with and without > PREEMPT_RT? No I have not. > -- Steve Sebastian ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-06 13:30 ` Steven Rostedt 2025-02-06 13:44 ` Sebastian Andrzej Siewior @ 2025-02-10 14:07 ` Joel Fernandes 2025-02-10 19:48 ` Steven Rostedt 2025-02-10 17:20 ` David Laight 2 siblings, 1 reply; 66+ messages in thread From: Joel Fernandes @ 2025-02-10 14:07 UTC (permalink / raw) To: Steven Rostedt Cc: Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie On Thu, Feb 6, 2025 at 8:30 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Wed, 5 Feb 2025 22:07:12 -0500 > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > RT tasks don't have a time slice. They are affected by events. An external > > > interrupt coming in, or a timer going off that states something is > > > happening. Perhaps we could use this for SCHED_RR or maybe even > > > SCHED_DEADLINE, as those do have time slices. > > > > > > But if it does get used, it should only be used when the task being > > > scheduled is the same SCHED_RR priority, or if SCHED_DEADLINE will not fail > > > its guarantees. > > > > > > > Right, it would apply still to RR/DL though... > > But it would have to guarantee that the RR it is delaying is of the same > priority, and that delaying the DL is not going to cause something to miss > its deadline. See Peter comment: "Then pick another number; RT too has a max scheduling latency number (on some random hardware). If you stay below that, all is fine.". > > 3. Overloading the purpose of LAZY: My understanding is, the purpose > > of LAZY is to let the scheduler decide if it wants to preempt based on > > preemption mode. It is not based on any hint, just on the preemption > > mode. I guess you are overloading LAZY by making LAZY flag also extend > > userspace timeslice (versus say making the time-slice extension hint > > its own thing...). > > I already replied about that. Note, LAZY was created in PREEMPT_RT for this > very purpose (but in the kernel), and ported to vanilla for a slightly > different purpose. > > Here's the history: > > PREEMPT_RT would convert spin_locks in the kernel to sleeping mutexes. > > This made RT tasks respond much faster to events. > > But non-RT (SCHED_OTHER) started suffering performance issues. > > When looking at the performance issues, we found that it was due to tasks > holding these sleeping spin_locks and being preempted. That is, the > preemption of holding spin_locks was causing more contention and slowing > things down tremendously. > > To first handle this, adaptive mutexes was introduced. These would spin > if the owner of the lock was still running, and would go to sleep if the > owner goes to sleep. This helped things quite a bit, but PREEMPT_RT was > still suffer a performance deficit compared to non-RT. > > This was because of the timer tick on SCHED_OTHER tasks that could > preempt a task holding a spin lock. > > NEED_RESCHED_LAZY was introduced to remedy this. It would be set for > SCHED_OTHER tasks and NEED_RESCHED for RT tasks. If the task was holding > a sleeping spin lock, the NEED_RESCHED_LAZY would not preempt the running > task, but NEED_RESCHED would. If the SCHED_OTHER task was not holding a > sleeping spin_lock it would be preempted regardless. > > This improved the performance of SCHED_OTHER tasks in PREEMPT_RT to be as > good as what was in vanilla. > > You see, LAZY was *created* for this purpose. Of letting the scheduler know > that the running task is in a critical section and the timer tick should > not preempt a SCHED_OTHER task. > I just wanted to extend this to SCHED_OTHER in user space too. Currently it does not "let anyone know" it is running in a critical section though. Various paths (update_curr(), wake up) just do a 'lazy' resched until the timer tick has elapsed, or the task returns to usermode/idle at which point schedule() is called. And it does this only for FAIR tasks. That can well happen even if the currently running task is not in a critical section in the kernel at all. Sure, it may benefit critical sections in the upstream kernel but where is that explicit? I still feel we should not overload this in-kernel mechanism for userspace locking and complicate things. > > Yes, I have worked on RT projects before -- you would know better > > than anyone. :-D. But admittedly, I haven't got to work much with > > PREEMPT_RT systems. > > Just using RT policy to improve performance is not an RT project. I'm > talking about projects that if you miss a deadline things crash. Where the > project works very hard to make sure everything works as intended. No no no, I have done way more than applying just the RT policy. So that means you do not know me that well;-).. I have worked on audio driver latency, low latency audio, latency issues in vmalloc code, preempt tracers, irq tracepoints , wake up latency tracers and various scheduler overhead debug — many of those issues dealt with sub millisecond latency.. I also dealt with cpu idle issues in the hardware causing real time latency problems (see my past talks if interested). I was partly a hardware engineer when I started my career and have built circuits. I have Electronics and Computer engineering degrees. - Joel ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-10 14:07 ` Joel Fernandes @ 2025-02-10 19:48 ` Steven Rostedt 0 siblings, 0 replies; 66+ messages in thread From: Steven Rostedt @ 2025-02-10 19:48 UTC (permalink / raw) To: Joel Fernandes Cc: Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie On Mon, 10 Feb 2025 09:07:27 -0500 Joel Fernandes <joel@joelfernandes.org> wrote: > On Thu, Feb 6, 2025 at 8:30 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > On Wed, 5 Feb 2025 22:07:12 -0500 > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > RT tasks don't have a time slice. They are affected by events. An external > > > > interrupt coming in, or a timer going off that states something is > > > > happening. Perhaps we could use this for SCHED_RR or maybe even > > > > SCHED_DEADLINE, as those do have time slices. > > > > > > > > But if it does get used, it should only be used when the task being > > > > scheduled is the same SCHED_RR priority, or if SCHED_DEADLINE will not fail > > > > its guarantees. > > > > > > > > > > Right, it would apply still to RR/DL though... > > > > But it would have to guarantee that the RR it is delaying is of the same > > priority, and that delaying the DL is not going to cause something to miss > > its deadline. > > See Peter comment: "Then pick another number; RT too has a max > scheduling latency number (on some random hardware). If you stay below > that, all is fine.". See mine and Sebastion's reply. Just adding another interrupt turn around, and you already lost. > > You see, LAZY was *created* for this purpose. Of letting the scheduler know > > that the running task is in a critical section and the timer tick should > > not preempt a SCHED_OTHER task. > > I just wanted to extend this to SCHED_OTHER in user space too. > > Currently it does not "let anyone know" it is running in a critical > section though. Various paths (update_curr(), wake up) just do a > 'lazy' resched until the timer tick has elapsed, or the task returns > to usermode/idle at which point schedule() is called. And it does this > only for FAIR tasks. That can well happen even if the currently > running task is not in a critical section in the kernel at all. Sure, > it may benefit critical sections in the upstream kernel but where is > that explicit? I still feel we should not overload this in-kernel > mechanism for userspace locking and complicate things. That's nice that you don't feel it. But I do. ;-) > > > > Yes, I have worked on RT projects before -- you would know better > > > than anyone. :-D. But admittedly, I haven't got to work much with > > > PREEMPT_RT systems. > > > > Just using RT policy to improve performance is not an RT project. I'm > > talking about projects that if you miss a deadline things crash. Where the > > project works very hard to make sure everything works as intended. > > No no no, I have done way more than applying just the RT policy. So > that means you do not know me that well;-).. I have worked on audio > driver latency, low latency audio, latency issues in vmalloc code, > preempt tracers, irq tracepoints , wake up latency tracers and various > scheduler overhead debug — many of those issues dealt with sub > millisecond latency.. I also dealt with cpu idle issues in the > hardware causing real time latency problems (see my past talks if > interested). I was partly a hardware engineer when I started my > career and have built circuits. I have Electronics and Computer > engineering degrees. None of that sounds to me like an RT project. I'm talking about robotics, industrial machines, autonomous driving, power plants, etc. Where everything is measured in WCET. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-06 13:30 ` Steven Rostedt 2025-02-06 13:44 ` Sebastian Andrzej Siewior 2025-02-10 14:07 ` Joel Fernandes @ 2025-02-10 17:20 ` David Laight 2025-02-10 17:27 ` Steven Rostedt 2 siblings, 1 reply; 66+ messages in thread From: David Laight @ 2025-02-10 17:20 UTC (permalink / raw) To: Steven Rostedt Cc: Joel Fernandes, Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie On Thu, 6 Feb 2025 08:30:39 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Wed, 5 Feb 2025 22:07:12 -0500 > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > RT tasks don't have a time slice. They are affected by events. An external > > > interrupt coming in, or a timer going off that states something is > > > happening. Perhaps we could use this for SCHED_RR or maybe even > > > SCHED_DEADLINE, as those do have time slices. > > > > > > But if it does get used, it should only be used when the task being > > > scheduled is the same SCHED_RR priority, or if SCHED_DEADLINE will not fail > > > its guarantees. > > > > > > > Right, it would apply still to RR/DL though... > > But it would have to guarantee that the RR it is delaying is of the same > priority, and that delaying the DL is not going to cause something to miss > its deadline. > > > > > > > In any case, if you want this to only work on FAIR tasks and not RT > > > > tasks, why is that only possible to do with rseq() + LAZY preemption > > > > and not Prakash's new API + all preemption modes? > > > > > > > > Also you can just ignore RT tasks (not that I'm saying that's a good > > > > idea but..) in taskshrd_delay_resched() in that patch if you ever > > > > wanted to do that. > > > > > > > > I just feel the RT latency thing is a non-issue AFAICS. > > > > > > Have you worked on any RT projects before? > > > > Heh.. I think maybe you misunderstood my statement, I was mentioning > > that I felt (similar to Peter I think) that NOT adopting this feature > > generically for all tasks due to a concern of 50us latency maybe does > > not make sense since poorly designed app / random hardware already > > have this issue. I think the main concern discussed in this thread is > > (and please CMIIW): > > We have code that has sub 100us latency and less. If some random user space > application applies this, adding 50us (or even 20us) will break these. And > this has nothing to do with poorly designed applications or hardware. > > By adding this as a feature that works everywhere, you will break use cases > that work today. Hmmm... you lose big-time anyway. All you need is a lot of network traffic 'pinch' the process context until the hardware interrupt, NAPI softint code and rcu softint code completes. That can easily take several milliseconds. We managed to get a trace of a SCHED_FIFO task being pre-empted by a higher priority SCHED_FIFO task. The chosen target cpu was active running a worker thread. That got interrupted and ran softint code for several milliseconds. Other cpu became idle, but the scheduler rather expects to be able to run RT threads on the cpu it chooses. The same can happen if an RT thread grabs a mutex for a short time. All it takes is a hardware interrupt and the mutex hold time goes through the roof. You don't need a context switch to hurt you. The only userspace fix is to replace all the mutex with atomic operations. (And even they can be griefsome because they are measurable slow.) David ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-10 17:20 ` David Laight @ 2025-02-10 17:27 ` Steven Rostedt 2025-02-10 19:44 ` Steven Rostedt 0 siblings, 1 reply; 66+ messages in thread From: Steven Rostedt @ 2025-02-10 17:27 UTC (permalink / raw) To: David Laight Cc: Joel Fernandes, Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie On Mon, 10 Feb 2025 17:20:59 +0000 David Laight <david.laight.linux@gmail.com> wrote: > Hmmm... you lose big-time anyway. > > All you need is a lot of network traffic 'pinch' the process context until > the hardware interrupt, NAPI softint code and rcu softint code completes. > That can easily take several milliseconds. Not on PREEMPT_RT. All that runs as threads. And this is a feature that we would like on RT for non RT tasks. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-10 17:27 ` Steven Rostedt @ 2025-02-10 19:44 ` Steven Rostedt 2025-02-10 21:51 ` David Laight 0 siblings, 1 reply; 66+ messages in thread From: Steven Rostedt @ 2025-02-10 19:44 UTC (permalink / raw) To: David Laight Cc: Joel Fernandes, Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie On Mon, 10 Feb 2025 12:27:00 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Mon, 10 Feb 2025 17:20:59 +0000 > David Laight <david.laight.linux@gmail.com> wrote: > > > Hmmm... you lose big-time anyway. > > > > All you need is a lot of network traffic 'pinch' the process context until > > the hardware interrupt, NAPI softint code and rcu softint code completes. > > That can easily take several milliseconds. > > Not on PREEMPT_RT. All that runs as threads. > > And this is a feature that we would like on RT for non RT tasks. Actually, this doesn't even need PREEMPT_RT to not hit your example. You can build and boot your system with interrupts as threads, and that also includes softirqs. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-10 19:44 ` Steven Rostedt @ 2025-02-10 21:51 ` David Laight 2025-02-10 21:58 ` Steven Rostedt 0 siblings, 1 reply; 66+ messages in thread From: David Laight @ 2025-02-10 21:51 UTC (permalink / raw) To: Steven Rostedt Cc: Joel Fernandes, Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie On Mon, 10 Feb 2025 14:44:32 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Mon, 10 Feb 2025 12:27:00 -0500 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > On Mon, 10 Feb 2025 17:20:59 +0000 > > David Laight <david.laight.linux@gmail.com> wrote: > > > > > Hmmm... you lose big-time anyway. > > > > > > All you need is a lot of network traffic 'pinch' the process context until > > > the hardware interrupt, NAPI softint code and rcu softint code completes. > > > That can easily take several milliseconds. > > > > Not on PREEMPT_RT. All that runs as threads. > > > > And this is a feature that we would like on RT for non RT tasks. > > Actually, this doesn't even need PREEMPT_RT to not hit your example. You > can build and boot your system with interrupts as threads, and that also > includes softirqs. And then you lose lots of receive ethernet packets unless you change all the thread priorities. (I don't recall anything that makes them run at a low FIFO prioriry.) And don't mention the mess that happens if you have hardware that is raising an interrupt ever 10ms and really needs the ISR to run within 10ms. Someone tried to do that into a VM, the interrupts turn up like busses. None come for ages and then three arrive together :-) David ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-10 21:51 ` David Laight @ 2025-02-10 21:58 ` Steven Rostedt 0 siblings, 0 replies; 66+ messages in thread From: Steven Rostedt @ 2025-02-10 21:58 UTC (permalink / raw) To: David Laight Cc: Joel Fernandes, Prakash Sangappa, Peter Zijlstra, linux-kernel, linux-trace-kernel, Thomas Gleixner, Ankur Arora, Linus Torvalds, linux-mm, x86, Andrew Morton, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, Boris Ostrovsky, Konrad Wilk, jgross, Andrew.Cooper3, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, Joseph Salisbury, broonie On Mon, 10 Feb 2025 21:51:16 +0000 David Laight <david.laight.linux@gmail.com> wrote: > And then you lose lots of receive ethernet packets unless you change > all the thread priorities. > (I don't recall anything that makes them run at a low FIFO prioriry.) Interrupt threads by default run at FIFO 50. The softirqs will run with the threads that raise them so at the same priority. See "sched_set_fifo()". -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-01-31 22:58 ` [RFC][PATCH 1/2] sched: Extended scheduler time slice Steven Rostedt 2025-02-01 11:59 ` Peter Zijlstra @ 2025-02-01 14:35 ` Mathieu Desnoyers 2025-02-01 23:08 ` Steven Rostedt 1 sibling, 1 reply; 66+ messages in thread From: Mathieu Desnoyers @ 2025-02-01 14:35 UTC (permalink / raw) To: Steven Rostedt, linux-kernel, linux-trace-kernel Cc: Thomas Gleixner, Peter Zijlstra, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On 2025-01-31 23:58, Steven Rostedt wrote: [...] > @@ -148,6 +160,18 @@ struct rseq { > */ > __u32 mm_cid; > > + /* > + * The cr_counter is a way for user space to inform the kernel that > + * it is in a critical section. If bits 1-31 are set, then the > + * kernel may grant the thread a bit more time (but there is no > + * guarantee of how much time or if it is granted at all). If the > + * kernel does grant the thread extra time, it will set bit 0 to > + * inform user space that it has granted the thread more time and that > + * user space should call yield() as soon as it leaves its critical Does it specifically need to be yield(), or can it be just "entering the kernel" through any system call or trap ? [...] > diff --git a/kernel/rseq.c b/kernel/rseq.c > index 9de6e35fe679..b792e36a3550 100644 > --- a/kernel/rseq.c > +++ b/kernel/rseq.c > @@ -339,6 +339,36 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) > force_sigsegv(sig); > } > > +bool rseq_delay_resched(void) > +{ > + struct task_struct *t = current; > + u32 flags; > + > + if (!t->rseq) > + return false; > + > + /* Make sure the cr_counter exists */ > + if (current->rseq_len <= offsetof(struct rseq, cr_counter)) > + return false; It would be clearer/more precise with < offsetofend(struct rseq, cr_counter) Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-01 14:35 ` Mathieu Desnoyers @ 2025-02-01 23:08 ` Steven Rostedt 2025-02-01 23:18 ` Linus Torvalds 0 siblings, 1 reply; 66+ messages in thread From: Steven Rostedt @ 2025-02-01 23:08 UTC (permalink / raw) To: Mathieu Desnoyers Cc: linux-kernel, linux-trace-kernel, Thomas Gleixner, Peter Zijlstra, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Sat, 1 Feb 2025 15:35:13 +0100 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > On 2025-01-31 23:58, Steven Rostedt wrote: > > [...] > > > @@ -148,6 +160,18 @@ struct rseq { > > */ > > __u32 mm_cid; > > > > + /* > > + * The cr_counter is a way for user space to inform the kernel that > > + * it is in a critical section. If bits 1-31 are set, then the > > + * kernel may grant the thread a bit more time (but there is no > > + * guarantee of how much time or if it is granted at all). If the > > + * kernel does grant the thread extra time, it will set bit 0 to > > + * inform user space that it has granted the thread more time and that > > + * user space should call yield() as soon as it leaves its critical > > Does it specifically need to be yield(), or can it be just "entering > the kernel" through any system call or trap ? No it doesn't need to be yield. That just seemed like the obvious system call to use. But any system call would force a schedule. We could just optimize yield() though. > > [...] > > > diff --git a/kernel/rseq.c b/kernel/rseq.c > > index 9de6e35fe679..b792e36a3550 100644 > > --- a/kernel/rseq.c > > +++ b/kernel/rseq.c > > @@ -339,6 +339,36 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) > > force_sigsegv(sig); > > } > > > > +bool rseq_delay_resched(void) > > +{ > > + struct task_struct *t = current; > > + u32 flags; > > + > > + if (!t->rseq) > > + return false; > > + > > + /* Make sure the cr_counter exists */ > > + if (current->rseq_len <= offsetof(struct rseq, cr_counter)) > > + return false; > > It would be clearer/more precise with < offsetofend(struct rseq, cr_counter) Ah yeah, thanks! -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-01 23:08 ` Steven Rostedt @ 2025-02-01 23:18 ` Linus Torvalds 2025-02-01 23:35 ` Linus Torvalds 2025-02-02 3:22 ` Steven Rostedt 0 siblings, 2 replies; 66+ messages in thread From: Linus Torvalds @ 2025-02-01 23:18 UTC (permalink / raw) To: Steven Rostedt Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel, Thomas Gleixner, Peter Zijlstra, Ankur Arora, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Sat, 1 Feb 2025 at 15:08, Steven Rostedt <rostedt@goodmis.org> wrote: > > No it doesn't need to be yield. That just seemed like the obvious > system call to use. But any system call would force a schedule. We > could just optimize yield() though. No, "optimizing" yield is not on the table. Why? Because it has *semantics* that people depend on because it has historical meaning. Things like "move the process to the end of the scheduling queue of that priority". That may sound like a good thing, but it's ABSOLUTELY NOT what you should actually do unless you know *exactly* what the system behavior is. For example, maybe the reason the kernel set NEED_RESCHED_LAZY is that a higher-priority process is ready to run. You haven't used up your time slice yet, but something more important needs the CPU. If you call "sched_yield()", sure, you'll run that higher priority thing. So far so good. But you *also* are literally telling the scheduler to put you at the back of the queue, despite the fact that maybe you are still in line to be run for *your* priority level. So now your performance will absolutely suck, because you just told the scheduler that you are not important, and other processes in your priority level should get priority. So no. "yield()" does not mean "just reschedule". It rally means "yield my position in the scheduling queue". You are literally better off using absolutely *ANY* other system call. The fact that you are confused about this kind of very basic issue does imply that this patch should absolutely be handled by somebody who knows the scheduler better. Linus ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-01 23:18 ` Linus Torvalds @ 2025-02-01 23:35 ` Linus Torvalds 2025-02-02 3:26 ` Steven Rostedt 2025-02-02 3:22 ` Steven Rostedt 1 sibling, 1 reply; 66+ messages in thread From: Linus Torvalds @ 2025-02-01 23:35 UTC (permalink / raw) To: Steven Rostedt Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel, Thomas Gleixner, Peter Zijlstra, Ankur Arora, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Sat, 1 Feb 2025 at 15:18, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But you *also* are literally telling the scheduler to put you at the > back of the queue, despite the fact that maybe you are still in line > to be run for *your* priority level. So now your performance will > absolutely suck, because you just told the scheduler that you are not > important, and other processes in your priority level should get > priority. Side note: we've messed with this before, exactly because people have been confused about this before. I have this dim memory of us having at one point decided to actively ignore that "put me last" part of sched_yield because we had huge performance problems with people misusing "sched_yield()". I didn't actually check what current active schedulers do. I would not be in the least surprised if different schedulers end up having very different behavior (particularly any RT scheduler). But the moral of the story ends up being the same: don't use yield() unless you are on some embedded platform and know exactly what the scheduling pattern is - or if you really want to say "I don't want to run now, do *anything* else, my performance doesn't matter". A traditional (reasonable) situation might be "I started another task or thread, I need for it to run first and initialized things, I'm polling for that to be done but I don't want to busy-loop if there is real work to be done". Where it really is a complete hack: "my performance doesn't matter because it's a one-time startup thing, and I couldn't be arsed to have real locking". In fact, the whole "I couldn't be arsed" is basically the tag-line for "yield()". Maybe we should rename the system call. Linus ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-01 23:35 ` Linus Torvalds @ 2025-02-02 3:26 ` Steven Rostedt 0 siblings, 0 replies; 66+ messages in thread From: Steven Rostedt @ 2025-02-02 3:26 UTC (permalink / raw) To: Linus Torvalds Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel, Thomas Gleixner, Peter Zijlstra, Ankur Arora, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Sat, 1 Feb 2025 15:35:53 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > I didn't actually check what current active schedulers do. I would > not be in the least surprised if different schedulers end up having > very different behavior (particularly any RT scheduler). The only real use of sched yield() I have ever seen in practice was in a RT application where all tasks were given the same RT FIFO priority, and the application would use the yield() system call to trigger the next task. That's also the only use case that really does have a strict defined behavior for yield(). In SCHED_FIFO, as the name suggests, tasks run in a first-in first-out order. yield() will put the task at the end of that queue. You can use this method to implement a strict application defined scheduler. -- Steve. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-01 23:18 ` Linus Torvalds 2025-02-01 23:35 ` Linus Torvalds @ 2025-02-02 3:22 ` Steven Rostedt 2025-02-02 7:22 ` Matthew Wilcox 1 sibling, 1 reply; 66+ messages in thread From: Steven Rostedt @ 2025-02-02 3:22 UTC (permalink / raw) To: Linus Torvalds Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel, Thomas Gleixner, Peter Zijlstra, Ankur Arora, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Sat, 1 Feb 2025 15:18:15 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, 1 Feb 2025 at 15:08, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > No it doesn't need to be yield. That just seemed like the obvious > > system call to use. But any system call would force a schedule. We > > could just optimize yield() though. > > No, "optimizing" yield is not on the table. Note, I only mentioned this because Peter had it in his patch he created when I first brought this up. https://lore.kernel.org/all/20231030132949.GA38123@noisy.programming.kicks-ass.net/ SYSCALL_DEFINE0(sched_yield) { + if (current->rseq_sched_delay) { + trace_printk("yield -- made it\n"); + schedule(); + return 0; + } + do_sched_yield(); return 0; } > > Why? Because it has *semantics* that people depend on because it has > historical meaning. Things like "move the process to the end of the > scheduling queue of that priority". Yes, I know the historical meaning. It's also a system call I've been telling people to avoid for the last 20 years. I haven't seen or heard about any application that actually depends on that behavior today. > > That may sound like a good thing, but it's ABSOLUTELY NOT what you > should actually do unless you know *exactly* what the system behavior > is. > > For example, maybe the reason the kernel set NEED_RESCHED_LAZY is that > a higher-priority process is ready to run. You haven't used up your > time slice yet, but something more important needs the CPU. If it's RT, it would set NEED_RESCHED and not LAZY. This is only for SCHED_OTHER. Sure, it could be a task with a negative nice value. > > So no. "yield()" does not mean "just reschedule". It rally means > "yield my position in the scheduling queue". The optimization would only treat sched_yield differently if the task had asked for an extended time slice, and it was granted. Like in Peter's patch, if that was the case, it would just call schedule and return. This would not affect yield() for any other task not implementing the rseq extend counter. > > You are literally better off using absolutely *ANY* other system call. > > The fact that you are confused about this kind of very basic issue > does imply that this patch should absolutely be handled by somebody > who knows the scheduler better. > I'm not confused. And before seeing Peter's use of yield(), I was reluctant to use it for the very same reasons you mentioned above. In my test programs, I was simply using getuid(), as that was one of the quickest syscalls. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-02 3:22 ` Steven Rostedt @ 2025-02-02 7:22 ` Matthew Wilcox 2025-02-02 22:29 ` Steven Rostedt 0 siblings, 1 reply; 66+ messages in thread From: Matthew Wilcox @ 2025-02-02 7:22 UTC (permalink / raw) To: Steven Rostedt Cc: Linus Torvalds, Mathieu Desnoyers, linux-kernel, linux-trace-kernel, Thomas Gleixner, Peter Zijlstra, Ankur Arora, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Sat, Feb 01, 2025 at 10:22:08PM -0500, Steven Rostedt wrote: > And before seeing Peter's use of yield(), I was reluctant to use it for > the very same reasons you mentioned above. In my test programs, I was > simply using getuid(), as that was one of the quickest syscalls. Is getuid() guaranteed to issue a syscall? It feels like the kind of information that a tricksy libc could cache. Traditionally, I think we've used getppid() as the canonical "very cheap syscall" as no layer can cache that information. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice 2025-02-02 7:22 ` Matthew Wilcox @ 2025-02-02 22:29 ` Steven Rostedt 0 siblings, 0 replies; 66+ messages in thread From: Steven Rostedt @ 2025-02-02 22:29 UTC (permalink / raw) To: Matthew Wilcox Cc: Linus Torvalds, Mathieu Desnoyers, linux-kernel, linux-trace-kernel, Thomas Gleixner, Peter Zijlstra, Ankur Arora, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie On Sun, 2 Feb 2025 07:22:08 +0000 Matthew Wilcox <willy@infradead.org> wrote: > On Sat, Feb 01, 2025 at 10:22:08PM -0500, Steven Rostedt wrote: > > And before seeing Peter's use of yield(), I was reluctant to use it for > > the very same reasons you mentioned above. In my test programs, I was > > simply using getuid(), as that was one of the quickest syscalls. > > Is getuid() guaranteed to issue a syscall? It feels like the kind of > information that a tricksy libc could cache. Traditionally, I think > we've used getppid() as the canonical "very cheap syscall" as no layer > can cache that information. Maybe that was what I used. Can't remember. And I think I even open coded it using syscall() to not even rely on glibc. -- Steve ^ permalink raw reply [flat|nested] 66+ messages in thread
* [RFC][PATCH 2/2] sched: Shorten time that tasks can extend their time slice for 2025-01-31 22:58 [RFC][PATCH 0/2] sched: Extended Scheduler Time Slice revisited Steven Rostedt 2025-01-31 22:58 ` [RFC][PATCH 1/2] sched: Extended scheduler time slice Steven Rostedt @ 2025-01-31 22:58 ` Steven Rostedt 1 sibling, 0 replies; 66+ messages in thread From: Steven Rostedt @ 2025-01-31 22:58 UTC (permalink / raw) To: linux-kernel, linux-trace-kernel Cc: Thomas Gleixner, Peter Zijlstra, Ankur Arora, Linus Torvalds, linux-mm, x86, akpm, luto, bp, dave.hansen, hpa, juri.lelli, vincent.guittot, willy, mgorman, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk, jgross, andrew.cooper3, Joel Fernandes, Vineeth Pillai, Suleiman Souhlal, Ingo Molnar, Mathieu Desnoyers, Clark Williams, bigeasy, daniel.wagner, joseph.salisbury, broonie From: Steven Rostedt <rostedt@goodmis.org> If a task sets its rseq bit to notify the kernel that it is in a critical section, the kernel currently gives it a full time slice to get out of that section. But that could be anywhere from 1ms to 10ms depending on the CONFIG_HZ value, and this can cause unwanted latency in other applications. Limit the extra time to 50us, which should be long enough for tasks to get out of their critical sections. If a task has a critical section longer than 50us, then it should be using futexes anyway. That is, system calls should not be a bottle neck for critical sections longer than 50us. This makes the code rely not only on CONFIG_RSEQ but also CONFIG_SCHED_HRTICK as it relies on a timer that can be set 50us into the future. The flag rseq_sched_delay is added to the task struct. The exit_to_user_mode_loop() will return the _TIF_NEED_RESCHED_LAZY flag if it granted the task an extended time slice. After interrupts are disabled and the code path is on its way to user space, a new function rseq_delay_resched_fini() is called with the return value of exit_to_user_mode_loop() (ti_work). If the _TIF_NEED_RESCHED_LAZY is set in the ti_work, then it will check to see if the task's rseq_sched_delay is already set (in case the task came into user space for some other reason), and if it is not set, then it will enable the schedule timer to trigger again in 50us and set the rseq_sched_delay flag. If that timer goes off, and the current task has the rseq_sched_delay flag set, it will then force a schedule, and also clear the rseq cr_counter flag stating that it had extended time, as user space no longer needs to schedule. sys_yield() has been modified to check to see if it was called and does a trace_printk() if it has. This is for testing purposes and will likely be removed in later versions of this patch. This is based on Peter Ziljstra's code: https://lore.kernel.org/all/20231030132949.GA38123@noisy.programming.kicks-ass.net/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> --- include/linux/entry-common.h | 2 + include/linux/sched.h | 11 +++++- kernel/entry/common.c | 2 +- kernel/rseq.c | 76 +++++++++++++++++++++++++++++++++--- kernel/sched/core.c | 16 ++++++++ kernel/sched/syscalls.c | 6 +++ 6 files changed, 106 insertions(+), 7 deletions(-) diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index fc61d0205c97..1e0970276726 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -330,6 +330,8 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs) arch_exit_to_user_mode_prepare(regs, ti_work); + rseq_delay_resched_fini(ti_work); + /* Ensure that kernel state is sane for a return to userspace */ kmap_assert_nomap(); lockdep_assert_irqs_disabled(); diff --git a/include/linux/sched.h b/include/linux/sched.h index 8e983d8cf72d..3c9d3ca9c5ad 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -967,6 +967,9 @@ struct task_struct { #ifdef CONFIG_RT_MUTEXES unsigned sched_rt_mutex:1; #endif +#if defined(CONFIG_RSEQ) && defined(CONFIG_SCHED_HRTICK) + unsigned rseq_sched_delay:1; +#endif /* Bit to tell TOMOYO we're in execve(): */ unsigned in_execve:1; @@ -2206,16 +2209,22 @@ static inline bool owner_on_cpu(struct task_struct *owner) unsigned long sched_cpu_util(int cpu); #endif /* CONFIG_SMP */ -#ifdef CONFIG_RSEQ +#if defined(CONFIG_RSEQ) && defined(CONFIG_SCHED_HRTICK) extern bool rseq_delay_resched(void); +extern void rseq_delay_resched_fini(unsigned long ti_work); +extern void rseq_delay_resched_tick(void); #else static inline bool rseq_delay_resched(void) { return false; } +extern inline void rseq_delay_resched_fini(unsigned long ti_work) { } +static inline void rseq_delay_resched_tick(void) { } #endif +extern void hrtick_local_start(u64 delay); + #ifdef CONFIG_SCHED_CORE extern void sched_core_free(struct task_struct *tsk); extern void sched_core_fork(struct task_struct *p); diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 50e35f153bf8..349f274d7185 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -142,7 +142,7 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs, } /* Return the latest work state for arch_exit_to_user_mode() */ - return ti_work; + return ti_work | ignore_mask; } /* diff --git a/kernel/rseq.c b/kernel/rseq.c index b792e36a3550..701c4801a111 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -339,35 +339,101 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) force_sigsegv(sig); } +#ifdef CONFIG_SCHED_HRTICK +void rseq_delay_resched_fini(unsigned long ti_work) +{ + extern void hrtick_local_start(u64 delay); + struct task_struct *t = current; + + if (!t->rseq) + return; + + if (!(ti_work & _TIF_NEED_RESCHED_LAZY)) { + /* Clear any previous setting of rseq_sched_delay */ + t->rseq_sched_delay = 0; + return; + } + + /* No need to start the timer if it is already started */ + if (t->rseq_sched_delay) + return; + + /* + * IRQs off, guaranteed to return to userspace, start timer on this CPU + * to limit the resched-overdraft. + * + * If your critical section is longer than 50 us you get to keep the + * pieces. + */ + + t->rseq_sched_delay = 1; + hrtick_local_start(50 * NSEC_PER_USEC); +} + bool rseq_delay_resched(void) { struct task_struct *t = current; u32 flags; if (!t->rseq) - return false; + goto nodelay; /* Make sure the cr_counter exists */ if (current->rseq_len <= offsetof(struct rseq, cr_counter)) - return false; + goto nodelay; /* If this were to fault, it would likely cause a schedule anyway */ if (copy_from_user_nofault(&flags, &t->rseq->cr_counter, sizeof(flags))) - return false; + goto nodelay; if (!(flags & RSEQ_CR_FLAG_IN_CRITICAL_SECTION_MASK)) - return false; + goto nodelay; trace_printk("Extend time slice\n"); flags |= RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED; if (copy_to_user_nofault(&t->rseq->cr_counter, &flags, sizeof(flags))) { trace_printk("Faulted writing rseq\n"); - return false; + goto nodelay; } return true; + +nodelay: + t->rseq_sched_delay = 0; + return false; +} + +void rseq_delay_resched_tick(void) +{ + struct task_struct *t = current; + + if (t->rseq_sched_delay) { + u32 flags; + + set_tsk_need_resched(t); + t->rseq_sched_delay = 0; + trace_printk("timeout -- force resched\n"); + + /* + * Now remove the that it was extended, as this will + * force a schedule and user space no longer needs to. + */ + + /* Just in case user space unregistered its rseq */ + if (!t->rseq) + return; + + if (copy_from_user_nofault(&flags, &t->rseq->cr_counter, sizeof(flags))) + return; + + flags &= ~RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED; + + if (copy_to_user_nofault(&t->rseq->cr_counter, &flags, sizeof(flags))) + return; + } } +#endif /* CONFIG_SCHED_HRTICK */ #ifdef CONFIG_DEBUG_RSEQ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3e5a6bf587f9..77d671dcd161 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -815,6 +815,7 @@ void update_rq_clock(struct rq *rq) static void hrtick_clear(struct rq *rq) { + rseq_delay_resched_tick(); if (hrtimer_active(&rq->hrtick_timer)) hrtimer_cancel(&rq->hrtick_timer); } @@ -830,6 +831,8 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer) WARN_ON_ONCE(cpu_of(rq) != smp_processor_id()); + rseq_delay_resched_tick(); + rq_lock(rq, &rf); update_rq_clock(rq); rq->donor->sched_class->task_tick(rq, rq->curr, 1); @@ -903,6 +906,16 @@ void hrtick_start(struct rq *rq, u64 delay) #endif /* CONFIG_SMP */ +void hrtick_local_start(u64 delay) +{ + struct rq *rq = this_rq(); + struct rq_flags rf; + + rq_lock(rq, &rf); + hrtick_start(rq, delay); + rq_unlock(rq, &rf); +} + static void hrtick_rq_init(struct rq *rq) { #ifdef CONFIG_SMP @@ -6711,6 +6724,9 @@ static void __sched notrace __schedule(int sched_mode) picked: clear_tsk_need_resched(prev); clear_preempt_need_resched(); +#ifdef CONFIG_RSEQ + prev->rseq_sched_delay = 0; +#endif #ifdef CONFIG_SCHED_DEBUG rq->last_seen_need_resched_ns = 0; #endif diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index ff0e5ab4e37c..1d981599e890 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -1379,6 +1379,12 @@ static void do_sched_yield(void) */ SYSCALL_DEFINE0(sched_yield) { + if (current->rseq_sched_delay) { + trace_printk("yield -- made it\n"); + schedule(); + return 0; + } + do_sched_yield(); return 0; } -- 2.45.2 ^ permalink raw reply [flat|nested] 66+ messages in thread
end of thread, other threads:[~2025-02-12 15:19 UTC | newest] Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-01-31 22:58 [RFC][PATCH 0/2] sched: Extended Scheduler Time Slice revisited Steven Rostedt 2025-01-31 22:58 ` [RFC][PATCH 1/2] sched: Extended scheduler time slice Steven Rostedt 2025-02-01 11:59 ` Peter Zijlstra 2025-02-01 12:47 ` Steven Rostedt 2025-02-01 18:11 ` Peter Zijlstra 2025-02-01 23:06 ` Steven Rostedt 2025-02-03 8:43 ` Peter Zijlstra 2025-02-03 8:53 ` Peter Zijlstra 2025-02-03 16:45 ` Steven Rostedt 2025-02-04 3:28 ` Suleiman Souhlal 2025-02-04 3:57 ` Steven Rostedt 2025-02-04 9:16 ` Peter Zijlstra 2025-02-04 12:51 ` Steven Rostedt 2025-02-04 13:16 ` Steven Rostedt 2025-02-04 15:05 ` Steven Rostedt 2025-02-04 15:30 ` Peter Zijlstra 2025-02-04 16:11 ` Steven Rostedt 2025-02-05 9:07 ` Peter Zijlstra 2025-02-05 13:10 ` Steven Rostedt 2025-02-05 13:44 ` Steven Rostedt 2025-02-04 22:44 ` Prakash Sangappa 2025-02-05 0:56 ` Joel Fernandes 2025-02-05 3:04 ` Steven Rostedt 2025-02-05 5:09 ` Joel Fernandes 2025-02-05 13:16 ` Steven Rostedt 2025-02-05 13:38 ` Steven Rostedt 2025-02-05 21:08 ` Prakash Sangappa 2025-02-05 21:19 ` Steven Rostedt 2025-02-05 21:33 ` Steven Rostedt 2025-02-05 21:36 ` Prakash Sangappa 2025-02-06 3:07 ` Joel Fernandes 2025-02-06 13:30 ` Steven Rostedt 2025-02-06 13:44 ` Sebastian Andrzej Siewior 2025-02-06 13:48 ` Peter Zijlstra 2025-02-06 13:53 ` Sebastian Andrzej Siewior 2025-02-06 13:57 ` Peter Zijlstra 2025-02-06 14:20 ` Steven Rostedt 2025-02-06 14:22 ` Sebastian Andrzej Siewior 2025-02-06 14:27 ` Peter Zijlstra 2025-02-06 14:57 ` Steven Rostedt 2025-02-06 15:01 ` Sebastian Andrzej Siewior 2025-02-10 19:43 ` Steven Rostedt 2025-02-10 22:04 ` David Laight 2025-02-10 22:15 ` Steven Rostedt 2025-02-11 8:21 ` Sebastian Andrzej Siewior 2025-02-11 10:57 ` Peter Zijlstra 2025-02-11 15:28 ` Steven Rostedt 2025-02-12 12:11 ` Sebastian Andrzej Siewior 2025-02-12 15:00 ` Steven Rostedt 2025-02-12 15:18 ` Sebastian Andrzej Siewior 2025-02-10 14:07 ` Joel Fernandes 2025-02-10 19:48 ` Steven Rostedt 2025-02-10 17:20 ` David Laight 2025-02-10 17:27 ` Steven Rostedt 2025-02-10 19:44 ` Steven Rostedt 2025-02-10 21:51 ` David Laight 2025-02-10 21:58 ` Steven Rostedt 2025-02-01 14:35 ` Mathieu Desnoyers 2025-02-01 23:08 ` Steven Rostedt 2025-02-01 23:18 ` Linus Torvalds 2025-02-01 23:35 ` Linus Torvalds 2025-02-02 3:26 ` Steven Rostedt 2025-02-02 3:22 ` Steven Rostedt 2025-02-02 7:22 ` Matthew Wilcox 2025-02-02 22:29 ` Steven Rostedt 2025-01-31 22:58 ` [RFC][PATCH 2/2] sched: Shorten time that tasks can extend their time slice for Steven Rostedt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox