From: Ankur Arora <ankur.a.arora@oracle.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>,
Ankur Arora <ankur.a.arora@oracle.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
akpm@linux-foundation.org, luto@kernel.org, bp@alien8.de,
dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com,
juri.lelli@redhat.com, vincent.guittot@linaro.org,
willy@infradead.org, mgorman@suse.de, rostedt@goodmis.org,
jon.grimm@amd.com, bharata@amd.com, raghavendra.kt@amd.com,
boris.ostrovsky@oracle.com, konrad.wilk@oracle.com,
jgross@suse.com, andrew.cooper3@citrix.com
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Date: Tue, 19 Sep 2023 12:05:07 -0700 [thread overview]
Message-ID: <874jjq56ho.fsf@oracle.com> (raw)
In-Reply-To: <87cyyfxd4k.ffs@tglx>
Thomas Gleixner <tglx@linutronix.de> writes:
> On Tue, Sep 12 2023 at 10:26, Peter Zijlstra wrote:
>> On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote:
>>> > The problem with the REP prefix (and Xen hypercalls) is that
>>> > they're long running instructions and it becomes fundamentally
>>> > impossible to put a cond_resched() in.
>>> >
>>> >> Yes. I'm starting to think that that the only sane solution is to
>>> >> limit cases that can do this a lot, and the "instruciton pointer
>>> >> region" approach would certainly work.
>>> >
>>> > From a code locality / I-cache POV, I think a sorted list of
>>> > (non overlapping) ranges might be best.
>>>
>>> Yeah, agreed. There are a few problems with doing that though.
>>>
>>> I was thinking of using a check of this kind to schedule out when
>>> it is executing in this "reschedulable" section:
>>> !preempt_count() && in_resched_function(regs->rip);
>>>
>>> For preemption=full, this should mostly work.
>>> For preemption=voluntary, though this'll only work with out-of-line
>>> locks, not if the lock is inlined.
>>>
>>> (Both, should have problems with __this_cpu_* and the like, but
>>> maybe we can handwave that away with sparse/objtool etc.)
>>
>> So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges
>> thing, and then only search the range when TIF flag is set.
>>
>> And I'm thinking it might be a good idea to have objtool validate the
>> range only contains simple instructions, the moment it contains control
>> flow I'm thinking it's too complicated.
>
> Can we take a step back and look at the problem from a scheduling
> perspective?
>
> The basic operation of a non-preemptible kernel is time slice
> scheduling, which means that a task can run more or less undisturbed for
> a full time slice once it gets on the CPU unless it schedules away
> voluntary via a blocking operation.
>
> This works pretty well as long as everything runs in userspace as the
> preemption points in the return to user space path are independent of
> the preemption model.
>
> These preemption points handle both time slice exhaustion and priority
> based preemption.
>
> With PREEMPT=NONE these are the only available preemption points.
>
> That means that kernel code can run more or less indefinitely until it
> schedules out or returns to user space, which is obviously not possible
> for kernel threads.
>
> To prevent starvation the kernel gained voluntary preemption points,
> i.e. cond_resched(), which has to be added manually to code as a
> developer sees fit.
>
> Later we added PREEMPT=VOLUNTARY which utilizes might_resched() as
> additional preemption points. might_resched() utilizes the existing
> might_sched() debug points, which are in code paths which might block on
> a contended resource. These debug points are mostly in core and
> infrastructure code and are in code paths which can block anyway. The
> only difference is that they allow preemption even when the resource is
> uncontended.
>
> Additionally we have PREEMPT=FULL which utilizes every zero transition
> of preeempt_count as a potential preemption point.
>
> Now we have the situation of long running data copies or data clear
> operations which run fully in hardware, but can be interrupted. As the
> interrupt return to kernel mode does not preempt in the NONE and
> VOLUNTARY cases, new workarounds emerged. Mostly by defining a data
> chunk size and adding cond_reched() again.
>
> That's ugly and does not work for long lasting hardware operations so we
> ended up with the suggestion of TIF_ALLOW_RESCHED to work around
> that. But again this needs to be manually annotated in the same way as a
> IP range based preemption scheme requires annotation.
>
> TBH. I detest all of this.
>
> Both cond_resched() and might_sleep/sched() are completely random
> mechanisms as seen from time slice operation and the data chunk based
> mechanism is just heuristics which works as good as heuristics tend to
> work. allow_resched() is not any different and IP based preemption
> mechanism are not going to be any better.
Agreed. I was looking at how to add resched sections etc, and in
addition to the randomness the choice of where exactly to add it seemed
to be quite fuzzy. A recipe for future kruft.
> The approach here is: Prevent the scheduler to make decisions and then
> mitigate the fallout with heuristics.
>
> That's just backwards as it moves resource control out of the scheduler
> into random code which has absolutely no business to do resource
> control.
>
> We have the reverse issue observed in PREEMPT_RT. The fact that spinlock
> held sections became preemtible caused even more preemption activity
> than on a PREEMPT=FULL kernel. The worst side effect of that was
> extensive lock contention.
>
> The way how we addressed that was to add a lazy preemption mode, which
> tries to preserve the PREEMPT=FULL behaviour when the scheduler wants to
> preempt tasks which all belong to the SCHED_OTHER scheduling class. This
> works pretty well and gains back a massive amount of performance for the
> non-realtime throughput oriented tasks without affecting the
> schedulability of real-time tasks at all. IOW, it does not take control
> away from the scheduler. It cooperates with the scheduler and leaves the
> ultimate decisions to it.
>
> I think we can do something similar for the problem at hand, which
> avoids most of these heuristic horrors and control boundary violations.
>
> The main issue is that long running operations do not honour the time
> slice and we work around that with cond_resched() and now have ideas
> with this new TIF bit and IP ranges.
>
> None of that is really well defined in respect to time slices. In fact
> its not defined at all versus any aspect of scheduling behaviour.
>
> What about the following:
>
> 1) Keep preemption count and the real preemption points enabled
> unconditionally. That's not more overhead than the current
> DYNAMIC_PREEMPT mechanism as long as the preemption count does not
> go to zero, i.e. the folded NEED_RESCHED bit stays set.
>
> From earlier experiments I know that the overhead of preempt_count
> is minimal and only really observable with micro benchmarks.
> Otherwise it ends up in the noise as long as the slow path is not
> taken.
>
> I did a quick check comparing a plain inc/dec pair vs. the
> DYMANIC_PREEMPT inc/dec_and_test+NOOP mechanism and the delta is
> in the non-conclusive noise.
>
> 20 years ago this was a real issue because we did not have:
>
> - the folding of NEED_RESCHED into the preempt count
>
> - the cacheline optimizations which make the preempt count cache
> pretty much always cache hot
>
> - the hardware was way less capable
>
> I'm not saying that preempt_count is completely free today as it
> obviously adds more text and affects branch predictors, but as the
> major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
> acceptable and tolerable tradeoff.
>
> 2) When the scheduler wants to set NEED_RESCHED due it sets
> NEED_RESCHED_LAZY instead which is only evaluated in the return to
> user space preemption points.
>
> As NEED_RESCHED_LAZY is not folded into the preemption count the
> preemption count won't become zero, so the task can continue until
> it hits return to user space.
>
> That preserves the existing behaviour.
>
> 3) When the scheduler tick observes that the time slice is exhausted,
> then it folds the NEED_RESCHED bit into the preempt count which
> causes the real preemption points to actually preempt including
> the return from interrupt to kernel path.
Right, and currently we check cond_resched() all the time in expectation
that something might need a resched.
Folding it in with the scheduler determining when next preemption happens
seems to make a lot of sense to me.
Thanks
Ankur
> That even allows the scheduler to enforce preemption for e.g. RT
> class tasks without changing anything else.
>
> I'm pretty sure that this gets rid of cond_resched(), which is an
> impressive list of instances:
>
> ./drivers 392
> ./fs 318
> ./mm 189
> ./kernel 184
> ./arch 95
> ./net 83
> ./include 46
> ./lib 36
> ./crypto 16
> ./sound 16
> ./block 11
> ./io_uring 13
> ./security 11
> ./ipc 3
>
> That list clearly documents that the majority of these
> cond_resched() invocations is in code which neither should care
> nor should have any influence on the core scheduling decision
> machinery.
>
> I think it's worth a try as it just fits into the existing preemption
> scheme, solves the issue of long running kernel functions, prevents
> invalid preemption and can utilize the existing instrumentation and
> debug infrastructure.
>
> Most importantly it gives control back to the scheduler and does not
> make it depend on the mercy of cond_resched(), allow_resched() or
> whatever heuristics sprinkled all over the kernel.
> To me this makes a lot of sense, but I might be on the completely wrong
> track. Se feel free to tell me that I'm completely nuts and/or just not
> seeing the obvious.
>
> Thanks,
>
> tglx
--
ankur
next prev parent reply other threads:[~2023-09-19 19:05 UTC|newest]
Thread overview: 152+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-30 18:49 [PATCH v2 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
2023-08-30 18:49 ` [PATCH v2 1/9] mm/clear_huge_page: allow arch override for clear_huge_page() Ankur Arora
2023-08-30 18:49 ` [PATCH v2 2/9] mm/huge_page: separate clear_huge_page() and copy_huge_page() Ankur Arora
2023-08-30 18:49 ` [PATCH v2 3/9] mm/huge_page: cleanup clear_/copy_subpage() Ankur Arora
2023-09-08 13:09 ` Matthew Wilcox
2023-09-11 17:22 ` Ankur Arora
2023-08-30 18:49 ` [PATCH v2 4/9] x86/clear_page: extend clear_page*() for multi-page clearing Ankur Arora
2023-09-08 13:11 ` Matthew Wilcox
2023-08-30 18:49 ` [PATCH v2 5/9] x86/clear_page: add clear_pages() Ankur Arora
2023-08-30 18:49 ` [PATCH v2 6/9] x86/clear_huge_page: multi-page clearing Ankur Arora
2023-08-31 18:26 ` kernel test robot
2023-09-08 12:38 ` Peter Zijlstra
2023-09-13 6:43 ` Raghavendra K T
2023-08-30 18:49 ` [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED Ankur Arora
2023-09-08 7:02 ` Peter Zijlstra
2023-09-08 17:15 ` Linus Torvalds
2023-09-08 22:50 ` Peter Zijlstra
2023-09-09 5:15 ` Linus Torvalds
2023-09-09 6:39 ` Ankur Arora
2023-09-09 9:11 ` Peter Zijlstra
2023-09-09 20:04 ` Ankur Arora
2023-09-09 5:30 ` Ankur Arora
2023-09-09 9:12 ` Peter Zijlstra
2023-09-09 20:15 ` Ankur Arora
2023-09-09 21:16 ` Linus Torvalds
2023-09-10 3:48 ` Ankur Arora
2023-09-10 4:35 ` Linus Torvalds
2023-09-10 10:01 ` Ankur Arora
2023-09-10 18:32 ` Linus Torvalds
2023-09-11 15:04 ` Peter Zijlstra
2023-09-11 16:29 ` andrew.cooper3
2023-09-11 17:04 ` Ankur Arora
2023-09-12 8:26 ` Peter Zijlstra
2023-09-12 12:24 ` Phil Auld
2023-09-12 12:33 ` Matthew Wilcox
2023-09-18 23:42 ` Thomas Gleixner
2023-09-19 1:57 ` Linus Torvalds
2023-09-19 8:03 ` Ingo Molnar
2023-09-19 8:43 ` Ingo Molnar
2023-09-19 13:43 ` Thomas Gleixner
2023-09-19 13:25 ` Thomas Gleixner
2023-09-19 12:30 ` Thomas Gleixner
2023-09-19 13:00 ` Arches that don't support PREEMPT Matthew Wilcox
2023-09-19 13:34 ` Geert Uytterhoeven
2023-09-19 13:37 ` John Paul Adrian Glaubitz
2023-09-19 13:42 ` Peter Zijlstra
2023-09-19 13:48 ` John Paul Adrian Glaubitz
2023-09-19 14:16 ` Peter Zijlstra
2023-09-19 14:24 ` John Paul Adrian Glaubitz
2023-09-19 14:32 ` Matthew Wilcox
2023-09-19 15:31 ` Steven Rostedt
2023-09-20 14:38 ` Anton Ivanov
2023-09-21 12:20 ` Arnd Bergmann
2023-09-19 14:17 ` Thomas Gleixner
2023-09-19 14:50 ` H. Peter Anvin
2023-09-19 14:57 ` Matt Turner
2023-09-19 17:09 ` Ulrich Teichert
2023-09-19 17:25 ` Linus Torvalds
2023-09-19 17:58 ` John Paul Adrian Glaubitz
2023-09-19 18:31 ` Thomas Gleixner
2023-09-19 18:38 ` Steven Rostedt
2023-09-19 18:52 ` Linus Torvalds
2023-09-19 19:53 ` Thomas Gleixner
2023-09-20 7:32 ` Ingo Molnar
2023-09-20 7:29 ` Ingo Molnar
2023-09-20 8:26 ` Thomas Gleixner
2023-09-20 10:37 ` David Laight
2023-09-19 14:21 ` Anton Ivanov
2023-09-19 15:17 ` Thomas Gleixner
2023-09-19 15:21 ` Anton Ivanov
2023-09-19 16:22 ` Richard Weinberger
2023-09-19 16:41 ` Anton Ivanov
2023-09-19 17:33 ` Thomas Gleixner
2023-10-06 14:51 ` Geert Uytterhoeven
2023-09-20 14:22 ` [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED Ankur Arora
2023-09-20 20:51 ` Thomas Gleixner
2023-09-21 0:14 ` Thomas Gleixner
2023-09-21 0:58 ` Ankur Arora
2023-09-21 2:12 ` Thomas Gleixner
2023-09-20 23:58 ` Thomas Gleixner
2023-09-21 0:57 ` Ankur Arora
2023-09-21 2:02 ` Thomas Gleixner
2023-09-21 4:16 ` Ankur Arora
2023-09-21 13:59 ` Steven Rostedt
2023-09-21 16:00 ` Linus Torvalds
2023-09-21 22:55 ` Thomas Gleixner
2023-09-23 1:11 ` Thomas Gleixner
2023-10-02 14:15 ` Steven Rostedt
2023-10-02 16:13 ` Thomas Gleixner
2023-10-18 1:03 ` Paul E. McKenney
2023-10-18 12:09 ` Ankur Arora
2023-10-18 17:51 ` Paul E. McKenney
2023-10-18 22:53 ` Thomas Gleixner
2023-10-18 23:25 ` Paul E. McKenney
2023-10-18 13:16 ` Thomas Gleixner
2023-10-18 14:31 ` Steven Rostedt
2023-10-18 17:55 ` Paul E. McKenney
2023-10-18 18:00 ` Steven Rostedt
2023-10-18 18:13 ` Paul E. McKenney
2023-10-19 12:37 ` Daniel Bristot de Oliveira
2023-10-19 17:08 ` Paul E. McKenney
2023-10-18 17:19 ` Paul E. McKenney
2023-10-18 17:41 ` Steven Rostedt
2023-10-18 17:59 ` Paul E. McKenney
2023-10-18 20:15 ` Ankur Arora
2023-10-18 20:42 ` Paul E. McKenney
2023-10-19 0:21 ` Thomas Gleixner
2023-10-19 19:13 ` Paul E. McKenney
2023-10-20 21:59 ` Paul E. McKenney
2023-10-20 22:56 ` Ankur Arora
2023-10-20 23:36 ` Paul E. McKenney
2023-10-21 1:05 ` Ankur Arora
2023-10-21 2:08 ` Paul E. McKenney
2023-10-24 12:15 ` Thomas Gleixner
2023-10-24 18:59 ` Paul E. McKenney
2023-09-23 22:50 ` Thomas Gleixner
2023-09-24 0:10 ` Thomas Gleixner
2023-09-24 7:19 ` Matthew Wilcox
2023-09-24 7:55 ` Thomas Gleixner
2023-09-24 10:29 ` Matthew Wilcox
2023-09-25 0:13 ` Ankur Arora
2023-10-06 13:01 ` Geert Uytterhoeven
2023-09-19 7:21 ` Ingo Molnar
2023-09-19 19:05 ` Ankur Arora [this message]
2023-10-24 14:34 ` Steven Rostedt
2023-10-25 1:49 ` Steven Rostedt
2023-10-26 7:50 ` Sergey Senozhatsky
2023-10-26 12:48 ` Steven Rostedt
2023-09-11 16:48 ` Steven Rostedt
2023-09-11 20:50 ` Linus Torvalds
2023-09-11 21:16 ` Linus Torvalds
2023-09-12 7:20 ` Peter Zijlstra
2023-09-12 7:38 ` Ingo Molnar
2023-09-11 22:20 ` Steven Rostedt
2023-09-11 23:10 ` Ankur Arora
2023-09-11 23:16 ` Steven Rostedt
2023-09-12 16:30 ` Linus Torvalds
2023-09-12 3:27 ` Matthew Wilcox
2023-09-12 16:20 ` Linus Torvalds
2023-09-19 3:21 ` Andy Lutomirski
2023-09-19 9:20 ` Thomas Gleixner
2023-09-19 9:49 ` Ingo Molnar
2023-08-30 18:49 ` [PATCH v2 8/9] irqentry: define irqentry_exit_allow_resched() Ankur Arora
2023-09-08 12:42 ` Peter Zijlstra
2023-09-11 17:24 ` Ankur Arora
2023-08-30 18:49 ` [PATCH v2 9/9] x86/clear_huge_page: make clear_contig_region() preemptible Ankur Arora
2023-09-08 12:45 ` Peter Zijlstra
2023-09-03 8:14 ` [PATCH v2 0/9] x86/clear_huge_page: multi-page clearing Mateusz Guzik
2023-09-05 22:14 ` Ankur Arora
2023-09-08 2:18 ` Raghavendra K T
2023-09-05 1:06 ` Raghavendra K T
2023-09-05 19:36 ` Ankur Arora
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=874jjq56ho.fsf@oracle.com \
--to=ankur.a.arora@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=andrew.cooper3@citrix.com \
--cc=bharata@amd.com \
--cc=boris.ostrovsky@oracle.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=hpa@zytor.com \
--cc=jgross@suse.com \
--cc=jon.grimm@amd.com \
--cc=juri.lelli@redhat.com \
--cc=konrad.wilk@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@amd.com \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=vincent.guittot@linaro.org \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox