* Re: [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us()
[not found] ` <87357q228f.ffs@tglx>
@ 2023-02-01 4:53 ` Hillf Danton
2023-02-01 12:02 ` Frederic Weisbecker
0 siblings, 1 reply; 4+ messages in thread
From: Hillf Danton @ 2023-02-01 4:53 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Yu Liao, fweisbec, mingo, liwei391, adobriyan, mirsad.todorovac,
linux-kernel, linux-mm, Peter Zijlstra
On Tue, 31 Jan 2023 15:44:00 +0100 Thomas Gleixner <tglx@linutronix.de>
>
> Seriously this procfs accuracy is the least of the problems and if this
> would be the only issue then we could trivially fix it by declaring that
> the procfs output might go backwards. It's an estimate after all. If
> there would be a real reason to ensure monotonicity there then we could
> easily do that in the readout code.
>
> But the real issue is that both get_cpu_idle_time_us() and
> get_cpu_iowait_time_us() can invoke update_ts_time_stats() which is way
> worse than the above procfs idle time going backwards.
>
> If update_ts_time_stats() is invoked concurrently for the same CPU then
> ts->idle_sleeptime and ts->iowait_sleeptime are turning into random
> numbers.
>
> This has been broken 12 years ago in commit 595aac488b54 ("sched:
> Introduce a function to update the idle statistics").
[...]
>
> P.S.: I hate the spinlock in the idle code path, but I don't have a
> better idea.
Provided the percpu rule is enforced, the random numbers mentioned above
could be erased without another spinlock added.
Hillf
+++ b/kernel/time/tick-sched.c
@@ -640,13 +640,26 @@ static void tick_nohz_update_jiffies(kti
/*
* Updates the per-CPU time idle statistics counters
*/
-static void
-update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
+static u64 update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now,
+ int io, u64 *last_update_time)
{
ktime_t delta;
+ if (last_update_time)
+ *last_update_time = ktime_to_us(now);
+
if (ts->idle_active) {
delta = ktime_sub(now, ts->idle_entrytime);
+
+ /* update is only expected on the local CPU */
+ if (cpu != smp_processor_id()) {
+ if (io)
+ delta = ktime_add(ts->iowait_sleeptime, delta);
+ else
+ delta = ktime_add(ts->idle_sleeptime, delta);
+ return ktime_to_us(delta);
+ }
+
if (nr_iowait_cpu(cpu) > 0)
ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
else
@@ -654,14 +667,12 @@ update_ts_time_stats(int cpu, struct tic
ts->idle_entrytime = now;
}
- if (last_update_time)
- *last_update_time = ktime_to_us(now);
-
+ return 0;
}
static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
{
- update_ts_time_stats(smp_processor_id(), ts, now, NULL);
+ update_ts_time_stats(smp_processor_id(), ts, now, 0, NULL);
ts->idle_active = 0;
sched_clock_idle_wakeup_event();
@@ -698,7 +709,9 @@ u64 get_cpu_idle_time_us(int cpu, u64 *l
now = ktime_get();
if (last_update_time) {
- update_ts_time_stats(cpu, ts, now, last_update_time);
+ u64 ret = update_ts_time_stats(cpu, ts, now, 0, last_update_time);
+ if (ret)
+ return ret;
idle = ts->idle_sleeptime;
} else {
if (ts->idle_active && !nr_iowait_cpu(cpu)) {
@@ -739,7 +752,9 @@ u64 get_cpu_iowait_time_us(int cpu, u64
now = ktime_get();
if (last_update_time) {
- update_ts_time_stats(cpu, ts, now, last_update_time);
+ u64 ret = update_ts_time_stats(cpu, ts, now, 1, last_update_time);
+ if (ret)
+ return ret;
iowait = ts->iowait_sleeptime;
} else {
if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us()
2023-02-01 4:53 ` [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us() Hillf Danton
@ 2023-02-01 12:02 ` Frederic Weisbecker
2023-02-01 14:01 ` Hillf Danton
0 siblings, 1 reply; 4+ messages in thread
From: Frederic Weisbecker @ 2023-02-01 12:02 UTC (permalink / raw)
To: Hillf Danton
Cc: Thomas Gleixner, Yu Liao, fweisbec, mingo, liwei391, adobriyan,
mirsad.todorovac, linux-kernel, linux-mm, Peter Zijlstra
On Wed, Feb 01, 2023 at 12:53:02PM +0800, Hillf Danton wrote:
> On Tue, 31 Jan 2023 15:44:00 +0100 Thomas Gleixner <tglx@linutronix.de>
> >
> > Seriously this procfs accuracy is the least of the problems and if this
> > would be the only issue then we could trivially fix it by declaring that
> > the procfs output might go backwards. It's an estimate after all. If
> > there would be a real reason to ensure monotonicity there then we could
> > easily do that in the readout code.
> >
> > But the real issue is that both get_cpu_idle_time_us() and
> > get_cpu_iowait_time_us() can invoke update_ts_time_stats() which is way
> > worse than the above procfs idle time going backwards.
> >
> > If update_ts_time_stats() is invoked concurrently for the same CPU then
> > ts->idle_sleeptime and ts->iowait_sleeptime are turning into random
> > numbers.
> >
> > This has been broken 12 years ago in commit 595aac488b54 ("sched:
> > Introduce a function to update the idle statistics").
>
> [...]
>
> >
> > P.S.: I hate the spinlock in the idle code path, but I don't have a
> > better idea.
>
> Provided the percpu rule is enforced, the random numbers mentioned above
> could be erased without another spinlock added.
>
> Hillf
> +++ b/kernel/time/tick-sched.c
> @@ -640,13 +640,26 @@ static void tick_nohz_update_jiffies(kti
> /*
> * Updates the per-CPU time idle statistics counters
> */
> -static void
> -update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
> +static u64 update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now,
> + int io, u64 *last_update_time)
> {
> ktime_t delta;
>
> + if (last_update_time)
> + *last_update_time = ktime_to_us(now);
> +
> if (ts->idle_active) {
> delta = ktime_sub(now, ts->idle_entrytime);
> +
> + /* update is only expected on the local CPU */
> + if (cpu != smp_processor_id()) {
Why not just updating it only on idle exit then?
> + if (io)
I fear it's not up to the caller to decides if the idle time is IO or not.
> + delta = ktime_add(ts->iowait_sleeptime, delta);
> + else
> + delta = ktime_add(ts->idle_sleeptime, delta);
> + return ktime_to_us(delta);
> + }
> +
> if (nr_iowait_cpu(cpu) > 0)
> ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
> else
But you kept the old update above.
So if this is not the local CPU, what do you do?
You'd need to return (without updating iowait_sleeptime):
ts->idle_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime)
Right? But then you may race with the local updater, risking to return
the delta added twice. So you need at least a seqcount.
But in the end, nr_iowait_cpu() is broken because that counter can be
decremented remotely and so the whole thing is beyond repair:
CPU 0 CPU 1 CPU 2
----- ----- ------
//io_schedule() TASK A
current->in_iowait = 1
rq(0)->nr_iowait++
//switch to idle
// READ /proc/stat
// See nr_iowait_cpu(0) == 1
return ts->iowait_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime)
//try_to_wake_up(TASK A)
rq(0)->nr_iowait--
//idle exit
// See nr_iowait_cpu(0) == 0
ts->idle_sleeptime += ktime_sub(ktime_get(), ts->idle_entrytime)
Thanks.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us()
2023-02-01 12:02 ` Frederic Weisbecker
@ 2023-02-01 14:01 ` Hillf Danton
2023-02-01 14:28 ` Frederic Weisbecker
0 siblings, 1 reply; 4+ messages in thread
From: Hillf Danton @ 2023-02-01 14:01 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Thomas Gleixner, Yu Liao, fweisbec, mingo, liwei391, adobriyan,
mirsad.todorovac, linux-kernel, linux-mm, Peter Zijlstra
On Wed, 1 Feb 2023 13:02:41 +0100 Frederic Weisbecker <frederic@kernel.org>
> On Wed, Feb 01, 2023 at 12:53:02PM +0800, Hillf Danton wrote:
> > On Tue, 31 Jan 2023 15:44:00 +0100 Thomas Gleixner <tglx@linutronix.de>
> > >
> > > Seriously this procfs accuracy is the least of the problems and if this
> > > would be the only issue then we could trivially fix it by declaring that
> > > the procfs output might go backwards. It's an estimate after all. If
> > > there would be a real reason to ensure monotonicity there then we could
> > > easily do that in the readout code.
> > >
> > > But the real issue is that both get_cpu_idle_time_us() and
> > > get_cpu_iowait_time_us() can invoke update_ts_time_stats() which is way
> > > worse than the above procfs idle time going backwards.
> > >
> > > If update_ts_time_stats() is invoked concurrently for the same CPU then
> > > ts->idle_sleeptime and ts->iowait_sleeptime are turning into random
> > > numbers.
> > >
> > > This has been broken 12 years ago in commit 595aac488b54 ("sched:
> > > Introduce a function to update the idle statistics").
> >
> > [...]
> >
> > >
> > > P.S.: I hate the spinlock in the idle code path, but I don't have a
> > > better idea.
> >
> > Provided the percpu rule is enforced, the random numbers mentioned above
> > could be erased without another spinlock added.
> >
> > Hillf
> > +++ b/kernel/time/tick-sched.c
> > @@ -640,13 +640,26 @@ static void tick_nohz_update_jiffies(kti
> > /*
> > * Updates the per-CPU time idle statistics counters
> > */
> > -static void
> > -update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
> > +static u64 update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now,
> > + int io, u64 *last_update_time)
> > {
> > ktime_t delta;
> >
> > + if (last_update_time)
> > + *last_update_time = ktime_to_us(now);
> > +
> > if (ts->idle_active) {
> > delta = ktime_sub(now, ts->idle_entrytime);
> > +
> > + /* update is only expected on the local CPU */
> > + if (cpu != smp_processor_id()) {
>
> Why not just updating it only on idle exit then?
This aligns to idle exit as much as it can by disallowing remote update.
>
> > + if (io)
>
> I fear it's not up to the caller to decides if the idle time is IO or not.
Could you specify a bit on your concern, given the callers of this function?
>
> > + delta = ktime_add(ts->iowait_sleeptime, delta);
> > + else
> > + delta = ktime_add(ts->idle_sleeptime, delta);
> > + return ktime_to_us(delta);
Based on the above comments, I guest you missed this line which prevents
get_cpu_idle_time_us() and get_cpu_iowait_time_us() from updating ts.
> > + }
> > +
> > if (nr_iowait_cpu(cpu) > 0)
> > ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
> > else
>
> But you kept the old update above.
>
> So if this is not the local CPU, what do you do?
>
> You'd need to return (without updating iowait_sleeptime):
>
> ts->idle_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime)
>
> Right?
Yes, the diff goes as you suggest.
> But then you may race with the local updater, risking to return
> the delta added twice. So you need at least a seqcount.
Add seqcount if needed. No problem.
>
> But in the end, nr_iowait_cpu() is broken because that counter can be
> decremented remotely and so the whole thing is beyond repair:
>
> CPU 0 CPU 1 CPU 2
> ----- ----- ------
> //io_schedule() TASK A
> current->in_iowait = 1
> rq(0)->nr_iowait++
> //switch to idle
> // READ /proc/stat
> // See nr_iowait_cpu(0) == 1
> return ts->iowait_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime)
>
> //try_to_wake_up(TASK A)
> rq(0)->nr_iowait--
> //idle exit
> // See nr_iowait_cpu(0) == 0
> ts->idle_sleeptime += ktime_sub(ktime_get(), ts->idle_entrytime)
Ah see your point.
The diff disallows remotely updating ts, and it is updated in idle exit
after my proposal, so what nr_iowait_cpu() breaks is mitigated.
Thanks for taking a look, particularly the race linked to nr_iowait_cpu().
Hillf
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us()
2023-02-01 14:01 ` Hillf Danton
@ 2023-02-01 14:28 ` Frederic Weisbecker
0 siblings, 0 replies; 4+ messages in thread
From: Frederic Weisbecker @ 2023-02-01 14:28 UTC (permalink / raw)
To: Hillf Danton
Cc: Thomas Gleixner, Yu Liao, fweisbec, mingo, liwei391, adobriyan,
mirsad.todorovac, linux-kernel, linux-mm, Peter Zijlstra
On Wed, Feb 01, 2023 at 10:01:17PM +0800, Hillf Danton wrote:
> > > +++ b/kernel/time/tick-sched.c
> > > @@ -640,13 +640,26 @@ static void tick_nohz_update_jiffies(kti
> > > /*
> > > * Updates the per-CPU time idle statistics counters
> > > */
> > > -static void
> > > -update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
> > > +static u64 update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now,
> > > + int io, u64 *last_update_time)
> > > {
> > > ktime_t delta;
> > >
> > > + if (last_update_time)
> > > + *last_update_time = ktime_to_us(now);
> > > +
> > > if (ts->idle_active) {
> > > delta = ktime_sub(now, ts->idle_entrytime);
> > > +
> > > + /* update is only expected on the local CPU */
> > > + if (cpu != smp_processor_id()) {
> >
> > Why not just updating it only on idle exit then?
>
> This aligns to idle exit as much as it can by disallowing remote update.
I mean why bother updating if idle does it for us already?
One possibility is that we get some more precise values if we read during
long idle periods with nr_iowait_cpu() changes in the middle.
> >
> > > + if (io)
> >
> > I fear it's not up to the caller to decides if the idle time is IO or not.
>
> Could you specify a bit on your concern, given the callers of this function?
You are randomly stating if the elapsing idle time is IO or not depending on
the caller, without verifying nr_iowait_cpu(). Or am I missing something?
> >
> > > + delta = ktime_add(ts->iowait_sleeptime, delta);
> > > + else
> > > + delta = ktime_add(ts->idle_sleeptime, delta);
> > > + return ktime_to_us(delta);
>
> Based on the above comments, I guest you missed this line which prevents
> get_cpu_idle_time_us() and get_cpu_iowait_time_us() from updating ts.
Right...
> > But then you may race with the local updater, risking to return
> > the delta added twice. So you need at least a seqcount.
>
> Add seqcount if needed. No problem.
> >
> > But in the end, nr_iowait_cpu() is broken because that counter can be
> > decremented remotely and so the whole thing is beyond repair:
> >
> > CPU 0 CPU 1 CPU 2
> > ----- ----- ------
> > //io_schedule() TASK A
> > current->in_iowait = 1
> > rq(0)->nr_iowait++
> > //switch to idle
> > // READ /proc/stat
> > // See nr_iowait_cpu(0) == 1
> > return ts->iowait_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime)
> >
> > //try_to_wake_up(TASK A)
> > rq(0)->nr_iowait--
> > //idle exit
> > // See nr_iowait_cpu(0) == 0
> > ts->idle_sleeptime += ktime_sub(ktime_get(), ts->idle_entrytime)
>
> Ah see your point.
>
> The diff disallows remotely updating ts, and it is updated in idle exit
> after my proposal, so what nr_iowait_cpu() breaks is mitigated.
Only halfway mitigated. This doesn't prevent from backward or forward jumps
when non-updating readers are involved at all.
Thanks.
>
> Thanks for taking a look, particularly the race linked to nr_iowait_cpu().
>
> Hillf
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-02-01 14:28 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20230128020051.2328465-1-liaoyu15@huawei.com>
[not found] ` <87357q228f.ffs@tglx>
2023-02-01 4:53 ` [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us() Hillf Danton
2023-02-01 12:02 ` Frederic Weisbecker
2023-02-01 14:01 ` Hillf Danton
2023-02-01 14:28 ` Frederic Weisbecker
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox