* Re: [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us() [not found] ` <87357q228f.ffs@tglx> @ 2023-02-01 4:53 ` Hillf Danton 2023-02-01 12:02 ` Frederic Weisbecker 0 siblings, 1 reply; 4+ messages in thread From: Hillf Danton @ 2023-02-01 4:53 UTC (permalink / raw) To: Thomas Gleixner Cc: Yu Liao, fweisbec, mingo, liwei391, adobriyan, mirsad.todorovac, linux-kernel, linux-mm, Peter Zijlstra On Tue, 31 Jan 2023 15:44:00 +0100 Thomas Gleixner <tglx@linutronix.de> > > Seriously this procfs accuracy is the least of the problems and if this > would be the only issue then we could trivially fix it by declaring that > the procfs output might go backwards. It's an estimate after all. If > there would be a real reason to ensure monotonicity there then we could > easily do that in the readout code. > > But the real issue is that both get_cpu_idle_time_us() and > get_cpu_iowait_time_us() can invoke update_ts_time_stats() which is way > worse than the above procfs idle time going backwards. > > If update_ts_time_stats() is invoked concurrently for the same CPU then > ts->idle_sleeptime and ts->iowait_sleeptime are turning into random > numbers. > > This has been broken 12 years ago in commit 595aac488b54 ("sched: > Introduce a function to update the idle statistics"). [...] > > P.S.: I hate the spinlock in the idle code path, but I don't have a > better idea. Provided the percpu rule is enforced, the random numbers mentioned above could be erased without another spinlock added. Hillf +++ b/kernel/time/tick-sched.c @@ -640,13 +640,26 @@ static void tick_nohz_update_jiffies(kti /* * Updates the per-CPU time idle statistics counters */ -static void -update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time) +static u64 update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, + int io, u64 *last_update_time) { ktime_t delta; + if (last_update_time) + *last_update_time = ktime_to_us(now); + if (ts->idle_active) { delta = ktime_sub(now, ts->idle_entrytime); + + /* update is only expected on the local CPU */ + if (cpu != smp_processor_id()) { + if (io) + delta = ktime_add(ts->iowait_sleeptime, delta); + else + delta = ktime_add(ts->idle_sleeptime, delta); + return ktime_to_us(delta); + } + if (nr_iowait_cpu(cpu) > 0) ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta); else @@ -654,14 +667,12 @@ update_ts_time_stats(int cpu, struct tic ts->idle_entrytime = now; } - if (last_update_time) - *last_update_time = ktime_to_us(now); - + return 0; } static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now) { - update_ts_time_stats(smp_processor_id(), ts, now, NULL); + update_ts_time_stats(smp_processor_id(), ts, now, 0, NULL); ts->idle_active = 0; sched_clock_idle_wakeup_event(); @@ -698,7 +709,9 @@ u64 get_cpu_idle_time_us(int cpu, u64 *l now = ktime_get(); if (last_update_time) { - update_ts_time_stats(cpu, ts, now, last_update_time); + u64 ret = update_ts_time_stats(cpu, ts, now, 0, last_update_time); + if (ret) + return ret; idle = ts->idle_sleeptime; } else { if (ts->idle_active && !nr_iowait_cpu(cpu)) { @@ -739,7 +752,9 @@ u64 get_cpu_iowait_time_us(int cpu, u64 now = ktime_get(); if (last_update_time) { - update_ts_time_stats(cpu, ts, now, last_update_time); + u64 ret = update_ts_time_stats(cpu, ts, now, 1, last_update_time); + if (ret) + return ret; iowait = ts->iowait_sleeptime; } else { if (ts->idle_active && nr_iowait_cpu(cpu) > 0) { ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us() 2023-02-01 4:53 ` [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us() Hillf Danton @ 2023-02-01 12:02 ` Frederic Weisbecker 2023-02-01 14:01 ` Hillf Danton 0 siblings, 1 reply; 4+ messages in thread From: Frederic Weisbecker @ 2023-02-01 12:02 UTC (permalink / raw) To: Hillf Danton Cc: Thomas Gleixner, Yu Liao, fweisbec, mingo, liwei391, adobriyan, mirsad.todorovac, linux-kernel, linux-mm, Peter Zijlstra On Wed, Feb 01, 2023 at 12:53:02PM +0800, Hillf Danton wrote: > On Tue, 31 Jan 2023 15:44:00 +0100 Thomas Gleixner <tglx@linutronix.de> > > > > Seriously this procfs accuracy is the least of the problems and if this > > would be the only issue then we could trivially fix it by declaring that > > the procfs output might go backwards. It's an estimate after all. If > > there would be a real reason to ensure monotonicity there then we could > > easily do that in the readout code. > > > > But the real issue is that both get_cpu_idle_time_us() and > > get_cpu_iowait_time_us() can invoke update_ts_time_stats() which is way > > worse than the above procfs idle time going backwards. > > > > If update_ts_time_stats() is invoked concurrently for the same CPU then > > ts->idle_sleeptime and ts->iowait_sleeptime are turning into random > > numbers. > > > > This has been broken 12 years ago in commit 595aac488b54 ("sched: > > Introduce a function to update the idle statistics"). > > [...] > > > > > P.S.: I hate the spinlock in the idle code path, but I don't have a > > better idea. > > Provided the percpu rule is enforced, the random numbers mentioned above > could be erased without another spinlock added. > > Hillf > +++ b/kernel/time/tick-sched.c > @@ -640,13 +640,26 @@ static void tick_nohz_update_jiffies(kti > /* > * Updates the per-CPU time idle statistics counters > */ > -static void > -update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time) > +static u64 update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, > + int io, u64 *last_update_time) > { > ktime_t delta; > > + if (last_update_time) > + *last_update_time = ktime_to_us(now); > + > if (ts->idle_active) { > delta = ktime_sub(now, ts->idle_entrytime); > + > + /* update is only expected on the local CPU */ > + if (cpu != smp_processor_id()) { Why not just updating it only on idle exit then? > + if (io) I fear it's not up to the caller to decides if the idle time is IO or not. > + delta = ktime_add(ts->iowait_sleeptime, delta); > + else > + delta = ktime_add(ts->idle_sleeptime, delta); > + return ktime_to_us(delta); > + } > + > if (nr_iowait_cpu(cpu) > 0) > ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta); > else But you kept the old update above. So if this is not the local CPU, what do you do? You'd need to return (without updating iowait_sleeptime): ts->idle_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime) Right? But then you may race with the local updater, risking to return the delta added twice. So you need at least a seqcount. But in the end, nr_iowait_cpu() is broken because that counter can be decremented remotely and so the whole thing is beyond repair: CPU 0 CPU 1 CPU 2 ----- ----- ------ //io_schedule() TASK A current->in_iowait = 1 rq(0)->nr_iowait++ //switch to idle // READ /proc/stat // See nr_iowait_cpu(0) == 1 return ts->iowait_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime) //try_to_wake_up(TASK A) rq(0)->nr_iowait-- //idle exit // See nr_iowait_cpu(0) == 0 ts->idle_sleeptime += ktime_sub(ktime_get(), ts->idle_entrytime) Thanks. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us() 2023-02-01 12:02 ` Frederic Weisbecker @ 2023-02-01 14:01 ` Hillf Danton 2023-02-01 14:28 ` Frederic Weisbecker 0 siblings, 1 reply; 4+ messages in thread From: Hillf Danton @ 2023-02-01 14:01 UTC (permalink / raw) To: Frederic Weisbecker Cc: Thomas Gleixner, Yu Liao, fweisbec, mingo, liwei391, adobriyan, mirsad.todorovac, linux-kernel, linux-mm, Peter Zijlstra On Wed, 1 Feb 2023 13:02:41 +0100 Frederic Weisbecker <frederic@kernel.org> > On Wed, Feb 01, 2023 at 12:53:02PM +0800, Hillf Danton wrote: > > On Tue, 31 Jan 2023 15:44:00 +0100 Thomas Gleixner <tglx@linutronix.de> > > > > > > Seriously this procfs accuracy is the least of the problems and if this > > > would be the only issue then we could trivially fix it by declaring that > > > the procfs output might go backwards. It's an estimate after all. If > > > there would be a real reason to ensure monotonicity there then we could > > > easily do that in the readout code. > > > > > > But the real issue is that both get_cpu_idle_time_us() and > > > get_cpu_iowait_time_us() can invoke update_ts_time_stats() which is way > > > worse than the above procfs idle time going backwards. > > > > > > If update_ts_time_stats() is invoked concurrently for the same CPU then > > > ts->idle_sleeptime and ts->iowait_sleeptime are turning into random > > > numbers. > > > > > > This has been broken 12 years ago in commit 595aac488b54 ("sched: > > > Introduce a function to update the idle statistics"). > > > > [...] > > > > > > > > P.S.: I hate the spinlock in the idle code path, but I don't have a > > > better idea. > > > > Provided the percpu rule is enforced, the random numbers mentioned above > > could be erased without another spinlock added. > > > > Hillf > > +++ b/kernel/time/tick-sched.c > > @@ -640,13 +640,26 @@ static void tick_nohz_update_jiffies(kti > > /* > > * Updates the per-CPU time idle statistics counters > > */ > > -static void > > -update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time) > > +static u64 update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, > > + int io, u64 *last_update_time) > > { > > ktime_t delta; > > > > + if (last_update_time) > > + *last_update_time = ktime_to_us(now); > > + > > if (ts->idle_active) { > > delta = ktime_sub(now, ts->idle_entrytime); > > + > > + /* update is only expected on the local CPU */ > > + if (cpu != smp_processor_id()) { > > Why not just updating it only on idle exit then? This aligns to idle exit as much as it can by disallowing remote update. > > > + if (io) > > I fear it's not up to the caller to decides if the idle time is IO or not. Could you specify a bit on your concern, given the callers of this function? > > > + delta = ktime_add(ts->iowait_sleeptime, delta); > > + else > > + delta = ktime_add(ts->idle_sleeptime, delta); > > + return ktime_to_us(delta); Based on the above comments, I guest you missed this line which prevents get_cpu_idle_time_us() and get_cpu_iowait_time_us() from updating ts. > > + } > > + > > if (nr_iowait_cpu(cpu) > 0) > > ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta); > > else > > But you kept the old update above. > > So if this is not the local CPU, what do you do? > > You'd need to return (without updating iowait_sleeptime): > > ts->idle_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime) > > Right? Yes, the diff goes as you suggest. > But then you may race with the local updater, risking to return > the delta added twice. So you need at least a seqcount. Add seqcount if needed. No problem. > > But in the end, nr_iowait_cpu() is broken because that counter can be > decremented remotely and so the whole thing is beyond repair: > > CPU 0 CPU 1 CPU 2 > ----- ----- ------ > //io_schedule() TASK A > current->in_iowait = 1 > rq(0)->nr_iowait++ > //switch to idle > // READ /proc/stat > // See nr_iowait_cpu(0) == 1 > return ts->iowait_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime) > > //try_to_wake_up(TASK A) > rq(0)->nr_iowait-- > //idle exit > // See nr_iowait_cpu(0) == 0 > ts->idle_sleeptime += ktime_sub(ktime_get(), ts->idle_entrytime) Ah see your point. The diff disallows remotely updating ts, and it is updated in idle exit after my proposal, so what nr_iowait_cpu() breaks is mitigated. Thanks for taking a look, particularly the race linked to nr_iowait_cpu(). Hillf ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us() 2023-02-01 14:01 ` Hillf Danton @ 2023-02-01 14:28 ` Frederic Weisbecker 0 siblings, 0 replies; 4+ messages in thread From: Frederic Weisbecker @ 2023-02-01 14:28 UTC (permalink / raw) To: Hillf Danton Cc: Thomas Gleixner, Yu Liao, fweisbec, mingo, liwei391, adobriyan, mirsad.todorovac, linux-kernel, linux-mm, Peter Zijlstra On Wed, Feb 01, 2023 at 10:01:17PM +0800, Hillf Danton wrote: > > > +++ b/kernel/time/tick-sched.c > > > @@ -640,13 +640,26 @@ static void tick_nohz_update_jiffies(kti > > > /* > > > * Updates the per-CPU time idle statistics counters > > > */ > > > -static void > > > -update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time) > > > +static u64 update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, > > > + int io, u64 *last_update_time) > > > { > > > ktime_t delta; > > > > > > + if (last_update_time) > > > + *last_update_time = ktime_to_us(now); > > > + > > > if (ts->idle_active) { > > > delta = ktime_sub(now, ts->idle_entrytime); > > > + > > > + /* update is only expected on the local CPU */ > > > + if (cpu != smp_processor_id()) { > > > > Why not just updating it only on idle exit then? > > This aligns to idle exit as much as it can by disallowing remote update. I mean why bother updating if idle does it for us already? One possibility is that we get some more precise values if we read during long idle periods with nr_iowait_cpu() changes in the middle. > > > > > + if (io) > > > > I fear it's not up to the caller to decides if the idle time is IO or not. > > Could you specify a bit on your concern, given the callers of this function? You are randomly stating if the elapsing idle time is IO or not depending on the caller, without verifying nr_iowait_cpu(). Or am I missing something? > > > > > + delta = ktime_add(ts->iowait_sleeptime, delta); > > > + else > > > + delta = ktime_add(ts->idle_sleeptime, delta); > > > + return ktime_to_us(delta); > > Based on the above comments, I guest you missed this line which prevents > get_cpu_idle_time_us() and get_cpu_iowait_time_us() from updating ts. Right... > > But then you may race with the local updater, risking to return > > the delta added twice. So you need at least a seqcount. > > Add seqcount if needed. No problem. > > > > But in the end, nr_iowait_cpu() is broken because that counter can be > > decremented remotely and so the whole thing is beyond repair: > > > > CPU 0 CPU 1 CPU 2 > > ----- ----- ------ > > //io_schedule() TASK A > > current->in_iowait = 1 > > rq(0)->nr_iowait++ > > //switch to idle > > // READ /proc/stat > > // See nr_iowait_cpu(0) == 1 > > return ts->iowait_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime) > > > > //try_to_wake_up(TASK A) > > rq(0)->nr_iowait-- > > //idle exit > > // See nr_iowait_cpu(0) == 0 > > ts->idle_sleeptime += ktime_sub(ktime_get(), ts->idle_entrytime) > > Ah see your point. > > The diff disallows remotely updating ts, and it is updated in idle exit > after my proposal, so what nr_iowait_cpu() breaks is mitigated. Only halfway mitigated. This doesn't prevent from backward or forward jumps when non-updating readers are involved at all. Thanks. > > Thanks for taking a look, particularly the race linked to nr_iowait_cpu(). > > Hillf ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-02-01 14:28 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20230128020051.2328465-1-liaoyu15@huawei.com>
[not found] ` <87357q228f.ffs@tglx>
2023-02-01 4:53 ` [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us() Hillf Danton
2023-02-01 12:02 ` Frederic Weisbecker
2023-02-01 14:01 ` Hillf Danton
2023-02-01 14:28 ` Frederic Weisbecker
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox