linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: PSI idle-shutoff
       [not found] <20220913140817.GA9091@hu-pkondeti-hyd.qualcomm.com>
@ 2022-10-10 10:57 ` Hillf Danton
  2022-10-10 21:16   ` Suren Baghdasaryan
  0 siblings, 1 reply; 6+ messages in thread
From: Hillf Danton @ 2022-10-10 10:57 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: Johannes Weiner, Suren Baghdasaryan, linux-mm, linux-kernel,
	quic_charante

On 13 Sep 2022 19:38:17 +0530 Pavan Kondeti <quic_pkondeti@quicinc.com>
> Hi
> 
> The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> run from a kworker thread, PSI_NONIDLE condition would be observed as
> there is a RUNNING task. So we would always end up re-arming the work.
> 
> If the work is re-armed from the psi_avgs_work() it self, the backing off
> logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> help. The work is already scheduled. so we don't do anything there.
> 
> Probably I am missing some thing here. Can you please clarify how we
> shut off re-arming the psi avg work?

Instead of open coding schedule_delayed_work() in bid to check if timer
hits the idle task (see delayed_work_timer_fn()), the idle task is tracked
in psi_task_switch() and checked by kworker to see if it preempted the idle
task.

Only for thoughts now.

Hillf

+++ b/kernel/sched/psi.c
@@ -412,6 +412,8 @@ static u64 update_averages(struct psi_gr
 	return avg_next_update;
 }
 
+static DEFINE_PER_CPU(int, prev_task_is_idle);
+
 static void psi_avgs_work(struct work_struct *work)
 {
 	struct delayed_work *dwork;
@@ -439,7 +441,7 @@ static void psi_avgs_work(struct work_st
 	if (now >= group->avg_next_update)
 		group->avg_next_update = update_averages(group, now);
 
-	if (nonidle) {
+	if (nonidle && 0 == per_cpu(prev_task_is_idle, raw_smp_processor_id())) {
 		schedule_delayed_work(dwork, nsecs_to_jiffies(
 				group->avg_next_update - now) + 1);
 	}
@@ -859,6 +861,7 @@ void psi_task_switch(struct task_struct
 	if (prev->pid) {
 		int clear = TSK_ONCPU, set = 0;
 
+		per_cpu(prev_task_is_idle, cpu) = 0;
 		/*
 		 * When we're going to sleep, psi_dequeue() lets us
 		 * handle TSK_RUNNING, TSK_MEMSTALL_RUNNING and
@@ -888,7 +891,8 @@ void psi_task_switch(struct task_struct
 			for (; group; group = iterate_groups(prev, &iter))
 				psi_group_change(group, cpu, clear, set, now, true);
 		}
-	}
+	} else
+		per_cpu(prev_task_is_idle, cpu) = 1;
 }
 
 /**


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PSI idle-shutoff
  2022-10-10 10:57 ` PSI idle-shutoff Hillf Danton
@ 2022-10-10 21:16   ` Suren Baghdasaryan
  2022-10-11 11:38     ` Hillf Danton
  0 siblings, 1 reply; 6+ messages in thread
From: Suren Baghdasaryan @ 2022-10-10 21:16 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Pavan Kondeti, Johannes Weiner, linux-mm, linux-kernel, quic_charante

On Mon, Oct 10, 2022 at 3:57 AM Hillf Danton <hdanton@sina.com> wrote:
>
> On 13 Sep 2022 19:38:17 +0530 Pavan Kondeti <quic_pkondeti@quicinc.com>
> > Hi
> >
> > The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> > run from a kworker thread, PSI_NONIDLE condition would be observed as
> > there is a RUNNING task. So we would always end up re-arming the work.
> >
> > If the work is re-armed from the psi_avgs_work() it self, the backing off
> > logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> > help. The work is already scheduled. so we don't do anything there.
> >
> > Probably I am missing some thing here. Can you please clarify how we
> > shut off re-arming the psi avg work?
>
> Instead of open coding schedule_delayed_work() in bid to check if timer
> hits the idle task (see delayed_work_timer_fn()), the idle task is tracked
> in psi_task_switch() and checked by kworker to see if it preempted the idle
> task.
>
> Only for thoughts now.
>
> Hillf
>
> +++ b/kernel/sched/psi.c
> @@ -412,6 +412,8 @@ static u64 update_averages(struct psi_gr
>         return avg_next_update;
>  }
>
> +static DEFINE_PER_CPU(int, prev_task_is_idle);
> +
>  static void psi_avgs_work(struct work_struct *work)
>  {
>         struct delayed_work *dwork;
> @@ -439,7 +441,7 @@ static void psi_avgs_work(struct work_st
>         if (now >= group->avg_next_update)
>                 group->avg_next_update = update_averages(group, now);
>
> -       if (nonidle) {
> +       if (nonidle && 0 == per_cpu(prev_task_is_idle, raw_smp_processor_id())) {

This condition would be incorrect if nonidle was set by a cpu other
than raw_smp_processor_id() and
prev_task_is_idle[raw_smp_processor_id()] == 0. IOW, if some activity
happens on a non-current cpu, we would fail to reschedule
psi_avgs_work for it. This can be fixed in collect_percpu_times() by
considering prev_task_is_idle for all other CPUs as well. However
Chengming's approach seems simpler to me TBH and does not require an
additional per-cpu variable.

>                 schedule_delayed_work(dwork, nsecs_to_jiffies(
>                                 group->avg_next_update - now) + 1);
>         }
> @@ -859,6 +861,7 @@ void psi_task_switch(struct task_struct
>         if (prev->pid) {
>                 int clear = TSK_ONCPU, set = 0;
>
> +               per_cpu(prev_task_is_idle, cpu) = 0;
>                 /*
>                  * When we're going to sleep, psi_dequeue() lets us
>                  * handle TSK_RUNNING, TSK_MEMSTALL_RUNNING and
> @@ -888,7 +891,8 @@ void psi_task_switch(struct task_struct
>                         for (; group; group = iterate_groups(prev, &iter))
>                                 psi_group_change(group, cpu, clear, set, now, true);
>                 }
> -       }
> +       } else
> +               per_cpu(prev_task_is_idle, cpu) = 1;
>  }
>
>  /**
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PSI idle-shutoff
  2022-10-10 21:16   ` Suren Baghdasaryan
@ 2022-10-11 11:38     ` Hillf Danton
  2022-10-11 17:11       ` Suren Baghdasaryan
  0 siblings, 1 reply; 6+ messages in thread
From: Hillf Danton @ 2022-10-11 11:38 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Pavan Kondeti, Johannes Weiner, linux-mm, linux-kernel, quic_charante

On 10 Oct 2022 14:16:26 -0700 Suren Baghdasaryan <surenb@google.com>
> On Mon, Oct 10, 2022 at 3:57 AM Hillf Danton <hdanton@sina.com> wrote:
> > On 13 Sep 2022 19:38:17 +0530 Pavan Kondeti <quic_pkondeti@quicinc.com>
> > > Hi
> > >
> > > The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> > > run from a kworker thread, PSI_NONIDLE condition would be observed as
> > > there is a RUNNING task. So we would always end up re-arming the work.
> > >
> > > If the work is re-armed from the psi_avgs_work() it self, the backing off
> > > logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> > > help. The work is already scheduled. so we don't do anything there.
> > >
> > > Probably I am missing some thing here. Can you please clarify how we
> > > shut off re-arming the psi avg work?
> >
> > Instead of open coding schedule_delayed_work() in bid to check if timer
> > hits the idle task (see delayed_work_timer_fn()), the idle task is tracked
> > in psi_task_switch() and checked by kworker to see if it preempted the idle
> > task.
> >
> > Only for thoughts now.
> >
> > Hillf
> >
> > +++ b/kernel/sched/psi.c
> > @@ -412,6 +412,8 @@ static u64 update_averages(struct psi_gr
> >         return avg_next_update;
> >  }
> >
> > +static DEFINE_PER_CPU(int, prev_task_is_idle);
> > +
> >  static void psi_avgs_work(struct work_struct *work)
> >  {
> >         struct delayed_work *dwork;
> > @@ -439,7 +441,7 @@ static void psi_avgs_work(struct work_st
> >         if (now >= group->avg_next_update)
> >                 group->avg_next_update = update_averages(group, now);
> >
> > -       if (nonidle) {
> > +       if (nonidle && 0 == per_cpu(prev_task_is_idle, raw_smp_processor_id())) {
> 
> This condition would be incorrect if nonidle was set by a cpu other
> than raw_smp_processor_id() and
> prev_task_is_idle[raw_smp_processor_id()] == 0.

Thanks for taking a look.

> IOW, if some activity happens on a non-current cpu, we would fail to
> reschedule psi_avgs_work for it.

Given activities on remote CPUs, can you specify what prevents psi_avgs_work
from being scheduled on remote CPUs if for example the local CPU has been
idle for a second?

> This can be fixed in collect_percpu_times() by
> considering prev_task_is_idle for all other CPUs as well. However
> Chengming's approach seems simpler to me TBH and does not require an
> additional per-cpu variable.

Good ideas are always welcome.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PSI idle-shutoff
  2022-10-11 11:38     ` Hillf Danton
@ 2022-10-11 17:11       ` Suren Baghdasaryan
  2022-10-12  6:20         ` Hillf Danton
  0 siblings, 1 reply; 6+ messages in thread
From: Suren Baghdasaryan @ 2022-10-11 17:11 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Pavan Kondeti, Johannes Weiner, linux-mm, linux-kernel, quic_charante

On Tue, Oct 11, 2022 at 4:38 AM Hillf Danton <hdanton@sina.com> wrote:
>
> On 10 Oct 2022 14:16:26 -0700 Suren Baghdasaryan <surenb@google.com>
> > On Mon, Oct 10, 2022 at 3:57 AM Hillf Danton <hdanton@sina.com> wrote:
> > > On 13 Sep 2022 19:38:17 +0530 Pavan Kondeti <quic_pkondeti@quicinc.com>
> > > > Hi
> > > >
> > > > The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> > > > run from a kworker thread, PSI_NONIDLE condition would be observed as
> > > > there is a RUNNING task. So we would always end up re-arming the work.
> > > >
> > > > If the work is re-armed from the psi_avgs_work() it self, the backing off
> > > > logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> > > > help. The work is already scheduled. so we don't do anything there.
> > > >
> > > > Probably I am missing some thing here. Can you please clarify how we
> > > > shut off re-arming the psi avg work?
> > >
> > > Instead of open coding schedule_delayed_work() in bid to check if timer
> > > hits the idle task (see delayed_work_timer_fn()), the idle task is tracked
> > > in psi_task_switch() and checked by kworker to see if it preempted the idle
> > > task.
> > >
> > > Only for thoughts now.
> > >
> > > Hillf
> > >
> > > +++ b/kernel/sched/psi.c
> > > @@ -412,6 +412,8 @@ static u64 update_averages(struct psi_gr
> > >         return avg_next_update;
> > >  }
> > >
> > > +static DEFINE_PER_CPU(int, prev_task_is_idle);
> > > +
> > >  static void psi_avgs_work(struct work_struct *work)
> > >  {
> > >         struct delayed_work *dwork;
> > > @@ -439,7 +441,7 @@ static void psi_avgs_work(struct work_st
> > >         if (now >= group->avg_next_update)
> > >                 group->avg_next_update = update_averages(group, now);
> > >
> > > -       if (nonidle) {
> > > +       if (nonidle && 0 == per_cpu(prev_task_is_idle, raw_smp_processor_id())) {
> >
> > This condition would be incorrect if nonidle was set by a cpu other
> > than raw_smp_processor_id() and
> > prev_task_is_idle[raw_smp_processor_id()] == 0.
>
> Thanks for taking a look.

Thanks for the suggestion!

>
> > IOW, if some activity happens on a non-current cpu, we would fail to
> > reschedule psi_avgs_work for it.
>
> Given activities on remote CPUs, can you specify what prevents psi_avgs_work
> from being scheduled on remote CPUs if for example the local CPU has been
> idle for a second?

I'm not a scheduler expert but I can imagine some work that finished
running on a big core A and generated some activity since the last
time psi_avgs_work executed.  With no other activity the next
psi_avgs_work could be scheduled on a small core B to conserve power.
There might be other cases involving cpuset limitation changes or cpu
offlining but I didn't think too hard about these. The bottom line, I
don't think we should be designing mechanisms which rely on
assumptions about how tasks will be scheduled. Even if these
assumptions are correct today they might change in the future and
things will break in unexpected places.

>
> > This can be fixed in collect_percpu_times() by
> > considering prev_task_is_idle for all other CPUs as well. However
> > Chengming's approach seems simpler to me TBH and does not require an
> > additional per-cpu variable.
>
> Good ideas are always welcome.

No question about that. Thanks!


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PSI idle-shutoff
  2022-10-11 17:11       ` Suren Baghdasaryan
@ 2022-10-12  6:20         ` Hillf Danton
  2022-10-12 15:40           ` Suren Baghdasaryan
  0 siblings, 1 reply; 6+ messages in thread
From: Hillf Danton @ 2022-10-12  6:20 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Pavan Kondeti, Johannes Weiner, linux-mm, linux-kernel, quic_charante

On 11 Oct 2022 10:11:58 -0700 Suren Baghdasaryan <surenb@google.com>
>On Tue, Oct 11, 2022 at 4:38 AM Hillf Danton <hdanton@sina.com> wrote:
>>
>> Given activities on remote CPUs, can you specify what prevents psi_avgs_work
>> from being scheduled on remote CPUs if for example the local CPU has been
>> idle for a second?
> 
> I'm not a scheduler expert but I can imagine some work that finished
> running on a big core A and generated some activity since the last
> time psi_avgs_work executed.  With no other activity the next
> psi_avgs_work could be scheduled on a small core B to conserve power.

Given core A and B, nothing prevents.

> There might be other cases involving cpuset limitation changes or cpu
> offlining but I didn't think too hard about these. The bottom line, I
> don't think we should be designing mechanisms which rely on
> assumptions about how tasks will be scheduled. Even if these

The tasks here makes me guess that we are on different pages - scheduling
work has little to do with how tasks are scheduled, and is no more than
queuing work on the system_wq in the case of psi_avgs_work,

> assumptions are correct today they might change in the future and
> things will break in unexpected places.

with nothing assumed.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PSI idle-shutoff
  2022-10-12  6:20         ` Hillf Danton
@ 2022-10-12 15:40           ` Suren Baghdasaryan
  0 siblings, 0 replies; 6+ messages in thread
From: Suren Baghdasaryan @ 2022-10-12 15:40 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Pavan Kondeti, Johannes Weiner, linux-mm, linux-kernel, quic_charante

On Tue, Oct 11, 2022 at 11:20 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On 11 Oct 2022 10:11:58 -0700 Suren Baghdasaryan <surenb@google.com>
> >On Tue, Oct 11, 2022 at 4:38 AM Hillf Danton <hdanton@sina.com> wrote:
> >>
> >> Given activities on remote CPUs, can you specify what prevents psi_avgs_work
> >> from being scheduled on remote CPUs if for example the local CPU has been
> >> idle for a second?
> >
> > I'm not a scheduler expert but I can imagine some work that finished
> > running on a big core A and generated some activity since the last
> > time psi_avgs_work executed.  With no other activity the next
> > psi_avgs_work could be scheduled on a small core B to conserve power.
>
> Given core A and B, nothing prevents.
>
> > There might be other cases involving cpuset limitation changes or cpu
> > offlining but I didn't think too hard about these. The bottom line, I
> > don't think we should be designing mechanisms which rely on
> > assumptions about how tasks will be scheduled. Even if these
>
> The tasks here makes me guess that we are on different pages - scheduling
> work has little to do with how tasks are scheduled, and is no more than
> queuing work on the system_wq in the case of psi_avgs_work,

I must have misunderstood your question then. My original concern was
that in the above example your suggested patch would not reschedule
psi_avgs_work to aggregate the activity recorded from core A. Easily
fixable but looks like a simpler approach is possible.

>
> > assumptions are correct today they might change in the future and
> > things will break in unexpected places.
>
> with nothing assumed.
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-10-12 15:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20220913140817.GA9091@hu-pkondeti-hyd.qualcomm.com>
2022-10-10 10:57 ` PSI idle-shutoff Hillf Danton
2022-10-10 21:16   ` Suren Baghdasaryan
2022-10-11 11:38     ` Hillf Danton
2022-10-11 17:11       ` Suren Baghdasaryan
2022-10-12  6:20         ` Hillf Danton
2022-10-12 15:40           ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox