[Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces

ksummit.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

* [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
@ 2014-05-06 12:54 Rafael J. Wysocki
  2014-05-06 13:37 ` Dave Jones
                   ` (5 more replies)
  0 siblings, 6 replies; 37+ messages in thread
From: Rafael J. Wysocki @ 2014-05-06 12:54 UTC (permalink / raw)
  To: ksummit-discuss
  Cc: Len Brown, Peter Zijlstra, Daniel Lezcano, Amit Kucheria, Ingo Molnar

Hi All,

During a recent discussion on linux-pm/LKML regarding the integration of the
scheduler with cpuidle (http://marc.info/?t=139834240600003&r=1&w=4) it became
apparent that the kernel might benefit from adding interfaces to let it know
how far it should go with saving energy, possibly at the expense of performance.

First of all, it would be good to have a place where subsystems and device
drivers can go and check what the current "energy conservation bias" is in
case they need to make a decision between delivering more performance and
using less energy.  Second, it would be good to provide user space with
a means to tell the kernel whether it should care more about performance or
energy.  Finally, it would be good to be able to adjust the overall "energy
conservation bias" automatically in response to certain "power" events such
as "battery is low/critical" etc.

It doesn't seem to be clear currently what level and scope of such interfaces
is appropriate and where to place them.  Would a global knob be useful?  Or
should they be per-subsystem, per-driver, per-task, per-cgroup etc?

It also is not particularly clear what representation of "energy conservation
bias" would be most useful.  Should that be a number or a set of well-defined
discrete levels that can be given names (like "max performance", "high
prerformance", "balanced" etc.)?  If a number, then what units to use and
how many different values to take into account?

The people involved in the scheduler/cpuidle discussion mentioned above were:
 * Amit Kucheria
 * Ingo Molnar
 * Daniel Lezcano
 * Morten Rasmussen
 * Peter Zijlstra
and me, but I think that this topic may be interesting to others too (especially
to Len who proposed a global "enefgy conservation bias" interface a few years ago).

Please let me know what you think.

Kind regards,
Rafael

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-06 12:54 [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces Rafael J. Wysocki
@ 2014-05-06 13:37 ` Dave Jones
  2014-05-06 13:49 ` Peter Zijlstra
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 37+ messages in thread
From: Dave Jones @ 2014-05-06 13:37 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Len Brown, ksummit-discuss, Peter Zijlstra, Daniel Lezcano,
	Amit Kucheria, Ingo Molnar

On Tue, May 06, 2014 at 02:54:03PM +0200, Rafael J. Wysocki wrote:

 > First of all, it would be good to have a place where subsystems and device
 > drivers can go and check what the current "energy conservation bias" is in
 > case they need to make a decision between delivering more performance and
 > using less energy.  Second, it would be good to provide user space with
 > a means to tell the kernel whether it should care more about performance or
 > energy.  Finally, it would be good to be able to adjust the overall "energy
 > conservation bias" automatically in response to certain "power" events such
 > as "battery is low/critical" etc.
 > 
 > It doesn't seem to be clear currently what level and scope of such interfaces
 > is appropriate and where to place them.  Would a global knob be useful?  Or
 > should they be per-subsystem, per-driver, per-task, per-cgroup etc?

I had thoughts about something along these lines a few years ago, when I
was still doing cpufreq stuff.

Using s/cpuidle/cpufreq/ but same principles..

 > It also is not particularly clear what representation of "energy conservation
 > bias" would be most useful.  Should that be a number or a set of well-defined
 > discrete levels that can be given names (like "max performance", "high
 > prerformance", "balanced" etc.)?  If a number, then what units to use and
 > how many different values to take into account?

I always thought that exposing frequencies to userspace was cpufreq's
biggest mistake.  If I were to do it all over again, I would do
something probably like the latter example above.

Switching governors from working system-wide to per-process would allow
users to make a lot more decisions like "don't ever change speed for
this pid", which isn't really do-able with our existing framework.

What /proc/pid/power/policy defaults to for each new pid would likely
still need to be configurable, but having users able to set the global
policy to dynamic (ie, on-demand) scaling, while also being able to do

 echo powersave > /proc/$(pidof seti-alien-detector)/power/policy

would I think be a much more deterministic interface over what we have now.
(Plus apps themselves could set their own policy this way).

The advantage of moving to policy names vs frequencies also means that
we could use a single power saving policy for cpufreq, cpuidle, and
whatever else we come up with.

The scheduler might also be able to make better decisions if we maintain
separate lists for each policy-type, prioritizing performance over
power-save etc.

	Dave

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-06 12:54 [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces Rafael J. Wysocki
  2014-05-06 13:37 ` Dave Jones
@ 2014-05-06 13:49 ` Peter Zijlstra
  2014-05-06 14:51   ` Morten Rasmussen
  2014-05-08 12:29   ` Rafael J. Wysocki
  2014-05-06 14:34 ` Morten Rasmussen
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 37+ messages in thread
From: Peter Zijlstra @ 2014-05-06 13:49 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Len Brown, ksummit-discuss, Daniel Lezcano, Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 2267 bytes --]

On Tue, May 06, 2014 at 02:54:03PM +0200, Rafael J. Wysocki wrote:
> Hi All,
> 
> During a recent discussion on linux-pm/LKML regarding the integration of the
> scheduler with cpuidle (http://marc.info/?t=139834240600003&r=1&w=4) it became
> apparent that the kernel might benefit from adding interfaces to let it know
> how far it should go with saving energy, possibly at the expense of performance.
> 
> First of all, it would be good to have a place where subsystems and device
> drivers can go and check what the current "energy conservation bias" is in
> case they need to make a decision between delivering more performance and
> using less energy.  Second, it would be good to provide user space with
> a means to tell the kernel whether it should care more about performance or
> energy.  Finally, it would be good to be able to adjust the overall "energy
> conservation bias" automatically in response to certain "power" events such
> as "battery is low/critical" etc.
> 
> It doesn't seem to be clear currently what level and scope of such interfaces
> is appropriate and where to place them.  Would a global knob be useful?  Or
> should they be per-subsystem, per-driver, per-task, per-cgroup etc?

per-task and per-cgroup doesn't seem to make sense to me; its the
hardware that consumes energy.

per-subsystem sounds right to me; I don't care which particular instance
of graphics cards I have, I want whichever one(s) I have to obey.

global doesn't make sense, like stated earlier I absolutely detest
automagic backlight dimming, whereas I don't particularly care about
compute speed at all.

So while I might want a energy conserving bias for say the CPU and GPU,
I most definitely don't want that to dim the screen.

> It also is not particularly clear what representation of "energy conservation
> bias" would be most useful.  Should that be a number or a set of well-defined
> discrete levels that can be given names (like "max performance", "high
> prerformance", "balanced" etc.)?  If a number, then what units to use and
> how many different values to take into account?

Yeah, fun.. we're not even sure how to make it do the 0,1 variants, and
now you want a sliding scale to make it do in-betweens ;-)

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-06 12:54 [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces Rafael J. Wysocki
  2014-05-06 13:37 ` Dave Jones
  2014-05-06 13:49 ` Peter Zijlstra
@ 2014-05-06 14:34 ` Morten Rasmussen
  2014-05-06 17:51 ` Preeti U Murthy
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 37+ messages in thread
From: Morten Rasmussen @ 2014-05-06 14:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Len Brown, ksummit-discuss, Peter Zijlstra, Daniel Lezcano,
	Amit Kucheria, Ingo Molnar

On Tue, May 06, 2014 at 01:54:03PM +0100, Rafael J. Wysocki wrote:
> Hi All,
> 
> During a recent discussion on linux-pm/LKML regarding the integration of the
> scheduler with cpuidle (http://marc.info/?t=139834240600003&r=1&w=4) it became
> apparent that the kernel might benefit from adding interfaces to let it know
> how far it should go with saving energy, possibly at the expense of performance.
> 
> First of all, it would be good to have a place where subsystems and device
> drivers can go and check what the current "energy conservation bias" is in
> case they need to make a decision between delivering more performance and
> using less energy.  Second, it would be good to provide user space with
> a means to tell the kernel whether it should care more about performance or
> energy.  Finally, it would be good to be able to adjust the overall "energy
> conservation bias" automatically in response to certain "power" events such
> as "battery is low/critical" etc.
> 
> It doesn't seem to be clear currently what level and scope of such interfaces
> is appropriate and where to place them.  Would a global knob be useful?  Or
> should they be per-subsystem, per-driver, per-task, per-cgroup etc?

A single global knob would mean that all subsystems and drivers would go
into performance mode if just one task needs high performance in one
subsystem (unless we ignore the request and let the task suffer). I
think that would be acceptable? Userspace/muddleware would have to
continuously track task requirements and when they are active to
influence the current knob setting. Either by setting it directly or
providing input to the kernel that affects the current knob setting.

> It also is not particularly clear what representation of "energy conservation
> bias" would be most useful.  Should that be a number or a set of well-defined
> discrete levels that can be given names (like "max performance", "high
> prerformance", "balanced" etc.)?  If a number, then what units to use and
> how many different values to take into account?

I don't think two or three discrete settings would be sufficient. As
mentioned in the thread, energy-awareness is not a big switch to turn
everything off or down to a minimum. It is a change of optimization
objective to also consider energy along with performance. The question
is how much performance we are willing to sacrifice to save energy.
IMHO, your energy value proposal in the thread would be useful to make
that decision. However, units and if it is useful for all subsystems and
drivers is unclear to me.

> The people involved in the scheduler/cpuidle discussion mentioned above were:
>  * Amit Kucheria
>  * Ingo Molnar
>  * Daniel Lezcano
>  * Morten Rasmussen
>  * Peter Zijlstra
> and me, but I think that this topic may be interesting to others too (especially
> to Len who proposed a global "enefgy conservation bias" interface a few years ago).
> 
> Please let me know what you think.

I'm indeed interested in this topic.

Thanks,
Morten

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-06 13:49 ` Peter Zijlstra
@ 2014-05-06 14:51   ` Morten Rasmussen
  2014-05-06 15:39     ` Peter Zijlstra
  2014-05-08 12:29   ` Rafael J. Wysocki
  1 sibling, 1 reply; 37+ messages in thread
From: Morten Rasmussen @ 2014-05-06 14:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Len Brown, ksummit-discuss, Daniel Lezcano, Amit Kucheria, Ingo Molnar

On Tue, May 06, 2014 at 02:49:09PM +0100, Peter Zijlstra wrote:
> On Tue, May 06, 2014 at 02:54:03PM +0200, Rafael J. Wysocki wrote:
> > Hi All,
> > 
> > During a recent discussion on linux-pm/LKML regarding the integration of the
> > scheduler with cpuidle (http://marc.info/?t=139834240600003&r=1&w=4) it became
> > apparent that the kernel might benefit from adding interfaces to let it know
> > how far it should go with saving energy, possibly at the expense of performance.
> > 
> > First of all, it would be good to have a place where subsystems and device
> > drivers can go and check what the current "energy conservation bias" is in
> > case they need to make a decision between delivering more performance and
> > using less energy.  Second, it would be good to provide user space with
> > a means to tell the kernel whether it should care more about performance or
> > energy.  Finally, it would be good to be able to adjust the overall "energy
> > conservation bias" automatically in response to certain "power" events such
> > as "battery is low/critical" etc.
> > 
> > It doesn't seem to be clear currently what level and scope of such interfaces
> > is appropriate and where to place them.  Would a global knob be useful?  Or
> > should they be per-subsystem, per-driver, per-task, per-cgroup etc?
> 
> per-task and per-cgroup doesn't seem to make sense to me; its the
> hardware that consumes energy.

True. But performance requirements are associated with tasks or groups
of tasks. We also need an interface to get input from userspace to tell
us when it is acceptable to potentially loose performance to save
energy. IIUC, that is Rafael's second point above.

Morten

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-06 14:51   ` Morten Rasmussen
@ 2014-05-06 15:39     ` Peter Zijlstra
  2014-05-06 16:04       ` Morten Rasmussen
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2014-05-06 15:39 UTC (permalink / raw)
  To: Morten Rasmussen; +Cc: Len Brown, ksummit-discuss, Daniel Lezcano, Ingo Molnar

On Tue, May 06, 2014 at 03:51:25PM +0100, Morten Rasmussen wrote:
> On Tue, May 06, 2014 at 02:49:09PM +0100, Peter Zijlstra wrote:
> > On Tue, May 06, 2014 at 02:54:03PM +0200, Rafael J. Wysocki wrote:
> > > Hi All,
> > > 
> > > During a recent discussion on linux-pm/LKML regarding the integration of the
> > > scheduler with cpuidle (http://marc.info/?t=139834240600003&r=1&w=4) it became
> > > apparent that the kernel might benefit from adding interfaces to let it know
> > > how far it should go with saving energy, possibly at the expense of performance.
> > > 
> > > First of all, it would be good to have a place where subsystems and device
> > > drivers can go and check what the current "energy conservation bias" is in
> > > case they need to make a decision between delivering more performance and
> > > using less energy.  Second, it would be good to provide user space with
> > > a means to tell the kernel whether it should care more about performance or
> > > energy.  Finally, it would be good to be able to adjust the overall "energy
> > > conservation bias" automatically in response to certain "power" events such
> > > as "battery is low/critical" etc.
> > > 
> > > It doesn't seem to be clear currently what level and scope of such interfaces
> > > is appropriate and where to place them.  Would a global knob be useful?  Or
> > > should they be per-subsystem, per-driver, per-task, per-cgroup etc?
> > 
> > per-task and per-cgroup doesn't seem to make sense to me; its the
> > hardware that consumes energy.
> 
> True. But performance requirements are associated with tasks or groups
> of tasks. We also need an interface to get input from userspace to tell
> us when it is acceptable to potentially loose performance to save
> energy. IIUC, that is Rafael's second point above.

That's the QoS thing. While related I don't think we should confuse the
two.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-06 15:39     ` Peter Zijlstra
@ 2014-05-06 16:04       ` Morten Rasmussen
  0 siblings, 0 replies; 37+ messages in thread
From: Morten Rasmussen @ 2014-05-06 16:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Len Brown, ksummit-discuss, Daniel Lezcano, Amit Kucheria, Ingo Molnar

On Tue, May 06, 2014 at 04:39:56PM +0100, Peter Zijlstra wrote:
> On Tue, May 06, 2014 at 03:51:25PM +0100, Morten Rasmussen wrote:
> > On Tue, May 06, 2014 at 02:49:09PM +0100, Peter Zijlstra wrote:
> > > On Tue, May 06, 2014 at 02:54:03PM +0200, Rafael J. Wysocki wrote:
> > > > Hi All,
> > > > 
> > > > During a recent discussion on linux-pm/LKML regarding the integration of the
> > > > scheduler with cpuidle (http://marc.info/?t=139834240600003&r=1&w=4) it became
> > > > apparent that the kernel might benefit from adding interfaces to let it know
> > > > how far it should go with saving energy, possibly at the expense of performance.
> > > > 
> > > > First of all, it would be good to have a place where subsystems and device
> > > > drivers can go and check what the current "energy conservation bias" is in
> > > > case they need to make a decision between delivering more performance and
> > > > using less energy.  Second, it would be good to provide user space with
> > > > a means to tell the kernel whether it should care more about performance or
> > > > energy.  Finally, it would be good to be able to adjust the overall "energy
> > > > conservation bias" automatically in response to certain "power" events such
> > > > as "battery is low/critical" etc.
> > > > 
> > > > It doesn't seem to be clear currently what level and scope of such interfaces
> > > > is appropriate and where to place them.  Would a global knob be useful?  Or
> > > > should they be per-subsystem, per-driver, per-task, per-cgroup etc?
> > > 
> > > per-task and per-cgroup doesn't seem to make sense to me; its the
> > > hardware that consumes energy.
> > 
> > True. But performance requirements are associated with tasks or groups
> > of tasks. We also need an interface to get input from userspace to tell
> > us when it is acceptable to potentially loose performance to save
> > energy. IIUC, that is Rafael's second point above.
> 
> That's the QoS thing. While related I don't think we should confuse the
> two.

Fully agree. We need both.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-06 12:54 [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces Rafael J. Wysocki
                   ` (2 preceding siblings ...)
  2014-05-06 14:34 ` Morten Rasmussen
@ 2014-05-06 17:51 ` Preeti U Murthy
  2014-05-08 12:58   ` Rafael J. Wysocki
  2014-05-07 21:03 ` Paul Gortmaker
  2014-05-12 11:53 ` Amit Kucheria
  5 siblings, 1 reply; 37+ messages in thread
From: Preeti U Murthy @ 2014-05-06 17:51 UTC (permalink / raw)
  To: Rafael J. Wysocki, ksummit-discuss, Peter Zijlstra, Morten Rasmussen
  Cc: Len Brown, Daniel Lezcano, Ingo Molnar, Amit Kucheria

Hi,

On 05/06/2014 06:24 PM, Rafael J. Wysocki wrote:
> Hi All,
>
> During a recent discussion on linux-pm/LKML regarding the integration
of the
> scheduler with cpuidle (http://marc.info/?t=139834240600003&r=1&w=4)
it became
> apparent that the kernel might benefit from adding interfaces to let
it know
> how far it should go with saving energy, possibly at the expense of
performance.
>
> First of all, it would be good to have a place where subsystems and device
> drivers can go and check what the current "energy conservation bias" is in
> case they need to make a decision between delivering more performance and
> using less energy.  Second, it would be good to provide user space with
> a means to tell the kernel whether it should care more about
performance or
> energy.  Finally, it would be good to be able to adjust the overall
"energy
> conservation bias" automatically in response to certain "power" events
such
> as "battery is low/critical" etc.

With respect to the point around user space being able to tell the
kernel what it wants I have the following idea.This is actually
extending what Dave quoted in his reply to this thread:

"
The advantage of moving to policy names vs frequencies also means that
we could use a single power saving policy for cpufreq, cpuidle, and
whatever else we come up with.

The scheduler might also be able to make better decisions if we maintain
separate lists for each policy-type, prioritizing performance over
power-save etc."

Tuned today exposes profiles like powersave, performance which set
kernel parameters, cpu-freq and cpu idle governors for these extreme use
cases. In powersave policy we do not worry about performance and vice
versa. However if one finds these as aggressive approaches to their
goals, there is balanced profile as well, which switches to powersave at
low load and to performance at high load. Even if latency sensitive
workloads run in this profile they will get hit only during the switch
from powersave to performance mode, but thereafter will get their way.

The advantage of having the concept of profiles is as Dave mentions,if
the user chooses a specific tuned profile, *multiple sub-system settings
can be taken care of in one place*. The profile could make way for
cpufreq, cpuidle, scheduler, device driver settings provided each of
these expose parameters which allow tuning of their decisions. So to
answer your question of if device drivers must probe the user settings,
I don't think so. These profiles can set the required driver parameters
which should automatically then kick in?

Today cpuidle and cpufreq already expose these settings through
governors. I am also assuming device drivers have scope for tuning their
functions through some such user exposed parameters. Memory can come
under this ambit too. Now lets consider scheduler which is set to join
this league.
   We could discuss and come up with some suitable parameters like
discrete levels of Perf/Watt which will allow the scheduler to take
appropriate decisions. (Of course we will need to work on this decision
making part of the scheduler.) So the tuned profiles could further
include the scheduler settings as well.

The point is that, profiles is a nice way of allowing the user to make
his choices. If he does not want to put in too much effort apart from
making a choice of profile, he can simply switch the currently active
profile to the one that meets his goal and not bother about the settings
it is doing internally. If he instead wants to have more fine grained
control over the settings, he can create a custom profile deriving out
of the existing tuned profiles.

Look at an example for a tuned profile for performance:
start() gets called when the profile is switched to and stop() when its
turned off. We could include the scheduling parameters in the profile
when we come up with the set of them.

 start() {
     [ "$USB_AUTOSUSPEND" = 1 ] && enable_usb_autosuspend
     set_disk_alpm min_power
     enable_cpu_multicore_powersave
     set_cpu_governor ondemand
     enable_snd_ac97_powersave
     set_hda_intel_powersave 10
     enable_wifi_powersave
     set_radeon_powersave auto
     return 0
 }

 stop() {
     [ "$USB_AUTOSUSPEND" = 1 ] && disable_usb_autosuspend
     set_disk_alpm max_performance
     disable_cpu_multicore_powersave
     restore_cpu_governor
     restore_snd_ac97_powersave
     restore_hda_intel_powersave
     disable_wifi_powersave
     restore_radeon_powersave
     return 0
 }

>
> It doesn't seem to be clear currently what level and scope of such
interfaces
> is appropriate and where to place them.  Would a global knob be
useful?  Or
> should they be per-subsystem, per-driver, per-task, per-cgroup etc?

A global knob would be useful in the case where the user chooses
performance policy for example. It means he expects the kernel to
*never* sacrifice performance for powersave. Now assume that a set of
tasks is running on 4 cpus out of 10. If the user has chosen performance
policy, *none of the 10 cpus should enter deep idle states* lest they
affect the latency of the tasks. Here a global knob would do well.

For less aggressive policies like balanced policy, a per-task policy
would do very well. Assume the above same scenario, we would want to
disable deep idle states only for those 4 cpus that we are running on
and allow the remaining 6 to enter deep idle states. Of course this
would mean that if the task gets scheduled on one of those 6, it would
take a latency hit, but only initially. The per-task knob would then
prevent that cpu from entering deep idle states henceforth. Or we could
use cgroups to prevent even such a thing from happening and make it a
per-cgroup knob if even the initial latency hit cannot be tolerated.

So having both per-task and global knobs may help depending on the profiles.
>
> It also is not particularly clear what representation of "energy
conservation
> bias" would be most useful.  Should that be a number or a set of
well-defined
> discrete levels that can be given names (like "max performance", "high
> prerformance", "balanced" etc.)?  If a number, then what units to use and
> how many different values to take into account?

Currently tuned has a good set of initial profiles. We could start with
them and add tunings which could be discrete values or could be policy
names depending on the sub-system. As for scheduler we could start with
with auto, power, performance and then move on to discrete values I guess.
>
> The people involved in the scheduler/cpuidle discussion mentioned
above were:
>  * Amit Kucheria
>  * Ingo Molnar
>  * Daniel Lezcano
>  * Morten Rasmussen
>  * Peter Zijlstra
> and me, but I think that this topic may be interesting to others too
(especially

I have been working on improving Energy management on PowerPC over the
last year. Specifically I have worked on extending the tick broadcast
framework in the kernel to support deep idle states and helped review
and improvise the cpufreq driver for PowerNV platforms.

https://lkml.org/lkml/2014/2/7/608
http://thread.gmane.org/gmane.linux.power-management.general/44175

Besides this I have been helping out in efforts to integrate cpuidle
with scheduler over the last year.

I wish to be a part of this discussion and look forward to sharing my
ideas on Energy management in the kernel. I am very interested in
bringing to the table, the challenges and solutions that we have on
PowerPC in the area of Energy Management. Please consider my
participation in this discussion.

Thank you

Regards
Preeti U Murthy

> to Len who proposed a global "enefgy conservation bias" interface a
few years ago).
>
> Please let me know what you think.
>
> Kind regards,
> Rafael
>
> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-06 12:54 [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces Rafael J. Wysocki
                   ` (3 preceding siblings ...)
  2014-05-06 17:51 ` Preeti U Murthy
@ 2014-05-07 21:03 ` Paul Gortmaker
  2014-05-12 11:53 ` Amit Kucheria
  5 siblings, 0 replies; 37+ messages in thread
From: Paul Gortmaker @ 2014-05-07 21:03 UTC (permalink / raw)
  To: Rafael J. Wysocki, ksummit-discuss
  Cc: Peter Zijlstra, Len Brown, Daniel Lezcano, Ingo Molnar

On 14-05-06 08:54 AM, Rafael J. Wysocki wrote:
> Hi All,
> 
> During a recent discussion on linux-pm/LKML regarding the integration of the
> scheduler with cpuidle (http://marc.info/?t=139834240600003&r=1&w=4) it became
> apparent that the kernel might benefit from adding interfaces to let it know
> how far it should go with saving energy, possibly at the expense of performance.

These links from last year might be of general interest:

http://lwn.net/Articles/571414/

http://etherpad.osuosl.org/energy-aware-scheduling-ks-2013

Paul.
--

> 
> First of all, it would be good to have a place where subsystems and device
> drivers can go and check what the current "energy conservation bias" is in
> case they need to make a decision between delivering more performance and
> using less energy.  Second, it would be good to provide user space with
> a means to tell the kernel whether it should care more about performance or
> energy.  Finally, it would be good to be able to adjust the overall "energy
> conservation bias" automatically in response to certain "power" events such
> as "battery is low/critical" etc.
> 
> It doesn't seem to be clear currently what level and scope of such interfaces
> is appropriate and where to place them.  Would a global knob be useful?  Or
> should they be per-subsystem, per-driver, per-task, per-cgroup etc?
> 
> It also is not particularly clear what representation of "energy conservation
> bias" would be most useful.  Should that be a number or a set of well-defined
> discrete levels that can be given names (like "max performance", "high
> prerformance", "balanced" etc.)?  If a number, then what units to use and
> how many different values to take into account?
> 
> The people involved in the scheduler/cpuidle discussion mentioned above were:
>  * Amit Kucheria
>  * Ingo Molnar
>  * Daniel Lezcano
>  * Morten Rasmussen
>  * Peter Zijlstra
> and me, but I think that this topic may be interesting to others too (especially
> to Len who proposed a global "enefgy conservation bias" interface a few years ago).
> 
> Please let me know what you think.
> 
> Kind regards,
> Rafael
> 
> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-06 13:49 ` Peter Zijlstra
  2014-05-06 14:51   ` Morten Rasmussen
@ 2014-05-08 12:29   ` Rafael J. Wysocki
  1 sibling, 0 replies; 37+ messages in thread
From: Rafael J. Wysocki @ 2014-05-08 12:29 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Len Brown, ksummit-discuss, Daniel Lezcano, Ingo Molnar

On Tuesday, May 06, 2014 03:49:09 PM Peter Zijlstra wrote:
> 
> --2F7AbV2suvT8PGoH
> Content-Type: text/plain; charset=us-ascii
> Content-Disposition: inline
> Content-Transfer-Encoding: quoted-printable
> 
> On Tue, May 06, 2014 at 02:54:03PM +0200, Rafael J. Wysocki wrote:
> > Hi All,
> >=20
> > During a recent discussion on linux-pm/LKML regarding the integration of =
> the
> > scheduler with cpuidle (http://marc.info/?t=3D139834240600003&r=3D1&w=3D4=
> ) it became
> > apparent that the kernel might benefit from adding interfaces to let it k=
> now
> > how far it should go with saving energy, possibly at the expense of perfo=
> rmance.
> >=20
> > First of all, it would be good to have a place where subsystems and device
> > drivers can go and check what the current "energy conservation bias" is in
> > case they need to make a decision between delivering more performance and
> > using less energy.  Second, it would be good to provide user space with
> > a means to tell the kernel whether it should care more about performance =
> or
> > energy.  Finally, it would be good to be able to adjust the overall "ener=
> gy
> > conservation bias" automatically in response to certain "power" events su=
> ch
> > as "battery is low/critical" etc.
> >=20
> > It doesn't seem to be clear currently what level and scope of such interf=
> aces
> > is appropriate and where to place them.  Would a global knob be useful?  =
> Or
> > should they be per-subsystem, per-driver, per-task, per-cgroup etc?
> 
> per-task and per-cgroup doesn't seem to make sense to me; its the
> hardware that consumes energy.
> 
> per-subsystem sounds right to me; I don't care which particular instance
> of graphics cards I have, I want whichever one(s) I have to obey.
> 
> global doesn't make sense, like stated earlier I absolutely detest
> automagic backlight dimming, whereas I don't particularly care about
> compute speed at all.

Except that subsustems may not be as independent as they may seem to be.
Take the graphics and the CPU.  They may share thermal constraints, for
example, and other things like clocks and voltage regulators.

Per-subsystem settings may not be adequate in such cases.

> So while I might want a energy conserving bias for say the CPU and GPU,
> I most definitely don't want that to dim the screen.

That happens because user space has its own ideas about what's appropriate. :-)

The question here is whether or not we want the kernel to react to events
like "we're on battery now" or "battery is low" which are global.

So while knobs may be per subsystem, there needs to be some kind of coordination
it seems or at least a notification mechanism that the subsystems can subscribe
to.

> > It also is not particularly clear what representation of "energy conserva=
> tion
> > bias" would be most useful.  Should that be a number or a set of well-def=
> ined
> > discrete levels that can be given names (like "max performance", "high
> > prerformance", "balanced" etc.)?  If a number, then what units to use and
> > how many different values to take into account?
> 
> Yeah, fun.. we're not even sure how to make it do the 0,1 variants, and
> now you want a sliding scale to make it do in-betweens ;-)

Well, if we don't think we can do better that 0,1, that will be the choice
I guess. :-)

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-06 17:51 ` Preeti U Murthy
@ 2014-05-08 12:58   ` Rafael J. Wysocki
  2014-05-08 14:57     ` Iyer, Sundar
  2014-05-10 16:59     ` Preeti U Murthy
  0 siblings, 2 replies; 37+ messages in thread
From: Rafael J. Wysocki @ 2014-05-08 12:58 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Len Brown, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On Tuesday, May 06, 2014 11:21:35 PM Preeti U Murthy wrote:
> Hi,
> 
> On 05/06/2014 06:24 PM, Rafael J. Wysocki wrote:
> > Hi All,
> >
> > During a recent discussion on linux-pm/LKML regarding the integration
> of the
> > scheduler with cpuidle (http://marc.info/?t=139834240600003&r=1&w=4)
> it became
> > apparent that the kernel might benefit from adding interfaces to let
> it know
> > how far it should go with saving energy, possibly at the expense of
> performance.
> >
> > First of all, it would be good to have a place where subsystems and device
> > drivers can go and check what the current "energy conservation bias" is in
> > case they need to make a decision between delivering more performance and
> > using less energy.  Second, it would be good to provide user space with
> > a means to tell the kernel whether it should care more about
> performance or
> > energy.  Finally, it would be good to be able to adjust the overall
> "energy
> > conservation bias" automatically in response to certain "power" events
> such
> > as "battery is low/critical" etc.
> 
> With respect to the point around user space being able to tell the
> kernel what it wants I have the following idea.This is actually
> extending what Dave quoted in his reply to this thread:
> 
> "
> The advantage of moving to policy names vs frequencies also means that
> we could use a single power saving policy for cpufreq, cpuidle, and
> whatever else we come up with.
> 
> The scheduler might also be able to make better decisions if we maintain
> separate lists for each policy-type, prioritizing performance over
> power-save etc."

I generally agree with this.

> Tuned today exposes profiles like powersave, performance which set
> kernel parameters, cpu-freq and cpu idle governors for these extreme use
> cases. In powersave policy we do not worry about performance and vice
> versa. However if one finds these as aggressive approaches to their
> goals, there is balanced profile as well, which switches to powersave at
> low load and to performance at high load. Even if latency sensitive
> workloads run in this profile they will get hit only during the switch
> from powersave to performance mode, but thereafter will get their way.
> 
> The advantage of having the concept of profiles is as Dave mentions,if
> the user chooses a specific tuned profile, *multiple sub-system settings
> can be taken care of in one place*. The profile could make way for
> cpufreq, cpuidle, scheduler, device driver settings provided each of
> these expose parameters which allow tuning of their decisions. So to
> answer your question of if device drivers must probe the user settings,
> I don't think so. These profiles can set the required driver parameters
> which should automatically then kick in?

That's something I was thinking about too, but the difficulty here is in
how to define the profiles (that is, what settings in each subsystem are
going to be affected by a profile change) and in deciding when to switch
profiles and which profile is the most appropriate going forward.

IOW, the high-level concept looks nice, but the details of the implementation
are important too. :-)

> Today cpuidle and cpufreq already expose these settings through
> governors.

cpufreq governors are kind of tied to specific "energy efficiency" profiles,
performance, powersave, on-demand.  However, cpuidle governors are rather
different in that respect.

> I am also assuming device drivers have scope for tuning their
> functions through some such user exposed parameters. Memory can come
> under this ambit too. Now lets consider scheduler which is set to join
> this league.
>    We could discuss and come up with some suitable parameters like
> discrete levels of Perf/Watt which will allow the scheduler to take

I prefer the amout of work per energy unit to perf/Watt (which is the same
number BTW), but that's just a detail.

> appropriate decisions. (Of course we will need to work on this decision
> making part of the scheduler.) So the tuned profiles could further
> include the scheduler settings as well.
> 
> The point is that, profiles is a nice way of allowing the user to make
> his choices. If he does not want to put in too much effort apart from
> making a choice of profile, he can simply switch the currently active
> profile to the one that meets his goal and not bother about the settings
> it is doing internally. If he instead wants to have more fine grained
> control over the settings, he can create a custom profile deriving out
> of the existing tuned profiles.
> 
> Look at an example for a tuned profile for performance:
> start() gets called when the profile is switched to and stop() when its
> turned off. We could include the scheduling parameters in the profile
> when we come up with the set of them.
> 
>  start() {
>      [ "$USB_AUTOSUSPEND" = 1 ] && enable_usb_autosuspend
>      set_disk_alpm min_power
>      enable_cpu_multicore_powersave
>      set_cpu_governor ondemand
>      enable_snd_ac97_powersave
>      set_hda_intel_powersave 10
>      enable_wifi_powersave
>      set_radeon_powersave auto
>      return 0
>  }
> 
>  stop() {
>      [ "$USB_AUTOSUSPEND" = 1 ] && disable_usb_autosuspend
>      set_disk_alpm max_performance
>      disable_cpu_multicore_powersave
>      restore_cpu_governor
>      restore_snd_ac97_powersave
>      restore_hda_intel_powersave
>      disable_wifi_powersave
>      restore_radeon_powersave
>      return 0
>  }

You seem to think that user space would operate those profiles, but the
experience so far is that user space is not actually good at doing things
like that.  We have exposed a number of PM-related knobs to user space,
but in may cases it actively refuses to use them (we have dropped a couple
of them too for this very reason).

This means expecting user space *alone* to do the right thing and tell the
kernel what to do next with the help of all of the individual knobs spread
all over the place is not entirely realistic in my view.

Yes, I think there should be ways for user space to indicate what its
current preference (or policy if you will) is, but those should be
relatively simple and starightforward to use.

For example, we have a per-device knob that user space can use to indicate
whether or not runtime PM should be used for the devices, if available.
As a result, if a user wants to enable runtime PM for all devices, she or
he has to go through all of them and switch the knob for each one individually,
which it would be easier to use a common big switch for that.  And that big
switch would be more likely to be actually used just because it is big
and makes a big difference.

> > It doesn't seem to be clear currently what level and scope of such
> interfaces
> > is appropriate and where to place them.  Would a global knob be
> useful?  Or
> > should they be per-subsystem, per-driver, per-task, per-cgroup etc?
> 
> A global knob would be useful in the case where the user chooses
> performance policy for example. It means he expects the kernel to
> *never* sacrifice performance for powersave. Now assume that a set of
> tasks is running on 4 cpus out of 10. If the user has chosen performance
> policy, *none of the 10 cpus should enter deep idle states* lest they
> affect the latency of the tasks. Here a global knob would do well.
> 
> For less aggressive policies like balanced policy, a per-task policy
> would do very well. Assume the above same scenario, we would want to
> disable deep idle states only for those 4 cpus that we are running on
> and allow the remaining 6 to enter deep idle states. Of course this
> would mean that if the task gets scheduled on one of those 6, it would
> take a latency hit, but only initially. The per-task knob would then
> prevent that cpu from entering deep idle states henceforth. Or we could
> use cgroups to prevent even such a thing from happening and make it a
> per-cgroup knob if even the initial latency hit cannot be tolerated.

I'm still seeing a problem with mixing tasks with different "energy"
settings.  If there are "performance" and "energy friendly" tasks to
run at the same time, it is not particularly clear how to the load
balancer should handle them, for one example.

> So having both per-task and global knobs may help depending on the profiles.
> >
> > It also is not particularly clear what representation of "energy
> conservation
> > bias" would be most useful.  Should that be a number or a set of
> well-defined
> > discrete levels that can be given names (like "max performance", "high
> > prerformance", "balanced" etc.)?  If a number, then what units to use and
> > how many different values to take into account?
> 
> Currently tuned has a good set of initial profiles. We could start with
> them and add tunings which could be discrete values or could be policy
> names depending on the sub-system. As for scheduler we could start with
> with auto, power, performance and then move on to discrete values I guess.

What you're suggesting seems to be to start with the "levels" that are
defined currently, by cpufreq governors for one example, and then to add
more over time as needed.  Is that correct?


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-08 12:58   ` Rafael J. Wysocki
@ 2014-05-08 14:57     ` Iyer, Sundar
  2014-05-12 16:44       ` Preeti U Murthy
  2014-05-10 16:59     ` Preeti U Murthy
  1 sibling, 1 reply; 37+ messages in thread
From: Iyer, Sundar @ 2014-05-08 14:57 UTC (permalink / raw)
  To: Rafael J. Wysocki, Preeti U Murthy
  Cc: Peter Zijlstra, Brown, Len, Daniel Lezcano, Ingo Molnar, ksummit-discuss

> -----Original Message-----
> From: ksummit-discuss-bounces@lists.linuxfoundation.org [mailto:ksummit-
> discuss-bounces@lists.linuxfoundation.org] On Behalf Of Rafael J.
> Wysocki
> Sent: Thursday, May 8, 2014 6:28 PM

> That's something I was thinking about too, but the difficulty here is in how to
> define the profiles (that is, what settings in each subsystem are going to be
> affected by a profile change) and in deciding when to switch profiles and
> which profile is the most appropriate going forward.
> 
> IOW, the high-level concept looks nice, but the details of the implementation
> are important too. :-)

I agree. Defining these profiles and trying to fit them into a system definition, 
system usage policy and above all user usage policy is where the sticking point is.

> > Today cpuidle and cpufreq already expose these settings through
> > governors.
> 
> cpufreq governors are kind of tied to specific "energy efficiency" profiles,
> performance, powersave, on-demand.  However, cpuidle governors are

I am not sure if that is correct. IMO Cpufreq governors function simply as per
policies defined to meet user experience. A system may choose to sacrifice
user experience @ the cost of running the CPU at the lowest frequency, but the
governor has no idea if it was really energy efficient for the platform. Similarly,
the governor might decide to run at a higher turbo frequency for better user
responsiveness, but it still doesn't know if it was energy efficient running @ those
frequencies. I am coming back to the point that energy efficiency is countable
_only_ at the platform level: if it results in a longer battery life w/o needing to plug in.

> > Look at an example for a tuned profile for performance:
> > start() gets called when the profile is switched to and stop() when
> > its turned off. We could include the scheduling parameters in the
> > profile when we come up with the set of them.
> >
> >  start() {
> >      [ "$USB_AUTOSUSPEND" = 1 ] && enable_usb_autosuspend
> >      set_disk_alpm min_power
> >      enable_cpu_multicore_powersave
> >      set_cpu_governor ondemand
> >      enable_snd_ac97_powersave
> >      set_hda_intel_powersave 10
> >      enable_wifi_powersave
> >      set_radeon_powersave auto
> >      return 0
> >  }
> >
> >  stop() {
> >      [ "$USB_AUTOSUSPEND" = 1 ] && disable_usb_autosuspend
> >      set_disk_alpm max_performance
> >      disable_cpu_multicore_powersave
> >      restore_cpu_governor
> >      restore_snd_ac97_powersave
> >      restore_hda_intel_powersave
> >      disable_wifi_powersave
> >      restore_radeon_powersave
> >      return 0
> >  }
> 
> You seem to think that user space would operate those profiles, but the
> experience so far is that user space is not actually good at doing things like
> that.  We have exposed a number of PM-related knobs to user space, but in
> may cases it actively refuses to use them (we have dropped a couple of
> them too for this very reason).

Please correct me if I am wrong, but items like wifi_powersave are something
default on most systems especially with per-device runtime power management.
Most devices are runtime managed and I don't see a strong reason to switch
device policies as energy saving methods.

> > A global knob would be useful in the case where the user chooses
> > performance policy for example. It means he expects the kernel to
> > *never* sacrifice performance for powersave. Now assume that a set of
> > tasks is running on 4 cpus out of 10. If the user has chosen
> > performance policy, *none of the 10 cpus should enter deep idle
> > states* lest they affect the latency of the tasks. Here a global knob would
> do well.

For this specific example, when you say the *user* has chosen the policy, do
you mean a user space daemon that takes care of this or the application itself?

How are we going to know if we will really save energy by limiting deep idle states
on all the 10 CPUs? Please help me understand this.

Cheers!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-08 12:58   ` Rafael J. Wysocki
  2014-05-08 14:57     ` Iyer, Sundar
@ 2014-05-10 16:59     ` Preeti U Murthy
  1 sibling, 0 replies; 37+ messages in thread
From: Preeti U Murthy @ 2014-05-10 16:59 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Len Brown, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On 05/08/2014 06:28 PM, Rafael J. Wysocki wrote:
>>
>> The advantage of having the concept of profiles is as Dave mentions,if
>> the user chooses a specific tuned profile, *multiple sub-system settings
>> can be taken care of in one place*. The profile could make way for
>> cpufreq, cpuidle, scheduler, device driver settings provided each of
>> these expose parameters which allow tuning of their decisions. So to
>> answer your question of if device drivers must probe the user settings,
>> I don't think so. These profiles can set the required driver parameters
>> which should automatically then kick in?
> 
> That's something I was thinking about too, but the difficulty here is in
> how to define the profiles (that is, what settings in each subsystem are
> going to be affected by a profile change) and in deciding when to switch
> profiles and which profile is the most appropriate going forward.
> 
> IOW, the high-level concept looks nice, but the details of the implementation
> are important too. :-)

I was thinking something as elementary as a powersave profile,
performance profile and balanced profile. The default should be balanced
profile where runtime pm of all devices kick in. The powersave profile
has conservative pm; a static powersave mode. The performance profile
has zero latency tolerance; no power management.

The kernel will remain in balanced profile unless the user changes his
choice of profile. In balanced profile, the behaviour of the
participating sub-systems should be
1.monitor load/relevant metric
2. Adjust device settings in a cycle.

> 
>> Today cpuidle and cpufreq already expose these settings through
>> governors.
> 
> cpufreq governors are kind of tied to specific "energy efficiency" profiles,
> performance, powersave, on-demand.  However, cpuidle governors are rather
> different in that respect.
> 
>> I am also assuming device drivers have scope for tuning their
>> functions through some such user exposed parameters. Memory can come
>> under this ambit too. Now lets consider scheduler which is set to join
>> this league.
>>    We could discuss and come up with some suitable parameters like
>> discrete levels of Perf/Watt which will allow the scheduler to take
> 
> I prefer the amout of work per energy unit to perf/Watt (which is the same
> number BTW), but that's just a detail.

Right. It makes it clearer.
> 
>> appropriate decisions. (Of course we will need to work on this decision
>> making part of the scheduler.) So the tuned profiles could further
>> include the scheduler settings as well.
>>
>> The point is that, profiles is a nice way of allowing the user to make
>> his choices. If he does not want to put in too much effort apart from
>> making a choice of profile, he can simply switch the currently active
>> profile to the one that meets his goal and not bother about the settings
>> it is doing internally. If he instead wants to have more fine grained
>> control over the settings, he can create a custom profile deriving out
>> of the existing tuned profiles.
>>
>> Look at an example for a tuned profile for performance:
>> start() gets called when the profile is switched to and stop() when its
>> turned off. We could include the scheduling parameters in the profile
>> when we come up with the set of them.
>>
>>  start() {
>>      [ "$USB_AUTOSUSPEND" = 1 ] && enable_usb_autosuspend
>>      set_disk_alpm min_power
>>      enable_cpu_multicore_powersave
>>      set_cpu_governor ondemand
>>      enable_snd_ac97_powersave
>>      set_hda_intel_powersave 10
>>      enable_wifi_powersave
>>      set_radeon_powersave auto
>>      return 0
>>  }
>>
>>  stop() {
>>      [ "$USB_AUTOSUSPEND" = 1 ] && disable_usb_autosuspend
>>      set_disk_alpm max_performance
>>      disable_cpu_multicore_powersave
>>      restore_cpu_governor
>>      restore_snd_ac97_powersave
>>      restore_hda_intel_powersave
>>      disable_wifi_powersave
>>      restore_radeon_powersave
>>      return 0
>>  }
> 
> You seem to think that user space would operate those profiles, but the
> experience so far is that user space is not actually good at doing things
> like that.  We have exposed a number of PM-related knobs to user space,
> but in may cases it actively refuses to use them (we have dropped a couple
> of them too for this very reason).
> 
> This means expecting user space *alone* to do the right thing and tell the
> kernel what to do next with the help of all of the individual knobs spread
> all over the place is not entirely realistic in my view.
> 
> Yes, I think there should be ways for user space to indicate what its
> current preference (or policy if you will) is, but those should be
> relatively simple and starightforward to use.
> 
> For example, we have a per-device knob that user space can use to indicate
> whether or not runtime PM should be used for the devices, if available.
> As a result, if a user wants to enable runtime PM for all devices, she or
> he has to go through all of them and switch the knob for each one individually,
> which it would be easier to use a common big switch for that.  And that big
> switch would be more likely to be actually used just because it is big
> and makes a big difference.

I agree. You are right, it is absurd to rely on user space to make fine
grained settings and the kernel to completely rely on it for its system
wide energy management decisions. So as I mentioned above and as you
state too we should have a big switch between three levels: perf,
powersave and balanced.

> 
>>> It doesn't seem to be clear currently what level and scope of such
>> interfaces
>>> is appropriate and where to place them.  Would a global knob be
>> useful?  Or
>>> should they be per-subsystem, per-driver, per-task, per-cgroup etc?
>>
>> A global knob would be useful in the case where the user chooses
>> performance policy for example. It means he expects the kernel to
>> *never* sacrifice performance for powersave. Now assume that a set of
>> tasks is running on 4 cpus out of 10. If the user has chosen performance
>> policy, *none of the 10 cpus should enter deep idle states* lest they
>> affect the latency of the tasks. Here a global knob would do well.
>>
>> For less aggressive policies like balanced policy, a per-task policy
>> would do very well. Assume the above same scenario, we would want to
>> disable deep idle states only for those 4 cpus that we are running on
>> and allow the remaining 6 to enter deep idle states. Of course this
>> would mean that if the task gets scheduled on one of those 6, it would
>> take a latency hit, but only initially. The per-task knob would then
>> prevent that cpu from entering deep idle states henceforth. Or we could
>> use cgroups to prevent even such a thing from happening and make it a
>> per-cgroup knob if even the initial latency hit cannot be tolerated.
> 
> I'm still seeing a problem with mixing tasks with different "energy"
> settings.  If there are "performance" and "energy friendly" tasks to
> run at the same time, it is not particularly clear how to the load
> balancer should handle them, for one example.

By per-task what I meant was the cpu that the task is running on should
adjust the choice of idle states and frequency to run on depending on
the latency requirement of the task. The load balancer will not behave
any differently.
   When a latency intolerant task is run on a cpu, whichever it might
be, the cpuidle governor will disable deep idle states on that cpu or
run the cpu at turbo mode say. Of course there is more to this than such
a simplistic view. But just to convey an idea I had in mind about the
advantage of per-task policy.
> 
> What you're suggesting seems to be to start with the "levels" that are
> defined currently, by cpufreq governors for one example, and then to add
> more over time as needed.  Is that correct?

Thats right.

Regards
Preeti U Murthy
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-06 12:54 [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces Rafael J. Wysocki
                   ` (4 preceding siblings ...)
  2014-05-07 21:03 ` Paul Gortmaker
@ 2014-05-12 11:53 ` Amit Kucheria
  2014-05-12 12:31   ` Morten Rasmussen
  2014-05-12 20:58   ` Mark Brown
  5 siblings, 2 replies; 37+ messages in thread
From: Amit Kucheria @ 2014-05-12 11:53 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Len Brown, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On Tue, May 6, 2014 at 6:24 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> Hi All,
>
> During a recent discussion on linux-pm/LKML regarding the integration of the
> scheduler with cpuidle (http://marc.info/?t=139834240600003&r=1&w=4) it became
> apparent that the kernel might benefit from adding interfaces to let it know
> how far it should go with saving energy, possibly at the expense of performance.

Thanks for bringing this up Rafael.

It is clear that the energy-efficiency objective is a multi-level
problem that depends on the HW architecture and application running on
it. There is no single policy that is always correct even on a single
HW platform - we'll be able to come up with use-cases that'll break
our carefully crafted policies. So we need the kernel to provide
mechanisms to select specific optimisations for our platform and then
ways to bypass them at runtime in some use-cases.

> First of all, it would be good to have a place where subsystems and device
> drivers can go and check what the current "energy conservation bias" is in
> case they need to make a decision between delivering more performance and
> using less energy.  Second, it would be good to provide user space with

Drivers are always designed to go as fast as possible until there is
nothing to do and runtime PM kicks in. Do we really want drivers that
slow down file copy to the USB stick because we are on battery? Or
degrade audio/video quality to save power? The only usecase I can come
up with where this makes sense is the wifi connection where the driver
should perhaps throttle bitrates if the network isn't being used
actively. But that is a driver-internal decision.

Between generic power domains, runtime PM and pm-qos, we seem to have
infrastructure in place to allow subystems and drivers to influence
system behaviour. Is anything missing here? Or is it just a matter of
having a centralised location (scheduler?)to deal with all this input
from the system?

> a means to tell the kernel whether it should care more about performance or
> energy.  Finally, it would be good to be able to adjust the overall "energy
> conservation bias" automatically in response to certain "power" events such
> as "battery is low/critical" etc.

In most cases middleware such as Android power HAL, gnome power
manager or tuned will be the user here. These arbitrators consolidate
diverse user preferences and poke a few sysfs files to get the desired
behaviour, including preventing PeterZ's backlight from dimming when
he is on battery :) While I agree about exposing the knobs to the
middleware, I don't want to depend on it to setup everything correctly
- we need sane defaults in the kernel.

> It doesn't seem to be clear currently what level and scope of such interfaces
> is appropriate and where to place them.  Would a global knob be useful?  Or
> should they be per-subsystem, per-driver, per-task, per-cgroup etc?

One other thing I'd like to touch upon is privilege - who gets to turn
these knobs? If we're thinking per-process scope, we need a default
"no policy" to deal with app marketplaces where a rogue application
could run down your battery or worse burn your fingers.

> It also is not particularly clear what representation of "energy conservation
> bias" would be most useful.  Should that be a number or a set of well-defined
> discrete levels that can be given names (like "max performance", "high
> prerformance", "balanced" etc.)?  If a number, then what units to use and
> how many different values to take into account?

I have a hard time figuring out how to map these levels to performance
/ power optimisations I care about. Say I have the following
optimisation techniques available today that I can change at runtime.

#define XX_TASK_PACKING              0x00000001  /* opposite of the
default spread policy */
#define XX_DISABLE_OVERDRIVE    0x00000002  /* disables expensive P-states */
#define XX_FORCE_DEEP_IDLE        0x00000004  /* go to deep idle
states even if activity on system dictates low-latency idling - useful
for thermal throttling aka idle injection */
#define XX_FORCE_SHALLOW_IDLE 0x00000008  /* keep cpu in low-latency
idle states for performance reasons */
#define XX_FOO_TECHNIQUE           0x00000010

This is a mix of power and performance objectives that apply on a
per-cpu and/or per-cluster level. The challenge here is the lack of
consistency - some of these conflict with each other but are not
necessary opposites of each other. Some of them are good for
performance and power. How do I categorize them into 'max
performance', 'balanced' or 'power save' ?

> The people involved in the scheduler/cpuidle discussion mentioned above were:
>  * Amit Kucheria
>  * Ingo Molnar
>  * Daniel Lezcano
>  * Morten Rasmussen
>  * Peter Zijlstra
> and me, but I think that this topic may be interesting to others too (especially
> to Len who proposed a global "enefgy conservation bias" interface a few years ago).
>
> Please let me know what you think.

Again, thanks for bringing this up. This is an important interface discussion.

Regards,
Amit

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-12 11:53 ` Amit Kucheria
@ 2014-05-12 12:31   ` Morten Rasmussen
  2014-05-13  5:52     ` Amit Kucheria
  2014-05-12 20:58   ` Mark Brown
  1 sibling, 1 reply; 37+ messages in thread
From: Morten Rasmussen @ 2014-05-12 12:31 UTC (permalink / raw)
  To: Amit Kucheria
  Cc: Len Brown, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On Mon, May 12, 2014 at 12:53:11PM +0100, Amit Kucheria wrote:
> On Tue, May 6, 2014 at 6:24 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > a means to tell the kernel whether it should care more about performance or
> > energy.  Finally, it would be good to be able to adjust the overall "energy
> > conservation bias" automatically in response to certain "power" events such
> > as "battery is low/critical" etc.
> 
> In most cases middleware such as Android power HAL, gnome power
> manager or tuned will be the user here. These arbitrators consolidate
> diverse user preferences and poke a few sysfs files to get the desired
> behaviour, including preventing PeterZ's backlight from dimming when
> he is on battery :) While I agree about exposing the knobs to the
> middleware, I don't want to depend on it to setup everything correctly
> - we need sane defaults in the kernel.
> 
> > It doesn't seem to be clear currently what level and scope of such interfaces
> > is appropriate and where to place them.  Would a global knob be useful?  Or
> > should they be per-subsystem, per-driver, per-task, per-cgroup etc?
> 
> One other thing I'd like to touch upon is privilege - who gets to turn
> these knobs? If we're thinking per-process scope, we need a default
> "no policy" to deal with app marketplaces where a rogue application
> could run down your battery or worse burn your fingers.

The middleware power manager as you mention above seems to be a good
candidate. The kernel wouldn't know which tasks are trusted to behave
nicely so I think that is a user-space/middleware problem to deal with.

> 
> > It also is not particularly clear what representation of "energy conservation
> > bias" would be most useful.  Should that be a number or a set of well-defined
> > discrete levels that can be given names (like "max performance", "high
> > prerformance", "balanced" etc.)?  If a number, then what units to use and
> > how many different values to take into account?
> 
> I have a hard time figuring out how to map these levels to performance
> / power optimisations I care about. Say I have the following
> optimisation techniques available today that I can change at runtime.
> 
> #define XX_TASK_PACKING              0x00000001  /* opposite of the
> default spread policy */
> #define XX_DISABLE_OVERDRIVE    0x00000002  /* disables expensive P-states */
> #define XX_FORCE_DEEP_IDLE        0x00000004  /* go to deep idle
> states even if activity on system dictates low-latency idling - useful
> for thermal throttling aka idle injection */
> #define XX_FORCE_SHALLOW_IDLE 0x00000008  /* keep cpu in low-latency
> idle states for performance reasons */
> #define XX_FOO_TECHNIQUE           0x00000010
> 
> This is a mix of power and performance objectives that apply on a
> per-cpu and/or per-cluster level. The challenge here is the lack of
> consistency - some of these conflict with each other but are not
> necessary opposites of each other. Some of them are good for
> performance and power. How do I categorize them into 'max
> performance', 'balanced' or 'power save' ?

You can't. Since platforms are different, different techniques will have
different impacts on the performance/energy trade-off. As I have said in
the original thread, we need to distinguish between techniques to change
behaviour (like the ones you have listed above) and optimization goals.
Whether a specific technique can bring us closer to our current
optimization goal (performance/energy trade-off) depends on the
platform.

Instead of a static mapping between techniques and the power/energy knob
setting we need to give the kernel enough information about the system
topology and energy costs figure out which technique should be applied
to get closer to the goal. For example, if the kernel knows the wake-up
costs (energy) of the cpus and tracks task behaviour it should be able
to figure out whether it makes sense to apply task packing. Similarly,
if we know the energy-efficiency for the P-states, we can more try to
avoid them if they are really expensive.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-08 14:57     ` Iyer, Sundar
@ 2014-05-12 16:44       ` Preeti U Murthy
  2014-05-13 23:36         ` Rafael J. Wysocki
  0 siblings, 1 reply; 37+ messages in thread
From: Preeti U Murthy @ 2014-05-12 16:44 UTC (permalink / raw)
  To: Iyer, Sundar
  Cc: Brown, Len, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On 05/08/2014 08:27 PM, Iyer, Sundar wrote:
>> -----Original Message-----
>> From: ksummit-discuss-bounces@lists.linuxfoundation.org [mailto:ksummit-
>> discuss-bounces@lists.linuxfoundation.org] On Behalf Of Rafael J.
>> Wysocki
>> Sent: Thursday, May 8, 2014 6:28 PM
>  
>> That's something I was thinking about too, but the difficulty here is in how to
>> define the profiles (that is, what settings in each subsystem are going to be
>> affected by a profile change) and in deciding when to switch profiles and
>> which profile is the most appropriate going forward.
>>
>> IOW, the high-level concept looks nice, but the details of the implementation
>> are important too. :-)
> 
> I agree. Defining these profiles and trying to fit them into a system definition, 
> system usage policy and above all user usage policy is where the sticking point is.
> 
>>> Today cpuidle and cpufreq already expose these settings through
>>> governors.
>>
>> cpufreq governors are kind of tied to specific "energy efficiency" profiles,
>> performance, powersave, on-demand.  However, cpuidle governors are
> 
> I am not sure if that is correct. IMO Cpufreq governors function simply as per
> policies defined to meet user experience. A system may choose to sacrifice
> user experience @ the cost of running the CPU at the lowest frequency, but the
> governor has no idea if it was really energy efficient for the platform. Similarly,

The governor will never know if it was energy efficient. It will only
take decisions from the data it has at its behest. And from the data
that is exposed by the platform, if it appears that running the cpus at
lowest frequency is the best bet to meet the system policy, it will do
so. It cannot do better than this IMO and as I pointed in the previous
mail this should be good enough too. Better than not having a governor no?

> the governor might decide to run at a higher turbo frequency for better user
> responsiveness, but it still doesn't know if it was energy efficient running @ those
> frequencies. I am coming back to the point that energy efficiency is countable
> _only_ at the platform level: if it results in a longer battery life w/o needing to plug in.

Not every profile above is catering to energy savings. The very fact
that the governor decided to run the cpus at turbo frequency means that
it not looking at energy efficiency but merely at short bursts of high
performance. This will definitely drop down battery life. But if the
user chose a profile where turbo mode is enabled it means he is ok with
these side effects.

We know certain general facts about cpu frequencies. Like running in
turbo mode would mean the cpus could possibly get throttled and lead to
eventual drop in performance. These are platform level. But having an
idea about these things help us design the algorithms in the kernel.

> 
>>> Look at an example for a tuned profile for performance:
>>> start() gets called when the profile is switched to and stop() when
>>> its turned off. We could include the scheduling parameters in the
>>> profile when we come up with the set of them.
>>>
>>>  start() {
>>>      [ "$USB_AUTOSUSPEND" = 1 ] && enable_usb_autosuspend
>>>      set_disk_alpm min_power
>>>      enable_cpu_multicore_powersave
>>>      set_cpu_governor ondemand
>>>      enable_snd_ac97_powersave
>>>      set_hda_intel_powersave 10
>>>      enable_wifi_powersave
>>>      set_radeon_powersave auto
>>>      return 0
>>>  }
>>>
>>>  stop() {
>>>      [ "$USB_AUTOSUSPEND" = 1 ] && disable_usb_autosuspend
>>>      set_disk_alpm max_performance
>>>      disable_cpu_multicore_powersave
>>>      restore_cpu_governor
>>>      restore_snd_ac97_powersave
>>>      restore_hda_intel_powersave
>>>      disable_wifi_powersave
>>>      restore_radeon_powersave
>>>      return 0
>>>  }
>>
>> You seem to think that user space would operate those profiles, but the
>> experience so far is that user space is not actually good at doing things like
>> that.  We have exposed a number of PM-related knobs to user space, but in
>> may cases it actively refuses to use them (we have dropped a couple of
>> them too for this very reason).
> 
> Please correct me if I am wrong, but items like wifi_powersave are something
> default on most systems especially with per-device runtime power management.
> Most devices are runtime managed and I don't see a strong reason to switch
> device policies as energy saving methods.

Hmm I am not aware of this. This was a sample tuned profile on Fedora
that I used as an example.

> 
>>> A global knob would be useful in the case where the user chooses
>>> performance policy for example. It means he expects the kernel to
>>> *never* sacrifice performance for powersave. Now assume that a set of
>>> tasks is running on 4 cpus out of 10. If the user has chosen
>>> performance policy, *none of the 10 cpus should enter deep idle
>>> states* lest they affect the latency of the tasks. Here a global knob would
>> do well.
> 
> For this specific example, when you say the *user* has chosen the policy, do
> you mean a user space daemon that takes care of this or the application itself?

I mean the "user"; i.e. whoever is in charge of deciding the system
policies. For virtualized systems it could be the system administrator
who could decide a specific policy for a VM. For laptops it is us, users.
> 
> How are we going to know if we will really save energy by limiting deep idle states
> on all the 10 CPUs? Please help me understand this.

We will not save energy by limiting idle states.By limiting idle states
we ensure that we do not affect the latency requirement of the task
running on the cpu.

Regards
Preeti U Murthy
> 
> Cheers!
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-12 11:53 ` Amit Kucheria
  2014-05-12 12:31   ` Morten Rasmussen
@ 2014-05-12 20:58   ` Mark Brown
  1 sibling, 0 replies; 37+ messages in thread
From: Mark Brown @ 2014-05-12 20:58 UTC (permalink / raw)
  To: Amit Kucheria
  Cc: Len Brown, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 1531 bytes --]

On Mon, May 12, 2014 at 05:23:11PM +0530, Amit Kucheria wrote:
> On Tue, May 6, 2014 at 6:24 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:

> > First of all, it would be good to have a place where subsystems and device
> > drivers can go and check what the current "energy conservation bias" is in
> > case they need to make a decision between delivering more performance and
> > using less energy.  Second, it would be good to provide user space with

> Drivers are always designed to go as fast as possible until there is
> nothing to do and runtime PM kicks in. Do we really want drivers that
> slow down file copy to the USB stick because we are on battery? Or
> degrade audio/video quality to save power? The only usecase I can come
> up with where this makes sense is the wifi connection where the driver
> should perhaps throttle bitrates if the network isn't being used
> actively. But that is a driver-internal decision.

There's some tradeoffs around audio as well actually - typically there
is a lot of room for degrading performance without much impact on real
world users.  That said of course hardware manufacturers are constantly
working to eliminate the need for such tradeoffs so the longer we leave
this stuff the less relevant it becomes.  I'd guess this is fairly
common for analogue circuits, a similar thing used to be the case with
PMICs though for modern devices the need for explict tuning has been
mostly eliminated and the hardware can do it autonomously (at least for
the bits that burn most power).

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-12 12:31   ` Morten Rasmussen
@ 2014-05-13  5:52     ` Amit Kucheria
  2014-05-13  9:59       ` Morten Rasmussen
  0 siblings, 1 reply; 37+ messages in thread
From: Amit Kucheria @ 2014-05-13  5:52 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Len Brown, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On Mon, May 12, 2014 at 6:01 PM, Morten Rasmussen
<morten.rasmussen@arm.com> wrote:
> On Mon, May 12, 2014 at 12:53:11PM +0100, Amit Kucheria wrote:
>> On Tue, May 6, 2014 at 6:24 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:

<snip>

> The middleware power manager as you mention above seems to be a good
> candidate. The kernel wouldn't know which tasks are trusted to behave
> nicely so I think that is a user-space/middleware problem to deal with.
>
>>
>> > It also is not particularly clear what representation of "energy conservation
>> > bias" would be most useful.  Should that be a number or a set of well-defined
>> > discrete levels that can be given names (like "max performance", "high
>> > prerformance", "balanced" etc.)?  If a number, then what units to use and
>> > how many different values to take into account?
>>
>> I have a hard time figuring out how to map these levels to performance
>> / power optimisations I care about. Say I have the following
>> optimisation techniques available today that I can change at runtime.
>>
>> #define XX_TASK_PACKING              0x00000001  /* opposite of the
>> default spread policy */
>> #define XX_DISABLE_OVERDRIVE    0x00000002  /* disables expensive P-states */
>> #define XX_FORCE_DEEP_IDLE        0x00000004  /* go to deep idle
>> states even if activity on system dictates low-latency idling - useful
>> for thermal throttling aka idle injection */
>> #define XX_FORCE_SHALLOW_IDLE 0x00000008  /* keep cpu in low-latency
>> idle states for performance reasons */
>> #define XX_FOO_TECHNIQUE           0x00000010
>>
>> This is a mix of power and performance objectives that apply on a
>> per-cpu and/or per-cluster level. The challenge here is the lack of
>> consistency - some of these conflict with each other but are not
>> necessary opposites of each other. Some of them are good for
>> performance and power. How do I categorize them into 'max
>> performance', 'balanced' or 'power save' ?
>
> You can't. Since platforms are different, different techniques will have
> different impacts on the performance/energy trade-off. As I have said in
> the original thread, we need to distinguish between techniques to change
> behaviour (like the ones you have listed above) and optimization goals.
> Whether a specific technique can bring us closer to our current
> optimization goal (performance/energy trade-off) depends on the
> platform.

Right. So we are saying that state names like "powersave",
"balanced/auto", "performance" will be platform-defined. Is it worth
defining them at all then?

I expect that these techniques can be counted on our fingers, so why
not just expose them directly to the system? The middleware and even
other kernel subsystems can directly toggle their state based on
current conditions.

I'm assuming here that cpufreq and cpuidle mechanism will get merged
into the scheduler core at some point in the near future.

Regards,
Amit

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-13  5:52     ` Amit Kucheria
@ 2014-05-13  9:59       ` Morten Rasmussen
  2014-05-13 23:55         ` Rafael J. Wysocki
  0 siblings, 1 reply; 37+ messages in thread
From: Morten Rasmussen @ 2014-05-13  9:59 UTC (permalink / raw)
  To: Amit Kucheria
  Cc: Len Brown, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On Tue, May 13, 2014 at 06:52:01AM +0100, Amit Kucheria wrote:
> On Mon, May 12, 2014 at 6:01 PM, Morten Rasmussen
> <morten.rasmussen@arm.com> wrote:
> > On Mon, May 12, 2014 at 12:53:11PM +0100, Amit Kucheria wrote:
> >> On Tue, May 6, 2014 at 6:24 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> 
> <snip>
> 
> > The middleware power manager as you mention above seems to be a good
> > candidate. The kernel wouldn't know which tasks are trusted to behave
> > nicely so I think that is a user-space/middleware problem to deal with.
> >
> >>
> >> > It also is not particularly clear what representation of "energy conservation
> >> > bias" would be most useful.  Should that be a number or a set of well-defined
> >> > discrete levels that can be given names (like "max performance", "high
> >> > prerformance", "balanced" etc.)?  If a number, then what units to use and
> >> > how many different values to take into account?
> >>
> >> I have a hard time figuring out how to map these levels to performance
> >> / power optimisations I care about. Say I have the following
> >> optimisation techniques available today that I can change at runtime.
> >>
> >> #define XX_TASK_PACKING              0x00000001  /* opposite of the
> >> default spread policy */
> >> #define XX_DISABLE_OVERDRIVE    0x00000002  /* disables expensive P-states */
> >> #define XX_FORCE_DEEP_IDLE        0x00000004  /* go to deep idle
> >> states even if activity on system dictates low-latency idling - useful
> >> for thermal throttling aka idle injection */
> >> #define XX_FORCE_SHALLOW_IDLE 0x00000008  /* keep cpu in low-latency
> >> idle states for performance reasons */
> >> #define XX_FOO_TECHNIQUE           0x00000010
> >>
> >> This is a mix of power and performance objectives that apply on a
> >> per-cpu and/or per-cluster level. The challenge here is the lack of
> >> consistency - some of these conflict with each other but are not
> >> necessary opposites of each other. Some of them are good for
> >> performance and power. How do I categorize them into 'max
> >> performance', 'balanced' or 'power save' ?
> >
> > You can't. Since platforms are different, different techniques will have
> > different impacts on the performance/energy trade-off. As I have said in
> > the original thread, we need to distinguish between techniques to change
> > behaviour (like the ones you have listed above) and optimization goals.
> > Whether a specific technique can bring us closer to our current
> > optimization goal (performance/energy trade-off) depends on the
> > platform.
> 
> Right. So we are saying that state names like "powersave",
> "balanced/auto", "performance" will be platform-defined. Is it worth
> defining them at all then?

No. I see "powersave", "auto", and "balance/auto" as objectives.
Objectives are platform independent. If we want stay within a certain
energy budget, for example consume less than X joules for playing one
hour of music, we set the performance/energy knob accordingly. It is
then up to the kernel to apply the right techniques to achieve the
objective on the particular platform. That will of course mean that the
kernel needs to be better informed about the platform energy
characteristics than it is today.

It is really a question of where we want to put all the details about
the platform. In user-space and let some daemon control a long list of
kernel parameters, or in the kernel and have a simple objective
user-space interface where the performance/energy trade-off can be
tuned (like the energy cost target as Rafael proposed).

> I expect that these techniques can be counted on our fingers, so why
> not just expose them directly to the system? The middleware and even
> other kernel subsystems can directly toggle their state based on
> current conditions.

It might make sense for controlling subsystems like (GPU, wifi,
modem,...) as middleware should often have a very good idea about which
subsystems that are in use. However, the techniques for scheduling might
not be either on or off. For example, task packing might make sense to
a certain degree under certain circumstances that cannot been seen from
user-space. The scheduler has access to detailed task information like
load averages and wake-up counts, which might necessary to determine
when to apply a specific technique. Forcing deep idle to reduce power
could potentially have the exact opposite effect if applied in scenarios
with frequent wake-ups as you risk burning more energy during the
transitions than you save while in deep idle.

IMHO, exposing the techniques to user-space implies exporting most of
the energy-awarness problem to user-space. My concern is that the
interface will become complicated and the user-space daemon needs to be
strictly in sync with the kernel. A simple(r) objective interface
(energy cost target for scheduling/idle/freq, subsystems not in use,
...) might be easier use and maintain.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-12 16:44       ` Preeti U Murthy
@ 2014-05-13 23:36         ` Rafael J. Wysocki
  2014-05-15 10:37           ` Preeti U Murthy
  0 siblings, 1 reply; 37+ messages in thread
From: Rafael J. Wysocki @ 2014-05-13 23:36 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Brown, Len, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On Monday, May 12, 2014 10:14:24 PM Preeti U Murthy wrote:
> On 05/08/2014 08:27 PM, Iyer, Sundar wrote:
> >> -----Original Message-----
> >> From: ksummit-discuss-bounces@lists.linuxfoundation.org [mailto:ksummit-
> >> discuss-bounces@lists.linuxfoundation.org] On Behalf Of Rafael J.
> >> Wysocki
> >> Sent: Thursday, May 8, 2014 6:28 PM
> >  
> >> That's something I was thinking about too, but the difficulty here is in how to
> >> define the profiles (that is, what settings in each subsystem are going to be
> >> affected by a profile change) and in deciding when to switch profiles and
> >> which profile is the most appropriate going forward.
> >>
> >> IOW, the high-level concept looks nice, but the details of the implementation
> >> are important too. :-)
> > 
> > I agree. Defining these profiles and trying to fit them into a system definition, 
> > system usage policy and above all user usage policy is where the sticking point is.
> > 
> >>> Today cpuidle and cpufreq already expose these settings through
> >>> governors.
> >>
> >> cpufreq governors are kind of tied to specific "energy efficiency" profiles,
> >> performance, powersave, on-demand.  However, cpuidle governors are
> > 
> > I am not sure if that is correct. IMO Cpufreq governors function simply as per
> > policies defined to meet user experience. A system may choose to sacrifice
> > user experience @ the cost of running the CPU at the lowest frequency, but the
> > governor has no idea if it was really energy efficient for the platform. Similarly,
> 
> The governor will never know if it was energy efficient. It will only
> take decisions from the data it has at its behest. And from the data
> that is exposed by the platform, if it appears that running the cpus at
> lowest frequency is the best bet to meet the system policy, it will do
> so. It cannot do better than this IMO and as I pointed in the previous
> mail this should be good enough too. Better than not having a governor no?
> 
> > the governor might decide to run at a higher turbo frequency for better user
> > responsiveness, but it still doesn't know if it was energy efficient running @ those
> > frequencies. I am coming back to the point that energy efficiency is countable
> > _only_ at the platform level: if it results in a longer battery life w/o needing to plug in.
> 
> Not every profile above is catering to energy savings. The very fact
> that the governor decided to run the cpus at turbo frequency means that
> it not looking at energy efficiency but merely at short bursts of high
> performance. This will definitely drop down battery life. But if the
> user chose a profile where turbo mode is enabled it means he is ok with
> these side effects.

You seem to be assuming a bit about the user.  Who may be a kid playing a game
on his phone or tablet and having no idea what "turbo" is whatsoever. :-)

> We know certain general facts about cpu frequencies. Like running in
> turbo mode would mean the cpus could possibly get throttled and lead to
> eventual drop in performance. These are platform level. But having an
> idea about these things help us design the algorithms in the kernel.

That I can agree with, but I wouldn't count on user input too much.

Of course, I'm for giving users who know what they are doing as much power
as reasonably possible, but on the other hand they are not really likely
to use that power, on the average at least.

[cut]

> > 
> > For this specific example, when you say the *user* has chosen the policy, do
> > you mean a user space daemon that takes care of this or the application itself?
> 
> I mean the "user"; i.e. whoever is in charge of deciding the system
> policies. For virtualized systems it could be the system administrator
> who could decide a specific policy for a VM. For laptops it is us, users.

What about phones, tablets, Android-based TV sets, Tizen-based watches??

> > How are we going to know if we will really save energy by limiting deep idle states
> > on all the 10 CPUs? Please help me understand this.
> 
> We will not save energy by limiting idle states.By limiting idle states
> we ensure that we do not affect the latency requirement of the task
> running on the cpu.

OK, so now, to be a little more specific, how is the task supposed to specify
that latency requirement?


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-13  9:59       ` Morten Rasmussen
@ 2014-05-13 23:55         ` Rafael J. Wysocki
  2014-05-14 20:21           ` Daniel Vetter
  0 siblings, 1 reply; 37+ messages in thread
From: Rafael J. Wysocki @ 2014-05-13 23:55 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Len Brown, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On Tuesday, May 13, 2014 10:59:29 AM Morten Rasmussen wrote:
> On Tue, May 13, 2014 at 06:52:01AM +0100, Amit Kucheria wrote:
> > On Mon, May 12, 2014 at 6:01 PM, Morten Rasmussen
> > <morten.rasmussen@arm.com> wrote:
> > > On Mon, May 12, 2014 at 12:53:11PM +0100, Amit Kucheria wrote:
> > >> On Tue, May 6, 2014 at 6:24 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > 
> > <snip>
> > 
> > > The middleware power manager as you mention above seems to be a good
> > > candidate. The kernel wouldn't know which tasks are trusted to behave
> > > nicely so I think that is a user-space/middleware problem to deal with.
> > >
> > >>
> > >> > It also is not particularly clear what representation of "energy conservation
> > >> > bias" would be most useful.  Should that be a number or a set of well-defined
> > >> > discrete levels that can be given names (like "max performance", "high
> > >> > prerformance", "balanced" etc.)?  If a number, then what units to use and
> > >> > how many different values to take into account?
> > >>
> > >> I have a hard time figuring out how to map these levels to performance
> > >> / power optimisations I care about. Say I have the following
> > >> optimisation techniques available today that I can change at runtime.
> > >>
> > >> #define XX_TASK_PACKING              0x00000001  /* opposite of the
> > >> default spread policy */
> > >> #define XX_DISABLE_OVERDRIVE    0x00000002  /* disables expensive P-states */
> > >> #define XX_FORCE_DEEP_IDLE        0x00000004  /* go to deep idle
> > >> states even if activity on system dictates low-latency idling - useful
> > >> for thermal throttling aka idle injection */
> > >> #define XX_FORCE_SHALLOW_IDLE 0x00000008  /* keep cpu in low-latency
> > >> idle states for performance reasons */
> > >> #define XX_FOO_TECHNIQUE           0x00000010
> > >>
> > >> This is a mix of power and performance objectives that apply on a
> > >> per-cpu and/or per-cluster level. The challenge here is the lack of
> > >> consistency - some of these conflict with each other but are not
> > >> necessary opposites of each other. Some of them are good for
> > >> performance and power. How do I categorize them into 'max
> > >> performance', 'balanced' or 'power save' ?
> > >
> > > You can't. Since platforms are different, different techniques will have
> > > different impacts on the performance/energy trade-off. As I have said in
> > > the original thread, we need to distinguish between techniques to change
> > > behaviour (like the ones you have listed above) and optimization goals.
> > > Whether a specific technique can bring us closer to our current
> > > optimization goal (performance/energy trade-off) depends on the
> > > platform.
> > 
> > Right. So we are saying that state names like "powersave",
> > "balanced/auto", "performance" will be platform-defined. Is it worth
> > defining them at all then?
> 
> No. I see "powersave", "auto", and "balance/auto" as objectives.
> Objectives are platform independent. If we want stay within a certain
> energy budget, for example consume less than X joules for playing one
> hour of music, we set the performance/energy knob accordingly. It is
> then up to the kernel to apply the right techniques to achieve the
> objective on the particular platform. That will of course mean that the
> kernel needs to be better informed about the platform energy
> characteristics than it is today.
> 
> It is really a question of where we want to put all the details about
> the platform. In user-space and let some daemon control a long list of
> kernel parameters, or in the kernel and have a simple objective
> user-space interface where the performance/energy trade-off can be
> tuned (like the energy cost target as Rafael proposed).
> 
> > I expect that these techniques can be counted on our fingers, so why
> > not just expose them directly to the system? The middleware and even
> > other kernel subsystems can directly toggle their state based on
> > current conditions.
> 
> It might make sense for controlling subsystems like (GPU, wifi,
> modem,...) as middleware should often have a very good idea about which
> subsystems that are in use. However, the techniques for scheduling might
> not be either on or off. For example, task packing might make sense to
> a certain degree under certain circumstances that cannot been seen from
> user-space. The scheduler has access to detailed task information like
> load averages and wake-up counts, which might necessary to determine
> when to apply a specific technique. Forcing deep idle to reduce power
> could potentially have the exact opposite effect if applied in scenarios
> with frequent wake-ups as you risk burning more energy during the
> transitions than you save while in deep idle.
> 
> IMHO, exposing the techniques to user-space implies exporting most of
> the energy-awarness problem to user-space. My concern is that the
> interface will become complicated and the user-space daemon needs to be
> strictly in sync with the kernel. A simple(r) objective interface
> (energy cost target for scheduling/idle/freq, subsystems not in use,
> ...) might be easier use and maintain.

Precisely.

In addition to that, the CPU scheduler is not the only subsystem where
things may not be either on or off.  Take the graphics for example (at
least in the case of more complicated adapters).  There certainly are
energy efficiency vs performance tradeoffs there.

Thanks!

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-13 23:55         ` Rafael J. Wysocki
@ 2014-05-14 20:21           ` Daniel Vetter
  0 siblings, 0 replies; 37+ messages in thread
From: Daniel Vetter @ 2014-05-14 20:21 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Len Brown, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On Wed, May 14, 2014 at 1:55 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> In addition to that, the CPU scheduler is not the only subsystem where
> things may not be either on or off.  Take the graphics for example (at
> least in the case of more complicated adapters).  There certainly are
> energy efficiency vs performance tradeoffs there.

There's actually really hilarious interaction issues going on. Up
until recently we've had bugs with the gpu turbo where you had to keep
the cpu unecessarily busy to get the highest gpu frequencies. Which
meant that optimizations in the userspace driver to be more cpu
efficient actually reduced performance. Essentially the gpu wasn't
ramping up because the cpu was slow in delivering new work and the cpu
wasn't ramping up becasue the gpu was slow in processing it and so
resulted in stalls. Keeping one of them artificially busy helped ;-)

Now we do ramp the gpu freq manually in the driver and help out the
cpu side by using io_schedule waits instead of normal ones.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-13 23:36         ` Rafael J. Wysocki
@ 2014-05-15 10:37           ` Preeti U Murthy
  0 siblings, 0 replies; 37+ messages in thread
From: Preeti U Murthy @ 2014-05-15 10:37 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Brown, Len, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On 05/14/2014 05:06 AM, Rafael J. Wysocki wrote:
> On Monday, May 12, 2014 10:14:24 PM Preeti U Murthy wrote:
>> On 05/08/2014 08:27 PM, Iyer, Sundar wrote:
>>>> -----Original Message-----
>>>> From: ksummit-discuss-bounces@lists.linuxfoundation.org [mailto:ksummit-
>>>> discuss-bounces@lists.linuxfoundation.org] On Behalf Of Rafael J.
>>>> Wysocki
>>>> Sent: Thursday, May 8, 2014 6:28 PM
>>>  
>>>> That's something I was thinking about too, but the difficulty here is in how to
>>>> define the profiles (that is, what settings in each subsystem are going to be
>>>> affected by a profile change) and in deciding when to switch profiles and
>>>> which profile is the most appropriate going forward.
>>>>
>>>> IOW, the high-level concept looks nice, but the details of the implementation
>>>> are important too. :-)
>>>
>>> I agree. Defining these profiles and trying to fit them into a system definition, 
>>> system usage policy and above all user usage policy is where the sticking point is.
>>>
>>>>> Today cpuidle and cpufreq already expose these settings through
>>>>> governors.
>>>>
>>>> cpufreq governors are kind of tied to specific "energy efficiency" profiles,
>>>> performance, powersave, on-demand.  However, cpuidle governors are
>>>
>>> I am not sure if that is correct. IMO Cpufreq governors function simply as per
>>> policies defined to meet user experience. A system may choose to sacrifice
>>> user experience @ the cost of running the CPU at the lowest frequency, but the
>>> governor has no idea if it was really energy efficient for the platform. Similarly,
>>
>> The governor will never know if it was energy efficient. It will only
>> take decisions from the data it has at its behest. And from the data
>> that is exposed by the platform, if it appears that running the cpus at
>> lowest frequency is the best bet to meet the system policy, it will do
>> so. It cannot do better than this IMO and as I pointed in the previous
>> mail this should be good enough too. Better than not having a governor no?
>>
>>> the governor might decide to run at a higher turbo frequency for better user
>>> responsiveness, but it still doesn't know if it was energy efficient running @ those
>>> frequencies. I am coming back to the point that energy efficiency is countable
>>> _only_ at the platform level: if it results in a longer battery life w/o needing to plug in.
>>
>> Not every profile above is catering to energy savings. The very fact
>> that the governor decided to run the cpus at turbo frequency means that
>> it not looking at energy efficiency but merely at short bursts of high
>> performance. This will definitely drop down battery life. But if the
>> user chose a profile where turbo mode is enabled it means he is ok with
>> these side effects.
> 
> You seem to be assuming a bit about the user.  Who may be a kid playing a game
> on his phone or tablet and having no idea what "turbo" is whatsoever. :-)
> 
>> We know certain general facts about cpu frequencies. Like running in
>> turbo mode would mean the cpus could possibly get throttled and lead to
>> eventual drop in performance. These are platform level. But having an
>> idea about these things help us design the algorithms in the kernel.
> 
> That I can agree with, but I wouldn't count on user input too much.
> 
> Of course, I'm for giving users who know what they are doing as much power
> as reasonably possible, but on the other hand they are not really likely
> to use that power, on the average at least.

Hmm..but then who helps the kernel in decisions like "should i spread
the load vs pack the load"? In an earlier attempt towards power aware
scheduler, a user policy would help it decide this.
  This time in addition to user policy or the user policy apart,
sufficient platform level details about the energy impact of spreading
vs packing should help the kernel decide. Is this what you are
suggesting with regard to the energy aware decisions by the kernel ?

> 
> [cut]
> 
>>>
>>> For this specific example, when you say the *user* has chosen the policy, do
>>> you mean a user space daemon that takes care of this or the application itself?
>>
>> I mean the "user"; i.e. whoever is in charge of deciding the system
>> policies. For virtualized systems it could be the system administrator
>> who could decide a specific policy for a VM. For laptops it is us, users.
> 
> What about phones, tablets, Android-based TV sets, Tizen-based watches??

In these cases I guess the kernel has to monitor the system by itself
and decide appropriately?
> 
>>> How are we going to know if we will really save energy by limiting deep idle states
>>> on all the 10 CPUs? Please help me understand this.
>>
>> We will not save energy by limiting idle states.By limiting idle states
>> we ensure that we do not affect the latency requirement of the task
>> running on the cpu.
> 
> OK, so now, to be a little more specific, how is the task supposed to specify
> that latency requirement?

Coming to think of it, a per task latency requirement is not making
sense for the following reason.

As long as a task is running on a cpu, the cpu will not enter any idle
state. The issue of latency comes up when the task sleeps. If the task
is guaranteed to wake up on the same cpu as that it slept on, we could
use the task latency requirement to decide the idle states that the cpu
could enter into.
   However in the scheduler today the task can wake up on *any cpu*. We
can't possibly disable the deep idle states on every cpu for this
reason. So looks like this needs more thought.

Regards
Preeti U Murthy

> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-13 23:41       ` Rafael J. Wysocki
@ 2014-05-14  9:15         ` Daniel Lezcano
  0 siblings, 0 replies; 37+ messages in thread
From: Daniel Lezcano @ 2014-05-14  9:15 UTC (permalink / raw)
  To: Rafael J. Wysocki, Preeti U Murthy
  Cc: Brown, Len, ksummit-discuss, Peter Zijlstra, Ingo Molnar

On 05/14/2014 01:41 AM, Rafael J. Wysocki wrote:
> On Monday, May 12, 2014 10:43:22 PM Preeti U Murthy wrote:
>> On 05/12/2014 04:44 PM, Morten Rasmussen wrote:
>>> On Thu, May 08, 2014 at 09:59:39AM +0100, Preeti U Murthy wrote:
>>>> On 05/07/2014 10:50 AM, Iyer, Sundar wrote:
>
> [cut]
>
>>> While I agree that there are mechanisms to deal with thermal throttling
>>> already, I think it is somewhat related to energy-awareness. If you need
>>> throttling due to thermal constraints you are burning too much power in
>>> your system. If you factor in energy-effiency and the requirements for
>>> the current use-case you might be able to stay within the power budget
>>> with a smaller performance impact than blindly throttling all
>>> subsystems.
>>
>> True. I was intending to distinguish the who and the why in the above
>> two situations. My only point was that thermal throttling is undertaken
>> by platform and is a safety mechanism whereas switching to energy saving
>> mode when battery is low is undertaken by the kernel and will lead to
>> better end-user experience i.e.battery longevity. Yes the kernel is
>> expected to prevent the system from being throttled as much as possible.
>
> The kernel may very well be responsible for thermal throttling in some cases.
> At least it needs to be able to respond to "do not draw more power than this"
> type of requests.
>
> [cut]
>
>>>
>>> IIUC, you are proposing to have profiles setting a lot of kernel
>>> tunables rather than a single knob to control energy-awareness?
>>>
>>> My concern with profiles is that it basically exports most of the
>>> energy-awareness decision problems to user-space. Maybe I'm missing
>>> something? IMHO, it would be better to have more accurate energy related
>>> topology information in the kernel so it would be able to take the
>>> decisions.
>>
>> You are right. We shouldn't be exposing so many knobs to user-space and
>> expect the kernel to make good decisions based on these knobs being
>> tweaked by user space. How about a high level classification of profiles
>> like balanced, performance, powersave? These alone can be chosen by the
>> user and the lower end tunings left to the discretion of the kernel.
>
> Well, so we're actually back to a central knob with three levels effectively. :-)

May be we can consider adding the large number of knobs with debugfs ?


-- 
  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-12 10:55       ` Iyer, Sundar
@ 2014-05-13 23:48         ` Rafael J. Wysocki
  0 siblings, 0 replies; 37+ messages in thread
From: Rafael J. Wysocki @ 2014-05-13 23:48 UTC (permalink / raw)
  To: Iyer, Sundar
  Cc: Brown, Len, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On Monday, May 12, 2014 10:55:23 AM Iyer, Sundar wrote:
> > -----Original Message-----
> > From: Morten Rasmussen [mailto:morten.rasmussen@arm.com]
> > Sent: Monday, May 12, 2014 4:02 PM
> >
> > > And which is why I mentioned that this is heavily platform dependent.
> > > This is completely dependent on how the rest of the system power
> > management works.
> > 
> > Agree. Race to halt/idle is not universally a good idea. It depends of the
> > platform energy efficiency at the higher performance states, idle power
> > consumption, system topology, use-case, and anything else that consumes
> > power while the tasks are running. For example, if your energy efficiency is
> > really bad in the turbo states, it might be worth going a bit slower if the total
> > energy can be reduced.
> 
> Apart from the specifics of the CPU/topology, race to halt doesn't contribute significant
> to workloads which are offloaded/accelerated: e.g. video, media workloads.
> 
> That said, I think the energy conservation boils down to (not limited to):
> 
> a. should we schedule wide (multiple CPUs) vs local (fewer CPUs);
> b. should we burst (higher P-states) vs run slow (lower P-states); 
> c. is the control resource (power, clock etc.) shared wide or local to the unit;
> d. Is the "local good" aka sub-system conservation resulting in "global good" aka
> platform conservation?
> e. what is the extent of options we want to load the user with: is the user going to toggle
> some 200 switches to get the best experience or the user space/kernel will abstract a bulk
> of these and provide more intelligent actions/decisions?
> 
> And I think the following should be the general outline of any efforts:
> 
> a. if the savings result in violation of any user defined quality-of-service for the experience (
> finite FPS, finite computational requirements like encode/decode compute requirement etc.)
> b. if we can conserve energy at the "platform" level vs "sub-system" level;
> c. If we do save @ the "sub-system" level, how much of this is dependent on the specific
> system architecture/topology/ vs "generic"; or in other words, how much of hit will a different
> architecture suffer (?)

All of these things are worth considering, I agree.

That said, my original question was about what kind of interfaces related to
energy conservation bias were needed.

Can you derive any suggestions from the above?


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-12 17:13     ` Preeti U Murthy
  2014-05-12 17:30       ` Iyer, Sundar
  2014-05-13  6:28       ` Amit Kucheria
@ 2014-05-13 23:41       ` Rafael J. Wysocki
  2014-05-14  9:15         ` Daniel Lezcano
  2 siblings, 1 reply; 37+ messages in thread
From: Rafael J. Wysocki @ 2014-05-13 23:41 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Brown, Len, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On Monday, May 12, 2014 10:43:22 PM Preeti U Murthy wrote:
> On 05/12/2014 04:44 PM, Morten Rasmussen wrote:
> > On Thu, May 08, 2014 at 09:59:39AM +0100, Preeti U Murthy wrote:
> >> On 05/07/2014 10:50 AM, Iyer, Sundar wrote:

[cut]

> > While I agree that there are mechanisms to deal with thermal throttling
> > already, I think it is somewhat related to energy-awareness. If you need
> > throttling due to thermal constraints you are burning too much power in
> > your system. If you factor in energy-effiency and the requirements for
> > the current use-case you might be able to stay within the power budget
> > with a smaller performance impact than blindly throttling all
> > subsystems.
> 
> True. I was intending to distinguish the who and the why in the above
> two situations. My only point was that thermal throttling is undertaken
> by platform and is a safety mechanism whereas switching to energy saving
> mode when battery is low is undertaken by the kernel and will lead to
> better end-user experience i.e.battery longevity. Yes the kernel is
> expected to prevent the system from being throttled as much as possible.

The kernel may very well be responsible for thermal throttling in some cases.
At least it needs to be able to respond to "do not draw more power than this"
type of requests.

[cut]

> > 
> > IIUC, you are proposing to have profiles setting a lot of kernel
> > tunables rather than a single knob to control energy-awareness?
> > 
> > My concern with profiles is that it basically exports most of the
> > energy-awareness decision problems to user-space. Maybe I'm missing
> > something? IMHO, it would be better to have more accurate energy related
> > topology information in the kernel so it would be able to take the
> > decisions.
> 
> You are right. We shouldn't be exposing so many knobs to user-space and
> expect the kernel to make good decisions based on these knobs being
> tweaked by user space. How about a high level classification of profiles
> like balanced, performance, powersave? These alone can be chosen by the
> user and the lower end tunings left to the discretion of the kernel.

Well, so we're actually back to a central knob with three levels effectively. :-)

Thanks!


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-12 16:06     ` Preeti U Murthy
@ 2014-05-13 23:29       ` Rafael J. Wysocki
  0 siblings, 0 replies; 37+ messages in thread
From: Rafael J. Wysocki @ 2014-05-13 23:29 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Brown, Len, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On Monday, May 12, 2014 09:36:15 PM Preeti U Murthy wrote:
> On 05/08/2014 07:53 PM, Iyer, Sundar wrote:
> >> -----Original Message-----
> >> From: Preeti U Murthy [mailto:preeti@linux.vnet.ibm.com]
> >> Sent: Thursday, May 8, 2014 2:30 PM
> >> To: Iyer, Sundar; Peter Zijlstra; Rafael J. Wysocki
> >> Cc: Brown, Len; Daniel Lezcano; Ingo Molnar; ksummit-

[cut]

> 
> The cpuidle sub-system is behaving fairly well on some of the platforms.
> A lot of CPU  power management is platform specific. But by exposing
> arch specific details like the details about the idle states that are
> present, through the cpuidle drivers, the kernel is able to make
> reasonably good predictions about the duration of idleness of a cpu and
> choose the idle state that it must enter into.
>   The point is that we have succeeded in the past in getting the high
> level power management reasonably right in the kernel although they were
> platform dependent.

The problem had fewer dimensions then (so to speak), however.

The set of available C-states might be different, but that pretty much was
all that needed to be taken into account.

Today, there are C-states per core, per package (also per module on some
platforms and so on) and there may be platform dependencies (like some
C-states are not available if certain I/O devices are not in the "right"
states).  That makes the picture a bit less clean and there are more
places where tradeoffs come into play, at least potentially.

Let alone the whole C-states vs P-states issue (Is it better to run at a
lower frequency for time X, or is it better to run at a higher frequency
for time Y, Y < X?).

Thanks!

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-12 17:13     ` Preeti U Murthy
  2014-05-12 17:30       ` Iyer, Sundar
@ 2014-05-13  6:28       ` Amit Kucheria
  2014-05-13 23:41       ` Rafael J. Wysocki
  2 siblings, 0 replies; 37+ messages in thread
From: Amit Kucheria @ 2014-05-13  6:28 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Brown, Len, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On Mon, May 12, 2014 at 10:43 PM, Preeti U Murthy
<preeti@linux.vnet.ibm.com> wrote:
> On 05/12/2014 04:44 PM, Morten Rasmussen wrote:
>> On Thu, May 08, 2014 at 09:59:39AM +0100, Preeti U Murthy wrote:
>>> On 05/07/2014 10:50 AM, Iyer, Sundar wrote:

<snip>

>>> Thats why I suggested the concept of profiles. If the user does not like
>>> the existing system profiles, he can derive from one of them that comes
>>> closes to his requirements and amend his preferences.
>>
>> IIUC, you are proposing to have profiles setting a lot of kernel
>> tunables rather than a single knob to control energy-awareness?
>>
>> My concern with profiles is that it basically exports most of the
>> energy-awareness decision problems to user-space. Maybe I'm missing
>> something? IMHO, it would be better to have more accurate energy related
>> topology information in the kernel so it would be able to take the
>> decisions.
>
> You are right. We shouldn't be exposing so many knobs to user-space and
> expect the kernel to make good decisions based on these knobs being
> tweaked by user space. How about a high level classification of profiles
> like balanced, performance, powersave? These alone can be chosen by the
> user and the lower end tunings left to the discretion of the kernel.

Hi Preeti,

In the other sub-thread, I'm arguing against such categorisation. :)

The optimisation techniques we have at our disposal don't neatly fit
into the "balanced", "performance", "powersave" baskets. And there
aren't really that many of them that we shouldn't directly expose
them, IMHO.

/Amit

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-12 17:13     ` Preeti U Murthy
@ 2014-05-12 17:30       ` Iyer, Sundar
  2014-05-13  6:28       ` Amit Kucheria
  2014-05-13 23:41       ` Rafael J. Wysocki
  2 siblings, 0 replies; 37+ messages in thread
From: Iyer, Sundar @ 2014-05-12 17:30 UTC (permalink / raw)
  To: Preeti U Murthy, Morten Rasmussen
  Cc: Brown, Len, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

> -----Original Message-----
> From: Preeti U Murthy [mailto:preeti@linux.vnet.ibm.com]
> Sent: Monday, May 12, 2014 10:43 PM
> To: Morten Rasmussen
> 
> You are right. We shouldn't be exposing so many knobs to user-space and
> expect the kernel to make good decisions based on these knobs being
> tweaked by user space. How about a high level classification of profiles like
> balanced, performance, powersave? These alone can be chosen by the user
> and the lower end tunings left to the discretion of the kernel.

Also, may I suggest to try to limit the discussion preliminary to CPU only? Should
we try to see if we can get a detailed policy, conditions specifically w.r.t CPU scheduling?

I am trying to sum up thoughts in my mind; please correct/add/edit as needed.

Scheduler options:
a. Spread wide: Utilize as much as CPUs as possible for tasks
b. Limit Local: Utilize as less as CPUs as possible.

Resources impacting these options:
a. if CPUs share common resources like power;
b. if CPUs share common cache; OIOW, cost of moving data around CPUs
c. CPU Energy efficiency profiles (BIG or little or xyz)

Factors affecting these options:

Ideally, we would like to spread wide if:
a. tasks/processes are heavily multi-threaded and parallelized;
b. asynchronous tasks that are single threaded and shifting them to multiple CPUs will
avoid boosting CPU load and hence P-states;
c. there is explicit hint from the application/user space about lack of data dependency;

does the list seem good to start?

Cheers!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-12 11:14   ` Morten Rasmussen
@ 2014-05-12 17:13     ` Preeti U Murthy
  2014-05-12 17:30       ` Iyer, Sundar
                         ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Preeti U Murthy @ 2014-05-12 17:13 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Brown, Len, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On 05/12/2014 04:44 PM, Morten Rasmussen wrote:
> On Thu, May 08, 2014 at 09:59:39AM +0100, Preeti U Murthy wrote:
>> On 05/07/2014 10:50 AM, Iyer, Sundar wrote:
>>>
>>>>> provide user space with a means to tell the kernel whether it should
>>>>> care more about performance or energy.  Finally, it would be good to
>>>>> be able to adjust the overall "energy conservation bias" automatically
>>>
>>> Instead of either energy or performance, would it be easier to look if
>>> it were a "just enough performance" metric? Rather than worry about a reduced
>>> performance to save energy, it would be IMO better to try to optimize the energy
>>> within the constraints of the required performance. Of course, those constraints
>>> could be changed.
>>>
>>> e.g. if the display would communicate it doesn't need to refresh more than 60fps,
>>> this could be communicated to the GPU/CPU to control the bias for these sub-systems
>>> accordingly.
>>
>> We don't really give the user a black and white choice like performance
>> and power-save alone. There is a proposal for 'auto' profile which
>> balances between the two.
>>
>> To give an example where we already expose a parameter for defining
>> performance constraint is the PM_QOS_CPU_DMA_LATENCY, where we tell the
>> cpuidle sub-system that any idle state not adhering to this latency
>> requirement should not be entered into. So we are saying you cannot
>> sacrifice latency beyond this threshold. Here we look at powersavings
>> but within a said constraint.
>>
>> The point is we will certainly look to provide the user with a mix and
>> match of powersave and performance profiles but to get started we begin
>> with powersave and performance.
> 
> I think most users would be interested in "auto" or whatever the
> performance/energy-effiency trade-off will be called. What would
> "powersave" do? Reduce power to a minium? It is energy that users of
> battery powered devices are interested in and is generally not minimized
> by minimizing power. Race to halt/idle is an excellent example of the
> opposite. A powersave setting like the powersave cpufreq governor
> wouldn't be very useful IMHO.

I agree that "powersave" is not bringing much to the table. For example
in the case of powersave cpu frequency governor. It reduces power but
results in very bad energy efficiency of tasks.
  However the reason I mentioned powersave is to emphasize the
importance of this policy in scheduler. In scheduler this policy is
looked to earn us good energy efficiency; i.e. Consolidate to fewer
power domains if powersave policy is chosen as against spreading of
tasks which would usually have been done. You could however call it
"auto" too since the patches that were written in this direction would
automatically spread the load if it got beyond a certain threshold.

> 
> I'm much more interested the performance/energy-efficiency trade-off.
> That is, how much energy are we willing to pay for a certain level of
> performance.
> 
>>
>>>
>>>>> in response to certain "power" events such as "battery is low/critical" etc.
>>>
>>> Would I be wrong if I said the thermal throttling is already an example of this?
>>> When the battery is critical/temperature is unbearable, the system cuts down
>>> the performance of sub-systems like CPU, display etc.
>>
>> Thermal throttling is an entirely different game altogether IMHO. We
>> throttle cpus to save the system from getting heated up and thus
>> damaged. That is to say if we don't do this, the system will become
>> unusable not just now, but forever.
>>
>> However if you look at the example of switching to energy save mode when
>> battery is low, this is to give the user a better user experience. IOW
>> if we didn't do that the system would die down, the user would have to
>> plug in the power supply and restart the machine. It would have wasted
>> some of his time. But no harm done,only a dissatisfied user.
>>
>> Now compare both the above scenarios: while the former should
>> necessarily be there if the platform has enabled turbo cpu frequency
>> ranges, the latter is an enhanced kernel behaviour to better end user
>> experience.
>>
>> We already have safety mechanisms like thermal throttling in the kernel
>> and the platforms today. That is not where we lack. We lack in providing
>> a better end user experience depending on his requirement for power
>> efficiency.
> 
> While I agree that there are mechanisms to deal with thermal throttling
> already, I think it is somewhat related to energy-awareness. If you need
> throttling due to thermal constraints you are burning too much power in
> your system. If you factor in energy-effiency and the requirements for
> the current use-case you might be able to stay within the power budget
> with a smaller performance impact than blindly throttling all
> subsystems.

True. I was intending to distinguish the who and the why in the above
two situations. My only point was that thermal throttling is undertaken
by platform and is a safety mechanism whereas switching to energy saving
mode when battery is low is undertaken by the kernel and will lead to
better end-user experience i.e.battery longevity. Yes the kernel is
expected to prevent the system from being throttled as much as possible.

> 
>>
>>>
>>>> per-subsystem sounds right to me; I don't care which particular instance of
>>>> graphics cards I have, I want whichever one(s) I have to obey.
>>>>
>>>> global doesn't make sense, like stated earlier I absolutely detest automagic
>>>> backlight dimming, whereas I don't particularly care about compute speed at
>>>> all.
>>>
>>> That calls for highly customized preferences for what to control: in most cases
>>> the dimmed backlight itself saves a considerable amount of energy which wouldn't
>>> be matched by a CPU (or a GPU) control. On a battery device, the first preference
>>> would be to dim out the screen but still allow the user a good battery life and 
>>> user experience.
>>
>> Thats why I suggested the concept of profiles. If the user does not like
>> the existing system profiles, he can derive from one of them that comes
>> closes to his requirements and amend his preferences.
> 
> IIUC, you are proposing to have profiles setting a lot of kernel
> tunables rather than a single knob to control energy-awareness?
> 
> My concern with profiles is that it basically exports most of the
> energy-awareness decision problems to user-space. Maybe I'm missing
> something? IMHO, it would be better to have more accurate energy related
> topology information in the kernel so it would be able to take the
> decisions.

You are right. We shouldn't be exposing so many knobs to user-space and
expect the kernel to make good decisions based on these knobs being
tweaked by user space. How about a high level classification of profiles
like balanced, performance, powersave? These alone can be chosen by the
user and the lower end tunings left to the discretion of the kernel.

Regards
Preeti U Murthy
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-08 14:23   ` Iyer, Sundar
  2014-05-12 10:31     ` Morten Rasmussen
@ 2014-05-12 16:06     ` Preeti U Murthy
  2014-05-13 23:29       ` Rafael J. Wysocki
  1 sibling, 1 reply; 37+ messages in thread
From: Preeti U Murthy @ 2014-05-12 16:06 UTC (permalink / raw)
  To: Iyer, Sundar
  Cc: Brown, Len, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On 05/08/2014 07:53 PM, Iyer, Sundar wrote:
>> -----Original Message-----
>> From: Preeti U Murthy [mailto:preeti@linux.vnet.ibm.com]
>> Sent: Thursday, May 8, 2014 2:30 PM
>> To: Iyer, Sundar; Peter Zijlstra; Rafael J. Wysocki
>> Cc: Brown, Len; Daniel Lezcano; Ingo Molnar; ksummit-
> 
>> True that 'race to halt' also ends up saving energy. But when the kernel goes
>> conservative on energy, the scheduler would look at racing to idle *within a
>> power domain* as much as possible. Only if the load crosses a certain
>> threshold would it spread across to other power domains.
> 
> I think Rafael mentioned in an another thread about shared supplies and resources.
> In such a case, the race-to-idle within a power domain may actually negate the overall
> platform savings.

Perhaps. However that does not mean that we will not save any power.

To answer your below question, I was referring to the CPU power domains
since we were talking about 'race to halt'. Scheduler spreading the
tasks across cpus leads to race to halt and possibly saving power since
the tasks finish quicker.

That told, with regard to power savings when there are shared resources,
if the scheduler consolidates tasks to one socket out of two because the
arch exposed the sockets as separate power domains, we will save power
at a processor level. However if they have shared memory controllers, it
would mean that the controller would still be powered on. That is still
fine and we cannot do much about it given the condition that there are
some tasks on the system. But we *can* save power somewhere; better than
not being aware of the power domains and randomly spreading tasks.

This patch lkml.org/lkml/2014/4/11/142 introduces the concept of CPU
power domains precisely to help the scheduler decide the placement of
tasks better for power savings.

> 
> And to confirm, you are referring to generic power domains beyond the CPU right?
> 
>> These are general heuristics. These simple heuristics must work out for most
>> platforms but may not work for all. If it works for majority of the cases then
>> I believe we can safely call it a success.
> 
> And which is why I mentioned that this is heavily platform dependent. This is 
> completely dependent on how the rest of the system power management works.

I don't think you can say *completely* dependent on the platform. If
every aspect of power management is dependent on the platform, there is
very little we can do in the kernel.

The cpuidle sub-system is behaving fairly well on some of the platforms.
A lot of CPU  power management is platform specific. But by exposing
arch specific details like the details about the idle states that are
present, through the cpuidle drivers, the kernel is able to make
reasonably good predictions about the duration of idleness of a cpu and
choose the idle state that it must enter into.
  The point is that we have succeeded in the past in getting the high
level power management reasonably right in the kernel although they were
platform dependent.

Regards
Preeti U Murthy

> 
> Cheers!
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-08  8:59 ` Preeti U Murthy
  2014-05-08 14:23   ` Iyer, Sundar
@ 2014-05-12 11:14   ` Morten Rasmussen
  2014-05-12 17:13     ` Preeti U Murthy
  1 sibling, 1 reply; 37+ messages in thread
From: Morten Rasmussen @ 2014-05-12 11:14 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Brown, Len, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On Thu, May 08, 2014 at 09:59:39AM +0100, Preeti U Murthy wrote:
> On 05/07/2014 10:50 AM, Iyer, Sundar wrote:
> > 
> >>> provide user space with a means to tell the kernel whether it should
> >>> care more about performance or energy.  Finally, it would be good to
> >>> be able to adjust the overall "energy conservation bias" automatically
> > 
> > Instead of either energy or performance, would it be easier to look if
> > it were a "just enough performance" metric? Rather than worry about a reduced
> > performance to save energy, it would be IMO better to try to optimize the energy
> > within the constraints of the required performance. Of course, those constraints
> > could be changed.
> > 
> > e.g. if the display would communicate it doesn't need to refresh more than 60fps,
> > this could be communicated to the GPU/CPU to control the bias for these sub-systems
> > accordingly.
> 
> We don't really give the user a black and white choice like performance
> and power-save alone. There is a proposal for 'auto' profile which
> balances between the two.
> 
> To give an example where we already expose a parameter for defining
> performance constraint is the PM_QOS_CPU_DMA_LATENCY, where we tell the
> cpuidle sub-system that any idle state not adhering to this latency
> requirement should not be entered into. So we are saying you cannot
> sacrifice latency beyond this threshold. Here we look at powersavings
> but within a said constraint.
> 
> The point is we will certainly look to provide the user with a mix and
> match of powersave and performance profiles but to get started we begin
> with powersave and performance.

I think most users would be interested in "auto" or whatever the
performance/energy-effiency trade-off will be called. What would
"powersave" do? Reduce power to a minium? It is energy that users of
battery powered devices are interested in and is generally not minimized
by minimizing power. Race to halt/idle is an excellent example of the
opposite. A powersave setting like the powersave cpufreq governor
wouldn't be very useful IMHO.

I'm much more interested the performance/energy-efficiency trade-off.
That is, how much energy are we willing to pay for a certain level of
performance.

> 
> > 
> >>> in response to certain "power" events such as "battery is low/critical" etc.
> > 
> > Would I be wrong if I said the thermal throttling is already an example of this?
> > When the battery is critical/temperature is unbearable, the system cuts down
> > the performance of sub-systems like CPU, display etc.
> 
> Thermal throttling is an entirely different game altogether IMHO. We
> throttle cpus to save the system from getting heated up and thus
> damaged. That is to say if we don't do this, the system will become
> unusable not just now, but forever.
> 
> However if you look at the example of switching to energy save mode when
> battery is low, this is to give the user a better user experience. IOW
> if we didn't do that the system would die down, the user would have to
> plug in the power supply and restart the machine. It would have wasted
> some of his time. But no harm done,only a dissatisfied user.
> 
> Now compare both the above scenarios: while the former should
> necessarily be there if the platform has enabled turbo cpu frequency
> ranges, the latter is an enhanced kernel behaviour to better end user
> experience.
> 
> We already have safety mechanisms like thermal throttling in the kernel
> and the platforms today. That is not where we lack. We lack in providing
> a better end user experience depending on his requirement for power
> efficiency.

While I agree that there are mechanisms to deal with thermal throttling
already, I think it is somewhat related to energy-awareness. If you need
throttling due to thermal constraints you are burning too much power in
your system. If you factor in energy-effiency and the requirements for
the current use-case you might be able to stay within the power budget
with a smaller performance impact than blindly throttling all
subsystems.

> 
> > 
> >> per-subsystem sounds right to me; I don't care which particular instance of
> >> graphics cards I have, I want whichever one(s) I have to obey.
> >>
> >> global doesn't make sense, like stated earlier I absolutely detest automagic
> >> backlight dimming, whereas I don't particularly care about compute speed at
> >> all.
> > 
> > That calls for highly customized preferences for what to control: in most cases
> > the dimmed backlight itself saves a considerable amount of energy which wouldn't
> > be matched by a CPU (or a GPU) control. On a battery device, the first preference
> > would be to dim out the screen but still allow the user a good battery life and 
> > user experience.
> 
> Thats why I suggested the concept of profiles. If the user does not like
> the existing system profiles, he can derive from one of them that comes
> closes to his requirements and amend his preferences.

IIUC, you are proposing to have profiles setting a lot of kernel
tunables rather than a single knob to control energy-awareness?

My concern with profiles is that it basically exports most of the
energy-awareness decision problems to user-space. Maybe I'm missing
something? IMHO, it would be better to have more accurate energy related
topology information in the kernel so it would be able to take the
decisions.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-12 10:31     ` Morten Rasmussen
@ 2014-05-12 10:55       ` Iyer, Sundar
  2014-05-13 23:48         ` Rafael J. Wysocki
  0 siblings, 1 reply; 37+ messages in thread
From: Iyer, Sundar @ 2014-05-12 10:55 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Brown, Len, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

> -----Original Message-----
> From: Morten Rasmussen [mailto:morten.rasmussen@arm.com]
> Sent: Monday, May 12, 2014 4:02 PM
>
> > And which is why I mentioned that this is heavily platform dependent.
> > This is completely dependent on how the rest of the system power
> management works.
> 
> Agree. Race to halt/idle is not universally a good idea. It depends of the
> platform energy efficiency at the higher performance states, idle power
> consumption, system topology, use-case, and anything else that consumes
> power while the tasks are running. For example, if your energy efficiency is
> really bad in the turbo states, it might be worth going a bit slower if the total
> energy can be reduced.

Apart from the specifics of the CPU/topology, race to halt doesn't contribute significant
to workloads which are offloaded/accelerated: e.g. video, media workloads.

That said, I think the energy conservation boils down to (not limited to):

a. should we schedule wide (multiple CPUs) vs local (fewer CPUs);
b. should we burst (higher P-states) vs run slow (lower P-states); 
c. is the control resource (power, clock etc.) shared wide or local to the unit;
d. Is the "local good" aka sub-system conservation resulting in "global good" aka
platform conservation?
e. what is the extent of options we want to load the user with: is the user going to toggle
some 200 switches to get the best experience or the user space/kernel will abstract a bulk
of these and provide more intelligent actions/decisions?

And I think the following should be the general outline of any efforts:

a. if the savings result in violation of any user defined quality-of-service for the experience (
finite FPS, finite computational requirements like encode/decode compute requirement etc.)
b. if we can conserve energy at the "platform" level vs "sub-system" level;
c. If we do save @ the "sub-system" level, how much of this is dependent on the specific
system architecture/topology/ vs "generic"; or in other words, how much of hit will a different
architecture suffer (?)

Cheers!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-08 14:23   ` Iyer, Sundar
@ 2014-05-12 10:31     ` Morten Rasmussen
  2014-05-12 10:55       ` Iyer, Sundar
  2014-05-12 16:06     ` Preeti U Murthy
  1 sibling, 1 reply; 37+ messages in thread
From: Morten Rasmussen @ 2014-05-12 10:31 UTC (permalink / raw)
  To: Iyer, Sundar
  Cc: Brown, Len, ksummit-discuss, Peter Zijlstra, Daniel Lezcano, Ingo Molnar

On Thu, May 08, 2014 at 03:23:58PM +0100, Iyer, Sundar wrote:
> > -----Original Message-----
> > From: Preeti U Murthy [mailto:preeti@linux.vnet.ibm.com]
> > Sent: Thursday, May 8, 2014 2:30 PM
> > To: Iyer, Sundar; Peter Zijlstra; Rafael J. Wysocki
> > Cc: Brown, Len; Daniel Lezcano; Ingo Molnar; ksummit-
> 
> > True that 'race to halt' also ends up saving energy. But when the kernel goes
> > conservative on energy, the scheduler would look at racing to idle *within a
> > power domain* as much as possible. Only if the load crosses a certain
> > threshold would it spread across to other power domains.
> 
> I think Rafael mentioned in an another thread about shared supplies and resources.
> In such a case, the race-to-idle within a power domain may actually negate the overall
> platform savings.
> 
> And to confirm, you are referring to generic power domains beyond the CPU right?
> 
> > These are general heuristics. These simple heuristics must work out for most
> > platforms but may not work for all. If it works for majority of the cases then
> > I believe we can safely call it a success.
> 
> And which is why I mentioned that this is heavily platform dependent. This is 
> completely dependent on how the rest of the system power management works.

Agree. Race to halt/idle is not universally a good idea. It depends of
the platform energy efficiency at the higher performance states, idle
power consumption, system topology, use-case, and anything else that
consumes power while the tasks are running. For example, if your energy
efficiency is really bad in the turbo states, it might be worth going a
bit slower if the total energy can be reduced.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-08  8:59 ` Preeti U Murthy
@ 2014-05-08 14:23   ` Iyer, Sundar
  2014-05-12 10:31     ` Morten Rasmussen
  2014-05-12 16:06     ` Preeti U Murthy
  2014-05-12 11:14   ` Morten Rasmussen
  1 sibling, 2 replies; 37+ messages in thread
From: Iyer, Sundar @ 2014-05-08 14:23 UTC (permalink / raw)
  To: Preeti U Murthy, Peter Zijlstra, Rafael J. Wysocki
  Cc: Brown, Len, Daniel Lezcano, Ingo Molnar, ksummit-discuss

> -----Original Message-----
> From: Preeti U Murthy [mailto:preeti@linux.vnet.ibm.com]
> Sent: Thursday, May 8, 2014 2:30 PM
> To: Iyer, Sundar; Peter Zijlstra; Rafael J. Wysocki
> Cc: Brown, Len; Daniel Lezcano; Ingo Molnar; ksummit-

> True that 'race to halt' also ends up saving energy. But when the kernel goes
> conservative on energy, the scheduler would look at racing to idle *within a
> power domain* as much as possible. Only if the load crosses a certain
> threshold would it spread across to other power domains.

I think Rafael mentioned in an another thread about shared supplies and resources.
In such a case, the race-to-idle within a power domain may actually negate the overall
platform savings.

And to confirm, you are referring to generic power domains beyond the CPU right?

> These are general heuristics. These simple heuristics must work out for most
> platforms but may not work for all. If it works for majority of the cases then
> I believe we can safely call it a success.

And which is why I mentioned that this is heavily platform dependent. This is 
completely dependent on how the rest of the system power management works.

Cheers!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
  2014-05-07  5:20 Iyer, Sundar
@ 2014-05-08  8:59 ` Preeti U Murthy
  2014-05-08 14:23   ` Iyer, Sundar
  2014-05-12 11:14   ` Morten Rasmussen
  0 siblings, 2 replies; 37+ messages in thread
From: Preeti U Murthy @ 2014-05-08  8:59 UTC (permalink / raw)
  To: Iyer, Sundar, Peter Zijlstra, Rafael J. Wysocki
  Cc: Brown, Len, Daniel Lezcano, Ingo Molnar, ksummit-discuss

On 05/07/2014 10:50 AM, Iyer, Sundar wrote:
>> -----Original Message-----
>> From: ksummit-discuss-bounces@lists.linuxfoundation.org [mailto:ksummit-
>> discuss-bounces@lists.linuxfoundation.org] On Behalf Of Peter Zijlstra
> 
>>> (http://marc.info/?t=139834240600003&r=1&w=4) it became apparent that
> 
>>> First of all, it would be good to have a place where subsystems and
>>> device drivers can go and check what the current "energy conservation
>>> bias" is in case they need to make a decision between delivering more
>>> performance and using less energy.  Second, it would be good to
> 
> It might sound a stupid question, but isn't this entirely dependent on the platform?
> 
> A higher performance will translate into better energy only if the "race to halt" was
> true and the system/platform had a nice power/performance/energy curve. E.g. if the
> task got completed quicker enough (reduced t) to offset the most probably increased
> current consumption (increased i @ constant v).
> 
> Am I wrong? What would happen on a platform, where more performance means
> using more energy?

True that 'race to halt' also ends up saving energy. But when the kernel
goes conservative on energy, the scheduler would look at racing to idle
*within a power domain* as much as possible. Only if the load crosses a
certain threshold would it spread across to other power domains.

But if it is asked not to sacrifice on performance, it will more readily
spread across power domains.

These are general heuristics. These simple heuristics must work out for
most platforms but may not work for all. If it works for majority of the
cases then I believe we can safely call it a success.

> 
>>> provide user space with a means to tell the kernel whether it should
>>> care more about performance or energy.  Finally, it would be good to
>>> be able to adjust the overall "energy conservation bias" automatically
> 
> Instead of either energy or performance, would it be easier to look if
> it were a "just enough performance" metric? Rather than worry about a reduced
> performance to save energy, it would be IMO better to try to optimize the energy
> within the constraints of the required performance. Of course, those constraints
> could be changed.
> 
> e.g. if the display would communicate it doesn't need to refresh more than 60fps,
> this could be communicated to the GPU/CPU to control the bias for these sub-systems
> accordingly.

We don't really give the user a black and white choice like performance
and power-save alone. There is a proposal for 'auto' profile which
balances between the two.

To give an example where we already expose a parameter for defining
performance constraint is the PM_QOS_CPU_DMA_LATENCY, where we tell the
cpuidle sub-system that any idle state not adhering to this latency
requirement should not be entered into. So we are saying you cannot
sacrifice latency beyond this threshold. Here we look at powersavings
but within a said constraint.

The point is we will certainly look to provide the user with a mix and
match of powersave and performance profiles but to get started we begin
with powersave and performance.

> 
>>> in response to certain "power" events such as "battery is low/critical" etc.
> 
> Would I be wrong if I said the thermal throttling is already an example of this?
> When the battery is critical/temperature is unbearable, the system cuts down
> the performance of sub-systems like CPU, display etc.

Thermal throttling is an entirely different game altogether IMHO. We
throttle cpus to save the system from getting heated up and thus
damaged. That is to say if we don't do this, the system will become
unusable not just now, but forever.

However if you look at the example of switching to energy save mode when
battery is low, this is to give the user a better user experience. IOW
if we didn't do that the system would die down, the user would have to
plug in the power supply and restart the machine. It would have wasted
some of his time. But no harm done,only a dissatisfied user.

Now compare both the above scenarios: while the former should
necessarily be there if the platform has enabled turbo cpu frequency
ranges, the latter is an enhanced kernel behaviour to better end user
experience.

We already have safety mechanisms like thermal throttling in the kernel
and the platforms today. That is not where we lack. We lack in providing
a better end user experience depending on his requirement for power
efficiency.

> 
>> per-subsystem sounds right to me; I don't care which particular instance of
>> graphics cards I have, I want whichever one(s) I have to obey.
>>
>> global doesn't make sense, like stated earlier I absolutely detest automagic
>> backlight dimming, whereas I don't particularly care about compute speed at
>> all.
> 
> That calls for highly customized preferences for what to control: in most cases
> the dimmed backlight itself saves a considerable amount of energy which wouldn't
> be matched by a CPU (or a GPU) control. On a battery device, the first preference
> would be to dim out the screen but still allow the user a good battery life and 
> user experience.

Thats why I suggested the concept of profiles. If the user does not like
the existing system profiles, he can derive from one of them that comes
closes to his requirements and amend his preferences.

Regards
Preeti U Murthy
> 
> Cheers!
> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces
@ 2014-05-07  5:20 Iyer, Sundar
  2014-05-08  8:59 ` Preeti U Murthy
  0 siblings, 1 reply; 37+ messages in thread
From: Iyer, Sundar @ 2014-05-07  5:20 UTC (permalink / raw)
  To: Peter Zijlstra, Rafael J. Wysocki
  Cc: Brown, Len, Daniel Lezcano, Ingo Molnar, ksummit-discuss

> -----Original Message-----
> From: ksummit-discuss-bounces@lists.linuxfoundation.org [mailto:ksummit-
> discuss-bounces@lists.linuxfoundation.org] On Behalf Of Peter Zijlstra

> > (http://marc.info/?t=139834240600003&r=1&w=4) it became apparent that

> > First of all, it would be good to have a place where subsystems and
> > device drivers can go and check what the current "energy conservation
> > bias" is in case they need to make a decision between delivering more
> > performance and using less energy.  Second, it would be good to

It might sound a stupid question, but isn't this entirely dependent on the platform?

A higher performance will translate into better energy only if the "race to halt" was
true and the system/platform had a nice power/performance/energy curve. E.g. if the
task got completed quicker enough (reduced t) to offset the most probably increased
current consumption (increased i @ constant v).

Am I wrong? What would happen on a platform, where more performance means
using more energy?

> > provide user space with a means to tell the kernel whether it should
> > care more about performance or energy.  Finally, it would be good to
> > be able to adjust the overall "energy conservation bias" automatically

Instead of either energy or performance, would it be easier to look if
it were a "just enough performance" metric? Rather than worry about a reduced
performance to save energy, it would be IMO better to try to optimize the energy
within the constraints of the required performance. Of course, those constraints
could be changed.

e.g. if the display would communicate it doesn't need to refresh more than 60fps,
this could be communicated to the GPU/CPU to control the bias for these sub-systems
accordingly.

> > in response to certain "power" events such as "battery is low/critical" etc.

Would I be wrong if I said the thermal throttling is already an example of this?
When the battery is critical/temperature is unbearable, the system cuts down
the performance of sub-systems like CPU, display etc.

> per-subsystem sounds right to me; I don't care which particular instance of
> graphics cards I have, I want whichever one(s) I have to obey.
> 
> global doesn't make sense, like stated earlier I absolutely detest automagic
> backlight dimming, whereas I don't particularly care about compute speed at
> all.

That calls for highly customized preferences for what to control: in most cases
the dimmed backlight itself saves a considerable amount of energy which wouldn't
be matched by a CPU (or a GPU) control. On a battery device, the first preference
would be to dim out the screen but still allow the user a good battery life and 
user experience.

Cheers!

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2014-05-15 10:42 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-06 12:54 [Ksummit-discuss] [TECH(CORE?) TOPIC] Energy conservation bias interfaces Rafael J. Wysocki
2014-05-06 13:37 ` Dave Jones
2014-05-06 13:49 ` Peter Zijlstra
2014-05-06 14:51   ` Morten Rasmussen
2014-05-06 15:39     ` Peter Zijlstra
2014-05-06 16:04       ` Morten Rasmussen
2014-05-08 12:29   ` Rafael J. Wysocki
2014-05-06 14:34 ` Morten Rasmussen
2014-05-06 17:51 ` Preeti U Murthy
2014-05-08 12:58   ` Rafael J. Wysocki
2014-05-08 14:57     ` Iyer, Sundar
2014-05-12 16:44       ` Preeti U Murthy
2014-05-13 23:36         ` Rafael J. Wysocki
2014-05-15 10:37           ` Preeti U Murthy
2014-05-10 16:59     ` Preeti U Murthy
2014-05-07 21:03 ` Paul Gortmaker
2014-05-12 11:53 ` Amit Kucheria
2014-05-12 12:31   ` Morten Rasmussen
2014-05-13  5:52     ` Amit Kucheria
2014-05-13  9:59       ` Morten Rasmussen
2014-05-13 23:55         ` Rafael J. Wysocki
2014-05-14 20:21           ` Daniel Vetter
2014-05-12 20:58   ` Mark Brown
2014-05-07  5:20 Iyer, Sundar
2014-05-08  8:59 ` Preeti U Murthy
2014-05-08 14:23   ` Iyer, Sundar
2014-05-12 10:31     ` Morten Rasmussen
2014-05-12 10:55       ` Iyer, Sundar
2014-05-13 23:48         ` Rafael J. Wysocki
2014-05-12 16:06     ` Preeti U Murthy
2014-05-13 23:29       ` Rafael J. Wysocki
2014-05-12 11:14   ` Morten Rasmussen
2014-05-12 17:13     ` Preeti U Murthy
2014-05-12 17:30       ` Iyer, Sundar
2014-05-13  6:28       ` Amit Kucheria
2014-05-13 23:41       ` Rafael J. Wysocki
2014-05-14  9:15         ` Daniel Lezcano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox