PSI vs. CPU overhead for client computing

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* PSI vs. CPU overhead for client computing
@ 2019-04-23 18:57 Luigi Semenzato
  2019-04-23 22:04 ` Suren Baghdasaryan
  0 siblings, 1 reply; 6+ messages in thread
From: Luigi Semenzato @ 2019-04-23 18:57 UTC (permalink / raw)
  To: Linux Memory Management List

I and others are working on improving system behavior under memory
pressure on Chrome OS.  We use zram, which swaps to a
statically-configured compressed RAM disk.  One challenge that we have
is that the footprint of our workloads is highly variable.  With zram,
we have to set the size of the swap partition at boot time.  When the
(logical) swap partition is full, we're left with some amount of RAM
usable by file and anonymous pages (we can ignore the rest).  We don't
get to control this amount dynamically.  Thus if the workload fits
nicely in it, everything works well.  If it doesn't, then the rate of
anonymous page faults can be quite high, causing large CPU overhead
for compression/decompression (as well as for other parts of the MM).

In Chrome OS and Android, we have the luxury that we can reduce
pressure by terminating processes (tab discard in Chrome OS, app kill
in Android---which incidentally also runs in parallel with Chrome OS
on some chromebooks).  To help decide when to reduce pressure, we
would like to have a reliable and device-independent measure of MM CPU
overhead.  I have looked into PSI and have a few questions.  I am also
looking for alternative suggestions.

PSI measures the times spent when some and all tasks are blocked by
memory allocation.  In some experiments, this doesn't seem to
correlate too well with CPU overhead (which instead correlates fairly
well with page fault rates).  Could this be because it includes
pressure from file page faults?  Is there some way of interpreting PSI
numbers so that the pressure from file pages is ignored?

What is the purpose of "some" and "full" in the PSI measurements?  The
chrome browser is a multi-process app and there is a lot of IPC.  When
process A is blocked on memory allocation, it cannot respond to IPC
from process B, thus effectively both processes are blocked on
allocation, but we don't see that.  Also, there are situations in
which some "uninteresting" process keep running.  So it's not clear we
can rely on "full".  Or maybe I am misunderstanding?  "Some" may be a
better measure, but again it doesn't measure indirect blockage.

The kernel contains various cpustat measurements, including some
slightly esoteric ones such as CPUTIME_GUEST and CPUTIME_GUEST_NICE.
Would adding a CPUTIME_MEM be out of the question?

Thanks!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PSI vs. CPU overhead for client computing
  2019-04-23 18:57 PSI vs. CPU overhead for client computing Luigi Semenzato
@ 2019-04-23 22:04 ` Suren Baghdasaryan
  2019-04-24  4:54   ` Luigi Semenzato
  2019-04-24 16:36   ` Johannes Weiner
  0 siblings, 2 replies; 6+ messages in thread
From: Suren Baghdasaryan @ 2019-04-23 22:04 UTC (permalink / raw)
  To: Luigi Semenzato; +Cc: Linux Memory Management List, Johannes Weiner

Hi Luigi,

On Tue, Apr 23, 2019 at 11:58 AM Luigi Semenzato <semenzato@google.com> wrote:
>
> I and others are working on improving system behavior under memory
> pressure on Chrome OS.  We use zram, which swaps to a
> statically-configured compressed RAM disk.  One challenge that we have
> is that the footprint of our workloads is highly variable.  With zram,
> we have to set the size of the swap partition at boot time.  When the
> (logical) swap partition is full, we're left with some amount of RAM
> usable by file and anonymous pages (we can ignore the rest).  We don't
> get to control this amount dynamically.  Thus if the workload fits
> nicely in it, everything works well.  If it doesn't, then the rate of
> anonymous page faults can be quite high, causing large CPU overhead
> for compression/decompression (as well as for other parts of the MM).
>
> In Chrome OS and Android, we have the luxury that we can reduce
> pressure by terminating processes (tab discard in Chrome OS, app kill
> in Android---which incidentally also runs in parallel with Chrome OS
> on some chromebooks).  To help decide when to reduce pressure, we
> would like to have a reliable and device-independent measure of MM CPU
> overhead.  I have looked into PSI and have a few questions.  I am also
> looking for alternative suggestions.
>
> PSI measures the times spent when some and all tasks are blocked by
> memory allocation.  In some experiments, this doesn't seem to
> correlate too well with CPU overhead (which instead correlates fairly
> well with page fault rates).  Could this be because it includes
> pressure from file page faults?

This might be caused by thrashing (see:
https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1114).

>  Is there some way of interpreting PSI
> numbers so that the pressure from file pages is ignored?

I don't think so but I might be wrong. Notice here
https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1111
you could probably use delayacct to distinguish file thrashing,
however remember that PSI takes into account the number of CPUs and
the number of currently non-idle tasks in its pressure calculations,
so the raw delay numbers might not be very useful here.

> What is the purpose of "some" and "full" in the PSI measurements?  The
> chrome browser is a multi-process app and there is a lot of IPC.  When
> process A is blocked on memory allocation, it cannot respond to IPC
> from process B, thus effectively both processes are blocked on
> allocation, but we don't see that.

I don't think PSI would account such an indirect stall when A is
waiting for B and B is blocked on memory access. B's stall will be
accounted for but I don't think A's blocked time will go into PSI
calculations. The process inter-dependencies are probably out of scope
for PSI.

> Also, there are situations in
> which some "uninteresting" process keep running.  So it's not clear we
> can rely on "full".  Or maybe I am misunderstanding?  "Some" may be a
> better measure, but again it doesn't measure indirect blockage.

Johannes explains the SOME and FULL calculations here:
https://elixir.bootlin.com/linux/v5.1-rc6/source/kernel/sched/psi.c#L76
and includes couple examples with the last one showing FULL>0 and some
tasks still running.

> The kernel contains various cpustat measurements, including some
> slightly esoteric ones such as CPUTIME_GUEST and CPUTIME_GUEST_NICE.
> Would adding a CPUTIME_MEM be out of the question?
>
> Thanks!
>

Just my 2 cents and Johannes being the author might have more to say here.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PSI vs. CPU overhead for client computing
  2019-04-23 22:04 ` Suren Baghdasaryan
@ 2019-04-24  4:54   ` Luigi Semenzato
  2019-04-24 14:49     ` Suren Baghdasaryan
  2019-04-24 16:36   ` Johannes Weiner
  1 sibling, 1 reply; 6+ messages in thread
From: Luigi Semenzato @ 2019-04-24  4:54 UTC (permalink / raw)
  To: Suren Baghdasaryan; +Cc: Linux Memory Management List, Johannes Weiner

Thank you very much Suren.

On Tue, Apr 23, 2019 at 3:04 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Hi Luigi,
>
> On Tue, Apr 23, 2019 at 11:58 AM Luigi Semenzato <semenzato@google.com> wrote:
> >
> > I and others are working on improving system behavior under memory
> > pressure on Chrome OS.  We use zram, which swaps to a
> > statically-configured compressed RAM disk.  One challenge that we have
> > is that the footprint of our workloads is highly variable.  With zram,
> > we have to set the size of the swap partition at boot time.  When the
> > (logical) swap partition is full, we're left with some amount of RAM
> > usable by file and anonymous pages (we can ignore the rest).  We don't
> > get to control this amount dynamically.  Thus if the workload fits
> > nicely in it, everything works well.  If it doesn't, then the rate of
> > anonymous page faults can be quite high, causing large CPU overhead
> > for compression/decompression (as well as for other parts of the MM).
> >
> > In Chrome OS and Android, we have the luxury that we can reduce
> > pressure by terminating processes (tab discard in Chrome OS, app kill
> > in Android---which incidentally also runs in parallel with Chrome OS
> > on some chromebooks).  To help decide when to reduce pressure, we
> > would like to have a reliable and device-independent measure of MM CPU
> > overhead.  I have looked into PSI and have a few questions.  I am also
> > looking for alternative suggestions.
> >
> > PSI measures the times spent when some and all tasks are blocked by
> > memory allocation.  In some experiments, this doesn't seem to
> > correlate too well with CPU overhead (which instead correlates fairly
> > well with page fault rates).  Could this be because it includes
> > pressure from file page faults?
>
> This might be caused by thrashing (see:
> https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1114).
>
> >  Is there some way of interpreting PSI
> > numbers so that the pressure from file pages is ignored?
>
> I don't think so but I might be wrong. Notice here
> https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1111
> you could probably use delayacct to distinguish file thrashing,
> however remember that PSI takes into account the number of CPUs and
> the number of currently non-idle tasks in its pressure calculations,
> so the raw delay numbers might not be very useful here.

OK.

> > What is the purpose of "some" and "full" in the PSI measurements?  The
> > chrome browser is a multi-process app and there is a lot of IPC.  When
> > process A is blocked on memory allocation, it cannot respond to IPC
> > from process B, thus effectively both processes are blocked on
> > allocation, but we don't see that.
>
> I don't think PSI would account such an indirect stall when A is
> waiting for B and B is blocked on memory access. B's stall will be
> accounted for but I don't think A's blocked time will go into PSI
> calculations. The process inter-dependencies are probably out of scope
> for PSI.

Right, that's what I was also saying.  It would be near impossible to
figure it out.  It may also be that statistically it doesn't matter,
as long as the workload characteristics don't change dramatically.
Which unfortunately they might...

> > Also, there are situations in
> > which some "uninteresting" process keep running.  So it's not clear we
> > can rely on "full".  Or maybe I am misunderstanding?  "Some" may be a
> > better measure, but again it doesn't measure indirect blockage.
>
> Johannes explains the SOME and FULL calculations here:
> https://elixir.bootlin.com/linux/v5.1-rc6/source/kernel/sched/psi.c#L76
> and includes couple examples with the last one showing FULL>0 and some
> tasks still running.

Thank you, yes, those are good explanation.  I am still not sure how
to use this in our case.

I thought about using the page fault rate as a proxy for the
allocation overhead.  Unfortunately it is difficult to figure out the
baseline, because: 1. it is device-dependent (that's not
insurmountable: we could compute a per-device baseline offline); 2.
the CPUs can go in and out of turbo mode, or temperature-throttling,
and the notion of a constant "baseline" fails miserably.

> > The kernel contains various cpustat measurements, including some
> > slightly esoteric ones such as CPUTIME_GUEST and CPUTIME_GUEST_NICE.
> > Would adding a CPUTIME_MEM be out of the question?

Any opinion on CPUTIME_MEM?

Thanks again!

> > Thanks!
> >
>
> Just my 2 cents and Johannes being the author might have more to say here.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PSI vs. CPU overhead for client computing
  2019-04-24  4:54   ` Luigi Semenzato
@ 2019-04-24 14:49     ` Suren Baghdasaryan
  2019-04-25 17:31       ` Luigi Semenzato
  0 siblings, 1 reply; 6+ messages in thread
From: Suren Baghdasaryan @ 2019-04-24 14:49 UTC (permalink / raw)
  To: Luigi Semenzato; +Cc: Linux Memory Management List, Johannes Weiner

On Tue, Apr 23, 2019 at 9:54 PM Luigi Semenzato <semenzato@google.com> wrote:
>
> Thank you very much Suren.
>
> On Tue, Apr 23, 2019 at 3:04 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > Hi Luigi,
> >
> > On Tue, Apr 23, 2019 at 11:58 AM Luigi Semenzato <semenzato@google.com> wrote:
> > >
> > > I and others are working on improving system behavior under memory
> > > pressure on Chrome OS.  We use zram, which swaps to a
> > > statically-configured compressed RAM disk.  One challenge that we have
> > > is that the footprint of our workloads is highly variable.  With zram,
> > > we have to set the size of the swap partition at boot time.  When the
> > > (logical) swap partition is full, we're left with some amount of RAM
> > > usable by file and anonymous pages (we can ignore the rest).  We don't
> > > get to control this amount dynamically.  Thus if the workload fits
> > > nicely in it, everything works well.  If it doesn't, then the rate of
> > > anonymous page faults can be quite high, causing large CPU overhead
> > > for compression/decompression (as well as for other parts of the MM).
> > >
> > > In Chrome OS and Android, we have the luxury that we can reduce
> > > pressure by terminating processes (tab discard in Chrome OS, app kill
> > > in Android---which incidentally also runs in parallel with Chrome OS
> > > on some chromebooks).  To help decide when to reduce pressure, we
> > > would like to have a reliable and device-independent measure of MM CPU
> > > overhead.  I have looked into PSI and have a few questions.  I am also
> > > looking for alternative suggestions.
> > >
> > > PSI measures the times spent when some and all tasks are blocked by
> > > memory allocation.  In some experiments, this doesn't seem to
> > > correlate too well with CPU overhead (which instead correlates fairly
> > > well with page fault rates).  Could this be because it includes
> > > pressure from file page faults?
> >
> > This might be caused by thrashing (see:
> > https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1114).
> >
> > >  Is there some way of interpreting PSI
> > > numbers so that the pressure from file pages is ignored?
> >
> > I don't think so but I might be wrong. Notice here
> > https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1111
> > you could probably use delayacct to distinguish file thrashing,
> > however remember that PSI takes into account the number of CPUs and
> > the number of currently non-idle tasks in its pressure calculations,
> > so the raw delay numbers might not be very useful here.
>
> OK.
>
> > > What is the purpose of "some" and "full" in the PSI measurements?  The
> > > chrome browser is a multi-process app and there is a lot of IPC.  When
> > > process A is blocked on memory allocation, it cannot respond to IPC
> > > from process B, thus effectively both processes are blocked on
> > > allocation, but we don't see that.
> >
> > I don't think PSI would account such an indirect stall when A is
> > waiting for B and B is blocked on memory access. B's stall will be
> > accounted for but I don't think A's blocked time will go into PSI
> > calculations. The process inter-dependencies are probably out of scope
> > for PSI.
>
> Right, that's what I was also saying.  It would be near impossible to
> figure it out.  It may also be that statistically it doesn't matter,
> as long as the workload characteristics don't change dramatically.
> Which unfortunately they might...
>
> > > Also, there are situations in
> > > which some "uninteresting" process keep running.  So it's not clear we
> > > can rely on "full".  Or maybe I am misunderstanding?  "Some" may be a
> > > better measure, but again it doesn't measure indirect blockage.
> >
> > Johannes explains the SOME and FULL calculations here:
> > https://elixir.bootlin.com/linux/v5.1-rc6/source/kernel/sched/psi.c#L76
> > and includes couple examples with the last one showing FULL>0 and some
> > tasks still running.
>
> Thank you, yes, those are good explanation.  I am still not sure how
> to use this in our case.
>
> I thought about using the page fault rate as a proxy for the
> allocation overhead.  Unfortunately it is difficult to figure out the
> baseline, because: 1. it is device-dependent (that's not
> insurmountable: we could compute a per-device baseline offline); 2.
> the CPUs can go in and out of turbo mode, or temperature-throttling,
> and the notion of a constant "baseline" fails miserably.
>
> > > The kernel contains various cpustat measurements, including some
> > > slightly esoteric ones such as CPUTIME_GUEST and CPUTIME_GUEST_NICE.
> > > Would adding a CPUTIME_MEM be out of the question?
>
> Any opinion on CPUTIME_MEM?

I guess some description of how you plan to calculate it would be
helpful. A simple raw delay counter might not be very useful, that's
why PSI performs more elaborate calculations.
Maybe posting a small RFC patch with code would get more attention and
you can collect more feedback.

> Thanks again!
>
> > > Thanks!
> > >
> >
> > Just my 2 cents and Johannes being the author might have more to say here.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PSI vs. CPU overhead for client computing
  2019-04-24 14:49     ` Suren Baghdasaryan
@ 2019-04-25 17:31       ` Luigi Semenzato
  0 siblings, 0 replies; 6+ messages in thread
From: Luigi Semenzato @ 2019-04-25 17:31 UTC (permalink / raw)
  To: Suren Baghdasaryan; +Cc: Linux Memory Management List, Johannes Weiner

Thank you, I can try to do that.

It's not trivial to get right though.  I have to find the right
compromise.  A horribly wrong patch won't be taken seriously, but a
completely correct one would be a bit too much work, given the
probability that it will get rejected.

Thanks also to Johannes for the clarification!

On Wed, Apr 24, 2019 at 7:49 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Apr 23, 2019 at 9:54 PM Luigi Semenzato <semenzato@google.com> wrote:
> >
> > Thank you very much Suren.
> >
> > On Tue, Apr 23, 2019 at 3:04 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > Hi Luigi,
> > >
> > > On Tue, Apr 23, 2019 at 11:58 AM Luigi Semenzato <semenzato@google.com> wrote:
> > > >
> > > > I and others are working on improving system behavior under memory
> > > > pressure on Chrome OS.  We use zram, which swaps to a
> > > > statically-configured compressed RAM disk.  One challenge that we have
> > > > is that the footprint of our workloads is highly variable.  With zram,
> > > > we have to set the size of the swap partition at boot time.  When the
> > > > (logical) swap partition is full, we're left with some amount of RAM
> > > > usable by file and anonymous pages (we can ignore the rest).  We don't
> > > > get to control this amount dynamically.  Thus if the workload fits
> > > > nicely in it, everything works well.  If it doesn't, then the rate of
> > > > anonymous page faults can be quite high, causing large CPU overhead
> > > > for compression/decompression (as well as for other parts of the MM).
> > > >
> > > > In Chrome OS and Android, we have the luxury that we can reduce
> > > > pressure by terminating processes (tab discard in Chrome OS, app kill
> > > > in Android---which incidentally also runs in parallel with Chrome OS
> > > > on some chromebooks).  To help decide when to reduce pressure, we
> > > > would like to have a reliable and device-independent measure of MM CPU
> > > > overhead.  I have looked into PSI and have a few questions.  I am also
> > > > looking for alternative suggestions.
> > > >
> > > > PSI measures the times spent when some and all tasks are blocked by
> > > > memory allocation.  In some experiments, this doesn't seem to
> > > > correlate too well with CPU overhead (which instead correlates fairly
> > > > well with page fault rates).  Could this be because it includes
> > > > pressure from file page faults?
> > >
> > > This might be caused by thrashing (see:
> > > https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1114).
> > >
> > > >  Is there some way of interpreting PSI
> > > > numbers so that the pressure from file pages is ignored?
> > >
> > > I don't think so but I might be wrong. Notice here
> > > https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1111
> > > you could probably use delayacct to distinguish file thrashing,
> > > however remember that PSI takes into account the number of CPUs and
> > > the number of currently non-idle tasks in its pressure calculations,
> > > so the raw delay numbers might not be very useful here.
> >
> > OK.
> >
> > > > What is the purpose of "some" and "full" in the PSI measurements?  The
> > > > chrome browser is a multi-process app and there is a lot of IPC.  When
> > > > process A is blocked on memory allocation, it cannot respond to IPC
> > > > from process B, thus effectively both processes are blocked on
> > > > allocation, but we don't see that.
> > >
> > > I don't think PSI would account such an indirect stall when A is
> > > waiting for B and B is blocked on memory access. B's stall will be
> > > accounted for but I don't think A's blocked time will go into PSI
> > > calculations. The process inter-dependencies are probably out of scope
> > > for PSI.
> >
> > Right, that's what I was also saying.  It would be near impossible to
> > figure it out.  It may also be that statistically it doesn't matter,
> > as long as the workload characteristics don't change dramatically.
> > Which unfortunately they might...
> >
> > > > Also, there are situations in
> > > > which some "uninteresting" process keep running.  So it's not clear we
> > > > can rely on "full".  Or maybe I am misunderstanding?  "Some" may be a
> > > > better measure, but again it doesn't measure indirect blockage.
> > >
> > > Johannes explains the SOME and FULL calculations here:
> > > https://elixir.bootlin.com/linux/v5.1-rc6/source/kernel/sched/psi.c#L76
> > > and includes couple examples with the last one showing FULL>0 and some
> > > tasks still running.
> >
> > Thank you, yes, those are good explanation.  I am still not sure how
> > to use this in our case.
> >
> > I thought about using the page fault rate as a proxy for the
> > allocation overhead.  Unfortunately it is difficult to figure out the
> > baseline, because: 1. it is device-dependent (that's not
> > insurmountable: we could compute a per-device baseline offline); 2.
> > the CPUs can go in and out of turbo mode, or temperature-throttling,
> > and the notion of a constant "baseline" fails miserably.
> >
> > > > The kernel contains various cpustat measurements, including some
> > > > slightly esoteric ones such as CPUTIME_GUEST and CPUTIME_GUEST_NICE.
> > > > Would adding a CPUTIME_MEM be out of the question?
> >
> > Any opinion on CPUTIME_MEM?
>
> I guess some description of how you plan to calculate it would be
> helpful. A simple raw delay counter might not be very useful, that's
> why PSI performs more elaborate calculations.
> Maybe posting a small RFC patch with code would get more attention and
> you can collect more feedback.
>
> > Thanks again!
> >
> > > > Thanks!
> > > >
> > >
> > > Just my 2 cents and Johannes being the author might have more to say here.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PSI vs. CPU overhead for client computing
  2019-04-23 22:04 ` Suren Baghdasaryan
  2019-04-24  4:54   ` Luigi Semenzato
@ 2019-04-24 16:36   ` Johannes Weiner
  1 sibling, 0 replies; 6+ messages in thread
From: Johannes Weiner @ 2019-04-24 16:36 UTC (permalink / raw)
  To: Suren Baghdasaryan; +Cc: Luigi Semenzato, Linux Memory Management List

On Tue, Apr 23, 2019 at 03:04:16PM -0700, Suren Baghdasaryan wrote:
> On Tue, Apr 23, 2019 at 11:58 AM Luigi Semenzato <semenzato@google.com> wrote:
> > The chrome browser is a multi-process app and there is a lot of IPC.  When
> > process A is blocked on memory allocation, it cannot respond to IPC
> > from process B, thus effectively both processes are blocked on
> > allocation, but we don't see that.
> 
> I don't think PSI would account such an indirect stall when A is
> waiting for B and B is blocked on memory access. B's stall will be
> accounted for but I don't think A's blocked time will go into PSI
> calculations. The process inter-dependencies are probably out of scope
> for PSI.

Well, yes and no. We don't do explicit dependency tracking, but when A
is waiting on B it's also not considered productive, so it doesn't
factor into the equation. psi will see B blocked on memory and no
other productive processes, which means FULL state until B resumes.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-04-25 17:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-23 18:57 PSI vs. CPU overhead for client computing Luigi Semenzato
2019-04-23 22:04 ` Suren Baghdasaryan
2019-04-24  4:54   ` Luigi Semenzato
2019-04-24 14:49     ` Suren Baghdasaryan
2019-04-25 17:31       ` Luigi Semenzato
2019-04-24 16:36   ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox