* PSI vs. CPU overhead for client computing @ 2019-04-23 18:57 Luigi Semenzato 2019-04-23 22:04 ` Suren Baghdasaryan 0 siblings, 1 reply; 6+ messages in thread From: Luigi Semenzato @ 2019-04-23 18:57 UTC (permalink / raw) To: Linux Memory Management List I and others are working on improving system behavior under memory pressure on Chrome OS. We use zram, which swaps to a statically-configured compressed RAM disk. One challenge that we have is that the footprint of our workloads is highly variable. With zram, we have to set the size of the swap partition at boot time. When the (logical) swap partition is full, we're left with some amount of RAM usable by file and anonymous pages (we can ignore the rest). We don't get to control this amount dynamically. Thus if the workload fits nicely in it, everything works well. If it doesn't, then the rate of anonymous page faults can be quite high, causing large CPU overhead for compression/decompression (as well as for other parts of the MM). In Chrome OS and Android, we have the luxury that we can reduce pressure by terminating processes (tab discard in Chrome OS, app kill in Android---which incidentally also runs in parallel with Chrome OS on some chromebooks). To help decide when to reduce pressure, we would like to have a reliable and device-independent measure of MM CPU overhead. I have looked into PSI and have a few questions. I am also looking for alternative suggestions. PSI measures the times spent when some and all tasks are blocked by memory allocation. In some experiments, this doesn't seem to correlate too well with CPU overhead (which instead correlates fairly well with page fault rates). Could this be because it includes pressure from file page faults? Is there some way of interpreting PSI numbers so that the pressure from file pages is ignored? What is the purpose of "some" and "full" in the PSI measurements? The chrome browser is a multi-process app and there is a lot of IPC. When process A is blocked on memory allocation, it cannot respond to IPC from process B, thus effectively both processes are blocked on allocation, but we don't see that. Also, there are situations in which some "uninteresting" process keep running. So it's not clear we can rely on "full". Or maybe I am misunderstanding? "Some" may be a better measure, but again it doesn't measure indirect blockage. The kernel contains various cpustat measurements, including some slightly esoteric ones such as CPUTIME_GUEST and CPUTIME_GUEST_NICE. Would adding a CPUTIME_MEM be out of the question? Thanks! ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PSI vs. CPU overhead for client computing 2019-04-23 18:57 PSI vs. CPU overhead for client computing Luigi Semenzato @ 2019-04-23 22:04 ` Suren Baghdasaryan 2019-04-24 4:54 ` Luigi Semenzato 2019-04-24 16:36 ` Johannes Weiner 0 siblings, 2 replies; 6+ messages in thread From: Suren Baghdasaryan @ 2019-04-23 22:04 UTC (permalink / raw) To: Luigi Semenzato; +Cc: Linux Memory Management List, Johannes Weiner Hi Luigi, On Tue, Apr 23, 2019 at 11:58 AM Luigi Semenzato <semenzato@google.com> wrote: > > I and others are working on improving system behavior under memory > pressure on Chrome OS. We use zram, which swaps to a > statically-configured compressed RAM disk. One challenge that we have > is that the footprint of our workloads is highly variable. With zram, > we have to set the size of the swap partition at boot time. When the > (logical) swap partition is full, we're left with some amount of RAM > usable by file and anonymous pages (we can ignore the rest). We don't > get to control this amount dynamically. Thus if the workload fits > nicely in it, everything works well. If it doesn't, then the rate of > anonymous page faults can be quite high, causing large CPU overhead > for compression/decompression (as well as for other parts of the MM). > > In Chrome OS and Android, we have the luxury that we can reduce > pressure by terminating processes (tab discard in Chrome OS, app kill > in Android---which incidentally also runs in parallel with Chrome OS > on some chromebooks). To help decide when to reduce pressure, we > would like to have a reliable and device-independent measure of MM CPU > overhead. I have looked into PSI and have a few questions. I am also > looking for alternative suggestions. > > PSI measures the times spent when some and all tasks are blocked by > memory allocation. In some experiments, this doesn't seem to > correlate too well with CPU overhead (which instead correlates fairly > well with page fault rates). Could this be because it includes > pressure from file page faults? This might be caused by thrashing (see: https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1114). > Is there some way of interpreting PSI > numbers so that the pressure from file pages is ignored? I don't think so but I might be wrong. Notice here https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1111 you could probably use delayacct to distinguish file thrashing, however remember that PSI takes into account the number of CPUs and the number of currently non-idle tasks in its pressure calculations, so the raw delay numbers might not be very useful here. > What is the purpose of "some" and "full" in the PSI measurements? The > chrome browser is a multi-process app and there is a lot of IPC. When > process A is blocked on memory allocation, it cannot respond to IPC > from process B, thus effectively both processes are blocked on > allocation, but we don't see that. I don't think PSI would account such an indirect stall when A is waiting for B and B is blocked on memory access. B's stall will be accounted for but I don't think A's blocked time will go into PSI calculations. The process inter-dependencies are probably out of scope for PSI. > Also, there are situations in > which some "uninteresting" process keep running. So it's not clear we > can rely on "full". Or maybe I am misunderstanding? "Some" may be a > better measure, but again it doesn't measure indirect blockage. Johannes explains the SOME and FULL calculations here: https://elixir.bootlin.com/linux/v5.1-rc6/source/kernel/sched/psi.c#L76 and includes couple examples with the last one showing FULL>0 and some tasks still running. > The kernel contains various cpustat measurements, including some > slightly esoteric ones such as CPUTIME_GUEST and CPUTIME_GUEST_NICE. > Would adding a CPUTIME_MEM be out of the question? > > Thanks! > Just my 2 cents and Johannes being the author might have more to say here. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PSI vs. CPU overhead for client computing 2019-04-23 22:04 ` Suren Baghdasaryan @ 2019-04-24 4:54 ` Luigi Semenzato 2019-04-24 14:49 ` Suren Baghdasaryan 2019-04-24 16:36 ` Johannes Weiner 1 sibling, 1 reply; 6+ messages in thread From: Luigi Semenzato @ 2019-04-24 4:54 UTC (permalink / raw) To: Suren Baghdasaryan; +Cc: Linux Memory Management List, Johannes Weiner Thank you very much Suren. On Tue, Apr 23, 2019 at 3:04 PM Suren Baghdasaryan <surenb@google.com> wrote: > > Hi Luigi, > > On Tue, Apr 23, 2019 at 11:58 AM Luigi Semenzato <semenzato@google.com> wrote: > > > > I and others are working on improving system behavior under memory > > pressure on Chrome OS. We use zram, which swaps to a > > statically-configured compressed RAM disk. One challenge that we have > > is that the footprint of our workloads is highly variable. With zram, > > we have to set the size of the swap partition at boot time. When the > > (logical) swap partition is full, we're left with some amount of RAM > > usable by file and anonymous pages (we can ignore the rest). We don't > > get to control this amount dynamically. Thus if the workload fits > > nicely in it, everything works well. If it doesn't, then the rate of > > anonymous page faults can be quite high, causing large CPU overhead > > for compression/decompression (as well as for other parts of the MM). > > > > In Chrome OS and Android, we have the luxury that we can reduce > > pressure by terminating processes (tab discard in Chrome OS, app kill > > in Android---which incidentally also runs in parallel with Chrome OS > > on some chromebooks). To help decide when to reduce pressure, we > > would like to have a reliable and device-independent measure of MM CPU > > overhead. I have looked into PSI and have a few questions. I am also > > looking for alternative suggestions. > > > > PSI measures the times spent when some and all tasks are blocked by > > memory allocation. In some experiments, this doesn't seem to > > correlate too well with CPU overhead (which instead correlates fairly > > well with page fault rates). Could this be because it includes > > pressure from file page faults? > > This might be caused by thrashing (see: > https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1114). > > > Is there some way of interpreting PSI > > numbers so that the pressure from file pages is ignored? > > I don't think so but I might be wrong. Notice here > https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1111 > you could probably use delayacct to distinguish file thrashing, > however remember that PSI takes into account the number of CPUs and > the number of currently non-idle tasks in its pressure calculations, > so the raw delay numbers might not be very useful here. OK. > > What is the purpose of "some" and "full" in the PSI measurements? The > > chrome browser is a multi-process app and there is a lot of IPC. When > > process A is blocked on memory allocation, it cannot respond to IPC > > from process B, thus effectively both processes are blocked on > > allocation, but we don't see that. > > I don't think PSI would account such an indirect stall when A is > waiting for B and B is blocked on memory access. B's stall will be > accounted for but I don't think A's blocked time will go into PSI > calculations. The process inter-dependencies are probably out of scope > for PSI. Right, that's what I was also saying. It would be near impossible to figure it out. It may also be that statistically it doesn't matter, as long as the workload characteristics don't change dramatically. Which unfortunately they might... > > Also, there are situations in > > which some "uninteresting" process keep running. So it's not clear we > > can rely on "full". Or maybe I am misunderstanding? "Some" may be a > > better measure, but again it doesn't measure indirect blockage. > > Johannes explains the SOME and FULL calculations here: > https://elixir.bootlin.com/linux/v5.1-rc6/source/kernel/sched/psi.c#L76 > and includes couple examples with the last one showing FULL>0 and some > tasks still running. Thank you, yes, those are good explanation. I am still not sure how to use this in our case. I thought about using the page fault rate as a proxy for the allocation overhead. Unfortunately it is difficult to figure out the baseline, because: 1. it is device-dependent (that's not insurmountable: we could compute a per-device baseline offline); 2. the CPUs can go in and out of turbo mode, or temperature-throttling, and the notion of a constant "baseline" fails miserably. > > The kernel contains various cpustat measurements, including some > > slightly esoteric ones such as CPUTIME_GUEST and CPUTIME_GUEST_NICE. > > Would adding a CPUTIME_MEM be out of the question? Any opinion on CPUTIME_MEM? Thanks again! > > Thanks! > > > > Just my 2 cents and Johannes being the author might have more to say here. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PSI vs. CPU overhead for client computing 2019-04-24 4:54 ` Luigi Semenzato @ 2019-04-24 14:49 ` Suren Baghdasaryan 2019-04-25 17:31 ` Luigi Semenzato 0 siblings, 1 reply; 6+ messages in thread From: Suren Baghdasaryan @ 2019-04-24 14:49 UTC (permalink / raw) To: Luigi Semenzato; +Cc: Linux Memory Management List, Johannes Weiner On Tue, Apr 23, 2019 at 9:54 PM Luigi Semenzato <semenzato@google.com> wrote: > > Thank you very much Suren. > > On Tue, Apr 23, 2019 at 3:04 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > Hi Luigi, > > > > On Tue, Apr 23, 2019 at 11:58 AM Luigi Semenzato <semenzato@google.com> wrote: > > > > > > I and others are working on improving system behavior under memory > > > pressure on Chrome OS. We use zram, which swaps to a > > > statically-configured compressed RAM disk. One challenge that we have > > > is that the footprint of our workloads is highly variable. With zram, > > > we have to set the size of the swap partition at boot time. When the > > > (logical) swap partition is full, we're left with some amount of RAM > > > usable by file and anonymous pages (we can ignore the rest). We don't > > > get to control this amount dynamically. Thus if the workload fits > > > nicely in it, everything works well. If it doesn't, then the rate of > > > anonymous page faults can be quite high, causing large CPU overhead > > > for compression/decompression (as well as for other parts of the MM). > > > > > > In Chrome OS and Android, we have the luxury that we can reduce > > > pressure by terminating processes (tab discard in Chrome OS, app kill > > > in Android---which incidentally also runs in parallel with Chrome OS > > > on some chromebooks). To help decide when to reduce pressure, we > > > would like to have a reliable and device-independent measure of MM CPU > > > overhead. I have looked into PSI and have a few questions. I am also > > > looking for alternative suggestions. > > > > > > PSI measures the times spent when some and all tasks are blocked by > > > memory allocation. In some experiments, this doesn't seem to > > > correlate too well with CPU overhead (which instead correlates fairly > > > well with page fault rates). Could this be because it includes > > > pressure from file page faults? > > > > This might be caused by thrashing (see: > > https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1114). > > > > > Is there some way of interpreting PSI > > > numbers so that the pressure from file pages is ignored? > > > > I don't think so but I might be wrong. Notice here > > https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1111 > > you could probably use delayacct to distinguish file thrashing, > > however remember that PSI takes into account the number of CPUs and > > the number of currently non-idle tasks in its pressure calculations, > > so the raw delay numbers might not be very useful here. > > OK. > > > > What is the purpose of "some" and "full" in the PSI measurements? The > > > chrome browser is a multi-process app and there is a lot of IPC. When > > > process A is blocked on memory allocation, it cannot respond to IPC > > > from process B, thus effectively both processes are blocked on > > > allocation, but we don't see that. > > > > I don't think PSI would account such an indirect stall when A is > > waiting for B and B is blocked on memory access. B's stall will be > > accounted for but I don't think A's blocked time will go into PSI > > calculations. The process inter-dependencies are probably out of scope > > for PSI. > > Right, that's what I was also saying. It would be near impossible to > figure it out. It may also be that statistically it doesn't matter, > as long as the workload characteristics don't change dramatically. > Which unfortunately they might... > > > > Also, there are situations in > > > which some "uninteresting" process keep running. So it's not clear we > > > can rely on "full". Or maybe I am misunderstanding? "Some" may be a > > > better measure, but again it doesn't measure indirect blockage. > > > > Johannes explains the SOME and FULL calculations here: > > https://elixir.bootlin.com/linux/v5.1-rc6/source/kernel/sched/psi.c#L76 > > and includes couple examples with the last one showing FULL>0 and some > > tasks still running. > > Thank you, yes, those are good explanation. I am still not sure how > to use this in our case. > > I thought about using the page fault rate as a proxy for the > allocation overhead. Unfortunately it is difficult to figure out the > baseline, because: 1. it is device-dependent (that's not > insurmountable: we could compute a per-device baseline offline); 2. > the CPUs can go in and out of turbo mode, or temperature-throttling, > and the notion of a constant "baseline" fails miserably. > > > > The kernel contains various cpustat measurements, including some > > > slightly esoteric ones such as CPUTIME_GUEST and CPUTIME_GUEST_NICE. > > > Would adding a CPUTIME_MEM be out of the question? > > Any opinion on CPUTIME_MEM? I guess some description of how you plan to calculate it would be helpful. A simple raw delay counter might not be very useful, that's why PSI performs more elaborate calculations. Maybe posting a small RFC patch with code would get more attention and you can collect more feedback. > Thanks again! > > > > Thanks! > > > > > > > Just my 2 cents and Johannes being the author might have more to say here. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PSI vs. CPU overhead for client computing 2019-04-24 14:49 ` Suren Baghdasaryan @ 2019-04-25 17:31 ` Luigi Semenzato 0 siblings, 0 replies; 6+ messages in thread From: Luigi Semenzato @ 2019-04-25 17:31 UTC (permalink / raw) To: Suren Baghdasaryan; +Cc: Linux Memory Management List, Johannes Weiner Thank you, I can try to do that. It's not trivial to get right though. I have to find the right compromise. A horribly wrong patch won't be taken seriously, but a completely correct one would be a bit too much work, given the probability that it will get rejected. Thanks also to Johannes for the clarification! On Wed, Apr 24, 2019 at 7:49 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Tue, Apr 23, 2019 at 9:54 PM Luigi Semenzato <semenzato@google.com> wrote: > > > > Thank you very much Suren. > > > > On Tue, Apr 23, 2019 at 3:04 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > Hi Luigi, > > > > > > On Tue, Apr 23, 2019 at 11:58 AM Luigi Semenzato <semenzato@google.com> wrote: > > > > > > > > I and others are working on improving system behavior under memory > > > > pressure on Chrome OS. We use zram, which swaps to a > > > > statically-configured compressed RAM disk. One challenge that we have > > > > is that the footprint of our workloads is highly variable. With zram, > > > > we have to set the size of the swap partition at boot time. When the > > > > (logical) swap partition is full, we're left with some amount of RAM > > > > usable by file and anonymous pages (we can ignore the rest). We don't > > > > get to control this amount dynamically. Thus if the workload fits > > > > nicely in it, everything works well. If it doesn't, then the rate of > > > > anonymous page faults can be quite high, causing large CPU overhead > > > > for compression/decompression (as well as for other parts of the MM). > > > > > > > > In Chrome OS and Android, we have the luxury that we can reduce > > > > pressure by terminating processes (tab discard in Chrome OS, app kill > > > > in Android---which incidentally also runs in parallel with Chrome OS > > > > on some chromebooks). To help decide when to reduce pressure, we > > > > would like to have a reliable and device-independent measure of MM CPU > > > > overhead. I have looked into PSI and have a few questions. I am also > > > > looking for alternative suggestions. > > > > > > > > PSI measures the times spent when some and all tasks are blocked by > > > > memory allocation. In some experiments, this doesn't seem to > > > > correlate too well with CPU overhead (which instead correlates fairly > > > > well with page fault rates). Could this be because it includes > > > > pressure from file page faults? > > > > > > This might be caused by thrashing (see: > > > https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1114). > > > > > > > Is there some way of interpreting PSI > > > > numbers so that the pressure from file pages is ignored? > > > > > > I don't think so but I might be wrong. Notice here > > > https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1111 > > > you could probably use delayacct to distinguish file thrashing, > > > however remember that PSI takes into account the number of CPUs and > > > the number of currently non-idle tasks in its pressure calculations, > > > so the raw delay numbers might not be very useful here. > > > > OK. > > > > > > What is the purpose of "some" and "full" in the PSI measurements? The > > > > chrome browser is a multi-process app and there is a lot of IPC. When > > > > process A is blocked on memory allocation, it cannot respond to IPC > > > > from process B, thus effectively both processes are blocked on > > > > allocation, but we don't see that. > > > > > > I don't think PSI would account such an indirect stall when A is > > > waiting for B and B is blocked on memory access. B's stall will be > > > accounted for but I don't think A's blocked time will go into PSI > > > calculations. The process inter-dependencies are probably out of scope > > > for PSI. > > > > Right, that's what I was also saying. It would be near impossible to > > figure it out. It may also be that statistically it doesn't matter, > > as long as the workload characteristics don't change dramatically. > > Which unfortunately they might... > > > > > > Also, there are situations in > > > > which some "uninteresting" process keep running. So it's not clear we > > > > can rely on "full". Or maybe I am misunderstanding? "Some" may be a > > > > better measure, but again it doesn't measure indirect blockage. > > > > > > Johannes explains the SOME and FULL calculations here: > > > https://elixir.bootlin.com/linux/v5.1-rc6/source/kernel/sched/psi.c#L76 > > > and includes couple examples with the last one showing FULL>0 and some > > > tasks still running. > > > > Thank you, yes, those are good explanation. I am still not sure how > > to use this in our case. > > > > I thought about using the page fault rate as a proxy for the > > allocation overhead. Unfortunately it is difficult to figure out the > > baseline, because: 1. it is device-dependent (that's not > > insurmountable: we could compute a per-device baseline offline); 2. > > the CPUs can go in and out of turbo mode, or temperature-throttling, > > and the notion of a constant "baseline" fails miserably. > > > > > > The kernel contains various cpustat measurements, including some > > > > slightly esoteric ones such as CPUTIME_GUEST and CPUTIME_GUEST_NICE. > > > > Would adding a CPUTIME_MEM be out of the question? > > > > Any opinion on CPUTIME_MEM? > > I guess some description of how you plan to calculate it would be > helpful. A simple raw delay counter might not be very useful, that's > why PSI performs more elaborate calculations. > Maybe posting a small RFC patch with code would get more attention and > you can collect more feedback. > > > Thanks again! > > > > > > Thanks! > > > > > > > > > > Just my 2 cents and Johannes being the author might have more to say here. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PSI vs. CPU overhead for client computing 2019-04-23 22:04 ` Suren Baghdasaryan 2019-04-24 4:54 ` Luigi Semenzato @ 2019-04-24 16:36 ` Johannes Weiner 1 sibling, 0 replies; 6+ messages in thread From: Johannes Weiner @ 2019-04-24 16:36 UTC (permalink / raw) To: Suren Baghdasaryan; +Cc: Luigi Semenzato, Linux Memory Management List On Tue, Apr 23, 2019 at 03:04:16PM -0700, Suren Baghdasaryan wrote: > On Tue, Apr 23, 2019 at 11:58 AM Luigi Semenzato <semenzato@google.com> wrote: > > The chrome browser is a multi-process app and there is a lot of IPC. When > > process A is blocked on memory allocation, it cannot respond to IPC > > from process B, thus effectively both processes are blocked on > > allocation, but we don't see that. > > I don't think PSI would account such an indirect stall when A is > waiting for B and B is blocked on memory access. B's stall will be > accounted for but I don't think A's blocked time will go into PSI > calculations. The process inter-dependencies are probably out of scope > for PSI. Well, yes and no. We don't do explicit dependency tracking, but when A is waiting on B it's also not considered productive, so it doesn't factor into the equation. psi will see B blocked on memory and no other productive processes, which means FULL state until B resumes. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-04-25 17:31 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-04-23 18:57 PSI vs. CPU overhead for client computing Luigi Semenzato 2019-04-23 22:04 ` Suren Baghdasaryan 2019-04-24 4:54 ` Luigi Semenzato 2019-04-24 14:49 ` Suren Baghdasaryan 2019-04-25 17:31 ` Luigi Semenzato 2019-04-24 16:36 ` Johannes Weiner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox