From: Balbir Singh <bsingharora@gmail.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Tejun Heo <tj@kernel.org>,
surenb@google.com, Vinayak Menon <vinmenon@codeaurora.org>,
Christoph Lameter <cl@linux.com>, Mike Galbraith <efault@gmx.de>,
Shakeel Butt <shakeelb@google.com>, linux-mm <linux-mm@kvack.org>,
cgroups@vger.kernel.org,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
kernel-team@fb.com
Subject: Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
Date: Tue, 24 Jul 2018 07:14:02 +1000 [thread overview]
Message-ID: <CAKTCnzmt_CnfZMMdK9_-rBrL4kUmoE70nVbnE58CJp++FP0CCQ@mail.gmail.com> (raw)
In-Reply-To: <20180712172942.10094-1-hannes@cmpxchg.org>
On Fri, Jul 13, 2018 at 3:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> PSI aggregates and reports the overall wallclock time in which the
> tasks in a system (or cgroup) wait for contended hardware resources.
>
> This helps users understand the resource pressure their workloads are
> under, which allows them to rootcause and fix throughput and latency
> problems caused by overcommitting, underprovisioning, suboptimal job
> placement in a grid, as well as anticipate major disruptions like OOM.
>
> This version 2 of the series incorporates a ton of feedback from
> PeterZ and SurenB; more details at the end of this email.
>
> Real-world applications
>
> We're using the data collected by psi (and its previous incarnation,
> memdelay) quite extensively at Facebook, with several success stories.
>
> One usecase is avoiding OOM hangs/livelocks. The reason these happen
> is because the OOM killer is triggered by reclaim not being able to
> free pages, but with fast flash devices there is *always* some clean
> and uptodate cache to reclaim; the OOM killer never kicks in, even as
> tasks spend 90% of the time thrashing the cache pages of their own
> executables. There is no situation where this ever makes sense in
> practice. We wrote a <100 line POC python script to monitor memory
> pressure and kill stuff way before such pathological thrashing leads
> to full system losses that require forcible hard resets.
>
> We've since extended and deployed this code into other places to
> guarantee latency and throughput SLAs, since they're usually violated
> way before the kernel OOM killer would ever kick in.
>
> The idea is to eventually incorporate this back into the kernel, so
> that Linux can avoid OOM livelocks (which TECHNICALLY aren't memory
> deadlocks, but for the user indistinguishable) out of the box.
>
> We also use psi memory pressure for loadshedding. Our batch job
> infrastructure used to use heuristics based on various VM stats to
> anticipate OOM situations, with lackluster success. We switched it to
> psi and managed to anticipate and avoid OOM kills and hangs fairly
> reliably. The reduction of OOM outages in the worker pool raised the
> pool's aggregate productivity, and we were able to switch that service
> to smaller machines.
>
> Lastly, we use cgroups to isolate a machine's main workload from
> maintenance crap like package upgrades, logging, configuration, as
> well as to prevent multiple workloads on a machine from stepping on
> each others' toes. We were not able to configure this properly without
> the pressure metrics; we would see latency or bandwidth drops, but it
> would often be hard to impossible to rootcause it post-mortem.
>
> We now log and graph pressure for the containers in our fleet and can
> trivially link latency spikes and throughput drops to shortages of
> specific resources after the fact, and fix the job config/scheduling.
>
> I've also recieved feedback and feature requests from Android for the
> purpose of low-latency OOM killing. The on-demand stats aggregation in
> the last patch of this series is for this purpose, to allow Android to
> react to pressure before the system starts visibly hanging.
>
> How do you use this feature?
>
> A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
> 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have
> cpu.pressure, memory.pressure and io.pressure files, which simply
> aggregate task stalls at the cgroup level instead of system-wide.
>
> The cpu file contains one line:
>
> some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
>
> The averages give the percentage of walltime in which one or more
> tasks are delayed on the runqueue while another task has the
> CPU. They're recent averages over 10s, 1m, 5m windows, so you can tell
> short term trends from long term ones, similarly to the load average.
>
Does the mechanism scale? I am a little concerned about how frequently
this infrastructure is monitored/read/acted upon. Why aren't existing
mechanisms sufficient -- why is the avg delay calculation in the
kernel?
> The total= value gives the absolute stall time in microseconds. This
> allows detecting latency spikes that might be too short to sway the
> running averages. It also allows custom time averaging in case the
> 10s/1m/5m windows aren't adequate for the usecase (or are too coarse
> with future hardware).
>
> What to make of this "some" metric? If CPU utilization is at 100% and
> CPU pressure is 0, it means the system is perfectly utilized, with one
> runnable thread per CPU and nobody waiting. At two or more runnable
> tasks per CPU, the system is 100% overcommitted and the pressure
> average will indicate as much. From a utilization perspective this is
> a great state of course: no CPU cycles are being wasted, even when 50%
> of the threads were to go idle (as most workloads do vary). From the
> perspective of the individual job it's not great, however, and they
> would do better with more resources. Depending on what your priority
> and options are, raised "some" numbers may or may not require action.
>
> The memory file contains two lines:
>
> some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
> full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
>
> The some line is the same as for cpu, the time in which at least one
> task is stalled on the resource. In the case of memory, this includes
> waiting on swap-in, page cache refaults and page reclaim.
>
> The full line, however, indicates time in which *nobody* is using the
> CPU productively due to pressure: all non-idle tasks are waiting for
> memory in one form or another. Significant time spent in there is a
> good trigger for killing things, moving jobs to other machines, or
> dropping incoming requests, since neither the jobs nor the machine
> overall are making too much headway.
>
> The io file is similar to memory. Because the block layer doesn't have
> a concept of hardware contention right now (how much longer is my IO
> request taking due to other tasks?), it reports CPU potential lost on
> all IO delays, not just the potential lost due to competition.
>
There is no talk about the overhead this introduces in general, may be
the details are in the patches. I'll read through them
Balbir Singh.
next prev parent reply other threads:[~2018-07-23 21:14 UTC|newest]
Thread overview: 83+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-07-12 17:29 Johannes Weiner
2018-07-12 17:29 ` [PATCH 01/10] mm: workingset: don't drop refault information prematurely Johannes Weiner
2018-07-12 17:29 ` [PATCH 02/10] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner
2018-07-23 13:36 ` Arnd Bergmann
2018-07-23 15:23 ` Johannes Weiner
2018-07-23 15:35 ` Arnd Bergmann
2018-07-23 16:27 ` Johannes Weiner
2018-07-24 15:04 ` Will Deacon
2018-07-25 16:06 ` Will Deacon
2018-07-12 17:29 ` [PATCH 03/10] delayacct: track delays from thrashing cache pages Johannes Weiner
2018-07-12 17:29 ` [PATCH 04/10] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD Johannes Weiner
2018-07-12 17:29 ` [PATCH 05/10] sched: loadavg: make calc_load_n() public Johannes Weiner
2018-07-12 17:29 ` [PATCH 06/10] sched: sched.h: make rq locking and clock functions available in stats.h Johannes Weiner
2018-07-12 17:29 ` [PATCH 07/10] sched: introduce this_rq_lock_irq() Johannes Weiner
2018-07-12 17:29 ` [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
2018-07-13 9:21 ` Peter Zijlstra
2018-07-13 16:17 ` Johannes Weiner
2018-07-14 8:48 ` Peter Zijlstra
2018-07-14 9:02 ` Peter Zijlstra
2018-07-17 10:03 ` Peter Zijlstra
2018-07-18 21:56 ` Johannes Weiner
2018-07-17 14:16 ` Peter Zijlstra
2018-07-18 22:00 ` Johannes Weiner
2018-07-17 14:21 ` Peter Zijlstra
2018-07-18 22:03 ` Johannes Weiner
2018-07-17 15:01 ` Peter Zijlstra
2018-07-18 22:06 ` Johannes Weiner
2018-07-20 14:13 ` Johannes Weiner
2018-07-17 15:17 ` Peter Zijlstra
2018-07-18 22:11 ` Johannes Weiner
2018-07-17 15:32 ` Peter Zijlstra
2018-07-18 12:03 ` Peter Zijlstra
2018-07-18 12:22 ` Peter Zijlstra
2018-07-18 22:36 ` Johannes Weiner
2018-07-19 13:58 ` Peter Zijlstra
2018-07-19 9:26 ` Peter Zijlstra
2018-07-19 12:50 ` Johannes Weiner
2018-07-19 13:18 ` Peter Zijlstra
2018-07-19 15:08 ` Linus Torvalds
2018-07-19 17:54 ` Johannes Weiner
2018-07-19 18:47 ` Johannes Weiner
2018-07-19 20:31 ` Peter Zijlstra
2018-07-24 16:01 ` Johannes Weiner
2018-07-18 12:46 ` Peter Zijlstra
2018-07-18 13:56 ` Johannes Weiner
2018-07-18 16:31 ` Peter Zijlstra
2018-07-18 16:46 ` Johannes Weiner
2018-07-20 20:35 ` Peter Zijlstra
2018-07-12 17:29 ` [PATCH 09/10] psi: cgroup support Johannes Weiner
2018-07-12 20:08 ` Tejun Heo
2018-07-17 15:40 ` Peter Zijlstra
2018-07-24 15:54 ` Johannes Weiner
2018-07-12 17:29 ` [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure Johannes Weiner
2018-07-12 23:45 ` Andrew Morton
2018-07-13 22:17 ` Johannes Weiner
2018-07-13 22:13 ` Suren Baghdasaryan
2018-07-13 22:49 ` Johannes Weiner
2018-07-13 23:34 ` Suren Baghdasaryan
2018-07-17 15:13 ` Peter Zijlstra
2018-07-12 17:37 ` [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Linus Torvalds
2018-07-12 23:44 ` Andrew Morton
2018-07-13 22:14 ` Johannes Weiner
2018-07-16 15:57 ` Daniel Drake
2018-07-17 11:25 ` Michal Hocko
2018-07-17 12:13 ` Daniel Drake
2018-07-17 12:23 ` Michal Hocko
2018-07-25 22:57 ` Daniel Drake
2018-07-18 22:21 ` Johannes Weiner
2018-07-19 11:29 ` peter enderborg
2018-07-19 12:18 ` Johannes Weiner
2018-07-23 21:14 ` Balbir Singh [this message]
2018-07-24 15:15 ` Johannes Weiner
2018-07-26 1:07 ` Singh, Balbir
2018-07-26 20:07 ` Johannes Weiner
2018-07-27 23:40 ` Suren Baghdasaryan
2018-07-27 22:01 ` Pavel Machek
2018-07-30 15:40 ` Johannes Weiner
2018-07-30 17:39 ` Pavel Machek
2018-07-30 17:51 ` Tejun Heo
2018-07-30 17:54 ` Randy Dunlap
2018-07-30 18:05 ` Tejun Heo
2018-07-30 17:59 ` Pavel Machek
2018-07-30 18:07 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAKTCnzmt_CnfZMMdK9_-rBrL4kUmoE70nVbnE58CJp++FP0CCQ@mail.gmail.com \
--to=bsingharora@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=cl@linux.com \
--cc=efault@gmx.de \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@fb.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=shakeelb@google.com \
--cc=surenb@google.com \
--cc=tj@kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=vinmenon@codeaurora.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox