linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Daniel Drake <drake@endlessm.com>
To: hannes@cmpxchg.org
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	cgroups@vger.kernel.org, linux@endlessm.com,
	linux-block@vger.kernel.org, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Andrew Morton <akpm@linuxfoundation.org>,
	Tejun Heo <tj@kernel.org>, Balbir Singh <bsingharora@gmail.com>,
	Mike Galbraith <efault@gmx.de>, Oliver Yang <yangoliver@me.com>,
	Shakeel Butt <shakeelb@google.com>, xxx xxx <x.qendo@gmail.com>,
	Taras Kondratiuk <takondra@cisco.com>,
	Daniel Walker <danielwa@cisco.com>,
	Vinayak Menon <vinmenon@codeaurora.org>,
	Ruslan Ruslichenko <rruslich@cisco.com>,
	kernel-team@fb.com
Subject: Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
Date: Mon, 16 Jul 2018 10:57:45 -0500	[thread overview]
Message-ID: <20180716155745.10368-1-drake@endlessm.com> (raw)
In-Reply-To: <20180712172942.10094-1-hannes@cmpxchg.org>

Hi Johannes,

Thanks for your work on psi! 

We have also been investigating the "thrashing problem" on our Endless
desktop OS. We have seen that systems can easily get into a state where the
UI becomes unresponsive to input, and the mouse cursor becomes extremely
slow or stuck when the system is running out of memory. We are working with
a full GNOME desktop environment on systems with only 2GB RAM, and
sometimes no real swap (although zram-swap helps mitigate the problem to
some extent).

My analysis so far indicates that when the system is low on memory and hits
this condition, the system is spending much of the time under
__alloc_pages_direct_reclaim. "perf trace -F" shows many many page faults
in executable code while this is going on. I believe the kernel is
swapping out executable code in order to satisfy memory allocation
requests, but then that swapped-out code is needed a moment later so it
gets swapped in again via the page fault handler, and all this activity
severely starves the system from being able to respond to user input.

I appreciate the kernel's attempt to keep processes alive, but in the
desktop case we see that the system rarely recovers from this situation,
so you have to hard shutdown. In this case we view it as desirable that
the OOM killer would step in (it is not doing so because direct reclaim
is not actually failing).

I had recently touched upon the cpuset mempressure counter, which
looked promising, but in practice I found that it was not a useful enough
representation of thrashing. It measures the rate at which
__perform_reclaim() is called, but I have observed that as the system gets
deeper and deeper into thrashing, __perform_reclaim() is actually called
at an increasingly slower rate, because each invocation ends up taking
more and more time (after 2 minutes of thrashing it can take close to 1s).

Instead of rate of function call it seems necessary to measure the amount
of work done by that codepath, and that's what you are doing with psi.

I tried psi on a 2GB RAM system with no swap (also no zram-swap) and
was pleased with the results combined with this sample userspace code:

https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd

It invokes the OOM killer when memory full_avg10 is >=10%, i.e. it kills
if all tasks were blocked on memory management for at least 1s in a 10s
period.

Upon initial tests it is working very well. The system recovers quickly
from thrashing after the daemon steps in and kills a process. I have yet
to see any kills being made prematurely. It would be great to see this
upstream soon.

I also support your ideas to have the kernel offer mechanisms to handle
this directly in future; it would be nice not to have the requirement of
delegating this task to userspace, plus there may be a possibility that
userspace is starved so much that it cannot step in to handle this.

The only question I have is about the format of the data in /proc. The
memory file returns two lines and several values on each line. This
requires a bit more parsing than what I have become accustomed to in recent
years of the "one value per file" approach that seems prevalent in sysfs.
Would it make sense to instead have a single value read from (say)
/proc/pressure/memory/full_avg10 ?

Thanks
Daniel

  parent reply	other threads:[~2018-07-16 15:57 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-12 17:29 Johannes Weiner
2018-07-12 17:29 ` [PATCH 01/10] mm: workingset: don't drop refault information prematurely Johannes Weiner
2018-07-12 17:29 ` [PATCH 02/10] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner
2018-07-23 13:36   ` Arnd Bergmann
2018-07-23 15:23     ` Johannes Weiner
2018-07-23 15:35       ` Arnd Bergmann
2018-07-23 16:27         ` Johannes Weiner
2018-07-24 15:04           ` Will Deacon
2018-07-25 16:06             ` Will Deacon
2018-07-12 17:29 ` [PATCH 03/10] delayacct: track delays from thrashing cache pages Johannes Weiner
2018-07-12 17:29 ` [PATCH 04/10] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD Johannes Weiner
2018-07-12 17:29 ` [PATCH 05/10] sched: loadavg: make calc_load_n() public Johannes Weiner
2018-07-12 17:29 ` [PATCH 06/10] sched: sched.h: make rq locking and clock functions available in stats.h Johannes Weiner
2018-07-12 17:29 ` [PATCH 07/10] sched: introduce this_rq_lock_irq() Johannes Weiner
2018-07-12 17:29 ` [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
2018-07-13  9:21   ` Peter Zijlstra
2018-07-13 16:17     ` Johannes Weiner
2018-07-14  8:48       ` Peter Zijlstra
2018-07-14  9:02       ` Peter Zijlstra
2018-07-17 10:03   ` Peter Zijlstra
2018-07-18 21:56     ` Johannes Weiner
2018-07-17 14:16   ` Peter Zijlstra
2018-07-18 22:00     ` Johannes Weiner
2018-07-17 14:21   ` Peter Zijlstra
2018-07-18 22:03     ` Johannes Weiner
2018-07-17 15:01   ` Peter Zijlstra
2018-07-18 22:06     ` Johannes Weiner
2018-07-20 14:13       ` Johannes Weiner
2018-07-17 15:17   ` Peter Zijlstra
2018-07-18 22:11     ` Johannes Weiner
2018-07-17 15:32   ` Peter Zijlstra
2018-07-18 12:03   ` Peter Zijlstra
2018-07-18 12:22     ` Peter Zijlstra
2018-07-18 22:36     ` Johannes Weiner
2018-07-19 13:58       ` Peter Zijlstra
2018-07-19  9:26     ` Peter Zijlstra
2018-07-19 12:50       ` Johannes Weiner
2018-07-19 13:18         ` Peter Zijlstra
2018-07-19 15:08     ` Linus Torvalds
2018-07-19 17:54       ` Johannes Weiner
2018-07-19 18:47     ` Johannes Weiner
2018-07-19 20:31       ` Peter Zijlstra
2018-07-24 16:01         ` Johannes Weiner
2018-07-18 12:46   ` Peter Zijlstra
2018-07-18 13:56     ` Johannes Weiner
2018-07-18 16:31       ` Peter Zijlstra
2018-07-18 16:46         ` Johannes Weiner
2018-07-20 20:35   ` Peter Zijlstra
2018-07-12 17:29 ` [PATCH 09/10] psi: cgroup support Johannes Weiner
2018-07-12 20:08   ` Tejun Heo
2018-07-17 15:40   ` Peter Zijlstra
2018-07-24 15:54     ` Johannes Weiner
2018-07-12 17:29 ` [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure Johannes Weiner
2018-07-12 23:45   ` Andrew Morton
2018-07-13 22:17     ` Johannes Weiner
2018-07-13 22:13   ` Suren Baghdasaryan
2018-07-13 22:49     ` Johannes Weiner
2018-07-13 23:34       ` Suren Baghdasaryan
2018-07-17 15:13   ` Peter Zijlstra
2018-07-12 17:37 ` [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Linus Torvalds
2018-07-12 23:44 ` Andrew Morton
2018-07-13 22:14   ` Johannes Weiner
2018-07-16 15:57 ` Daniel Drake [this message]
2018-07-17 11:25   ` Michal Hocko
2018-07-17 12:13     ` Daniel Drake
2018-07-17 12:23       ` Michal Hocko
2018-07-25 22:57         ` Daniel Drake
2018-07-18 22:21     ` Johannes Weiner
2018-07-19 11:29       ` peter enderborg
2018-07-19 12:18         ` Johannes Weiner
2018-07-23 21:14 ` Balbir Singh
2018-07-24 15:15   ` Johannes Weiner
2018-07-26  1:07     ` Singh, Balbir
2018-07-26 20:07       ` Johannes Weiner
2018-07-27 23:40         ` Suren Baghdasaryan
2018-07-27 22:01 ` Pavel Machek
2018-07-30 15:40   ` Johannes Weiner
2018-07-30 17:39     ` Pavel Machek
2018-07-30 17:51       ` Tejun Heo
2018-07-30 17:54         ` Randy Dunlap
2018-07-30 18:05           ` Tejun Heo
2018-07-30 17:59         ` Pavel Machek
2018-07-30 18:07           ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180716155745.10368-1-drake@endlessm.com \
    --to=drake@endlessm.com \
    --cc=akpm@linuxfoundation.org \
    --cc=bsingharora@gmail.com \
    --cc=cgroups@vger.kernel.org \
    --cc=danielwa@cisco.com \
    --cc=efault@gmx.de \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux@endlessm.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rruslich@cisco.com \
    --cc=shakeelb@google.com \
    --cc=takondra@cisco.com \
    --cc=tj@kernel.org \
    --cc=vinmenon@codeaurora.org \
    --cc=x.qendo@gmail.com \
    --cc=yangoliver@me.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox